[
  {
    "path": ".gitignore",
    "content": "*.pyc\ninterface\n"
  },
  {
    "path": "README.md",
    "content": "# End-to-end speech secognition toolkit\nThis is an E2E ASR toolkit modified from Espnet1 (version 0.9.9).  \nIf this repositry can help you, we will be appreciate if you can star it and cite our papers.\n\nThis is the official implementation following papers:  \n[**Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI**](https://ieeexplore.ieee.org/document/9746579/) (Accepted by ICASSP 2022)  \n[**Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model**](https://ieeexplore.ieee.org/document/9721084) (Accepted by SPL)  \n[**Integrate Lattice-Free MMI into End-to-End Speech Recognition**](https://arxiv.org/abs/2203.15614) (Submitted to TASLP) \n\n\nWe achieve state-of-the-art results on two of the most popular results in Aishell-1 and AIshell-2 Mandarin datasets.  \nPlease feel free to change / modify the code as you like. :)\n### Update\n- 2021/12/29: Release the first version, which contains all MMI-related features, including MMI training criteria, MMI Prefix Score (for attention-based encoder-decoder, AED) and MMI Alignment Score (For neural transducer, NT).\n- 2022/1/6: Release the word-level N-gram LM scorer.\n- 2022/1/12: We update the instructions to build the environment. We also release the trained NT model for Aishell-1 for quick performance check. We update the guildline to run our code.\n- 2022/3/29 We release a new CTC / RNN-T recipe for code-switch problem based on ASRU 2019 Mandarin-English code-switch dataset (see egs/asrucs); Results on Aishell-1 and Aishell-2 are also updated.\n\n### Environment:\nThe main dependencies of this code can be divided into three part: `kaldi`, `espnet` and `k2`  \nPlease follow the instructions in [build_env.sh](https://github.com/jctian98/e2e_lfmmi/blob/master/env/build_env.sh) to build the environment.  \nNote the script cannot run automatically and you need to run it line-by-line.\n### Results\nCurrently we have released examples on Aishell-1 and Aishell-2 datasets.  \nWith MMI training & decoding methods and the word-level N-gram LM. We achieve results on Aishell-1 and Aishell-2 as below. All results are in CER%  \nThe model file of Aishell-1 NT system is [here](https://drive.google.com/file/d/1VE2YtLb70UpQkeGWE8WhHJl7sSwNa_zG/view?usp=sharing) for quick performance check.\n\n|  Test set                      | Aishell-1-dev | Aishell-1-test | Aishell-2-ios | Aishell-2-android | Aishell-2-mic |  \n|  :----                         | :-: | :--: | :-: | :-----: | :-: |\n| AED                            | 4.60| 5.07 | 5.72| 6.60    | 6.58| \n| AED + MMI + Word Ngram         | 4.08| 4.45 | 5.15| 5.92    | 5.77|\n| NT                             | 4.41| 4.82 | 5.81| 6.52    | 6.52|\n| NT + MMI + Word Ngram          | 3.79| 4.10 | 5.02| 5.85    | 5.66|\n \n### Get Start\nTake Aishell-1 as an example. Working process for other examples are very similar.  \nstep 1: clone the code and link kaldi\n```\nconda activate lfmmi\ngit clone https://github.com/jctian98/e2e_lfmmi E2E-ASR-Framework # clone and RENAME\ncd E2E-ASR-Framework\nln -s <path-to-kaldi> kaldi                                       # link kaldi\n```\nstep 2: prepare data, lexicon and LMs. Before you run, please set the datadir in `prepare.sh`\n```\ncd egs/aishell1\nbash prepare.sh \n```\nstep 3: model training. You should split the data before start the training.  \nYou can skip this step and download our trained model [here](https://drive.google.com/file/d/1VE2YtLb70UpQkeGWE8WhHJl7sSwNa_zG/view?usp=sharing)\n```\npython3 espnet_utils/splitjson.py -p <ngpu> dump/train_sp/deltafalse/data.json\nbash nt.sh --stop_stage 1\n```\nstep 4: decode \n```\nbash nt.sh --stage 2 --mmi-weight 0.2 --word-ngram-weight 0.4\n```\nSeveral Hint:\n1. Please change the paths in `path.sh` accordingly before you start\n2. Please change the `data` to config your data path in `prepare.sh`\n3. Our code runs in DDP style and requires some global variables. Before you start, you need to set them manually. We assume Pytorch distributed API works well on your machine.  \n```\nexport HOST_GPU_NUM=x       # number of GPUs on each host\nexport HOST_NUM=x           # number of hosts\nexport NODE_NUM=x           # number of GPUs in total (on all hosts)\nexport INDEX=x              # index of this host\nexport CHIEF_IP=xx.xx.xx.xx # IP of the master host\n```\n4. You may encounter some problem about `k2`. Try to delete `data/lang_phone/Linv.pt` (in training) and `data/word_3gram/G.pt`(in decoding) and re-generate them again. \n5. Multiple choices are available during decoding (we take `nt.sh` as an example, but the usage of `aed.sh` is the same).  \n   To use the MMI-related scorers, you need train the model with MMI auxiliary criterion;  \n   \n  To use MMI Prefix Score (in AED) or MMI Alignment score (in NT):\n  ```\n  bash nt.sh --stage 2 --mmi-weight 0.2\n  ```\n  To use any external LM, you need to train them in advance (as implemented in `prepare.sh`)  \n  \n  To use word-level N-gram LM:\n  ```\n  bash nt.sh --stage 2 --word-ngram-weight 0.4\n  ```\n  To use character-level N-gram LM:\n  ```\n  bash nt.sh --stage 2 --ngram-weight 1.0\n  ```\n  To use neural network LM:\n  ```\n  bash nt.sh --stage 2 --lm-weight 1.0\n  ```\n### Reference\nkaldi: https://github.com/kaldi-asr/kaldi  \nEspent: https://github.com/espnet/espnet  \nk2-fsa: https://github.com/k2-fsa/k2  \n### Citations\n```\n@INPROCEEDINGS{9746579,\n  author={Tian, Jinchuan and Yu, Jianwei and Weng, Chao and Zhang, Shi-Xiong and Su, Dan and Yu, Dong and Zou, Yuexian},\n  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, \n  title={Consistent Training and Decoding for End-to-End Speech Recognition Using Lattice-Free MMI}, \n  year={2022},\n  volume={},\n  number={},\n  pages={7782-7786},\n  doi={10.1109/ICASSP43922.2022.9746579}}\n\n@ARTICLE{9721084,\n  author={Tian, Jinchuan and Yu, Jianwei and Weng, Chao and Zou, Yuexian and Yu, Dong},\n  journal={IEEE Signal Processing Letters}, \n  title={Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model}, \n  year={2022},\n  volume={},\n  number={},\n  pages={1-1},\n  doi={10.1109/LSP.2022.3154241}}\n  \n@article{tian2022integrate,\n  title={Integrate Lattice-Free MMI into End-to-End Speech Recognition},\n  author={Tian, Jinchuan and Yu, Jianwei and Weng, Chao and Zou, Yuexian and Yu, Dong},\n  journal={arXiv preprint arXiv:2203.15614},\n  year={2022}\n}\n```\n### Authorship\nJinchuan Tian;  tianjinchuan@stu.pku.edu.cn or tyriontian@tencent.com  \nJianwei Yu; tomasyu@tencent.com (supervisor)  \nChao Weng; cweng@tencent.com  \nYuexian Zou; zouyx@pku.edu.cn\n"
  },
  {
    "path": "__init__.py",
    "content": "\"\"\"Initialize espnet package.\"\"\"\n\nimport os\ndirname = os.path.dirname(__file__)\nversion_file = os.path.join(dirname, \"version.txt\")\nwith open(version_file, \"r\") as f:\n    __version__ = f.read().strip()\n"
  },
  {
    "path": "asr/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "asr/asr_mix_utils.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis script is used to provide utility functions designed for multi-speaker ASR.\n\nCopyright 2017 Johns Hopkins University (Shinji Watanabe)\n Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nMost functions can be directly used as in asr_utils.py:\n    CompareValueTrigger, restore_snapshot, adadelta_eps_decay, chainer_load,\n    torch_snapshot, torch_save, torch_resume, AttributeDict, get_model_conf.\n\n\"\"\"\n\nimport copy\nimport logging\nimport os\n\nfrom chainer.training import extension\n\nimport matplotlib\n\nfrom espnet.asr.asr_utils import parse_hypothesis\n\n\nmatplotlib.use(\"Agg\")\n\n\n# * -------------------- chainer extension related -------------------- *\nclass PlotAttentionReport(extension.Extension):\n    \"\"\"Plot attention reporter.\n\n    Args:\n        att_vis_fn (espnet.nets.*_backend.e2e_asr.calculate_all_attentions):\n            Function of attention visualization.\n        data (list[tuple(str, dict[str, dict[str, Any]])]): List json utt key items.\n        outdir (str): Directory to save figures.\n        converter (espnet.asr.*_backend.asr.CustomConverter):\n            CustomConverter object. Function to convert data.\n        device (torch.device): The destination device to send tensor.\n        reverse (bool): If True, input and output length are reversed.\n\n    \"\"\"\n\n    def __init__(self, att_vis_fn, data, outdir, converter, device, reverse=False):\n        \"\"\"Initialize PlotAttentionReport.\"\"\"\n        self.att_vis_fn = att_vis_fn\n        self.data = copy.deepcopy(data)\n        self.outdir = outdir\n        self.converter = converter\n        self.device = device\n        self.reverse = reverse\n        if not os.path.exists(self.outdir):\n            os.makedirs(self.outdir)\n\n    def __call__(self, trainer):\n        \"\"\"Plot and save imaged matrix of att_ws.\"\"\"\n        att_ws_sd = self.get_attention_weights()\n        for ns, att_ws in enumerate(att_ws_sd):\n            for idx, att_w in enumerate(att_ws):\n                filename = \"%s/%s.ep.{.updater.epoch}.output%d.png\" % (\n                    self.outdir,\n                    self.data[idx][0],\n                    ns + 1,\n                )\n                att_w = self.get_attention_weight(idx, att_w, ns)\n                self._plot_and_save_attention(att_w, filename.format(trainer))\n\n    def log_attentions(self, logger, step):\n        \"\"\"Add image files of attention matrix to tensorboard.\"\"\"\n        att_ws_sd = self.get_attention_weights()\n        for ns, att_ws in enumerate(att_ws_sd):\n            for idx, att_w in enumerate(att_ws):\n                att_w = self.get_attention_weight(idx, att_w, ns)\n                plot = self.draw_attention_plot(att_w)\n                logger.add_figure(\"%s\" % (self.data[idx][0]), plot.gcf(), step)\n                plot.clf()\n\n    def get_attention_weights(self):\n        \"\"\"Return attention weights.\n\n        Returns:\n            arr_ws_sd (numpy.ndarray): attention weights. It's shape would be\n                differ from bachend.dtype=float\n                * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax). 2)\n                  other case => (B, Lmax, Tmax).\n                * chainer-> attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        batch = self.converter([self.converter.transform(self.data)], self.device)\n        att_ws_sd = self.att_vis_fn(*batch)\n        return att_ws_sd\n\n    def get_attention_weight(self, idx, att_w, spkr_idx):\n        \"\"\"Transform attention weight in regard to self.reverse.\"\"\"\n        if self.reverse:\n            dec_len = int(self.data[idx][1][\"input\"][0][\"shape\"][0])\n            enc_len = int(self.data[idx][1][\"output\"][spkr_idx][\"shape\"][0])\n        else:\n            dec_len = int(self.data[idx][1][\"output\"][spkr_idx][\"shape\"][0])\n            enc_len = int(self.data[idx][1][\"input\"][0][\"shape\"][0])\n        if len(att_w.shape) == 3:\n            att_w = att_w[:, :dec_len, :enc_len]\n        else:\n            att_w = att_w[:dec_len, :enc_len]\n        return att_w\n\n    def draw_attention_plot(self, att_w):\n        \"\"\"Visualize attention weights matrix.\n\n        Args:\n            att_w(Tensor): Attention weight matrix.\n\n        Returns:\n            matplotlib.pyplot: pyplot object with attention matrix image.\n\n        \"\"\"\n        import matplotlib.pyplot as plt\n\n        if len(att_w.shape) == 3:\n            for h, aw in enumerate(att_w, 1):\n                plt.subplot(1, len(att_w), h)\n                plt.imshow(aw, aspect=\"auto\")\n                plt.xlabel(\"Encoder Index\")\n                plt.ylabel(\"Decoder Index\")\n        else:\n            plt.imshow(att_w, aspect=\"auto\")\n            plt.xlabel(\"Encoder Index\")\n            plt.ylabel(\"Decoder Index\")\n        plt.tight_layout()\n        return plt\n\n    def _plot_and_save_attention(self, att_w, filename):\n        plt = self.draw_attention_plot(att_w)\n        plt.savefig(filename)\n        plt.close()\n\n\ndef add_results_to_json(js, nbest_hyps_sd, char_list):\n    \"\"\"Add N-best results to json.\n\n    Args:\n        js (dict[str, Any]): Groundtruth utterance dict.\n        nbest_hyps_sd (list[dict[str, Any]]):\n            List of hypothesis for multi_speakers (# Utts x # Spkrs).\n        char_list (list[str]): List of characters.\n\n    Returns:\n        dict[str, Any]: N-best results added utterance dict.\n\n    \"\"\"\n    # copy old json info\n    new_js = dict()\n    new_js[\"utt2spk\"] = js[\"utt2spk\"]\n    num_spkrs = len(nbest_hyps_sd)\n    new_js[\"output\"] = []\n\n    for ns in range(num_spkrs):\n        tmp_js = []\n        nbest_hyps = nbest_hyps_sd[ns]\n\n        for n, hyp in enumerate(nbest_hyps, 1):\n            # parse hypothesis\n            rec_text, rec_token, rec_tokenid, score = parse_hypothesis(hyp, char_list)\n\n            # copy ground-truth\n            out_dic = dict(js[\"output\"][ns].items())\n\n            # update name\n            out_dic[\"name\"] += \"[%d]\" % n\n\n            # add recognition results\n            out_dic[\"rec_text\"] = rec_text\n            out_dic[\"rec_token\"] = rec_token\n            out_dic[\"rec_tokenid\"] = rec_tokenid\n            out_dic[\"score\"] = score\n\n            # add to list of N-best result dicts\n            tmp_js.append(out_dic)\n\n            # show 1-best result\n            if n == 1:\n                logging.info(\"groundtruth: %s\" % out_dic[\"text\"])\n                logging.info(\"prediction : %s\" % out_dic[\"rec_text\"])\n\n        new_js[\"output\"].append(tmp_js)\n    return new_js\n"
  },
  {
    "path": "asr/asr_utils.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n# Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport copy\nimport json\nimport logging\nimport os\nimport shutil\nimport tempfile\nimport numpy as np\nimport torch\n\n\n# * -------------------- training iterator related -------------------- *\n\n\nclass CompareValueTrigger(object):\n    \"\"\"Trigger invoked when key value getting bigger or lower than before.\n\n    Args:\n        key (str) : Key of value.\n        compare_fn ((float, float) -> bool) : Function to compare the values.\n        trigger (tuple(int, str)) : Trigger that decide the comparison interval.\n\n    \"\"\"\n\n    def __init__(self, key, compare_fn, trigger=(1, \"epoch\")):\n        from chainer import training\n\n        self._key = key\n        self._best_value = None\n        self._interval_trigger = training.util.get_trigger(trigger)\n        self._init_summary()\n        self._compare_fn = compare_fn\n\n    def __call__(self, trainer):\n        \"\"\"Get value related to the key and compare with current value.\"\"\"\n        observation = trainer.observation\n        summary = self._summary\n        key = self._key\n        if key in observation:\n            summary.add({key: observation[key]})\n\n        if not self._interval_trigger(trainer):\n            return False\n\n        stats = summary.compute_mean()\n        value = float(stats[key])  # copy to CPU\n        self._init_summary()\n\n        if self._best_value is None:\n            # initialize best value\n            self._best_value = value\n            return False\n        elif self._compare_fn(self._best_value, value):\n            return True\n        else:\n            self._best_value = value\n            return False\n\n    def _init_summary(self):\n        import chainer\n\n        self._summary = chainer.reporter.DictSummary()\n\n\ntry:\n    from chainer.training import extension\nexcept ImportError:\n    PlotAttentionReport = None\nelse:\n\n    class PlotAttentionReport(extension.Extension):\n        \"\"\"Plot attention reporter.\n\n        Args:\n            att_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_attentions):\n                Function of attention visualization.\n            data (list[tuple(str, dict[str, list[Any]])]): List json utt key items.\n            outdir (str): Directory to save figures.\n            converter (espnet.asr.*_backend.asr.CustomConverter):\n                Function to convert data.\n            device (int | torch.device): Device.\n            reverse (bool): If True, input and output length are reversed.\n            ikey (str): Key to access input\n                (for ASR/ST ikey=\"input\", for MT ikey=\"output\".)\n            iaxis (int): Dimension to access input\n                (for ASR/ST iaxis=0, for MT iaxis=1.)\n            okey (str): Key to access output\n                (for ASR/ST okey=\"input\", MT okay=\"output\".)\n            oaxis (int): Dimension to access output\n                (for ASR/ST oaxis=0, for MT oaxis=0.)\n            subsampling_factor (int): subsampling factor in encoder\n\n        \"\"\"\n\n        def __init__(\n            self,\n            att_vis_fn,\n            data,\n            outdir,\n            converter,\n            transform,\n            device,\n            reverse=False,\n            ikey=\"input\",\n            iaxis=0,\n            okey=\"output\",\n            oaxis=0,\n            subsampling_factor=1,\n        ):\n            self.att_vis_fn = att_vis_fn\n            self.data = copy.deepcopy(data)\n            self.data_dict = {k: v for k, v in copy.deepcopy(data)}\n            # key is utterance ID\n            self.outdir = outdir\n            self.converter = converter\n            self.transform = transform\n            self.device = device\n            self.reverse = reverse\n            self.ikey = ikey\n            self.iaxis = iaxis\n            self.okey = okey\n            self.oaxis = oaxis\n            self.factor = subsampling_factor\n            if not os.path.exists(self.outdir):\n                os.makedirs(self.outdir)\n\n        def __call__(self, trainer):\n            \"\"\"Plot and save image file of att_ws matrix.\"\"\"\n            att_ws, uttid_list = self.get_attention_weights()\n            if isinstance(att_ws, list):  # multi-encoder case\n                num_encs = len(att_ws) - 1\n                # atts\n                for i in range(num_encs):\n                    for idx, att_w in enumerate(att_ws[i]):\n                        filename = \"%s/%s.ep.{.updater.epoch}.att%d.png\" % (\n                            self.outdir,\n                            uttid_list[idx],\n                            i + 1,\n                        )\n                        att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                        np_filename = \"%s/%s.ep.{.updater.epoch}.att%d.npy\" % (\n                            self.outdir,\n                            uttid_list[idx],\n                            i + 1,\n                        )\n                        np.save(np_filename.format(trainer), att_w)\n                        self._plot_and_save_attention(att_w, filename.format(trainer))\n                # han\n                for idx, att_w in enumerate(att_ws[num_encs]):\n                    filename = \"%s/%s.ep.{.updater.epoch}.han.png\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                    np_filename = \"%s/%s.ep.{.updater.epoch}.han.npy\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    np.save(np_filename.format(trainer), att_w)\n                    self._plot_and_save_attention(\n                        att_w, filename.format(trainer), han_mode=True\n                    )\n            else:\n                for idx, att_w in enumerate(att_ws):\n                    filename = \"%s/%s.ep.{.updater.epoch}.png\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                    np_filename = \"%s/%s.ep.{.updater.epoch}.npy\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    np.save(np_filename.format(trainer), att_w)\n                    self._plot_and_save_attention(att_w, filename.format(trainer))\n\n        def log_attentions(self, logger, step):\n            \"\"\"Add image files of att_ws matrix to the tensorboard.\"\"\"\n            att_ws, uttid_list = self.get_attention_weights()\n            if isinstance(att_ws, list):  # multi-encoder case\n                num_encs = len(att_ws) - 1\n                # atts\n                for i in range(num_encs):\n                    for idx, att_w in enumerate(att_ws[i]):\n                        att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                        plot = self.draw_attention_plot(att_w)\n                        logger.add_figure(\n                            \"%s_att%d\" % (uttid_list[idx], i + 1),\n                            plot.gcf(),\n                            step,\n                        )\n                # han\n                for idx, att_w in enumerate(att_ws[num_encs]):\n                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                    plot = self.draw_han_plot(att_w)\n                    logger.add_figure(\n                        \"%s_han\" % (uttid_list[idx]),\n                        plot.gcf(),\n                        step,\n                    )\n            else:\n                for idx, att_w in enumerate(att_ws):\n                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)\n                    plot = self.draw_attention_plot(att_w)\n                    logger.add_figure(\"%s\" % (uttid_list[idx]), plot.gcf(), step)\n\n        def get_attention_weights(self):\n            \"\"\"Return attention weights.\n\n            Returns:\n                numpy.ndarray: attention weights. float. Its shape would be\n                    differ from backend.\n                    * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax), 2)\n                      other case => (B, Lmax, Tmax).\n                    * chainer-> (B, Lmax, Tmax)\n\n            \"\"\"\n            return_batch, uttid_list = self.transform(self.data, return_uttid=True)\n            batch = self.converter([return_batch], self.device)\n            if isinstance(batch, tuple):\n                att_ws = self.att_vis_fn(*batch)\n            else:\n                att_ws = self.att_vis_fn(**batch)\n            return att_ws, uttid_list\n\n        def trim_attention_weight(self, uttid, att_w):\n            \"\"\"Transform attention matrix with regard to self.reverse.\"\"\"\n            if self.reverse:\n                enc_key, enc_axis = self.okey, self.oaxis\n                dec_key, dec_axis = self.ikey, self.iaxis\n            else:\n                enc_key, enc_axis = self.ikey, self.iaxis\n                dec_key, dec_axis = self.okey, self.oaxis\n            dec_len = int(self.data_dict[uttid][dec_key][dec_axis][\"shape\"][0])\n            enc_len = int(self.data_dict[uttid][enc_key][enc_axis][\"shape\"][0])\n            if self.factor > 1:\n                enc_len //= self.factor\n            if len(att_w.shape) == 3:\n                att_w = att_w[:, :dec_len, :enc_len]\n            else:\n                att_w = att_w[:dec_len, :enc_len]\n            return att_w\n\n        def draw_attention_plot(self, att_w):\n            \"\"\"Plot the att_w matrix.\n\n            Returns:\n                matplotlib.pyplot: pyplot object with attention matrix image.\n\n            \"\"\"\n            import matplotlib\n\n            matplotlib.use(\"Agg\")\n            import matplotlib.pyplot as plt\n\n            plt.clf()\n            att_w = att_w.astype(np.float32)\n            if len(att_w.shape) == 3:\n                for h, aw in enumerate(att_w, 1):\n                    plt.subplot(1, len(att_w), h)\n                    plt.imshow(aw, aspect=\"auto\")\n                    plt.xlabel(\"Encoder Index\")\n                    plt.ylabel(\"Decoder Index\")\n            else:\n                plt.imshow(att_w, aspect=\"auto\")\n                plt.xlabel(\"Encoder Index\")\n                plt.ylabel(\"Decoder Index\")\n            plt.tight_layout()\n            return plt\n\n        def draw_han_plot(self, att_w):\n            \"\"\"Plot the att_w matrix for hierarchical attention.\n\n            Returns:\n                matplotlib.pyplot: pyplot object with attention matrix image.\n\n            \"\"\"\n            import matplotlib\n\n            matplotlib.use(\"Agg\")\n            import matplotlib.pyplot as plt\n\n            plt.clf()\n            if len(att_w.shape) == 3:\n                for h, aw in enumerate(att_w, 1):\n                    legends = []\n                    plt.subplot(1, len(att_w), h)\n                    for i in range(aw.shape[1]):\n                        plt.plot(aw[:, i])\n                        legends.append(\"Att{}\".format(i))\n                    plt.ylim([0, 1.0])\n                    plt.xlim([0, aw.shape[0]])\n                    plt.grid(True)\n                    plt.ylabel(\"Attention Weight\")\n                    plt.xlabel(\"Decoder Index\")\n                    plt.legend(legends)\n            else:\n                legends = []\n                for i in range(att_w.shape[1]):\n                    plt.plot(att_w[:, i])\n                    legends.append(\"Att{}\".format(i))\n                plt.ylim([0, 1.0])\n                plt.xlim([0, att_w.shape[0]])\n                plt.grid(True)\n                plt.ylabel(\"Attention Weight\")\n                plt.xlabel(\"Decoder Index\")\n                plt.legend(legends)\n            plt.tight_layout()\n            return plt\n\n        def _plot_and_save_attention(self, att_w, filename, han_mode=False):\n            if han_mode:\n                plt = self.draw_han_plot(att_w)\n            else:\n                plt = self.draw_attention_plot(att_w)\n            plt.savefig(filename)\n            plt.close()\n\n\ntry:\n    from chainer.training import extension\nexcept ImportError:\n    PlotCTCReport = None\nelse:\n\n    class PlotCTCReport(extension.Extension):\n        \"\"\"Plot CTC reporter.\n\n        Args:\n            ctc_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_ctc_probs):\n                Function of CTC visualization.\n            data (list[tuple(str, dict[str, list[Any]])]): List json utt key items.\n            outdir (str): Directory to save figures.\n            converter (espnet.asr.*_backend.asr.CustomConverter):\n                Function to convert data.\n            device (int | torch.device): Device.\n            reverse (bool): If True, input and output length are reversed.\n            ikey (str): Key to access input\n                (for ASR/ST ikey=\"input\", for MT ikey=\"output\".)\n            iaxis (int): Dimension to access input\n                (for ASR/ST iaxis=0, for MT iaxis=1.)\n            okey (str): Key to access output\n                (for ASR/ST okey=\"input\", MT okay=\"output\".)\n            oaxis (int): Dimension to access output\n                (for ASR/ST oaxis=0, for MT oaxis=0.)\n            subsampling_factor (int): subsampling factor in encoder\n\n        \"\"\"\n\n        def __init__(\n            self,\n            ctc_vis_fn,\n            data,\n            outdir,\n            converter,\n            transform,\n            device,\n            reverse=False,\n            ikey=\"input\",\n            iaxis=0,\n            okey=\"output\",\n            oaxis=0,\n            subsampling_factor=1,\n        ):\n            self.ctc_vis_fn = ctc_vis_fn\n            self.data = copy.deepcopy(data)\n            self.data_dict = {k: v for k, v in copy.deepcopy(data)}\n            # key is utterance ID\n            self.outdir = outdir\n            self.converter = converter\n            self.transform = transform\n            self.device = device\n            self.reverse = reverse\n            self.ikey = ikey\n            self.iaxis = iaxis\n            self.okey = okey\n            self.oaxis = oaxis\n            self.factor = subsampling_factor\n            if not os.path.exists(self.outdir):\n                os.makedirs(self.outdir)\n\n        def __call__(self, trainer):\n            \"\"\"Plot and save image file of ctc prob.\"\"\"\n            ctc_probs, uttid_list = self.get_ctc_probs()\n            if isinstance(ctc_probs, list):  # multi-encoder case\n                num_encs = len(ctc_probs) - 1\n                for i in range(num_encs):\n                    for idx, ctc_prob in enumerate(ctc_probs[i]):\n                        filename = \"%s/%s.ep.{.updater.epoch}.ctc%d.png\" % (\n                            self.outdir,\n                            uttid_list[idx],\n                            i + 1,\n                        )\n                        ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)\n                        np_filename = \"%s/%s.ep.{.updater.epoch}.ctc%d.npy\" % (\n                            self.outdir,\n                            uttid_list[idx],\n                            i + 1,\n                        )\n                        np.save(np_filename.format(trainer), ctc_prob)\n                        self._plot_and_save_ctc(ctc_prob, filename.format(trainer))\n            else:\n                for idx, ctc_prob in enumerate(ctc_probs):\n                    filename = \"%s/%s.ep.{.updater.epoch}.png\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)\n                    np_filename = \"%s/%s.ep.{.updater.epoch}.npy\" % (\n                        self.outdir,\n                        uttid_list[idx],\n                    )\n                    np.save(np_filename.format(trainer), ctc_prob)\n                    self._plot_and_save_ctc(ctc_prob, filename.format(trainer))\n\n        def log_ctc_probs(self, logger, step):\n            \"\"\"Add image files of ctc probs to the tensorboard.\"\"\"\n            ctc_probs, uttid_list = self.get_ctc_probs()\n            if isinstance(ctc_probs, list):  # multi-encoder case\n                num_encs = len(ctc_probs) - 1\n                for i in range(num_encs):\n                    for idx, ctc_prob in enumerate(ctc_probs[i]):\n                        ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)\n                        plot = self.draw_ctc_plot(ctc_prob)\n                        logger.add_figure(\n                            \"%s_ctc%d\" % (uttid_list[idx], i + 1),\n                            plot.gcf(),\n                            step,\n                        )\n            else:\n                for idx, ctc_prob in enumerate(ctc_probs):\n                    ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)\n                    plot = self.draw_ctc_plot(ctc_prob)\n                    logger.add_figure(\"%s\" % (uttid_list[idx]), plot.gcf(), step)\n\n        def get_ctc_probs(self):\n            \"\"\"Return CTC probs.\n\n            Returns:\n                numpy.ndarray: CTC probs. float. Its shape would be\n                    differ from backend. (B, Tmax, vocab).\n\n            \"\"\"\n            return_batch, uttid_list = self.transform(self.data, return_uttid=True)\n            batch = self.converter([return_batch], self.device)\n            if isinstance(batch, tuple):\n                probs = self.ctc_vis_fn(*batch)\n            else:\n                probs = self.ctc_vis_fn(**batch)\n            return probs, uttid_list\n\n        def trim_ctc_prob(self, uttid, prob):\n            \"\"\"Trim CTC posteriors accoding to input lengths.\"\"\"\n            enc_len = int(self.data_dict[uttid][self.ikey][self.iaxis][\"shape\"][0])\n            if self.factor > 1:\n                enc_len //= self.factor\n            prob = prob[:enc_len]\n            return prob\n\n        def draw_ctc_plot(self, ctc_prob):\n            \"\"\"Plot the ctc_prob matrix.\n\n            Returns:\n                matplotlib.pyplot: pyplot object with CTC prob matrix image.\n\n            \"\"\"\n            import matplotlib\n\n            matplotlib.use(\"Agg\")\n            import matplotlib.pyplot as plt\n\n            ctc_prob = ctc_prob.astype(np.float32)\n\n            plt.clf()\n            topk_ids = np.argsort(ctc_prob, axis=1)\n            n_frames, vocab = ctc_prob.shape\n            times_probs = np.arange(n_frames)\n\n            plt.figure(figsize=(20, 8))\n\n            # NOTE: index 0 is reserved for blank\n            for idx in set(topk_ids.reshape(-1).tolist()):\n                if idx == 0:\n                    plt.plot(\n                        times_probs, ctc_prob[:, 0], \":\", label=\"<blank>\", color=\"grey\"\n                    )\n                else:\n                    plt.plot(times_probs, ctc_prob[:, idx])\n            plt.xlabel(u\"Input [frame]\", fontsize=12)\n            plt.ylabel(\"Posteriors\", fontsize=12)\n            plt.xticks(list(range(0, int(n_frames) + 1, 10)))\n            plt.yticks(list(range(0, 2, 1)))\n            plt.tight_layout()\n            return plt\n\n        def _plot_and_save_ctc(self, ctc_prob, filename):\n            plt = self.draw_ctc_plot(ctc_prob)\n            plt.savefig(filename)\n            plt.close()\n\n\ndef restore_snapshot(model, snapshot, load_fn=None):\n    \"\"\"Extension to restore snapshot.\n\n    Returns:\n        An extension function.\n\n    \"\"\"\n    import chainer\n    from chainer import training\n\n    if load_fn is None:\n        load_fn = chainer.serializers.load_npz\n\n    @training.make_extension(trigger=(1, \"epoch\"))\n    def restore_snapshot(trainer):\n        _restore_snapshot(model, snapshot, load_fn)\n\n    return restore_snapshot\n\n\ndef _restore_snapshot(model, snapshot, load_fn=None):\n    if load_fn is None:\n        import chainer\n\n        load_fn = chainer.serializers.load_npz\n\n    load_fn(snapshot, model)\n    logging.info(\"restored from \" + str(snapshot))\n\n\ndef adadelta_eps_decay(eps_decay):\n    \"\"\"Extension to perform adadelta eps decay.\n\n    Args:\n        eps_decay (float): Decay rate of eps.\n\n    Returns:\n        An extension function.\n\n    \"\"\"\n    from chainer import training\n\n    @training.make_extension(trigger=(1, \"epoch\"))\n    def adadelta_eps_decay(trainer):\n        _adadelta_eps_decay(trainer, eps_decay)\n\n    return adadelta_eps_decay\n\n\ndef _adadelta_eps_decay(trainer, eps_decay):\n    optimizer = trainer.updater.get_optimizer(\"main\")\n    # for chainer\n    if hasattr(optimizer, \"eps\"):\n        current_eps = optimizer.eps\n        setattr(optimizer, \"eps\", current_eps * eps_decay)\n        logging.info(\"adadelta eps decayed to \" + str(optimizer.eps))\n    # pytorch\n    else:\n        for p in optimizer.param_groups:\n            p[\"eps\"] *= eps_decay\n            logging.info(\"adadelta eps decayed to \" + str(p[\"eps\"]))\n\n\ndef adam_lr_decay(eps_decay):\n    \"\"\"Extension to perform adam lr decay.\n\n    Args:\n        eps_decay (float): Decay rate of lr.\n\n    Returns:\n        An extension function.\n\n    \"\"\"\n    from chainer import training\n\n    @training.make_extension(trigger=(1, \"epoch\"))\n    def adam_lr_decay(trainer):\n        _adam_lr_decay(trainer, eps_decay)\n\n    return adam_lr_decay\n\n\ndef _adam_lr_decay(trainer, eps_decay):\n    optimizer = trainer.updater.get_optimizer(\"main\")\n    # for chainer\n    if hasattr(optimizer, \"lr\"):\n        current_lr = optimizer.lr\n        setattr(optimizer, \"lr\", current_lr * eps_decay)\n        logging.info(\"adam lr decayed to \" + str(optimizer.lr))\n    # pytorch\n    else:\n        for p in optimizer.param_groups:\n            p[\"lr\"] *= eps_decay\n            logging.info(\"adam lr decayed to \" + str(p[\"lr\"]))\n\n\ndef torch_snapshot(savefun=torch.save, filename=\"snapshot.ep.{.updater.epoch}\"):\n    \"\"\"Extension to take snapshot of the trainer for pytorch.\n\n    Returns:\n        An extension function.\n\n    \"\"\"\n    from chainer.training import extension\n\n    @extension.make_extension(trigger=(1, \"epoch\"), priority=-100)\n    def torch_snapshot(trainer):\n        _torch_snapshot_object(trainer, trainer, filename.format(trainer), savefun)\n\n    return torch_snapshot\n\n\ndef _torch_snapshot_object(trainer, target, filename, savefun):\n    from chainer.serializers import DictionarySerializer\n\n    # make snapshot_dict dictionary\n    s = DictionarySerializer()\n    s.save(trainer)\n    if hasattr(trainer.updater.model, \"model\"):\n        # (for TTS)\n        if hasattr(trainer.updater.model.model, \"module\"):\n            model_state_dict = trainer.updater.model.model.module.state_dict()\n        else:\n            model_state_dict = trainer.updater.model.model.state_dict()\n    else:\n        # (for ASR)\n        if hasattr(trainer.updater.model, \"module\"):\n            model_state_dict = trainer.updater.model.module.state_dict()\n        else:\n            model_state_dict = trainer.updater.model.state_dict()\n    \n\n    snapshot_dict = {\n        \"trainer\": s.target,\n        \"model\": model_state_dict,\n    }\n\n    if hasattr(trainer.updater, \"ddp_trainer\"):\n        # For ASR\n        snapshot_dict[\"optimizer\"] = trainer.updater.ddp_trainer.optimizer.state_dict()\n    else:\n        # Others like LM\n        snapshot_dict[\"optimizer\"] = trainer.updater.get_optimizer(\"main\").state_dict() \n\n    # save snapshot dictionary\n    fn = filename.format(trainer)\n    prefix = \"tmp\" + fn\n    tmpdir = tempfile.mkdtemp(prefix=prefix, dir=trainer.out)\n    tmppath = os.path.join(tmpdir, fn)\n    try:\n        savefun(snapshot_dict, tmppath)\n        shutil.move(tmppath, os.path.join(trainer.out, fn))\n    finally:\n        shutil.rmtree(tmpdir)\n\n\ndef add_gradient_noise(model, iteration, duration=100, eta=1.0, scale_factor=0.55):\n    \"\"\"Adds noise from a standard normal distribution to the gradients.\n\n    The standard deviation (`sigma`) is controlled by the three hyper-parameters below.\n    `sigma` goes to zero (no noise) with more iterations.\n\n    Args:\n        model (torch.nn.model): Model.\n        iteration (int): Number of iterations.\n        duration (int) {100, 1000}:\n            Number of durations to control the interval of the `sigma` change.\n        eta (float) {0.01, 0.3, 1.0}: The magnitude of `sigma`.\n        scale_factor (float) {0.55}: The scale of `sigma`.\n    \"\"\"\n    interval = (iteration // duration) + 1\n    sigma = eta / interval ** scale_factor\n    for param in model.parameters():\n        if param.grad is not None:\n            _shape = param.grad.size()\n            noise = sigma * torch.randn(_shape).to(param.device)\n            param.grad += noise\n\n\n# * -------------------- general -------------------- *\ndef get_model_conf(model_path, conf_path=None):\n    \"\"\"Get model config information by reading a model config file (model.json).\n\n    Args:\n        model_path (str): Model path.\n        conf_path (str): Optional model config path.\n\n    Returns:\n        list[int, int, dict[str, Any]]: Config information loaded from json file.\n\n    \"\"\"\n    if conf_path is None:\n        model_conf = os.path.dirname(model_path) + \"/model.json\"\n    else:\n        model_conf = conf_path\n    with open(model_conf, \"rb\") as f:\n        logging.info(\"reading a config file from \" + model_conf)\n        confs = json.load(f)\n    if isinstance(confs, dict):\n        # for lm\n        args = confs\n        return argparse.Namespace(**args)\n    else:\n        # for asr, tts, mt\n        idim, odim, args = confs\n        return idim, odim, argparse.Namespace(**args)\n\n\ndef chainer_load(path, model):\n    \"\"\"Load chainer model parameters.\n\n    Args:\n        path (str): Model path or snapshot file path to be loaded.\n        model (chainer.Chain): Chainer model.\n\n    \"\"\"\n    import chainer\n\n    if \"snapshot\" in os.path.basename(path):\n        chainer.serializers.load_npz(path, model, path=\"updater/model:main/\")\n    else:\n        chainer.serializers.load_npz(path, model)\n\n\ndef torch_save(path, model):\n    \"\"\"Save torch model states.\n\n    Args:\n        path (str): Model path to be saved.\n        model (torch.nn.Module): Torch model.\n\n    \"\"\"\n    if hasattr(model, \"module\"):\n        torch.save(model.module.state_dict(), path)\n    else:\n        torch.save(model.state_dict(), path)\n\n\ndef snapshot_object(target, filename):\n    \"\"\"Returns a trainer extension to take snapshots of a given object.\n\n    Args:\n        target (model): Object to serialize.\n        filename (str): Name of the file into which the object is serialized.It can\n            be a format string, where the trainer object is passed to\n            the :meth: `str.format` method. For example,\n            ``'snapshot_{.updater.iteration}'`` is converted to\n            ``'snapshot_10000'`` at the 10,000th iteration.\n\n    Returns:\n        An extension function.\n\n    \"\"\"\n    from chainer.training import extension\n\n    @extension.make_extension(trigger=(1, \"epoch\"), priority=-100)\n    def snapshot_object(trainer):\n        torch_save(os.path.join(trainer.out, filename.format(trainer)), target)\n\n    return snapshot_object\n\n\ndef torch_load(path, model):\n    \"\"\"Load torch model states.\n\n    Args:\n        path (str): Model path or snapshot file path to be loaded.\n        model (torch.nn.Module): Torch model.\n\n    \"\"\"\n    if \"snapshot\" in os.path.basename(path):\n        model_state_dict = torch.load(path, map_location=lambda storage, loc: storage)[\n            \"model\"\n        ]\n    else:\n        model_state_dict = torch.load(path, map_location=lambda storage, loc: storage)\n\n    if hasattr(model, \"module\"):\n        model.module.load_state_dict(model_state_dict)\n    else:\n        model.load_state_dict(model_state_dict)\n\n    del model_state_dict\n\n\ndef torch_resume(snapshot_path, trainer, load_trainer_and_opt=True):\n    \"\"\"Resume from snapshot for pytorch.\n\n    Args:\n        snapshot_path (str): Snapshot file path.\n        trainer (chainer.training.Trainer): Chainer's trainer instance.\n\n    \"\"\"\n    from chainer.serializers import NpzDeserializer\n\n    if not load_trainer_and_opt:\n        print(\"Only model weights are resumed\")\n        print(\"trainer and optimizer is ignored\")\n        print(\"make sure this is the second-stage training\")\n\n    # load snapshot\n    snapshot_dict = torch.load(snapshot_path, map_location=lambda storage, loc: storage)\n\n    # restore trainer states\n    if load_trainer_and_opt:\n        d = NpzDeserializer(snapshot_dict[\"trainer\"])\n        d.load(trainer)\n\n    # restore model states\n    if hasattr(trainer.updater.model, \"model\"):\n        # (for TTS model)\n        if hasattr(trainer.updater.model.model, \"module\"):\n            trainer.updater.model.model.module.load_state_dict(snapshot_dict[\"model\"])\n        else:\n            trainer.updater.model.model.load_state_dict(snapshot_dict[\"model\"])\n    else:\n        # (for ASR model)\n        if hasattr(trainer.updater.model, \"module\"):\n            trainer.updater.model.module.load_state_dict(snapshot_dict[\"model\"])\n        else:\n            trainer.updater.model.load_state_dict(snapshot_dict[\"model\"])\n\n    # restore optimizer states\n    if load_trainer_and_opt and hasattr(trainer.updater.ddp_trainer, \"optimizer\"):\n        trainer.updater.ddp_trainer.optimizer.load_state_dict(snapshot_dict[\"optimizer\"])\n\n    # delete opened snapshot\n    del snapshot_dict\n\n\n# * ------------------ recognition related ------------------ *\ndef parse_hypothesis(hyp, char_list):\n    \"\"\"Parse hypothesis.\n\n    Args:\n        hyp (list[dict[str, Any]]): Recognition hypothesis.\n        char_list (list[str]): List of characters.\n\n    Returns:\n        tuple(str, str, str, float)\n\n    \"\"\"\n    # remove sos and get results\n    tokenid_as_list = list(map(int, hyp[\"yseq\"][1:]))\n    token_as_list = [char_list[idx] for idx in tokenid_as_list]\n    score = float(hyp[\"score\"])\n\n    # convert to string\n    tokenid = \" \".join([str(idx) for idx in tokenid_as_list])\n    token = \" \".join(token_as_list)\n    text = \"\".join(token_as_list).replace(\"<space>\", \" \")\n\n    return text, token, tokenid, score\n\n\ndef add_results_to_json(js, nbest_hyps, char_list):\n    \"\"\"Add N-best results to json.\n\n    Args:\n        js (dict[str, Any]): Groundtruth utterance dict.\n        nbest_hyps_sd (list[dict[str, Any]]):\n            List of hypothesis for multi_speakers: nutts x nspkrs.\n        char_list (list[str]): List of characters.\n\n    Returns:\n        dict[str, Any]: N-best results added utterance dict.\n\n    \"\"\"\n    # copy old json info\n    new_js = dict()\n    new_js[\"utt2spk\"] = js[\"utt2spk\"]\n    new_js[\"output\"] = []\n\n    for n, hyp in enumerate(nbest_hyps, 1):\n        # parse hypothesis\n        rec_text, rec_token, rec_tokenid, score = parse_hypothesis(hyp, char_list)\n\n        # copy ground-truth\n        if len(js[\"output\"]) > 0:\n            out_dic = dict(js[\"output\"][0].items())\n        else:\n            # for no reference case (e.g., speech translation)\n            out_dic = {\"name\": \"\"}\n\n        # update name\n        out_dic[\"name\"] += \"[%d]\" % n\n\n        # add recognition results\n        out_dic[\"rec_text\"] = rec_text\n        out_dic[\"rec_token\"] = rec_token\n        out_dic[\"rec_tokenid\"] = rec_tokenid\n        out_dic[\"score\"] = score\n       \n        # RNNT MMI \n        if \"mmi_tot_score\" in hyp:\n            out_dic[\"mmi_tot_score\"] = hyp[\"mmi_tot_score\"]\n       \n        # LASCTC MMI \n        if \"scores\" in hyp:\n            if \"mmi_tot_score\" in hyp[\"scores\"]:\n                out_dic[\"mmi_tot_score\"] = hyp[\"scores\"][\"mmi_tot_score\"]\n            if \"mmi\" in hyp[\"scores\"]:\n                out_dic[\"mmi\"] = hyp[\"scores\"][\"mmi\"]\n\n        # add to list of N-best result dicts\n        new_js[\"output\"].append(out_dic)\n\n        # show 1-best result\n        if n == 1:\n            if \"text\" in out_dic.keys():\n                logging.info(\"groundtruth: %s\" % out_dic[\"text\"])\n            logging.info(\"prediction : %s\" % out_dic[\"rec_text\"])\n\n    return new_js\n\n\ndef plot_spectrogram(\n    plt,\n    spec,\n    mode=\"db\",\n    fs=None,\n    frame_shift=None,\n    bottom=True,\n    left=True,\n    right=True,\n    top=False,\n    labelbottom=True,\n    labelleft=True,\n    labelright=True,\n    labeltop=False,\n    cmap=\"inferno\",\n):\n    \"\"\"Plot spectrogram using matplotlib.\n\n    Args:\n        plt (matplotlib.pyplot): pyplot object.\n        spec (numpy.ndarray): Input stft (Freq, Time)\n        mode (str): db or linear.\n        fs (int): Sample frequency. To convert y-axis to kHz unit.\n        frame_shift (int): The frame shift of stft. To convert x-axis to second unit.\n        bottom (bool):Whether to draw the respective ticks.\n        left (bool):\n        right (bool):\n        top (bool):\n        labelbottom (bool):Whether to draw the respective tick labels.\n        labelleft (bool):\n        labelright (bool):\n        labeltop (bool):\n        cmap (str): Colormap defined in matplotlib.\n\n    \"\"\"\n    spec = np.abs(spec)\n    if mode == \"db\":\n        x = 20 * np.log10(spec + np.finfo(spec.dtype).eps)\n    elif mode == \"linear\":\n        x = spec\n    else:\n        raise ValueError(mode)\n\n    if fs is not None:\n        ytop = fs / 2000\n        ylabel = \"kHz\"\n    else:\n        ytop = x.shape[0]\n        ylabel = \"bin\"\n\n    if frame_shift is not None and fs is not None:\n        xtop = x.shape[1] * frame_shift / fs\n        xlabel = \"s\"\n    else:\n        xtop = x.shape[1]\n        xlabel = \"frame\"\n\n    extent = (0, xtop, 0, ytop)\n    plt.imshow(x[::-1], cmap=cmap, extent=extent)\n\n    if labelbottom:\n        plt.xlabel(\"time [{}]\".format(xlabel))\n    if labelleft:\n        plt.ylabel(\"freq [{}]\".format(ylabel))\n    plt.colorbar().set_label(\"{}\".format(mode))\n\n    plt.tick_params(\n        bottom=bottom,\n        left=left,\n        right=right,\n        top=top,\n        labelbottom=labelbottom,\n        labelleft=labelleft,\n        labelright=labelright,\n        labeltop=labeltop,\n    )\n    plt.axis(\"auto\")\n\n\n# * ------------------ recognition related ------------------ *\ndef format_mulenc_args(args):\n    \"\"\"Format args for multi-encoder setup.\n\n    It deals with following situations:  (when args.num_encs=2):\n    1. args.elayers = None -> args.elayers = [4, 4];\n    2. args.elayers = 4 -> args.elayers = [4, 4];\n    3. args.elayers = [4, 4, 4] -> args.elayers = [4, 4].\n\n    \"\"\"\n    # default values when None is assigned.\n    default_dict = {\n        \"etype\": \"blstmp\",\n        \"elayers\": 4,\n        \"eunits\": 300,\n        \"subsample\": \"1\",\n        \"dropout_rate\": 0.0,\n        \"atype\": \"dot\",\n        \"adim\": 320,\n        \"awin\": 5,\n        \"aheads\": 4,\n        \"aconv_chans\": -1,\n        \"aconv_filts\": 100,\n    }\n    for k in default_dict.keys():\n        if isinstance(vars(args)[k], list):\n            if len(vars(args)[k]) != args.num_encs:\n                logging.warning(\n                    \"Length mismatch {}: Convert {} to {}.\".format(\n                        k, vars(args)[k], vars(args)[k][: args.num_encs]\n                    )\n                )\n            vars(args)[k] = vars(args)[k][: args.num_encs]\n        else:\n            if not vars(args)[k]:\n                # assign default value if it is None\n                vars(args)[k] = default_dict[k]\n                logging.warning(\n                    \"{} is not specified, use default value {}.\".format(\n                        k, default_dict[k]\n                    )\n                )\n            # duplicate\n            logging.warning(\n                \"Type mismatch {}: Convert {} to {}.\".format(\n                    k, vars(args)[k], [vars(args)[k] for _ in range(args.num_encs)]\n                )\n            )\n            vars(args)[k] = [vars(args)[k] for _ in range(args.num_encs)]\n    return args\n"
  },
  {
    "path": "asr/chainer_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "asr/chainer_backend/asr.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Training/decoding definition for the speech recognition task.\"\"\"\n\nimport json\nimport logging\nimport os\nimport six\n\n# chainer related\nimport chainer\n\nfrom chainer import training\n\nfrom chainer.datasets import TransformDataset\nfrom chainer.training import extensions\n\n# espnet related\nfrom espnet.asr.asr_utils import adadelta_eps_decay\nfrom espnet.asr.asr_utils import add_results_to_json\nfrom espnet.asr.asr_utils import chainer_load\nfrom espnet.asr.asr_utils import CompareValueTrigger\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import restore_snapshot\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.utils.deterministic_utils import set_deterministic_chainer\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.evaluator import BaseEvaluator\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.iterators import ToggleableShufflingMultiprocessIterator\nfrom espnet.utils.training.iterators import ToggleableShufflingSerialIterator\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\n# rnnlm\nimport espnet.lm.chainer_backend.extlm as extlm_chainer\nimport espnet.lm.chainer_backend.lm as lm_chainer\n\n# numpy related\nimport matplotlib\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom tensorboardX import SummaryWriter\n\nmatplotlib.use(\"Agg\")\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    # display chainer version\n    logging.info(\"chainer version = \" + chainer.__version__)\n\n    set_deterministic_chainer(args)\n\n    # check cuda and cudnn availability\n    if not chainer.cuda.available:\n        logging.warning(\"cuda is not available\")\n    if not chainer.cuda.cudnn_enabled:\n        logging.warning(\"cudnn is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n    idim = int(valid_json[utts[0]][\"input\"][0][\"shape\"][1])\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # specify attention, CTC, hybrid mode\n    if args.mtlalpha == 1.0:\n        mtl_mode = \"ctc\"\n        logging.info(\"Pure CTC mode\")\n    elif args.mtlalpha == 0.0:\n        mtl_mode = \"att\"\n        logging.info(\"Pure attention mode\")\n    else:\n        mtl_mode = \"mtl\"\n        logging.info(\"Multitask learning mode\")\n\n    # specify model architecture\n    logging.info(\"import model module: \" + args.model_module)\n    model_class = dynamic_import(args.model_module)\n    model = model_class(idim, odim, args, flag_return=False)\n    assert isinstance(model, ASRInterface)\n    total_subsampling_factor = model.get_total_subsampling_factor()\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    # Set gpu\n    ngpu = args.ngpu\n    if ngpu == 1:\n        gpu_id = 0\n        # Make a specified GPU current\n        chainer.cuda.get_device_from_id(gpu_id).use()\n        model.to_gpu()  # Copy the model to the GPU\n        logging.info(\"single gpu calculation.\")\n    elif ngpu > 1:\n        gpu_id = 0\n        devices = {\"main\": gpu_id}\n        for gid in six.moves.xrange(1, ngpu):\n            devices[\"sub_%d\" % gid] = gid\n        logging.info(\"multi gpu calculation (#gpus = %d).\" % ngpu)\n        logging.warning(\n            \"batch size is automatically increased (%d -> %d)\"\n            % (args.batch_size, args.batch_size * args.ngpu)\n        )\n    else:\n        gpu_id = -1\n        logging.info(\"cpu calculation\")\n\n    # Setup an optimizer\n    if args.opt == \"adadelta\":\n        optimizer = chainer.optimizers.AdaDelta(eps=args.eps)\n    elif args.opt == \"adam\":\n        optimizer = chainer.optimizers.Adam()\n    elif args.opt == \"noam\":\n        optimizer = chainer.optimizers.Adam(alpha=0, beta1=0.9, beta2=0.98, eps=1e-9)\n    else:\n        raise NotImplementedError(\"args.opt={}\".format(args.opt))\n\n    optimizer.setup(model)\n    optimizer.add_hook(chainer.optimizer.GradientClipping(args.grad_clip))\n\n    # Setup a converter\n    converter = model.custom_converter(subsampling_factor=model.subsample[0])\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    # set up training iterator and updater\n    load_tr = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n    )\n    load_cv = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    accum_grad = args.accum_grad\n    if ngpu <= 1:\n        # make minibatch list (variable length)\n        train = make_batchset(\n            train_json,\n            args.batch_size,\n            args.maxlen_in,\n            args.maxlen_out,\n            args.minibatches,\n            min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n            shortest_first=use_sortagrad,\n            count=args.batch_count,\n            batch_bins=args.batch_bins,\n            batch_frames_in=args.batch_frames_in,\n            batch_frames_out=args.batch_frames_out,\n            batch_frames_inout=args.batch_frames_inout,\n            iaxis=0,\n            oaxis=0,\n        )\n        # hack to make batchsize argument as 1\n        # actual batchsize is included in a list\n        if args.n_iter_processes > 0:\n            train_iters = [\n                ToggleableShufflingMultiprocessIterator(\n                    TransformDataset(train, load_tr),\n                    batch_size=1,\n                    n_processes=args.n_iter_processes,\n                    n_prefetch=8,\n                    maxtasksperchild=20,\n                    shuffle=not use_sortagrad,\n                )\n            ]\n        else:\n            train_iters = [\n                ToggleableShufflingSerialIterator(\n                    TransformDataset(train, load_tr),\n                    batch_size=1,\n                    shuffle=not use_sortagrad,\n                )\n            ]\n\n        # set up updater\n        updater = model.custom_updater(\n            train_iters[0],\n            optimizer,\n            converter=converter,\n            device=gpu_id,\n            accum_grad=accum_grad,\n        )\n    else:\n        if args.batch_count not in (\"auto\", \"seq\") and args.batch_size == 0:\n            raise NotImplementedError(\n                \"--batch-count 'bin' and 'frame' are not implemented \"\n                \"in chainer multi gpu\"\n            )\n        # set up minibatches\n        train_subsets = []\n        for gid in six.moves.xrange(ngpu):\n            # make subset\n            train_json_subset = {\n                k: v for i, (k, v) in enumerate(train_json.items()) if i % ngpu == gid\n            }\n            # make minibatch list (variable length)\n            train_subsets += [\n                make_batchset(\n                    train_json_subset,\n                    args.batch_size,\n                    args.maxlen_in,\n                    args.maxlen_out,\n                    args.minibatches,\n                )\n            ]\n\n        # each subset must have same length for MultiprocessParallelUpdater\n        maxlen = max([len(train_subset) for train_subset in train_subsets])\n        for train_subset in train_subsets:\n            if maxlen != len(train_subset):\n                for i in six.moves.xrange(maxlen - len(train_subset)):\n                    train_subset += [train_subset[i]]\n\n        # hack to make batchsize argument as 1\n        # actual batchsize is included in a list\n        if args.n_iter_processes > 0:\n            train_iters = [\n                ToggleableShufflingMultiprocessIterator(\n                    TransformDataset(train_subsets[gid], load_tr),\n                    batch_size=1,\n                    n_processes=args.n_iter_processes,\n                    n_prefetch=8,\n                    maxtasksperchild=20,\n                    shuffle=not use_sortagrad,\n                )\n                for gid in six.moves.xrange(ngpu)\n            ]\n        else:\n            train_iters = [\n                ToggleableShufflingSerialIterator(\n                    TransformDataset(train_subsets[gid], load_tr),\n                    batch_size=1,\n                    shuffle=not use_sortagrad,\n                )\n                for gid in six.moves.xrange(ngpu)\n            ]\n\n        # set up updater\n        updater = model.custom_parallel_updater(\n            train_iters, optimizer, converter=converter, devices=devices\n        )\n\n    # Set up a trainer\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler(train_iters),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n    if args.opt == \"noam\":\n        from espnet.nets.chainer_backend.transformer.training import VaswaniRule\n\n        trainer.extend(\n            VaswaniRule(\n                \"alpha\",\n                d=args.adim,\n                warmup_steps=args.transformer_warmup_steps,\n                scale=args.transformer_lr,\n            ),\n            trigger=(1, \"iteration\"),\n        )\n    # Resume from a snapshot\n    if args.resume:\n        chainer.serializers.load_npz(args.resume, trainer)\n\n    # set up validation iterator\n    valid = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=0,\n    )\n\n    if args.n_iter_processes > 0:\n        valid_iter = chainer.iterators.MultiprocessIterator(\n            TransformDataset(valid, load_cv),\n            batch_size=1,\n            repeat=False,\n            shuffle=False,\n            n_processes=args.n_iter_processes,\n            n_prefetch=8,\n            maxtasksperchild=20,\n        )\n    else:\n        valid_iter = chainer.iterators.SerialIterator(\n            TransformDataset(valid, load_cv), batch_size=1, repeat=False, shuffle=False\n        )\n\n    # Evaluate the model with the test dataset for each epoch\n    trainer.extend(BaseEvaluator(valid_iter, model, converter=converter, device=gpu_id))\n\n    # Save attention weight each epoch\n    if args.num_save_attention > 0 and args.mtlalpha != 1.0:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"input\"][0][\"shape\"][1]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        logging.info(\"Using custom PlotAttentionReport\")\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=gpu_id,\n            subsampling_factor=total_subsampling_factor,\n        )\n        trainer.extend(att_reporter, trigger=(1, \"epoch\"))\n    else:\n        att_reporter = None\n\n    # Take a snapshot for each specified epoch\n    trainer.extend(\n        extensions.snapshot(filename=\"snapshot.ep.{.updater.epoch}\"),\n        trigger=(1, \"epoch\"),\n    )\n\n    # Make a plot for training and validation values\n    trainer.extend(\n        extensions.PlotReport(\n            [\n                \"main/loss\",\n                \"validation/main/loss\",\n                \"main/loss_ctc\",\n                \"validation/main/loss_ctc\",\n                \"main/loss_att\",\n                \"validation/main/loss_att\",\n            ],\n            \"epoch\",\n            file_name=\"loss.png\",\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/acc\", \"validation/main/acc\"], \"epoch\", file_name=\"acc.png\"\n        )\n    )\n\n    # Save best models\n    trainer.extend(\n        extensions.snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\"validation/main/loss\"),\n    )\n    if mtl_mode != \"ctc\":\n        trainer.extend(\n            extensions.snapshot_object(model, \"model.acc.best\"),\n            trigger=training.triggers.MaxValueTrigger(\"validation/main/acc\"),\n        )\n\n    # epsilon decay in the optimizer\n    if args.opt == \"adadelta\":\n        if args.criterion == \"acc\" and mtl_mode != \"ctc\":\n            trainer.extend(\n                restore_snapshot(model, args.outdir + \"/model.acc.best\"),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(model, args.outdir + \"/model.loss.best\"),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(\n        extensions.LogReport(trigger=(args.report_interval_iters, \"iteration\"))\n    )\n    report_keys = [\n        \"epoch\",\n        \"iteration\",\n        \"main/loss\",\n        \"main/loss_ctc\",\n        \"main/loss_att\",\n        \"validation/main/loss\",\n        \"validation/main/loss_ctc\",\n        \"validation/main/loss_att\",\n        \"main/acc\",\n        \"validation/main/acc\",\n        \"elapsed_time\",\n    ]\n    if args.opt == \"adadelta\":\n        trainer.extend(\n            extensions.observe_value(\n                \"eps\", lambda trainer: trainer.updater.get_optimizer(\"main\").eps\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"eps\")\n    trainer.extend(\n        extensions.PrintReport(report_keys),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n\n    set_early_stop(trainer, args)\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        writer = SummaryWriter(args.tensorboard_dir)\n        trainer.extend(\n            TensorboardLogger(writer, att_reporter),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\ndef recog(args):\n    \"\"\"Decode with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    # display chainer version\n    logging.info(\"chainer version = \" + chainer.__version__)\n\n    set_deterministic_chainer(args)\n\n    # read training config\n    idim, odim, train_args = get_model_conf(args.model, args.model_conf)\n\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    # specify model architecture\n    logging.info(\"reading model parameters from \" + args.model)\n    # To be compatible with v.0.3.0 models\n    if hasattr(train_args, \"model_module\"):\n        model_module = train_args.model_module\n    else:\n        model_module = \"espnet.nets.chainer_backend.e2e_asr:E2E\"\n    model_class = dynamic_import(model_module)\n    model = model_class(idim, odim, train_args)\n    assert isinstance(model, ASRInterface)\n    chainer_load(args.model, model)\n\n    # read rnnlm\n    if args.rnnlm:\n        rnnlm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        rnnlm = lm_chainer.ClassifierWithState(\n            lm_chainer.RNNLM(\n                len(train_args.char_list), rnnlm_args.layer, rnnlm_args.unit\n            )\n        )\n        chainer_load(args.rnnlm, rnnlm)\n    else:\n        rnnlm = None\n\n    if args.word_rnnlm:\n        rnnlm_args = get_model_conf(args.word_rnnlm, args.word_rnnlm_conf)\n        word_dict = rnnlm_args.char_list_dict\n        char_dict = {x: i for i, x in enumerate(train_args.char_list)}\n        word_rnnlm = lm_chainer.ClassifierWithState(\n            lm_chainer.RNNLM(len(word_dict), rnnlm_args.layer, rnnlm_args.unit)\n        )\n        chainer_load(args.word_rnnlm, word_rnnlm)\n\n        if rnnlm is not None:\n            rnnlm = lm_chainer.ClassifierWithState(\n                extlm_chainer.MultiLevelLM(\n                    word_rnnlm.predictor, rnnlm.predictor, word_dict, char_dict\n                )\n            )\n        else:\n            rnnlm = lm_chainer.ClassifierWithState(\n                extlm_chainer.LookAheadWordLM(\n                    word_rnnlm.predictor, word_dict, char_dict\n                )\n            )\n\n    # read json data\n    with open(args.recog_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=False,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n\n    # decode each utterance\n    new_js = {}\n    with chainer.no_backprop_mode():\n        for idx, name in enumerate(js.keys(), 1):\n            logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n            batch = [(name, js[name])]\n            feat = load_inputs_and_targets(batch)[0][0]\n            nbest_hyps = model.recognize(feat, args, train_args.char_list, rnnlm)\n            new_js[name] = add_results_to_json(\n                js[name], nbest_hyps, train_args.char_list\n            )\n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "asr/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "asr/pytorch_backend/asr.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Training/decoding definition for the speech recognition task.\"\"\"\n\nimport copy\nimport json\nimport logging\nimport math\nimport os\nimport sys\n\nfrom chainer import reporter as reporter_module\nfrom chainer import training\nfrom chainer.training import extensions\nfrom chainer.training.updater import StandardUpdater\nimport numpy as np\nimport torch\nimport torch.distributed as dist\nimport time\n\nfrom espnet.asr.asr_utils import adadelta_eps_decay\nfrom espnet.asr.asr_utils import add_results_to_json\nfrom espnet.asr.asr_utils import CompareValueTrigger\nfrom espnet.asr.asr_utils import format_mulenc_args\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import plot_spectrogram\nfrom espnet.asr.asr_utils import restore_snapshot\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.asr.pytorch_backend.asr_init import freeze_modules\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_model\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_modules\nimport espnet.lm.pytorch_backend.extlm as extlm_pytorch\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.beam_search_transducer import BeamSearchTransducer\nfrom espnet.nets.pytorch_backend.e2e_asr import pad_list\nimport espnet.nets.pytorch_backend.lm.default as lm_pytorch\nfrom espnet.nets.pytorch_backend.streaming.segment import SegmentStreamingE2E\nfrom espnet.nets.pytorch_backend.streaming.window import WindowStreamingE2E\nfrom espnet.transform.spectrogram import IStft\nfrom espnet.transform.transformation import Transformation\nfrom espnet.utils.cli_writers import file_writer_helper\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.evaluator import BaseEvaluator\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\nfrom espnet.snowfall.warpper.k2_decode import k2_decode\nimport matplotlib\n\nfrom espnet.utils.parse_decoding_process import plot_decoding_logs\nfrom espnet.utils.bmuf import BlockAdamTrainer\nmatplotlib.use(\"Agg\")\n\nif sys.version_info[0] == 2:\n    from itertools import izip_longest as zip_longest\nelse:\n    from itertools import zip_longest as zip_longest\n\nfrom espnet.nets.scorers.mmi_rnnt_scorer import MMIRNNTScorer\n# from espnet.nets.scorers.mmi_alignment_score import MMIRNNTScorer\nfrom espnet.utils.print import step_print\nfrom espnet.utils.sampler import BufferSampler\nfrom espnet.utils.rtf_calculator import RTF_calculator\nfrom espnet.nets.lm_interface import dynamic_import_lm\n\n\ndef _recursive_to(xs, device):\n    if torch.is_tensor(xs):\n        return xs.to(device)\n    if isinstance(xs, tuple):\n        return tuple(_recursive_to(x, device) for x in xs)\n    return xs\n\ndef is_alphabet(char):\n    if (char >= '\\u0041' and char <= '\\u005a') or (char >= '\\u0061' and char <= '\\u007a'):\n        return True\n    else:\n        return False\n\nclass CustomEvaluator(BaseEvaluator):\n    \"\"\"Custom Evaluator for Pytorch.\n\n    Args:\n        model (torch.nn.Module): The model to evaluate.\n        iterator (chainer.dataset.Iterator) : The train iterator.\n\n        target (link | dict[str, link]) :Link object or a dictionary of\n            links to evaluate. If this is just a link object, the link is\n            registered by the name ``'main'``.\n\n        device (torch.device): The device used.\n        ngpu (int): The number of GPUs.\n\n    \"\"\"\n\n    def __init__(self, model, iterator, target, device, ngpu=None):\n        super(CustomEvaluator, self).__init__(iterator, target)\n        self.model = model\n        self.device = device\n        if ngpu is not None:\n            self.ngpu = ngpu\n        elif device.type == \"cpu\":\n            self.ngpu = 0\n        else:\n            self.ngpu = 1\n\n    # The core part of the update routine can be customized by overriding\n    def evaluate(self):\n        \"\"\"Main evaluate routine for CustomEvaluator.\"\"\"\n        iterator = self._iterators[\"main\"]\n\n        if self.eval_hook:\n            self.eval_hook(self)\n\n        if hasattr(iterator, \"reset\"):\n            iterator.reset()\n            it = iterator\n        else:\n            it = copy.copy(iterator)\n\n        summary = reporter_module.DictSummary()\n\n        self.model.eval()\n        with torch.no_grad():\n            for batch in it:\n                print(\"evaluation batch\")\n                x = _recursive_to(batch, self.device)\n                observation = {}\n                with reporter_module.report_scope(observation):\n                    # read scp files\n                    # x: original json with loaded features\n                    #    will be converted to chainer variable later\n                    if self.ngpu == 0:\n                        self.model(*x)\n                    else:\n                        # apex does not support torch.nn.DataParallel\n                        # data_parallel(self.model, x, range(self.ngpu))\n                        self.model(*x)\n                summary.add(observation)\n        self.model.train()\n\n        return summary.compute_mean()\n\n\nclass CustomUpdater(StandardUpdater):\n    \"\"\"Custom Updater for Pytorch.\n\n    Args:\n        model (torch.nn.Module): The model to update.\n        grad_clip_threshold (float): The gradient clipping value to use.\n        train_iter (chainer.dataset.Iterator): The training iterator.\n        optimizer (torch.optim.optimizer): The training optimizer.\n\n        device (torch.device): The device to use.\n        ngpu (int): The number of gpus to use.\n        use_apex (bool): The flag to use Apex in backprop.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        model,\n        grad_clip_threshold,\n        train_iter,\n        optimizer,\n        device,\n        ngpu,\n        grad_noise=False,\n        accum_grad=1,\n        use_apex=False,\n        ddp_trainer=None\n    ):\n        super(CustomUpdater, self).__init__(train_iter, optimizer)\n        self.model = model\n        self.grad_clip_threshold = grad_clip_threshold\n        self.device = device\n        self.ngpu = ngpu\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n        self.grad_noise = grad_noise\n        self.iteration = 0\n        self.use_apex = use_apex\n        self.ddp_trainer = ddp_trainer\n        self.optimizer = optimizer\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Main update routine of the CustomUpdater.\"\"\"\n        # When we pass one iterator and optimizer to StandardUpdater.__init__,\n        # they are automatically named 'main'.\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n        epoch = train_iter.epoch\n\n        \n        batch = train_iter.next()\n        \n        x = _recursive_to(batch, self.device)\n        is_new_epoch = train_iter.epoch != epoch\n        \n        if self.ngpu == 0:\n            loss = self.model(*x).mean() / self.accum_grad\n        else:\n            # apex does not support torch.nn.DataParallel\n            #loss = (\n            #    data_parallel(self.model, x, range(self.ngpu)).mean() / self.accum_grad\n            #)\n            loss = self.model(*x) / self.accum_grad\n        if self.use_apex:\n            from apex import amp\n\n            # NOTE: for a compatibility with noam optimizer\n            opt = optimizer.optimizer if hasattr(optimizer, \"optimizer\") else optimizer\n            with amp.scale_loss(loss, opt) as scaled_loss:\n                scaled_loss.backward()\n        else:\n            loss.backward()\n        # step_print(f\"| forward_count {self.forward_count} | finish backward\")\n        # gradient noise injection\n        if self.grad_noise:\n            from espnet.asr.asr_utils import add_gradient_noise\n\n            add_gradient_noise(\n                self.model, self.iteration, duration=100, eta=1.0, scale_factor=0.55\n            )\n\n        # update parameters\n        self.forward_count += 1\n        if not is_new_epoch and self.forward_count != self.accum_grad:\n            return\n        self.forward_count = 0\n        # compute the gradient norm to check if it is normal or not\n        grad_norm = torch.nn.utils.clip_grad_norm_(\n            self.model.parameters(), self.grad_clip_threshold\n        )\n        logging.info(\"on device {} grad norm={}\".format(self.device, grad_norm))\n        if math.isnan(grad_norm):\n            logging.warning(\"grad norm is nan. Do not update model.\")\n            self.ddp_trainer.optimizer.zero_grad()\n        else:\n            \"\"\"\n            Optimizer is never used for update. \n            The real updating process and the DDP communication is in \n            this `update_and_sync()`\n            \"\"\"\n            # self.optimizer.step()\n            self.ddp_trainer.update_and_sync()\n            if self.iteration % 1 == 0:\n                step_print(f\"| iteration: {self.iteration} | gradient applied\")\n\n    def update(self):\n        self.update_core()\n        # #iterations with accum_grad > 1\n        # Ref.: https://github.com/espnet/espnet/issues/777\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomConverter(object):\n    \"\"\"Custom batch converter for Pytorch.\n\n    Args:\n        subsampling_factor (int): The subsampling factor.\n        dtype (torch.dtype): Data type to convert.\n\n    \"\"\"\n\n    def __init__(self, subsampling_factor=1, dtype=torch.float32):\n        \"\"\"Construct a CustomConverter object.\"\"\"\n        self.subsampling_factor = subsampling_factor\n        self.ignore_id = -1\n        self.dtype = dtype\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Transform a batch and send it to a device.\n\n        Args:\n            batch (list): The batch to transform.\n            device (torch.device): The device to send to.\n\n        Returns:\n            tuple(torch.Tensor, torch.Tensor, torch.Tensor)\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys, texts, xs_orig = batch[0]\n\n        # perform subsampling\n        if self.subsampling_factor > 1:\n            xs = [x[:: self.subsampling_factor, :] for x in xs]\n\n        # get batch of lengths of input sequences\n        ilens = np.array([x.shape[0] for x in xs])\n\n        # perform padding and convert to tensor\n        # currently only support real number\n        if xs[0].dtype.kind == \"c\":\n            xs_pad_real = pad_list(\n                [torch.from_numpy(x.real).float() for x in xs], 0\n            ).to(device, dtype=self.dtype)\n            xs_pad_imag = pad_list(\n                [torch.from_numpy(x.imag).float() for x in xs], 0\n            ).to(device, dtype=self.dtype)\n            # Note(kamo):\n            # {'real': ..., 'imag': ...} will be changed to ComplexTensor in E2E.\n            # Don't create ComplexTensor and give it E2E here\n            # because torch.nn.DataParellel can't handle it.\n            xs_pad = {\"real\": xs_pad_real, \"imag\": xs_pad_imag}\n        else:\n            xs_pad = pad_list([torch.from_numpy(x).float() for x in xs], 0).to(\n                device, dtype=self.dtype\n            )\n\n        xs_pad_orig = pad_list([torch.from_numpy(x).float() for x in xs_orig], 0).to(\n            device, dtype=self.dtype\n        )\n\n        ilens = torch.from_numpy(ilens).to(device)\n        # NOTE: this is for multi-output (e.g., speech translation)\n        ys_pad = pad_list(\n            [\n                torch.from_numpy(\n                    np.array(y[0][:]) if isinstance(y, tuple) else y\n                ).long()\n                for y in ys\n            ],\n            self.ignore_id,\n        ).to(device)\n\n        return xs_pad, ilens, ys_pad, texts, xs_pad_orig\n\n\nclass CustomConverterMulEnc(object):\n    \"\"\"Custom batch converter for Pytorch in multi-encoder case.\n\n    Args:\n        subsampling_factors (list): List of subsampling factors for each encoder.\n        dtype (torch.dtype): Data type to convert.\n\n    \"\"\"\n\n    def __init__(self, subsamping_factors=[1, 1], dtype=torch.float32):\n        \"\"\"Initialize the converter.\"\"\"\n        self.subsamping_factors = subsamping_factors\n        self.ignore_id = -1\n        self.dtype = dtype\n        self.num_encs = len(subsamping_factors)\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Transform a batch and send it to a device.\n\n        Args:\n            batch (list): The batch to transform.\n            device (torch.device): The device to send to.\n\n        Returns:\n            tuple( list(torch.Tensor), list(torch.Tensor), torch.Tensor)\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs_list = batch[0][: self.num_encs]\n        ys = batch[0][-1]\n\n        # perform subsampling\n        if np.sum(self.subsamping_factors) > self.num_encs:\n            xs_list = [\n                [x[:: self.subsampling_factors[i], :] for x in xs_list[i]]\n                for i in range(self.num_encs)\n            ]\n\n        # get batch of lengths of input sequences\n        ilens_list = [\n            np.array([x.shape[0] for x in xs_list[i]]) for i in range(self.num_encs)\n        ]\n\n        # perform padding and convert to tensor\n        # currently only support real number\n        xs_list_pad = [\n            pad_list([torch.from_numpy(x).float() for x in xs_list[i]], 0).to(\n                device, dtype=self.dtype\n            )\n            for i in range(self.num_encs)\n        ]\n\n        ilens_list = [\n            torch.from_numpy(ilens_list[i]).to(device) for i in range(self.num_encs)\n        ]\n        # NOTE: this is for multi-task learning (e.g., speech translation)\n        ys_pad = pad_list(\n            [\n                torch.from_numpy(np.array(y[0]) if isinstance(y, tuple) else y).long()\n                for y in ys\n            ],\n            self.ignore_id,\n        ).to(device)\n\n        return xs_list_pad, ilens_list, ys_pad\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n    if args.num_encs > 1:\n        args = format_mulenc_args(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n    idim_list = [\n        int(valid_json[utts[0]][\"input\"][i][\"shape\"][-1]) for i in range(args.num_encs)\n    ]\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][-1])\n    for i in range(args.num_encs):\n        logging.info(\"stream{}: input dims : {}\".format(i + 1, idim_list[i]))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # specify attention, CTC, hybrid mode\n    if \"transducer\" in args.model_module:\n        if (\n            getattr(args, \"etype\", False) == \"custom\"\n            or getattr(args, \"dtype\", False) == \"custom\"\n        ):\n            mtl_mode = \"custom_transducer\"\n        else:\n            mtl_mode = \"transducer\"\n        logging.info(\"Pure transducer mode\")\n    elif args.mtlalpha == 1.0:\n        mtl_mode = \"ctc\"\n        logging.info(\"Pure CTC mode\")\n    elif args.mtlalpha == 0.0:\n        mtl_mode = \"att\"\n        logging.info(\"Pure attention mode\")\n    else:\n        mtl_mode = \"mtl\"\n        logging.info(\"Multitask learning mode\")\n\n    if (args.enc_init is not None or args.dec_init is not None) and args.num_encs == 1:\n        model = load_trained_modules(idim_list[0], odim, args)\n    else:\n        model_class = dynamic_import(args.model_module)\n        model = model_class(\n            idim_list[0] if args.num_encs == 1 else idim_list, odim, args\n        )\n    assert isinstance(model, ASRInterface)\n    total_subsampling_factor = model.get_total_subsampling_factor()\n\n    print(model)\n    logging.info(\n        \" Total parameter of the model = \"\n        + str(sum(p.numel() for p in model.parameters()))\n    )\n\n    if args.rnnlm is not None:\n        rnnlm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        rnnlm = lm_pytorch.ClassifierWithState(\n            lm_pytorch.RNNLM(len(args.char_list), rnnlm_args.layer, rnnlm_args.unit)\n        )\n        torch_load(args.rnnlm, rnnlm)\n        model.rnnlm = rnnlm\n\n    # write model config\n    global_rank = args.node_rank * args.node_size + args.local_rank\n    args.outdir = args.outdir.replace(\"RANK\", str(global_rank))\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(\n                (idim_list[0] if args.num_encs == 1 else idim_list, odim, vars(args)),\n                indent=4,\n                ensure_ascii=False,\n                sort_keys=True,\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    reporter = model.reporter\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n        if args.num_encs > 1:\n            # TODO(ruizhili): implement data parallel for multi-encoder setup.\n            raise NotImplementedError(\n                \"Data parallel is not supported for multi-encoder setup.\"\n            )\n\n    # set torch device \n    assert args.ngpu in [1, 0] # this is ddp version\n    device = torch.device(f\"cuda:{args.local_rank}\" if args.ngpu > 0 else \"cpu\")\n    \n    if args.train_dtype in (\"float16\", \"float32\", \"float64\"):\n        dtype = getattr(torch, args.train_dtype)\n    else:\n        dtype = torch.float32\n    model = model.to(device=device, dtype=dtype)\n    if args.freeze_mods:\n        model, model_params = freeze_modules(model, args.freeze_mods)\n    else:\n        model_params = model.parameters()\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # We build the SGD optimizer but never use it.\n    # Other code needs this\n    # The real optimizer is in ddp_trainer\n    optimizer = torch.optim.SGD(model_params, lr=1.0)\n\n    # setup apex.amp\n    if args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\"):\n        try:\n            from apex import amp\n        except ImportError as e:\n            logging.error(\n                f\"You need to install apex for --train-dtype {args.train_dtype}. \"\n                \"See https://github.com/NVIDIA/apex#linux\"\n            )\n            raise e\n        if args.opt == \"noam\":\n            model, optimizer.optimizer = amp.initialize(\n                model, optimizer.optimizer, opt_level=args.train_dtype\n            )\n        else:\n            model, optimizer = amp.initialize(\n                model, optimizer, opt_level=args.train_dtype\n            )\n        use_apex = True\n\n        from espnet.nets.pytorch_backend.ctc import CTC\n\n        amp.register_float_function(CTC, \"loss_fn\")\n        amp.init()\n        logging.warning(\"register ctc as float function\")\n    else:\n        use_apex = False\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # Setup a converter\n    if args.num_encs == 1:\n        converter = CustomConverter(subsampling_factor=model.subsample[0], dtype=dtype)\n    else:\n        converter = CustomConverterMulEnc(\n            [i[0] for i in model.subsample_list], dtype=dtype\n        )\n\n    # read json data\n    args.train_json = args.train_json.replace(\"RANK\", str(global_rank + 1))\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    # if use block_load, the utterance must sorted from shortest to longest\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0 or args.block_load\n    # make minibatch list (variable length)\n    # disable the adaptive batch_size to sync DDP training\n    # if use frame as the count, we do not set min_batch_size\n    train = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.batch_size if args.batch_size > 0 else 1, #args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=0,\n        no_sort=args.block_load,\n    )\n    valid = make_batchset(\n        valid_json,\n        args.batch_size * 2,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.batch_size, #args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=0,\n    )\n\n    if args.block_load:\n        assert args.n_iter_processes <= 1, \"never use more than one worker\"\n        sampler = BufferSampler(\n            length=len(train),\n            utts_per_ark=args.utts_per_ark,\n            batch_size=args.batch_size,\n            buf_size=args.block_buffer_size,\n            seed=args.seed,\n        )\n        prefetch_factor = sampler.get_prefetch_factor()\n        shuffle = None\n    else:\n        sampler=None\n        prefetch_factor = 20\n        shuffle = not use_sortagrad\n\n    load_tr = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n        block_load=args.block_load,\n    )\n    load_cv = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    # default collate function converts numpy array to pytorch tensor\n    # we used an empty collate function instead which returns list\n    train_dataset = TransformDataset(train, lambda data: converter([load_tr(data)]))\n    valid_dataset = TransformDataset(valid, lambda data: converter([load_cv(data)]))\n\n    train_iter = ChainerDataLoader(\n        dataset=train_dataset,\n        batch_size=1,\n        num_workers=args.n_iter_processes,\n        shuffle=shuffle,\n        collate_fn=lambda x: x[0],\n        prefetch_factor=prefetch_factor,\n        sampler=sampler\n    )\n    # prefetch_factor=5,\n    valid_iter = ChainerDataLoader(\n        dataset=valid_dataset,\n        batch_size=1,\n        shuffle=False,\n        collate_fn=lambda x: x[0],\n        num_workers=args.n_iter_processes,\n    )\n\n    \n    # Set up a trainer\n    ddp_trainer = BlockAdamTrainer(args,\n                                   master_node=args.master_node,\n                                   rank=global_rank,\n                                   world_size=args.world_size,\n                                   model=model,\n    )\n    \n    updater = CustomUpdater(\n        model,\n        args.grad_clip,\n        {\"main\": train_iter},\n        optimizer,\n        device,\n        args.ngpu,\n        args.grad_noise,\n        args.accum_grad,\n        use_apex=use_apex,\n        ddp_trainer=ddp_trainer\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    if use_sortagrad and args.sortagrad != 0:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer, args.load_trainer_and_opt)\n\n    # Evaluate the model with the test dataset for each epoch\n    if args.save_interval_iters > 0:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu),\n            trigger=(args.save_interval_iters, \"iteration\"),\n        )\n    else:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu)\n        )\n\n    # Save attention weight each epoch\n    is_attn_plot = (\n        \"transformer\" in args.model_module\n        or \"conformer\" in args.model_module\n        or mtl_mode in [\"att\", \"mtl\", \"custom_transducer\"]\n    )\n\n    if args.num_save_attention > 0 and is_attn_plot:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"input\"][0][\"shape\"][1]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            subsampling_factor=total_subsampling_factor,\n        )\n        trainer.extend(att_reporter, trigger=(1, \"epoch\"))\n    else:\n        att_reporter = None\n\n    # Save CTC prob at each epoch\n    if mtl_mode in [\"ctc\", \"mtl\"] and args.num_save_ctc > 0:\n        # NOTE: sort it by output lengths\n        data = sorted(\n            list(valid_json.items())[: args.num_save_ctc],\n            key=lambda x: int(x[1][\"output\"][0][\"shape\"][0]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            ctc_vis_fn = model.module.calculate_all_ctc_probs\n            plot_class = model.module.ctc_plot_class\n        else:\n            ctc_vis_fn = model.calculate_all_ctc_probs\n            plot_class = model.ctc_plot_class\n        ctc_reporter = plot_class(\n            ctc_vis_fn,\n            data,\n            args.outdir + \"/ctc_prob\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            subsampling_factor=total_subsampling_factor,\n        )\n        trainer.extend(ctc_reporter, trigger=(1, \"epoch\"))\n    else:\n        ctc_reporter = None\n\n    # Make a plot for training and validation values\n    if args.num_encs > 1:\n        report_keys_loss_ctc = [\n            \"main/loss_ctc{}\".format(i + 1) for i in range(model.num_encs)\n        ] + [\"validation/main/loss_ctc{}\".format(i + 1) for i in range(model.num_encs)]\n        report_keys_cer_ctc = [\n            \"main/cer_ctc{}\".format(i + 1) for i in range(model.num_encs)\n        ] + [\"validation/main/cer_ctc{}\".format(i + 1) for i in range(model.num_encs)]\n\n    if hasattr(model, \"is_rnnt\"):\n        trainer.extend(\n            extensions.PlotReport(\n                [\n                    \"main/loss\",\n                    \"validation/main/loss\",\n                    \"main/loss_trans\",\n                    \"validation/main/loss_trans\",\n                    \"main/loss_ctc\",\n                    \"validation/main/loss_ctc\",\n                    \"main/loss_lm\",\n                    \"validation/main/loss_lm\",\n                    \"main/loss_aux_trans\",\n                    \"validation/main/loss_aux_trans\",\n                    \"main/loss_aux_symm_kl\",\n                    \"validation/main/loss_aux_symm_kl\",\n                    \"main/loss_mbr\",\n                    \"validation/main/loss_mbr\",\n                    \"main/loss_mmi\",\n                    \"validation/main/loss_mmi\",\n                    \"main/loss_lang\",\n                    \"validation/main/loss_lang\",\n                    \"main/loss_att\",\n                    \"validation/main/loss_att\",\n                ],\n                \"epoch\",\n                file_name=\"loss.png\",\n            )\n        )\n    else:\n        trainer.extend(\n            extensions.PlotReport(\n                [\n                    \"main/loss\",\n                    \"validation/main/loss\",\n                    \"main/loss_ctc\",\n                    \"validation/main/loss_ctc\",\n                    \"main/loss_att\",\n                    \"validation/main/loss_att\",\n                    \"main/loss_third\",\n                    \"validation/main/loss_third\",\n                    \"main/loss_mbr\",\n                    \"validation/main/loss_mbr\",\n                ]\n                + ([] if args.num_encs == 1 else report_keys_loss_ctc),\n                \"epoch\",\n                file_name=\"loss.png\",\n            )\n        )\n\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/acc\", \"validation/main/acc\"], \"epoch\", file_name=\"acc.png\"\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/cer_ctc\", \"validation/main/cer_ctc\"]\n            + ([] if args.num_encs == 1 else report_keys_loss_ctc),\n            \"epoch\",\n            file_name=\"cer.png\",\n        )\n    )\n\n    # save the checkpoint only if this is the master GPU\n    if global_rank == 0:\n        # Save best models\n        trainer.extend(\n            snapshot_object(model, \"model.loss.best\"),\n            trigger=training.triggers.MinValueTrigger(\"validation/main/loss\"),\n        )\n        if mtl_mode not in [\"ctc\", \"transducer\", \"custom_transducer\"]:\n            trainer.extend(\n                snapshot_object(model, \"model.acc.best\"),\n                trigger=training.triggers.MaxValueTrigger(\"validation/main/acc\"),\n            )\n    \n        # save snapshot which contains model and optimizer states\n        if args.save_interval_iters > 0:\n            trainer.extend(\n                torch_snapshot(filename=\"snapshot.iter.{.updater.iteration}\"),\n                trigger=(args.save_interval_iters, \"iteration\"),\n            )\n    \n        # save snapshot at every epoch - for model averaging\n        trainer.extend(torch_snapshot(), trigger=(1, \"epoch\"))\n\n    # epsilon decay in the optimizer\n    if args.opt == \"adadelta\":\n        if args.criterion == \"acc\" and mtl_mode != \"ctc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n        # NOTE: In some cases, it may take more than one epoch for the model's loss\n        # to escape from a local minimum.\n        # Thus, restore_snapshot extension is not used here.\n        # see details in https://github.com/espnet/espnet/pull/2171\n        elif args.criterion == \"loss_eps_decay_only\":\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(\n        extensions.LogReport(trigger=(args.report_interval_iters, \"iteration\"))\n    )\n\n    if hasattr(model, \"is_rnnt\"):\n        report_keys = [\n            \"epoch\",\n            \"iteration\",\n            \"main/loss\",\n            \"main/loss_trans\",\n            \"main/loss_ctc\",\n            \"main/loss_lm\",\n            \"main/loss_aux_trans\",\n            \"main/loss_aux_symm_kl\",\n            \"main/loss_mbr\",\n            \"main/loss_mmi\",\n            \"main/loss_att\",\n            \"main/loss_lang\",\n            \"validation/main/loss\",\n            \"validation/main/loss_trans\",\n            \"validation/main/loss_ctc\",\n            \"validation/main/loss_lm\",\n            \"validation/main/loss_aux_trans\",\n            \"validation/main/loss_aux_symm_kl\",\n            \"validation/main/loss_mbr\",\n            \"validation/main/loss_mmi\",\n            \"validation/main/loss_att\",\n            \"validation/main/loss_lang\",\n            \"elapsed_time\",\n        ]\n    else:\n        report_keys = [\n            \"epoch\",\n            \"iteration\",\n            \"main/loss\",\n            \"main/loss_ctc\",\n            \"main/loss_att\",\n            \"main/loss_third\",\n            \"main/loss_mbr\",\n            \"validation/main/loss\",\n            \"validation/main/loss_ctc\",\n            \"validation/main/loss_att\",\n            \"validation/main/loss_third\",\n            \"validation/main/loss_mbr\",\n            \"main/acc\",\n            \"validation/main/acc\",\n            \"main/cer_ctc\",\n            \"validation/main/cer_ctc\",\n            \"elapsed_time\",\n        ] + ([] if args.num_encs == 1 else report_keys_cer_ctc + report_keys_loss_ctc)\n\n    if args.opt == \"adadelta\":\n        trainer.extend(\n            extensions.observe_value(\n                \"eps\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"eps\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"eps\")\n    if args.report_cer:\n        report_keys.append(\"validation/main/cer\")\n    if args.report_wer:\n        report_keys.append(\"validation/main/wer\")\n\n    logwriter = open(args.outdir + f\"/train.{global_rank}.log\", 'w')\n    trainer.extend(\n        extensions.PrintReport(report_keys, out=logwriter),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n\n    # trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    set_early_stop(trainer, args)\n\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\ndef recog(args):\n    \"\"\"Decode with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n\n    if args.ngpu == 1:\n        gpu_id = args.local_rank - 1\n        logging.warning(\"gpu id: \" + str(gpu_id))\n        device=torch.device(\"cuda:{}\".format(gpu_id))\n    else:\n        device=torch.device(\"cpu\")\n        os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"-1\" # disable GPU\n\n    model, train_args = load_trained_model(args.model, training=False)\n    assert isinstance(model, ASRInterface)\n    model.recog_args = args\n\n    if args.streaming_mode and \"transformer\" in train_args.model_module:\n        raise NotImplementedError(\"streaming mode for transformer is not implemented\")\n    logging.info(\n        \" Total parameter of the model = \"\n        + str(sum(p.numel() for p in model.parameters()))\n    )\n\n    # read rnnlm\n    if args.rnnlm and args.lm_weight > 0.0:\n        rnnlm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        if getattr(rnnlm_args, \"model_module\", \"default\") == \"default\":\n            rnnlm = lm_pytorch.ClassifierWithState(\n                lm_pytorch.RNNLM(\n                    len(train_args.char_list),\n                    rnnlm_args.layer,\n                    rnnlm_args.unit,\n                    getattr(rnnlm_args, \"embed_unit\", None),  # for backward compatibility\n                )\n            )\n        elif getattr(rnnlm_args, \"model_module\", \"default\") == \"transformer\":\n            lm_class = dynamic_import_lm(\"transformer\", rnnlm_args.backend)\n            rnnlm = lm_class(len(train_args.char_list), rnnlm_args)\n        else:\n            raise ValueError(\"Unsupported LM type\")\n\n        torch_load(args.rnnlm, rnnlm)\n        rnnlm.eval()\n    else:\n        rnnlm = None\n\n    if args.word_rnnlm:\n        rnnlm_args = get_model_conf(args.word_rnnlm, args.word_rnnlm_conf)\n        word_dict = rnnlm_args.char_list_dict\n        char_dict = {x: i for i, x in enumerate(train_args.char_list)}\n        word_rnnlm = lm_pytorch.ClassifierWithState(\n            lm_pytorch.RNNLM(\n                len(word_dict),\n                rnnlm_args.layer,\n                rnnlm_args.unit,\n                getattr(rnnlm_args, \"embed_unit\", None),  # for backward compatibility\n            )\n        )\n        torch_load(args.word_rnnlm, word_rnnlm)\n        word_rnnlm.eval()\n\n        if rnnlm is not None:\n            rnnlm = lm_pytorch.ClassifierWithState(\n                extlm_pytorch.MultiLevelLM(\n                    word_rnnlm.predictor, rnnlm.predictor, word_dict, char_dict\n                )\n            )\n        else:\n            rnnlm = lm_pytorch.ClassifierWithState(\n                extlm_pytorch.LookAheadWordLM(\n                    word_rnnlm.predictor, word_dict, char_dict\n                )\n            )\n\n    model = model.to(device)\n    if rnnlm:\n        rnnlm = rnnlm.to(device)\n\n    # read json data\n    with open(args.recog_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n\n    # load transducer beam search\n    if hasattr(model, \"is_rnnt\"):\n        if hasattr(model, \"dec\"):\n            trans_decoder = model.dec\n        else:\n            trans_decoder = model.decoder\n        joint_network = model.joint_network\n     \n        # We only use the MMIRNNTScorer now \n        if train_args.aux_mmi and train_args.aux_mmi_type == \"mmi\":\n            adim = train_args.enc_block_arch[0]['d_hidden']\n            weight_path = os.path.dirname(args.result_label) + \"/dump\" \n            os.makedirs(weight_path, exist_ok=True)\n            model.aux_mmi.dump_weight(args.local_rank, weight_path)\n\n            mmi_scorer_module = MMIRNNTScorer\n            mmi_scorer = mmi_scorer_module(lang=model.aux_mmi.lang,\n                                       device=device,\n                                       idim=adim,\n                                       sos_id=model.sos,\n                                       rank=args.local_rank,\n                                       use_segment=args.use_segment,\n                                       char_list=train_args.char_list,\n                                       weight_path=weight_path,\n                                       lookahead=args.mas_lookahead,\n                                       ) \n        else:\n            mmi_scorer = None\n\n        if args.ngram_model and args.ngram_weight > 0.0:\n            print(f\"Using ngram model: {args.ngram_model}\", flush=True)\n            from espnet.nets.scorers.ngram import NgramPartScorer\n            ngram_scorer = NgramPartScorer(args.ngram_model, train_args.char_list)\n        else:\n            ngram_scorer = None\n\n        if args.word_ngram is not None and args.word_ngram_weight > 0.0:\n            from espnet.nets.scorers.word_ngram import WordNgramPartialScorer\n            word_ngram_scorer = WordNgramPartialScorer\n            word_ngram_scorer = word_ngram_scorer(\n                                  args.word_ngram, device, train_args.char_list,\n                                  log_semiring=args.word_ngram_log_semiring,\n                                  lower_char=args.word_ngram_lower_char)\n        else:\n            word_ngram_scorer = None\n\n        if args.tlg_scorer is not None and args.tlg_weight > 0.0:\n            print(f\"Using tlg scorer: {args.tlg_scorer}\", flush=True)\n            from espnet.nets.scorers.tlg_scorer import TlgPartialScorer\n            tlg_scorer = TlgPartialScorer(lang=args.tlg_scorer, \n                                          nonblk_reward=args.tlg_nonblk_reward)\n        else:\n            tlg_scorer = None\n\n        # for code-switch data\n        if args.cs_nt_decode_feature in [\"chn\", \"eng\"]:\n            ctc_module = getattr(model, \"aux_ctc\", None)\n        else:\n            ctc_module = getattr(model, \"decoder_ctc\", None)\n\n        if args.eng_vocab is not None and os.path.isfile(args.eng_vocab):\n            eng_vocab = [s.strip() for s in open(args.eng_vocab, encoding=\"utf-8\").readlines()]\n        else:\n            eng_vocab = None\n\n        beam_search_transducer = BeamSearchTransducer(\n            decoder=trans_decoder,\n            joint_network=joint_network,\n            beam_size=args.beam_size,\n            nbest=args.nbest,\n            lm=rnnlm,\n            lm_weight=args.lm_weight,\n            search_type=args.search_type,\n            char_list=train_args.char_list,\n            max_sym_exp=args.max_sym_exp,\n            u_max=args.u_max,\n            nstep=args.nstep,\n            prefix_alpha=args.prefix_alpha,\n            score_norm=args.score_norm,\n            mmi_scorer=mmi_scorer,\n            mmi_weight=args.mmi_weight,\n            ngram_scorer=ngram_scorer,\n            ngram_weight=args.ngram_weight,\n            word_ngram_scorer=word_ngram_scorer,\n            word_ngram_weight=args.word_ngram_weight,\n            tlg_scorer=tlg_scorer,\n            tlg_weight=args.tlg_weight,\n            forbid_eng=args.forbid_eng,\n            ctc_module=ctc_module,\n            ctc_weight=args.ctc_weight,\n            eng_vocab=eng_vocab\n        )\n\n    if args.k2_decode:\n        k2_decode(model, device, js, load_inputs_and_targets, args.batchsize, args.use_segment)\n        print(\"Finish FST decoding. Abort!\")\n        return\n    \n    nbest_dict = {}\n    rtf_calculator = RTF_calculator(js)\n    rtf_calculator.tik()\n    if args.batchsize == 0:\n        with torch.no_grad():\n            for idx, name in enumerate(js.keys(), 1):\n                logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n                batch = [(name, js[name])]\n                feats = load_inputs_and_targets(batch)\n                feat = (\n                    feats[0][0]\n                    if args.num_encs == 1\n                    else [feats[idx][0] for idx in range(model.num_encs)]\n                )\n\n                # For Oteam ASR Only: skip all transcriptions that have english chars\n                text_trans = js[name][\"output\"][0][\"text\"]\n                if any([is_alphabet(x) for x in text_trans]) and args.skip_eng:\n                    continue\n\n                if args.streaming_mode == \"window\" and args.num_encs == 1:\n                    logging.info(\n                        \"Using streaming recognizer with window size %d frames\",\n                        args.streaming_window,\n                    )\n                    se2e = WindowStreamingE2E(e2e=model, recog_args=args, rnnlm=rnnlm)\n                    for i in range(0, feat.shape[0], args.streaming_window):\n                        logging.info(\n                            \"Feeding frames %d - %d\", i, i + args.streaming_window\n                        )\n                        se2e.accept_input(feat[i : i + args.streaming_window])\n                    logging.info(\"Running offline attention decoder\")\n                    se2e.decode_with_attention_offline()\n                    logging.info(\"Offline attention decoder finished\")\n                    nbest_hyps = se2e.retrieve_recognition()\n                elif args.streaming_mode == \"segment\" and args.num_encs == 1:\n                    logging.info(\n                        \"Using streaming recognizer with threshold value %d\",\n                        args.streaming_min_blank_dur,\n                    )\n                    nbest_hyps = []\n                    for n in range(args.nbest):\n                        nbest_hyps.append({\"yseq\": [], \"score\": 0.0})\n                    se2e = SegmentStreamingE2E(e2e=model, recog_args=args, rnnlm=rnnlm)\n                    r = np.prod(model.subsample)\n                    for i in range(0, feat.shape[0], r):\n                        hyps = se2e.accept_input(feat[i : i + r])\n                        if hyps is not None:\n                            text = \"\".join(\n                                [\n                                    train_args.char_list[int(x)]\n                                    for x in hyps[0][\"yseq\"][1:-1]\n                                    if int(x) != -1\n                                ]\n                            )\n                            text = text.replace(\n                                \"\\u2581\", \" \"\n                            ).strip()  # for SentencePiece\n                            text = text.replace(model.space, \" \")\n                            text = text.replace(model.blank, \"\")\n                            logging.info(text)\n                            for n in range(args.nbest):\n                                nbest_hyps[n][\"yseq\"].extend(hyps[n][\"yseq\"])\n                                nbest_hyps[n][\"score\"] += hyps[n][\"score\"]\n                elif hasattr(model, \"is_rnnt\"):\n                    nbest_hyps = model.recognize(feat, beam_search_transducer,\n                                                 decode_feature=args.cs_nt_decode_feature)\n                else:\n                    nbest_hyps = model.recognize(\n                        feat, args, train_args.char_list, rnnlm\n                    )\n                # visualization\n                # decode_dir = os.path.dirname(args.result_label)\n                # graph_dir = os.path.join(decode_dir, \"graph\")\n                # os.makedirs(graph_dir, exist_ok=True)\n                # plot_decoding_logs(graph_dir, train_args.char_list,\n                #                    args, name, nbest_hyps)\n                nbest_dict[name] = nbest_hyps\n                new_js[name] = add_results_to_json(\n                    js[name], nbest_hyps, train_args.char_list\n                )\n\n    else:\n\n        def grouper(n, iterable, fillvalue=None):\n            kargs = [iter(iterable)] * n\n            return zip_longest(*kargs, fillvalue=fillvalue)\n\n        # sort data if batchsize > 1\n        keys = list(js.keys())\n        if args.batchsize > 1:\n            feat_lens = [js[key][\"input\"][0][\"shape\"][0] for key in keys]\n            sorted_index = sorted(range(len(feat_lens)), key=lambda i: -feat_lens[i])\n            keys = [keys[i] for i in sorted_index]\n\n        with torch.no_grad():\n            for names in grouper(args.batchsize, keys, None):\n                names = [name for name in names if name]\n                batch = [(name, js[name]) for name in names]\n                feats = (\n                    load_inputs_and_targets(batch)[0]\n                    if args.num_encs == 1\n                    else load_inputs_and_targets(batch)\n                )\n                if args.streaming_mode == \"window\" and args.num_encs == 1:\n                    raise NotImplementedError\n                elif args.streaming_mode == \"segment\" and args.num_encs == 1:\n                    if args.batchsize > 1:\n                        raise NotImplementedError\n                    feat = feats[0]\n                    nbest_hyps = []\n                    for n in range(args.nbest):\n                        nbest_hyps.append({\"yseq\": [], \"score\": 0.0})\n                    se2e = SegmentStreamingE2E(e2e=model, recog_args=args, rnnlm=rnnlm)\n                    r = np.prod(model.subsample)\n                    for i in range(0, feat.shape[0], r):\n                        hyps = se2e.accept_input(feat[i : i + r])\n                        if hyps is not None:\n                            text = \"\".join(\n                                [\n                                    train_args.char_list[int(x)]\n                                    for x in hyps[0][\"yseq\"][1:-1]\n                                    if int(x) != -1\n                                ]\n                            )\n                            text = text.replace(\n                                \"\\u2581\", \" \"\n                            ).strip()  # for SentencePiece\n                            text = text.replace(model.space, \" \")\n                            text = text.replace(model.blank, \"\")\n                            logging.info(text)\n                            for n in range(args.nbest):\n                                nbest_hyps[n][\"yseq\"].extend(hyps[n][\"yseq\"])\n                                nbest_hyps[n][\"score\"] += hyps[n][\"score\"]\n                    nbest_hyps = [nbest_hyps]\n                else:\n                    nbest_hyps = model.recognize_batch(\n                        feats, args, train_args.char_list, rnnlm=rnnlm\n                    )\n\n                for i, nbest_hyp in enumerate(nbest_hyps):\n                    name = names[i]\n                    new_js[name] = add_results_to_json(\n                        js[name], nbest_hyp, train_args.char_list\n                    )\n\n    rtf_calculator.tok()\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    \n\n\ndef enhance(args):\n    \"\"\"Dumping enhanced speech and mask.\n\n    Args:\n        args (namespace): The program arguments.\n    \"\"\"\n    set_deterministic_pytorch(args)\n    # read training config\n    idim, odim, train_args = get_model_conf(args.model, args.model_conf)\n\n    # TODO(ruizhili): implement enhance for multi-encoder model\n    assert args.num_encs == 1, \"number of encoder should be 1 ({} is given)\".format(\n        args.num_encs\n    )\n\n    # load trained model parameters\n    logging.info(\"reading model parameters from \" + args.model)\n    model_class = dynamic_import(train_args.model_module)\n    model = model_class(idim, odim, train_args)\n    assert isinstance(model, ASRInterface)\n    torch_load(args.model, model)\n    model.recog_args = args\n\n    # gpu\n    if args.ngpu == 1:\n        gpu_id = list(range(args.ngpu))\n        logging.info(\"gpu id: \" + str(gpu_id))\n        model.cuda()\n\n    # read json data\n    with open(args.recog_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=False,\n        sort_in_input_length=False,\n        preprocess_conf=None,  # Apply pre_process in outer func\n    )\n    if args.batchsize == 0:\n        args.batchsize = 1\n\n    # Creates writers for outputs from the network\n    if args.enh_wspecifier is not None:\n        enh_writer = file_writer_helper(args.enh_wspecifier, filetype=args.enh_filetype)\n    else:\n        enh_writer = None\n\n    # Creates a Transformation instance\n    preprocess_conf = (\n        train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf\n    )\n    if preprocess_conf is not None:\n        logging.info(f\"Use preprocessing: {preprocess_conf}\")\n        transform = Transformation(preprocess_conf)\n    else:\n        transform = None\n\n    # Creates a IStft instance\n    istft = None\n    frame_shift = args.istft_n_shift  # Used for plot the spectrogram\n    if args.apply_istft:\n        if preprocess_conf is not None:\n            # Read the conffile and find stft setting\n            with open(preprocess_conf) as f:\n                # Json format: e.g.\n                #    {\"process\": [{\"type\": \"stft\",\n                #                  \"win_length\": 400,\n                #                  \"n_fft\": 512, \"n_shift\": 160,\n                #                  \"window\": \"han\"},\n                #                 {\"type\": \"foo\", ...}, ...]}\n                conf = json.load(f)\n                assert \"process\" in conf, conf\n                # Find stft setting\n                for p in conf[\"process\"]:\n                    if p[\"type\"] == \"stft\":\n                        istft = IStft(\n                            win_length=p[\"win_length\"],\n                            n_shift=p[\"n_shift\"],\n                            window=p.get(\"window\", \"hann\"),\n                        )\n                        logging.info(\n                            \"stft is found in {}. \"\n                            \"Setting istft config from it\\n{}\".format(\n                                preprocess_conf, istft\n                            )\n                        )\n                        frame_shift = p[\"n_shift\"]\n                        break\n        if istft is None:\n            # Set from command line arguments\n            istft = IStft(\n                win_length=args.istft_win_length,\n                n_shift=args.istft_n_shift,\n                window=args.istft_window,\n            )\n            logging.info(\n                \"Setting istft config from the command line args\\n{}\".format(istft)\n            )\n\n    # sort data\n    keys = list(js.keys())\n    feat_lens = [js[key][\"input\"][0][\"shape\"][0] for key in keys]\n    sorted_index = sorted(range(len(feat_lens)), key=lambda i: -feat_lens[i])\n    keys = [keys[i] for i in sorted_index]\n\n    def grouper(n, iterable, fillvalue=None):\n        kargs = [iter(iterable)] * n\n        return zip_longest(*kargs, fillvalue=fillvalue)\n\n    num_images = 0\n    if not os.path.exists(args.image_dir):\n        os.makedirs(args.image_dir)\n\n    for names in grouper(args.batchsize, keys, None):\n        batch = [(name, js[name]) for name in names]\n\n        # May be in time region: (Batch, [Time, Channel])\n        org_feats = load_inputs_and_targets(batch)[0]\n        if transform is not None:\n            # May be in time-freq region: : (Batch, [Time, Channel, Freq])\n            feats = transform(org_feats, train=False)\n        else:\n            feats = org_feats\n\n        with torch.no_grad():\n            enhanced, mask, ilens = model.enhance(feats)\n\n        for idx, name in enumerate(names):\n            # Assuming mask, feats : [Batch, Time, Channel. Freq]\n            #          enhanced    : [Batch, Time, Freq]\n            enh = enhanced[idx][: ilens[idx]]\n            mas = mask[idx][: ilens[idx]]\n            feat = feats[idx]\n\n            # Plot spectrogram\n            if args.image_dir is not None and num_images < args.num_images:\n                import matplotlib.pyplot as plt\n\n                num_images += 1\n                ref_ch = 0\n\n                plt.figure(figsize=(20, 10))\n                plt.subplot(4, 1, 1)\n                plt.title(\"Mask [ref={}ch]\".format(ref_ch))\n                plot_spectrogram(\n                    plt,\n                    mas[:, ref_ch].T,\n                    fs=args.fs,\n                    mode=\"linear\",\n                    frame_shift=frame_shift,\n                    bottom=False,\n                    labelbottom=False,\n                )\n\n                plt.subplot(4, 1, 2)\n                plt.title(\"Noisy speech [ref={}ch]\".format(ref_ch))\n                plot_spectrogram(\n                    plt,\n                    feat[:, ref_ch].T,\n                    fs=args.fs,\n                    mode=\"db\",\n                    frame_shift=frame_shift,\n                    bottom=False,\n                    labelbottom=False,\n                )\n\n                plt.subplot(4, 1, 3)\n                plt.title(\"Masked speech [ref={}ch]\".format(ref_ch))\n                plot_spectrogram(\n                    plt,\n                    (feat[:, ref_ch] * mas[:, ref_ch]).T,\n                    frame_shift=frame_shift,\n                    fs=args.fs,\n                    mode=\"db\",\n                    bottom=False,\n                    labelbottom=False,\n                )\n\n                plt.subplot(4, 1, 4)\n                plt.title(\"Enhanced speech\")\n                plot_spectrogram(\n                    plt, enh.T, fs=args.fs, mode=\"db\", frame_shift=frame_shift\n                )\n\n                plt.savefig(os.path.join(args.image_dir, name + \".png\"))\n                plt.clf()\n\n            # Write enhanced wave files\n            if enh_writer is not None:\n                if istft is not None:\n                    enh = istft(enh)\n                else:\n                    enh = enh\n\n                if args.keep_length:\n                    if len(org_feats[idx]) < len(enh):\n                        # Truncate the frames added by stft padding\n                        enh = enh[: len(org_feats[idx])]\n                    elif len(org_feats) > len(enh):\n                        padwidth = [(0, (len(org_feats[idx]) - len(enh)))] + [\n                            (0, 0)\n                        ] * (enh.ndim - 1)\n                        enh = np.pad(enh, padwidth, mode=\"constant\")\n\n                if args.enh_filetype in (\"sound\", \"sound.hdf5\"):\n                    enh_writer[name] = (args.fs, enh)\n                else:\n                    # Hint: To dump stft_signal, mask or etc,\n                    # enh_filetype='hdf5' might be convenient.\n                    enh_writer[name] = enh\n\n            if num_images >= args.num_images and enh_writer is None:\n                logging.info(\"Breaking the process.\")\n                break\n\n\ndef ctc_align(args):\n    \"\"\"CTC forced alignments with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n    \"\"\"\n\n    def add_alignment_to_json(js, alignment, char_list):\n        \"\"\"Add N-best results to json.\n\n        Args:\n            js (dict[str, Any]): Groundtruth utterance dict.\n            alignment (list[int]): List of alignment.\n            char_list (list[str]): List of characters.\n\n        Returns:\n            dict[str, Any]: N-best results added utterance dict.\n\n        \"\"\"\n        # copy old json info\n        new_js = dict()\n        new_js[\"ctc_alignment\"] = []\n\n        alignment_tokens = []\n        for idx, a in enumerate(alignment):\n            alignment_tokens.append(char_list[a])\n        alignment_tokens = \" \".join(alignment_tokens)\n\n        new_js[\"ctc_alignment\"] = alignment_tokens\n\n        return new_js\n\n    set_deterministic_pytorch(args)\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, ASRInterface)\n    model.eval()\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n\n    if args.ngpu > 1:\n        raise NotImplementedError(\"only single GPU decoding is supported\")\n    if args.ngpu == 1:\n        device = \"cuda\"\n    else:\n        device = \"cpu\"\n    dtype = getattr(torch, args.dtype)\n    logging.info(f\"Decoding device={device}, dtype={dtype}\")\n    model.to(device=device, dtype=dtype).eval()\n\n    # read json data\n    with open(args.align_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n    if args.batchsize == 0:\n        with torch.no_grad():\n            for idx, name in enumerate(js.keys(), 1):\n                logging.info(\"(%d/%d) aligning \" + name, idx, len(js.keys()))\n                batch = [(name, js[name])]\n                feat, label = load_inputs_and_targets(batch)\n                feat = feat[0]\n                label = label[0]\n                enc = model.encode(torch.as_tensor(feat).to(device)).unsqueeze(0)\n                alignment = model.ctc.forced_align(enc, label)\n                new_js[name] = add_alignment_to_json(\n                    js[name], alignment, train_args.char_list\n                )\n    else:\n        raise NotImplementedError(\"Align_batch is not implemented.\")\n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "asr/pytorch_backend/asr_init.py",
    "content": "\"\"\"Finetuning methods.\"\"\"\n\nimport logging\nimport os\nimport torch\n\nfrom collections import OrderedDict\n\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.mt_interface import MTInterface\nfrom espnet.nets.pytorch_backend.transducer.utils import custom_torch_load\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.dynamic_import import dynamic_import\n\n\ndef freeze_modules(model, modules):\n    \"\"\"Freeze model parameters according to modules list.\n\n    Args:\n        model (torch.nn.Module): main model to update\n        modules (list): specified module list for freezing\n\n    Return:\n        model (torch.nn.Module): updated model\n        model_params (filter): filtered model parameters\n\n    \"\"\"\n    for mod, param in model.named_parameters():\n        if any(mod.startswith(m) for m in modules):\n            logging.info(f\"freezing {mod}, it will not be updated.\")\n            param.requires_grad = False\n\n    model_params = filter(lambda x: x.requires_grad, model.parameters())\n\n    return model, model_params\n\n\ndef transfer_verification(model_state_dict, partial_state_dict, modules):\n    \"\"\"Verify tuples (key, shape) for input model modules match specified modules.\n\n    Args:\n        model_state_dict (OrderedDict): the initial model state_dict\n        partial_state_dict (OrderedDict): the trained model state_dict\n        modules (list): specified module list for transfer\n\n    Return:\n        (boolean): allow transfer\n\n    \"\"\"\n    modules_model = []\n    partial_modules = []\n\n    for key_p, value_p in partial_state_dict.items():\n        if any(key_p.startswith(m) for m in modules):\n            partial_modules += [(key_p, value_p.shape)]\n\n    for key_m, value_m in model_state_dict.items():\n        if any(key_m.startswith(m) for m in modules):\n            modules_model += [(key_m, value_m.shape)]\n\n    len_match = len(modules_model) == len(partial_modules)\n\n    module_match = sorted(modules_model, key=lambda x: (x[0], x[1])) == sorted(\n        partial_modules, key=lambda x: (x[0], x[1])\n    )\n\n    return len_match and module_match\n\n\ndef get_partial_state_dict(model_state_dict, modules):\n    \"\"\"Create state_dict with specified modules matching input model modules.\n\n    Note that get_partial_lm_state_dict is used if a LM specified.\n\n    Args:\n        model_state_dict (OrderedDict): trained model state_dict\n        modules (list): specified module list for transfer\n\n    Return:\n        new_state_dict (OrderedDict): the updated state_dict\n\n    \"\"\"\n    new_state_dict = OrderedDict()\n\n    for key, value in model_state_dict.items():\n        if any(key.startswith(m) for m in modules):\n            new_state_dict[key] = value\n\n    return new_state_dict\n\n\ndef get_lm_state_dict(lm_state_dict):\n    \"\"\"Create compatible ASR decoder state dict from LM state dict.\n\n    Args:\n        lm_state_dict (OrderedDict): pre-trained LM state_dict\n\n    Return:\n        new_state_dict (OrderedDict): LM state_dict with updated keys\n\n    \"\"\"\n    new_state_dict = OrderedDict()\n\n    for key, value in list(lm_state_dict.items()):\n        if key == \"predictor.embed.weight\":\n            new_state_dict[\"dec.embed.weight\"] = value\n        elif key.startswith(\"predictor.rnn.\"):\n            _split = key.split(\".\")\n\n            new_key = \"dec.decoder.\" + _split[2] + \".\" + _split[3] + \"_l0\"\n            new_state_dict[new_key] = value\n\n    return new_state_dict\n\n\ndef filter_modules(model_state_dict, modules):\n    \"\"\"Filter non-matched modules in module_state_dict.\n\n    Args:\n        model_state_dict (OrderedDict): trained model state_dict\n        modules (list): specified module list for transfer\n\n    Return:\n        new_mods (list): the update module list\n\n    \"\"\"\n    new_mods = []\n    incorrect_mods = []\n\n    mods_model = list(model_state_dict.keys())\n    for mod in modules:\n        if any(key.startswith(mod) for key in mods_model):\n            new_mods += [mod]\n        else:\n            incorrect_mods += [mod]\n\n    if incorrect_mods:\n        logging.warning(\n            \"module(s) %s don't match or (partially match) \"\n            \"available modules in model.\",\n            incorrect_mods,\n        )\n        logging.warning(\"for information, the existing modules in model are:\")\n        logging.warning(\"%s\", mods_model)\n\n    return new_mods\n\n\ndef load_trained_model(model_path, training=True):\n    \"\"\"Load the trained model for recognition.\n\n    Args:\n        model_path (str): Path to model.***.best\n\n    \"\"\"\n    idim, odim, train_args = get_model_conf(\n        model_path, os.path.join(os.path.dirname(model_path), \"model.json\")\n    )\n\n    logging.warning(\"reading model parameters from \" + model_path)\n\n    if hasattr(train_args, \"model_module\"):\n        model_module = train_args.model_module\n    else:\n        model_module = \"espnet.nets.pytorch_backend.e2e_asr:E2E\"\n    # CTC Loss is not needed, default to builtin to prevent import errors\n    # if hasattr(train_args, \"ctc_type\"):\n    #     train_args.ctc_type = \"builtin\"\n\n    model_class = dynamic_import(model_module)\n\n    if \"transducer\" in model_module:\n        model = model_class(idim, odim, train_args, training=training)\n        custom_torch_load(model_path, model, training=training)\n    else:\n        model = model_class(idim, odim, train_args)\n        torch_load(model_path, model)\n\n    return model, train_args\n\n# when start decoding jobs with very large nj, this function leads\n# to reading error. Do this for many times\ndef _load_trained_model(model_path, training=True, patience=10):\n\n    for i in range(patience):\n        try:\n            model, train_args = _load_trained_model(model_path, training=training)\n            print(f\"Model Init: Successful initialize model in {i}-th trail\", flush=True)\n            return model, train_args\n        except:\n            print(f\"Model Init: Fail in {i}-th trail. Try again!\", flush=True)\n\ndef get_trained_model_state_dict(model_path):\n    \"\"\"Extract the trained model state dict for pre-initialization.\n\n    Args:\n        model_path (str): Path to model.***.best\n\n    Return:\n        model.state_dict() (OrderedDict): the loaded model state_dict\n        (bool): Boolean defining whether the model is an LM\n\n    \"\"\"\n    conf_path = os.path.join(os.path.dirname(model_path), \"model.json\")\n    if \"rnnlm\" in model_path:\n        logging.warning(\"reading model parameters from %s\", model_path)\n\n        return get_lm_state_dict(torch.load(model_path))\n\n    idim, odim, args = get_model_conf(model_path, conf_path)\n\n    logging.warning(\"reading model parameters from \" + model_path)\n\n    if hasattr(args, \"model_module\"):\n        model_module = args.model_module\n    else:\n        model_module = \"espnet.nets.pytorch_backend.e2e_asr:E2E\"\n\n    model_class = dynamic_import(model_module)\n    model = model_class(idim, odim, args)\n    torch_load(model_path, model)\n    assert (\n        isinstance(model, MTInterface)\n        or isinstance(model, ASRInterface)\n        or isinstance(model, TTSInterface)\n    )\n\n    return model.state_dict()\n\n\ndef load_trained_modules(idim, odim, args, interface=ASRInterface):\n    \"\"\"Load model encoder or/and decoder modules with ESPNET pre-trained model(s).\n\n    Args:\n        idim (int): initial input dimension.\n        odim (int): initial output dimension.\n        args (Namespace): The initial model arguments.\n        interface (Interface): ASRInterface or STInterface or TTSInterface.\n\n    Return:\n        model (torch.nn.Module): The model with pretrained modules.\n\n    \"\"\"\n\n    def print_new_keys(state_dict, modules, model_path):\n        logging.warning(\"loading %s from model: %s\", modules, model_path)\n\n        for k in state_dict.keys():\n            logging.warning(\"override %s\" % k)\n\n    enc_model_path = args.enc_init\n    dec_model_path = args.dec_init\n    enc_modules = args.enc_init_mods\n    dec_modules = args.dec_init_mods\n\n    model_class = dynamic_import(args.model_module)\n    main_model = model_class(idim, odim, args)\n    assert isinstance(main_model, interface)\n\n    main_state_dict = main_model.state_dict()\n\n    logging.warning(\"model(s) found for pre-initialization\")\n    for model_path, modules in [\n        (enc_model_path, enc_modules),\n        (dec_model_path, dec_modules),\n    ]:\n        if model_path is not None:\n            if os.path.isfile(model_path):\n                model_state_dict = get_trained_model_state_dict(model_path)\n\n                modules = filter_modules(model_state_dict, modules)\n\n                partial_state_dict = get_partial_state_dict(model_state_dict, modules)\n\n                if partial_state_dict:\n                    if transfer_verification(\n                        main_state_dict, partial_state_dict, modules\n                    ):\n                        print_new_keys(partial_state_dict, modules, model_path)\n                        main_state_dict.update(partial_state_dict)\n                    else:\n                        logging.warning(\n                            f\"modules {modules} in model {model_path} \"\n                            f\"don't match your training config\",\n                        )\n            else:\n                logging.warning(\"model was not found : %s\", model_path)\n\n    main_model.load_state_dict(main_state_dict)\n\n    return main_model\n"
  },
  {
    "path": "asr/pytorch_backend/asr_mix.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis script is used for multi-speaker speech recognition.\n\nCopyright 2017 Johns Hopkins University (Shinji Watanabe)\n Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\"\"\"\nimport json\nimport logging\nimport os\n\n# chainer related\nfrom chainer import training\nfrom chainer.training import extensions\nfrom itertools import zip_longest as zip_longest\nimport numpy as np\nfrom tensorboardX import SummaryWriter\nimport torch\n\nfrom espnet.asr.asr_mix_utils import add_results_to_json\nfrom espnet.asr.asr_utils import adadelta_eps_decay\n\nfrom espnet.asr.asr_utils import CompareValueTrigger\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import restore_snapshot\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.asr.pytorch_backend.asr import CustomEvaluator\nfrom espnet.asr.pytorch_backend.asr import CustomUpdater\nfrom espnet.asr.pytorch_backend.asr import load_trained_model\nimport espnet.lm.pytorch_backend.extlm as extlm_pytorch\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.pytorch_backend.e2e_asr_mix import pad_list\nimport espnet.nets.pytorch_backend.lm.default as lm_pytorch\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\nimport matplotlib\n\nmatplotlib.use(\"Agg\")\n\n\nclass CustomConverter(object):\n    \"\"\"Custom batch converter for Pytorch.\n\n    Args:\n        subsampling_factor (int): The subsampling factor.\n        dtype (torch.dtype): Data type to convert.\n\n    \"\"\"\n\n    def __init__(self, subsampling_factor=1, dtype=torch.float32, num_spkrs=2):\n        \"\"\"Initialize the converter.\"\"\"\n        self.subsampling_factor = subsampling_factor\n        self.ignore_id = -1\n        self.dtype = dtype\n        self.num_spkrs = num_spkrs\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Transform a batch and send it to a device.\n\n        Args:\n            batch (list(tuple(str, dict[str, dict[str, Any]]))): The batch to transform.\n            device (torch.device): The device to send to.\n\n        Returns:\n            tuple(torch.Tensor, torch.Tensor, torch.Tensor): Transformed batch.\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys = batch[0][0], batch[0][-self.num_spkrs :]\n\n        # perform subsampling\n        if self.subsampling_factor > 1:\n            xs = [x[:: self.subsampling_factor, :] for x in xs]\n\n        # get batch of lengths of input sequences\n        ilens = np.array([x.shape[0] for x in xs])\n\n        # perform padding and convert to tensor\n        # currently only support real number\n        if xs[0].dtype.kind == \"c\":\n            xs_pad_real = pad_list(\n                [torch.from_numpy(x.real).float() for x in xs], 0\n            ).to(device, dtype=self.dtype)\n            xs_pad_imag = pad_list(\n                [torch.from_numpy(x.imag).float() for x in xs], 0\n            ).to(device, dtype=self.dtype)\n            # Note(kamo):\n            # {'real': ..., 'imag': ...} will be changed to ComplexTensor in E2E.\n            # Don't create ComplexTensor and give it to E2E here\n            # because torch.nn.DataParallel can't handle it.\n            xs_pad = {\"real\": xs_pad_real, \"imag\": xs_pad_imag}\n        else:\n            xs_pad = pad_list([torch.from_numpy(x).float() for x in xs], 0).to(\n                device, dtype=self.dtype\n            )\n\n        ilens = torch.from_numpy(ilens).to(device)\n        if not isinstance(ys[0], np.ndarray):\n            ys_pad = []\n            for i in range(len(ys)):  # speakers\n                ys_pad += [torch.from_numpy(y).long() for y in ys[i]]\n            ys_pad = pad_list(ys_pad, self.ignore_id)\n            ys_pad = (\n                ys_pad.view(self.num_spkrs, -1, ys_pad.size(1))\n                .transpose(0, 1)\n                .to(device)\n            )  # (B, num_spkrs, Tmax)\n        else:\n            ys_pad = pad_list(\n                [torch.from_numpy(y).long() for y in ys], self.ignore_id\n            ).to(device)\n\n        return xs_pad, ilens, ys_pad\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n    idim = int(valid_json[utts[0]][\"input\"][0][\"shape\"][-1])\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][-1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # specify attention, CTC, hybrid mode\n    if args.mtlalpha == 1.0:\n        mtl_mode = \"ctc\"\n        logging.info(\"Pure CTC mode\")\n    elif args.mtlalpha == 0.0:\n        mtl_mode = \"att\"\n        logging.info(\"Pure attention mode\")\n    else:\n        mtl_mode = \"mtl\"\n        logging.info(\"Multitask learning mode\")\n\n    # specify model architecture\n    model_class = dynamic_import(args.model_module)\n    model = model_class(idim, odim, args)\n    assert isinstance(model, ASRInterface)\n    subsampling_factor = model.subsample[0]\n\n    if args.rnnlm is not None:\n        rnnlm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        rnnlm = lm_pytorch.ClassifierWithState(\n            lm_pytorch.RNNLM(\n                len(args.char_list),\n                rnnlm_args.layer,\n                rnnlm_args.unit,\n                getattr(rnnlm_args, \"embed_unit\", None),  # for backward compatibility\n            )\n        )\n        torch.load(args.rnnlm, rnnlm)\n        model.rnnlm = rnnlm\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    reporter = model.reporter\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    if args.train_dtype in (\"float16\", \"float32\", \"float64\"):\n        dtype = getattr(torch, args.train_dtype)\n    else:\n        dtype = torch.float32\n    model = model.to(device=device, dtype=dtype)\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Setup an optimizer\n    if args.opt == \"adadelta\":\n        optimizer = torch.optim.Adadelta(\n            model.parameters(), rho=0.95, eps=args.eps, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"adam\":\n        optimizer = torch.optim.Adam(model.parameters(), weight_decay=args.weight_decay)\n    elif args.opt == \"noam\":\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n\n        optimizer = get_std_opt(\n            model.parameters(),\n            args.adim,\n            args.transformer_warmup_steps,\n            args.transformer_lr,\n        )\n    else:\n        raise NotImplementedError(\"unknown optimizer: \" + args.opt)\n\n    # setup apex.amp\n    if args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\"):\n        try:\n            from apex import amp\n        except ImportError as e:\n            logging.error(\n                f\"You need to install apex for --train-dtype {args.train_dtype}. \"\n                \"See https://github.com/NVIDIA/apex#linux\"\n            )\n            raise e\n        if args.opt == \"noam\":\n            model, optimizer.optimizer = amp.initialize(\n                model, optimizer.optimizer, opt_level=args.train_dtype\n            )\n        else:\n            model, optimizer = amp.initialize(\n                model, optimizer, opt_level=args.train_dtype\n            )\n        use_apex = True\n    else:\n        use_apex = False\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # Setup a converter\n    converter = CustomConverter(\n        subsampling_factor=subsampling_factor, dtype=dtype, num_spkrs=args.num_spkrs\n    )\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    # make minibatch list (variable length)\n    train = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=-1,\n    )\n    valid = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=-1,\n    )\n\n    load_tr = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n    )\n    load_cv = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    # default collate function converts numpy array to pytorch tensor\n    # we used an empty collate function instead which returns list\n    train_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(train, lambda data: converter([load_tr(data)])),\n            batch_size=1,\n            num_workers=args.n_iter_processes,\n            shuffle=True,\n            collate_fn=lambda x: x[0],\n        )\n    }\n    valid_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(valid, lambda data: converter([load_cv(data)])),\n            batch_size=1,\n            shuffle=False,\n            collate_fn=lambda x: x[0],\n            num_workers=args.n_iter_processes,\n        )\n    }\n\n    # Set up a trainer\n    updater = CustomUpdater(\n        model,\n        args.grad_clip,\n        train_iter,\n        optimizer,\n        device,\n        args.ngpu,\n        args.grad_noise,\n        args.accum_grad,\n        use_apex=use_apex,\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    # Evaluate the model with the test dataset for each epoch\n    trainer.extend(CustomEvaluator(model, valid_iter, reporter, device, args.ngpu))\n\n    # Save attention weight each epoch\n    if args.num_save_attention > 0 and args.mtlalpha != 1.0:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"input\"][0][\"shape\"][1]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n        )\n        trainer.extend(att_reporter, trigger=(1, \"epoch\"))\n    else:\n        att_reporter = None\n\n    # Make a plot for training and validation values\n    trainer.extend(\n        extensions.PlotReport(\n            [\n                \"main/loss\",\n                \"validation/main/loss\",\n                \"main/loss_ctc\",\n                \"validation/main/loss_ctc\",\n                \"main/loss_att\",\n                \"validation/main/loss_att\",\n            ],\n            \"epoch\",\n            file_name=\"loss.png\",\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/acc\", \"validation/main/acc\"], \"epoch\", file_name=\"acc.png\"\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/cer_ctc\", \"validation/main/cer_ctc\"], \"epoch\", file_name=\"cer.png\"\n        )\n    )\n\n    # Save best models\n    trainer.extend(\n        snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\"validation/main/loss\"),\n    )\n    if mtl_mode != \"ctc\":\n        trainer.extend(\n            snapshot_object(model, \"model.acc.best\"),\n            trigger=training.triggers.MaxValueTrigger(\"validation/main/acc\"),\n        )\n\n    # save snapshot which contains model and optimizer states\n    trainer.extend(torch_snapshot(), trigger=(1, \"epoch\"))\n\n    # epsilon decay in the optimizer\n    if args.opt == \"adadelta\":\n        if args.criterion == \"acc\" and mtl_mode != \"ctc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(\n        extensions.LogReport(trigger=(args.report_interval_iters, \"iteration\"))\n    )\n    report_keys = [\n        \"epoch\",\n        \"iteration\",\n        \"main/loss\",\n        \"main/loss_ctc\",\n        \"main/loss_att\",\n        \"validation/main/loss\",\n        \"validation/main/loss_ctc\",\n        \"validation/main/loss_att\",\n        \"main/acc\",\n        \"validation/main/acc\",\n        \"main/cer_ctc\",\n        \"validation/main/cer_ctc\",\n        \"elapsed_time\",\n    ]\n    if args.opt == \"adadelta\":\n        trainer.extend(\n            extensions.observe_value(\n                \"eps\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"eps\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"eps\")\n    if args.report_cer:\n        report_keys.append(\"validation/main/cer\")\n    if args.report_wer:\n        report_keys.append(\"validation/main/wer\")\n    trainer.extend(\n        extensions.PrintReport(report_keys),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    set_early_stop(trainer, args)\n\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        trainer.extend(\n            TensorboardLogger(SummaryWriter(args.tensorboard_dir), att_reporter),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\ndef recog(args):\n    \"\"\"Decode with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, ASRInterface)\n    model.recog_args = args\n\n    # read rnnlm\n    if args.rnnlm:\n        rnnlm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        if getattr(rnnlm_args, \"model_module\", \"default\") != \"default\":\n            raise ValueError(\n                \"use '--api v2' option to decode with non-default language model\"\n            )\n        rnnlm = lm_pytorch.ClassifierWithState(\n            lm_pytorch.RNNLM(\n                len(train_args.char_list),\n                rnnlm_args.layer,\n                rnnlm_args.unit,\n                getattr(rnnlm_args, \"embed_unit\", None),  # for backward compatibility\n            )\n        )\n        torch_load(args.rnnlm, rnnlm)\n        rnnlm.eval()\n    else:\n        rnnlm = None\n\n    if args.word_rnnlm:\n        rnnlm_args = get_model_conf(args.word_rnnlm, args.word_rnnlm_conf)\n        word_dict = rnnlm_args.char_list_dict\n        char_dict = {x: i for i, x in enumerate(train_args.char_list)}\n        word_rnnlm = lm_pytorch.ClassifierWithState(\n            lm_pytorch.RNNLM(len(word_dict), rnnlm_args.layer, rnnlm_args.unit)\n        )\n        torch_load(args.word_rnnlm, word_rnnlm)\n        word_rnnlm.eval()\n\n        if rnnlm is not None:\n            rnnlm = lm_pytorch.ClassifierWithState(\n                extlm_pytorch.MultiLevelLM(\n                    word_rnnlm.predictor, rnnlm.predictor, word_dict, char_dict\n                )\n            )\n        else:\n            rnnlm = lm_pytorch.ClassifierWithState(\n                extlm_pytorch.LookAheadWordLM(\n                    word_rnnlm.predictor, word_dict, char_dict\n                )\n            )\n\n    # gpu\n    if args.ngpu == 1:\n        gpu_id = list(range(args.ngpu))\n        logging.info(\"gpu id: \" + str(gpu_id))\n        model.cuda()\n        if rnnlm:\n            rnnlm.cuda()\n\n    # read json data\n    with open(args.recog_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=False,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n\n    if args.batchsize == 0:\n        with torch.no_grad():\n            for idx, name in enumerate(js.keys(), 1):\n                logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n                batch = [(name, js[name])]\n                feat = load_inputs_and_targets(batch)[0][0]\n                nbest_hyps = model.recognize(feat, args, train_args.char_list, rnnlm)\n                new_js[name] = add_results_to_json(\n                    js[name], nbest_hyps, train_args.char_list\n                )\n\n    else:\n\n        def grouper(n, iterable, fillvalue=None):\n            kargs = [iter(iterable)] * n\n            return zip_longest(*kargs, fillvalue=fillvalue)\n\n        # sort data if batchsize > 1\n        keys = list(js.keys())\n        if args.batchsize > 1:\n            feat_lens = [js[key][\"input\"][0][\"shape\"][0] for key in keys]\n            sorted_index = sorted(range(len(feat_lens)), key=lambda i: -feat_lens[i])\n            keys = [keys[i] for i in sorted_index]\n\n        with torch.no_grad():\n            for names in grouper(args.batchsize, keys, None):\n                names = [name for name in names if name]\n                batch = [(name, js[name]) for name in names]\n                feats = load_inputs_and_targets(batch)[0]\n                nbest_hyps = model.recognize_batch(\n                    feats, args, train_args.char_list, rnnlm=rnnlm\n                )\n\n                for i, name in enumerate(names):\n                    nbest_hyp = [hyp[i] for hyp in nbest_hyps]\n                    new_js[name] = add_results_to_json(\n                        js[name], nbest_hyp, train_args.char_list\n                    )\n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "asr/pytorch_backend/recog.py",
    "content": "\"\"\"V2 backend for `asr_recog.py` using py:class:`espnet.nets.beam_search.BeamSearch`.\"\"\"\n\nimport json\nimport logging\nimport os\nimport torch\n\nfrom espnet.asr.asr_utils import add_results_to_json\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.pytorch_backend.asr import load_trained_model\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.batch_beam_search import BatchBeamSearch\nfrom espnet.nets.beam_search import BeamSearch\nfrom espnet.nets.lm_interface import dynamic_import_lm\nfrom espnet.nets.scorer_interface import BatchScorerInterface\nfrom espnet.nets.scorers.length_bonus import LengthBonus\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.nets.scorers.mmi_frame_scorer import MMIFrameScorer\n# from espnet.nets.scorers.mmi_prefix_score import MMIFrameScorer\nfrom espnet.nets.scorers.ctc import CTCPrefixScorer\nfrom espnet.nets.scorers.word_ngram import WordNgramPartialScorer\nfrom espnet.nets.scorers.mmi_rescorer import MMIRescorer\nfrom espnet.utils.rtf_calculator import RTF_calculator\n\ndef recog_v2(args):\n    \"\"\"Decode with custom models that implements ScorerInterface.\n\n    Notes:\n        The previous backend espnet.asr.pytorch_backend.asr.recog\n        only supports E2E and RNNLM\n\n    Args:\n        args (namespace): The program arguments.\n        See py:func:`espnet.bin.asr_recog.get_parser` for details\n\n    \"\"\"\n    logging.warning(\"experimental API for custom LMs is selected by --api v2\")\n    if args.batchsize > 1:\n        raise NotImplementedError(\"multi-utt batch decoding is not implemented\")\n    if args.streaming_mode is not None:\n        raise NotImplementedError(\"streaming mode is not implemented\")\n    if args.word_rnnlm:\n        raise NotImplementedError(\"word LM is not implemented\")\n\n    if args.ngpu > 1:\n        raise NotImplementedError(\"only single GPU decoding is supported\")\n    if args.ngpu == 1:\n        device = torch.device(\"cuda\")\n    else:\n        # So the cuda is not available now\n        device = torch.device(\"cpu\")\n        os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"-1\"\n        assert torch.cuda.is_available() == False\n    print(f\"Rank: {args.local_rank} Using device: {device}, ngpu: {args.ngpu}\")\n\n    set_deterministic_pytorch(args)\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, ASRInterface)\n    model.eval()\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=False,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n\n    if args.rnnlm:\n        lm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)\n        # NOTE: for a compatibility with less than 0.5.0 version models\n        lm_model_module = getattr(lm_args, \"model_module\", \"default\")\n        lm_class = dynamic_import_lm(lm_model_module, lm_args.backend)\n        lm = lm_class(len(train_args.char_list), lm_args)\n        torch_load(args.rnnlm, lm)\n        lm.eval()\n    else:\n        lm = None\n\n    if args.ngram_model and args.ngram_weight > 0.0:\n        from espnet.nets.scorers.ngram import NgramFullScorer\n        from espnet.nets.scorers.ngram import NgramPartScorer\n\n        if args.ngram_scorer == \"full\":\n            ngram = NgramFullScorer(args.ngram_model, train_args.char_list)\n        else:\n            ngram = NgramPartScorer(args.ngram_model, train_args.char_list)\n    else:\n        ngram = None\n\n    # load mmi_scorer\n    if args.mmi_weight > 0.0:\n        # Also make sure it is K2MMI\n        assert hasattr(model.ctc, \"dump_weight\")\n        # Dump a pth for each rank to avoid conflits when reading / writing\n        weight_path = os.path.dirname(args.result_label) + \"/dump\"\n        os.makedirs(weight_path, exist_ok=True)\n        model.ctc.dump_weight(args.local_rank, weight_path)\n        mmi_scorer = MMIFrameScorer\n        mmi = mmi_scorer(lang=model.ctc.lang,\n                         device=device,\n                         idim=train_args.adim,\n                         sos_id=model.sos,\n                         rank=args.local_rank,\n                         use_segment=args.use_segment,\n                         char_list=train_args.char_list, \n                         weight_path=weight_path)\n    else:\n        mmi = None\n\n    if args.mmi_rescore:\n        weight_path = os.path.dirname(args.result_label) + \"/dump\"\n        os.makedirs(weight_path, exist_ok=True)\n        model.ctc.dump_weight(args.local_rank, weight_path)\n        assert args.mmi_weight <= 0.0\n        mmi_rescorer = MMIRescorer(lang=model.ctc.lang,\n                                   device=device,\n                                   idim=train_args.adim,\n                                   sos_id=model.sos,\n                                   rank=args.local_rank,\n                                   use_segment=args.use_segment,\n                                   char_list=train_args.char_list,\n                                   weight_path=weight_path)\n    else:\n        mmi_rescorer = None\n\n    if args.ctc_weight > 0.0:\n        ctc_module = model.third_loss if hasattr(model, \"third_loss\") else model.ctc\n        ctc = CTCPrefixScorer(ctc_module, model.eos)\n    else: \n        ctc = None\n\n    if args.word_ngram_weight > 0.0:\n        word_ngram_scorer = WordNgramPartialScorer\n        print(f\"Using word ngram model: {args.word_ngram}\", flush=True)\n        word_ngram_scorer = WordNgramPartialScorer(args.word_ngram, \n                              device,\n                              train_args.char_list, \n                              log_semiring=args.word_ngram_log_semiring)\n    else:\n        word_ngram_scorer = None\n        \n    scorers = model.scorers()\n    scorers[\"ctc\"] = ctc \n    scorers[\"mmi\"] = mmi \n    scorers[\"lm\"] = lm\n    scorers[\"ngram\"] = ngram\n    scorers[\"length_bonus\"] = LengthBonus(len(train_args.char_list))\n    scorers[\"word_ngram\"] = word_ngram_scorer\n    weights = dict(\n        decoder=1.0 - args.ctc_weight,\n        ctc=args.ctc_weight,\n        lm=args.lm_weight,\n        ngram=args.ngram_weight,\n        length_bonus=args.penalty,\n        mmi=args.mmi_weight,\n        word_ngram=args.word_ngram_weight,\n    )\n    beam_search = BeamSearch(\n        beam_size=args.beam_size,\n        vocab_size=len(train_args.char_list),\n        weights=weights,\n        scorers=scorers,\n        sos=model.sos,\n        eos=model.eos,\n        token_list=train_args.char_list,\n        pre_beam_score_key=None if args.ctc_weight == 1.0 else \"full\",\n        mmi_rescorer=mmi_rescorer,\n    )\n    # TODO(karita): make all scorers batchfied\n    if args.batchsize == 1:\n        non_batch = [\n            k\n            for k, v in beam_search.full_scorers.items()\n            if not isinstance(v, BatchScorerInterface)\n        ]\n        if len(non_batch) == 0:\n            beam_search.__class__ = BatchBeamSearch\n            logging.info(\"BatchBeamSearch implementation is selected.\")\n        else:\n            logging.warning(\n                f\"As non-batch scorers {non_batch} are found, \"\n                f\"fall back to non-batch implementation.\"\n            )\n\n    dtype = getattr(torch, args.dtype)\n    logging.info(f\"Decoding device={device}, dtype={dtype}\")\n    model.to(device=device, dtype=dtype).eval()\n    # beam_search.to(device=device, dtype=dtype).eval()\n\n    # read json data\n    with open(args.recog_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n    rtf_calculator = RTF_calculator(js)\n    rtf_calculator.tik()\n    with torch.no_grad():\n        for idx, name in enumerate(js.keys(), 1):\n            logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n            batch = [(name, js[name])]\n            feat = load_inputs_and_targets(batch)[0][0]\n            enc = model.encode(torch.as_tensor(feat).to(device=device, dtype=dtype))\n            nbest_hyps = beam_search(\n                x=enc, maxlenratio=args.maxlenratio, minlenratio=args.minlenratio\n            )\n            nbest_hyps = [\n                h.asdict() for h in nbest_hyps[: min(len(nbest_hyps), args.nbest)]\n            ]\n            new_js[name] = add_results_to_json(\n                js[name], nbest_hyps, train_args.char_list\n            )\n    \n    rtf_calculator.tok()    \n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "bin/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "bin/asr_align.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2020 Johns Hopkins University (Xuankai Chang)\n#           2020, Technische Universität München;  Dominik Winkelbauer, Ludwig Kürzinger\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"\nThis program performs CTC segmentation to align utterances within audio files.\n\nInputs:\n    `--data-json`:\n        A json containing list of utterances and audio files\n    `--model`:\n        An already trained ASR model\n\nOutput:\n    `--output`:\n        A plain `segments` file with utterance positions in the audio files.\n\nSelected parameters:\n    `--min-window-size`:\n        Minimum window size considered for a single utterance. The current default value\n        should be OK in most cases. Larger values might give better results; too large\n        values cause IndexErrors.\n    `--subsampling-factor`:\n        If the encoder sub-samples its input, the number of frames at the CTC layer is\n        reduced by this factor.\n    `--frame-duration`:\n        This is the non-overlapping duration of a single frame in milliseconds (the\n        inverse of frames per millisecond).\n    `--set-blank`:\n        In the rare case that the blank token has not the index 0 in the character\n        dictionary, this parameter sets the index of the blank token.\n    `--gratis-blank`:\n        Sets the transition cost for blank tokens to zero. Useful if there are longer\n        unrelated segments between segments.\n    `--replace-spaces-with-blanks`:\n        Spaces are replaced with blanks. Helps to model pauses between words. May\n        increase length of ground truth. May lead to misaligned segments when combined\n        with the option `--gratis-blank`.\n\"\"\"\n\nimport configargparse\nimport logging\nimport os\nimport sys\n\n# imports for inference\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_model\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nimport json\nimport torch\n\n# imports for CTC segmentation\nfrom ctc_segmentation import ctc_segmentation\nfrom ctc_segmentation import CtcSegmentationParameters\nfrom ctc_segmentation import determine_utterance_segments\nfrom ctc_segmentation import prepare_text\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get default arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Align text to audio using CTC segmentation.\"\n        \"using a pre-trained speech recognition model.\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"Decoding config file path.\")\n    parser.add_argument(\n        \"--ngpu\", type=int, default=0, help=\"Number of GPUs (max. 1 is supported)\"\n    )\n    parser.add_argument(\n        \"--dtype\",\n        choices=(\"float16\", \"float32\", \"float64\"),\n        default=\"float32\",\n        help=\"Float precision (only available in --api v2)\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        type=str,\n        default=\"pytorch\",\n        choices=[\"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", type=int, default=1, help=\"Debugmode\")\n    parser.add_argument(\"--verbose\", \"-V\", type=int, default=1, help=\"Verbose option\")\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    # task related\n    parser.add_argument(\n        \"--data-json\", type=str, help=\"Json of recognition data for audio and text\"\n    )\n    parser.add_argument(\"--utt-text\", type=str, help=\"Text separated into utterances\")\n    # model (parameter) related\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n    parser.add_argument(\n        \"--num-encs\", default=1, type=int, help=\"Number of encoders in the model.\"\n    )\n    # ctc-segmentation related\n    parser.add_argument(\n        \"--subsampling-factor\",\n        type=int,\n        default=None,\n        help=\"Subsampling factor.\"\n        \" If the encoder sub-samples its input, the number of frames at the CTC layer\"\n        \" is reduced by this factor. For example, a BLSTMP with subsampling 1_2_2_1_1\"\n        \" has a subsampling factor of 4.\",\n    )\n    parser.add_argument(\n        \"--frame-duration\",\n        type=int,\n        default=None,\n        help=\"Non-overlapping duration of a single frame in milliseconds.\",\n    )\n    parser.add_argument(\n        \"--min-window-size\",\n        type=int,\n        default=None,\n        help=\"Minimum window size considered for utterance.\",\n    )\n    parser.add_argument(\n        \"--max-window-size\",\n        type=int,\n        default=None,\n        help=\"Maximum window size considered for utterance.\",\n    )\n    parser.add_argument(\n        \"--use-dict-blank\",\n        type=int,\n        default=None,\n        help=\"DEPRECATED.\",\n    )\n    parser.add_argument(\n        \"--set-blank\",\n        type=int,\n        default=None,\n        help=\"Index of model dictionary for blank token (default: 0).\",\n    )\n    parser.add_argument(\n        \"--gratis-blank\",\n        type=int,\n        default=None,\n        help=\"Set the transition cost of the blank token to zero. Audio sections\"\n        \" labeled with blank tokens can then be skipped without penalty. Useful\"\n        \" if there are unrelated audio segments between utterances.\",\n    )\n    parser.add_argument(\n        \"--replace-spaces-with-blanks\",\n        type=int,\n        default=None,\n        help=\"Fill blanks in between words to better model pauses between words.\"\n        \" Segments can be misaligned if this option is combined with --gratis-blank.\"\n        \" May increase length of ground truth.\",\n    )\n    parser.add_argument(\n        \"--scoring-length\",\n        type=int,\n        default=None,\n        help=\"Changes partitioning length L for calculation of the confidence score.\",\n    )\n    parser.add_argument(\n        \"--output\",\n        type=configargparse.FileType(\"w\"),\n        required=True,\n        help=\"Output segments file\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run the main decoding function.\"\"\"\n    parser = get_parser()\n    args, extra = parser.parse_known_args(args)\n    # logging info\n    if args.verbose == 1:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    elif args.verbose == 2:\n        logging.basicConfig(\n            level=logging.DEBUG,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n    if args.ngpu == 0 and args.dtype == \"float16\":\n        raise ValueError(f\"--dtype {args.dtype} does not support the CPU backend.\")\n    # check CUDA_VISIBLE_DEVICES\n    device = \"cpu\"\n    if args.ngpu == 1:\n        device = \"cuda\"\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n    elif args.ngpu > 1:\n        logging.error(\"Decoding only supports ngpu=1.\")\n        sys.exit(1)\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n    # recog\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        ctc_align(args, device)\n    else:\n        raise ValueError(\"Only pytorch is supported.\")\n    sys.exit(0)\n\n\ndef ctc_align(args, device):\n    \"\"\"ESPnet-specific interface for CTC segmentation.\n\n    Parses configuration, infers the CTC posterior probabilities,\n    and then aligns start and end of utterances using CTC segmentation.\n    Results are written to the output file given in the args.\n\n    :param args: given configuration\n    :param device: for inference; one of ['cuda', 'cpu']\n    :return:  0 on success\n    \"\"\"\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, ASRInterface)\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n    logging.info(f\"Decoding device={device}\")\n    # Warn for nets with high memory consumption on long audio files\n    if hasattr(model, \"enc\"):\n        encoder_module = model.enc.__class__.__module__\n    elif hasattr(model, \"encoder\"):\n        encoder_module = model.encoder.__class__.__module__\n    else:\n        encoder_module = \"Unknown\"\n    logging.info(f\"Encoder module: {encoder_module}\")\n    logging.info(f\"CTC module:     {model.ctc.__class__.__module__}\")\n    if \"rnn\" not in encoder_module:\n        logging.warning(\"No BLSTM model detected; memory consumption may be high.\")\n    model.to(device=device).eval()\n    # read audio and text json data\n    with open(args.data_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    with open(args.utt_text, \"r\", encoding=\"utf-8\") as f:\n        lines = f.readlines()\n        i = 0\n        text = {}\n        segment_names = {}\n        for name in js.keys():\n            text_per_audio = []\n            segment_names_per_audio = []\n            while i < len(lines) and lines[i].startswith(name):\n                text_per_audio.append(lines[i][lines[i].find(\" \") + 1 :])\n                segment_names_per_audio.append(lines[i][: lines[i].find(\" \")])\n                i += 1\n            text[name] = text_per_audio\n            segment_names[name] = segment_names_per_audio\n    # apply configuration\n    config = CtcSegmentationParameters()\n    if args.subsampling_factor is not None:\n        config.subsampling_factor = args.subsampling_factor\n    if args.frame_duration is not None:\n        config.frame_duration_ms = args.frame_duration\n    if args.min_window_size is not None:\n        config.min_window_size = args.min_window_size\n    if args.max_window_size is not None:\n        config.max_window_size = args.max_window_size\n    config.char_list = train_args.char_list\n    if args.use_dict_blank is not None:\n        logging.warning(\n            \"The option --use-dict-blank is deprecated. If needed,\"\n            \" use --set-blank instead.\"\n        )\n    if args.set_blank is not None:\n        config.blank = args.set_blank\n    if args.replace_spaces_with_blanks is not None:\n        if args.replace_spaces_with_blanks:\n            config.replace_spaces_with_blanks = True\n        else:\n            config.replace_spaces_with_blanks = False\n    if args.gratis_blank:\n        config.blank_transition_cost_zero = True\n    if config.blank_transition_cost_zero and args.replace_spaces_with_blanks:\n        logging.error(\n            \"Blanks are inserted between words, and also the transition cost of blank\"\n            \" is zero. This configuration may lead to misalignments!\"\n        )\n    if args.scoring_length is not None:\n        config.score_min_mean_over_L = args.scoring_length\n    logging.info(\n        f\"Frame timings: {config.frame_duration_ms}ms * {config.subsampling_factor}\"\n    )\n    # Iterate over audio files to decode and align\n    for idx, name in enumerate(js.keys(), 1):\n        logging.info(\"(%d/%d) Aligning \" + name, idx, len(js.keys()))\n        batch = [(name, js[name])]\n        feat, label = load_inputs_and_targets(batch)\n        feat = feat[0]\n        with torch.no_grad():\n            # Encode input frames\n            enc_output = model.encode(torch.as_tensor(feat).to(device)).unsqueeze(0)\n            # Apply ctc layer to obtain log character probabilities\n            lpz = model.ctc.log_softmax(enc_output)[0].cpu().numpy()\n        # Prepare the text for aligning\n        ground_truth_mat, utt_begin_indices = prepare_text(config, text[name])\n        # Align using CTC segmentation\n        timings, char_probs, state_list = ctc_segmentation(\n            config, lpz, ground_truth_mat\n        )\n        logging.debug(f\"state_list = {state_list}\")\n        # Obtain list of utterances with time intervals and confidence score\n        segments = determine_utterance_segments(\n            config, utt_begin_indices, char_probs, timings, text[name]\n        )\n        # Write to \"segments\" file\n        for i, boundary in enumerate(segments):\n            utt_segment = (\n                f\"{segment_names[name][i]} {name} {boundary[0]:.2f}\"\n                f\" {boundary[1]:.2f} {boundary[2]:.9f}\\n\"\n            )\n            args.output.write(utt_segment)\n    return 0\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/asr_enhance.py",
    "content": "#!/usr/bin/env python3\nimport configargparse\nfrom distutils.util import strtobool\nimport logging\nimport os\nimport random\nimport sys\n\nimport numpy as np\n\nfrom espnet.asr.pytorch_backend.asr import enhance\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    parser = configargparse.ArgumentParser(\n        description=\"Enhance noisy speech for speech recognition\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\"--ngpu\", default=0, type=int, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--backend\",\n        default=\"chainer\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--verbose\", \"-V\", default=1, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--batchsize\",\n        default=1,\n        type=int,\n        help=\"Batch size for beam search (0: means no batch processing)\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    # task related\n    parser.add_argument(\n        \"--recog-json\", type=str, help=\"Filename of recognition data (json)\"\n    )\n    # model (parameter) related\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n\n    # Outputs configuration\n    parser.add_argument(\n        \"--enh-wspecifier\",\n        type=str,\n        default=None,\n        help=\"Specify the output way for enhanced speech.\"\n        \"e.g. ark,scp:outdir,wav.scp\",\n    )\n    parser.add_argument(\n        \"--enh-filetype\",\n        type=str,\n        default=\"sound\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for enhanced speech. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\"--fs\", type=int, default=16000, help=\"The sample frequency\")\n    parser.add_argument(\n        \"--keep-length\",\n        type=strtobool,\n        default=True,\n        help=\"Adjust the output length to match \" \"with the input for enhanced speech\",\n    )\n    parser.add_argument(\n        \"--image-dir\", type=str, default=None, help=\"The directory saving the images.\"\n    )\n    parser.add_argument(\n        \"--num-images\",\n        type=int,\n        default=20,\n        help=\"The number of images files to be saved. \"\n        \"If negative, all samples are to be saved.\",\n    )\n\n    # IStft\n    parser.add_argument(\n        \"--apply-istft\",\n        type=strtobool,\n        default=True,\n        help=\"Apply istft to the output from the network\",\n    )\n    parser.add_argument(\n        \"--istft-win-length\",\n        type=int,\n        default=512,\n        help=\"The window length for istft. \"\n        \"This option is ignored \"\n        \"if stft is found in the preprocess-conf\",\n    )\n    parser.add_argument(\n        \"--istft-n-shift\",\n        type=str,\n        default=256,\n        help=\"The window type for istft. \"\n        \"This option is ignored \"\n        \"if stft is found in the preprocess-conf\",\n    )\n    parser.add_argument(\n        \"--istft-window\",\n        type=str,\n        default=\"hann\",\n        help=\"The window type for istft. \"\n        \"This option is ignored \"\n        \"if stft is found in the preprocess-conf\",\n    )\n    return parser\n\n\ndef main(args):\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    # logging info\n    if args.verbose == 1:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    elif args.verbose == 2:\n        logging.basicConfig(\n            level=logging.DEBUG,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n        # TODO(kamo): support of multiple GPUs\n        if args.ngpu > 1:\n            logging.error(\"The program only supports ngpu=1.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # seed setting\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    logging.info(\"set random seed = %d\" % args.seed)\n\n    # recog\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        enhance(args)\n    else:\n        raise ValueError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/asr_recog.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"End-to-end speech recognition model decoding script.\"\"\"\n\nimport configargparse\nimport logging\nimport os\nimport random\nimport sys\nimport tracemalloc\nimport numpy as np\n\nfrom espnet.utils.cli_utils import strtobool\n\n# NOTE: you need this func to generate our sphinx doc\n\n\ndef get_parser():\n    \"\"\"Get default arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Transcribe text from speech using \"\n        \"a speech recognition model on one CPU or GPU\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"Config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"Second config file path that overwrites the settings in `--config`\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"Third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`\",\n    )\n\n    parser.add_argument(\"--ngpu\", type=int, default=0, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--dtype\",\n        choices=(\"float16\", \"float32\", \"float64\"),\n        default=\"float32\",\n        help=\"Float precision (only available in --api v2)\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        type=str,\n        default=\"chainer\",\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", type=int, default=1, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", type=int, default=1, help=\"Random seed\")\n    parser.add_argument(\"--verbose\", \"-V\", type=int, default=1, help=\"Verbose option\")\n    parser.add_argument(\n        \"--batchsize\",\n        type=int,\n        default=1,\n        help=\"Batch size for beam search (0: means no batch processing)\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--api\",\n        default=\"v1\",\n        choices=[\"v1\", \"v2\"],\n        help=\"Beam search APIs \"\n        \"v1: Default API. It only supports the ASRInterface.recognize method \"\n        \"and DefaultRNNLM. \"\n        \"v2: Experimental API. It supports any models that implements ScorerInterface.\",\n    )\n    # task related\n    parser.add_argument(\n        \"--recog-json\", type=str, help=\"Filename of recognition data (json)\"\n    )\n    parser.add_argument(\n        \"--result-label\",\n        type=str,\n        required=True,\n        help=\"Filename of result label data (json)\",\n    )\n    # model (parameter) related\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n    parser.add_argument(\n        \"--num-spkrs\",\n        type=int,\n        default=1,\n        choices=[1, 2],\n        help=\"Number of speakers in the speech\",\n    )\n    parser.add_argument(\n        \"--num-encs\", default=1, type=int, help=\"Number of encoders in the model.\"\n    )\n    # search related\n    parser.add_argument(\"--nbest\", type=int, default=10, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=1, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", type=float, default=0.0, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        type=float,\n        default=0.0,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        type=float,\n        default=0.0,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    parser.add_argument(\n        \"--ctc-weight\", type=float, default=0.0, help=\"CTC weight in joint decoding\"\n    )\n    parser.add_argument(\n        \"--weights-ctc-dec\",\n        type=float,\n        action=\"append\",\n        help=\"ctc weight assigned to each encoder during decoding.\"\n        \"[in multi-encoder mode only]\",\n    )\n    parser.add_argument(\n        \"--ctc-window-margin\",\n        type=int,\n        default=0,\n        help=\"\"\"Use CTC window with margin parameter to accelerate\n                        CTC/attention decoding especially on GPU. Smaller magin\n                        makes decoding faster, but may increase search errors.\n                        If margin=0 (default), this function is disabled\"\"\",\n    )\n    # transducer related\n    parser.add_argument(\n        \"--search-type\",\n        type=str,\n        default=\"alsd\",\n        choices=[\"default\", \"nsc\", \"tsd\", \"alsd\", \"ctc_greedy\", \"ctc_beam\"],\n        help=\"\"\"Type of beam search implementation to use during inference.\n        Can be either: default beam search, n-step constrained beam search (\"nsc\"),\n        time-synchronous decoding (\"tsd\") or alignment-length synchronous decoding\n        (\"alsd\").\n        Additional associated parameters: \"nstep\" + \"prefix-alpha\" (for nsc),\n        \"max-sym-exp\" (for tsd) and \"u-max\" (for alsd)\"\"\",\n    )\n    parser.add_argument(\n        \"--nstep\",\n        type=int,\n        default=1,\n        help=\"Number of expansion steps allowed in NSC beam search.\",\n    )\n    parser.add_argument(\n        \"--prefix-alpha\",\n        type=int,\n        default=2,\n        help=\"Length prefix difference allowed in NSC beam search.\",\n    )\n    parser.add_argument(\n        \"--max-sym-exp\",\n        type=int,\n        default=2,\n        help=\"Number of symbol expansions allowed in TSD decoding.\",\n    )\n    parser.add_argument(\n        \"--u-max\",\n        type=int,\n        default=400,\n        help=\"Length prefix difference allowed in ALSD beam search.\",\n    )\n    parser.add_argument(\n        \"--score-norm\",\n        type=strtobool,\n        nargs=\"?\",\n        default=True,\n        help=\"Normalize transducer scores by length\",\n    )\n    # rnnlm related\n    parser.add_argument(\n        \"--rnnlm\", type=str, default=None, help=\"RNNLM model file to read\"\n    )\n    parser.add_argument(\n        \"--rnnlm-conf\", type=str, default=None, help=\"RNNLM model config file to read\"\n    )\n    parser.add_argument(\n        \"--word-rnnlm\", type=str, default=None, help=\"Word RNNLM model file to read\"\n    )\n    parser.add_argument(\n        \"--word-rnnlm-conf\",\n        type=str,\n        default=None,\n        help=\"Word RNNLM model config file to read\",\n    )\n    parser.add_argument(\"--word-dict\", type=str, default=None, help=\"Word list to read\")\n    parser.add_argument(\"--lm-weight\", type=float, default=0.1, help=\"RNNLM weight\")\n    # ngram related\n    parser.add_argument(\n        \"--ngram-model\", type=str, default=None, help=\"ngram model file to read\"\n    )\n    parser.add_argument(\"--ngram-weight\", type=float, default=0.1, help=\"ngram weight\")\n    parser.add_argument(\n        \"--ngram-scorer\",\n        type=str,\n        default=\"part\",\n        choices=(\"full\", \"part\"),\n        help=\"\"\"if the ngram is set as a part scorer, similar with CTC scorer,\n                ngram scorer only scores topK hypethesis.\n                if the ngram is set as full scorer, ngram scorer scores all hypthesis\n                the decoding speed of part scorer is musch faster than full one\"\"\",\n    )\n    # streaming related\n    parser.add_argument(\n        \"--streaming-mode\",\n        type=str,\n        default=None,\n        choices=[\"window\", \"segment\"],\n        help=\"\"\"Use streaming recognizer for inference.\n                        `--batchsize` must be set to 0 to enable this mode\"\"\",\n    )\n    parser.add_argument(\"--streaming-window\", type=int, default=10, help=\"Window size\")\n    parser.add_argument(\n        \"--streaming-min-blank-dur\",\n        type=int,\n        default=10,\n        help=\"Minimum blank duration threshold\",\n    )\n    parser.add_argument(\n        \"--streaming-onset-margin\", type=int, default=1, help=\"Onset margin\"\n    )\n    parser.add_argument(\n        \"--streaming-offset-margin\", type=int, default=1, help=\"Offset margin\"\n    )\n    # non-autoregressive related\n    # Mask CTC related. See https://arxiv.org/abs/2005.08700 for the detail.\n    parser.add_argument(\n        \"--maskctc-n-iterations\",\n        type=int,\n        default=10,\n        help=\"Number of decoding iterations.\"\n        \"For Mask CTC, set 0 to predict 1 mask/iter.\",\n    )\n    parser.add_argument(\n        \"--maskctc-probability-threshold\",\n        type=float,\n        default=0.999,\n        help=\"Threshold probability for CTC output\",\n    )\n\n    parser.add_argument(\n        \"--k2-decode\",\n        type=bool,\n        default=False,\n        help=\"Using K2 decoding\",\n    )\n    parser.add_argument(\n        \"--local-rank\",\n        type=int,\n        default=-1,\n        help=\"To choose GPU\",\n    )\n    parser.add_argument(\n        \"--mmi-weight\",\n        type=float,\n        default=0.0,\n        help=\"MMI scorer weight\",\n    )\n    parser.add_argument(\n        \"--mas-lookahead\",\n        type=int,\n        default=0,\n        help=\"Number of frames to look-ahead in MMI alignment scores\",\n    )\n    parser.add_argument(\n        \"--use-segment\",\n        type=strtobool,\n        default=False,\n        help=\"If true, the MMI score is parsed by jieba. (Chinese only)\",\n    )\n    parser.add_argument(\n        \"--mmi-rescore\",\n        type=strtobool,\n        default=False,\n        help=\"Do mmi rescoring after decoding, only for lasctc framework\"\n    )\n    parser.add_argument(\n        \"--word-ngram\",\n        type=str,\n        default=\"\",\n        help=\"Path to word-level N-gram model lang directory\"\n    )\n    parser.add_argument(\n        \"--word-ngram-weight\",\n        type=float,\n        default=0.0,\n        help=\"weight of the N-gram model\"\n    )\n    parser.add_argument(\n        \"--word-ngram-log-semiring\",\n        type=strtobool,\n        default=True,\n        help=\"If true, score the lattice with log-semiring, else tropical semiring\"\n    )\n    parser.add_argument(\n        \"--word-ngram-lower-char\",\n        type=strtobool,\n        default=True,\n        help=\"If true, all english characters will be converted into lower case. otherwise upper case\"\n    )\n    parser.add_argument(\n        \"--tlg-scorer\",\n        type=str,\n        default=\"\",\n        help=\"lang directory of lang that save the LG.fst. Only useful for RNNT ALSD decoding\"\n    )   \n    parser.add_argument(\n        \"--tlg-nonblk-reward\",\n        type=float,\n        default=1.5,\n        help=\"Reward whenaver a non-blank token is generated. Used in TLG scorer\",\n    )\n    parser.add_argument(\n        \"--tlg-weight\",\n        type=float,\n        default=0.0,\n        help=\"weight for TLG scorer in decoding\",\n    )\n    parser.add_argument(\n        \"--skip-eng\",\n        type=strtobool,\n        default=False,\n        help=\"If true, skip the utterance whose transcription has english alphabet (rnnt only)\",\n    )\n    parser.add_argument(\n        \"--forbid-eng\",\n        type=strtobool,\n        default=False,\n        help=\"If true, forbid the rnnt model to predict English characters (rnnt only)\",\n    )\n    parser.add_argument(\n        \"--cs-nt-decode-feature\",\n        type=str,\n        default=\"combine\",\n        choices = [\"combine\", \"chn\", \"eng\"],\n        help=\"feature used for decoding\",\n    )\n    parser.add_argument(\n        \"--cs-lang-weight\",\n        type=float,\n        default=\"0.0\",\n        help=\"weight of language classification loss\",\n    )\n    parser.add_argument(\n        \"--eng-vocab\",\n        type=str,\n        default=None,\n        help=\"if apply, the hypothesis is valid only if all english words are in this vocab\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run the main decoding function.\"\"\"\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    if args.ngpu == 0 and args.dtype == \"float16\":\n        raise ValueError(f\"--dtype {args.dtype} does not support the CPU backend.\")\n\n    # logging info\n    if args.verbose == 1:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    elif args.verbose == 2:\n        logging.basicConfig(\n            level=logging.DEBUG,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n        # TODO(mn5k): support of multiple GPUs\n        if args.ngpu > 1:\n            logging.error(\"The program only supports ngpu=1.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # seed setting\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    logging.info(\"set random seed = %d\" % args.seed)\n\n    # validate rnn options\n    if args.rnnlm is not None and args.word_rnnlm is not None:\n        logging.error(\n            \"It seems that both --rnnlm and --word-rnnlm are specified. \"\n            \"Please use either option.\"\n        )\n        sys.exit(1)\n\n    # recog\n    logging.info(\"backend = \" + args.backend)\n    if args.num_spkrs == 1:\n        if args.backend == \"chainer\":\n            from espnet.asr.chainer_backend.asr import recog\n\n            recog(args)\n        elif args.backend == \"pytorch\":\n            if args.num_encs == 1:\n                # Experimental API that supports custom LMs\n                if args.api == \"v2\":\n                    from espnet.asr.pytorch_backend.recog import recog_v2\n\n                    recog_v2(args)\n                else:\n                    from espnet.asr.pytorch_backend.asr import recog\n\n                    if args.dtype != \"float32\":\n                        raise NotImplementedError(\n                            f\"`--dtype {args.dtype}` is only available with `--api v2`\"\n                        )\n                    recog(args)\n            else:\n                if args.api == \"v2\":\n                    raise NotImplementedError(\n                        f\"--num-encs {args.num_encs} > 1 is not supported in --api v2\"\n                    )\n                else:\n                    from espnet.asr.pytorch_backend.asr import recog\n\n                    recog(args)\n        else:\n            raise ValueError(\"Only chainer and pytorch are supported.\")\n    elif args.num_spkrs == 2:\n        if args.backend == \"pytorch\":\n            from espnet.asr.pytorch_backend.asr_mix import recog\n\n            recog(args)\n        else:\n            raise ValueError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    # tracemalloc.start(10000)\n    main(sys.argv[1:])\n    # size, peak = tracemalloc.get_traced_memory()\n    # peak /= (1024 ** 2)\n    # print(f\"Maximum Memory consumed: {peak}MB\")\n"
  },
  {
    "path": "bin/asr_train.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Tomoki Hayashi (Nagoya University)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Automatic speech recognition model training script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nfrom distutils.version import LooseVersion\n\nimport configargparse\nimport numpy as np\nimport torch\n\nfrom espnet import __version__\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.training.batchfy import BATCH_COUNT_CHOICES\n\nis_torch_1_2_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.2\")\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser(parser=None, required=True):\n    \"\"\"Get default arguments.\"\"\"\n    if parser is None:\n        parser = configargparse.ArgumentParser(\n            description=\"Train an automatic speech recognition (ASR) model on one CPU, \"\n            \"one or multiple GPUs\",\n            config_file_parser_class=configargparse.YAMLConfigFileParser,\n            formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n        )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings in \"\n        \"`--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--train-dtype\",\n        default=\"float32\",\n        choices=[\"float16\", \"float32\", \"float64\", \"O0\", \"O1\", \"O2\", \"O3\"],\n        help=\"Data type for training (only pytorch backend). \"\n        \"O0,O1,.. flags require apex. \"\n        \"See https://nvidia.github.io/apex/amp.html#opt-levels\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"chainer\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\n        \"--outdir\", type=str, required=required, help=\"Output directory\"\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--dict\", required=required, help=\"Dictionary\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--debugdir\", type=str, help=\"Output directory for debugging\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\n        \"--minibatches\",\n        \"-N\",\n        type=int,\n        default=\"-1\",\n        help=\"Process only N minibatches (for debug)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log dir path\",\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=300,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    parser.add_argument(\n        \"--save-interval-iters\",\n        default=0,\n        type=int,\n        help=\"Save snapshot interval iterations\",\n    )\n    # task related\n    parser.add_argument(\n        \"--train-json\",\n        type=str,\n        default=None,\n        help=\"Filename of train label data (json)\",\n    )\n    parser.add_argument(\n        \"--valid-json\",\n        type=str,\n        default=None,\n        help=\"Filename of validation label data (json)\",\n    )\n    # network architecture\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=None,\n        help=\"model defined module (default: espnet.nets.xxx_backend.e2e_asr:E2E)\",\n    )\n    # encoder\n    parser.add_argument(\n        \"--num-encs\", default=1, type=int, help=\"Number of encoders in the model.\"\n    )\n    # loss related\n    parser.add_argument(\n        \"--ctc_type\",\n        default=\"warpctc\",\n        type=str,\n        choices=[\"builtin\", \"warpctc\", \"gtnctc\", \"cudnnctc\", \"k2mmi\", 'k2ctc'],\n        help=\"Type of CTC implementation to calculate loss.\",\n    )\n    parser.add_argument(\n        \"--mtlalpha\",\n        default=0.5,\n        type=float,\n        help=\"Multitask learning coefficient, \"\n        \"alpha: alpha*ctc_loss + (1-alpha)*att_loss \",\n    )\n    parser.add_argument(\n        \"--lsm-weight\", default=0.0, type=float, help=\"Label smoothing weight\"\n    )\n    # recognition options to compute CER/WER\n    parser.add_argument(\n        \"--report-cer\",\n        default=False,\n        action=\"store_true\",\n        help=\"Compute CER on development set\",\n    )\n    parser.add_argument(\n        \"--report-wer\",\n        default=False,\n        action=\"store_true\",\n        help=\"Compute WER on development set\",\n    )\n    parser.add_argument(\"--nbest\", type=int, default=1, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=4, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", default=0.0, type=float, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        default=0.0,\n        type=float,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        default=0.0,\n        type=float,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    parser.add_argument(\n        \"--ctc-weight\", default=0.3, type=float, help=\"CTC weight in joint decoding\"\n    )\n    parser.add_argument(\n        \"--rnnlm\", type=str, default=None, help=\"RNNLM model file to read\"\n    )\n    parser.add_argument(\n        \"--rnnlm-conf\", type=str, default=None, help=\"RNNLM model config file to read\"\n    )\n    parser.add_argument(\"--lm-weight\", default=0.1, type=float, help=\"RNNLM weight.\")\n    parser.add_argument(\"--sym-space\", default=\"<space>\", type=str, help=\"Space symbol\")\n    parser.add_argument(\"--sym-blank\", default=\"<blank>\", type=str, help=\"Blank symbol\")\n    # minibatch related\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batch-count\",\n        default=\"auto\",\n        choices=BATCH_COUNT_CHOICES,\n        help=\"How to count batch_size. \"\n        \"The default (auto) will find how to count by args.\",\n    )\n    parser.add_argument(\n        \"--batch-size\",\n        \"--batch-seqs\",\n        \"-b\",\n        default=0,\n        type=int,\n        help=\"Maximum seqs in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-bins\",\n        default=0,\n        type=int,\n        help=\"Maximum bins in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-in\",\n        default=0,\n        type=int,\n        help=\"Maximum input frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-out\",\n        default=0,\n        type=int,\n        help=\"Maximum output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-inout\",\n        default=0,\n        type=int,\n        help=\"Maximum input+output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--maxlen-in\",\n        \"--batch-seq-maxlen-in\",\n        default=800,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the input sequence length > ML.\",\n    )\n    parser.add_argument(\n        \"--maxlen-out\",\n        \"--batch-seq-maxlen-out\",\n        default=150,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the output sequence length > ML\",\n    )\n    parser.add_argument(\n        \"--n-iter-processes\",\n        default=0,\n        type=int,\n        help=\"Number of processes of iterator\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        nargs=\"?\",\n        help=\"The configuration file for the pre-processing\",\n    )\n    # optimization related\n    parser.add_argument(\n        \"--opt\",\n        default=\"noam_sgd\",\n        type=str,\n        choices=[\"adadelta\", \"adam\", \"noam\", \"noam_sgd\"],\n        help=\"Optimizer\",\n    )\n    parser.add_argument(\n        \"--accum-grad\", default=1, type=int, help=\"Number of gradient accumuration\"\n    )\n    parser.add_argument(\n        \"--eps\", default=1e-8, type=float, help=\"Epsilon constant for optimizer\"\n    )\n    parser.add_argument(\n        \"--eps-decay\", default=0.01, type=float, help=\"Decaying ratio of epsilon\"\n    )\n    parser.add_argument(\n        \"--weight-decay\", default=0.0, type=float, help=\"Weight decay ratio\"\n    )\n    parser.add_argument(\n        \"--criterion\",\n        default=\"acc\",\n        type=str,\n        choices=[\"loss\", \"loss_eps_decay_only\", \"acc\"],\n        help=\"Criterion to perform epsilon decay\",\n    )\n    parser.add_argument(\n        \"--threshold\", default=1e-4, type=float, help=\"Threshold to stop iteration\"\n    )\n    parser.add_argument(\n        \"--epochs\", \"-e\", default=30, type=int, help=\"Maximum number of epochs\"\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/acc\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs to wait without improvement \"\n        \"before stopping the training\",\n    )\n    parser.add_argument(\n        \"--grad-clip\", default=5, type=float, help=\"Gradient norm threshold to clip\"\n    )\n    parser.add_argument(\n        \"--num-save-attention\",\n        default=0,\n        type=int,\n        help=\"Number of samples of attention to be saved\",\n    )\n    parser.add_argument(\n        \"--num-save-ctc\",\n        default=0,\n        type=int,\n        help=\"Number of samples of CTC probability to be saved\",\n    )\n    parser.add_argument(\n        \"--grad-noise\",\n        type=strtobool,\n        default=False,\n        help=\"The flag to switch to use noise injection to gradients during training\",\n    )\n    # asr_mix related\n    parser.add_argument(\n        \"--num-spkrs\",\n        default=1,\n        type=int,\n        choices=[1, 2],\n        help=\"Number of speakers in the speech.\",\n    )\n    # decoder related\n    parser.add_argument(\n        \"--context-residual\",\n        default=False,\n        type=strtobool,\n        nargs=\"?\",\n        help=\"The flag to switch to use context vector residual in the decoder network\",\n    )\n    # finetuning related\n    parser.add_argument(\n        \"--enc-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained ASR model to initialize encoder.\",\n    )\n    parser.add_argument(\n        \"--enc-init-mods\",\n        default=\"enc.enc.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of encoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--dec-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained ASR, MT or LM model to initialize decoder.\",\n    )\n    parser.add_argument(\n        \"--dec-init-mods\",\n        default=\"att.,dec.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of decoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--freeze-mods\",\n        default=None,\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of modules to freeze, separated by a comma.\",\n    )\n    # front end related\n    parser.add_argument(\n        \"--use-frontend\",\n        type=strtobool,\n        default=False,\n        help=\"The flag to switch to use frontend system.\",\n    )\n\n    # WPE related\n    parser.add_argument(\n        \"--use-wpe\",\n        type=strtobool,\n        default=False,\n        help=\"Apply Weighted Prediction Error\",\n    )\n    parser.add_argument(\n        \"--wtype\",\n        default=\"blstmp\",\n        type=str,\n        choices=[\n            \"lstm\",\n            \"blstm\",\n            \"lstmp\",\n            \"blstmp\",\n            \"vgglstmp\",\n            \"vggblstmp\",\n            \"vgglstm\",\n            \"vggblstm\",\n            \"gru\",\n            \"bgru\",\n            \"grup\",\n            \"bgrup\",\n            \"vgggrup\",\n            \"vggbgrup\",\n            \"vgggru\",\n            \"vggbgru\",\n        ],\n        help=\"Type of encoder network architecture \"\n        \"of the mask estimator for WPE. \"\n        \"\",\n    )\n    parser.add_argument(\"--wlayers\", type=int, default=2, help=\"\")\n    parser.add_argument(\"--wunits\", type=int, default=300, help=\"\")\n    parser.add_argument(\"--wprojs\", type=int, default=300, help=\"\")\n    parser.add_argument(\"--wdropout-rate\", type=float, default=0.0, help=\"\")\n    parser.add_argument(\"--wpe-taps\", type=int, default=5, help=\"\")\n    parser.add_argument(\"--wpe-delay\", type=int, default=3, help=\"\")\n    parser.add_argument(\n        \"--use-dnn-mask-for-wpe\",\n        type=strtobool,\n        default=False,\n        help=\"Use DNN to estimate the power spectrogram. \"\n        \"This option is experimental.\",\n    )\n    # Beamformer related\n    parser.add_argument(\"--use-beamformer\", type=strtobool, default=True, help=\"\")\n    parser.add_argument(\n        \"--btype\",\n        default=\"blstmp\",\n        type=str,\n        choices=[\n            \"lstm\",\n            \"blstm\",\n            \"lstmp\",\n            \"blstmp\",\n            \"vgglstmp\",\n            \"vggblstmp\",\n            \"vgglstm\",\n            \"vggblstm\",\n            \"gru\",\n            \"bgru\",\n            \"grup\",\n            \"bgrup\",\n            \"vgggrup\",\n            \"vggbgrup\",\n            \"vgggru\",\n            \"vggbgru\",\n        ],\n        help=\"Type of encoder network architecture \"\n        \"of the mask estimator for Beamformer.\",\n    )\n    parser.add_argument(\"--blayers\", type=int, default=2, help=\"\")\n    parser.add_argument(\"--bunits\", type=int, default=300, help=\"\")\n    parser.add_argument(\"--bprojs\", type=int, default=300, help=\"\")\n    parser.add_argument(\"--badim\", type=int, default=320, help=\"\")\n    parser.add_argument(\n        \"--bnmask\",\n        type=int,\n        default=2,\n        help=\"Number of beamforming masks, \" \"default is 2 for [speech, noise].\",\n    )\n    parser.add_argument(\n        \"--ref-channel\",\n        type=int,\n        default=-1,\n        help=\"The reference channel used for beamformer. \"\n        \"By default, the channel is estimated by DNN.\",\n    )\n    parser.add_argument(\"--bdropout-rate\", type=float, default=0.0, help=\"\")\n    # Feature transform: Normalization\n    parser.add_argument(\n        \"--stats-file\",\n        type=str,\n        default=None,\n        help=\"The stats file for the feature normalization\",\n    )\n    parser.add_argument(\n        \"--apply-uttmvn\",\n        type=strtobool,\n        default=True,\n        help=\"Apply utterance level mean \" \"variance normalization.\",\n    )\n    parser.add_argument(\"--uttmvn-norm-means\", type=strtobool, default=True, help=\"\")\n    parser.add_argument(\"--uttmvn-norm-vars\", type=strtobool, default=False, help=\"\")\n    # Feature transform: Fbank\n    parser.add_argument(\n        \"--fbank-fs\",\n        type=int,\n        default=16000,\n        help=\"The sample frequency used for \" \"the mel-fbank creation.\",\n    )\n    parser.add_argument(\n        \"--n-mels\", type=int, default=80, help=\"The number of mel-frequency bins.\"\n    )\n    parser.add_argument(\"--fbank-fmin\", type=float, default=0.0, help=\"\")\n    parser.add_argument(\"--fbank-fmax\", type=float, default=None, help=\"\")\n\n    # K2 \n    parser.add_argument(\"--lang\", type=str,\n                        help=\"k2 lang dir\")\n    parser.add_argument(\"--den-scale\", type=float, default=1.0,\n                        help=\"denumerator scale: loss = num + den_scale * den\")\n    parser.add_argument(\"--third-weight\", type=float, default=0.0,\n                        help=\"we still need ctc loss if encoder is supervised by MMI. This is ctc_weight\")\n    parser.add_argument(\"--use-segment\", type=strtobool, default=False,\n                        help=\"If true, MMI supervision is from text_org. If false, it is from ys_pad\")\n    \n    # DDP\n    parser.add_argument(\"--master-node\", type=int, default=0,\n                        help=\"master node rank\")\n    parser.add_argument(\"--local_rank\", type=int, default=-1,\n                        help=\"local GPU rank\")\n    parser.add_argument(\"--world-size\", type=int, default=-1,\n                        help=\"BMUF world size\")\n    parser.add_argument(\"--node-rank\", type=int, default=-1,\n                        help=\"DDP node rank\")\n    parser.add_argument(\"--node-size\", type=int, default=8,\n                        help=\"number of GPU on each node\")\n\n    # MBR\n    parser.add_argument(\"--load-trainer-and-opt\", type=strtobool, default=True,\n                        help=\"If false, only the model weight would be loaded in snapshot\")\n\n    \n    parser.add_argument(\"--block-load\", type=strtobool, default=False,\n                        help=\"block loading for training. make sure all batches are in the same ark\")\n    parser.add_argument(\"--utts-per-ark\", type=int, default=256,\n                        help=\"number of utterance in each ark\")\n    \"\"\"\n    Due to the slow ceph, we cannot load data completely in random paradigm\n    Thus, the randomness is implemented in hierarchical style.\n    (1) We sure that each minibatch is from the same ark file. Also, make\n        sure the utterances in json file is sorted from shortest to longest.\n        You should do this before training starts.\n    (2) the whole dataset is divided into many groups, each groups contains \n        \"block-buffer-size\" arks. The randomness is implemented on both \n        intra- and inter- group styles. The larger the 'block-buffer-size',\n        the better the randomness is implemented. But more memory would be\n        be consumed.\n    (3) At the begining of each epoch, the training would stuck since nearly \n        each update needs to load a new ark. This will not last long: it would\n        be smooth once a group of data is completely loaded\n    (4) Once a minibatch is consumed, we delete it in memory to avoid OOM\n    (5) If we use loading stategy, we can only use one worker process ot load\n        the data to avoid the conflicts in memory buffer. But this is fine since\n        each ark contains many utterances and one worker is far more than enough\n    (6) A buffer that is too large (e.g, size > 100) will make the GPU slow since \n        the virtual memory (actually the disk) is used as the buffer \n    \"\"\"\n    parser.add_argument(\"--block-buffer-size\", type=int, default=80,\n                        help=\"number of arks in buffer. At most 3*block_buffer_size arks would be stored in memory\")\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Run the main training function.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n    if args.backend == \"chainer\" and args.train_dtype != \"float32\":\n        raise NotImplementedError(\n            f\"chainer backend does not support --train-dtype {args.train_dtype}.\"\n            \"Use --dtype float32.\"\n        )\n    if args.ngpu == 0 and args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\", \"float16\"):\n        raise ValueError(\n            f\"--train-dtype {args.train_dtype} does not support the CPU backend.\"\n        )\n\n    from espnet.utils.dynamic_import import dynamic_import\n\n    if args.model_module is None:\n        if args.num_spkrs == 1:\n            model_module = \"espnet.nets.\" + args.backend + \"_backend.e2e_asr:E2E\"\n        else:\n            model_module = \"espnet.nets.\" + args.backend + \"_backend.e2e_asr_mix:E2E\"\n    else:\n        model_module = args.model_module\n    model_class = dynamic_import(model_module)\n    model_class.add_arguments(parser)\n\n    args = parser.parse_args(cmd_args)\n    args.model_module = model_module\n    if \"chainer_backend\" in args.model_module:\n        args.backend = \"chainer\"\n    if \"pytorch_backend\" in args.model_module:\n        args.backend = \"pytorch\"\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n    else:\n        if is_torch_1_2_plus and args.ngpu != 1:\n            logging.debug(\n                \"There are some bugs with multi-GPU processing in PyTorch 1.2+\"\n                + \" (see https://github.com/pytorch/pytorch/issues/21108)\"\n            )\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # set random seed\n    logging.info(\"random seed = %d\" % args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    # load dictionary for debug log\n    if args.dict is not None:\n        with open(args.dict, \"rb\") as f:\n            dictionary = f.readlines()\n        char_list = [entry.decode(\"utf-8\").split(\" \")[0] for entry in dictionary]\n        char_list.insert(0, \"<blank>\")\n        char_list.append(\"<eos>\")\n        # for non-autoregressive maskctc model\n        if \"maskctc\" in args.model_module:\n            char_list.append(\"<mask>\")\n        args.char_list = char_list\n    else:\n        args.char_list = None\n\n    # train\n    logging.info(\"backend = \" + args.backend)\n\n    if args.num_spkrs == 1:\n        if args.backend == \"chainer\":\n            from espnet.asr.chainer_backend.asr import train\n\n            train(args)\n        elif args.backend == \"pytorch\":\n            from espnet.asr.pytorch_backend.asr import train\n\n            train(args)\n        else:\n            raise ValueError(\"Only chainer and pytorch are supported.\")\n    else:\n        # FIXME(kamo): Support --model-module\n        if args.backend == \"pytorch\":\n            from espnet.asr.pytorch_backend.asr_mix import train\n\n            train(args)\n        else:\n            raise ValueError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/lm_train.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# This code is ported from the following implementation written in Torch.\n# https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py\n\n\"\"\"Language model training script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nimport configargparse\nimport numpy as np\n\nfrom espnet import __version__\nfrom espnet.nets.lm_interface import dynamic_import_lm\nfrom espnet.optimizer.factory import dynamic_import_optimizer\nfrom espnet.scheduler.scheduler import dynamic_import_scheduler\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser(parser=None, required=True):\n    \"\"\"Get parser.\"\"\"\n    if parser is None:\n        parser = configargparse.ArgumentParser(\n            description=\"Train a new language model on one CPU or one GPU\",\n            config_file_parser_class=configargparse.YAMLConfigFileParser,\n            formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n        )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--train-dtype\",\n        default=\"float32\",\n        choices=[\"float16\", \"float32\", \"float64\", \"O0\", \"O1\", \"O2\", \"O3\"],\n        help=\"Data type for training (only pytorch backend). \"\n        \"O0,O1,.. flags require apex. \"\n        \"See https://nvidia.github.io/apex/amp.html#opt-levels\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"chainer\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\n        \"--outdir\", type=str, required=required, help=\"Output directory\"\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--dict\", type=str, required=required, help=\"Dictionary\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log dir path\",\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=100,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    # task related\n    parser.add_argument(\n        \"--train-label\",\n        type=str,\n        required=required,\n        help=\"Filename of train label data\",\n    )\n    parser.add_argument(\n        \"--valid-label\",\n        type=str,\n        required=required,\n        help=\"Filename of validation label data\",\n    )\n    parser.add_argument(\"--test-label\", type=str, help=\"Filename of test label data\")\n    parser.add_argument(\n        \"--dump-hdf5-path\",\n        type=str,\n        default=None,\n        help=\"Path to dump a preprocessed dataset as hdf5\",\n    )\n    # training configuration\n    parser.add_argument(\"--opt\", default=\"sgd\", type=str, help=\"Optimizer\")\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batchsize\",\n        \"-b\",\n        type=int,\n        default=300,\n        help=\"Number of examples in each mini-batch\",\n    )\n    parser.add_argument(\n        \"--accum-grad\", type=int, default=1, help=\"Number of gradient accumueration\"\n    )\n    parser.add_argument(\n        \"--epoch\",\n        \"-e\",\n        type=int,\n        default=20,\n        help=\"Number of sweeps over the dataset to train\",\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/loss\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs \"\n        \"to wait without improvement before stopping the training\",\n    )\n    parser.add_argument(\n        \"--schedulers\",\n        default=None,\n        action=\"append\",\n        type=lambda kv: kv.split(\"=\"),\n        help=\"optimizer schedulers, you can configure params like:\"\n        \" <optimizer-param>-<scheduler-name>-<schduler-param>\"\n        ' e.g., \"--schedulers lr=noam --lr-noam-warmup 1000\".',\n    )\n    parser.add_argument(\n        \"--gradclip\",\n        \"-c\",\n        type=float,\n        default=5,\n        help=\"Gradient norm threshold to clip\",\n    )\n    parser.add_argument(\n        \"--maxlen\",\n        type=int,\n        default=40,\n        help=\"Batch size is reduced if the input sequence > ML\",\n    )\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=\"default\",\n        help=\"model defined module \"\n        \"(default: espnet.nets.xxx_backend.lm.default:DefaultRNNLM)\",\n    )\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Train LM.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n    if args.backend == \"chainer\" and args.train_dtype != \"float32\":\n        raise NotImplementedError(\n            f\"chainer backend does not support --train-dtype {args.train_dtype}.\"\n            \"Use --dtype float32.\"\n        )\n    if args.ngpu == 0 and args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\", \"float16\"):\n        raise ValueError(\n            f\"--train-dtype {args.train_dtype} does not support the CPU backend.\"\n        )\n\n    # parse arguments dynamically\n    model_class = dynamic_import_lm(args.model_module, args.backend)\n    model_class.add_arguments(parser)\n    if args.schedulers is not None:\n        for k, v in args.schedulers:\n            scheduler_class = dynamic_import_scheduler(v)\n            scheduler_class.add_arguments(k, parser)\n\n    opt_class = dynamic_import_optimizer(args.opt, args.backend)\n    opt_class.add_arguments(parser)\n\n    args = parser.parse_args(cmd_args)\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n        args.ngpu = ngpu\n    else:\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # seed setting\n    nseed = args.seed\n    random.seed(nseed)\n    np.random.seed(nseed)\n\n    # load dictionary\n    with open(args.dict, \"rb\") as f:\n        dictionary = f.readlines()\n    char_list = [entry.decode(\"utf-8\").split(\" \")[0] for entry in dictionary]\n    char_list.insert(0, \"<blank>\")\n    char_list.append(\"<eos>\")\n    args.char_list_dict = {x: i for i, x in enumerate(char_list)}\n    args.n_vocab = len(char_list)\n\n    # train\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"chainer\":\n        from espnet.lm.chainer_backend.lm import train\n\n        train(args)\n    elif args.backend == \"pytorch\":\n        from espnet.lm.pytorch_backend.lm import train\n\n        train(args)\n    else:\n        raise ValueError(\"Only chainer and pytorch are supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/mt_train.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Neural machine translation model training script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nfrom distutils.version import LooseVersion\n\nimport configargparse\nimport numpy as np\nimport torch\n\nfrom espnet import __version__\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.training.batchfy import BATCH_COUNT_CHOICES\n\nis_torch_1_2_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.2\")\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser(parser=None, required=True):\n    \"\"\"Get default arguments.\"\"\"\n    if parser is None:\n        parser = configargparse.ArgumentParser(\n            description=\"Train a neural machine translation (NMT) model on one CPU, \"\n            \"one or multiple GPUs\",\n            config_file_parser_class=configargparse.YAMLConfigFileParser,\n            formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n        )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--train-dtype\",\n        default=\"float32\",\n        choices=[\"float16\", \"float32\", \"float64\", \"O0\", \"O1\", \"O2\", \"O3\"],\n        help=\"Data type for training (only pytorch backend). \"\n        \"O0,O1,.. flags require apex. \"\n        \"See https://nvidia.github.io/apex/amp.html#opt-levels\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"chainer\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\n        \"--outdir\", type=str, required=required, help=\"Output directory\"\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\n        \"--dict\", required=required, help=\"Dictionary for source/target languages\"\n    )\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--debugdir\", type=str, help=\"Output directory for debugging\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\n        \"--minibatches\",\n        \"-N\",\n        type=int,\n        default=\"-1\",\n        help=\"Process only N minibatches (for debug)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log dir path\",\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=100,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    parser.add_argument(\n        \"--save-interval-iters\",\n        default=0,\n        type=int,\n        help=\"Save snapshot interval iterations\",\n    )\n    # task related\n    parser.add_argument(\n        \"--train-json\",\n        type=str,\n        default=None,\n        help=\"Filename of train label data (json)\",\n    )\n    parser.add_argument(\n        \"--valid-json\",\n        type=str,\n        default=None,\n        help=\"Filename of validation label data (json)\",\n    )\n    # network architecture\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=None,\n        help=\"model defined module (default: espnet.nets.xxx_backend.e2e_mt:E2E)\",\n    )\n    # loss related\n    parser.add_argument(\n        \"--lsm-weight\", default=0.0, type=float, help=\"Label smoothing weight\"\n    )\n    # translations options to compute BLEU\n    parser.add_argument(\n        \"--report-bleu\",\n        default=True,\n        action=\"store_true\",\n        help=\"Compute BLEU on development set\",\n    )\n    parser.add_argument(\"--nbest\", type=int, default=1, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=4, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", default=0.0, type=float, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        default=0.0,\n        type=float,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        default=0.0,\n        type=float,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    parser.add_argument(\n        \"--rnnlm\", type=str, default=None, help=\"RNNLM model file to read\"\n    )\n    parser.add_argument(\n        \"--rnnlm-conf\", type=str, default=None, help=\"RNNLM model config file to read\"\n    )\n    parser.add_argument(\"--lm-weight\", default=0.0, type=float, help=\"RNNLM weight.\")\n    parser.add_argument(\"--sym-space\", default=\"<space>\", type=str, help=\"Space symbol\")\n    parser.add_argument(\"--sym-blank\", default=\"<blank>\", type=str, help=\"Blank symbol\")\n    # minibatch related\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batch-count\",\n        default=\"auto\",\n        choices=BATCH_COUNT_CHOICES,\n        help=\"How to count batch_size. \"\n        \"The default (auto) will find how to count by args.\",\n    )\n    parser.add_argument(\n        \"--batch-size\",\n        \"--batch-seqs\",\n        \"-b\",\n        default=0,\n        type=int,\n        help=\"Maximum seqs in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-bins\",\n        default=0,\n        type=int,\n        help=\"Maximum bins in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-in\",\n        default=0,\n        type=int,\n        help=\"Maximum input frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-out\",\n        default=0,\n        type=int,\n        help=\"Maximum output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-inout\",\n        default=0,\n        type=int,\n        help=\"Maximum input+output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--maxlen-in\",\n        \"--batch-seq-maxlen-in\",\n        default=100,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the input sequence length > ML.\",\n    )\n    parser.add_argument(\n        \"--maxlen-out\",\n        \"--batch-seq-maxlen-out\",\n        default=100,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the output sequence length > ML\",\n    )\n    parser.add_argument(\n        \"--n-iter-processes\",\n        default=0,\n        type=int,\n        help=\"Number of processes of iterator\",\n    )\n    # optimization related\n    parser.add_argument(\n        \"--opt\",\n        default=\"adadelta\",\n        type=str,\n        choices=[\"adadelta\", \"adam\", \"noam\"],\n        help=\"Optimizer\",\n    )\n    parser.add_argument(\n        \"--accum-grad\", default=1, type=int, help=\"Number of gradient accumuration\"\n    )\n    parser.add_argument(\n        \"--eps\", default=1e-8, type=float, help=\"Epsilon constant for optimizer\"\n    )\n    parser.add_argument(\n        \"--eps-decay\", default=0.01, type=float, help=\"Decaying ratio of epsilon\"\n    )\n    parser.add_argument(\n        \"--lr\", default=1e-3, type=float, help=\"Learning rate for optimizer\"\n    )\n    parser.add_argument(\n        \"--lr-decay\", default=1.0, type=float, help=\"Decaying ratio of learning rate\"\n    )\n    parser.add_argument(\n        \"--weight-decay\", default=0.0, type=float, help=\"Weight decay ratio\"\n    )\n    parser.add_argument(\n        \"--criterion\",\n        default=\"acc\",\n        type=str,\n        choices=[\"loss\", \"acc\"],\n        help=\"Criterion to perform epsilon decay\",\n    )\n    parser.add_argument(\n        \"--threshold\", default=1e-4, type=float, help=\"Threshold to stop iteration\"\n    )\n    parser.add_argument(\n        \"--epochs\", \"-e\", default=30, type=int, help=\"Maximum number of epochs\"\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/acc\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs to wait \"\n        \"without improvement before stopping the training\",\n    )\n    parser.add_argument(\n        \"--grad-clip\", default=5, type=float, help=\"Gradient norm threshold to clip\"\n    )\n    parser.add_argument(\n        \"--num-save-attention\",\n        default=3,\n        type=int,\n        help=\"Number of samples of attention to be saved\",\n    )\n    # decoder related\n    parser.add_argument(\n        \"--context-residual\",\n        default=False,\n        type=strtobool,\n        nargs=\"?\",\n        help=\"The flag to switch to use context vector residual in the decoder network\",\n    )\n    parser.add_argument(\n        \"--tie-src-tgt-embedding\",\n        default=False,\n        type=strtobool,\n        nargs=\"?\",\n        help=\"Tie parameters of source embedding and target embedding.\",\n    )\n    parser.add_argument(\n        \"--tie-classifier\",\n        default=False,\n        type=strtobool,\n        nargs=\"?\",\n        help=\"Tie parameters of target embedding and output projection layer.\",\n    )\n    # finetuning related\n    parser.add_argument(\n        \"--enc-init\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Pre-trained ASR model to initialize encoder.\",\n    )\n    parser.add_argument(\n        \"--enc-init-mods\",\n        default=\"enc.enc.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of encoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--dec-init\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Pre-trained ASR, MT or LM model to initialize decoder.\",\n    )\n    parser.add_argument(\n        \"--dec-init-mods\",\n        default=\"att., dec.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of decoder modules to initialize, separated by a comma.\",\n    )\n    # multilingual related\n    parser.add_argument(\n        \"--multilingual\",\n        default=False,\n        type=strtobool,\n        help=\"Prepend target language ID to the source sentence. \"\n        \"Both source/target language IDs must be prepend in the pre-processing stage.\",\n    )\n    parser.add_argument(\n        \"--replace-sos\",\n        default=False,\n        type=strtobool,\n        help=\"Replace <sos> in the decoder with a target language ID \"\n        \"(the first token in the target sequence)\",\n    )\n\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Run the main training function.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n    if args.backend == \"chainer\" and args.train_dtype != \"float32\":\n        raise NotImplementedError(\n            f\"chainer backend does not support --train-dtype {args.train_dtype}.\"\n            \"Use --dtype float32.\"\n        )\n    if args.ngpu == 0 and args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\", \"float16\"):\n        raise ValueError(\n            f\"--train-dtype {args.train_dtype} does not support the CPU backend.\"\n        )\n\n    from espnet.utils.dynamic_import import dynamic_import\n\n    if args.model_module is None:\n        model_module = \"espnet.nets.\" + args.backend + \"_backend.e2e_mt:E2E\"\n    else:\n        model_module = args.model_module\n    model_class = dynamic_import(model_module)\n    model_class.add_arguments(parser)\n\n    args = parser.parse_args(cmd_args)\n    args.model_module = model_module\n    if \"chainer_backend\" in args.model_module:\n        args.backend = \"chainer\"\n    if \"pytorch_backend\" in args.model_module:\n        args.backend = \"pytorch\"\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n        args.ngpu = ngpu\n    else:\n        if is_torch_1_2_plus and args.ngpu != 1:\n            logging.debug(\n                \"There are some bugs with multi-GPU processing in PyTorch 1.2+\"\n                + \" (see https://github.com/pytorch/pytorch/issues/21108)\"\n            )\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # set random seed\n    logging.info(\"random seed = %d\" % args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    # load dictionary for debug log\n    if args.dict is not None:\n        with open(args.dict, \"rb\") as f:\n            dictionary = f.readlines()\n        char_list = [entry.decode(\"utf-8\").split(\" \")[0] for entry in dictionary]\n        char_list.insert(0, \"<blank>\")\n        char_list.append(\"<eos>\")\n        args.char_list = char_list\n    else:\n        args.char_list = None\n\n    # train\n    logging.info(\"backend = \" + args.backend)\n\n    if args.backend == \"pytorch\":\n        from espnet.mt.pytorch_backend.mt import train\n\n        train(args)\n    else:\n        raise ValueError(\"Only pytorch are supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/mt_trans.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Neural machine translation model decoding script.\"\"\"\n\nimport configargparse\nimport logging\nimport os\nimport random\nimport sys\n\nimport numpy as np\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get default arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Translate text from speech \"\n        \"using a speech translation model on one CPU or GPU\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"Config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"Second config file path that overwrites the settings in `--config`\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"Third config file path \"\n        \"that overwrites the settings in `--config` and `--config2`\",\n    )\n\n    parser.add_argument(\"--ngpu\", type=int, default=0, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--dtype\",\n        choices=(\"float16\", \"float32\", \"float64\"),\n        default=\"float32\",\n        help=\"Float precision (only available in --api v2)\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        type=str,\n        default=\"chainer\",\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", type=int, default=1, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", type=int, default=1, help=\"Random seed\")\n    parser.add_argument(\"--verbose\", \"-V\", type=int, default=1, help=\"Verbose option\")\n    parser.add_argument(\n        \"--batchsize\",\n        type=int,\n        default=1,\n        help=\"Batch size for beam search (0: means no batch processing)\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--api\",\n        default=\"v1\",\n        choices=[\"v1\", \"v2\"],\n        help=\"Beam search APIs \"\n        \"v1: Default API. It only supports \"\n        \"the ASRInterface.recognize method and DefaultRNNLM. \"\n        \"v2: Experimental API. \"\n        \"It supports any models that implements ScorerInterface.\",\n    )\n    # task related\n    parser.add_argument(\n        \"--trans-json\", type=str, help=\"Filename of translation data (json)\"\n    )\n    parser.add_argument(\n        \"--result-label\",\n        type=str,\n        required=True,\n        help=\"Filename of result label data (json)\",\n    )\n    # model (parameter) related\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n    # search related\n    parser.add_argument(\"--nbest\", type=int, default=1, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=1, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", type=float, default=0.1, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        type=float,\n        default=3.0,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        type=float,\n        default=0.0,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    # multilingual related\n    parser.add_argument(\n        \"--tgt-lang\",\n        default=False,\n        type=str,\n        help=\"target language ID (e.g., <en>, <de>, and <fr> etc.)\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run the main decoding function.\"\"\"\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    # logging info\n    if args.verbose == 1:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    elif args.verbose == 2:\n        logging.basicConfig(\n            level=logging.DEBUG,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n        # TODO(mn5k): support of multiple GPUs\n        if args.ngpu > 1:\n            logging.error(\"The program only supports ngpu=1.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # seed setting\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    logging.info(\"set random seed = %d\" % args.seed)\n\n    # trans\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        # Experimental API that supports custom LMs\n        from espnet.mt.pytorch_backend.mt import trans\n\n        if args.dtype != \"float32\":\n            raise NotImplementedError(\n                f\"`--dtype {args.dtype}` is only available with `--api v2`\"\n            )\n        trans(args)\n    else:\n        raise ValueError(\"Only pytorch are supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/st_train.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"End-to-end speech translation model training script.\"\"\"\n\nfrom distutils.version import LooseVersion\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nimport configargparse\nimport numpy as np\nimport torch\n\nfrom espnet import __version__\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.training.batchfy import BATCH_COUNT_CHOICES\n\nis_torch_1_2_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.2\")\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser(parser=None, required=True):\n    \"\"\"Get default arguments.\"\"\"\n    if parser is None:\n        parser = configargparse.ArgumentParser(\n            description=\"Train a speech translation (ST) model on one CPU, \"\n            \"one or multiple GPUs\",\n            config_file_parser_class=configargparse.YAMLConfigFileParser,\n            formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n        )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--train-dtype\",\n        default=\"float32\",\n        choices=[\"float16\", \"float32\", \"float64\", \"O0\", \"O1\", \"O2\", \"O3\"],\n        help=\"Data type for training (only pytorch backend). \"\n        \"O0,O1,.. flags require apex. \"\n        \"See https://nvidia.github.io/apex/amp.html#opt-levels\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"chainer\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\n        \"--outdir\", type=str, required=required, help=\"Output directory\"\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--dict\", required=required, help=\"Dictionary\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--debugdir\", type=str, help=\"Output directory for debugging\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\n        \"--minibatches\",\n        \"-N\",\n        type=int,\n        default=\"-1\",\n        help=\"Process only N minibatches (for debug)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log dir path\",\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=100,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    parser.add_argument(\n        \"--save-interval-iters\",\n        default=0,\n        type=int,\n        help=\"Save snapshot interval iterations\",\n    )\n    # task related\n    parser.add_argument(\n        \"--train-json\",\n        type=str,\n        default=None,\n        help=\"Filename of train label data (json)\",\n    )\n    parser.add_argument(\n        \"--valid-json\",\n        type=str,\n        default=None,\n        help=\"Filename of validation label data (json)\",\n    )\n    # network architecture\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=None,\n        help=\"model defined module (default: espnet.nets.xxx_backend.e2e_st:E2E)\",\n    )\n    # loss related\n    parser.add_argument(\n        \"--ctc_type\",\n        default=\"warpctc\",\n        type=str,\n        choices=[\"builtin\", \"warpctc\", \"gtnctc\", \"cudnnctc\"],\n        help=\"Type of CTC implementation to calculate loss.\",\n    )\n    parser.add_argument(\n        \"--mtlalpha\",\n        default=0.0,\n        type=float,\n        help=\"Multitask learning coefficient, alpha: \\\n                                alpha*ctc_loss + (1-alpha)*att_loss\",\n    )\n    parser.add_argument(\n        \"--asr-weight\",\n        default=0.0,\n        type=float,\n        help=\"Multitask learning coefficient for ASR task, weight: \"\n        \" asr_weight*(alpha*ctc_loss + (1-alpha)*att_loss)\"\n        \" + (1-asr_weight-mt_weight)*st_loss\",\n    )\n    parser.add_argument(\n        \"--mt-weight\",\n        default=0.0,\n        type=float,\n        help=\"Multitask learning coefficient for MT task, weight: \\\n                                mt_weight*mt_loss + (1-mt_weight-asr_weight)*st_loss\",\n    )\n    parser.add_argument(\n        \"--lsm-weight\", default=0.0, type=float, help=\"Label smoothing weight\"\n    )\n    # recognition options to compute CER/WER\n    parser.add_argument(\n        \"--report-cer\",\n        default=False,\n        action=\"store_true\",\n        help=\"Compute CER on development set\",\n    )\n    parser.add_argument(\n        \"--report-wer\",\n        default=False,\n        action=\"store_true\",\n        help=\"Compute WER on development set\",\n    )\n    # translations options to compute BLEU\n    parser.add_argument(\n        \"--report-bleu\",\n        default=True,\n        action=\"store_true\",\n        help=\"Compute BLEU on development set\",\n    )\n    parser.add_argument(\"--nbest\", type=int, default=1, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=4, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", default=0.0, type=float, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        default=0.0,\n        type=float,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        default=0.0,\n        type=float,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    parser.add_argument(\n        \"--rnnlm\", type=str, default=None, help=\"RNNLM model file to read\"\n    )\n    parser.add_argument(\n        \"--rnnlm-conf\", type=str, default=None, help=\"RNNLM model config file to read\"\n    )\n    parser.add_argument(\"--lm-weight\", default=0.0, type=float, help=\"RNNLM weight.\")\n    parser.add_argument(\"--sym-space\", default=\"<space>\", type=str, help=\"Space symbol\")\n    parser.add_argument(\"--sym-blank\", default=\"<blank>\", type=str, help=\"Blank symbol\")\n    # minibatch related\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batch-count\",\n        default=\"auto\",\n        choices=BATCH_COUNT_CHOICES,\n        help=\"How to count batch_size. \"\n        \"The default (auto) will find how to count by args.\",\n    )\n    parser.add_argument(\n        \"--batch-size\",\n        \"--batch-seqs\",\n        \"-b\",\n        default=0,\n        type=int,\n        help=\"Maximum seqs in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-bins\",\n        default=0,\n        type=int,\n        help=\"Maximum bins in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-in\",\n        default=0,\n        type=int,\n        help=\"Maximum input frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-out\",\n        default=0,\n        type=int,\n        help=\"Maximum output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-inout\",\n        default=0,\n        type=int,\n        help=\"Maximum input+output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--maxlen-in\",\n        \"--batch-seq-maxlen-in\",\n        default=800,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, batch size is reduced \"\n        \"if the input sequence length > ML.\",\n    )\n    parser.add_argument(\n        \"--maxlen-out\",\n        \"--batch-seq-maxlen-out\",\n        default=150,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the output sequence length > ML\",\n    )\n    parser.add_argument(\n        \"--n-iter-processes\",\n        default=0,\n        type=int,\n        help=\"Number of processes of iterator\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        nargs=\"?\",\n        help=\"The configuration file for the pre-processing\",\n    )\n    # optimization related\n    parser.add_argument(\n        \"--opt\",\n        default=\"adadelta\",\n        type=str,\n        choices=[\"adadelta\", \"adam\", \"noam\"],\n        help=\"Optimizer\",\n    )\n    parser.add_argument(\n        \"--accum-grad\", default=1, type=int, help=\"Number of gradient accumuration\"\n    )\n    parser.add_argument(\n        \"--eps\", default=1e-8, type=float, help=\"Epsilon constant for optimizer\"\n    )\n    parser.add_argument(\n        \"--eps-decay\", default=0.01, type=float, help=\"Decaying ratio of epsilon\"\n    )\n    parser.add_argument(\n        \"--lr\", default=1e-3, type=float, help=\"Learning rate for optimizer\"\n    )\n    parser.add_argument(\n        \"--lr-decay\", default=1.0, type=float, help=\"Decaying ratio of learning rate\"\n    )\n    parser.add_argument(\n        \"--weight-decay\", default=0.0, type=float, help=\"Weight decay ratio\"\n    )\n    parser.add_argument(\n        \"--criterion\",\n        default=\"acc\",\n        type=str,\n        choices=[\"loss\", \"acc\"],\n        help=\"Criterion to perform epsilon decay\",\n    )\n    parser.add_argument(\n        \"--threshold\", default=1e-4, type=float, help=\"Threshold to stop iteration\"\n    )\n    parser.add_argument(\n        \"--epochs\", \"-e\", default=30, type=int, help=\"Maximum number of epochs\"\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/acc\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs to wait \"\n        \"without improvement before stopping the training\",\n    )\n    parser.add_argument(\n        \"--grad-clip\", default=5, type=float, help=\"Gradient norm threshold to clip\"\n    )\n    parser.add_argument(\n        \"--num-save-attention\",\n        default=3,\n        type=int,\n        help=\"Number of samples of attention to be saved\",\n    )\n    parser.add_argument(\n        \"--num-save-ctc\",\n        default=3,\n        type=int,\n        help=\"Number of samples of CTC probability to be saved\",\n    )\n    parser.add_argument(\n        \"--grad-noise\",\n        type=strtobool,\n        default=False,\n        help=\"The flag to switch to use noise injection to gradients during training\",\n    )\n    # speech translation related\n    parser.add_argument(\n        \"--context-residual\",\n        default=False,\n        type=strtobool,\n        nargs=\"?\",\n        help=\"The flag to switch to use context vector residual in the decoder network\",\n    )\n    # finetuning related\n    parser.add_argument(\n        \"--enc-init\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Pre-trained ASR model to initialize encoder.\",\n    )\n    parser.add_argument(\n        \"--enc-init-mods\",\n        default=\"enc.enc.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of encoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--dec-init\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Pre-trained ASR, MT or LM model to initialize decoder.\",\n    )\n    parser.add_argument(\n        \"--dec-init-mods\",\n        default=\"att., dec.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of decoder modules to initialize, separated by a comma.\",\n    )\n    # multilingual related\n    parser.add_argument(\n        \"--multilingual\",\n        default=False,\n        type=strtobool,\n        help=\"Prepend target language ID to the source sentence. \"\n        \" Both source/target language IDs must be prepend in the pre-processing stage.\",\n    )\n    parser.add_argument(\n        \"--replace-sos\",\n        default=False,\n        type=strtobool,\n        help=\"Replace <sos> in the decoder with a target language ID \\\n                              (the first token in the target sequence)\",\n    )\n    # Feature transform: Normalization\n    parser.add_argument(\n        \"--stats-file\",\n        type=str,\n        default=None,\n        help=\"The stats file for the feature normalization\",\n    )\n    parser.add_argument(\n        \"--apply-uttmvn\",\n        type=strtobool,\n        default=True,\n        help=\"Apply utterance level mean \" \"variance normalization.\",\n    )\n    parser.add_argument(\"--uttmvn-norm-means\", type=strtobool, default=True, help=\"\")\n    parser.add_argument(\"--uttmvn-norm-vars\", type=strtobool, default=False, help=\"\")\n    # Feature transform: Fbank\n    parser.add_argument(\n        \"--fbank-fs\",\n        type=int,\n        default=16000,\n        help=\"The sample frequency used for \" \"the mel-fbank creation.\",\n    )\n    parser.add_argument(\n        \"--n-mels\", type=int, default=80, help=\"The number of mel-frequency bins.\"\n    )\n    parser.add_argument(\"--fbank-fmin\", type=float, default=0.0, help=\"\")\n    parser.add_argument(\"--fbank-fmax\", type=float, default=None, help=\"\")\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Run the main training function.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n    if args.backend == \"chainer\" and args.train_dtype != \"float32\":\n        raise NotImplementedError(\n            f\"chainer backend does not support --train-dtype {args.train_dtype}.\"\n            \"Use --dtype float32.\"\n        )\n    if args.ngpu == 0 and args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\", \"float16\"):\n        raise ValueError(\n            f\"--train-dtype {args.train_dtype} does not support the CPU backend.\"\n        )\n\n    from espnet.utils.dynamic_import import dynamic_import\n\n    if args.model_module is None:\n        model_module = \"espnet.nets.\" + args.backend + \"_backend.e2e_st:E2E\"\n    else:\n        model_module = args.model_module\n    model_class = dynamic_import(model_module)\n    model_class.add_arguments(parser)\n\n    args = parser.parse_args(cmd_args)\n    args.model_module = model_module\n    if \"chainer_backend\" in args.model_module:\n        args.backend = \"chainer\"\n    if \"pytorch_backend\" in args.model_module:\n        args.backend = \"pytorch\"\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n        args.ngpu = ngpu\n    else:\n        if is_torch_1_2_plus and args.ngpu != 1:\n            logging.debug(\n                \"There are some bugs with multi-GPU processing in PyTorch 1.2+\"\n                + \" (see https://github.com/pytorch/pytorch/issues/21108)\"\n            )\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # set random seed\n    logging.info(\"random seed = %d\" % args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    # load dictionary for debug log\n    if args.dict is not None:\n        with open(args.dict, \"rb\") as f:\n            dictionary = f.readlines()\n        char_list = [entry.decode(\"utf-8\").split(\" \")[0] for entry in dictionary]\n        char_list.insert(0, \"<blank>\")\n        char_list.append(\"<eos>\")\n        args.char_list = char_list\n    else:\n        args.char_list = None\n\n    # train\n    logging.info(\"backend = \" + args.backend)\n\n    if args.backend == \"pytorch\":\n        from espnet.st.pytorch_backend.st import train\n\n        train(args)\n    else:\n        raise ValueError(\"Only pytorch are supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/st_trans.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"End-to-end speech translation model decoding script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport sys\n\nimport configargparse\nimport numpy as np\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get default arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Translate text from speech using a speech translation \"\n        \"model on one CPU or GPU\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"Config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"Second config file path that overwrites the settings in `--config`\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"Third config file path that overwrites \"\n        \"the settings in `--config` and `--config2`\",\n    )\n\n    parser.add_argument(\"--ngpu\", type=int, default=0, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--dtype\",\n        choices=(\"float16\", \"float32\", \"float64\"),\n        default=\"float32\",\n        help=\"Float precision (only available in --api v2)\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        type=str,\n        default=\"chainer\",\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", type=int, default=1, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", type=int, default=1, help=\"Random seed\")\n    parser.add_argument(\"--verbose\", \"-V\", type=int, default=1, help=\"Verbose option\")\n    parser.add_argument(\n        \"--batchsize\",\n        type=int,\n        default=1,\n        help=\"Batch size for beam search (0: means no batch processing)\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--api\",\n        default=\"v1\",\n        choices=[\"v1\", \"v2\"],\n        help=\"Beam search APIs \"\n        \"v1: Default API. \"\n        \"It only supports the ASRInterface.recognize method and DefaultRNNLM. \"\n        \"v2: Experimental API. \"\n        \"It supports any models that implements ScorerInterface.\",\n    )\n    # task related\n    parser.add_argument(\n        \"--trans-json\", type=str, help=\"Filename of translation data (json)\"\n    )\n    parser.add_argument(\n        \"--result-label\",\n        type=str,\n        required=True,\n        help=\"Filename of result label data (json)\",\n    )\n    # model (parameter) related\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    # search related\n    parser.add_argument(\"--nbest\", type=int, default=1, help=\"Output N-best hypotheses\")\n    parser.add_argument(\"--beam-size\", type=int, default=1, help=\"Beam size\")\n    parser.add_argument(\"--penalty\", type=float, default=0.0, help=\"Incertion penalty\")\n    parser.add_argument(\n        \"--maxlenratio\",\n        type=float,\n        default=0.0,\n        help=\"\"\"Input length ratio to obtain max output length.\n                        If maxlenratio=0.0 (default), it uses a end-detect function\n                        to automatically find maximum hypothesis lengths\"\"\",\n    )\n    parser.add_argument(\n        \"--minlenratio\",\n        type=float,\n        default=0.0,\n        help=\"Input length ratio to obtain min output length\",\n    )\n    # multilingual related\n    parser.add_argument(\n        \"--tgt-lang\",\n        default=False,\n        type=str,\n        help=\"target language ID (e.g., <en>, <de>, and <fr> etc.)\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run the main decoding function.\"\"\"\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    # logging info\n    if args.verbose == 1:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    elif args.verbose == 2:\n        logging.basicConfig(\n            level=logging.DEBUG,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n        # TODO(mn5k): support of multiple GPUs\n        if args.ngpu > 1:\n            logging.error(\"The program only supports ngpu=1.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # seed setting\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n    logging.info(\"set random seed = %d\" % args.seed)\n\n    # trans\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        # Experimental API that supports custom LMs\n        from espnet.st.pytorch_backend.st import trans\n\n        if args.dtype != \"float32\":\n            raise NotImplementedError(\n                f\"`--dtype {args.dtype}` is only available with `--api v2`\"\n            )\n        trans(args)\n    else:\n        raise ValueError(\"Only pytorch are supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/tts_decode.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"TTS decoding script.\"\"\"\n\nimport configargparse\nimport logging\nimport os\nimport platform\nimport subprocess\nimport sys\n\nfrom espnet.utils.cli_utils import strtobool\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get parser of decoding arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Synthesize speech from text using a TTS model on one CPU\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites \"\n        \"the settings in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\"--ngpu\", default=0, type=int, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--backend\",\n        default=\"pytorch\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--out\", type=str, required=True, help=\"Output filename\")\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    # task related\n    parser.add_argument(\n        \"--json\", type=str, required=True, help=\"Filename of train label data (json)\"\n    )\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n    # decoding related\n    parser.add_argument(\n        \"--maxlenratio\", type=float, default=5, help=\"Maximum length ratio in decoding\"\n    )\n    parser.add_argument(\n        \"--minlenratio\", type=float, default=0, help=\"Minimum length ratio in decoding\"\n    )\n    parser.add_argument(\n        \"--threshold\", type=float, default=0.5, help=\"Threshold value in decoding\"\n    )\n    parser.add_argument(\n        \"--use-att-constraint\",\n        type=strtobool,\n        default=False,\n        help=\"Whether to use the attention constraint\",\n    )\n    parser.add_argument(\n        \"--backward-window\",\n        type=int,\n        default=1,\n        help=\"Backward window size in the attention constraint\",\n    )\n    parser.add_argument(\n        \"--forward-window\",\n        type=int,\n        default=3,\n        help=\"Forward window size in the attention constraint\",\n    )\n    parser.add_argument(\n        \"--fastspeech-alpha\",\n        type=float,\n        default=1.0,\n        help=\"Alpha to change the speed for FastSpeech\",\n    )\n    # save related\n    parser.add_argument(\n        \"--save-durations\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to save durations converted from attentions\",\n    )\n    parser.add_argument(\n        \"--save-focus-rates\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to save focus rates of attentions\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run deocding.\"\"\"\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        # python 2 case\n        if platform.python_version_tuple()[0] == \"2\":\n            if \"clsp.jhu.edu\" in subprocess.check_output([\"hostname\", \"-f\"]):\n                cvd = subprocess.check_output(\n                    [\"/usr/local/bin/free-gpu\", \"-n\", str(args.ngpu)]\n                ).strip()\n                logging.info(\"CLSP: use gpu\" + cvd)\n                os.environ[\"CUDA_VISIBLE_DEVICES\"] = cvd\n        # python 3 case\n        else:\n            if \"clsp.jhu.edu\" in subprocess.check_output([\"hostname\", \"-f\"]).decode():\n                cvd = (\n                    subprocess.check_output(\n                        [\"/usr/local/bin/free-gpu\", \"-n\", str(args.ngpu)]\n                    )\n                    .decode()\n                    .strip()\n                )\n                logging.info(\"CLSP: use gpu\" + cvd)\n                os.environ[\"CUDA_VISIBLE_DEVICES\"] = cvd\n\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # extract\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        from espnet.tts.pytorch_backend.tts import decode\n\n        decode(args)\n    else:\n        raise NotImplementedError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/tts_train.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Text-to-speech model training script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nimport configargparse\nimport numpy as np\n\nfrom espnet import __version__\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.training.batchfy import BATCH_COUNT_CHOICES\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get parser of training arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Train a new text-to-speech (TTS) model on one CPU, \"\n        \"one or multiple GPUs\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites \"\n        \"the settings in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"pytorch\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--outdir\", type=str, required=True, help=\"Output directory\")\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        type=str,\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\n        \"--minibatches\",\n        \"-N\",\n        type=int,\n        default=\"-1\",\n        help=\"Process only N minibatches (for debug)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log directory path\",\n    )\n    parser.add_argument(\n        \"--eval-interval-epochs\", default=1, type=int, help=\"Evaluation interval epochs\"\n    )\n    parser.add_argument(\n        \"--save-interval-epochs\", default=1, type=int, help=\"Save interval epochs\"\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=100,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    # task related\n    parser.add_argument(\n        \"--train-json\", type=str, required=True, help=\"Filename of training json\"\n    )\n    parser.add_argument(\n        \"--valid-json\", type=str, required=True, help=\"Filename of validation json\"\n    )\n    # network architecture\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=\"espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2\",\n        help=\"model defined module\",\n    )\n    # minibatch related\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batch-sort-key\",\n        default=\"shuffle\",\n        type=str,\n        choices=[\"shuffle\", \"output\", \"input\"],\n        nargs=\"?\",\n        help='Batch sorting key. \"shuffle\" only work with --batch-count \"seq\".',\n    )\n    parser.add_argument(\n        \"--batch-count\",\n        default=\"auto\",\n        choices=BATCH_COUNT_CHOICES,\n        help=\"How to count batch_size. \"\n        \"The default (auto) will find how to count by args.\",\n    )\n    parser.add_argument(\n        \"--batch-size\",\n        \"--batch-seqs\",\n        \"-b\",\n        default=0,\n        type=int,\n        help=\"Maximum seqs in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-bins\",\n        default=0,\n        type=int,\n        help=\"Maximum bins in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-in\",\n        default=0,\n        type=int,\n        help=\"Maximum input frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-out\",\n        default=0,\n        type=int,\n        help=\"Maximum output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-inout\",\n        default=0,\n        type=int,\n        help=\"Maximum input+output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--maxlen-in\",\n        \"--batch-seq-maxlen-in\",\n        default=100,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the input sequence length > ML.\",\n    )\n    parser.add_argument(\n        \"--maxlen-out\",\n        \"--batch-seq-maxlen-out\",\n        default=200,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the output sequence length > ML\",\n    )\n    parser.add_argument(\n        \"--num-iter-processes\",\n        default=0,\n        type=int,\n        help=\"Number of processes of iterator\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--use-speaker-embedding\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to use speaker embedding\",\n    )\n    parser.add_argument(\n        \"--use-second-target\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to use second target\",\n    )\n    # optimization related\n    parser.add_argument(\n        \"--opt\", default=\"adam\", type=str, choices=[\"adam\", \"noam\"], help=\"Optimizer\"\n    )\n    parser.add_argument(\n        \"--accum-grad\", default=1, type=int, help=\"Number of gradient accumuration\"\n    )\n    parser.add_argument(\n        \"--lr\", default=1e-3, type=float, help=\"Learning rate for optimizer\"\n    )\n    parser.add_argument(\"--eps\", default=1e-6, type=float, help=\"Epsilon for optimizer\")\n    parser.add_argument(\n        \"--weight-decay\",\n        default=1e-6,\n        type=float,\n        help=\"Weight decay coefficient for optimizer\",\n    )\n    parser.add_argument(\n        \"--epochs\", \"-e\", default=30, type=int, help=\"Number of maximum epochs\"\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/loss\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs to wait \"\n        \"without improvement before stopping the training\",\n    )\n    parser.add_argument(\n        \"--grad-clip\", default=1, type=float, help=\"Gradient norm threshold to clip\"\n    )\n    parser.add_argument(\n        \"--num-save-attention\",\n        default=5,\n        type=int,\n        help=\"Number of samples of attention to be saved\",\n    )\n    parser.add_argument(\n        \"--keep-all-data-on-mem\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to keep all data on memory\",\n    )\n    # finetuning related\n    parser.add_argument(\n        \"--enc-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained TTS model path to initialize encoder.\",\n    )\n    parser.add_argument(\n        \"--enc-init-mods\",\n        default=\"enc.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of encoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--dec-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained TTS model path to initialize decoder.\",\n    )\n    parser.add_argument(\n        \"--dec-init-mods\",\n        default=\"dec.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of decoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--freeze-mods\",\n        default=None,\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of modules to freeze (not to train), separated by a comma.\",\n    )\n\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Run training.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n\n    from espnet.utils.dynamic_import import dynamic_import\n\n    model_class = dynamic_import(args.model_module)\n    assert issubclass(model_class, TTSInterface)\n    model_class.add_arguments(parser)\n    args = parser.parse_args(cmd_args)\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n        args.ngpu = ngpu\n    else:\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # set random seed\n    logging.info(\"random seed = %d\" % args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    if args.backend == \"pytorch\":\n        from espnet.tts.pytorch_backend.tts import train\n\n        train(args)\n    else:\n        raise NotImplementedError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/vc_decode.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"VC decoding script.\"\"\"\n\nimport configargparse\nimport logging\nimport os\nimport platform\nimport subprocess\nimport sys\n\nfrom espnet.utils.cli_utils import strtobool\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get parser of decoding arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Converting speech using a VC model on one CPU\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\"--ngpu\", default=0, type=int, help=\"Number of GPUs\")\n    parser.add_argument(\n        \"--backend\",\n        default=\"pytorch\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\"--out\", type=str, required=True, help=\"Output filename\")\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    # task related\n    parser.add_argument(\n        \"--json\", type=str, required=True, help=\"Filename of train label data (json)\"\n    )\n    parser.add_argument(\n        \"--model\", type=str, required=True, help=\"Model file parameters to read\"\n    )\n    parser.add_argument(\n        \"--model-conf\", type=str, default=None, help=\"Model config file\"\n    )\n    # decoding related\n    parser.add_argument(\n        \"--maxlenratio\", type=float, default=5, help=\"Maximum length ratio in decoding\"\n    )\n    parser.add_argument(\n        \"--minlenratio\", type=float, default=0, help=\"Minimum length ratio in decoding\"\n    )\n    parser.add_argument(\n        \"--threshold\", type=float, default=0.5, help=\"Threshold value in decoding\"\n    )\n    parser.add_argument(\n        \"--use-att-constraint\",\n        type=strtobool,\n        default=False,\n        help=\"Whether to use the attention constraint\",\n    )\n    parser.add_argument(\n        \"--backward-window\",\n        type=int,\n        default=1,\n        help=\"Backward window size in the attention constraint\",\n    )\n    parser.add_argument(\n        \"--forward-window\",\n        type=int,\n        default=3,\n        help=\"Forward window size in the attention constraint\",\n    )\n    # save related\n    parser.add_argument(\n        \"--save-durations\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to save durations converted from attentions\",\n    )\n    parser.add_argument(\n        \"--save-focus-rates\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to save focus rates of attentions\",\n    )\n    return parser\n\n\ndef main(args):\n    \"\"\"Run deocding.\"\"\"\n    parser = get_parser()\n    args = parser.parse_args(args)\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # check CUDA_VISIBLE_DEVICES\n    if args.ngpu > 0:\n        # python 2 case\n        if platform.python_version_tuple()[0] == \"2\":\n            if \"clsp.jhu.edu\" in subprocess.check_output([\"hostname\", \"-f\"]):\n                cvd = subprocess.check_output(\n                    [\"/usr/local/bin/free-gpu\", \"-n\", str(args.ngpu)]\n                ).strip()\n                logging.info(\"CLSP: use gpu\" + cvd)\n                os.environ[\"CUDA_VISIBLE_DEVICES\"] = cvd\n        # python 3 case\n        else:\n            if \"clsp.jhu.edu\" in subprocess.check_output([\"hostname\", \"-f\"]).decode():\n                cvd = (\n                    subprocess.check_output(\n                        [\"/usr/local/bin/free-gpu\", \"-n\", str(args.ngpu)]\n                    )\n                    .decode()\n                    .strip()\n                )\n                logging.info(\"CLSP: use gpu\" + cvd)\n                os.environ[\"CUDA_VISIBLE_DEVICES\"] = cvd\n\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is None:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n        elif args.ngpu != len(cvd.split(\",\")):\n            logging.error(\"#gpus is not matched with CUDA_VISIBLE_DEVICES.\")\n            sys.exit(1)\n\n    # display PYTHONPATH\n    logging.info(\"python path = \" + os.environ.get(\"PYTHONPATH\", \"(None)\"))\n\n    # extract\n    logging.info(\"backend = \" + args.backend)\n    if args.backend == \"pytorch\":\n        from espnet.vc.pytorch_backend.vc import decode\n\n        decode(args)\n    else:\n        raise NotImplementedError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "bin/vc_train.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Voice conversion model training script.\"\"\"\n\nimport logging\nimport os\nimport random\nimport subprocess\nimport sys\n\nimport configargparse\nimport numpy as np\n\nfrom espnet import __version__\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.training.batchfy import BATCH_COUNT_CHOICES\n\n\n# NOTE: you need this func to generate our sphinx doc\ndef get_parser():\n    \"\"\"Get parser of training arguments.\"\"\"\n    parser = configargparse.ArgumentParser(\n        description=\"Train a new voice conversion (VC) model on one CPU, \"\n        \"one or multiple GPUs\",\n        config_file_parser_class=configargparse.YAMLConfigFileParser,\n        formatter_class=configargparse.ArgumentDefaultsHelpFormatter,\n    )\n\n    # general configuration\n    parser.add(\"--config\", is_config_file=True, help=\"config file path\")\n    parser.add(\n        \"--config2\",\n        is_config_file=True,\n        help=\"second config file path that overwrites the settings in `--config`.\",\n    )\n    parser.add(\n        \"--config3\",\n        is_config_file=True,\n        help=\"third config file path that overwrites the settings \"\n        \"in `--config` and `--config2`.\",\n    )\n\n    parser.add_argument(\n        \"--ngpu\",\n        default=None,\n        type=int,\n        help=\"Number of GPUs. If not given, use all visible devices\",\n    )\n    parser.add_argument(\n        \"--backend\",\n        default=\"pytorch\",\n        type=str,\n        choices=[\"chainer\", \"pytorch\"],\n        help=\"Backend library\",\n    )\n    parser.add_argument(\"--outdir\", type=str, required=True, help=\"Output directory\")\n    parser.add_argument(\"--debugmode\", default=1, type=int, help=\"Debugmode\")\n    parser.add_argument(\"--seed\", default=1, type=int, help=\"Random seed\")\n    parser.add_argument(\n        \"--resume\",\n        \"-r\",\n        default=\"\",\n        type=str,\n        nargs=\"?\",\n        help=\"Resume the training from snapshot\",\n    )\n    parser.add_argument(\n        \"--minibatches\",\n        \"-N\",\n        type=int,\n        default=\"-1\",\n        help=\"Process only N minibatches (for debug)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--tensorboard-dir\",\n        default=None,\n        type=str,\n        nargs=\"?\",\n        help=\"Tensorboard log directory path\",\n    )\n    parser.add_argument(\n        \"--eval-interval-epochs\",\n        default=100,\n        type=int,\n        help=\"Evaluation interval epochs\",\n    )\n    parser.add_argument(\n        \"--save-interval-epochs\", default=1, type=int, help=\"Save interval epochs\"\n    )\n    parser.add_argument(\n        \"--report-interval-iters\",\n        default=10,\n        type=int,\n        help=\"Report interval iterations\",\n    )\n    # task related\n    parser.add_argument(\"--srcspk\", type=str, help=\"Source speaker\")\n    parser.add_argument(\"--trgspk\", type=str, help=\"Target speaker\")\n    parser.add_argument(\n        \"--train-json\", type=str, required=True, help=\"Filename of training json\"\n    )\n    parser.add_argument(\n        \"--valid-json\", type=str, required=True, help=\"Filename of validation json\"\n    )\n\n    # network architecture\n    parser.add_argument(\n        \"--model-module\",\n        type=str,\n        default=\"espnet.nets.pytorch_backend.e2e_tts_tacotron2:Tacotron2\",\n        help=\"model defined module\",\n    )\n    # minibatch related\n    parser.add_argument(\n        \"--sortagrad\",\n        default=0,\n        type=int,\n        nargs=\"?\",\n        help=\"How many epochs to use sortagrad for. 0 = deactivated, -1 = all epochs\",\n    )\n    parser.add_argument(\n        \"--batch-sort-key\",\n        default=\"shuffle\",\n        type=str,\n        choices=[\"shuffle\", \"output\", \"input\"],\n        nargs=\"?\",\n        help='Batch sorting key. \"shuffle\" only work with --batch-count \"seq\".',\n    )\n    parser.add_argument(\n        \"--batch-count\",\n        default=\"auto\",\n        choices=BATCH_COUNT_CHOICES,\n        help=\"How to count batch_size. \"\n        \"The default (auto) will find how to count by args.\",\n    )\n    parser.add_argument(\n        \"--batch-size\",\n        \"--batch-seqs\",\n        \"-b\",\n        default=0,\n        type=int,\n        help=\"Maximum seqs in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-bins\",\n        default=0,\n        type=int,\n        help=\"Maximum bins in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-in\",\n        default=0,\n        type=int,\n        help=\"Maximum input frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-out\",\n        default=0,\n        type=int,\n        help=\"Maximum output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--batch-frames-inout\",\n        default=0,\n        type=int,\n        help=\"Maximum input+output frames in a minibatch (0 to disable)\",\n    )\n    parser.add_argument(\n        \"--maxlen-in\",\n        \"--batch-seq-maxlen-in\",\n        default=100,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the input sequence length > ML.\",\n    )\n    parser.add_argument(\n        \"--maxlen-out\",\n        \"--batch-seq-maxlen-out\",\n        default=200,\n        type=int,\n        metavar=\"ML\",\n        help=\"When --batch-count=seq, \"\n        \"batch size is reduced if the output sequence length > ML\",\n    )\n    parser.add_argument(\n        \"--num-iter-processes\",\n        default=0,\n        type=int,\n        help=\"Number of processes of iterator\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--use-speaker-embedding\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to use speaker embedding\",\n    )\n    parser.add_argument(\n        \"--use-second-target\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to use second target\",\n    )\n    # optimization related\n    parser.add_argument(\n        \"--opt\",\n        default=\"adam\",\n        type=str,\n        choices=[\"adam\", \"noam\", \"lamb\"],\n        help=\"Optimizer\",\n    )\n    parser.add_argument(\n        \"--accum-grad\", default=1, type=int, help=\"Number of gradient accumuration\"\n    )\n    parser.add_argument(\n        \"--lr\", default=1e-3, type=float, help=\"Learning rate for optimizer\"\n    )\n    parser.add_argument(\"--eps\", default=1e-6, type=float, help=\"Epsilon for optimizer\")\n    parser.add_argument(\n        \"--weight-decay\",\n        default=1e-6,\n        type=float,\n        help=\"Weight decay coefficient for optimizer\",\n    )\n    parser.add_argument(\n        \"--epochs\", \"-e\", default=30, type=int, help=\"Number of maximum epochs\"\n    )\n    parser.add_argument(\n        \"--early-stop-criterion\",\n        default=\"validation/main/loss\",\n        type=str,\n        nargs=\"?\",\n        help=\"Value to monitor to trigger an early stopping of the training\",\n    )\n    parser.add_argument(\n        \"--patience\",\n        default=3,\n        type=int,\n        nargs=\"?\",\n        help=\"Number of epochs to wait without improvement \"\n        \"before stopping the training\",\n    )\n    parser.add_argument(\n        \"--grad-clip\", default=1, type=float, help=\"Gradient norm threshold to clip\"\n    )\n    parser.add_argument(\n        \"--num-save-attention\",\n        default=5,\n        type=int,\n        help=\"Number of samples of attention to be saved\",\n    )\n    parser.add_argument(\n        \"--keep-all-data-on-mem\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to keep all data on memory\",\n    )\n\n    parser.add_argument(\n        \"--enc-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained model path to initialize encoder.\",\n    )\n    parser.add_argument(\n        \"--enc-init-mods\",\n        default=\"enc.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of encoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--dec-init\",\n        default=None,\n        type=str,\n        help=\"Pre-trained model path to initialize decoder.\",\n    )\n    parser.add_argument(\n        \"--dec-init-mods\",\n        default=\"dec.\",\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of decoder modules to initialize, separated by a comma.\",\n    )\n    parser.add_argument(\n        \"--freeze-mods\",\n        default=None,\n        type=lambda s: [str(mod) for mod in s.split(\",\") if s != \"\"],\n        help=\"List of modules to freeze (not to train), separated by a comma.\",\n    )\n\n    return parser\n\n\ndef main(cmd_args):\n    \"\"\"Run training.\"\"\"\n    parser = get_parser()\n    args, _ = parser.parse_known_args(cmd_args)\n\n    from espnet.utils.dynamic_import import dynamic_import\n\n    model_class = dynamic_import(args.model_module)\n    assert issubclass(model_class, TTSInterface)\n    model_class.add_arguments(parser)\n    args = parser.parse_args(cmd_args)\n\n    # add version info in args\n    args.version = __version__\n\n    # logging info\n    if args.verbose > 0:\n        logging.basicConfig(\n            level=logging.INFO,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n    else:\n        logging.basicConfig(\n            level=logging.WARN,\n            format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n        )\n        logging.warning(\"Skip DEBUG/INFO messages\")\n\n    # If --ngpu is not given,\n    #   1. if CUDA_VISIBLE_DEVICES is set, all visible devices\n    #   2. if nvidia-smi exists, use all devices\n    #   3. else ngpu=0\n    if args.ngpu is None:\n        cvd = os.environ.get(\"CUDA_VISIBLE_DEVICES\")\n        if cvd is not None:\n            ngpu = len(cvd.split(\",\"))\n        else:\n            logging.warning(\"CUDA_VISIBLE_DEVICES is not set.\")\n            try:\n                p = subprocess.run(\n                    [\"nvidia-smi\", \"-L\"], stdout=subprocess.PIPE, stderr=subprocess.PIPE\n                )\n            except (subprocess.CalledProcessError, FileNotFoundError):\n                ngpu = 0\n            else:\n                ngpu = len(p.stderr.decode().split(\"\\n\")) - 1\n    else:\n        ngpu = args.ngpu\n    logging.info(f\"ngpu: {ngpu}\")\n\n    # set random seed\n    logging.info(\"random seed = %d\" % args.seed)\n    random.seed(args.seed)\n    np.random.seed(args.seed)\n\n    if args.backend == \"pytorch\":\n        from espnet.vc.pytorch_backend.vc import train\n\n        train(args)\n    else:\n        raise NotImplementedError(\"Only pytorch is supported.\")\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/.gitignore",
    "content": "launch\nespnet-2021724\nsegment_aishell1\nword_ngram\n"
  },
  {
    "path": "egs/aishell1/.gitignore",
    "content": "dump\ndump32\ndump64\ndata\nexp\nfbank\n"
  },
  {
    "path": "egs/aishell1/aed.sh",
    "content": "#!/usr/bin/env bash\n\n# author: tyriontian\n# tyriontian@tencent.com\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\nresume=        # Resume the training from snapshot\ndebug=false\n\n# feature configuration\ndo_delta=false\n\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/tuning/train_pytorch_conformer_kernel31.yaml\nlm_config=conf/lm.yaml\ndecode_config=conf/decode.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\n\n# ngram\nngramtag=\nn_gram=4\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\n\n# data\ndata=/data/asr_data/aishell/\ndata_url=www.openslr.org/resources/33\ndict=data/lang_1char/train_sp_units.txt\nlang=data/lang_phone\n\n### Configurable parameters ###\ntag=\"8v100_lasmmictc_alpha03_ctc03\"\nngpu=8\n\n# Train config\nseed=888\nbatch_size=8\naccum_grad=1\nepochs=100\nuse_segment=true # if true, use word-level transcription in MMI criterion\nctc_type=\"k2mmi\" # k2mmi | k2ctc | default\nmtlalpha=0.3\nthird_weight=0.3\n\n# MBR training config\naux_mbr=false\naux_mbr_weight=1.0\naux_mbr_beam=4\nmbr_epochs=100\nmbr_lr=0.1\nmbr_warmup=2500\nmbr_resume=\n\n# Decode config\nidx_average=41_50\nmmi_weight=0.0 # MMI / phonectc joint decoding\nctc_weight=0.5 # char ctc joint decoding\nngram_weight=0.0\nngram_order=4\nword_ngram_tag=word_3gram\nword_ngram_weight=0.0\nword_ngram_log_semiring=true\nlm_weight=0.0\nbeam_size=10\nmmi_rescore=false\nrecog_set=\"test dev\"\n\n. utils/parse_options.sh || exit 1;\n\nif [ $debug == true ]; then\n    export HOST_GPU_NUM=1\n    export HOST_NUM=1\n    export NODE_NUM=1\n    export INDEX=0\n    export CHIEF_IP=\"9.135.217.29\"\nfi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--use-segment $use_segment \\\n--ctc_type $ctc_type \\\n--mtlalpha $mtlalpha \\\n--third-weight $third_weight \\\n\"\n\nif [ $aux_mbr == true ]; then\n    train_opts=\"$train_opts \\\n                --aux-mbr $aux_mbr \\\n                --aux-mbr-weight $aux_mbr_weight \\\n                --aux-mbr-beam $aux_mbr_beam \\\n                --transformer-lr $mbr_lr \\\n                --epochs $mbr_epochs \\\n                --transformer-warmup-steps $mbr_warmup \\\n                --resume $mbr_resume \\\n                --load-trainer-and-opt false \\\n                --save-interval-iters 1000 \\\n                \"\n    export OMP_NUM_THREADS=6 # for on-the-fly decoding\nfi\n\ndecode_opts=\\\n\"\\\n--ctc-weight $ctc_weight \\\n--mmi-weight $mmi_weight \\\n--ngram-weight $ngram_weight \\\n--mmi-rescore $mmi_rescore \\\n--beam-size $beam_size \\\n--word-ngram data/${word_ngram_tag} \\\n--word-ngram-weight $word_ngram_weight \\\n--word-ngram-log-semiring $word_ngram_log_semiring \\\n--lm-weight $lm_weight \\\n\"\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev\n\nexpname=${train_set}_${backend}_${tag}\nexpdir=exp/${expname}\nmkdir -p ${expdir}\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Network Training\"\n\n    # make sure in jizhi config file: \"exec_start_in_all_mpi_pods\": true, \n    MASTER_PORT=22277\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/results_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --resume ${resume} \\\n        --train-json ${feat_tr_dir}/split${ngpu}utt/data_tiny.RANK.json \\\n        --valid-json ${feat_dt_dir}/data.json \\\n        --lang $lang \\\n        --opt \"noam_sgd\" \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        $train_opts > ${expdir}/global_record.${INDEX}.txt 2>&1\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Decoding\"\n    nj=500\n    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} etype) = custom ]] || \\\n           [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then\n        recog_model=model.last${idx_average}.avg.best\n        echo ${expdir}/results_0/${recog_model}\n        average_checkpoints.py --backend ${backend} \\\n         \t\t       --snapshots ${expdir}/results_0/snapshot.ep.* \\\n        \t\t       --out ${expdir}/results_0/${recog_model} \\\n        \t\t       --num ${idx_average}\n    fi\n\n    decode_parent_dir=decode_mmi${mmi_weight}_${word_ngram_tag}${word_ngram_weight}_ctc${ctc_weight}_beam${beam_size}_${idx_average}\n    for rtask in ${recog_set}; do\n        decode_dir=$decode_parent_dir/$rtask\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n\n        # split data\n        splitjson.py --parts ${nj} ${feat_recog_dir}/data.json\n\n        #### use CPU for decoding\n        ngpu=0\n\n        ${decode_cmd} JOB=1:$nj ${expdir}/${decode_dir}/log/decode.JOB.log \\\n            asr_recog.py \\\n            --config ${decode_config} \\\n            --ngpu ${ngpu} \\\n            --backend ${backend} \\\n            --batchsize 0 \\\n            --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \\\n            --result-label ${expdir}/${decode_dir}/data.JOB.json \\\n            --model ${expdir}/results_0/${recog_model}  \\\n            --ngram-model exp/train_ngram/${ngram_order}gram.bin \\\n            --rnnlm exp/train_rnnlm_pytorch_lm_transformer/rnnlm.model.best \\\n            --rnnlm-conf exp/train_rnnlm_pytorch_lm_transformer/model.json \\\n            --local-rank JOB --api v2 \\\n            $decode_opts\n\n        score_sclite.sh ${expdir}/${decode_dir} ${dict} \\\n          > ${expdir}/${decode_dir}/decode_result.txt\n\n    done\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/aishell1/cmd.sh",
    "content": "# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======\n# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>\n# e.g.\n#   run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB\n#\n# Options:\n#   --time <time>: Limit the maximum time to execute.\n#   --mem <mem>: Limit the maximum memory usage.\n#   -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.\n#   --num-threads <ngpu>: Specify the number of CPU core.\n#   --gpu <ngpu>: Specify the number of GPU devices.\n#   --config: Change the configuration file from default.\n#\n# \"JOB=1:10\" is used for \"array jobs\" and it can control the number of parallel jobs.\n# The left string of \"=\", i.e. \"JOB\", is replaced by <N>(Nth job) in the command and the log file name,\n# e.g. \"echo JOB\" is changed to \"echo 3\" for the 3rd job and \"echo 8\" for 8th job respectively.\n# Note that the number must start with a positive number, so you can't use \"JOB=0:10\" for example.\n#\n# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.\n# These options are mapping to specific options for each backend and\n# it is configured by \"conf/queue.conf\" and \"conf/slurm.conf\" by default.\n# If jobs failed, your configuration might be wrong for your environment.\n#\n#\n# The official documentaion for run.pl, queue.pl, slurm.pl, and ssh.pl:\n#   \"Parallelization in Kaldi\": http://kaldi-asr.org/doc/queue.html\n# =========================================================~\n\n\n# Select the backend used by run.sh from \"local\", \"sge\", \"slurm\", or \"ssh\"\ncmd_backend='local'\n\n# Local machine, without any Job scheduling system\nif [ \"${cmd_backend}\" = local ]; then\n\n    # The other usage\n    export train_cmd=\"run.pl\"\n    # Used for \"*_train.py\": \"--gpu\" is appended optionally by run.sh\n    export cuda_cmd=\"run.pl\"\n    # Used for \"*_recog.py\"\n    export decode_cmd=\"run.pl\"\n\n# \"qsub\" (SGE, Torque, PBS, etc.)\nelif [ \"${cmd_backend}\" = sge ]; then\n    # The default setting is written in conf/queue.conf.\n    # You must change \"-q g.q\" for the \"queue\" for your environment.\n    # To know the \"queue\" names, type \"qhost -q\"\n    # Note that to use \"--gpu *\", you have to setup \"complex_value\" for the system scheduler.\n\n    export train_cmd=\"queue.pl\"\n    export cuda_cmd=\"queue.pl\"\n    export decode_cmd=\"queue.pl\"\n\n# \"sbatch\" (Slurm)\nelif [ \"${cmd_backend}\" = slurm ]; then\n    # The default setting is written in conf/slurm.conf.\n    # You must change \"-p cpu\" and \"-p gpu\" for the \"partion\" for your environment.\n    # To know the \"partion\" names, type \"sinfo\".\n    # You can use \"--gpu * \" by defualt for slurm and it is interpreted as \"--gres gpu:*\"\n    # The devices are allocated exclusively using \"${CUDA_VISIBLE_DEVICES}\".\n\n    export train_cmd=\"slurm.pl\"\n    export cuda_cmd=\"slurm.pl\"\n    export decode_cmd=\"slurm.pl\"\n\nelif [ \"${cmd_backend}\" = ssh ]; then\n    # You have to create \".queue/machines\" to specify the host to execute jobs.\n    # e.g. .queue/machines\n    #   host1\n    #   host2\n    #   host3\n    # Assuming you can login them without any password, i.e. You have to set ssh keys.\n\n    export train_cmd=\"ssh.pl\"\n    export cuda_cmd=\"ssh.pl\"\n    export decode_cmd=\"ssh.pl\"\n\n# This is an example of specifying several unique options in the JHU CLSP cluster setup.\n# Users can modify/add their own command options according to their cluster environments.\nelif [ \"${cmd_backend}\" = jhu ]; then\n\n    export train_cmd=\"queue.pl --mem 2G\"\n    export cuda_cmd=\"queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf\"\n    export decode_cmd=\"queue.pl --mem 4G\"\n\nelse\n    echo \"$0: Error: Unknown cmd_backend=${cmd_backend}\" 1>&2\n    return 1\nfi\n"
  },
  {
    "path": "egs/aishell1/conf/fbank.conf",
    "content": "--sample-frequency=16000 \n--num-mel-bins=80\n"
  },
  {
    "path": "egs/aishell1/conf/gpu.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q"
  },
  {
    "path": "egs/aishell1/conf/lm.yaml",
    "content": "# rnnlm related\nlayer: 2\nunit: 650\nopt: sgd        # or adam\nbatchsize: 64   # batch size in LM training\nepoch: 20      # if the data size is large, we can reduce this\npatience: 3\nmaxlen: 100     # if sentence length > lm_maxlen, lm_batchsize is automatically reduced\n"
  },
  {
    "path": "egs/aishell1/conf/lm_rnn.yaml",
    "content": "lm.yaml"
  },
  {
    "path": "egs/aishell1/conf/lm_transformer.yaml",
    "content": "# This Transformer LM setting w/ 4 GPUs took around 60 days for 50 epochs.\n# However, you can get better results in 6 days for 5 epochs (WER: 2.2/5.4/2.6/5.7)\n# than LSTM LM (WER: 2.6/5.6/2.6/5.7) in 60 days for 20 epochs\n# And if you does not have 4 GPUs, try accum-grad=4.\n\n# network architecture\nmodel-module: transformer\natt-unit: 512\nembed-unit: 128\nhead: 8\nlayer: 16\npos-enc: none\nunit: 2048\n\n# minibatch related\nbatchsize: 32\nmaxlen: 40\n\n# optimization related\nopt: adam\nschedulers: lr=cosine\ndropout-rate: 0.0\nepoch: 50\ngradclip: 1.0\nlr: 1e-4\nlr-cosine-total: 100000\nlr-cosine-warmup: 1000\npatience: 0\nsortagrad: 0\n"
  },
  {
    "path": "egs/aishell1/conf/pitch.conf",
    "content": "--sample-frequency=16000\n"
  },
  {
    "path": "egs/aishell1/conf/queue.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l gpu=$0 -q g.q\n"
  },
  {
    "path": "egs/aishell1/conf/slurm.conf",
    "content": "# Default configuration\ncommand sbatch --export=PATH\noption name=* --job-name $0\noption time=* --time $0\noption mem=* --mem-per-cpu $0\noption mem=0\noption num_threads=* --cpus-per-task $0\noption num_threads=1 --cpus-per-task 1\noption num_nodes=* --nodes $0\ndefault gpu=0\noption gpu=0 -p cpu\noption gpu=* -p gpu --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU\n# note: the --max-jobs-run option is supported as a special case\n# by slurm.pl and you don't have to handle it in the config file.\n"
  },
  {
    "path": "egs/aishell1/conf/specaug.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 5\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n"
  },
  {
    "path": "egs/aishell1/conf/specaug_test.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 0\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/decode_pytorch_transformer.yaml",
    "content": "batchsize: 0\nbeam-size: 10\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.5\nlm-weight: 0.0\nngram-weight: 0.3\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/decode_rnn.yaml",
    "content": "beam-size: 20\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.6\nlm-weight: 0.3\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_pytorch_conformer_kernel15.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nrel-pos-type: latest\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 15\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_pytorch_conformer_kernel31.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_pytorch_conformer_kernel31_large.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 16\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 512\naheads: 8\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_pytorch_conformer_kernel31_small.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 8\neunits: 1024\n# decoder related\ndlayers: 4\ndunits: 1024\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_pytorch_transformer.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/train_rnn.yaml",
    "content": "# network architecture\n# encoder related\netype: vggblstm     # encoder architecture type\nelayers: 3\neunits: 1024\neprojs: 1024\nsubsample: \"1_2_2_1_1\" # skip every n frame from input to nth layers\n# decoder related\ndlayers: 2\ndunits: 1024\n# attention related\natype: location\nadim: 1024\naconv-chans: 10\naconv-filts: 100\n\n# hybrid CTC/attention\nmtlalpha: 0.5\n\n# minibatch related\nbatch-size: 30\nmaxlen-in: 800  # if input length  > maxlen_in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen_out, batchsize is automatically reduced\n\n# optimization related\nopt: adadelta\nepochs: 10\npatience: 0\n\n# scheduled sampling option\nsampling-probability: 0.0\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/decode_default.yaml",
    "content": "# decoding parameters\nbatch: 0\nbeam-size: 10\nsearch-type: default\nscore-norm: True\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\naccum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4_att.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# Attention scorer auxiliary task: mainly follow the settings in LASCTC decoder\natt-adim: 512\natt-aheads: 8\natt-dlayers: 6\natt-dunits: 2048\natt-dropout-rate: 0.1\natt-attn-dropout-rate: 0.0\natt-length-normalized-loss: false\nlsm-weight: 0.1\n\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4_small.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 256\n          d_ff: 1024\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 512\ndunits: 256\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 256\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\n#aux-ctc: True\n#aux-ctc-weight: 0.5\n#aux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: False\naux-ctc-weight: 0.0\naux-ctc-dropout-rate: 0.0\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_conformer-rnn_transducer_ngpu4_large.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 31\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 16\n## decoder related\ndtype: lstm\ndlayers: 2\ndec-embed-dim: 1024\ndunits: 1024\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: False\naux-ctc-weight: 0.0\naux-ctc-dropout-rate: 0.0\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/aishell1/conf/tuning/transducer/train_transducer_aux.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: True\naux-ctc-weight: 0.1\naux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/aishell1/local/add_lex_disambig.pl",
    "content": "#!/usr/bin/env perl\n#  Copyright 2010-2011  Microsoft Corporation\n#            2013-2016  Johns Hopkins University (author: Daniel Povey)\n#                 2015  Hainan Xu\n#                 2015  Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Adds disambiguation symbols to a lexicon.\n# Outputs still in the normal lexicon format.\n# Disambig syms are numbered #1, #2, #3, etc. (#0\n# reserved for symbol in grammar).\n# Outputs the number of disambig syms to the standard output.\n# With the --pron-probs option, expects the second field\n# of each lexicon line to be a pron-prob.\n# With the --sil-probs option, expects three additional\n# fields after the pron-prob, representing various components\n# of the silence probability model.\n\n$pron_probs = 0;\n$sil_probs = 0;\n$first_allowed_disambig = 1;\n\nfor ($n = 1; $n <= 3 && @ARGV > 0; $n++) {\n  if ($ARGV[0] eq \"--pron-probs\") {\n    $pron_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--sil-probs\") {\n    $sil_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--first-allowed-disambig\") {\n    $first_allowed_disambig = 0 + $ARGV[1];\n    if ($first_allowed_disambig < 1) {\n      die \"add_lex_disambig.pl: invalid --first-allowed-disambig option: $first_allowed_disambig\\n\";\n    }\n    shift @ARGV;\n    shift @ARGV;\n  }\n}\n\nif (@ARGV != 2) {\n  die \"Usage: add_lex_disambig.pl [opts] <lexicon-in> <lexicon-out>\\n\" .\n    \"This script adds disambiguation symbols to a lexicon in order to\\n\" .\n    \"make decoding graphs determinizable; it adds pseudo-phone\\n\" .\n    \"disambiguation symbols #1, #2 and so on at the ends of phones\\n\" .\n    \"to ensure that all pronunciations are different, and that none\\n\" .\n    \"is a prefix of another.\\n\" .\n    \"It prints to the standard output the number of the largest-numbered\" .\n    \"disambiguation symbol that was used.\\n\" .\n    \"\\n\" .\n    \"Options:   --pron-probs       Expect pronunciation probabilities in the 2nd field\\n\" .\n    \"           --sil-probs        [should be with --pron-probs option]\\n\" .\n    \"                              Expect 3 extra fields after the pron-probs, for aspects of\\n\" .\n    \"                              the silence probability model\\n\" .\n    \"           --first-allowed-disambig <n>  The number of the first disambiguation symbol\\n\" .\n    \"                              that this script is allowed to add.  By default this is\\n\" .\n    \"                              #1, but you can set this to a larger value using this option.\\n\" .\n    \"e.g.:\\n\" .\n    \" add_lex_disambig.pl lexicon.txt lexicon_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs lexiconp.txt lexiconp_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs --sil-probs lexiconp_silprob.txt lexiconp_silprob_disambig.txt\\n\";\n}\n\n\n$lexfn = shift @ARGV;\n$lexoutfn = shift @ARGV;\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\n\n# (1)  Read in the lexicon.\n@L = ( );\nwhile(<L>) {\n    @A = split(\" \", $_);\n    push @L, join(\" \", @A);\n}\n\n# (2) Work out the count of each phone-sequence in the\n# lexicon.\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) {\n      $p = shift @A;\n      if (!($p > 0.0 && $p <= 1.0)) { die \"Bad lexicon line $l (expecting pron-prob as second field)\"; }\n    }\n    if ($sil_probs) {\n      $silp = shift @A;\n      if (!($silp > 0.0 && $silp <= 1.0)) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n    }\n    if (!(@A)) {\n      die \"Bad lexicon line $1, no phone in phone list\";\n    }\n    $count{join(\" \",@A)}++;\n}\n\n# (3) For each left sub-sequence of each phone-sequence, note down\n# that it exists (for identifying prefixes of longer strings).\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) { shift @A; } # remove pron-prob.\n    if ($sil_probs) {\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob, there three numbers for sil_probs\n    }\n    while(@A > 0) {\n        pop @A;  # Remove last phone\n        $issubseq{join(\" \",@A)} = 1;\n    }\n}\n\n# (4) For each entry in the lexicon:\n#  if the phone sequence is unique and is not a\n#  prefix of another word, no diambig symbol.\n#  Else output #1, or #2, #3, ... if the same phone-seq\n#  has already been assigned a disambig symbol.\n\n\nopen(O, \">$lexoutfn\") || die \"Opening lexicon file $lexoutfn for writing.\\n\";\n\n# max_disambig will always be the highest-numbered disambiguation symbol that\n# has been used so far.\n$max_disambig = $first_allowed_disambig - 1;\n\nforeach $l (@L) {\n  @A = split(\" \", $l);\n  $word = shift @A;\n  if ($pron_probs) {\n    $pron_prob = shift @A;\n  }\n  if ($sil_probs) {\n    $sil_word_prob = shift @A;\n    $word_sil_correction = shift @A;\n    $prev_nonsil_correction = shift @A\n  }\n  $phnseq = join(\" \", @A);\n  if (!defined $issubseq{$phnseq}\n      && $count{$phnseq} == 1) {\n    ;                           # Do nothing.\n  } else {\n    if ($phnseq eq \"\") {        # need disambig symbols for the empty string\n      # that are not use anywhere else.\n      $max_disambig++;\n      $reserved_for_the_empty_string{$max_disambig} = 1;\n      $phnseq = \"#$max_disambig\";\n    } else {\n      $cur_disambig = $last_used_disambig_symbol_of{$phnseq};\n      if (!defined $cur_disambig) {\n        $cur_disambig = $first_allowed_disambig;\n      } else {\n        $cur_disambig++;           # Get a number that has not been used yet for\n                                   # this phone sequence.\n      }\n      while (defined $reserved_for_the_empty_string{$cur_disambig}) {\n        $cur_disambig++;\n      }\n      if ($cur_disambig > $max_disambig) {\n        $max_disambig = $cur_disambig;\n      }\n      $last_used_disambig_symbol_of{$phnseq} = $cur_disambig;\n      $phnseq = $phnseq . \" #\" . $cur_disambig;\n    }\n  }\n  if ($pron_probs) {\n    if ($sil_probs) {\n      print O \"$word\\t$pron_prob\\t$sil_word_prob\\t$word_sil_correction\\t$prev_nonsil_correction\\t$phnseq\\n\";\n    } else {\n      print O \"$word\\t$pron_prob\\t$phnseq\\n\";\n    }\n  } else {\n    print O \"$word\\t$phnseq\\n\";\n  }\n}\n\nprint $max_disambig . \"\\n\";\n"
  },
  {
    "path": "egs/aishell1/local/aishell_data_prep.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Xingyu Na\n# Apache 2.0\n\n. ./path.sh || exit 1;\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 <audio-path> <text-path>\"\n  echo \" $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript\"\n  exit 1;\nfi\n\naishell_audio_dir=$1\naishell_text=$2/aishell_transcript_v0.8.txt\n\ntrain_dir=data/local/train\ndev_dir=data/local/dev\ntest_dir=data/local/test\ntmp_dir=data/local/tmp\n\nmkdir -p $train_dir\nmkdir -p $dev_dir\nmkdir -p $test_dir\nmkdir -p $tmp_dir\n\n# data directory check\nif [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then\n  echo \"Error: $0 requires two directory arguments\"\n  exit 1;\nfi\n\n# find wav audio file for train, dev and test resp.\nfind $aishell_audio_dir -iname \"*.wav\" > $tmp_dir/wav.flist\nn=`cat $tmp_dir/wav.flist | wc -l`\n[ $n -ne 141925 ] && \\\n  echo Warning: expected 141925 data data files, found $n\n\ngrep -i \"wav/train\" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;\ngrep -i \"wav/dev\" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;\ngrep -i \"wav/test\" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;\n\nrm -r $tmp_dir\n\n# Transcriptions preparation\nfor dir in $train_dir $dev_dir $test_dir; do\n  echo Preparing $dir transcriptions\n  sed -e 's/\\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list\n  sed -e 's/\\.wav//' $dir/wav.flist | awk -F '/' '{i=NF-1;printf(\"%s %s\\n\",$NF,$i)}' > $dir/utt2spk_all\n  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all\n  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt\n  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list\n  utils/filter_scp.pl -f 1 $dir/utt.list $dir/utt2spk_all | sort -u > $dir/utt2spk\n  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp\n  sort -u $dir/transcripts.txt > $dir/text\n  utils/utt2spk_to_spk2utt.pl $dir/utt2spk > $dir/spk2utt\ndone\n\nmkdir -p data/train data/dev data/test\n\nfor f in spk2utt utt2spk wav.scp text; do\n  cp $train_dir/$f data/train/$f || exit 1;\n  cp $dev_dir/$f data/dev/$f || exit 1;\n  cp $test_dir/$f data/test/$f || exit 1;\ndone\n\necho \"$0: AISHELL data preparation succeeded\"\nexit 0;\n"
  },
  {
    "path": "egs/aishell1/local/aishell_train_lms.sh",
    "content": "#!/usr/bin/env bash\n\n# To be run from one directory above this script.\n. ./path.sh\n\ntext=data/local/train/text\nlexicon=data/local/dict_nosp/lexicon.txt\n\nfor f in \"$text\" \"$lexicon\"; do\n  [ ! -f $x ] && echo \"$0: No such file $f\" && exit 1\ndone\n\n# This script takes no arguments.  It assumes you have already run\n# aishell_data_prep.sh.\n# It takes as input the files\n# data/local/train/text\n# data/local/dict/lexicon.txt\ndir=data/local/lm\nmkdir -p $dir\n\nkaldi_lm=$(which train_lm.sh)\nif [ -z $kaldi_lm ]; then\n  echo \"$0: train_lm.sh is not found. That might mean it's not installed\"\n  echo \"$0: or it is not added to PATH\"\n  echo \"$0: Please use the following commands to install it\"\n  echo \"  git clone https://github.com/danpovey/kaldi_lm.git\"\n  echo \"  cd kaldi_lm\"\n  echo \"  make -j\"\n  echo \"Then add the path of kaldi_lm to PATH and rerun $0\"\n  exit 1\nfi\n\ncleantext=$dir/text.no_oov\n\ncat $text | awk -v lex=$lexicon 'BEGIN{while((getline<lex) >0){ seen[$1]=1; } }\n  {for(n=1; n<=NF;n++) {  if (seen[$n]) { printf(\"%s \", $n); } else {printf(\"<UNK> \");} } printf(\"\\n\");}' \\\n  >$cleantext || exit 1\n\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' | sort | uniq -c |\n  sort -nr >$dir/word.counts || exit 1\n\n# Get counts from acoustic training transcripts, and add  one-count\n# for each word in the lexicon (but not silence, we don't want it\n# in the LM-- we'll add it optionally later).\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' |\n  cat - <(grep -w -v '!SIL' $lexicon | awk '{print $1}') |\n  sort | uniq -c | sort -nr >$dir/unigram.counts || exit 1\n\n# note: we probably won't really make use of <UNK> as there aren't any OOVs\ncat $dir/unigram.counts | awk '{print $2}' | get_word_map.pl \"<s>\" \"</s>\" \"<UNK>\" >$dir/word_map ||\n  exit 1\n\n# note: ignore 1st field of train.txt, it's the utterance-id.\ncat $cleantext | awk -v wmap=$dir/word_map 'BEGIN{while((getline<wmap)>0)map[$1]=$2;}\n  { for(n=2;n<=NF;n++) { printf map[$n]; if(n<NF){ printf \" \"; } else { print \"\"; }}}' | gzip -c >$dir/train.gz ||\n  exit 1\n\ntrain_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1\ntrain_lm.sh --arpa --lmtype 4gram-mincount $dir || exit 1\n\n# LM is small enough that we don't need to prune it (only about 0.7M N-grams).\n# Perplexity over 128254.000000 words is 90.446690\n\n# note: output is\n# data/local/lm/3gram-mincount/lm_unpruned.gz\n\nexit 0\n\n# From here is some commands to do a baseline with SRILM (assuming\n# you have it installed).\nheldout_sent=10000 # Don't change this if you want result to be comparable with\n# kaldi_lm results\nsdir=$dir/srilm # in case we want to use SRILM to double-check perplexities.\nmkdir -p $sdir\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' |\n  head -$heldout_sent >$sdir/heldout\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' |\n  tail -n +$heldout_sent >$sdir/train\n\ncat $dir/word_map | awk '{print $1}' | cat - <(\n  echo \"<s>\"\n  echo \"</s>\"\n) >$sdir/wordlist\n\nngram-count -text $sdir/train -order 3 -limit-vocab -vocab $sdir/wordlist -unk \\\n  -map-unk \"<UNK>\" -kndiscount -interpolate -lm $sdir/srilm.o3g.kn.gz\nngram -lm $sdir/srilm.o3g.kn.gz -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250954 ppl= 90.5091 ppl1= 132.482\n\n# Note: perplexity SRILM gives to Kaldi-LM model is same as kaldi-lm reports above.\n# Difference in WSJ must have been due to different treatment of <UNK>.\nngram -lm $dir/3gram-mincount/lm_unpruned.gz -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250913 ppl= 90.4439 ppl1= 132.379\n"
  },
  {
    "path": "egs/aishell1/local/apply_map.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\n# This program is a bit like ./sym2int.pl in that it applies a map\n# to things in a file, but it's a bit more general in that it doesn't\n# assume the things being mapped to are single tokens, they could\n# be sequences of tokens.  See the usage message.\n\n\n$permissive = 0;\n\nfor ($x = 0; $x <= 2; $x++) {\n\n  if (@ARGV > 0 && $ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesty (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n\n  if (@ARGV > 0 && $ARGV[0] eq '--permissive') {\n    shift @ARGV;\n    # Mapping is optional (missing key is printed to output)\n    $permissive = 1;\n  }\n}\n\nif(@ARGV != 1) {\n  print STDERR \"Invalid usage: \" . join(\" \", @ARGV) . \"\\n\";\n  print STDERR <<'EOF';\nUsage: apply_map.pl [options] map <input >output\n options: [-f <field-range> ] [--permissive]\n   This applies a map to some specified fields of some input text:\n   For each line in the map file: the first field is the thing we\n   map from, and the remaining fields are the sequence we map it to.\n   The -f (field-range) option says which fields of the input file the map\n   map should apply to.\n   If the --permissive option is supplied, fields which are not present\n   in the map will be left as they were.\n Applies the map 'map' to all input text, where each line of the map\n is interpreted as a map from the first field to the list of the other fields\n Note: <field-range> can look like 4-5, or 4-, or 5-, or 1, it means the field\n range in the input to apply the map to.\n e.g.: echo A B | apply_map.pl a.txt\n where a.txt is:\n A a1 a2\n B b\n will produce:\n a1 a2 b\nEOF\n  exit(1);\n}\n\n($map_file) = @ARGV;\nopen(M, \"<$map_file\") || die \"Error opening map file $map_file: $!\";\n\nwhile (<M>) {\n  @A = split(\" \", $_);\n  @A >= 1 || die \"apply_map.pl: empty line.\";\n  $i = shift @A;\n  $o = join(\" \", @A);\n  $map{$i} = $o;\n}\n\nwhile(<STDIN>) {\n  @A = split(\" \", $_);\n  for ($x = 0; $x < @A; $x++) {\n    if ( (!defined $field_begin || $x >= $field_begin)\n         && (!defined $field_end || $x <= $field_end)) {\n      $a = $A[$x];\n      if (!defined $map{$a}) {\n        if (!$permissive) {\n          die \"apply_map.pl: undefined key $a in $map_file\\n\";\n        } else {\n          print STDERR \"apply_map.pl: warning! missing key $a in $map_file\\n\";\n        }\n      } else {\n        $A[$x] = $map{$a};\n      }\n    }\n  }\n  print join(\" \", @A) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/aishell1/local/build_sp_text.py",
    "content": "import sys\n\nin_f = sys.argv[1]\n\nfor line in open(in_f, 'r', encoding=\"utf8\"):\n    elems = line.split()\n    uttid = elems[0]\n    for sp in [\"0.9\", \"1.0\", \"1.1\"]:\n        uttid_sp = f\"sp{sp}-{uttid}\"\n        line = f\"{uttid_sp} \" + \" \".join(elems[1:])\n        print(line)\n"
  },
  {
    "path": "egs/aishell1/local/build_word_mapping.py",
    "content": "# convert the attention output vocabulary into lexicon vocabulary\nimport sys\n\natt_vocab = sys.argv[1]\nlex_vocab = sys.argv[2]\nout_map = sys.argv[3]\n\n# load lex_vocab\nlex = {}\nfor line in open(lex_vocab, encoding='utf8'):\n    tok, tid = line.split()\n    lex[tok] = tid\n\nwriter = open(out_map, 'w', encoding='utf8')\nfor line in open(att_vocab, encoding='utf8'):\n    tok, tid = line.split()\n    if tok in lex.keys():\n        info = \"{} {}\\n\".format(tid, lex[tok])\n        writer.write(info)\n    else:\n        print(\"CANNOT find \", tok)\n"
  },
  {
    "path": "egs/aishell1/local/compile_bigram.sh",
    "content": "# Compile char level bigram LM. for MMI training. \n# The bigram should be sparse or 4300+ words would lead to 17M arcs and overflow of GPU memory\n\nlang=$1\ntrain_text=$2\nthreshold=2\n\nlmplz -o 2 --prune $threshold < $train_text > $lang/P.arpa\npython3 -m kaldilm \\\n        --read-symbol-table=\"${lang}/words.txt\" \\\n        --disambig-symbol='#0' \\\n        --max-order=2 \\\n        $lang/P.arpa > ${lang}/P.fst.txt\n"
  },
  {
    "path": "egs/aishell1/local/download_and_untar.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2014  Johns Hopkins University (author: Daniel Povey)\n#             2017  Xingyu Na\n# Apache 2.0\n\nremove_archive=false\n\nif [ \"$1\" == --remove-archive ]; then\n  remove_archive=true\n  shift\nfi\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [--remove-archive] <data-base> <url-base> <corpus-part>\"\n  echo \"e.g.: $0 /export/a05/xna/data www.openslr.org/resources/33 data_aishell\"\n  echo \"With --remove-archive it will remove the archive after successfully un-tarring it.\"\n  echo \"<corpus-part> can be one of: data_aishell, resource_aishell.\"\nfi\n\ndata=$1\nurl=$2\npart=$3\n\nif [ ! -d \"$data\" ]; then\n  echo \"$0: no such directory $data\"\n  exit 1;\nfi\n\npart_ok=false\nlist=\"data_aishell resource_aishell\"\nfor x in $list; do\n  if [ \"$part\" == $x ]; then part_ok=true; fi\ndone\nif ! $part_ok; then\n  echo \"$0: expected <corpus-part> to be one of $list, but got '$part'\"\n  exit 1;\nfi\n\nif [ -z \"$url\" ]; then\n  echo \"$0: empty URL base.\"\n  exit 1;\nfi\n\nif [ -f $data/$part/.complete ]; then\n  echo \"$0: data part $part was already successfully extracted, nothing to do.\"\n  exit 0;\nfi\n\n# sizes of the archive files in bytes.\nsizes=\"15582913665 1246920\"\n\nif [ -f $data/$part.tgz ]; then\n  size=$(/bin/ls -l $data/$part.tgz | awk '{print $5}')\n  size_ok=false\n  for s in $sizes; do if [ $s == $size ]; then size_ok=true; fi; done\n  if ! $size_ok; then\n    echo \"$0: removing existing file $data/$part.tgz because its size in bytes $size\"\n    echo \"does not equal the size of one of the archives.\"\n    rm $data/$part.tgz\n  else\n    echo \"$data/$part.tgz exists and appears to be complete.\"\n  fi\nfi\n\nif [ ! -f $data/$part.tgz ]; then\n  if ! command -v wget >/dev/null; then\n    echo \"$0: wget is not installed.\"\n    exit 1;\n  fi\n  full_url=$url/$part.tgz\n  echo \"$0: downloading data from $full_url.  This may take some time, please be patient.\"\n\n  cd $data || exit 1\n  if ! wget --no-check-certificate $full_url; then\n    echo \"$0: error executing wget $full_url\"\n    exit 1;\n  fi\nfi\n\ncd $data || exit 1\n\nif ! tar -xvzf $part.tgz; then\n  echo \"$0: error un-tarring archive $data/$part.tgz\"\n  exit 1;\nfi\n\ntouch $data/$part/.complete\n\nif [ $part == \"data_aishell\" ]; then\n  cd $data/$part/wav || exit 1\n  for wav in ./*.tar.gz; do\n    echo \"Extracting wav from $wav\"\n    tar -zxf $wav && rm $wav\n  done\nfi\n\necho \"$0: Successfully downloaded and un-tarred $data/$part.tgz\"\n\nif $remove_archive; then\n  echo \"$0: removing $data/$part.tgz file since --remove-archive option was supplied.\"\n  rm $data/$part.tgz\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/aishell1/local/fstaddselfloops.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2020 Xiaomi Corporation (Author: Junbo Zhang)\n# Apache 2.0\n\nuse strict;\nuse warnings;\n\nmy $Usage = <<EOU;\nfstaddselfloops.pl:\nAdds self-loops to states of an FST to propagate disambiguation symbols through it.\nThey are added on each final state and each state with non-epsilon output symbols\non at least one arc out of the state. \n\nUsage: local/fstaddselfloops.pl <wdisambig_phone> <wdisambig_word> < <openfst_text>\n e.g.: cat L_disambig.txt | local/fstaddselfloops.pl 347 200004 > L_disambig_with_loop.txt\nEOU\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\nmy $wdisambig_phone = shift @ARGV;\nmy $wdisambig_word = shift @ARGV;\n\nmy %states_needs_self_loops;\nwhile (<>) {\n    print $_;\n\n    my @items = split(/\\s+/);\n    if (@items == 2) {\n        # it is a final state\n        $states_needs_self_loops{$items[0]} = 1;\n    } elsif (@items == 5) {\n        my ($src, $dst, $inlabel, $outlabel, $score) = @items;\n        $states_needs_self_loops{$src} = 1 if ($outlabel != 0);\n    } else {\n        die \"Invalid openfst line.\";\n    }\n}\n\nforeach (keys %states_needs_self_loops) {\n    print \"$_ $_ $wdisambig_phone $wdisambig_word 0.0\\n\"\n}\n"
  },
  {
    "path": "egs/aishell1/local/k2_aishell_prepare_dict.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Xingyu Na\n# Apache 2.0\n\n# prepare dict resources\n\n# . ./path.sh\n\n[ $# != 2 ] && echo \"Usage: $0 <resource-path> <dest-path>\" && exit 1\n\nres_dir=$1\ndict_dir=$2\nmkdir -p $dict_dir\ncp $res_dir/lexicon.txt $dict_dir\necho '<UNK> spn' >>$dict_dir/lexicon.txt\n\ncat $dict_dir/lexicon.txt | awk '{ for(n=2;n<=NF;n++){ phones[$n] = 1; }} END{for (p in phones) print p;}' |\n  perl -e 'while(<>){ chomp($_); $phone = $_; next if ($phone eq \"sil\");\n    m:^([^\\d]+)(\\d*)$: || die \"Bad phone $_\"; $q{$1} .= \"$phone \"; }\n    foreach $l (values %q) {print \"$l\\n\";}\n  ' | sort -k1 >$dict_dir/nonsilence_phones.txt || exit 1\n\necho sil >$dict_dir/silence_phones.txt\n\necho sil >$dict_dir/optional_silence.txt\n\n# No \"extra questions\" in the input to this setup, as we don't\n# have stress or tone\n\ncat $dict_dir/silence_phones.txt | awk '{printf(\"%s \", $1);} END{printf \"\\n\";}' >$dict_dir/extra_questions.txt || exit 1\ncat $dict_dir/nonsilence_phones.txt | perl -e 'while(<>){ foreach $p (split(\" \", $_)) {\n  $p =~ m:^([^\\d]+)(\\d*)$: || die \"Bad phone $_\"; $q{$2} .= \"$p \"; } } foreach $l (values %q) {print \"$l\\n\";}' \\\n  >>$dict_dir/extra_questions.txt || exit 1\n\necho \"$0: AISHELL dict preparation succeeded\"\nexit 0\n"
  },
  {
    "path": "egs/aishell1/local/k2_aishell_prepare_dict_char.sh",
    "content": "# Build character-level dict for K2 CTC / MMI \n# The token list would be very large (10k+) if we use aishell lexicon \n# so we use the token list of espnet\n[ $# != 2 ] && echo \"Usage: $0 <espnet-char-list> <dest-path>\" && exit 1\n\nlex=$1\ndict=$2\n\nrm -r $dict\nmkdir -p $dict\n\n# prepare lexicon\ncat $lex | tail -n +2 | awk '{print $1, $1}' > $dict/lexicon.txt\necho \"<UNK> spn\" >> $dict/lexicon.txt\necho \"SIL sil\" >> $dict/lexicon.txt\necho \"<SPOKEN_NOISE> sil\" >> $dict/lexicon.txt\n\n# phones and extra questions\necho sil >$dict/silence_phones.txt\necho sil >$dict/optional_silence.txt\necho sil >$dict/extra_questions.txt\ncat $dict/lexicon.txt | cut -d \" \" -f 2 | grep -v \"sil\" > $dict/nonsilence_phones.txt\n"
  },
  {
    "path": "egs/aishell1/local/k2_prepare_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey);\n#                      Arnab Ghoshal\n#                2014  Guoguo Chen\n#                2015  Hainan Xu\n#                2016  FAU Erlangen (Author: Axel Horndasch)\n#                2020  Xiaomi Corporation (Author: Junbo Zhang)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script prepares a directory such as data/lang/, in the standard format,\n# given a source directory containing a dictionary lexicon.txt in a form like:\n# word phone1 phone2 ... phoneN\n# per line (alternate prons would be separate lines), or a dictionary with probabilities\n# called lexiconp.txt in a form:\n# word pron-prob phone1 phone2 ... phoneN\n# (with 0.0 < pron-prob <= 1.0); note: if lexiconp.txt exists, we use it even if\n# lexicon.txt exists.\n# and also files silence_phones.txt, nonsilence_phones.txt, optional_silence.txt\n# and extra_questions.txt\n# Here, silence_phones.txt and nonsilence_phones.txt are lists of silence and\n# non-silence phones respectively (where silence includes various kinds of\n# noise, laugh, cough, filled pauses etc., and nonsilence phones includes the\n# \"real\" phones.)\n# In each line of those files is a list of phones, and the phones on each line\n# are assumed to correspond to the same \"base phone\", i.e. they will be\n# different stress or tone variations of the same basic phone.\n# The file \"optional_silence.txt\" contains just a single phone (typically SIL)\n# which is used for optional silence in the lexicon.\n# extra_questions.txt might be empty; typically will consist of lists of phones,\n# all members of each list with the same stress or tone; and also possibly a\n# list for the silence phones.  This will augment the automatically generated\n# questions (note: the automatically generated ones will treat all the\n# stress/tone versions of a phone the same, so will not \"get to ask\" about\n# stress or tone).\n#\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nnum_sil_states=5\nnum_nonsil_states=3\nposition_dependent_phones=true\n# position_dependent_phones is false also when position dependent phones and word_boundary.txt\n# have been generated by another source\nshare_silence_phones=false  # if true, then share pdfs of different silence\n                            # phones together.\nsil_prob=0.5\nnum_extra_phone_disambig_syms=1 # Standard one phone disambiguation symbol is used for optional silence.\n                                # Increasing this number does not harm, but is only useful if you later\n                                # want to introduce this labels to L_disambig.fst\n\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\necho $sil_prob\n. local/parse_options.sh\necho $sil_prob\nif [ $# -ne 4 ]; then\n  echo \"Usage: local/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>\"\n  echo \"e.g.: local/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang\"\n  echo \"<dict-src-dir> should contain the following files:\"\n  echo \" extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt\"\n  echo \"See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.\"\n  echo \"options: \"\n  echo \"<dict-src-dir> may also, for the grammar-decoding case (see http://kaldi-asr.org/doc/grammar.html)\"\n  echo \"contain a file nonterminals.txt containing symbols like #nonterm:contact_list, one per line.\"\n  echo \"     --num-sil-states <number of states>             # default: 5, #states in silence models.\"\n  echo \"     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.\"\n  echo \"     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I\"\n  echo \"                                                     # markers on phones to indicate word-internal positions. \"\n  echo \"     --share-silence-phones (true|false)             # default: false; if true, share pdfs of \"\n  echo \"                                                     # all silence phones. \"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  exit 1;\nfi\n\nsrcdir=$1\noov_word=$2\ntmpdir=$3\ndir=$4\n\n\nif [ -d $dir/phones ]; then\n  rm -r $dir/phones\nfi\nmkdir -p $dir $tmpdir $dir/phones\n\nsilprob=false\n[ -f $srcdir/lexiconp_silprob.txt ] && silprob=true\n\n[ -f path.sh ] && . ./path.sh\n\nif [[ ! -f $srcdir/lexicon.txt ]]; then\n  echo \"**Creating $srcdir/lexicon.txt from $srcdir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdir/lexiconp.txt > $srcdir/lexicon.txt || exit 1;\nfi\nif [[ ! -f $srcdir/lexiconp.txt ]]; then\n  echo \"**Creating $srcdir/lexiconp.txt from $srcdir/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdir/lexicon.txt > $srcdir/lexiconp.txt || exit 1;\nfi\n\nif [ ! -z \"$unk_fst\" ] && [ ! -f \"$unk_fst\" ]; then\n  echo \"$0: expected --unk-fst $unk_fst to exist as a file\"\n  exit 1\nfi\n\nif $position_dependent_phones; then\n  # Create $tmpdir/lexiconp.txt from $srcdir/lexiconp.txt (or\n  # $tmpdir/lexiconp_silprob.txt from $srcdir/lexiconp_silprob.txt) by\n  # adding the markers _B, _E, _S, _I depending on word position.\n  # In this recipe, these markers apply to silence also.\n  # Do this starting from lexiconp.txt only.\n  if \"$silprob\"; then\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; $silword_p = shift @A;\n              $wordsil_f = shift @A; $wordnonsil_f = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_S\\n\"; }\n         else { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n                < $srcdir/lexiconp_silprob.txt > $tmpdir/lexiconp_silprob.txt\n  else\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $A[0]_S\\n\"; } else { print \"$w $p $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $srcdir/lexiconp.txt > $tmpdir/lexiconp.txt || exit 1;\n  fi\n\n  # create $tmpdir/phone_map.txt\n  # this has the format (on each line)\n  # <original phone> <version 1 of original phone> <version 2> ...\n  # where the versions depend on the position of the phone within a word.\n  # For instance, we'd have:\n  # AA AA_B AA_E AA_I AA_S\n  # for (B)egin, (E)nd, (I)nternal and (S)ingleton\n  # and in the case of silence\n  # SIL SIL SIL_B SIL_E SIL_I SIL_S\n  # [because SIL on its own is one of the variants; this is for when it doesn't\n  #  occur inside a word but as an option in the lexicon.]\n\n  # This phone map expands the phone lists into all the word-position-dependent\n  # versions of the phone lists.\n  cat <(set -f; for x in `cat $srcdir/silence_phones.txt`; do for y in \"\" \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    <(set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do for y in \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    > $tmpdir/phone_map.txt\nelse\n  if \"$silprob\"; then\n    cp $srcdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob.txt\n  else\n    cp $srcdir/lexiconp.txt $tmpdir/lexiconp.txt\n  fi\n\n  cat $srcdir/silence_phones.txt $srcdir/nonsilence_phones.txt | \\\n    awk '{for(n=1;n<=NF;n++) print $n; }' > $tmpdir/phones\n  paste -d' ' $tmpdir/phones $tmpdir/phones > $tmpdir/phone_map.txt\nfi\n\n\n# Making monophone systems.\ncat $srcdir/silence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/silence.txt\ncat $srcdir/nonsilence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/nonsilence.txt\ncp $srcdir/optional_silence.txt $dir/phones/optional_silence.txt\n\n# if extra_questions.txt is empty, it's OK.\ncat $srcdir/extra_questions.txt 2>/dev/null | local/apply_map.pl $tmpdir/phone_map.txt \\\n  >$dir/phones/extra_questions.txt\n\n# Want extra questions about the word-start/word-end stuff. Make it separate for\n# silence and non-silence. Probably doesn't matter, as silence will rarely\n# be inside a word.\nif $position_dependent_phones; then\n  for suffix in _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\n  for suffix in \"\" _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/silence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\nfi\n\n# add_lex_disambig.pl is responsible for adding disambiguation symbols to\n# the lexicon, for telling us how many disambiguation symbols it used,\n# and also for modifying the unknown-word's pronunciation (if the\n# --unk-fst was provided) to the sequence \"#1 #2 #3\", and reserving those\n# disambig symbols for that purpose.\n# The #2 will later be replaced with the actual unk model.  The reason\n# for the #1 and the #3 is for disambiguation and also to keep the\n# FST compact.  If we didn't have the #1, we might have a different copy of\n# the unk-model FST, or at least some of its arcs, for each start-state from\n# which an <unk> transition comes (instead of per end-state, which is more compact);\n# and adding the #3 prevents us from potentially having 2 copies of the unk-model\n# FST due to the optional-silence [the last phone of any word gets 2 arcs].\nif [ ! -z \"$unk_fst\" ]; then  # if the --unk-fst option was provided...\n  if \"$silprob\"; then\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp_silprob.txt \"$oov_word\" || exit 1\n  else\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp.txt \"$oov_word\" || exit 1\n  fi\n  unk_opt=\"--first-allowed-disambig 4\"\nelse\n  unk_opt=\nfi\n\nif \"$silprob\"; then\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs --sil-probs $tmpdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob_disambig.txt)\nelse\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\nndisambig=$[$ndisambig+$num_extra_phone_disambig_syms]; # add (at least) one disambig symbol for silence in lexicon FST.\necho $ndisambig > $tmpdir/lex_ndisambig\n\n# Format of lexiconp_disambig.txt:\n# !SIL\t1.0   SIL_S\n# <SPOKEN_NOISE>\t1.0   SPN_S #1\n# <UNK>\t1.0  SPN_S #2\n# <NOISE>\t1.0  NSN_S\n# !EXCLAMATION-POINT\t1.0  EH2_B K_I S_I K_I L_I AH0_I M_I EY1_I SH_I AH0_I N_I P_I OY2_I N_I T_E\n\n( for n in `seq 0 $ndisambig`; do echo '#'$n; done ) >$dir/phones/disambig.txt\n\n# Create phone symbol table.\necho \"<eps>\" | cat - $dir/phones/{silence,nonsilence,disambig}.txt | \\\n  awk '{n=NR-1; print $1, n;}' > $dir/phones.txt\n\n# Create a file that describes the word-boundary information for\n# each phone.  5 categories.\nif $position_dependent_phones; then\n  cat $dir/phones/{silence,nonsilence}.txt | \\\n    awk '/_I$/{print $1, \"internal\"; next;} /_B$/{print $1, \"begin\"; next; }\n         /_S$/{print $1, \"singleton\"; next;} /_E$/{print $1, \"end\"; next; }\n         {print $1, \"nonword\";} ' > $dir/phones/word_boundary.txt\nelse\n  # word_boundary.txt might have been generated by another source\n  [ -f $srcdir/word_boundary.txt ] && cp $srcdir/word_boundary.txt $dir/phones/word_boundary.txt\nfi\n\n# Create word symbol table.\n# <s> and </s> are only needed due to the need to rescore lattices with\n# ConstArpaLm format language model. They do not normally appear in G.fst or\n# L.fst.\n\nif \"$silprob\"; then\n  # remove the silprob\n  cat $tmpdir/lexiconp_silprob.txt |\\\n    awk '{\n      for(i=1; i<=NF; i++) {\n        if(i!=3 && i!=4 && i!=5) printf(\"%s\\t\", $i); if(i==NF) print \"\";\n      }\n    }' > $tmpdir/lexiconp.txt\nfi\n\ncat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq  | awk '\n  BEGIN {\n    print \"<eps> 0\";\n  }\n  {\n    if ($1 == \"<s>\") {\n      print \"<s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    if ($1 == \"</s>\") {\n      print \"</s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    printf(\"%s %d\\n\", $1, NR);\n  }\n  END {\n    printf(\"#0 %d\\n\", NR+1);\n    printf(\"<s> %d\\n\", NR+2);\n    printf(\"</s> %d\\n\", NR+3);\n  }' > $dir/words.txt || exit 1;\n\n# format of $dir/words.txt:\n#<eps> 0\n#a 1\n#aa 2\n#aarvark 3\n#...\n\nsilphone=`cat $srcdir/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\ngrammar_opts=\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\n\nif $silprob; then\n  # Add silence probabilities (models the prob. of silence before and after each\n  # word).  On some setups this helps a bit.  See local/dict_dir_add_pronprobs.sh\n  # and where it's called in the example scripts (run.sh).\n  local/make_lexicon_fst_silprob.py $grammar_opts --sil-phone=$silphone \\\n    $tmpdir/lexiconp_silprob.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt  > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false |   \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone \\\n    $tmpdir/lexiconp.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false | \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n# The file oov.txt contains a word that we will map any OOVs to during\n# training.\necho \"$oov_word\" > $dir/oov.txt || exit 1;\ncat $dir/oov.txt | local/sym2int.pl $dir/words.txt >$dir/oov.int || exit 1;\n# integer version of oov symbol, used in some scripts.\n\n\n# the file wdisambig.txt contains a (line-by-line) list of the text-form of the\n# disambiguation symbols that are used in the grammar and passed through by the\n# lexicon.  At this stage it's hardcoded as '#0', but we're laying the groundwork\n# for more generality (which probably would be added by another script).\n# wdisambig_words.int contains the corresponding list interpreted by the\n# symbol table words.txt, and wdisambig_phones.int contains the corresponding\n# list interpreted by the symbol table phones.txt.\necho '#0' >$dir/phones/wdisambig.txt\n\nwdisambig_phone=`local/sym2int.pl $dir/phones.txt <$dir/phones/wdisambig.txt`\nwdisambig_word=`local/sym2int.pl $dir/words.txt <$dir/phones/wdisambig.txt`\n\n# Create these lists of phones in colon-separated integer list form too,\n# for purposes of being given to programs as command-line options.\nfor f in silence nonsilence optional_silence disambig; do\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt >$dir/phones/$f.int\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt | \\\n   awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/$f.csl || exit 1;\ndone\n\nif [ -f $dir/phones/word_boundary.txt ]; then\n  local/sym2int.pl -f 1 $dir/phones.txt <$dir/phones/word_boundary.txt \\\n    > $dir/phones/word_boundary.int || exit 1;\nfi\n\nsilphonelist=`cat $dir/phones/silence.csl`\nnonsilphonelist=`cat $dir/phones/nonsilence.csl`\n\n# Create the lexicon FST with disambiguation symbols, and put it in lang_test.\n# There is an extra step where we create a loop to \"pass through\" the\n# disambiguation symbols from G.fst.\n\nif $silprob; then\n  local/make_lexicon_fst_silprob.py $grammar_opts \\\n    --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_silprob_disambig.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts \\\n    --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_disambig.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/aishell1/local/make_lexicon_fst.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright   2018  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n# see get_args() below for usage message.\nimport argparse\nimport os\nimport sys\nimport math\nimport re\n\n# The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n# encoding means \"treat words as sequences of bytes\", and it is compatible\n# with utf-8 encoding as well as other encodings such as gbk, as long as the\n# spaces are also spaces in ascii (which we check).  It is basically how we\n# emulate the behavior of python before python3.\nsys.stdout = open(1, 'w', encoding='latin-1', closefd=False)\nsys.stderr = open(2, 'w', encoding='latin-1', closefd=False)\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates the\n       text form of a lexicon FST, to be compiled by fstcompile using the\n       appropriate symbol tables (phones.txt and words.txt) .  It will mostly\n       be invoked indirectly via utils/prepare_lang.sh.  The output goes to\n       the stdout.\"\"\")\n\n    parser.add_argument('--sil-phone', dest='sil_phone', type=str,\n                        help=\"\"\"Text form of optional-silence phone, e.g. 'SIL'.  See also\n                        the --silprob option.\"\"\")\n    parser.add_argument('--sil-prob', dest='sil_prob', type=float, default=0.0,\n                        help=\"\"\"Probability of silence between words (including at the\n                        beginning and end of word sequences).  Must be in the range [0.0, 1.0].\n                        This refers to the optional silence inserted by the lexicon; see\n                        the --silphone option.\"\"\")\n    parser.add_argument('--sil-disambig', dest='sil_disambig', type=str,\n                        help=\"\"\"Disambiguation symbol to disambiguate silence, e.g. #5.\n                        Will only be supplied if you are creating the version of L.fst\n                        with disambiguation symbols, intended for use with cyclic G.fst.\n                        This symbol was introduced to fix a rather obscure source of\n                        nondeterminism of CLG.fst, that has to do with reordering of\n                        disambiguation symbols and phone symbols.\"\"\")\n    parser.add_argument('--left-context-phones', dest='left_context_phones', type=str,\n                        help=\"\"\"Only relevant if --nonterminals is also supplied; this relates\n                        to grammar decoding (see http://kaldi-asr.org/doc/grammar.html or\n                        src/doc/grammar.dox).  Format is a list of left-context phones,\n                        in text form, one per line.  E.g. data/lang/phones/left_context_phones.txt\"\"\")\n    parser.add_argument('--nonterminals', type=str,\n                        help=\"\"\"If supplied, --left-context-phones must also be supplied.\n                        List of user-defined nonterminal symbols such as #nonterm:contact_list,\n                        one per line.  E.g. data/local/dict/nonterminals.txt.\"\"\")\n    parser.add_argument('lexiconp', type=str,\n                        help=\"\"\"Filename of lexicon with pronunciation probabilities\n                        (normally lexiconp.txt), with lines of the form 'word prob p1 p2...',\n                        e.g. 'a   1.0    ay'\"\"\")\n    args = parser.parse_args()\n    return args\n\n\ndef read_lexiconp(filename):\n    \"\"\"Reads the lexiconp.txt file in 'filename', with lines like 'word pron p1 p2 ...'.\n    Returns a list of tuples (word, pron_prob, pron), where 'word' is a string,\n   'pron_prob', a float, is the pronunciation probability (which must be >0.0\n    and would normally be <=1.0),  and 'pron' is a list of strings representing phones.\n    An element in the returned list might be ('hello', 1.0, ['h', 'eh', 'l', 'ow']).\n    \"\"\"\n\n    ans = []\n    found_empty_prons = False\n    found_large_pronprobs = False\n    # See the comment near the top of this file, RE why we use latin-1.\n    with open(filename, 'r', encoding='latin-1') as f:\n        whitespace = re.compile(\"[ \\t]+\")\n        for line in f:\n            a = whitespace.split(line.strip(\" \\t\\r\\n\"))\n            if len(a) < 2:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            word = a[0]\n            if word == \"<eps>\":\n                # This would clash with the epsilon symbol normally used in OpenFst.\n                print(\"{0}: error: found <eps> as a word in lexicon file \"\n                      \"{1}\".format(line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            try:\n                pron_prob = float(a[1])\n            except:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2}, 2nd field \"\n                      \"should be pron-prob\".format(sys.argv[0], line.strip(\" \\t\\r\\n\"), filename),\n                      file=sys.stderr)\n                sys.exit(1)\n            prons = a[2:]\n            if pron_prob <= 0.0:\n                print(\"{0}: error: invalid pron-prob in line '{1}' of lexicon file {1} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            if len(prons) == 0:\n                found_empty_prons = True\n            ans.append( (word, pron_prob, prons) )\n            if pron_prob > 1.0:\n                found_large_pronprobs = True\n    if found_empty_prons:\n        print(\"{0}: warning: found at least one word with an empty pronunciation \"\n              \"in lexicon file {1}.\".format(sys.argv[0], filename),\n              file=sys.stderr)\n    if found_large_pronprobs:\n        print(\"{0}: warning: found at least one word with pron-prob >1.0 \"\n              \"in {1}\".format(sys.argv[0], filename), file=sys.stderr)\n\n\n    if len(ans) == 0:\n        print(\"{0}: error: found no pronunciations in lexicon file {1}\".format(\n            sys.argv[0], filename), file=sys.stderr)\n        sys.exit(1)\n    return ans\n\n\ndef write_nonterminal_arcs(start_state, loop_state, next_state,\n                           nonterminals, left_context_phones):\n    \"\"\"This function relates to the grammar-decoding setup, see\n    kaldi-asr.org/doc/grammar.html.  It is called from write_fst_no_silence\n    and write_fst_silence, and writes to the stdout some extra arcs\n    in the lexicon FST that relate to nonterminal symbols.\n    See the section \"Special symbols in L.fst,\n    kaldi-asr.org/doc/grammar.html#grammar_special_l.\n       start_state: the start-state of L.fst.\n       loop_state:  the state of high out-degree in L.fst where words leave\n                  and enter.\n       next_state: the number from which this function can start allocating its\n                  own states.  the updated value of next_state will be returned.\n       nonterminals: the user-defined nonterminal symbols as a list of\n          strings, e.g. ['#nonterm:contact_list', ... ].\n       left_context_phones: a list of phones that may appear as left-context,\n          e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n    shared_state = next_state\n    next_state += 1\n    final_state = next_state\n    next_state += 1\n\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=start_state, dest=shared_state,\n        phone='#nonterm_begin', word='#nonterm_begin',\n        cost=0.0))\n\n    for nonterminal in nonterminals:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=loop_state, dest=shared_state,\n            phone=nonterminal, word=nonterminal,\n            cost=0.0))\n    # this_cost equals log(len(left_context_phones)) but the expression below\n    # better captures the meaning.  Applying this cost to arcs keeps the FST\n    # stochatic (sum-to-one, like an HMM), so that if we do weight pushing\n    # things won't get weird.  In the grammar-FST code when we splice things\n    # together we will cancel out this cost, see the function CombineArcs().\n    this_cost = -math.log(1.0 / len(left_context_phones))\n\n    for left_context_phone in left_context_phones:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=shared_state, dest=loop_state,\n            phone=left_context_phone, word='<eps>', cost=this_cost))\n    # arc from loop-state to a final-state with #nonterm_end as ilabel and olabel\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=loop_state, dest=final_state,\n        phone='#nonterm_end', word='#nonterm_end', cost=0.0))\n    print(\"{state}\\t{final_cost}\".format(\n        state=final_state, final_cost=0.0))\n    return next_state\n\n\n\ndef write_fst_no_silence(lexicon, nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n    when --sil-prob=0.0, meaning there is no optional silence allowed.\n\n      'lexicon' is a list of 3-tuples (word, pron-prob, prons) as returned by\n        read_lexiconp().\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    loop_state = 0\n    next_state = 1  # the next un-allocated state, will be incremented as we go.\n    for (word, pronprob, pron) in lexicon:\n        cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state,\n                dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=(cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            loop_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\ndef write_fst_with_silence(lexicon, sil_prob, sil_phone, sil_disambig,\n                           nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n       when --sil-prob != 0.0, meaning there is optional silence\n     'lexicon' is a list of 3-tuples (word, pron-prob, prons)\n         as returned by read_lexiconp().\n     'sil_prob', which is expected to be strictly between 0.. and 1.0, is the\n         probability of silence\n     'sil_phone' is the silence phone, e.g. \"SIL\".\n     'sil_disambig' is either None, or the silence disambiguation symbol, e.g. \"#5\".\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    assert sil_prob > 0.0 and sil_prob < 1.0\n    sil_cost = -math.log(sil_prob)\n    no_sil_cost = -math.log(1.0 - sil_prob);\n\n    start_state = 0\n    loop_state = 1  # words enter and leave from here\n    sil_state = 2   # words terminate here when followed by silence; this state\n                    # has a silence transition to loop_state.\n    next_state = 3  # the next un-allocated state, will be incremented as we go.\n\n\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=loop_state,\n        phone='<eps>', word='<eps>', cost=no_sil_cost))\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=sil_state,\n        phone='<eps>', word='<eps>', cost=sil_cost))\n    if sil_disambig is None:\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=loop_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n    else:\n        sil_disambig_state = next_state\n        next_state += 1\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=sil_disambig_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_disambig_state, dest=loop_state,\n            phone=sil_disambig, word='<eps>', cost=0.0))\n\n\n    for (word, pronprob, pron) in lexicon:\n        pron_cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state, dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(pron_cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=no_sil_cost + (pron_cost if i <= 0 else 0.0)))\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=sil_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=sil_cost + (pron_cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            start_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\n\n\ndef write_words_txt(orig_lines, highest_numbered_symbol, nonterminals, filename):\n    \"\"\"Writes updated words.txt to 'filename'.  'orig_lines' is the original lines\n       in the words.txt file as a list of strings (without the newlines);\n       highest_numbered_symbol is the highest numbered symbol in the original\n       words.txt; nonterminals is a list of strings like '#nonterm:foo'.\"\"\"\n    with open(filename, 'w', encoding='latin-1') as f:\n        for l in orig_lines:\n            print(l, file=f)\n        cur_symbol = highest_numbered_symbol + 1\n        for n in [ '#nonterm_begin', '#nonterm_end' ] + nonterminals:\n            print(\"{0} {1}\".format(n, cur_symbol), file=f)\n            cur_symbol = cur_symbol + 1\n\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminals symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef read_left_context_phones(filename):\n    \"\"\"Reads, checks, and returns a list of left-context phones, in text form, one\n       per line.  Returns a list of strings, e.g. ['a', 'ah', ..., '#nonterm_bos' ]\"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no left-context phones.\".format(filename))\n    whitespace = re.compile(\"[ \\t]+\")\n    for s in ans:\n        if len(whitespace.split(s)) != 1:\n            raise RuntimeError(\"The file {0} contains an invalid line '{1}'\".format(filename, s)   )\n\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\n\ndef is_token(s):\n    \"\"\"Returns true if s is a string and is space-free.\"\"\"\n    if not isinstance(s, str):\n        return False\n    whitespace = re.compile(\"[ \\t\\r\\n]+\")\n    split_str = whitespace.split(s);\n    return len(split_str) == 1 and s == split_str[0]\n\n\ndef main():\n    args = get_args()\n\n    lexicon = read_lexiconp(args.lexiconp)\n\n    if args.nonterminals is None:\n        nonterminals, left_context_phones = None, None\n    else:\n        if args.left_context_phones is None:\n            print(\"{0}: if --nonterminals is specified, --left-context-phones must also \"\n                  \"be specified\".format(sys.argv[0]))\n            sys.exit(1)\n        nonterminals = read_nonterminals(args.nonterminals)\n        left_context_phones = read_left_context_phones(args.left_context_phones)\n\n    if args.sil_prob == 0.0:\n          write_fst_no_silence(lexicon,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n    else:\n        # Do some checking that the options make sense.\n        if args.sil_prob < 0.0 or args.sil_prob >= 1.0:\n            print(\"{0}: invalid value specified --sil-prob={1}\".format(\n                sys.argv[0], args.sil_prob), file=sys.stderr)\n            sys.exit(1)\n\n        if not is_token(args.sil_phone):\n            print(\"{0}: you specified --sil-prob={1} but --sil-phone is set \"\n                  \"to '{2}'\".format(sys.argv[0], args.sil_prob, args.sil_phone),\n                  file=sys.stderr)\n            sys.exit(1)\n        if args.sil_disambig is not None and not is_token(args.sil_disambig):\n            print(\"{0}: invalid value --sil-disambig='{1}' was specified.\"\n                  \"\".format(sys.argv[0], args.sil_disambig), file=sys.stderr)\n            sys.exit(1)\n        write_fst_with_silence(lexicon, args.sil_prob, args.sil_phone,\n                               args.sil_disambig,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n\n\n\n#    (lines, highest_symbol) = read_words_txt(args.input_words_txt)\n#    nonterminals = read_nonterminals(args.nonterminal_symbols_list)\n#    write_words_txt(lines, highest_symbol, nonterminals, args.output_words_txt)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/aishell1/local/max_rescore.py",
    "content": "import sys\nimport json\nimport codecs\nimport copy\n\njson_f = sys.argv[1]\njson_f_out = sys.argv[2]\nbest_dict_f = sys.argv[3]\n\nwith codecs.open(json_f, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\nbest_dict = {}\nfor name in j[\"utts\"]:\n    hyp_lst = j[\"utts\"][name][\"output\"]\n    for idx, hyp in enumerate(hyp_lst):\n        if hyp[\"text\"] == hyp[\"rec_text\"].replace(\"<eos>\", \"\") and idx > 0:\n            best_dict[name] = copy.deepcopy([hyp_lst[0]] + [hyp_lst[idx]]) \n            print(f\"{name}: {idx}-th is the best\")\n            if hyp_lst[0][\"mmi_tot_score\"] - hyp_lst[idx][\"mmi_tot_score\"] <  - 1e-5:\n                print(\"May be corrected by MMI\")\n            \n\n\n            hyp_lst = [hyp]\n    j[\"utts\"][name][\"output\"] = hyp_lst[:1]\n\nwith open(json_f_out, \"wb\") as f:\n    f.write(\n        json.dumps(\n            j, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n\nwith open(best_dict_f, \"wb\") as f:\n    f.write(\n        json.dumps(\n            best_dict, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n"
  },
  {
    "path": "egs/aishell1/local/parse_options.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);\n#                 Arnab Ghoshal, Karel Vesely\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Parse command-line options.\n# To be sourced by another script (as in \". parse_options.sh\").\n# Option format is: --option-name arg\n# and shell variable \"option_name\" gets set to value \"arg.\"\n# The exception is --help, which takes no arguments, but prints the\n# $help_message variable (if defined).\n\n\n###\n### The --config file options have lower priority to command line\n### options, so we need to import them first...\n###\n\n# Now import all the configs specified by command-line, in left-to-right order\nfor ((argpos=1; argpos<$#; argpos++)); do\n  if [ \"${!argpos}\" == \"--config\" ]; then\n    argpos_plus1=$((argpos+1))\n    config=${!argpos_plus1}\n    [ ! -r $config ] && echo \"$0: missing config '$config'\" && exit 1\n    . $config  # source the config file.\n  fi\ndone\n\n\n###\n### Now we process the command line options\n###\nwhile true; do\n  [ -z \"${1:-}\" ] && break;  # break if there are no arguments\n  case \"$1\" in\n    # If the enclosing script is called with --help option, print the help\n    # message and exit.  Scripts should put help messages in $help_message\n    --help|-h) if [ -z \"$help_message\" ]; then echo \"No help found.\" 1>&2;\n      else printf \"$help_message\\n\" 1>&2 ; fi;\n      exit 0 ;;\n    --*=*) echo \"$0: options to scripts must be of the form --name value, got '$1'\"\n      exit 1 ;;\n    # If the first command-line argument begins with \"--\" (e.g. --foo-bar),\n    # then work out the variable name as $name, which will equal \"foo_bar\".\n    --*) name=`echo \"$1\" | sed s/^--// | sed s/-/_/g`;\n      # Next we test whether the variable in question is undefned-- if so it's\n      # an invalid option and we die.  Note: $0 evaluates to the name of the\n      # enclosing script.\n      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar\n      # is undefined.  We then have to wrap this test inside \"eval\" because\n      # foo_bar is itself inside a variable ($name).\n      eval '[ -z \"${'$name'+xxx}\" ]' && echo \"$0: invalid option $1\" 1>&2 && exit 1;\n\n      oldval=\"`eval echo \\\\$$name`\";\n      # Work out whether we seem to be expecting a Boolean argument.\n      if [ \"$oldval\" == \"true\" ] || [ \"$oldval\" == \"false\" ]; then\n        was_bool=true;\n      else\n        was_bool=false;\n      fi\n\n      # Set the variable to the right value-- the escaped quotes make it work if\n      # the option had spaces, like --cmd \"queue.pl -sync y\"\n      eval $name=\\\"$2\\\";\n\n      # Check that Boolean-valued arguments are really Boolean.\n      if $was_bool && [[ \"$2\" != \"true\" && \"$2\" != \"false\" ]]; then\n        echo \"$0: expected \\\"true\\\" or \\\"false\\\": $1 $2\" 1>&2\n        exit 1;\n      fi\n      shift 2;\n      ;;\n  *) break;\n  esac\ndone\n\n\n# Check for an empty argument to the --cmd option, which can easily occur as a\n# result of scripting errors.\n[ ! -z \"${cmd+xxx}\" ] && [ -z \"$cmd\" ] && echo \"$0: empty argument to --cmd option\" 1>&2 && exit 1;\n\n\ntrue; # so this script returns exit code 0.\n"
  },
  {
    "path": "egs/aishell1/local/parse_text_jieba.py",
    "content": "import jieba\nimport sys\n\nin_f = sys.argv[1]\nout_f = sys.argv[2]\nword_dict = sys.argv[3]\n\njieba.load_userdict(word_dict)\n\nwriter = open(out_f, 'w', encoding=\"utf8\")\nfor line in open(in_f, encoding=\"utf8\"):\n    elems = line.split()\n    uttid = elems[0]\n    trans = \"\".join(elems[1:])\n    trans_seg = \" \".join(elems[1:])\n    trans_seg_jieba = list(jieba.cut(trans, cut_all=False))\n    trans_seg_jieba = \" \".join(trans_seg_jieba)\n    if not trans_seg_jieba == trans_seg:\n        writer.write(f\"Initail: {trans_seg} | jieba: {trans_seg_jieba}\\n\")\nwriter.close()\n"
  },
  {
    "path": "egs/aishell1/local/prepare_word_lex.py",
    "content": "import sys\n\n\"\"\"\nMake a word-level lexicon for MMI training. \nPrevious lexicon accepts phones, here this lexicon accepts words.\n\"\"\"\n\nin_f = sys.argv[1]\nout_f = sys.argv[2]\nchar_out_f = sys.argv[3]\n\ncnt = 0\nwriter = open(out_f, 'w', encoding=\"utf8\")\nchar_writer = open(char_out_f, 'w', encoding=\"utf8\")\nfor line in open(in_f, encoding=\"utf8\"):\n    cnt += 1\n  \n    # The first two lines should be kept: special tokens   \n    if cnt <= 2:\n        writer.write(line)\n        char_writer.write(line)\n        continue\n\n    word = line.split()[0]\n    line = word + \" \" + \" \".join(list(word)) + \"\\n\"\n    writer.write(line)\n\n    if len(word) == 1:\n        line = f\"{word} {word}\\n\"\n        char_writer.write(line)\n\nwriter.close()\nchar_writer.close()\n \n"
  },
  {
    "path": "egs/aishell1/local/sym2int.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n$ignore_oov = 0;\n\nfor($x = 0; $x < 2; $x++) {\n  if ($ARGV[0] eq \"--map-oov\") {\n    shift @ARGV;\n    $map_oov = shift @ARGV;\n    if ($map_oov eq \"-f\" || $map_oov =~ m/words\\.txt$/ || $map_oov eq \"\") {\n      # disallow '-f', the empty string and anything ending in words.txt as the\n      # OOV symbol because these are likely command-line errors.\n      die \"the --map-oov option requires an argument\";\n    }\n  }\n  if ($ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesy (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n}\n\n$symtab = shift @ARGV;\nif (!defined $symtab) {\n  print STDERR \"Usage: sym2int.pl [options] symtab [input transcriptions] > output transcriptions\\n\" .\n    \"options: [--map-oov <oov-symbol> ]  [-f <field-range> ]\\n\" .\n      \"note: <field-range> can look like 4-5, or 4-, or 5-, or 1.\\n\";\n}\nopen(F, \"<$symtab\") || die \"Error opening symbol table file $symtab\";\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"bad line in symbol table file: $_\";\n    $sym2int{$A[0]} = $A[1] + 0;\n}\n\nif (defined $map_oov && $map_oov !~ m/^\\d+$/) { # not numeric-> look it up\n  if (!defined $sym2int{$map_oov}) { die \"OOV symbol $map_oov not defined.\"; }\n  $map_oov = $sym2int{$map_oov};\n}\n\n$num_warning = 0;\n$max_warning = 20;\n\nwhile (<>) {\n  @A = split(\" \", $_);\n  @B = ();\n  for ($n = 0; $n < @A; $n++) {\n    $a = $A[$n];\n    if ( (!defined $field_begin || $n >= $field_begin)\n         && (!defined $field_end || $n <= $field_end)) {\n      $i = $sym2int{$a};\n      if (!defined ($i)) {\n        if (defined $map_oov) {\n          if ($num_warning++ < $max_warning) {\n            print STDERR \"sym2int.pl: replacing $a with $map_oov\\n\";\n            if ($num_warning == $max_warning) {\n              print STDERR \"sym2int.pl: not warning for OOVs any more times\\n\";\n            }\n          }\n          $i = $map_oov;\n        }\n      }\n      $a = $i;\n    }\n    push @B, $a;\n  }\n  print join(\" \", @B);\n  print \"\\n\";\n}\nif ($num_warning > 0) {\n  print STDERR \"** Replaced $num_warning instances of OOVs with $map_oov\\n\";\n}\n\nexit(0);\n"
  },
  {
    "path": "egs/aishell1/nt.sh",
    "content": "#!/usr/bin/env bash\n\n# author: tyriontian\n# tyriontian@tencent.com\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\ndebug=false\n\n# feature configuration\ndo_delta=false\n\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4.yaml\nlm_config=conf/lm.yaml\ndecode_config=conf/tuning/transducer/decode_default.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\n\n# ngram\nngramtag=\nn_gram=4\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\n\n# data\ndata=/data/asr_data/aishell/\ndata_url=www.openslr.org/resources/33\ndict=data/lang_1char/train_sp_units.txt\nlang=data/lang_phone\n\n### Configurable parameters ###\ntag=\"8v100_ddp_rnnt_mmi\"\nngpu=8\n\n# Train config\nseed=888\nbatch_size=8\naccum_grad=1\nepochs=100\nuse_segment=true # if true, use word-level transcription in MMI criterion\naux_ctc=true\naux_ctc_weight=0.5\naux_ctc_dropout_rate=0.1\naux_mmi=true\naux_mmi_weight=0.5\naux_mmi_dropout_rate=0.1\naux_mmi_type='mmi' # mmi or phonectc\natt_scorer_weight=0.0 # train an attention scorer for rescoring\nresume=\n\n# MBR training config\naux_mbr=false\naux_mbr_weight=1.0\naux_mbr_beam=4\nmbr_epochs=100\nmbr_lr=0.1\nmbr_warmup=2500\nmbr_resume=\n\nmaster_port=22275\n\n# Decode config\nidx_average=91_100\nsearch_type=\"alsd\" # \"default\", \"nsc\", \"tsd\", \"alsd\"\nmmi_weight=0.0 # MMI / phonectc joint decoding\nmas_lookahead=0 # MMI Alignment look-ahead frames\nctc_weight=0.0 # char ctc joint decoding\nngram_order=4\nngram_weight=0.0\nlm_weight=0.0\nword_ngram_weight=0.0\nword_ngram_tag=word_3gram_wbdiscount\nword_ngram_log_semiring=true\nbeam_size=10\nrecog_set=\"test dev\"\nmax_job=144\n\n. utils/parse_options.sh || exit 1;\n\nif [ $debug == true ]; then\n    export HOST_GPU_NUM=1\n    export HOST_NUM=1\n    export NODE_NUM=1\n    export INDEX=0\n    export CHIEF_IP=\"9.135.217.29\"\nfi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--use-segment $use_segment \\\n--aux-ctc $aux_ctc \\\n--aux-ctc-weight $aux_ctc_weight \\\n--aux-ctc-dropout-rate $aux_ctc_dropout_rate \\\n--aux-mmi $aux_mmi \\\n--aux-mmi-weight $aux_mmi_weight \\\n--aux-mmi-dropout-rate $aux_mmi_dropout_rate \\\n--aux-mmi-type $aux_mmi_type \\\n--att-scorer-weight $att_scorer_weight \\\n\"\n\nif [ $aux_mbr == true ]; then\n    train_opts=\"$train_opts \\\n                --aux-mbr $aux_mbr \\\n                --aux-mbr-weight $aux_mbr_weight \\\n                --aux-mbr-beam $aux_mbr_beam \\\n                --transformer-lr $mbr_lr \\\n                --epochs $mbr_epochs \\\n                --transformer-warmup-steps $mbr_warmup \\\n                --resume $mbr_resume \\\n                --load-trainer-and-opt false \\\n                --save-interval-iters 1000 \\\n                \"\n    export OMP_NUM_THREADS=6 # for on-the-fly decoding\nfi\n\ndecode_opts=\\\n\"\\\n--search-type $search_type \\\n--mmi-weight $mmi_weight \\\n--mas-lookahead $mas_lookahead \\\n--beam-size $beam_size \\\n--ctc-weight $ctc_weight \\\n--ngram-weight $ngram_weight \\\n--word-ngram-weight $word_ngram_weight \\\n--word-ngram data/$word_ngram_tag \\\n--word-ngram-log-semiring ${word_ngram_log_semiring} \\\n--lm-weight $lm_weight \\\n\"\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev\n\nexpname=${train_set}_${backend}_${tag}\nexpdir=exp/${expname}\nmkdir -p ${expdir}\n\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Network Training\"\n    \n    # make sure in jizhi config file: \"exec_start_in_all_mpi_pods\": true, \n    MASTER_PORT=$master_port\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/results_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --resume ${resume} \\\n        --train-json ${feat_tr_dir}/split${ngpu}utt/data.RANK.json \\\n        --valid-json ${feat_dt_dir}/data.json \\\n        --lang $lang \\\n        --opt \"noam_sgd\" \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        $train_opts > ${expdir}/global_record.${INDEX}.txt 2>&1\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Decoding\"\n    nj=500\n    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} etype) = custom ]] || \\\n           [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then\n        recog_model=model.last${idx_average}.avg.best\n        average_checkpoints.py --backend ${backend} \\\n         \t\t       --snapshots ${expdir}/results_0/snapshot.ep.* \\\n          \t\t       --out ${expdir}/results_0/${recog_model} \\\n         \t\t       --num ${idx_average}\n    fi\n\n    decode_parent_dir=decode_mmi${mmi_weight}_${word_ngram_tag}${word_ngram_weight}_lookahead${mas_lookahead}_beam${beam_size}_${idx_average}\n    for rtask in ${recog_set}; do\n        decode_dir=$decode_parent_dir/$rtask \n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n\n        # split data\n        splitjson.py --parts ${nj} ${feat_recog_dir}/data.json\n\n        #### use CPU for decoding\n        ngpu=0\n\n        # If use rnnlm, download the official ckpts and add:\n        # --rnnlm exp/train_rnnlm_pytorch_lm/official_ckpts/rnnlm.model.best \\\n        # --rnnlm-conf exp/train_rnnlm_pytorch_lm/official_ckpts/model.json \\\n\n        # If use character-level N-gram lm, train with kenlm and add:\n        # --ngram-model exp/train_ngram/${ngram_order}gram.bin \\\n\n        ${decode_cmd} JOB=1:${nj} ${expdir}/${decode_dir}/log/decode.JOB.log \\\n            asr_recog.py \\\n            --config ${decode_config} \\\n            --ngpu ${ngpu} \\\n            --backend ${backend} \\\n            --batchsize 0 \\\n            --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \\\n            --result-label ${expdir}/${decode_dir}/data.JOB.json \\\n            --model ${expdir}/results_0/${recog_model} \\\n            --local-rank JOB \\\n            $decode_opts\n\n        score_sclite.sh ${expdir}/${decode_dir} ${dict} \\\n          > ${expdir}/${decode_dir}/decode_result.txt\n     \n    done\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/aishell1/path.sh",
    "content": "# this is necessary since docker images would not run .bashrc if the command line is not \"bash\"\nsource ~/.bashrc # to include libfst.so\n\nMAIN_ROOT=$PWD/../../\nKALDI_ROOT=../../kaldi/ # Kaldi is local and is not available on jizhi task\n\nexport PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH\n[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 \"The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!\" && exit 1\n. $KALDI_ROOT/tools/config/common_path.sh\nexport LC_ALL=C\n\nexport PATH=$PWD/espnet_utils:$MAIN_ROOT/bin:$MAIN_ROOT:$PATH\n\nexport OMP_NUM_THREADS=1\n# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C\nexport PYTHONIOENCODING=UTF-8\nexport PATH=${KALDI_ROOT}/tools/sph2pipe:${KALDI_ROOT}/tools/kaldi_lm:${KALDI_ROOT}/tools/sctk/bin:${KALDI_ROOT}/tools/kenlm/build/bin:${KALDI_ROOT}/tools/srilm/bin/i686-m64/:${KALDI_ROOT}/tools/srilm/lm/bin/i686-m64/:$PATH\nexport PYTHONPATH=$MAIN_ROOT:$MAIN_ROOT/bin/:$MAIN_ROOT/../:$PWD:$PYTHONPATH # so espnet could be find like a library\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib64:${KALDI_ROOT}/tools/openfst/lib/\n# nvidia-smi -c 3\n"
  },
  {
    "path": "egs/aishell1/prepare.sh",
    "content": "#!/usr/bin/env bash\n\n# author: tyriontian\n# tyriontian@tencent.com\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=1         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\nresume=        # Resume the training from snapshot\n\n# feature configuration\ndo_delta=false\n\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/train.yaml\nlm_config=conf/lm_rnn.yaml\ndecode_config=conf/decode.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\n\n# ngram\nngramtag=\nn_gram=4\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\nn_average=10\n\n# data\ndata=/data/asr_data/aishell/\ndata_url=www.openslr.org/resources/33\n\n# exp tag\ntag=\"\" # tag for managing experiments.\n\n. utils/parse_options.sh || exit 1;\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev\nrecog_set=\"dev test\"\n\nif [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then\n    echo \"stage -1: Data Download\"\n    local/download_and_untar.sh ${data} ${data_url} data_aishell\n    local/download_and_untar.sh ${data} ${data_url} resource_aishell\nfi\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    ### Task dependent. You have to make data the following preparation part by yourself.\n    ### But you can utilize Kaldi recipes in most cases\n    echo \"stage 0: Data preparation\"\n    local/aishell_data_prep.sh ${data}/data_aishell/wav ${data}/data_aishell/transcript\n    # remove space in text\n    for x in train dev test; do\n        cp data/${x}/text data/${x}/text_org\n        paste -d \" \" <(cut -f 1 -d\" \" data/${x}/text_org) <(cut -f 2- -d\" \" data/${x}/text_org | tr -d \" \") \\\n            > data/${x}/text\n    done\nfi\n\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    ### Task dependent. You have to design training and dev sets by yourself.\n    ### But you can utilize Kaldi recipes in most cases\n    echo \"stage 1: Feature Generation\"\n    fbankdir=fbank\n    # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 30 --write_utt2num_frames true \\\n        data/train exp/make_fbank/train ${fbankdir}\n    utils/fix_data_dir.sh data/train\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 10 --write_utt2num_frames true \\\n        data/dev exp/make_fbank/dev ${fbankdir}\n    utils/fix_data_dir.sh data/dev\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 10 --write_utt2num_frames true \\\n        data/test exp/make_fbank/test ${fbankdir}\n    utils/fix_data_dir.sh data/test\n\n    # speed-perturbed\n    utils/perturb_data_dir_speed.sh 0.9 data/train data/temp1\n    utils/perturb_data_dir_speed.sh 1.0 data/train data/temp2\n    utils/perturb_data_dir_speed.sh 1.1 data/train data/temp3\n    utils/combine_data.sh --extra-files utt2uniq data/${train_set} data/temp1 data/temp2 data/temp3\n    rm -r data/temp1 data/temp2 data/temp3\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 30 --write_utt2num_frames true \\\n        data/${train_set} exp/make_fbank/${train_set} ${fbankdir}\n    utils/fix_data_dir.sh data/${train_set}\n\n    # By tyriontian: Additionally you need to copy text_org from data/train to data_train_sp\n    # text_org in this script refer the transcriptions that are segmented into word level\n    # This is useful for MMI as our MMI criterion works in word level\n    python3 espnet_utils/build_sp_text.py data/train/text_org | sort -k 1 > data/${train_set}/text_org\n\n    # compute global CMVN\n    compute-cmvn-stats scp:data/${train_set}/feats.scp data/${train_set}/cmvn.ark\n\n    # dump features for training\n    split_dir=$(echo $PWD | awk -F \"/\" '{print $NF \"/\" $(NF-1)}')\n    if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_tr_dir}/storage ]; then\n    utils/create_split_dir.pl \\\n        /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_set}/delta${do_delta}/storage \\\n        ${feat_tr_dir}/storage\n    fi\n    if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_dt_dir}/storage ]; then\n    utils/create_split_dir.pl \\\n        /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_dev}/delta${do_delta}/storage \\\n        ${feat_dt_dir}/storage\n    fi\n    dump.sh --cmd \"$train_cmd\" --nj 32 --do_delta ${do_delta} \\\n        data/${train_set}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/train ${feat_tr_dir}\n    for rtask in ${recog_set}; do\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir}\n        dump.sh --cmd \"$train_cmd\" --nj 10 --do_delta ${do_delta} \\\n            data/${rtask}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/recog/${rtask} \\\n            ${feat_recog_dir}\n    done\nfi\n\ndict=data/lang_1char/${train_set}_units.txt\necho \"dictionary: ${dict}\"\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    ### Task dependent. You have to check non-linguistic symbols used in the corpus.\n    echo \"stage 2: Dictionary and Json Data Preparation\"\n    mkdir -p data/lang_1char/\n\n    echo \"make a dictionary\"\n    echo \"<unk> 1\" > ${dict} # <unk> must be 1, 0 will be used for \"blank\" in CTC\n    text2token.py -s 1 -n 1 data/${train_set}/text | cut -f 2- -d\" \" | tr \" \" \"\\n\" \\\n    | sort | uniq | grep -v -e '^\\s*$' | awk '{print $0 \" \" NR+1}' >> ${dict}\n    wc -l ${dict}\n\n    echo \"make json files\"\n    data2json.sh --feat ${feat_tr_dir}/feats.scp \\\n                 --text-org data/${train_set}/text_org \\\n\t\t data/${train_set} ${dict} > ${feat_tr_dir}/data.json\n    for rtask in ${recog_set}; do\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n        data2json.sh --feat ${feat_recog_dir}/feats.scp \\\n                     --text-org data/${rtask}/text_org \\\n\t\t     data/${rtask} ${dict} > ${feat_recog_dir}/data.json\n    done\nfi\n\n# you can skip this and remove --rnnlm option in the recognition (stage 5)\nif [ -z ${lmtag} ]; then\n    lmtag=$(basename ${lm_config%.*})\nfi\nlmexpname=train_rnnlm_${backend}_${lmtag}\nlmexpdir=exp/${lmexpname}\nmkdir -p ${lmexpdir}\n\nngramexpname=train_ngram\nngramexpdir=exp/${ngramexpname}\nif [ -z ${ngramtag} ]; then\n    ngramtag=${n_gram}\nfi\nmkdir -p ${ngramexpdir}\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: LM Preparation\"\n    lmdatadir=data/local/lm_train\n    mkdir -p ${lmdatadir}\n    text2token.py -s 1 -n 1 data/train/text | cut -f 2- -d\" \" \\\n        > ${lmdatadir}/train.txt\n    text2token.py -s 1 -n 1 data/${train_dev}/text | cut -f 2- -d\" \" \\\n        > ${lmdatadir}/valid.txt\n\n    # NNLM. by default you do not need this\n    ${cuda_cmd} --gpu ${ngpu} ${lmexpdir}/train.log \\\n        lm_train.py \\\n        --config ${lm_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --verbose 1 \\\n        --outdir ${lmexpdir} \\\n        --tensorboard-dir tensorboard/${lmexpname} \\\n        --train-label ${lmdatadir}/train.txt \\\n        --valid-label ${lmdatadir}/valid.txt \\\n        --resume ${lm_resume} \\\n        --dict ${dict}\n\n    # prepare character-level N-gram LM. You need kenlm to run this  \n    # lmplz --discount_fallback -o ${n_gram} <${lmdatadir}/train.txt > ${ngramexpdir}/${n_gram}gram.arpa\n    # build_binary -s ${ngramexpdir}/${n_gram}gram.arpa ${ngramexpdir}/${n_gram}gram.bin\nfi\n\nlang=data/lang_phone\nif [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then\n  local/k2_aishell_prepare_dict.sh $data/resource_aishell data/local/dict_nosp\n  local/k2_prepare_lang.sh --position-dependent-phones false data/local/dict_nosp \\\n      \"<UNK>\" data/local/lang_tmp_nosp $lang || exit 1\n\n  # We also prepare Word-level N-gram LM; order = 3, 4\n  local/aishell_train_lms.sh\n\n  for order in 3 4 ; do\n      mkdir -p data/word_${order}gram\n      gunzip -c data/local/lm/${order}gram-mincount/lm_unpruned.gz \\\n        > data/word_${order}gram/lm.arpa\n\n      cp $lang/words.txt data/word_${order}gram/\n      cp $lang/oov.int data/word_${order}gram/\n\n      python3 -m kaldilm \\\n        --read-symbol-table=\"data/word_${order}gram/words.txt\" \\\n        --disambig-symbol='#0' \\\n        --max-order=$order \\\n        data/word_${order}gram/lm.arpa > data/word_${order}gram/G.fst.txt\n    \n    done\nfi\n\n# Prepare these word N-gram LMs for SPL response\n# (1) use different smooth method\n# (2) use jieba rather than the ground-truth transcription\nif [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then\n    # 3-gram LM with different smooth\n    for sm in -wbdiscount -kndiscount -ukndiscount -ndiscount; do\n        bash espnet_utils/train_lms_srilm.sh \\\n          --unk \"<UNK>\" --lm-opts $sm data/local/dict_nosp/lexicon.txt \\\n          data/local/train/text data/local/lm$sm  \n    done\n\n    # gtdiscount\n    bash espnet_utils/train_lms_srilm.sh \\\n          --unk \"<UNK>\" data/local/dict_nosp/lexicon.txt \\\n          data/local/train/text data/local/lm-gtdiscount\n\n    # word segmentation by jieba\n    python3 espnet_utils/jieba_build_dict.py $lang/words.txt $lang/jieba_dict.txt\n    python3 espnet_utils/text_norm.py --in-f data/train/text \\\n      --out-f data/local/train/text.jieba --segment\n    bash espnet_utils/train_lms_srilm.sh \\\n      --unk \"<UNK>\" data/local/dict_nosp/lexicon.txt \\\n      data/local/train/text.jieba data/local/lm-jieba\n\n    # build k2 directory\n    for tag in wbdiscount kndiscount ukndiscount ndiscount gtdiscount jieba; do\n        mkdir -p data/word_3gram_$tag; lmdir=data/word_3gram_$tag\n        gunzip -c data/local/lm-$tag/srilm/srilm.o3g.kn.gz \\\n          > $lmdir/lm.arpa\n\n        cp $lang/words.txt $lmdir\n        cp $lang/oov.int $lmdir\n\n        python3 -m kaldilm \\\n            --read-symbol-table=\"$lmdir/words.txt\" \\\n            --disambig-symbol='#0' \\\n            --max-order=3 \\\n            $lmdir/lm.arpa > $lmdir/G.fst.txt\n\n        python3 espnet/nets/scorers/word_ngram.py $lmdir\n    done\n    \nfi\n"
  },
  {
    "path": "egs/aishell2/.gitignore",
    "content": "dump\ndump32\ndump64\ndata\nexp\nfbank\nexp_without_segmentation\n_exp\n"
  },
  {
    "path": "egs/aishell2/aed.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=2         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\nresume=        # Resume the training from snapshot\n\n# feature configuration\ndo_delta=false\n\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/tuning/train_pytorch_conformer_kernel31.yaml\nlm_config=conf/lm.yaml\ndecode_config=conf/decode.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\n\n# data dir, modify this to your AISHELL-2 data path\ntr_dir=/data/asr_data/aishell2/iOS/data\ndev_tst_dir=/data/asr_data/aishell2/AISHELL-DEV-TEST-SET\n\n# exp tag\n### Configurable parameters ###\ntag=\"8v100_lasmmictc_alpha03_ctc03_seg\"\nngpu=8\ndebug=false\n# Train config\nseed=888\nbatch_size=4\naccum_grad=16\nepochs=100\nuse_segment=true # if true, use word-level transcription in MMI criterion\nctc_type=\"k2mmi\" # k2mmi k2ctc builtin\nmtlalpha=0.3\nthird_weight=0.3\n\n# MBR training config\naux_mbr=false\naux_mbr_weight=1.0\naux_mbr_beam=4\nmbr_epochs=100\nmbr_lr=0.1\nmbr_warmup=2500\nmbr_resume=\n\n# Decode config\nidx_average=41_50\nmmi_weight=0.0 # MMI / phonectc joint decoding\nctc_weight=0.5 # char ctc joint decoding\nngram_weight=0.0\nngram_order=4\nword_ngram_weight=0.0\nword_ngram_tag=word_3gram_wbdiscount # 3 or 4 gram\nword_ngram_log_semiring=true\nlm_weight=0.0\nmmi_rescore=false # or rescore\nbeam_size=10\nrecog_set=\"test_android test_ios test_mic\"\n\n. utils/parse_options.sh || exit 1;\n\nif [ $debug == true ]; then\n    export HOST_GPU_NUM=1\n    export HOST_NUM=1\n    export NODE_NUM=1\n    export INDEX=0\n    export CHIEF_IP=\"9.135.217.29\"\nfi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--use-segment $use_segment \\\n--ctc_type $ctc_type \\\n--mtlalpha $mtlalpha \\\n--third-weight $third_weight \\\n\"\n\nif [ $aux_mbr == true ]; then\n    train_opts=\"$train_opts \\\n                --aux-mbr $aux_mbr \\\n                --aux-mbr-weight $aux_mbr_weight \\\n                --aux-mbr-beam $aux_mbr_beam \\\n                --transformer-lr $mbr_lr \\\n                --epochs $mbr_epochs \\\n                --transformer-warmup-steps $mbr_warmup \\\n                --resume $mbr_resume \\\n                --load-trainer-and-opt false \\\n                --save-interval-iters 1000 \\\n                \"\n    export OMP_NUM_THREADS=6 # for on-the-fly decoding\nfi\n\ndecode_opts=\\\n\"\\\n--mmi-weight $mmi_weight \\\n--mmi-rescore $mmi_rescore \\\n--beam-size $beam_size \\\n--ctc-weight $ctc_weight \\\n--ngram-weight $ngram_weight \\\n--word-ngram-weight $word_ngram_weight \\\n--word-ngram data/${word_ngram_tag} \\\n--word-ngram-log-semiring $word_ngram_log_semiring \\\n--lm-weight $lm_weight \\\n\"\ndict=data/lang_1char/train_sp_units.txt\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev_ios\n\nexpname=${train_set}_${backend}_${tag}\nexpdir=exp/${expname}\nmkdir -p ${expdir}\n\nlang=data/lang_phone\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Network Training\"\n    MASTER_PORT=22277\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/results_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --resume ${resume} \\\n        --train-json ${feat_tr_dir}/split${ngpu}utt/data_noeng.RANK.json \\\n        --valid-json ${feat_dt_dir}/data.json \\\n        --lang $lang \\\n        --opt \"noam_sgd\" \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        $train_opts > ${expdir}/global_record.${INDEX}.txt 2>&1\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Decoding\"\n    nj=1000\n    recog_model=model.last${idx_average}.avg.best\n    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} etype) = custom ]] || \\\n           [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then\n\trecog_model=model.last${idx_average}.avg.best\n\taverage_checkpoints.py --backend ${backend} \\\n        \t               --snapshots ${expdir}/results_0/snapshot.ep.* \\\n\t\t\t       --out ${expdir}/results_0/${recog_model} \\\n\t\t\t       --num ${idx_average}\n    fi\n\n    decode_parent_dir=decode_mmi${mmi_weight}_${word_ngram_tag}${word_ngram_weight}_ctc${ctc_weight}_ep${idx_average}_beam${beam_size}\n    for rtask in ${recog_set}; do\n        decode_dir=$decode_parent_dir/$rtask\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n\n        # split data\n        splitjson.py --parts ${nj} ${feat_recog_dir}/data.json\n\n        #### use CPU for decoding\n        ngpu=0\n        ${decode_cmd} JOB=1:$nj ${expdir}/${decode_dir}/log/decode.JOB.log \\\n            python3 ${MAIN_ROOT}/bin/asr_recog.py \\\n            --config ${decode_config} \\\n            --ngpu ${ngpu} \\\n            --backend ${backend} \\\n            --batchsize 0 \\\n            --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \\\n            --result-label ${expdir}/${decode_dir}/data.JOB.json \\\n            --model ${expdir}/results_0/${recog_model}  \\\n            --ngram-model exp/train_ngram/${ngram_order}gram.bin \\\n            --rnnlm exp/train_rnnlm_pytorch_lm_transformer/results/rnnlm.model.best \\\n            --rnnlm-conf exp/train_rnnlm_pytorch_lm_transformer/results/model.json \\\n            --api v2 \\\n            --local-rank JOB $decode_opts  \n\n        score_sclite.sh ${expdir}/${decode_dir} ${dict} \\\n          > ${expdir}/${decode_dir}/decode_result.txt\n\n    done\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/aishell2/conf/fbank.conf",
    "content": "--sample-frequency=16000 \n--num-mel-bins=80\n"
  },
  {
    "path": "egs/aishell2/conf/gpu.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q"
  },
  {
    "path": "egs/aishell2/conf/lm.yaml",
    "content": "# rnnlm related\nlayer: 2\nunit: 650\nopt: sgd        # or adam\nbatchsize: 64   # batch size in LM training\nepoch: 20      # if the data size is large, we can reduce this\npatience: 3\nmaxlen: 100     # if sentence length > lm_maxlen, lm_batchsize is automatically reduced\n"
  },
  {
    "path": "egs/aishell2/conf/lm_rnn.yaml",
    "content": "lm.yaml"
  },
  {
    "path": "egs/aishell2/conf/lm_transformer.yaml",
    "content": "# This Transformer LM setting w/ 4 GPUs took around 60 days for 50 epochs.\n# However, you can get better results in 6 days for 5 epochs (WER: 2.2/5.4/2.6/5.7)\n# than LSTM LM (WER: 2.6/5.6/2.6/5.7) in 60 days for 20 epochs\n# And if you does not have 4 GPUs, try accum-grad=4.\n\n# network architecture\nmodel-module: transformer\natt-unit: 512\nembed-unit: 128\nhead: 8\nlayer: 16\npos-enc: none\nunit: 2048\n\n# minibatch related\nbatchsize: 32\nmaxlen: 40\n\n# optimization related\nopt: adam\nschedulers: lr=cosine\ndropout-rate: 0.0\nepoch: 50\ngradclip: 1.0\nlr: 1e-4\nlr-cosine-total: 100000\nlr-cosine-warmup: 1000\npatience: 0\nsortagrad: 0\n"
  },
  {
    "path": "egs/aishell2/conf/pitch.conf",
    "content": "--sample-frequency=16000\n"
  },
  {
    "path": "egs/aishell2/conf/queue.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l gpu=$0 -q g.q\n"
  },
  {
    "path": "egs/aishell2/conf/slurm.conf",
    "content": "# Default configuration\ncommand sbatch --export=PATH\noption name=* --job-name $0\noption time=* --time $0\noption mem=* --mem-per-cpu $0\noption mem=0\noption num_threads=* --cpus-per-task $0\noption num_threads=1 --cpus-per-task 1\noption num_nodes=* --nodes $0\ndefault gpu=0\noption gpu=0 -p cpu\noption gpu=* -p gpu --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU\n# note: the --max-jobs-run option is supported as a special case\n# by slurm.pl and you don't have to handle it in the config file.\n"
  },
  {
    "path": "egs/aishell2/conf/specaug.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 5\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n"
  },
  {
    "path": "egs/aishell2/conf/specaug_test.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 0\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/decode_pytorch_transformer.yaml",
    "content": "batchsize: 0\nbeam-size: 10\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.5\nlm-weight: 0.0\nngram-weight: 0.3\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/decode_rnn.yaml",
    "content": "beam-size: 20\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.6\nlm-weight: 0.3\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/train_pytorch_conformer_kernel15.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nrel-pos-type: latest\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 15\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/train_pytorch_conformer_kernel31.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/train_pytorch_transformer.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/train_rnn.yaml",
    "content": "# network architecture\n# encoder related\netype: vggblstm     # encoder architecture type\nelayers: 3\neunits: 1024\neprojs: 1024\nsubsample: \"1_2_2_1_1\" # skip every n frame from input to nth layers\n# decoder related\ndlayers: 2\ndunits: 1024\n# attention related\natype: location\nadim: 1024\naconv-chans: 10\naconv-filts: 100\n\n# hybrid CTC/attention\nmtlalpha: 0.5\n\n# minibatch related\nbatch-size: 30\nmaxlen-in: 800  # if input length  > maxlen_in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen_out, batchsize is automatically reduced\n\n# optimization related\nopt: adadelta\nepochs: 10\npatience: 0\n\n# scheduled sampling option\nsampling-probability: 0.0\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/decode_default.yaml",
    "content": "# decoding parameters\nbatch: 0\nbeam-size: 10\nsearch-type: default\nscore-norm: True\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/train_conformer-rnn_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\naccum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\n#aux-ctc: True\n#aux-ctc-weight: 0.5\n#aux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/train_conformer-rnn_transducer_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: False\naux-ctc-weight: 0.0\naux-ctc-dropout-rate: 0.0\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/train_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/aishell2/conf/tuning/transducer/train_transducer_aux.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: True\naux-ctc-weight: 0.1\naux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/aishell2/local/add_lex_disambig.pl",
    "content": "#!/usr/bin/env perl\n#  Copyright 2010-2011  Microsoft Corporation\n#            2013-2016  Johns Hopkins University (author: Daniel Povey)\n#                 2015  Hainan Xu\n#                 2015  Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Adds disambiguation symbols to a lexicon.\n# Outputs still in the normal lexicon format.\n# Disambig syms are numbered #1, #2, #3, etc. (#0\n# reserved for symbol in grammar).\n# Outputs the number of disambig syms to the standard output.\n# With the --pron-probs option, expects the second field\n# of each lexicon line to be a pron-prob.\n# With the --sil-probs option, expects three additional\n# fields after the pron-prob, representing various components\n# of the silence probability model.\n\n$pron_probs = 0;\n$sil_probs = 0;\n$first_allowed_disambig = 1;\n\nfor ($n = 1; $n <= 3 && @ARGV > 0; $n++) {\n  if ($ARGV[0] eq \"--pron-probs\") {\n    $pron_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--sil-probs\") {\n    $sil_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--first-allowed-disambig\") {\n    $first_allowed_disambig = 0 + $ARGV[1];\n    if ($first_allowed_disambig < 1) {\n      die \"add_lex_disambig.pl: invalid --first-allowed-disambig option: $first_allowed_disambig\\n\";\n    }\n    shift @ARGV;\n    shift @ARGV;\n  }\n}\n\nif (@ARGV != 2) {\n  die \"Usage: add_lex_disambig.pl [opts] <lexicon-in> <lexicon-out>\\n\" .\n    \"This script adds disambiguation symbols to a lexicon in order to\\n\" .\n    \"make decoding graphs determinizable; it adds pseudo-phone\\n\" .\n    \"disambiguation symbols #1, #2 and so on at the ends of phones\\n\" .\n    \"to ensure that all pronunciations are different, and that none\\n\" .\n    \"is a prefix of another.\\n\" .\n    \"It prints to the standard output the number of the largest-numbered\" .\n    \"disambiguation symbol that was used.\\n\" .\n    \"\\n\" .\n    \"Options:   --pron-probs       Expect pronunciation probabilities in the 2nd field\\n\" .\n    \"           --sil-probs        [should be with --pron-probs option]\\n\" .\n    \"                              Expect 3 extra fields after the pron-probs, for aspects of\\n\" .\n    \"                              the silence probability model\\n\" .\n    \"           --first-allowed-disambig <n>  The number of the first disambiguation symbol\\n\" .\n    \"                              that this script is allowed to add.  By default this is\\n\" .\n    \"                              #1, but you can set this to a larger value using this option.\\n\" .\n    \"e.g.:\\n\" .\n    \" add_lex_disambig.pl lexicon.txt lexicon_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs lexiconp.txt lexiconp_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs --sil-probs lexiconp_silprob.txt lexiconp_silprob_disambig.txt\\n\";\n}\n\n\n$lexfn = shift @ARGV;\n$lexoutfn = shift @ARGV;\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\n\n# (1)  Read in the lexicon.\n@L = ( );\nwhile(<L>) {\n    @A = split(\" \", $_);\n    push @L, join(\" \", @A);\n}\n\n# (2) Work out the count of each phone-sequence in the\n# lexicon.\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) {\n      $p = shift @A;\n      if (!($p > 0.0 && $p <= 1.0)) { die \"Bad lexicon line $l (expecting pron-prob as second field)\"; }\n    }\n    if ($sil_probs) {\n      $silp = shift @A;\n      if (!($silp > 0.0 && $silp <= 1.0)) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n    }\n    if (!(@A)) {\n      die \"Bad lexicon line $1, no phone in phone list\";\n    }\n    $count{join(\" \",@A)}++;\n}\n\n# (3) For each left sub-sequence of each phone-sequence, note down\n# that it exists (for identifying prefixes of longer strings).\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) { shift @A; } # remove pron-prob.\n    if ($sil_probs) {\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob, there three numbers for sil_probs\n    }\n    while(@A > 0) {\n        pop @A;  # Remove last phone\n        $issubseq{join(\" \",@A)} = 1;\n    }\n}\n\n# (4) For each entry in the lexicon:\n#  if the phone sequence is unique and is not a\n#  prefix of another word, no diambig symbol.\n#  Else output #1, or #2, #3, ... if the same phone-seq\n#  has already been assigned a disambig symbol.\n\n\nopen(O, \">$lexoutfn\") || die \"Opening lexicon file $lexoutfn for writing.\\n\";\n\n# max_disambig will always be the highest-numbered disambiguation symbol that\n# has been used so far.\n$max_disambig = $first_allowed_disambig - 1;\n\nforeach $l (@L) {\n  @A = split(\" \", $l);\n  $word = shift @A;\n  if ($pron_probs) {\n    $pron_prob = shift @A;\n  }\n  if ($sil_probs) {\n    $sil_word_prob = shift @A;\n    $word_sil_correction = shift @A;\n    $prev_nonsil_correction = shift @A\n  }\n  $phnseq = join(\" \", @A);\n  if (!defined $issubseq{$phnseq}\n      && $count{$phnseq} == 1) {\n    ;                           # Do nothing.\n  } else {\n    if ($phnseq eq \"\") {        # need disambig symbols for the empty string\n      # that are not use anywhere else.\n      $max_disambig++;\n      $reserved_for_the_empty_string{$max_disambig} = 1;\n      $phnseq = \"#$max_disambig\";\n    } else {\n      $cur_disambig = $last_used_disambig_symbol_of{$phnseq};\n      if (!defined $cur_disambig) {\n        $cur_disambig = $first_allowed_disambig;\n      } else {\n        $cur_disambig++;           # Get a number that has not been used yet for\n                                   # this phone sequence.\n      }\n      while (defined $reserved_for_the_empty_string{$cur_disambig}) {\n        $cur_disambig++;\n      }\n      if ($cur_disambig > $max_disambig) {\n        $max_disambig = $cur_disambig;\n      }\n      $last_used_disambig_symbol_of{$phnseq} = $cur_disambig;\n      $phnseq = $phnseq . \" #\" . $cur_disambig;\n    }\n  }\n  if ($pron_probs) {\n    if ($sil_probs) {\n      print O \"$word\\t$pron_prob\\t$sil_word_prob\\t$word_sil_correction\\t$prev_nonsil_correction\\t$phnseq\\n\";\n    } else {\n      print O \"$word\\t$pron_prob\\t$phnseq\\n\";\n    }\n  } else {\n    print O \"$word\\t$phnseq\\n\";\n  }\n}\n\nprint $max_disambig . \"\\n\";\n"
  },
  {
    "path": "egs/aishell2/local/apply_map.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\n# This program is a bit like ./sym2int.pl in that it applies a map\n# to things in a file, but it's a bit more general in that it doesn't\n# assume the things being mapped to are single tokens, they could\n# be sequences of tokens.  See the usage message.\n\n\n$permissive = 0;\n\nfor ($x = 0; $x <= 2; $x++) {\n\n  if (@ARGV > 0 && $ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesty (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n\n  if (@ARGV > 0 && $ARGV[0] eq '--permissive') {\n    shift @ARGV;\n    # Mapping is optional (missing key is printed to output)\n    $permissive = 1;\n  }\n}\n\nif(@ARGV != 1) {\n  print STDERR \"Invalid usage: \" . join(\" \", @ARGV) . \"\\n\";\n  print STDERR <<'EOF';\nUsage: apply_map.pl [options] map <input >output\n options: [-f <field-range> ] [--permissive]\n   This applies a map to some specified fields of some input text:\n   For each line in the map file: the first field is the thing we\n   map from, and the remaining fields are the sequence we map it to.\n   The -f (field-range) option says which fields of the input file the map\n   map should apply to.\n   If the --permissive option is supplied, fields which are not present\n   in the map will be left as they were.\n Applies the map 'map' to all input text, where each line of the map\n is interpreted as a map from the first field to the list of the other fields\n Note: <field-range> can look like 4-5, or 4-, or 5-, or 1, it means the field\n range in the input to apply the map to.\n e.g.: echo A B | apply_map.pl a.txt\n where a.txt is:\n A a1 a2\n B b\n will produce:\n a1 a2 b\nEOF\n  exit(1);\n}\n\n($map_file) = @ARGV;\nopen(M, \"<$map_file\") || die \"Error opening map file $map_file: $!\";\n\nwhile (<M>) {\n  @A = split(\" \", $_);\n  @A >= 1 || die \"apply_map.pl: empty line.\";\n  $i = shift @A;\n  $o = join(\" \", @A);\n  $map{$i} = $o;\n}\n\nwhile(<STDIN>) {\n  @A = split(\" \", $_);\n  for ($x = 0; $x < @A; $x++) {\n    if ( (!defined $field_begin || $x >= $field_begin)\n         && (!defined $field_end || $x <= $field_end)) {\n      $a = $A[$x];\n      if (!defined $map{$a}) {\n        if (!$permissive) {\n          die \"apply_map.pl: undefined key $a in $map_file\\n\";\n        } else {\n          print STDERR \"apply_map.pl: warning! missing key $a in $map_file\\n\";\n        }\n      } else {\n        $A[$x] = $map{$a};\n      }\n    }\n  }\n  print join(\" \", @A) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/aishell2/local/fstaddselfloops.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2020 Xiaomi Corporation (Author: Junbo Zhang)\n# Apache 2.0\n\nuse strict;\nuse warnings;\n\nmy $Usage = <<EOU;\nfstaddselfloops.pl:\nAdds self-loops to states of an FST to propagate disambiguation symbols through it.\nThey are added on each final state and each state with non-epsilon output symbols\non at least one arc out of the state. \n\nUsage: local/fstaddselfloops.pl <wdisambig_phone> <wdisambig_word> < <openfst_text>\n e.g.: cat L_disambig.txt | local/fstaddselfloops.pl 347 200004 > L_disambig_with_loop.txt\nEOU\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\nmy $wdisambig_phone = shift @ARGV;\nmy $wdisambig_word = shift @ARGV;\n\nmy %states_needs_self_loops;\nwhile (<>) {\n    print $_;\n\n    my @items = split(/\\s+/);\n    if (@items == 2) {\n        # it is a final state\n        $states_needs_self_loops{$items[0]} = 1;\n    } elsif (@items == 5) {\n        my ($src, $dst, $inlabel, $outlabel, $score) = @items;\n        $states_needs_self_loops{$src} = 1 if ($outlabel != 0);\n    } else {\n        die \"Invalid openfst line.\";\n    }\n}\n\nforeach (keys %states_needs_self_loops) {\n    print \"$_ $_ $wdisambig_phone $wdisambig_word 0.0\\n\"\n}\n"
  },
  {
    "path": "egs/aishell2/local/jieba_split_text.py",
    "content": "import jieba\nimport sys\n\nsrc_file = sys.argv[1]\ndst_file = sys.argv[2]\ndict_file = sys.argv[3]\njieba.set_dictionary(dict_file)\n\nreader = open(src_file, 'r')\nwriter = open(dst_file, 'w')\n\nword_dict = {}\nfor line in open(dict_file):\n    w = line.strip().split()[0]\n    word_dict[w] = 0 \n\noov_count = 0\nfor i, line in enumerate(reader):\n    elems = line.strip().split()\n    uttid, ctx = elems[0], elems[1:]\n    ctx = \" \".join(ctx)\n    ctx = jieba.lcut(ctx, HMM=False)\n    #for x in ctx:\n    #    if x not in word_dict:\n    #        print(i, x, flush=True)\n    ctx = \" \".join(ctx)\n    writer.write(f\"{uttid} {ctx}\\n\")\n    \nwriter.close()\n"
  },
  {
    "path": "egs/aishell2/local/k2_prepare_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey);\n#                      Arnab Ghoshal\n#                2014  Guoguo Chen\n#                2015  Hainan Xu\n#                2016  FAU Erlangen (Author: Axel Horndasch)\n#                2020  Xiaomi Corporation (Author: Junbo Zhang)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script prepares a directory such as data/lang/, in the standard format,\n# given a source directory containing a dictionary lexicon.txt in a form like:\n# word phone1 phone2 ... phoneN\n# per line (alternate prons would be separate lines), or a dictionary with probabilities\n# called lexiconp.txt in a form:\n# word pron-prob phone1 phone2 ... phoneN\n# (with 0.0 < pron-prob <= 1.0); note: if lexiconp.txt exists, we use it even if\n# lexicon.txt exists.\n# and also files silence_phones.txt, nonsilence_phones.txt, optional_silence.txt\n# and extra_questions.txt\n# Here, silence_phones.txt and nonsilence_phones.txt are lists of silence and\n# non-silence phones respectively (where silence includes various kinds of\n# noise, laugh, cough, filled pauses etc., and nonsilence phones includes the\n# \"real\" phones.)\n# In each line of those files is a list of phones, and the phones on each line\n# are assumed to correspond to the same \"base phone\", i.e. they will be\n# different stress or tone variations of the same basic phone.\n# The file \"optional_silence.txt\" contains just a single phone (typically SIL)\n# which is used for optional silence in the lexicon.\n# extra_questions.txt might be empty; typically will consist of lists of phones,\n# all members of each list with the same stress or tone; and also possibly a\n# list for the silence phones.  This will augment the automatically generated\n# questions (note: the automatically generated ones will treat all the\n# stress/tone versions of a phone the same, so will not \"get to ask\" about\n# stress or tone).\n#\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nnum_sil_states=5\nnum_nonsil_states=3\nposition_dependent_phones=true\n# position_dependent_phones is false also when position dependent phones and word_boundary.txt\n# have been generated by another source\nshare_silence_phones=false  # if true, then share pdfs of different silence\n                            # phones together.\nsil_prob=0.5\nnum_extra_phone_disambig_syms=1 # Standard one phone disambiguation symbol is used for optional silence.\n                                # Increasing this number does not harm, but is only useful if you later\n                                # want to introduce this labels to L_disambig.fst\n\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\necho $sil_prob\n. local/parse_options.sh\necho $sil_prob\nif [ $# -ne 4 ]; then\n  echo \"Usage: local/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>\"\n  echo \"e.g.: local/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang\"\n  echo \"<dict-src-dir> should contain the following files:\"\n  echo \" extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt\"\n  echo \"See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.\"\n  echo \"options: \"\n  echo \"<dict-src-dir> may also, for the grammar-decoding case (see http://kaldi-asr.org/doc/grammar.html)\"\n  echo \"contain a file nonterminals.txt containing symbols like #nonterm:contact_list, one per line.\"\n  echo \"     --num-sil-states <number of states>             # default: 5, #states in silence models.\"\n  echo \"     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.\"\n  echo \"     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I\"\n  echo \"                                                     # markers on phones to indicate word-internal positions. \"\n  echo \"     --share-silence-phones (true|false)             # default: false; if true, share pdfs of \"\n  echo \"                                                     # all silence phones. \"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  exit 1;\nfi\n\nsrcdir=$1\noov_word=$2\ntmpdir=$3\ndir=$4\n\n\nif [ -d $dir/phones ]; then\n  rm -r $dir/phones\nfi\nmkdir -p $dir $tmpdir $dir/phones\n\nsilprob=false\n[ -f $srcdir/lexiconp_silprob.txt ] && silprob=true\n\n[ -f path.sh ] && . ./path.sh\n\nif [[ ! -f $srcdir/lexicon.txt ]]; then\n  echo \"**Creating $srcdir/lexicon.txt from $srcdir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdir/lexiconp.txt > $srcdir/lexicon.txt || exit 1;\nfi\nif [[ ! -f $srcdir/lexiconp.txt ]]; then\n  echo \"**Creating $srcdir/lexiconp.txt from $srcdir/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdir/lexicon.txt > $srcdir/lexiconp.txt || exit 1;\nfi\n\nif [ ! -z \"$unk_fst\" ] && [ ! -f \"$unk_fst\" ]; then\n  echo \"$0: expected --unk-fst $unk_fst to exist as a file\"\n  exit 1\nfi\n\nif $position_dependent_phones; then\n  # Create $tmpdir/lexiconp.txt from $srcdir/lexiconp.txt (or\n  # $tmpdir/lexiconp_silprob.txt from $srcdir/lexiconp_silprob.txt) by\n  # adding the markers _B, _E, _S, _I depending on word position.\n  # In this recipe, these markers apply to silence also.\n  # Do this starting from lexiconp.txt only.\n  if \"$silprob\"; then\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; $silword_p = shift @A;\n              $wordsil_f = shift @A; $wordnonsil_f = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_S\\n\"; }\n         else { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n                < $srcdir/lexiconp_silprob.txt > $tmpdir/lexiconp_silprob.txt\n  else\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $A[0]_S\\n\"; } else { print \"$w $p $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $srcdir/lexiconp.txt > $tmpdir/lexiconp.txt || exit 1;\n  fi\n\n  # create $tmpdir/phone_map.txt\n  # this has the format (on each line)\n  # <original phone> <version 1 of original phone> <version 2> ...\n  # where the versions depend on the position of the phone within a word.\n  # For instance, we'd have:\n  # AA AA_B AA_E AA_I AA_S\n  # for (B)egin, (E)nd, (I)nternal and (S)ingleton\n  # and in the case of silence\n  # SIL SIL SIL_B SIL_E SIL_I SIL_S\n  # [because SIL on its own is one of the variants; this is for when it doesn't\n  #  occur inside a word but as an option in the lexicon.]\n\n  # This phone map expands the phone lists into all the word-position-dependent\n  # versions of the phone lists.\n  cat <(set -f; for x in `cat $srcdir/silence_phones.txt`; do for y in \"\" \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    <(set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do for y in \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    > $tmpdir/phone_map.txt\nelse\n  if \"$silprob\"; then\n    cp $srcdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob.txt\n  else\n    cp $srcdir/lexiconp.txt $tmpdir/lexiconp.txt\n  fi\n\n  cat $srcdir/silence_phones.txt $srcdir/nonsilence_phones.txt | \\\n    awk '{for(n=1;n<=NF;n++) print $n; }' > $tmpdir/phones\n  paste -d' ' $tmpdir/phones $tmpdir/phones > $tmpdir/phone_map.txt\nfi\n\n\n# Making monophone systems.\ncat $srcdir/silence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/silence.txt\ncat $srcdir/nonsilence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/nonsilence.txt\ncp $srcdir/optional_silence.txt $dir/phones/optional_silence.txt\n\n# if extra_questions.txt is empty, it's OK.\ncat $srcdir/extra_questions.txt 2>/dev/null | local/apply_map.pl $tmpdir/phone_map.txt \\\n  >$dir/phones/extra_questions.txt\n\n# Want extra questions about the word-start/word-end stuff. Make it separate for\n# silence and non-silence. Probably doesn't matter, as silence will rarely\n# be inside a word.\nif $position_dependent_phones; then\n  for suffix in _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\n  for suffix in \"\" _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/silence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\nfi\n\n# add_lex_disambig.pl is responsible for adding disambiguation symbols to\n# the lexicon, for telling us how many disambiguation symbols it used,\n# and also for modifying the unknown-word's pronunciation (if the\n# --unk-fst was provided) to the sequence \"#1 #2 #3\", and reserving those\n# disambig symbols for that purpose.\n# The #2 will later be replaced with the actual unk model.  The reason\n# for the #1 and the #3 is for disambiguation and also to keep the\n# FST compact.  If we didn't have the #1, we might have a different copy of\n# the unk-model FST, or at least some of its arcs, for each start-state from\n# which an <unk> transition comes (instead of per end-state, which is more compact);\n# and adding the #3 prevents us from potentially having 2 copies of the unk-model\n# FST due to the optional-silence [the last phone of any word gets 2 arcs].\nif [ ! -z \"$unk_fst\" ]; then  # if the --unk-fst option was provided...\n  if \"$silprob\"; then\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp_silprob.txt \"$oov_word\" || exit 1\n  else\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp.txt \"$oov_word\" || exit 1\n  fi\n  unk_opt=\"--first-allowed-disambig 4\"\nelse\n  unk_opt=\nfi\n\nif \"$silprob\"; then\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs --sil-probs $tmpdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob_disambig.txt)\nelse\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\nndisambig=$[$ndisambig+$num_extra_phone_disambig_syms]; # add (at least) one disambig symbol for silence in lexicon FST.\necho $ndisambig > $tmpdir/lex_ndisambig\n\n# Format of lexiconp_disambig.txt:\n# !SIL\t1.0   SIL_S\n# <SPOKEN_NOISE>\t1.0   SPN_S #1\n# <UNK>\t1.0  SPN_S #2\n# <NOISE>\t1.0  NSN_S\n# !EXCLAMATION-POINT\t1.0  EH2_B K_I S_I K_I L_I AH0_I M_I EY1_I SH_I AH0_I N_I P_I OY2_I N_I T_E\n\n( for n in `seq 0 $ndisambig`; do echo '#'$n; done ) >$dir/phones/disambig.txt\n\n# Create phone symbol table.\necho \"<eps>\" | cat - $dir/phones/{silence,nonsilence,disambig}.txt | \\\n  awk '{n=NR-1; print $1, n;}' > $dir/phones.txt\n\n# Create a file that describes the word-boundary information for\n# each phone.  5 categories.\nif $position_dependent_phones; then\n  cat $dir/phones/{silence,nonsilence}.txt | \\\n    awk '/_I$/{print $1, \"internal\"; next;} /_B$/{print $1, \"begin\"; next; }\n         /_S$/{print $1, \"singleton\"; next;} /_E$/{print $1, \"end\"; next; }\n         {print $1, \"nonword\";} ' > $dir/phones/word_boundary.txt\nelse\n  # word_boundary.txt might have been generated by another source\n  [ -f $srcdir/word_boundary.txt ] && cp $srcdir/word_boundary.txt $dir/phones/word_boundary.txt\nfi\n\n# Create word symbol table.\n# <s> and </s> are only needed due to the need to rescore lattices with\n# ConstArpaLm format language model. They do not normally appear in G.fst or\n# L.fst.\n\nif \"$silprob\"; then\n  # remove the silprob\n  cat $tmpdir/lexiconp_silprob.txt |\\\n    awk '{\n      for(i=1; i<=NF; i++) {\n        if(i!=3 && i!=4 && i!=5) printf(\"%s\\t\", $i); if(i==NF) print \"\";\n      }\n    }' > $tmpdir/lexiconp.txt\nfi\n\ncat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq  | awk '\n  BEGIN {\n    print \"<eps> 0\";\n  }\n  {\n    if ($1 == \"<s>\") {\n      print \"<s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    if ($1 == \"</s>\") {\n      print \"</s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    printf(\"%s %d\\n\", $1, NR);\n  }\n  END {\n    printf(\"#0 %d\\n\", NR+1);\n    printf(\"<s> %d\\n\", NR+2);\n    printf(\"</s> %d\\n\", NR+3);\n  }' > $dir/words.txt || exit 1;\n\n# format of $dir/words.txt:\n#<eps> 0\n#a 1\n#aa 2\n#aarvark 3\n#...\n\nsilphone=`cat $srcdir/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\ngrammar_opts=\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\n\nif $silprob; then\n  # Add silence probabilities (models the prob. of silence before and after each\n  # word).  On some setups this helps a bit.  See local/dict_dir_add_pronprobs.sh\n  # and where it's called in the example scripts (run.sh).\n  local/make_lexicon_fst_silprob.py $grammar_opts --sil-phone=$silphone \\\n    $tmpdir/lexiconp_silprob.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt  > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false |   \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone \\\n    $tmpdir/lexiconp.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false | \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n# The file oov.txt contains a word that we will map any OOVs to during\n# training.\necho \"$oov_word\" > $dir/oov.txt || exit 1;\ncat $dir/oov.txt | local/sym2int.pl $dir/words.txt >$dir/oov.int || exit 1;\n# integer version of oov symbol, used in some scripts.\n\n\n# the file wdisambig.txt contains a (line-by-line) list of the text-form of the\n# disambiguation symbols that are used in the grammar and passed through by the\n# lexicon.  At this stage it's hardcoded as '#0', but we're laying the groundwork\n# for more generality (which probably would be added by another script).\n# wdisambig_words.int contains the corresponding list interpreted by the\n# symbol table words.txt, and wdisambig_phones.int contains the corresponding\n# list interpreted by the symbol table phones.txt.\necho '#0' >$dir/phones/wdisambig.txt\n\nwdisambig_phone=`local/sym2int.pl $dir/phones.txt <$dir/phones/wdisambig.txt`\nwdisambig_word=`local/sym2int.pl $dir/words.txt <$dir/phones/wdisambig.txt`\n\n# Create these lists of phones in colon-separated integer list form too,\n# for purposes of being given to programs as command-line options.\nfor f in silence nonsilence optional_silence disambig; do\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt >$dir/phones/$f.int\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt | \\\n   awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/$f.csl || exit 1;\ndone\n\nif [ -f $dir/phones/word_boundary.txt ]; then\n  local/sym2int.pl -f 1 $dir/phones.txt <$dir/phones/word_boundary.txt \\\n    > $dir/phones/word_boundary.int || exit 1;\nfi\n\nsilphonelist=`cat $dir/phones/silence.csl`\nnonsilphonelist=`cat $dir/phones/nonsilence.csl`\n\n# Create the lexicon FST with disambiguation symbols, and put it in lang_test.\n# There is an extra step where we create a loop to \"pass through\" the\n# disambiguation symbols from G.fst.\n\nif $silprob; then\n  local/make_lexicon_fst_silprob.py $grammar_opts \\\n    --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_silprob_disambig.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts \\\n    --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_disambig.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/aishell2/local/make_lexicon_fst.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright   2018  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n# see get_args() below for usage message.\nimport argparse\nimport os\nimport sys\nimport math\nimport re\n\n# The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n# encoding means \"treat words as sequences of bytes\", and it is compatible\n# with utf-8 encoding as well as other encodings such as gbk, as long as the\n# spaces are also spaces in ascii (which we check).  It is basically how we\n# emulate the behavior of python before python3.\nsys.stdout = open(1, 'w', encoding='latin-1', closefd=False)\nsys.stderr = open(2, 'w', encoding='latin-1', closefd=False)\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates the\n       text form of a lexicon FST, to be compiled by fstcompile using the\n       appropriate symbol tables (phones.txt and words.txt) .  It will mostly\n       be invoked indirectly via utils/prepare_lang.sh.  The output goes to\n       the stdout.\"\"\")\n\n    parser.add_argument('--sil-phone', dest='sil_phone', type=str,\n                        help=\"\"\"Text form of optional-silence phone, e.g. 'SIL'.  See also\n                        the --silprob option.\"\"\")\n    parser.add_argument('--sil-prob', dest='sil_prob', type=float, default=0.0,\n                        help=\"\"\"Probability of silence between words (including at the\n                        beginning and end of word sequences).  Must be in the range [0.0, 1.0].\n                        This refers to the optional silence inserted by the lexicon; see\n                        the --silphone option.\"\"\")\n    parser.add_argument('--sil-disambig', dest='sil_disambig', type=str,\n                        help=\"\"\"Disambiguation symbol to disambiguate silence, e.g. #5.\n                        Will only be supplied if you are creating the version of L.fst\n                        with disambiguation symbols, intended for use with cyclic G.fst.\n                        This symbol was introduced to fix a rather obscure source of\n                        nondeterminism of CLG.fst, that has to do with reordering of\n                        disambiguation symbols and phone symbols.\"\"\")\n    parser.add_argument('--left-context-phones', dest='left_context_phones', type=str,\n                        help=\"\"\"Only relevant if --nonterminals is also supplied; this relates\n                        to grammar decoding (see http://kaldi-asr.org/doc/grammar.html or\n                        src/doc/grammar.dox).  Format is a list of left-context phones,\n                        in text form, one per line.  E.g. data/lang/phones/left_context_phones.txt\"\"\")\n    parser.add_argument('--nonterminals', type=str,\n                        help=\"\"\"If supplied, --left-context-phones must also be supplied.\n                        List of user-defined nonterminal symbols such as #nonterm:contact_list,\n                        one per line.  E.g. data/local/dict/nonterminals.txt.\"\"\")\n    parser.add_argument('lexiconp', type=str,\n                        help=\"\"\"Filename of lexicon with pronunciation probabilities\n                        (normally lexiconp.txt), with lines of the form 'word prob p1 p2...',\n                        e.g. 'a   1.0    ay'\"\"\")\n    args = parser.parse_args()\n    return args\n\n\ndef read_lexiconp(filename):\n    \"\"\"Reads the lexiconp.txt file in 'filename', with lines like 'word pron p1 p2 ...'.\n    Returns a list of tuples (word, pron_prob, pron), where 'word' is a string,\n   'pron_prob', a float, is the pronunciation probability (which must be >0.0\n    and would normally be <=1.0),  and 'pron' is a list of strings representing phones.\n    An element in the returned list might be ('hello', 1.0, ['h', 'eh', 'l', 'ow']).\n    \"\"\"\n\n    ans = []\n    found_empty_prons = False\n    found_large_pronprobs = False\n    # See the comment near the top of this file, RE why we use latin-1.\n    with open(filename, 'r', encoding='latin-1') as f:\n        whitespace = re.compile(\"[ \\t]+\")\n        for line in f:\n            a = whitespace.split(line.strip(\" \\t\\r\\n\"))\n            if len(a) < 2:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            word = a[0]\n            if word == \"<eps>\":\n                # This would clash with the epsilon symbol normally used in OpenFst.\n                print(\"{0}: error: found <eps> as a word in lexicon file \"\n                      \"{1}\".format(line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            try:\n                pron_prob = float(a[1])\n            except:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2}, 2nd field \"\n                      \"should be pron-prob\".format(sys.argv[0], line.strip(\" \\t\\r\\n\"), filename),\n                      file=sys.stderr)\n                sys.exit(1)\n            prons = a[2:]\n            if pron_prob <= 0.0:\n                print(\"{0}: error: invalid pron-prob in line '{1}' of lexicon file {1} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            if len(prons) == 0:\n                found_empty_prons = True\n            ans.append( (word, pron_prob, prons) )\n            if pron_prob > 1.0:\n                found_large_pronprobs = True\n    if found_empty_prons:\n        print(\"{0}: warning: found at least one word with an empty pronunciation \"\n              \"in lexicon file {1}.\".format(sys.argv[0], filename),\n              file=sys.stderr)\n    if found_large_pronprobs:\n        print(\"{0}: warning: found at least one word with pron-prob >1.0 \"\n              \"in {1}\".format(sys.argv[0], filename), file=sys.stderr)\n\n\n    if len(ans) == 0:\n        print(\"{0}: error: found no pronunciations in lexicon file {1}\".format(\n            sys.argv[0], filename), file=sys.stderr)\n        sys.exit(1)\n    return ans\n\n\ndef write_nonterminal_arcs(start_state, loop_state, next_state,\n                           nonterminals, left_context_phones):\n    \"\"\"This function relates to the grammar-decoding setup, see\n    kaldi-asr.org/doc/grammar.html.  It is called from write_fst_no_silence\n    and write_fst_silence, and writes to the stdout some extra arcs\n    in the lexicon FST that relate to nonterminal symbols.\n    See the section \"Special symbols in L.fst,\n    kaldi-asr.org/doc/grammar.html#grammar_special_l.\n       start_state: the start-state of L.fst.\n       loop_state:  the state of high out-degree in L.fst where words leave\n                  and enter.\n       next_state: the number from which this function can start allocating its\n                  own states.  the updated value of next_state will be returned.\n       nonterminals: the user-defined nonterminal symbols as a list of\n          strings, e.g. ['#nonterm:contact_list', ... ].\n       left_context_phones: a list of phones that may appear as left-context,\n          e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n    shared_state = next_state\n    next_state += 1\n    final_state = next_state\n    next_state += 1\n\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=start_state, dest=shared_state,\n        phone='#nonterm_begin', word='#nonterm_begin',\n        cost=0.0))\n\n    for nonterminal in nonterminals:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=loop_state, dest=shared_state,\n            phone=nonterminal, word=nonterminal,\n            cost=0.0))\n    # this_cost equals log(len(left_context_phones)) but the expression below\n    # better captures the meaning.  Applying this cost to arcs keeps the FST\n    # stochatic (sum-to-one, like an HMM), so that if we do weight pushing\n    # things won't get weird.  In the grammar-FST code when we splice things\n    # together we will cancel out this cost, see the function CombineArcs().\n    this_cost = -math.log(1.0 / len(left_context_phones))\n\n    for left_context_phone in left_context_phones:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=shared_state, dest=loop_state,\n            phone=left_context_phone, word='<eps>', cost=this_cost))\n    # arc from loop-state to a final-state with #nonterm_end as ilabel and olabel\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=loop_state, dest=final_state,\n        phone='#nonterm_end', word='#nonterm_end', cost=0.0))\n    print(\"{state}\\t{final_cost}\".format(\n        state=final_state, final_cost=0.0))\n    return next_state\n\n\n\ndef write_fst_no_silence(lexicon, nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n    when --sil-prob=0.0, meaning there is no optional silence allowed.\n\n      'lexicon' is a list of 3-tuples (word, pron-prob, prons) as returned by\n        read_lexiconp().\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    loop_state = 0\n    next_state = 1  # the next un-allocated state, will be incremented as we go.\n    for (word, pronprob, pron) in lexicon:\n        cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state,\n                dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=(cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            loop_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\ndef write_fst_with_silence(lexicon, sil_prob, sil_phone, sil_disambig,\n                           nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n       when --sil-prob != 0.0, meaning there is optional silence\n     'lexicon' is a list of 3-tuples (word, pron-prob, prons)\n         as returned by read_lexiconp().\n     'sil_prob', which is expected to be strictly between 0.. and 1.0, is the\n         probability of silence\n     'sil_phone' is the silence phone, e.g. \"SIL\".\n     'sil_disambig' is either None, or the silence disambiguation symbol, e.g. \"#5\".\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    assert sil_prob > 0.0 and sil_prob < 1.0\n    sil_cost = -math.log(sil_prob)\n    no_sil_cost = -math.log(1.0 - sil_prob);\n\n    start_state = 0\n    loop_state = 1  # words enter and leave from here\n    sil_state = 2   # words terminate here when followed by silence; this state\n                    # has a silence transition to loop_state.\n    next_state = 3  # the next un-allocated state, will be incremented as we go.\n\n\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=loop_state,\n        phone='<eps>', word='<eps>', cost=no_sil_cost))\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=sil_state,\n        phone='<eps>', word='<eps>', cost=sil_cost))\n    if sil_disambig is None:\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=loop_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n    else:\n        sil_disambig_state = next_state\n        next_state += 1\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=sil_disambig_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_disambig_state, dest=loop_state,\n            phone=sil_disambig, word='<eps>', cost=0.0))\n\n\n    for (word, pronprob, pron) in lexicon:\n        pron_cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state, dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(pron_cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=no_sil_cost + (pron_cost if i <= 0 else 0.0)))\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=sil_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=sil_cost + (pron_cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            start_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\n\n\ndef write_words_txt(orig_lines, highest_numbered_symbol, nonterminals, filename):\n    \"\"\"Writes updated words.txt to 'filename'.  'orig_lines' is the original lines\n       in the words.txt file as a list of strings (without the newlines);\n       highest_numbered_symbol is the highest numbered symbol in the original\n       words.txt; nonterminals is a list of strings like '#nonterm:foo'.\"\"\"\n    with open(filename, 'w', encoding='latin-1') as f:\n        for l in orig_lines:\n            print(l, file=f)\n        cur_symbol = highest_numbered_symbol + 1\n        for n in [ '#nonterm_begin', '#nonterm_end' ] + nonterminals:\n            print(\"{0} {1}\".format(n, cur_symbol), file=f)\n            cur_symbol = cur_symbol + 1\n\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminals symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef read_left_context_phones(filename):\n    \"\"\"Reads, checks, and returns a list of left-context phones, in text form, one\n       per line.  Returns a list of strings, e.g. ['a', 'ah', ..., '#nonterm_bos' ]\"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no left-context phones.\".format(filename))\n    whitespace = re.compile(\"[ \\t]+\")\n    for s in ans:\n        if len(whitespace.split(s)) != 1:\n            raise RuntimeError(\"The file {0} contains an invalid line '{1}'\".format(filename, s)   )\n\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\n\ndef is_token(s):\n    \"\"\"Returns true if s is a string and is space-free.\"\"\"\n    if not isinstance(s, str):\n        return False\n    whitespace = re.compile(\"[ \\t\\r\\n]+\")\n    split_str = whitespace.split(s);\n    return len(split_str) == 1 and s == split_str[0]\n\n\ndef main():\n    args = get_args()\n\n    lexicon = read_lexiconp(args.lexiconp)\n\n    if args.nonterminals is None:\n        nonterminals, left_context_phones = None, None\n    else:\n        if args.left_context_phones is None:\n            print(\"{0}: if --nonterminals is specified, --left-context-phones must also \"\n                  \"be specified\".format(sys.argv[0]))\n            sys.exit(1)\n        nonterminals = read_nonterminals(args.nonterminals)\n        left_context_phones = read_left_context_phones(args.left_context_phones)\n\n    if args.sil_prob == 0.0:\n          write_fst_no_silence(lexicon,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n    else:\n        # Do some checking that the options make sense.\n        if args.sil_prob < 0.0 or args.sil_prob >= 1.0:\n            print(\"{0}: invalid value specified --sil-prob={1}\".format(\n                sys.argv[0], args.sil_prob), file=sys.stderr)\n            sys.exit(1)\n\n        if not is_token(args.sil_phone):\n            print(\"{0}: you specified --sil-prob={1} but --sil-phone is set \"\n                  \"to '{2}'\".format(sys.argv[0], args.sil_prob, args.sil_phone),\n                  file=sys.stderr)\n            sys.exit(1)\n        if args.sil_disambig is not None and not is_token(args.sil_disambig):\n            print(\"{0}: invalid value --sil-disambig='{1}' was specified.\"\n                  \"\".format(sys.argv[0], args.sil_disambig), file=sys.stderr)\n            sys.exit(1)\n        write_fst_with_silence(lexicon, args.sil_prob, args.sil_phone,\n                               args.sil_disambig,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n\n\n\n#    (lines, highest_symbol) = read_words_txt(args.input_words_txt)\n#    nonterminals = read_nonterminals(args.nonterminal_symbols_list)\n#    write_words_txt(lines, highest_symbol, nonterminals, args.output_words_txt)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/aishell2/local/max_rescore.py",
    "content": "import sys\nimport json\nimport codecs\nimport copy\n\njson_f = sys.argv[1]\njson_f_out = sys.argv[2]\nbest_dict_f = sys.argv[3]\n\nwith codecs.open(json_f, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\nbest_dict = {}\nfor name in j[\"utts\"]:\n    hyp_lst = j[\"utts\"][name][\"output\"]\n    for idx, hyp in enumerate(hyp_lst):\n        if hyp[\"text\"] == hyp[\"rec_text\"].replace(\"<eos>\", \"\") and idx > 0:\n            best_dict[name] = copy.deepcopy([hyp_lst[0]] + [hyp_lst[idx]]) \n            print(f\"{name}: {idx}-th is the best\")\n            if hyp_lst[0][\"mmi_tot_score\"] - hyp_lst[idx][\"mmi_tot_score\"] <  - 1e-5:\n                print(\"May be corrected by MMI\")\n            \n\n\n            hyp_lst = [hyp]\n    j[\"utts\"][name][\"output\"] = hyp_lst[:1]\n\nwith open(json_f_out, \"wb\") as f:\n    f.write(\n        json.dumps(\n            j, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n\nwith open(best_dict_f, \"wb\") as f:\n    f.write(\n        json.dumps(\n            best_dict, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n"
  },
  {
    "path": "egs/aishell2/local/mmi_rescore.sh",
    "content": "decode_dir=$1\ndict=$2\n\nmkdir -p $decode_dir/rescore\ndir=$decode_dir/rescore\n\nmkdir -p $dir/best\npython3 local/max_rescore.py $decode_dir/data.json ${dir}/best/data.1.json $dir/best/best.json\nscore_sclite.sh  $dir/best ${dict} > ${dir}/best/decode_result.txt\n\nfor w in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; do\n    mkdir -p $dir/$w\n    python3 local/rerank.py $decode_dir/data.json $w ${dir}/${w}/data.1.json\n    score_sclite.sh  $dir/$w ${dict} > ${dir}/$w/decode_result.txt\ndone \n"
  },
  {
    "path": "egs/aishell2/local/parse_options.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);\n#                 Arnab Ghoshal, Karel Vesely\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Parse command-line options.\n# To be sourced by another script (as in \". parse_options.sh\").\n# Option format is: --option-name arg\n# and shell variable \"option_name\" gets set to value \"arg.\"\n# The exception is --help, which takes no arguments, but prints the\n# $help_message variable (if defined).\n\n\n###\n### The --config file options have lower priority to command line\n### options, so we need to import them first...\n###\n\n# Now import all the configs specified by command-line, in left-to-right order\nfor ((argpos=1; argpos<$#; argpos++)); do\n  if [ \"${!argpos}\" == \"--config\" ]; then\n    argpos_plus1=$((argpos+1))\n    config=${!argpos_plus1}\n    [ ! -r $config ] && echo \"$0: missing config '$config'\" && exit 1\n    . $config  # source the config file.\n  fi\ndone\n\n\n###\n### Now we process the command line options\n###\nwhile true; do\n  [ -z \"${1:-}\" ] && break;  # break if there are no arguments\n  case \"$1\" in\n    # If the enclosing script is called with --help option, print the help\n    # message and exit.  Scripts should put help messages in $help_message\n    --help|-h) if [ -z \"$help_message\" ]; then echo \"No help found.\" 1>&2;\n      else printf \"$help_message\\n\" 1>&2 ; fi;\n      exit 0 ;;\n    --*=*) echo \"$0: options to scripts must be of the form --name value, got '$1'\"\n      exit 1 ;;\n    # If the first command-line argument begins with \"--\" (e.g. --foo-bar),\n    # then work out the variable name as $name, which will equal \"foo_bar\".\n    --*) name=`echo \"$1\" | sed s/^--// | sed s/-/_/g`;\n      # Next we test whether the variable in question is undefned-- if so it's\n      # an invalid option and we die.  Note: $0 evaluates to the name of the\n      # enclosing script.\n      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar\n      # is undefined.  We then have to wrap this test inside \"eval\" because\n      # foo_bar is itself inside a variable ($name).\n      eval '[ -z \"${'$name'+xxx}\" ]' && echo \"$0: invalid option $1\" 1>&2 && exit 1;\n\n      oldval=\"`eval echo \\\\$$name`\";\n      # Work out whether we seem to be expecting a Boolean argument.\n      if [ \"$oldval\" == \"true\" ] || [ \"$oldval\" == \"false\" ]; then\n        was_bool=true;\n      else\n        was_bool=false;\n      fi\n\n      # Set the variable to the right value-- the escaped quotes make it work if\n      # the option had spaces, like --cmd \"queue.pl -sync y\"\n      eval $name=\\\"$2\\\";\n\n      # Check that Boolean-valued arguments are really Boolean.\n      if $was_bool && [[ \"$2\" != \"true\" && \"$2\" != \"false\" ]]; then\n        echo \"$0: expected \\\"true\\\" or \\\"false\\\": $1 $2\" 1>&2\n        exit 1;\n      fi\n      shift 2;\n      ;;\n  *) break;\n  esac\ndone\n\n\n# Check for an empty argument to the --cmd option, which can easily occur as a\n# result of scripting errors.\n[ ! -z \"${cmd+xxx}\" ] && [ -z \"$cmd\" ] && echo \"$0: empty argument to --cmd option\" 1>&2 && exit 1;\n\n\ntrue; # so this script returns exit code 0.\n"
  },
  {
    "path": "egs/aishell2/local/prepare_data.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)\n#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)\n# Apache 2.0\n\n# transform raw AISHELL-2 data to kaldi format\n\n. ./path.sh || exit 1;\n\ntmp=\ndir=\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>\"\n  echo \" $0 /export/AISHELL-2/iOS/train data/local/train data/train\"\n  exit 1;\nfi\n\ncorpus=$1\n#dict_dir=$2\ntmp=$2\ndir=$3\n\necho \"prepare_data.sh: Preparing data in $corpus\"\n\nmkdir -p $tmp\nmkdir -p $dir\n\n# corpus check\nif [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then\n  echo \"Error: $0 requires wav.scp and trans.txt under $corpus directory.\"\n  exit 1;\nfi\n\n# validate utt-key list, IC0803W0380 is a bad utterance\nawk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list\nawk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list\nutils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list\n\n# wav.scp\nawk -F'\\t' -v path_prefix=$corpus '{printf(\"%s\\t%s/%s\\n\",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp\nutils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp\n\n# text\nutils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text\n\n# utt2spk & spk2utt\nawk -F'\\t' '{print $2}' $tmp/wav.scp > $tmp/wav.list\nsed -e 's:\\.wav::g' $tmp/wav.list | \\\n  awk -F'/' '{i=NF-1;printf(\"%s\\t%s\\n\",$NF,$i)}' > $tmp/tmp_utt2spk\nutils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_utt2spk | sort -k 1 | uniq > $tmp/utt2spk\nutils/utt2spk_to_spk2utt.pl $tmp/utt2spk | sort -k 1 | uniq > $tmp/spk2utt\n\n# copy prepared resources from tmp_dir to target dir\nmkdir -p $dir\nfor f in wav.scp text spk2utt utt2spk; do\n  cp $tmp/$f $dir/$f || exit 1;\ndone\n\necho \"local/prepare_data.sh succeeded\"\nexit 0;\n"
  },
  {
    "path": "egs/aishell2/local/prepare_dict.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)\n#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)\n# Apache 2.0\n\n# This is a shell script, and it download and process DaCiDian for Mandarin ASR.\n\n. ./path.sh\n\ndownload_dir=data/local/DaCiDian\ndir=data/local/dict\n\nif [ $# -ne 1 ]; then\n  echo \"Usage: $0 <dict-dir>\";\n  exit 1;\nfi\n\ndir=$1\n\n# download the DaCiDian from github\nif [ ! -d $download_dir ]; then\n  git clone https://github.com/aishell-foundation/DaCiDian.git $download_dir\nfi\n\n# here we map <UNK> to the phone spn(spoken noise)\nmkdir -p $dir\npython $download_dir/DaCiDian.py $download_dir/word_to_pinyin.txt $download_dir/pinyin_to_phone.txt > $dir/lexicon.txt\necho -e \"<UNK>\\tspn\" >> $dir/lexicon.txt\n\n# prepare silence_phones.txt, nonsilence_phones.txt, optional_silence.txt, extra_questions.txt\ncat $dir/lexicon.txt | awk '{ for(n=2;n<=NF;n++){ phones[$n] = 1; }} END{for (p in phones) print p;}'| \\\n  perl -e 'while(<>){ chomp($_); $phone = $_; next if ($phone eq \"sil\");\n    m:^([^\\d]+)(\\d*)$: || die \"Bad phone $_\"; $q{$1} .= \"$phone \"; }\n    foreach $l (values %q) {print \"$l\\n\";}\n  ' | sort -k1 > $dir/nonsilence_phones.txt  || exit 1;\n\necho sil > $dir/silence_phones.txt\necho sil > $dir/optional_silence.txt\n\ncat $dir/silence_phones.txt | awk '{printf(\"%s \", $1);} END{printf \"\\n\";}' > $dir/extra_questions.txt || exit 1;\ncat $dir/nonsilence_phones.txt | perl -e 'while(<>){ foreach $p (split(\" \", $_)) {\n  $p =~ m:^([^\\d]+)(\\d*)$: || die \"Bad phone $_\"; if($p eq \"\\$0\"){$q{\"\"} .= \"$p \";}else{$q{$2} .= \"$p \";} } } foreach $l (values %q) {print \"$l\\n\";}' \\\n >> $dir/extra_questions.txt || exit 1;\n\necho \"local/prepare_dict.sh succeeded\"\nexit 0;\n"
  },
  {
    "path": "egs/aishell2/local/rerank.py",
    "content": "import sys\nimport json\nimport codecs\n\n\njson_f = sys.argv[1]\njson_f_out = sys.argv[3]\nweight = float(sys.argv[2])\n\nwith codecs.open(json_f, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\nfor name in j[\"utts\"]:\n    hyp_lst = j[\"utts\"][name][\"output\"]\n    for hyp in hyp_lst:\n        hyp[\"score\"] = float(hyp[\"score\"]) * weight + float(hyp[\"mmi_tot_score\"]) * (1 - weight)\n    hyp_lst.sort(key=lambda hyp: hyp[\"score\"], reverse=True)\n    j[\"utts\"][name][\"output\"] = hyp_lst\n\nwith open(json_f_out, \"wb\") as f:\n    f.write(\n        json.dumps(\n            j, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n"
  },
  {
    "path": "egs/aishell2/local/sym2int.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n$ignore_oov = 0;\n\nfor($x = 0; $x < 2; $x++) {\n  if ($ARGV[0] eq \"--map-oov\") {\n    shift @ARGV;\n    $map_oov = shift @ARGV;\n    if ($map_oov eq \"-f\" || $map_oov =~ m/words\\.txt$/ || $map_oov eq \"\") {\n      # disallow '-f', the empty string and anything ending in words.txt as the\n      # OOV symbol because these are likely command-line errors.\n      die \"the --map-oov option requires an argument\";\n    }\n  }\n  if ($ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesy (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n}\n\n$symtab = shift @ARGV;\nif (!defined $symtab) {\n  print STDERR \"Usage: sym2int.pl [options] symtab [input transcriptions] > output transcriptions\\n\" .\n    \"options: [--map-oov <oov-symbol> ]  [-f <field-range> ]\\n\" .\n      \"note: <field-range> can look like 4-5, or 4-, or 5-, or 1.\\n\";\n}\nopen(F, \"<$symtab\") || die \"Error opening symbol table file $symtab\";\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"bad line in symbol table file: $_\";\n    $sym2int{$A[0]} = $A[1] + 0;\n}\n\nif (defined $map_oov && $map_oov !~ m/^\\d+$/) { # not numeric-> look it up\n  if (!defined $sym2int{$map_oov}) { die \"OOV symbol $map_oov not defined.\"; }\n  $map_oov = $sym2int{$map_oov};\n}\n\n$num_warning = 0;\n$max_warning = 20;\n\nwhile (<>) {\n  @A = split(\" \", $_);\n  @B = ();\n  for ($n = 0; $n < @A; $n++) {\n    $a = $A[$n];\n    if ( (!defined $field_begin || $n >= $field_begin)\n         && (!defined $field_end || $n <= $field_end)) {\n      $i = $sym2int{$a};\n      if (!defined ($i)) {\n        if (defined $map_oov) {\n          if ($num_warning++ < $max_warning) {\n            print STDERR \"sym2int.pl: replacing $a with $map_oov\\n\";\n            if ($num_warning == $max_warning) {\n              print STDERR \"sym2int.pl: not warning for OOVs any more times\\n\";\n            }\n          }\n          $i = $map_oov;\n        }\n      }\n      $a = $i;\n    }\n    push @B, $a;\n  }\n  print join(\" \", @B);\n  print \"\\n\";\n}\nif ($num_warning > 0) {\n  print STDERR \"** Replaced $num_warning instances of OOVs with $map_oov\\n\";\n}\n\nexit(0);\n"
  },
  {
    "path": "egs/aishell2/local/train_lms.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)\n#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)\n# Apache 2.0\n\n. ./path.sh\n. ./utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"train_lms.sh <lexicon> <word-segmented-text> <dir>\"\n  echo \" e.g train_lms.sh data/local/dict/lexicon.txt data/local/train/text data/local/lm\"\n  exit 1;\nfi\n\nlexicon=$1\ntext=$2\ndir=$3\n\nfor f in \"$text\" \"$lexicon\"; do\n  [ ! -f $x ] && echo \"$0: No such file $f\" && exit 1;\ndone\n\nkaldi_lm=`which train_lm.sh`\nif [ -z $kaldi_lm ]; then\n  echo \"$0: train_lm.sh is not found. That might mean it's not installed\"\n  echo \"$0: or it is not added to PATH\"\n  echo \"$0: Use the script tools/extras/install_kaldi_lm.sh to install it\"\n  exit 1\nfi\n\nmkdir -p $dir\ncleantext=$dir/text.no_oov\n\ncat $text | awk -v lex=$lexicon 'BEGIN{while((getline<lex) >0){ seen[$1]=1; } }\n  {for(n=1; n<=NF;n++) {  if (seen[$n]) { printf(\"%s \", $n); } else {printf(\"<UNK> \");} } printf(\"\\n\");}' \\\n  > $cleantext || exit 1;\n\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' | sort | uniq -c | \\\n   sort -nr > $dir/word.counts || exit 1;\n\n# Get counts from acoustic training transcripts, and add  one-count\n# for each word in the lexicon (but not silence, we don't want it\n# in the LM-- we'll add it optionally later).\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' | \\\n  cat - <(grep -w -v '!SIL' $lexicon | awk '{print $1}') | \\\n   sort | uniq -c | sort -nr > $dir/unigram.counts || exit 1;\n\n# note: we probably won't really make use of <UNK> as there aren't any OOVs\ncat $dir/unigram.counts  | awk '{print $2}' | get_word_map.pl \"<s>\" \"</s>\" \"<UNK>\" > $dir/word_map \\\n   || exit 1;\n\n# note: ignore 1st field of train.txt, it's the utterance-id.\ncat $cleantext | awk -v wmap=$dir/word_map 'BEGIN{while((getline<wmap)>0)map[$1]=$2;}\n  { for(n=2;n<=NF;n++) { printf map[$n]; if(n<NF){ printf \" \"; } else { print \"\"; }}}' | gzip -c >$dir/train.gz \\\n   || exit 1;\n\ntrain_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1;\ntrain_lm.sh --arpa --lmtype 4gram-mincount $dir\n# note: output is\n# data/local/lm/3gram-mincount/lm_unpruned.gz\n\necho \"local/train_lms.sh succeeded\"\nexit 0\n\n\n# From here is some commands to do a baseline with SRILM (assuming\n# you have it installed).\nheldout_sent=10000 # Don't change this if you want result to be comparable with\n    # kaldi_lm results\nsdir=$dir/srilm # in case we want to use SRILM to double-check perplexities.\nmkdir -p $sdir\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' | \\\n  head -$heldout_sent > $sdir/heldout\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' | \\\n  tail -n +$heldout_sent > $sdir/train\n\ncat $dir/word_map | awk '{print $1}' | cat - <(echo \"<s>\"; echo \"</s>\" ) > $sdir/wordlist\n\n\nngram-count -text $sdir/train -order 3 -limit-vocab -vocab $sdir/wordlist -unk \\\n  -map-unk \"<UNK>\" -kndiscount -interpolate -lm $sdir/srilm.o3g.kn.gz\nngram -lm $sdir/srilm.o3g.kn.gz -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250954 ppl= 90.5091 ppl1= 132.482\n\n# Note: perplexity SRILM gives to Kaldi-LM model is same as kaldi-lm reports above.\n# Difference in WSJ must have been due to different treatment of <UNK>.\nngram -lm $dir/3gram-mincount/lm_unpruned.gz  -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250913 ppl= 90.4439 ppl1= 132.379\n\necho \"local/train_lms.sh succeeded\"\nexit 0\n"
  },
  {
    "path": "egs/aishell2/local/word_segmentation.py",
    "content": "# encoding=utf-8\n# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)\n#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys\nimport jieba\nreload(sys)\nsys.setdefaultencoding('utf-8')\n\nif len(sys.argv) < 3:\n  sys.stderr.write(\"word_segmentation.py <vocab> <trans> > <word-segmented-trans>\\n\")\n  exit(1)\n\nvocab_file=sys.argv[1]\ntrans_file=sys.argv[2]\n\njieba.set_dictionary(vocab_file)\nfor line in open(trans_file):\n  key,trans = line.strip().split('\\t',1)\n  words = jieba.cut(trans, HMM=False) # turn off new word discovery (HMM-based)\n  new_line = key + '\\t' + \" \".join(words)\n  print(new_line)\n"
  },
  {
    "path": "egs/aishell2/nt.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=2         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\nresume=        # Resume the training from snapshot\n\n# feature configuration\ndo_delta=false\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4.yaml\nlm_config=conf/lm.yaml\ndecode_config=conf/tuning/transducer/decode_default.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\n\n# data dir, modify this to your AISHELL-2 data path\ntr_dir=/data/asr_data/aishell2/iOS/data\ndev_tst_dir=/data/asr_data/aishell2/AISHELL-DEV-TEST-SET\n\n# exp tag\n### Configurable parameters ###\ntag=\"8v100_rnnt_mmi_ctc\"\nngpu=8\ndebug=false\n\n# Train config\nseed=888\nbatch_size=8\naccum_grad=8\nepochs=100\nuse_segment=true # if true, use word-level transcription in MMI criterion\naux_ctc=false\naux_ctc_weight=0.5\naux_ctc_dropout_rate=0.1\naux_mmi=false\naux_mmi_weight=0.5\naux_mmi_dropout_rate=0.1\naux_mmi_type='mmi' # mmi or phonectc\n\n# MBR training config\naux_mbr=false\naux_mbr_weight=1.0\naux_mbr_beam=4\nmbr_epochs=100\nmbr_lr=0.1\nmbr_warmup=2500\nmbr_resume=\n\n# Decode config\nidx_average=41_50\nsearch_type=\"alsd\" # \"default\", \"nsc\", \"tsd\", \"alsd\"\nmmi_weight=0.0 # MMI / phonectc joint decoding\nmas_lookahead=0\nctc_weight=0.0 # char ctc joint decoding\nngram_weight=0.0\nngram_order=4\nword_ngram_weight=0.0\nword_ngram_tag=word_3gram # 3 or 4 gram\nword_ngram_log_semiring=true\nlm_weight=0.0\nbeam_size=10\nrecog_set=\"test_android test_ios test_mic\"\n\n. utils/parse_options.sh || exit 1;\n\nif [ $debug == true ]; then\n    export HOST_GPU_NUM=1\n    export HOST_NUM=1\n    export NODE_NUM=1\n    export INDEX=0\n    export CHIEF_IP=\"9.135.217.29\"\nfi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--use-segment $use_segment \\\n--aux-ctc $aux_ctc \\\n--aux-ctc-weight $aux_ctc_weight \\\n--aux-ctc-dropout-rate $aux_ctc_dropout_rate \\\n--aux-mmi $aux_mmi \\\n--aux-mmi-weight $aux_mmi_weight \\\n--aux-mmi-dropout-rate $aux_mmi_dropout_rate \\\n--aux-mmi-type $aux_mmi_type \\\n\"\n\n\nif [ $aux_mbr == true ]; then\n    train_opts=\"$train_opts \\\n                --aux-mbr $aux_mbr \\\n                --aux-mbr-weight $aux_mbr_weight \\\n                --aux-mbr-beam $aux_mbr_beam \\\n                --transformer-lr $mbr_lr \\\n                --epochs $mbr_epochs \\\n                --transformer-warmup-steps $mbr_warmup \\\n                --resume $mbr_resume \\\n                --load-trainer-and-opt false \\\n                --save-interval-iters 1000 \\\n                \"\n    export OMP_NUM_THREADS=6 # for on-the-fly decoding\nfi\n\ndecode_opts=\\\n\"\\\n--search-type $search_type \\\n--mmi-weight $mmi_weight \\\n--beam-size $beam_size \\\n--ctc-weight $ctc_weight \\\n--ngram-weight $ngram_weight \\\n--word-ngram-weight $word_ngram_weight \\\n--word-ngram data/${word_ngram_tag} \\\n--word-ngram-log-semiring $word_ngram_log_semiring \\\n--lm-weight $lm_weight \\\n--mas-lookahead $mas_lookahead \\\n\"\ndict=data/lang_1char/train_sp_units.txt\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev_ios\n\nexpname=${train_set}_${backend}_${tag}\nexpdir=exp/${expname}\nmkdir -p ${expdir}\n\nlang=data/lang_phone\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Network Training\"\n    MASTER_PORT=22277\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/results_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --resume ${resume} \\\n        --train-json ${feat_tr_dir}/split${ngpu}utt/data_noeng.RANK.json \\\n        --valid-json ${feat_dt_dir}/data.json \\\n        --lang $lang \\\n        --opt \"noam_sgd\" \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        $train_opts > ${expdir}/global_record.${INDEX}.txt 2>&1\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Decoding\"\n    nj=2500\n    recog_model=model.last${idx_average}.avg.best\n    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} etype) = custom ]] || \\\n           [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then\n\trecog_model=model.last${idx_average}.avg.best\n\taverage_checkpoints.py --backend ${backend} \\\n        \t               --snapshots ${expdir}/results_0/snapshot.ep.* \\\n\t     \t \t       --out ${expdir}/results_0/${recog_model} \\\n\t \t\t       --num ${idx_average}\n    fi\n    \n    decode_parent_dir=decode_mmi${mmi_weight}_${word_ngram_tag}${word_ngram_weight}_lookahead${mas_lookahead}_ep${idx_average}_beam${beam_size}\n    for rtask in ${recog_set}; do\n        decode_dir=$decode_parent_dir/$rtask\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n\n        # split data\n        splitjson.py --parts ${nj} ${feat_recog_dir}/data.json\n\n        #### use CPU for decoding\n        ngpu=0\n        ${decode_cmd} JOB=1:$nj ${expdir}/${decode_dir}/log/decode.JOB.log \\\n            python3 ${MAIN_ROOT}/bin/asr_recog.py \\\n            --config ${decode_config} \\\n            --ngpu ${ngpu} \\\n            --backend ${backend} \\\n            --batchsize 0 \\\n            --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \\\n            --result-label ${expdir}/${decode_dir}/data.JOB.json \\\n            --model ${expdir}/results_0/${recog_model}  \\\n            --local-rank JOB $decode_opts  \n\n        score_sclite.sh ${expdir}/${decode_dir} ${dict} \\\n          > ${expdir}/${decode_dir}/decode_result.txt\n\n    done\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/aishell2/prepare.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=2         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\nresume=        # Resume the training from snapshot\n\n# feature configuration\ndo_delta=false\n\ntrain_config=conf/train.yaml\nlm_config=conf/lm_rnn.yaml\ndecode_config=conf/decode.yaml\n\n# rnnlm related\nlm_resume=         # specify a snapshot file to resume LM training\nlmtag=             # tag for managing LMs\nn_gram=4\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\n\n# data dir, modify this to your AISHELL-2 data path\ntr_dir=/data/asr_data/aishell2/iOS/data\ndev_tst_dir=/data/asr_data/aishell2/AISHELL-DEV-TEST-SET\nword_arpa=/apdcephfs/share_1149801/speech_user/tomasyu/jinchuan/data/ngram/cweng_3g_5gram.arpa\n# exp tag\n### Configurable parameters ###\ntag=\"8v100_rnnt_mmi_ctc\"\nngpu=8\n\n# Train config\nseed=888\nbatch_size=8\naccum_grad=1\nepochs=100\nuse_segment=true # if true, use word-level transcription in MMI criterion\naux_ctc_weight=0.5\naux_ctc_dropout_rate=0.1\naux_mmi=true\naux_mmi_weight=0.5\naux_mmi_dropout_rate=0.1\naux_mmi_type='mmi' # mmi or phonectc\n\n# Decode config\nidx_average=91_100\nsearch_type=\"alsd\" # \"default\", \"nsc\", \"tsd\", \"alsd\"\nmmi_weight=0.2 # MMI / phonectc joint decoding\nctc_weight=0.0 # char ctc joint decoding\nngram_weight=0.0\nword_ngram_weight=0.0\nword_ngram_order=4 # 3 or 4 gram\nmmi_type=\"frame\" # or rescore\nbeam_size=10\nrecog_set=\"test_android test_ios test_mic\"\n\n. utils/parse_options.sh || exit 1;\n\n#if [ $debug -eq true ]; then\n#    export HOST_GPU_NUM=1\n#    export HOST_NUM=1\n#    export NODE_NUM=1\n#    export INDEX=0\n#    export CHIEF_IP=\"9.135.217.29\"\n#fi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--use-segment $use_segment \\\n--aux-ctc $aux_ctc \\\n--aux-ctc-weight $aux_ctc_weight \\\n--aux-ctc-dropout-rate $aux_ctc_dropout_rate \\\n--aux-mmi $aux_mmi \\\n--aux-mmi-weight $aux_mmi_weight \\\n--aux-mmi-dropout-rate $aux_mmi_dropout_rate \\\n--aux-mmi-type $aux_mmi_type \\\n\"\n\ndecode_opts=\\\n\"\\\n--search-type $search_type \\\n--mmi-weight $mmi_weight \\\n--beam-size $beam_size \\\n--ctc-weight $ctc_weight \\\n--mmi-type $mmi_type \\\n--ngram-weight $ngram_weight \\\n--word-ngram-weight $word_ngram_weight \\\n--word-ngram data/word_${word_ngram_order}gram \\\n\"\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\ntrain_set=train_sp\ntrain_dev=dev_ios\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    ### Task dependent. You have to make data the following preparation part by yourself.\n    ### But you can utilize Kaldi recipes in most cases\n    echo \"stage 0: Data preparation\"\n    # For training set\n    local/prepare_data.sh ${tr_dir} data/local/train data/train || exit 1;\n    # # For dev and test set\n    for x in Android iOS Mic; do\n        local/prepare_data.sh ${dev_tst_dir}/${x}/dev data/local/dev_${x,,} data/dev_${x,,} || exit 1;\n        local/prepare_data.sh ${dev_tst_dir}/${x}/test data/local/test_${x,,} data/test_${x,,} || exit 1;\n    done \n    # Normalize text to capital letters\n    for x in train dev_android dev_ios dev_mic test_android test_ios test_mic; do\n        mv data/${x}/text data/${x}/text_org\n        paste <(cut -f 1 data/${x}/text_org) <(cut -f 2 data/${x}/text_org | tr '[:lower:]' '[:upper:]') \\\n            > data/${x}/text\n        rm data/${x}/text_org\n    done\nfi\n\nfeat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}\nfeat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    ### Task dependent. You have to design training and dev sets by yourself.\n    ### But you can utilize Kaldi recipes in most cases\n    echo \"stage 1: Feature Generation\"\n    fbankdir=fbank\n    # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 188 --write_utt2num_frames true \\\n        data/train exp/make_fbank/train ${fbankdir}\n    utils/fix_data_dir.sh data/train\n    for x in android ios mic; do\n        steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 20 --write_utt2num_frames true \\\n            data/dev_${x} exp/make_fbank/dev_${x} ${fbankdir}\n        utils/fix_data_dir.sh data/dev_${x}     \n        steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 20 --write_utt2num_frames true \\\n            data/test_${x} exp/make_fbank/test_${x} ${fbankdir}\n        utils/fix_data_dir.sh data/test_${x}\n    done\n    \n    # speed-perturbed\n    utils/perturb_data_dir_speed.sh 0.9 data/train data/temp1\n    utils/perturb_data_dir_speed.sh 1.0 data/train data/temp2\n    utils/perturb_data_dir_speed.sh 1.1 data/train data/temp3\n    utils/combine_data.sh --extra-files utt2uniq data/${train_set} data/temp1 data/temp2 data/temp3\n    rm -r data/temp1 data/temp2 data/temp3\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 30 --write_utt2num_frames true \\\n    data/${train_set} exp/make_fbank/${train_set} ${fbankdir}\n    utils/fix_data_dir.sh data/${train_set}\n\n    # compute global CMVN\n    compute-cmvn-stats scp:data/${train_set}/feats.scp data/${train_set}/cmvn.ark\n\n    # dump features for training\n    split_dir=$(echo $PWD | awk -F \"/\" '{print $NF \"/\" $(NF-1)}')\n    if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_tr_dir}/storage ]; then\n    utils/create_split_dir.pl \\\n        /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_set}/delta${do_delta}/storage \\\n        ${feat_tr_dir}/storage\n    fi\n    if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_dt_dir}/storage ]; then\n    utils/create_split_dir.pl \\\n        /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_dev}/delta${do_delta}/storage \\\n        ${feat_dt_dir}/storage\n    fi\n    dump.sh --cmd \"$train_cmd\" --nj 100 --do_delta ${do_delta} \\\n        data/${train_set}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/train ${feat_tr_dir}\n        \n    for rtask in ${recog_set}; do\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir}\n        dump.sh --cmd \"$train_cmd\" --nj 20 --do_delta ${do_delta} \\\n            data/${rtask}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/recog/${rtask} \\\n            ${feat_recog_dir}\n    done\nfi\n\nlang=data/lang_phone\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: prepare lang and do text segmentation\"\n    bash local/prepare_dict.sh data/local/dict\n    local/k2_prepare_lang.sh --position-dependent-phones false data/local/dict \\\n      \"<UNK>\" data/local/lang_tmp_nosp $lang || exit 1\n\n    # use jieba for segmentation used in MMI. This would take a few minutes\n    python3 local/jieba_build_dict.py $lang/words.txt $lang/jieba_dict.txt\n    for part in train_sp dev_android dev_ios dev_mic test_android test_ios test_mic; do\n        python3 local/jieba_split_text.py data/${part}/text data/${part}/text_orig.scp $lang/jieba_dict.txt\n    done\n\n    # word-level N-gram model\n    mkdir -p data/local/lm\n    awk '{print $1}' data/local/dict/lexicon.txt | sort | uniq | awk '{print $1,99}' \\\n      > data/local/lm/word_seg_vocab.txt\n    python2 local/word_segmentation.py data/local/lm/word_seg_vocab.txt \\\n      data/local/train/text > data/local/lm/trans.txt\n\n    local/train_lms.sh \\\n     data/local/dict/lexicon.txt \\\n     data/local/lm/trans.txt \\\n     data/local/lm || exit 1;\n\n    for order in 3 4; do\n      wngram_dir=data/word_${order}gram; mkdir -p $wngram_dir\n      cp $lang/words.txt $wngram_dir\n      cp $lang/oov.int $wngram_dir\n      gunzip -c data/local/lm/${order}gram-mincount/lm_unpruned.gz \\\n        > $wngram_dir/lm.arpa\n      python3 -m kaldilm \\\n      --read-symbol-table=\"$wngram_dir/words.txt\" \\\n      --disambig-symbol='#0' \\\n      --max-order=$order \\\n      $wngram_dir/lm.arpa > $wngram_dir/G.fst.txt\n    done\n\n    # prepare very large LM with external resources\n    mkdir -p data/word_5gram; wdir=data/word_5gram\n    cp $lang/words.txt $wdir/\n    cp $lang/oov.int $wdir/\n\n    python3 -m kaldilm \\\n        --read-symbol-table=\"$wdir/words.txt\" \\\n        --disambig-symbol='#0' \\\n        --max-order=5 \\\n        $word_arpa > $wdir/G.fst.txt\n    # built the .pt file\n    python3 espnet/nets/scorers/word_ngram.py\nfi\n\ndict=data/lang_1char/${train_set}_units.txt\necho \"dictionary: ${dict}\"\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    ### Task dependent. You have to check non-linguistic symbols used in the corpus.\n    echo \"stage 3: Dictionary and Json Data Preparation\"\n    mkdir -p data/lang_1char/\n\n    echo \"make a dictionary\"\n    echo \"<unk> 1\" > ${dict} # <unk> must be 1, 0 will be used for \"blank\" in CTC\n    text2token.py -s 1 -n 1 data/${train_set}/text | cut -f 2- -d\" \" | tr \" \" \"\\n\" \\\n    | sort | uniq | grep -v -e '^\\s*$' | awk '{print $0 \" \" NR+1}' >> ${dict}\n    wc -l ${dict}\n\n    echo \"make json files\"\n    data2json.sh --feat ${feat_tr_dir}/feats.scp \\\n                 --text_org data/${train_set}/text_orig.scp \\\n\t\t data/${train_set} ${dict} > ${feat_tr_dir}/data.json\n    for rtask in ${recog_set}; do\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n        data2json.sh --feat ${feat_recog_dir}/feats.scp \\\n                     --text_org data/${rtask}/text_orig.scp \\\n\t\t     data/${rtask} ${dict} > ${feat_recog_dir}/data.json\n    done   \nfi\n\n# you can skip this and remove --rnnlm option in the recognition (stage 5)\nif [ -z ${lmtag} ]; then\n    lmtag=$(basename ${lm_config%.*})\nfi\nlmexpname=train_rnnlm_${backend}_${lmtag}\nlmexpdir=exp/${lmexpname}\nmkdir -p ${lmexpdir}\n\nif [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then\n    echo \"stage 4: LM Preparation\"\n    lmdatadir=data/local/lm_train\n    mkdir -p ${lmdatadir}\n    text2token.py -s 1 -n 1 data/train/text | cut -f 2- -d\" \" \\\n        > ${lmdatadir}/train.txt\n    text2token.py -s 1 -n 1 data/${train_dev}/text | cut -f 2- -d\" \" \\\n        > ${lmdatadir}/valid.txt\n    mkdir -p ${lmexpdir}/results\n\n    ${cuda_cmd} --gpu ${ngpu} ${lmexpdir}/train.log \\\n        lm_train.py \\\n        --config ${lm_config} \\\n        --ngpu 4 \\\n        --backend ${backend} \\\n        --verbose ${verbose} \\\n        --outdir ${lmexpdir}/results \\\n        --tensorboard-dir ${lmexpdir}/tensorboard \\\n        --train-label ${lmdatadir}/train.txt \\\n        --valid-label ${lmdatadir}/valid.txt \\\n        --resume ${lm_resume} \\\n        --dict ${dict}\n \n    ngramexpdir=exp/train_ngram\n    lmplz --discount_fallback -o ${n_gram} <${lmdatadir}/train.txt > ${ngramexpdir}/${n_gram}gram.arpa\n    build_binary -s ${ngramexpdir}/${n_gram}gram.arpa ${ngramexpdir}/${n_gram}gram.bin\n\nfi\n\n# Prepare these word N-gram LMs for SPL response\n# (1) use different smooth method\nif [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then\n    # 3-gram LM with different smooth\n    for sm in -wbdiscount -kndiscount -ukndiscount -ndiscount; do\n        bash espnet_utils/train_lms_srilm.sh \\\n          --unk \"<UNK>\" --lm-opts $sm data/local/dict/lexicon.txt \\\n          data/local/lm/trans.txt data/local/lm$sm\n    done\n\n    # good-tuning\n    bash espnet_utils/train_lms_srilm.sh \\\n      --unk \"<UNK>\" data/local/dict/lexicon.txt \\\n      data/local/lm/trans.txt data/local/lm-gtdiscount\n\n    # build k2 directory\n    for tag in wbdiscount kndiscount ukndiscount ndiscount gtdiscount; do\n        mkdir -p data/word_3gram_$tag; lmdir=data/word_3gram_$tag\n        gunzip -c data/local/lm-$tag/srilm/srilm.o3g.kn.gz \\\n          > $lmdir/lm.arpa\n\n        cp $lang/words.txt $lmdir\n        cp $lang/oov.int $lmdir\n\n        python3 -m kaldilm \\\n            --read-symbol-table=\"$lmdir/words.txt\" \\\n            --disambig-symbol='#0' \\\n            --max-order=3 \\\n            $lmdir/lm.arpa > $lmdir/G.fst.txt\n\n        python3 espnet/nets/scorers/word_ngram.py $lmdir\n    done\nfi\n"
  },
  {
    "path": "egs/asrucs/.gitignore",
    "content": "dump\ndump32\ndump64\ndata\nexp*\nfbank\n"
  },
  {
    "path": "egs/asrucs/cmd.sh",
    "content": "../aishell1/cmd.sh"
  },
  {
    "path": "egs/asrucs/conf/decode.yaml",
    "content": "tuning/decode_pytorch_transformer.yaml"
  },
  {
    "path": "egs/asrucs/conf/fbank.conf",
    "content": "--sample-frequency=16000 \n--num-mel-bins=80\n"
  },
  {
    "path": "egs/asrucs/conf/gpu.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q"
  },
  {
    "path": "egs/asrucs/conf/lm.yaml",
    "content": "# rnnlm related\nlayer: 2\nunit: 650\nopt: sgd        # or adam\nbatchsize: 64   # batch size in LM training\nepoch: 20      # if the data size is large, we can reduce this\npatience: 3\nmaxlen: 100     # if sentence length > lm_maxlen, lm_batchsize is automatically reduced\n"
  },
  {
    "path": "egs/asrucs/conf/lm_rnn.yaml",
    "content": "lm.yaml"
  },
  {
    "path": "egs/asrucs/conf/lm_transformer.yaml",
    "content": "# This Transformer LM setting w/ 4 GPUs took around 60 days for 50 epochs.\n# However, you can get better results in 6 days for 5 epochs (WER: 2.2/5.4/2.6/5.7)\n# than LSTM LM (WER: 2.6/5.6/2.6/5.7) in 60 days for 20 epochs\n# And if you does not have 4 GPUs, try accum-grad=4.\n\n# network architecture\nmodel-module: transformer\natt-unit: 512\nembed-unit: 128\nhead: 8\nlayer: 16\npos-enc: none\nunit: 2048\n\n# minibatch related\nbatchsize: 32\nmaxlen: 40\n\n# optimization related\nopt: adam\nschedulers: lr=cosine\ndropout-rate: 0.0\nepoch: 10\ngradclip: 1.0\nlr: 1e-4\nlr-cosine-total: 100000\nlr-cosine-warmup: 1000\npatience: 0\nsortagrad: 0\n"
  },
  {
    "path": "egs/asrucs/conf/pitch.conf",
    "content": "--sample-frequency=16000\n"
  },
  {
    "path": "egs/asrucs/conf/pure_ctc.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 15\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 1.0\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 16\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 4\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/asrucs/conf/queue.conf",
    "content": "# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l gpu=$0 -q g.q\n"
  },
  {
    "path": "egs/asrucs/conf/slurm.conf",
    "content": "# Default configuration\ncommand sbatch --export=PATH\noption name=* --job-name $0\noption time=* --time $0\noption mem=* --mem-per-cpu $0\noption mem=0\noption num_threads=* --cpus-per-task $0\noption num_threads=1 --cpus-per-task 1\noption num_nodes=* --nodes $0\ndefault gpu=0\noption gpu=0 -p cpu\noption gpu=* -p gpu --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU\n# note: the --max-jobs-run option is supported as a special case\n# by slurm.pl and you don't have to handle it in the config file.\n"
  },
  {
    "path": "egs/asrucs/conf/specaug.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 5\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: false\n"
  },
  {
    "path": "egs/asrucs/conf/specaug_test.yaml",
    "content": "process:\n  # these three processes are a.k.a. SpecAugument\n  - type: \"time_warp\"\n    max_time_warp: 0\n    inplace: true\n    mode: \"PIL\"\n  - type: \"freq_mask\"\n    F: 30\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n  - type: \"time_mask\"\n    T: 40\n    n_mask: 2\n    inplace: true\n    replace_with_zero: true\n"
  },
  {
    "path": "egs/asrucs/conf/train.yaml",
    "content": "tuning/train_pytorch_conformer_kernel15.yaml"
  },
  {
    "path": "egs/asrucs/conf/train_conformer-rnn_transducer_cs.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 50\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 256\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer_cs:E2E\"\n\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/decode_pytorch_transformer.yaml",
    "content": "batchsize: 0\nbeam-size: 10\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.5\nlm-weight: 0.0\nngram-weight: 0.3\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/decode_rnn.yaml",
    "content": "beam-size: 20\npenalty: 0.0\nmaxlenratio: 0.0\nminlenratio: 0.0\nctc-weight: 0.6\nlm-weight: 0.3\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_pytorch_conformer_kernel15.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nrel-pos-type: latest\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 15\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_pytorch_conformer_kernel31.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_pytorch_conformer_kernel31_large.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 16\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 512\naheads: 8\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_pytorch_conformer_kernel31_small.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 8\neunits: 1024\n# decoder related\ndlayers: 4\ndunits: 1024\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n\n# conformer specific setting\ntransformer-encoder-pos-enc-layer-type: rel_pos\ntransformer-encoder-selfattn-layer-type: rel_selfattn\ntransformer-encoder-activation-type: swish\nmacaron-style: true\nuse-cnn-module: true\ncnn-module-kernel: 31\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_pytorch_transformer.yaml",
    "content": "# network architecture\n# encoder related\nelayers: 12\neunits: 2048\n# decoder related\ndlayers: 6\ndunits: 2048\n# attention related\nadim: 256\naheads: 4\n\n# hybrid CTC/attention\nmtlalpha: 0.3\n\n# label smoothing\nlsm-weight: 0.1\n\n# minibatch related\nbatch-size: 32\nmaxlen-in: 512  # if input length  > maxlen-in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced\n\n# optimization related\nsortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs\nopt: noam\naccum-grad: 2\ngrad-clip: 5\npatience: 0\nepochs: 50\ndropout-rate: 0.1\n\n# transformer specific setting\nbackend: pytorch\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transformer:E2E\"\ntransformer-input-layer: conv2d     # encoder architecture type\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\ntransformer-attn-dropout-rate: 0.0\ntransformer-length-normalized-loss: false\ntransformer-init: pytorch\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/train_rnn.yaml",
    "content": "# network architecture\n# encoder related\netype: vggblstm     # encoder architecture type\nelayers: 3\neunits: 1024\neprojs: 1024\nsubsample: \"1_2_2_1_1\" # skip every n frame from input to nth layers\n# decoder related\ndlayers: 2\ndunits: 1024\n# attention related\natype: location\nadim: 1024\naconv-chans: 10\naconv-filts: 100\n\n# hybrid CTC/attention\nmtlalpha: 0.5\n\n# minibatch related\nbatch-size: 30\nmaxlen-in: 800  # if input length  > maxlen_in, batchsize is automatically reduced\nmaxlen-out: 150 # if output length > maxlen_out, batchsize is automatically reduced\n\n# optimization related\nopt: adadelta\nepochs: 10\npatience: 0\n\n# scheduled sampling option\nsampling-probability: 0.0\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/decode_default.yaml",
    "content": "# decoding parameters\nbatch: 0\nbeam-size: 10\nsearch-type: default\nscore-norm: True\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\naccum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 15\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 256\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4_att.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# Attention scorer auxiliary task: mainly follow the settings in LASCTC decoder\natt-adim: 512\natt-aheads: 8\natt-dlayers: 6\natt-dunits: 2048\natt-dropout-rate: 0.1\natt-attn-dropout-rate: 0.0\natt-length-normalized-loss: false\nlsm-weight: 0.1\n\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer_aux_ngpu4_small.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 256\n          d_ff: 1024\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 512\ndunits: 256\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 256\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\n#aux-ctc: True\n#aux-ctc-weight: 0.5\n#aux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer_ngpu4.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 15\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 12\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: False\naux-ctc-weight: 0.0\naux-ctc-dropout-rate: 0.0\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_conformer-rnn_transducer_ngpu4_large.yaml",
    "content": "# minibatch related\nbatch-size: 32\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: noam\ntransformer-lr: 1.0\ntransformer-warmup-steps: 25000\nepochs: 100\npatience: 0\n#accum-grad: 2\ngrad-clip: 5.0\n\n# network architecture\n## general\ncustom-enc-positional-encoding-type: rel_pos\ncustom-enc-self-attn-type: rel_self_attn\ncustom-enc-pw-activation-type: swish\n## encoder related\netype: custom\ncustom-enc-input-layer: vgg2l\nenc-block-arch:\n        - type: conformer\n          d_hidden: 512\n          d_ff: 2048\n          heads: 4\n          macaron_style: True\n          use_conv_mod: True\n          conv_mod_kernel: 31\n          dropout-rate: 0.3\n          att-dropout-rate: 0.3\nenc-block-repeat: 16\n## decoder related\ndtype: lstm\ndlayers: 2\ndec-embed-dim: 1024\ndunits: 1024\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: False\naux-ctc-weight: 0.0\naux-ctc-dropout-rate: 0.0\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_transducer.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n"
  },
  {
    "path": "egs/asrucs/conf/tuning/transducer/train_transducer_aux.yaml",
    "content": "# minibatch related\nbatch-size: 64\nmaxlen-in: 512\nmaxlen-out: 150\n\n# optimization related\ncriterion: loss\nearly-stop-criterion: \"validation/main/loss\"\nsortagrad: 0\nopt: adadelta\nepochs: 30\npatience: 3\naccum-grad: 2\n\n# network architecture\n## encoder related\netype: vggblstm\nelayers: 6\neunits: 512\neprojs: 512\ndropout-rate: 0.4\n## decoder related\ndtype: lstm\ndlayers: 1\ndec-embed-dim: 1024\ndunits: 512\ndropout-rate-embed-decoder: 0.2\ndropout-rate-decoder: 0.1\n## joint network related\njoint-dim: 512\n\n# transducer related\nmodel-module: \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\"\n\n# reporter related\nreport-wer: True\nreport-cer: True\n\n# auxiliary task\naux-ctc: True\naux-ctc-weight: 0.1\naux-ctc-dropout-rate: 0.1\n"
  },
  {
    "path": "egs/asrucs/espnet",
    "content": "../../../E2E-ASR-Framework/"
  },
  {
    "path": "egs/asrucs/espnet_utils",
    "content": "../espnet_utils/"
  },
  {
    "path": "egs/asrucs/local/add_seperator.py",
    "content": "import sys\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\nin_f, out_f = sys.argv[1:]\n\nlines = open(in_f, encoding=\"utf-8\").readlines()\nseperator = u'\\u2581'\nwriter = open(out_f, 'w', encoding=\"utf-8\")\n\nfor line in lines:\n    line = line.strip().split()\n    ans = []\n    for p in line[1:]: # remove uttid\n        if is_all_chinese(p):\n            ans.append(p)\n        else:\n            ans.append(seperator + p)\n    line = \" \".join(ans) + '\\n'\n    writer.write(line)\nwriter.close()\n"
  },
  {
    "path": "egs/asrucs/local/generate_fake_cs.py",
    "content": "import sys\nimport torch\nimport random\nimport torchaudio\n\nMAX_LENGTH = 12 * 100 # 12 seconds\n\n\ndef read_datadir(d):\n    wav_dict = {}\n    for line in open(d + \"/wav.scp\", encoding=\"utf-8\"):\n        uttid, item = line.split()\n        wav_dict[uttid] = item\n\n    text_dict = {}\n    for line in open(d + \"/text\", encoding=\"utf-8\"):\n        uttid, item = line.split()[0], line.split()[1:]\n        text_dict[uttid] = \" \".join(item)\n\n    dur_dict = {}\n    for line in open(d + \"/utt2num_frames\", encoding=\"utf-8\"):\n        uttid, item = line.strip().split()\n        dur_dict[uttid] = int(item)\n    \n    return wav_dict, text_dict, dur_dict\n\ndef generate_pairs(chn_dur_dict, eng_dur_dict, dur_maximum):\n    dur_sum = 0 \n    ans = []\n    chn_keys = list(chn_dur_dict.keys())\n    eng_keys = list(eng_dur_dict.keys())\n    random.shuffle(chn_keys)\n    random.shuffle(eng_keys)\n\n    for i, (chn_k, eng_k) in enumerate(zip(chn_keys, eng_keys)):\n        length = chn_dur_dict[chn_k] + eng_dur_dict[eng_k]\n        if length > MAX_LENGTH:\n            continue\n\n        k_lst = [chn_k, eng_k]\n        random.shuffle(k_lst)\n\n        ans.append(k_lst)\n        dur_sum += length\n\n        if dur_sum > dur_maximum:\n            break\n\n    return ans \n\ndef write_utts(tgt_dir, pair_list, wav_dict, text_dict):\n    text_writer = open(tgt_dir + \"/text\", 'w', encoding=\"utf-8\")\n    scp_writer = open(tgt_dir + \"/wav.scp\", 'w', encoding=\"utf-8\")\n\n    def write_utt(path1, path2, path):\n        wave1, sr1 = torchaudio.load(path1)\n        wave2, sr2 = torchaudio.load(path2)\n        assert sr1 == sr2\n        wave = torch.cat([wave1, wave2], dim=-1)\n        torchaudio.save(path, wave, sample_rate=sr1)\n\n    for i, (k1, k2) in enumerate(pair_list):\n        uttid = k1 + \"_and_\" + k2\n        wave_path = tgt_dir + '/wavs/' + uttid + '.wav'\n        write_utt(wav_dict[k1], wav_dict[k2], wave_path)\n   \n        text = uttid + \" \" + text_dict[k1] + ' ' + text_dict[k2] + \"\\n\"\n        text_writer.write(text)\n        text_writer.flush()\n\n        scp_info = uttid + \" \" + wave_path + \"\\n\"\n        scp_writer.write(scp_info)\n        scp_writer.flush()\n\n        if i % 10000 == 0:\n            print(f\"have generate {i} utts\")\n\n    text_writer.close()\n    scp_writer.close()\n\ndef main():\n\n    chn_dir, eng_dir, tgt_dir = sys.argv[1:4]\n\n    chn_wav_dict, chn_text_dict, chn_dur_dict = \\\n        read_datadir(chn_dir)\n\n    eng_wav_dict, eng_text_dict, eng_dur_dict = \\\n        read_datadir(eng_dir)\n    \n    chn_wav_dict.update(eng_wav_dict)\n    chn_text_dict.update(eng_text_dict)\n\n    # 200 hours\n    pair_list = generate_pairs(chn_dur_dict, eng_dur_dict, dur_maximum=100 * 3600 * 200)\n    write_utts(tgt_dir, pair_list, chn_wav_dict, chn_text_dict)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/asrucs/local/prepare_fake_cs.sh",
    "content": "cs_dir=data/train_cs_fake/\n# generate wav files\n# python3 local/generate_fake_cs.py data/train_zh_trim/ data/train_en_trim/ data/train_cs_fake/\n\n# cat $cs_dir/text | cut -d ' ' -f 1 | awk '{print $1, $1}' > $cs_dir/spk2utt \n# cp $cs_dir/spk2utt $cs_dir/utt2spk\n# utils/fix_data_dir.sh $cs_dir\n\n# steps/make_fbank_pitch.sh --nj 500 --write_utt2num_frames true \\\n#     $cs_dir exp/make_fbank/cs_fake fbank\n\n# dump.sh  --nj 48 --do_delta false \\\n#     $cs_dir/feats.scp data/cmvn/cmvn.ark exp/dump_feats/fake_cs \\\n#     dump/train_cs_fake/deltafalse/\n\n# python3 espnet_utils/text_norm.py --in-f data/train_cs_fake/text --out-f data/train_cs_fake/text_org --eng-upper\n# mv data/train_cs_fake/text_org data/train_cs_fake/text\n\n# feat_part_dir=dump/train_cs_fake/deltafalse/\n# data2json.sh --nj 20 --feat $feat_part_dir/feats.scp --bpecode data/dict_cs/bpe.model \\\n#     data/train_cs_fake data/dict_cs/dict.txt > $feat_part_dir/data.json\n\n# dict=data/dict_cs/dict.txt\n# n_symbols=`wc -l $dict | cut -d ' ' -f 1`\n\n# python3 espnet_utils/add_uttcls_json.py dump/train_cs_fake/deltafalse/data.json \\\n#             dump/train_cs_fake/deltafalse/data_withcls.json $[$n_symbols + 4]\n\npython3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data_withcls.json \\\n        dump/train_en_trim/deltafalse/data_withcls.json \\\n        dump/train_cs_fake/deltafalse/data_withcls.json \\\n        --shuffle > dump/jsons/pretrain_data_fakecs.json\npython3 espnet_utils/splitjson.py -p 8 --original-order dump/jsons/pretrain_data_fakecs.json\n"
  },
  {
    "path": "egs/asrucs/nt.sh",
    "content": "#!/usr/bin/env bash\n\n# author: tyriontian\n# tyriontian@tencent.com\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\ndebugmode=1\ndumpdir=dump   # directory to dump full features\nN=0            # number of minibatches to be used (mainly for debugging). \"0\" uses all minibatches.\nverbose=0      # verbose option\ndebug=false\n\n# feature configuration\ndo_delta=false\n\npreprocess_config=conf/specaug.yaml\ntrain_config=conf/train_conformer-rnn_transducer_cs.yaml\ndecode_config=conf/tuning/transducer/decode_default.yaml\n\n# decoding parameter\nrecog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'\nresume=\n\n# data\ntrain_json=dump/jsons/split8utt/zh_en_cs_withcls.RANK.json\nvalid_json=dump/dev_cs/deltafalse/data_withcls.json\ndict=data/dict_cs/dict.txt\nnlsyms=data/dict_cs/nlsyms.txt\nnbpe=5000\nbpemodel=data/dict_cs/bpe.model\n\n### Configurable parameters ###\ntag=debug\nngpu=8\n\n# Train config\nseed=888\nbatch_size=16\naccum_grad=4\nepochs=50\naux_ctc_weight=1.0\ncs_lang_weight=0.0\ncs_share_encoder=true\ncs_share_encoder_layers=9\ncs_is_pretrain=false\ncs_use_adversial_examples=true\ncs_is_ctc_decoder=true\ncs_use_mask_predictor=false\nenc_block_repeat=3\npretrain_model=\n\n# Decode config \ndecode_tag=\"default_decode\"\nidx_average=91_100\ndecode_feature=combine \nsearch_type=\"ctc_greedy\" # ctc_greedy | ctc_beam | alsd \nbeam_size=10\nngram_model=data/ngram/train_cs_5gram.bin\nngram_weight=0.0\nrnnlm=exp/train_nnlm_combine/rnnlm.model.best\nrnnlm_conf=exp/train_nnlm_combine/model.json\nlm_weight=0.0\nword_ngram=data/word_ngram/train_combine/\nword_ngram_weight=0.0\neng_vocab=None\nrecog_set=\"test_zh test_en test_cs\"\n\n. utils/parse_options.sh || exit 1;\n\nif [ $debug == true ]; then\n    export HOST_GPU_NUM=1\n    export HOST_NUM=1\n    export NODE_NUM=1\n    export INDEX=0\n    export CHIEF_IP=\"9.135.217.29\"\nfi\n\ntrain_opts=\\\n\"\\\n--seed $seed \\\n--batch-size $batch_size \\\n--accum-grad $accum_grad \\\n--epochs $epochs \\\n--aux-ctc-weight $aux_ctc_weight \\\n--cs-lang-weight $cs_lang_weight \\\n--cs-share-encoder $cs_share_encoder \\\n--cs-share-encoder-layers $cs_share_encoder_layers \\\n--cs-use-adversial-examples $cs_use_adversial_examples \\\n--cs-is-pretrain $cs_is_pretrain \\\n--cs-is-ctc-decoder $cs_is_ctc_decoder \\\n--cs-use-mask-predictor $cs_use_mask_predictor \\\n--enc-block-repeat $enc_block_repeat \\\n\"\n\ndecode_opts=\\\n\"\\\n--beam-size $beam_size \\\n--cs-nt-decode-feature $decode_feature \\\n--search-type $search_type \\\n--ngram-model $ngram_model \\\n--ngram-weight $ngram_weight \\\n--rnnlm $rnnlm \\\n--rnnlm-conf $rnnlm_conf \\\n--lm-weight $lm_weight \\\n--word-ngram-lower-char false \\\n--word-ngram $word_ngram \\\n--word-ngram-weight $word_ngram_weight \\\n--eng-vocab $eng_vocab \\\n\"\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\nexpname=${tag}\nexpdir=exp/${expname}\nmkdir -p ${expdir}\n\n\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Network Pre-Training: CTC bilingual system\"\n    # keep the model file \n    cp espnet/nets/pytorch_backend/e2e_asr_transducer_cs.py $expdir \n    MASTER_PORT=22275\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/pretrain_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --resume ${resume} \\\n        --train-json $train_json \\\n        --valid-json $valid_json \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        --num-save-attention 0 \\\n        --cs-is-pretrain true $train_opts \\\n        > ${expdir}/pretrain.${INDEX}.txt 2>&1\nfi\n\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 1: Network Fine-tuning: RNNT bilingual system\"\n    MASTER_PORT=22275\n    NCCL_DEBUG=TRACE python3 -m torch.distributed.launch \\\n        --nproc_per_node ${HOST_GPU_NUM} --master_port $MASTER_PORT \\\n        --nnodes=${HOST_NUM} --node_rank=${INDEX} --master_addr=${CHIEF_IP} \\\n        ${MAIN_ROOT}/bin/asr_train.py \\\n        --config ${train_config} \\\n        --preprocess-conf ${preprocess_config} \\\n        --ngpu 1 \\\n        --backend ${backend} \\\n        --outdir ${expdir}/finetuning_RANK \\\n        --debugmode ${debugmode} \\\n        --dict ${dict} \\\n        --debugdir ${expdir} \\\n        --minibatches ${N} \\\n        --verbose ${verbose} \\\n        --train-json $train_json \\\n        --valid-json $valid_json \\\n        --n-iter-processes 8 \\\n        --world-size $ngpu \\\n        --node-rank ${INDEX} \\\n        --node-size ${HOST_GPU_NUM} \\\n        --num-save-attention 0 \\\n        --resume $pretrain_model \\\n        --load-trainer-and-opt false \\\n        $train_opts > ${expdir}/finetuning.${INDEX}.txt 2>&1\n\n       \nfi\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: Decoding\"\n    nj=5\n    recog_model=model.last${idx_average}.avg.best\n    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \\\n           [[ $(get_yaml.py ${train_config} etype) = custom ]] || \\\n           [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then \\\n        if [ ! -f ${expdir}/pretrain_0/${recog_model} ]; then\n            echo \"conduct model average\"\n            average_checkpoints.py --backend ${backend} \\\n                                   --snapshots ${expdir}/pretrain_0/snapshot.ep.* \\\n                                   --out ${expdir}/pretrain_0/${recog_model} \\\n                                   --num ${idx_average} \n        fi\n    fi\n    decode_parent_dir=${decode_tag}\n    for rtask in ${recog_set}; do\n        decode_dir=$decode_parent_dir/${rtask}_${decode_feature}_${search_type}\n        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}\n\n        # split data\n        splitjson.py --parts ${nj} ${feat_recog_dir}/data.json\n\n        #### use CPU for decoding\n        ngpu=0\n\n        ${decode_cmd} JOB=1:${nj} --max-job 100 ${expdir}/${decode_dir}/log/decode.JOB.log \\\n            asr_recog.py \\\n            --config ${decode_config} \\\n            --ngpu ${ngpu} \\\n            --backend ${backend} \\\n            --batchsize 0 \\\n            --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \\\n            --result-label ${expdir}/${decode_dir}/data.JOB.json \\\n            --model ${expdir}/pretrain_0/${recog_model} \\\n            $decode_opts\n\n        if [ $rtask == \"test_cs\" ]; then \n            mer_opt=\"--mer true\"\n        else\n            mer_opt=\"\"\n        fi\n        \n        score_sclite.sh --bpe $nbpe --bpemodel ${bpemodel} --wer true $mer_opt --nlsyms data/dict_cs/nlsyms.txt \\\n          ${expdir}/${decode_dir} ${dict} > ${expdir}/${decode_dir}/decode_result.txt\n    done\n    echo \"Finished\"\nfi\n\n"
  },
  {
    "path": "egs/asrucs/path.sh",
    "content": "../aishell1/path.sh"
  },
  {
    "path": "egs/asrucs/prepare.sh",
    "content": "#!/usr/bin/env bash\n\n# author: tyriontian\n# tianjinchuan@stu.pku.edu.cn ; tyriontian@tencent.com\n\n# A Code-Switch ASR recipe\n\n. ./path.sh || exit 1;\n. ./cmd.sh || exit 1;\n\nstage=2\nstop_stage=100\ndumpdir=dump\nfbankdir=fank\ndo_delta=false\nnbpe=5000\noteam_cs_text=../oteam_asr3/data/eng/text\n\n. utils/parse_options.sh || exit 1;\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    echo \"stage 0: prepare features\"\n    # make raw features and remove long-short utts\n    for part in train_zh train_en train_cs; do \n        steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 500 --write_utt2num_frames true \\\n            data/$part exp/make_fbank/$part ${fbankdir}\n        espnet_utils/remove_longshortdata.sh --maxframes 1200 --maxchars 400 data/${part} data/${part}_trim\n    done\n\n    # compute cmvn\n    mkdir -p data/cmvn\n    cat data/train_zh_trim/feats.scp data/train_en_trim/feats.scp data/train_cs_trim/feats.scp \\\n        | shuf | head -n 10000 > data/cmvn/feats_cmvn.scp\n    compute-cmvn-stats scp:data/cmvn/feats_cmvn.scp data/cmvn/cmvn.ark\n\n    # dump features without speed perturb\n    for part in train_zh_trim train_en_trim train_cs_trim \\\n                dev_zh dev_en dev_cs test_zh test_en test_cs; do\n        feat_part_dir=${dumpdir}/${part}/delta${do_delta}; mkdir -p ${feat_part_dir}\n        dump.sh --cmd \"$train_cmd\" --nj 48 --do_delta ${do_delta} \\\n            data/${part}/feats.scp data/cmvn/cmvn.ark exp/dump_feats/${part} \\\n            ${feat_part_dir}\n    done\nfi\n\nmkdir -p data/dict_cs\ndict=data/dict_cs/dict.txt\nnlsyms=data/dict_cs/nlsyms.txt\nbpemodel=data/dict_cs/bpe\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: prepare dictionary and json file\"\n    \n    # prepare special symbols\n    echo \"<unk> 1\" > $dict; echo \"<chn> 2\" >> $dict; echo \"<eng> 3\" >> $dict\n    echo \"<chn>\" > $nlsyms; echo \"<eng>\" >> $nlsyms\n\n    # chn symbols\n    text2token.py -s 1 -n 1 data/train_zh_trim/text | cut -f 2- -d\" \" | tr \" \" \"\\n\" \\\n      | sort | uniq | grep -v -e '^\\s*$' | awk '{print $0 \" \" NR+3}' >> ${dict}\n    nchn_symbols=`wc -l $dict | cut -d ' ' -f 1` \n\n    # eng symbols and bpe models \n    cat data/train_en_trim/text | cut -d ' ' -f 2- > data/dict_cs/bpe_input.txt\n    spm_train --input=data/dict_cs/bpe_input.txt --vocab_size=${nbpe} --model_type=unigram \\\n        --model_prefix=${bpemodel} --input_sentence_size=100000000\n    spm_encode --model=${bpemodel}.model --output_format=piece --split-chn < data/dict_cs/bpe_input.txt \\\n        | tr ' ' '\\n' | sort | uniq | awk '{print $0 \" \" NR+'${nchn_symbols}'}' >> ${dict}\n\n    # eng words\n    cat data/train_en_trim/text data/train_cs_trim/text \n\n    # make json files\n    for part in train_zh_trim train_en_trim train_cs_trim \\\n                dev_zh dev_en dev_cs test_zh test_en test_cs; do\n        feat_part_dir=${dumpdir}/${part}/delta${do_delta}\n        data2json.sh --nj 20 --feat $feat_part_dir/feats.scp --bpecode ${bpemodel}.model \\\n            data/$part $dict > $feat_part_dir/data.json \n    done\n\n    # Add language-id in label sequence; consider <blk> <eos>\n    # We add these label only for model inference -> no test sets\n    n_symbols=`wc -l $dict | cut -d ' ' -f 1`\n    for part in train_zh_trim dev_zh; do\n        python3 espnet_utils/add_uttcls_json.py dump/${part}/deltafalse/data.json \\\n            dump/${part}/deltafalse/data_withcls.json $[$n_symbols + 2]\n    done\n\n    for part in train_en_trim dev_en; do\n        python3 espnet_utils/add_uttcls_json.py dump/${part}/deltafalse/data.json \\\n            dump/${part}/deltafalse/data_withcls.json $[$n_symbols + 3]\n    done\n\n    for part in train_cs_trim dev_cs; do\n        python3 espnet_utils/add_uttcls_json.py dump/${part}/deltafalse/data.json \\\n            dump/${part}/deltafalse/data_withcls.json $[$n_symbols + 4]\n    done\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Process json files in multiple styles\"\n    # dev\n    python3 espnet_utils/concatjson.py \\\n        dump/dev_zh/deltafalse/data.json \\\n        dump/dev_en/deltafalse/data.json \\\n        dump/dev_cs/deltafalse/data.json \\\n        > dump/jsons/dev.json\n\n    # zh + en \n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data.json \\\n        dump/train_en_trim/deltafalse/data.json \\\n        --shuffle > dump/jsons/zh_en.json\n    \n    # zh + en + cs\n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data.json \\\n        dump/train_en_trim/deltafalse/data.json \\\n        dump/train_cs_trim/deltafalse/data.json \\\n        --shuffle > dump/jsons/zh_en_cs.json\n    \n    # zh + en + fakecs\n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data.json \\\n        dump/train_en_trim/deltafalse/data.json \\\n        dump/train_cs_fake/deltafalse/data.json \\\n        --shuffle > dump/jsons/zh_en_fakecs.json\n\n    ### With class label ###\n    \n    # dev\n    python3 espnet_utils/concatjson.py \\\n        dump/dev_zh/deltafalse/data_withcls.json \\\n        dump/dev_en/deltafalse/data_withcls.json \\\n        dump/dev_cs/deltafalse/data_withcls.json \\\n        > dump/jsons/dev_withcls.json\n\n    # zh + en \n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data_withcls.json \\\n        dump/train_en_trim/deltafalse/data_withcls.json \\\n        --shuffle > dump/jsons/zh_en_withcls.json\n\n    # zh + en + cs\n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data_withcls.json \\\n        dump/train_en_trim/deltafalse/data_withcls.json \\\n        dump/train_cs_trim/deltafalse/data_withcls.json \\\n        --shuffle > dump/jsons/zh_en_cs_withcls.json\n\n    # zh + en + fakecs\n    python3 espnet_utils/concatjson.py \\\n        dump/train_zh_trim/deltafalse/data_withcls.json \\\n        dump/train_en_trim/deltafalse/data_withcls.json \\\n        dump/train_cs_fake/deltafalse/data_withcls.json \\\n        --shuffle > dump/jsons/zh_en_fakecs_withcls.json\n\n    for json in zh_en zh_en_withcls zh_en_cs zh_en_cs_withcls zh_en_fakecs zh_en_fakecs_withcls; do\n        python3 espnet_utils/splitjson.py -p 8 --original-order dump/jsons/${json}.json &\n    done; wait\nfi\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: build LMs\"\n    \n#    mkdir -p data/ngram\n#    # process corpus\n#    for part in train_zh train_en train_cs dev_zh dev_en dev_cs; do\n#        cut -d \" \" -f 2- data/$part/text | spm_encode --model=${bpemodel}.model \\\n#          --output_format=piece > data/ngram/${part}_input.txt\n#    done\n#    cut -d \" \" -f 2- $oteam_cs_text | spm_encode --model=${bpemodel}.model \\\n#          --output_format=piece > data/ngram/oteam_cs_input.txt\n#\n#    rm -f data/ngram/train_combine_input.txt data/ngram/dev_combine_input.txt \n#    cat data/ngram/train*_input.txt > data/ngram/train_combine_input.txt\n#    cat data/ngram/dev*_input.txt   > data/ngram/dev_combine_input.txt\n#  \n#    # train N-gram LM \n#    for part in zh en cs combine; do \n#        lmplz --discount_fallback -o 5 < data/ngram/train_${part}_input.txt \\\n#          > data/ngram/${part}_5gram.arpa\n#        build_binary data/ngram/${part}_5gram.arpa data/ngram/${part}_5gram.bin\n#\n#        cat data/ngram/train_${part}_input.txt data/ngram/oteam_cs_input.txt \\\n#          > data/ngram/train_${part}_input_extended.txt\n#        lmplz --discount_fallback -o 5 <  data/ngram/train_${part}_input_extended.txt\\\n#          > data/ngram/${part}_5gram_oteam.arpa\n#        build_binary data/ngram/${part}_5gram_oteam.arpa data/ngram/${part}_5gram_oteam.bin\n#    done\n#\n#    # train token-level LM\n#    for part in combine; do\n#        ${cuda_cmd} --gpu 4 exp/train_nnlm_${part}/train.log \\\n#            lm_train.py \\\n#            --config conf/lm_transformer.yaml \\\n#            --ngpu 4 \\\n#            --backend pytorch \\\n#            --verbose 1 \\\n#            --outdir exp/train_nnlm_${part} \\\n#            --train-label data/ngram/train_${part}_input.txt \\\n#            --valid-label data/ngram/dev_${part}_input.txt \\\n#            --dict ${dict}\n#    done\n\n#    mkdir -p data/word_lm; rm data/word_lm/*\n#    for part in train_zh train_en train_cs dev_zh dev_en dev_cs; do\n#        python3 espnet_utils/text_norm.py --in-f data/${part}/text \\\n#          --out-f data/word_lm/${part}_input.txt --eng-upper --segment-chn \n#    done\n#    rm -f data/word_lm/train_combine_input.txt \n#    rm -f data/word_lm/dev_combine_input.txt\n#    cat data/word_lm/train_* > data/word_lm/train_combine_input.txt\n#    cat data/word_lm/dev_*   > data/word_lm/dev_combine_input.txt\n#\n#    word_dict=data/word_lm/dict.txt\n#    text2vocabulary.py -s 65000 -o ${word_dict} data/word_lm/train_combine_input.txt\n\n    \nfi\nexit 0;\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: build word N-gram LMs\"\n    mkdir -p data/word_ngram;\n   \n    # prepare corpus \n    for part in train_zh train_en train_cs; do\n        python3 espnet_utils/text_norm.py --in-f data/${part}/text \\\n          --out-f data/word_ngram/${part}_seg.txt --segment --eng-upper\n        cut -d ' ' -f 2- data/word_ngram/${part}_seg.txt | spm_encode \\\n          --model=${bpemodel}.model --output_format=piece \\\n          > data/word_ngram/${part}_bpe.txt\n        python3 local/add_seperator.py data/word_ngram/${part}_seg.txt \\\n          data/word_ngram/${part}_word.txt\n        cat data/word_ngram/${part}_bpe.txt data/word_ngram/${part}_word.txt \\\n          > data/word_ngram/${part}_input.txt\n    done \n    cat data/word_ngram/*_input.txt > data/word_ngram/train_combine_input.txt\n\n    # prepare words.txt\n    words=data/word_ngram/words.txt\n    cat data/word_ngram/train_combine_input.txt | tr \" \" \"\\n\" | sort | uniq | \\\n      grep -v '<unk>' > ${words} \n\n    # train N-gram and convert to torch version\n    words_disambig=data/word_ngram/words_disambig.txt\n    (echo \"<eps>\"; echo \"<unk>\") | cat - $words |\\\n      awk '{print $0 \" \" NR-1}' > $words_disambig\n    echo \"#0 `wc -l $words_disambig | cut -d ' ' -f 1`\" \\\n      >> $words_disambig\n    for part in train_zh train_en train_cs train_combine; do\n        bash espnet_utils/train_lms_srilm.sh \\\n          --unk \"<unk>\" --lm-opts -wbdiscount --order 5 \\\n          $words data/word_ngram/${part}_input.txt \\\n          data/word_ngram/$part\n        gunzip -c data/word_ngram/$part/srilm/srilm.o3g.kn.gz \\\n          > data/word_ngram/$part/lm.arpa\n        cp $words_disambig data/word_ngram/$part/words.txt\n        echo 1 > data/word_ngram/$part/oov.int\n\n        python3 -m kaldilm \\\n          --read-symbol-table=$words_disambig \\\n          --disambig-symbol='#0' \\\n          --max-order=5 \\\n          data/word_ngram/$part/lm.arpa \\\n          > data/word_ngram/$part/G.fst.txt\n\n        python3 espnet/nets/scorers/word_ngram.py data/word_ngram/$part \n    done \nfi\n"
  },
  {
    "path": "egs/asrucs/steps",
    "content": "../steps/"
  },
  {
    "path": "egs/asrucs/text",
    "content": ""
  },
  {
    "path": "egs/asrucs/utils",
    "content": "../utils/"
  },
  {
    "path": "egs/espnet_utils/add_uttcls_json.py",
    "content": "import json\nimport sys\n\ndef main():\n    in_json = sys.argv[1]\n    out_json = sys.argv[2]\n    clsid = sys.argv[3]\n\n    reader = open(in_json, encoding=\"utf-8\")\n    j = json.load(reader)\n\n    for name in j[\"utts\"].keys():\n        j[\"utts\"][name][\"output\"][0][\"tokenid\"] = \\\n            clsid + \" \" + j[\"utts\"][name][\"output\"][0][\"tokenid\"]\n\n    with open(out_json, \"wb\") as f:\n        f.write(\n            json.dumps(\n                j, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/addjson.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\n\nfrom distutils.util import strtobool\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"add multiple json values to an input or output value\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"jsons\", type=str, nargs=\"+\", help=\"json files\")\n    parser.add_argument(\n        \"-i\",\n        \"--is-input\",\n        default=True,\n        type=strtobool,\n        help=\"If true, add to input. If false, add to output\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    # make intersection set for utterance keys\n    js = []\n    intersec_ks = []\n    for x in args.jsons:\n        with codecs.open(x, \"r\", encoding=\"utf-8\") as f:\n            j = json.load(f)\n        ks = j[\"utts\"].keys()\n        logging.info(x + \": has \" + str(len(ks)) + \" utterances\")\n        if len(intersec_ks) > 0:\n            intersec_ks = intersec_ks.intersection(set(ks))\n            if len(intersec_ks) == 0:\n                logging.warning(\"Empty intersection\")\n                break\n        else:\n            intersec_ks = set(ks)\n        js.append(j)\n    logging.info(\"new json has \" + str(len(intersec_ks)) + \" utterances\")\n\n    # updated original dict to keep intersection\n    intersec_org_dic = dict()\n    for k in intersec_ks:\n        v = js[0][\"utts\"][k]\n        intersec_org_dic[k] = v\n\n    intersec_add_dic = dict()\n    for k in intersec_ks:\n        v = js[1][\"utts\"][k]\n        for j in js[2:]:\n            v.update(j[\"utts\"][k])\n        intersec_add_dic[k] = v\n\n    new_dic = dict()\n    for key_id in intersec_org_dic:\n        orgdic = intersec_org_dic[key_id]\n        adddic = intersec_add_dic[key_id]\n\n        if \"utt2spk\" not in orgdic:\n            orgdic[\"utt2spk\"] = \"\"\n        # NOTE: for machine translation\n\n        # add as input\n        if args.is_input:\n            # original input\n            input_list = orgdic[\"input\"]\n            # additional input\n            in_add_dic = {}\n            if \"idim\" in adddic and \"ilen\" in adddic:\n                in_add_dic[\"shape\"] = [int(adddic[\"ilen\"]), int(adddic[\"idim\"])]\n            elif \"idim\" in adddic:\n                in_add_dic[\"shape\"] = [int(adddic[\"idim\"])]\n            # add all other key value\n            for key, value in adddic.items():\n                if key in [\"idim\", \"ilen\"]:\n                    continue\n                in_add_dic[key] = value\n            # add name\n            in_add_dic[\"name\"] = \"input%d\" % (len(input_list) + 1)\n\n            input_list.append(in_add_dic)\n            new_dic[key_id] = {\n                \"input\": input_list,\n                \"output\": orgdic[\"output\"],\n                \"utt2spk\": orgdic[\"utt2spk\"],\n            }\n        # add as output\n        else:\n            # original output\n            output_list = orgdic[\"output\"]\n            # additional output\n            out_add_dic = {}\n            # add shape\n            if \"odim\" in adddic and \"olen\" in adddic:\n                out_add_dic[\"shape\"] = [int(adddic[\"olen\"]), int(adddic[\"odim\"])]\n            elif \"odim\" in adddic:\n                out_add_dic[\"shape\"] = [int(adddic[\"odim\"])]\n            # add all other key value\n            for key, value in adddic.items():\n                if key in [\"odim\", \"olen\"]:\n                    continue\n                out_add_dic[key] = value\n            # add name\n            out_add_dic[\"name\"] = \"target%d\" % (len(output_list) + 1)\n\n            output_list.append(out_add_dic)\n            new_dic[key_id] = {\n                \"input\": orgdic[\"input\"],\n                \"output\": output_list,\n                \"utt2spk\": orgdic[\"utt2spk\"],\n            }\n            if \"lang\" in orgdic.keys():\n                new_dic[key_id][\"lang\"] = orgdic[\"lang\"]\n\n    # ensure \"ensure_ascii=False\", which is a bug\n    jsonstring = json.dumps(\n        {\"utts\": new_dic},\n        indent=4,\n        ensure_ascii=False,\n        sort_keys=True,\n        separators=(\",\", \": \"),\n    )\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    print(jsonstring)\n"
  },
  {
    "path": "egs/espnet_utils/apply-cmvn.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nfrom distutils.util import strtobool\nimport logging\n\nimport kaldiio\nimport numpy\n\nfrom espnet.transform.cmvn import CMVN\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_utils import is_scipy_wav_style\nfrom espnet.utils.cli_writers import file_writer_helper\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"apply mean-variance normalization to files\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--in-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--stats-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"npy\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--out-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\"],\n        help=\"Specify the file format for the wspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n\n    parser.add_argument(\n        \"--norm-means\",\n        type=strtobool,\n        default=True,\n        help=\"Do variance normalization or not.\",\n    )\n    parser.add_argument(\n        \"--norm-vars\",\n        type=strtobool,\n        default=False,\n        help=\"Do variance normalization or not.\",\n    )\n    parser.add_argument(\n        \"--reverse\", type=strtobool, default=False, help=\"Do reverse mode or not\"\n    )\n    parser.add_argument(\n        \"--spk2utt\",\n        type=str,\n        help=\"A text file of speaker to utterance-list map. \"\n        \"(Don't give rspecifier format, such as \"\n        '\"ark:spk2utt\")',\n    )\n    parser.add_argument(\n        \"--utt2spk\",\n        type=str,\n        help=\"A text file of utterance to speaker map. \"\n        \"(Don't give rspecifier format, such as \"\n        '\"ark:utt2spk\")',\n    )\n    parser.add_argument(\n        \"--write-num-frames\", type=str, help=\"Specify wspecifer for utt2num_frames\"\n    )\n    parser.add_argument(\n        \"--compress\", type=strtobool, default=False, help=\"Save in compressed format\"\n    )\n    parser.add_argument(\n        \"--compression-method\",\n        type=int,\n        default=2,\n        help=\"Specify the method(if mat) or \" \"gzip-level(if hdf5)\",\n    )\n    parser.add_argument(\n        \"stats_rspecifier_or_rxfilename\",\n        help=\"Input stats. e.g. ark:stats.ark or stats.mat\",\n    )\n    parser.add_argument(\n        \"rspecifier\", type=str, help=\"Read specifier id. e.g. ark:some.ark\"\n    )\n    parser.add_argument(\n        \"wspecifier\", type=str, help=\"Write specifier id. e.g. ark:some.ark\"\n    )\n    return parser\n\n\ndef main():\n    args = get_parser().parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    if \":\" in args.stats_rspecifier_or_rxfilename:\n        is_rspcifier = True\n        if args.stats_filetype == \"npy\":\n            stats_filetype = \"hdf5\"\n        else:\n            stats_filetype = args.stats_filetype\n\n        stats_dict = dict(\n            file_reader_helper(args.stats_rspecifier_or_rxfilename, stats_filetype)\n        )\n    else:\n        is_rspcifier = False\n        if args.stats_filetype == \"mat\":\n            stats = kaldiio.load_mat(args.stats_rspecifier_or_rxfilename)\n        else:\n            stats = numpy.load(args.stats_rspecifier_or_rxfilename)\n        stats_dict = {None: stats}\n\n    cmvn = CMVN(\n        stats=stats_dict,\n        norm_means=args.norm_means,\n        norm_vars=args.norm_vars,\n        utt2spk=args.utt2spk,\n        spk2utt=args.spk2utt,\n        reverse=args.reverse,\n    )\n\n    with file_writer_helper(\n        args.wspecifier,\n        filetype=args.out_filetype,\n        write_num_frames=args.write_num_frames,\n        compress=args.compress,\n        compression_method=args.compression_method,\n    ) as writer:\n        for utt, mat in file_reader_helper(args.rspecifier, args.in_filetype):\n            if is_scipy_wav_style(mat):\n                # If data is sound file, then got as Tuple[int, ndarray]\n                rate, mat = mat\n            mat = cmvn(mat, utt if is_rspcifier else None)\n            writer[utt] = mat\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/asr_align_wav.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2020 Johns Hopkins University (Xuankai Chang)\n# 2020 Technische Universität München, Authors: Ludwig Kürzinger, Dominik Winkelbauer\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nif [ ! -f path.sh ] || [ ! -f cmd.sh ]; then\n    echo \"Please change current directory to recipe directory e.g., egs/tedlium2/asr1\"\n    exit 1\nfi\n\n. ./path.sh\n\n# general configuration\npython=python3\nbackend=pytorch\nstage=-1       # start from -1 if you need to start from model download\nstop_stage=100\nngpu=0         # number of gpus (\"0\" uses cpu, otherwise use gpu)\nverbose=1      # verbose option\n\n# feature configuration\ndo_delta=false\ncmvn=\n\n# decoding parameter\nalign_model=\nalign_config=\nalign_dir=align\napi=v1\n\n# Parameters for CTC alignment\n# The subsampling factor depends on whether the encoder uses subsampling\nsubsampling_factor=4\n# minium confidence score in log space - may need adjustment depending on data and model, e.g. -1.5 or -5.0\nmin_confidence_score=-5.0\n# minimum length of one utterance (counted in frames)\nmin_window_size=8000\n# partitioning length L for calculation of the confidence score\nscoring_length=30\n\n\n# download related\nmodels=tedlium2.rnn.v2\ndict=\nnlsyms=\n\n. utils/parse_options.sh || exit 1;\n\nhelp_message=$(cat <<EOF\nUsage:\n    $0 [options] <wav_file> \"<text>\"\n\nOptions:\n    --backend <chainer|pytorch>     # chainer or pytorch (Default: pytorch)\n    --ngpu <ngpu>                   # Number of GPUs (Default: 0)\n    --align-dir <directory_name>    # Name of directory to store decoding temporary data\n    --models <model_name>           # Model name (e.g. tedlium2.transformer.v1)\n    --cmvn <path>                   # Location of cmvn.ark\n    --align-model <path>            # Location of E2E model\n    --align-config <path>           # Location of configuration file\n    --api <api_version>             # API version (v1 or v2, available in only pytorch backend)\n    --nlsyms <path>                 # Non-linguistic symbol list\n\nExample:\n    # Record audio from microphone input as example.wav\n    rec -c 1 -r 16000 example.wav trim 0 5\n\n    # Align using model name\n    $0 --models tedlium2.transformer.v1 example.wav \"example text\"\n\n    # Align using model file\n    $0 --cmvn cmvn.ark --align_model model.acc.best --align_config conf/align.yaml example.wav\n\n    # Align with GPU (require batchsize > 0 in configuration file)\n    $0 --ngpu 1 example.wav\n\nAvailable models:\n    - tedlium2.rnn.v1\n    - tedlium2.rnn.v2\n    - tedlium2.transformer.v1\n    - tedlium3.transformer.v1\n    - librispeech.transformer.v1\n    - librispeech.transformer.v1.transformerlm.v1\n    - commonvoice.transformer.v1\n    - csj.transformer.v1\n    - csj.rnn.v1\n    - wsj.transformer.v1\n    - wsj.transformer_small.v1\nEOF\n)\n\n\n# make shellcheck happy\ntrain_cmd=\n\n. ./cmd.sh\n\nwav=$1\ntext=$2\ndownload_dir=${align_dir}/download\n\nif [ ! $# -eq 2 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\n# check api version\nif [ \"${backend}\" = \"chainer\" ]; then\n    echo \"chainer backend is not supported.\" >&2\n    exit 1;\nfi\n\n# Check model name or model file is set\nif [ -z $models ]; then\n    if [[ -z $cmvn || -z $align_model || -z $align_config ]]; then\n        echo 'Error: models or set of cmvn, align_model and align_config are required.' >&2\n        exit 1\n    fi\nfi\n\n# Check for transformer models because of their memory consumption\nif [[ $models == *\"rnn\"* ]]; then\n    echo \"Using RNN model: \"${models}\nelse\n    echo \"Using Transformer model: \"${models}\n    echo \"WARNING. For large audio files, use an RNN model.\"\nfi\n\ndir=${download_dir}/${models}\nmkdir -p ${dir}\n\nfunction download_models () {\n    if [ -z $models ]; then\n        return\n    fi\n\n    file_ext=\"tar.gz\"\n    case \"${models}\" in\n        \"tedlium2.rnn.v1\") share_url=\"https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe\"; api=v1 ;;\n        \"tedlium2.rnn.v2\") share_url=\"https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf\"; api=v1 ;;\n        \"tedlium2.transformer.v1\") share_url=\"https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow\" ;;\n        \"tedlium3.transformer.v1\") share_url=\"https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU\" ;;\n        \"librispeech.transformer.v1\") share_url=\"https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6\" ;;\n        \"librispeech.transformer.v1.transformerlm.v1\") share_url=\"https://drive.google.com/open?id=17cOOSHHMKI82e1MXj4r2ig8gpGCRmG2p\" ;;\n        \"commonvoice.transformer.v1\") share_url=\"https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh\" ;;\n        \"csj.transformer.v1\") share_url=\"https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF\" ;;\n        \"csj.rnn.v1\") share_url=\"https://drive.google.com/open?id=1ALvD4nHan9VDJlYJwNurVr7H7OV0j2X9\" ;;\n        \"wsj.transformer.v1\") share_url=\"https://drive.google.com/open?id=1Az-4H25uwnEFa4lENc-EKiPaWXaijcJp\" ;;\n        \"wsj.transformer_small.v1\") share_url=\"https://drive.google.com/open?id=1jdEKbgWhLTxN_qP4xwE7mTOPmp7Ga--T\" ;;\n        *) echo \"No such models: ${models}\"; exit 1 ;;\n    esac\n\n    if [ ! -e ${dir}/.complete ]; then\n        download_from_google_drive.sh ${share_url} ${dir} ${file_ext}\n        touch ${dir}/.complete\n    fi\n}\n\n# Download trained models\nif [ -z \"${cmvn}\" ]; then\n    download_models\n    cmvn=$(find ${download_dir}/${models} -name \"cmvn.ark\" | head -n 1)\nfi\nif [ -z \"${align_model}\" ]; then\n    download_models\n    align_model=$(find ${download_dir}/${models} -name \"model*.best*\" | head -n 1)\nfi\nif [ -z \"${align_config}\" ]; then\n    download_models\n    align_config=$(find ${download_dir}/${models} -name \"decode*.yaml\" | head -n 1)\nfi\nif [ -z \"${wav}\" ]; then\n    download_models\n    wav=$(find ${download_dir}/${models} -name \"*.wav\" | head -n 1)\nfi\nif [ -z \"${dict}\" ]; then\n    download_models\n\n    if [ -z \"${dict}\" ]; then\n        mkdir -p ${download_dir}/${models}/data/lang_autochar/\n        model_config=$(find -L ${download_dir}/${models}/exp/*/results/model.json | head -n 1)\n        dict=${download_dir}/${models}/data/lang_autochar/dict.txt\n        python -c 'import json,sys;obj=json.load(sys.stdin);[print(char + \" \" + str(i + 1)) for i, char in enumerate(obj[2][\"char_list\"])]' > ${dict} < ${model_config}\n    fi\nfi\n\n# Check file existence\nif [ ! -f \"${cmvn}\" ]; then\n    echo \"No such CMVN file: ${cmvn}\"\n    exit 1\nfi\nif [ ! -f \"${align_model}\" ]; then\n    echo \"No such E2E model: ${align_model}\"\n    exit 1\nfi\nif [ ! -f \"${align_config}\" ]; then\n    echo \"No such config file: ${align_config}\"\n    exit 1\nfi\nif [ ! -f \"${dict}\" ]; then\n    echo \"No such Dictionary file: ${dict}\"\n    exit 1\nfi\nif [ ! -f \"${wav}\" ]; then\n    echo \"No such WAV file: ${wav}\"\n    exit 1\nfi\nif [ -z \"${text}\" ]; then\n    echo \"Text is empty: ${text}\"\n    exit 1\nfi\n\nbase=$(basename $wav .wav)\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    echo \"stage 0: Data preparation\"\n\n    mkdir -p ${align_dir}/data\n    echo \"$base $wav\" > ${align_dir}/data/wav.scp\n    echo \"X $base\" > ${align_dir}/data/spk2utt\n    echo \"$base X\" > ${align_dir}/data/utt2spk\n    echo \"$base $text\" > ${align_dir}/data/text\nfi\n\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Feature Generation\"\n\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 1 --write_utt2num_frames true \\\n        ${align_dir}/data ${align_dir}/log ${align_dir}/fbank || exit 1;\n\n    feat_align_dir=${align_dir}/dump; mkdir -p ${feat_align_dir}\n    dump.sh --cmd \"$train_cmd\" --nj 1 --do_delta ${do_delta} \\\n        ${align_dir}/data/feats.scp ${cmvn} ${align_dir}/log \\\n        ${feat_align_dir}\n    utils/fix_data_dir.sh ${align_dir}/data\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Json Data Preparation\"\n\n    nlsyms_opts=\"\"\n    if [[ -n ${nlsyms} ]]; then\n        nlsyms_opts=\"--nlsyms ${nlsyms}\"\n    fi\n\n    feat_align_dir=${align_dir}/dump\n    data2json.sh --feat ${feat_align_dir}/feats.scp ${nlsyms_opts} \\\n        ${align_dir}/data ${dict} > ${feat_align_dir}/data.json || exit 1;\n\nfi\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: Aligning\"\n    feat_align_dir=${align_dir}/dump\n\n    ${python} -m espnet.bin.asr_align \\\n        --config ${align_config} \\\n        --ngpu ${ngpu} \\\n        --verbose ${verbose} \\\n        --data-json ${feat_align_dir}/data.json \\\n        --model ${align_model} \\\n        --subsampling-factor ${subsampling_factor} \\\n        --min-window-size ${min_window_size} \\\n        --scoring-length ${scoring_length} \\\n        --api ${api} \\\n        --utt-text ${align_dir}/utt_text \\\n        --output ${align_dir}/aligned_segments || exit 1;\n\n    echo \"\"\n    echo \"Segments file: $(wc -l ${align_dir}/aligned_segments)\"\n    count_reliable=$(awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments | wc -l)\n    echo \"Utterances with min confidence score: ${count_reliable}\"\n    echo \"Finished.\"\nfi\n"
  },
  {
    "path": "egs/espnet_utils/average_checkpoints.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\nimport argparse\nimport json\nimport os\n\nimport numpy as np\n\n\ndef main():\n    if args.log is not None:\n        with open(args.log) as f:\n            logs = json.load(f)\n        val_scores = []\n        for log in logs:\n            if log[\"epoch\"] > args.max_epoch:\n                continue\n\n            if args.metric == \"acc\":\n                if \"validation/main/acc\" in log.keys():\n                    val_scores += [[log[\"epoch\"], log[\"validation/main/acc\"]]]\n            elif args.metric == \"perplexity\":\n                if \"val_perplexity\" in log.keys():\n                    val_scores += [[log[\"epoch\"], 1 / log[\"val_perplexity\"]]]\n            elif args.metric == \"loss\":\n                if \"validation/main/loss\" in log.keys():\n                    val_scores += [[log[\"epoch\"], -log[\"validation/main/loss\"]]]\n            elif args.metric == \"bleu\":\n                if \"validation/main/bleu\" in log.keys():\n                    val_scores += [[log[\"epoch\"], log[\"validation/main/bleu\"]]]\n            elif args.metric == \"cer\":\n                if \"validation/main/cer\" in log.keys():\n                    val_scores += [[log[\"epoch\"], -log[\"validation/main/cer\"]]]\n            elif args.metric == \"cer_ctc\":\n                if \"validation/main/cer_ctc\" in log.keys():\n                    val_scores += [[log[\"epoch\"], -log[\"validation/main/cer_ctc\"]]]\n            else:\n                # Keep original order for compatibility\n                if \"validation/main/acc\" in log.keys():\n                    val_scores += [[log[\"epoch\"], log[\"validation/main/acc\"]]]\n                elif \"val_perplexity\" in log.keys():\n                    val_scores += [[log[\"epoch\"], 1 / log[\"val_perplexity\"]]]\n                elif \"validation/main/loss\" in log.keys():\n                    val_scores += [[log[\"epoch\"], -log[\"validation/main/loss\"]]]\n\n        if len(val_scores) == 0:\n            raise ValueError(\"%s is not found in log.\" % args.metric)\n        val_scores = np.array(val_scores)\n        sort_idx = np.argsort(val_scores[:, -1])\n        sorted_val_scores = val_scores[sort_idx][::-1]\n        print(\"metric: %s\" % args.metric)\n        print(\"best val scores = \" + str(sorted_val_scores[: int(args.num), 1]))\n        print(\n            \"selected epochs = \"\n            + str(sorted_val_scores[: int(args.num), 0].astype(np.int64))\n        )\n        last = [\n            os.path.dirname(args.snapshots[0]) + \"/snapshot.ep.%d\" % (int(epoch))\n            for epoch in sorted_val_scores[: int(args.num), 0]\n        ]\n        args.num = int(args.num)\n    else:\n        print(args.num)\n        last = sorted(args.snapshots, key=lambda x: int(x.split(\".\")[-1]))\n        if args.num.isdigit():\n            last = last[-int(args.num) :]\n        else:\n            start, end = args.num.split('_')\n            start, end  = int(start) - 1, int(end)\n            last = last[start: end]\n        args.num = len(last)\n    print(\"average over\", last)\n    avg = None\n\n    if args.backend == \"pytorch\":\n        import torch\n\n        # sum\n        for path in last:\n            states = torch.load(path, map_location=torch.device(\"cpu\"))[\"model\"]\n            if avg is None:\n                avg = states\n            else:\n                for k in avg.keys():\n                    avg[k] += states[k]\n\n        # average\n        for k in avg.keys():\n            if avg[k] is not None:\n                if avg[k].is_floating_point():\n                    avg[k] /= args.num\n                else:\n                    avg[k] //= args.num\n\n        torch.save(avg, args.out)\n\n    elif args.backend == \"chainer\":\n        # sum\n        for path in last:\n            states = np.load(path)\n            if avg is None:\n                keys = [x.split(\"main/\")[1] for x in states if \"model\" in x]\n                avg = dict()\n                for k in keys:\n                    avg[k] = states[\"updater/model:main/{}\".format(k)]\n            else:\n                for k in keys:\n                    avg[k] += states[\"updater/model:main/{}\".format(k)]\n        # average\n        for k in keys:\n            if avg[k] is not None:\n                avg[k] /= args.num\n        np.savez_compressed(args.out, **avg)\n        os.rename(\"{}.npz\".format(args.out), args.out)  # numpy save with .npz extension\n    else:\n        raise ValueError(\"Incorrect type of backend\")\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"average models from snapshot\")\n    parser.add_argument(\"--snapshots\", required=True, type=str, nargs=\"+\")\n    parser.add_argument(\"--out\", required=True, type=str)\n    parser.add_argument(\"--num\", default=10, type=str)\n    parser.add_argument(\"--backend\", default=\"chainer\", type=str)\n    parser.add_argument(\"--log\", default=None, type=str, nargs=\"?\")\n    parser.add_argument(\n        \"--metric\",\n        default=\"\",\n        type=str,\n        nargs=\"?\",\n        choices=[\"acc\", \"bleu\", \"cer\", \"cer_ctc\", \"loss\", \"perplexity\"],\n    )\n    parser.add_argument(\n        \"--max-epoch\",\n        default=10000000,\n        type=int,\n        nargs=\"?\",\n    )\n    return parser\n\n\nif __name__ == \"__main__\":\n    args = get_parser().parse_args()\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/build_fake_lexicon.py",
    "content": "import sys\nimport os\n\norig_lexicon = sys.argv[1]\n\nfor line in open(orig_lexicon, encoding=\"utf-8\"):\n    word = line.strip().split()[0]\n    if word.startswith(\"<\") or word == \"SIL\" or word == \"sil\":\n        print(f\"{word} {word}\")\n    else:\n        out = [word] + list(word)\n        out = \" \".join(out)\n        print(out)\n        \n"
  },
  {
    "path": "egs/espnet_utils/build_sp_text.py",
    "content": "import sys\n\nin_f = sys.argv[1]\n\nfor line in open(in_f, 'r', encoding=\"utf8\"):\n    elems = line.split()\n    uttid = elems[0]\n    for sp in [\"0.9\", \"1.0\", \"1.1\"]:\n        uttid_sp = f\"sp{sp}-{uttid}\"\n        line = f\"{uttid_sp} \" + \" \".join(elems[1:])\n        print(line)\n"
  },
  {
    "path": "egs/espnet_utils/calculate_rtf.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2021 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nfrom dateutil import parser\nimport glob\nimport os\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"calculate real time factor (RTF)\")\n    parser.add_argument(\n        \"--log-dir\",\n        type=str,\n        default=None,\n        help=\"path to logging directory\",\n    )\n    return parser\n\n\ndef main():\n\n    args = get_parser().parse_args()\n\n    audio_sec = 0\n    decode_sec = 0\n    n_utt = 0\n\n    audio_durations = []\n    start_times = []\n    end_times = []\n    for x in glob.glob(os.path.join(args.log_dir, \"decode.*.log\")):\n        with codecs.open(x, \"r\", \"utf-8\") as f:\n            for line in f:\n                x = line.strip()\n                if \"INFO: input lengths\" in x:\n                    audio_durations += [int(x.split(\"input lengths: \")[1])]\n                    start_times += [parser.parse(x.split(\"(\")[0])]\n                elif \"INFO: prediction\" in x:\n                    end_times += [parser.parse(x.split(\"(\")[0])]\n        assert len(audio_durations) == len(end_times), (\n            len(audio_durations),\n            len(end_times),\n        )\n        assert len(start_times) == len(end_times), (len(start_times), len(end_times))\n        audio_sec += sum(audio_durations) / 100  # [sec]\n        decode_sec += sum(\n            [\n                (end - start).total_seconds()\n                for start, end in zip(start_times, end_times)\n            ]\n        )\n        n_utt += len(audio_durations)\n\n    print(\"Total audio duration: %.3f [sec]\" % audio_sec)\n    print(\"Total decoding time: %.3f [sec]\" % decode_sec)\n    rtf = decode_sec / audio_sec if audio_sec > 0 else 0\n    print(\"RTF: %.3f\" % rtf)\n    latency = decode_sec * 1000 / n_utt if n_utt > 0 else 0\n    print(\"Latency: %.3f [ms/sentence]\" % latency)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/change_root.py",
    "content": "# author: tyriontian\n# tyriontian@tencent.com\n\n# this script is to change the root dir of some data files, like dump directory \n# in espnet format. By change the root, we can transplant the data into other \n# experiments\n\n# e.g., python3 espnet_utils/change_root.py /mnt/ceph_asr_ts/tomasyu/jinchuan/las_mmi/ /apdcephfs/share_1149801/speech_user/tomasyu/jinchuan/lasmmi/ dump/dev_test/ \".json,.scp\"\nimport sys\nimport os\n\norg_pref = sys.argv[1]\ndst_pref = sys.argv[2]\nroot = sys.argv[3]\nsuffix = sys.argv[4]\n\nif org_pref[-1] != '/' or dst_pref[-1] != '/':\n    raise ValueError(\"path should end with /\")\n\nsuffix = suffix.strip().split(\",\")\nprint(f\"Working under the directory: {root}\")\nprint(f\"Change the prefix in all files that end with {suffix}\")\nprint(f\"The initial prefix is {org_pref}; The destination prefix is {dst_pref}\")\n\n# BFS search: find all files to change root\nqueue = [root]\nflist = []\nwhile queue:\n    cur_dir = queue.pop(0)\n    for d in os.listdir(cur_dir):\n        d = os.path.join(cur_dir, d)\n        \n        if os.path.isfile(d):\n            for s in suffix:\n                if d.endswith(s):\n                    flist.append(d)\n                    print(f\"File to change: {d}\")\n\n        if os.path.isdir(d):\n            queue.append(d)\n\n# process these files one by one\nfor f in flist:\n    handle = open(f, 'r+', encoding=\"utf-8\")\n    context = handle.readlines()\n    handle.seek(0)\n    handle.truncate(0)\n    for line in context:\n        line = line.replace(org_pref, dst_pref)\n        handle.write(line)\n    \n"
  },
  {
    "path": "egs/espnet_utils/change_yaml.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nfrom pathlib import Path\n\nimport yaml\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"change specified attributes of a YAML file\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n\n    egroup = parser.add_mutually_exclusive_group()\n    parser.add_argument(\"inyaml\", nargs=\"?\")\n    egroup.add_argument(\"-o\", \"--outyaml\")\n    egroup.add_argument(\"--outdir\")\n    parser.add_argument(\n        \"-a\",\n        \"--arg\",\n        action=\"append\",\n        default=[],\n        help=\"e.g -a a.b.c=4 -> {'a': {'b': {'c': 4}}}\",\n    )\n    parser.add_argument(\n        \"-d\",\n        \"--delete\",\n        action=\"append\",\n        default=[],\n        help='e.g -d a -> \"a\" is removed from the input yaml',\n    )\n    return parser\n\n\ndef main():\n    args = get_parser().parse_args()\n\n    if args.inyaml is None:\n        indict = {}\n    else:\n        with open(args.inyaml, \"r\") as f:\n            indict = yaml.load(f, Loader=yaml.Loader)\n        if indict is None:\n            indict = {}\n\n    if args.outyaml is None:\n        # Auto naming from arguments\n        eles = []\n        if args.inyaml is not None:\n            p = Path(args.inyaml)\n            if args.outdir is None:\n                outdir = p.parent\n            else:\n                outdir = Path(args.outdir)\n            eles.append(str(outdir / p.stem))\n\n        table = str.maketrans(\"{}[]()\", \"%%__--\", \" |&;#*?~\\\"'\\\\\")\n        for arg in args.delete:\n            value = arg.translate(table)\n            eles.append(\"del-\" + value)\n        for arg in args.arg:\n            if \"=\" not in arg:\n                raise RuntimeError(f'\"{arg}\" does\\'t include \"=\"')\n            key, value = arg.split(\"=\")\n            key = key.translate(table)\n            value = value.translate(table)\n            eles.append(key + value)\n\n        outyaml = \"_\".join(eles)\n        if outyaml == \"\":\n            outyaml = \"config\"\n        outyaml += \".yaml\"\n        if args.inyaml == outyaml:\n            p = Path(args.outyaml)\n            outyaml = p.parent / (p.stem + \".2\" + p.suffix)\n\n        outyaml = Path(outyaml)\n    else:\n        outyaml = Path(args.outyaml)\n\n    for arg in args.delete + args.arg:\n        if \"=\" in arg:\n            key, value = arg.split(\"=\")\n            if not value.strip() == \"\":\n                value = yaml.load(value, Loader=yaml.Loader)\n        else:\n            key = arg\n            value = None\n\n        keys = key.split(\".\")\n        d = indict\n        for idx, k in enumerate(keys):\n            if idx == len(keys) - 1:\n                if isinstance(d, (tuple, list)):\n                    k = int(k)\n                    if k >= len(d):\n                        d += type(d)(None for _ in range(k - len(d) + 1))\n                if value is not None:\n                    d[k] = value\n                else:\n                    del d[k]\n            else:\n                if isinstance(d, (tuple, list)):\n                    k = int(k)\n                    if k >= len(d):\n                        d += type(d)(None for _ in range(k - len(d) + 1))\n                elif isinstance(d, dict):\n                    if k not in d:\n                        d[k] = {}\n                if not isinstance(d[k], (dict, tuple, list)):\n                    d[k] = {}\n                d = d[k]\n\n    outyaml.parent.mkdir(parents=True, exist_ok=True)\n    with outyaml.open(\"w\") as f:\n        yaml.dump(indict, f, Dumper=yaml.Dumper, indent=4, sort_keys=False)\n    print(outyaml)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/clean_corpus.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2021 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nmaxframes=3000\nmaxchars=400\nutt_extra_files=\"text.tc text.lc text.lc.rm\"\nno_feat=false\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> <langs>\ne.g.: $0 data/train \"en de\"\nOptions:\n  --maxframes        # number of maximum input frame length\n  --maxchars         # number of maximum character length\n  --utt_extra_files  # extra text files for target sequence\n  --no_feat          # set to True for MT recipe\nEOF\n)\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 2 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata_dir=$1\nlangs=$2\n\nmkdir -p ${data_dir}\ntmpdir=$(mktemp -d ${data_dir}/tmp-XXXXX)\ntrap 'rm -rf ${tmpdir}' EXIT\n\n# remove utt having more than ${maxframes} frames\n# remove utt having more than ${maxchars} characters\nfor lang in ${langs}; do\n    remove_longshortdata.sh --no_feat ${no_feat} --maxframes ${maxframes} --maxchars ${maxchars} ${data_dir}.${lang} ${tmpdir}.${lang}\ndone\n\n# Match the number of utterances between source and target languages\nfor lang in ${langs}; do\n    cut -f 1 -d \" \" ${tmpdir}.${lang}/text > ${tmpdir}.${lang}/reclist\n    if [ ! -f ${tmpdir}/reclist ]; then\n        cp ${tmpdir}.${lang}/reclist  ${tmpdir}/reclist\n    else\n        # extract common lines\n        comm -12 ${tmpdir}/reclist ${tmpdir}.${lang}/reclist > ${tmpdir}/reclist.tmp\n        mv ${tmpdir}/reclist.tmp ${tmpdir}/reclist\n    fi\ndone\n\nfor lang in ${langs}; do\n    reduce_data_dir.sh ${tmpdir}.${lang} ${tmpdir}/reclist ${data_dir}.${lang}\n    utils/fix_data_dir.sh --utt_extra_files \"${utt_extra_files}\" ${data_dir}.${lang}\ndone\n\nrm -rf ${tmpdir}*\n"
  },
  {
    "path": "egs/espnet_utils/compute-cmvn-stats.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nimport logging\n\nimport kaldiio\nimport numpy as np\n\nfrom espnet.transform.transformation import Transformation\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_utils import is_scipy_wav_style\nfrom espnet.utils.cli_writers import file_writer_helper\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Compute cepstral mean and \"\n        \"variance normalization statistics\"\n        \"If wspecifier provided: per-utterance by default, \"\n        \"or per-speaker if\"\n        \"spk2utt option provided; if wxfilename: global\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--spk2utt\",\n        type=str,\n        help=\"A text file of speaker to utterance-list map. \"\n        \"(Don't give rspecifier format, such as \"\n        '\"ark:utt2spk\")',\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--in-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--out-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"npy\"],\n        help=\"Specify the file format for the wspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"rspecifier\", type=str, help=\"Read specifier for feats. e.g. ark:some.ark\"\n    )\n    parser.add_argument(\n        \"wspecifier_or_wxfilename\", type=str, help=\"Write specifier. e.g. ark:some.ark\"\n    )\n    return parser\n\n\ndef main():\n    args = get_parser().parse_args()\n\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    is_wspecifier = \":\" in args.wspecifier_or_wxfilename\n\n    if is_wspecifier:\n        if args.spk2utt is not None:\n            logging.info(\"Performing as speaker CMVN mode\")\n            utt2spk_dict = {}\n            with open(args.spk2utt) as f:\n                for line in f:\n                    spk, utts = line.rstrip().split(None, 1)\n                    for utt in utts.split():\n                        utt2spk_dict[utt] = spk\n\n            def utt2spk(x):\n                return utt2spk_dict[x]\n\n        else:\n            logging.info(\"Performing as utterance CMVN mode\")\n\n            def utt2spk(x):\n                return x\n\n        if args.out_filetype == \"npy\":\n            logging.warning(\n                \"--out-filetype npy is allowed only for \"\n                \"Global CMVN mode, changing to hdf5\"\n            )\n            args.out_filetype = \"hdf5\"\n\n    else:\n        logging.info(\"Performing as global CMVN mode\")\n        if args.spk2utt is not None:\n            logging.warning(\"spk2utt is not used for global CMVN mode\")\n\n        def utt2spk(x):\n            return None\n\n        if args.out_filetype == \"hdf5\":\n            logging.warning(\n                \"--out-filetype hdf5 is not allowed for \"\n                \"Global CMVN mode, changing to npy\"\n            )\n            args.out_filetype = \"npy\"\n\n    if args.preprocess_conf is not None:\n        preprocessing = Transformation(args.preprocess_conf)\n        logging.info(\"Apply preprocessing: {}\".format(preprocessing))\n    else:\n        preprocessing = None\n\n    # Calculate stats for each speaker\n    counts = {}\n    sum_feats = {}\n    square_sum_feats = {}\n\n    idx = 0\n    for idx, (utt, matrix) in enumerate(\n        file_reader_helper(args.rspecifier, args.in_filetype), 1\n    ):\n        if is_scipy_wav_style(matrix):\n            # If data is sound file, then got as Tuple[int, ndarray]\n            rate, matrix = matrix\n        if preprocessing is not None:\n            matrix = preprocessing(matrix, uttid_list=utt)\n\n        spk = utt2spk(utt)\n\n        # Init at the first seen of the spk\n        if spk not in counts:\n            counts[spk] = 0\n            feat_shape = matrix.shape[1:]\n            # Accumulate in double precision\n            sum_feats[spk] = np.zeros(feat_shape, dtype=np.float64)\n            square_sum_feats[spk] = np.zeros(feat_shape, dtype=np.float64)\n\n        counts[spk] += matrix.shape[0]\n        sum_feats[spk] += matrix.sum(axis=0)\n        square_sum_feats[spk] += (matrix ** 2).sum(axis=0)\n    logging.info(\"Processed {} utterances\".format(idx))\n    assert idx > 0, idx\n\n    cmvn_stats = {}\n    for spk in counts:\n        feat_shape = sum_feats[spk].shape\n        cmvn_shape = (2, feat_shape[0] + 1) + feat_shape[1:]\n        _cmvn_stats = np.empty(cmvn_shape, dtype=np.float64)\n        _cmvn_stats[0, :-1] = sum_feats[spk]\n        _cmvn_stats[1, :-1] = square_sum_feats[spk]\n\n        _cmvn_stats[0, -1] = counts[spk]\n        _cmvn_stats[1, -1] = 0.0\n\n        # You can get the mean and std as following,\n        # >>> N = _cmvn_stats[0, -1]\n        # >>> mean = _cmvn_stats[0, :-1] / N\n        # >>> std = np.sqrt(_cmvn_stats[1, :-1] / N - mean ** 2)\n\n        cmvn_stats[spk] = _cmvn_stats\n\n    # Per utterance or speaker CMVN\n    if is_wspecifier:\n        with file_writer_helper(\n            args.wspecifier_or_wxfilename, filetype=args.out_filetype\n        ) as writer:\n            for spk, mat in cmvn_stats.items():\n                writer[spk] = mat\n\n    # Global CMVN\n    else:\n        matrix = cmvn_stats[None]\n        if args.out_filetype == \"npy\":\n            np.save(args.wspecifier_or_wxfilename, matrix)\n        elif args.out_filetype == \"mat\":\n            # Kaldi supports only matrix or vector\n            kaldiio.save_mat(args.wspecifier_or_wxfilename, matrix)\n        else:\n            raise RuntimeError(\n                \"Not supporting: --out-filetype {}\".format(args.out_filetype)\n            )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/compute-fbank-feats.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nfrom distutils.util import strtobool\nimport logging\n\nimport kaldiio\nimport numpy\nimport resampy\n\nfrom espnet.transform.spectrogram import logmelspectrogram\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_writers import file_writer_helper\nfrom espnet2.utils.types import int_or_none\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"compute FBANK feature from WAV\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--fs\", type=int_or_none, help=\"Sampling frequency\")\n    parser.add_argument(\n        \"--fmax\", type=int_or_none, default=None, nargs=\"?\", help=\"Maximum frequency\"\n    )\n    parser.add_argument(\n        \"--fmin\", type=int_or_none, default=None, nargs=\"?\", help=\"Minimum frequency\"\n    )\n    parser.add_argument(\"--n_mels\", type=int, default=80, help=\"Number of mel basis\")\n    parser.add_argument(\"--n_fft\", type=int, default=1024, help=\"FFT length in point\")\n    parser.add_argument(\n        \"--n_shift\", type=int, default=512, help=\"Shift length in point\"\n    )\n    parser.add_argument(\n        \"--win_length\",\n        type=int_or_none,\n        default=None,\n        nargs=\"?\",\n        help=\"Analisys window length in point\",\n    )\n    parser.add_argument(\n        \"--window\",\n        type=str,\n        default=\"hann\",\n        choices=[\"hann\", \"hamming\"],\n        help=\"Type of window\",\n    )\n    parser.add_argument(\n        \"--write-num-frames\", type=str, help=\"Specify wspecifer for utt2num_frames\"\n    )\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\"],\n        help=\"Specify the file format for output. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--compress\", type=strtobool, default=False, help=\"Save in compressed format\"\n    )\n    parser.add_argument(\n        \"--compression-method\",\n        type=int,\n        default=2,\n        help=\"Specify the method(if mat) or \" \"gzip-level(if hdf5)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--normalize\",\n        choices=[1, 16, 24, 32],\n        type=int,\n        default=None,\n        help=\"Give the bit depth of the PCM, \"\n        \"then normalizes data to scale in [-1,1]\",\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"WAV scp file\")\n    parser.add_argument(\n        \"--segments\",\n        type=str,\n        help=\"segments-file format: each line is either\"\n        \"<segment-id> <recording-id> <start-time> <end-time>\"\n        \"e.g. call-861225-A-0050-0065 call-861225-A 5.0 6.5\",\n    )\n    parser.add_argument(\"wspecifier\", type=str, help=\"Write specifier\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    with kaldiio.ReadHelper(\n        args.rspecifier, segments=args.segments\n    ) as reader, file_writer_helper(\n        args.wspecifier,\n        filetype=args.filetype,\n        write_num_frames=args.write_num_frames,\n        compress=args.compress,\n        compression_method=args.compression_method,\n    ) as writer:\n        for utt_id, (rate, array) in reader:\n            array = array.astype(numpy.float32)\n            if args.fs is not None and rate != args.fs:\n                array = resampy.resample(array, rate, args.fs, axis=0)\n            if args.normalize is not None and args.normalize != 1:\n                array = array / (1 << (args.normalize - 1))\n\n            lmspc = logmelspectrogram(\n                x=array,\n                fs=args.fs if args.fs is not None else rate,\n                n_mels=args.n_mels,\n                n_fft=args.n_fft,\n                n_shift=args.n_shift,\n                win_length=args.win_length,\n                window=args.window,\n                fmin=args.fmin,\n                fmax=args.fmax,\n            )\n            writer[utt_id] = lmspc\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/compute-stft-feats.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nfrom distutils.util import strtobool\nimport logging\n\nimport kaldiio\nimport numpy\nimport resampy\n\nfrom espnet.transform.spectrogram import spectrogram\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_writers import file_writer_helper\nfrom espnet2.utils.types import int_or_none\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"compute STFT feature from WAV\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--fs\", type=int_or_none, help=\"Sampling frequency\")\n    parser.add_argument(\"--n_fft\", type=int, default=1024, help=\"FFT length in point\")\n    parser.add_argument(\n        \"--n_shift\", type=int, default=512, help=\"Shift length in point\"\n    )\n    parser.add_argument(\n        \"--win_length\",\n        type=int_or_none,\n        default=None,\n        nargs=\"?\",\n        help=\"Analisys window length in point\",\n    )\n    parser.add_argument(\n        \"--window\",\n        type=str,\n        default=\"hann\",\n        choices=[\"hann\", \"hamming\"],\n        help=\"Type of window\",\n    )\n    parser.add_argument(\n        \"--write-num-frames\", type=str, help=\"Specify wspecifer for utt2num_frames\"\n    )\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\"],\n        help=\"Specify the file format. \" '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--compress\", type=strtobool, default=False, help=\"Save in compressed format\"\n    )\n    parser.add_argument(\n        \"--compression-method\",\n        type=int,\n        default=2,\n        help=\"Specify the method(if mat) or \" \"gzip-level(if hdf5)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--normalize\",\n        choices=[1, 16, 24, 32],\n        type=int,\n        default=None,\n        help=\"Give the bit depth of the PCM, \"\n        \"then normalizes data to scale in [-1,1]\",\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"WAV scp file\")\n    parser.add_argument(\n        \"--segments\",\n        type=str,\n        help=\"segments-file format: each line is either\"\n        \"<segment-id> <recording-id> <start-time> <end-time>\"\n        \"e.g. call-861225-A-0050-0065 call-861225-A 5.0 6.5\",\n    )\n    parser.add_argument(\"wspecifier\", type=str, help=\"Write specifier\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    with kaldiio.ReadHelper(\n        args.rspecifier, segments=args.segments\n    ) as reader, file_writer_helper(\n        args.wspecifier,\n        filetype=args.filetype,\n        write_num_frames=args.write_num_frames,\n        compress=args.compress,\n        compression_method=args.compression_method,\n    ) as writer:\n        for utt_id, (rate, array) in reader:\n            array = array.astype(numpy.float32)\n            if args.fs is not None and rate != args.fs:\n                array = resampy.resample(array, rate, args.fs, axis=0)\n            if args.normalize is not None and args.normalize != 1:\n                array = array / (1 << (args.normalize - 1))\n            spc = spectrogram(\n                x=array,\n                n_fft=args.n_fft,\n                n_shift=args.n_shift,\n                win_length=args.win_length,\n                window=args.window,\n            )\n            writer[utt_id] = spc\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/concat_json_multiref.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2018 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"concatenate multiple json files for data augmentation\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"jsons\", type=str, nargs=\"+\", help=\"json files\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    # make intersection set for utterance keys\n    num_keys = 0\n    js = {}\n    for i, x in enumerate(args.jsons):\n        with codecs.open(x, encoding=\"utf-8\") as f:\n            j = json.load(f)\n        ks = j[\"utts\"].keys()\n        logging.debug(x + \": has \" + str(len(ks)) + \" utterances\")\n\n        num_keys += len(ks)\n        if i > 0:\n            for k in ks:\n                js[k + \".\" + str(i)] = j[\"utts\"][k]\n        else:\n            js = j[\"utts\"]\n        # js.update(j['utts'])\n\n    # logging.info('new json has ' + str(len(js.keys())) + ' utterances')\n    logging.info(\"new json has \" + str(num_keys) + \" utterances\")\n\n    # ensure \"ensure_ascii=False\", which is a bug\n    jsonstring = json.dumps(\n        {\"utts\": js},\n        indent=4,\n        sort_keys=True,\n        ensure_ascii=False,\n        separators=(\",\", \": \"),\n    )\n    sys.stdout = codecs.getwriter(\"utf-8\")(sys.stdout.buffer)\n    print(jsonstring)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/concatjson.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\nimport random\nfrom espnet.utils.cli_utils import get_commandline_args\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"concatenate json files\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--ark-size\", type=int, default=0, help=\"json files\")\n    parser.add_argument(\"--shuffle\", action='store_true', help=\"shuffle the output\")\n    parser.add_argument(\"jsons\", type=str, nargs=\"+\", help=\"json files\")\n    return parser\n\ndef truncate_tail(d, size):\n    tot_length = len(list(d.keys()))\n    tot_length -= tot_length % size\n    out = {}\n    keys = list(d.keys())[:tot_length]\n    for k in keys:\n        out[k] = d[k]\n    return out\n\nif __name__ == \"__main__\":\n    args = get_parser().parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    # make intersection set for utterance keys\n    js = {}\n    for x in args.jsons:\n        with codecs.open(x, encoding=\"utf-8\") as f:\n            j = json.load(f)\n        ks = j[\"utts\"].keys()\n        logging.debug(x + \": has \" + str(len(ks)) + \" utterances\")\n        if args.ark_size > 0:\n            dict_truncated = truncate_tail(j[\"utts\"], args.ark_size)\n        else:\n            dict_truncated = j[\"utts\"]\n        js.update(dict_truncated)\n    logging.info(\"new json has \" + str(len(js.keys())) + \" utterances\")\n\n    if args.shuffle:\n        keys = list(js.keys())\n        random.shuffle(keys)\n        new_js = {k: js[k] for k in keys}\n        js = new_js\n\n    # ensure \"ensure_ascii=False\", which is a bug\n    jsonstring = json.dumps(\n        {\"utts\": js},\n        indent=4,\n        sort_keys=False,\n        ensure_ascii=False,\n        separators=(\",\", \": \"),\n    )\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    print(jsonstring)\n"
  },
  {
    "path": "egs/espnet_utils/convert_fbank.sh",
    "content": "#!/usr/bin/env bash\n# Set bash to 'debug' mode, it will exit on :\n# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',\nset -e\nset -u\nset -o pipefail\n\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# Begin configuration section.\nnj=4\nfs=22050\nfmax=\nfmin=\nn_fft=1024\nn_shift=512\nwin_length=\nn_mels=\niters=64\ncmd=run.pl\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<fbank-dir>] ]\ne.g.: $0 data/train exp/griffin_lim/train wav\nNote: <log-dir> defaults to <data-dir>/log, and <fbank-dir> defaults to <data-dir>/data\nOptions:\n  --nj <nj>                  # number of parallel jobs\n  --fs <fs>                  # sampling rate\n  --fmax <fmax>              # maximum frequency\n  --fmin <fmin>              # minimum frequency\n  --n_fft <n_fft>            # number of FFT points (default=1024)\n  --n_shift <n_shift>        # shift size in point (default=256)\n  --win_length <win_length>  # window length in point (default=)\n  --n_mels <n_mels>          # number of mel basis (default=80)\n  --iters <iters>            # number of Griffin-lim iterations (default=64)\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\nEOF\n)\n# End configuration section.\n\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=${data}/log\nfi\nif [ $# -ge 3 ]; then\n  wavdir=$3\nelse\n  wavdir=${data}/data\nfi\n\n# use \"name\" as part of name of the archive.\nname=$(basename ${data})\n\nmkdir -p ${wavdir} || exit 1;\nmkdir -p ${logdir} || exit 1;\n\nscp=${data}/feats.scp\n\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"$split_scps $logdir/feats.$n.scp\"\ndone\n\nutils/split_scp.pl ${scp} ${split_scps} || exit 1;\n\n${cmd} JOB=1:${nj} ${logdir}/griffin_lim_${name}.JOB.log \\\n    convert_fbank_to_wav.py \\\n        --fs ${fs} \\\n        --fmax ${fmax} \\\n        --fmin ${fmin} \\\n        --win_length ${win_length} \\\n        --n_fft ${n_fft} \\\n        --n_shift ${n_shift} \\\n        --n_mels ${n_mels} \\\n        --iters ${iters} \\\n        scp:${logdir}/feats.JOB.scp \\\n        ${wavdir}\n\nrm ${logdir}/feats.*.scp 2>/dev/null\n\necho \"Succeeded creating wav for $name\"\n"
  },
  {
    "path": "egs/espnet_utils/convert_fbank_to_wav.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport logging\nimport os\n\nfrom distutils.version import LooseVersion\n\nimport librosa\nimport numpy as np\nfrom scipy.io.wavfile import write\n\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\nEPS = 1e-10\n\n\ndef logmelspc_to_linearspc(lmspc, fs, n_mels, n_fft, fmin=None, fmax=None):\n    \"\"\"Convert log Mel filterbank to linear spectrogram.\n\n    Args:\n        lmspc (ndarray): Log Mel filterbank (T, n_mels).\n        fs (int): Sampling frequency.\n        n_mels (int): Number of mel basis.\n        n_fft (int): Number of FFT points.\n        f_min (int, optional): Minimum frequency to analyze.\n        f_max (int, optional): Maximum frequency to analyze.\n\n    Returns:\n        ndarray: Linear spectrogram (T, n_fft // 2 + 1).\n\n    \"\"\"\n    assert lmspc.shape[1] == n_mels\n    fmin = 0 if fmin is None else fmin\n    fmax = fs / 2 if fmax is None else fmax\n    mspc = np.power(10.0, lmspc)\n    mel_basis = librosa.filters.mel(fs, n_fft, n_mels, fmin, fmax)\n    inv_mel_basis = np.linalg.pinv(mel_basis)\n    spc = np.maximum(EPS, np.dot(inv_mel_basis, mspc.T).T)\n\n    return spc\n\n\ndef griffin_lim(spc, n_fft, n_shift, win_length, window=\"hann\", n_iters=100):\n    \"\"\"Convert linear spectrogram into waveform using Griffin-Lim.\n\n    Args:\n        spc (ndarray): Linear spectrogram (T, n_fft // 2 + 1).\n        n_fft (int): Number of FFT points.\n        n_shift (int): Shift size in points.\n        win_length (int): Window length in points.\n        window (str, optional): Window function type.\n        n_iters (int, optionl): Number of iterations of Griffin-Lim Algorithm.\n\n    Returns:\n        ndarray: Reconstructed waveform (N,).\n\n    \"\"\"\n    # assert the size of input linear spectrogram\n    assert spc.shape[1] == n_fft // 2 + 1\n\n    if LooseVersion(librosa.__version__) >= LooseVersion(\"0.7.0\"):\n        # use librosa's fast Grriffin-Lim algorithm\n        spc = np.abs(spc.T)\n        y = librosa.griffinlim(\n            S=spc,\n            n_iter=n_iters,\n            hop_length=n_shift,\n            win_length=win_length,\n            window=window,\n            center=True if spc.shape[1] > 1 else False,\n        )\n    else:\n        # use slower version of Grriffin-Lim algorithm\n        logging.warning(\n            \"librosa version is old. use slow version of Grriffin-Lim algorithm.\"\n            \"if you want to use fast Griffin-Lim, please update librosa via \"\n            \"`source ./path.sh && pip install librosa==0.7.0`.\"\n        )\n        cspc = np.abs(spc).astype(np.complex).T\n        angles = np.exp(2j * np.pi * np.random.rand(*cspc.shape))\n        y = librosa.istft(cspc * angles, n_shift, win_length, window=window)\n        for i in range(n_iters):\n            angles = np.exp(\n                1j\n                * np.angle(librosa.stft(y, n_fft, n_shift, win_length, window=window))\n            )\n            y = librosa.istft(cspc * angles, n_shift, win_length, window=window)\n\n    return y\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert FBANK to WAV using Griffin-Lim algorithm\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--fs\", type=int, default=22050, help=\"Sampling frequency\")\n    parser.add_argument(\n        \"--fmax\", type=int, default=None, nargs=\"?\", help=\"Maximum frequency\"\n    )\n    parser.add_argument(\n        \"--fmin\", type=int, default=None, nargs=\"?\", help=\"Minimum frequency\"\n    )\n    parser.add_argument(\"--n_fft\", type=int, default=1024, help=\"FFT length in point\")\n    parser.add_argument(\n        \"--n_shift\", type=int, default=512, help=\"Shift length in point\"\n    )\n    parser.add_argument(\n        \"--win_length\",\n        type=int,\n        default=None,\n        nargs=\"?\",\n        help=\"Analisys window length in point\",\n    )\n    parser.add_argument(\n        \"--n_mels\", type=int, default=None, nargs=\"?\", help=\"Number of mel basis\"\n    )\n    parser.add_argument(\n        \"--window\",\n        type=str,\n        default=\"hann\",\n        choices=[\"hann\", \"hamming\"],\n        help=\"Type of window\",\n    )\n    parser.add_argument(\n        \"--iters\", type=int, default=100, help=\"Number of iterations in Grriffin Lim\"\n    )\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"Input feature\")\n    parser.add_argument(\"outdir\", type=str, help=\"Output directory\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logging.basicConfig(\n        level=logging.INFO,\n        format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n    )\n    logging.info(get_commandline_args())\n\n    # check directory\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n\n    for idx, (utt_id, lmspc) in enumerate(\n        file_reader_helper(args.rspecifier, args.filetype), 1\n    ):\n        if args.n_mels is not None:\n            spc = logmelspc_to_linearspc(\n                lmspc,\n                fs=args.fs,\n                n_mels=args.n_mels,\n                n_fft=args.n_fft,\n                fmin=args.fmin,\n                fmax=args.fmax,\n            )\n        else:\n            spc = lmspc\n        y = griffin_lim(\n            spc,\n            n_fft=args.n_fft,\n            n_shift=args.n_shift,\n            win_length=args.win_length,\n            window=args.window,\n            n_iters=args.iters,\n        )\n        logging.info(\"(%d) %s\" % (idx, utt_id))\n        write(\n            args.outdir + \"/%s.wav\" % utt_id,\n            args.fs,\n            (y * np.iinfo(np.int16).max).astype(np.int16),\n        )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/copy-feats.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nfrom distutils.util import strtobool\nimport logging\n\nfrom espnet.transform.transformation import Transformation\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_utils import is_scipy_wav_style\nfrom espnet.utils.cli_writers import file_writer_helper\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"copy feature with preprocessing\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--in-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--out-filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for the wspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--write-num-frames\", type=str, help=\"Specify wspecifer for utt2num_frames\"\n    )\n    parser.add_argument(\n        \"--compress\", type=strtobool, default=False, help=\"Save in compressed format\"\n    )\n    parser.add_argument(\n        \"--compression-method\",\n        type=int,\n        default=2,\n        help=\"Specify the method(if mat) or \" \"gzip-level(if hdf5)\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"rspecifier\", type=str, help=\"Read specifier for feats. e.g. ark:some.ark\"\n    )\n    parser.add_argument(\n        \"wspecifier\", type=str, help=\"Write specifier. e.g. ark:some.ark\"\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    if args.preprocess_conf is not None:\n        preprocessing = Transformation(args.preprocess_conf)\n        logging.info(\"Apply preprocessing: {}\".format(preprocessing))\n    else:\n        preprocessing = None\n\n    with file_writer_helper(\n        args.wspecifier,\n        filetype=args.out_filetype,\n        write_num_frames=args.write_num_frames,\n        compress=args.compress,\n        compression_method=args.compression_method,\n    ) as writer:\n        for utt, mat in file_reader_helper(args.rspecifier, args.in_filetype):\n            if is_scipy_wav_style(mat):\n                # If data is sound file, then got as Tuple[int, ndarray]\n                rate, mat = mat\n\n            if preprocessing is not None:\n                mat = preprocessing(mat, uttid_list=utt)\n\n            # shape = (Time, Channel)\n            if args.out_filetype in [\"sound.hdf5\", \"sound\"]:\n                # Write Tuple[int, numpy.ndarray] (scipy style)\n                writer[utt] = (rate, mat)\n            else:\n                writer[utt] = mat\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/data2json.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\necho \"$0 $*\" >&2 # Print the command line for logging\n. ./path.sh\n\nnj=10\ncmd=run.pl\nnlsyms=\"\"\nlang=\"\"\nfeat=\"\" # feat.scp\noov=\"<unk>\"\nbpecode=\"\"\nallow_one_column=false\nverbose=0\ntrans_type=char\nfiletype=\"\"\npreprocess_conf=\"\"\ncategory=\"\"\ntext_org=\"\"\nout=\"\" # If omitted, write in stdout\n\ntext=\"\"\nmultilingual=false\n\nhelp_message=$(cat << EOF\nUsage: $0 <data-dir> <dict>\ne.g. $0 data/train data/lang_1char/train_units.txt\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\n  --feat <feat-scp>                                # feat.scp or feat1.scp,feat2.scp,...\n  --oov <oov-word>                                 # Default: <unk>\n  --out <outputfile>                               # If omitted, write in stdout\n  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file\n  --preprocess-conf <json>                         # Apply preprocess to feats when creating shape.scp\n  --verbose <num>                                  # Default: 0\nEOF\n)\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n    echo $@\n    echo \"${help_message}\" 1>&2\n    exit 1;\nfi\n\nset -euo pipefail\n\ndir=$1\ndic=$2\ntmpdir=$(mktemp -d ${dir}/tmp-XXXXX)\ntrap 'rm -rf ${tmpdir}' EXIT\n\nif [ -z ${text} ]; then\n    text=${dir}/text\nfi\n\n# 1. Create scp files for inputs\n#   These are not necessary for decoding mode, and make it as an option\ninput=\nif [ -n \"${feat}\" ]; then\n    _feat_scps=$(echo \"${feat}\" | tr ',' ' ' )\n    read -r -a feat_scps <<< $_feat_scps\n    num_feats=${#feat_scps[@]}\n\n    for (( i=1; i<=num_feats; i++ )); do\n        feat=${feat_scps[$((i-1))]}\n        mkdir -p ${tmpdir}/input_${i}\n        input+=\"input_${i} \"\n        cat ${feat} > ${tmpdir}/input_${i}/feat.scp\n\n        # Dump in the \"legacy\" style JSON format\n        if [ -n \"${filetype}\" ]; then\n            awk -v filetype=${filetype} '{print $1 \" \" filetype}' ${feat} \\\n                > ${tmpdir}/input_${i}/filetype.scp\n        fi\n\n        feat_to_shape.sh --cmd \"${cmd}\" --nj ${nj} \\\n            --filetype \"${filetype}\" \\\n            --preprocess-conf \"${preprocess_conf}\" \\\n            --verbose ${verbose} ${feat} ${tmpdir}/input_${i}/shape.scp\n    done\nfi\n\n# 2. Create scp files for outputs\nmkdir -p ${tmpdir}/output\nif [ -n \"${bpecode}\" ]; then\n    if [ ${multilingual} = true ]; then\n        # remove a space before the language ID\n        paste -d \" \" <(awk '{print $1}' ${text}) <(cut -f 2- -d\" \" ${text} \\\n            | spm_encode --model=${bpecode} --output_format=piece | cut -f 2- -d\" \") \\\n            > ${tmpdir}/output/token.scp\n    else\n        paste -d \" \" <(awk '{print $1}' ${text}) <(cut -f 2- -d\" \" ${text} \\\n            | spm_encode --model=${bpecode} --output_format=piece --split-chn) \\\n            > ${tmpdir}/output/token.scp\n    fi\nelif [ -n \"${nlsyms}\" ]; then\n    text2token.py -s 1 -n 1 -l ${nlsyms} ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp\nelse\n    text2token.py -s 1 -n 1 ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp\nfi\n< ${tmpdir}/output/token.scp utils/sym2int.pl --map-oov ${oov} -f 2- ${dic} > ${tmpdir}/output/tokenid.scp\n# +2 comes from CTC blank and EOS\nvocsize=$(tail -n 1 ${dic} | awk '{print $2}')\nodim=$(echo \"$vocsize + 2\" | bc)\n< ${tmpdir}/output/tokenid.scp awk -v odim=${odim} '{print $1 \" \" NF-1 \",\" odim}' > ${tmpdir}/output/shape.scp\n\ncat ${text} > ${tmpdir}/output/text.scp\n\n\n# 3. Create scp files for the others\nmkdir -p ${tmpdir}/other\nif [ ${multilingual} == true ]; then\n    awk '{\n        n = split($1,S,\"[-]\");\n        lang=S[n];\n        print $1 \" \" lang\n    }' ${text} > ${tmpdir}/other/lang.scp\nelif [ -n \"${lang}\" ]; then\n    awk -v lang=${lang} '{print $1 \" \" lang}' ${text} > ${tmpdir}/other/lang.scp\nfi\n\nif [ -n \"${category}\" ]; then\n    awk -v category=${category} '{print $1 \" \" category}' ${dir}/text \\\n        > ${tmpdir}/other/category.scp\nfi\ncat ${dir}/utt2spk > ${tmpdir}/other/utt2spk.scp\n\nif [ -n \"${text_org}\" ]; then\n    cp $text_org ${tmpdir}/other/text_org.scp\nfi\n\n# 4. Merge scp files into a JSON file\nopts=\"\"\nif [ -n \"${feat}\" ]; then\n    intypes=\"${input} output other\"\nelse\n    intypes=\"output other\"\nfi\nfor intype in ${intypes}; do\n    if [ -z \"$(find \"${tmpdir}/${intype}\" -name \"*.scp\")\" ]; then\n        continue\n    fi\n\n    if [ ${intype} != other ]; then\n        opts+=\"--${intype%_*}-scps \"\n    else\n        opts+=\"--scps \"\n    fi\n\n    for x in \"${tmpdir}/${intype}\"/*.scp; do\n        k=$(basename ${x} .scp)\n        if [ ${k} = shape ]; then\n            opts+=\"shape:${x}:shape \"\n        else\n            opts+=\"${k}:${x} \"\n        fi\n    done\ndone\n\nif ${allow_one_column}; then\n    opts+=\"--allow-one-column true \"\nelse\n    opts+=\"--allow-one-column false \"\nfi\n\nif [ -n \"${out}\" ]; then\n    opts+=\"-O ${out}\"\nfi\nmerge_scp2json.py --verbose ${verbose} ${opts}\n\nrm -fr ${tmpdir}\n"
  },
  {
    "path": "egs/espnet_utils/divide_lang.sh",
    "content": "#!/bin/bash\n\n# Copyright 2021 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n. ./path.sh\n\nif [ \"$#\" -ne 2 ]; then\n    echo \"Usage: $0 <set> <langs divided by space>\"\n    echo \"e.g.: $0 dev\"\n    exit 1\nfi\n\nset=$1\nlangs=$2\n\n# Copy stuff intoc its final locations [this has been moved from the format_data script]\nfor lang in ${langs}; do\n    mkdir -p data/${set}.${lang}\n    for f in spk2utt utt2spk segments wav.scp feats.scp utt2num_frames; do\n        if [ -f data/${set}/${f} ]; then\n            sort data/${set}/${f} > data/${set}.${lang}/${f}\n        fi\n    done\n    sort data/${set}/text.lc.rm.${lang} > data/${set}.${lang}/text  # dummy\n    for case in lc.rm lc tc; do\n        sort data/${set}/text.${case}.${lang} > data/${set}.${lang}/text.${case}\n    done\n    utils/fix_data_dir.sh --utt_extra_files \"text.tc text.lc text.lc.rm\" data/${set}.${lang}\n    if [ -f data/${set}.${lang}/feats.scp ]; then\n        utils/validate_data_dir.sh data/${set}.${lang} || exit 1;\n    else\n        utils/validate_data_dir.sh --no-feats --no-wav data/${set}.${lang} || exit 1;\n    fi\ndone\n"
  },
  {
    "path": "egs/espnet_utils/double_precious_cer.py",
    "content": "import sys\n\nin_f = sys.argv[1]\n\nfor line in open(in_f, encoding=\"utf-8\"):\n    if \"Sum\" in line and \"|\" in line and \"Avg\" not in line:\n        line = line.strip().split()\n        tot = line[4]\n        err = line[10]\n        cer = float(err) / float(tot) * 100\n        print(\"CER: {:.3f}\".format(cer))\n"
  },
  {
    "path": "egs/espnet_utils/download_from_google_drive.sh",
    "content": "#!/usr/bin/env bash\n\n# Download zip, tar, or tar.gz file from google drive\n\n# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nshare_url=$1\ndownload_dir=${2:-\"downloads\"}\nfile_ext=${3:-\"zip\"}\n\nif [ \"$1\" = \"--help\" ] || [ $# -lt 1 ] || [ $# -gt 3 ]; then\n   echo \"Usage: $0 <share-url> [<download_dir> <file_ext>]\";\n   echo \"e.g.: $0 https://drive.google.com/open?id=1zF88bRNbJhw9hNBq3NrDg8vnGGibREmg downloads zip\"\n   echo \"Options:\"\n   echo \"    <download_dir>: directory to save downloaded file. (Default=downloads)\"\n   echo \"    <file_ext>: file extension of the file to be downloaded. (Default=zip)\"\n   if [ \"$1\" = \"--help\" ]; then\n       exit 0;\n   fi\n   exit 1;\nfi\n\n[ ! -e \"${download_dir}\" ] && mkdir -p \"${download_dir}\"\ntmp=$(mktemp \"${download_dir}/XXXXXX.${file_ext}\")\n\n# file id in google drive can be obtain from sharing link\n# ref: https://qiita.com/namakemono/items/c963e75e0af3f7eed732\nfile_id=$(echo \"${share_url}\" | cut -d\"=\" -f 2)\n\n# define decompressor\ndecompress () {\n    filename=$1\n    decompress_dir=$2\n    if echo \"${filename}\" | grep -q \".zip\"; then\n        unzip \"${filename}\" -d \"${decompress_dir}\"\n    elif echo \"${filename}\" | grep -q -e \".tar\" -e \".tar.gz\" -e \".tgz\"; then\n        tar xvzf \"${filename}\" -C \"${decompress_dir}\"\n    else\n        echo \"Unsupported file extension.\" >&2 && exit 1\n    fi\n}\n\nset -e\n# Solution from https://github.com/wkentaro/gdown\ngdown --id \"${file_id}\" -O \"${tmp}\"\ndecompress \"${tmp}\" \"${download_dir}\"\n\n# remove tmpfiles\nrm \"${tmp}\"\necho \"Sucessfully downloaded ${file_ext} file from ${share_url}\"\n"
  },
  {
    "path": "egs/espnet_utils/dump-pcm.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nfrom distutils.util import strtobool\nimport logging\n\nimport kaldiio\nimport numpy\n\nfrom espnet.transform.transformation import Transformation\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_writers import file_writer_helper\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"dump PCM files from a WAV scp file\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--write-num-frames\", type=str, help=\"Specify wspecifer for utt2num_frames\"\n    )\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for output. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--format\",\n        type=str,\n        default=None,\n        help=\"The file format for output pcm. \"\n        \"This option is only valid \"\n        'when \"--filetype\" is \"sound.hdf5\" or \"sound\"',\n    )\n    parser.add_argument(\n        \"--compress\", type=strtobool, default=False, help=\"Save in compressed format\"\n    )\n    parser.add_argument(\n        \"--compression-method\",\n        type=int,\n        default=2,\n        help=\"Specify the method(if mat) or \" \"gzip-level(if hdf5)\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--normalize\",\n        choices=[1, 16, 24, 32],\n        type=int,\n        default=None,\n        help=\"Give the bit depth of the PCM, \"\n        \"then normalizes data to scale in [-1,1]\",\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"--keep-length\",\n        type=strtobool,\n        default=True,\n        help=\"Truncating or zero padding if the output length \"\n        \"is changed from the input by preprocessing\",\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"WAV scp file\")\n    parser.add_argument(\n        \"--segments\",\n        type=str,\n        help=\"segments-file format: each line is either\"\n        \"<segment-id> <recording-id> <start-time> <end-time>\"\n        \"e.g. call-861225-A-0050-0065 call-861225-A 5.0 6.5\",\n    )\n    parser.add_argument(\"wspecifier\", type=str, help=\"Write specifier\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    if args.preprocess_conf is not None:\n        preprocessing = Transformation(args.preprocess_conf)\n        logging.info(\"Apply preprocessing: {}\".format(preprocessing))\n    else:\n        preprocessing = None\n\n    with file_writer_helper(\n        args.wspecifier,\n        filetype=args.filetype,\n        write_num_frames=args.write_num_frames,\n        compress=args.compress,\n        compression_method=args.compression_method,\n        pcm_format=args.format,\n    ) as writer:\n        for utt_id, (rate, array) in kaldiio.ReadHelper(args.rspecifier, args.segments):\n            if args.filetype == \"mat\":\n                # Kaldi-matrix doesn't support integer\n                array = array.astype(numpy.float32)\n\n            if array.ndim == 1:\n                # (Time) -> (Time, Channel)\n                array = array[:, None]\n\n            if args.normalize is not None and args.normalize != 1:\n                array = array.astype(numpy.float32)\n                array = array / (1 << (args.normalize - 1))\n\n            if preprocessing is not None:\n                orgtype = array.dtype\n                out = preprocessing(array, uttid_list=utt_id)\n                out = out.astype(orgtype)\n\n                if args.keep_length:\n                    if len(out) > len(array):\n                        out = numpy.pad(\n                            out,\n                            [(0, len(out) - len(array))]\n                            + [(0, 0) for _ in range(out.ndim - 1)],\n                            mode=\"constant\",\n                        )\n                    elif len(out) < len(array):\n                        # The length can be changed by stft, for example.\n                        out = out[: len(out)]\n\n                array = out\n\n            # shape = (Time, Channel)\n            if args.filetype in [\"sound.hdf5\", \"sound\"]:\n                # Write Tuple[int, numpy.ndarray] (scipy style)\n                writer[utt_id] = (rate, array)\n            else:\n                writer[utt_id] = array\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/dump.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\necho \"$0 $*\"  # Print the command line for logging\n. ./path.sh\n\ncmd=run.pl\ndo_delta=false\nnj=1\nverbose=0\ncompress=true\nwrite_utt2num_frames=true\nfiletype='mat'  # mat or hdf5\nhelp_message=\"Usage: $0 <scp> <cmvnark> <logdir> <dumpdir>\"\n\n. utils/parse_options.sh\n\nscp=$1\ncvmnark=$2\nlogdir=$3\ndumpdir=$4\n\nif [ $# != 4 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\nmkdir -p ${logdir}\nmkdir -p ${dumpdir}\n\ndumpdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' ${dumpdir} ${PWD})\n\nfor n in $(seq ${nj}); do\n    # the next command does nothing unless $dumpdir/storage/ exists, see\n    # utils/create_data_link.pl for more info.\n    utils/create_data_link.pl ${dumpdir}/feats.${n}.ark\ndone\n\nif ${write_utt2num_frames}; then\n    write_num_frames_opt=\"--write-num-frames=ark,t:$dumpdir/utt2num_frames.JOB\"\nelse\n    write_num_frames_opt=\nfi\n\n# split scp file\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"$split_scps $logdir/feats.$n.scp\"\ndone\n\nutils/split_scp.pl ${scp} ${split_scps} || exit 1;\n\n# dump features\nif ${do_delta}; then\n    ${cmd} JOB=1:${nj} ${logdir}/dump_feature.JOB.log \\\n        apply-cmvn --norm-vars=true ${cvmnark} scp:${logdir}/feats.JOB.scp ark:- \\| \\\n        add-deltas ark:- ark:- \\| \\\n        copy-feats.py --verbose ${verbose} --out-filetype ${filetype} \\\n            --compress=${compress} --compression-method=2 ${write_num_frames_opt} \\\n            ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \\\n        || exit 1\nelse\n    ${cmd} JOB=1:${nj} ${logdir}/dump_feature.JOB.log \\\n        apply-cmvn --norm-vars=true ${cvmnark} scp:${logdir}/feats.JOB.scp ark:- \\| \\\n        copy-feats.py --verbose ${verbose} --out-filetype ${filetype} \\\n            --compress=${compress} --compression-method=2 ${write_num_frames_opt} \\\n            ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \\\n        || exit 1\nfi\n\n# concatenate scp files\nfor n in $(seq ${nj}); do\n    cat ${dumpdir}/feats.${n}.scp || exit 1;\ndone > ${dumpdir}/feats.scp || exit 1\n\nif ${write_utt2num_frames}; then\n    for n in $(seq ${nj}); do\n        cat ${dumpdir}/utt2num_frames.${n} || exit 1;\n    done > ${dumpdir}/utt2num_frames || exit 1\n    rm ${dumpdir}/utt2num_frames.* 2>/dev/null\nfi\n\n# Write the filetype, this will be used for data2json.sh\necho ${filetype} > ${dumpdir}/filetype\n\n\n# remove temp scps\nrm ${logdir}/feats.*.scp 2>/dev/null\nif [ ${verbose} -eq 1 ]; then\n    echo \"Succeeded dumping features for training\"\nfi\n"
  },
  {
    "path": "egs/espnet_utils/dump_pcm.sh",
    "content": "#!/usr/bin/env bash\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\ncompress=false\nwrite_utt2num_frames=false # if true writes utt2num_frames\nverbose=2\nfiletype=mat # mat or hdf5\nkeep_length=true\nformat=wav\n# End configuration section.\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<pcm-dir>] ]\ne.g.: $0 data/train exp/dump_pcm/train pcm\nNote: <log-dir> defaults to <data-dir>/log, and <pcm-dir> defaults to <data-dir>/data\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\n  --write-utt2num-frames <true|false>     # If true, write utt2num_frames file.\n  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file\nEOF\n)\necho \"$0 $*\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=${data}/log\nfi\nif [ $# -ge 3 ]; then\n  pcmdir=$3\nelse\n  pcmdir=${data}/data\nfi\n\n\n# make $pcmdir an absolute pathname.\npcmdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' ${pcmdir} ${PWD})\n\n# use \"name\" as part of name of the archive.\nname=$(basename ${data})\n\nmkdir -p ${pcmdir}\nmkdir -p ${logdir}\n\nif [ -f ${data}/feats.scp ]; then\n  mkdir -p ${data}/.backup\n  echo \"$0: moving ${data}/feats.scp to ${data}/.backup\"\n  mv ${data}/feats.scp ${data}/.backup\nfi\n\nscp=${data}/wav.scp\n\nrequired=\"${scp}\"\n\nfor f in ${required}; do\n  if [ ! -f ${f} ]; then\n    echo \"$0: no such file ${f}\"\n  fi\ndone\n\nutils/validate_data_dir.sh --no-text --no-feats ${data}\n\nif ${write_utt2num_frames}; then\n    opts=\"--write-num-frames=ark,t:${logdir}/utt2num_frames.JOB \"\nelse\n    opts=\nfi\n\nif [ \"${filetype}\" == hdf5 ]; then\n    ext=.h5\nelif [ \"${filetype}\" == sound.hdf5 ]; then\n    ext=.flac.h5\n    opts+=\"--format ${format} \"\n\nelif [ \"${filetype}\" == sound ]; then\n    ext=\n    opts+=\"--format wav \"\nelse\n    ext=.ark\nfi\n\nif [ -f ${data}/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\"\"\n  for n in $(seq ${nj}); do\n    split_segments=\"${split_segments} ${logdir}/segments.${n}\"\n  done\n\n  utils/split_scp.pl ${data}/segments ${split_segments}\n\n  ${cmd} JOB=1:${nj} ${logdir}/dump_pcm_${name}.JOB.log \\\n      dump-pcm.py ${opts} --filetype ${filetype} --verbose=${verbose} --compress=${compress} \\\n      --keep-length ${keep_length} --segment=${logdir}/segments.JOB scp:${scp} \\\n      ark,scp:${pcmdir}/raw_pcm_${name}.JOB${ext},${pcmdir}/raw_pcm_${name}.JOB.scp\n\nelse\n\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\"\"\n  for n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${logdir}/wav.${n}.scp\"\n  done\n\n  utils/split_scp.pl ${scp} ${split_scps}\n\n  ${cmd} JOB=1:${nj} ${logdir}/dump_pcm_${name}.JOB.log \\\n      dump-pcm.py ${opts} --filetype ${filetype} --verbose=${verbose} --compress=${compress} \\\n      --keep-length ${keep_length} scp:${logdir}/wav.JOB.scp \\\n      ark,scp:${pcmdir}/raw_pcm_${name}.JOB${ext},${pcmdir}/raw_pcm_${name}.JOB.scp\n\nfi\n\n\n# concatenate the .scp files together.\nfor n in $(seq ${nj}); do\n  cat ${pcmdir}/raw_pcm_${name}.${n}.scp\ndone > ${data}/feats.scp\n\nif ${write_utt2num_frames}; then\n  for n in $(seq ${nj}); do\n    cat ${logdir}/utt2num_frames.${n}\n  done > ${data}/utt2num_frames\n  rm ${logdir}/utt2num_frames.*\nfi\n\nrm -f ${logdir}/wav.*.scp ${logdir}/segments.* 2>/dev/null\n\n# Write the filetype, this will be used for data2json.sh\necho ${filetype} > ${data}/filetype\n\nnf=$(< $data/feats.scp wc -l)\nnu=$(< $data/utt2spk wc -l)\nif [ ${nf} -ne ${nu} ]; then\n  echo \"It seems not all of the feature files were successfully (${nf} != ${nu});\"\n  echo \"consider using utils/fix_data_dir.sh ${data}\"\nfi\n\necho \"Succeeded dumping pcm for ${name}\"\n"
  },
  {
    "path": "egs/espnet_utils/eval-source-separation.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nfrom collections import OrderedDict\nfrom distutils.util import strtobool\nimport itertools\nimport logging\nimport os\nfrom pathlib import Path\nimport shutil\nimport subprocess\nimport sys\nfrom tempfile import TemporaryDirectory\nimport warnings\n\nimport museval\nimport numpy as np\nfrom pystoi.stoi import stoi\nimport soundfile\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef eval_STOI(ref, y, fs, extended=False, compute_permutation=True):\n    \"\"\"Calculate STOI\n\n    Reference:\n        A short-time objective intelligibility measure\n            for time-frequency weighted noisy speech\n        https://ieeexplore.ieee.org/document/5495701\n\n    Note(kamo):\n        STOI is defined on the signal at 10kHz\n        and the input at the other sampling rate will be resampled.\n        Thus, the result differs depending on the implementation of resampling.\n        Especially, pystoi cannot reproduce matlab's resampling now.\n\n    :param ref (np.ndarray): Reference (Nsrc, Nframe, Nmic)\n    :param y (np.ndarray): Enhanced (Nsrc, Nframe, Nmic)\n    :param fs (int): Sample frequency\n    :param extended (bool): stoi or estoi\n    :param compute_permutation (bool):\n    :return: value, perm\n    :rtype: Tuple[Tuple[float, ...], Tuple[int, ...]]\n    \"\"\"\n    if ref.shape != y.shape:\n        raise ValueError(\n            \"ref and y should have the same shape: {} != {}\".format(ref.shape, y.shape)\n        )\n    if ref.ndim != 3:\n        raise ValueError(\"Input must have 3 dims: {}\".format_map(ref.ndim))\n    n_src = ref.shape[0]\n    n_mic = ref.shape[2]\n\n    if compute_permutation:\n        index_list = list(itertools.permutations(range(n_src)))\n    else:\n        index_list = [list(range(n_src))]\n\n    values = [\n        [\n            sum(stoi(ref[i, :, ch], y[j, :, ch], fs, extended) for ch in range(n_mic))\n            / n_mic\n            for i, j in enumerate(indices)\n        ]\n        for indices in index_list\n    ]\n\n    best_pairs = sorted(\n        [(v, i) for v, i in zip(values, index_list)], key=lambda x: sum(x[0])\n    )[-1]\n    value, perm = best_pairs\n    return tuple(value), tuple(perm)\n\n\ndef eval_PESQ(ref, enh, fs, compute_permutation: bool = True, wideband: bool = True):\n    \"\"\"Evaluate PESQ\n\n    PESQ program can be downloaded from here:\n        http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.862-200511-I!Amd2!SOFT-ZST-E&type=items\n\n    Reference:\n        Perceptual evaluation of speech quality (PESQ)-a new method\n            for speech quality assessment of telephone networks and codecs\n        https://ieeexplore.ieee.org/document/941023\n\n    :param x (np.ndarray): Reference (Nsrc, Nframe, Nmic)\n    :param y (np.ndarray): Enhanced (Nsrc, Nframe, Nmic)\n    :param fs (int): Sample frequency\n    :param compute_permutation (bool):\n    \"\"\"\n    if shutil.which(\"PESQ\") is None:\n        raise RuntimeError(\"PESQ: command not found: Please install\")\n    if fs not in (8000, 16000):\n        raise ValueError(\"Sample frequency must be 8000 or 16000: {}\".format(fs))\n    if ref.shape != enh.shape:\n        raise ValueError(\n            \"ref and enh should have the same shape: {} != {}\".format(\n                ref.shape, enh.shape\n            )\n        )\n    if ref.ndim != 3:\n        raise ValueError(\"Input must have 3 dims: {}\".format_map(ref.ndim))\n\n    n_src = ref.shape[0]\n    n_mic = ref.shape[2]\n    with TemporaryDirectory() as d:\n        # Dumping wav files temporary\n        ref_files = []\n        enh_files = []\n        for isrc in range(n_src):\n            refs = []  # [Nsrc, Nmic]\n            enhs = []  # [Nsrc, Nmic]\n            for imic in range(n_mic):\n                wv = str(os.path.join(d, \"ref.{}.{}.wav\".format(isrc, imic)))\n                soundfile.write(wv, ref[isrc, :, imic].astype(np.int16), fs)\n                refs.append(wv)\n\n                wv = str(os.path.join(d, \"enh.{}.{}.wav\".format(isrc, imic)))\n                soundfile.write(wv, enh[isrc, :, imic].astype(np.int16), fs)\n                enhs.append(wv)\n            ref_files.append(refs)\n            enh_files.append(enhs)\n\n        if compute_permutation:\n            index_list = list(itertools.permutations(range(n_src)))\n        else:\n            index_list = [list(range(n_src))]\n\n        values = []\n        for indices in index_list:\n            values2 = []\n            for i, j in enumerate(indices):\n                lis = []\n                for imic in range(n_mic):\n                    # PESQ +<8000|16000> <ref.wav> <enh.wav> [smos] [cond]\n                    if wideband:\n                        commands = [\n                            \"PESQ\",\n                            \"+{}\".format(fs),\n                            \"+wb\",\n                            ref_files[i][imic],\n                            enh_files[j][imic],\n                        ]\n                    else:\n                        commands = [\n                            \"PESQ\",\n                            \"+{}\".format(fs),\n                            ref_files[i][imic],\n                            enh_files[j][imic],\n                        ]\n                    with subprocess.Popen(\n                        commands, stdout=subprocess.DEVNULL, cwd=d\n                    ) as p:\n                        _, _ = p.communicate()\n\n                    # e.g.\n                    # REFERENCE\t DEGRADED\t PESQMOS\t MOSLQO\t SAMPLE_FREQ\t MODE\n                    # /tmp/t/ref.0.wav\t /tmp/t/enh.0.wav\t -1.000\t 4.644\t 16000\twb\n                    result_txt = Path(d) / \"pesq_results.txt\"\n                    if result_txt.exists():\n                        with result_txt.open(\"r\") as f:\n                            lis.append(float(f.readlines()[1].split()[3]))\n                    else:\n                        # Sometimes PESQ is failed. I don't know why.\n                        warnings.warn(\"Processing error is found.\")\n                        lis.append(1.0)\n                    # Averaging over n_mic\n                # Averaging over n_mic\n                values2.append(sum(lis) / len(lis))\n            values.append(values2)\n    best_pairs = sorted(\n        [(v, i) for v, i in zip(values, index_list)], key=lambda x: sum(x[0])\n    )[-1]\n    value, perm = best_pairs\n    return tuple(value), tuple(perm)\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Evaluate enhanced speech. \"\n        \"e.g. {c} --ref ref.scp --enh enh.scp --outdir outputdir\"\n        \"or {c} --ref ref.scp ref2.scp --enh enh.scp enh2.scp \"\n        \"--outdir outputdir\".format(c=sys.argv[0]),\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--ref\",\n        dest=\"reffiles\",\n        nargs=\"+\",\n        type=str,\n        required=True,\n        help=\"WAV file lists for reference\",\n    )\n    parser.add_argument(\n        \"--enh\",\n        dest=\"enhfiles\",\n        nargs=\"+\",\n        type=str,\n        required=True,\n        help=\"WAV files lists for enhanced\",\n    )\n    parser.add_argument(\"--outdir\", type=str, required=True)\n    parser.add_argument(\n        \"--keylist\",\n        type=str,\n        help=\"Specify the target samples. By default, \"\n        \"using all keys in the first reference file\",\n    )\n    parser.add_argument(\n        \"--evaltypes\",\n        type=str,\n        nargs=\"+\",\n        choices=[\"SDR\", \"STOI\", \"ESTOI\", \"PESQ\"],\n        default=[\"SDR\", \"STOI\", \"ESTOI\", \"PESQ\"],\n    )\n    parser.add_argument(\n        \"--permutation\",\n        type=strtobool,\n        default=True,\n        help=\"Compute all permutations or \" \"use the pair of input order\",\n    )\n\n    # About BSS Eval v4:\n    # The 2018 Signal Separation Evaluation Campaign\n    # https://arxiv.org/abs/1804.06267\n    parser.add_argument(\n        \"--bss-eval-images\",\n        type=strtobool,\n        default=True,\n        help=\"Use bss_eval_images or bss_eval_sources. \"\n        \"For more detail, see museval source codes.\",\n    )\n    parser.add_argument(\n        \"--bss-eval-version\",\n        type=str,\n        default=\"v3\",\n        choices=[\"v3\", \"v4\"],\n        help=\"Specify bss-eval-version: v3 or v4\",\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n    if len(args.reffiles) != len(args.enhfiles):\n        raise RuntimeError(\n            \"The number of ref files are different \"\n            \"from the enh files: {} != {}\".format(\n                len(args.reffiles), len(args.enhfiles)\n            )\n        )\n    if len(args.enhfiles) == 1:\n        args.permutation = False\n\n    # Read text files and created a mapping of key2filepath\n    reffiles_dict = OrderedDict()  # Dict[str, Dict[str, str]]\n    for ref in args.reffiles:\n        d = OrderedDict()\n        with open(ref, \"r\") as f:\n            for line in f:\n                key, path = line.split(None, 1)\n                d[key] = path.rstrip()\n        reffiles_dict[ref] = d\n\n    enhfiles_dict = OrderedDict()  # Dict[str, Dict[str, str]]\n    for enh in args.enhfiles:\n        d = OrderedDict()\n        with open(enh, \"r\") as f:\n            for line in f:\n                key, path = line.split(None, 1)\n                d[key] = path.rstrip()\n        enhfiles_dict[enh] = d\n\n    if args.keylist is not None:\n        with open(args.keylist, \"r\") as f:\n            keylist = [line.rstrip().split()[0] for line in f]\n    else:\n        keylist = list(reffiles_dict.values())[0]\n\n    if len(keylist) == 0:\n        raise RuntimeError(\"No keys are found\")\n\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n\n    evaltypes = []\n    for evaltype in args.evaltypes:\n        if evaltype == \"SDR\":\n            evaltypes += [\"SDR\", \"ISR\", \"SIR\", \"SAR\"]\n        else:\n            evaltypes.append(evaltype)\n\n    # Open files in write mode\n    writers = {k: open(os.path.join(args.outdir, k), \"w\") for k in evaltypes}\n\n    for key in keylist:\n        # 1. Load ref files\n        rate_prev = None\n\n        ref_signals = []\n        for listname, d in reffiles_dict.items():\n            if key not in d:\n                raise RuntimeError(\"{} doesn't exist in {}\".format(key, listname))\n            filepath = d[key]\n            signal, rate = soundfile.read(filepath, dtype=np.int16)\n            if signal.ndim == 1:\n                # (Nframe) -> (Nframe, 1)\n                signal = signal[:, None]\n            ref_signals.append(signal)\n            if rate_prev is not None and rate != rate_prev:\n                raise RuntimeError(\"Sampling rates mismatch\")\n            rate_prev = rate\n\n        # 2. Load enh files\n        enh_signals = []\n        for listname, d in enhfiles_dict.items():\n            if key not in d:\n                raise RuntimeError(\"{} doesn't exist in {}\".format(key, listname))\n            filepath = d[key]\n            signal, rate = soundfile.read(filepath, dtype=np.int16)\n            if signal.ndim == 1:\n                # (Nframe) -> (Nframe, 1)\n                signal = signal[:, None]\n            enh_signals.append(signal)\n            if rate_prev is not None and rate != rate_prev:\n                raise RuntimeError(\"Sampling rates mismatch\")\n            rate_prev = rate\n\n        for signal in ref_signals + enh_signals:\n            if signal.shape[1] != ref_signals[0].shape[1]:\n                raise RuntimeError(\"The number of channels mismatch\")\n\n        # 3. Zero padding to adjust the length to the maximum length in inputs\n        ml = max(len(s) for s in ref_signals + enh_signals)\n        ref_signals = [\n            np.pad(s, [(0, ml - len(s)), (0, 0)], mode=\"constant\") if len(s) < ml else s\n            for s in ref_signals\n        ]\n\n        enh_signals = [\n            np.pad(s, [(0, ml - len(s)), (0, 0)], mode=\"constant\") if len(s) < ml else s\n            for s in enh_signals\n        ]\n\n        # ref_signals, enh_signals: (Nsrc, Nframe, Nmic)\n        ref_signals = np.stack(ref_signals, axis=0)\n        enh_signals = np.stack(enh_signals, axis=0)\n\n        # 4. Evaluates\n        for evaltype in args.evaltypes:\n            if evaltype == \"SDR\":\n                (sdr, isr, sir, sar, perm) = museval.metrics.bss_eval(\n                    ref_signals,\n                    enh_signals,\n                    window=np.inf,\n                    hop=np.inf,\n                    compute_permutation=args.permutation,\n                    filters_len=512,\n                    framewise_filters=args.bss_eval_version == \"v3\",\n                    bsseval_sources_version=not args.bss_eval_images,\n                )\n\n                # sdr: (Nsrc, Nframe)\n                writers[\"SDR\"].write(\n                    \"{} {}\\n\".format(key, \" \".join(map(str, sdr[:, 0])))\n                )\n                writers[\"ISR\"].write(\n                    \"{} {}\\n\".format(key, \" \".join(map(str, isr[:, 0])))\n                )\n                writers[\"SIR\"].write(\n                    \"{} {}\\n\".format(key, \" \".join(map(str, sir[:, 0])))\n                )\n                writers[\"SAR\"].write(\n                    \"{} {}\\n\".format(key, \" \".join(map(str, sar[:, 0])))\n                )\n\n            elif evaltype == \"STOI\":\n                stoi, perm = eval_STOI(\n                    ref_signals,\n                    enh_signals,\n                    rate,\n                    extended=False,\n                    compute_permutation=args.permutation,\n                )\n                writers[\"STOI\"].write(\"{} {}\\n\".format(key, \" \".join(map(str, stoi))))\n\n            elif evaltype == \"ESTOI\":\n                estoi, perm = eval_STOI(\n                    ref_signals,\n                    enh_signals,\n                    rate,\n                    extended=True,\n                    compute_permutation=args.permutation,\n                )\n                writers[\"ESTOI\"].write(\"{} {}\\n\".format(key, \" \".join(map(str, estoi))))\n\n            elif evaltype == \"PESQ\":\n                pesq, perm = eval_PESQ(\n                    ref_signals, enh_signals, rate, compute_permutation=args.permutation\n                )\n                writers[\"PESQ\"].write(\"{} {}\\n\".format(key, \" \".join(map(str, pesq))))\n            else:\n                # Cannot reach\n                raise RuntimeError\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/eval_perm_free_error.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Johns Hopkins University (Xuankai Chang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\nimport argparse\nimport codecs\nimport json\nimport logging\nimport re\nimport six\nimport sys\n\nimport numpy as np\n\n\ndef permutationDFS(source, start, res):\n    # get permutations with DFS\n    # return order in [[1, 2], [2, 1]] or\n    # [[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 2, 1], [3, 1, 2]]\n    if start == len(source) - 1:  # reach final state\n        res.append(source.tolist())\n    for i in range(start, len(source)):\n        # swap values at position start and i\n        source[start], source[i] = source[i], source[start]\n        permutationDFS(source, start + 1, res)\n        # reverse the swap\n        source[start], source[i] = source[i], source[start]\n\n\n# pre-set the permutation scheme (ref_idx, hyp_idx)\ndef permutation_schemes(num_spkrs):\n    src = [x for x in range(1, num_spkrs + 1)]\n    perms = []\n\n    # get all permutations of [1, ..., num_spkrs]\n    # [[r1h1, r2h2], [r1h2, r2h1]]\n    # [[r1h1, r2h2, r3h3], [r1h1, r2h3, r3h2], [r1h2, r2h1, r3h3],\n    #  [r1h2, r2h3, r3h2], [r1h3, r2h2, r3h1], [r1h3, r2h1, r3h2]]]\n    # ...\n    permutationDFS(np.array(src), 0, perms)\n\n    keys = []\n    for perm in perms:\n        keys.append([\"r%dh%d\" % (i, j) for i, j in enumerate(perm, 1)])\n\n    return sum(keys, []), keys\n\n\ndef convert_score(keys, dic):\n    ret = {}\n    pat = re.compile(r\"\\d+\")\n    for k in keys:\n        score = dic[k][\"Scores\"]\n        score = list(map(int, pat.findall(score)))  # [c,s,d,i]\n        assert len(score) == 4\n        ret[k] = score\n    return ret\n\n\ndef get_utt_permutation(old_dic, num_spkrs=2):\n    perm, keys = permutation_schemes(num_spkrs)\n    new_dic = {}\n\n    for id in old_dic.keys():\n        # compute error rate for each utt\n        in_dic = old_dic[id]\n        score = convert_score(perm, in_dic)\n        perm_score = []\n        for ks in keys:\n            tmp_score = [0, 0, 0, 0]\n            for k in ks:\n                tmp_score = [tmp_score[i] + score[k][i] for i in range(4)]\n            perm_score.append(tmp_score)\n\n        error_rate = [\n            sum(s[1:4]) / float(sum(s[0:3])) for s in perm_score\n        ]  # (s+d+i) / (c+s+d)\n\n        min_idx, min_v = min(enumerate(error_rate), key=lambda x: x[1])\n        dic = {}\n        for k in keys[min_idx]:\n            dic[k] = in_dic[k]\n        dic[\"Scores\"] = \"(#C #S #D #I) \" + \" \".join(map(str, perm_score[min_idx]))\n        new_dic[id] = dic\n\n    return new_dic\n\n\ndef get_results(result_file, result_key):\n    re_id = r\"^id: \"\n    re_strings = {\n        \"Speaker\": r\"^Speaker sentences\",\n        \"Scores\": r\"^Scores: \",\n        \"REF\": r\"^REF: \",\n        \"HYP\": r\"^HYP: \",\n    }\n    re_id = re.compile(re_id)\n    re_patterns = {}\n    for p in re_strings.keys():\n        re_patterns[p] = re.compile(re_strings[p])\n\n    results = {}\n    tmp_id = None\n    tmp_ret = {}\n\n    with codecs.open(result_file, \"r\", encoding=\"utf-8\") as f:\n        line = f.readline()\n        while line:\n            x = line.rstrip()\n            x_split = x.split()\n\n            if re_id.match(x):\n                if tmp_id:\n                    results[tmp_id] = {result_key: tmp_ret}\n                    tmp_ret = {}\n                tmp_id = x_split[1]\n            for p in re_patterns.keys():\n                if re_patterns[p].match(x):\n                    tmp_ret[p] = \" \".join(x_split[1:])\n            line = f.readline()\n\n    if tmp_ret != {}:\n        results[tmp_id] = {result_key: tmp_ret}\n\n    return {\"utts\": results}\n\n\ndef merge_results(results):\n    rslt_lst = []\n\n    # make intersection set for utterance keys\n    intersec_keys = []\n    for x in results.keys():\n        j = results[x]\n\n        ks = j[\"utts\"].keys()\n        logging.info(x + \": has \" + str(len(ks)) + \" utterances\")\n\n        if len(intersec_keys) > 0:\n            intersec_keys = intersec_keys.intersection(set(ks))\n        else:\n            intersec_keys = set(ks)\n        rslt_lst.append(j)\n\n    logging.info(\n        \"After merge, the result has \" + str(len(intersec_keys)) + \" utterances\"\n    )\n\n    # merging results\n    dic = dict()\n    for k in intersec_keys:\n        v = rslt_lst[0][\"utts\"][k]\n        for j in rslt_lst[1:]:\n            v.update(j[\"utts\"][k])\n        dic[k] = v\n\n    return dic\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"evaluate permutation-free error\")\n    parser.add_argument(\n        \"--num-spkrs\", type=int, default=2, help=\"number of mixed speakers.\"\n    )\n    parser.add_argument(\n        \"results\",\n        type=str,\n        nargs=\"+\",\n        help=\"the scores between references and hypotheses, \"\n        \"in ascending order of references (1st) and hypotheses (2nd), \"\n        \"e.g. [r1h1, r1h2, r2h1, r2h2] in 2-speaker-mix case.\",\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    if len(args.results) != args.num_spkrs ** 2:\n        parser.print_help()\n        sys.exit(1)\n\n    # Read results from files\n    results = {}\n    for r in six.moves.range(1, args.num_spkrs + 1):\n        for h in six.moves.range(1, args.num_spkrs + 1):\n            idx = (r - 1) * args.num_spkrs + h - 1\n            key = \"r{}h{}\".format(r, h)\n\n            result = get_results(args.results[idx], key)\n            results[key] = result\n\n    # Merge the results of every permutation\n    results = merge_results(results)\n\n    # Get the final results with best permutation\n    new_results = get_utt_permutation(results, args.num_spkrs)\n\n    # Get WER/CER\n    pat = re.compile(r\"\\d+\")\n    score = np.zeros((len(new_results.keys()), 4))\n    for idx, key in enumerate(new_results.keys()):\n        # [c, s, d, i]\n        tmp_score = list(map(int, pat.findall(new_results[key][\"Scores\"])))\n        score[idx] = tmp_score\n    return score, new_results\n\n\nif __name__ == \"__main__\":\n    sys.stdout = codecs.getwriter(\"utf-8\")(sys.stdout.buffer)\n\n    scores, new_results = main()\n    score_sum = np.sum(scores, axis=0, dtype=int)\n\n    # Print results\n    print(sys.argv)\n    print(\"Total Scores: (#C #S #D #I) \" + \" \".join(map(str, list(score_sum))))\n    print(\n        \"Error Rate:   {:0.2f}\".format(\n            100 * sum(score_sum[1:4]) / float(sum(score_sum[0:3]))\n        )\n    )\n    print(\"Total Utts: \", str(scores.shape[0]))\n\n    print(\n        json.dumps(\n            {\"utts\": new_results},\n            indent=4,\n            ensure_ascii=False,\n            sort_keys=True,\n            separators=(\",\", \": \"),\n        )\n    )\n"
  },
  {
    "path": "egs/espnet_utils/eval_source_separation.sh",
    "content": "#!/usr/bin/env bash\n\necho \"$0 $*\" >&2 # Print the command line for logging\n\nnj=10\ncmd=run.pl\nevaltypes=\"SDR STOI ESTOI PESQ\"\npermutation=true\n# Use museval.metrics.bss_eval_images or museval.metrics.bss_eval_source\nbss_eval_images=true\nbss_eval_version=v3\n\nhelp_message=$(cat << EOF\nUsage: $0 reffiles enffiles <dir>\n    e.g. $0 reference.scp enhanced.scp outdir\n\nAnd also supporting multiple sources:\n    e.g. $0 \"ref1.scp,ref2.scp\" \"enh1.scp,enh2.scp\" outdir\n\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\nEOF\n)\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n    echo \"${help_message}\" 1>&2\n    exit 1;\nfi\n\nset -euo pipefail\n\nIFS=, read -r -a reffiles <<<$1\nIFS=, read -r -a enhfiles <<<$2\ndir=$3\nlogdir=${dir}/log\nmkdir -p ${logdir}\n\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${logdir}/key.${n}.scp\"\ndone\n\n# Split the first reference\nutils/split_scp.pl ${reffiles[0]} ${split_scps} || exit 1;\n\n${cmd} JOB=1:${nj} ${logdir}/eval-enhanced-speech.JOB.log \\\n    eval-source-separation.py \\\n    --ref \"${reffiles[@]}\" --enh \"${enhfiles[@]}\" \\\n    --keylist ${logdir}/key.JOB.scp \\\n    --out ${logdir}/JOB \\\n    --evaltypes ${evaltypes} \\\n    --permutation ${permutation} \\\n    --bss-eval-images ${bss_eval_images} \\\n    --bss-eval-version ${bss_eval_version}\n\n\nfor t in ${evaltypes/SDR/SDR ISR SIR SAR}; do\n    for i in $(seq 1 ${nj}); do\n        cat ${logdir}/${i}/${t}\n    done > ${dir}/${t}\n\n    # Calculate the mean over files\n    python3 << EOF > ${dir}/mean_${t}\nwith open('${dir}/${t}', 'r') as f:\n    values = []\n    for l in f:\n        vs = l.rstrip().split(None)[1:]\n        values.append(sum(map(float, vs)) / len(vs))\n    mean = sum(values) / len(values)\nprint(mean)\nEOF\n\ndone\n"
  },
  {
    "path": "egs/espnet_utils/feat-to-shape.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nimport logging\nimport sys\n\nfrom espnet.transform.transformation import Transformation\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\nfrom espnet.utils.cli_utils import is_scipy_wav_style\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert feature to its shape\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\", \"sound.hdf5\", \"sound\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\n        \"--preprocess-conf\",\n        type=str,\n        default=None,\n        help=\"The configuration file for the pre-processing\",\n    )\n    parser.add_argument(\n        \"rspecifier\", type=str, help=\"Read specifier for feats. e.g. ark:some.ark\"\n    )\n    parser.add_argument(\n        \"out\",\n        nargs=\"?\",\n        type=argparse.FileType(\"w\"),\n        default=sys.stdout,\n        help=\"The output filename. \" \"If omitted, then output to sys.stdout\",\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    if args.preprocess_conf is not None:\n        preprocessing = Transformation(args.preprocess_conf)\n        logging.info(\"Apply preprocessing: {}\".format(preprocessing))\n    else:\n        preprocessing = None\n\n    # There are no necessary for matrix without preprocessing,\n    # so change to file_reader_helper to return shape.\n    # This make sense only with filetype=\"hdf5\".\n    for utt, mat in file_reader_helper(\n        args.rspecifier, args.filetype, return_shape=preprocessing is None\n    ):\n        if preprocessing is not None:\n            if is_scipy_wav_style(mat):\n                # If data is sound file, then got as Tuple[int, ndarray]\n                rate, mat = mat\n            mat = preprocessing(mat, uttid_list=utt)\n            shape_str = \",\".join(map(str, mat.shape))\n        else:\n            if len(mat) == 2 and isinstance(mat[1], tuple):\n                # If data is sound file, Tuple[int, Tuple[int, ...]]\n                rate, mat = mat\n            shape_str = \",\".join(map(str, mat))\n        args.out.write(\"{} {}\\n\".format(utt, shape_str))\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/feat_to_shape.sh",
    "content": "#!/usr/bin/env bash\n\n# Begin configuration section.\nnj=188\ncmd=run.pl\nverbose=0\nfiletype=\"\"\npreprocess_conf=\"\"\n# End configuration section.\n\nhelp_message=$(cat << EOF\nUsage: $0 [options] <input-scp> <output-scp> [<log-dir>]\ne.g.: $0 data/train/feats.scp data/train/shape.scp data/train/log\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\n  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file\n  --preprocess-conf <json>                         # Apply preprocess to feats when creating shape.scp\n  --verbose <num>                                  # Default: 0\nEOF\n)\n\necho \"$0 $*\" 1>&2 # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 2 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\" 1>&2\n    exit 1;\nfi\n\nset -euo pipefail\n\nscp=$1\noutscp=$2\ndata=$(dirname ${scp})\nif [ $# -eq 3 ]; then\n  logdir=$3\nelse\n  logdir=${data}/log\nfi\nmkdir -p ${logdir}\n\nnj=$((nj<$(<\"${scp}\" wc -l)?nj:$(<\"${scp}\" wc -l)))\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${logdir}/feats.${n}.scp\"\ndone\n\nutils/split_scp.pl ${scp} ${split_scps}\n\nif [ -n \"${preprocess_conf}\" ]; then\n    preprocess_opt=\"--preprocess-conf ${preprocess_conf}\"\nelse\n    preprocess_opt=\"\"\nfi\nif [ -n \"${filetype}\" ]; then\n    filetype_opt=\"--filetype ${filetype}\"\nelse\n    filetype_opt=\"\"\nfi\n\n${cmd} JOB=1:${nj} ${logdir}/feat_to_shape.JOB.log \\\n    feat-to-shape.py --verbose ${verbose} ${preprocess_opt} ${filetype_opt} \\\n    scp:${logdir}/feats.JOB.scp ${logdir}/shape.JOB.scp\n\n# concatenate the .scp files together.\nfor n in $(seq ${nj}); do\n    cat ${logdir}/shape.${n}.scp\ndone > ${outscp}\n\nrm -f ${logdir}/feats.*.scp 2>/dev/null\n"
  },
  {
    "path": "egs/espnet_utils/feats2npy.py",
    "content": "#!/usr/bin/env python\n#  coding: utf-8\n\nimport argparse\nfrom kaldiio import ReadHelper\nimport numpy as np\nimport os\nfrom os.path import join\nimport sys\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Convet kaldi-style features to numpy arrays\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"scp_file\", type=str, help=\"scp file\")\n    parser.add_argument(\"out_dir\", type=str, help=\"output directory\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    args = get_parser().parse_args(sys.argv[1:])\n    os.makedirs(args.out_dir, exist_ok=True)\n    with ReadHelper(f\"scp:{args.scp_file}\") as f:\n        for utt_id, arr in f:\n            out_path = join(args.out_dir, f\"{utt_id}-feats.npy\")\n            np.save(out_path, arr, allow_pickle=False)\n    sys.exit(0)\n"
  },
  {
    "path": "egs/espnet_utils/filt.py",
    "content": "#!/usr/bin/env python3\n\n# Apache 2.0\n\nimport argparse\nimport codecs\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"filter words in a text file\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--exclude\",\n        \"-v\",\n        dest=\"exclude\",\n        action=\"store_true\",\n        help=\"exclude filter words\",\n    )\n    parser.add_argument(\"filt\", type=str, help=\"filter list\")\n    parser.add_argument(\"infile\", type=str, help=\"input file\")\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    filter_file(args.infile, args.filt, args.exclude)\n\n\ndef filter_file(infile, filt, exclude):\n    vocab = set()\n    with codecs.open(filt, \"r\", encoding=\"utf-8\") as vocabfile:\n        for line in vocabfile:\n            vocab.add(line.strip())\n\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    with codecs.open(infile, \"r\", encoding=\"utf-8\") as textfile:\n        for line in textfile:\n            if exclude:\n                print(\n                    \" \".join(\n                        map(\n                            lambda word: word if word not in vocab else \"\",\n                            line.strip().split(),\n                        )\n                    )\n                )\n            else:\n                print(\n                    \" \".join(\n                        map(\n                            lambda word: word if word in vocab else \"<UNK>\",\n                            line.strip().split(),\n                        )\n                    )\n                )\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/filter_all_eng_utts.py",
    "content": "import sys\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\nf_in = sys.argv[1]\nf_out = sys.argv[2]\n\nwriter = open(f_out, 'w', encoding='utf-8')\nfor line in open(f_in, encoding='utf-8'):\n    elems = line.strip().split()\n    uttid = elems[0]\n    \n    if len(elems) <= 1:\n        continue\n\n    text = \" \".join(elems[1:])\n    if not is_all_chinese(text):\n        continue\n\n    out_line = \" \".join([uttid, text, '\\n'])\n    writer.write(out_line)\n"
  },
  {
    "path": "egs/espnet_utils/filter_scp.py",
    "content": "import sys\n\nref_f = sys.argv[1]\nin_f = sys.argv[2]\n\n# output is in the order of ref_f\nref = []\nfor line in open(ref_f, encoding='utf-8'):\n    uttid = line.strip().split()[0]\n    ref.append(uttid)\n\nin_dic = {}\nfor line in open(in_f, encoding='utf-8'):\n    elems = line.strip().split()\n    uttid = elems[0]\n    ctx = \" \".join(elems[1:])\n    in_dic[uttid] = ctx\n\nfor e in ref:\n    if e in in_dic:\n        print(f\"{e} {in_dic[e]}\")\n"
  },
  {
    "path": "egs/espnet_utils/filter_trn.py",
    "content": "# this is to process the hyp.trn of sppd3\nimport sys\n\nin_f = sys.argv[1]\nignore = \"叮 当 叮 当 \"\n\nfor line in open(in_f, encoding=\"utf-8\"):\n    line = line.strip().replace(ignore, \"\")\n    print(line) \n\n"
  },
  {
    "path": "egs/espnet_utils/free-gpu.sh",
    "content": "#!/usr/bin/env bash\n# Author: Gaurav Kumar\n\n\n# Usage: e.g.\n# % free-gpu.sh -n 2\n# 1,2\n\n# Allow requests for multiple GPUs\n# (Optional) defaults to 1\nreq_gpus=1\nwhile getopts ':n:' opt; do\n  case ${opt} in\n    n)\n      req_gpus=${OPTARG}\n      ;;\n    :)\n      echo \"Option -${OPTARG} requires an argument.\" >&2\n      exit 1\n      ;;\n    *)\n      echo \"Option -${OPTARG} is not supported\" >&2\n      exit 1\n      ;;\n  esac\ndone\n\n# Number of free GPUs on a machine\nn_gpus=$(lspci | grep -i \"nvidia\" | grep -c -v \"Audio\")\n\n# Return -1 if there are no GPUs on the machine\n# or if the requested number of GPUs exceed\n# the number of GPUs installed.\nif [ ${n_gpus} -eq 0 ] || [ ${req_gpus} -gt ${n_gpus} ]; then\n  echo \"-1\"\n  exit 1\nfi\n\n# shellcheck disable=SC2026\nf_gpu=$(nvidia-smi | sed -e '1,/Processes/d' \\\n  | tail -n+3 | head -n-1 | awk '{print $2}' \\\n  | awk -v ng=${n_gpus} 'BEGIN{for (n=0;n<ng;++n){g[n] = 1}} {delete g[$1];} END{for (i in g) print i}' \\\n  | tail -n ${req_gpus})\n\n# return -1 if not enough free GPUs were found\nif [[ $(echo ${f_gpu} | grep -v '^$' | wc -w) -ne ${req_gpus} ]]; then\n  echo \"-1\"\n  exit 1\nelse\n  echo ${f_gpu} | sed 's: :,:g'\nfi\n"
  },
  {
    "path": "egs/espnet_utils/gdown.pl",
    "content": "#!/usr/bin/env perl\n#\n# Google Drive direct download of big files\n# ./gdown.pl 'gdrive file url' ['desired file name']\n#\n# v1.0 by circulosmeos 04-2014.\n# v1.1 by circulosmeos 01-2017.\n# v1.2, v1.3, v1.4 by circulosmeos 01-2019, 02-2019.\n# //circulosmeos.wordpress.com/2014/04/12/google-drive-direct-download-of-big-files\n# Distributed under GPL 3 (//www.gnu.org/licenses/gpl-3.0.html)\n#\nuse strict;\nuse POSIX;\n\nmy $TEMP='gdown.cookie.temp';\nmy $COMMAND;\nmy $confirm;\nmy $check;\nsub execute_command();\n\nmy $URL=shift;\ndie \"\\n./gdown.pl 'gdrive file url' [desired file name]\\n\\n\" if $URL eq '';\n\nmy $FILENAME=shift;\n$FILENAME='gdown.'.strftime(\"%Y%m%d%H%M%S\", localtime).'.'.substr(rand,2) if $FILENAME eq '';\n\nif ($URL=~m#^https?://drive.google.com/file/d/([^/]+)#) {\n    $URL=\"https://docs.google.com/uc?id=$1&export=download\";\n}\nelsif ($URL=~m#^https?://drive.google.com/open\\?id=([^/]+)#) {\n    $URL=\"https://docs.google.com/uc?id=$1&export=download\";\n}\n\nexecute_command();\n\nwhile (-s $FILENAME < 100000) { # only if the file isn't the download yet\n    open fFILENAME, '<', $FILENAME;\n    $check=0;\n    foreach (<fFILENAME>) {\n        if (/href=\"(\\/uc\\?export=download[^\"]+)/) {\n            $URL='https://docs.google.com'.$1;\n            $URL=~s/&amp;/&/g;\n            $confirm='';\n            $check=1;\n            last;\n        }\n        if (/confirm=([^;&]+)/) {\n            $confirm=$1;\n            $check=1;\n            last;\n        }\n        if (/\"downloadUrl\":\"([^\"]+)/) {\n            $URL=$1;\n            $URL=~s/\\\\u003d/=/g;\n            $URL=~s/\\\\u0026/&/g;\n            $confirm='';\n            $check=1;\n            last;\n        }\n    }\n    close fFILENAME;\n    die \"Couldn't download the file :-(\\n\" if ($check==0);\n    $URL=~s/confirm=([^;&]+)/confirm=$confirm/ if $confirm ne '';\n\n    execute_command();\n}\n\nunlink $TEMP;\n\nsub execute_command() {\n    $COMMAND=\"wget --progress=dot:giga --no-check-certificate --load-cookie $TEMP --save-cookie $TEMP \\\"$URL\\\"\";\n    $COMMAND.=\" -O \\\"$FILENAME\\\"\" if $FILENAME ne '';\n    system ( $COMMAND );\n    return 1;\n}\n"
  },
  {
    "path": "egs/espnet_utils/generate_wav.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# Begin configuration section.\nnj=2\nfs=22050\nn_fft=1024\nn_shift=256\ncmd=run.pl\nhelp_message=$(cat <<EOF\nUsage:\n  $0 [options] <model-path> <data-dir> [<log-dir> [<fbank-dir>] ]\nExample:\n  $0 ljspeech.wavenet.ns.v1/checkpoint-1000000.pkl data/train exp/wavenet_vocoder/train wav\nNote:\n  <log-dir> defaults to <data-dir>/log, and <fbank-dir> defaults to <data-dir>/data\nOptions:\n  --nj <nj>             # number of parallel jobs\n  --fs <fs>             # sampling rate (default=22050)\n  --n_fft <n_fft>       # number of FFT points (default=1024)\n  --n_shift <n_shift>   # shift size in point (default=256)\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\nEOF\n)\n# End configuration section.\n\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 2 ] || [ $# -gt 4 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nmodel=$1\ndata=$2\nif [ $# -ge 3 ]; then\n  logdir=$3\nelse\n  logdir=${data}/log\nfi\nif [ $# -ge 4 ]; then\n  wavdir=$4\nelse\n  wavdir=${data}/data\nfi\n\n# use \"name\" as part of name of the archive.\nname=$(basename ${data})\n\nmkdir -p ${wavdir} || exit 1;\nmkdir -p ${logdir} || exit 1;\n\nscp=${data}/feats.scp\n\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"$split_scps $logdir/feats.${n}.scp\"\ndone\n\nutils/split_scp.pl ${scp} ${split_scps} || exit 1;\n\n${cmd} JOB=1:${nj} ${logdir}/generate_with_wavenet_${name}.JOB.log \\\n    generate_wav_from_fbank.py \\\n        --model ${model} \\\n        --fs ${fs} \\\n        --n_fft ${n_fft} \\\n        --n_shift ${n_shift} \\\n        scp:${logdir}/feats.JOB.scp \\\n        ${wavdir}\n\nrm ${logdir}/feats.*.scp 2>/dev/null\n\necho \"Succeeded creating wav for ${name}\"\n"
  },
  {
    "path": "egs/espnet_utils/generate_wav_from_fbank.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"This code is based on https://github.com/kan-bayashi/PytorchWaveNetVocoder.\"\"\"\n\n# Copyright 2019 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport logging\nimport os\nimport time\n\nimport h5py\nimport numpy as np\nimport pysptk\nimport torch\n\nfrom scipy.io.wavfile import write\nfrom sklearn.preprocessing import StandardScaler\n\nfrom espnet.nets.pytorch_backend.wavenet import decode_mu_law\nfrom espnet.nets.pytorch_backend.wavenet import encode_mu_law\nfrom espnet.nets.pytorch_backend.wavenet import WaveNet\nfrom espnet.utils.cli_readers import file_reader_helper\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\nclass TimeInvariantMLSAFilter(object):\n    \"\"\"Time invariant MLSA filter.\n\n    This module is used to perform noise shaping described in\n    `An investigation of noise shaping with perceptual\n     weighting for WaveNet-based speech generation`_.\n\n    Args:\n        coef (ndaaray): MLSA filter coefficient (D,).\n        alpha (float): All pass constant value.\n        n_shift (int): Shift length in points.\n\n    .. _`An investigation of noise shaping with perceptual\n        weighting for WaveNet-based speech generation`:\n        https://ieeexplore.ieee.org/abstract/document/8461332\n\n    \"\"\"\n\n    def __init__(self, coef, alpha, n_shift):\n        self.coef = coef\n        self.n_shift = n_shift\n        self.mlsa_filter = pysptk.synthesis.Synthesizer(\n            pysptk.synthesis.MLSADF(order=coef.shape[0] - 1, alpha=alpha),\n            hopsize=n_shift,\n        )\n\n    def __call__(self, y):\n        \"\"\"Apply time invariant MLSA filter.\n\n        Args:\n            y (ndarray): Waveform signal normalized from -1 to 1 (N,).\n\n        Returns:\n            y (ndarray): Filtered waveform signal normalized from -1 to 1 (N,).\n\n        \"\"\"\n        # check shape and type\n        assert len(y.shape) == 1\n        y = np.float64(y)\n\n        # get frame number and then replicate mlsa coef\n        num_frames = int(len(y) / self.n_shift) + 1\n        coef = np.tile(self.coef, [num_frames, 1])\n\n        return self.mlsa_filter.synthesis(y, coef)\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"generate wav from FBANK using wavenet vocoder\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--fs\", type=int, default=22050, help=\"Sampling frequency\")\n    parser.add_argument(\"--n_fft\", type=int, default=1024, help=\"FFT length in point\")\n    parser.add_argument(\n        \"--n_shift\", type=int, default=256, help=\"Shift length in point\"\n    )\n    parser.add_argument(\"--model\", type=str, default=None, help=\"WaveNet model\")\n    parser.add_argument(\n        \"--filetype\",\n        type=str,\n        default=\"mat\",\n        choices=[\"mat\", \"hdf5\"],\n        help=\"Specify the file format for the rspecifier. \"\n        '\"mat\" is the matrix format in kaldi',\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"Input feature e.g. scp:feat.scp\")\n    parser.add_argument(\"outdir\", type=str, help=\"Output directory\")\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logging.basicConfig(\n        level=logging.INFO,\n        format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n    )\n    logging.info(get_commandline_args())\n\n    # check directory\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n\n    # load model config\n    model_dir = os.path.dirname(args.model)\n    train_args = torch.load(os.path.join(model_dir, \"model.conf\"))\n\n    # load statistics\n    scaler = StandardScaler()\n    with h5py.File(os.path.join(model_dir, \"stats.h5\")) as f:\n        scaler.mean_ = f[\"/melspc/mean\"][()]\n        scaler.scale_ = f[\"/melspc/scale\"][()]\n        # TODO(kan-bayashi): include following info as default\n        coef = f[\"/mlsa/coef\"][()]\n        alpha = f[\"/mlsa/alpha\"][()]\n\n    # define MLSA filter for noise shaping\n    mlsa_filter = TimeInvariantMLSAFilter(\n        coef=coef,\n        alpha=alpha,\n        n_shift=args.n_shift,\n    )\n\n    # define model and laod parameters\n    device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n    model = WaveNet(\n        n_quantize=train_args.n_quantize,\n        n_aux=train_args.n_aux,\n        n_resch=train_args.n_resch,\n        n_skipch=train_args.n_skipch,\n        dilation_depth=train_args.dilation_depth,\n        dilation_repeat=train_args.dilation_repeat,\n        kernel_size=train_args.kernel_size,\n        upsampling_factor=train_args.upsampling_factor,\n    )\n    model.load_state_dict(torch.load(args.model, map_location=\"cpu\")[\"model\"])\n    model.eval()\n    model.to(device)\n\n    for idx, (utt_id, lmspc) in enumerate(\n        file_reader_helper(args.rspecifier, args.filetype), 1\n    ):\n        logging.info(\"(%d) %s\" % (idx, utt_id))\n\n        # perform preprocesing\n        x = encode_mu_law(\n            np.zeros((1)), mu=train_args.n_quantize\n        )  # quatize initial seed waveform\n        h = scaler.transform(lmspc)  # normalize features\n\n        # convert to tensor\n        x = torch.tensor(x, dtype=torch.long, device=device)  # (1,)\n        h = torch.tensor(h, dtype=torch.float, device=device)  # (T, n_aux)\n\n        # get length of waveform\n        n_samples = (h.shape[0] - 1) * args.n_shift + args.n_fft\n\n        # generate\n        start_time = time.time()\n        with torch.no_grad():\n            y = model.generate(x, h, n_samples, interval=100)\n        logging.info(\n            \"generation speed = %s (sec / sample)\"\n            % ((time.time() - start_time) / (len(y) - 1))\n        )\n        y = decode_mu_law(y, mu=train_args.n_quantize)\n\n        # apply mlsa filter for noise shaping\n        y = mlsa_filter(y)\n\n        # save as .wav file\n        write(\n            os.path.join(args.outdir, \"%s.wav\" % utt_id),\n            args.fs,\n            (y * np.iinfo(np.int16).max).astype(np.int16),\n        )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/get_yaml.py",
    "content": "#!/usr/bin/env python3\nimport argparse\n\nimport yaml\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"get a specified attribute from a YAML file\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"inyaml\")\n    parser.add_argument(\n        \"attr\", help='foo.bar will access yaml.load(inyaml)[\"foo\"][\"bar\"]'\n    )\n    return parser\n\n\ndef main():\n    args = get_parser().parse_args()\n    with open(args.inyaml, \"r\") as f:\n        indict = yaml.load(f, Loader=yaml.Loader)\n\n    try:\n        for attr in args.attr.split(\".\"):\n            if attr.isdigit():\n                attr = int(attr)\n            indict = indict[attr]\n        print(indict)\n    except KeyError:\n        # print nothing\n        # sys.exit(1)\n        pass\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/jieba_build_dict.py",
    "content": "import jieba\nimport sys\n_ = jieba.lcut(\"aaa\")\n\n\nwords_file = sys.argv[1]\ndict_file = sys.argv[2]\n\nreader = open(words_file, 'r')\nwriter = open(dict_file, 'w')\nfor line in reader:\n    term = line.strip().split()[0]\n    freq = jieba.dt.FREQ.get(term)\n    freq = 1 if freq == None else freq\n    writer.write(f\"{term} {freq}\\n\")\nwriter.close()\n"
  },
  {
    "path": "egs/espnet_utils/json2sctm.py",
    "content": "#!/usr/bin/python\n# -*- coding: utf-8 -*-\n\nimport argparse\nimport os\nimport subprocess\nimport sys\n\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"convert json to sctm\")\n    parser.add_argument(\"json\", type=str, default=None, nargs=\"?\", help=\"input trn\")\n    parser.add_argument(\"dict\", type=str, help=\"dict\")\n    parser.add_argument(\n        \"--num-spkrs\", type=int, default=1, nargs=\"?\", help=\"number of speakers\"\n    )\n    parser.add_argument(\"--refs\", type=str, nargs=\"*\", help=\"ref for all speakers\")\n    parser.add_argument(\"--hyps\", type=str, nargs=\"*\", help=\"hyp for all outputs\")\n    parser.add_argument(\"--orig-stm\", type=str, nargs=\"?\", help=\"orig stm\")\n    parser.add_argument(\"--stm\", type=str, default=None, nargs=\"+\", help=\"output stm\")\n    parser.add_argument(\"--ctm\", type=str, default=None, nargs=\"+\", help=\"output ctm\")\n    parser.add_argument(\n        \"--bpe\", type=str, default=None, nargs=\"?\", help=\"BPE model if applicable\"\n    )\n    return parser\n\n\ndef main(args):\n    from utils import json2trn\n    from utils import trn2ctm\n    from utils import trn2stm\n\n    parser = get_parser()\n    args = parser.parse_args(args)\n    if args.refs is None:\n        refs = [\"ref_tmp.trn\"]\n        del_ref = True\n    else:\n        refs = args.refs\n        del_ref = False\n    if args.hyps is None:\n        hyps = [\"hyp_tmp.trn\"]\n        del_hyp = True\n    else:\n        hyps = args.hyps\n        del_hyp = False\n    json2trn.convert(args.json, args.dict, refs, hyps, args.num_spkrs)\n    for trn in refs + hyps:\n        # We don't remove non-lang-syms because kaldi already removes them when scoring\n        call_args = [\"sed\", \"-i.bak2\", \"-r\", \"s/<blank> //g\", trn]\n        subprocess.check_call(call_args)\n        if args.bpe is not None:\n            with open(wrd_name(trn), \"w\") as out:\n                with open(trn, \"r\") as spm_in:\n                    sed_args = [\"sed\", \"-e\", \"s/▁/ /g\"]\n                    sed = subprocess.Popen(sed_args, stdout=out, stdin=subprocess.PIPE)\n                    spm_args = [\n                        \"spm_decode\",\n                        \"--model=\" + args.bpe,\n                        \"--input_format=piece\",\n                    ]\n                    subprocess.Popen(spm_args, stdin=spm_in)\n                    sed.communicate()\n        else:\n            call_args = [\n                \"sed\",\n                \"-e\",\n                \"s/ //g\",\n                \"-e\",\n                \"s/(/ (/\",\n                \"-e\",\n                \"s/<space>/ /g\",\n                trn,\n            ]\n            with open(wrd_name(trn), \"w\") as out:\n                sed = subprocess.Popen(call_args, stdout=out)\n                sed.communicate()\n    for trn, stm in zip(refs, args.stm):\n        trn2stm.convert(wrd_name(trn), stm, args.orig_stm)\n    if del_ref:\n        os.remove(refs[0])\n        os.remove(refs[0] + \".bak2\")\n        os.remove(wrd_name(refs[0]))\n\n    for trn, ctm in zip(hyps, args.ctm):\n        trn2ctm.convert(wrd_name(trn), ctm)\n    if del_hyp:\n        os.remove(hyps[0])\n        os.remove(hyps[0] + \".bak2\")\n        os.remove(wrd_name(hyps[0]))\n\n\ndef wrd_name(trn):\n    split = trn.split(\".\")\n    return \".\".join(split[:-1]) + \".wrd.\" + split[-1]\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/json2text.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport logging\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert ASR recognized json to text\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"json\", type=str, help=\"json files\")\n    parser.add_argument(\"dict\", type=str, help=\"dict\")\n    parser.add_argument(\"ref\", type=str, help=\"ref\")\n    parser.add_argument(\"hyp\", type=str, help=\"hyp\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    args = get_parser().parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    logging.info(\"reading %s\", args.json)\n    with codecs.open(args.json, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\n    logging.info(\"reading %s\", args.dict)\n    with codecs.open(args.dict, \"r\", encoding=\"utf-8\") as f:\n        dictionary = f.readlines()\n    char_list = [entry.split(\" \")[0] for entry in dictionary]\n    char_list.insert(0, \"<blank>\")\n    char_list.append(\"<eos>\")\n    # print([x.encode('utf-8') for x in char_list])\n\n    logging.info(\"writing hyp trn to %s\", args.hyp)\n    logging.info(\"writing ref trn to %s\", args.ref)\n    h = codecs.open(args.hyp, \"w\", encoding=\"utf-8\")\n    r = codecs.open(args.ref, \"w\", encoding=\"utf-8\")\n\n    for x in j[\"utts\"]:\n        seq = [\n            char_list[int(i)] for i in j[\"utts\"][x][\"output\"][0][\"rec_tokenid\"].split()\n        ]\n        h.write(x + \" \" + \" \".join(seq).replace(\"<eos>\", \"\") + \"\\n\")\n\n        if \"tokenid\" in j[\"utts\"][x][\"output\"][0].keys():\n            seq = [\n                char_list[int(i)] for i in j[\"utts\"][x][\"output\"][0][\"tokenid\"].split()\n            ]\n            r.write(x + \" \" + \" \".join(seq).replace(\"<eos>\", \"\") + \"\\n\")\n"
  },
  {
    "path": "egs/espnet_utils/json2trn.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#           2018 Xuankai Chang (Shanghai Jiao Tong University)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert a json to a transcription file with a token dictionary\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"json\", type=str, help=\"json files\")\n    parser.add_argument(\"dict\", type=str, help=\"dict\")\n    parser.add_argument(\"--num-spkrs\", type=int, default=1, help=\"number of speakers\")\n    parser.add_argument(\"--refs\", type=str, nargs=\"+\", help=\"ref for all speakers\")\n    parser.add_argument(\"--hyps\", type=str, nargs=\"+\", help=\"hyp for all outputs\")\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    convert(args.json, args.dict, args.refs, args.hyps, args.num_spkrs)\n\n\ndef convert(jsonf, dic, refs, hyps, num_spkrs=1):\n    n_ref = len(refs)\n    n_hyp = len(hyps)\n    assert n_ref == n_hyp\n    assert n_ref == num_spkrs\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    logging.info(\"reading %s\", jsonf)\n    with codecs.open(jsonf, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\n    logging.info(\"reading %s\", dic)\n    with codecs.open(dic, \"r\", encoding=\"utf-8\") as f:\n        dictionary = f.readlines()\n    char_list = [entry.split(\" \")[0] for entry in dictionary]\n    char_list.insert(0, \"<blank>\")\n    char_list.append(\"<eos>\")\n\n    for ns in range(num_spkrs):\n        hyp_file = codecs.open(hyps[ns], \"w\", encoding=\"utf-8\")\n        ref_file = codecs.open(refs[ns], \"w\", encoding=\"utf-8\")\n\n        for x in j[\"utts\"]:\n            # recognition hypothesis\n            if num_spkrs == 1:\n                seq = [\n                    char_list[int(i)]\n                    for i in j[\"utts\"][x][\"output\"][0][\"rec_tokenid\"].split()\n                ]\n            else:\n                seq = [\n                    char_list[int(i)]\n                    for i in j[\"utts\"][x][\"output\"][ns][0][\"rec_tokenid\"].split()\n                ]\n            # In the recognition hypothesis,\n            # the <eos> symbol is usually attached in the last part of the sentence\n            # and it is removed below.\n            hyp_file.write(\" \".join(seq).replace(\"<eos>\", \"\")),\n            hyp_file.write(\n                \" (\" + j[\"utts\"][x][\"utt2spk\"].replace(\"-\", \"_\") + \"-\" + x + \")\\n\"\n            )\n\n            # reference\n            if num_spkrs == 1:\n                seq = j[\"utts\"][x][\"output\"][0][\"token\"]\n            else:\n                seq = j[\"utts\"][x][\"output\"][ns][0][\"token\"]\n            # Unlike the recognition hypothesis,\n            # the reference is directly generated from a token without dictionary\n            # to avoid to include <unk> symbols in the reference to make scoring normal.\n            # The detailed discussion can be found at\n            # https://github.com/espnet/espnet/issues/993\n            ref_file.write(\n                seq + \" (\" + j[\"utts\"][x][\"utt2spk\"].replace(\"-\", \"_\") + \"-\" + x + \")\\n\"\n            )\n\n        hyp_file.close()\n        ref_file.close()\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/json2trn_mt.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2018 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# NOTE: this is made for machine translation\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert json to machine translation transcription\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"json\", type=str, help=\"json files\")\n    parser.add_argument(\"dict\", type=str, help=\"dict for target language\")\n    parser.add_argument(\"--refs\", type=str, nargs=\"+\", help=\"ref for all speakers\")\n    parser.add_argument(\"--hyps\", type=str, nargs=\"+\", help=\"hyp for all outputs\")\n    parser.add_argument(\"--srcs\", type=str, nargs=\"+\", help=\"src for all outputs\")\n    parser.add_argument(\n        \"--dict-src\",\n        type=str,\n        help=\"dict for source language\",\n        default=False,\n        nargs=\"?\",\n    )\n    return parser\n\n\ndef main(args):\n    parser = get_parser()\n    args = parser.parse_args(args)\n    convert(args.json, args.dict, args.refs, args.hyps, args.srcs, args.dict_src)\n\n\ndef convert(jsonf, dic, refs, hyps, srcs, dic_src):\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    logging.info(\"reading %s\", jsonf)\n    with codecs.open(jsonf, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\n    # target dictionary\n    logging.info(\"reading %s\", dic)\n    with codecs.open(dic, \"r\", encoding=\"utf-8\") as f:\n        dictionary = f.readlines()\n    char_list_tgt = [entry.split(\" \")[0] for entry in dictionary]\n    char_list_tgt.insert(0, \"<blank>\")\n    char_list_tgt.append(\"<eos>\")\n\n    # source dictionary\n    logging.info(\"reading %s\", dic_src)\n    if dic_src:\n        with codecs.open(dic_src, \"r\", encoding=\"utf-8\") as f:\n            dictionary = f.readlines()\n        char_list_src = [entry.split(\" \")[0] for entry in dictionary]\n        char_list_src.insert(0, \"<blank>\")\n        char_list_src.append(\"<eos>\")\n\n    if hyps:\n        hyp_file = codecs.open(hyps[0], \"w\", encoding=\"utf-8\")\n    ref_file = codecs.open(refs[0], \"w\", encoding=\"utf-8\")\n    if srcs:\n        src_file = codecs.open(srcs[0], \"w\", encoding=\"utf-8\")\n\n    for x in j[\"utts\"]:\n        # hyps\n        if hyps:\n            seq = [\n                char_list_tgt[int(i)]\n                for i in j[\"utts\"][x][\"output\"][0][\"rec_tokenid\"].split()\n            ]\n            hyp_file.write(\" \".join(seq).replace(\"<eos>\", \"\")),\n            hyp_file.write(\n                \" (\" + j[\"utts\"][x][\"utt2spk\"].replace(\"-\", \"_\") + \"-\" + x + \")\\n\"\n            )\n\n        # ref\n        seq = [\n            char_list_tgt[int(i)] for i in j[\"utts\"][x][\"output\"][0][\"tokenid\"].split()\n        ]\n        ref_file.write(\" \".join(seq).replace(\"<eos>\", \"\")),\n        ref_file.write(\n            \" (\" + j[\"utts\"][x][\"utt2spk\"].replace(\"-\", \"_\") + \"-\" + x + \")\\n\"\n        )\n\n        # src\n        if \"tokenid_src\" in j[\"utts\"][x][\"output\"][0].keys():\n            if dic_src:\n                seq = [\n                    char_list_src[int(i)]\n                    for i in j[\"utts\"][x][\"output\"][0][\"tokenid_src\"].split()\n                ]\n            else:\n                seq = [\n                    char_list_tgt[int(i)]\n                    for i in j[\"utts\"][x][\"output\"][0][\"tokenid_src\"].split()\n                ]\n            src_file.write(\" \".join(seq).replace(\"<eos>\", \"\")),\n            src_file.write(\n                \" (\" + j[\"utts\"][x][\"utt2spk\"].replace(\"-\", \"_\") + \"-\" + x + \")\\n\"\n            )\n\n    if hyps:\n        hyp_file.close()\n    ref_file.close()\n    if srcs:\n        src_file.close()\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/json2trn_wo_dict.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Okayama University (Katsuki Inoue)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert a json to a transcription file with a token dictionary\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"json\", type=str, help=\"json files\")\n    parser.add_argument(\"--num-spkrs\", type=int, default=1, help=\"number of speakers\")\n    parser.add_argument(\"--refs\", type=str, nargs=\"+\", help=\"ref for all speakers\")\n    parser.add_argument(\"--hyps\", type=str, nargs=\"+\", help=\"hyp for all outputs\")\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    convert(args.json, args.refs, args.hyps, args.num_spkrs)\n\n\ndef convert(jsonf, refs, hyps, num_spkrs=1):\n    n_ref = len(refs)\n    n_hyp = len(hyps)\n    assert n_ref == n_hyp\n    assert n_ref == num_spkrs\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    logging.basicConfig(level=logging.INFO, format=logfmt)\n    logging.info(get_commandline_args())\n\n    logging.info(\"reading %s\", jsonf)\n    with codecs.open(jsonf, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\n    for ns in range(num_spkrs):\n        hyp_file = codecs.open(hyps[ns], \"w\", encoding=\"utf-8\")\n        ref_file = codecs.open(refs[ns], \"w\", encoding=\"utf-8\")\n\n        for x in j[\"utts\"]:\n            # recognition hypothesis\n            if num_spkrs == 1:\n                seq = j[\"utts\"][x][\"output\"][0][\"rec_text\"].replace(\"<eos>\", \"\")\n            else:\n                seq = j[\"utts\"][x][\"output\"][ns][0][\"rec_text\"].replace(\"<eos>\", \"\")\n            # In the recognition hypothesis,\n            # the <eos> symbol is usually attached in the last part of the sentence\n            # and it is removed below.\n            hyp_file.write(seq)\n            hyp_file.write(\" (\" + x.replace(\"-\", \"_\") + \")\\n\")\n\n            # reference\n            if num_spkrs == 1:\n                seq = j[\"utts\"][x][\"output\"][0][\"text\"]\n            else:\n                seq = j[\"utts\"][x][\"output\"][ns][0][\"text\"]\n            # Unlike the recognition hypothesis,\n            # the reference is directly generated from a token without dictionary\n            # to avoid to include <unk> symbols in the reference to make scoring normal.\n            # The detailed discussion can be found at\n            # https://github.com/espnet/espnet/issues/993\n            ref_file.write(seq + \" (\" + x.replace(\"-\", \"_\") + \")\\n\")\n\n        hyp_file.close()\n        ref_file.close()\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/k2/add_lex_disambig.pl",
    "content": "#!/usr/bin/env perl\n#  Copyright 2010-2011  Microsoft Corporation\n#            2013-2016  Johns Hopkins University (author: Daniel Povey)\n#                 2015  Hainan Xu\n#                 2015  Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Adds disambiguation symbols to a lexicon.\n# Outputs still in the normal lexicon format.\n# Disambig syms are numbered #1, #2, #3, etc. (#0\n# reserved for symbol in grammar).\n# Outputs the number of disambig syms to the standard output.\n# With the --pron-probs option, expects the second field\n# of each lexicon line to be a pron-prob.\n# With the --sil-probs option, expects three additional\n# fields after the pron-prob, representing various components\n# of the silence probability model.\n\n$pron_probs = 0;\n$sil_probs = 0;\n$first_allowed_disambig = 1;\n\nfor ($n = 1; $n <= 3 && @ARGV > 0; $n++) {\n  if ($ARGV[0] eq \"--pron-probs\") {\n    $pron_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--sil-probs\") {\n    $sil_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--first-allowed-disambig\") {\n    $first_allowed_disambig = 0 + $ARGV[1];\n    if ($first_allowed_disambig < 1) {\n      die \"add_lex_disambig.pl: invalid --first-allowed-disambig option: $first_allowed_disambig\\n\";\n    }\n    shift @ARGV;\n    shift @ARGV;\n  }\n}\n\nif (@ARGV != 2) {\n  die \"Usage: add_lex_disambig.pl [opts] <lexicon-in> <lexicon-out>\\n\" .\n    \"This script adds disambiguation symbols to a lexicon in order to\\n\" .\n    \"make decoding graphs determinizable; it adds pseudo-phone\\n\" .\n    \"disambiguation symbols #1, #2 and so on at the ends of phones\\n\" .\n    \"to ensure that all pronunciations are different, and that none\\n\" .\n    \"is a prefix of another.\\n\" .\n    \"It prints to the standard output the number of the largest-numbered\" .\n    \"disambiguation symbol that was used.\\n\" .\n    \"\\n\" .\n    \"Options:   --pron-probs       Expect pronunciation probabilities in the 2nd field\\n\" .\n    \"           --sil-probs        [should be with --pron-probs option]\\n\" .\n    \"                              Expect 3 extra fields after the pron-probs, for aspects of\\n\" .\n    \"                              the silence probability model\\n\" .\n    \"           --first-allowed-disambig <n>  The number of the first disambiguation symbol\\n\" .\n    \"                              that this script is allowed to add.  By default this is\\n\" .\n    \"                              #1, but you can set this to a larger value using this option.\\n\" .\n    \"e.g.:\\n\" .\n    \" add_lex_disambig.pl lexicon.txt lexicon_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs lexiconp.txt lexiconp_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs --sil-probs lexiconp_silprob.txt lexiconp_silprob_disambig.txt\\n\";\n}\n\n\n$lexfn = shift @ARGV;\n$lexoutfn = shift @ARGV;\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\n\n# (1)  Read in the lexicon.\n@L = ( );\nwhile(<L>) {\n    @A = split(\" \", $_);\n    push @L, join(\" \", @A);\n}\n\n# (2) Work out the count of each phone-sequence in the\n# lexicon.\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) {\n      $p = shift @A;\n      if (!($p > 0.0 && $p <= 1.0)) { die \"Bad lexicon line $l (expecting pron-prob as second field)\"; }\n    }\n    if ($sil_probs) {\n      $silp = shift @A;\n      if (!($silp > 0.0 && $silp <= 1.0)) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n    }\n    if (!(@A)) {\n      die \"Bad lexicon line $1, no phone in phone list\";\n    }\n    $count{join(\" \",@A)}++;\n}\n\n# (3) For each left sub-sequence of each phone-sequence, note down\n# that it exists (for identifying prefixes of longer strings).\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) { shift @A; } # remove pron-prob.\n    if ($sil_probs) {\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob, there three numbers for sil_probs\n    }\n    while(@A > 0) {\n        pop @A;  # Remove last phone\n        $issubseq{join(\" \",@A)} = 1;\n    }\n}\n\n# (4) For each entry in the lexicon:\n#  if the phone sequence is unique and is not a\n#  prefix of another word, no diambig symbol.\n#  Else output #1, or #2, #3, ... if the same phone-seq\n#  has already been assigned a disambig symbol.\n\n\nopen(O, \">$lexoutfn\") || die \"Opening lexicon file $lexoutfn for writing.\\n\";\n\n# max_disambig will always be the highest-numbered disambiguation symbol that\n# has been used so far.\n$max_disambig = $first_allowed_disambig - 1;\n\nforeach $l (@L) {\n  @A = split(\" \", $l);\n  $word = shift @A;\n  if ($pron_probs) {\n    $pron_prob = shift @A;\n  }\n  if ($sil_probs) {\n    $sil_word_prob = shift @A;\n    $word_sil_correction = shift @A;\n    $prev_nonsil_correction = shift @A\n  }\n  $phnseq = join(\" \", @A);\n  if (!defined $issubseq{$phnseq}\n      && $count{$phnseq} == 1) {\n    ;                           # Do nothing.\n  } else {\n    if ($phnseq eq \"\") {        # need disambig symbols for the empty string\n      # that are not use anywhere else.\n      $max_disambig++;\n      $reserved_for_the_empty_string{$max_disambig} = 1;\n      $phnseq = \"#$max_disambig\";\n    } else {\n      $cur_disambig = $last_used_disambig_symbol_of{$phnseq};\n      if (!defined $cur_disambig) {\n        $cur_disambig = $first_allowed_disambig;\n      } else {\n        $cur_disambig++;           # Get a number that has not been used yet for\n                                   # this phone sequence.\n      }\n      while (defined $reserved_for_the_empty_string{$cur_disambig}) {\n        $cur_disambig++;\n      }\n      if ($cur_disambig > $max_disambig) {\n        $max_disambig = $cur_disambig;\n      }\n      $last_used_disambig_symbol_of{$phnseq} = $cur_disambig;\n      $phnseq = $phnseq . \" #\" . $cur_disambig;\n    }\n  }\n  if ($pron_probs) {\n    if ($sil_probs) {\n      print O \"$word\\t$pron_prob\\t$sil_word_prob\\t$word_sil_correction\\t$prev_nonsil_correction\\t$phnseq\\n\";\n    } else {\n      print O \"$word\\t$pron_prob\\t$phnseq\\n\";\n    }\n  } else {\n    print O \"$word\\t$phnseq\\n\";\n  }\n}\n\nprint $max_disambig . \"\\n\";\n"
  },
  {
    "path": "egs/espnet_utils/k2/apply_map.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\n# This program is a bit like ./sym2int.pl in that it applies a map\n# to things in a file, but it's a bit more general in that it doesn't\n# assume the things being mapped to are single tokens, they could\n# be sequences of tokens.  See the usage message.\n\n\n$permissive = 0;\n\nfor ($x = 0; $x <= 2; $x++) {\n\n  if (@ARGV > 0 && $ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesty (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n\n  if (@ARGV > 0 && $ARGV[0] eq '--permissive') {\n    shift @ARGV;\n    # Mapping is optional (missing key is printed to output)\n    $permissive = 1;\n  }\n}\n\nif(@ARGV != 1) {\n  print STDERR \"Invalid usage: \" . join(\" \", @ARGV) . \"\\n\";\n  print STDERR <<'EOF';\nUsage: apply_map.pl [options] map <input >output\n options: [-f <field-range> ] [--permissive]\n   This applies a map to some specified fields of some input text:\n   For each line in the map file: the first field is the thing we\n   map from, and the remaining fields are the sequence we map it to.\n   The -f (field-range) option says which fields of the input file the map\n   map should apply to.\n   If the --permissive option is supplied, fields which are not present\n   in the map will be left as they were.\n Applies the map 'map' to all input text, where each line of the map\n is interpreted as a map from the first field to the list of the other fields\n Note: <field-range> can look like 4-5, or 4-, or 5-, or 1, it means the field\n range in the input to apply the map to.\n e.g.: echo A B | apply_map.pl a.txt\n where a.txt is:\n A a1 a2\n B b\n will produce:\n a1 a2 b\nEOF\n  exit(1);\n}\n\n($map_file) = @ARGV;\nopen(M, \"<$map_file\") || die \"Error opening map file $map_file: $!\";\n\nwhile (<M>) {\n  @A = split(\" \", $_);\n  @A >= 1 || die \"apply_map.pl: empty line.\";\n  $i = shift @A;\n  $o = join(\" \", @A);\n  $map{$i} = $o;\n}\n\nwhile(<STDIN>) {\n  @A = split(\" \", $_);\n  for ($x = 0; $x < @A; $x++) {\n    if ( (!defined $field_begin || $x >= $field_begin)\n         && (!defined $field_end || $x <= $field_end)) {\n      $a = $A[$x];\n      if (!defined $map{$a}) {\n        if (!$permissive) {\n          die \"apply_map.pl: undefined key $a in $map_file\\n\";\n        } else {\n          print STDERR \"apply_map.pl: warning! missing key $a in $map_file\\n\";\n        }\n      } else {\n        $A[$x] = $map{$a};\n      }\n    }\n  }\n  print join(\" \", @A) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/espnet_utils/k2/fstaddselfloops.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2020 Xiaomi Corporation (Author: Junbo Zhang)\n# Apache 2.0\n\nuse strict;\nuse warnings;\n\nmy $Usage = <<EOU;\nfstaddselfloops.pl:\nAdds self-loops to states of an FST to propagate disambiguation symbols through it.\nThey are added on each final state and each state with non-epsilon output symbols\non at least one arc out of the state. \n\nUsage: local/fstaddselfloops.pl <wdisambig_phone> <wdisambig_word> < <openfst_text>\n e.g.: cat L_disambig.txt | local/fstaddselfloops.pl 347 200004 > L_disambig_with_loop.txt\nEOU\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\nmy $wdisambig_phone = shift @ARGV;\nmy $wdisambig_word = shift @ARGV;\n\nmy %states_needs_self_loops;\nwhile (<>) {\n    print $_;\n\n    my @items = split(/\\s+/);\n    if (@items == 2) {\n        # it is a final state\n        $states_needs_self_loops{$items[0]} = 1;\n    } elsif (@items == 5) {\n        my ($src, $dst, $inlabel, $outlabel, $score) = @items;\n        $states_needs_self_loops{$src} = 1 if ($outlabel != 0);\n    } else {\n        die \"Invalid openfst line.\";\n    }\n}\n\nforeach (keys %states_needs_self_loops) {\n    print \"$_ $_ $wdisambig_phone $wdisambig_word 0.0\\n\"\n}\n"
  },
  {
    "path": "egs/espnet_utils/k2/k2_prepare_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey);\n#                      Arnab Ghoshal\n#                2014  Guoguo Chen\n#                2015  Hainan Xu\n#                2016  FAU Erlangen (Author: Axel Horndasch)\n#                2020  Xiaomi Corporation (Author: Junbo Zhang)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script prepares a directory such as data/lang/, in the standard format,\n# given a source directory containing a dictionary lexicon.txt in a form like:\n# word phone1 phone2 ... phoneN\n# per line (alternate prons would be separate lines), or a dictionary with probabilities\n# called lexiconp.txt in a form:\n# word pron-prob phone1 phone2 ... phoneN\n# (with 0.0 < pron-prob <= 1.0); note: if lexiconp.txt exists, we use it even if\n# lexicon.txt exists.\n# and also files silence_phones.txt, nonsilence_phones.txt, optional_silence.txt\n# and extra_questions.txt\n# Here, silence_phones.txt and nonsilence_phones.txt are lists of silence and\n# non-silence phones respectively (where silence includes various kinds of\n# noise, laugh, cough, filled pauses etc., and nonsilence phones includes the\n# \"real\" phones.)\n# In each line of those files is a list of phones, and the phones on each line\n# are assumed to correspond to the same \"base phone\", i.e. they will be\n# different stress or tone variations of the same basic phone.\n# The file \"optional_silence.txt\" contains just a single phone (typically SIL)\n# which is used for optional silence in the lexicon.\n# extra_questions.txt might be empty; typically will consist of lists of phones,\n# all members of each list with the same stress or tone; and also possibly a\n# list for the silence phones.  This will augment the automatically generated\n# questions (note: the automatically generated ones will treat all the\n# stress/tone versions of a phone the same, so will not \"get to ask\" about\n# stress or tone).\n#\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nnum_sil_states=5\nnum_nonsil_states=3\nposition_dependent_phones=true\n# position_dependent_phones is false also when position dependent phones and word_boundary.txt\n# have been generated by another source\nshare_silence_phones=false  # if true, then share pdfs of different silence\n                            # phones together.\nsil_prob=0.5\nnum_extra_phone_disambig_syms=1 # Standard one phone disambiguation symbol is used for optional silence.\n                                # Increasing this number does not harm, but is only useful if you later\n                                # want to introduce this labels to L_disambig.fst\n\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\necho $sil_prob\n. local/parse_options.sh\necho $sil_prob\nif [ $# -ne 4 ]; then\n  echo \"Usage: local/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>\"\n  echo \"e.g.: local/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang\"\n  echo \"<dict-src-dir> should contain the following files:\"\n  echo \" extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt\"\n  echo \"See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.\"\n  echo \"options: \"\n  echo \"<dict-src-dir> may also, for the grammar-decoding case (see http://kaldi-asr.org/doc/grammar.html)\"\n  echo \"contain a file nonterminals.txt containing symbols like #nonterm:contact_list, one per line.\"\n  echo \"     --num-sil-states <number of states>             # default: 5, #states in silence models.\"\n  echo \"     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.\"\n  echo \"     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I\"\n  echo \"                                                     # markers on phones to indicate word-internal positions. \"\n  echo \"     --share-silence-phones (true|false)             # default: false; if true, share pdfs of \"\n  echo \"                                                     # all silence phones. \"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  exit 1;\nfi\n\nsrcdir=$1\noov_word=$2\ntmpdir=$3\ndir=$4\n\n\nif [ -d $dir/phones ]; then\n  rm -r $dir/phones\nfi\nmkdir -p $dir $tmpdir $dir/phones\n\nsilprob=false\n[ -f $srcdir/lexiconp_silprob.txt ] && silprob=true\n\n[ -f path.sh ] && . ./path.sh\n\nif [[ ! -f $srcdir/lexicon.txt ]]; then\n  echo \"**Creating $srcdir/lexicon.txt from $srcdir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdir/lexiconp.txt > $srcdir/lexicon.txt || exit 1;\nfi\nif [[ ! -f $srcdir/lexiconp.txt ]]; then\n  echo \"**Creating $srcdir/lexiconp.txt from $srcdir/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdir/lexicon.txt > $srcdir/lexiconp.txt || exit 1;\nfi\n\nif [ ! -z \"$unk_fst\" ] && [ ! -f \"$unk_fst\" ]; then\n  echo \"$0: expected --unk-fst $unk_fst to exist as a file\"\n  exit 1\nfi\n\nif $position_dependent_phones; then\n  # Create $tmpdir/lexiconp.txt from $srcdir/lexiconp.txt (or\n  # $tmpdir/lexiconp_silprob.txt from $srcdir/lexiconp_silprob.txt) by\n  # adding the markers _B, _E, _S, _I depending on word position.\n  # In this recipe, these markers apply to silence also.\n  # Do this starting from lexiconp.txt only.\n  if \"$silprob\"; then\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; $silword_p = shift @A;\n              $wordsil_f = shift @A; $wordnonsil_f = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_S\\n\"; }\n         else { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n                < $srcdir/lexiconp_silprob.txt > $tmpdir/lexiconp_silprob.txt\n  else\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $A[0]_S\\n\"; } else { print \"$w $p $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $srcdir/lexiconp.txt > $tmpdir/lexiconp.txt || exit 1;\n  fi\n\n  # create $tmpdir/phone_map.txt\n  # this has the format (on each line)\n  # <original phone> <version 1 of original phone> <version 2> ...\n  # where the versions depend on the position of the phone within a word.\n  # For instance, we'd have:\n  # AA AA_B AA_E AA_I AA_S\n  # for (B)egin, (E)nd, (I)nternal and (S)ingleton\n  # and in the case of silence\n  # SIL SIL SIL_B SIL_E SIL_I SIL_S\n  # [because SIL on its own is one of the variants; this is for when it doesn't\n  #  occur inside a word but as an option in the lexicon.]\n\n  # This phone map expands the phone lists into all the word-position-dependent\n  # versions of the phone lists.\n  cat <(set -f; for x in `cat $srcdir/silence_phones.txt`; do for y in \"\" \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    <(set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do for y in \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    > $tmpdir/phone_map.txt\nelse\n  if \"$silprob\"; then\n    cp $srcdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob.txt\n  else\n    cp $srcdir/lexiconp.txt $tmpdir/lexiconp.txt\n  fi\n\n  cat $srcdir/silence_phones.txt $srcdir/nonsilence_phones.txt | \\\n    awk '{for(n=1;n<=NF;n++) print $n; }' > $tmpdir/phones\n  paste -d' ' $tmpdir/phones $tmpdir/phones > $tmpdir/phone_map.txt\nfi\n\n\n# Making monophone systems.\ncat $srcdir/silence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/silence.txt\ncat $srcdir/nonsilence_phones.txt | local/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/nonsilence.txt\ncp $srcdir/optional_silence.txt $dir/phones/optional_silence.txt\n\n# if extra_questions.txt is empty, it's OK.\ncat $srcdir/extra_questions.txt 2>/dev/null | local/apply_map.pl $tmpdir/phone_map.txt \\\n  >$dir/phones/extra_questions.txt\n\n# Want extra questions about the word-start/word-end stuff. Make it separate for\n# silence and non-silence. Probably doesn't matter, as silence will rarely\n# be inside a word.\nif $position_dependent_phones; then\n  for suffix in _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\n  for suffix in \"\" _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/silence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\nfi\n\n# add_lex_disambig.pl is responsible for adding disambiguation symbols to\n# the lexicon, for telling us how many disambiguation symbols it used,\n# and also for modifying the unknown-word's pronunciation (if the\n# --unk-fst was provided) to the sequence \"#1 #2 #3\", and reserving those\n# disambig symbols for that purpose.\n# The #2 will later be replaced with the actual unk model.  The reason\n# for the #1 and the #3 is for disambiguation and also to keep the\n# FST compact.  If we didn't have the #1, we might have a different copy of\n# the unk-model FST, or at least some of its arcs, for each start-state from\n# which an <unk> transition comes (instead of per end-state, which is more compact);\n# and adding the #3 prevents us from potentially having 2 copies of the unk-model\n# FST due to the optional-silence [the last phone of any word gets 2 arcs].\nif [ ! -z \"$unk_fst\" ]; then  # if the --unk-fst option was provided...\n  if \"$silprob\"; then\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp_silprob.txt \"$oov_word\" || exit 1\n  else\n    local/lang/internal/modify_unk_pron.py $tmpdir/lexiconp.txt \"$oov_word\" || exit 1\n  fi\n  unk_opt=\"--first-allowed-disambig 4\"\nelse\n  unk_opt=\nfi\n\nif \"$silprob\"; then\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs --sil-probs $tmpdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob_disambig.txt)\nelse\n  ndisambig=$(local/add_lex_disambig.pl $unk_opt --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\nndisambig=$[$ndisambig+$num_extra_phone_disambig_syms]; # add (at least) one disambig symbol for silence in lexicon FST.\necho $ndisambig > $tmpdir/lex_ndisambig\n\n# Format of lexiconp_disambig.txt:\n# !SIL\t1.0   SIL_S\n# <SPOKEN_NOISE>\t1.0   SPN_S #1\n# <UNK>\t1.0  SPN_S #2\n# <NOISE>\t1.0  NSN_S\n# !EXCLAMATION-POINT\t1.0  EH2_B K_I S_I K_I L_I AH0_I M_I EY1_I SH_I AH0_I N_I P_I OY2_I N_I T_E\n\n( for n in `seq 0 $ndisambig`; do echo '#'$n; done ) >$dir/phones/disambig.txt\n\n# Create phone symbol table.\necho \"<eps>\" | cat - $dir/phones/{silence,nonsilence,disambig}.txt | \\\n  awk '{n=NR-1; print $1, n;}' > $dir/phones.txt\n\n# Create a file that describes the word-boundary information for\n# each phone.  5 categories.\nif $position_dependent_phones; then\n  cat $dir/phones/{silence,nonsilence}.txt | \\\n    awk '/_I$/{print $1, \"internal\"; next;} /_B$/{print $1, \"begin\"; next; }\n         /_S$/{print $1, \"singleton\"; next;} /_E$/{print $1, \"end\"; next; }\n         {print $1, \"nonword\";} ' > $dir/phones/word_boundary.txt\nelse\n  # word_boundary.txt might have been generated by another source\n  [ -f $srcdir/word_boundary.txt ] && cp $srcdir/word_boundary.txt $dir/phones/word_boundary.txt\nfi\n\n# Create word symbol table.\n# <s> and </s> are only needed due to the need to rescore lattices with\n# ConstArpaLm format language model. They do not normally appear in G.fst or\n# L.fst.\n\nif \"$silprob\"; then\n  # remove the silprob\n  cat $tmpdir/lexiconp_silprob.txt |\\\n    awk '{\n      for(i=1; i<=NF; i++) {\n        if(i!=3 && i!=4 && i!=5) printf(\"%s\\t\", $i); if(i==NF) print \"\";\n      }\n    }' > $tmpdir/lexiconp.txt\nfi\n\ncat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq  | awk '\n  BEGIN {\n    print \"<eps> 0\";\n  }\n  {\n    if ($1 == \"<s>\") {\n      print \"<s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    if ($1 == \"</s>\") {\n      print \"</s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    printf(\"%s %d\\n\", $1, NR);\n  }\n  END {\n    printf(\"#0 %d\\n\", NR+1);\n    printf(\"<s> %d\\n\", NR+2);\n    printf(\"</s> %d\\n\", NR+3);\n  }' > $dir/words.txt || exit 1;\n\n# format of $dir/words.txt:\n#<eps> 0\n#a 1\n#aa 2\n#aarvark 3\n#...\n\nsilphone=`cat $srcdir/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\ngrammar_opts=\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\n\nif $silprob; then\n  # Add silence probabilities (models the prob. of silence before and after each\n  # word).  On some setups this helps a bit.  See local/dict_dir_add_pronprobs.sh\n  # and where it's called in the example scripts (run.sh).\n  local/make_lexicon_fst_silprob.py $grammar_opts --sil-phone=$silphone \\\n    $tmpdir/lexiconp_silprob.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt  > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false |   \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone \\\n    $tmpdir/lexiconp.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt > $dir/L.fst.txt || exit 1;\n\n    # fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n    #   --keep_isymbols=false --keep_osymbols=false | \\\n    # fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n# The file oov.txt contains a word that we will map any OOVs to during\n# training.\necho \"$oov_word\" > $dir/oov.txt || exit 1;\ncat $dir/oov.txt | local/sym2int.pl $dir/words.txt >$dir/oov.int || exit 1;\n# integer version of oov symbol, used in some scripts.\n\n\n# the file wdisambig.txt contains a (line-by-line) list of the text-form of the\n# disambiguation symbols that are used in the grammar and passed through by the\n# lexicon.  At this stage it's hardcoded as '#0', but we're laying the groundwork\n# for more generality (which probably would be added by another script).\n# wdisambig_words.int contains the corresponding list interpreted by the\n# symbol table words.txt, and wdisambig_phones.int contains the corresponding\n# list interpreted by the symbol table phones.txt.\necho '#0' >$dir/phones/wdisambig.txt\n\nwdisambig_phone=`local/sym2int.pl $dir/phones.txt <$dir/phones/wdisambig.txt`\nwdisambig_word=`local/sym2int.pl $dir/words.txt <$dir/phones/wdisambig.txt`\n\n# Create these lists of phones in colon-separated integer list form too,\n# for purposes of being given to programs as command-line options.\nfor f in silence nonsilence optional_silence disambig; do\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt >$dir/phones/$f.int\n  local/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt | \\\n   awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/$f.csl || exit 1;\ndone\n\nif [ -f $dir/phones/word_boundary.txt ]; then\n  local/sym2int.pl -f 1 $dir/phones.txt <$dir/phones/word_boundary.txt \\\n    > $dir/phones/word_boundary.int || exit 1;\nfi\n\nsilphonelist=`cat $dir/phones/silence.csl`\nnonsilphonelist=`cat $dir/phones/nonsilence.csl`\n\n# Create the lexicon FST with disambiguation symbols, and put it in lang_test.\n# There is an extra step where we create a loop to \"pass through\" the\n# disambiguation symbols from G.fst.\n\nif $silprob; then\n  local/make_lexicon_fst_silprob.py $grammar_opts \\\n    --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_silprob_disambig.txt $srcdir/silprob.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nelse\n  local/make_lexicon_fst.py $grammar_opts \\\n    --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_disambig.txt | \\\n    local/sym2int.pl -f 3 $dir/phones.txt | \\\n    local/sym2int.pl -f 4 $dir/words.txt | \\\n    local/fstaddselfloops.pl $wdisambig_phone $wdisambig_word > $dir/L_disambig.fst.txt || exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/espnet_utils/k2/parse_options.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);\n#                 Arnab Ghoshal, Karel Vesely\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Parse command-line options.\n# To be sourced by another script (as in \". parse_options.sh\").\n# Option format is: --option-name arg\n# and shell variable \"option_name\" gets set to value \"arg.\"\n# The exception is --help, which takes no arguments, but prints the\n# $help_message variable (if defined).\n\n\n###\n### The --config file options have lower priority to command line\n### options, so we need to import them first...\n###\n\n# Now import all the configs specified by command-line, in left-to-right order\nfor ((argpos=1; argpos<$#; argpos++)); do\n  if [ \"${!argpos}\" == \"--config\" ]; then\n    argpos_plus1=$((argpos+1))\n    config=${!argpos_plus1}\n    [ ! -r $config ] && echo \"$0: missing config '$config'\" && exit 1\n    . $config  # source the config file.\n  fi\ndone\n\n\n###\n### Now we process the command line options\n###\nwhile true; do\n  [ -z \"${1:-}\" ] && break;  # break if there are no arguments\n  case \"$1\" in\n    # If the enclosing script is called with --help option, print the help\n    # message and exit.  Scripts should put help messages in $help_message\n    --help|-h) if [ -z \"$help_message\" ]; then echo \"No help found.\" 1>&2;\n      else printf \"$help_message\\n\" 1>&2 ; fi;\n      exit 0 ;;\n    --*=*) echo \"$0: options to scripts must be of the form --name value, got '$1'\"\n      exit 1 ;;\n    # If the first command-line argument begins with \"--\" (e.g. --foo-bar),\n    # then work out the variable name as $name, which will equal \"foo_bar\".\n    --*) name=`echo \"$1\" | sed s/^--// | sed s/-/_/g`;\n      # Next we test whether the variable in question is undefned-- if so it's\n      # an invalid option and we die.  Note: $0 evaluates to the name of the\n      # enclosing script.\n      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar\n      # is undefined.  We then have to wrap this test inside \"eval\" because\n      # foo_bar is itself inside a variable ($name).\n      eval '[ -z \"${'$name'+xxx}\" ]' && echo \"$0: invalid option $1\" 1>&2 && exit 1;\n\n      oldval=\"`eval echo \\\\$$name`\";\n      # Work out whether we seem to be expecting a Boolean argument.\n      if [ \"$oldval\" == \"true\" ] || [ \"$oldval\" == \"false\" ]; then\n        was_bool=true;\n      else\n        was_bool=false;\n      fi\n\n      # Set the variable to the right value-- the escaped quotes make it work if\n      # the option had spaces, like --cmd \"queue.pl -sync y\"\n      eval $name=\\\"$2\\\";\n\n      # Check that Boolean-valued arguments are really Boolean.\n      if $was_bool && [[ \"$2\" != \"true\" && \"$2\" != \"false\" ]]; then\n        echo \"$0: expected \\\"true\\\" or \\\"false\\\": $1 $2\" 1>&2\n        exit 1;\n      fi\n      shift 2;\n      ;;\n  *) break;\n  esac\ndone\n\n\n# Check for an empty argument to the --cmd option, which can easily occur as a\n# result of scripting errors.\n[ ! -z \"${cmd+xxx}\" ] && [ -z \"$cmd\" ] && echo \"$0: empty argument to --cmd option\" 1>&2 && exit 1;\n\n\ntrue; # so this script returns exit code 0.\n"
  },
  {
    "path": "egs/espnet_utils/k2/sym2int.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n$ignore_oov = 0;\n\nfor($x = 0; $x < 2; $x++) {\n  if ($ARGV[0] eq \"--map-oov\") {\n    shift @ARGV;\n    $map_oov = shift @ARGV;\n    if ($map_oov eq \"-f\" || $map_oov =~ m/words\\.txt$/ || $map_oov eq \"\") {\n      # disallow '-f', the empty string and anything ending in words.txt as the\n      # OOV symbol because these are likely command-line errors.\n      die \"the --map-oov option requires an argument\";\n    }\n  }\n  if ($ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesy (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n}\n\n$symtab = shift @ARGV;\nif (!defined $symtab) {\n  print STDERR \"Usage: sym2int.pl [options] symtab [input transcriptions] > output transcriptions\\n\" .\n    \"options: [--map-oov <oov-symbol> ]  [-f <field-range> ]\\n\" .\n      \"note: <field-range> can look like 4-5, or 4-, or 5-, or 1.\\n\";\n}\nopen(F, \"<$symtab\") || die \"Error opening symbol table file $symtab\";\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"bad line in symbol table file: $_\";\n    $sym2int{$A[0]} = $A[1] + 0;\n}\n\nif (defined $map_oov && $map_oov !~ m/^\\d+$/) { # not numeric-> look it up\n  if (!defined $sym2int{$map_oov}) { die \"OOV symbol $map_oov not defined.\"; }\n  $map_oov = $sym2int{$map_oov};\n}\n\n$num_warning = 0;\n$max_warning = 20;\n\nwhile (<>) {\n  @A = split(\" \", $_);\n  @B = ();\n  for ($n = 0; $n < @A; $n++) {\n    $a = $A[$n];\n    if ( (!defined $field_begin || $n >= $field_begin)\n         && (!defined $field_end || $n <= $field_end)) {\n      $i = $sym2int{$a};\n      if (!defined ($i)) {\n        if (defined $map_oov) {\n          if ($num_warning++ < $max_warning) {\n            print STDERR \"sym2int.pl: replacing $a with $map_oov\\n\";\n            if ($num_warning == $max_warning) {\n              print STDERR \"sym2int.pl: not warning for OOVs any more times\\n\";\n            }\n          }\n          $i = $map_oov;\n        }\n      }\n      $a = $i;\n    }\n    push @B, $a;\n  }\n  print join(\" \", @B);\n  print \"\\n\";\n}\nif ($num_warning > 0) {\n  print STDERR \"** Replaced $num_warning instances of OOVs with $map_oov\\n\";\n}\n\nexit(0);\n"
  },
  {
    "path": "egs/espnet_utils/make_fbank.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# Begin configuration section.\nnj=4\nfs=none\nfmax=\nfmin=\nn_mels=80\nn_fft=1024\nn_shift=512\nwin_length=\nwindow=hann\nwrite_utt2num_frames=true\ncmd=run.pl\ncompress=true\nnormalize=16  # The bit-depth of the input wav files\nfiletype=mat # mat or hdf5\n# End configuration section.\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<fbank-dir>] ]\ne.g.: $0 data/train exp/make_fbank/train mfcc\nNote: <log-dir> defaults to <data-dir>/log, and <fbank-dir> defaults to <data-dir>/data\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\n  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file\nEOF\n)\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=${data}/log\nfi\nif [ $# -ge 3 ]; then\n  fbankdir=$3\nelse\n  fbankdir=${data}/data\nfi\n\n# make $fbankdir an absolute pathname.\nfbankdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' ${fbankdir} ${PWD})\n\n# use \"name\" as part of name of the archive.\nname=$(basename ${data})\n\nmkdir -p ${fbankdir} || exit 1;\nmkdir -p ${logdir} || exit 1;\n\nif [ -f ${data}/feats.scp ]; then\n  mkdir -p ${data}/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv ${data}/feats.scp ${data}/.backup\nfi\n\nscp=${data}/wav.scp\n\nutils/validate_data_dir.sh --no-text --no-feats ${data} || exit 1;\n\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${logdir}/wav.${n}.scp\"\ndone\n\nutils/split_scp.pl ${scp} ${split_scps} || exit 1;\n\nif ${write_utt2num_frames}; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:${logdir}/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif [ \"${filetype}\" == hdf5 ]; then\n    ext=h5\nelse\n    ext=ark\nfi\n\nif [ -f ${data}/segments ]; then\n    echo \"$0 [info]: segments file exists: using that.\"\n    split_segments=\"\"\n    for n in $(seq ${nj}); do\n        split_segments=\"${split_segments} ${logdir}/segments.${n}\"\n    done\n\n    utils/split_scp.pl ${data}/segments ${split_segments}\n\n    ${cmd} JOB=1:${nj} ${logdir}/make_fbank_${name}.JOB.log \\\n        compute-fbank-feats.py \\\n            --fs ${fs} \\\n            --fmax ${fmax} \\\n            --fmin ${fmin} \\\n            --n_fft ${n_fft} \\\n            --n_shift ${n_shift} \\\n            --win_length ${win_length} \\\n            --window ${window} \\\n            --n_mels ${n_mels} \\\n            ${write_num_frames_opt} \\\n            --compress=${compress} \\\n            --filetype ${filetype} \\\n            --normalize ${normalize} \\\n            --segment=${logdir}/segments.JOB scp:${scp} \\\n            ark,scp:${fbankdir}/raw_fbank_${name}.JOB.${ext},${fbankdir}/raw_fbank_${name}.JOB.scp\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming pcm.scp indexed by utterance.\"\n  split_scps=\"\"\n  for n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${logdir}/wav.${n}.scp\"\n  done\n\n  utils/split_scp.pl ${scp} ${split_scps}\n\n  ${cmd} JOB=1:${nj} ${logdir}/make_fbank_${name}.JOB.log \\\n      compute-fbank-feats.py \\\n          --fs ${fs} \\\n          --fmax ${fmax} \\\n          --fmin ${fmin} \\\n          --n_fft ${n_fft} \\\n          --n_shift ${n_shift} \\\n          --win_length ${win_length} \\\n          --window ${window} \\\n          --n_mels ${n_mels} \\\n          ${write_num_frames_opt} \\\n          --compress=${compress} \\\n          --filetype ${filetype} \\\n          --normalize ${normalize} \\\n          scp:${logdir}/wav.JOB.scp \\\n          ark,scp:${fbankdir}/raw_fbank_${name}.JOB.${ext},${fbankdir}/raw_fbank_${name}.JOB.scp\nfi\n\n\n# concatenate the .scp files together.\nfor n in $(seq ${nj}); do\n    cat ${fbankdir}/raw_fbank_${name}.${n}.scp || exit 1;\ndone > ${data}/feats.scp || exit 1\n\nif ${write_utt2num_frames}; then\n    for n in $(seq ${nj}); do\n        cat ${logdir}/utt2num_frames.${n} || exit 1;\n    done > ${data}/utt2num_frames || exit 1\n    rm ${logdir}/utt2num_frames.* 2>/dev/null\nfi\n\nrm -f ${logdir}/wav.*.scp ${logdir}/segments.* 2>/dev/null\n\n# Write the filetype, this will be used for data2json.sh\necho ${filetype} > ${data}/filetype\n\nnf=$(wc -l < ${data}/feats.scp)\nnu=$(wc -l < ${data}/wav.scp)\nif [ ${nf} -ne ${nu} ]; then\n    echo \"It seems not all of the feature files were successfully ($nf != $nu);\"\n    echo \"consider using utils/fix_data_dir.sh $data\"\nfi\n\necho \"Succeeded creating filterbank features for $name\"\n"
  },
  {
    "path": "egs/espnet_utils/make_pair_json.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nfrom io import open\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Merge source and target data.json files into one json file.\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--src-json\", type=str, help=\"Json file for the source speaker\")\n    parser.add_argument(\n        \"--trg-json\",\n        type=str,\n        default=None,\n        help=\"Json file for the target speaker. If not specified, use source only.\",\n    )\n    parser.add_argument(\n        \"--num_utts\", default=-1, type=int, help=\"Number of utterances (take from head)\"\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=1, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--out\",\n        \"-O\",\n        type=str,\n        help=\"The output filename. \" \"If omitted, then output to sys.stdout\",\n    )\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    with open(args.src_json, \"rb\") as f:\n        src_json = json.load(f)[\"utts\"]\n    if args.trg_json:\n        with open(args.trg_json, \"rb\") as f:\n            trg_json = json.load(f)[\"utts\"]\n\n    # get source and target speaker\n    _ = list(src_json.keys())[0].split(\"_\")\n    srcspk = _[0]\n    if args.trg_json:\n        _ = list(trg_json.keys())[0].split(\"_\")\n        trgspk = _[0]\n\n    count = 0\n    data = {\"utts\": {}}\n    # (dirty) loop through input only because in/out should have same files\n    for k, v in src_json.items():\n        _ = k.split(\"_\")\n        number = \"_\".join(_[1:])\n\n        entry = {\"input\": src_json[srcspk + \"_\" + number][\"input\"]}\n\n        if args.trg_json:\n            entry[\"output\"] = trg_json[trgspk + \"_\" + number][\"input\"]\n            entry[\"output\"][0][\"name\"] = \"target1\"\n\n        data[\"utts\"][number] = entry\n        count += 1\n        if args.num_utts > 0 and count >= args.num_utts:\n            break\n\n    if args.out is None:\n        out = sys.stdout\n    else:\n        out = open(args.out, \"w\", encoding=\"utf-8\")\n\n    json.dump(\n        data,\n        out,\n        indent=4,\n        ensure_ascii=False,\n        separators=(\",\", \": \"),\n    )\n"
  },
  {
    "path": "egs/espnet_utils/make_stft.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# Begin configuration section.\nnj=4\nfs=none\nn_fft=1024\nn_shift=512\nwin_length=\nwindow=hann\nwrite_utt2num_frames=true\ncmd=run.pl\ncompress=true\nnormalize=16  # The bit-depth of the input wav files\nfiletype=mat # mat or hdf5\n# End configuration section.\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<stft-dir>] ]\ne.g.: $0 data/train exp/make_stft/train stft\nNote: <log-dir> defaults to <data-dir>/log, and <stft-dir> defaults to <data-dir>/data\nOptions:\n  --nj <nj>                                        # number of parallel jobs\n  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\n  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file\nEOF\n)\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=${data}/log\nfi\nif [ $# -ge 3 ]; then\n  stftdir=$3\nelse\n  stftdir=${data}/data\nfi\n\n# make $stftdir an absolute pathname.\nstftdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' ${stftdir} ${PWD})\n\n# use \"name\" as part of name of the archive.\nname=$(basename ${data})\n\nmkdir -p ${stftdir} || exit 1;\nmkdir -p ${logdir} || exit 1;\n\nif [ -f ${data}/feats.scp ]; then\n  mkdir -p ${data}/.backup\n  echo \"$0: moving ${data}/feats.scp to ${data}/.backup\"\n  mv ${data}/feats.scp ${data}/.backup\nfi\n\nscp=${data}/wav.scp\n\nutils/validate_data_dir.sh --no-text --no-feats ${data} || exit 1;\n\nif ${write_utt2num_frames}; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:${logdir}/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif [ \"${filetype}\" == hdf5 ]; then\n    ext=h5\nelse\n    ext=ark\nfi\n\nif [ -f ${data}/segments ]; then\n    echo \"$0 [info]: segments file exists: using that.\"\n    split_segments=\"\"\n    for n in $(seq ${nj}); do\n        split_segments=\"${split_segments} ${logdir}/segments.${n}\"\n    done\n\n    utils/split_scp.pl ${data}/segments ${split_segments}\n\n    ${cmd} JOB=1:${nj} ${logdir}/make_stft_${name}.JOB.log \\\n        compute-stft-feats.py \\\n            --win_length ${win_length} \\\n            --n_fft ${n_fft} \\\n            --n_shift ${n_shift} \\\n            --window ${window} \\\n            ${write_num_frames_opt} \\\n            --compress=${compress} \\\n            --filetype ${filetype} \\\n            --normalize ${normalize} \\\n            --segment=${logdir}/segments.JOB scp:${scp} \\\n            ark,scp:${stftdir}/raw_stft_${name}.JOB.${ext},${stftdir}/raw_stft_${name}.JOB.scp\n\nelse\n    echo \"$0: [info]: no segments file exists: assuming pcm.scp indexed by utterance.\"\n    split_scps=\"\"\n    for n in $(seq ${nj}); do\n        split_scps=\"${split_scps} ${logdir}/wav.${n}.scp\"\n    done\n\n    utils/split_scp.pl ${scp} ${split_scps} || exit 1;\n\n${cmd} JOB=1:${nj} ${logdir}/make_stft_${name}.JOB.log \\\n    compute-stft-feats.py \\\n        --fs ${fs} \\\n        --win_length ${win_length} \\\n        --n_fft ${n_fft} \\\n        --n_shift ${n_shift} \\\n        --window ${window} \\\n        ${write_num_frames_opt} \\\n        --compress=${compress} \\\n        --filetype ${filetype} \\\n        --normalize ${normalize} \\\n        scp:${logdir}/wav.JOB.scp \\\n        ark,scp:${stftdir}/raw_stft_${name}.JOB.${ext},${stftdir}/raw_stft_${name}.JOB.scp\nfi\n\n# concatenate the .scp files together.\nfor n in $(seq ${nj}); do\n    cat ${stftdir}/raw_stft_${name}.${n}.scp || exit 1;\ndone > ${data}/feats.scp || exit 1\n\nif ${write_utt2num_frames}; then\n    for n in $(seq ${nj}); do\n        cat ${logdir}/utt2num_frames.${n} || exit 1;\n    done > ${data}/utt2num_frames || exit 1\n    rm ${logdir}/utt2num_frames.* 2>/dev/null\nfi\n\nrm -f ${logdir}/wav.*.scp ${logdir}/segments.* 2>/dev/null\n\n# Write the filetype, this will be used for data2json.sh\necho ${filetype} > ${data}/filetype\n\nnf=$(wc -l < ${data}/feats.scp)\nnu=$(wc -l < ${data}/wav.scp)\nif [ ${nf} -ne ${nu} ]; then\n    echo \"It seems not all of the feature files were successfully ($nf != $nu);\"\n    echo \"consider using utils/fix_data_dir.sh $data\"\nfi\n\necho \"Succeeded creating filterbank features for $name\"\n"
  },
  {
    "path": "egs/espnet_utils/mbr_analysis.py",
    "content": "# Author: Jinchuan Tian ; tianjinchuan@stu.pku.edu.cn ; tyriontian@tencent.com\n# This script provides:\n# (1) CER (2) Bayesian Risk and its variance\n\nimport sys\nimport json\nimport math\nimport editdistance\nimport numpy as np\n\ndef main():\n    # load json file\n    f = open(sys.argv[1], \"rb\")\n    results_json = json.load(f)[\"utts\"]\n\n    num_err, num_tot = 0, 0\n    risk_stat, sum_prob_stat, ref_prob_stat = [], [], []\n    for uttid, info in results_json.items():\n        try:\n            hypotheses = info[\"output\"]\n            ref_token = hypotheses[0][\"token\"]\n        \n            # hypothesis and their probability\n            texts, probs, find_ref = [], [], False\n            for h in hypotheses:\n                text = h[\"rec_token\"].replace(\"<eos>\", \"\").strip()\n                texts.append(text)\n                probs.append(math.exp(h[\"score\"]))\n                if ref_token == text:\n                    ref_prob_stat.append(math.exp(h[\"score\"]))\n                    find_ref = True\n    \n            if not find_ref:\n                ref_prob_stat.append(0.0)\n    \n            # find edit-distance\n            edit_dists = [editdistance.eval(ref_token, rec_token) \\\n                          for rec_token in texts]\n    \n            # bayesian risk\n            weighted_probs = [a * b for a, b in zip(edit_dists, probs)]\n            risk = sum(weighted_probs) / (sum(probs) + 1e-10)\n            risk_stat.append(risk)\n    \n            # sum prob \n            sum_prob_stat.append(sum(probs))\n    \n            # cer statistics\n            num_err += edit_dists[0]\n            num_tot += len(ref_token.strip().split())\n        except:\n            pass\n\n    # conclusion\n    print(\"### MBR statistics on {} ###\".format(sys.argv[1]))\n    cer = num_err / num_tot * 100\n    print(\"CER: {:.4f}% {}/{}\".format(cer, num_err, num_tot))\n    \n    br_mean, br_deviation = np.mean(risk_stat), np.sqrt(np.var(risk_stat))\n    print(\"Mean and Deviation of Bayesian Risk: {:.4f} | {:.4f}\".format(br_mean, br_deviation))\n\n    sum_prob_mean, sum_prob_deviation = np.mean(sum_prob_stat), np.sqrt(np.var(sum_prob_stat))\n    ref_prob_mean, ref_prob_deviation = np.mean(ref_prob_stat), np.sqrt(np.var(ref_prob_stat))\n    print(\"Mean and Deviation of Accumulated probability: {:.4f} | {:.4f}\".format(sum_prob_mean, sum_prob_deviation))\n    print(\"Mean and Deviation of Reference probability: {:.4f} | {:.4f}\".format(ref_prob_mean, ref_prob_deviation))\n\n\nif __name__ == \"__main__\":\n    main() \n"
  },
  {
    "path": "egs/espnet_utils/mcd_calculate.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# Calculate MCD using converted waveform.\n\nimport argparse\nimport fnmatch\nimport multiprocessing as mp\nimport os\n\nfrom fastdtw import fastdtw\nimport numpy as np\nimport pysptk\nimport pyworld as pw\nimport scipy\nfrom scipy.io import wavfile\nfrom scipy.signal import firwin\nfrom scipy.signal import lfilter\n\n\ndef find_files(root_dir, query=\"*.wav\", include_root_dir=True):\n    \"\"\"Find files recursively.\n\n    Args:\n        root_dir (str): Root root_dir to find.\n        query (str): Query to find.\n        include_root_dir (bool): If False, root_dir name is not included.\n\n    Returns:\n        list: List of found filenames.\n\n    \"\"\"\n    files = []\n    for root, dirnames, filenames in os.walk(root_dir, followlinks=True):\n        for filename in fnmatch.filter(filenames, query):\n            files.append(os.path.join(root, filename))\n    if not include_root_dir:\n        files = [file_.replace(root_dir + \"/\", \"\") for file_ in files]\n\n    return files\n\n\ndef low_cut_filter(x, fs, cutoff=70):\n    \"\"\"FUNCTION TO APPLY LOW CUT FILTER\n\n    Args:\n        x (ndarray): Waveform sequence\n        fs (int): Sampling frequency\n        cutoff (float): Cutoff frequency of low cut filter\n\n    Return:\n        (ndarray): Low cut filtered waveform sequence\n    \"\"\"\n\n    nyquist = fs // 2\n    norm_cutoff = cutoff / nyquist\n\n    # low cut filter\n    fil = firwin(255, norm_cutoff, pass_zero=False)\n    lcf_x = lfilter(fil, 1, x)\n\n    return lcf_x\n\n\ndef spc2npow(spectrogram):\n    \"\"\"Calculate normalized power sequence from spectrogram\n\n    Parameters\n    ----------\n    spectrogram : array, shape (T, `fftlen / 2 + 1`)\n        Array of spectrum envelope\n\n    Return\n    ------\n    npow : array, shape (`T`, `1`)\n        Normalized power sequence\n\n    \"\"\"\n\n    # frame based processing\n    npow = np.apply_along_axis(_spvec2pow, 1, spectrogram)\n\n    meanpow = np.mean(npow)\n    npow = 10.0 * np.log10(npow / meanpow)\n\n    return npow\n\n\ndef _spvec2pow(specvec):\n    \"\"\"Convert a spectrum envelope into a power\n\n    Parameters\n    ----------\n    specvec : vector, shape (`fftlen / 2 + 1`)\n        Vector of specturm envelope |H(w)|^2\n\n    Return\n    ------\n    power : scala,\n        Power of a frame\n\n    \"\"\"\n\n    # set FFT length\n    fftl2 = len(specvec) - 1\n    fftl = fftl2 * 2\n\n    # specvec is not amplitude spectral |H(w)| but power spectral |H(w)|^2\n    power = specvec[0] + specvec[fftl2]\n    for k in range(1, fftl2):\n        power += 2.0 * specvec[k]\n    power /= fftl\n\n    return power\n\n\ndef extfrm(data, npow, power_threshold=-20):\n    \"\"\"Extract frame over the power threshold\n\n    Parameters\n    ----------\n    data: array, shape (`T`, `dim`)\n        Array of input data\n    npow : array, shape (`T`)\n        Vector of normalized power sequence.\n    power_threshold : float, optional\n        Value of power threshold [dB]\n        Default set to -20\n\n    Returns\n    -------\n    data: array, shape (`T_ext`, `dim`)\n        Remaining data after extracting frame\n        `T_ext` <= `T`\n\n    \"\"\"\n\n    T = data.shape[0]\n    if T != len(npow):\n        raise (\"Length of two vectors is different.\")\n\n    valid_index = np.where(npow > power_threshold)\n    extdata = data[valid_index]\n    assert extdata.shape[0] <= T\n\n    return extdata\n\n\ndef world_extract(wav_path, args):\n    fs, x = wavfile.read(wav_path)\n    x = np.array(x, dtype=np.float64)\n    x = low_cut_filter(x, fs)\n\n    # extract features\n    f0, time_axis = pw.harvest(\n        x, fs, f0_floor=args.f0min, f0_ceil=args.f0max, frame_period=args.shiftms\n    )\n    sp = pw.cheaptrick(x, f0, time_axis, fs, fft_size=args.fftl)\n    ap = pw.d4c(x, f0, time_axis, fs, fft_size=args.fftl)\n    mcep = pysptk.sp2mc(sp, args.mcep_dim, args.mcep_alpha)\n    npow = spc2npow(sp)\n\n    return {\n        \"sp\": sp,\n        \"mcep\": mcep,\n        \"ap\": ap,\n        \"f0\": f0,\n        \"npow\": npow,\n    }\n\n\ndef get_basename(path):\n    return os.path.splitext(os.path.split(path)[-1])[0]\n\n\ndef calculate(file_list, gt_file_list, args, MCD):\n\n    for i, cvt_path in enumerate(file_list):\n        corresponding_list = list(\n            filter(lambda gt_path: get_basename(gt_path) in cvt_path, gt_file_list)\n        )\n        assert len(corresponding_list) == 1\n        gt_path = corresponding_list[0]\n        gt_basename = get_basename(gt_path)\n\n        # extract ground truth and converted features\n        gt_feats = world_extract(gt_path, args)\n        cvt_feats = world_extract(cvt_path, args)\n\n        # VAD & DTW based on power\n        gt_mcep_nonsil_pow = extfrm(gt_feats[\"mcep\"], gt_feats[\"npow\"])\n        cvt_mcep_nonsil_pow = extfrm(cvt_feats[\"mcep\"], cvt_feats[\"npow\"])\n        _, path = fastdtw(\n            cvt_mcep_nonsil_pow,\n            gt_mcep_nonsil_pow,\n            dist=scipy.spatial.distance.euclidean,\n        )\n        twf_pow = np.array(path).T\n\n        # MCD using power-based DTW\n        cvt_mcep_dtw_pow = cvt_mcep_nonsil_pow[twf_pow[0]]\n        gt_mcep_dtw_pow = gt_mcep_nonsil_pow[twf_pow[1]]\n        diff2sum = np.sum((cvt_mcep_dtw_pow - gt_mcep_dtw_pow) ** 2, 1)\n        mcd = np.mean(10.0 / np.log(10.0) * np.sqrt(2 * diff2sum), 0)\n\n        print(\"{} {}\".format(gt_basename, mcd))\n        MCD.append(mcd)\n\n\ndef get_parser():\n\n    parser = argparse.ArgumentParser(description=\"calculate MCD.\")\n    parser.add_argument(\n        \"--wavdir\",\n        required=True,\n        type=str,\n        help=\"path of directory for converted waveforms\",\n    )\n    parser.add_argument(\n        \"--gtwavdir\",\n        required=True,\n        type=str,\n        help=\"path of directory for ground truth waveforms\",\n    )\n\n    # analysis related\n    parser.add_argument(\n        \"--mcep_dim\", default=41, type=int, help=\"dimension of mel cepstrum coefficient\"\n    )\n    parser.add_argument(\n        \"--mcep_alpha\", default=0.41, type=int, help=\"all pass constant\"\n    )\n    parser.add_argument(\"--fftl\", default=1024, type=int, help=\"fft length\")\n    parser.add_argument(\"--shiftms\", default=5, type=int, help=\"frame shift (ms)\")\n    parser.add_argument(\n        \"--f0min\", required=True, type=int, help=\"fo search range (min)\"\n    )\n    parser.add_argument(\n        \"--f0max\", required=True, type=int, help=\"fo search range (max)\"\n    )\n\n    parser.add_argument(\n        \"--n_jobs\", default=40, type=int, help=\"number of parallel jobs\"\n    )\n    return parser\n\n\ndef main():\n    args = get_parser().parse_args()\n\n    # find files\n    converted_files = sorted(find_files(args.wavdir))\n    gt_files = sorted(find_files(args.gtwavdir))\n\n    # Get and divide list\n\n    print(\"number of utterances = %d\" % len(converted_files))\n    file_lists = np.array_split(converted_files, args.n_jobs)\n    file_lists = [f_list.tolist() for f_list in file_lists]\n\n    # multi processing\n    with mp.Manager() as manager:\n        MCD = manager.list()\n        processes = []\n        for f in file_lists:\n            p = mp.Process(target=calculate, args=(f, gt_files, args, MCD))\n            p.start()\n            processes.append(p)\n\n        # wait for all process\n        for p in processes:\n            p.join()\n\n        mMCD = np.mean(np.array(MCD))\n        print(\"Mean MCD: {:.2f}\".format(mMCD))\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/merge_scp2json.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n\nimport argparse\nimport codecs\nfrom distutils.util import strtobool\nfrom io import open\nimport json\nimport logging\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\nPY2 = sys.version_info[0] == 2\nsys.stdin = codecs.getreader(\"utf-8\")(sys.stdin if PY2 else sys.stdin.buffer)\nsys.stdout = codecs.getwriter(\"utf-8\")(sys.stdout if PY2 else sys.stdout.buffer)\n\n\n# Special types:\ndef shape(x):\n    \"\"\"Change str to List[int]\n\n    >>> shape('3,5')\n    [3, 5]\n    >>> shape(' [3, 5] ')\n    [3, 5]\n\n    \"\"\"\n\n    # x: ' [3, 5] ' -> '3, 5'\n    x = x.strip()\n    if x[0] == \"[\":\n        x = x[1:]\n    if x[-1] == \"]\":\n        x = x[:-1]\n\n    return list(map(int, x.split(\",\")))\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Given each file paths with such format as \"\n        \"<key>:<file>:<type>. type> can be omitted and the default \"\n        'is \"str\". e.g. {} '\n        \"--input-scps feat:data/feats.scp shape:data/utt2feat_shape:shape \"\n        \"--input-scps feat:data/feats2.scp shape:data/utt2feat2_shape:shape \"\n        \"--output-scps text:data/text shape:data/utt2text_shape:shape \"\n        \"--scps utt2spk:data/utt2spk\".format(sys.argv[0]),\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--input-scps\",\n        type=str,\n        nargs=\"*\",\n        action=\"append\",\n        default=[],\n        help=\"Json files for the inputs\",\n    )\n    parser.add_argument(\n        \"--output-scps\",\n        type=str,\n        nargs=\"*\",\n        action=\"append\",\n        default=[],\n        help=\"Json files for the outputs\",\n    )\n    parser.add_argument(\n        \"--scps\",\n        type=str,\n        nargs=\"+\",\n        default=[],\n        help=\"The json files except for the input and outputs\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=1, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--allow-one-column\",\n        type=strtobool,\n        default=False,\n        help=\"Allow one column in input scp files. \"\n        \"In this case, the value will be empty string.\",\n    )\n    parser.add_argument(\n        \"--out\",\n        \"-O\",\n        type=str,\n        help=\"The output filename. \" \"If omitted, then output to sys.stdout\",\n    )\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n    args.scps = [args.scps]\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    # List[List[Tuple[str, str, Callable[[str], Any], str, str]]]\n    input_infos = []\n    output_infos = []\n    infos = []\n    for lis_list, key_scps_list in [\n        (input_infos, args.input_scps),\n        (output_infos, args.output_scps),\n        (infos, args.scps),\n    ]:\n        for key_scps in key_scps_list:\n            lis = []\n            for key_scp in key_scps:\n                sps = key_scp.split(\":\")\n                if len(sps) == 2:\n                    key, scp = sps\n                    type_func = None\n                    type_func_str = \"none\"\n                elif len(sps) == 3:\n                    key, scp, type_func_str = sps\n                    fail = False\n\n                    try:\n                        # type_func: Callable[[str], Any]\n                        # e.g. type_func_str = \"int\" -> type_func = int\n                        type_func = eval(type_func_str)\n                    except Exception:\n                        raise RuntimeError(\"Unknown type: {}\".format(type_func_str))\n\n                    if not callable(type_func):\n                        raise RuntimeError(\"Unknown type: {}\".format(type_func_str))\n\n                else:\n                    raise RuntimeError(\n                        \"Format <key>:<filepath> \"\n                        \"or <key>:<filepath>:<type>  \"\n                        \"e.g. feat:data/feat.scp \"\n                        \"or shape:data/feat.scp:shape: {}\".format(key_scp)\n                    )\n\n                for item in lis:\n                    if key == item[0]:\n                        raise RuntimeError(\n                            'The key \"{}\" is duplicated: {} {}'.format(\n                                key, item[3], key_scp\n                            )\n                        )\n\n                lis.append((key, scp, type_func, key_scp, type_func_str))\n            lis_list.append(lis)\n\n    # Open  scp files\n    input_fscps = [\n        [open(i[1], \"r\", encoding=\"utf-8\") for i in il] for il in input_infos\n    ]\n    output_fscps = [\n        [open(i[1], \"r\", encoding=\"utf-8\") for i in il] for il in output_infos\n    ]\n    fscps = [[open(i[1], \"r\", encoding=\"utf-8\") for i in il] for il in infos]\n\n    # Note(kamo): What is done here?\n    # The final goal is creating a JSON file such as.\n    # {\n    #     \"utts\": {\n    #         \"sample_id1\": {(omitted)},\n    #         \"sample_id2\": {(omitted)},\n    #          ....\n    #     }\n    # }\n    #\n    # To reduce memory usage, reading the input text files for each lines\n    # and writing JSON elements per samples.\n    if args.out is None:\n        out = sys.stdout\n    else:\n        out = open(args.out, \"w\", encoding=\"utf-8\")\n    out.write('{\\n    \"utts\": {\\n')\n    nutt = 0\n    while True:\n        nutt += 1\n        # List[List[str]]\n        input_lines = [[f.readline() for f in fl] for fl in input_fscps]\n        output_lines = [[f.readline() for f in fl] for fl in output_fscps]\n        lines = [[f.readline() for f in fl] for fl in fscps]\n\n        # Get the first line\n        concat = sum(input_lines + output_lines + lines, [])\n        if len(concat) == 0:\n            break\n        first = concat[0]\n\n        # Sanity check: Must be sorted by the first column and have same keys\n        count = 0\n        for ls_list in (input_lines, output_lines, lines):\n            for ls in ls_list:\n                for line in ls:\n                    if line == \"\" or first == \"\":\n                        if line != first:\n                            concat = sum(input_infos + output_infos + infos, [])\n                            raise RuntimeError(\n                                \"The number of lines mismatch \"\n                                'between: \"{}\" and \"{}\"'.format(\n                                    concat[0][1], concat[count][1]\n                                )\n                            )\n\n                    elif line.split()[0] != first.split()[0]:\n                        concat = sum(input_infos + output_infos + infos, [])\n                        raise RuntimeError(\n                            \"The keys are mismatch at {}th line \"\n                            'between \"{}\" and \"{}\":\\n>>> {}\\n>>> {}'.format(\n                                nutt,\n                                concat[0][1],\n                                concat[count][1],\n                                first.rstrip(),\n                                line.rstrip(),\n                            )\n                        )\n                    count += 1\n\n        # The end of file\n        if first == \"\":\n            if nutt != 1:\n                out.write(\"\\n\")\n            break\n        if nutt != 1:\n            out.write(\",\\n\")\n\n        entry = {}\n        for inout, _lines, _infos in [\n            (\"input\", input_lines, input_infos),\n            (\"output\", output_lines, output_infos),\n            (\"other\", lines, infos),\n        ]:\n\n            lis = []\n            for idx, (line_list, info_list) in enumerate(zip(_lines, _infos), 1):\n                if inout == \"input\":\n                    d = {\"name\": \"input{}\".format(idx)}\n                elif inout == \"output\":\n                    d = {\"name\": \"target{}\".format(idx)}\n                else:\n                    d = {}\n\n                # info_list: List[Tuple[str, str, Callable]]\n                # line_list: List[str]\n                for line, info in zip(line_list, info_list):\n                    sps = line.split(None, 1)\n                    if len(sps) < 2:\n                        if not args.allow_one_column:\n                            raise RuntimeError(\n                                \"Format error {}th line in {}: \"\n                                ' Expecting \"<key> <value>\":\\n>>> {}'.format(\n                                    nutt, info[1], line\n                                )\n                            )\n                        uttid = sps[0]\n                        value = \"\"\n                    else:\n                        uttid, value = sps\n\n                    key = info[0]\n                    type_func = info[2]\n                    value = value.rstrip()\n\n                    if type_func is not None:\n                        try:\n                            # type_func: Callable[[str], Any]\n                            value = type_func(value)\n                        except Exception:\n                            logging.error(\n                                '\"{}\" is an invalid function '\n                                \"for the {} th line in {}: \\n>>> {}\".format(\n                                    info[4], nutt, info[1], line\n                                )\n                            )\n                            raise\n\n                    d[key] = value\n                lis.append(d)\n\n            if inout != \"other\":\n                entry[inout] = lis\n            else:\n                # If key == 'other'. only has the first item\n                entry.update(lis[0])\n\n        entry = json.dumps(\n            entry, indent=4, ensure_ascii=False, sort_keys=True, separators=(\",\", \": \")\n        )\n        # Add indent\n        indent = \"    \" * 2\n        entry = (\"\\n\" + indent).join(entry.split(\"\\n\"))\n\n        uttid = first.split()[0]\n        out.write('        \"{}\": {}'.format(uttid, entry))\n\n    out.write(\"    }\\n}\\n\")\n\n    logging.info(\"{} entries in {}\".format(nutt, out.name))\n"
  },
  {
    "path": "egs/espnet_utils/mergejson.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport os\nimport sys\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"merge json files\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--input-jsons\",\n        type=str,\n        nargs=\"+\",\n        action=\"append\",\n        default=[],\n        help=\"Json files for the inputs\",\n    )\n    parser.add_argument(\n        \"--output-jsons\",\n        type=str,\n        nargs=\"+\",\n        action=\"append\",\n        default=[],\n        help=\"Json files for the outputs\",\n    )\n    parser.add_argument(\n        \"--jsons\",\n        type=str,\n        nargs=\"+\",\n        action=\"append\",\n        default=[],\n        help=\"The json files except for the input and outputs\",\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\"-O\", dest=\"output\", type=str, help=\"Output json file\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # logging info\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    js_dict = {}  # Dict[str, List[List[Dict[str, Dict[str, dict]]]]]\n    # make intersection set for utterance keys\n    intersec_ks = None  # Set[str]\n    for jtype, jsons_list in [\n        (\"input\", args.input_jsons),\n        (\"output\", args.output_jsons),\n        (\"other\", args.jsons),\n    ]:\n        js_dict[jtype] = []\n        for jsons in jsons_list:\n            js = []\n            for x in jsons:\n                if os.path.isfile(x):\n                    with codecs.open(x, encoding=\"utf-8\") as f:\n                        j = json.load(f)\n                    ks = list(j[\"utts\"].keys())\n                    logging.info(x + \": has \" + str(len(ks)) + \" utterances\")\n                    if intersec_ks is not None:\n                        intersec_ks = intersec_ks.intersection(set(ks))\n                        if len(intersec_ks) == 0:\n                            logging.warning(\"No intersection\")\n                            break\n                    else:\n                        intersec_ks = set(ks)\n                    js.append(j)\n            js_dict[jtype].append(js)\n    logging.info(\"new json has \" + str(len(intersec_ks)) + \" utterances\")\n\n    new_dic = {}\n    for k in intersec_ks:\n        new_dic[k] = {\"input\": [], \"output\": []}\n        for jtype in [\"input\", \"output\", \"other\"]:\n            for idx, js in enumerate(js_dict[jtype], 1):\n                # Merge dicts from jsons into a dict\n                dic = {k2: v for j in js for k2, v in j[\"utts\"][k].items()}\n\n                if jtype == \"other\":\n                    new_dic[k].update(dic)\n                else:\n                    _dic = {}\n\n                    # FIXME(kamo): ad-hoc way to change str to List[int]\n                    if jtype == \"input\":\n                        _dic[\"name\"] = \"input{}\".format(idx)\n                        if \"ilen\" in dic and \"idim\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"ilen\"]), int(dic[\"idim\"]))\n                        elif \"ilen\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"ilen\"]),)\n                        elif \"idim\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"idim\"]),)\n\n                    elif jtype == \"output\":\n                        _dic[\"name\"] = \"target{}\".format(idx)\n                        if \"olen\" in dic and \"odim\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"olen\"]), int(dic[\"odim\"]))\n                        elif \"ilen\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"olen\"]),)\n                        elif \"idim\" in dic:\n                            _dic[\"shape\"] = (int(dic[\"odim\"]),)\n                    if \"shape\" in dic:\n                        # shape: \"80,1000\" -> [80, 1000]\n                        _dic[\"shape\"] = list(map(int, dic[\"shape\"].split(\",\")))\n\n                    for k2, v in dic.items():\n                        if k2 not in [\"ilen\", \"idim\", \"olen\", \"odim\", \"shape\"]:\n                            _dic[k2] = v\n                    new_dic[k][jtype].append(_dic)\n\n    # ensure \"ensure_ascii=False\", which is a bug\n    if args.output is not None:\n        sys.stdout = codecs.open(args.output, \"w\", encoding=\"utf-8\")\n    else:\n        sys.stdout = codecs.getwriter(\"utf-8\")(\n            sys.stdout if is_python2 else sys.stdout.buffer\n        )\n    print(\n        json.dumps(\n            {\"utts\": new_dic},\n            indent=4,\n            ensure_ascii=False,\n            sort_keys=True,\n            separators=(\",\", \": \"),\n        )\n    )\n"
  },
  {
    "path": "egs/espnet_utils/mix-mono-wav-scp.py",
    "content": "#!/usr/bin/env python3\nimport argparse\nimport io\nimport sys\n\nPY2 = sys.version_info[0] == 2\n\nif PY2:\n    from itertools import izip_longest as zip_longest\nelse:\n    from itertools import zip_longest\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        description=\"Mixing wav.scp files into a multi-channel wav.scp \" \"using sox.\",\n    )\n    parser.add_argument(\"scp\", type=str, nargs=\"+\", help=\"Give wav.scp\")\n    parser.add_argument(\n        \"out\",\n        nargs=\"?\",\n        type=argparse.FileType(\"w\"),\n        default=sys.stdout,\n        help=\"The output filename. \" \"If omitted, then output to sys.stdout\",\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    fscps = [io.open(scp, \"r\", encoding=\"utf-8\") for scp in args.scp]\n    for linenum, lines in enumerate(zip_longest(*fscps)):\n        keys = []\n        wavs = []\n\n        for line, scp in zip(lines, args.scp):\n            if line is None:\n                raise RuntimeError(\"Numbers of line mismatch\")\n\n            sps = line.split(\" \", 1)\n            if len(sps) != 2:\n                raise RuntimeError(\n                    'Invalid line is found: {}, line {}: \"{}\" '.format(\n                        scp, linenum, line\n                    )\n                )\n            key, wav = sps\n            keys.append(key)\n            wavs.append(wav.strip())\n\n        if not all(k == keys[0] for k in keys):\n            raise RuntimeError(\n                \"The ids mismatch. Hint; the input files must be \"\n                \"sorted and must have same ids: {}\".format(keys)\n            )\n\n        args.out.write(\n            \"{} sox -M {} -c {} -t wav - |\\n\".format(\n                keys[0], \" \".join(\"{}\".format(w) for w in wavs), len(fscps)\n            )\n        )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/mmi_rescore.sh",
    "content": "decode_dir=$1\ndict=$2\n\nmkdir -p $decode_dir/rescore\ndir=$decode_dir/rescore\n\nmkdir -p $dir/best\n\nfor w in 0.05 0.1 0.2 0.3; do\n    mkdir -p $dir/$w\n    (python3 espnet_utils/rerank_mmi.py $decode_dir/data.json $w ${dir}/${w}/data.1.json\n    score_sclite.sh  --sppd3 true $dir/$w ${dict} > ${dir}/$w/decode_result.txt) &\ndone\nwait \n"
  },
  {
    "path": "egs/espnet_utils/pack_model.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2019 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n[ -f ./path.sh ] && . ./path.sh\n\nresults=\"\"\n# e.g., \"exp/tr_it_pytorch_train/decode_dt_it_decode/result.wrd.txt\n#        exp/tr_it_pytorch_train/decode_et_it_decode/result.wrd.txt\"'\nlm=\"\"\ndict=\"\"\netc=\"\"\noutfile=\"model\"\npreprocess_conf=\"\"\n\nhelp_message=$(cat <<EOF\nUsage: $0 --lm <lm> --dict <dict> <tr_conf> <dec_conf> <cmvn> <e2e>, for example:\n<lm>:       exp/train_rnnlm/rnnlm.model.best\n<dict>:     data/lang_char\n<tr_conf>:  conf/train.yaml\n<dec_conf>: conf/decode.yaml\n<cmvn>:     data/tr_it/cmvn.ark\n<e2e>:      exp/tr_it_pytorch_train/results/model.last10.avg.best\nEOF\n)\n. utils/parse_options.sh\n\nif [ $# != 4 ]; then\n    echo \"${help_message}\"\n    exit 1\nfi\n\ntr_conf=$1\ndec_conf=$2\ncmvn=$3\ne2e=$4\n\necho \"  - Model files (archived to ${outfile}.tar.gz by \\`\\$ pack_model.sh\\`)\"\necho \"    - model link: (put the model link manually. please contact Shinji Watanabe <shinjiw@ieee.org> if you want a web storage to put your files)\"\n\n# configs\nif [ -e ${tr_conf} ]; then\n    tar cfh ${outfile}.tar ${tr_conf}\n    echo -n \"    - training config file: \\`\"\n    echo ${tr_conf} | sed -e \"s/$/\\`/\"\nelse\n    echo \"missing ${tr_conf}\"\n    exit 1\nfi\nif [ -e ${dec_conf} ]; then\n    tar rfh ${outfile}.tar ${dec_conf}\n    echo -n \"    - decoding config file: \\`\"\n    echo ${dec_conf} | sed -e \"s/$/\\`/\"\nelse\n    echo \"missing ${dec_conf}\"\n    exit 1\nfi\n# NOTE(kan-bayashi): preprocess conf is optional\nif [ -n \"${preprocess_conf}\" ]; then\n    tar rfh ${outfile}.tar ${preprocess_conf}\n    echo -n \"    - preprocess config file: \\`\"\n    echo ${preprocess_conf} | sed -e \"s/$/\\`/\"\nfi\n\n# cmvn\nif [ -e ${cmvn} ]; then\n    tar rfh ${outfile}.tar ${cmvn}\n    echo -n \"    - cmvn file: \\`\"\n    echo ${cmvn} | sed -e \"s/$/\\`/\"\nelse\n    echo \"missing ${cmvn}\"\n    exit 1\nfi\n\n# e2e\nif [ -e ${e2e} ]; then\n    tar rfh ${outfile}.tar ${e2e}\n    echo -n \"    - e2e file: \\`\"\n    echo ${e2e} | sed -e \"s/$/\\`/\"\n\n    e2e_conf=$(dirname ${e2e})/model.json\n    if [ ! -e ${e2e_conf} ]; then\n\techo missing ${e2e_conf}\n\texit 1\n    else\n\techo -n \"    - e2e JSON file: \\`\"\n\techo ${e2e_conf} | sed -e \"s/$/\\`/\"\n\ttar rfh ${outfile}.tar ${e2e_conf}\n    fi\nelse\n    echo \"missing ${e2e}\"\n    exit 1\nfi\n\n# lm\nif [ -n \"${lm}\" ]; then\n    if [ -e ${lm} ]; then\n\ttar rfh ${outfile}.tar ${lm}\n\techo -n \"    - lm file: \\`\"\n\techo ${lm} | sed -e \"s/$/\\`/\"\n\n\tlm_conf=$(dirname ${lm})/model.json\n\tif [ ! -e ${lm_conf} ]; then\n\t    echo missing ${lm_conf}\n\t    exit 1\n\telse\n\t    echo -n \"    - lm JSON file: \\`\"\n\t    echo ${lm_conf} | sed -e \"s/$/\\`/\"\n\t    tar rfh ${outfile}.tar ${lm_conf}\n\tfi\n    else\n\techo \"missing ${lm}\"\n\texit 1\n    fi\nfi\n\n# dict\nif [ -n \"${dict}\" ]; then\n    if [ -e ${dict} ]; then\n\ttar rfh ${outfile}.tar ${dict}\n\techo -n \"    - dict file: \\`\"\n\techo ${dict} | sed -e \"s/$/\\`/\"\n    else\n\techo \"missing ${dict}\"\n\texit 1\n    fi\nfi\n\n# etc\nfor x in ${etc}; do\n    if [ -e ${x} ]; then\n\ttar rfh ${outfile}.tar ${x}\n\techo -n \"    - etc file: \\`\"\n\techo ${x} | sed -e \"s/$/\\`/\"\n    else\n\techo \"missing ${x}\"\n\texit 1\n    fi\ndone\n\n# finally compress the tar file\ngzip -f ${outfile}.tar\n\n# results\nif [ -n \"${results}\" ]; then\n    echo \"  - Results (paste them by yourself or obtained by \\`\\$ pack_model.sh --results <results>\\`)\"\n    echo \"\\`\\`\\`\"\nfi\nfor x in ${results}; do\n    if [ -e ${x} ]; then\n\techo \"${x}\"\n\tgrep -e Avg -e SPKR -m 2 ${x}\n    else\n\techo \"missing ${x}\"\n\texit 1\n    fi\ndone\nif [ -n \"${results}\" ]; then\n    echo \"\\`\\`\\`\"\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/espnet_utils/prepare_block_load.sh",
    "content": "num_split=32\nbpe_model=\n\n. utils/parse_options.sh\n\nkdir=$1 # kaldi dataset directory\nddir=$2 # dump directory\ndst=$3 # distination directory\ndict=$4\n\n# step 1: sort the scp in dumpdir according to the utt2num_frames\nmkdir -p $dst\ntmpdir=$dst/tmp; mkdir -p $tmpdir\npython3 espnet_utils/sort_scp_by_length.py $ddir/feats.scp $ddir/utt2num_frames \\\n                                   $tmpdir/feats.scp $tmpdir/utt2num_frames\n\n# step 2: split the feats.scp and utt2num_frames according to  `num_split`\nfests_scps=\"\"\nfor idx in `seq 1 $num_split`; do\n   dir=$dst/$idx; mkdir -p $dir\n   feats_scps=\"$feats_scps $dir/feats.scp\"\ndone\npython3 espnet_utils/split_scp.py $tmpdir/feats.scp $feats_scps\n\nfor idx in `seq 1 $num_split`; do\n   dir=$dst/$idx;\n   python3 espnet_utils/filter_scp.py $dir/feats.scp \\\n       $tmpdir/utt2num_frames > $dir/utt2num_frames &\ndone\nwait\n\n# step 3: copy-feats\nfor idx in `seq 1 $num_split`; do\n   dir=$dst/$idx;\n   python3 espnet_utils/split_scp_fix_length.py $dir/feats.scp\n   nj=`ls $dir/feats.*.scp | wc -l`\n   ${decode_cmd} JOB=1:$nj $dir/copy_logs/copy_feat.JOB.log \\\n       copy-feats --compress=true --compression-method=2 \\\n           scp:$dir/feats.JOB.scp \\\n           ark,scp:$dir/feats.JOB.ark,$dir/feats_copy.JOB.scp\n   for j in `seq 1 $nj`; do\n       cat $dir/feats_copy.${j}.scp \n   done > $dir/feats.scp\ndone\n\n# step 4: filter the kaldi format data\nfor idx in `seq 1 $num_split`; do\n    dir=$dst/$idx;\n    mkdir -p $dir/kaldi_files\n    for f in text utt2spk spk2utt text_org; do\n        if [ -f  $kdir/$f ]; then\n        python3 espnet_utils/filter_scp.py $dir/feats.scp \\\n            $kdir/$f > $dir/kaldi_files/$f &\n        fi\n    done\n    wait\ndone\n\n# step 5: make json\nfor idx in `seq 1 $num_split`; do\n    dir=$dst/$idx;\n    \n    if [ -f $dir/kaldi_files/text_org ]; then\n        json_opts=\"--text_org $dir/kaldi_files/text_org\"\n    else\n        json_opts=\"\"\n    fi\n\n    if [ ! -z $bpe_model ]; then\n        json_opts=\"$json_opts --bpecode $bpe_model\"\n    else\n        json_opts=\"$json_opts\"\n    fi\n\n    bash espnet_utils/data2json.sh $json_opts --feat $dir/feats.scp \\\n        $dir/kaldi_files $dict > $dir/data.json &\ndone\nwait\n"
  },
  {
    "path": "egs/espnet_utils/prepare_mer.py",
    "content": "import sys\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\nin_f, chn_f, eng_f = sys.argv[1:4]\nchn_writer = open(chn_f, 'w', encoding=\"utf-8\")\neng_writer = open(eng_f, 'w', encoding=\"utf-8\")\n\nfor line in open(in_f, encoding=\"utf-8\"):\n    elems = line.strip().split()\n    uttid = elems[-1]\n    chn_buf, eng_buf = [], []\n    for c in elems[:-1]:\n        if is_all_chinese(c):\n            chn_buf.append(c)\n        else:\n            eng_buf.append(c)\n    \n    chn_str = \" \".join(chn_buf + [uttid]) + \"\\n\"\n    eng_str = \" \".join(eng_buf + [uttid]) + \"\\n\"\n\n    chn_writer.write(chn_str)\n    eng_writer.write(eng_str)\n\nchn_writer.close()\neng_writer.close()\n"
  },
  {
    "path": "egs/espnet_utils/queue-freegpu.pl",
    "content": "#!/usr/bin/env perl\nuse strict;\nuse warnings;\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2014  Vimal Manohar (Johns Hopkins University)\n# Apache 2.0.\n\nuse File::Basename;\nuse Cwd;\nuse Getopt::Long;\n\n# queue.pl has the same functionality as run.pl, except that\n# it runs the job in question on the queue (Sun GridEngine).\n# This version of queue.pl uses the task array functionality\n# of the grid engine.  Note: it's different from the queue.pl\n# in the s4 and earlier scripts.\n\n# The script now supports configuring the queue system using a config file\n# (default in conf/queue.conf; but can be passed specified with --config option)\n# and a set of command line options.\n# The current script handles:\n# 1) Normal configuration arguments\n# For e.g. a command line option of \"--gpu 1\" could be converted into the option\n# \"-q g.q -l gpu=1\" to qsub. How the CLI option is handled is determined by a\n# line in the config file like\n# gpu=* -q g.q -l gpu=$0\n# $0 here in the line is replaced with the argument read from the CLI and the\n# resulting string is passed to qsub.\n# 2) Special arguments to options such as\n# gpu=0\n# If --gpu 0 is given in the command line, then no special \"-q\" is given.\n# 3) Default argument\n# default gpu=0\n# If --gpu option is not passed in the command line, then the script behaves as\n# if --gpu 0 was passed since 0 is specified as the default argument for that\n# option\n# 4) Arbitrary options and arguments.\n# Any command line option starting with '--' and its argument would be handled\n# as long as its defined in the config file.\n# 5) Default behavior\n# If the config file that is passed using is not readable, then the script\n# behaves as if the queue has the following config file:\n# $ cat conf/queue.conf\n# # Default configuration\n# command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\n# option mem=* -l mem_free=$0,ram_free=$0\n# option mem=0          # Do not add anything to qsub_opts\n# option num_threads=* -pe smp $0\n# option num_threads=1  # Do not add anything to qsub_opts\n# option max_jobs_run=* -tc $0\n# default gpu=0\n# option gpu=0 -q all.q\n# option gpu=* -l gpu=$0 -q g.q\n\nmy $qsub_opts = \"\";\nmy $sync = 0;\nmy $num_threads = 1;\nmy $gpu = 0;\n\nmy $config = \"conf/queue.conf\";\n\nmy %cli_options = ();\n\nmy $jobname;\nmy $jobstart;\nmy $jobend;\nmy $array_job = 0;\nmy $sge_job_id;\n\nsub print_usage() {\n  print STDERR\n   \"Usage: queue.pl [options] [JOB=1:n] log-file command-line arguments...\\n\" .\n   \"e.g.: queue.pl foo.log echo baz\\n\" .\n   \" (which will echo \\\"baz\\\", with stdout and stderr directed to foo.log)\\n\" .\n   \"or: queue.pl -q all.q\\@xyz foo.log echo bar \\| sed s/bar/baz/ \\n\" .\n   \" (which is an example of using a pipe; you can provide other escaped bash constructs)\\n\" .\n   \"or: queue.pl -q all.q\\@qyz JOB=1:10 foo.JOB.log echo JOB \\n\" .\n   \" (which illustrates the mechanism to submit parallel jobs; note, you can use \\n\" .\n   \"  another string other than JOB)\\n\" .\n   \"Note: if you pass the \\\"-sync y\\\" option to qsub, this script will take note\\n\" .\n   \"and change its behavior.  Otherwise it uses qstat to work out when the job finished\\n\" .\n   \"Options:\\n\" .\n   \"  --config <config-file> (default: $config)\\n\" .\n   \"  --mem <mem-requirement> (e.g. --mem 2G, --mem 500M, \\n\" .\n   \"                           also support K and numbers mean bytes)\\n\" .\n   \"  --num-threads <num-threads> (default: $num_threads)\\n\" .\n   \"  --max-jobs-run <num-jobs>\\n\" .\n   \"  --gpu <0|1> (default: $gpu)\\n\";\n  exit 1;\n}\n\nsub caught_signal {\n  if ( defined $sge_job_id ) { # Signal trapped after submitting jobs\n    my $signal = $!;\n    system (\"qdel $sge_job_id\");\n    print STDERR \"Caught a signal: $signal , deleting SGE task: $sge_job_id and exiting\\n\";\n    exit(2);\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nfor (my $x = 1; $x <= 2; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    my $switch = shift @ARGV;\n\n    if ($switch eq \"-V\") {\n      $qsub_opts .= \"-V \";\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"WARNING: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $sync = 1;\n        $qsub_opts .= \"$switch $argument \";\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $qsub_opts .= \"$switch $argument $argument2 \";\n        $num_threads = $argument2;\n      } elsif ($switch =~ m/^--/) { # Config options\n        # Convert CLI option to variable name\n        # by removing '--' from the switch and replacing any\n        # '-' with a '_'\n        $switch =~ s/^--//;\n        $switch =~ s/-/_/g;\n        $cli_options{$switch} = $argument;\n      } else {  # Other qsub options - passed as is\n        $qsub_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"queue.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is a GridEngine limitation).\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"queue.pl: Warning: suspicious first argument to queue.pl: $ARGV[0]\\n\";\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nif (exists $cli_options{\"config\"}) {\n  $config = $cli_options{\"config\"};\n}\n\nmy $default_config_file = <<'EOF';\n# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l gpu=$0 -q '*.q'\nEOF\n\n# Here the configuration options specified by the user on the command line\n# (e.g. --mem 2G) are converted to options to the qsub system as defined in\n# the config file. (e.g. if the config file has the line\n# \"option mem=* -l ram_free=$0,mem_free=$0\"\n# and the user has specified '--mem 2G' on the command line, the options\n# passed to queue system would be \"-l ram_free=2G,mem_free=2G\n# A more detailed description of the ways the options would be handled is at\n# the top of this file.\n\n$SIG{INT} = \\&caught_signal;\n$SIG{TERM} = \\&caught_signal;\n\nmy $opened_config_file = 1;\n\nopen CONFIG, \"<$config\" or $opened_config_file = 0;\n\nmy %cli_config_options = ();\nmy %cli_default_options = ();\n\nif ($opened_config_file == 0 && exists($cli_options{\"config\"})) {\n  print STDERR \"Could not open config file $config\\n\";\n  exit(1);\n} elsif ($opened_config_file == 0 && !exists($cli_options{\"config\"})) {\n  # Open the default config file instead\n  open (CONFIG, \"echo '$default_config_file' |\") or die \"Unable to open pipe\\n\";\n  $config = \"Default config\";\n}\n\nmy $qsub_cmd = \"\";\nmy $read_command = 0;\n\nwhile(<CONFIG>) {\n  chomp;\n  my $line = $_;\n  $_ =~ s/\\s*#.*//g;\n  if ($_ eq \"\") { next; }\n  if ($_ =~ /^command (.+)/) {\n    $read_command = 1;\n    $qsub_cmd = $1 . \" \";\n  } elsif ($_ =~ m/^option ([^=]+)=\\* (.+)$/) {\n    # Config option that needs replacement with parameter value read from CLI\n    # e.g.: option mem=* -l mem_free=$0,ram_free=$0\n    my $option = $1;     # mem\n    my $arg= $2;         # -l mem_free=$0,ram_free=$0\n    if ($arg !~ m:\\$0:) {\n      die \"Unable to parse line '$line' in config file ($config)\\n\";\n    }\n    if (exists $cli_options{$option}) {\n      # Replace $0 with the argument read from command line.\n      # e.g. \"-l mem_free=$0,ram_free=$0\" -> \"-l mem_free=2G,ram_free=2G\"\n      $arg =~ s/\\$0/$cli_options{$option}/g;\n      $cli_config_options{$option} = $arg;\n    }\n  } elsif ($_ =~ m/^option ([^=]+)=(\\S+)\\s?(.*)$/) {\n    # Config option that does not need replacement\n    # e.g. option gpu=0 -q all.q\n    my $option = $1;      # gpu\n    my $value = $2;       # 0\n    my $arg = $3;         # -q all.q\n    if (exists $cli_options{$option}) {\n      $cli_default_options{($option,$value)} = $arg;\n    }\n  } elsif ($_ =~ m/^default (\\S+)=(\\S+)/) {\n    # Default options. Used for setting default values to options i.e. when\n    # the user does not specify the option on the command line\n    # e.g. default gpu=0\n    my $option = $1;  # gpu\n    my $value = $2;   # 0\n    if (!exists $cli_options{$option}) {\n      # If the user has specified this option on the command line, then we\n      # don't have to do anything\n      $cli_options{$option} = $value;\n    }\n  } else {\n    print STDERR \"queue.pl: unable to parse line '$line' in config file ($config)\\n\";\n    exit(1);\n  }\n}\n\nclose(CONFIG);\n\nif ($read_command != 1) {\n  print STDERR \"queue.pl: config file ($config) does not contain the line \\\"command .*\\\"\\n\";\n  exit(1);\n}\n\nfor my $option (keys %cli_options) {\n  if ($option eq \"config\") { next; }\n  if ($option eq \"max_jobs_run\" && $array_job != 1) { next; }\n  my $value = $cli_options{$option};\n\n  if (exists $cli_default_options{($option,$value)}) {\n    $qsub_opts .= \"$cli_default_options{($option,$value)} \";\n  } elsif (exists $cli_config_options{$option}) {\n    $qsub_opts .= \"$cli_config_options{$option} \";\n  } else {\n    if ($opened_config_file == 0) { $config = \"default config file\"; }\n    die \"queue.pl: Command line option $option not described in $config (or value '$value' not allowed)\\n\";\n  }\n}\n\nmy $cwd = getcwd();\nmy $logfile = shift @ARGV;\n\nif ($array_job == 1 && $logfile !~ m/$jobname/\n    && $jobend > $jobstart) {\n  print STDERR \"queue.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n#\n# Work out the command; quote escaping is done here.\n# Note: the rules for escaping stuff are worked out pretty\n# arbitrarily, based on what we want it to do.  Some things that\n# we pass as arguments to queue.pl, such as \"|\", we want to be\n# interpreted by bash, so we don't escape them.  Other things,\n# such as archive specifiers like 'ark:gunzip -c foo.gz|', we want\n# to be passed, in quotes, to the Kaldi program.  Our heuristic\n# is that stuff with spaces in should be quoted.  This doesn't\n# always work.\n#\nmy $cmd = \"\";\n\nforeach my $x (@ARGV) {\n  if ($x =~ m/^\\S+$/) { $cmd .= $x . \" \"; } # If string contains no spaces, take\n                                            # as-is.\n  elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; } # else if no dbl-quotes, use single\n  else { $cmd .= \"\\\"$x\\\" \"; }  # else use double.\n}\n\n#\n# Work out the location of the script file, and open it for writing.\n#\nmy $dir = dirname($logfile);\nmy $base = basename($logfile);\nmy $qdir = \"$dir/q\";\n$qdir =~ s:/(log|LOG)/*q:/q:; # If qdir ends in .../log/q, make it just .../q.\nmy $queue_logfile = \"$qdir/$base\";\n\nif (!-d $dir) { system \"mkdir -p $dir 2>/dev/null\"; } # another job may be doing this...\nif (!-d $dir) { die \"Cannot make the directory $dir\\n\"; }\n# make a directory called \"q\",\n# where we will put the log created by qsub... normally this doesn't contain\n# anything interesting, evertyhing goes to $logfile.\n# in $qdir/sync we'll put the done.* files... we try to keep this\n# directory small because it's transmitted over NFS many times.\nif (! -d \"$qdir/sync\") {\n  system \"mkdir -p $qdir/sync 2>/dev/null\";\n  sleep(5); ## This is to fix an issue we encountered in denominator lattice creation,\n  ## where if e.g. the exp/tri2b_denlats/log/15/q directory had just been\n  ## created and the job immediately ran, it would die with an error because nfs\n  ## had not yet synced.  I'm also decreasing the acdirmin and acdirmax in our\n  ## NFS settings to something like 5 seconds.\n}\n\nmy $queue_array_opt = \"\";\nif ($array_job == 1) { # It's an array job.\n  $queue_array_opt = \"-t $jobstart:$jobend\";\n  $logfile =~ s/$jobname/\\$SGE_TASK_ID/g; # This variable will get\n  # replaced by qsub, in each job, with the job-id.\n  $cmd =~ s/$jobname/\\$\\{SGE_TASK_ID\\}/g; # same for the command...\n  $queue_logfile =~ s/\\.?$jobname//; # the log file in the q/ subdirectory\n  # is for the queue to put its log, and this doesn't need the task array subscript\n  # so we remove it.\n}\n\n# queue_scriptfile is as $queue_logfile [e.g. dir/q/foo.log] but\n# with the suffix .sh.\nmy $queue_scriptfile = $queue_logfile;\n($queue_scriptfile =~ s/\\.[a-zA-Z]{1,5}$/.sh/) || ($queue_scriptfile .= \".sh\");\nif ($queue_scriptfile !~ m:^/:) {\n  $queue_scriptfile = $cwd . \"/\" . $queue_scriptfile; # just in case.\n}\n\n# We'll write to the standard input of \"qsub\" (the file-handle Q),\n# the job that we want it to execute.\n# Also keep our current PATH around, just in case there was something\n# in it that we need (although we also source ./path.sh)\n\nmy $syncfile = \"$qdir/sync/done.$$\";\n\nunlink($queue_logfile, $syncfile);\n#\n# Write to the script file, and then close it.\n#\nopen(Q, \">$queue_scriptfile\") || die \"Failed to write to $queue_scriptfile\";\n\nprint Q \"#!/usr/bin/env bash\\n\";\nprint Q \"cd $cwd\\n\";\nprint Q \". ./path.sh\\n\";\nprint Q \"( echo '#' Running on \\`hostname\\`\\n\";\nprint Q \"  echo '#' Started at \\`date\\`\\n\";\nprint Q \"  echo -n '# '; cat <<EOF\\n\";\nprint Q \"$cmd\\n\"; # this is a way of echoing the command into a comment in the log file,\nprint Q \"EOF\\n\"; # without having to escape things like \"|\" and quote characters.\nprint Q \") >$logfile\\n\";\nprint Q \"if ! which free-gpu.sh &> /dev/null; then\\n\";\nprint Q \"   echo 'command not found: free-gpu.sh not found.'\\n\";\nprint Q \"   exit 1\\n\";\nprint Q \"fi\\n\";\nprint Q \"gpuid=\\$(free-gpu.sh -n $cli_options{'gpu'})\\n\";\nprint Q \"if [[ \\${gpuid} == -1 ]]; then\\n\";\nprint Q \"   echo 'Failed to find enough free GPUs: $cli_options{'gpu'}'\\n\";\nprint Q \"   exit 1\\n\";\nprint Q \"fi\\n\";\nprint Q \"echo \\\"free gpu: \\${gpuid}\\\" >>$logfile\\n\";\nprint Q \"export CUDA_VISIBLE_DEVICES=\\${gpuid}\\n\";\nprint Q \"time1=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \" ( $cmd ) 2>>$logfile >>$logfile\\n\";\nprint Q \"ret=\\$?\\n\";\nprint Q \"time2=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \"echo '#' Accounting: time=\\$((\\$time2-\\$time1)) threads=$num_threads >>$logfile\\n\";\nprint Q \"echo '#' Finished at \\`date\\` with status \\$ret >>$logfile\\n\";\nprint Q \"[ \\$ret -eq 137 ] && exit 100;\\n\"; # If process was killed (e.g. oom) it will exit with status 137;\n  # let the script return with status 100 which will put it to E state; more easily rerunnable.\nif ($array_job == 0) { # not an array job\n  print Q \"touch $syncfile\\n\"; # so we know it's done.\n} else {\n  print Q \"touch $syncfile.\\$SGE_TASK_ID\\n\"; # touch a bunch of sync-files.\n}\nprint Q \"exit \\$[\\$ret ? 1 : 0]\\n\"; # avoid status 100 which grid-engine\nprint Q \"## submitted with:\\n\";       # treats specially.\n$qsub_cmd .= \"-o $queue_logfile $qsub_opts $queue_array_opt $queue_scriptfile >>$queue_logfile 2>&1\";\nprint Q \"# $qsub_cmd\\n\";\nif (!close(Q)) { # close was not successful... || die \"Could not close script file $shfile\";\n  die \"Failed to close the script file (full disk?)\";\n}\nchmod 0755, $queue_scriptfile;\n\n# This block submits the job to the queue.\nfor (my $try = 1; $try < 5; $try++) {\n  my $ret = system ($qsub_cmd);\n  if ($ret != 0) {\n    if ($sync && $ret == 256) { # this is the exit status when a job failed (bad exit status)\n      if (defined $jobname) {\n        $logfile =~ s/\\$SGE_TASK_ID/*/g;\n      }\n      print STDERR \"queue.pl: job writing to $logfile failed\\n\";\n      exit(1);\n    } else {\n      print STDERR \"queue.pl: Error submitting jobs to queue (return status was $ret)\\n\";\n      print STDERR \"queue log file is $queue_logfile, command was $qsub_cmd\\n\";\n      my $err = `tail $queue_logfile`;\n      print STDERR \"Output of qsub was: $err\\n\";\n      if ($err =~ m/gdi request/ || $err =~ m/qmaster/) {\n        # When we get queue connectivity problems we usually see a message like:\n        # Unable to run job: failed receiving gdi request response for mid=1 (got\n        # syncron message receive timeout error)..\n        my $waitfor = 20;\n        print STDERR \"queue.pl: It looks like the queue master may be inaccessible. \" .\n          \" Trying again after $waitfor seconts\\n\";\n        sleep($waitfor);\n        # ... and continue throught the loop.\n      } else {\n        exit(1);\n      }\n    }\n  } else {\n    last;  # break from the loop.\n  }\n}\n\nif (! $sync) { # We're not submitting with -sync y, so we\n  # need to wait for the jobs to finish.  We wait for the\n  # sync-files we \"touched\" in the script to exist.\n  my @syncfiles = ();\n  if (!defined $jobname) { # not an array job.\n    push @syncfiles, $syncfile;\n  } else {\n    for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n      push @syncfiles, \"$syncfile.$jobid\";\n    }\n  }\n  # We will need the sge_job_id, to check that job still exists\n  { # This block extracts the numeric SGE job-id from the log file in q/.\n    # It may be used later to query 'qstat' about the job.\n    open(L, \"<$queue_logfile\") || die \"Error opening log file $queue_logfile\";\n    undef $sge_job_id;\n    while (<L>) {\n      if (m/Your job\\S* (\\d+)[. ].+ has been submitted/) {\n        if (defined $sge_job_id) {\n          die \"Error: your job was submitted more than once (see $queue_logfile)\";\n        } else {\n          $sge_job_id = $1;\n        }\n      }\n    }\n    close(L);\n    if (!defined $sge_job_id) {\n      die \"Error: log file $queue_logfile does not specify the SGE job-id.\";\n    }\n  }\n  my $check_sge_job_ctr=1;\n\n  my $wait = 0.1;\n  my $counter = 0;\n  foreach my $f (@syncfiles) {\n    # wait for the jobs to finish one by one.\n    while (! -f $f) {\n      sleep($wait);\n      $wait *= 1.2;\n      if ($wait > 3.0) {\n        $wait = 3.0; # never wait more than 3 seconds.\n        # the following (.kick) commands are basically workarounds for NFS bugs.\n        if (rand() < 0.25) { # don't do this every time...\n          if (rand() > 0.5) {\n            system(\"touch $qdir/sync/.kick\");\n          } else {\n            unlink(\"$qdir/sync/.kick\");\n          }\n        }\n        if ($counter++ % 10 == 0) {\n          # This seems to kick NFS in the teeth to cause it to refresh the\n          # directory.  I've seen cases where it would indefinitely fail to get\n          # updated, even though the file exists on the server.\n          # Only do this every 10 waits (every 30 seconds) though, or if there\n          # are many jobs waiting they can overwhelm the file server.\n          system(\"ls $qdir/sync >/dev/null\");\n        }\n      }\n\n      # The purpose of the next block is so that queue.pl can exit if the job\n      # was killed without terminating.  It's a bit complicated because (a) we\n      # don't want to overload the qmaster by querying it too frequently), and\n      # (b) sometimes the qmaster is unreachable or temporarily down, and we\n      # don't want this to necessarily kill the job.\n      if (($check_sge_job_ctr < 100 && ($check_sge_job_ctr++ % 10) == 0) ||\n          ($check_sge_job_ctr >= 100 && ($check_sge_job_ctr++ % 50) == 0)) {\n        # Don't run qstat too often, avoid stress on SGE; the if-condition above\n        # is designed to check every 10 waits at first, and eventually every 50\n        # waits.\n        if ( -f $f ) { next; }  #syncfile appeared: OK.\n        my $output = `qstat -j $sge_job_id 2>&1`;\n        my $ret = $?;\n        if ($ret >> 8 == 1 && $output !~ m/qmaster/ &&\n            $output !~ m/gdi request/) {\n          # Don't consider immediately missing job as error, first wait some\n          # time to make sure it is not just delayed creation of the syncfile.\n\n          sleep(3);\n          # Sometimes NFS gets confused and thinks it's transmitted the directory\n          # but it hasn't, due to timestamp issues.  Changing something in the\n          # directory will usually fix that.\n          system(\"touch $qdir/sync/.kick\");\n          unlink(\"$qdir/sync/.kick\");\n          if ( -f $f ) { next; }   #syncfile appeared, ok\n          sleep(7);\n          system(\"touch $qdir/sync/.kick\");\n          sleep(1);\n          unlink(\"qdir/sync/.kick\");\n          if ( -f $f ) {  next; }   #syncfile appeared, ok\n          sleep(60);\n          system(\"touch $qdir/sync/.kick\");\n          sleep(1);\n          unlink(\"$qdir/sync/.kick\");\n          if ( -f $f ) { next; }  #syncfile appeared, ok\n          $f =~ m/\\.(\\d+)$/ || die \"Bad sync-file name $f\";\n          my $job_id = $1;\n          if (defined $jobname) {\n            $logfile =~ s/\\$SGE_TASK_ID/$job_id/g;\n          }\n          my $last_line = `tail -n 1 $logfile`;\n          if ($last_line =~ m/status 0$/ && (-M $logfile) < 0) {\n            # if the last line of $logfile ended with \"status 0\" and\n            # $logfile is newer than this program [(-M $logfile) gives the\n            # time elapsed between file modification and the start of this\n            # program], then we assume the program really finished OK,\n            # and maybe something is up with the file system.\n            print STDERR \"**queue.pl: syncfile $f was not created but job seems\\n\" .\n              \"**to have finished OK.  Probably your file-system has problems.\\n\" .\n              \"**This is just a warning.\\n\";\n            last;\n          } else {\n            chop $last_line;\n            print STDERR \"queue.pl: Error, unfinished job no \" .\n              \"longer exists, log is in $logfile, last line is '$last_line', \" .\n              \"syncfile is $f, return status of qstat was $ret\\n\" .\n              \"Possible reasons: a) Exceeded time limit? -> Use more jobs!\" .\n              \" b) Shutdown/Frozen machine? -> Run again!  Qmaster output \" .\n              \"was: $output\\n\";\n            exit(1);\n          }\n        } elsif ($ret != 0) {\n          print STDERR \"queue.pl: Warning: qstat command returned status $ret (qstat -j $sge_job_id,$!)\\n\";\n          print STDERR \"queue.pl: output was: $output\";\n        }\n      }\n    }\n  }\n  unlink(@syncfiles);\n}\n\n# OK, at this point we are synced; we know the job is done.\n# But we don't know about its exit status.  We'll look at $logfile for this.\n# First work out an array @logfiles of file-locations we need to\n# read (just one, unless it's an array job).\nmy @logfiles = ();\nif (!defined $jobname) { # not an array job.\n  push @logfiles, $logfile;\n} else {\n  for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n    my $l = $logfile;\n    $l =~ s/\\$SGE_TASK_ID/$jobid/g;\n    push @logfiles, $l;\n  }\n}\n\nmy $num_failed = 0;\nmy $status = 1;\nforeach my $l (@logfiles) {\n  my @wait_times = (0.1, 0.2, 0.2, 0.3, 0.5, 0.5, 1.0, 2.0, 5.0, 5.0, 5.0, 10.0, 25.0);\n  for (my $iter = 0; $iter <= @wait_times; $iter++) {\n    my $line = `tail -10 $l 2>/dev/null`; # Note: although this line should be the last\n    # line of the file, I've seen cases where it was not quite the last line because\n    # of delayed output by the process that was running, or processes it had called.\n    # so tail -10 gives it a little leeway.\n    if ($line =~ m/with status (\\d+)/) {\n      $status = $1;\n      last;\n    } else {\n      if ($iter < @wait_times) {\n        sleep($wait_times[$iter]);\n      } else {\n        if (! -f $l) {\n          print STDERR \"Log-file $l does not exist.\\n\";\n        } else {\n          print STDERR \"The last line of log-file $l does not seem to indicate the \"\n            . \"return status as expected\\n\";\n        }\n        exit(1);                # Something went wrong with the queue, or the\n        # machine it was running on, probably.\n      }\n    }\n  }\n  # OK, now we have $status, which is the return-status of\n  # the command in the job.\n  if ($status != 0) { $num_failed++; }\n}\nif ($num_failed == 0) { exit(0); }\nelse { # we failed.\n  if (@logfiles == 1) {\n    if (defined $jobname) { $logfile =~ s/\\$SGE_TASK_ID/$jobstart/g; }\n    print STDERR \"queue.pl: job failed with status $status, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"queue.pl: probably you forgot to put JOB=1:\\$nj in your script.\\n\";\n    }\n  } else {\n    if (defined $jobname) { $logfile =~ s/\\$SGE_TASK_ID/*/g; }\n    my $numjobs = 1 + $jobend - $jobstart;\n    print STDERR \"queue.pl: $num_failed / $numjobs failed, log is in $logfile\\n\";\n  }\n  exit(1);\n}\n"
  },
  {
    "path": "egs/espnet_utils/recog_wav.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2019 Nagoya University (Takenori Yoshimura)\n#           2019 RevComm Inc. (Takekatsu Hiramura)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nif [ ! -f path.sh ] || [ ! -f cmd.sh ]; then\n    echo \"Please change current directory to recipe directory e.g., egs/tedlium2/asr1\"\n    exit 1\nfi\n\n. ./path.sh\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=0         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\nverbose=1      # verbose option\n\n# feature configuration\ndo_delta=false\ncmvn=\n\n# rnnlm related\nuse_lang_model=true\nlang_model=\n\n# decoding parameter\nrecog_model=\ndecode_config=\ndecode_dir=decode\napi=v2\n\n# download related\nmodels=tedlium2.transformer.v1\n\nhelp_message=$(cat <<EOF\nUsage:\n    $0 [options] <wav_file>\n\nOptions:\n    --backend <chainer|pytorch>     # chainer or pytorch (Default: pytorch)\n    --ngpu <ngpu>                   # Number of GPUs (Default: 0)\n    --decode_dir <directory_name>   # Name of directory to store decoding temporary data\n    --models <model_name>           # Model name (e.g. tedlium2.transformer.v1)\n    --cmvn <path>                   # Location of cmvn.ark\n    --lang_model <path>             # Location of language model\n    --recog_model <path>            # Location of E2E model\n    --decode_config <path>          # Location of configuration file\n    --api <api_version>             # API version (v1 or v2, available in only pytorch backend)\n\nExample:\n    # Record audio from microphone input as example.wav\n    rec -c 1 -r 16000 example.wav trim 0 5\n\n    # Decode using model name\n    $0 --models tedlium2.transformer.v1 example.wav\n\n    # Decode with streaming mode (only RNN with API v1 is supported)\n    $0 --models tedlium2.rnn.v2 --api v1 example.wav\n\n    # Decode using model file\n    $0 --cmvn cmvn.ark --lang_model rnnlm.model.best --recog_model model.acc.best --decode_config conf/decode.yaml example.wav\n\n    # Decode with GPU (require batchsize > 0 in configuration file)\n    $0 --ngpu 1 example.wav\n\nAvailable models:\n    - tedlium2.rnn.v1\n    - tedlium2.rnn.v2\n    - tedlium2.transformer.v1\n    - tedlium3.transformer.v1\n    - librispeech.transformer.v1\n    - librispeech.transformer.v1.transformerlm.v1\n    - commonvoice.transformer.v1\n    - csj.transformer.v1\nEOF\n)\n. utils/parse_options.sh || exit 1;\n\n# make shellcheck happy\ntrain_cmd=\ndecode_cmd=\n\n. ./cmd.sh\n\nwav=$1\ndownload_dir=${decode_dir}/download\n\nif [ $# -lt 1 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -e\nset -u\nset -o pipefail\n\n# check api version\nif [ \"${api}\" = \"v2\" ] && [ \"${backend}\" = \"chainer\" ]; then\n    echo \"chainer backend does not support api v2.\" >&2\n    exit 1;\nfi\n\n# Check model name or model file is set\nif [ -z $models ]; then\n    if [ $use_lang_model = \"true\" ]; then\n        if [[ -z $cmvn || -z $lang_model || -z $recog_model || -z $decode_config ]]; then\n            echo 'Error: models or set of cmvn, lang_model, recog_model and decode_config are required.' >&2\n            exit 1\n        fi\n    else\n        if [[ -z $cmvn || -z $recog_model || -z $decode_config ]]; then\n            echo 'Error: models or set of cmvn, recog_model and decode_config are required.' >&2\n            exit 1\n        fi\n    fi\nfi\n\ndir=${download_dir}/${models}\nmkdir -p ${dir}\n\nfunction download_models () {\n    if [ -z $models ]; then\n        return\n    fi\n\n    file_ext=\"tar.gz\"\n    case \"${models}\" in\n        \"tedlium2.rnn.v1\") share_url=\"https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe\"; api=v1 ;;\n        \"tedlium2.rnn.v2\") share_url=\"https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf\"; api=v1 ;;\n        \"tedlium2.transformer.v1\") share_url=\"https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow\" ;;\n        \"tedlium3.transformer.v1\") share_url=\"https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU\" ;;\n        \"librispeech.transformer.v1\") share_url=\"https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6\" ;;\n        \"librispeech.transformer.v1.transformerlm.v1\") share_url=\"https://drive.google.com/open?id=17cOOSHHMKI82e1MXj4r2ig8gpGCRmG2p\" ;;\n        \"commonvoice.transformer.v1\") share_url=\"https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh\" ;;\n        \"csj.transformer.v1\") share_url=\"https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF\" ;;\n        *) echo \"No such models: ${models}\"; exit 1 ;;\n    esac\n\n    if [ ! -e ${dir}/.complete ]; then\n        download_from_google_drive.sh ${share_url} ${dir} ${file_ext}\n        touch ${dir}/.complete\n    fi\n}\n\n# Download trained models\nif [ -z \"${cmvn}\" ]; then\n    download_models\n    cmvn=$(find ${download_dir}/${models} -name \"cmvn.ark\" | head -n 1)\nfi\nif [ -z \"${lang_model}\" ] && ${use_lang_model}; then\n    download_models\n    lang_model=$(find ${download_dir}/${models} -name \"rnnlm*.best*\" | head -n 1)\nfi\nif [ -z \"${recog_model}\" ]; then\n    download_models\n    recog_model=$(find ${download_dir}/${models} -name \"model*.best*\" | head -n 1)\nfi\nif [ -z \"${decode_config}\" ]; then\n    download_models\n    decode_config=$(find ${download_dir}/${models} -name \"decode*.yaml\" | head -n 1)\nfi\nif [ -z \"${wav}\" ]; then\n    download_models\n    wav=$(find ${download_dir}/${models} -name \"*.wav\" | head -n 1)\nfi\n\n# Check file existence\nif [ ! -f \"${cmvn}\" ]; then\n    echo \"No such CMVN file: ${cmvn}\"\n    exit 1\nfi\nif [ ! -f \"${lang_model}\" ] && ${use_lang_model}; then\n    echo \"No such language model: ${lang_model}\"\n    exit 1\nfi\nif [ ! -f \"${recog_model}\" ]; then\n    echo \"No such E2E model: ${recog_model}\"\n    exit 1\nfi\nif [ ! -f \"${decode_config}\" ]; then\n    echo \"No such config file: ${decode_config}\"\n    exit 1\nfi\nif [ ! -f \"${wav}\" ]; then\n    echo \"No such WAV file: ${wav}\"\n    exit 1\nfi\n\nbase=$(basename $wav .wav)\ndecode_dir=${decode_dir}/${base}\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    echo \"stage 0: Data preparation\"\n\n    mkdir -p ${decode_dir}/data\n    echo \"$base sox $wav -R -r 16000 -c 1 -b 16 -t wav - dither |\" > ${decode_dir}/data/wav.scp\n    echo \"X $base\" > ${decode_dir}/data/spk2utt\n    echo \"$base X\" > ${decode_dir}/data/utt2spk\n    echo \"$base X\" > ${decode_dir}/data/text\nfi\n\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Feature Generation\"\n\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 1 --write_utt2num_frames true \\\n        ${decode_dir}/data ${decode_dir}/log ${decode_dir}/fbank\n\n    feat_recog_dir=${decode_dir}/dump; mkdir -p ${feat_recog_dir}\n    dump.sh --cmd \"$train_cmd\" --nj 1 --do_delta ${do_delta} \\\n        ${decode_dir}/data/feats.scp ${cmvn} ${decode_dir}/log \\\n        ${feat_recog_dir}\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Json Data Preparation\"\n\n    dict=${decode_dir}/dict\n    echo \"<unk> 1\" > ${dict}\n    feat_recog_dir=${decode_dir}/dump\n    data2json.sh --feat ${feat_recog_dir}/feats.scp \\\n        ${decode_dir}/data ${dict} > ${feat_recog_dir}/data.json\n    rm -f ${dict}\nfi\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: Decoding\"\n    if ${use_lang_model}; then\n        recog_opts=\"--rnnlm ${lang_model}\"\n    else\n        recog_opts=\"\"\n    fi\n    feat_recog_dir=${decode_dir}/dump\n\n    ${decode_cmd} ${decode_dir}/log/decode.log \\\n        asr_recog.py \\\n        --config ${decode_config} \\\n        --ngpu ${ngpu} \\\n        --backend ${backend} \\\n        --debugmode ${debugmode} \\\n        --verbose ${verbose} \\\n        --recog-json ${feat_recog_dir}/data.json \\\n        --result-label ${decode_dir}/result.json \\\n        --model ${recog_model} \\\n        --api ${api} \\\n        ${recog_opts}\n\n    echo \"\"\n    recog_text=$(grep rec_text ${decode_dir}/result.json | sed -e 's/.*: \"\\(.*\\)\".*/\\1/' | sed -e 's/<eos>//')\n    echo \"Recognized text: ${recog_text}\"\n    echo \"\"\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/espnet_utils/reduce_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# koried, 10/29/2012\n\n# Reduce a data set based on a list of turn-ids\n\nhelp_message=\"usage: $0 srcdir turnlist destdir\"\n\nif [ $1 == \"--help\" ]; then\n    echo \"${help_message}\"\n    exit 0;\nfi\n\nif [ $# != 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nsrcdir=$1\nreclist=$2\ndestdir=$3\n\nif [ ! -f ${srcdir}/utt2spk ]; then\necho \"$0: no such file $srcdir/utt2spk\"\nexit 1;\nfi\n\nfunction do_filtering {\n# assumes the utt2spk and spk2utt files already exist.\n\t[ -f ${srcdir}/feats.scp ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/feats.scp >${destdir}/feats.scp\n\t[ -f ${srcdir}/wav.scp ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/wav.scp >${destdir}/wav.scp\n\t[ -f ${srcdir}/text ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/text >${destdir}/text\n\t[ -f ${srcdir}/utt2num_frames ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/utt2num_frames >${destdir}/utt2num_frames\n\t[ -f ${srcdir}/spk2gender ] && utils/filter_scp.pl ${destdir}/spk2utt <${srcdir}/spk2gender >${destdir}/spk2gender\n\t[ -f ${srcdir}/cmvn.scp ] && utils/filter_scp.pl ${destdir}/spk2utt <${srcdir}/cmvn.scp >${destdir}/cmvn.scp\n\tif [ -f ${srcdir}/segments ]; then\n\t\tutils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/segments >${destdir}/segments\n\t\tawk '{print $2;}' ${destdir}/segments | sort | uniq > ${destdir}/reco # recordings.\n\t\t# The next line would override the command above for wav.scp, which would be incorrect.\n\t\t[ -f ${srcdir}/wav.scp ] && utils/filter_scp.pl ${destdir}/reco <${srcdir}/wav.scp >${destdir}/wav.scp\n\t\t[ -f ${srcdir}/reco2file_and_channel ] && \\\n\t\t\tutils/filter_scp.pl ${destdir}/reco <${srcdir}/reco2file_and_channel >${destdir}/reco2file_and_channel\n\t\t\n\t\t# Filter the STM file for proper sclite scoring (this will also remove the comments lines)\n\t\t[ -f ${srcdir}/stm ] && utils/filter_scp.pl ${destdir}/reco < ${srcdir}/stm > ${destdir}/stm\n\t\trm ${destdir}/reco\n\tfi\n\tsrcutts=$(wc -l < ${srcdir}/utt2spk)\n\tdestutts=$(wc -l < ${destdir}/utt2spk)\n\techo \"Reduced #utt from $srcutts to $destutts\"\n}\n\nmkdir -p ${destdir}\n\n# filter the utt2spk based on the set of recordings\nutils/filter_scp.pl ${reclist} < ${srcdir}/utt2spk > ${destdir}/utt2spk\n\nutils/utt2spk_to_spk2utt.pl < ${destdir}/utt2spk > ${destdir}/spk2utt\ndo_filtering;\n"
  },
  {
    "path": "egs/espnet_utils/remove_longshortdata.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n. ./path.sh\n\nmaxframes=2000\nminframes=10\nmaxchars=200\nminchars=0\nnlsyms=\"\"\nno_feat=false\ntrans_type=char\n\nhelp_message=\"usage: $0 olddatadir newdatadir\"\n\n. utils/parse_options.sh || exit 1;\n\nif [ $# != 2 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nsdir=$1\nodir=$2\nmkdir -p ${odir}/tmp\n\nif [ ${no_feat} = true ]; then\n    # for machine translation\n    cut -d' ' -f 1 ${sdir}/text > ${odir}/tmp/reclist1\nelse\n    echo \"extract utterances having less than $maxframes or more than $minframes frames\"\n    utils/data/get_utt2num_frames.sh ${sdir}\n    < ${sdir}/utt2num_frames  awk -v maxframes=\"$maxframes\" '{ if ($2 < maxframes) print }' \\\n        | awk -v minframes=\"$minframes\" '{ if ($2 > minframes) print }' \\\n        | awk '{print $1}' > ${odir}/tmp/reclist1\nfi\n\necho \"extract utterances having less than $maxchars or more than $minchars characters\"\n# counting number of chars. Use (NF - 1) instead of NF to exclude the utterance ID column\nif [ -z ${nlsyms} ]; then\ntext2token.py -s 1 -n 1 ${sdir}/text --trans_type ${trans_type} \\\n    | awk -v maxchars=\"$maxchars\" '{ if (NF - 1 < maxchars) print }' \\\n    | awk -v minchars=\"$minchars\" '{ if (NF - 1 > minchars) print }' \\\n    | awk '{print $1}' > ${odir}/tmp/reclist2\nelse\ntext2token.py -l ${nlsyms} -s 1 -n 1 ${sdir}/text --trans_type ${trans_type} \\\n    | awk -v maxchars=\"$maxchars\" '{ if (NF - 1 < maxchars) print }' \\\n    | awk -v minchars=\"$minchars\" '{ if (NF - 1 > minchars) print }' \\\n    | awk '{print $1}' > ${odir}/tmp/reclist2\nfi\n\n# extract common lines\ncomm -12 <(sort ${odir}/tmp/reclist1) <(sort ${odir}/tmp/reclist2) > ${odir}/tmp/reclist\n\nreduce_data_dir.sh ${sdir} ${odir}/tmp/reclist ${odir}\nutils/fix_data_dir.sh ${odir}\n\noldnum=$(wc -l ${sdir}/feats.scp | awk '{print $1}')\nnewnum=$(wc -l ${odir}/feats.scp | awk '{print $1}')\necho \"change from $oldnum to $newnum\"\n"
  },
  {
    "path": "egs/espnet_utils/remove_punctuation.pl",
    "content": "#!/usr/bin/perl\n\nuse warnings;\nuse strict;\n\nbinmode(STDIN,\":utf8\");\nbinmode(STDOUT,\":utf8\");\n\nwhile(<STDIN>) {\n  $_ = \" $_ \";\n\n  # remove punctuation except apostrophe\n  s/<space>/spacemark/g;  # for scoring\n  s/'/apostrophe/g;\n  s/[[:punct:]]//g;\n  s/apostrophe/'/g;\n  s/spacemark/<space>/g;  # for scoring\n\n  # remove whitespace\n  s/\\s+/ /g;\n  s/^\\s+//;\n  s/\\s+$//;\n\n  print \"$_\\n\";\n}\n"
  },
  {
    "path": "egs/espnet_utils/rerank_mmi.py",
    "content": "import sys\nimport json\nimport codecs\n\n\njson_f = sys.argv[1]\njson_f_out = sys.argv[3]\nweight = float(sys.argv[2])\n\nwith codecs.open(json_f, \"r\", encoding=\"utf-8\") as f:\n        j = json.load(f)\n\nfor name in j[\"utts\"]:\n    hyp_lst = j[\"utts\"][name][\"output\"]\n    for hyp in hyp_lst:\n        hyp[\"score\"] = float(hyp[\"score\"]) + float(hyp[\"mmi_tot_score\"]) * weight\n    hyp_lst.sort(key=lambda hyp: hyp[\"score\"], reverse=True)\n    j[\"utts\"][name][\"output\"] = hyp_lst\n\nwith open(json_f_out, \"wb\") as f:\n    f.write(\n        json.dumps(\n            j, indent=4, ensure_ascii=False, sort_keys=True\n        ).encode(\"utf_8\")\n    )\n"
  },
  {
    "path": "egs/espnet_utils/result2json.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#           2018 Xuankai Chang (Shanghai Jiao Tong University)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport re\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert sclite's result.txt file to json\"\n    )\n    parser.add_argument(\"--key\", \"-k\", type=str, help=\"key\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    key = re.findall(r\"r\\d+h\\d+\", args.key)[0]\n\n    re_id = r\"^id: \"\n    re_strings = {\n        \"Speaker\": r\"^Speaker sentences\",\n        \"Scores\": r\"^Scores: \",\n        \"REF\": r\"^REF: \",\n        \"HYP\": r\"^HYP: \",\n    }\n    re_id = re.compile(re_id)\n    re_patterns = {}\n    for p in re_strings.keys():\n        re_patterns[p] = re.compile(re_strings[p])\n\n    ret = {}\n    tmp_id = None\n    tmp_ret = {}\n\n    sys.stdin = codecs.getreader(\"utf-8\")(sys.stdin if is_python2 else sys.stdin.buffer)\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    line = sys.stdin.readline()\n    while line:\n        x = line.rstrip()\n        x_split = x.split()\n\n        if re_id.match(x):\n            if tmp_id:\n                ret[tmp_id] = {key: tmp_ret}\n                tmp_ret = {}\n            tmp_id = x_split[1]\n        for p in re_patterns.keys():\n            if re_patterns[p].match(x):\n                tmp_ret[p] = \" \".join(x_split[1:])\n        line = sys.stdin.readline()\n\n    if tmp_ret != {}:\n        ret[tmp_id] = {key: tmp_ret}\n\n    all_l = {\"utts\": ret}\n    # ensure \"ensure_ascii=False\", which is a bug\n    jsonstring = json.dumps(\n        all_l, indent=4, ensure_ascii=False, sort_keys=True, separators=(\",\", \": \")\n    )\n    print(jsonstring)\n"
  },
  {
    "path": "egs/espnet_utils/score_bleu.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nexport LC_ALL=C\n\n. ./path.sh\n\nnlsyms=\"\"\nbpe=\"\"\nbpemodel=\"\"\nfilter=\"\"\ncase=lc\nset=\"\"\nremove_nonverbal=true\n\n. utils/parse_options.sh\n\nif [ $# -lt 3 ]; then\n    echo \"Usage: $0 <decode-dir> <tgt_lang> <dict-tgt> <dict-src>\";\n    exit 1;\nfi\n\ndir=$1\ntgt_lang=$2\ndic_tgt=$3\ndic_src=$4\n\nconcatjson.py ${dir}/data.*.json > ${dir}/data.json\njson2trn_mt.py ${dir}/data.json ${dic_tgt} --refs ${dir}/ref.trn.org \\\n    --hyps ${dir}/hyp.trn.org --srcs ${dir}/src.trn.org --dict-src ${dic_src}\n\n# remove uttterance id\nperl -pe 's/\\([^\\)]+\\)\\n/\\n/g;' ${dir}/ref.trn.org > ${dir}/ref.trn\nperl -pe 's/\\([^\\)]+\\)\\n/\\n/g;' ${dir}/hyp.trn.org > ${dir}/hyp.trn\nperl -pe 's/\\([^\\)]+\\)\\n/\\n/g;' ${dir}/src.trn.org > ${dir}/src.trn\n\n# remove non-verbal labels (optional)\nperl -pe 's/\\([^\\)]+\\)//g;' ${dir}/ref.trn > ${dir}/ref.rm.trn\nperl -pe 's/\\([^\\)]+\\)//g;' ${dir}/hyp.trn > ${dir}/hyp.rm.trn\nperl -pe 's/\\([^\\)]+\\)//g;' ${dir}/src.trn > ${dir}/src.rm.trn\n\nif [ -n \"$bpe\" ]; then\n    if [ ${remove_nonverbal} ]; then\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref.rm.trn | sed -e \"s/▁/ /g\" > ${dir}/ref.wrd.trn\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp.rm.trn | sed -e \"s/▁/ /g\" > ${dir}/hyp.wrd.trn\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/src.rm.trn | sed -e \"s/▁/ /g\" > ${dir}/src.wrd.trn\n    else\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref.trn | sed -e \"s/▁/ /g\" > ${dir}/ref.wrd.trn\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp.trn | sed -e \"s/▁/ /g\" > ${dir}/hyp.wrd.trn\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/src.trn | sed -e \"s/▁/ /g\" > ${dir}/src.wrd.trn\n    fi\nelse\n    if [ ${remove_nonverbal} ]; then\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/ref.rm.trn > ${dir}/ref.wrd.trn\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/hyp.rm.trn > ${dir}/hyp.wrd.trn\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/src.rm.trn > ${dir}/src.wrd.trn\n    else\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/ref.trn > ${dir}/ref.wrd.trn\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/hyp.trn > ${dir}/hyp.wrd.trn\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" -e \"s/>/> /g\" ${dir}/src.trn > ${dir}/src.wrd.trn\n    fi\nfi\n\n# detokenize\ndetokenizer.perl -l ${tgt_lang} -q < ${dir}/ref.wrd.trn > ${dir}/ref.wrd.trn.detok\ndetokenizer.perl -l ${tgt_lang} -q < ${dir}/hyp.wrd.trn > ${dir}/hyp.wrd.trn.detok\ndetokenizer.perl -l ${tgt_lang} -q < ${dir}/src.wrd.trn > ${dir}/src.wrd.trn.detok\n\n# remove language IDs\nif [ -n \"${nlsyms}\" ]; then\n    cp ${dir}/ref.wrd.trn.detok ${dir}/ref.wrd.trn.detok.tmp\n    cp ${dir}/hyp.wrd.trn.detok ${dir}/hyp.wrd.trn.detok.tmp\n    cp ${dir}/src.wrd.trn.detok ${dir}/src.wrd.trn.detok.tmp\n    filt.py -v $nlsyms ${dir}/ref.wrd.trn.detok.tmp > ${dir}/ref.wrd.trn.detok\n    filt.py -v $nlsyms ${dir}/hyp.wrd.trn.detok.tmp > ${dir}/hyp.wrd.trn.detok\n    filt.py -v $nlsyms ${dir}/src.wrd.trn.detok.tmp > ${dir}/src.wrd.trn.detok\nfi\nif [ -n \"${filter}\" ]; then\n    sed -i.bak3 -f ${filter} ${dir}/hyp.wrd.trn.detok\n    sed -i.bak3 -f ${filter} ${dir}/ref.wrd.trn.detok\n    sed -i.bak3 -f ${filter} ${dir}/src.wrd.trn.detok\nfi\n# NOTE: this must be performed after detokenization so that punctuation marks are not removed\n\nif [ ${case} = tc ]; then\n    echo ${set} > ${dir}/result.tc.txt\n    multi-bleu-detok.perl ${dir}/ref.wrd.trn.detok < ${dir}/hyp.wrd.trn.detok >> ${dir}/result.tc.txt\n    echo \"write a case-sensitive BLEU result in ${dir}/result.tc.txt\"\n    cat ${dir}/result.tc.txt\nelse\n    echo ${set} > ${dir}/result.lc.txt\n    multi-bleu-detok.perl -lc ${dir}/ref.wrd.trn.detok < ${dir}/hyp.wrd.trn.detok > ${dir}/result.lc.txt\n    echo \"write a case-insensitive BLEU result in ${dir}/result.lc.txt\"\n    cat ${dir}/result.lc.txt\nfi\n\n# TODO(hirofumi): add TER & METEOR metrics here\n"
  },
  {
    "path": "egs/espnet_utils/score_lang_id.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2021 Johns Hopkins University (Jiatong Shi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport sys\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"language identification scoring\")\n    parser.add_argument(\"--ref\", type=str, help=\"input reference\", required=True)\n    parser.add_argument(\"--hyp\", type=str, help=\"input hypotheses\", required=True)\n    parser.add_argument(\n        \"--out\",\n        type=argparse.FileType(\"w\"),\n        default=sys.stdout,\n        help=\"The output filename. \" \"If omitted, then output to sys.stdout\",\n    )\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    scoring(args.ref, args.hyp, args.out)\n\n\ndef scoring(ref, hyp, out):\n    ref_file = codecs.open(ref, \"r\", encoding=\"utf-8\")\n    hyp_file = codecs.open(hyp, \"r\", encoding=\"utf-8\")\n\n    utt_num = 0\n    correct = 0\n\n    while True:\n        ref_utt = ref_file.readline()\n        hyp_utt = hyp_file.readline()\n\n        if not ref_utt or not hyp_utt:\n            break\n\n        [rec_id, lid, utt_id] = ref_utt.strip().split()\n        [hrec_id, hlid, hutt_id] = hyp_utt.strip().split()\n\n        assert (rec_id == hrec_id and utt_id == hutt_id) and \"Mismatch in trn id\"\n\n        if lid == hlid:\n            correct += 1\n        utt_num += 1\n    out.write(\n        \"Language Identification Scoring: Accuracy {:.4f} ({}/{})\".format(\n            (correct / float(utt_num)), correct, utt_num\n        )\n    )\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/score_sclite.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n[ -f ./path.sh ] && . ./path.sh\n\nnlsyms=\"\"\nwer=false\nmer=false\nbpe=\"\"\nbpemodel=\"\"\nremove_blank=true\nfilter=\"\"\nnum_spkrs=1\nhelp_message=\"Usage: $0 <data-dir> <dict>\"\nsppd3=false\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\ndir=$1\ndic=$2\n\nconcatjson.py ${dir}/data.*.json > ${dir}/data.json\n\nif [ $num_spkrs -eq 1 ]; then\n  json2trn.py ${dir}/data.json ${dic} --num-spkrs ${num_spkrs} --refs ${dir}/ref.trn --hyps ${dir}/hyp.trn\n\n  if ${remove_blank}; then\n      sed -i.bak2 -r 's/<blank> //g' ${dir}/hyp.trn\n  fi\n  if [ -n \"${nlsyms}\" ]; then\n      cp ${dir}/ref.trn ${dir}/ref.trn.org\n      cp ${dir}/hyp.trn ${dir}/hyp.trn.org\n      filt.py -v ${nlsyms} ${dir}/ref.trn.org > ${dir}/ref.trn\n      filt.py -v ${nlsyms} ${dir}/hyp.trn.org > ${dir}/hyp.trn\n  fi\n  if [ -n \"${filter}\" ]; then\n      sed -i.bak3 -f ${filter} ${dir}/hyp.trn\n      sed -i.bak3 -f ${filter} ${dir}/ref.trn\n  fi\n\n  if [ $sppd3 = true ]; then\n      cp ${dir}/hyp.trn ${dir}/hyp.trn.org \n      python3 espnet_utils/filter_trn.py $dir/hyp.trn.org > ${dir}/hyp.trn\n  fi\n\n  sclite -r ${dir}/ref.trn trn -h ${dir}/hyp.trn trn -i rm -o all stdout > ${dir}/result.txt\n  \n  echo \"write a CER (or TER) result in ${dir}/result.txt\"\n  grep -e Avg -e SPKR -m 2 ${dir}/result.txt\n  python3 espnet_utils/double_precious_cer.py ${dir}/result.txt\n\n  if ${wer}; then\n      if [ -n \"$bpe\" ]; then\n  \t    spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref.trn | sed -e \"s/▁/ /g\" > ${dir}/ref.wrd.trn\n  \t    spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp.trn | sed -e \"s/▁/ /g\" > ${dir}/hyp.wrd.trn\n      else\n  \t    sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/ref.trn > ${dir}/ref.wrd.trn\n  \t    sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/hyp.trn > ${dir}/hyp.wrd.trn\n      fi\n      sclite -r ${dir}/ref.wrd.trn trn -h ${dir}/hyp.wrd.trn trn -i rm -o all stdout > ${dir}/result.wrd.txt\n\n      echo \"write a WER result in ${dir}/result.wrd.txt\"\n      grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.txt\n      python3 espnet_utils/double_precious_cer.py ${dir}/result.wrd.txt\n\n      if ${mer}; then\n         python3 espnet_utils/prepare_mer.py ${dir}/ref.wrd.trn ${dir}/ref.wrd.trn.chn ${dir}/ref.wrd.trn.eng\n         python3 espnet_utils/prepare_mer.py ${dir}/hyp.wrd.trn ${dir}/hyp.wrd.trn.chn ${dir}/hyp.wrd.trn.eng \n         sclite -r ${dir}/ref.wrd.trn.chn trn -h ${dir}/hyp.wrd.trn.chn trn -i rm -o all stdout > ${dir}/result.wrd.chn.txt\n         sclite -r ${dir}/ref.wrd.trn.eng trn -h ${dir}/hyp.wrd.trn.eng trn -i rm -o all stdout > ${dir}/result.wrd.eng.txt\n         \n         echo \"write a Mandarin CER result of code-switch data in ${dir}/result.wrd.chn.txt\"\n         grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.chn.txt\n         echo \"write a English MER result of code-switch data in ${dir}/result.wrd.eng.txt\"\n         grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.eng.txt\n      fi\n  fi\nelif [ ${num_spkrs} -lt 4 ]; then\n  ref_trns=\"\"\n  hyp_trns=\"\"\n  for i in $(seq ${num_spkrs}); do\n      ref_trns=${ref_trns}\"${dir}/ref${i}.trn \"\n      hyp_trns=${hyp_trns}\"${dir}/hyp${i}.trn \"\n  done\n  json2trn.py ${dir}/data.json ${dic} --num-spkrs ${num_spkrs} --refs ${ref_trns} --hyps ${hyp_trns}\n\n  for n in $(seq ${num_spkrs}); do\n      if ${remove_blank}; then\n          sed -i.bak2 -r 's/<blank> //g' ${dir}/hyp${n}.trn\n      fi\n      if [ -n \"${nlsyms}\" ]; then\n          cp ${dir}/ref${n}.trn ${dir}/ref${n}.trn.org\n          cp ${dir}/hyp${n}.trn ${dir}/hyp${n}.trn.org\n          filt.py -v ${nlsyms} ${dir}/ref${n}.trn.org > ${dir}/ref${n}.trn\n          filt.py -v ${nlsyms} ${dir}/hyp${n}.trn.org > ${dir}/hyp${n}.trn\n      fi\n      if [ -n \"${filter}\" ]; then\n          sed -i.bak3 -f ${filter} ${dir}/hyp${n}.trn\n          sed -i.bak3 -f ${filter} ${dir}/ref${n}.trn\n      fi\n  done\n\n  results_str=\"\"\n  for (( i=0; i<$((num_spkrs * num_spkrs)); i++ )); do\n      ind_r=$((i / num_spkrs + 1))\n      ind_h=$((i % num_spkrs + 1))\n      results_str=${results_str}\"${dir}/result_r${ind_r}h${ind_h}.txt \"\n      sclite -r ${dir}/ref${ind_r}.trn trn -h ${dir}/hyp${ind_h}.trn trn -i rm -o all stdout > ${dir}/result_r${ind_r}h${ind_h}.txt\n  done\n\n  echo \"write CER (or TER) results in ${dir}/result_r*h*.txt\"\n  eval_perm_free_error.py --num-spkrs ${num_spkrs} \\\n      ${results_str} > ${dir}/min_perm_result.json\n  sed -n '2,4p' ${dir}/min_perm_result.json\n\n  if ${wer}; then\n      for n in $(seq ${num_spkrs}); do\n          if [ -n \"$bpe\" ]; then\n              spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref${n}.trn | sed -e \"s/▁/ /g\" > ${dir}/ref${n}.wrd.trn\n              spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp${n}.trn | sed -e \"s/▁/ /g\" > ${dir}/hyp${n}.wrd.trn\n          else\n              sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/ref${n}.trn > ${dir}/ref${n}.wrd.trn\n              sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/hyp${n}.trn > ${dir}/hyp${n}.wrd.trn\n          fi\n      done\n      results_str=\"\"\n      for (( i=0; i<$((num_spkrs * num_spkrs)); i++ )); do\n          ind_r=$((i / num_spkrs + 1))\n          ind_h=$((i % num_spkrs + 1))\n          results_str=${results_str}\"${dir}/result_r${ind_r}h${ind_h}.wrd.txt \"\n          sclite -r ${dir}/ref${ind_r}.wrd.trn trn -h ${dir}/hyp${ind_h}.wrd.trn trn -i rm -o all stdout > ${dir}/result_r${ind_r}h${ind_h}.wrd.txt\n      done\n\n      echo \"write WER results in ${dir}/result_r*h*.wrd.txt\"\n      eval_perm_free_error.py --num-spkrs ${num_spkrs} \\\n          ${results_str} > ${dir}/min_perm_result.wrd.json\n      sed -n '2,4p' ${dir}/min_perm_result.wrd.json\n  fi\nfi\n"
  },
  {
    "path": "egs/espnet_utils/score_sclite_case.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nexport LC_ALL=C\n\n. ./path.sh\n\nnlsyms=\"\"\nwer=false\nbpe=\"\"\nbpemodel=\"\"\nremove_blank=true\nfilter=\"\"\ncase=lc.rm\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n    echo \"Usage: $0 <data-dir> <dict>\";\n    exit 1;\nfi\n\ndir=$1\ndic=$2\n\nconcatjson.py ${dir}/data.*.json > ${dir}/data.json\njson2trn.py ${dir}/data.json ${dic} --refs ${dir}/ref.trn --hyps ${dir}/hyp.trn\n\nif ${remove_blank}; then\n    sed -i.bak2 -r 's/<blank> //g' ${dir}/hyp.trn\nfi\nif [ -n \"${nlsyms}\" ]; then\n    cp ${dir}/ref.trn ${dir}/ref.trn.org\n    cp ${dir}/hyp.trn ${dir}/hyp.trn.org\n    filt.py -v ${nlsyms} ${dir}/ref.trn.org > ${dir}/ref.trn\n    filt.py -v ${nlsyms} ${dir}/hyp.trn.org > ${dir}/hyp.trn\nfi\nif [ -n \"${filter}\" ]; then\n    sed -i.bak3 -f ${filter} ${dir}/hyp.trn\n    sed -i.bak3 -f ${filter} ${dir}/ref.trn\nfi\n\n# case-sensitive WER\nif [ ${case} = tc ]; then\n\n  # detokenize\n  detokenizer.perl -l en -q < ${dir}/ref.trn > ${dir}/ref.trn.detok\n  detokenizer.perl -l en -q < ${dir}/hyp.trn > ${dir}/hyp.trn.detok\n\n  sclite -s -r ${dir}/ref.trn.detok trn -h ${dir}/hyp.trn.detok trn -i rm -o all stdout > ${dir}/result.tc.txt\n\n  echo \"write a case-sensitive CER (or TER) result in ${dir}/result.tc.txt\"\n  grep -e Avg -e SPKR -m 2 ${dir}/result.tc.txt\n\n  if ${wer}; then\n      if [ -n \"$bpe\" ]; then\n          spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref.trn | sed -e \"s/▁/ /g\" > ${dir}/ref.wrd.trn\n          spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp.trn | sed -e \"s/▁/ /g\" > ${dir}/hyp.wrd.trn\n      else\n          sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/ref.trn > ${dir}/ref.wrd.trn\n          sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/hyp.trn > ${dir}/hyp.wrd.trn\n      fi\n\n      # detokenize\n      detokenizer.perl -l en -q < ${dir}/ref.wrd.trn > ${dir}/ref.wrd.trn.detok\n      detokenizer.perl -l en -q < ${dir}/hyp.wrd.trn > ${dir}/hyp.wrd.trn.detok\n\n      sclite -s -r ${dir}/ref.wrd.trn.detok trn -h ${dir}/hyp.wrd.trn.detok trn -i rm -o all stdout > ${dir}/result.wrd.tc.txt\n\n      echo \"write a case-sensitive WER result in ${dir}/result.wrd.tc.txt\"\n      grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.tc.txt\n  fi\nfi\n\n# lowercasing\nlowercase.perl < ${dir}/hyp.trn > ${dir}/hyp.trn.lc\nlowercase.perl < ${dir}/ref.trn > ${dir}/ref.trn.lc\n\n# remove punctuation\npaste -d \"(\" <(cut -d '(' -f 1 ${dir}/hyp.trn.lc | remove_punctuation.pl | sed -e \"s/  / /g\") <(cut -d '(' -f 2- ${dir}/hyp.trn.lc) > ${dir}/hyp.trn.lc.rm\npaste -d \"(\" <(cut -d '(' -f 1 ${dir}/ref.trn.lc | remove_punctuation.pl | sed -e \"s/  / /g\") <(cut -d '(' -f 2- ${dir}/ref.trn.lc) > ${dir}/ref.trn.lc.rm\n\n# detokenize\ndetokenizer.perl -l en -q < ${dir}/ref.trn.lc.rm > ${dir}/ref.trn.lc.rm.detok\ndetokenizer.perl -l en -q < ${dir}/hyp.trn.lc.rm > ${dir}/hyp.trn.lc.rm.detok\n\nsclite -r ${dir}/ref.trn.lc.rm.detok trn -h ${dir}/hyp.trn.lc.rm.detok trn -i rm -o all stdout > ${dir}/result.txt\n\necho \"write a CER (or TER) result in ${dir}/result.txt\"\ngrep -e Avg -e SPKR -m 2 ${dir}/result.txt\n\nif ${wer}; then\n    if [ -n \"$bpe\" ]; then\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/ref.trn.lc.rm | sed -e \"s/▁/ /g\" > ${dir}/ref.wrd.trn.lc.rm\n        spm_decode --model=${bpemodel} --input_format=piece < ${dir}/hyp.trn.lc.rm | sed -e \"s/▁/ /g\" > ${dir}/hyp.wrd.trn.lc.rm\n    else\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/ref.trn.lc.rm > ${dir}/ref.wrd.trn.lc.rm\n        sed -e \"s/ //g\" -e \"s/(/ (/\" -e \"s/<space>/ /g\" ${dir}/hyp.trn.lc.rm > ${dir}/hyp.wrd.trn.lc.rm\n    fi\n\n    # detokenize\n    detokenizer.perl -l en -q < ${dir}/ref.wrd.trn.lc.rm > ${dir}/ref.wrd.trn.lc.rm.detok\n    detokenizer.perl -l en -q < ${dir}/hyp.wrd.trn.lc.rm > ${dir}/hyp.wrd.trn.lc.rm.detok\n\n    sclite -r ${dir}/ref.wrd.trn.lc.rm.detok trn -h ${dir}/hyp.wrd.trn.lc.rm.detok trn -i rm -o all stdout > ${dir}/result.wrd.txt\n\n    echo \"write a WER result in ${dir}/result.wrd.txt\"\n    grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.txt\nfi\n"
  },
  {
    "path": "egs/espnet_utils/score_sclite_wo_dict.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2019 Okayama University (Katsuki Inoue)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n[ -f ./path.sh ] && . ./path.sh\n\nwer=false\nnum_spkrs=1\nhelp_message=\"Usage: $0 <data-dir>\"\n\n. utils/parse_options.sh\n\nif [ $# != 1 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\ndir=$1\n\nconcatjson.py ${dir}/data.*.json > ${dir}/data.json\n\nif [ $num_spkrs -eq 1 ]; then\n    json2trn_wo_dict.py ${dir}/data.json --num-spkrs ${num_spkrs} --refs ${dir}/ref_org.wrd.trn --hyps ${dir}/hyp_org.wrd.trn\n   \n    cat < ${dir}/hyp_org.wrd.trn | sed -e 's/▁//' | sed -e 's/▁/ /g' > ${dir}/hyp.wrd.trn\n    cat < ${dir}/ref_org.wrd.trn | sed -e 's/\\.//g' -e 's/\\,//g' > ${dir}/ref.wrd.trn\n\n    cat < ${dir}/hyp.wrd.trn | awk -v FS='' '{a=0;for(i=1;i<=NF;i++){if($i==\"(\"){a=1};if(a==0){printf(\"%s \",$i)}else{printf(\"%s\",$i)}}printf(\"\\n\")}' > ${dir}/hyp.trn\n    cat < ${dir}/ref.wrd.trn | awk -v FS='' '{a=0;for(i=1;i<=NF;i++){if($i==\"(\"){a=1};if(a==0){printf(\"%s \",$i)}else{printf(\"%s\",$i)}}printf(\"\\n\")}' > ${dir}/ref.trn\n\n    sclite -r ${dir}/ref.trn trn -h ${dir}/hyp.trn -i rm -o all stdout > ${dir}/result.txt\n    echo \"write a CER result in ${dir}/result.txt\"\n    grep -e Avg -e SPKR -m 2 ${dir}/result.txt\n    \n    if ${wer}; then\n        sclite -r ${dir}/ref.wrd.trn trn -h ${dir}/hyp.wrd.trn -i rm -o all stdout > ${dir}/result.wrd.txt\n        echo \"write a WER result in ${dir}/result.wrd.txt\"\n        grep -e Avg -e SPKR -m 2 ${dir}/result.wrd.txt\n        \n        sclite -r ${dir}/ref_org.wrd.trn trn -h ${dir}/hyp.wrd.trn trn -i rm -o all stdout > ${dir}/result_w_punc.wrd.txt\n        echo \"write a WER result in ${dir}/result_w_punc.wrd.txt\"\n        grep -e Avg -e SPKR -m 2 ${dir}/result_w_punc.wrd.txt\n\n    fi\nfi\n\n\n"
  },
  {
    "path": "egs/espnet_utils/scp2json.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport json\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert scp to json\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--key\", \"-k\", type=str, help=\"key\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    new_line = {}\n    sys.stdin = codecs.getreader(\"utf-8\")(sys.stdin if is_python2 else sys.stdin.buffer)\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    line = sys.stdin.readline()\n    while line:\n        x = line.rstrip().split()\n        v = {args.key: \" \".join(x[1:])}\n        new_line[x[0]] = v\n        line = sys.stdin.readline()\n\n    all_l = {\"utts\": new_line}\n\n    # ensure \"ensure_ascii=False\", which is a bug\n    jsonstring = json.dumps(\n        all_l, indent=4, ensure_ascii=False, sort_keys=True, separators=(\",\", \": \")\n    )\n    print(jsonstring)\n"
  },
  {
    "path": "egs/espnet_utils/show_result.sh",
    "content": "#!/usr/bin/env bash\nmindepth=0\nmaxdepth=1\n\n. utils/parse_options.sh\n\nif [ $# -gt 1 ]; then\n    echo \"Usage: $0 --mindepth 0 --maxdepth 1 [exp]\" 1>&2\n    echo \"\"\n    echo \"Show the system environments and the evaluation results in Markdown format.\"\n    echo 'The default of <exp> is \"exp/\".'\n    exit 1\nfi\n\n[ -f ./path.sh ] && . ./path.sh\nset -euo pipefail\nif [ $# -eq 1 ]; then\n    exp=$1\nelse\n    exp=exp\nfi\n\n\ncat << EOF\n<!-- Generated by $0 -->\n# RESULTS\n## Environments\n- date: \\`$(LC_ALL=C date)\\`\nEOF\n\npython3 << EOF\nimport sys, espnet, chainer, torch\npyversion = sys.version.replace('\\n', ' ')\n\nprint(f\"\"\"- python version: \\`{pyversion}\\`\n- espnet version: \\`espnet {espnet.__version__}\\`\n- chainer version: \\`chainer {chainer.__version__}\\`\n- pytorch version: \\`pytorch {torch.__version__}\\`\"\"\")\nEOF\n\ncat << EOF\n- Git hash: \\`$(git rev-parse HEAD)\\`\n  - Commit date: \\`$(git log -1 --format='%cd')\\`\n\nEOF\n\nwhile IFS= read -r expdir; do\n    if ls ${expdir}/decode_*/result.txt &> /dev/null; then\n    # 1. Show the result table\n    cat << EOF\n## $(basename ${expdir})\n### CER\n\n|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|\n|---|---|---|---|---|---|---|---|---|\nEOF\n        grep -e Avg ${expdir}/decode_*/result.txt \\\n            | sed -e \"s#${expdir}/\\([^/]*\\)/result.txt:#|\\1#g\" \\\n            | sed -e 's#Sum/Avg##g' | tr '|' ' ' | tr -s ' ' '|'\n        echo\n\n        # 2. Show the result table for WER\n        if ls ${expdir}/decode_*/result.wrd.txt &> /dev/null; then\n            cat << EOF\n### WER\n\n|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|\n|---|---|---|---|---|---|---|---|---|\nEOF\n            grep -e Avg ${expdir}/decode_*/result.wrd.txt \\\n                | sed -e \"s#${expdir}/\\([^/]*\\)/result.wrd.txt:#|\\1#g\" \\\n                | sed -e 's#Sum/Avg##g' | tr '|' ' ' | tr -s ' ' '|'\n            echo\n        fi\n    fi\ndone < <(find ${exp} -mindepth ${mindepth} -maxdepth ${maxdepth} -type d)\n"
  },
  {
    "path": "egs/espnet_utils/significant_test.sh",
    "content": "adir=$1 # reference\nbdir=$2 # tested \n\nfor part in trn wrd.trn wrd.trn.chn wrd.trn.eng; do\n    if [ -f $adir/ref.$part ] && [ -f $adir/ref.$part ]; then \n        (sclite -F -i wsj -r $adir/ref.$part -h $adir/hyp.$part -o sgml\n        sclite -F -i wsj -r $bdir/ref.$part -h $bdir/hyp.$part -o sgml\n\n        cat $adir/hyp.${part}.sgml $bdir/hyp.${part}.sgml | sc_stats -p -t mapsswe -v -u -n $bdir/result.${part}.mapsswe\n        ) &\n    fi\ndone\nwait \n"
  },
  {
    "path": "egs/espnet_utils/sort_scp_by_length.py",
    "content": "import sys\nimport os\n\nin_scp = sys.argv[1]\nin_frame = sys.argv[2]\nout_scp = sys.argv[3]\nout_frame = sys.argv[4]\n\n# read scp as dict\nscp_dict = {}\nfor line in open(in_scp, encoding=\"utf-8\"):\n    uttid, add = line.strip().split()\n    scp_dict[uttid] = add\n\n# read utt2frames\nframe_lst = []\nfor line in open(in_frame, encoding=\"utf-8\"):\n    uttid, length = line.strip().split()\n    length = int(length)\n    frame_lst.append([uttid, length])\n\nframe_lst.sort(key=lambda x: x[1])\n\nscp_writer = open(out_scp, 'w', encoding=\"utf-8\")\nframe_writer = open(out_frame, 'w', encoding='utf-8')\n\nfor e in frame_lst:\n    uttid, length = e\n    add = scp_dict[uttid]\n    scp_writer.write(f\"{uttid} {add}\\n\")\n    frame_writer.write(f\"{uttid} {length}\\n\")\nscp_writer.close()\nframe_writer.close()\n\n\"\"\"\nmax_utt = 256\ncount = 1\nout_scp_base = os.path.basename(out_scp)\nout_scp_base = out_scp_base.replace(\".\", f\".{count}.\")\nout_scp_dir = os.path.dirname(out_scp)\nout_scp = os.path.join(out_scp_dir, out_scp_base)\nfor i, e in enumerate(frame_lst):\n    if i % max_utt == 0:\n       scp_writer = open(out_scp, 'w', encoding='utf-8')\n       out_scp = out_scp.replace(f\".{count}.\", f\".{count+1}.\") \n       count += 1\n    uttid, _ = e\n    add = scp_dict[uttid]\n    scp_writer.write(f\"{uttid} {add}\\n\")\n\"\"\"\n"
  },
  {
    "path": "egs/espnet_utils/speed_perturb.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2021 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\ncases=\"\"\nspeeds=\"0.9 1.0 1.1\"\nlangs=\"\"\nwrite_utt2num_frames=true\nnj=32\ncmd=\"\"\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> <destination-dir> <fbankdir>\ne.g.: $0 data/train en de\nOptions:\n  --cases                              # target case information (e.g., lc.rm, lc, tc)\n  --speeds                             # speed used in speed perturbation (e.g., 0.9. 1.0, 1.1)\n  --langs                              # all languages (source + target)\n  --write_utt2num_frames               # write utt2num_frames in steps/make_fbank_pitch.sh\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs\n  --nj <nj>                            # number of parallel jobs\nEOF\n)\necho \"$0 $*\"  # Print the command line for logging\n\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\n\ndata_dir=$1\ndst=$2\nfbankdir=$3\n\ntmpdir=$(mktemp -d ${data_dir}/tmp-XXXXX)\ntrap 'rm -rf ${tmpdir}' EXIT\n\nfor sp in ${speeds}; do\n    utils/perturb_data_dir_speed.sh ${sp} ${data_dir} ${tmpdir}/temp.${sp}\ndone\nutils/combine_data.sh --extra-files utt2uniq ${dst} ${tmpdir}/temp.*\n\nsteps/make_fbank_pitch.sh --cmd ${cmd} --nj ${nj} --write_utt2num_frames ${write_utt2num_frames} \\\n    ${dst} exp/make_fbank/\"$(basename ${dst})\" ${fbankdir}\nutils/fix_data_dir.sh ${dst}\nutils/validate_data_dir.sh --no-text ${dst}\n\nif [ -n \"${langs}\" ]; then\n    # for ST/MT recipe + ASR recipe in ST recipe\n   for lang in ${langs}; do\n        for case in ${cases}; do\n            if [ -f ${dst}/text.${case}.${lang} ]; then\n                rm ${dst}/text.${case}.${lang}\n            fi\n        done\n        touch ${dst}/text.${case}.${lang}\n\n        for sp in ${speeds}; do\n            awk -v p=\"sp${sp}-\" '{printf(\"%s %s%s\\n\", $1, p, $1);}' ${data_dir}/utt2spk > ${dst}/utt_map\n\n            for case in ${cases}; do\n                utils/apply_map.pl -f 1 ${dst}/utt_map <${data_dir}/text.${case}.${lang} >> ${dst}/text.${case}.${lang}\n            done\n        done\n    done\nelse\n    # for ASR only recipe\n    touch ${dst}/text\n    for sp in ${speeds}; do\n        awk -v p=\"sp${sp}-\" '{printf(\"%s %s%s\\n\", $1, p, $1);}' ${data_dir}/utt2spk > ${dst}/utt_map\n        utils/apply_map.pl -f 1 ${dst}/utt_map <${data_dir}/text >>${dst}/text\n    done\nfi\n\nrm -rf ${tmpdir}*\n"
  },
  {
    "path": "egs/espnet_utils/split_scp.py",
    "content": "import sys\n\nin_f = sys.argv[1]\nwriters = []\nfor f in sys.argv[2:]:\n    writer = open(f, 'w', encoding='utf-8')\n    writers.append(writer)\nnum_writers = len(writers)\n\nfor i, line in enumerate(open(in_f, encoding='utf-8')):\n    writer = writers[i % num_writers]\n    writer.write(line)\n\nfor w in writers:\n    writer.close()\n    \n"
  },
  {
    "path": "egs/espnet_utils/split_scp_fix_length.py",
    "content": "import sys\nimport os\n\nin_f = sys.argv[1]\n\nmax_utt=360 # So batch size could be 1, 2, 3, 4, 5, 6, 8, 10, 12 etc.\ncount = 1\n\nout_scp_base = os.path.basename(in_f)\nout_scp_base = out_scp_base.replace(\".\", f\".{count}.\")\nout_scp_dir = os.path.dirname(in_f)\nout_scp = os.path.join(out_scp_dir, out_scp_base)\nfor i, line in enumerate(open(in_f, encoding='utf-8')):\n    if i % max_utt == 0:\n        scp_writer = open(out_scp, 'w', encoding=\"utf-8\")\n        out_scp = out_scp.replace(f\".{count}.\", f\".{count+1}.\") \n        count += 1\n    scp_writer.write(line)\n    \n"
  },
  {
    "path": "egs/espnet_utils/splitjson.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport json\nimport logging\nimport os\nimport sys\n\nimport numpy as np\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"split a json file for parallel processing\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"json\", type=str, help=\"json file\")\n    parser.add_argument(\n        \"--parts\", \"-p\", type=int, help=\"Number of subparts to be prepared\", default=0\n    )\n    parser.add_argument(\n        \"--original-order\", action=\"store_true\", help=\"If set, not sort utts by keys\"\n    )\n    return parser\n\n\nif __name__ == \"__main__\":\n    args = get_parser().parse_args()\n\n    # logging info\n    logging.basicConfig(\n        level=logging.INFO,\n        format=\"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\",\n    )\n    logging.info(get_commandline_args())\n\n    # check directory\n    filename = os.path.basename(args.json).split(\".\")[0]\n    dirname = os.path.dirname(args.json)\n    dirname = \"{}/split{}utt\".format(dirname, args.parts)\n    if not os.path.exists(dirname):\n        os.makedirs(dirname)\n\n    # load json and split keys\n    j = json.load(codecs.open(args.json, \"r\", encoding=\"utf-8\"))\n    if args.original_order:\n        utt_ids = list(j[\"utts\"].keys())\n    else:\n        utt_ids = sorted(list(j[\"utts\"].keys()))\n    logging.info(\"number of utterances = %d\" % len(utt_ids))\n    if len(utt_ids) < args.parts:\n        logging.error(\"#utterances < #splits. Use smaller split number.\")\n        sys.exit(1)\n    utt_id_lists = np.array_split(utt_ids, args.parts)\n    utt_id_lists = [utt_id_list.tolist() for utt_id_list in utt_id_lists]\n\n    for i, utt_id_list in enumerate(utt_id_lists):\n        new_dic = dict()\n        for utt_id in utt_id_list:\n            new_dic[utt_id] = j[\"utts\"][utt_id]\n        jsonstring = json.dumps(\n            {\"utts\": new_dic},\n            indent=4,\n            ensure_ascii=False,\n            sort_keys=not args.original_order,\n            separators=(\",\", \": \"),\n        )\n        fl = \"{}/{}.{}.json\".format(dirname, filename, i + 1)\n        sys.stdout = codecs.open(fl, \"w+\", encoding=\"utf-8\")\n        print(jsonstring)\n        sys.stdout.close()\n"
  },
  {
    "path": "egs/espnet_utils/spm_decode",
    "content": "#!/usr/bin/env python\n# Copyright (c) Facebook, Inc. and its affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# https://github.com/pytorch/fairseq/blob/master/LICENSE\n\n\nimport argparse\nimport sys\n\nimport sentencepiece as spm\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model\", required=True,\n                        help=\"sentencepiece model to use for decoding\")\n    parser.add_argument(\"--input\", default=None, help=\"input file to decode\")\n    parser.add_argument(\"--input_format\", choices=[\"piece\", \"id\"], default=\"piece\")\n    args = parser.parse_args()\n\n    sp = spm.SentencePieceProcessor()\n    sp.Load(args.model)\n\n    if args.input_format == \"piece\":\n        def decode(l):\n            return \"\".join(sp.DecodePieces(l))\n    elif args.input_format == \"id\":\n        def decode(l):\n            return \"\".join(sp.DecodeIds(l))\n    else:\n        raise NotImplementedError\n\n    def tok2int(tok):\n        # remap reference-side <unk> (represented as <<unk>>) to 0\n        return int(tok) if tok != \"<<unk>>\" else 0\n\n    def multilingual_decode(line):\n        def process_segment(buf):\n            segment = \"\".join(buf).split() # string of bpes\n            segment = decode(segment).split() # list of words\n            return segment\n\n        ans, buf = [], []\n        for c in line:\n            if is_all_chinese(c):\n                if buf:\n                    ans.extend(process_segment(buf))\n                    buf = []\n                ans.append(c)\n            else:\n                buf.append(c)\n        if buf:\n            ans.extend(process_segment(buf))\n\n        ans = \" \".join(ans)\n        return ans\n                    \n\n    if args.input is None:\n        h = sys.stdin\n    else:\n        h = open(args.input, \"r\", encoding=\"utf-8\")\n    for line in h:\n        print(multilingual_decode(line))\n        # print(decode(line.split()))\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/spm_encode",
    "content": "#!/usr/bin/env python\n# Copyright (c) Facebook, Inc. and its affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the license found in\n# https://github.com/pytorch/fairseq/blob/master/LICENSE\n\n\nimport argparse\nimport contextlib\nimport sys\n\nimport sentencepiece as spm\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--model\", required=True,\n                        help=\"sentencepiece model to use for encoding\")\n    parser.add_argument(\"--inputs\", nargs=\"+\", default=['-'],\n                        help=\"input files to filter/encode\")\n    parser.add_argument(\"--outputs\", nargs=\"+\", default=['-'],\n                        help=\"path to save encoded outputs\")\n    parser.add_argument(\"--output_format\", choices=[\"piece\", \"id\"], default=\"piece\")\n    parser.add_argument(\"--min-len\", type=int, metavar=\"N\",\n                        help=\"filter sentence pairs with fewer than N tokens\")\n    parser.add_argument(\"--max-len\", type=int, metavar=\"N\",\n                        help=\"filter sentence pairs with more than N tokens\")\n    parser.add_argument(\"--split-chn\", action=\"store_true\",\n                        help=\"if true, remove all space between chn tokens\")\n    args = parser.parse_args()\n\n    assert len(args.inputs) == len(args.outputs), \\\n        \"number of input and output paths should match\"\n\n    sp = spm.SentencePieceProcessor()\n    sp.Load(args.model)\n\n    if args.output_format == \"piece\":\n        def encode(l):\n            return sp.EncodeAsPieces(l)\n    elif args.output_format == \"id\":\n        def encode(l):\n            return list(map(str, sp.EncodeAsIds(l)))\n    else:\n        raise NotImplementedError\n\n    if args.min_len is not None or args.max_len is not None:\n        def valid(line):\n            return (\n                (args.min_len is None or len(line) >= args.min_len) and\n                (args.max_len is None or len(line) <= args.max_len)\n            )\n    else:\n        def valid(lines):\n            return True\n\n    with contextlib.ExitStack() as stack:\n        inputs = [\n            stack.enter_context(open(input, \"r\", encoding=\"utf-8\"))\n            if input != \"-\" else sys.stdin\n            for input in args.inputs\n        ]\n        outputs = [\n            stack.enter_context(open(output, \"w\", encoding=\"utf-8\"))\n            if output != \"-\" else sys.stdout\n            for output in args.outputs\n        ]\n\n        stats = {\n            \"num_empty\": 0,\n            \"num_filtered\": 0,\n        }\n\n        if args.split_chn:\n            process_chn = lambda x: \" \".join(list(x))\n        else:\n            process_chn = lambda x: x \n\n        def multilingual_encode(string):\n            ans = []\n            pieces = string.strip().split()\n  \n            for p in pieces:\n                if is_all_chinese(p):\n                    ans.append(process_chn(p))\n                else:\n                    ans.extend(encode(p))\n            \n            return ans\n\n        def encode_line(line):\n            line = line.strip()\n            if len(line) > 0:\n                # line = encode(line)\n                line = multilingual_encode(line)\n                if valid(line):\n                    return line\n                else:\n                    stats[\"num_filtered\"] += 1\n            else:\n                stats[\"num_empty\"] += 1\n            return None\n\n        for i, lines in enumerate(zip(*inputs), start=1):\n            enc_lines = list(map(encode_line, lines))\n            if not any(enc_line is None for enc_line in enc_lines):\n                for enc_line, output_h in zip(enc_lines, outputs):\n                    print(\" \".join(enc_line), file=output_h)\n            if i % 10000 == 0:\n                print(\"processed {} lines\".format(i), file=sys.stderr)\n\n        print(\"skipped {} empty lines\".format(stats[\"num_empty\"]), file=sys.stderr)\n        print(\"filtered {} lines\".format(stats[\"num_filtered\"]), file=sys.stderr)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/spm_train",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the license found in the\n# https://github.com/pytorch/fairseq/blob/master/LICENSE\nimport sys\n\nimport sentencepiece as spm\n\n\nif __name__ == \"__main__\":\n    spm.SentencePieceTrainer.Train(\" \".join(sys.argv[1:]))\n"
  },
  {
    "path": "egs/espnet_utils/stdout.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n\n# In general, doing\n#  run.pl some.log a b c is like running the command a b c in\n# the bash shell, and putting the standard error and output into some.log.\n# To run parallel jobs (backgrounded on the host machine), you can do (e.g.)\n#  run.pl JOB=1:4 some.JOB.log a b c JOB is like running the command a b c JOB\n# and putting it in some.JOB.log, for each one. [Note: JOB can be any identifier].\n# If any of the jobs fails, this script will fail.\n\n# A typical example is:\n#  run.pl some.log my-prog \"--opt=foo bar\" foo \\|  other-prog baz\n# and run.pl will run something like:\n# ( my-prog '--opt=foo bar' foo |  other-prog baz ) >& some.log\n#\n# Basically it takes the command-line arguments, quotes them\n# as necessary to preserve spaces, and evaluates them with bash.\n# In addition it puts the command line at the top of the log, and\n# the start and end times of the command at the beginning and end.\n# The reason why this is useful is so that we can create a different\n# version of this program that uses a queueing system instead.\n\n# use Data::Dumper;\n\n@ARGV < 2 && die \"usage: run.pl log-file command-line arguments...\";\n\n\n$max_jobs_run = -1;\n$jobstart = 1;\n$jobend = 1;\n$ignored_opts = \"\"; # These will be ignored.\n\n# First parse an option like JOB=1:4, and any\n# options that would normally be given to\n# queue.pl, which we will just discard.\n\nfor (my $x = 1; $x <= 2; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    # parse any options that would normally go to qsub, but which will be ignored here.\n    my $switch = shift @ARGV;\n    if ($switch eq \"-V\") {\n      $ignored_opts .= \"-V \";\n    } elsif ($switch eq \"--max-jobs-run\" || $switch eq \"-tc\") {\n      # we do support the option --max-jobs-run n, and its GridEngine form -tc n.\n      $max_jobs_run = shift @ARGV;\n      if (! ($max_jobs_run > 0)) {\n        die \"run.pl: invalid option --max-jobs-run $max_jobs_run\";\n      }\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"run.pl: WARNING: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $ignored_opts .= \"-sync \"; # Note: in the\n        # corresponding code in queue.pl it says instead, just \"$sync = 1;\".\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $ignored_opts .= \"$switch $argument $argument2 \";\n      } elsif ($switch eq \"--gpu\") {\n        $using_gpu = $argument;\n      } else {\n        # Ignore option.\n        $ignored_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"run.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility).\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"run.pl: Warning: suspicious first argument to run.pl: $ARGV[0]\\n\";\n  }\n}\n\n# Users found this message confusing so we are removing it.\n# if ($ignored_opts ne \"\") {\n#   print STDERR \"run.pl: Warning: ignoring options \\\"$ignored_opts\\\"\\n\";\n# }\n\nif ($max_jobs_run == -1) { # If --max-jobs-run option not set,\n                           # then work out the number of processors if possible,\n                           # and set it based on that.\n  $max_jobs_run = 0;\n  if ($using_gpu) {\n    if (open(P, \"nvidia-smi -L |\")) {\n      $max_jobs_run++ while (<P>);\n      close(P);\n    }\n    if ($max_jobs_run == 0) {\n      $max_jobs_run = 1;\n      print STDERR \"run.pl: Warning: failed to detect number of GPUs from nvidia-smi, using ${max_jobs_run}\\n\";\n    }\n  } elsif (open(P, \"</proc/cpuinfo\")) {  # Linux\n    while (<P>) { if (m/^processor/) { $max_jobs_run++; } }\n    if ($max_jobs_run == 0) {\n      print STDERR \"run.pl: Warning: failed to detect any processors from /proc/cpuinfo\\n\";\n      $max_jobs_run = 10;  # reasonable default.\n    }\n    close(P);\n  } elsif (open(P, \"sysctl -a |\")) {  # BSD/Darwin\n    while (<P>) {\n      if (m/hw\\.ncpu\\s*[:=]\\s*(\\d+)/) { # hw.ncpu = 4, or hw.ncpu: 4\n        $max_jobs_run = $1;\n        last;\n      }\n    }\n    close(P);\n    if ($max_jobs_run == 0) {\n      print STDERR \"run.pl: Warning: failed to detect any processors from sysctl -a\\n\";\n      $max_jobs_run = 10;  # reasonable default.\n    }\n  } else {\n    # allow at most 32 jobs at once, on non-UNIX systems; change this code\n    # if you need to change this default.\n    $max_jobs_run = 32;\n  }\n  # The just-computed value of $max_jobs_run is just the number of processors\n  # (or our best guess); and if it happens that the number of jobs we need to\n  # run is just slightly above $max_jobs_run, it will make sense to increase\n  # $max_jobs_run to equal the number of jobs, so we don't have a small number\n  # of leftover jobs.\n  $num_jobs = $jobend - $jobstart + 1;\n  if (!$using_gpu &&\n      $num_jobs > $max_jobs_run && $num_jobs < 1.4 * $max_jobs_run) {\n    $max_jobs_run = $num_jobs;\n  }\n}\n\n$logfile = shift @ARGV;\n\nif (defined $jobname && $logfile !~ m/$jobname/ &&\n    $jobend > $jobstart) {\n  print STDERR \"run.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n$cmd = \"\";\n\nforeach $x (@ARGV) {\n    if ($x =~ m/^\\S+$/) { $cmd .=  $x . \" \"; }\n    elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; }\n    else { $cmd .= \"\\\"$x\\\" \"; }\n}\n\n#$Data::Dumper::Indent=0;\n$ret = 0;\n$numfail = 0;\n%active_pids=();\n\nuse POSIX \":sys_wait_h\";\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  if (scalar(keys %active_pids) >= $max_jobs_run) {\n\n    # Lets wait for a change in any child's status\n    # Then we have to work out which child finished\n    $r = waitpid(-1, 0);\n    $code = $?;\n    if ($r < 0 ) { die \"run.pl: Error waiting for child process\"; } # should never happen.\n    if ( defined $active_pids{$r} ) {\n        $jid=$active_pids{$r};\n        $fail[$jid]=$code;\n        if ($code !=0) { $numfail++;}\n        delete $active_pids{$r};\n        # print STDERR \"Finished: $r/$jid \" .  Dumper(\\%active_pids) . \"\\n\";\n    } else {\n        die \"run.pl: Cannot find the PID of the chold process that just finished.\";\n    }\n\n    # In theory we could do a non-blocking waitpid over all jobs running just\n    # to find out if only one or more jobs finished during the previous waitpid()\n    # However, we just omit this and will reap the next one in the next pass\n    # through the for(;;) cycle\n  }\n  $childpid = fork();\n  if (!defined $childpid) { die \"run.pl: Error forking in run.pl (writing to $logfile)\"; }\n  if ($childpid == 0) { # We're in the child... this branch\n    # executes the job and returns (possibly with an error status).\n    if (defined $jobname) {\n      $cmd =~ s/$jobname/$jobid/g;\n      $logfile =~ s/$jobname/$jobid/g;\n    }\n    system(\"mkdir -p `dirname $logfile` 2>/dev/null\");\n    open(F, \">$logfile\") || die \"run.pl: Error opening log file $logfile\";\n    print F \"# \" . $cmd . \"\\n\";\n    print F \"# Started at \" . `date`;\n    $starttime = `date +'%s'`;\n    print F \"#\\n\";\n    close(F);\n\n    # Pipe into bash.. make sure we're not using any other shell.\n    open(B, \"|bash\") || die \"run.pl: Error opening shell command\";\n    print B \"( \" . $cmd . \") |& tee -a $logfile\";\n    close(B);                   # If there was an error, exit status is in $?\n    $ret = $?;\n\n    $lowbits = $ret & 127;\n    $highbits = $ret >> 8;\n    if ($lowbits != 0) { $return_str = \"code $highbits; signal $lowbits\" }\n    else { $return_str = \"code $highbits\"; }\n\n    $endtime = `date +'%s'`;\n    open(F, \">>$logfile\") || die \"run.pl: Error opening log file $logfile (again)\";\n    $enddate = `date`;\n    chop $enddate;\n    print F \"# Accounting: time=\" . ($endtime - $starttime) . \" threads=1\\n\";\n    print F \"# Ended ($return_str) at \" . $enddate . \", elapsed time \" . ($endtime-$starttime) . \" seconds\\n\";\n    close(F);\n    exit($ret == 0 ? 0 : 1);\n  } else {\n    $pid[$jobid] = $childpid;\n    $active_pids{$childpid} = $jobid;\n    # print STDERR \"Queued: \" .  Dumper(\\%active_pids) . \"\\n\";\n  }\n}\n\n# Now we have submitted all the jobs, lets wait until all the jobs finish\nforeach $child (keys %active_pids) {\n    $jobid=$active_pids{$child};\n    $r = waitpid($pid[$jobid], 0);\n    $code = $?;\n    if ($r == -1) { die \"run.pl: Error waiting for child process\"; } # should never happen.\n    if ($r != 0) { $fail[$jobid]=$code; $numfail++ if $code!=0; } # Completed successfully\n}\n\n# Some sanity checks:\n# The $fail array should not contain undefined codes\n# The number of non-zeros in that array  should be equal to $numfail\n# We cannot do foreach() here, as the JOB ids do not necessarily start by zero\n$failed_jids=0;\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $job_return = $fail[$jobid];\n  if (not defined $job_return ) {\n    # print Dumper(\\@fail);\n\n    die \"run.pl: Sanity check failed: we have indication that some jobs are running \" .\n      \"even after we waited for all jobs to finish\" ;\n  }\n  if ($job_return != 0 ){ $failed_jids++;}\n}\nif ($failed_jids != $numfail) {\n  die \"run.pl: Sanity check failed: cannot find out how many jobs failed ($failed_jids x $numfail).\"\n}\nif ($numfail > 0) { $ret = 1; }\n\nif ($ret != 0) {\n  $njobs = $jobend - $jobstart + 1;\n  if ($njobs == 1) {\n    if (defined $jobname) {\n      $logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with\n                                         # that job.\n    }\n    print STDERR \"run.pl: job failed, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"run.pl: probably you forgot to put JOB=1:\\$nj in your script.\";\n    }\n  }\n  else {\n    $logfile =~ s/$jobname/*/g;\n    print STDERR \"run.pl: $numfail / $njobs failed, log is in $logfile\\n\";\n  }\n}\n\n\nexit ($ret);\n"
  },
  {
    "path": "egs/espnet_utils/synth_wav.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2019 Nagoya University (Takenori Yoshimura)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nif [ ! -f path.sh ] || [ ! -f cmd.sh ]; then\n    echo \"Please change directory to e.g., egs/ljspeech/tts1\"\n    exit 1\nfi\n\n# shellcheck disable=SC1091\n. ./path.sh || exit 1;\n# shellcheck disable=SC1091\n. ./cmd.sh || exit 1;\n\n# general configuration\nbackend=pytorch\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=0         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\nverbose=1      # verbose option\n\n# feature configuration\nfs=22050      # sampling frequency\nfmax=\"\"       # maximum frequency\nfmin=\"\"       # minimum frequency\nn_mels=80     # number of mel basis\nn_fft=1024    # number of fft points\nn_shift=256   # number of shift points\nwin_length=\"\" # window length\ncmvn=\n\n# dictionary related\ndict=\ntrans_type=\"char\"\n\n# embedding related\ninput_wav=\n\n# decoding related\nsynth_model=\ndecode_config=\ndecode_dir=decode\ngriffin_lim_iters=64\n\n# download related\nmodels=ljspeech.transformer.v1\nvocoder_models=ljspeech.parallel_wavegan.v1\n\nhelp_message=$(cat <<EOF\nUsage:\n    $ $0 <text>\n\nNote:\n    This code does not include text frontend part. Please clean the input\n    text manually. Also, you need to modify feature configuration according\n    to the model. Default setting is for ljspeech models, so if you want to\n    use other pretrained models, please modify the parameters by yourself.\n    For our provided models, you can find them in the tables at\n    https://github.com/espnet/espnet#tts-demo.\n    If you are beginner, instead of this script, I strongly recommend trying\n    the following colab notebook at first, which includes all of the procedure\n    from text frontend, feature generation, and waveform generation.\n    https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb\n\nExample:\n    # make text file and then generate it\n    # (for the default model, ljspeech, we use upper-case char sequence as the input)\n    echo \"THIS IS A DEMONSTRATION OF TEXT TO SPEECH.\" > example.txt\n    $0 example.txt\n\n    # also you can use multiple text\n    echo \"THIS IS A DEMONSTRATION OF TEXT TO SPEECH.\" > example.txt\n    echo \"TEXT TO SPEECH IS A TECHQNIQUE TO CONVERT TEXT INTO SPEECH.\" >> example.txt\n    $0 example.txt\n\n    # you can specify the pretrained models\n    $0 --models ljspeech.transformer.v3 example.txt\n\n    # also you can specify vocoder model\n    $0 --vocoder_models ljspeech.wavenet.mol.v2 example.txt\n\nAvailable models:\n    - ljspeech.tacotron2.v1\n    - ljspeech.tacotron2.v2\n    - ljspeech.tacotron2.v3\n    - ljspeech.transformer.v1\n    - ljspeech.transformer.v2\n    - ljspeech.transformer.v3\n    - ljspeech.fastspeech.v1\n    - ljspeech.fastspeech.v2\n    - ljspeech.fastspeech.v3\n    - libritts.tacotron2.v1\n    - libritts.transformer.v1\n    - jsut.transformer.v1\n    - jsut.tacotron2.v1\n    - csmsc.transformer.v1\n    - csmsc.fastspeech.v3\n\nAvailable vocoder models:\n    - ljspeech.wavenet.softmax.ns.v1\n    - ljspeech.wavenet.mol.v1\n    - ljspeech.parallel_wavegan.v1\n    - libritts.wavenet.mol.v1\n    - jsut.wavenet.mol.v1\n    - jsut.parallel_wavegan.v1\n    - csmsc.wavenet.mol.v1\n    - csmsc.parallel_wavegan.v1\n\nModel details:\n    | Model name              | Lang | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Input type |\n    | ----------------------- | ---- | ------- | -------------- | ---------------------- | ---------- |\n    | ljspeech.tacotron2.v1   | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.tacotron2.v2   | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.tacotron2.v3   | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.transformer.v1 | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.transformer.v2 | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.transformer.v3 | EN   | 22.05k  | None           | 1024 / 256 / None      | phn        |\n    | ljspeech.fastspeech.v1  | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.fastspeech.v2  | EN   | 22.05k  | None           | 1024 / 256 / None      | char       |\n    | ljspeech.fastspeech.v3  | EN   | 22.05k  | None           | 1024 / 256 / None      | phn        |\n    | libritts.tacotron2.v1   | EN   | 24k     | 80-7600        | 1024 / 256 / None      | char       |\n    | libritts.transformer.v1 | EN   | 24k     | 80-7600        | 1024 / 256 / None      | char       |\n    | jsut.tacotron2          | JP   | 24k     | 80-7600        | 2048 / 300 / 1200      | phn        |\n    | jsut.transformer        | JP   | 24k     | 80-7600        | 2048 / 300 / 1200      | phn        |\n    | csmsc.transformer.v1    | ZH   | 24k     | 80-7600        | 2048 / 300 / 1200      | pinyin     |\n    | csmsc.fastspeech.v3     | ZH   | 24k     | 80-7600        | 2048 / 300 / 1200      | pinyin     |\n\nVocoder model details:\n    | Model name                     | Lang | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type       |\n    | ------------------------------ | ---- | ------- | -------------- | ---------------------- | ---------------- |\n    | ljspeech.wavenet.softmax.ns.v1 | EN   | 22.05k  | None           | 1024 / 256 / None      | Softmax WaveNet  |\n    | ljspeech.wavenet.mol.v1        | EN   | 22.05k  | None           | 1024 / 256 / None      | MoL WaveNet      |\n    | ljspeech.parallel_wavegan.v1   | EN   | 22.05k  | None           | 1024 / 256 / None      | Parallel WaveGAN |\n    | libritts.wavenet.mol.v1        | EN   | 24k     | None           | 1024 / 256 / None      | MoL WaveNet      |\n    | jsut.wavenet.mol.v1            | JP   | 24k     | 80-7600        | 2048 / 300 / 1200      | MoL WaveNet      |\n    | jsut.parallel_wavegan.v1       | JP   | 24k     | 80-7600        | 2048 / 300 / 1200      | Parallel WaveGAN |\n    | csmsc.wavenet.mol.v1           | ZH   | 24k     | 80-7600        | 2048 / 300 / 1200      | MoL WaveNet      |\n    | csmsc.parallel_wavegan.v1      | ZH   | 24k     | 80-7600        | 2048 / 300 / 1200      | Parallel WaveGAN |\n\nEOF\n)\n\n# shellcheck disable=SC1091\n. utils/parse_options.sh || exit 1;\n\ntxt=$1\ndownload_dir=${decode_dir}/download\n\nif [ $# -ne 1 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -e\nset -u\nset -o pipefail\n\nfunction download_models () {\n    case \"${models}\" in\n        \"ljspeech.tacotron2.v1\") share_url=\"https://drive.google.com/open?id=1dKzdaDpOkpx7kWZnvrvx2De7eZEdPHZs\" ;;\n        \"ljspeech.tacotron2.v2\") share_url=\"https://drive.google.com/open?id=11T9qw8rJlYzUdXvFjkjQjYrp3iGfQ15h\" ;;\n        \"ljspeech.tacotron2.v3\") share_url=\"https://drive.google.com/open?id=1hiZn14ITUDM1nkn-GkaN_M3oaTOUcn1n\" ;;\n        \"ljspeech.transformer.v1\") share_url=\"https://drive.google.com/open?id=13DR-RB5wrbMqBGx_MC655VZlsEq52DyS\" ;;\n        \"ljspeech.transformer.v2\") share_url=\"https://drive.google.com/open?id=1xxAwPuUph23RnlC5gym7qDM02ZCW9Unp\" ;;\n        \"ljspeech.transformer.v3\") share_url=\"https://drive.google.com/open?id=1M_w7nxI6AfbtSHpMO-exILnAc_aUYvXP\" ;;\n        \"ljspeech.fastspeech.v1\") share_url=\"https://drive.google.com/open?id=17RUNFLP4SSTbGA01xWRJo7RkR876xM0i\" ;;\n        \"ljspeech.fastspeech.v2\") share_url=\"https://drive.google.com/open?id=1zD-2GMrWM3thaDpS3h3rkTU4jIC0wc5B\";;\n        \"ljspeech.fastspeech.v3\") share_url=\"https://drive.google.com/open?id=1W86YEQ6KbuUTIvVURLqKtSNqe_eI2GDN\";;\n        \"libritts.tacotron2.v1\") share_url=\"https://drive.google.com/open?id=1iAXwC0AuWusa9AcFeUVkcNLG0I-hnSr3\" ;;\n        \"libritts.transformer.v1\") share_url=\"https://drive.google.com/open?id=1Xj73mDPuuPH8GsyNO8GnOC3mn0_OK4g3\";;\n        \"jsut.transformer.v1\") share_url=\"https://drive.google.com/open?id=1mEnZfBKqA4eT6Bn0eRZuP6lNzL-IL3VD\" ;;\n        \"jsut.tacotron2.v1\") share_url=\"https://drive.google.com/open?id=1kp5M4VvmagDmYckFJa78WGqh1drb_P9t\" ;;\n        \"csmsc.transformer.v1\") share_url=\"https://drive.google.com/open?id=1bTSygvonv5TS6-iuYsOIUWpN2atGnyhZ\";;\n        \"csmsc.fastspeech.v3\") share_url=\"https://drive.google.com/open?id=1T8thxkAxjGFPXPWPTcKLvHnd6lG0-82R\";;\n        *) echo \"No such models: ${models}\"; exit 1 ;;\n    esac\n\n    dir=${download_dir}/${models}\n    mkdir -p \"${dir}\"\n    if [ ! -e \"${dir}/.complete\" ]; then\n        download_from_google_drive.sh \"${share_url}\" \"${dir}\" \"tar.gz\"\n\ttouch \"${dir}/.complete\"\n    fi\n}\n\nfunction download_vocoder_models () {\n    case \"${vocoder_models}\" in\n        \"ljspeech.wavenet.softmax.ns.v1\") share_url=\"https://drive.google.com/open?id=1eA1VcRS9jzFa-DovyTgJLQ_jmwOLIi8L\";;\n        \"ljspeech.wavenet.mol.v1\") share_url=\"https://drive.google.com/open?id=1sY7gEUg39QaO1szuN62-Llst9TrFno2t\";;\n        \"ljspeech.parallel_wavegan.v1\") share_url=\"https://drive.google.com/open?id=1tv9GKyRT4CDsvUWKwH3s_OfXkiTi0gw7\";;\n        \"libritts.wavenet.mol.v1\") share_url=\"https://drive.google.com/open?id=1jHUUmQFjWiQGyDd7ZeiCThSjjpbF_B4h\";;\n        \"jsut.wavenet.mol.v1\") share_url=\"https://drive.google.com/open?id=187xvyNbmJVZ0EZ1XHCdyjZHTXK9EcfkK\";;\n        \"jsut.parallel_wavegan.v1\") share_url=\"https://drive.google.com/open?id=1OwrUQzAmvjj1x9cDhnZPp6dqtsEqGEJM\";;\n        \"csmsc.wavenet.mol.v1\") share_url=\"https://drive.google.com/open?id=1PsjFRV5eUP0HHwBaRYya9smKy5ghXKzj\";;\n        \"csmsc.parallel_wavegan.v1\") share_url=\"https://drive.google.com/open?id=10M6H88jEUGbRWBmU1Ff2VaTmOAeL8CEy\";;\n        *) echo \"No such models: ${vocoder_models}\"; exit 1 ;;\n    esac\n\n    dir=${download_dir}/${vocoder_models}\n    mkdir -p \"${dir}\"\n    if [ ! -e \"${dir}/.complete\" ]; then\n        download_from_google_drive.sh \"${share_url}\" \"${dir}\" \".tar.gz\"\n\ttouch \"${dir}/.complete\"\n    fi\n}\n\n# Download trained models\nif [ -z \"${cmvn}\" ]; then\n    download_models\n    cmvn=$(find \"${download_dir}/${models}\" -name \"cmvn.ark\" | head -n 1)\nfi\nif [ -z \"${dict}\" ]; then\n    download_models\n    dict=$(find \"${download_dir}/${models}\" -name \"*_units.txt\" | head -n 1)\nfi\nif [ -z \"${synth_model}\" ]; then\n    download_models\n    synth_model=$(find \"${download_dir}/${models}\" -name \"model*.best\" | head -n 1)\nfi\nif [ -z \"${decode_config}\" ]; then\n    download_models\n    decode_config=$(find \"${download_dir}/${models}\" -name \"decode*.yaml\" | head -n 1)\nfi\n\nsynth_json=$(basename \"${synth_model}\")\nmodel_json=\"$(dirname \"${synth_model}\")/${synth_json%%.*}.json\"\nuse_speaker_embedding=$(grep use_speaker_embedding \"${model_json}\" | sed -e \"s/.*: \\(.*\\),/\\1/\")\nif [ \"${use_speaker_embedding}\" = \"false\" ] || [ \"${use_speaker_embedding}\" = \"0\" ]; then\n    use_input_wav=false\nelse\n    use_input_wav=true\nfi\nif [ -z \"${input_wav}\" ] && \"${use_input_wav}\"; then\n    download_models\n    input_wav=$(find \"${download_dir}/${models}\" -name \"*.wav\" | head -n 1)\nfi\n\n# Check file existence\nif [ ! -f \"${cmvn}\" ]; then\n    echo \"No such CMVN file: ${cmvn}\"\n    exit 1\nfi\nif [ ! -f \"${dict}\" ]; then\n    echo \"No such dictionary: ${dict}\"\n    exit 1\nfi\nif [ ! -f \"${synth_model}\" ]; then\n    echo \"No such E2E model: ${synth_model}\"\n    exit 1\nfi\nif [ ! -f \"${decode_config}\" ]; then\n    echo \"No such config file: ${decode_config}\"\n    exit 1\nfi\nif [ ! -f \"${input_wav}\" ] && ${use_input_wav}; then\n    echo \"No such WAV file for extracting meta information: ${input_wav}\"\n    exit 1\nfi\nif [ ! -f \"${txt}\" ]; then\n    echo \"No such txt file: ${txt}\"\n    exit 1\nfi\n\nbase=$(basename \"${txt}\" .txt)\ndecode_dir=${decode_dir}/${base}\n\nif [ \"${stage}\" -le 0 ] && [ \"${stop_stage}\" -ge 0 ]; then\n    echo \"stage 0: Data preparation\"\n\n    [ -e \"${decode_dir}/data\" ] && rm -rf \"${decode_dir}/data\"\n    mkdir -p \"${decode_dir}/data\"\n    num_lines=$(wc -l < \"${txt}\")\n    for idx in $(seq \"${num_lines}\"); do\n        echo \"${base}_${idx} X\" >> \"${decode_dir}/data/wav.scp\"\n        echo \"X ${base}_${idx}\" >> \"${decode_dir}/data/spk2utt\"\n        echo \"${base}_${idx} X\" >> \"${decode_dir}/data/utt2spk\"\n        echo -n \"${base}_${idx} \" >> \"${decode_dir}/data/text\"\n        sed -n \"${idx}\"p \"${txt}\" >> \"${decode_dir}/data/text\"\n    done\n\n    mkdir -p \"${decode_dir}/dump\"\n    data2json.sh --trans_type \"${trans_type}\" \"${decode_dir}/data\" \"${dict}\" > \"${decode_dir}/dump/data.json\"\nfi\n\nif [ \"${stage}\" -le 1 ] && [ \"${stop_stage}\" -ge 1 ] && \"${use_input_wav}\"; then\n    echo \"stage 1: x-vector extraction\"\n\n    utils/copy_data_dir.sh \"${decode_dir}/data\" \"${decode_dir}/data2\"\n    sed -i -e \"s;X$;${input_wav};g\" \"${decode_dir}/data2/wav.scp\"\n    utils/data/resample_data_dir.sh 16000 \"${decode_dir}/data2\"\n    # shellcheck disable=SC2154\n    steps/make_mfcc.sh \\\n        --write-utt2num-frames true \\\n        --mfcc-config conf/mfcc.conf \\\n        --nj 1 --cmd \"${train_cmd}\" \\\n        \"${decode_dir}/data2\" \"${decode_dir}/log\" \"${decode_dir}/mfcc\"\n    utils/fix_data_dir.sh \"${decode_dir}/data2\"\n    sid/compute_vad_decision.sh --nj 1 --cmd \"$train_cmd\" \\\n        \"${decode_dir}/data2\" \"${decode_dir}/log\" \"${decode_dir}/mfcc\"\n    utils/fix_data_dir.sh \"${decode_dir}/data2\"\n\n    nnet_dir=${download_dir}/xvector_nnet_1a\n    if [ ! -e \"${nnet_dir}\" ]; then\n        echo \"X-vector model does not exist. Download pre-trained model.\"\n        wget http://kaldi-asr.org/models/8/0008_sitw_v2_1a.tar.gz\n        tar xvf 0008_sitw_v2_1a.tar.gz\n        mv 0008_sitw_v2_1a/exp/xvector_nnet_1a \"${download_dir}\"\n        rm -rf 0008_sitw_v2_1a.tar.gz 0008_sitw_v2_1a\n    fi\n    sid/nnet3/xvector/extract_xvectors.sh --cmd \"${train_cmd} --mem 4G\" --nj 1 \\\n        \"${nnet_dir}\" \"${decode_dir}/data2\" \\\n        \"${decode_dir}/xvectors\"\n\n    local/update_json.sh \"${decode_dir}/dump/data.json\" \"${decode_dir}/xvectors/xvector.scp\"\nfi\n\nif [ \"${stage}\" -le 2 ] && [ \"${stop_stage}\" -ge 2 ]; then\n    echo \"stage 2: Decoding\"\n\n    # shellcheck disable=SC2154\n    ${decode_cmd} \"${decode_dir}/log/decode.log\" \\\n        tts_decode.py \\\n        --config \"${decode_config}\" \\\n        --ngpu \"${ngpu}\" \\\n        --backend \"${backend}\" \\\n        --debugmode \"${debugmode}\" \\\n        --verbose \"${verbose}\" \\\n        --out \"${decode_dir}/outputs/feats\" \\\n        --json \"${decode_dir}/dump/data.json\" \\\n        --model \"${synth_model}\"\nfi\n\noutdir=${decode_dir}/outputs; mkdir -p \"${outdir}_denorm\"\nif [ \"${stage}\" -le 3 ] && [ \"${stop_stage}\" -ge 3 ]; then\n    echo \"stage 3: Synthesis with Griffin-Lim\"\n\n    apply-cmvn --norm-vars=true --reverse=true \"${cmvn}\" \\\n        scp:\"${outdir}/feats.scp\" \\\n        ark,scp:\"${outdir}_denorm/feats.ark,${outdir}_denorm/feats.scp\"\n\n    convert_fbank.sh --nj 1 --cmd \"${decode_cmd}\" \\\n        --fs \"${fs}\" \\\n        --fmax \"${fmax}\" \\\n        --fmin \"${fmin}\" \\\n        --n_fft \"${n_fft}\" \\\n        --n_shift \"${n_shift}\" \\\n        --win_length \"${win_length}\" \\\n        --n_mels \"${n_mels}\" \\\n        --iters \"${griffin_lim_iters}\" \\\n        \"${outdir}_denorm\" \\\n        \"${decode_dir}/log\" \\\n        \"${decode_dir}/wav\"\n\n    echo \"\"\n    echo \"Synthesized wav: ${decode_dir}/wav/${base}.wav\"\n    echo \"\"\n    echo \"Finished\"\nfi\n\nif [ \"${stage}\" -le 4 ] && [ \"${stop_stage}\" -ge 4 ]; then\n    echo \"stage 4: Synthesis with Neural Vocoder\"\n    model_corpus=$(echo ${models} | cut -d. -f 1)\n    vocoder_model_corpus=$(echo ${vocoder_models} | cut -d. -f 1)\n    if [ \"${model_corpus}\" != \"${vocoder_model_corpus}\" ]; then\n        echo \"${vocoder_models} does not support ${models} (Due to the sampling rate mismatch).\"\n        exit 1\n    fi\n    download_vocoder_models\n    dst_dir=${decode_dir}/wav_wnv\n\n    # This is hardcoded for now.\n    if [[ \"${vocoder_models}\" == *\".mol.\"* ]]; then\n        # Needs to use https://github.com/r9y9/wavenet_vocoder\n        # that supports mixture of logistics/gaussians\n        MDN_WAVENET_VOC_DIR=./local/r9y9_wavenet_vocoder\n        if [ ! -d \"${MDN_WAVENET_VOC_DIR}\" ]; then\n            git clone https://github.com/r9y9/wavenet_vocoder \"${MDN_WAVENET_VOC_DIR}\"\n            cd \"${MDN_WAVENET_VOC_DIR}\" && pip install . && cd -\n        fi\n        checkpoint=$(find \"${download_dir}/${vocoder_models}\" -name \"*.pth\" | head -n 1)\n        feats2npy.py \"${outdir}/feats.scp\" \"${outdir}_npy\"\n        python3 ${MDN_WAVENET_VOC_DIR}/evaluate.py \"${outdir}_npy\" \"${checkpoint}\" \"${dst_dir}\" \\\n            --hparams \"batch_size=1\" \\\n            --verbose \"${verbose}\"\n        rm -rf \"${outdir}_npy\"\n    elif [[ \"${vocoder_models}\" == *\".parallel_wavegan.\"* ]]; then\n        checkpoint=$(find \"${download_dir}/${vocoder_models}\" -name \"*.pkl\" | head -n 1)\n        if ! command -v parallel-wavegan-decode > /dev/null; then\n            pip install parallel-wavegan\n        fi\n        parallel-wavegan-decode \\\n            --scp \"${outdir}/feats.scp\" \\\n            --checkpoint \"${checkpoint}\" \\\n            --outdir \"${dst_dir}\" \\\n            --verbose \"${verbose}\"\n    else\n        checkpoint=$(find \"${download_dir}/${vocoder_models}\" -name \"checkpoint*\" | head -n 1)\n        generate_wav.sh --nj 1 --cmd \"${decode_cmd}\" \\\n            --fs \"${fs}\" \\\n            --n_fft \"${n_fft}\" \\\n            --n_shift \"${n_shift}\" \\\n            \"${checkpoint}\" \\\n            \"${outdir}_denorm\" \\\n            \"${decode_dir}/log\" \\\n            \"${dst_dir}\"\n    fi\n    echo \"\"\n    echo \"Synthesized wav: ${decode_dir}/wav_wnv/${base}.wav\"\n    echo \"\"\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/espnet_utils/text2token.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport re\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef exist_or_not(i, match_pos):\n    start_pos = None\n    end_pos = None\n    for pos in match_pos:\n        if pos[0] <= i < pos[1]:\n            start_pos = pos[0]\n            end_pos = pos[1]\n            break\n\n    return start_pos, end_pos\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"convert raw text to tokenized text\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--nchar\",\n        \"-n\",\n        default=1,\n        type=int,\n        help=\"number of characters to split, i.e., \\\n                        aabb -> a a b b with -n 1 and aa bb with -n 2\",\n    )\n    parser.add_argument(\n        \"--skip-ncols\", \"-s\", default=0, type=int, help=\"skip first n columns\"\n    )\n    parser.add_argument(\"--space\", default=\"<space>\", type=str, help=\"space symbol\")\n    parser.add_argument(\n        \"--non-lang-syms\",\n        \"-l\",\n        default=None,\n        type=str,\n        help=\"list of non-linguistic symobles, e.g., <NOISE> etc.\",\n    )\n    parser.add_argument(\"text\", type=str, default=False, nargs=\"?\", help=\"input text\")\n    parser.add_argument(\n        \"--trans_type\",\n        \"-t\",\n        type=str,\n        default=\"char\",\n        choices=[\"char\", \"phn\"],\n        help=\"\"\"Transcript type. char/phn. e.g., for TIMIT FADG0_SI1279 -\n                        If trans_type is char,\n                        read from SI1279.WRD file -> \"bricks are an alternative\"\n                        Else if trans_type is phn,\n                        read from SI1279.PHN file -> \"sil b r ih sil k s aa r er n aa l\n                        sil t er n ih sil t ih v sil\" \"\"\",\n    )\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    rs = []\n    if args.non_lang_syms is not None:\n        with codecs.open(args.non_lang_syms, \"r\", encoding=\"utf-8\") as f:\n            nls = [x.rstrip() for x in f.readlines()]\n            rs = [re.compile(re.escape(x)) for x in nls]\n\n    if args.text:\n        f = codecs.open(args.text, encoding=\"utf-8\")\n    else:\n        f = codecs.getreader(\"utf-8\")(sys.stdin if is_python2 else sys.stdin.buffer)\n\n    sys.stdout = codecs.getwriter(\"utf-8\")(\n        sys.stdout if is_python2 else sys.stdout.buffer\n    )\n    line = f.readline()\n    n = args.nchar\n    while line:\n        x = line.split()\n        print(\" \".join(x[: args.skip_ncols]), end=\" \")\n        a = \" \".join(x[args.skip_ncols :])\n\n        # get all matched positions\n        match_pos = []\n        for r in rs:\n            i = 0\n            while i >= 0:\n                m = r.search(a, i)\n                if m:\n                    match_pos.append([m.start(), m.end()])\n                    i = m.end()\n                else:\n                    break\n\n        if args.trans_type == \"phn\":\n            a = a.split(\" \")\n        else:\n            if len(match_pos) > 0:\n                chars = []\n                i = 0\n                while i < len(a):\n                    start_pos, end_pos = exist_or_not(i, match_pos)\n                    if start_pos is not None:\n                        chars.append(a[start_pos:end_pos])\n                        i = end_pos\n                    else:\n                        chars.append(a[i])\n                        i += 1\n                a = chars\n\n            a = [a[j : j + n] for j in range(0, len(a), n)]\n\n        a_flat = []\n        for z in a:\n            a_flat.append(\"\".join(z))\n\n        a_chars = [z.replace(\" \", args.space) for z in a_flat]\n        if args.trans_type == \"phn\":\n            a_chars = [z.replace(\"sil\", args.space) for z in a_chars]\n        print(\" \".join(a_chars))\n        line = f.readline()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/text2vocabulary.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Mitsubishi Electric Research Laboratories (Takaaki Hori)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport codecs\nimport logging\nimport six\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"create a vocabulary file from text files\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--output\", \"-o\", default=\"\", type=str, help=\"output a vocabulary file\"\n    )\n    parser.add_argument(\"--cutoff\", \"-c\", default=0, type=int, help=\"cut-off frequency\")\n    parser.add_argument(\n        \"--vocabsize\", \"-s\", default=20000, type=int, help=\"vocabulary size\"\n    )\n    parser.add_argument(\"text_files\", nargs=\"*\", help=\"input text files\")\n    return parser\n\n\nif __name__ == \"__main__\":\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # count the word occurrences\n    counts = {}\n    exclude = [\"<sos>\", \"<eos>\", \"<unk>\"]\n    if len(args.text_files) == 0:\n        args.text_files.append(\"-\")\n    for fn in args.text_files:\n        fd = (\n            codecs.open(fn, \"r\", encoding=\"utf-8\")\n            if fn != \"-\"\n            else codecs.getreader(\"utf-8\")(\n                sys.stdin if is_python2 else sys.stdin.buffer\n            )\n        )\n        for ln in fd.readlines():\n            for tok in ln.split():\n                if tok not in exclude:\n                    if tok not in counts:\n                        counts[tok] = 1\n                    else:\n                        counts[tok] += 1\n        if fn != \"-\":\n            fd.close()\n\n    # limit the vocabulary size\n    total_count = sum(counts.values())\n    invocab_count = 0\n    vocabulary = []\n    for w, c in sorted(counts.items(), key=lambda x: -x[1]):\n        if c <= args.cutoff:\n            break\n        if len(vocabulary) >= args.vocabsize:\n            break\n        vocabulary.append(w)\n        invocab_count += c\n\n    logging.warning(\n        \"OOV rate = %.2f %%\" % (float(total_count - invocab_count) / total_count * 100)\n    )\n    # write the vocabulary\n    fd = (\n        codecs.open(args.output, \"w\", encoding=\"utf-8\")\n        if args.output\n        else codecs.getwriter(\"utf-8\")(sys.stdout if is_python2 else sys.stdout.buffer)\n    )\n    six.print_(\"<unk> 1\", file=fd)\n    for n, w in enumerate(sorted(vocabulary)):\n        six.print_(\"%s %d\" % (w, n + 2), file=fd)\n    if args.output:\n        fd.close()\n"
  },
  {
    "path": "egs/espnet_utils/text_norm.py",
    "content": "# author: tyriontian\n# tyriontian@tencent.com\n\nimport sys\nimport os\nimport jieba\nimport argparse\nimport cn2an\nfrom string import punctuation as en_pun\nfrom zhon.hanzi import punctuation as zh_pun\npun = en_pun + zh_pun\n\ndef remove_punc(s):\n    for c in pun:\n        s = s.replace(c, \"\")\n    return s\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n\ndef is_contain_chinese(check_str):\n    for ch in check_str:\n        if u'\\u4e00' <= ch <= u'\\u9fff':\n            return True\n    return False\n\ndef splice(s):\n    # the segmentation results are only on Chinese\n    # we do not splice English\n    buf = []\n    ans = []\n    for c in s:\n        if is_all_chinese(c):\n            if buf:\n                buf_str = \"\".join(buf)\n                buf = []\n                ans.append(buf_str)\n            ans.append(c)\n        else:\n            buf.append(c)\n\n    # incase the last c is not Chinese\n    if buf:\n        buf_str = \"\".join(buf)\n        ans.append(buf_str)\n\n    ans = \" \".join(ans)\n    return ans\n\ndigit_dict = {\"0\": \"零\",\n              \"1\": \"一\",\n              \"2\": \"二\",\n              \"3\": \"三\",\n              \"4\": \"四\",\n              \"5\": \"五\",\n              \"6\": \"六\",\n              \"7\": \"七\",\n              \"8\": \"八\",\n              \"9\": \"九\",}\ndef digit_norm(s):\n    out = \"\"\n    buf = \"\"\n    for c in s:\n        if not c.isdigit():\n            if buf:\n                try:\n                    digit_str = cn2an.an2cn(buf)\n                except:\n                    print(f\"cannot convert digit {buf}\")\n                    digit_str = \"\".join([digit_dict.get(x, \"\") for x in buf])\n                out += digit_str\n                buf = \"\"\n            out += c\n        else:\n            buf += c\n    if buf:\n        buf = cn2an.an2cn(buf)\n        out += buf\n    return out\n       \n\ndef remove_blank_chn(s):\n    s = s.strip()\n    out = \"\"\n    for i in range(len(s)):\n        if not s[i] == \" \":\n            out += s[i]\n        else:\n            a = is_all_chinese(s[i-1])\n            b = is_all_chinese(s[i+1])\n            # if not a and not b: \n            if not a or not b: # keep chn-eng <space> \n                out += s[i]\n    return out\n\ndef add_blank_boundary(s):\n    s = s.strip()\n    out = \"\"\n    for i in range(len(s) - 1):\n        out += s[i]\n        \n        a = is_all_chinese(s[i])\n        b = is_all_chinese(s[i+1])\n        if a ^ b:\n            out += \" \"\n    \n    out += s[-1]\n    out = out.strip()\n    return out\n\ndef split_eng_words(s):\n    out = []\n    for w in s.strip().split():\n        if is_all_chinese(w):\n            out.append(w)\n        else:\n            for c in w:\n                out.append(c)\n    out = \" \".join(out)\n    return out\n\ndef split_chn_words(s):\n    out = []\n    for w in s.strip().split():\n        if is_all_chinese(w):\n            out.extend(list(w))\n        else:\n            out.extend(w.split())\n    out = \" \".join(out)\n    return out\n\ndef upper_or_lower(s, upper=True):\n    if upper:\n        return s.upper()\n    else:\n        return s.lower()\n \ndef process_one_line(content, args):\n    # (1) remove punctuation and space\n    content = remove_punc(content)\n\n    # (2) remove ignore symbols\n    if args.ignore is not None:\n        ignores = args.ignore.split(\",\")\n        for c in ignores:\n            content.replace(c, \"\")\n    \n    # (3) digit norm and upper/lower\n    content = digit_norm(content)\n    content = upper_or_lower(content, args.eng_upper)\n\n    # (4) remove all blank except those between eng words\n    #     This is for kaldi/text\n    content = remove_blank_chn(content)\n    \n    if args.segment_chn:\n        content = split_chn_words(content) \n\n    if not args.segment:\n        return content\n\n    # (5) split by jieba. There should be a blank\n    #     at any chn-eng boundary \n    else:\n        content = add_blank_boundary(content)\n        if args.segment_eng:\n            content = split_eng_words(content)\n        content = content.strip().split()\n        out = []\n        for p in content:\n            if is_all_chinese(p):\n                out.extend(jieba.lcut(p, HMM=False))\n            else:\n                out.append(p)\n        out = \" \".join(out)\n        return out\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Normalize the text\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\n        \"--in-f\", type=str, help=\"input file\")\n    parser.add_argument(\n        \"--out-f\", type=str, help=\"output file\")\n    parser.add_argument(\n        \"--freq-dict\", type=str, default=None, help=\"frequency dict\")\n    parser.add_argument(\n        \"--segment\", action='store_true', help=\"do segmentation in output file\")\n    parser.add_argument(\n        \"--segment-eng\", action='store_true', help=\"segment english into chars\")\n    parser.add_argument(\n        \"--segment-chn\", action='store_true', help=\"segment mandarin into chars\")\n    parser.add_argument(\n        \"--ignore\", type=str, default=None, help=\"symbol to remove in output file\")\n    parser.add_argument(\n        \"--eng-upper\", action='store_true', help=\"all english in upper class\")\n    return parser\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    if args.segment and args.freq_dict:\n        jieba.set_dictionary(args.freq_dict)\n\n    writer = open(args.out_f, 'w', encoding=\"utf-8\") \n    for line in open(args.in_f, encoding=\"utf-8\"):\n        elems = line.strip().split()\n\n        # we skip the empty string\n        if len(elems) == 1:\n            print(f\"Empty text found for {elems[0]}\")\n            continue\n\n        uttid, content = elems[0], elems[1:]\n        content = \" \".join(content)\n\n        content = process_one_line(content, args)\n        writer.write(f\"{uttid} {content}\\n\")  \n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/trace_rnnt.py",
    "content": "# Author: Jinchuan Tian\n# tyriontian@tencent.com\n\n# A simple example to show the inference process of RNN-T modules\n# The only dependency for this script is pytorch (torch1.7.1+cuda101)\n#\n# run: python3 trace_rnnt.py <resources-dir>\n# egs: python3 trace_rnnt.py ./resources\n\n\nimport sys\nimport os\nimport torch\nimport json\nfrom argparse import Namespace\n\n# If you do not need the Espnet dependency, you can just copy the transducer directory\nfrom espnet.nets.pytorch_backend.transducer.custom_encoder import CustomEncoder\nfrom espnet.nets.pytorch_backend.transducer.rnn_decoder import DecoderRNNT\nfrom espnet.nets.pytorch_backend.transducer.joint_network import JointNetwork\n\n\ndef main():\n    \"\"\" parse configs \"\"\"\n    export_dir = sys.argv[1]\n    json_file = os.path.join(export_dir, \"model.json\")\n    idim, odim, args = json.load(open(json_file))\n    args = Namespace(**args) \n    device = torch.device(\"cuda\") # also works for CPU \n    \n    \"\"\" load modules \"\"\"\n    encoder = CustomEncoder(\n                idim,\n                args.enc_block_arch,\n                input_layer=args.custom_enc_input_layer,\n                repeat_block=args.enc_block_repeat,\n                self_attn_type=args.custom_enc_self_attn_type,\n                positional_encoding_type=args.custom_enc_positional_encoding_type,\n                positionwise_activation_type=args.custom_enc_pw_activation_type,\n                conv_mod_activation_type=args.custom_enc_conv_mod_activation_type,\n                aux_task_layer_list=args.aux_task_layer_list,\n    )\n    enc_pt = os.path.join(export_dir, \"encoder.pt\")\n    encoder.load_state_dict(torch.load(enc_pt))\n    encoder.eval().to(device)\n\n    decoder = DecoderRNNT(\n                odim,\n                args.dtype,\n                args.dlayers,\n                args.dunits,\n                args.char_list.index(\"<blank>\"),\n                args.dec_embed_dim,\n                args.dropout_rate_decoder,\n                args.dropout_rate_embed_decoder,\n    )\n    dec_pt = os.path.join(export_dir, \"decoder.pt\")\n    decoder.load_state_dict(torch.load(dec_pt))\n    decoder.eval().to(device)\n\n    joint_network = JointNetwork(\n        odim, \n        encoder.enc_out, \n        args.dunits, \n        args.joint_dim, \n        args.joint_activation_type\n    )\n    joint_pt = os.path.join(export_dir, \"joint_net.pt\")\n    joint_network.load_state_dict(torch.load(joint_pt))\n    joint_network.eval().to(device)\n    print(\"INFO: Successfully load encoder, decoder and joint-network\")\n\n    \"\"\" Module Inference \"\"\"\n    B = 2                        # Batch_size\n    T = 400                      # Maximum time index\n    U = 4                        # Maximum word index\n    enc_idim = idim\n    enc_odim = encoder.enc_out\n    n_vocab = odim\n    dec_odim = args.dunits\n\n    # For batch-inference, you may want to pass masks to the encoder and call it like \n    # 'encoder(enc_in, masks)'. In this case, the paddings will not be considered.\n    # See espnet/nets/pytorch_backend/nets_utils.py:make_non_pad_mask for details.\n    # but it's ok if B = 1.\n    enc_in = torch.rand([B, T, enc_idim]).to(device)\n    enc_out, _  = encoder(enc_in, None)\n    print(\"encoder_out size: \", enc_out.size())  # enc_out: [B, sub(T), enc_odim], T is sub-sumpled by a factor of 6\n     \n    # decoder inference\n    decoder.set_device(enc_out.device) # needed before inference\n    decoder.set_data_type(enc_out.dtype) # needed before inference\n    # The LSTM should work as long as the 'ey' is consistent with 'states'.\n    # So you may use a cache and a state-select method to save computation.\n    states = decoder.init_state(B)\n    for _ in range(U):\n        tokens = torch.randint(low=0, high=n_vocab, size=[B, 1]).to(device)\n        ey = decoder.embed(tokens)\n        dec_out, states = decoder.rnn_forward(ey, states)\n    print(\"decoder_out size: \", dec_out.size()) # dec_out: [B, 1, dec_odim]\n\n    # joint-network inference\n    # It is safe to feed two 4-dim tensors.\n    # However, the joint network should work as long as two conditions are met.\n    # (1) element-wise addtion of enc_out and dec_out will not raise shape error (allow broadcastable)\n    # (2) enc_out.size()[-1] == dec_out.size()[-1] == size_of_joint_net\n    # The size of output should be the same with enc_out except the last dimention:\n    # the last dimention is n_vocab\n    enc_out = enc_out.unsqueeze(2) # [B, sub(T), 1, enc_odim]\n    dec_out = dec_out.unsqueeze(1) # [B, 1, U, dec_odim]\n    joint_out = joint_network(enc_out, dec_out)\n    print(\"joint_out size: \", joint_out.size()) # [B, T, U, n_vocab] \n    \n    # the output distribution is over this char_list: args.char_list\n \nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/train_lms_srilm.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)\n#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)\n# Apache 2.0\n\nunk=\"<unk>\"\nlm_opts=\"-wbdiscount\"\norder=3\n\n. ./path.sh\n. ./utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"train_lms.sh <lexicon> <word-segmented-text> <dir>\"\n  echo \" e.g train_lms.sh data/local/dict/lexicon.txt data/local/train/text data/local/lm\"\n  echo $@\n  exit 1;\nfi\n\nlexicon=$1\ntext=$2\ndir=$3\n\nfor f in \"$text\" \"$lexicon\"; do\n  [ ! -f $x ] && echo \"$0: No such file $f\" && exit 1;\ndone\n\nkaldi_lm=`which train_lm.sh`\nif [ -z $kaldi_lm ]; then\n  echo \"$0: train_lm.sh is not found. That might mean it's not installed\"\n  echo \"$0: or it is not added to PATH\"\n  echo \"$0: Use the script tools/extras/install_kaldi_lm.sh to install it\"\n  exit 1\nfi\n\nmkdir -p $dir\ncleantext=$dir/text.no_oov\n\ncat $text | awk -v lex=$lexicon 'BEGIN{while((getline<lex) >0){ seen[$1]=1; } }\n  {for(n=1; n<=NF;n++) {  if (seen[$n]) { printf(\"%s \", $n); } else {printf(\"'$unk' \");} } printf(\"\\n\");}' \\\n  > $cleantext || exit 1;\n\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' | sort | uniq -c | \\\n   sort -nr > $dir/word.counts || exit 1;\n\n# Get counts from acoustic training transcripts, and add  one-count\n# for each word in the lexicon (but not silence, we don't want it\n# in the LM-- we'll add it optionally later).\ncat $cleantext | awk '{for(n=2;n<=NF;n++) print $n; }' | \\\n  cat - <(grep -w -v '!SIL' $lexicon | awk '{print $1}') | \\\n   sort | uniq -c | sort -nr > $dir/unigram.counts || exit 1;\n\n# note: we probably won't really make use of <unk> as there aren't any OOVs\ncat $dir/unigram.counts  | awk '{print $2}' | get_word_map.pl \"<s>\" \"</s>\" $unk > $dir/word_map \\\n   || exit 1;\n\n# note: ignore 1st field of train.txt, it's the utterance-id.\ncat $cleantext | awk -v wmap=$dir/word_map 'BEGIN{while((getline<wmap)>0)map[$1]=$2;}\n  { for(n=2;n<=NF;n++) { printf map[$n]; if(n<NF){ printf \" \"; } else { print \"\"; }}}' | gzip -c >$dir/train.gz \\\n   || exit 1;\n\n# From here is some commands to do a baseline with SRILM (assuming\n# you have it installed).\nheldout_sent=200 # Don't change this if you want result to be comparable with\n    # kaldi_lm results\nsdir=$dir/srilm # in case we want to use SRILM to double-check perplexities.\nmkdir -p $sdir\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' | \\\n  head -$heldout_sent > $sdir/heldout\ncat $cleantext | awk '{for(n=2;n<=NF;n++){ printf $n; if(n<NF) printf \" \"; else print \"\"; }}' | \\\n  tail -n +$heldout_sent > $sdir/train\n\ncat $dir/word_map | awk '{print $1}' | cat - <(echo \"<s>\"; echo \"</s>\" ) > $sdir/wordlist\n\nngram-count -text $sdir/train -order $order -limit-vocab -vocab $sdir/wordlist -unk \\\n  -map-unk $unk $lm_opts -interpolate -lm $sdir/srilm.o3g.kn.gz\nngram -lm $sdir/srilm.o3g.kn.gz -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250954 ppl= 90.5091 ppl1= 132.482\n\n# Note: perplexity SRILM gives to Kaldi-LM model is same as kaldi-lm reports above.\n# Difference in WSJ must have been due to different treatment of <UNK>.\n# ngram -lm $dir/3gram-mincount/lm_unpruned.gz  -ppl $sdir/heldout\n# 0 zeroprobs, logprob= -250913 ppl= 90.4439 ppl1= 132.379\n\necho \"local/train_lms.sh succeeded\"\nexit 0\n"
  },
  {
    "path": "egs/espnet_utils/translate_wav.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2020 The ESPnet Authors.\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nif [ ! -f path.sh ] || [ ! -f cmd.sh ]; then\n    echo \"Please change current directory to recipe directory e.g., egs/tedlium2/asr1\"\n    exit 1\nfi\n\n. ./path.sh\n\n# general configuration\nstage=0        # start from 0 if you need to start from data preparation\nstop_stage=100\nngpu=0         # number of gpus (\"0\" uses cpu, otherwise use gpu)\ndebugmode=1\ndetokenize=true\nverbose=1      # verbose option\n\n# feature configuration\ndo_delta=false\ncmvn=\n\n# decoding parameter\ntrans_model=\ndecode_config=\ndecode_dir=decode\napi=v2\n\n# download related\nmodels=must_c.transformer.v1.en-fr\n\nhelp_message=$(cat <<EOF\nUsage:\n    $0 [options] <wav_file>\n\nOptions:\n    --ngpu <ngpu>                   # Number of GPUs (Default: 0)\n    --decode_dir <directory_name>   # Name of directory to store decoding temporary data\n    --models <model_name>           # Model name (e.g. must_c.transformer.v1.en-fr)\n    --cmvn <path>                   # Location of cmvn.ark\n    --trans_model <path>            # Location of E2E model\n    --decode_config <path>          # Location of configuration file\n    --api <api_version>             # API version (v1 or v2)\n\nExample:\n    # Record audio from microphone input as example.wav\n    rec -c 1 -r 16000 example.wav trim 0 5\n\n    # Decode using model name\n    $0 --models must_c.transformer.v1.en-fr example.wav\n\n    # Decode using model file\n    $0 --cmvn cmvn.ark --trans_model model.acc.best --decode_config conf/decode.yaml example.wav\n\n    # Decode with GPU (require batchsize > 0 in configuration file)\n    $0 --ngpu 1 example.wav\n\nAvailable models:\n    - must_c.transformer.v1.en-fr\n    - fisher_callhome_spanish.transformer.v1.es-en\nEOF\n)\n. utils/parse_options.sh || exit 1;\n\n# make shellcheck happy\ntrain_cmd=\ndecode_cmd=\n\n. ./cmd.sh\n\nwav=$1\ndownload_dir=${decode_dir}/download\n\nif [ $# -lt 1 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -e\nset -u\nset -o pipefail\n\n# Check model name or model file is set\nif [ -z $models ]; then\n    if [[ -z $cmvn || -z $trans_model || -z $decode_config ]]; then\n        echo \"Error: models or set of cmvn, trans_model and decode_config are required.\" >&2\n        exit 1\n    fi\nfi\n\ndir=${download_dir}/${models}\nmkdir -p ${dir}\n\nfunction download_models () {\n    if [ -z $models ]; then\n        return\n    fi\n    case \"${models}\" in\n        # TODO(karita): register more models\n        \"must_c.transformer.v1.en-fr\") share_url=\"https://drive.google.com/open?id=1wFIAqxoBUioTKTLRLv29KzvphkUm3qdo\" ;;\n        \"fisher_callhome_spanish.transformer.v1.es-en\") share_url=\"https://drive.google.com/open?id=1hawp5ZLw4_SIHIT3edglxbKIIkPVe8n3\" ;;\n        *) echo \"No such models: ${models}\"; exit 1 ;;\n    esac\n\n    if [ ! -e ${dir}/.complete ]; then\n        download_from_google_drive.sh ${share_url} ${dir} \".tar.gz\"\n        touch ${dir}/.complete\n    fi\n}\n\n# Download trained models\nif [ -z \"${cmvn}\" ]; then\n    download_models\n    cmvn=$(find ${download_dir}/${models} -name \"cmvn.ark\" | head -n 1)\nfi\nif [ -z \"${trans_model}\" ]; then\n    download_models\n    trans_model=$(find ${download_dir}/${models} -name \"model*.best*\" | head -n 1)\nfi\nif [ -z \"${decode_config}\" ]; then\n    download_models\n    decode_config=$(find ${download_dir}/${models} -name \"decode*.yaml\" | head -n 1)\nfi\nif [ -z \"${wav}\" ]; then\n    download_models\n    wav=$(find ${download_dir}/${models} -name \"*.wav\" | head -n 1)\nfi\n\n# Check file existence\nif [ ! -f \"${cmvn}\" ]; then\n    echo \"No such CMVN file: ${cmvn}\"\n    exit 1\nfi\nif [ ! -f \"${trans_model}\" ]; then\n    echo \"No such E2E model: ${trans_model}\"\n    exit 1\nfi\nif [ ! -f \"${decode_config}\" ]; then\n    echo \"No such config file: ${decode_config}\"\n    exit 1\nfi\nif [ ! -f \"${wav}\" ]; then\n    echo \"No such WAV file: ${wav}\"\n    exit 1\nfi\n\nbase=$(basename $wav .wav)\ndecode_dir=${decode_dir}/${base}\n\nif [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n    echo \"stage 0: Data preparation\"\n\n    mkdir -p ${decode_dir}/data\n    echo \"$base $wav\" > ${decode_dir}/data/wav.scp\n    echo \"X $base\" > ${decode_dir}/data/spk2utt\n    echo \"$base X\" > ${decode_dir}/data/utt2spk\n    echo \"$base X\" > ${decode_dir}/data/text\nfi\n\nif [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n    echo \"stage 1: Feature Generation\"\n\n    steps/make_fbank_pitch.sh --cmd \"$train_cmd\" --nj 1 --write_utt2num_frames true \\\n        ${decode_dir}/data ${decode_dir}/log ${decode_dir}/fbank\n\n    feat_trans_dir=${decode_dir}/dump; mkdir -p ${feat_trans_dir}\n    dump.sh --cmd \"$train_cmd\" --nj 1 --do_delta ${do_delta} \\\n        ${decode_dir}/data/feats.scp ${cmvn} ${decode_dir}/log \\\n        ${feat_trans_dir}\nfi\n\nif [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n    echo \"stage 2: Json Data Preparation\"\n\n    dict=${decode_dir}/dict\n    echo \"<unk> 1\" > ${dict}\n    feat_trans_dir=${decode_dir}/dump\n    data2json.sh --feat ${feat_trans_dir}/feats.scp \\\n        ${decode_dir}/data ${dict} > ${feat_trans_dir}/data.json\n    rm -f ${dict}\nfi\n\nif [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n    echo \"stage 3: Decoding\"\n    feat_trans_dir=${decode_dir}/dump\n\n\n    ${decode_cmd} ${decode_dir}/log/decode.log \\\n        st_trans.py \\\n        --config ${decode_config} \\\n        --ngpu ${ngpu} \\\n        --backend pytorch \\\n        --debugmode ${debugmode} \\\n        --verbose ${verbose} \\\n        --trans-json ${feat_trans_dir}/data.json \\\n        --result-label ${decode_dir}/result.json \\\n        --model ${trans_model} \\\n        --api ${api}\n\n    echo \"\"\n    trans_text=$(grep rec_text ${decode_dir}/result.json | sed -e 's/.*: \"\\(.*\\)\".*/\\1/' | sed -e 's/<eos>//')\n    if $detokenize; then\n        trans_text=$(echo ${trans_text} | sed -s 's/▁/ /g' | detokenizer.perl)\n    fi\n    echo \"Translated text: ${trans_text}\"\n    echo \"\"\n    echo \"Finished\"\nfi\n"
  },
  {
    "path": "egs/espnet_utils/trim_silence.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport argparse\nimport codecs\nimport logging\nimport os\n\nimport kaldiio\nimport librosa\nimport matplotlib.pyplot as plt\nimport numpy\n\nfrom espnet.utils.cli_utils import get_commandline_args\n\n\ndef _time_to_str(time_idx):\n    time_idx = time_idx * 10 ** 4\n    return \"%06d\" % time_idx\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"Trim slience with simple power thresholding \"\n        \"and make segments file.\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n    )\n    parser.add_argument(\"--fs\", type=int, help=\"Sampling frequency\")\n    parser.add_argument(\n        \"--threshold\", type=float, default=60, help=\"Threshold in decibels\"\n    )\n    parser.add_argument(\n        \"--win_length\", type=int, default=1024, help=\"Analisys window length in point\"\n    )\n    parser.add_argument(\n        \"--shift_length\", type=int, default=256, help=\"Shift length in point\"\n    )\n    parser.add_argument(\n        \"--min_silence\", type=float, default=0.01, help=\"minimum silence length\"\n    )\n    parser.add_argument(\n        \"--figdir\", type=str, default=\"figs\", help=\"Directory to save figures\"\n    )\n    parser.add_argument(\"--verbose\", \"-V\", default=0, type=int, help=\"Verbose option\")\n    parser.add_argument(\n        \"--normalize\",\n        choices=[1, 16, 24, 32],\n        type=int,\n        default=None,\n        help=\"Give the bit depth of the PCM, \"\n        \"then normalizes data to scale in [-1,1]\",\n    )\n    parser.add_argument(\"rspecifier\", type=str, help=\"WAV scp file\")\n    parser.add_argument(\"wspecifier\", type=str, help=\"Segments file\")\n\n    return parser\n\n\ndef main():\n    parser = get_parser()\n    args = parser.parse_args()\n\n    # set logger\n    logfmt = \"%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s\"\n    if args.verbose > 0:\n        logging.basicConfig(level=logging.INFO, format=logfmt)\n    else:\n        logging.basicConfig(level=logging.WARN, format=logfmt)\n    logging.info(get_commandline_args())\n\n    if not os.path.exists(args.figdir):\n        os.makedirs(args.figdir)\n\n    with kaldiio.ReadHelper(args.rspecifier) as reader, codecs.open(\n        args.wspecifier, \"w\", encoding=\"utf-8\"\n    ) as f:\n        for utt_id, (rate, array) in reader:\n            assert rate == args.fs\n            array = array.astype(numpy.float32)\n            if args.normalize is not None and args.normalize != 1:\n                array = array / (1 << (args.normalize - 1))\n            array_trim, idx = librosa.effects.trim(\n                y=array,\n                top_db=args.threshold,\n                frame_length=args.win_length,\n                hop_length=args.shift_length,\n            )\n            start, end = idx / args.fs\n\n            # save figure\n            plt.subplot(2, 1, 1)\n            plt.plot(array)\n            plt.title(\"Original\")\n            plt.subplot(2, 1, 2)\n            plt.plot(array_trim)\n            plt.title(\"Trim\")\n            plt.tight_layout()\n            plt.savefig(args.figdir + \"/\" + utt_id + \".png\")\n            plt.close()\n\n            # added minimum silence part\n            start = max(0.0, start - args.min_silence)\n            end = min(len(array) / args.fs, end + args.min_silence)\n\n            # write to segments file\n            segment = \"%s %s %f %f\\n\" % (utt_id, utt_id, start, end)\n            f.write(segment)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/trim_silence.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nfs=16000\nwin_length=1024\nshift_length=256\nthreshold=60\nmin_silence=0.01\nnormalize=16\ncmd=run.pl\nnj=32\n\nhelp_message=$(cat <<EOF\nUsage: $0 [options] <data-dir> <log-dir>\ne.g.: $0 data/train exp/trim_silence/train\nOptions:\n  --fs <fs>                      # sampling frequency (default=16000)\n  --win_length <win_length>      # window length in point (default=1024)\n  --shift_length <shift_length>  # shift length in point (default=256)\n  --threshold <threshold>        # power threshold in db (default=60)\n  --min_silence <sec>            # minimum silence lenght in sec (default=0.01)\n  --normalize <bit>              # audio bit (default=16)\n  --cmd <cmd>                    # how to run jobs (default=run.pl)\n  --nj <nj>                      # number of parallel jobs (default=32)\nEOF\n)\n\n. utils/parse_options.sh || exit 1;\n\nif [ ! $# -eq 2 ]; then\n    echo \"${help_message}\"\n    exit 1;\nfi\n\nset -euo pipefail\ndata=$1\nlogdir=$2\n\ntmpdir=$(mktemp -d ${data}/tmp-XXXX)\nsplit_scps=\"\"\nfor n in $(seq ${nj}); do\n    split_scps=\"${split_scps} ${tmpdir}/wav.${n}.scp\"\ndone\nutils/split_scp.pl ${data}/wav.scp ${split_scps} || exit 1;\n\n# make segments file describing start and end time\n${cmd} JOB=1:${nj} ${logdir}/trim_silence.JOB.log \\\n    MPLBACKEND=Agg trim_silence.py \\\n        --fs ${fs} \\\n        --win_length ${win_length} \\\n        --shift_length ${shift_length} \\\n        --threshold ${threshold} \\\n        --min_silence ${min_silence} \\\n        --normalize ${normalize} \\\n        --figdir ${logdir}/figs \\\n        scp:${tmpdir}/wav.JOB.scp \\\n        ${tmpdir}/segments.JOB\n\n# concatenate segments\nfor n in $(seq ${nj}); do\n    cat ${tmpdir}/segments.${n} || exit 1;\ndone > ${data}/segments || exit 1\nrm -rf ${tmpdir}\n\n# check\nutils/validate_data_dir.sh --no-feats ${data}\necho \"Successfully trimed silence part.\"\n"
  },
  {
    "path": "egs/espnet_utils/trn2ctm.py",
    "content": "#!/usr/bin/python\n\nimport argparse\nimport codecs\nimport math\nimport re\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"convert trn to ctm\")\n    parser.add_argument(\"trn\", type=str, default=None, nargs=\"?\", help=\"input trn\")\n    parser.add_argument(\"ctm\", type=str, default=None, nargs=\"?\", help=\"output ctm\")\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    convert(args.trn, args.ctm)\n\n\ndef convert(trn=None, ctm=None):\n    if trn is not None:\n        with codecs.open(trn, \"r\", encoding=\"utf-8\") as trn:\n            content = trn.readlines()\n    else:\n        trn = codecs.getreader(\"utf-8\")(sys.stdin if is_python2 else sys.stdin.buffer)\n        content = trn.readlines()\n    split_content = []\n    for i, line in enumerate(content):\n        idx = line.rindex(\"(\")\n        split = [line[:idx].strip().upper(), line[idx + 1 :].strip()[:-1]]\n        while \"((\" in split[0]:\n            split[0] = split[0].replace(\"((\", \"(\")\n        while \"  \" in split[0]:\n            split[0] = split[0].replace(\"  \", \" \")\n        segm_info = re.split(\"[-_]\", split[1])\n        segm_info = [s.strip() for s in segm_info]\n        col1 = segm_info[0] + \"_\" + segm_info[1]\n        col2 = segm_info[2]\n        start_time_int = int(segm_info[6])\n        end_time_int = int(segm_info[7])\n        diff_int = end_time_int - start_time_int\n        word_split = split[0].split(\" \")\n        word_split = list(\n            filter(lambda x: len(x) > 0 and any([c != \" \" for c in x]), word_split)\n        )\n        if len(word_split) > 0:\n            step_int = int(math.floor(float(diff_int) / len(word_split)))\n            step = str(step_int)\n            for j, word in enumerate(word_split):\n                start_time = str(int(start_time_int + step_int * j))\n                col3 = (\n                    (start_time[:-2] if len(start_time) > 2 else \"0\")\n                    + \".\"\n                    + (start_time[-2:] if len(start_time) > 1 else \"00\")\n                )\n                if j == len(word_split) - 1:\n                    diff = str(int(end_time_int - int(start_time)))\n                else:\n                    diff = step\n                col4 = (diff[:-2] if len(diff) > 2 else \"0\") + \".\" + diff[-2:]\n                segm_info = [col1, col2, col3, col4]\n                split_content.append(\" \".join(segm_info) + \"  \" + word)\n    if ctm is not None:\n        sys.stdout = codecs.open(ctm, \"w\", encoding=\"utf-8\")\n    else:\n        sys.stdout = codecs.getwriter(\"utf-8\")(\n            sys.stdout if is_python2 else sys.stdout.buffer\n        )\n    for c_line in split_content:\n        print(c_line)\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/trn2stm.py",
    "content": "#!/usr/bin/python\n\nimport argparse\nimport codecs\nimport re\nimport sys\n\nis_python2 = sys.version_info[0] == 2\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"convert trn to stm\")\n    parser.add_argument(\n        \"--orig-stm\",\n        type=str,\n        default=None,\n        nargs=\"?\",\n        help=\"Original stm file to add additional information to the generated one\",\n    )\n    parser.add_argument(\"trn\", type=str, default=None, nargs=\"?\", help=\"input trn\")\n    parser.add_argument(\"stm\", type=str, default=None, nargs=\"?\", help=\"output stm\")\n    return parser\n\n\ndef main(args):\n    args = get_parser().parse_args(args)\n    convert(args.trn, args.stm, args.orig_stm)\n\n\ndef convert(trn=None, stm=None, orig_stm=None):\n    if orig_stm is not None:\n        with codecs.open(orig_stm, \"r\", encoding=\"utf-8\") as orig_stm:\n            orig_content = orig_stm.readlines()\n            has_orig = True\n            header = []\n            content = []\n            for line in orig_content:\n                (header if line.startswith(\";;\") else content).append(line.strip())\n            del orig_content\n            content = [x.split(\" \") for x in content]\n            mapping = {}\n            for x in content:\n                mapping[x[2]] = x[5]\n            del content\n    else:\n        has_orig = False\n        header = None\n        mapping = None\n\n    if trn is not None:\n        with codecs.open(trn, \"r\", encoding=\"utf-8\") as trn:\n            content = trn.readlines()\n    else:\n        trn = codecs.getreader(\"utf-8\")(sys.stdin if is_python2 else sys.stdin.buffer)\n        content = trn.readlines()\n\n    for i, line in enumerate(content):\n        idx = line.rindex(\"(\")\n        split = [line[:idx].strip().upper() + \" \", line[idx + 1 :].strip()[:-1]]\n        while \"((\" in split[0]:\n            split[0] = split[0].replace(\"((\", \"(\")\n        while \"  \" in split[0]:\n            split[0] = split[0].replace(\"  \", \" \")\n        segm_info = re.split(\"[-_]\", split[1])\n        segm_info = [s.strip() for s in segm_info]\n        col1 = segm_info[0] + \"_\" + segm_info[1]\n        col2 = segm_info[2]\n        col3 = segm_info[3] + \"_\" + segm_info[4] + \"_\" + segm_info[5]\n        start_time = str(int(segm_info[6]))\n        end_time = str(int(segm_info[7]))\n        col4 = (\n            (start_time[:-2] if len(start_time) > 2 else \"0\")\n            + \".\"\n            + (start_time[-2:] if len(start_time) > 1 else \"00\")\n        )\n        col5 = (\n            (end_time[:-2] if len(end_time) > 2 else \"0\")\n            + \".\"\n            + (end_time[-2:] if len(end_time) > 1 else \"00\")\n        )\n        col6 = mapping[col3] if has_orig else \"\"\n        segm_info = [col1, col2, col3, col4, col5, col6]\n        content[i] = \" \".join(segm_info) + \"  \" + split[0]\n    if stm is not None:\n        sys.stdout = codecs.open(stm, \"w\", encoding=\"utf-8\")\n    else:\n        sys.stdout = codecs.getwriter(\"utf-8\")(\n            sys.stdout if is_python2 else sys.stdout.buffer\n        )\n    if has_orig:\n        for h_line in header:\n            print(h_line)\n    for c_line in content:\n        print(c_line)\n\n\nif __name__ == \"__main__\":\n    main(sys.argv[1:])\n"
  },
  {
    "path": "egs/espnet_utils/update_json.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2020 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\necho \"$0 $*\" >&2 # Print the command line for logging\n. ./path.sh\n\nnlsyms=\"\"\noov=\"<unk>\"\nbpecode=\"\"\nverbose=0\ntrans_type=char\n\ntext=\"\"\nmultilingual=false\n\nhelp_message=$(cat << EOF\nUsage: $0 <json> <data-dir> <dict>\ne.g. $0 data/train data/lang_1char/train_units.txt\nOptions:\n  --oov <oov-word>                                 # Default: <unk>\n  --verbose <num>                                  # Default: 0\nEOF\n)\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n    echo \"${help_message}\" 1>&2\n    exit 1;\nfi\n\nset -euo pipefail\n\njson=$1\ndir=$2\ndic=$3\njson_dir=$(dirname ${json})\ntmpdir=$(mktemp -d ${dir}/tmp-XXXXX)\ntrap 'rm -rf ${tmpdir}' EXIT\n\nif [ -z ${text} ]; then\n    text=${dir}/text\nfi\n\n# 2. Create scp files for outputs\nmkdir -p ${tmpdir}/output\nif [ -n \"${bpecode}\" ]; then\n    if [ ${multilingual} = true ]; then\n        # remove a space before the language ID\n        paste -d \" \" <(awk '{print $1}' ${text}) <(cut -f 2- -d\" \" ${text} \\\n            | spm_encode --model=${bpecode} --output_format=piece | cut -f 2- -d\" \") \\\n            > ${tmpdir}/output/token.scp\n    else\n        paste -d \" \" <(awk '{print $1}' ${text}) <(cut -f 2- -d\" \" ${text} \\\n            | spm_encode --model=${bpecode} --output_format=piece) \\\n            > ${tmpdir}/output/token.scp\n    fi\nelif [ -n \"${nlsyms}\" ]; then\n    text2token.py -s 1 -n 1 -l ${nlsyms} ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp\nelse\n    text2token.py -s 1 -n 1 ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp\nfi\n< ${tmpdir}/output/token.scp utils/sym2int.pl --map-oov ${oov} -f 2- ${dic} > ${tmpdir}/output/tokenid.scp\nawk '{print $1 \" \" NF-1}' ${tmpdir}/output/tokenid.scp > ${tmpdir}/output/olen.scp\n# +2 comes from CTC blank and EOS\nvocsize=$(tail -n 1 ${dic} | awk '{print $2}')\nodim=$(echo \"$vocsize + 2\" | bc)\nawk -v odim=${odim} '{print $1 \" \" odim}' ${text} > ${tmpdir}/output/odim.scp\n\ncat ${text} > ${tmpdir}/output/text.scp\n\n\n# 4. Create JSON files from each scp files\nrm -f ${tmpdir}/*/*.json\nfor x in \"${tmpdir}\"/output/*.scp; do\n    k=$(basename ${x} .scp)\n    < ${x} scp2json.py --key ${k} > ${tmpdir}/output/${k}.json\ndone\n\n# add to json\naddjson.py --verbose ${verbose} -i false \\\n  ${json} ${tmpdir}/output/text.json ${tmpdir}/output/token.json ${tmpdir}/output/tokenid.json ${tmpdir}/output/olen.json ${tmpdir}/output/odim.json > ${tmpdir}/data.json\nmkdir -p ${json_dir}/.backup\necho \"json updated. original json is kept in ${json_dir}/.backup.\"\ncp ${json} ${json_dir}/.backup/\"$(basename ${json})\"\ncp ${tmpdir}/data.json ${json}\n\nrm -fr ${tmpdir}\n"
  },
  {
    "path": "egs/espnet_utils/word_ngram_rescore.py",
    "content": "# author: tyriontian\n# tyriontian@tencent.com\n\nimport json\nimport sys\nimport codecs\nimport torch\n\nfrom espnet.nets.scorers.word_ngram import WordNgram\n\ndef score_texts(ngram, texts, ignore_strs=[\"<eos>\", \" \"]):\n    for s in ignore_strs:\n        texts = [t.replace(s, \"\") for t in texts]\n    return ngram.score_texts(texts).cpu().tolist()\n\ndef main():\n    js_file = sys.argv[1]\n    ngram_dir = sys.argv[2]\n    weight = float(sys.argv[3])\n    js_file_tgt = sys.argv[4]\n\n    # read json\n    with codecs.open(js_file, \"r\", encoding=\"utf-8\") as f:\n        js = json.load(f)\n        js = js[\"utts\"]\n    \n    # load word-level N-gram LM\n    device = torch.device(\"cpu\")\n    ngram = WordNgram(ngram_dir, device)\n\n    # rescore each hypothesis and sort\n    for name in js.keys():\n        hyp_lst = js[name][\"output\"]\n\n        texts = []\n        for hyp in hyp_lst:\n            texts.append(hyp[\"rec_text\"])\n\n        text_scores = score_texts(ngram, texts)\n        \n        for j, hyp in enumerate(hyp_lst):\n            hyp_lst[j][\"score\"] += text_scores[j] * weight\n            hyp_lst[j][\"word_ngram_score\"] = text_scores[j]\n\n        hyp_lst.sort(key=lambda hyp: hyp[\"score\"], reverse=True)\n        js[name][\"output\"] = hyp_lst\n\n    js = {\"utts\": js}\n\n    # write new json\n    with open(js_file_tgt, \"wb\") as f:\n        f.write(\n            json.dumps(\n                js, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    \n    \n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/espnet_utils/word_ngram_rescore.sh",
    "content": "decode_dir=$1\nword_ngram=$2\ndict=$3\n\n. utils/parse_options.sh || exit 1;\n\nmkdir -p $decode_dir/rescore\n\nfor w in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9; do\n    (mkdir -p $decode_dir/word_ngram_rescore/$w; subdir=$decode_dir/word_ngram_rescore/$w\n    python3 espnet_utils/word_ngram_rescore.py $decode_dir/data.json \\\n      $word_ngram $w $subdir/data.1.json\n    bash espnet_utils/score_sclite.sh $subdir $dict \\\n      > $subdir/decode_result.txt) & \ndone\nwait \n"
  },
  {
    "path": "egs/steps/align_basis_fmllr.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Copyright 2013  GoVivace Inc (Author: Nagendra Goel)\n# Apache 2.0\n\n# Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta)\n# + fMLLR (probably with SAT models).\n# It first computes an alignment with the final.alimdl (or the final.mdl if final.alimdl\n# is not present), then does 2 iterations of fMLLR estimation.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match the source directory.\n\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbasis_fmllr_opts=\"--fmllr-min-count=22  --num-iters=10 --size-scale=0.2 --step-size-iters=3\"\nbeam=10\nretry_beam=40\nboost_silence=1.5 # factor by which to boost silence during alignment.\nfmllr_update_type=full\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_basis_fmllr.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_basis_fmllr.sh data/train data/lang exp/tri4 exp/tri4_ali\"\n   echo \"Note: <src-dir> should ideally have been trained by steps/train_sat_basis.sh, or\"\n   echo \"if a non-SAT system (not recommended), the basis should have been computed\"\n   echo \"by steps/get_fmllr_basis.sh.\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --fmllr-update-type (full|diag|offset|none)      # default full.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\ngraphdir=$dir\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nfor f in $srcdir/tree  $srcdir/final.mdl $srcdir/fmllr.basis \\\n                       $data/feats.scp $lang/phones.txt; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist\"\n    exit 1\n  fi\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\ncp $srcdir/delta_opts $dir 2>/dev/null\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-basis-fmllr-gpost $basis_fmllr_opts --spk2utt=ark:$sdata/JOB/spk2utt \\\n        $mdl $srcdir/fmllr.basis \"$sifeats\"  ark,s,cs:- \\\n        ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-basis-fmllr $basis_fmllr_opts --spk2utt=ark:$sdata/JOB/spk2utt \\\n         $mdl $srcdir/fmllr.basis \"$sifeats\" \\\n        ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n#rm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_basis_fmllr_lats.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Version of align_fmllr_lats.sh that uses \"basis fMLLR\", so it is suitable for\n# situations where there is very little data per speaker (e.g. when there is a\n# one-to-one mapping between utterances and speakers).  Intended for use where\n# the model was trained with basis-fMLLR (i.e.  when you trained the model with\n# train_sat_basis.sh where you normally would have trained with train_sat.sh),\n# or when it was trained with SAT but you ran get_fmllr_basis.sh on the\n# source-model directory.\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nacoustic_scale=0.1\nbeam=10\nretry_beam=40\nfinal_beam=20  # For the lattice-generation phase there is no retry-beam.  This\n               # is a limitation of gmm-latgen-faster.  We just use an\n               # intermediate beam.  We'll lose a little data and it will be\n               # slightly slower.  (however, the min-active of 200 that\n               # gmm-latgen-faster defaults to may help.)\nboost_silence=1.0 # factor by which to boost silence during alignment.\nbasis_fmllr_opts=\"--fmllr-min-count=22  --num-iters=10 --size-scale=0.2 --step-size-iters=3\"\n\ngenerate_ali_from_lats=false # If true, alingments generated from lattices.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_fmllr_lats.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_fmllr_lats.sh data/train data/lang exp/tri1 exp/tri1_lats\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nif [ ! -f $srcdir/fmllr.basis ]; then\n  echo \"$0: expected $srcdir/fmllr.basis to exist.   Run get_fmllr_basis.sh on $srcdir.\"\nfi\n\nfor f in $data/feats.scp $lang/phones.txt $srcdir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1\ndone\n\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.alimdl $dir 2>/dev/null\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\ncp $srcdir/delta_opts $dir 2>/dev/null\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n    cp $srcdir/full.mat $dir 2>/dev/null\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## because gmm-latgen-faster doesn't support adding the transition-probs to the\n## graph itself, we need to bake them into the compiled graphs.  This means we can't reuse previously compiled graphs,\n## because the other scripts write them without transition probs.\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling training graphs\"\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $scale_opts $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\n\nif [ $stage -le 1 ]; then\n  # Note: we need to set --transition-scale=0.0 --self-loop-scale=0.0 because,\n  # as explained above, we compiled the transition probs into the training\n  # graphs.\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled --transition-scale=0.0 --self-loop-scale=0.0 --acoustic-scale=$acoustic_scale \\\n        --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-basis-fmllr-gpost $basis_fmllr_opts \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl $srcdir/fmllr.basis \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-basis-fmllr $basis_fmllr_opts \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl $srcdir/fmllr.basis \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  # Warning: gmm-latgen-faster doesn't support a retry-beam so you may get more\n  # alignment errors (however, it does have a default min-active=200 so this\n  # will tend to reduce alignment errors).\n  # --allow_partial=false makes sure we reach the end of the decoding graph.\n  # --word-determinize=false makes sure we retain the alternative pronunciations of\n  #   words (including alternatives regarding optional silences).\n  #  --lattice-beam=$beam keeps all the alternatives that were within the beam,\n  #    it means we do no pruning of the lattice (lattices from a training transcription\n  #    will be small anyway).\n  echo \"$0: generating lattices containing alternate pronunciations.\"\n  $cmd JOB=1:$nj $dir/log/generate_lattices.JOB.log \\\n    gmm-latgen-faster --acoustic-scale=$acoustic_scale --beam=$final_beam \\\n        --lattice-beam=$final_beam --allow-partial=false --word-determinize=false \\\n      \"$mdl_cmd\" \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 4 ] && $generate_ali_from_lats; then\n  # If generate_alignments is true, ali.*.gz is generated in lats dir\n  $cmd JOB=1:$nj $dir/log/generate_alignments.JOB.log \\\n    lattice-best-path --acoustic-scale=$acoustic_scale \"ark:gunzip -c $dir/lat.JOB.gz |\" \\\n    ark:/dev/null \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz 2>/dev/null || true\n\necho \"$0: done generating lattices from training transcripts.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_fmllr.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta)\n# + fMLLR (probably with SAT models).\n# It first computes an alignment with the final.alimdl (or the final.mdl if final.alimdl\n# is not present), then does 2 iterations of fMLLR estimation.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match the source directory.\n\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ncareful=false\nboost_silence=1.0 # factor by which to boost silence during alignment.\nfmllr_update_type=full\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_fmllr.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_fmllr.sh data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --fmllr-update-type (full|diag|offset|none)      # default full.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.alimdl $dir 2>/dev/null\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\ncp $srcdir/delta_opts $dir 2>/dev/null\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n    cp $srcdir/full.mat $dir 2>/dev/null\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_fmllr_lats.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Version of align_fmllr.sh that generates lattices (lat.*.gz) with\n# alignments of alternative pronunciations in them.  Mainly intended\n# as a precursor to LF-MMI/chain training for now.\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nacoustic_scale=0.1\nbeam=10\nretry_beam=40\nfinal_beam=20  # For the lattice-generation phase there is no retry-beam.  This\n               # is a limitation of gmm-latgen-faster.  We just use an\n               # intermediate beam.  We'll lose a little data and it will be\n               # slightly slower.  (however, the min-active of 200 that\n               # gmm-latgen-faster defaults to may help.)\nboost_silence=1.0 # factor by which to boost silence during alignment.\nfmllr_update_type=full\ngenerate_ali_from_lats=false # If true, alingments generated from lattices.\nmax_active=7000\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_fmllr_lats.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_fmllr_lats.sh data/train data/lang exp/tri1 exp/tri1_lats\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --fmllr-update-type (full|diag|offset|none)      # default full.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.alimdl $dir 2>/dev/null\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\ncp $srcdir/delta_opts $dir 2>/dev/null\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n    cp $srcdir/full.mat $dir 2>/dev/null\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## because gmm-latgen-faster doesn't support adding the transition-probs to the\n## graph itself, we need to bake them into the compiled graphs.  This means we can't reuse previously compiled graphs,\n## because the other scripts write them without transition probs.\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling training graphs\"\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $scale_opts $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\n\nif [ $stage -le 1 ]; then\n  # Note: we need to set --transition-scale=0.0 --self-loop-scale=0.0 because,\n  # as explained above, we compiled the transition probs into the training\n  # graphs.\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled --transition-scale=0.0 --self-loop-scale=0.0 --acoustic-scale=$acoustic_scale \\\n        --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  # Warning: gmm-latgen-faster doesn't support a retry-beam so you may get more\n  # alignment errors (however, it does have a default min-active=200 so this\n  # will tend to reduce alignment errors).\n  # --allow_partial=false makes sure we reach the end of the decoding graph.\n  # --word-determinize=false makes sure we retain the alternative pronunciations of\n  #   words (including alternatives regarding optional silences).\n  #  --lattice-beam=$beam keeps all the alternatives that were within the beam,\n  #    it means we do no pruning of the lattice (lattices from a training transcription\n  #    will be small anyway).\n  echo \"$0: generating lattices containing alternate pronunciations.\"\n  $cmd JOB=1:$nj $dir/log/generate_lattices.JOB.log \\\n    gmm-latgen-faster --max-active=$max_active --acoustic-scale=$acoustic_scale --beam=$final_beam \\\n        --lattice-beam=$final_beam --allow-partial=false --word-determinize=false \\\n      \"$mdl_cmd\" \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 4 ] && $generate_ali_from_lats; then\n  # If generate_alignments is true, ali.*.gz is generated in lats dir\n  $cmd JOB=1:$nj $dir/log/generate_alignments.JOB.log \\\n    lattice-best-path --acoustic-scale=$acoustic_scale \"ark:gunzip -c $dir/lat.JOB.gz |\" \\\n    ark:/dev/null \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz 2>/dev/null || true\n\necho \"$0: done generating lattices from training transcripts.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_lvtln.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Vimal Manohar\n\n# Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta)\n# Will ignore fMLLR.\n# Will estimate VTLN warping factors\n# as a by product, which can be used to extract VTLN-warped features.\n\n# Begin configuration section\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10.0\nretry_beam=40\nboost_silence=1.0 # factor by which to boost silence during alignment.\nlogdet_scale=1.0\ncleanup=false\n\n# End configuration section\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Wrong #arguments ($#, expected 4)\"\n   echo \"Usage: steps/align_lvtln.sh [options] <data-dir> <lang-dir> <src-dir>  <align-dir>\"\n   echo \" e.g.: steps/align_lvtln.sh data/train data/lang exp/tri2c exp/tri2c_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0: file $data/spk2warp exists.  This script expects non-VTLN features\"\n  exit 1;\nfi\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl,final.lvtln} $dir || exit 1;\ncp $srcdir/final.occs $dir;\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n\n## Set up the unadapted features \"$sifeats\"\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $srcdir/full.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";   \n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ -f $data/segments ]; then\n  subset_utts=\"ark:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists: using wav.scp directly.\"\n  subset_utts=\"ark:wav-copy scp:$sdata/JOB/wav.scp ark:- |\"\nfi\n\n## Get the first-pass LVTLN transforms\nif [ $stage -le 2 ]; then\n  echo \"$0: computing first-pass LVTLN transforms.\"\n  $cmd JOB=1:$nj $dir/log/lvtln_pass1.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-lvtln-trans --verbose=1 --spk2utt=ark:$sdata/JOB/spk2utt --logdet-scale=$logdet_scale \\\n    $mdl $dir/final.lvtln \"$sifeats\" ark,s,cs:- ark:$dir/trans_pass1.JOB \\\n    ark,t:$dir/warp_pass1.JOB || exit 1;\nfi\n##\n\nfeats1=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans_pass1.JOB ark:- ark:- |\"\n\n## Do a second pass of estimating the LVTLN transform.\n\nif [ $stage -le 3 ]; then\n  echo \"$0: realigning with transformed features\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats1\" \"ark:|gzip -c >$dir/ali_pass2.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: re-estimating LVTLN transforms\"\n  $cmd JOB=1:$nj $dir/log/lvtln_pass1.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali_pass2.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $alimdl \"$feats1\" ark:- ark:- \\| \\\n    gmm-est-lvtln-trans --verbose=1 --spk2utt=ark:$sdata/JOB/spk2utt --logdet-scale=$logdet_scale \\\n    $mdl $dir/final.lvtln \"$sifeats\" ark,s,cs:- ark:$dir/trans.JOB \\\n    ark,t:$dir/warp.JOB || exit 1;\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 5 ]; then\n  # This second alignment does not affect the transforms.\n  echo \"$0: realigning with the second-pass LVTLN transforms\"\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ -f $dir/warp.1 ]; then\n  for j in $(seq $nj); do cat $dir/warp_pass1.$j; done > $dir/0.warp || exit 1;\n  for j in $(seq $nj); do cat $dir/warp.$j; done > $dir/final.warp || exit 1;\n  ns1=$(cat $dir/0.warp | wc -l)\n  ns2=$(cat $dir/final.warp | wc -l)\n  ! [ \"$ns1\" == \"$ns2\" ] && echo \"$0: Number of speakers differ pass1 vs pass2, $ns1 != $ns2\" && exit 1;\n\n  paste $dir/0.warp $dir/final.warp | awk '{x=$2 - $4; if ((x>0?x:-x) > 0.010001) { print $1, $2, $4; }}' > $dir/warp_changed\n  nc=$(cat $dir/warp_changed | wc -l)\n  echo \"$0: For $nc speakers out of $ns1, warp changed pass1 vs pass2 by >0.01, see $dir/warp_changed for details\"\nfi\n\nif true; then # Diagnostics\n  if [ -f $data/spk2gender ]; then \n    # To make it easier to eyeball the male and female speakers' warps\n    # separately, separate them out.\n    for g in m f; do # means: for gender in male female\n      cat $dir/final.warp | \\\n        utils/filter_scp.pl <(grep -w $g $data/spk2gender | awk '{print $1}') > $dir/final.warp.$g\n      echo -n \"The last few warp factors for gender $g are: \"\n      tail -n 10 $dir/final.warp.$g | awk '{printf(\"%s \", $2);}'; \n      echo\n    done\n  fi\nfi\n\nif $cleanup; then\n  rm $dir/pre_ali.*.gz $dir/ali_pass?.*.gz $dir/trans_pass1.* $dir/warp_pass1.* $dir/warp.*\nfi\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/align_raw_fmllr.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta)\n# + fMLLR (probably with SAT models).\n# It first computes an alignment with the final.alimdl (or the final.mdl if final.alimdl\n# is not present), then does 2 iterations of fMLLR estimation.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match the source directory.\n\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # factor by which to boost silence during alignment.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_fmllr.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_fmllr.sh data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n\nif [[ ! -f $srcdir/final.mat || ! -f $srcdir/full.mat ]]; then\n  echo \"$0: we require final.mat and full.mat in the source directory $srcdir\"\nfi\n\nfull_lda_mat=\"get-full-lda-mat --print-args=false $srcdir/final.mat $srcdir/full.mat -|\"\ncp $srcdir/full.mat $srcdir/final.mat $dir\n\nraw_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n! [ \"$raw_dim\" -gt 0 ] && echo \"raw feature dim not set\" && exit 1;\n\nsplicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\"\nsifeats=\"$splicedfeats transform-feats $srcdir/final.mat ark:- ark:- |\"\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-fmllr-raw-gpost --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt \\\n       $mdl \"$full_lda_mat\" \"$splicedfeats\" ark,s,cs:- ark:$dir/raw_trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-fmllr-raw --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$full_lda_mat\" \\\n       \"$splicedfeats\" ark,s,cs:- ark:$dir/raw_trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments and (if needed) speaker-vectors, given an\n# SGMM system.  If the system is built on top of SAT, you should supply\n# transforms with the --transform-dir option.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory.\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false # use graphs from srcdir\nuse_gselect=false # use gselect info from srcdir [regardless, we use\n   # Gaussian-selection info, we might have to compute it though.]\ngselect=15  # Number of Gaussian-selection indices for SGMMs.\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ntransform_dir=  # directory to find fMLLR transforms in.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_sgmm2.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_sgmm2.sh --transform-dir exp/tri3b data/train data/lang \\\\\"\n   echo \"           exp/sgmm4a exp/sgmm5a_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --transform-dir <transform-dir>                  # directory to find fMLLR transforms\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nsdata=$data/split$nj\n\nmkdir -p $dir/log\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\n[ -f $srcdir/final.alimdl ] && cp $srcdir/final.alimdl $dir\ncp $srcdir/final.occs $dir;\n\n## Set up features.\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option during alignment.\"\nfi\n##\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\n  ln.pl $srcdir/fsts.*.gz $dir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n## Work out where we're getting the Gaussian-selection info from\nif $use_gselect; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-gselect true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/gselect.1.gz ] && echo \"No gselect info in $srcdir\" && exit 1;\n  graphdir=$srcdir\n  gselect_opt=\"--gselect=ark,s,cs:gunzip -c $srcdir/gselect.JOB.gz|\"\n  ln.pl $srcdir/gselect.*.gz $dir\nelse\n  graphdir=$dir\n  if [ $stage -le 1 ]; then\n    echo \"$0: computing Gaussian-selection info\"\n    # Note: doesn't matter whether we use $alimdl or $mdl, they will\n    # have the same gselect info.\n    $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n      sgmm2-gselect --full-gmm-nbest=$gselect $alimdl \\\n      \"$feats\" \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\n  fi\n  gselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\nfi\n\n\nif [ $alimdl == $mdl ]; then\n  # Speaker-independent decoding-- just one pass.  Not normal.\n  T=`sgmm2-info $mdl | grep 'speaker vector space' | awk '{print $NF}'` || exit 1;\n  [ \"$T\" -ne 0 ] && echo \"No alignment model, yet speaker vector space nonempty\" && exit 1;\n\n  if [ $stage -le 2 ]; then\n    echo \"$0: aligning data in $data using model $mdl (no speaker-vectors)\"\n    $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n      sgmm2-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam $alimdl \\\n      \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n  echo \"$0: done aligning data.\"\n  exit 0;\nfi\n\n# Continue with system with speaker vectors.\nif [ $stage -le 2 ]; then\n  echo \"$0: aligning data in $data using model $alimdl\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    sgmm2-align-compiled $scale_opts \"$gselect_opt\" --beam=$beam --retry-beam=$retry_beam $alimdl \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: computing speaker vectors (1st pass)\"\n  $cmd JOB=1:$nj $dir/log/spk_vecs1.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    sgmm2-post-to-gpost \"$gselect_opt\" $alimdl \"$feats\" ark:- ark:- \\| \\\n    sgmm2-est-spkvecs-gpost --spk2utt=ark:$sdata/JOB/spk2utt \\\n     $mdl \"$feats\" ark,s,cs:- ark:$dir/pre_vecs.JOB || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: computing speaker vectors (2nd pass)\"\n  $cmd JOB=1:$nj $dir/log/spk_vecs2.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    sgmm2-est-spkvecs --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" \\\n     --spk-vecs=ark:$dir/pre_vecs.JOB $mdl \"$feats\" ark,s,cs:- ark:$dir/vecs.JOB || exit 1;\n  rm $dir/pre_vecs.*\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    sgmm2-align-compiled $scale_opts \"$gselect_opt\" --beam=$beam --retry-beam=$retry_beam \\\n     --utt2spk=ark:$sdata/JOB/utt2spk --spk-vecs=ark:$dir/vecs.JOB \\\n     $mdl \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/align_si.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments using a model with delta or\n# LDA+MLLT features.\n\n# If you supply the \"--use-graphs true\" option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match with the source directory.\n\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ncareful=false\nboost_silence=1.0 # Factor by which to boost silence during alignment.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: steps/align_si.sh <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/align_si.sh data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\n\nfor f in $data/text $lang/oov.int $srcdir/tree $srcdir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\ncp $srcdir/delta_opts $dir 2>/dev/null\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\n\n\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $srcdir/full.mat $dir\n   ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\nmdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/final.mdl - |\"\n\nif $use_graphs; then\n  [ $nj != \"`cat $srcdir/num_jobs`\" ] && echo \"$0: mismatch in num-jobs\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"$0: no such file $srcdir/fsts.1.gz\" && exit 1;\n\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" \\\n      \"ark:gunzip -c $srcdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nelse\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  # We could just use gmm-align in the next line, but it's less efficient as it compiles the\n  # training graphs one by one.\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" ark:- \\\n      \"$feats\" \"ark,t:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\necho \"$0: done aligning data.\"\n"
  },
  {
    "path": "egs/steps/best_path_weights.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014-17 Vimal Manohar\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This script gets from the lattice the best path alignments and frame-level\n# posteriors of the pdfs in the best path alignment.\n# The output directory has the format of an alignment directory.\n# It can optionally read alignments from a directory, in which case,\n# the script gets frame-level posteriors of the pdf corresponding to those\n# alignments.\n# The frame-level posteriors in the form of kaldi vectors and are \n# output in weights.scp.\n\nset -e\n\n# begin configuration section.\ncmd=run.pl\nstage=-10\nacwt=0.1\n#end configuration section.\n\nif [ -f ./path.sh ]; then . ./path.sh; fi\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ] && [ $# -ne 4 ]; then\n  cat <<EOF\n    Usage: $0 [options] <data-dir> <decode-dir> [<ali-dir>] <out-dir>\n      E.g. $0 data/train_unt.seg exp/tri1/decode exp/tri1/best_path\n    Options:\n      --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\nEOF\n  \n  exit 1;\nfi\n\ndata=$1\ndecode_dir=$2\ndir=${@: -1}  # last argument to the script\n\nali_dir=$dir\nif [ $# -eq 4 ]; then\n  ali_dir=$3\nfi\n\nmkdir -p $dir\n\nnj=$(cat $decode_dir/num_jobs)\necho $nj > $dir/num_jobs\n\nif [ $stage -le 1 ]; then\n  mkdir -p $dir/log\n  $cmd JOB=1:$nj $dir/log/best_path.JOB.log \\\n    lattice-best-path --acoustic-scale=$acwt \\\n      \"ark,s,cs:gunzip -c $decode_dir/lat.JOB.gz |\" \\\n      ark:/dev/null \"ark:| gzip -c > $dir/ali.JOB.gz\" || exit 1\nfi\n\n# Find where the final.mdl is.\nif [ -f $(dirname $decode_dir)/final.mdl ]; then\n  src_dir=$(dirname $decode_dir)\nelse\n  src_dir=$decode_dir\nfi\n\ncp $src_dir/cmvn_opts $dir/ || exit 1\nfor f in final.mat splice_opts frame_subsampling_factor; do\n  if [ -f $src_dir/$f ]; then cp $src_dir/$f $dir; fi\ndone\n\n# make $dir an absolute pathname.\nfdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD})\n\nmodel=$src_dir/final.mdl\ntree=$src_dir/tree\n\nfor f in $model $decode_dir/lat.1.gz $tree; do\n  if [ ! -f $f ]; then echo \"$0: expecting file $f to exist\" && exit 1; fi\ndone\n\ncp $model $tree $dir || exit 1\n\nali_nj=$(cat $ali_dir/num_jobs) || exit 1\nif [ $nj -ne $ali_nj ]; then\n  echo \"$0: $decode_dir and $ali_dir have different number of jobs. Redo alignment with $nj jobs.\"\n  exit 1\nfi\n\nif [ $stage -lt 2 ]; then\n  $cmd JOB=1:$nj $dir/log/get_post.JOB.log \\\n    lattice-to-post --acoustic-scale=$acwt \\\n      \"ark,s,cs:gunzip -c $decode_dir/lat.JOB.gz|\" ark:- \\| \\\n    post-to-pdf-post $model ark,s,cs:- ark:- \\| \\\n    get-post-on-ali ark,s,cs:- \\\n    \"ark,s,cs:gunzip -c $ali_dir/ali.JOB.gz | convert-ali $dir/final.mdl $model $tree ark,s,cs:- ark:- | ali-to-pdf $model ark,s,cs:- ark:- |\" \\\n    \"ark,scp:$fdir/weights.JOB.ark,$fdir/weights.JOB.scp\" || exit 1\nfi\n\nfor n in `seq $nj`; do\n  cat $dir/weights.$n.scp \ndone > $dir/weights.scp\n\nrm $dir/weights.*.scp\n\nexit 0\n"
  },
  {
    "path": "egs/steps/cleanup/clean_and_segment_data.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Vimal Manohar\n#           2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script demonstrates how to re-segment training data selecting only the\n# \"good\" audio that matches the transcripts.\n# The basic idea is to decode with an existing in-domain GMM acoustic model, and\n# a biased language model built from the reference transcript, and then work out\n# the segmentation from a ctm like file.\n\nset -e -o pipefail\n\nstage=0\n\ncmd=run.pl\ncleanup=true\nnj=4\ngraph_opts=\nsegmentation_opts=\n\n. ./path.sh\n. utils/parse_options.sh\n\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [options] <data> <lang> <srcdir> <dir> <cleaned-data>\"\n  echo \" This script does data cleanup to remove bad portions of transcripts and\"\n  echo \" may do other minor modifications of transcripts such as allowing repetitions\"\n  echo \" for disfluencies, and adding or removing non-scored words (by default:\"\n  echo \" words that map to 'silence phones')\"\n  echo \" Note: <srcdir> is expected to contain a GMM-based model, preferably a\"\n  echo \" SAT-trained one (see train_sat.sh).\"\n  echo \" If <srcdir> contains fMLLR transforms (trans.*) they are assumed to\"\n  echo \" be transforms corresponding to the data in <data>.  If <srcdir> is for different\"\n  echo \" dataset, and you're using SAT models, you should align <data> with <srcdir>\"\n  echo \" using align_fmllr.sh, and supply that directory as <srcdir>\"\n  echo \"\"\n  echo \"e.g. $0 data/train data/lang exp/tri3 exp/tri3_cleanup data/train_cleaned\"\n  echo \"Options:\"\n  echo \"  --stage <n>             # stage to run from, to enable resuming from partially\"\n  echo \"                          # completed run (default: 0)\"\n  echo \"  --cmd '$cmd'            # command to submit jobs with (e.g. run.pl, queue.pl)\"\n  echo \"  --nj <n>                # number of parallel jobs to use in graph creation and\"\n  echo \"                          # decoding\"\n  echo \"  --segmentation-opts 'opts'  # Additional options to segment_ctm_edits.py.\"\n  echo \"                              # Please run steps/cleanup/internal/segment_ctm_edits.py\"\n  echo \"                              # without arguments to see allowed options.\"\n  echo \"  --graph-opts 'opts'         # Additional options to make_biased_lm_graphs.sh.\"\n  echo \"                              # Please run steps/cleanup/make_biased_lm_graphs.sh\"\n  echo \"                              # without arguments to see allowed options.\"\n  echo \"  --cleanup        <true|false>  # Clean up intermediate files afterward.  Default true.\"\n  exit 1\n\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\ndata_out=$5\n\n\nfor f in $srcdir/{final.mdl,tree,cmvn_opts} $data/utt2spk $data/feats.scp $lang/words.txt $lang/oov.txt; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist.\"\n    exit 1\n  fi\ndone\n\nmkdir -p $dir\ncp $srcdir/final.mdl $dir\ncp $srcdir/tree $dir\ncp $srcdir/cmvn_opts $dir\ncp $srcdir/{splice_opts,delta_opts,final.mat,final.alimdl} $dir 2>/dev/null || true\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\ncp $lang/phones.txt $dir\n\nif [ $stage -le 1 ]; then\n  echo \"$0: Building biased-language-model decoding graphs...\"\n  steps/cleanup/make_biased_lm_graphs.sh $graph_opts \\\n    --nj $nj --cmd \"$cmd\" \\\n     $data $lang $dir $dir/graphs\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Decoding with biased language models...\"\n  transform_opt=\n  if [ -f $srcdir/trans.1 ]; then\n    # If srcdir contained trans.* then we assume they are fMLLR transforms for\n    # this data, and we use them.\n    transform_opt=\"--transform-dir $srcdir\"\n  fi\n  # Note: the --beam 15.0 (vs. the default 13.0) does actually slow it\n  # down substantially, around 0.35xRT to 0.7xRT on tedlium.\n  # I want to test at some point whether it's actually necessary to have\n  # this largish beam.\n  steps/cleanup/decode_segmentation.sh \\\n      --beam 15.0 --nj $nj --cmd \"$cmd --mem 4G\" $transform_opt \\\n      --skip-scoring true --allow-partial false \\\n       $dir/graphs $data $dir/lats\n\n  # the following is for diagnostics, e.g. it will give us the lattice depth.\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $lang $dir/lats\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Doing oracle alignment of lattices...\"\n  steps/cleanup/lattice_oracle_align.sh \\\n    --cmd \"$cmd\" $data $lang $dir/lats $dir/lattice_oracle\nfi\n\n\nif [ $stage -le 4 ]; then\n  echo \"$0: using default values of non-scored words...\"\n\n  # At the level of this script we just hard-code it that non-scored words are\n  # those that map to silence phones (which is what get_non_scored_words.py\n  # gives us), although this could easily be made user-configurable.  This list\n  # of non-scored words affects the behavior of several of the data-cleanup\n  # scripts; essentially, we view the non-scored words as negotiable when it\n  # comes to the reference transcript, so we'll consider changing the reference\n  # to match the hyp when it comes to these words.\n  steps/cleanup/internal/get_non_scored_words.py $lang > $dir/non_scored_words.txt\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: modifying ctm-edits file to allow repetitions [for dysfluencies] and \"\n  echo \"   ... to fix reference mismatches involving non-scored words. \"\n\n  $cmd $dir/log/modify_ctm_edits.log \\\n    steps/cleanup/internal/modify_ctm_edits.py --verbose=3 $dir/non_scored_words.txt \\\n    $dir/lattice_oracle/ctm_edits $dir/ctm_edits.modified\n\n  echo \"   ... See $dir/log/modify_ctm_edits.log for details and stats, including\"\n  echo \" a list of commonly-repeated words.\"\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: applying 'taint' markers to ctm-edits file to mark silences and\"\n  echo \"  ... non-scored words that are next to errors.\"\n  $cmd $dir/log/taint_ctm_edits.log \\\n       steps/cleanup/internal/taint_ctm_edits.py $dir/ctm_edits.modified $dir/ctm_edits.tainted\n  echo \"... Stats, including global cor/ins/del/sub stats, are in $dir/log/taint_ctm_edits.log.\"\nfi\n\n\nif [ $stage -le 7 ]; then\n  echo \"$0: creating segmentation from ctm-edits file.\"\n\n  $cmd $dir/log/segment_ctm_edits.log \\\n    steps/cleanup/internal/segment_ctm_edits.py \\\n       $segmentation_opts \\\n       --oov-symbol-file=$lang/oov.txt \\\n      --ctm-edits-out=$dir/ctm_edits.segmented \\\n      --word-stats-out=$dir/word_stats.txt \\\n      $dir/non_scored_words.txt \\\n      $dir/ctm_edits.tainted $dir/text $dir/segments\n\n  echo \"$0: contents of $dir/log/segment_ctm_edits.log are:\"\n  cat $dir/log/segment_ctm_edits.log\n  echo \"For word-level statistics on p(not-being-in-a-segment), with 'worst' words at the top,\"\n  echo \"see $dir/word_stats.txt\"\n  echo \"For detailed utterance-level debugging information, see $dir/ctm_edits.segmented\"\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: working out required segment padding to account for feature-generation edge effects.\"\n  # make sure $data/utt2dur exists.\n  utils/data/get_utt2dur.sh $data\n  # utt2dur.from_ctm contains lines of the form 'utt dur',  e.g.\n  # AMI_EN2001a_H00_MEE068_0000557_0000594 0.35\n  # where the times are ultimately derived from the num-frames in the features.\n  cat $dir/lattice_oracle/ctm_edits | \\\n     awk '{utt=$1; t=$3+$4; if (t > dur[$1]) dur[$1] = t; } END{for (k in dur) print k, dur[k];}' | \\\n     sort > $dir/utt2dur.from_ctm\n  # the apply_map command below gives us lines of the form 'utt dur-from-$data/utt2dur dur-from-utt2dur.from_ctm',\n  # e.g. AMI_EN2001a_H00_MEE068_0000557_0000594 0.37 0.35\n  utils/apply_map.pl -f 1 <(awk '{print $1,$1,$2}' <$data/utt2dur) <$dir/utt2dur.from_ctm  | \\\n    awk '{printf(\"%.3f\\n\", $2 - $3); }' | sort | uniq -c | sort -nr > $dir/padding_frequencies\n  # there are values other than the most-frequent one (0.02) in there because\n  # of wav files that were shorter than the segment info.\n  padding=$(head -n 1 $dir/padding_frequencies | awk '{print $2}')\n  echo \"$0: we'll pad segments with $padding seconds at segment ends to correct for feature-generation end effects\"\n  echo $padding >$dir/segment_end_padding\nfi\n\n\nif [ $stage -le 8 ]; then\n  echo \"$0: based on the segments and text file in $dir/segments and $dir/text, creating new data-dir in $data_out\"\n  padding=$(cat $dir/segment_end_padding)  # e.g. 0.02\n  utils/data/subsegment_data_dir.sh --segment-end-padding $padding ${data} $dir/segments $dir/text $data_out\n  # utils/data/subsegment_data_dir.sh can output directories that have e.g. to many entries left in wav.scp\n  # Clean this up with the fix_dat_dir.sh script\n  utils/fix_data_dir.sh $data_out\nfi\n\nif [ $stage -le 9 ]; then\n  echo \"$0: recomputing CMVN stats for the new data\"\n  # Caution: this script puts the CMVN stats in $data_out/data,\n  # e.g. data/train_cleaned/data.  This is not the general pattern we use.\n  steps/compute_cmvn_stats.sh $data_out $data_out/log $data_out/data\nfi\n\nif $cleanup; then\n  echo \"$0: cleaning up intermediate files\"\n  rm -r $dir/graphs/fsts $dir/graphs/HCLG.fsts.scp || true\n  rm -r $dir/lats/lat.*.gz $dir/lats/split_fsts || true\n  rm $dir/lattice_oracle/lat.*.gz || true\nfi\n\necho \"$0: done.\"\n"
  },
  {
    "path": "egs/steps/cleanup/clean_and_segment_data_nnet3.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Vimal Manohar\n#           2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script demonstrates how to re-segment training data selecting only the\n# \"good\" audio that matches the transcripts.\n# This script is like clean_and_segment_data.sh, but uses nnet3 model instead of\n# a GMM for decoding.\n# The basic idea is to decode with an existing in-domain nnet3 acoustic model,\n# and a biased language model built from the reference transcript, and then work\n# out the segmentation from a ctm like file.\n\nset -e\nset -o pipefail\nset -u\n\nstage=0\n\ncmd=run.pl\ncleanup=true  # remove temporary directories and files\nnj=4\n# Decode options\ngraph_opts=\nscale_opts=\nbeam=15.0\nlattice_beam=1.0\n\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\nlmwt=10\n\n# Contexts must ideally match training\nextra_left_context=0  # Set to some large value, typically 40 for LSTM (must match training)\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nframes_per_chunk=150\n\n# i-vector options\nextractor=    # i-Vector extractor. If provided, will extract i-vectors.\n              # Required if the network was trained with i-vector extractor.\nuse_vad=false # Use energy-based VAD for i-vector extraction\n\nsegmentation_opts=\n\n. ./path.sh\n. utils/parse_options.sh\n\n\nif [ $# -ne 5 ]; then\n  cat <<EOF\n  Usage: $0 [--extractor <ivector-extractor>] [options] <data> <lang> <srcdir> <dir> <cleaned-data>\n   This script does data cleanup to remove bad portions of transcripts and\n   may do other minor modifications of transcripts such as allowing repetitions\n   for disfluencies, and adding or removing non-scored words (by default:\n   words that map to 'silence phones')\n   Note: <srcdir> is expected to contain a nnet3-based model.\n   <ivector-extractor> and decoding options like --extra-left-context must match\n   the appropriate options used for training.\n\n  e.g. $0 data/train data/lang exp/tri3 exp/tri3_cleanup data/train_cleaned\n  main options (for others, see top of script file):\n    --stage <n>             # stage to run from, to enable resuming from partially\n                            # completed run (default: 0)\n    --cmd '$cmd'            # command to submit jobs with (e.g. run.pl, queue.pl)\n    --nj <n>                # number of parallel jobs to use in graph creation and\n                            # decoding\n    --graph-opts 'opts'         # Additional options to make_biased_lm_graphs.sh.\n                                # Please run steps/cleanup/make_biased_lm_graphs.sh\n                                # without arguments to see allowed options.\n    --segmentation-opts 'opts'  # Additional options to segment_ctm_edits.py.\n                                # Please run steps/cleanup/internal/segment_ctm_edits.py\n                                # without arguments to see allowed options.\n    --cleanup        <true|false>  # Clean up intermediate files afterward.  Default true.\n    --extractor <extractor>     # i-vector extractor directory if i-vector is\n                                # to be used during decoding. Must match\n                                # the extractor used for training neural-network.\n    --use-vad <true|false>      # If true, uses energy-based VAD to apply frame weights\n                                # for i-vector stats extraction\nEOF\n  exit 1\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\ndata_out=$5\n\n\nextra_files=\nif [ ! -z \"$extractor\" ]; then\n  extra_files=\"$extractor/final.ie\"\nfi\n\nfor f in $srcdir/{final.mdl,tree,cmvn_opts} $data/utt2spk $data/feats.scp \\\n  $lang/words.txt $lang/oov.txt $extra_files; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist.\"\n    exit 1\n  fi\ndone\n\nmkdir -p $dir\ncp $srcdir/final.mdl $dir\ncp $srcdir/tree $dir\ncp $srcdir/cmvn_opts $dir\ncp $srcdir/{splice_opts,delta_opts,final.mat,final.alimdl} $dir 2>/dev/null || true\ncp $srcdir/frame_subsampling_factor $dir 2>/dev/null || true\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  echo \"$0: guessing that this is a chain system, checking parameters.\"\n  if [ -z $scale_opts ]; then\n    echo \"$0: setting scale_opts\"\n    scale_opts=\"--self-loop-scale=1.0 --transition-scale=1.0\"\n  fi\n  if [ $acwt == 0.1 ]; then\n    echo \"$0: setting acwt=1.0\"\n    acwt=1.0\n  fi\n  if [ $lmwt == 10 ]; then\n    echo \"$0: setting lmwt=1.0\"\n    lmwt=1\n  fi\nfi\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\ncp $lang/phones.txt $dir\n\nif [ $stage -le 1 ]; then\n  echo \"$0: Building biased-language-model decoding graphs...\"\n\n\n  steps/cleanup/make_biased_lm_graphs.sh $graph_opts --scale-opts \"$scale_opts\" \\\n    --nj $nj --cmd \"$cmd\" \\\n     $data $lang $dir $dir/graphs\nfi\n\nonline_ivector_dir=\nif [ ! -z \"$extractor\" ]; then\n  online_ivector_dir=$dir/ivectors_$(basename $data)\n\n  if [ $stage -le 2 ]; then\n    # Compute energy-based VAD\n    if $use_vad; then\n      steps/compute_vad_decision.sh $data \\\n        $data/log $data/data\n    fi\n\n    steps/online/nnet2/extract_ivectors_online.sh \\\n      --nj $nj --cmd \"$cmd --mem 4G\" --use-vad $use_vad \\\n      $data $extractor $online_ivector_dir\n  fi\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Decoding with biased language models...\"\n\n  steps/cleanup/decode_segmentation_nnet3.sh \\\n    --acwt $acwt  \\\n    --beam $beam --lattice-beam $lattice_beam --nj $nj --cmd \"$cmd --mem 4G\" \\\n    --skip-scoring true --allow-partial false \\\n    --extra-left-context $extra_left_context \\\n    --extra-right-context $extra_right_context \\\n    --extra-left-context-initial $extra_left_context_initial \\\n    --extra-right-context-final $extra_right_context_final \\\n    --frames-per-chunk $frames_per_chunk \\\n    ${online_ivector_dir:+--online-ivector-dir $online_ivector_dir} \\\n    $dir/graphs $data $dir/lats\n\n  # the following is for diagnostics, e.g. it will give us the lattice depth.\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $lang $dir/lats\nfi\n\nframe_shift_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  frame_shift_opt=\"--frame-shift 0.0$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: Doing oracle alignment of lattices...\"\n  steps/cleanup/lattice_oracle_align.sh --cmd \"$cmd --mem 4G\" $frame_shift_opt \\\n    $data $lang $dir/lats $dir/lattice_oracle\nfi\n\n\nif [ $stage -le 4 ]; then\n  echo \"$0: using default values of non-scored words...\"\n\n  # At the level of this script we just hard-code it that non-scored words are\n  # those that map to silence phones (which is what get_non_scored_words.py\n  # gives us), although this could easily be made user-configurable.  This list\n  # of non-scored words affects the behavior of several of the data-cleanup\n  # scripts; essentially, we view the non-scored words as negotiable when it\n  # comes to the reference transcript, so we'll consider changing the reference\n  # to match the hyp when it comes to these words.\n  steps/cleanup/internal/get_non_scored_words.py $lang > $dir/non_scored_words.txt\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: modifying ctm-edits file to allow repetitions [for dysfluencies] and \"\n  echo \"   ... to fix reference mismatches involving non-scored words. \"\n\n  $cmd $dir/log/modify_ctm_edits.log \\\n    steps/cleanup/internal/modify_ctm_edits.py --verbose=3 $dir/non_scored_words.txt \\\n    $dir/lattice_oracle/ctm_edits $dir/ctm_edits.modified\n\n  echo \"   ... See $dir/log/modify_ctm_edits.log for details and stats, including\"\n  echo \" a list of commonly-repeated words.\"\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: applying 'taint' markers to ctm-edits file to mark silences and\"\n  echo \"  ... non-scored words that are next to errors.\"\n  $cmd $dir/log/taint_ctm_edits.log \\\n       steps/cleanup/internal/taint_ctm_edits.py $dir/ctm_edits.modified $dir/ctm_edits.tainted\n  echo \"... Stats, including global cor/ins/del/sub stats, are in $dir/log/taint_ctm_edits.log.\"\nfi\n\n\nif [ $stage -le 7 ]; then\n  echo \"$0: creating segmentation from ctm-edits file.\"\n\n  $cmd $dir/log/segment_ctm_edits.log \\\n    steps/cleanup/internal/segment_ctm_edits.py \\\n      $segmentation_opts \\\n      --oov-symbol-file=$lang/oov.txt \\\n      --ctm-edits-out=$dir/ctm_edits.segmented \\\n      --word-stats-out=$dir/word_stats.txt \\\n      $dir/non_scored_words.txt \\\n      $dir/ctm_edits.tainted $dir/text $dir/segments\n\n  echo \"$0: contents of $dir/log/segment_ctm_edits.log are:\"\n  cat $dir/log/segment_ctm_edits.log\n  echo \"For word-level statistics on p(not-being-in-a-segment), with 'worst' words at the top,\"\n  echo \"see $dir/word_stats.txt\"\n  echo \"For detailed utterance-level debugging information, see $dir/ctm_edits.segmented\"\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: working out required segment padding to account for feature-generation edge effects.\"\n  # make sure $data/utt2dur exists.\n  utils/data/get_utt2dur.sh $data\n  # utt2dur.from_ctm contains lines of the form 'utt dur',  e.g.\n  # AMI_EN2001a_H00_MEE068_0000557_0000594 0.35\n  # where the times are ultimately derived from the num-frames in the features.\n  cat $dir/lattice_oracle/ctm_edits | \\\n     awk '{utt=$1; t=$3+$4; if (t > dur[$1]) dur[$1] = t; } END{for (k in dur) print k, dur[k];}' | \\\n     sort > $dir/utt2dur.from_ctm\n  # the apply_map command below gives us lines of the form 'utt dur-from-$data/utt2dur dur-from-utt2dur.from_ctm',\n  # e.g. AMI_EN2001a_H00_MEE068_0000557_0000594 0.37 0.35\n  utils/apply_map.pl -f 1 <(awk '{print $1,$1,$2}' <$data/utt2dur) <$dir/utt2dur.from_ctm  | \\\n    awk '{printf(\"%.3f\\n\", $2 - $3); }' | sort | uniq -c | sort -nr > $dir/padding_frequencies\n  # there are values other than the most-frequent one (0.02) in there because\n  # of wav files that were shorter than the segment info.\n  padding=$(head -n 1 $dir/padding_frequencies | awk '{print $2}')\n  echo \"$0: we'll pad segments with $padding seconds at segment ends to correct for feature-generation end effects\"\n  echo $padding >$dir/segment_end_padding\nfi\n\n\nif [ $stage -le 8 ]; then\n  echo \"$0: based on the segments and text file in $dir/segments and $dir/text, creating new data-dir in $data_out\"\n  padding=$(cat $dir/segment_end_padding)  # e.g. 0.02\n  utils/data/subsegment_data_dir.sh --segment-end-padding $padding ${data} $dir/segments $dir/text $data_out\n  # utils/data/subsegment_data_dir.sh can output directories that have e.g. to many entries left in wav.scp\n  # Clean this up with the fix_dat_dir.sh script\n  utils/fix_data_dir.sh $data_out\nfi\n\nif [ $stage -le 9 ]; then\n  echo \"$0: recomputing CMVN stats for the new data\"\n  # Caution: this script puts the CMVN stats in $data_out/data,\n  # e.g. data/train_cleaned/data.  This is not the general pattern we use.\n  steps/compute_cmvn_stats.sh $data_out $data_out/log $data_out/data\nfi\n\nif $cleanup; then\n  echo \"$0: cleaning up intermediate files\"\n  rm -r $dir/graphs/fsts $dir/graphs/HCLG.fsts.scp || true\n  rm -r $dir/lats/lat.*.gz $dir/lats/split_fsts || true\n  rm $dir/lattice_oracle/lat.*.gz || true\nfi\n\necho \"$0: done.\"\n"
  },
  {
    "path": "egs/steps/cleanup/combine_short_segments.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016 Vijayaditya Peddinti\n# Apache 2.0\n\nfrom __future__ import print_function\nimport argparse\nimport sys\nimport os\nimport subprocess\nimport errno\nimport copy\nimport shutil\nimport warnings\n\ndef GetArgs():\n    # we add compulsary arguments as named arguments for readability\n    parser = argparse.ArgumentParser(description=\"\"\"\n    **Warning, this script is deprecated.  Please use utils/data/combine_short_segments.sh**\n    This script concatenates segments in the input_data_dir to ensure that\"\"\"\n    \" the segments in the output_data_dir have a specified minimum length.\",\n    formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n\n    parser.add_argument(\"--minimum-duration\", type=float, required = True,\n                        help=\"Minimum duration of the segments in the output directory\")\n    parser.add_argument(\"--input-data-dir\", type=str, required = True)\n    parser.add_argument(\"--output-data-dir\", type=str, required = True)\n\n    print(' '.join(sys.argv))\n    args = parser.parse_args()\n    return args\n\ndef RunKaldiCommand(command, wait = True):\n    \"\"\" Runs commands frequently seen in Kaldi scripts. These are usually a\n        sequence of commands connected by pipes, so we use shell=True \"\"\"\n    p = subprocess.Popen(command, shell = True,\n                         stdout = subprocess.PIPE,\n                         stderr = subprocess.PIPE)\n\n    if wait:\n        [stdout, stderr] = p.communicate()\n        if p.returncode is not 0:\n            raise Exception(\"There was an error while running the command {0}\\n\".format(command)+\"-\"*10+\"\\n\"+stderr)\n        return stdout, stderr\n    else:\n        return p\n\ndef MakeDir(dir):\n    try:\n        os.mkdir(dir)\n    except OSError as exc:\n        if exc.errno != errno.EEXIST:\n            raise exc\n        raise Exception(\"Directory {0} already exists\".format(dir))\n        pass\n\ndef CheckFiles(input_data_dir):\n    for file_name in ['spk2utt', 'text', 'utt2spk', 'feats.scp']:\n        file_name = '{0}/{1}'.format(input_data_dir, file_name)\n        if not os.path.exists(file_name):\n            raise Exception(\"There is no such file {0}\".format(file_name))\n\ndef ParseFileToDict(file, assert2fields = False, value_processor = None):\n    if value_processor is None:\n        value_processor = lambda x: x[0]\n\n    dict = {}\n    for line in open(file, 'r'):\n        parts = line.split()\n        if assert2fields:\n            assert(len(parts) == 2)\n\n        dict[parts[0]] = value_processor(parts[1:])\n    return dict\n\ndef WriteDictToFile(dict, file_name):\n    file = open(file_name, 'w')\n    keys = dict.keys()\n    keys.sort()\n    for key in keys:\n        value = dict[key]\n        if type(value) in [list, tuple] :\n            if type(value) is tuple:\n                value = list(value)\n            value.sort()\n            value = ' '.join(value)\n        file.write('{0}\\t{1}\\n'.format(key, value))\n    file.close()\n\n\ndef ParseDataDirInfo(data_dir):\n    data_dir_file = lambda file_name: '{0}/{1}'.format(data_dir, file_name)\n\n    utt2spk = ParseFileToDict(data_dir_file('utt2spk'))\n    spk2utt = ParseFileToDict(data_dir_file('spk2utt'), value_processor = lambda x: x)\n    text = ParseFileToDict(data_dir_file('text'), value_processor = lambda x: \" \".join(x))\n    # we want to assert feats.scp has just 2 fields, as we don't know how\n    # to process it otherwise\n    feat = ParseFileToDict(data_dir_file('feats.scp'), assert2fields = True)\n    utt2dur = ParseFileToDict(data_dir_file('utt2dur'), value_processor = lambda x: float(x[0]))\n    utt2uniq = None\n    if os.path.exists(data_dir_file('utt2uniq')):\n        utt2uniq = ParseFileToDict(data_dir_file('utt2uniq'))\n    return utt2spk, spk2utt, text, feat, utt2dur, utt2uniq\n\n\ndef GetCombinedUttIndexRange(utt_index, utts, utt_durs, minimum_duration):\n    # We want the minimum number of concatenations\n    # to reach the minimum_duration. If two concatenations satisfy\n    # the minimum duration constraint we choose the shorter one.\n    left_index = utt_index - 1\n    right_index = utt_index + 1\n    num_remaining_segments = len(utts) - 1\n    cur_utt_dur = utt_durs[utts[utt_index]]\n\n    while num_remaining_segments > 0:\n\n        left_utt_dur = 0\n        if left_index >= 0:\n            left_utt_dur = utt_durs[utts[left_index]]\n        right_utt_dur = 0\n        if right_index <= len(utts) - 1:\n            right_utt_dur = utt_durs[utts[right_index]]\n\n        right_combined_utt_dur = cur_utt_dur + right_utt_dur\n        left_combined_utt_dur = cur_utt_dur + left_utt_dur\n        left_right_combined_utt_dur = cur_utt_dur + left_utt_dur + right_utt_dur\n\n        combine_left_exit = False\n        combine_right_exit = False\n        if right_combined_utt_dur >= minimum_duration:\n            if left_combined_utt_dur >= minimum_duration:\n                if left_combined_utt_dur <= right_combined_utt_dur:\n                    combine_left_exit = True\n                else:\n                    combine_right_exit = True\n            else:\n                combine_right_exit = True\n        elif left_combined_utt_dur >= minimum_duration:\n            combine_left_exit = True\n        elif left_right_combined_utt_dur >= minimum_duration :\n            combine_left_exit = True\n            combine_right_exit = True\n\n        if combine_left_exit and combine_right_exit:\n            cur_utt_dur = left_right_combined_utt_dur\n            break\n        elif combine_left_exit:\n            cur_utt_dur = left_combined_utt_dur\n            # move back the right_index as we don't need to combine it\n            right_index = right_index - 1\n            break\n        elif combine_right_exit:\n            cur_utt_dur = right_combined_utt_dur\n            # move back the left_index as we don't need to combine it\n            left_index = left_index + 1\n            break\n\n        # couldn't satisfy minimum duration requirement so continue search\n        if left_index >= 0:\n            num_remaining_segments = num_remaining_segments - 1\n        if right_index <= len(utts) - 1:\n            num_remaining_segments = num_remaining_segments - 1\n\n        left_index = left_index - 1\n        right_index = right_index + 1\n\n        cur_utt_dur = left_right_combined_utt_dur\n    left_index = max(0, left_index)\n    right_index = min(len(utts)-1, right_index)\n    return left_index, right_index, cur_utt_dur\n\n\ndef WriteCombinedDirFiles(output_dir, utt2spk, spk2utt, text, feat, utt2dur, utt2uniq):\n    out_dir_file = lambda file_name: '{0}/{1}'.format(output_dir, file_name)\n    total_combined_utt_list = []\n    for speaker in spk2utt.keys():\n        utts = spk2utt[speaker]\n        for utt in utts:\n            if type(utt) is tuple:\n                #this is a combined utt\n                total_combined_utt_list.append((speaker, utt))\n\n    for speaker, combined_utt_tuple in total_combined_utt_list:\n        combined_utt_list = list(combined_utt_tuple)\n        combined_utt_list.sort()\n        new_utt_name = \"-\".join(combined_utt_list)+'-appended'\n\n        # updating the utt2spk dict\n        for utt in combined_utt_list:\n            spk_name = utt2spk.pop(utt)\n        utt2spk[new_utt_name] = spk_name\n\n        # updating the spk2utt dict\n        spk2utt[speaker].remove(combined_utt_tuple)\n        spk2utt[speaker].append(new_utt_name)\n\n        # updating the text dict\n        combined_text = []\n        for utt in combined_utt_list:\n            combined_text.append(text.pop(utt))\n        text[new_utt_name] = ' '.join(combined_text)\n\n        # updating the feat dict\n        combined_feat = []\n        for utt in combined_utt_list:\n            combined_feat.append(feat.pop(utt))\n        feat_command = \"concat-feats --print-args=false {feats} - |\".format(feats = \" \".join(combined_feat))\n        feat[new_utt_name] = feat_command\n\n        # updating utt2dur\n        combined_dur = 0\n        for utt in combined_utt_list:\n            combined_dur += utt2dur.pop(utt)\n        utt2dur[new_utt_name] = combined_dur\n\n        # updating utt2uniq\n        if utt2uniq is not None:\n            combined_uniqs = []\n            for utt in combined_utt_list:\n                combined_uniqs.append(utt2uniq.pop(utt))\n            # utt2uniq file is used to map perturbed data to original unperturbed\n            # versions so that the training cross validation sets can avoid overlap\n            # of data however if perturbation changes the length of the utterance\n            # (e.g. speed perturbation) the utterance combinations in each\n            # perturbation of the original recording can be very different. So there\n            # is no good way to find the utt2uniq mapping so that we can avoid\n            # overlap.\n            utt2uniq[new_utt_name] = combined_uniqs[0]\n\n\n    WriteDictToFile(utt2spk, out_dir_file('utt2spk'))\n    WriteDictToFile(spk2utt, out_dir_file('spk2utt'))\n    WriteDictToFile(feat, out_dir_file('feats.scp'))\n    WriteDictToFile(text, out_dir_file('text'))\n    if utt2uniq is not None:\n        WriteDictToFile(utt2uniq, out_dir_file('utt2uniq'))\n    WriteDictToFile(utt2dur, out_dir_file('utt2dur'))\n\n\ndef CombineSegments(input_dir, output_dir, minimum_duration):\n    utt2spk, spk2utt, text, feat, utt2dur, utt2uniq = ParseDataDirInfo(input_dir)\n    total_combined_utt_list = []\n\n    # copy the duration dictionary so that we can modify it\n    utt_durs = copy.deepcopy(utt2dur)\n    speakers = spk2utt.keys()\n    speakers.sort()\n    for speaker in speakers:\n\n        utts = spk2utt[speaker] # this is an assignment of the reference\n        # In WriteCombinedDirFiles the values of spk2utt will have the list\n        # of combined utts which will be used as reference\n\n        # we make an assumption that the sorted uttlist corresponds\n        # to contiguous segments. This is true only if utt naming\n        # is done according to accepted conventions\n        # this is an easily violatable assumption. Have to think of a better\n        # way to do this.\n        utts.sort()\n        utt_index = 0\n        while utt_index < len(utts):\n            if utt_durs[utts[utt_index]] < minimum_duration:\n                left_index, right_index, cur_utt_dur = GetCombinedUttIndexRange(utt_index, utts, utt_durs, minimum_duration)\n                if not cur_utt_dur >= minimum_duration:\n                    # this is a rare occurrence, better make the user aware of this\n                    # situation and let them deal with it\n                    warnings.warn('Speaker {0} does not have enough utterances to satisfy the minimum duration '\n                                  'constraint. Not modifying these utterances'.format(speaker))\n                    utt_index = utt_index + 1\n                    continue\n                combined_duration = 0\n                combined_utts = []\n                # update the utts_dur dictionary\n                for utt in utts[left_index:right_index + 1]:\n                    combined_duration += utt_durs.pop(utt)\n                    if type(utt) is tuple:\n                        for item in utt:\n                            combined_utts.append(item)\n                    else:\n                        combined_utts.append(utt)\n                combined_utts = tuple(combined_utts) # converting to immutable type to use as dictionary key\n                assert(cur_utt_dur == combined_duration)\n\n                # now modify the utts list\n                combined_indices = list(range(left_index, right_index + 1))\n                # start popping from the largest index so that the lower\n                # indexes are valid\n                for i in combined_indices[::-1]:\n                    utts.pop(i)\n                utts.insert(left_index, combined_utts)\n                utt_durs[combined_utts] = combined_duration\n                utt_index = left_index\n            utt_index = utt_index + 1\n    WriteCombinedDirFiles(output_dir, utt2spk, spk2utt, text, feat, utt2dur, utt2uniq)\n\ndef Main():\n    print(\"\"\"steps/cleanup/combine_short_segments.py: warning: this script is deprecated and will be removed.\n          Please use utils/data/combine_short_segments.sh\"\"\", file = sys.stderr)\n    args = GetArgs()\n\n    CheckFiles(args.input_data_dir)\n    MakeDir(args.output_data_dir)\n    feat_lengths = {}\n    segments_file = '{0}/segments'.format(args.input_data_dir)\n\n    RunKaldiCommand(\"utils/data/get_utt2dur.sh {0}\".format(args.input_data_dir))\n\n    CombineSegments(args.input_data_dir, args.output_data_dir, args.minimum_duration)\n\n    RunKaldiCommand(\"utils/utt2spk_to_spk2utt.pl {od}/utt2spk > {od}/spk2utt\".format(od = args.output_data_dir))\n    if os.path.exists('{0}/cmvn.scp'.format(args.input_data_dir)):\n        shutil.copy('{0}/cmvn.scp'.format(args.input_data_dir), args.output_data_dir)\n\n    RunKaldiCommand(\"utils/fix_data_dir.sh {0}\".format(args.output_data_dir))\nif __name__ == \"__main__\":\n    Main()\n\n"
  },
  {
    "path": "egs/steps/cleanup/create_segments_from_ctm.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2014  Guoguo Chen; 2015 Nagendra Kumar Goel\n# Apache 2.0\n\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\n# $SIG{__WARN__} = sub { $DB::single = 1 };\n\nmy $Usage = <<'EOU';\nThis script creates the segments file and text file for a data directory with\nnew segmentation. It takes a ctm file and an \"alignment\" file. The ctm file\ncorresponds to the audio that we want to make segmentations for, and is created\nby decoding the audio using existing in-domain models. The \"alignment\" file is\ngenerated by the binary align-text, and is Levenshtein alignment between the\noriginal transcript and the decoded output.\n\nInternally, the script first tries to find silence regions (gaps in the CTM).\nIf a silence region is found, and the neighboring words are free of errors\naccording to the alignment file, then this silence region will be taken as\na split point, and new segment will be created. If the new segment we are going\nto output is too long (longer than --max-seg-length), the script will split\nthe long segments into smaller pieces with length roughly --max-seg-length.\nIf you are going to use --wer-cutoff to filter out segments with high WER, make\nsure you set it to a reasonable value. If the value you set is higher than the\nWER from your alignment file, then most of the segments will be filtered out.\n\nUsage: steps/cleanup/create_segments_from_ctm.pl [options] \\\n                              <ctm> <aligned.txt> <segments> <text>\n e.g.: steps/cleanup/create_segments_from_ctm.pl \\\n          train_si284_split.ctm train_si284_split.aligned.txt \\\n          data/train_si284_reseg/segments data/train_si284_reseg/text\n\nAllowed options:\n  --max-seg-length  : Maximum length of new segments (default = 10.0)\n  --min-seg-length  : Minimum length of new segments (default = 2.0)\n  --min-sil-length  : Minimum length of silence as split point (default = 0.5)\n  --separator       : Separator for aligned pairs (default = \";\")\n  --special-symbol  : Special symbol to aligned with inserted or deleted words\n                      (default = \"<***>\")\n  --wer-cutoff      : Ignore segments with WER higher than the specified value.\n                      -1 means no segment will be ignored. (default = -1)\n  --use-silence-midpoints : Set to 1 if you want to use silence midpoints\n                      instead of min_sil_length for silence overhang.(default 0)\n  --force-correct-boundary-words : Set to zero if the segments will not be\n                      required to have boundary words to be correct. Default 1\n  --aligned-ctm-filename : If set, the intermediate aligned ctm\n                      is saved to this file\nEOU\n\nmy $max_seg_length = 10.0;\nmy $min_seg_length = 2.0;\nmy $min_sil_length = 0.5;\nmy $separator = \";\";\nmy $special_symbol = \"<***>\";\nmy $wer_cutoff = -1;\nmy $use_silence_midpoints = 0;\nmy $force_correct_boundary_words = 1;\nmy $aligned_ctm_filename = \"\";\nGetOptions(\n  'wer-cutoff=f' => \\$wer_cutoff,\n  'max-seg-length=f' => \\$max_seg_length,\n  'min-seg-length=f' => \\$min_seg_length,\n  'min-sil-length=f' => \\$min_sil_length,\n  'use-silence-midpoints=f' => \\$use_silence_midpoints,\n  'force-correct-boundary-words=f' => \\$force_correct_boundary_words,\n  'aligned-ctm-filename=s' => \\$aligned_ctm_filename,\n  'separator=s'      => \\$separator,\n  'special-symbol=s' => \\$special_symbol);\n\nif (@ARGV != 4) {\n  die $Usage;\n}\n\nmy ($ctm_in, $align_in, $segments_out, $text_out) = @ARGV;\n\nopen(CI, \"<$ctm_in\") || die \"Error: fail to open $ctm_in\\n\";\nopen(AI, \"<$align_in\") || die \"Error: fail to open $align_in\\n\";\nopen(my $SO, \">$segments_out\") || die \"Error: fail to open $segments_out\\n\";\nopen(my $TO, \">$text_out\") || die \"Error: fail to open $text_out\\n\";\nmy $ACT= undef;\nif ($aligned_ctm_filename ne \"\") {\n    open($ACT, \">$aligned_ctm_filename\");\n}\n# Prints the current segment to file.\nsub PrintSegment {\n  my ($aligned_ctm, $wav_id, $min_sil_length, $min_seg_length,\n      $seg_start_index, $seg_end_index, $seg_count, $SO, $TO) = @_;\n\n  if ($seg_start_index > $seg_end_index) {\n    return -1;\n  }\n\n  # Removes the surrounding silence.\n  while ($seg_start_index < scalar(@{$aligned_ctm}) &&\n         $aligned_ctm->[$seg_start_index]->[0] eq \"<eps>\") {\n    $seg_start_index += 1;\n  }\n  while ($seg_end_index >= 0 &&\n         $aligned_ctm->[$seg_end_index]->[0] eq \"<eps>\") {\n    $seg_end_index -= 1;\n  }\n  if ($seg_start_index > $seg_end_index) {\n    return -1;\n  }\n\n  # Filters out segments with high WER.\n  if ($wer_cutoff != -1) {\n    my $num_errors = 0; my $num_words = 0;\n    for (my $i = $seg_start_index; $i <= $seg_end_index; $i += 1) {\n      if ($aligned_ctm->[$i]->[0] ne \"<eps>\") {\n        $num_words += 1;\n      }\n      $num_errors += $aligned_ctm->[$i]->[3];\n    }\n    if ($num_errors / $num_words > $wer_cutoff || $num_words < 1) {\n      return -1;\n    }\n  }\n\n  # Works out the surrounding silence.\n  my $index = $seg_start_index - 1;\n  while ($index >= 0 && $aligned_ctm->[$index]->[0] eq\n         \"<eps>\" && $aligned_ctm->[$index]->[3] == 0) {\n    $index -= 1;\n  }\n  my $left_of_segment_has_deletion = \"false\";\n  $left_of_segment_has_deletion = \"true\"\n      if ($index > 0 && $aligned_ctm->[$index-1]->[0] ne \"<eps>\"\n          && $aligned_ctm->[$index-1]->[3] == 0);\n\n  my $pad_start_sil = ($aligned_ctm->[$seg_start_index]->[1] -\n                       $aligned_ctm->[$index + 1]->[1]) / 2.0;\n  if (($left_of_segment_has_deletion eq \"true\") || !$use_silence_midpoints) {\n      if ($pad_start_sil > $min_sil_length / 2.0) {\n          $pad_start_sil = $min_sil_length / 2.0;\n      }\n  }\n  my $right_of_segment_has_deletion = \"false\";\n  $index = $seg_end_index + 1;\n  while ($index < scalar(@{$aligned_ctm}) &&\n         $aligned_ctm->[$index]->[0] eq \"<eps>\" &&\n         $aligned_ctm->[$index]->[3] == 0) {\n    $index += 1;\n  }\n  $right_of_segment_has_deletion = \"true\"\n      if ($index < scalar(@{$aligned_ctm})-1 && $aligned_ctm->[$index+1]->[0] ne\n          \"<eps>\" && $aligned_ctm->[$index - 1]->[3] > 0);\n  my $pad_end_sil = ($aligned_ctm->[$index - 1]->[1] +\n                     $aligned_ctm->[$index - 1]->[2] -\n                     $aligned_ctm->[$seg_end_index]->[1] -\n                     $aligned_ctm->[$seg_end_index]->[2]) / 2.0;\n  if (($right_of_segment_has_deletion eq \"true\") || !$use_silence_midpoints) {\n      if ($pad_end_sil > $min_sil_length / 2.0) {\n          $pad_end_sil = $min_sil_length / 2.0;\n      }\n  }\n\n  my $seg_start = $aligned_ctm->[$seg_start_index]->[1] - $pad_start_sil;\n  my $seg_end = $aligned_ctm->[$seg_end_index]->[1] +\n                $aligned_ctm->[$seg_end_index]->[2] + $pad_end_sil;\n  if ($seg_end - $seg_start < $min_seg_length) {\n      return -1;\n  }\n\n  $seg_start = sprintf(\"%.2f\", $seg_start);\n  $seg_end = sprintf(\"%.2f\", $seg_end);\n  my $seg_id = $wav_id . \"_\" . sprintf(\"%05d\", $seg_count);\n  print $SO \"$seg_id $wav_id $seg_start $seg_end\\n\";\n\n  print $TO \"$seg_id \";\n  for (my $x = $seg_start_index; $x <= $seg_end_index; $x += 1) {\n    if ($aligned_ctm->[$x]->[0] ne \"<eps>\") {\n      print $TO \"$aligned_ctm->[$x]->[0] \";\n    }\n  }\n  print $TO \"\\n\";\n  return 0;\n}\n\n# Computes split point.\nsub GetSplitPoint {\n  my ($aligned_ctm, $seg_start_index, $seg_end_index, $max_seg_length) = @_;\n\n  # Scan in the reversed order so we can maximize the length.\n  my $split_point = $seg_start_index;\n  for (my $x = $seg_end_index; $x > $seg_start_index; $x -= 1) {\n    my $current_seg_length = $aligned_ctm->[$x]->[1] +\n                             $aligned_ctm->[$x]->[2] -\n                             $aligned_ctm->[$seg_start_index]->[1];\n    if ($current_seg_length <= $max_seg_length) {\n      $split_point = $x;\n      last;\n    }\n  }\n  return $split_point;\n}\n\n# Computes segment length without surrounding silence.\nsub GetSegmentLengthNoSil {\n  my ($aligned_ctm, $seg_start_index, $seg_end_index) = @_;\n  while ($seg_start_index < scalar(@{$aligned_ctm}) &&\n         $aligned_ctm->[$seg_start_index]->[0] eq \"<eps>\") {\n    $seg_start_index += 1;\n  }\n  while ($seg_end_index >= 0 &&\n         $aligned_ctm->[$seg_end_index]->[0] eq \"<eps>\") {\n    $seg_end_index -= 1;\n  }\n  if ($seg_start_index > $seg_end_index) {\n    return 0;\n  }\n  my $current_seg_length = $aligned_ctm->[$seg_end_index]->[1] +\n                           $aligned_ctm->[$seg_end_index]->[2] -\n                           $aligned_ctm->[$seg_start_index]->[1];\n  return $current_seg_length;\n}\n\n# Force splits long segments.\nsub SplitLongSegment {\n  my ($aligned_ctm, $wav_id, $max_seg_length, $min_sil_length,\n      $seg_start_index, $seg_end_index, $current_seg_count, $SO, $TO) = @_;\n  # If the segment is too long, we manually split it. We make sure that the\n  # resulting segments are at least ($max_seg_length / 2) seconds long.\n  my $current_seg_length = $aligned_ctm->[$seg_end_index]->[1] +\n                           $aligned_ctm->[$seg_end_index]->[2] -\n                           $aligned_ctm->[$seg_start_index]->[1];\n  my $current_seg_index = $seg_start_index;\n  my $aligned_ctm_size = scalar(@{$aligned_ctm});\n  while ($current_seg_length > 1.5 * $max_seg_length && $current_seg_index < $aligned_ctm_size-1) {\n    my $split_point = GetSplitPoint($aligned_ctm, $current_seg_index,\n                                    $seg_end_index, $max_seg_length);\n    my $ans = PrintSegment($aligned_ctm, $wav_id, $min_sil_length,\n                           $min_seg_length, $current_seg_index, $split_point,\n                           $current_seg_count, $SO, $TO);\n    $current_seg_count += 1 if ($ans != -1);\n    $current_seg_index = $split_point + 1;\n    $current_seg_length = $aligned_ctm->[$seg_end_index]->[1] +\n                          $aligned_ctm->[$seg_end_index]->[2] -\n                          $aligned_ctm->[$current_seg_index]->[1];\n  }\n\n  if ($current_seg_index eq $aligned_ctm_size-1) {\n      my $ans = PrintSegment($aligned_ctm, $wav_id, $min_sil_length,\n                             $min_seg_length, $current_seg_index, $current_seg_index,\n                             $current_seg_count, $SO, $TO);\n      $current_seg_count += 1 if ($ans != -1);\n      return ($current_seg_count, $current_seg_index);\n  }\n\n  if ($current_seg_length > $max_seg_length) {\n    my $split_point = GetSplitPoint($aligned_ctm, $current_seg_index,\n                                    $seg_end_index,\n                                    $current_seg_length / 2.0 + 0.01);\n    my $ans = PrintSegment($aligned_ctm, $wav_id, $min_sil_length,\n                           $min_seg_length, $current_seg_index, $split_point,\n                           $current_seg_count, $SO, $TO);\n    $current_seg_count += 1 if ($ans != -1);\n    $current_seg_index = $split_point + 1;\n  }\n\n  my $split_point = GetSplitPoint($aligned_ctm, $current_seg_index,\n                                  $seg_end_index, $max_seg_length + 0.01);\n  my $ans = PrintSegment($aligned_ctm, $wav_id, $min_sil_length,\n                         $min_seg_length, $current_seg_index, $split_point,\n                         $current_seg_count, $SO, $TO);\n  $current_seg_count += 1 if ($ans != -1);\n  $current_seg_index = $split_point + 1;\n\n  return ($current_seg_count, $current_seg_index);\n}\n\n# Processes each wav file.\nsub ProcessWav {\n  my ($max_seg_length, $min_seg_length, $min_sil_length, $special_symbol,\n      $current_ctm, $current_align, $SO, $TO, $ACT) = @_;\n\n  my $wav_id = $current_ctm->[0]->[0];\n  my $channel_id = $current_ctm->[0]->[1];\n  defined($wav_id) || die \"Error: empty wav section\\n\";\n\n  # First, we have to align the ctm file to the Levenshtein alignment.\n  # @aligned_ctm is a list of the following:\n  # [word, start_time, duration, num_errors]\n  my $ctm_index = 0;\n  my @aligned_ctm = ();\n  foreach my $entry (@{$current_align}) {\n    my $ref_word = $entry->[0];\n    my $hyp_word = $entry->[1];\n    if ($hyp_word eq $special_symbol) {\n      # Case 1: deletion, $hyp does not correspond to a word in the ctm file.\n      my $start = 0.0; my $dur = 0.0;\n      if (defined($aligned_ctm[-1])) {\n        $start = $aligned_ctm[-1]->[1] + $aligned_ctm[-1]->[2];\n      }\n      push(@aligned_ctm, [$ref_word, $start, $dur, 1]);\n    } else {\n      # Case 2: non-deletion, now $hyp corresponds to a word in ctm file.\n      while ($current_ctm->[$ctm_index]->[4] eq \"<eps>\") {\n        # Case 2.1: ctm contains silence at the corresponding place.\n        push(@aligned_ctm, [\"<eps>\", $current_ctm->[$ctm_index]->[2],\n                             $current_ctm->[$ctm_index]->[3], 0]);\n        $ctm_index += 1;\n      }\n      my $ctm_word = $current_ctm->[$ctm_index]->[4];\n      $hyp_word eq $ctm_word ||\n        die \"Error: got word $hyp_word in alignment but $ctm_word in ctm\\n\";\n      my $start = $current_ctm->[$ctm_index]->[2];\n      my $dur = $current_ctm->[$ctm_index]->[3];\n      if ($ref_word ne $ctm_word) {\n        if ($ref_word eq $special_symbol) {\n          # Case 2.2: insertion, we propagate the duration and error to the\n          #           previous one.\n          if (defined($aligned_ctm[-1])) {\n            $aligned_ctm[-1]->[2] += $dur;\n            $aligned_ctm[-1]->[3] += 1;\n          } else {\n            push(@aligned_ctm, [\"<eps>\", $start, $dur, 1]);\n          }\n        } else {\n          # Case 2.3: substitution.\n          push(@aligned_ctm, [$ref_word, $start, $dur, 1]);\n        }\n      } else {\n        # Case 2.4: correct.\n        push(@aligned_ctm, [$ref_word, $start, $dur, 0]);\n      }\n      $ctm_index += 1;\n    }\n  }\n\n  # Save the aligned CTM if needed\n  if(defined($ACT)){\n    for (my $i = 0; $i <= $#aligned_ctm; $i++) {\n      print $ACT \"$wav_id $channel_id $aligned_ctm[$i][1] $aligned_ctm[$i][2] \";\n      print $ACT \"$aligned_ctm[$i][0] $aligned_ctm[$i][3]\\n\";\n    }\n  }\n\n  # Second, we create segments from @align_ctm, using simple greedy method.\n  my $current_seg_index = 0;\n  my $current_seg_count = 0;\n  for (my $x = 0; $x < @aligned_ctm; $x += 1) {\n    my $lcorrect = \"true\"; my $rcorrect = \"true\";\n    $lcorrect = \"false\" if ($x > 0 && $aligned_ctm[$x - 1]->[3] > 0);\n    $rcorrect = \"false\" if ($x < @aligned_ctm - 1 &&\n                            $aligned_ctm[$x + 1]->[3] > 0);\n\n    my $current_seg_length = GetSegmentLengthNoSil(\\@aligned_ctm,\n                                                   $current_seg_index, $x);\n\n    # We split the audio, if the silence is longer than the requested silence\n    # length, and if there are no alignment error around it. We also make sure\n    # that segment contains actual words, instead of pure silence.\n    if ($aligned_ctm[$x]->[0] eq \"<eps>\" &&\n        $aligned_ctm[$x]->[2] >= $min_sil_length\n       && (($force_correct_boundary_words && $lcorrect eq \"true\" &&\n            $rcorrect eq \"true\") || !$force_correct_boundary_words)) {\n      if ($current_seg_length <= $max_seg_length &&\n          $current_seg_length >= $min_seg_length) {\n        my $ans = PrintSegment(\\@aligned_ctm, $wav_id, $min_sil_length,\n                               $min_seg_length, $current_seg_index, $x,\n                               $current_seg_count, $SO, $TO);\n        $current_seg_count += 1 if ($ans != -1);\n        $current_seg_index = $x + 1;\n      } elsif ($current_seg_length > $max_seg_length) {\n        ($current_seg_count, $current_seg_index)\n          = SplitLongSegment(\\@aligned_ctm, $wav_id, $max_seg_length,\n                             $min_sil_length, $current_seg_index, $x,\n                             $current_seg_count, $SO, $TO);\n      }\n    }\n  }\n\n  # Last segment.\n  if ($current_seg_index <= @aligned_ctm - 1) {\n    SplitLongSegment(\\@aligned_ctm, $wav_id, $max_seg_length, $min_sil_length,\n                     $current_seg_index, @aligned_ctm - 1,\n                     $current_seg_count, $SO, $TO);\n  }\n}\n\n# Insert <eps> as silence so the down stream process will be easier. Example:\n#\n# Input ctm:\n# 011 A 3.39 0.23 SELL\n# 011 A 3.62 0.18 OFF\n# 011 A 3.83 0.45 ASSETS\n#\n# Output ctm:\n# 011 A 3.39 0.23 SELL\n# 011 A 3.62 0.18 OFF\n# 011 A 3.80 0.03 <eps>\n# 011 A 3.83 0.45 ASSETS\nsub InsertSilence {\n  my ($ctm_in, $ctm_out) = @_;\n  for (my $x = 1; $x < @{$ctm_in}; $x += 1) {\n    push(@{$ctm_out}, $ctm_in->[$x - 1]);\n\n    my $new_start = sprintf(\"%.2f\",\n                            $ctm_in->[$x - 1]->[2] + $ctm_in->[$x - 1]->[3]);\n    if ($new_start < $ctm_in->[$x]->[2]) {\n      my $new_dur = sprintf(\"%.2f\", $ctm_in->[$x]->[2] - $new_start);\n      push(@{$ctm_out}, [$ctm_in->[$x - 1]->[0], $ctm_in->[$x - 1]->[1],\n                         $new_start, $new_dur, \"<eps>\"]);\n    }\n  }\n  push(@{$ctm_out}, $ctm_in->[@{$ctm_in} - 1]);\n}\n\n# Reads the alignment.\nmy %aligned = ();\nwhile (<AI>) {\n  chomp;\n  my @col = split;\n  @col >= 2 || die \"Error: bad line $_\\n\";\n  my $wav = shift @col;\n  if ( (@col + 0) % 3 != 2) {\n    die \"Bad line in align-text output (unexpected number of fields): $_\";\n  }\n  my @pairs = ();\n\n  for (my $x = 0; $x * 3 + 2 < @col; $x++) {\n    my $first_word = $col[$x * 3];\n    my $second_word = $col[$x * 3 + 1];\n    if ($x * 3 + 2 < @col) {\n      if ($col[$x * 3 + 2] ne $separator) {\n        die \"Bad line in align-text output (expected separator '$separator'): $_\";\n      }\n    }\n    # the [ ] expression returns a reference to a new anonymous array.\n    push(@pairs, [ $first_word, $second_word ]);\n  }\n  ! defined($aligned{$wav}) || die \"Error: $wav has already been processed\\n\";\n  $aligned{$wav} = \\@pairs;\n}\n\n# Reads the ctm file and creates the segmentation.\nmy $previous_wav_id = \"\";\nmy $previous_channel_id = \"\";\nmy @current_wav = ();\nwhile (<CI>) {\n  chomp;\n  my @col = split;\n  @col >= 5 || die \"Error: bad line $_\\n\";\n  if ($previous_wav_id eq $col[0] && $previous_channel_id eq $col[1]) {\n    push(@current_wav, \\@col);\n  } else {\n    if (@current_wav > 0) {\n      defined($aligned{$previous_wav_id}) ||\n        die \"Error: no alignment info for $previous_wav_id\\n\";\n      my @current_wav_silence = ();\n      InsertSilence(\\@current_wav, \\@current_wav_silence);\n      ProcessWav($max_seg_length, $min_seg_length, $min_sil_length,\n                 $special_symbol, \\@current_wav_silence,\n                 $aligned{$previous_wav_id}, $SO, $TO, $ACT);\n    }\n    @current_wav = ();\n    push(@current_wav, \\@col);\n    $previous_wav_id = $col[0];\n    $previous_channel_id = $col[1];\n  }\n}\n\n# The last wav file.\nif (@current_wav > 0) {\n  defined($aligned{$previous_wav_id}) ||\n    die \"Error: no alignment info for $previous_wav_id\\n\";\n  my @current_wav_silence = ();\n  InsertSilence(\\@current_wav, \\@current_wav_silence);\n  ProcessWav($max_seg_length, $min_seg_length, $min_sil_length, $special_symbol,\n             \\@current_wav_silence, $aligned{$previous_wav_id}, $SO, $TO, $ACT);\n}\n\nclose(CI);\nclose(AI);\nclose($SO);\nclose($TO);\nclose($ACT) if defined($ACT);\n"
  },
  {
    "path": "egs/steps/cleanup/debug_lexicon.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# this script gets some stats that will help you debug the lexicon.\n\n# Begin configuration section.\nstage=1\nremove_stress=false\nnj=10  # number of jobs for various decoding-type things that we run.\ncmd=run.pl\nalidir=\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"usage: $0 <data-dir> <lang-dir> <src-dir> <src-dict> <dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/tri4b data/local/dict/lexicon.txt exp/debug_lexicon\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd <cmd>                                      # command to run jobs, e.g. run.pl,queue.pl\"\n   echo \"  --stage <stage>                                  # use to control partial reruns.\"\n   echo \"  --remove-stress <true|false>                     # if true, remove stress before printing analysis\"\n   echo \"                                                   # note: if you change this, you only have to rerun\"\n   echo \"                                                   # from stage 10.\"\n   echo \"  --alidir <alignment-dir>                         # if supplied, training-data alignments and transforms\"\n   echo \"                                                   # are obtained from here instead of being generated.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrc=$3\nsrcdict=$4\ndir=$5\n\nset -e\n\nfor f in $data/feats.scp $lang/phones.txt $src/final.mdl $srcdict; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nmkdir -p $dir\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\ncp $lang/phones.txt $dir\n\nif [ -z $alidir ]; then\n  alidir=${src}_ali_$(basename $data)\n  if [ $stage -le 1 ]; then\n    steps/align_fmllr.sh --cmd \"$cmd\" --nj $nj $data $lang $src $alidir\n  fi\nfi\n\nphone_lang=data/$(basename $lang)_phone_bg\n\nif [ $stage -le 2 ]; then\n  utils/lang/make_phone_bigram_lang.sh $lang $alidir $phone_lang\nfi\n\nif [ $stage -le 3 ]; then\n  utils/mkgraph.sh $phone_lang $src $src/graph_phone_bg\nfi\n\nif [ $stage -le 4 ]; then\n  steps/decode_si.sh --skip-scoring true \\\n    --cmd \"$cmd\" --nj $nj --transform-dir $alidir \\\n    --acwt 0.25 --beam 10.0 --lattice-beam 5.0 --max-active 2500 \\\n    $src/graph_phone_bg $data $src/decode_$(basename $data)_phone_bg\nfi\n\nif [ $stage -le 5 ]; then\n  steps/get_train_ctm.sh --print-silence true --use-segments false \\\n     --cmd \"$cmd\" $data $lang $alidir\nfi\n\nif [ $stage -le 6 ]; then\n  steps/get_ctm.sh --use-segments false --cmd \"$cmd\" --min-lmwt 3 --max-lmwt 8 \\\n     $data $phone_lang $src/decode_$(basename $data)_phone_bg\nfi\n\nif [ $stage -le 7 ]; then\n  mkdir -p $dir\n  # lmwt=4 corresponds to the scale we decoded at.\n  cp $src/decode_$(basename $data)_phone_bg/score_4/$(basename $data).ctm $dir/phone.ctm\n\n  cp $alidir/ctm $dir/word.ctm\nfi\n\nif [ $stage -le 8 ]; then\n# we'll use 'sort' to do most of the heavy lifting when processing the data.\n# suppose word.ctm has an entry like\n# sw02054 A 213.32 0.24 and\n# we'll convert it into two entries like this, with the start and end separately:\n# sw02054-A 0021332 START and\n# sw02054-A 0021356 END and\n#\n# and suppose phone.ctm has lines like\n# sw02054 A 213.09 0.24 sil\n# sw02054 A 213.33 0.13 ae_B\n# we'll convert them into lines where the time is derived the midpoint of the phone, like\n# sw02054 A 0021321 PHONE sil\n# sw02054 A 0021340 PHONE ae_B\n# and then we'll remove the optional-silence phones and, if needed, the word-boundary markers from\n# the phones, to get just\n# sw02054 A 0021340 PHONE ae\n# then after sorting and merge-sorting the two ctm files we can easily\n# work out for each word, what the phones were during that time.\n\n  grep -v '<eps>' $phone_lang/phones.txt | awk '{print $1, $1}' | \\\n    sed 's/_B$//' | sed 's/_I$//' | sed 's/_E$//' | sed 's/_S$//' >$dir/phone_map.txt\n\n\n  export LC_ALL=C\n\n  cat $dir/phone.ctm | utils/apply_map.pl -f 5 $dir/phone_map.txt > $dir/phone_mapped.ctm\n\n  cat $dir/word.ctm  | awk '{printf(\"%s-%s %010.0f START %s\\n\", $1, $2, 1000*$3, $5); printf(\"%s-%s %010.0f END %s\\n\", $1, $2, 1000*($3+$4), $5);}' | \\\n    sort > $dir/word_processed.ctm\n\n  # filter out those utteraces which only appea in phone_processed.ctm but not in word_processed.ctm\n  cat $dir/phone_mapped.ctm | awk '{printf(\"%s-%s %010.0f PHONE %s\\n\", $1, $2, 1000*($3+(0.5*$4)), $5);}' | \\\n    awk 'NR==FNR{a[$1] = 1; next} {if($1 in a) print $0}' $dir/word_processed.ctm - | \\\n    sort > $dir/phone_processed.ctm\n\n  # merge-sort both ctm's\n  sort -m $dir/word_processed.ctm $dir/phone_processed.ctm > $dir/combined.ctm\nfi\n\n# after merge-sort of the two ctm's, we add <eps> to cover \"deserted\" phones due to precision limits, and then merge all consecutive <eps>'s.\nif [ $stage -le 9 ]; then\n  awk '{print $1, $3, $4}' $dir/combined.ctm | \\\n     perl -e ' while (<>) { chop; @A = split(\" \", $_); ($utt, $a,$b) = @A;\n     if ($a eq \"START\") { $cur_word = $b; @phones = (); }\n     if ($a eq \"END\") { print $utt, \" \", $cur_word, \" \", join(\" \", @phones), \"\\n\"; }\n     if ($a eq \"PHONE\") { if ($prev eq \"END\") {print $utt, \" \", \"<eps>\", \" \", $b, \"\\n\";} else {push @phones, $b;}} $prev = $a;} ' |\\\n     awk 'BEGIN{merge_prev=0;} {utt=$1;word=$2;pron=$3;for (i=4;i<=NF;i++) pron=pron\" \"$i;\n     if (word_prev == \"<eps>\" && word == \"<eps>\" && utt_prev == utt) {merge=0;pron_prev=pron_prev\" \"pron;} else {merge=1;}\n     if(merge_prev==1) {print utt_prev, word_prev, pron_prev;};\n     merge_prev=merge; utt_prev=utt; word_prev=word; pron_prev=pron;}\n     END{if(merge_prev==1) {print utt_prev, word_prev, pron_prev;}}' > $dir/ctm_prons.txt\n\n  steps/cleanup/internal/get_non_scored_words.py $lang > $dir/non_scored_words\n  steps/cleanup/internal/get_pron_stats.py $dir/ctm_prons.txt $phone_lang/phones/silence.txt $phone_lang/phones/optional_silence.txt $dir/non_scored_words - | \\\n    sort -nr > $dir/prons.txt\nfi\n\nif [ $stage -le 10 ]; then\n  if $remove_stress; then\n    perl -e 'while(<>) { @A=split(\" \", $_); for ($n=1;$n<@A;$n++) { $A[$n] =~ s/[0-9]$//; } print join(\" \", @A) . \"\\n\"; } ' \\\n      <$srcdict >$dir/lexicon.txt\n  else\n    cp $srcdict $dir/lexicon.txt\n  fi\n  silphone=$(cat $phone_lang/phones/optional_silence.txt)\n  echo \"<eps> $silphone\" >> $dir/lexicon.txt\n\n  awk '{count[$2] += $1;} END {for (w in count){print w, count[w];}}' \\\n      <$dir/prons.txt >$dir/counts.txt\n\n\n\n  cat $dir/prons.txt | \\\n    if $remove_stress; then\n      perl -e 'while(<>) { @A=split(\" \", $_); for ($n=1;$n<@A;$n++) { $A[$n] =~ s/[0-9]$//; } print join(\" \", @A) . \"\\n\"; } '\n    else\n      cat\n    fi | perl -e '\n     print \";; <count-of-this-pron> <rank-of-this-pron> <frequency-of-this-pron> CORRECT|INCORRECT <word> <pron>\\n\";\n     open(D, \"<$ARGV[0]\") || die \"opening dict file $ARGV[0]\";\n     # create a hash of all reference pronuncations, and for each word, record\n     # a list of the prons, separated by \" | \".\n     while (<D>) {\n        @A = split(\" \", $_); $is_pron{join(\" \",@A)} = 1;\n        $w = shift @A;\n        if (!defined $prons{$w}) { $prons{$w} = join(\" \", @A); }\n        else { $prons{$w} = $prons{$w} . \" | \" . join(\" \", @A); }\n     }\n     open(C, \"<$ARGV[1]\") || die \"opening counts file $ARGV[1];\";\n     while (<C>) { @A = split(\" \", $_); $word_count{$A[0]} = $A[1]; }\n     while (<STDIN>) { @A = split(\" \", $_);\n       $count = shift @A; $word = $A[0]; $freq = sprintf(\"%0.2f\", $count / $word_count{$word});\n       $rank = ++$wcount{$word}; # 1 if top observed pron of word, 2 if second...\n       $str = (defined $is_pron{join(\" \", @A)} ? \"CORRECT\" : \"INCORRECT\");\n       shift @A;\n       print \"$count $rank $freq $str $word \\\"\" . join(\" \", @A) . \"\\\", ref = \\\"$prons{$word}\\\"\\n\";\n     } ' $dir/lexicon.txt $dir/counts.txt  >$dir/pron_info.txt\n\n  grep -v '^;;' $dir/pron_info.txt | \\\n     awk '{ word=$5; count=$1; if (tot[word] == 0) { first_line[word] = $0; }\n            corr[word] += ($4 == \"CORRECT\" ? count : 0); tot[word] += count; }\n          END {for (w in tot) { printf(\"%s\\t%s\\t%s\\t\\t%s\\n\", tot[w], w, (corr[w]/tot[w]), first_line[w]); }} ' \\\n     | sort -k1 -nr | cat <( echo ';; <total-count-of-word> <word> <correct-proportion>      <first-corresponding-line-in-pron_info.txt>') - \\\n      > $dir/word_info.txt\nfi\n\nif [ $stage -le 11 ]; then\n  echo \"$0: some of the more interesting stuff in $dir/pron_info.txt follows.\"\n  echo \"# grep -w INCORRECT $dir/pron_info.txt  | grep -w 1 | head -n 20\"\n\n  grep -w INCORRECT $dir/pron_info.txt  | grep -w 1 | head -n 20\n\n  echo \"$0: here are some other interesting things..\"\n  echo \"# grep -w INCORRECT $dir/pron_info.txt  | grep -w 1 | awk '\\$3 > 0.4 && \\$1 > 10' | head -n 20\"\n  grep -w INCORRECT $dir/pron_info.txt  | grep -w 1 | awk '$3 > 0.4 && $1 > 10' | head -n 20\n\n  echo \"$0: here are some high-frequency words whose reference pronunciations rarely show up.\"\n  echo \"# awk '\\$3 < 0.1' $dir/word_info.txt  | head -n 20\"\n  awk '$3 < 0.1 || $1 == \";;\"' $dir/word_info.txt  | head -n 20\n\n\nfi\n"
  },
  {
    "path": "egs/steps/cleanup/decode_fmllr_segmentation.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen, 2015 GoVivace Inc. (Nagendra Goel)\n#           2017  Vimal Manohar\n# Apache 2.0\n\n# Similar to steps/cleanup/decode_segmentation.sh, but does fMLLR adaptation.\n# Decoding script with per-utterance graph that does fMLLR adaptation.\n# This can be on top of delta+delta-delta, or LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the\n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:\n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\nset -e\nset -o pipefail\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in initial pass.\nalignment_model=\nadapt_model=\nfinal_model=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in\n              # lattice generation.\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nskip_scoring=false\nscoring_opts=\nmax_fmllr_jobs=25  # I've seen the fMLLR jobs overload NFS badly if the decoding\n                   # was started with a lot of many jobs, so we limit the number of\n                   # parallel jobs to 25 by default.  End configuration section\nallow_partial=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"$0: This is a special decoding script for segmentation where we\"\n   echo \"use one decoding graph per segment. We assume a file HCLG.fsts.scp exists\"\n   echo \"which is the scp file of the graphs for each segment.\"\n   echo \"This will normally be obtained by steps/cleanup/make_biased_lm_graphs.sh.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: $0 exp/tri2b/graph_train_si284_split \\\\\"\n   echo \"             data/train_si284_split exp/tri2b/decode_train_si284_split\"\n   echo \"\"\n   echo \"where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \"where the model is.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n   echo \"  --scoring-opts <opts>                    # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` || true  # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null` || true\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\nutils/lang/check_phones_compatible.sh $graphdir/phones.txt $srcdir/phones.txt\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fsts.scp $data/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# Split HCLG.fsts.scp by input utterance\nn1=$(cat $graphdir/HCLG.fsts.scp | wc -l)\nn2=$(cat $data/feats.scp | wc -l)\nif [ $n1 != $n2 ]; then\n  echo \"$0: expected $n2 graphs in $graphdir/HCLG.fsts.scp, got $n1\"\nfi\n\nmkdir -p $dir/split_fsts\nsort -k1,1 $graphdir/HCLG.fsts.scp > $dir/HCLG.fsts.sorted.scp\nutils/filter_scps.pl --no-warn -f 1 JOB=1:$nj \\\n  $sdata/JOB/feats.scp $dir/HCLG.fsts.sorted.scp $dir/split_fsts/HCLG.fsts.JOB.scp\nHCLG=scp:$dir/split_fsts/HCLG.fsts.JOB.scp\n\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n    if [ -f \"$graphdir/num_pdfs\" ]; then\n      [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $alignment_model | grep pdfs | awk '{print $NF}'` ] || \\\n        { echo \"Mismatch in number of pdfs with $alignment_model\"; exit 1; }\n    fi\n    steps/cleanup/decode_segmentation.sh --scoring-opts \"$scoring_opts\" \\\n           --num-threads $num_threads --skip-scoring $skip_scoring \\\n           --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam \\\n           --model $alignment_model --max-active \\\n           $first_max_active $graphdir $data $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n## Set up the unadapted features \"$sifeats\"\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n##\n\n## Now get the first-pass fMLLR transforms.\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass fMLLR transforms.\"\n  $cmd --max-jobs-run $max_fmllr_jobs JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$sifeats\" ark,s,cs:- \\\n    ark:$dir/pre_trans.JOB || exit 1;\nfi\n##\n\npass1feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/pre_trans.JOB ark:- ark:- |\"\n\n## Do the main lattice generation pass.  Note: we don't determinize the lattices at\n## this stage, as we're going to use them in acoustic rescoring with the larger\n## model, and it's more correct to store the full state-level lattice for this purpose.\nif [ $stage -le 2 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $adapt_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $adapt_model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --determinize-lattice=false \\\n    --allow-partial=$allow_partial --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model \"$HCLG\" \"$pass1feats\" \"ark:|gzip -c > $dir/lat.tmp.JOB.gz\"\nfi\n##\n\n## Do a second pass of estimating the transform-- this time with the lattices\n## generated from the alignment model.  Compose the transforms to get\n## $dir/trans.1, etc.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating fMLLR transforms a second time.\"\n  $cmd --max-jobs-run $max_fmllr_jobs JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=4.0 \\\n    \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$pass1feats\" \\\n    ark,s,cs:- ark:$dir/trans_tmp.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans_tmp.JOB ark:$dir/pre_trans.JOB \\\n    ark:$dir/trans.JOB  || exit 1;\nfi\n##\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for\n# language model rescoring.\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" '&&' rm $dir/lat.tmp.JOB.gz || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nrm $dir/{trans_tmp,pre_trans}.*\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/cleanup/decode_segmentation.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen, 2015 GoVivace Inc. (Nagendra Goel)\n#           2017  Vimal Manohar\n# Apache 2.0\n\n# Some basic error checking, similar to steps/decode.sh is added.\n\nset -e\nset -o pipefail\n\n# Begin configuration section.\ntransform_dir=   # this option won't normally be used, but it can be used if you\n                 # want to supply existing fMLLR transforms when decoding.\niter=\nmodel= # You can specify the model to use (e.g. if you want to use the .alimdl)\nstage=0\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\nallow_partial=true\n# note: there are no more min-lmwt and max-lmwt options, instead use\n# e.g. --scoring-opts \"--min-lmwt 1 --max-lmwt 20\"\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"$0: This is a special decoding script for segmentation where we\"\n   echo \"use one decoding graph per segment. We assume a file HCLG.fsts.scp exists\"\n   echo \"which is the scp file of the graphs for each segment.\"\n   echo \"This will normally be obtained by steps/cleanup/make_biased_lm_graphs.sh.\"\n   echo \"This script does not estimate fMLLR transforms; you have to use\"\n   echo \"the --transform-dir option if you want to use fMLLR.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: $0 exp/tri2b/graph_train_si284_split \\\\\"\n   echo \"             data/train_si284_split exp/tri2b/decode_train_si284_split\"\n   echo \"\"\n   echo \"where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \"where the model is.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --model <model>                                  # which model to use (e.g. to\"\n   echo \"                                                   # specify the final.alimdl)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <trans-dir>                      # dir to find fMLLR transforms \"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --num-threads <n>                                # number of threads to use, default 1.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\n\nmkdir -p $dir/log\n\nif [ -e $dir/final.mdl ]; then\n  srcdir=$dir\nelif [ -e $dir/../final.mdl ]; then\n  srcdir=$(dirname $dir)\nelse\n  echo \"$0: expected either $dir/final.mdl or $dir/../final.mdl to exist\"\n  exit 1\nfi\nsdata=$data/split$nj;\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  if [ -z $iter ]; then model=$srcdir/final.mdl;\n  else model=$srcdir/$iter.mdl; fi\nfi\n\nif [ $(basename $model) != final.alimdl ] ; then\n  # Do not use the $srcpath -- look at the path where the model is\n  if [ -f $(dirname $model)/final.alimdl ] && [ -z \"$transform_dir\" ]; then\n    echo -e '\\n\\n'\n    echo $0 'WARNING: Running speaker independent system decoding using a SAT model!'\n    echo $0 'WARNING: This is OK if you know what you are doing...'\n    echo -e '\\n\\n'\n  fi\nfi\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $model $graphdir/HCLG.fsts.scp; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nutils/lang/check_phones_compatible.sh $graphdir/phones.txt $srcdir/phones.txt\n\n# Split HCLG.fsts.scp by input utterance\nn1=$(cat $graphdir/HCLG.fsts.scp | wc -l)\nn2=$(cat $data/feats.scp | wc -l)\nif [ $n1 != $n2 ]; then\n  echo \"$0: expected $n2 graphs in $graphdir/HCLG.fsts.scp, got $n1\"\nfi\n\n\nmkdir -p $dir/split_fsts\nsort -k1,1 $graphdir/HCLG.fsts.scp > $dir/HCLG.fsts.sorted.scp\nutils/filter_scps.pl --no-warn -f 1 JOB=1:$nj \\\n  $sdata/JOB/feats.scp $dir/HCLG.fsts.sorted.scp $dir/split_fsts/HCLG.fsts.JOB.scp\nHCLG=scp:$dir/split_fsts/HCLG.fsts.JOB.scp\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` || true # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null` || true\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null` || true\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist\" && exit 1\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    echo \"$0: num-jobs for transforms mismatches, so copying them.\"\n    for n in $(seq $nj_orig); do cat $transform_dir/trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/trans.ark,$dir/trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\n  fi\nfi\n\nif [ $stage -le 0 ]; then\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --allow-partial=$allow_partial --word-symbol-table=$graphdir/words.txt \\\n    $model \"$HCLG\" \"$feats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/decode_segmentation_nnet3.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen, 2015 GoVivace Inc. (Nagendra Goel)\n#           2017  Vimal Manohar\n# Apache 2.0\n\n# This script is similar to steps/cleanup/decode_segmentation.sh, but \n# does decoding using nnet3 model.\n\nset -e\nset -o pipefail\n\n# Begin configuration section.\nstage=-1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0  # Beam we use in lattice generation. We can reduce this if \n                  # we only need the best path\niter=final\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nscoring_opts=\nskip_scoring=false\nallow_partial=true\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n   echo \"$0: This is a special decoding script for segmentation where we\"\n   echo \"use one decoding graph per segment. We assume a file HCLG.fsts.scp exists\"\n   echo \"which is the scp file of the graphs for each segment.\"\n   echo \"This will normally be obtained by steps/cleanup/make_biased_lm_graphs.sh.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: $0 --online-ivector-dir exp/nnet3/ivectors_train_si284_split \"\n   echo \"             exp/nnet3/tdnn/graph_train_si284_split \\\\\"\n   echo \"             data/train_si284_split exp/nnet3/tdnn/decode_train_si284_split\"\n   echo \"\"\n   echo \"where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \"where the model is.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --num-threads <n>                                # number of threads to use, default 1.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\n\nmkdir -p $dir/log\n\nif [ -e $dir/$iter.mdl ]; then\n  srcdir=$dir\nelif [ -e $dir/../$iter.mdl ]; then\n  srcdir=$(dirname $dir)\nelse\n  echo \"$0: expected either $dir/$iter.mdl or $dir/../$iter.mdl to exist\"\n  exit 1\nfi\nmodel=$srcdir/$iter.mdl\n\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nutils/lang/check_phones_compatible.sh $graph_dir/phones.txt $srcdir/phones.txt || exit 1\n\nfor f in $graphdir/HCLG.fsts.scp $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n# Split HCLG.fsts.scp by input utterance\nn1=$(cat $graphdir/HCLG.fsts.scp | wc -l)\nn2=$(cat $data/feats.scp | wc -l)\nif [ $n1 != $n2 ]; then\n  echo \"$0: expected $n2 graphs in $graphdir/HCLG.fsts.scp, got $n1\"\nfi\n\nmkdir -p $dir/split_fsts\nsort -k1,1 $graphdir/HCLG.fsts.scp > $dir/HCLG.fsts.sorted.scp\nutils/filter_scps.pl --no-warn -f 1 JOB=1:$nj \\\n  $sdata/JOB/feats.scp $dir/HCLG.fsts.sorted.scp $dir/split_fsts/HCLG.fsts.JOB.scp\nHCLG=scp:$dir/split_fsts/HCLG.fsts.JOB.scp\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 1 ]; then\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-faster$thread_string $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=$allow_partial \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     \"$HCLG\" \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"$0: Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    iter_opt=\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $iter_opt $scoring_opts --cmd \"$cmd\" $data $graphdir $dir ||\n      { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/find_bad_utts.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments using a model with delta or\n# LDA+MLLT features.  This version, rather than just using the\n# text to align, computes mini-language models (unigram) from the text\n# and a few common words in the LM.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nacoustic_scale=0.1\nbeam=15.0\nlattice_beam=8.0\nmax_active=750\ntransform_dir=  # directory to find fMLLR transforms in.\ntop_n_words=100 # Number of common words that we compile into each graph (most frequent\n                # in $lang/text.\nstage=-1\ncleanup=true\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"$0: Warning: this script is deprecated and will be removed.\"\n  echo \"  ... please use steps/cleanup/clean_and_segment_data.sh,\"\n  echo \" which produces the same output formats as this script\"\n  echo \" (e.g. all_info.sorted.txt)\"\n  echo \"Usage: $0 <data-dir> <lang-dir> <src-dir> <dir>\"\n  echo \"e.g.:  $0 data/train data/lang exp/tri1 exp/tri1_debug\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --use-graphs true                                # use graphs in src-dir\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nfor f in $data/text $lang/oov.int $srcdir/tree $srcdir/final.mdl \\\n    $lang/L_disambig.fst $lang/phones/disambig.int; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\n\n\nif [ $stage -le 0 ]; then\n  utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt <$data/text | \\\n    awk '{for(x=2;x<=NF;x++) print $x;}' | sort | uniq -c | \\\n    sort -rn > $dir/word_counts.int || exit 1;\n  num_words=$(awk '{x+=$1} END{print x}' < $dir/word_counts.int) || exit 1;\n  # print top-n words with their unigram probabilities.\n\n  head -n $top_n_words $dir/word_counts.int | awk -v tot=$num_words '{print $1/tot, $2;}' >$dir/top_words.int\n  utils/int2sym.pl -f 2 $lang/words.txt <$dir/top_words.int >$dir/top_words.txt\nfi\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $srcdir/full.mat $dir\n   ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ -z \"$transform_dir\" ] && [ -f $srcdir/trans.1 ]; then\n  transform_dir=$srcdir\nfi\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/trans.$n; done | \\\n      copy-feats ark:- ark,scp:$dir/trans.ark,$dir/trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\n  fi\nelif [ -f $srcdir/final.alimdl ]; then\n  echo \"$0: **WARNING**: you seem to be using an fMLLR system as input,\"\n  echo \"  but you are not providing the --transform-dir option during alignment.\"\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: decoding $data using utterance-specific decoding graphs using model from $srcdir, output in $dir\"\n\n  rm $dir/edits.*.txt $dir/aligned_ref.*.txt 2>/dev/null\n\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text \\| \\\n    steps/cleanup/make_utterance_fsts.pl $dir/top_words.int \\| \\\n    compile-train-graphs-fsts $scale_opts --read-disambig-syms=$lang/phones/disambig.int \\\n     $dir/tree $dir/final.mdl $lang/L_disambig.fst ark:- ark:- \\| \\\n    gmm-latgen-faster --acoustic-scale=$acoustic_scale --beam=$beam \\\n      --max-active=$max_active --lattice-beam=$lattice_beam \\\n      --word-symbol-table=$lang/words.txt \\\n     $dir/final.mdl ark:- \"$feats\" ark:- \\| \\\n    lattice-oracle ark:- \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\" \\\n      ark,t:- ark,t:$dir/edits.JOB.txt \\| \\\n    utils/int2sym.pl -f 2- $lang/words.txt '>' $dir/aligned_ref.JOB.txt || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if [ -f $dir/edits.1.txt ]; then\n    # the awk commands below are to ensure that partially-written files don't confuse us.\n    for x in $(seq $nj); do cat $dir/edits.$x.txt; done | awk '{if(NF==2){print;}}' > $dir/edits.txt\n    for x in $(seq $nj); do cat $dir/aligned_ref.$x.txt; done | awk '{if(NF>=1){print;}}' > $dir/aligned_ref.txt\n  else\n    echo \"$0: warning: no file $dir/edits.1.txt, using previously concatenated file if present.\"\n  fi\n\n  # in case any utterances failed to align, get filtered copy of $data/text\n  utils/filter_scp.pl $dir/edits.txt < $data/text  > $dir/text\n  cat $dir/text | awk '{print $1, (NF-1);}' > $dir/length.txt\n\n  n1=$(wc -l < $dir/edits.txt)\n  n2=$(wc -l < $dir/aligned_ref.txt)\n  n3=$(wc -l < $dir/text)\n  n4=$(wc -l < $dir/length.txt)\n  if [ $n1 -ne $n2 ] || [ $n2 -ne $n3 ] || [ $n3 -ne $n4 ]; then\n    echo \"$0: mismatch in lengths of files:\"\n    wc $dir/edits.txt $dir/aligned_ref.txt $dir/text $dir/length.txt\n    exit 1;\n  fi\n\n  # note: the format of all_info.txt is:\n  # <utterance-id>   <number of errors>  <reference-length>  <decoded-output>   <reference>\n  # with the fields separated by tabs, e.g.\n  # adg04_sr009_trn 1 \t12\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED AT\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED\n\n  paste $dir/edits.txt \\\n      <(awk '{print $2}' $dir/length.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/aligned_ref.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/text) > $dir/all_info.txt\n\n  sort -nr -k2 $dir/all_info.txt > $dir/all_info.sorted.txt\n\n  if $cleanup; then\n    rm $dir/edits.*.txt $dir/aligned_ref.*.txt\n  fi\n\nfi\n\nif [ $stage -le 3 ]; then\n  ###\n  # These stats might help people figure out what is wrong with the data\n  # a)human-friendly and machine-parsable alignment in the file per_utt_details.txt\n  # b)evaluation of per-speaker performance to possibly find speakers with\n  #   distinctive accents/speech disorders and similar\n  # c)Global analysis on (Ins/Del/Sub) operation, which might be used to figure\n  #   out if there is systematic issue with lexicon, pronunciation or phonetic confusability\n\n  mkdir -p $dir/analysis\n  align-text --special-symbol=\"***\"  ark:$dir/text ark:$dir/aligned_ref.txt  ark,t:- | \\\n    utils/scoring/wer_per_utt_details.pl --special-symbol \"***\" > $dir/analysis/per_utt_details.txt\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_per_spk_details.pl $data/utt2spk > $dir/analysis/per_spk_details.txt\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_ops_details.pl --special-symbol \"***\" | \\\n    sort -i -b -k1,1 -k4,4nr -k2,2 -k3,3 > $dir/analysis/ops_details.txt\n\nfi\n\n"
  },
  {
    "path": "egs/steps/cleanup/find_bad_utts_nnet.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey)\n#           2016       Api.ai (Author: Ilya Platonov)      \n# Apache 2.0\n#\n# Tweaked version of find_bad_utts.sh to work with nnet2 and nnet3(supports chain models) non-ivector models.\n# This script uses nnet-info and nnet3-am-info to determine type of nnet (nnet2 or nnet3).\n# Use --acoustic-scale=1.0 for chain models.\n#\n# Begin configuration section.  \nnj=8\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nacoustic_scale=0.1\nbeam=15.0\nlattice_beam=8.0\nmax_active=750\ntransform_dir=  # directory to find fMLLR transforms in.\ntop_n_words=100 # Number of common words that we compile into each graph (most frequent\n                # in $lang/text.\nstage=-1\ncleanup=true\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: $0 <data-dir> <lang-dir> <src-dir> <dir>\"\n   echo \"e.g.:  $0 data/train data/lang exp/tri1 exp/tri1_debug\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nfor f in $data/text $lang/oov.int $srcdir/tree $srcdir/final.mdl \\\n    $lang/L_disambig.fst $lang/phones/disambig.int; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n#checking type of nnet\nif nnet-info 1>/dev/null 2>/dev/null $srcdir/final.mdl; then \n  nnet_type=\"nnet\";\n  latgen_cmd=\"nnet-latgen-faster\";\nelif nnet3-am-info 1>/dev/null 2>/dev/null $srcdir/final.mdl; then\n  nnet_type=\"nnet3\"\n  frame_subsampling_factor=1;\n  nnet3_opt=\n  if [ -f $srcdir/frame_subsampling_factor ]; then\n    frame_subsampling_factor=\"$(cat $srcdir/frame_subsampling_factor)\"\n  fi\n  if [ \"$frame_subsamping_factor\" != \"1\" ]; then\n    nnet3_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\";\n  fi\n  latgen_cmd=\"nnet3-latgen-faster $nnet3_opt\";\nelse\n  echo \"Unsupported type of nnet for $srcdir/final.mdl\";\nfi \n\necho \"nnet type is $nnet_type\";\n\n\nif [ $stage -le 0 ]; then\n  utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt <$data/text | \\\n    awk '{for(x=2;x<=NF;x++) print $x;}' | sort | uniq -c | \\\n    sort -rn > $dir/word_counts.int || exit 1;\n  num_words=$(awk '{x+=$1} END{print x}' < $dir/word_counts.int) || exit 1;\n  # print top-n words with their unigram probabilities.\n\n  head -n $top_n_words $dir/word_counts.int | awk -v tot=$num_words '{print $1/tot, $2;}' >$dir/top_words.int\n  utils/int2sym.pl -f 2 $lang/words.txt <$dir/top_words.int >$dir/top_words.txt\nfi\n\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\";\n\nif [ $stage -le 1 ]; then\n  echo \"$0: decoding $data using utterance-specific decoding graphs using model from $srcdir, output in $dir\"\n\n  rm $dir/edits.*.txt $dir/aligned_ref.*.txt 2>/dev/null\n\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text \\| \\\n    steps/cleanup/make_utterance_fsts.pl $dir/top_words.int \\| \\\n    compile-train-graphs-fsts $scale_opts --read-disambig-syms=$lang/phones/disambig.int \\\n     $dir/tree $dir/final.mdl $lang/L_disambig.fst ark:- ark:- \\| \\\n    $latgen_cmd --acoustic-scale=$acoustic_scale --beam=$beam \\\n      --max-active=$max_active --lattice-beam=$lattice_beam \\\n      --word-symbol-table=$lang/words.txt \\\n     $dir/final.mdl ark:- \"$feats\" ark:- \\| \\\n    lattice-oracle ark:- \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\" \\\n      ark,t:- ark,t:$dir/edits.JOB.txt \\| \\\n    utils/int2sym.pl -f 2- $lang/words.txt '>' $dir/aligned_ref.JOB.txt || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if [ -f $dir/edits.1.txt ]; then\n    # the awk commands below are to ensure that partially-written files don't confuse us.\n    for x in $(seq $nj); do cat $dir/edits.$x.txt; done | awk '{if(NF==2){print;}}' > $dir/edits.txt\n    for x in $(seq $nj); do cat $dir/aligned_ref.$x.txt; done | awk '{if(NF>=1){print;}}' > $dir/aligned_ref.txt\n  else\n    echo \"$0: warning: no file $dir/edits.1.txt, using previously concatenated file if present.\"\n  fi\n\n  # in case any utterances failed to align, get filtered copy of $data/text\n  utils/filter_scp.pl $dir/edits.txt < $data/text  > $dir/text\n  cat $dir/text | awk '{print $1, (NF-1);}' > $dir/length.txt\n\n  n1=$(wc -l < $dir/edits.txt)\n  n2=$(wc -l < $dir/aligned_ref.txt)\n  n3=$(wc -l < $dir/text)\n  n4=$(wc -l < $dir/length.txt)\n  if [ $n1 -ne $n2 ] || [ $n2 -ne $n3 ] || [ $n3 -ne $n4 ]; then\n    echo \"$0: mismatch in lengths of files:\"\n    wc $dir/edits.txt $dir/aligned_ref.txt $dir/text $dir/length.txt\n    exit 1;\n  fi\n\n  # note: the format of all_info.txt is:\n  # <utterance-id>   <number of errors>  <reference-length>  <decoded-output>   <reference>\n  # with the fields separated by tabs, e.g.\n  # adg04_sr009_trn 1 \t12\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED AT\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED\n  \n  paste $dir/edits.txt \\\n      <(awk '{print $2}' $dir/length.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/aligned_ref.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/text) > $dir/all_info.txt\n\n  sort -nr -k2 $dir/all_info.txt > $dir/all_info.sorted.txt\n\n  if $cleanup; then\n    rm $dir/edits.*.txt $dir/aligned_ref.*.txt\n  fi\n\nfi\n\nif [ $stage -le 3 ]; then\n  ###\n  # These stats migh help people figure out what is wrong with the data\n  # a)human-friendly and machine-parsable alignment in the file per_utt_details.txt\n  # b)evaluation of per-speaker performance to possibly find speakers with \n  #   distinctive accents/speech disorders and similar\n  # c)Global analysis on (Ins/Del/Sub) operation, which might be used to figure\n  #   out if there is systematic issue with lexicon, pronunciation or phonetic confusability\n\n  mkdir -p $dir/analysis\n  align-text --special-symbol=\"***\"  ark:$dir/text ark:$dir/aligned_ref.txt  ark,t:- | \\\n    utils/scoring/wer_per_utt_details.pl --special-symbol \"***\" > $dir/analysis/per_utt_details.txt\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_per_spk_details.pl $data/utt2spk > $dir/analysis/per_spk_details.txt\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_ops_details.pl --special-symbol \"***\" | \\\n    sort -i -b -k1,1 -k4,4nr -k2,2 -k3,3 > $dir/analysis/ops_details.txt\n\nfi\n\n"
  },
  {
    "path": "egs/steps/cleanup/internal/align_ctm_ref.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2016    Vimal Manohar\n#           2020    Dongji Gao\n# Apache 2.0.\n\n\"\"\"This module aligns a hypothesis (CTM or text) with a reference to\nfind the best matching sub-sequence in the reference for the hypothesis\nusing Smith-Waterman like alignment.\n\ne.g.: align_ctm_ref.py --hyp-format=CTM --ref=data/train/text --hyp=foo/ctm\n        --output=foo/ctm_edits\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nhandler = logging.StreamHandler()\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.setLevel(logging.DEBUG)\n\nverbose_level = 0\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"\n    This module aligns a hypothesis (CTM or text) with a reference to find the\n    best matching sub-sequence in the reference for the hypothesis using\n    Smith-Waterman like alignment.\n\n    e.g.: align_ctm_ref.py --align-full-hyp=false --hyp-format=CTM\n    --reco2file-and-channel=data/foo/reco2file_and_channel --ref=data/train/text\n    --hyp=foo/ctm --output=foo/ctm_edits\n    \"\"\")\n\n    parser.add_argument(\"--hyp-format\", type=str, choices=[\"Text\", \"CTM\"],\n                        default=\"CTM\",\n                        help=\"Format used for the hypothesis\")\n    parser.add_argument(\"--reco2file-and-channel\", type=argparse.FileType('r'),\n                        help=\"\"\"reco2file_and_channel file.\n                        This will be used to match references that are usually\n                        indexed by the recording-id with the CTM lines that have\n                        file and channel. This option is typically not\n                        required.\"\"\")\n    parser.add_argument(\"--eps-symbol\", type=str, default=\"-\",\n                        help=\"Symbol used to contain alignment \"\n                        \"to empty symbol\")\n    parser.add_argument(\"--oov-word\", type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"Symbol of OOV word in hypothesis\")\n    parser.add_argument(\"--symbol-table\", type=argparse.FileType('r'),\n                        help=\"\"\"Symbol table for words in vocabulary. Used\n                        to determine if a word is a OOV or not\"\"\")\n\n    parser.add_argument(\"--correct-score\", type=int, default=1,\n                        help=\"Score for correct matches\")\n    parser.add_argument(\"--substitution-penalty\", type=int, default=1,\n                        help=\"Penalty for substitution errors\")\n    parser.add_argument(\"--deletion-penalty\", type=int, default=1,\n                        help=\"Penalty for deletion errors\")\n    parser.add_argument(\"--insertion-penalty\", type=int, default=1,\n                        help=\"Penalty for insertion errors\")\n\n    parser.add_argument(\"--align-full-hyp\", type=str,\n                        action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"], default=True,\n                        help=\"\"\"Align full hypothesis i.e. trackback from\n                        the end to get the alignment. This is different\n                        from the normal Smith-Waterman alignment, where the\n                        traceback will be from the maximum score.\"\"\")\n\n    parser.add_argument(\"--debug-only\", type=str, default=\"false\",\n                        choices=[\"true\", \"false\"],\n                        help=\"Run test functions only\")\n    parser.add_argument(\"--verbose\", type=int, default=0,\n                        choices=[0, 1, 2, 3],\n                        help=\"Use larger value for more verbose logging.\")\n\n    parser.add_argument(\"--ref\", dest='ref_in_file',\n                        type=argparse.FileType('r'), required=True,\n                        help=\"Reference text file\")\n    parser.add_argument(\"--hyp\", dest='hyp_in_file', required=True,\n                        type=argparse.FileType('r'),\n                        help=\"Hypothesis text or CTM file\")\n    parser.add_argument(\"--output\", dest='alignment_out_file', required=True,\n                        type=argparse.FileType('w'),\n                        help=\"\"\"File to write output alignment.\n                        If hyp-format=CTM, then the output is in the form of\n                        CTM, but with two additional columns of Edit-type and\n                        Reference-word matched to the hypothesis.\"\"\")\n\n    args = parser.parse_args()\n\n    args.debug_only = bool(args.debug_only == \"true\")\n\n    global verbose_level\n    verbose_level = args.verbose\n    if args.verbose > 2:\n        handler.setLevel(logging.DEBUG)\n    else:\n        handler.setLevel(logging.INFO)\n    logger.addHandler(handler)\n\n    return args\n\n\ndef read_text(text_file):\n    \"\"\"Reads a kaldi-format text file and yield elements of a dictionary\n        { utterane_id : transcript (as a list of words) }\n\n    The first-column of the text file is the utterance-id, which will be\n    used as the key to index the dictionary elements.\n    The remaining columns of the file are text of the transcript and they are\n    returned as a list of words.\n    \"\"\"\n    for line in text_file:\n        parts = line.strip().split()\n        if len(parts) < 1:\n            raise RuntimeError(\n                \"Did not get enough columns; line {0} in {1}\"\n                \"\".format(line, text_file.name))\n        elif len(parts) == 1:\n            logger.warn(\"Empty transcript for utterance %s in %s\",\n                        parts[0], text_file.name)\n            yield parts[0], []\n        else:\n            yield parts[0], parts[1:]\n    text_file.close()\n\n\ndef read_ctm(ctm_file, file_and_channel2reco=None):\n    \"\"\"Reads a CTM file and yields elements of a dictionary\n        { utterance-id : CTM for the utterance },\n    where CTM for the utterance is stored as a list of lines\n    from a CTM correponding to the utterance.\n\n    Note: *_reco in the variables usually correspond to utterances rather\n    than recordings.\n    \"\"\"\n    prev_reco = \"\"\n    ctm_lines = []\n    for line in ctm_file:\n        try:\n            parts = line.strip().split()\n            parts[2] = float(parts[2])\n            parts[3] = float(parts[3])\n\n            if len(parts) == 5:\n                parts.append(1.0)   # confidence defaults to 1.0.\n\n            if len(parts) != 6:\n                raise ValueError(\"CTM must have 6 fields.\")\n\n            if file_and_channel2reco is None:\n                reco = parts[0]\n                if parts[1] != '1':\n                    raise ValueError(\"Channel should be 1, \"\n                                     \"got {0}\".format(parts[1]))\n            else:\n                reco = file_and_channel2reco[(parts[0], parts[1])]\n            if prev_reco != \"\" and reco != prev_reco:\n                # New recording\n                yield prev_reco, ctm_lines\n                ctm_lines = []\n            ctm_lines.append(parts[2:])\n            prev_reco = reco\n        except Exception:\n            logger.error(\"Error in processing CTM line {0}\".format(line))\n            raise\n    if prev_reco != \"\" and len(ctm_lines) > 0:\n        yield prev_reco, ctm_lines\n    ctm_file.close()\n\n\ndef smith_waterman_alignment(ref, hyp, similarity_score_function,\n                             del_score, ins_score,\n                             eps_symbol=\"<eps>\", align_full_hyp=True):\n    \"\"\"Does Smith-Waterman alignment of reference sequence and hypothesis\n    sequence.\n    This is a special case of the Smith-Waterman alignment that assumes that\n    the deletion and insertion costs are linear with number of incorrect words.\n\n    If align_full_hyp is True, then the traceback of the alignment\n    is started at the end of the hypothesis. This is when we want the\n    reference that aligns with the full hypothesis.\n    This differs from the normal Smith-Waterman alignment, where the traceback\n    is from the highest score in the alignment score matrix. This\n    can be obtained by setting align_full_hyp as False. This gets only the\n    sub-sequence of the hypothesis that best matches with a\n    sub-sequence of the reference.\n\n    Returns a list of tuples where each tuple has the format:\n        (ref_word, hyp_word, ref_word_from_index, hyp_word_from_index,\n         ref_word_to_index, hyp_word_to_index)\n    \"\"\"\n    output = []\n\n    ref_len = len(ref)\n    hyp_len = len(hyp)\n\n    bp = [[] for x in range(ref_len+1)]\n\n    # Score matrix of size (ref_len + 1) x (hyp_len + 1)\n    # The index m, n in this matrix corresponds to the score\n    # of the best matching sub-sequence pair between reference and hypothesis\n    # ending with the reference word ref[m-1] and hypothesis word hyp[n-1].\n    # If align_full_hyp is True, then the hypothesis sub-sequence is from\n    # the 0th word i.e. hyp[0].\n    H = [[] for x in range(ref_len+1)]\n\n    for ref_index in range(ref_len+1):\n        if align_full_hyp:\n            H[ref_index] = [-(hyp_len+2) for x in range(hyp_len+1)]\n            H[ref_index][0] = 0\n        else:\n            H[ref_index] = [0 for x in range(hyp_len+1)]\n        bp[ref_index] = [(0, 0) for x in range(hyp_len+1)]\n\n        if align_full_hyp and ref_index == 0:\n            for hyp_index in range(1, hyp_len+1):\n                H[0][hyp_index] = H[0][hyp_index-1] + ins_score\n                bp[ref_index][hyp_index] = (ref_index, hyp_index-1)\n                logger.debug(\n                    \"({0},{1}) -> ({2},{3}): {4}\"\n                    \"\".format(ref_index, hyp_index-1, ref_index, hyp_index,\n                              H[ref_index][hyp_index]))\n\n    max_score = -float(\"inf\")\n    max_score_element = (0, 0)\n\n    for ref_index in range(1, ref_len+1):     # Reference\n        for hyp_index in range(1, hyp_len+1):     # Hypothesis\n            sub_or_ok = (H[ref_index-1][hyp_index-1]\n                         + similarity_score_function(ref[ref_index-1],\n                                                     hyp[hyp_index-1]))\n\n            if ((not align_full_hyp and sub_or_ok > 0)\n                    or (align_full_hyp\n                        and sub_or_ok >= H[ref_index][hyp_index])):\n                H[ref_index][hyp_index] = sub_or_ok\n                bp[ref_index][hyp_index] = (ref_index-1, hyp_index-1)\n                logger.debug(\n                    \"({0},{1}) -> ({2},{3}): {4} ({5},{6})\"\n                    \"\".format(ref_index-1, hyp_index-1, ref_index, hyp_index,\n                              H[ref_index][hyp_index],\n                              ref[ref_index-1], hyp[hyp_index-1]))\n\n            if H[ref_index-1][hyp_index] + del_score > H[ref_index][hyp_index]:\n                H[ref_index][hyp_index] = H[ref_index-1][hyp_index] + del_score\n                bp[ref_index][hyp_index] = (ref_index-1, hyp_index)\n                logger.debug(\n                    \"({0},{1}) -> ({2},{3}): {4}\"\n                    \"\".format(ref_index-1, hyp_index, ref_index, hyp_index,\n                              H[ref_index][hyp_index]))\n\n            if H[ref_index][hyp_index-1] + ins_score > H[ref_index][hyp_index]:\n                H[ref_index][hyp_index] = H[ref_index][hyp_index-1] + ins_score\n                bp[ref_index][hyp_index] = (ref_index, hyp_index-1)\n                logger.debug(\n                    \"({0},{1}) -> ({2},{3}): {4}\"\n                    \"\".format(ref_index, hyp_index-1, ref_index, hyp_index,\n                              H[ref_index][hyp_index]))\n\n            #if hyp_index == hyp_len and H[ref_index][hyp_index] >= max_score:\n            if ((not align_full_hyp or hyp_index == hyp_len)\n                    and H[ref_index][hyp_index] >= max_score):\n                max_score = H[ref_index][hyp_index]\n                max_score_element = (ref_index, hyp_index)\n\n    ref_index, hyp_index = max_score_element\n    score = max_score\n    logger.debug(\"Alignment score: %s for (%d, %d)\",\n                 score, ref_index, hyp_index)\n\n    while ((not align_full_hyp and score >= 0)\n           or (align_full_hyp and hyp_index > 0)):\n        try:\n            prev_ref_index, prev_hyp_index = bp[ref_index][hyp_index]\n            if ((prev_ref_index, prev_hyp_index) == (ref_index, hyp_index)\n                    or (prev_ref_index, prev_hyp_index) == (0, 0)):\n                score = H[ref_index][hyp_index]\n                if score != 0:\n                    ref_word = ref[ref_index-1] if ref_index > 0 else eps_symbol\n                    hyp_word = hyp[hyp_index-1] if hyp_index > 0 else eps_symbol\n                    output.append((ref_word, hyp_word, prev_ref_index,\n                        prev_hyp_index, ref_index, hyp_index))\n\n                    ref_index, hyp_index = (prev_ref_index, prev_hyp_index)\n                    score = H[ref_index][hyp_index]\n                break\n\n            if (ref_index == prev_ref_index + 1\n                    and hyp_index == prev_hyp_index + 1):\n                # Substitution or correct\n                output.append(\n                    (ref[ref_index-1] if ref_index > 0 else eps_symbol,\n                     hyp[hyp_index-1] if hyp_index > 0 else eps_symbol,\n                     prev_ref_index, prev_hyp_index, ref_index, hyp_index))\n            elif (prev_hyp_index == hyp_index):\n                # Deletion\n                assert prev_ref_index == ref_index - 1\n                output.append(\n                    (ref[ref_index-1] if ref_index > 0 else eps_symbol,\n                     eps_symbol,\n                     prev_ref_index, prev_hyp_index, ref_index, hyp_index))\n            elif (prev_ref_index == ref_index):\n                # Insertion\n                assert prev_hyp_index == hyp_index - 1\n                output.append(\n                    (eps_symbol,\n                     hyp[hyp_index-1] if hyp_index > 0 else eps_symbol,\n                     prev_ref_index, prev_hyp_index, ref_index, hyp_index))\n            else:\n                raise RuntimeError\n\n\n            ref_index, hyp_index = (prev_ref_index, prev_hyp_index)\n            score = H[ref_index][hyp_index]\n        except Exception:\n            logger.error(\"Unexpected entry (%d,%d) -> (%d,%d), %s, %s\",\n                         prev_ref_index, prev_hyp_index, ref_index, hyp_index,\n                         ref[prev_ref_index], hyp[prev_hyp_index])\n            raise RuntimeError(\"Unexpected result: Bug in code!!\")\n\n    assert (align_full_hyp or score == 0)\n\n    output.reverse()\n\n    if verbose_level > 2:\n        for ref_index in range(ref_len+1):\n            for hyp_index in range(hyp_len+1):\n                print (\"{0} \".format(H[ref_index][hyp_index]), end='',\n                       file=sys.stderr)\n            print (\"\", file=sys.stderr)\n\n    logger.debug(\"Aligned output:\")\n    logger.debug(\"  -  \".join([\"({0},{1})\".format(x[4], x[5])\n                               for x in output]))\n    logger.debug(\"REF: \")\n    logger.debug(\"    \".join(str(x[0]) for x in output))\n    logger.debug(\"HYP:\")\n    logger.debug(\"    \".join(str(x[1]) for x in output))\n\n    return (output, max_score)\n\n\ndef print_alignment(recording, alignment, out_file_handle):\n    out_text = [recording]\n    for line in alignment:\n        try:\n            out_text.append(line[1])\n        except Exception:\n            logger.error(\"Something wrong with alignment. \"\n                         \"Invalid line {0}\".format(line))\n            raise\n    print (\" \".join(out_text), file=out_file_handle)\n\n\ndef get_edit_type(hyp_word, ref_word, duration=-1, eps_symbol='<eps>',\n                  oov_word=None, symbol_table=None):\n    if hyp_word == ref_word and hyp_word != eps_symbol:\n        return 'cor'\n    if hyp_word != eps_symbol and ref_word == eps_symbol:\n        return 'ins'\n    if hyp_word == eps_symbol and ref_word != eps_symbol and duration == 0.0:\n        return 'del'\n    if (hyp_word == oov_word and symbol_table is not None\n            and len(symbol_table) > 0 and ref_word not in symbol_table):\n        return 'cor'    # this special case is treated as correct\n    if hyp_word == eps_symbol and ref_word == eps_symbol and duration > 0.0:\n        # silence in hypothesis; we don't match this up with any reference\n        # word.\n        return 'sil'\n    # The following assertion is because, based on how get_ctm_edits()\n    # works, we shouldn't hit this case.\n    assert hyp_word != eps_symbol and ref_word != eps_symbol\n    return 'sub'\n\n\ndef get_ctm_edits(alignment_output, ctm_array, eps_symbol=\"<eps>\",\n                  oov_word=None, symbol_table=None):\n    \"\"\"\n    This function takes two lists\n        alignment_output = The output of smith_waterman_alignment() which is a\n            list of tuples (ref_word, hyp_word, ref_word_from_index,\n            hyp_word_from_index, ref_word_to_index, hyp_word_to_index)\n        ctm_array = [ [ start1, duration1, hyp_word1, confidence1 ], ... ]\n    and pads them with new list elements so that the entries 'match up'.\n\n    Returns CTM edits lines, which are CTM lines appended with reference word\n    and edit type.\n\n    What we are aiming for is that for each i, ctm_array[i][2] ==\n    alignment_output[i][1].  The reasons why this is not automatically true\n    are:\n\n     (1) There may be insertions in the hypothesis sequence that are not\n         aligned with any reference words in the beginning of the\n         alignment_output.\n     (2) There may be deletions in the end of the alignment_output that\n         do not correspond to any additional hypothesis CTM lines.\n\n    We introduce suitable entries in to alignment_output and ctm_array as\n    necessary to make them 'match up'.\n    \"\"\"\n    ctm_edits = []\n    ali_len = len(alignment_output)\n    ctm_len = len(ctm_array)\n    ali_pos = 0\n    ctm_pos = 0\n\n    # current_time is the end of the last ctm segment we processesed.\n    current_time = ctm_array[0][0] if ctm_len > 0 else 0.0\n\n    for (ref_word, hyp_word, ref_prev_i, hyp_prev_i,\n         ref_i, hyp_i) in alignment_output:\n        try:\n            ctm_pos = hyp_prev_i\n            # This is true because we cannot have errors at the end because\n            # that will decrease the smith-waterman alignment score.\n            assert ctm_pos < ctm_len\n            assert len(ctm_array[ctm_pos]) == 4\n\n            if hyp_prev_i == hyp_i:\n                assert hyp_word == eps_symbol\n                # These are deletions as there are no CTM entries\n                # corresponding to these alignments.\n                edit_type = get_edit_type(\n                    hyp_word=eps_symbol, ref_word=ref_word,\n                    duration=0.0, eps_symbol=eps_symbol,\n                    oov_word=oov_word, symbol_table=symbol_table)\n                ctm_line = [current_time, 0.0, eps_symbol, 1.0,\n                            ref_word, edit_type]\n                ctm_edits.append(ctm_line)\n            else:\n                assert hyp_i == hyp_prev_i + 1\n                assert hyp_word == ctm_array[ctm_pos][2]\n                # This is the normal case, where there are 2 entries where\n                # they hyp-words match up.\n                ctm_line = list(ctm_array[ctm_pos])\n                if hyp_word == eps_symbol and ref_word != eps_symbol:\n                    # This is a silence in hypothesis aligned with a reference\n                    # word. We split this into two ctm edit lines where the\n                    # first one is a deletion of duration 0 and the second\n                    # one is a silence of duration given by the ctm line.\n                    edit_type = get_edit_type(\n                        hyp_word=eps_symbol, ref_word=ref_word,\n                        duration=0.0, eps_symbol=eps_symbol,\n                        oov_word=oov_word, symbol_table=symbol_table)\n                    assert edit_type == 'del'\n                    ctm_edits.append([current_time, 0.0, eps_symbol, 1.0,\n                                      ref_word, edit_type])\n\n                    edit_type = get_edit_type(\n                        hyp_word=eps_symbol, ref_word=eps_symbol,\n                        duration=ctm_line[1], eps_symbol=eps_symbol,\n                        oov_word=oov_word, symbol_table=symbol_table)\n                    assert edit_type == 'sil'\n                    ctm_line.extend([eps_symbol, edit_type])\n                    ctm_edits.append(ctm_line)\n                else:\n                    edit_type = get_edit_type(\n                        hyp_word=hyp_word, ref_word=ref_word,\n                        duration=ctm_line[1], eps_symbol=eps_symbol,\n                        oov_word=oov_word, symbol_table=symbol_table)\n                    ctm_line.extend([ref_word, edit_type])\n                    ctm_edits.append(ctm_line)\n                current_time = (ctm_array[ctm_pos][0]\n                                + ctm_array[ctm_pos][1])\n        except Exception:\n            logger.error(\"Could not get ctm edits for \"\n                         \"edits@{edits_pos} = {0}, ctm@{ctm_pos} = {1}\".format(\n                            (\"NONE\" if ali_pos >= ali_len\n                             else alignment_output[ali_pos]),\n                            (\"NONE\" if ctm_pos >= ctm_len\n                             else ctm_array[ctm_pos]),\n                            edits_pos=ali_pos, ctm_pos=ctm_pos))\n            logger.error(\"alignment = {0}\".format(alignment_output))\n            raise\n    return ctm_edits\n\n\ndef ctm_line_to_string(ctm_line):\n    if len(ctm_line) != 8:\n        raise RuntimeError(\"len(ctm_line) expected to be {0}. \"\n                           \"Invalid line {1}\".format(8, ctm_line))\n\n    return \" \".join([str(x) for x in ctm_line])\n\n\ndef test_alignment(align_full_hyp):\n    hyp = \"GCCAT\"\n    ref = \"AGCACACA\"\n\n    verbose = 3\n    logger.info(\"REF: %s\", ref)\n    logger.info(\"HYP: %s\", hyp)\n\n    output, score = smith_waterman_alignment(\n        ref, hyp, similarity_score_function=lambda x, y: 2 if (x == y) else -1,\n        del_score=-1, ins_score=-1, eps_symbol=\"-\", align_full_hyp=align_full_hyp)\n\n    print_alignment(\"Alignment\", output, out_file_handle=sys.stderr)\n\n\ndef run(args):\n    if args.debug_only:\n        test_alignment(args.align_full_hyp)\n        raise SystemExit(\"Exiting since --debug-only was true\")\n\n    def similarity_score_function(x, y):\n        if x == y:\n            return args.correct_score\n        return -args.substitution_penalty\n\n    del_score = -args.deletion_penalty\n    ins_score = -args.insertion_penalty\n\n    reco2file_and_channel = {}\n    file_and_channel2reco = {}\n\n    if args.reco2file_and_channel is not None:\n        for line in args.reco2file_and_channel:\n            parts = line.strip().split()\n\n            reco2file_and_channel[parts[0]] = (parts[1], parts[2])\n            file_and_channel2reco[(parts[1], parts[2])] = parts[0]\n        args.reco2file_and_channel.close()\n    else:\n        file_and_channel2reco = None\n\n    symbol_table = {}\n    if args.symbol_table is not None:\n        for line in args.symbol_table:\n            parts = line.strip().split()\n            symbol_table[parts[0]] = int(parts[1])\n        args.symbol_table.close()\n\n    if args.hyp_format == \"Text\":\n        hyp_lines = {key: value\n                     for (key, value) in read_text(args.hyp_in_file)}\n    else:\n        hyp_lines = {key: value\n                     for (key, value) in read_ctm(args.hyp_in_file,\n                                                  file_and_channel2reco)}\n\n    num_err = 0\n    num_done = 0\n    for reco, ref_text in read_text(args.ref_in_file):\n        try:\n            if reco not in hyp_lines:\n                num_err += 1\n                raise Warning(\"Could not find recording {0} \"\n                              \"in hypothesis {1}\".format(\n                                  reco, args.hyp_in_file.name))\n                continue\n\n            if args.hyp_format == \"CTM\":\n                hyp_array = [x[2] for x in hyp_lines[reco]]\n            else:\n                hyp_array = hyp_lines[reco]\n\n            if args.reco2file_and_channel is None:\n                reco2file_and_channel[reco] = (reco, \"1\")\n\n            logger.debug(\"Running Smith-Waterman alignment for %s\", reco)\n\n            output, score = smith_waterman_alignment(\n                ref_text, hyp_array, eps_symbol=args.eps_symbol,\n                similarity_score_function=similarity_score_function,\n                del_score=del_score, ins_score=ins_score,\n                align_full_hyp=args.align_full_hyp)\n\n            if args.hyp_format == \"CTM\":\n                ctm_edits = get_ctm_edits(output, hyp_lines[reco],\n                                          eps_symbol=args.eps_symbol,\n                                          oov_word=args.oov_word,\n                                          symbol_table=symbol_table)\n                for line in ctm_edits:\n                    ctm_line = list(reco2file_and_channel[reco])\n                    ctm_line.extend(line)\n                    print(ctm_line_to_string(ctm_line),\n                          file=args.alignment_out_file)\n            else:\n                print_alignment(\n                    reco, output, out_file_handle=args.alignment_out_file)\n            num_done += 1\n        except:\n            logger.error(\"Alignment failed for recording {0} \"\n                         \"with ref = {1} and hyp = {2}\".format(\n                             reco, \" \".join(ref_text),\n                             \" \".join(hyp_array)))\n            raise\n\n    logger.info(\"Processed %d recordings; failed with %d\", num_done, num_err)\n\n    if num_done == 0:\n        raise RuntimeError(\"Processed 0 recordings.\")\n\n\ndef main():\n    args = get_args()\n\n    try:\n        run(args)\n    except Exception:\n        logger.error(\"Failed to align ref and hypotheses; \"\n                     \"got exception \", exc_info=True)\n        raise SystemExit(1)\n    finally:\n        if args.reco2file_and_channel is not None:\n            args.reco2file_and_channel.close()\n        args.ref_in_file.close()\n        args.hyp_in_file.close()\n        args.alignment_out_file.close()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/compute_tf_idf.py",
    "content": "#! /usr/bin/env python\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\n\nimport tf_idf\nsys.path.insert(0, 'steps')\n\nlogger = logging.getLogger('tf_idf')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(filename)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef _get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script takes in a set of documents and computes the\n        TF-IDF for each n-gram up to the specified order.  The script can also\n        load IDF stats from a different file instead of computing them from the\n        input set of documents.\"\"\")\n\n    parser.add_argument(\"--tf-weighting-scheme\", type=str, default=\"raw\",\n                        choices=[\"binary\", \"raw\", \"log\", \"normalized\"],\n                        help=\"\"\"The function applied on the raw\n                        term-frequencies f(t,d) when computing tf(t,d).\n                        TF weighting schemes:-\n                        binary : tf(t,d) = 1 if t in d else 0\n                        raw    : tf(t,d) = f(t,d)\n                        log    : tf(t,d) = 1 + log(f(t,d))\n                        normalized : tf(t,d) = K + (1-K) * \"\"\"\n                        \"\"\"f(t,d) / max{f(t',d): t' in d}\"\"\")\n    parser.add_argument(\"--tf-normalization-factor\", type=float, default=0.5,\n                        help=\"K value for normalized TF weighting scheme\")\n    parser.add_argument(\"--idf-weighting-scheme\", type=str, default=\"log\",\n                        choices=[\"unary\", \"log\", \"log-smoothed\",\n                                 \"probabilistic\"],\n                        help=\"\"\"The function applied on the raw\n                        inverse-document frequencies n(t) = |d in D: t in d|\n                        when computing idf(t,d).\n                        IDF weighting schemes:-\n                        unary  : idf(t,D) = 1\n                        log    : idf(t,D) = log (N / 1 + n(t))\n                        log-smoothed : idf(t,D) = log(1 + N / n(t))\n                        probabilistic: idf(t,D) = log((N - n(t)) / n(t))\"\"\")\n    parser.add_argument(\"--ngram-order\", type=int, default=2,\n                        help=\"Accumulate for terms upto this n-grams order\")\n\n    parser.add_argument(\"--input-idf-stats\", type=argparse.FileType('r'),\n                        help=\"If provided, IDF stats are loaded from this \"\n                        \"file\")\n    parser.add_argument(\"--output-idf-stats\", type=argparse.FileType('w'),\n                        help=\"If providied, IDF stats are written to this \"\n                        \"file\")\n    parser.add_argument(\"--accumulate-over-docs\", type=str, default=\"true\",\n                        choices=[\"true\", \"false\"],\n                        help=\"If true, the stats are accumulated over all the \"\n                        \"documents and a single tf-idf-file is written out.\")\n    parser.add_argument(\"docs\", type=argparse.FileType('r'),\n                        help=\"Input documents in kaldi text format i.e. \"\n                        \"<document-id> <text>\")\n    parser.add_argument(\"tf_idf_file\", type=argparse.FileType('w'),\n                        help=\"Output tf-idf for each (t,d) pair in the \"\n                        \"input documents written in the format \"\n                        \"<terms> <document-id> <tf-idf>\")\n\n    args = parser.parse_args()\n\n    if args.tf_normalization_factor >= 1.0 or args.tf_normalization_factor < 0:\n        raise ValueError(\"--tf-normalization-factor must be in [0,1)\")\n\n    args.accumulate_over_docs = bool(args.accumulate_over_docs == \"true\")\n\n    if not args.accumulate_over_docs and args.input_idf_stats is None:\n        raise TypeError(\n            \"If --accumulate-over-docs=false is provided, \"\n            \"then --input-idf-stats must be provided.\")\n\n    return args\n\n\ndef _run(args):\n    tf_stats = tf_idf.TFStats()\n    idf_stats = tf_idf.IDFStats()\n\n    if args.input_idf_stats is not None:\n        idf_stats.read(args.input_idf_stats)\n\n    num_done = 0\n    for line in args.docs:\n        parts = line.strip().split()\n        doc = parts[0]\n        tf_stats.accumulate(doc, parts[1:], args.ngram_order)\n\n        if not args.accumulate_over_docs:\n            # Write the document-id and the corresponding tf-idf values.\n            print (doc, file=args.tf_idf_file, end=' ')\n            tf_idf.write_tfidf_from_stats(\n                tf_stats, idf_stats, args.tf_idf_file,\n                tf_weighting_scheme=args.tf_weighting_scheme,\n                idf_weighting_scheme=args.idf_weighting_scheme,\n                tf_normalization_factor=args.tf_normalization_factor,\n                expected_document_id=doc)\n            tf_stats = tf_idf.TFStats()\n        num_done += 1\n\n    if args.accumulate_over_docs:\n        tf_stats.compute_term_stats(idf_stats=idf_stats\n                                              if args.input_idf_stats is None\n                                              else None)\n\n        if args.output_idf_stats is not None:\n            idf_stats.write(args.output_idf_stats)\n            args.output_idf_stats.close()\n\n        tf_idf.write_tfidf_from_stats(\n            tf_stats, idf_stats, args.tf_idf_file,\n            tf_weighting_scheme=args.tf_weighting_scheme,\n            idf_weighting_scheme=args.idf_weighting_scheme,\n            tf_normalization_factor=args.tf_normalization_factor)\n\n    if num_done == 0:\n        raise RuntimeError(\"Could not compute TF-IDF for any query documents\")\n\ndef main():\n    args = _get_args()\n\n    try:\n        _run(args)\n    finally:\n        if args.input_idf_stats is not None:\n            args.input_idf_stats.close()\n        if args.output_idf_stats is not None:\n            args.output_idf_stats.close()\n        args.docs.close()\n        args.tf_idf_file.close()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/ctm_to_text.pl",
    "content": "#! /usr/bin/perl\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0.\n\nuse strict;\nuse warnings;\n\nif (scalar @ARGV != 1 && scalar @ARGV != 3) {\n  my $usage = <<END;\nThis script converts a CTM into kaldi text format by concatenating the words\nbelonging to the same utterance (or recording) and outputs the same to the\nstandard output.\nIf --non-scored-words list file is provided with one word per line, then \nthose words are not added to the text.\n\nThe CTM format is <file> <channel> <start-time> <duration> <word> [<conf>].\nThis script assumes the CTM to be in NIST sorted order given by UNIX\nsort command \"sort +0 -1 +1 -2 +2nb -3\"\n\nUsage: ctm_to_text.pl [--non-scored-words <file>] <ctm-file> > <text>\nEND\n  die $usage;\n}\n\nmy $non_scored_words_list = \"\";\nif (scalar @ARGV > 1) {\n  if ($ARGV[0] eq \"--non-scored-words\") {\n    shift @ARGV;\n    $non_scored_words_list = shift @ARGV;\n  } else {\n    die \"Unknown option $ARGV[0]\\n\";\n  }\n}\n\nmy %non_scored_words;\n$non_scored_words{\"<eps>\"} = 1;\n\nif ($non_scored_words_list ne \"\") {\n  open NONSCORED, $non_scored_words_list or die \"Failed to open $non_scored_words_list\";\n  \n  while (<NONSCORED>) {\n    chomp;\n    my @F = split;\n    $non_scored_words{$F[0]} = 1;\n  }\n\n  close NONSCORED;\n}\n\nmy $ctm_file = shift @ARGV;\nopen CTM, $ctm_file or die \"Failed to open $ctm_file\";\n\nmy $prev_utt = \"\";\nmy @text;\n\nwhile (<CTM>) {\n  chomp;\n  my @F = split;\n\n  my $utt = $F[0];\n  if ($utt ne $prev_utt && $prev_utt ne \"\") {\n    if (scalar @text > 0) {\n      print $prev_utt . \" \" . join(\" \", @text) . \"\\n\";\n    }\n    @text = ();\n  }\n  \n  if (scalar @F < 5 || scalar @F > 6) {\n    die \"Invalid line $_ in CTM $ctm_file\\n\";\n  }\n\n  if (!defined $non_scored_words{$F[4]}) {\n    push @text, $F[4];\n  }\n\n  $prev_utt = $utt;\n}\n\nclose CTM;\n    \nif (scalar @text > 0) {\n  print $prev_utt . \" \" . join(\" \", @text) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/steps/cleanup/internal/get_ctm_edits.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys, operator, argparse\n\n# Modify the CTM to include for each token the information from Levenshtein\n# alignment of 'hypothesis' and 'reference'\n# (i.e. the output of 'align-text'.\n\n# The information added to each token in the CTM is the reference word and one\n# of the following edit-types:\n#  'cor' = correct  [note: as a special case we count as correct cases where\n#                    the hypothesis word is the OOV symbol and the reference\n#                    word is OOV w.r.t. the supplied vocabulary.]\n#  'sub' = substitution\n#  'del' = deletion\n#  'ins' = insertion\n#  'sil' = (silence in ctm; does not consume a reference word)\n# note: the script modify_ctm_edits.py will add the new\n# note: the following extra edit-type may be added by modify_ctm_edits.py:\n#  'fix'  ... this is like 'cor', but it means the reference has been modified\n#             to fix non-scoreable errors [typically errors that don't change the\n#             meaning], so we don't trust the word or value it as much as a 'cor'.\n#\n\n# Note: Additional lines are added to the CTM to account for deletions.\n\n# Input CTM:\n# (note: the <eps> is for silence in the input CTM that comes from\n# optional-silence in the graph.  However, the input edits don't have anything\n# for these silences.\n# We assume (and check) that the channel will always be '1', because the\n# input CTMs are expected to be 'per utterance', not including real\n# recording-ids.\n\n# Input ctm format:\n# <file-id> <channel> <start-time> <duration> <hyp-word> [<confidence>]\n# note, the confidence defaults to 1 if not provided (these\n# scripts don't actually use the confidence field).\n\n## TimBrown_2008P-0007226-0007620 1 0.000 0.100 when\n## TimBrown_2008P-0007226-0007620 1 0.100 0.090 i\n## TimBrown_2008P-0007226-0007620 1 0.190 0.300 some\n## TimBrown_2008P-0007226-0007620 1 0.490 0.110 when\n## TimBrown_2008P-0007226-0007620 1 0.600 0.060 i\n## TimBrown_2008P-0007226-0007620 1 0.660 0.190 say\n## TimBrown_2008P-0007226-0007620 1 0.850 0.450 go\n## TimBrown_2008P-0007226-0007620 1 1.300 0.310 [COUGH]\n## TimBrown_2008P-0007226-0007620 1 1.610 0.130 you\n## TimBrown_2008P-0007226-0007620 1 1.740 0.180 got\n## TimBrown_2008P-0007226-0007620 1 1.920 0.370 thirty\n## TimBrown_2008P-0007226-0007620 1 2.290 0.830 seconds\n## TimBrown_2008P-0007226-0007620 1 3.120 0.330 <eps>\n## TimBrown_2008P-0007226-0007620 1 3.450 0.040 [BREATH]\n## TimBrown_2008P-0007226-0007620 1 3.490 0.110 to\n## TimBrown_2008P-0007226-0007620 1 3.600 0.320 [NOISE]\n\n# Input Levenshtein edits : (the output of 'align-text' post-processed by 'wer_per_utt_details.pl')\n\n# AJJacobs_2007P-0001605-0003029 i i ; thought thought ; i'd i'd ; tell tell ; you you ; a a ; little little ; about about ; [UH] [UH] ; what what ; i i ; like like ; to to ; write write ; and and ; [UH] [UH] ; i i ; like like ; to to ; [UH] [UH] ; immerse immerse ; myself myself ; [SMACK] [SMACK] ; in in ; my my ; topics topics ; [UM] [UM] ; i i ; just just ; like like ; to to ; [UH] [UH] ; dive dive ; [SMACK] [SMACK] ; right right ; in in ; and and ; become become ; [UH] [UH] ; sort sort ; of of ; a a ; human human ; guinea guinea ; pig pig ; [BREATH] [BREATH] ; and and ; [UH] [UH]\n# AJJacobs_2007P-0003133-0004110 i i ; see see ; my my ; life life ; as as ; a a ; series series ; of of ; experiments experiments ; [BREATH] [BREATH] ; so so ; [UH] [UH] ; i i ; [NOISE] [NOISE] ; work work ; for for ; esquire esquire ; magazine magazine ; <eps> and ; a a ; couple couple ; of of ; years years ; ago ago ; [BREATH] [BREATH] ; i i ; wrote wrote ; an an ; article article ; called called ; [NOISE] [NOISE] ; my my ; outsourced outsourced ; life life\n\n\n# Output format:\n# <file-id> <channel> <start-time> <duration> <hyp-word> <confidence> <ref-word> <edit-type>\n\n# AJJacobs_2007P-0001605-0003029 1 0 0.09 <eps> 1.0 <eps> sil\n# AJJacobs_2007P-0001605-0003029 1 0.09 0.15 i 1.0 i cor\n# AJJacobs_2007P-0001605-0003029 1 0.24 0.25 thought 1.0 thought cor\n# AJJacobs_2007P-0001605-0003029 1 0.49 0.14 i'd 1.0 i'd cor\n# AJJacobs_2007P-0001605-0003029 1 0.63 0.22 tell 1.0 tell cor\n# AJJacobs_2007P-0001605-0003029 1 0.85 0.11 you 1.0 you cor\n# AJJacobs_2007P-0001605-0003029 1 0.96 0.05 a 1.0 a cor\n# AJJacobs_2007P-0001605-0003029 1 1.01 0.24 little 1.0 little cor\n# AJJacobs_2007P-0001605-0003029 1 1.25 0.5 about 1.0 about cor\n# AJJacobs_2007P-0001605-0003029 1 1.75 0.48 [UH] 1.0 [UH] cor\n# AJJacobs_2007P-0001605-0003029 1 2.23 0.34 <eps> 1.0 <eps> sil\n# AJJacobs_2007P-0001605-0003029 1 2.57 0.21 what 1.0 what cor\n# AJJacobs_2007P-0001605-0003029 1 2.78 0.1 i 1.0 i cor\n# AJJacobs_2007P-0001605-0003029 1 2.88 0.22 like 1.0 like cor\n# AJJacobs_2007P-0001605-0003029 1 3.1 0.13 to 1.0 to cor\n# AJJacobs_2007P-0001605-0003029 1 3.23 0.37 write 1.0 write cor\n# AJJacobs_2007P-0001605-0003029 1 3.6 0.03 <eps> 1.0 <eps> sil\n# AJJacobs_2007P-0001605-0003029 1 3.63 0.36 and 1.0 and cor\n\n\n\nparser = argparse.ArgumentParser(\n    description = \"Append to the CTM the Levenshtein alignment of 'hypothesis' and 'reference'; \"\n    \"creates augmented CTM with extra fields (see script for details)\")\n\nparser.add_argument(\"--oov\", type = int, default = -1,\n                    help = \"The integer representation of the OOV symbol; substitutions \"\n                    \"by the OOV symbol for out-of-vocabulary reference words are treated \"\n                    \"as correct, if you also supply the --symbol-table option.\")\nparser.add_argument(\"--symbol-table\", type = str,\n                    help = \"The words.txt your system used; if supplied, it is used to \"\n                    \"determine OOV words (and such words will count as correct if \"\n                    \"substituted by the OOV symbol).  See also the --oov option\")\n# Required arguments\nparser.add_argument(\"edits_in\", metavar = \"<edits-in>\",\n                    help = \"Filename of output of 'align-text', which this program reads. \"\n                    \"Use /dev/stdin for standard input.\")\nparser.add_argument(\"ctm_in\", metavar = \"<ctm-in>\",\n                    help = \"Filename of input hypothesis in ctm format\")\nparser.add_argument(\"ctm_edits_out\", metavar = \"<ctm-edits-out>\",\n                    help = \"Filename of output (CTM appended with word-edit information)\")\nargs = parser.parse_args()\n\n\n\ndef OpenFiles():\n    global ctm_edits_out, edits_in, ctm_in, symbol_table, oov_word\n    try:\n        ctm_edits_out = open(args.ctm_edits_out, 'w', encoding='utf-8')\n    except:\n        sys.exit(\"get_ctm_edits.py: error opening ctm-edits file {0} for output\".format(\n                args.ctm_edits_out))\n    try:\n        edits_in = open(args.edits_in, encoding='utf-8')\n    except:\n        sys.exit(\"get_ctm_edits.py: error opening edits file {0} for input\".format(\n                args.edits_in))\n    try:\n        ctm_in = open(args.ctm_in, encoding='utf-8')\n    except:\n        sys.exit(\"get_ctm_edits.py: error opening ctm file {0} for input\".format(\n                args.ctm_in))\n\n    symbol_table = set()\n    oov_word = None\n    if args.symbol_table != None:\n        if args.oov == -1:\n            print(\"get_ctm_edits.py: error: if you set the the --symbol-table option \"\n                  \"you must also set the --oov option\", file = sys.stderr)\n        try:\n            f = open(args.symbol_table, 'r', encoding='utf-8')\n            for line in f.readlines():\n                [ word, integer ] = line.split()\n                if int(integer) == args.oov:\n                    oov_word = word\n                symbol_table.add(word)\n        except:\n            sys.exit(\"get_ctm_edits.py: error opening symbol-table file {0} for \"\n                     \"input (or bad file), exception is: {1}\".format(args.symbol_table))\n        f.close()\n        if oov_word == None:\n            sys.exit(\"get_ctm_edits.py: OOV word not found: check the values of \"\n                     \"--symbol-table={0} and --oov={1}\".format(args.symbol_table,\n                                                               args.oov))\n\n# This function takes two lists\n# edits_array = [ [ hyp_word1, ref_word1], [ hyp_word2, ref_word2 ], ... ]\n# ctm_array = [ [ start1, duration1, hyp_word1, confidence1 ], ... ]\n#\n# and pads them with new list elements so that the entries 'match up'.  What we\n# are aiming for is that for each i, ctm_array[i][2] == edits_array[i][0].  The\n# reasons why this is not automatically true are:\n#\n#  (1) There may be deletions in the hypothesis sequence, which would lead to\n#      pairs like [ '<eps>', ref_word ].\n#  (2) The ctm may have been written 'with silence', which will lead to\n#      ctm entries like [ 1, 7.8, 0.9, '<eps>' ] where the '<eps>' refers\n#      to the optional-silence from the lexicon.\n#\n# We introduce suitable entries in to edits_array and ctm_array as necessary\n# to make them 'match up'.  This function returns the pair (new_edits_array,\n# new_ctm_array).\ndef PadArrays(edits_array, ctm_array):\n    new_edits_array = []\n    new_ctm_array = []\n    edits_len = len(edits_array)\n    ctm_len = len(ctm_array)\n    edits_pos = 0\n    ctm_pos = 0\n    # current_time is the end of the last ctm segment we processesed.\n    current_time = ctm_array[0][0] if ctm_len > 0 else 0.0\n    while edits_pos < edits_len or ctm_pos < ctm_len:\n        if edits_pos < edits_len and ctm_pos < ctm_len and \\\n                edits_array[edits_pos][0] == ctm_array[ctm_pos][2] and \\\n                edits_array[edits_pos][0] != '<eps>':\n            # This is the normal case, where there are 2 entries where\n            # they hyp-words match up\n            new_edits_array.append(edits_array[edits_pos])\n            edits_pos += 1\n            new_ctm_array.append(ctm_array[ctm_pos])\n            current_time = ctm_array[ctm_pos][0] + ctm_array[ctm_pos][1]\n            ctm_pos += 1\n        elif edits_pos < edits_len and edits_array[edits_pos][0] == '<eps>':\n            # There was a deletion.  Pad with an empty ctm segment with '<eps>' as\n            # the word.\n            new_edits_array.append(edits_array[edits_pos])\n            edits_pos += 1\n            duration = 0.0\n            confidence = 1.0\n            new_ctm_array.append([ current_time, duration, '<eps>', confidence])\n        elif ctm_pos < ctm_len and ctm_array[ctm_pos][2] == '<eps>':\n            # There was silence in the ctm, and either we're reached the end of the\n            # edits sequence, or the hyp word was not '<eps>':\n\n            new_edits_array.append(['<eps>', '<eps>'])\n            new_ctm_array.append(ctm_array[ctm_pos])\n            current_time = ctm_array[ctm_pos][0] + ctm_array[ctm_pos][1]\n            ctm_pos += 1\n        else:\n            raise Exception(\"Could not align edits_array = {0} and ctm_array = {1}; \"\n                            \"edits-position = {2}, ctm-position = {3}, \"\n                            \"pending-edit={4}, pending-ctm-entry={5}\".format(\n                    edits_array, ctm_array, edits_pos, ctm_pos,\n                    edits_array[edits_pos] if edits_pos < edits_len else None,\n                    ctm_array[ctm_pos] if ctm_pos < ctm_len else None))\n    assert len(new_edits_array) == len(new_ctm_array)\n    return (new_edits_array, new_ctm_array)\n\n\n# This function returns the appropriate edit-type to output in the ctm-edits\n# file.  The ref_word and hyp_word and duration are the values we'll print in\n# the ctm-edits file.\ndef GetEditType(hyp_word, ref_word, duration):\n    global oov_word\n    if hyp_word == ref_word and hyp_word !='<eps>':\n        return 'cor'\n    elif hyp_word != '<eps>' and ref_word == '<eps>':\n        return 'ins'\n    elif hyp_word == '<eps>' and ref_word != '<eps>' and duration == 0.0:\n        return 'del'\n    elif hyp_word == oov_word and \\\n         len(symbol_table) != 0 and not ref_word in symbol_table:\n        return 'cor'   # this special case is treated as correct.\n    elif hyp_word == '<eps>' == ref_word and duration > 0.0:\n        # silence in hypothesis; we don't match this up with any reference word.\n        return 'sil'\n    else:\n        # The following assertion is because, based on how PadArrays\n        # works, we shouldn't hit this case.\n        assert hyp_word != '<eps>' and ref_word != '<eps>'\n        return 'sub'\n\n# this prints a number with a certain number of digits after\n# the point, while removing trailing zeros.\ndef FloatToString(f):\n    num_digits = 6 # we want to print 6 digits after the zero\n    g = f\n    while abs(g) > 1.0:\n        g *= 0.1\n        num_digits += 1\n    format_str = '%.{0}g'.format(num_digits)\n    return format_str % f\n\n\ndef OutputCtm(utterance_id, edits_array, ctm_array):\n    global ctm_edits_out\n    # note: this function expects the padded entries created by PadARrays.\n    assert len(edits_array) == len(ctm_array)\n    channel = '1'  # this is hardcoded at both input and output, since this CTM\n                   # doesn't really represent recordings, only utterances.\n    for i in range(len(edits_array)):\n        ( hyp_word, ref_word ) = edits_array[i]\n        ( start_time, duration, hyp_word2, confidence ) = ctm_array[i]\n        if not hyp_word == hyp_word2:\n            print(\"Error producing output CTM for edit = {0} and ctm = {1}\".format(\n                    edits_array[i], ctm_array[i]), file = sys.stderr)\n            sys.exit(1)\n        assert hyp_word == hyp_word2\n        edit_type = GetEditType(hyp_word, ref_word, duration)\n        print(utterance_id, channel, FloatToString(start_time),\n              FloatToString(duration), hyp_word, confidence, ref_word,\n              edit_type, file = ctm_edits_out)\n\n\ndef ProcessOneUtterance(utterance_id, edits_line, ctm_lines):\n    try:\n        # Remove the utterance-id from the beginning of the edits line\n        edits_fields = edits_line[len(utterance_id) + 1:]\n\n        # e.g. if edits_fields is now 'i i ; see be ; my my ', edits_array will become\n        #  [ ['i', 'i'], ['see', 'be'], ['my', 'my'] ]\n        fields_split = edits_fields.split()\n        first_fields, second_fields = fields_split[0::3], fields_split[1::3]\n        if (\n            len(first_fields) != len(second_fields) or\n            (len(fields_split) >= 3 and set(fields_split[2::3]) != {';'})\n        ):\n            sys.exit(\"get_ctm_edits.py: could not make sense of edits line: \" + edits_line)\n\n        edits_array = list(zip(first_fields, second_fields))\n\n        # ctm_array will now become something like [ ['1', '1.010', '0.240', 'little ' ], ... ]\n        ctm_array = [ x.split() for x in ctm_lines ]\n        ctm_array = []\n        for line in ctm_lines:\n            try:\n                # Strip off the utterance-id and split the remaining fields\n                # which should be: channel==1, start, dur, word, [confidence]\n                a = line[len(utterance_id) + 1:].split()\n                if len(a) == 4:\n                    a.append(1.0)  # confidence defaults to 1.0.\n                [ channel, start, dur, word, confidence ] = a\n                if channel != '1':\n                    raise Exception(\"Channel should be 1, got: \" + channel)\n                ctm_array.append([ float(start), float(dur), word, float(confidence) ])\n            except Exception as e:\n                sys.exit(\"get_ctm_edits.py: error procesing ctm line {0} \"\n                         \"... exception is: {1} {2}\".format(line, type(e), str(e)))\n        # ctm_array will now be something like [ [ 1.010, 0.240, 'little ', 1.0 ], ... ]\n\n        # The following call pads the edits and ctm arrays with appropriate\n        # entries so that they have the same length and the elements 'match up'.\n        (edits_array, ctm_array) = PadArrays(edits_array, ctm_array)\n    except Exception as e:\n        sys.exit(\"get_ctm_edits.py: error processing utterance {0}, error was: {1}\".format(\n                utterance_id, str(e)))\n    OutputCtm(utterance_id, edits_array, ctm_array)\n\ndef ProcessData():\n    num_utterances_processed = 0\n\n    pending_ctm_line = ctm_in.readline()\n\n    while True:\n        this_edits_line = edits_in.readline()\n        if this_edits_line == '':\n            if pending_ctm_line != '':\n                sys.exit(\"get_ctm_edits.py: edits_in input {0} ended before \"\n                         \"ctm input was ended.  We processed {1} \"\n                         \"utterances.\".format(args.edits_in, num_utterances_processed))\n            break\n        a = this_edits_line.split()\n        if len(a) == 0:\n            sys.exit(\"get_ctm_edits.py: edits_input {0} had an empty line\".format(\n                    args.edits_in))\n        utterance_id = a[0]\n        utterance_id_len = len(utterance_id)\n        this_utterance_ctm_lines = []\n        while len(pending_ctm_line.strip()) > 0 and pending_ctm_line.split()[0] == utterance_id:\n            this_utterance_ctm_lines.append(pending_ctm_line)\n            pending_ctm_line = ctm_in.readline()\n        ProcessOneUtterance(utterance_id, this_edits_line,\n                            this_utterance_ctm_lines)\n        num_utterances_processed += 1\n    print(\"get_ctm_edits.py: processed {0} utterances\".format(\n            num_utterances_processed), file=sys.stderr)\n\n\nOpenFiles()\nProcessData()\n\n"
  },
  {
    "path": "egs/steps/cleanup/internal/get_non_scored_words.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport operator\nimport os\nimport sys\nfrom collections import defaultdict\n\nimport io\nsys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding=\"utf8\")\nsys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding=\"utf8\")\n\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n# If you supply the <lang> directory (the one that corresponds to\n# how you decoded the data) to this script, it assumes that the <lang>\n# directory contains phones/align_lexicon.int, and it uses this to work\n# out a reasonable guess of the non-scored phones, based on which have\n# a single-word pronunciation that maps to a silence phone.\n# It then uses the words.txt to work out the written form of those words.\n\nparser = argparse.ArgumentParser(\n    description = \"This program works out a reasonable guess at a list of \"\n    \"non-scored words (words that won't affect the WER evaluation): \"\n    \"things like [COUGH], [NOISE] and so on.  This is useful because a list of \"\n    \"such words is required by some other scripts (e.g. modify_ctm_edits.py), \"\n    \"and it's inconvenient to have to specify the list manually for each language. \"\n    \"This program writes out the words in text form, one per line.\")\n\nparser.add_argument(\"lang\", type = str,\n                    help = \"The lang/ directory.  This program expects \"\n                    \"lang/words.txt and lang/phones/silence.int and \"\n                    \"lang/phones/align_lexicon.int to exist, and will use them to work \"\n                    \"out a reasonable guess of the non-scored words  (as those whose \"\n                    \"pronunciations are a single phone in the 'silphones' list)\")\n\nargs = parser.parse_args()\n\nnon_scored_words = set()\n\n\ndef read_lang(lang_dir):\n    global non_scored_words\n\n    if not os.path.isdir(lang_dir):\n        logger.error(\"expected lang/ directory %s to \"\n                     \"exist.\", lang_dir)\n        raise RuntimeError\n\n    for f in [ '/words.txt', '/phones/silence.int', '/phones/align_lexicon.int' ]:\n        if not os.path.exists(lang_dir + f):\n            logger.error(\"expected file %s%s to exist.\", lang_dir, f)\n            raise RuntimeError\n\n    # read silence-phones.\n    try:\n        silence_phones = set()\n        for line in open(lang_dir + '/phones/silence.int').readlines():\n            silence_phones.add(int(line))\n    except Exception:\n        logger.error(\"problem reading file \"\n                     \"%s/phones/silence.int\", lang_dir)\n        raise\n\n    # read align_lexicon.int.\n    # format is: <word-index> <word-index> <phone-index1> <phone-index2> ..\n    # We're looking for line of the form:\n    # w w p\n    # where w > 0 and p is in the set 'silence_phones'\n    try:\n        silence_word_ints = set()\n        for line in open(lang_dir + '/phones/align_lexicon.int').readlines():\n            a = line.split()\n            if len(a) == 3 and a[0] == a[1] and int(a[0]) > 0 and \\\n                    int(a[2]) in silence_phones:\n                silence_word_ints.add(int(a[0]))\n    except Exception:\n        logger.error(\"problem reading file %s/phones/align_lexicon.int\",\n                     lang_dir)\n        raise\n\n    try:\n        for line in open(lang_dir + '/words.txt', encoding='utf-8').readlines():\n            [ word, integer ] = line.split()\n            if int(integer) in silence_word_ints:\n                non_scored_words.add(word)\n    except Exception:\n        logger.error(\"problem reading file %s/words.txt.int\", lang_dir)\n        raise\n\n    if not len(non_scored_words) == len(silence_word_ints):\n        raise RuntimeError(\"error getting silence words, len({0}) != len({1})\"\n                           \"\".format(non_scored_words, silence_word_ints))\n    for word in non_scored_words:\n        print(word)\n\n\nread_lang(args.lang)\n"
  },
  {
    "path": "egs/steps/cleanup/internal/get_pron_stats.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport sys\nimport warnings\n\n# Collect pronounciation stats from a ctm_prons.txt file of the form output\n# by steps/cleanup/debug_lexicon.sh.  This input file has lines of the form:\n#  utt_id word phone1 phone2 .. phoneN\n#  e.g.\n#  foo-bar123-342  hello h eh l l ow\n# (and this script does require that lines from the same utterance be ordered in\n# order of time).\n# The output of this program is word pronunciation stats of the form:\n#  count word phone1 .. phoneN\n#  e.g.:\n#  24.0  hello h ax l l ow\n# This program uses various heuristics to account for the fact that in the input ctm_prons.txt\n# file may not always be well aligned.  As a result of some of these heuristics the counts will\n# not always be integers.\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description = \"Accumulate pronounciation statistics from \"\n                                     \"a ctm_prons.txt file.\",\n                                     epilog = \"See steps/cleanup/debug_lexicon.sh for example\")\n    parser.add_argument(\"ctm_prons_file\", metavar = \"<ctm-prons-file>\", type = str,\n                        help = \"File containing word-pronounciation alignments obtained from a ctm file; \"\n                        \"It represents phonetic decoding results, aligned with word boundaries obtained\"\n                        \"from forced alignments.\"\n                        \"each line must be <utt_id> <word> <phones>\")\n    parser.add_argument(\"silence_file\", metavar = \"<silphone-file>\", type = str,\n                        help = \"File containing a list of silence phones.\")\n    parser.add_argument(\"optional_silence_file\", metavar = \"<optional_silence>\", type = str,\n                        help = \"File containing the optional silence phone. We'll be replacing empty prons by this,\"\n                        \"because empty prons would cause a problem for lattice word alignment.\")\n    parser.add_argument(\"non_scored_words_file\", metavar = \"<non-scored-words-file>\", type = str,\n                        help = \"File containing a list of non-scored words.\")\n    parser.add_argument(\"stats_file\", metavar = \"<stats-file>\", type = str,\n                        help = \"Write accumulated statitistics to this file; each line represents how many times \"\n                        \"a specific word-pronunciation pair appears in the phonetic decoding results (ctm_pron_file).\"\n                        \"each line is <count> <word> <phones>\")\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.ctm_prons_file == \"-\":\n        args.ctm_prons_file_handle = sys.stdin\n    else:\n        args.ctm_prons_file_handle = open(args.ctm_prons_file)\n    args.non_scored_words_file_handle = open(args.non_scored_words_file)\n    args.silence_file_handle = open(args.silence_file)\n    args.optional_silence_file_handle = open(args.optional_silence_file)\n    if args.stats_file == \"-\":\n        args.stats_file_handle = sys.stdout\n    else:\n        args.stats_file_handle = open(args.stats_file, \"w\")\n    return args\n\ndef ReadEntries(file_handle):\n    entries = set()\n    for line in file_handle:\n        entries.add(line.strip())\n    return entries\n\n# Basically, this function generates an \"info\" list from a ctm_prons file.\n# Each entry in the list represents the pronounciation candidate(s) of a word.\n# For each non-<eps> word, the entry is a list: [utt_id, word, set(pronunciation_candidates)]. e.g:\n# [911Mothers_2010W-0010916-0012901-1, other, set('AH DH ER', 'AH DH ER K AH N')]\n# For each <eps>, we split the phones it aligns to into two parts: \"nonsil_left\",\n# which includes phones before the first silphone, and \"nonsil_right\", which includes\n# phones after the last silphone. For example, for <eps> : 'V SIL B AH SIL',\n# nonsil_left is 'V' and nonsil_right is empty ''. After processing an <eps> entry\n# in ctm_prons, we put it in \"info\" as an entry:  [utt_id, word, nonsil_right]\n# only if it's nonsil_right segment is not empty, which may be used when processing\n# the next word.\n#\n# Normally, one non-<eps> word is only aligned to one pronounciation candidate. However\n# when there is a preceding/following <eps>, like in the following example, we\n# assume the phones aligned to <eps> should be statistically distributed\n# to its neighboring words (BTW we assume there are no consecutive <eps> within an utterance.)\n# Thus we append the \"nonsil_left\" segment of these phones to the pronounciation\n# of the preceding word, if the last phone of this pronounciation is not a silence phone,\n# Similarly we can add a pron candidate to the following word.\n#\n# For example, for the following part of a ctm_prons file:\n# 911Mothers_2010W-0010916-0012901-1 other AH DH ER\n# 911Mothers_2010W-0010916-0012901-1 <eps> K AH N SIL B\n# 911Mothers_2010W-0010916-0012901-1 because IH K HH W AA Z AH\n# 911Mothers_2010W-0010916-0012901-1 <eps> V SIL\n# 911Mothers_2010W-0010916-0012901-1 when W EH N\n# 911Mothers_2010W-0010916-0012901-1 people P IY P AH L\n# 911Mothers_2010W-0010916-0012901-1 <eps> SIL\n# 911Mothers_2010W-0010916-0012901-1 heard HH ER\n# 911Mothers_2010W-0010916-0012901-1 <eps> D\n# 911Mothers_2010W-0010916-0012901-1 that SIL DH AH T\n# 911Mothers_2010W-0010916-0012901-1 my M AY\n#\n# The corresponding segment in the \"info\" list is:\n# [911Mothers_2010W-0010916-0012901-1, other, set('AH DH ER', 'AH DH ER K AH N')]\n# [911Mothers_2010W-0010916-0012901-1, <eps>, 'B'\n# [911Mothers_2010W-0010916-0012901-1, because, set('IH K HH W AA Z AH', 'B IH K HH W AA Z AH', 'IH K HH W AA Z AH V', 'B IH K HH W AA Z AH V')]\n# [911Mothers_2010W-0010916-0012901-1, when, set('W EH N')]\n# [911Mothers_2010W-0010916-0012901-1, people, set('P IY P AH L')]\n# [911Mothers_2010W-0010916-0012901-1, <eps>, 'D']\n# [911Mothers_2010W-0010916-0012901-1, that, set('SIL DH AH T')]\n# [911Mothers_2010W-0010916-0012901-1, my, set('M AY')]\n#\n# Then we accumulate pronouciation stats from \"info\". Basically, for each occurence\n# of a word, each pronounciation candidate gets equal soft counts. e.g. In the above\n# example, each pron candidate of \"because\" gets a count of 1/4. The stats is stored\n# in a dictionary (word, pron) : count.\n\ndef GetStatsFromCtmProns(silphones, optional_silence, non_scored_words, ctm_prons_file_handle):\n    info = []\n    for line in ctm_prons_file_handle.readlines():\n        splits = line.strip().split()\n        utt = splits[0]\n        word = splits[1]\n        phones = splits[2:]\n        if phones == []:\n            phones = [optional_silence]\n        # extract the nonsil_left and nonsil_right segments, and then try to\n        # append nonsil_left to the pron candidates of preceding word, getting\n        # extended pron candidates.\n        # Note: the ctm_pron file may have cases like:\n        # KevinStone_2010U-0024782-0025580-1 [UH] EH\n        # KevinStone_2010U-0024782-0025580-1 fda F T\n        # KevinStone_2010U-0024782-0025580-1 [NOISE] IY EY\n        # which means non-scored-words (except oov symbol <unk>/<UNK>) behaves like <eps>.\n        # So we apply the same merging method in these cases.\n        if word == '<eps>' or (word in non_scored_words and word != '<unk>' and word != '<UNK>'):\n            nonsil_left = []\n            nonsil_right = []\n            for phone in phones:\n                if phone in silphones:\n                    break\n                nonsil_left.append(phone)\n\n            for phone in reversed(phones):\n                if phone in silphones:\n                    break\n                nonsil_right.insert(0, phone)\n\n            # info[-1][0] is the utt_id of the last entry\n            if len(nonsil_left) > 0 and len(info) > 0 and utt == info[-1][0]:\n                # pron_ext is a set of extended pron candidates.\n                pron_ext = set()\n                # info[-1][2] is the set of pron candidates of the last entry.\n                for pron in info[-1][2]:\n                    # skip generating the extended pron candidate if\n                    # the pron ends with a silphone.\n                    ends_with_sil = False\n                    for sil in silphones:\n                        if pron.endswith(sil):\n                            ends_with_sil = True\n                    if not ends_with_sil:\n                        pron_ext.add(pron+\" \"+\" \".join(nonsil_left))\n                if isinstance(info[-1][2], set):\n                    info[-1][2] = info[-1][2].union(pron_ext)\n            if len(nonsil_right) > 0:\n                info.append([utt, word, \" \".join(nonsil_right)])\n        else:\n            prons = set()\n            prons.add(\" \".join(phones))\n            # If there's a preceding <eps>/non_scored_words (which means the third field is a string rather than a set of strings),\n            # we append it's nonsil_right segment to the pron candidates of the current word.\n            if len(info) > 0 and utt == info[-1][0] and isinstance(info[-1][2], str) and (phones == [] or phones[0] not in silphones):\n                # info[-1][2] is the nonsil_right segment of the phones aligned to the last <eps>/non_scored_words.\n                prons.add(info[-1][2]+' '+\" \".join(phones))\n            info.append([utt, word, prons])\n    stats = {}\n    for utt, word, prons in info:\n        # If the prons is not a set, the current word must be <eps> or an non_scored_word,\n        # where we just left the nonsil_right part as prons.\n        if isinstance(prons, set) and len(prons) > 0:\n            count = 1.0 / float(len(prons))\n            for pron in prons:\n                phones = pron.strip().split()\n                # post-processing: remove all begining/trailing silence phones.\n                # we allow only candidates that either consist of a single silence\n                # phone, or the silence phones are inside non-silence phones.\n                if len(phones) > 1:\n                    begin = 0\n                    for phone in phones:\n                        if phone in silphones:\n                            begin += 1\n                        else:\n                            break\n                    if begin == len(phones):\n                        begin -= 1\n                    phones = phones[begin:]\n                    if len(phones) == 1:\n                        break\n                    end = len(phones)\n                    for phone in reversed(phones):\n                        if phone in silphones:\n                            end -= 1\n                        else:\n                            break\n                    phones = phones[:end]\n                phones = \" \".join(phones)\n                stats[(word, phones)] = stats.get((word, phones), 0) + count\n    return stats\n\ndef WriteStats(stats, file_handle):\n    for word_pron, count in stats.items():\n        print('{0} {1} {2}'.format(count, word_pron[0], word_pron[1]), file=file_handle)\n    file_handle.close()\n\ndef Main():\n    args = GetArgs()\n    silphones = ReadEntries(args.silence_file_handle)\n    non_scored_words = ReadEntries(args.non_scored_words_file_handle)\n    optional_silence = ReadEntries(args.optional_silence_file_handle)\n    stats = GetStatsFromCtmProns(silphones, optional_silence.pop(), non_scored_words, args.ctm_prons_file_handle)\n    WriteStats(stats, args.stats_file_handle)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/make_one_biased_lm.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport sys\nimport argparse\nimport math\nfrom collections import defaultdict\n\nimport io\nsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=\"utf8\")\nsys.stderr = io.TextIOWrapper(sys.stderr.buffer,encoding=\"utf8\")\nsys.stdin = io.TextIOWrapper(sys.stdin.buffer,encoding=\"utf8\")\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script creates a biased language model suitable for alignment and\ndata-cleanup purposes.   It reads (possibly multiple) lines of integerized text\nfrom the input and writes a text-form FST of a backoff language model to\nthe standard output, to be piped into fstcompile.\"\"\")\n\nparser.add_argument(\"--word-disambig-symbol\", type = int, required = True,\n                    help = \"Integer corresponding to the disambiguation \"\n                    \"symbol (normally #0) for backoff arcs\")\nparser.add_argument(\"--ngram-order\", type = int, default = 4,\n                    choices = [2,3,4,5,6,7],\n                    help = \"Maximum order of n-gram to use (but see also \"\n                    \"--min-lm-state-count; the effective order may be less.\")\nparser.add_argument(\"--min-lm-state-count\", type = int, default = 10,\n                    help = \"Minimum count below which we will completely \"\n                    \"discount an LM-state (if it is of order > 2, i.e. \"\n                    \"history-length > 1).\")\nparser.add_argument(\"--top-words\", type = str,\n                    help = \"File containing frequent words and probabilities to be added into \"\n                    \"the language model, with lines in the format '<integer-id-of-word> <prob>'. \"\n                    \"These probabilities will be added to the probabilities in the unigram \"\n                    \"backoff state and then renormalized; this option allows you to introduce \"\n                    \"common words to the LM with specified probabilities.\")\nparser.add_argument(\"--discounting-constant\", type = float, default = 0.3,\n                    help = \"Discounting constant D for standard (unmodified) Kneser-Ney; \"\n                    \"must be strictly between 0 and 1.  A value closer to 0 will give \"\n                    \"you a more-strongly-biased LM.\")\nparser.add_argument(\"--verbose\", type = int, default = 0,\n                    choices=[0,1,2,3,4,5], help = \"Verbose level\")\n\nargs = parser.parse_args()\n\nif args.verbose >= 1:\n    print(' '.join(sys.argv), file = sys.stderr)\n\n\n\n\nclass NgramCounts(object):\n    ## A note on data-structure.\n    ## Firstly, all words are represented as integers.\n    ## We store n-gram counts as an array, indexed by (history-length == n-gram order minus one)\n    ## (note: python calls arrays \"lists\")  of dicts from histories to counts, where\n    ## histories are arrays of integers and \"counts\" are dicts from integer to float.\n    ## For instance, when accumulating the 4-gram count for the '8' in the sequence '5 6 7 8',\n    ## we'd do as follows:\n    ##  self.counts[3][[5,6,7]][8] += 1.0\n    ## where the [3] indexes an array, the [[5,6,7]] indexes a dict, and\n    ## the [8] indexes a dict.\n    def __init__(self, ngram_order):\n        self.ngram_order = ngram_order\n        # Integerized counts will never contain negative numbers, so\n        # inside this program, we use -3 and -2 for the BOS and EOS symbols\n        # respectively.\n        # Note: it's actually important that the bos-symbol is the most negative;\n        # it helps ensure that we print the state with left-context <s> first\n        # when we print the FST, and this means that the start-state will have\n        # the correct value.\n        self.bos_symbol = -3\n        self.eos_symbol = -2\n        # backoff_symbol is kind of a pseudo-word, it's used in keeping track of\n        # the backoff counts in each state.\n        self.backoff_symbol = -1\n        self.counts = []\n        for n in range(ngram_order):\n            # The 'lambda: defaultdict(float)' is an anonymous function taking\n            # no arguments that returns a new defaultdict(float).\n            # If we index self.counts[n][history] for a history-length n < ngram_order\n            # and a previously unseen history, it will create a new defaultdict\n            # that defaults to 0.0 [since the function float() will return 0.0].\n            # This means that we can index self.counts without worrying about\n            # undefined values.\n            self.counts.append(defaultdict(lambda: defaultdict(float)))\n\n    # adds a raw count (called while processing input data).\n    # Suppose we see the sequence '6 7 8 9' and ngram_order=4, 'history'\n    # would be (6,7,8) and 'predicted_word' would be 9; 'count' would be\n    # 1.0.\n    def AddCount(self, history, predicted_word, count):\n        self.counts[len(history)][history][predicted_word] += count\n\n    # 'line' is a string containing a sequence of integer word-ids.\n    # This function adds the un-smoothed counts from this line of text.\n    def AddRawCountsFromLine(self, line):\n        try:\n            words = [self.bos_symbol] + [ int(x) for x in line.split() ] + [self.eos_symbol]\n        except:\n            sys.exit(\"make_one_biased_lm.py: bad input line {0} (expected a sequence \"\n                     \"of integers)\".format(line))\n\n        for n in range(1, len(words)):\n            predicted_word = words[n]\n            history_start = max(0, n + 1 - self.ngram_order)\n            history = tuple(words[history_start:n])\n            self.AddCount(history, predicted_word, 1.0)\n\n    def AddRawCountsFromStandardInput(self):\n        lines_processed = 0\n        while True:\n            line = sys.stdin.readline()\n            if line == '':\n                break\n            self.AddRawCountsFromLine(line)\n            lines_processed += 1\n        if lines_processed == 0 or args.verbose > 0:\n            print(\"make_one_biased_lm.py: processed {0} lines of input\".format(\n                    lines_processed), file = sys.stderr)\n\n\n    # This function returns a dict from history (as a tuple of integers of\n    # length > 1, ignoring lower-order histories), to the total count of this\n    # history state plus all history-states which back off to this history state.\n    # It's used inside CompletelyDiscountLowCountStates().\n    def GetHistToTotalCount(self):\n        ans = defaultdict(float)\n        for n in range(2, self.ngram_order):\n            for hist, word_to_count in self.counts[n].items():\n                total_count = sum(word_to_count.values())\n                while len(hist) >= 2:\n                    ans[hist] += total_count\n                    hist = hist[1:]\n        return ans\n\n\n    # This function will completely discount the counts in any LM-states of\n    # order > 2 (i.e. history-length > 1) that have total count below\n    # 'min_count'; when computing the total counts, we include higher-order\n    # LM-states that would back off to 'this' lm-state, in the total.\n    def CompletelyDiscountLowCountStates(self, min_count):\n        hist_to_total_count = self.GetHistToTotalCount()\n        for n in reversed(list(range(2, self.ngram_order))):\n            this_order_counts = self.counts[n]\n            to_delete = []\n            for hist in this_order_counts.keys():\n                if hist_to_total_count[hist] < min_count:\n                    # we need to completely back off this count.\n                    word_to_count = this_order_counts[hist]\n                    # mark this key for deleting\n                    to_delete.append(hist)\n                    backoff_hist = hist[1:]  # this will be a tuple not a list.\n                    for word, count in word_to_count.items():\n                        self.AddCount(backoff_hist, word, count)\n            for hist in to_delete:\n                del this_order_counts[hist]\n\n    # This backs off the counts according to Kneser-Ney (unmodified,\n    # with interpolation).\n    def ApplyBackoff(self, D):\n        assert D > 0.0 and D < 1.0\n        for n in reversed(list(range(1, self.ngram_order))):\n            this_order_counts = self.counts[n]\n            for hist, word_to_count in this_order_counts.items():\n                backoff_hist = hist[1:]\n                backoff_word_to_count = self.counts[n-1][backoff_hist]\n                this_discount_total = 0.0\n                for word in word_to_count:\n                    assert word_to_count[word] >= 1.0\n                    word_to_count[word] -= D\n                    this_discount_total += D\n                    # Interpret the following line as incrementing the\n                    # count-of-counts for the next-lower order.\n                    backoff_word_to_count[word] += 1.0\n                word_to_count[self.backoff_symbol] += this_discount_total\n\n\n    # This function prints out to stderr the n-gram counts stored in this\n    # object; it's used for debugging.\n    def Print(self, info_string):\n        print(info_string, file=sys.stderr)\n        # these are useful for debug.\n        total = 0.0\n        total_excluding_backoff = 0.0\n        for this_order_counts in self.counts:\n            for hist, word_to_count in this_order_counts.items():\n                this_total_count = sum(word_to_count.values())\n                print('{0}: total={1} '.format(hist, this_total_count),\n                      end='', file=sys.stderr)\n                print(' '.join(['{0} -> {1} '.format(word, count)\n                                for word, count in word_to_count.items() ]),\n                      file = sys.stderr)\n                total += this_total_count\n                total_excluding_backoff += this_total_count\n                if self.backoff_symbol in word_to_count:\n                    total_excluding_backoff -= word_to_count[self.backoff_symbol]\n        print('total count = {0}, excluding discount = {1}'.format(\n                total, total_excluding_backoff), file = sys.stderr)\n\n    def AddTopWords(self, top_words_file):\n        empty_history = ()\n        word_to_count = self.counts[0][empty_history]\n        total = sum(word_to_count.values())\n        try:\n            f = open(top_words_file, mode='r', encoding='utf-8')\n        except:\n            sys.exit(\"make_one_biased_lm.py: error opening top-words file: \"\n                     \"--top-words=\" + top_words_file)\n        while True:\n            line = f.readline()\n            if line == '':\n                break\n            try:\n                [ word_index, prob ] = line.split()\n                word_index = int(word_index)\n                prob = float(prob)\n                assert word_index > 0 and prob > 0.0\n                word_to_count[word_index] += prob * total\n            except Exception as e:\n                sys.exit(\"make_one_biased_lm.py: could not make sense of the \"\n                         \"line '{0}' in op-words file: {1} \".format(line, str(e)))\n        f.close()\n\n\n    def GetTotalCountMap(self):\n        # This function, called from PrintAsFst, returns a map from\n        # history to the total-count for that state.\n        total_count_map = dict()\n        for n in range(0, self.ngram_order):\n            for hist, word_to_count in self.counts[n].items():\n                total_count_map[hist] = sum(word_to_count.values())\n        return total_count_map\n\n    def GetHistToStateMap(self):\n        # This function, called from PrintAsFst, returns a map from\n        # history to integer FST-state.\n        hist_to_state = dict()\n        fst_state_counter = 0\n        for n in range(0, self.ngram_order):\n            for hist in self.counts[n].keys():\n                hist_to_state[hist] = fst_state_counter\n                fst_state_counter += 1\n        return hist_to_state\n\n    def GetProb(self, hist, word, total_count_map):\n        total_count = total_count_map[hist]\n        word_to_count = self.counts[len(hist)][hist]\n        prob = float(word_to_count[word]) / total_count\n        if len(hist) > 0 and word != self.backoff_symbol:\n            prob_in_backoff = self.GetProb(hist[1:], word, total_count_map)\n            backoff_prob = float(word_to_count[self.backoff_symbol]) / total_count\n            prob += backoff_prob * prob_in_backoff\n        return prob\n\n    # This function prints the estimated language model as an FST.\n    def PrintAsFst(self, word_disambig_symbol):\n        # n is the history-length (== order + 1).  We iterate over the\n        # history-length in the order 1, 0, 2, 3, and then iterate over the\n        # histories of each order in sorted order.  Putting order 1 first\n        # and sorting on the histories\n        # ensures that the bigram state with <s> as the left context comes first.\n        # (note: self.bos_symbol is the most negative symbol)\n\n        # History will map from history (as a tuple) to integer FST-state.\n        hist_to_state = self.GetHistToStateMap()\n        total_count_map = self.GetTotalCountMap()\n\n        for n in [ 1, 0 ] + list(range(2, self.ngram_order)):\n            this_order_counts = self.counts[n]\n            # For order 1, make sure the keys are sorted.\n            keys = this_order_counts.keys() if n != 1 else sorted(this_order_counts.keys())\n            for hist in keys:\n                word_to_count = this_order_counts[hist]\n                this_fst_state = hist_to_state[hist]\n\n                for word in word_to_count.keys():\n                    # work out this_cost.  Costs in OpenFst are negative logs.\n                    this_cost = -math.log(self.GetProb(hist, word, total_count_map))\n\n                    if word > 0: # a real word.\n                        next_hist = hist + (word,)  # appending tuples\n                        while not next_hist in hist_to_state:\n                            next_hist = next_hist[1:]\n                        next_fst_state = hist_to_state[next_hist]\n                        print(this_fst_state, next_fst_state, word, word,\n                              this_cost)\n                    elif word == self.eos_symbol:\n                        # print final-prob for this state.\n                        print(this_fst_state, this_cost)\n                    else:\n                        assert word == self.backoff_symbol\n                        backoff_fst_state = hist_to_state[hist[1:len(hist)]]\n                        print(this_fst_state, backoff_fst_state,\n                              word_disambig_symbol, 0, this_cost)\n\n\nngram_counts = NgramCounts(args.ngram_order)\nngram_counts.AddRawCountsFromStandardInput()\n\nif args.verbose >= 3:\n    ngram_counts.Print(\"Raw counts:\")\nngram_counts.CompletelyDiscountLowCountStates(args.min_lm_state_count)\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after discounting low-count states:\")\nngram_counts.ApplyBackoff(args.discounting_constant)\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after applying Kneser-Ney discounting:\")\nif args.top_words != None:\n    ngram_counts.AddTopWords(args.top_words)\n    if args.verbose >= 3:\n        ngram_counts.Print(\"Counts after applying top-n-words\")\nngram_counts.PrintAsFst(args.word_disambig_symbol)\n\n\n# test comand:\n# (echo 6 7 8 4; echo 7 8 9; echo 7 8) | ./make_one_biased_lm.py --word-disambig-symbol=1000 --min-lm-state-count=2 --verbose=3 --top-words=<(echo 1 0.5; echo 2 0.25)\n"
  },
  {
    "path": "egs/steps/cleanup/internal/modify_ctm_edits.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\nfrom collections import defaultdict\n\n\"\"\"\nThis script reads and writes the 'ctm-edits' file that is\nproduced by get_ctm_edits.py.\n\nIt modifies the ctm-edits so that non-scored words\nare not counted as errors: for instance, if there are things like\n[COUGH] and [NOISE] in the transcript, deletions, insertions and\nsubstitutions involving them are allowed, and we modify the reference\nto correspond to the hypothesis.\n\nIf you supply the <lang> directory (the one that corresponds to\nhow you decoded the data) to this script, it assumes that the <lang>\ndirectory contains phones/align_lexicon.int, and it uses this to work\nout a reasonable guess of the non-scored phones, based on which have\na single-word pronunciation that maps to a silence phone.\nIt then uses the words.txt to work out the written form of those words.\n\nAlternatively, you may specify a file containing the non-scored words one\nper line, with the --non-scored-words option.\n\nNon-scored words that were deleted (i.e. they were in the ref but not the\nhyp) are simply removed from the ctm.  For non-scored words that\nwere inserted or substituted, we change the reference word to match the\nhyp word, but instead of marking the operation as 'cor' (correct), we\nmark it as 'fix' (fixed), so that it will not be positively counted as a correct\nword for purposes of finding the optimal segment boundaries.\n\ne.g.\n<file-id> <channel> <start-time> <duration> <conf> <hyp-word> <ref-word> <edit-type>\n[note: the <channel> will always be 1].\n\nAJJacobs_2007P-0001605-0003029 1 0 0.09 <eps> 1.0 <eps> sil\nAJJacobs_2007P-0001605-0003029 1 0.09 0.15 i 1.0 i cor\nAJJacobs_2007P-0001605-0003029 1 0.24 0.25 thought 1.0 thought cor\nAJJacobs_2007P-0001605-0003029 1 0.49 0.14 i'd 1.0 i'd cor\nAJJacobs_2007P-0001605-0003029 1 0.63 0.22 tell 1.0 tell cor\nAJJacobs_2007P-0001605-0003029 1 0.85 0.11 you 1.0 you cor\nAJJacobs_2007P-0001605-0003029 1 0.96 0.05 a 1.0 a cor\nAJJacobs_2007P-0001605-0003029 1 1.01 0.24 little 1.0 little cor\nAJJacobs_2007P-0001605-0003029 1 1.25 0.5 about 1.0 about cor\nAJJacobs_2007P-0001605-0003029 1 1.75 0.48 [UH] 1.0 [UH] cor\n\"\"\"\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter('%(asctime)s [%(filename)s:%(lineno)s - '\n                              '%(funcName)s - %(levelname)s ] %(message)s')\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\nparser = argparse.ArgumentParser(\n    description = \"This program modifies the reference in the ctm-edits which \"\n    \"is output by steps/cleanup/internal/get_ctm_edits.py, to allow insertions, deletions and \"\n    \"substitutions of non-scored words, and [if --allow-repetitions=true], \"\n    \"duplications of single words or pairs of scored words (to account for dysfluencies \"\n    \"that were not transcribed).  Note: deletions and substitutions of non-scored words \"\n    \"after the reference is corrected, will be marked as operation 'fix' rather than \"\n    \"'cor' (correct) so that the downstream processing knows that this was not in \"\n    \"the original reference.  Also by defaults tags non-scored words as such when \"\n    \"they are correct; see the --tag-non-scored option.\")\n\nparser.add_argument(\"--verbose\", type = int, default = 1,\n                    choices=[0,1,2,3],\n                    help = \"Verbose level, higher = more verbose output\")\nparser.add_argument(\"--allow-repetitions\", type = str, default = 'true',\n                    choices=['true','false'],\n                    help = \"If true, allow repetitions in the transcript of one or \"\n                    \"two-word sequences: for instance if the ref says 'i' but \"\n                    \"the hyp says 'i i', or the ref says 'but then' and the hyp says \"\n                    \"'but then but then', fix the reference accordingly.  Intervening \"\n                    \"non-scored words are allowed between the repetitions.  These \"\n                    \"fixes will be marked as 'cor', not as 'fix', since there is \"\n                    \"generally no way to tell which repetition was the 'real' one \"\n                    \"(and since we're generally confident that such things were \"\n                    \"actually uttered).\")\nparser.add_argument(\"non_scored_words_in\", metavar = \"<non-scored-words-file>\",\n                    help=\"Filename of file containing a list of non-scored words, \"\n                    \"one per line. See steps/cleanup/get_nonscored_words.py.\")\nparser.add_argument(\"ctm_edits_in\", metavar = \"<ctm-edits-in>\",\n                    help = \"Filename of input ctm-edits file. \"\n                    \"Use /dev/stdin for standard input.\")\nparser.add_argument(\"ctm_edits_out\", metavar = \"<ctm-edits-out>\",\n                    help = \"Filename of output ctm-edits file. \"\n                    \"Use /dev/stdout for standard output.\")\n\nargs = parser.parse_args()\n\n\n\ndef ReadNonScoredWords(non_scored_words_file):\n    global non_scored_words\n    try:\n        f = open(non_scored_words_file, encoding='utf-8')\n    except:\n        sys.exit(\"modify_ctm_edits.py: error opening file: \"\n                 \"--non-scored-words=\" + non_scored_words_file)\n    for line in f.readlines():\n        a = line.split()\n        if not len(line.split()) == 1:\n            sys.exit(\"modify_ctm_edits.py: bad line in non-scored-words \"\n                     \"file {0}: {1}\".format(non_scored_words_file, line))\n        non_scored_words.add(a[0])\n    f.close()\n\n\n\n# The ctm-edits file format is as follows [note: file-id is really utterance-id\n# in this context].\n# <file-id> <channel> <start-time> <duration> <conf> <hyp-word> <ref-word> <edit>\n# e.g.:\n# AJJacobs_2007P-0001605-0003029 1 0 0.09 <eps> 1.0 <eps> sil\n# AJJacobs_2007P-0001605-0003029 1 0.09 0.15 i 1.0 i cor\n# ...\n# This function processes a single line of ctm-edits input for fixing\n# \"non-scored\" words.  The input 'a' is the split line as an array of fields.\n# It modifies the object 'a'.   This function returns the modified array,\n# and please note that it is destructive of its input 'a'.\n# If it returnso the empty array then the line is to be deleted.\ndef ProcessLineForNonScoredWords(a):\n    global num_lines, num_correct_lines, ref_change_stats\n    try:\n        assert len(a) == 8\n        num_lines += 1\n        # we could do:\n        # [ file, channel, start, duration, hyp_word, confidence, ref_word, edit_type ] = a\n        duration = a[3]\n        hyp_word = a[4]\n        ref_word = a[6]\n        edit_type = a[7]\n        if edit_type == 'ins':\n            assert ref_word == '<eps>'\n            if hyp_word in non_scored_words:\n                # insert this non-scored word into the reference.\n                ref_change_stats[ref_word + ' -> ' + hyp_word] += 1\n                ref_word = hyp_word\n                edit_type = 'fix'\n        elif edit_type == 'del':\n            assert hyp_word == '<eps>' and float(duration) == 0.0\n            if ref_word in non_scored_words:\n                ref_change_stats[ref_word + ' -> ' + hyp_word] += 1\n                return []\n        elif edit_type == 'sub':\n            assert hyp_word != '<eps>'\n            if hyp_word in non_scored_words and ref_word in non_scored_words:\n                # we also allow replacing one non-scored word with another.\n                ref_change_stats[ref_word + ' -> ' + hyp_word] += 1\n                ref_word = hyp_word\n                edit_type = 'fix'\n        else:\n            assert edit_type == 'cor' or edit_type == 'sil'\n            num_correct_lines += 1\n\n        a[4] = hyp_word\n        a[6] = ref_word\n        a[7] = edit_type\n        return a\n\n    except Exception:\n        logger.error(\"bad line in ctm-edits input: \"\n                     \"{0}\".format(a))\n        raise RuntimeError\n\n# This function processes the split lines of one utterance (as a\n# list of lists of fields), to allow repetitions of words, so if the\n# reference says 'i' but the hyp says 'i i', or the ref says\n# 'you know' and the hyp says 'you know you know', we change the\n# ref to match.\n# It returns the modified list-of-lists [but note that the input\n# is actually modified].\ndef ProcessUtteranceForRepetitions(split_lines_of_utt):\n    global non_scored_words, repetition_stats\n    # The array 'selected_lines' will contain the indexes of of selected\n    # elements of 'split_lines_of_utt'.  Consider split_line =\n    # split_lines_of_utt[i].  If the hyp and ref words in split_line are both\n    # either '<eps>' or non-scoreable words, we discard the index.\n    # Otherwise we put it into selected_lines.\n    selected_line_indexes = []\n    # selected_edits will contain, for each element of selected_line_indexes, the\n    # corresponding edit_type from the original utterance previous to\n    # this function call ('cor', 'ins', etc.).\n    #\n    # As a special case, if there was a substitution ('sub') where the\n    # reference word was a non-scored word and the hyp word was a real word,\n    # we mark it in this array as 'ins', because for purposes of this algorithm\n    # it behaves the same as an insertion.\n    #\n    # Whenever we do any operation that will change the reference, we change\n    # all the selected_edits in the array to None so that they won't match\n    # any further operations.\n    selected_edits = []\n    # selected_hyp_words will contain, for each element of selected_line_indexes, the\n    # corresponding hyp_word.\n    selected_hyp_words = []\n\n    for i in range(len(split_lines_of_utt)):\n        split_line = split_lines_of_utt[i]\n        hyp_word = split_line[4]\n        ref_word = split_line[6]\n        # keep_this_line will be True if we are going to keep this line in the\n        # 'selected lines' for further processing of repetitions.  We only\n        # eliminate lines involving non-scored words or epsilon in both hyp\n        # and reference position\n        # [note: epsilon in hyp position for non-empty segments indicates\n        #  optional-silence, and it does make sense to make this 'invisible',\n        #  just like non-scored words, for the purposes of this code.]\n        keep_this_line = True\n        if (hyp_word == '<eps>' or hyp_word in non_scored_words) and \\\n           (ref_word == '<eps>' or ref_word in non_scored_words):\n            keep_this_line = False\n        if keep_this_line:\n            selected_line_indexes.append(i)\n            edit_type = split_line[7]\n            if edit_type == 'sub' and ref_word in non_scored_words:\n                assert not hyp_word in non_scored_words\n                # For purposes of this algorithm, substitution of, say,\n                # '[COUGH]' by 'hello' behaves like an insertion of 'hello',\n                # since we're willing to remove the '[COUGH]' from the\n                # transript.\n                edit_type = 'ins'\n            selected_edits.append(edit_type)\n            selected_hyp_words.append(hyp_word)\n\n    # indexes_to_fix will be a list of indexes into 'selected_indexes' where we\n    # plan to fix the ref to match the hyp.\n    indexes_to_fix = []\n\n    # This loop scans for, and fixes, two-word insertions that follow,\n    # or precede, the corresponding correct words.\n    for i in range(0, len(selected_line_indexes) - 3):\n        this_indexes = selected_line_indexes[i:i+4]\n        this_hyp_words = selected_hyp_words[i:i+4]\n\n        if this_hyp_words[0] == this_hyp_words[2] and \\\n           this_hyp_words[1] == this_hyp_words[3] and \\\n           this_hyp_words[0] != this_hyp_words[1]:\n            # if the hyp words were of the form [ 'a', 'b', 'a', 'b' ]...\n            this_edits = selected_edits[i:i+4]\n            if this_edits == [ 'cor', 'cor', 'ins', 'ins' ] or \\\n                    this_edits == [ 'ins', 'ins', 'cor', 'cor' ]:\n                if this_edits[0] == 'cor':\n                    indexes_to_fix += [ i+2, i+3 ]\n                else:\n                    indexes_to_fix += [ i, i+1 ]\n\n                # the next line prevents this region of the text being used\n                # in any further edits.\n                selected_edits[i:i+4] = [ None, None, None, None ]\n                word_pair = this_hyp_words[0] + ' '  + this_hyp_words[1]\n                # e.g. word_pair = 'hi there'\n                # add 2 because these stats are of words.\n                repetition_stats[word_pair] += 2\n                # the next line prevents this region of the text being used\n                # in any further edits.\n                selected_edits[i:i+4] = [ None, None, None, None ]\n\n    # This loop scans for, and fixes, one-word insertions that follow,\n    # or precede, the corresponding correct words.\n    for i in range(0, len(selected_line_indexes) - 1):\n        this_indexes = selected_line_indexes[i:i+2]\n        this_hyp_words = selected_hyp_words[i:i+2]\n\n        if this_hyp_words[0] == this_hyp_words[1]:\n            # if the hyp words were of the form [ 'a', 'a' ]...\n            this_edits = selected_edits[i:i+2]\n            if this_edits == [ 'cor', 'ins' ] or this_edits == [ 'ins', 'cor' ]:\n                if this_edits[0] == 'cor':\n                    indexes_to_fix.append(i+1)\n                else:\n                    indexes_to_fix.append(i)\n                repetition_stats[this_hyp_words[0]] += 1\n                # the next line prevents this region of the text being used\n                # in any further edits.\n                selected_edits[i:i+2] = [ None, None ]\n\n    for i in indexes_to_fix:\n        j = selected_line_indexes[i]\n        split_line = split_lines_of_utt[j]\n        ref_word = split_line[6]\n        hyp_word = split_line[4]\n        assert ref_word == '<eps>' or ref_word in non_scored_words\n        # we replace reference with the decoded word, which will be a\n        # repetition.\n        split_line[6] = hyp_word\n        split_line[7] = 'cor'\n\n    return split_lines_of_utt\n\n\n# note: split_lines_of_utt is a list of lists, one per line, each containing the\n# sequence of fields.\n# Returns the same format of data after processing.\ndef ProcessUtterance(split_lines_of_utt):\n    new_split_lines_of_utt = []\n    for split_line in split_lines_of_utt:\n        new_split_line = ProcessLineForNonScoredWords(split_line)\n        if new_split_line != []:\n            new_split_lines_of_utt.append(new_split_line)\n    if args.allow_repetitions == 'true':\n        new_split_lines_of_utt = ProcessUtteranceForRepetitions(new_split_lines_of_utt)\n    return new_split_lines_of_utt\n\n\ndef ProcessData():\n    try:\n        f_in = open(args.ctm_edits_in, encoding='utf-8')\n    except:\n        sys.exit(\"modify_ctm_edits.py: error opening ctm-edits input \"\n                 \"file {0}\".format(args.ctm_edits_in))\n    try:\n        f_out = open(args.ctm_edits_out, 'w', encoding='utf-8')\n    except:\n        sys.exit(\"modify_ctm_edits.py: error opening ctm-edits output \"\n                 \"file {0}\".format(args.ctm_edits_out))\n    num_lines_processed = 0\n\n\n    # Most of what we're doing in the lines below is splitting the input lines\n    # and grouping them per utterance, before giving them to ProcessUtterance()\n    # and then printing the modified lines.\n    first_line = f_in.readline()\n    if first_line == '':\n        sys.exit(\"modify_ctm_edits.py: empty input\")\n    split_pending_line = first_line.split()\n    if len(split_pending_line) == 0:\n        sys.exit(\"modify_ctm_edits.py: bad input line \" + first_line)\n    cur_utterance = split_pending_line[0]\n    split_lines_of_cur_utterance = []\n\n    while True:\n        if len(split_pending_line) == 0 or split_pending_line[0] != cur_utterance:\n            split_lines_of_cur_utterance = ProcessUtterance(split_lines_of_cur_utterance)\n            for split_line in split_lines_of_cur_utterance:\n                print(' '.join(split_line), file = f_out)\n            split_lines_of_cur_utterance = []\n            if len(split_pending_line) == 0:\n                break\n            else:\n                cur_utterance = split_pending_line[0]\n\n        split_lines_of_cur_utterance.append(split_pending_line)\n        next_line = f_in.readline()\n        split_pending_line = next_line.split()\n        if len(split_pending_line) == 0:\n            if next_line != '':\n                sys.exit(\"modify_ctm_edits.py: got an empty or whitespace input line\")\n    try:\n        f_out.close()\n    except:\n        sys.exit(\"modify_ctm_edits.py: error closing ctm-edits output \"\n                 \"(broken pipe or full disk?)\")\n\ndef PrintNonScoredStats():\n    if args.verbose < 1:\n        return\n    if num_lines == 0:\n        print(\"modify_ctm_edits.py: processed no input.\", file = sys.stderr)\n    num_lines_modified = sum(ref_change_stats.values())\n    num_incorrect_lines = num_lines - num_correct_lines\n    percent_lines_incorrect= '%.2f' % (num_incorrect_lines * 100.0 / num_lines)\n    percent_modified = '%.2f' % (num_lines_modified * 100.0 / num_lines);\n    if num_incorrect_lines > 0:\n        percent_of_incorrect_modified = '%.2f' % (num_lines_modified * 100.0 /\n                                                  num_incorrect_lines)\n    else:\n        percent_of_incorrect_modified = float('nan')\n    print(\"modify_ctm_edits.py: processed {0} lines of ctm ({1}% of which incorrect), \"\n          \"of which {2} were changed fixing the reference for non-scored words \"\n          \"({3}% of lines, or {4}% of incorrect lines)\".format(\n            num_lines, percent_lines_incorrect, num_lines_modified,\n            percent_modified, percent_of_incorrect_modified),\n          file = sys.stderr)\n\n    keys = sorted(ref_change_stats.keys(), reverse=True,\n                  key = lambda x: ref_change_stats[x])\n    num_keys_to_print = 40 if args.verbose >= 2 else 10\n\n    print(\"modify_ctm_edits.py: most common edits (as percentages \"\n          \"of all such edits) are:\\n\" +\n          ('\\n'.join([ '%s [%.2f%%]' % (k, ref_change_stats[k]*100.0/num_lines_modified)\n                     for k in keys[0:num_keys_to_print]]))\n          + '\\n...'if num_keys_to_print < len(keys) else '',\n          file = sys.stderr)\n\n\ndef PrintRepetitionStats():\n    if args.verbose < 1 or sum(repetition_stats.values()) == 0:\n        return\n    num_lines_modified = sum(repetition_stats.values())\n    num_incorrect_lines = num_lines - num_correct_lines\n    percent_lines_incorrect= '%.2f' % (num_incorrect_lines * 100.0 / num_lines)\n    percent_modified = '%.2f' % (num_lines_modified * 100.0 / num_lines);\n    if num_incorrect_lines > 0:\n        percent_of_incorrect_modified = '%.2f' % (num_lines_modified * 100.0 /\n                                                  num_incorrect_lines)\n    else:\n        percent_of_incorrect_modified = float('nan')\n    print(\"modify_ctm_edits.py: processed {0} lines of ctm ({1}% of which incorrect), \"\n          \"of which {2} were changed fixing the reference for repetitions ({3}% of \"\n          \"lines, or {4}% of incorrect lines)\".format(\n            num_lines, percent_lines_incorrect, num_lines_modified,\n            percent_modified, percent_of_incorrect_modified),\n          file = sys.stderr)\n\n    keys = sorted(repetition_stats.keys(), reverse=True,\n                  key = lambda x: repetition_stats[x])\n    num_keys_to_print = 40 if args.verbose >= 2 else 10\n\n    print(\"modify_ctm_edits.py: most common repetitions inserted into reference (as percentages \"\n          \"of all words fixed in this way) are:\\n\" +\n          ('\\n'.join([ '%s [%.2f%%]' % (k, repetition_stats[k]*100.0/num_lines_modified)\n                     for k in keys[0:num_keys_to_print]]))\n          + '\\n...' if num_keys_to_print < len(keys) else '',\n          file = sys.stderr)\n\n\nnon_scored_words = set()\nReadNonScoredWords(args.non_scored_words_in)\n\nnum_lines = 0\nnum_correct_lines = 0\n# ref_change_stats will be a map from a string like\n# 'foo -> bar' to an integer count; it keeps track of how much we changed\n# the reference.\nref_change_stats = defaultdict(int)\n# repetition_stats will be a map from strings like\n# 'a', or 'a b' (the repeated strings), to an integer count; like\n# ref_change_stats, it keeps track of how many changes we made\n# in allowing repetitions.\nrepetition_stats = defaultdict(int)\n\nProcessData()\nPrintNonScoredStats()\nPrintRepetitionStats()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/resolve_ctm_edits_overlaps.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2014  Johns Hopkins University (Authors: Daniel Povey)\n#           2014  Vijayaditya Peddinti\n#           2016  Vimal Manohar\n# Apache 2.0.\n\n\"\"\"\nScript to combine ctms edits with overlapping segments obtained from\nsmith-waterman alignment. This script is similar to utils/ctm/resolve_ctm_edits.py,\nwhere the overlapping region is just split in two. The approach here is a\nlittle more advanced since we have access to the WER\n(w.r.t. the reference text). It finds the WER of the overlapped region\nin the two overlapping segments, and chooses the better one.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport collections\nimport logging\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\n    '%(asctime)s [%(pathname)s:%(lineno)s - '\n    '%(funcName)s - %(levelname)s ] %(message)s')\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    \"\"\"gets command line arguments\"\"\"\n\n    usage = \"\"\" Python script to resolve overlaps in ctms \"\"\"\n    parser = argparse.ArgumentParser(usage)\n    parser.add_argument('segments', type=argparse.FileType('r'),\n                        help='use segments to resolve overlaps')\n    parser.add_argument('ctm_edits_in', type=argparse.FileType('r'),\n                        help='input_ctm_file')\n    parser.add_argument('ctm_edits_out', type=argparse.FileType('w'),\n                        help='output_ctm_file')\n    parser.add_argument('--verbose', type=int, default=0,\n                        help=\"Higher value for more verbose logging.\")\n    args = parser.parse_args()\n\n    if args.verbose > 2:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef read_segments(segments_file):\n    \"\"\"Read from segments and returns two dictionaries,\n    {utterance-id: (recording_id, start_time, end_time)}\n    {recording_id: list-of-utterances}\n    \"\"\"\n    segments = {}\n    reco2utt = collections.defaultdict(list)\n\n    num_lines = 0\n    for line in segments_file:\n        num_lines += 1\n        parts = line.strip().split()\n        assert len(parts) in [4, 5]\n        segments[parts[0]] = (parts[1], float(parts[2]), float(parts[3]))\n        reco2utt[parts[1]].append(parts[0])\n\n    logger.info(\"Read %d lines from segments file %s\",\n                num_lines, segments_file.name)\n    segments_file.close()\n\n    return segments, reco2utt\n\n\ndef read_ctm_edits(ctm_edits_file, segments):\n    \"\"\"Read CTM from ctm_edits_file into a dictionary of values indexed by the\n    recording.\n    It is assumed to be sorted by the recording-id and utterance-id.\n\n    Returns a dictionary {recording : ctm_edit_lines}\n        where ctm_lines is a list of lines of CTM corresponding to the\n        utterances in the recording.\n        The format is as follows:\n        [[(utteranceA, channelA, start_time1, duration1, hyp_word1, conf1, ref_word1, edit_type1),\n          (utteranceA, channelA, start_time2, duration2, hyp_word2, conf2, ref_word2, edit_type2),\n          ...\n          (utteranceA, channelA, start_timeN, durationN, hyp_wordN, confN, ref_wordN, edit_typeN)],\n         [(utteranceB, channelB, start_time1, duration1, hyp_word1, conf1, ref_word1, edit_type1),\n          (utteranceB, channelB, start_time2, duration2, hyp_word2, conf2, ref_word2, edit_type2),\n          ...],\n         ...\n         [...\n          (utteranceZ, channelZ, start_timeN, durationN, hyp_wordN, confN, ref_wordN, edit_typeN)]\n        ]\n\n    Arguments:\n        segments - Dictionary containing the output of read_segments()\n            { utterance_id: (recording_id, start_time, end_time) }\n    \"\"\"\n    ctm_edits = {}\n\n    num_lines = 0\n    for line in ctm_edits_file:\n        num_lines += 1\n        parts = line.split()\n\n        utt = parts[0]\n        reco = segments[utt][0]\n\n        if (reco, utt) not in ctm_edits:\n            ctm_edits[(reco, utt)] = []\n\n        ctm_edits[(reco, utt)].append(\n            [parts[0], parts[1], float(parts[2]), float(parts[3]),\n             parts[4], float(parts[5])] + parts[6:])\n\n    logger.info(\"Read %d lines from CTM %s\", num_lines, ctm_edits_file.name)\n\n    ctm_edits_file.close()\n    return ctm_edits\n\n\ndef wer(ctm_edit_lines):\n    num_words = 0\n    num_incorrect_words = 0\n    for line in ctm_edit_lines:\n        if line[7] != 'sil':\n            num_words += 1\n            if line[7] in ['ins', 'del', 'sub']:\n                num_incorrect_words += 1\n    if num_words == 0 and num_incorrect_words > 0:\n        return float('inf')\n    if num_words == 0 and num_incorrect_words == 0:\n        return 0\n    return float(num_incorrect_words) / num_words\n\n\ndef choose_best_ctm_lines(first_lines, second_lines,\n                          window_length, overlap_length):\n    \"\"\"Returns ctm lines that have lower WER. If the WER is the lines with\n    the higher number of words is returned.\n    \"\"\"\n    i, best_lines = min((0, first_lines),\n                        (1, second_lines),\n                        key=lambda x: wer(x[1]))\n    return i\n\n\ndef resolve_overlaps(ctm_edits, segments):\n    \"\"\"Resolve overlaps within segments of the same recording.\n\n    Returns new lines of CTM for the recording.\n\n    Arguments:\n        ctm_edits - The CTM lines for a single recording. This is one value\n            stored in the dictionary read by read_ctm(). Assumes that the lines\n            are sorted by the utterance-ids.\n            The format is the following:\n            [[(utteranceA, channelA, start_time1, duration1, hyp_word1, conf1),\n              (utteranceA, channelA, start_time2, duration2, hyp_word2, conf2),\n              ...\n              (utteranceA, channelA, start_timeN, durationN, hyp_wordN, confN)\n             ],\n             [(utteranceB, channelB, start_time1, duration1, hyp_word1, conf1),\n              (utteranceB, channelB, start_time2, duration2, hyp_word2, conf2),\n              ...],\n             ...\n             [...\n              (utteranceZ, channelZ, start_timeN, durationN, hyp_wordN, confN)]\n            ]\n            Expects this to be non-empty.\n        segments - Dictionary containing the output of read_segments()\n            { utterance_id: (recording_id, start_time, end_time) }\n        \"\"\"\n    total_ctm_edits = []\n    assert len(ctm_edits) > 0\n\n    # First column of first line in CTM for first utterance\n    next_utt = ctm_edits[0][0][0]\n    for utt_index, ctm_edits_for_cur_utt in enumerate(ctm_edits):\n        if utt_index == len(ctm_edits) - 1:\n            break\n\n        if len(ctm_edits_for_cur_utt) == 0:\n            next_utt = ctm_edits[utt_index + 1][0][0]\n            continue\n\n        cur_utt = ctm_edits_for_cur_utt[0][0]\n        if cur_utt != next_utt:\n            logger.error(\n                \"Current utterance %s is not the same as the next \"\n                \"utterance %s in previous iteration.\\n\"\n                \"CTM is not sorted by utterance-id?\",\n                cur_utt, next_utt)\n            raise ValueError\n\n        # Assumption here is that the segments are written in\n        # consecutive order in time.\n        ctm_edits_for_next_utt = ctm_edits[utt_index + 1]\n        next_utt = ctm_edits_for_next_utt[0][0]\n        if segments[next_utt][1] < segments[cur_utt][1]:\n            logger.error(\n                \"Next utterance %s <= Current utterance %s. \"\n                \"CTM edits is not sorted by utterance-id.\",\n                next_utt, cur_utt)\n            raise ValueError\n\n        try:\n            # length of this utterance\n            window_length = segments[cur_utt][2] - segments[cur_utt][1]\n\n            # overlap of this segment with the next segment\n            # i.e. current_utterance_end_time - next_utterance_start_time\n            # Note: It is possible for this to be negative when there is\n            # actually no overlap between consecutive segments.\n            try:\n                overlap = segments[cur_utt][2] - segments[next_utt][1]\n            except KeyError:\n                logger(\"Could not find utterance %s in segments\",\n                       next_utt)\n                raise\n\n            # find the first word that is in the overlap\n            # at the end of the cur utt\n            try:\n                cur_utt_end_index = next(\n                    (i for i, line in enumerate(ctm_edits_for_cur_utt)\n                     if line[2] + line[3] / 2.0 > window_length - overlap))\n            except StopIteration:\n                cur_utt_end_index = len(ctm_edits_for_cur_utt)\n\n            cur_utt_end_lines = ctm_edits_for_cur_utt[cur_utt_end_index:]\n\n            # find the last word that is not in the overlap\n            # at the beginning of the next utt\n            try:\n                next_utt_start_index = next(\n                    (i for i, line in enumerate(ctm_edits_for_next_utt)\n                     if line[2] + line[3] / 2.0 > overlap))\n            except StopIteration:\n                next_utt_start_index = 0\n\n            next_utt_start_lines = ctm_edits_for_next_utt[:\n                                                          next_utt_start_index]\n\n            choose_index = choose_best_ctm_lines(\n                cur_utt_end_lines, next_utt_start_lines,\n                window_length, overlap)\n\n            # Ignore the hypotheses beyond this midpoint. They will be\n            # considered as part of the next segment.\n            if choose_index == 1:\n                total_ctm_edits.extend(\n                    ctm_edits_for_cur_utt[:cur_utt_end_index])\n            else:\n                total_ctm_edits.extend(ctm_edits_for_cur_utt)\n\n            if choose_index == 0 and next_utt_start_index > 0:\n                # Update the ctm_edits_for_next_utt to include only the lines\n                # starting from index.\n                ctm_edits[utt_index + 1] = (\n                    ctm_edits_for_next_utt[next_utt_start_index:])\n            # else leave the ctm_edits as is.\n        except:\n            logger.error(\"Could not resolve overlaps between CTM edits for \"\n                         \"%s and %s\", cur_utt, next_utt)\n            logger.error(\"Current CTM:\")\n            for line in ctm_edits_for_cur_utt:\n                logger.error(ctm_edit_line_to_string(line))\n            logger.error(\"Next CTM:\")\n            for line in ctm_edits_for_next_utt:\n                logger.error(ctm_edit_line_to_string(line))\n            raise\n\n    # merge the last ctm entirely\n    total_ctm_edits.extend(ctm_edits[-1])\n\n    return total_ctm_edits\n\n\ndef ctm_edit_line_to_string(line):\n    \"\"\"Converts a line of CTM edit to string.\"\"\"\n    return \"{0} {1} {2} {3} {4} {5} {6}\".format(line[0], line[1], line[2],\n                                                line[3], line[4], line[5],\n                                                \" \".join(line[6:]))\n\n\ndef write_ctm_edits(ctm_edit_lines, out_file):\n    \"\"\"Writes CTM lines stored in a list to file.\"\"\"\n    for line in ctm_edit_lines:\n        print(ctm_edit_line_to_string(line), file=out_file)\n\n\ndef run(args):\n    \"\"\"this method does everything in this script\"\"\"\n    segments, reco2utt = read_segments(args.segments)\n    ctm_edits = read_ctm_edits(args.ctm_edits_in, segments)\n\n    for reco, utts in reco2utt.items():\n        ctm_edits_for_reco = []\n        for utt in sorted(utts, key=lambda x: segments[x][1]):\n            if (reco, utt) in ctm_edits:\n                ctm_edits_for_reco.append(ctm_edits[(reco, utt)])\n        try:\n            if len(ctm_edits_for_reco) == 0:\n                logger.warn('CTMs for recording %s is empty.',\n                            reco)\n                continue   # Go to the next recording\n\n            # Process CTMs in the recordings\n            ctm_edits_for_reco = resolve_overlaps(ctm_edits_for_reco, segments)\n            write_ctm_edits(ctm_edits_for_reco, args.ctm_edits_out)\n        except Exception:\n            logger.error(\"Failed to process CTM edits for recording %s\",\n                         reco)\n            raise\n    args.ctm_edits_out.close()\n    logger.info(\"Wrote CTM for %d recordings.\", len(ctm_edits))\n\n\ndef main():\n    \"\"\"The main function which parses arguments and call run().\"\"\"\n    args = get_args()\n    try:\n        run(args)\n    except:\n        logger.error(\"Failed to resolve overlaps\", exc_info=True)\n        raise RuntimeError\n    finally:\n        try:\n            for f in [args.segments, args.ctm_edits_in, args.ctm_edits_out]:\n                if f is not None:\n                    f.close()\n        except IOError:\n            logger.error(\"Could not close some files. \"\n                         \"Disk error or broken pipes?\")\n            raise\n        except UnboundLocalError:\n            raise SystemExit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/retrieve_similar_docs.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0.\n\n\"\"\"This script retrieves documents similar to the query documents\nusing a similarity score based on the total TFIDF for all the terms in the\nquery document.\n\nSome terminology:\n    original utterance-id = The utterance-id of the original long audio segments\n        and the corresponding reference transcript\n    source-text = reference transcript\n    source-text-id = original utterance-id\n    sub-segment = Approximately 30s long chunk of the original utterance\n    query-id = utterance-id of the sub-segment\n    document = Approximately 1000 words of a source-text\n    doc-id = Id of the document\n\ne.g.\nfoo1 A B C D E F is in the original text file\nand foo1 foo 100 200 is in the original segments file.\n\nHere foo1 is the source-text-id and \"A B C D\" is the reference transcript. It\nis a 100s long segment from the recording foo.\n\nfoo1 is split into 30s long sub-segments as follows:\nfoo1-1 foo1 100 130\nfoo1-2 foo1 125 155\nfoo1-3 foo1 150 180\nfoo1-4 foo1 175 200\n\nfoo1-{1,2,3,4} are query-ids.\n\nThe source-text for foo1 is split into two-word documents.\ndoc1 A B\ndoc2 C D\ndoc3 E F\n\ndoc{1,2,3} are doc-ids.\n\n--source-text2doc-ids option is given a mapping that contains\nfoo1 doc1 doc2 doc3\n\n--query-id2source-text-id option is given a mapping that contains\nfoo1-1 foo1\nfoo1-2 foo1\nfoo1-3 foo1\nfoo1-4 foo1\n\nThe query TF-IDFs are all indexed by the utterance-id of the sub-segments\nof the original utterances.\nThe source TF-IDFs use the document-ids created by splitting the source-text\n(corresponding to original utterances) into documents.\n\nFor each query (sub-segment), we need to retrieve the documents that were\ncreated from the same original utterance that the sub-segment was from. For\nthis, we have to load the source TF-IDF that has those documents. This\ninformation is provided using the option --source-text2tf-idf-file, which\nis like an SCP file with the first column being the source-text-id and the\nsecond column begin the location of TF-IDF for the documents corresponding\nto that source-text-id.\n\nThe output of this script is a file where the first column is the\nquery-id (i.e. sub-segment-id) and the remaining columns, which is at least\none in number and a maxmium of (1 + 2 * num-neighbors-to-search) columns\nare tuples separated by commas\n(<doc-id>, <start-fraction>, <end-fraction>), where <doc-id> is the document-id\n<start-fraction> is the proportion of the document from the beginning\nthat needs to be in the retrieved set.\n<end-fraction> is the proportion of the document from the end\nthat needs to be in the retrieved set.\nIf both <start-fraction> and <end-fraction> are 1, then the full document is\nadded to the retrieved set.\nSome examples of the lines in the output file are:\nfoo1-1 doc1,1,1\nfoo1-2 doc1,0,0.2 doc2,1,1 doc3,0.2,0\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\n\nimport tf_idf\n\n\nlogger = logging.getLogger('__name__')\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(filename)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\n\nfor l in [logger, logging.getLogger('tf_idf'), logging.getLogger('libs')]:\n    l.setLevel(logging.DEBUG)\n    l.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script retrieves documents similar to the\n        query documents using a similarity score based on the total TFIDF for\n        all the terms in the query document.\n        See the beginning of the script for more details about the\n        arguments to the script.\"\"\")\n\n    parser.add_argument(\"--verbose\", type=int, default=0, choices=[0, 1, 2, 3],\n                        help=\"Higher for more logging statements\")\n\n    parser.add_argument(\"--num-neighbors-to-search\", type=int, default=0,\n                        help=\"\"\"Number of neighboring documents to search\n                        around the one retrieved based on maximum tf-idf\n                        similarity. A value of 0 means only the document\n                        with the maximum tf-idf similarity is retrieved,\n                        and none of the documents adjacent to it.\"\"\")\n    parser.add_argument(\"--neighbor-tfidf-threshold\", type=float, default=0.9,\n                        help=\"\"\"Ignore neighbors that have tf-idf similarity\n                        with the query document less than this threshold\n                        factor lower than the best score.\"\"\")\n    parser.add_argument(\"--partial-doc-fraction\", default=0.2,\n                        help=\"\"\"The fraction of neighboring document that will\n                        be part of the retrieved document set.\n                        If this is greater than 0, then a fraction of words\n                        from the neighboring documents is added to the\n                        retrieved document.\"\"\")\n\n    parser.add_argument(\"--source-text-id2doc-ids\",\n                        type=argparse.FileType('r'), required=True,\n                        help=\"\"\"A mapping from the source text to a list of\n                        documents that it is broken into\n                        <text-utterance-id> <document-id-1> ...\n                        <document-id-N>\"\"\")\n    parser.add_argument(\"--query-id2source-text-id\",\n                        type=argparse.FileType('r'), required=True,\n                        help=\"\"\"A mapping from the query document-id to a\n                        source text from which a document needs to be\n                        retrieved.\"\"\")\n    parser.add_argument(\"--source-text-id2tfidf\", type=argparse.FileType('r'),\n                        required=True,\n                        help=\"\"\"An SCP file for the TF-IDF for source\n                        documents indexed by the source-text-id.\"\"\")\n    parser.add_argument(\"--query-tfidf\", type=argparse.FileType('r'),\n                        required=True,\n                        help=\"\"\"Archive of TF-IDF objects for query documents\n                        indexed by the query-id.\n                        The format is\n                        query-id <TFIDF> ... </TFIDF>\n                        \"\"\")\n    parser.add_argument(\"--relevant-docs\", type=argparse.FileType('w'),\n                        required=True,\n                        help=\"\"\"Output archive of a list of source documents\n                        similar to a query document, indexed by the\n                        query document id.\"\"\")\n\n    args = parser.parse_args()\n\n    if args.partial_doc_fraction < 0 or args.partial_doc_fraction > 1:\n        logger.error(\"--partial-doc-fraction must be in [0,1]\")\n        raise ValueError\n\n    return args\n\n\ndef read_map(file_handle, num_values_per_key=None,\n             min_num_values_per_key=None, must_contain_unique_key=True):\n    \"\"\"Reads a map from a file into a dictionary and returns it.\n    Expects the map is stored in the file in the following format:\n    <key> <value-1> <value-2> ... <value-N>\n    The values are returned as a tuple stored in a dictionary indexed by the\n    \"key\".\n\n    Arguments:\n        file_handle - A handle to an opened input file containing the map\n        num_values_per_key - If provided, the function raises an error if\n                             the number of values read for a key in the input\n                             file does not match the \"num_values_per_key\"\n        min_num_values_per_key - If provided, the function raises an error\n                                 if the number of values read for a key in the\n                                 input file is less than\n                                 \"min_num_values_per_key\"\n        must_contain_unique_key - If set to True, then it is required that the\n                                  file has a unique key; otherwise this\n                                  function will exit with error.\n\n    Returns:\n        { key: tuple(values) }\n    \"\"\"\n    dict_map = {}\n    for line in file_handle:\n        try:\n            parts = line.strip().split()\n            key = parts[0]\n\n            if (num_values_per_key is not None\n                    and len(parts) - 1 != num_values_per_key):\n                logger.error(\n                    \"Expecting {0} columns; Got {1}.\".format(\n                        num_values_per_key + 1, len(parts)))\n                raise TypeError\n\n            if (min_num_values_per_key is not None\n                    and len(parts) - 1 < min_num_values_per_key):\n                logger.error(\n                    \"Expecting at least {0} columns; Got {1}.\".format(\n                        min_num_values_per_key + 1, len(parts)))\n                raise TypeError\n\n            if must_contain_unique_key and key in dict_map:\n                logger.error(\"Found duplicate key %s\", key)\n                raise TypeError\n\n            if num_values_per_key is not None and num_values_per_key == 1:\n                dict_map[key] = parts[1]\n            else:\n                dict_map[key] = parts[1:]\n        except Exception:\n            logger.error(\"Failed reading line %s in file %s\",\n                         line, file_handle.name)\n            raise\n    file_handle.close()\n    return dict_map\n\n\ndef get_document_ids(source_docs, indexes):\n    indexes = sorted(\n        [(key, value[0], value[1]) for key, value in indexes.items()],\n        key=lambda x: x[0])\n\n    doc_ids = []\n    for i, partial_start, partial_end in indexes:\n        try:\n            doc_ids.append((source_docs[i], partial_start, partial_end))\n        except IndexError:\n            pass\n    return doc_ids\n\n\ndef run(args):\n    \"\"\"The main function that does all the processing.\n    Takes as argument the Namespace object obtained from _get_args().\n    \"\"\"\n    query_id2source_text_id = read_map(args.query_id2source_text_id,\n                                       num_values_per_key=1)\n    source_text_id2doc_ids = read_map(args.source_text_id2doc_ids,\n                                      min_num_values_per_key=1)\n\n    source_text_id2tfidf = read_map(args.source_text_id2tfidf,\n                                    num_values_per_key=1)\n\n    num_queries = 0\n    prev_source_text_id = \"\"\n    for query_id, query_tfidf in tf_idf.read_tfidf_ark(args.query_tfidf):\n        num_queries += 1\n\n        # The source text from which a document is to be retrieved for the\n        # input query\n        source_text_id = query_id2source_text_id[query_id]\n\n        if prev_source_text_id != source_text_id:\n            source_tfidf = tf_idf.TFIDF()\n            source_tfidf.read(\n                open(source_text_id2tfidf[source_text_id]))\n            prev_source_text_id = source_text_id\n\n        # The source documents corresponding to the source text.\n        # This is set of documents which will be searched over for the query.\n        source_doc_ids = source_text_id2doc_ids[source_text_id]\n\n        scores = query_tfidf.compute_similarity_scores(\n            source_tfidf, source_docs=source_doc_ids, query_id=query_id)\n\n        assert len(scores) > 0, (\n            \"Did not get scores for query {0}\".format(query_id))\n\n        if args.verbose > 2:\n            for tup, score in scores.items():\n                logger.debug(\"Score, {num}: {0} {1} {2}\".format(\n                    tup[0], tup[1], score, num=num_queries))\n\n        best_index, best_doc_id = max(\n            enumerate(source_doc_ids), key=lambda x: scores[(query_id, x[1])])\n        best_score = scores[(query_id, best_doc_id)]\n\n        assert source_doc_ids[best_index] == best_doc_id\n        assert best_score == max([scores[(query_id, x)]\n                                  for x in source_doc_ids])\n\n        best_indexes = {}\n\n        if args.num_neighbors_to_search == 0:\n            best_indexes[best_index] = (1, 1)\n            if best_index > 0:\n                best_indexes[best_index - 1] = (0, args.partial_doc_fraction)\n            if best_index < len(source_doc_ids) - 1:\n                best_indexes[best_index + 1] = (args.partial_doc_fraction, 0)\n        else:\n            excluded_indexes = set()\n            for index in range(\n                    max(best_index - args.num_neighbors_to_search, 0),\n                    min(best_index + args.num_neighbors_to_search + 1,\n                        len(source_doc_ids))):\n                if (scores[(query_id, source_doc_ids[index])]\n                        >= args.neighbor_tfidf_threshold * best_score):\n                    best_indexes[index] = (1, 1)    # Type 2\n                    if index > 0 and index - 1 in excluded_indexes:\n                        try:\n                            # Type 1 and 3\n                            start_frac, end_frac = best_indexes[index - 1]\n                            assert end_frac == 0\n                            best_indexes[index - 1] = (\n                                start_frac, args.partial_doc_fraction)\n                        except KeyError:\n                            # Type 1\n                            best_indexes[index - 1] = (\n                                0, args.partial_doc_fraction)\n                else:\n                    excluded_indexes.add(index)\n                    if index > 0 and index - 1 not in excluded_indexes:\n                        # Type 3\n                        best_indexes[index] = (args.partial_doc_fraction, 0)\n\n        best_docs = get_document_ids(source_doc_ids, best_indexes)\n\n        assert len(best_docs) > 0, (\n            \"Did not get best docs for query {0}\\n\"\n            \"Scores: {1}\\n\"\n            \"Source docs: {2}\\n\"\n            \"Best index: {best_index}, score: {best_score}\\n\".format(\n                query_id, scores, source_doc_ids,\n                best_index=best_index, best_score=best_score))\n        assert (best_doc_id, 1.0, 1.0) in best_docs\n\n        print (\"{0} {1}\".format(query_id, \" \".join(\n            [\"%s,%.2f,%.2f\" % x for x in best_docs])),\n               file=args.relevant_docs)\n\n    if num_queries == 0:\n        raise RuntimeError(\"Failed to retrieve any document.\")\n\n    logger.info(\"Retrieved similar documents for \"\n                \"%d queries\", num_queries)\n\n\ndef main():\n    args = get_args()\n\n    if args.verbose > 1:\n        handler.setLevel(logging.DEBUG)\n    try:\n        run(args)\n    finally:\n        for f in [args.query_id2source_text_id, args.source_text_id2doc_ids,\n                  args.relevant_docs, args.query_tfidf, args.source_text_id2tfidf]:\n            f.close()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/segment_ctm_edits.py",
    "content": "#!/usr/bin/env python3\n\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport sys, operator, argparse, os\nfrom collections import defaultdict\n\n# This script reads 'ctm-edits' file format that is produced by get_ctm_edits.py\n# and modified by modify_ctm_edits.py and taint_ctm_edits.py Its function is to\n# produce a segmentation and text from the ctm-edits input.\n\n# The ctm-edits file format that this script expects is as follows\n# <file-id> <channel> <start-time> <duration> <conf> <hyp-word> <ref-word> <edit> ['tainted']\n# [note: file-id is really utterance-id at this point].\n\nparser = argparse.ArgumentParser(\n    description = \"This program produces segmentation and text information \"\n    \"based on reading ctm-edits input format which is produced by \"\n    \"steps/cleanup/internal/get_ctm_edits.py, steps/cleanup/internal/modify_ctm_edits.py and \"\n    \"steps/cleanup/internal/taint_ctm_edits.py.\",\n formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\nparser.add_argument(\"--min-segment-length\", type = float, default = 0.5,\n                    help = \"Minimum allowed segment length (in seconds) for any \"\n                    \"segment; shorter segments than this will be discarded.\")\nparser.add_argument(\"--min-new-segment-length\", type = float, default = 1.0,\n                    help = \"Minimum allowed segment length (in seconds) for newly \"\n                    \"created segments (i.e. not identical to the input utterances). \"\n                    \"Expected to be >= --min-segment-length.\")\nparser.add_argument(\"--frame-length\", type = float, default = 0.01,\n                    help = \"This only affects rounding of the output times; they will \"\n                    \"be constrained to multiples of this value.\")\nparser.add_argument(\"--max-tainted-length\", type = float, default = 0.05,\n                    help = \"Maximum allowed length of any 'tainted' line.  Note: \"\n                    \"'tainted' lines may only appear at the boundary of a \"\n                    \"segment\")\nparser.add_argument(\"--max-edge-silence-length\", type = float, default = 0.5,\n                    help = \"Maximum allowed length of silence if it appears at the \"\n                    \"edge of a segment (will be truncated).  This rule is \"\n                    \"relaxed if such truncation would take a segment below \"\n                    \"the --min-segment-length or --min-new-segment-length.\")\nparser.add_argument(\"--max-edge-non-scored-length\", type = float, default = 0.5,\n                    help = \"Maximum allowed length of a non-scored word (noise, cough, etc.) \"\n                    \"if it appears at the edge of a segment (will be truncated). \"\n                    \"This rule is relaxed if such truncation would take a \"\n                    \"segment below the --min-segment-length.\")\nparser.add_argument(\"--max-internal-silence-length\", type = float, default = 2.0,\n                    help = \"Maximum allowed length of silence if it appears inside a segment \"\n                    \"(will cause the segment to be split).\")\nparser.add_argument(\"--max-internal-non-scored-length\", type = float, default = 2.0,\n                    help = \"Maximum allowed length of a non-scored word (noise, etc.) if \"\n                    \"it appears inside a segment (will cause the segment to be \"\n                    \"split).  Note: reference words which are real words but OOV \"\n                    \"are not included in this category.\")\nparser.add_argument(\"--unk-padding\", type = float, default = 0.05,\n                    help = \"Amount of padding with <unk> that we do if a segment boundary is \"\n                    \"next to errors (ins, del, sub).  That is, we add this amount of \"\n                    \"time to the segment and add the <unk> word to cover the acoustics. \"\n                    \"If nonzero, the --oov-symbol-file option must be supplied.\")\nparser.add_argument(\"--max-junk-proportion\", type = float, default = 0.1,\n                    help = \"Maximum proportion of the time of the segment that may \"\n                    \"consist of potentially bad data, in which we include 'tainted' lines of \"\n                    \"the ctm-edits input and unk-padding.\")\nparser.add_argument(\"--min-split-point-duration\", type=float, default=0.1,\n                    help=\"\"\"Minimum duration of silence or non-scored word\n                    to be considered a viable split point when\n                    truncating based on junk proportion.\"\"\")\nparser.add_argument(\"--max-deleted-words-kept-when-merging\", type = int, default = 1,\n                    help = \"When merging segments that are found to be overlapping or \"\n                    \"adjacent after all other processing, keep in the transcript the \"\n                    \"reference words that were deleted between the segments [if any] \"\n                    \"as long as there were no more than this many reference words. \"\n                    \"Setting this to zero will mean that any reference words that \"\n                    \"were deleted between the segments we're about to reattach will \"\n                    \"not appear in the generated transcript (so we'll match the hyp).\")\nparser.add_argument(\"--oov-symbol-file\", type = str, default = None,\n                    help = \"Filename of file such as data/lang/oov.txt which contains \"\n                    \"the text form of the OOV word, normally '<unk>'.  Supplied as \"\n                    \"a file to avoid complications with escaping.  Necessary if \"\n                    \"the --unk-padding option has a nonzero value (which it does \"\n                    \"by default.\")\nparser.add_argument(\"--ctm-edits-out\", type = str,\n                    help = \"Filename to output an extended version of the ctm-edits format \"\n                    \"with segment start and end points noted.  This file is intended to be \"\n                    \"read by humans; there are currently no scripts that will read it.\")\nparser.add_argument(\"--word-stats-out\", type = str,\n                    help = \"Filename for output of word-level stats, of the form \"\n                    \"'<word> <bad-proportion> <total-count-in-ref>', e.g. 'hello 0.12 12408', \"\n                    \"where the <bad-proportion> is the proportion of the time that this \"\n                    \"reference word does not make it into a segment.  It can help reveal words \"\n                    \"that have problematic pronunciations or are associated with \"\n                    \"transcription errors.\")\n\n\nparser.add_argument(\"non_scored_words_in\", metavar = \"<non-scored-words-file>\",\n                    help=\"Filename of file containing a list of non-scored words, \"\n                    \"one per line. See steps/cleanup/internal/get_nonscored_words.py.\")\nparser.add_argument(\"ctm_edits_in\", metavar = \"<ctm-edits-in>\",\n                    help = \"Filename of input ctm-edits file. \"\n                    \"Use /dev/stdin for standard input.\")\nparser.add_argument(\"text_out\", metavar = \"<text-out>\",\n                    help = \"Filename of output text file (same format as data/train/text, i.e. \"\n                    \"<new-utterance-id> <word1> <word2> ... <wordN>\")\nparser.add_argument(\"segments_out\", metavar = \"<segments-out>\",\n                    help = \"Filename of output segments.  This has the same format as data/train/segments, \"\n                    \"but instead of <recording-id>, the second field is the old utterance-id, i.e \"\n                    \"<new-utterance-id> <old-utterance-id> <start-time> <end-time>\")\n\nargs = parser.parse_args()\n\n\n\n\ndef IsTainted(split_line_of_utt):\n    return len(split_line_of_utt) > 8 and split_line_of_utt[8] == 'tainted'\n\n# This function returns a list of pairs (start-index, end-index) representing\n# the cores of segments (so if a pair is (s, e), then the core of a segment\n# would span (s, s+1, ... e-1).\n#\n# By the 'core of a segment', we mean a sequence of ctm-edits lines including at\n# least one 'cor' line and a contiguous sequence of other lines of the type\n# 'cor', 'fix' and 'sil' that must be not tainted.  The segment core excludes\n# any tainted lines at the edge of a segment, which will be added later.\n#\n# We only initiate segments when it contains something correct and not realized\n# as unk (i.e. ref==hyp); and we extend it with anything that is 'sil' or 'fix'\n# or 'cor' that is not tainted.  Contiguous regions of 'true' in the resulting\n# boolean array will then become the cores of prototype segments, and we'll add\n# any adjacent tainted words (or parts of them).\ndef ComputeSegmentCores(split_lines_of_utt):\n    num_lines = len(split_lines_of_utt)\n    line_is_in_segment_core = [ False] * num_lines\n    for i in range(num_lines):\n        if split_lines_of_utt[i][7] == 'cor' and \\\n            split_lines_of_utt[i][4] == split_lines_of_utt[i][6]:\n            line_is_in_segment_core[i] = True\n\n    # extend each proto-segment forwards as far as we can:\n    for i in range(1, num_lines):\n        if line_is_in_segment_core[i-1] and not line_is_in_segment_core[i]:\n            edit_type = split_lines_of_utt[i][7]\n            if not IsTainted(split_lines_of_utt[i]) and \\\n                (edit_type == 'cor' or edit_type == 'sil' or edit_type == 'fix'):\n                line_is_in_segment_core[i] = True\n\n    # extend each proto-segment backwards as far as we can:\n    for i in reversed(range(0, num_lines - 1)):\n        if line_is_in_segment_core[i+1] and not line_is_in_segment_core[i]:\n            edit_type = split_lines_of_utt[i][7]\n            if not IsTainted(split_lines_of_utt[i]) and \\\n               (edit_type == 'cor' or edit_type == 'sil' or edit_type == 'fix'):\n                line_is_in_segment_core[i] = True\n\n\n    segment_ranges = []\n    cur_segment_start = None\n    for i in range(0, num_lines):\n        if line_is_in_segment_core[i]:\n            if cur_segment_start == None:\n                cur_segment_start = i\n        else:\n            if cur_segment_start != None:\n                segment_ranges.append( (cur_segment_start, i) )\n                cur_segment_start = None\n    if cur_segment_start != None:\n        segment_ranges.append( (cur_segment_start, num_lines) )\n\n    return segment_ranges\n\nclass Segment(object):\n    def __init__(self, split_lines_of_utt, start_index, end_index, debug_str = None):\n        self.split_lines_of_utt = split_lines_of_utt\n        # start_index is the index of the first line that appears in this\n        # segment, and end_index is one past the last line.  This does not\n        # include unk-padding.\n        self.start_index = start_index\n        self.end_index = end_index\n        # If the following values are nonzero, then when we create the segment\n        # we will add <unk> at the start and end of the segment [representing\n        # partial words], with this amount of additional audio.\n        self.start_unk_padding = 0.0\n        self.end_unk_padding = 0.0\n\n        # debug_str keeps track of the 'core' of the segment.\n        if debug_str == None:\n            debug_str = 'core-start={0},core-end={1}'.format(start_index,end_index)\n        self.debug_str = debug_str\n\n        # This gives the proportion of the time of the first line in the segment\n        # that we keep.  Usually 1.0 but may be less if we've trimmed away some\n        # proportion of the time.\n        self.start_keep_proportion = 1.0\n        # This gives the proportion of the time of the last line in the segment\n        # that we keep.  Usually 1.0 but may be less if we've trimmed away some\n        # proportion of the time.\n        self.end_keep_proportion = 1.0\n\n    # This is stage 1 of segment processing (after creating the boundaries of the\n    # core of the segment, which is done outside of this class).a\n    #\n    # This function may reduce start_index and/or increase end_index by\n    # including a single adjacent 'tainted' line from the ctm-edits file.  This\n    # is only done if the lines at the boundaries of the segment are currently\n    # real non-silence words and not non-scored words.  The idea is that we\n    # probably don't want to start or end the segment right at the boundary of a\n    # real word, we want to add some kind of padding.\n    def PossiblyAddTaintedLines(self):\n        global non_scored_words\n        split_lines_of_utt = self.split_lines_of_utt\n        # we're iterating over the segment (start, end)\n        for b in [False, True]:\n            if b:\n                boundary_index = self.end_index - 1\n                adjacent_index = self.end_index\n            else:\n                boundary_index = self.start_index\n                adjacent_index = self.start_index - 1\n            if adjacent_index >= 0 and adjacent_index < len(split_lines_of_utt):\n                # only consider merging the adjacent word into the segment if we're not\n                # at a segment boundary.\n                adjacent_line_is_tainted = IsTainted(split_lines_of_utt[adjacent_index])\n                # if the adjacent line wasn't tainted, then there must have been\n                # another stronger reason why we didn't include it in the core\n                # of the segment (probably that it was an ins, del or sub), so\n                # there is no point considering it.\n                if adjacent_line_is_tainted:\n                    boundary_edit_type = split_lines_of_utt[boundary_index][7]\n                    boundary_hyp_word = split_lines_of_utt[boundary_index][7]\n                    # we only add the tainted line to the segment if the word at\n                    # the boundary was a non-silence word that was correctly\n                    # decoded and not fixed [see modify_ctm_edits.py.]\n                    if boundary_edit_type == 'cor' and \\\n                       not boundary_hyp_word in non_scored_words:\n                        # Add the adjacent tainted line to the segment.\n                        if b:\n                            self.end_index += 1\n                        else:\n                            self.start_index -= 1\n\n    # This is stage 2 of segment processing.\n    # This function will split a segment into multiple pieces if any of the\n    # internal [non-boundary] silences or non-scored words are longer\n    # than the allowed values --max-internal-silence-length and\n    # --max-internal-non-scored-length.  This function returns a\n    # list of segments.  In the normal case (where there is no splitting)\n    # it just returns an array with a single element 'self'.\n    def PossiblySplitSegment(self):\n        global non_scored_words, args\n        # make sure the segment hasn't been processed more than we expect.\n        assert self.start_unk_padding == 0.0 and self.end_unk_padding == 0.0 and \\\n              self.start_keep_proportion == 1.0 and self.end_keep_proportion == 1.0\n        segments = []  # the answer\n        cur_start_index = self.start_index\n        cur_start_is_split = False\n        # only consider splitting at non-boundary lines.  [we'd just truncate\n        # the boundary lines.]\n        for index_to_split_at in range(cur_start_index + 1, self.end_index - 1):\n            this_split_line = self.split_lines_of_utt[index_to_split_at]\n            this_duration = float(this_split_line[3])\n            this_edit_type = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            if (this_edit_type == 'sil' and this_duration > args.max_internal_silence_length) or \\\n               (this_ref_word in non_scored_words and this_duration > args.max_internal_non_scored_length):\n                # We split this segment at this index, dividing the word in two\n                # [later on, in PossiblyTruncateBoundaries, it may be further\n                # truncated.]\n                # Note: we use 'index_to_split_at + 1' because the Segment constructor\n                # takes an 'end-index' which is interpreted as one past the end.\n                new_segment = Segment(self.split_lines_of_utt, cur_start_index,\n                                      index_to_split_at + 1, self.debug_str)\n                if cur_start_is_split:\n                    new_segment.start_keep_proportion = 0.5\n                new_segment.end_keep_proportion = 0.5\n                cur_start_is_split = True\n                cur_start_index = index_to_split_at\n                segments.append(new_segment)\n        if len(segments) == 0:  # We did not split.\n            segments.append(self)\n        else:\n            # We did split.  Add the very last segment.\n            new_segment = Segment(self.split_lines_of_utt, cur_start_index,\n                                  self.end_index, self.debug_str)\n            assert cur_start_is_split\n            new_segment.start_keep_proportion = 0.5\n            segments.append(new_segment)\n        return segments\n\n\n    # This is stage 3 of segment processing.  It will truncate the silences and\n    # non-scored words at the segment boundaries if they are longer than the\n    # --max-edge-silence-length and --max-edge-non-scored-length respectively\n    # (and to the extent that this wouldn't take us below the\n    # --min-segment-length or --min-new-segment-length).\n    def PossiblyTruncateBoundaries(self):\n        for b in [True, False]:\n            if b:\n                this_index = self.start_index\n            else:\n                this_index = self.end_index - 1\n            this_split_line = self.split_lines_of_utt[this_index]\n            truncated_duration = None\n            this_duration = float(this_split_line[3])\n            this_edit = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            if this_edit == 'sil' and \\\n               this_duration > args.max_edge_silence_length:\n                truncated_duration = args.max_edge_silence_length\n            elif this_ref_word in non_scored_words and \\\n                 this_duration > args.max_edge_non_scored_length:\n                truncated_duration = args.max_edge_non_scored_length\n            if truncated_duration != None:\n                keep_proportion = truncated_duration / this_duration\n                if b:\n                    self.start_keep_proportion = keep_proportion\n                else:\n                    self.end_keep_proportion = keep_proportion\n\n    # This relaxes the segment-boundary truncation of\n    # PossiblyTruncateBoundaries(), if it would take us below\n    # min-new-segment-length or min-segment-length.  Note: this does not relax\n    # the boundary truncation for a particular boundary (start or end) if that\n    # boundary corresponds to a 'tainted' line of the ctm (because it's\n    # dangerous to include too much 'tainted' audio).\n    def RelaxBoundaryTruncation(self):\n        # this should be called before adding unk padding.\n        assert self.start_unk_padding == self.end_unk_padding == 0.0\n        if self.start_keep_proportion == self.end_keep_proportion == 1.0:\n            return  # nothing to do there was no truncation.\n        length_cutoff = max(args.min_new_segment_length, args.min_segment_length)\n        length_with_truncation = self.Length()\n        if length_with_truncation >= length_cutoff:\n            return  # Nothing to do.\n        orig_start_keep_proportion = self.start_keep_proportion\n        orig_end_keep_proportion = self.end_keep_proportion\n        if not IsTainted(self.split_lines_of_utt[self.start_index]):\n            self.start_keep_proportion = 1.0\n        if not IsTainted(self.split_lines_of_utt[self.end_index - 1]):\n            self.end_keep_proportion = 1.0\n        length_with_relaxed_boundaries = self.Length()\n        if length_with_relaxed_boundaries <= length_cutoff:\n            # Completely undo the truncation [to the extent allowed by the\n            # presence of tainted lines at the start/end] if, even without\n            # truncation, we'd be below the length cutoff.  This segment may be\n            # removed later on (but it may not, if removing truncation makes us\n            # identical to the input utterance, and the length is between\n            # min_segment_length min_new_segment_length).\n            return\n        # Next, compute an interpolation constant a such that the\n        # {start,end}_keep_proportion values will equal a *\n        # [values-computed-by-PossiblyTruncateBoundaries()] + (1-a) * [completely-relaxed-values].\n        # we're solving the equation:\n        # length_cutoff = a * length_with_truncation + (1-a) * length_with_relaxed_boundaries\n        # -> length_cutoff - length_with_relaxed_boundaries =\n        #        a * (length_with_truncation - length_with_relaxed_boundaries)\n        # -> a = (length_cutoff - length_with_relaxed_boundaries) / (length_with_truncation - length_with_relaxed_boundaries)\n        a = (length_cutoff - length_with_relaxed_boundaries) / \\\n            (length_with_truncation - length_with_relaxed_boundaries)\n        if a < 0.0 or a > 1.0:\n            print(\"segment_ctm_edits.py: bad 'a' value = {0}\".format(a), file = sys.stderr)\n            return\n        self.start_keep_proportion = \\\n           a * orig_start_keep_proportion + (1-a) * self.start_keep_proportion\n        self.end_keep_proportion = \\\n           a * orig_end_keep_proportion + (1-a) * self.end_keep_proportion\n        if not abs(self.Length() - length_cutoff) < 0.01:\n            print(\"segment_ctm_edits.py: possible problem relaxing boundary \"\n                  \"truncation, length is {0} vs {1}\".format(self.Length(), length_cutoff),\n                  file = sys.stderr)\n\n\n    # This is stage 4 of segment processing.\n    # This function may set start_unk_padding and end_unk_padding to nonzero\n    # values.  This is done if the current boundary words are real, scored\n    # words and we're not next to the beginning or end of the utterance.\n    def PossiblyAddUnkPadding(self):\n        for b in [True, False]:\n            if b:\n                this_index = self.start_index\n            else:\n                this_index = self.end_index - 1\n            this_split_line = self.split_lines_of_utt[this_index]\n            this_start_time = float(this_split_line[2])\n            this_ref_word = this_split_line[6]\n            this_edit = this_split_line[7]\n            if this_edit == 'cor' and not this_ref_word in non_scored_words:\n                # we can consider adding unk-padding.\n                if b: # start of utterance.\n                    unk_padding = args.unk_padding\n                    if unk_padding > this_start_time:  # close to beginning of file\n                        unk_padding = this_start_time\n                    # If we could add less than half of the specified\n                    # unk-padding, don't add any (because when we add\n                    # unk-padding we add the unknown-word symbol '<unk>', and if\n                    # there isn't enough space to traverse the HMM we don't want\n                    # to do it at all.\n                    if unk_padding < 0.5 * args.unk_padding:\n                        unk_padding = 0.0\n                    self.start_unk_padding = unk_padding\n                else: # end of utterance.\n                    this_end_time = this_start_time + float(this_split_line[3])\n                    last_line = self.split_lines_of_utt[-1]\n                    utterance_end_time = float(last_line[2]) + float(last_line[3])\n                    max_allowable_padding = utterance_end_time - this_end_time\n                    assert max_allowable_padding > -0.01\n                    unk_padding = args.unk_padding\n                    if unk_padding > max_allowable_padding:\n                        unk_padding = max_allowable_padding\n                    # If we could add less than half of the specified\n                    # unk-padding, don't add any (because when we add\n                    # unk-padding we add the unknown-word symbol '<unk>', and if\n                    # there isn't enough space to traverse the HMM we don't want\n                    # to do it at all.\n                    if unk_padding < 0.5 * args.unk_padding:\n                        unk_padding = 0.0\n                    self.end_unk_padding = unk_padding\n\n    # This function will merge the segment in 'other' with the segment\n    # in 'self'.  It is only to be called when 'self' and 'other' are from\n    # the same utterance, 'other' is after 'self' in time order (based on\n    # the original segment cores), and self.EndTime() >= other.StartTime().\n    # Note: in this situation there will normally be deleted words\n    # between the two segments.  What this program does with the deleted\n    # words depends on '--max-deleted-words-kept-when-merging'.  If there\n    # were any inserted words in the transcript (less likely), this\n    # program will keep the reference.\n    def MergeWithSegment(self, other):\n        assert self.EndTime() >= other.StartTime() and \\\n               self.StartTime() < other.EndTime() and \\\n               self.split_lines_of_utt is other.split_lines_of_utt\n        orig_self_end_index = self.end_index\n        self.debug_str = \"({0}/merged-with/{1})\".format(self.debug_str, other.debug_str)\n        # everything that relates to the end of this segment gets copied\n        # from 'other'.\n        self.end_index = other.end_index\n        self.end_unk_padding = other.end_unk_padding\n        self.end_keep_proportion = other.end_keep_proportion\n        # The next thing we have to do is to go over any lines of the ctm that\n        # appear between 'self' and 'other', or are shared between both (this\n        # would only happen for tainted silence or non-scored-word segments),\n        # and decide what to do with them.  We'll keep the reference for any\n        # substitutions or insertions (which anyway are unlikely to appear\n        # in these merged segments).  Note: most of this happens in self.Text(),\n        # but at this point we need to decide whether to mark any deletions\n        # as 'discard-this-word'.\n        first_index_of_overlap = min(orig_self_end_index - 1, other.start_index)\n        last_index_of_overlap = max(orig_self_end_index - 1, other.start_index)\n        num_deleted_words = 0\n        for i in range(first_index_of_overlap, last_index_of_overlap + 1):\n            edit_type = self.split_lines_of_utt[i][7]\n            if edit_type == 'del':\n                num_deleted_words += 1\n        if num_deleted_words > args.max_deleted_words_kept_when_merging:\n            for i in range(first_index_of_overlap, last_index_of_overlap + 1):\n                if self.split_lines_of_utt[i][7] == 'del':\n                    self.split_lines_of_utt[i].append('do-not-include-in-text')\n\n    # Returns the start time of the utterance (within the enclosing utterance)\n    # This is before any rounding.\n    def StartTime(self):\n        first_line = self.split_lines_of_utt[self.start_index]\n        first_line_start = float(first_line[2])\n        first_line_duration = float(first_line[3])\n        first_line_end = first_line_start + first_line_duration\n        return first_line_end - self.start_unk_padding \\\n              - (first_line_duration * self.start_keep_proportion)\n\n    # Returns some string-valued information about 'this' that is useful for debugging.\n    def DebugInfo(self):\n        return 'start=%d,end=%d,unk-padding=%.2f,%.2f,keep-proportion=%.2f,%.2f,' % \\\n            (self.start_index, self.end_index, self.start_unk_padding,\n             self.end_unk_padding, self.start_keep_proportion, self.end_keep_proportion) + \\\n         self.debug_str\n\n    # Returns the start time of the utterance (within the enclosing utterance)\n    def EndTime(self):\n        last_line = self.split_lines_of_utt[self.end_index - 1]\n        last_line_start = float(last_line[2])\n        last_line_duration = float(last_line[3])\n        return last_line_start + (last_line_duration * self.end_keep_proportion) \\\n             + self.end_unk_padding\n\n    # Returns the segment length in seconds.\n    def Length(self):\n        return self.EndTime() - self.StartTime()\n\n    def IsWholeUtterance(self):\n        # returns true if this segment corresponds to the whole utterance that\n        # it's a part of (i.e. its start/end time are zero and the end-time of\n        # the last segment.\n        last_line_of_utt = self.split_lines_of_utt[-1]\n        last_line_end_time = float(last_line_of_utt[2]) + float(last_line_of_utt[3])\n        return abs(self.StartTime() - 0.0) < 0.001 and \\\n               abs(self.EndTime() - last_line_end_time) < 0.001\n\n    # Returns the proportion of the duration of this segment that consists of\n    # unk-padding and tainted lines of input (will be between 0.0 and 1.0).\n    def JunkProportion(self):\n        # Note: only the first and last lines could possibly be tainted as\n        # that's how we create the segments; and if either or both are tainted\n        # the utterance must contain other lines, so double-counting is not a\n        # problem.\n        junk_duration = self.start_unk_padding + self.end_unk_padding\n        first_split_line = self.split_lines_of_utt[self.start_index]\n        if IsTainted(first_split_line):\n            first_duration = float(first_split_line[3])\n            junk_duration += first_duration * self.start_keep_proportion\n        last_split_line = self.split_lines_of_utt[self.end_index - 1]\n        if IsTainted(last_split_line):\n            last_duration = float(last_split_line[3])\n            junk_duration += last_duration * self.end_keep_proportion\n        return junk_duration / self.Length()\n\n    # This function will remove something from the beginning of the\n    # segment if it's possible to cleanly lop off a bit that contains\n    # more junk, as a proportion of its length, than 'args.junk_proportion'.\n    # Junk is defined as unk-padding and/or tainted segments.\n    # It considers as a potential split point, the first silence\n    # segment or non-tainted non-scored-word segment in the\n    # utterance.  See also TruncateEndForJunkProportion\n    def PossiblyTruncateStartForJunkProportion(self):\n        begin_junk_duration = self.start_unk_padding\n        first_split_line = self.split_lines_of_utt[self.start_index]\n        if IsTainted(first_split_line):\n            first_duration = float(first_split_line[3])\n            begin_junk_duration += first_duration * self.start_keep_proportion\n        if begin_junk_duration == 0.0:\n            # nothing to do.\n            return\n\n        candidate_start_index = None\n        # the following iterates over all lines internal to the utterance.\n        for i in range(self.start_index + 1, self.end_index - 1):\n            this_split_line = self.split_lines_of_utt[i]\n            this_edit_type = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            # We'll consider splitting on silence and on non-scored words.\n            # (i.e. making the silence or non-scored word the left boundary of\n            # the new utterance and discarding the piece to the left of that).\n            if ((this_edit_type == 'sil'\n                 or (this_edit_type == 'cor'\n                     and this_ref_word in non_scored_words))\n                and (float(this_split_line[3])\n                     > args.min_split_point_duration)):\n                candidate_start_index = i\n                candidate_start_time = float(this_split_line[2])\n                break  # Consider only the first potential truncation.\n        if candidate_start_index is None:\n            return  # Nothing to do as there is no place to split.\n        candidate_removed_piece_duration = candidate_start_time - self.StartTime()\n        if float(begin_junk_duration) / candidate_removed_piece_duration < args.max_junk_proportion:\n            return  # Nothing to do as the candidate piece to remove has too\n                    # little junk.\n        # OK, remove the piece.\n        self.start_index = candidate_start_index\n        self.start_unk_padding = 0.0\n        self.start_keep_proportion = 1.0\n        self.debug_str += ',truncated-start-for-junk'\n\n    # This is like PossiblyTruncateStartForJunkProportion(), but\n    # acts on the end of the segment; see comments there.\n    def PossiblyTruncateEndForJunkProportion(self):\n        end_junk_duration = self.end_unk_padding\n        last_split_line = self.split_lines_of_utt[self.end_index - 1]\n        if IsTainted(last_split_line):\n            last_duration = float(last_split_line[3])\n            end_junk_duration += last_duration * self.end_keep_proportion\n        if end_junk_duration == 0.0:\n            # nothing to do.\n            return\n\n        candidate_end_index = None\n        # the following iterates over all lines internal to the utterance\n        # (starting from the end).\n        for i in reversed(range(self.start_index + 1, self.end_index - 1)):\n            this_split_line = self.split_lines_of_utt[i]\n            this_edit_type = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            # We'll consider splitting on silence and on non-scored words.\n            # (i.e. making the silence or non-scored word the right boundary of\n            # the new utterance and discarding the piece to the right of that).\n            if ((this_edit_type == 'sil'\n                 or (this_edit_type == 'cor'\n                     and this_ref_word in non_scored_words))\n                and (float(this_split_line[3])\n                     > args.min_split_point_duration)):\n                candidate_end_index = i + 1  # note: end-indexes are one past the last.\n                candidate_end_time = float(this_split_line[2]) + float(this_split_line[3])\n                break  # Consider only the latest potential truncation.\n        if candidate_end_index is None:\n            return  # Nothing to do as there is no place to split.\n        candidate_removed_piece_duration = self.EndTime() - candidate_end_time\n        if float(end_junk_duration) / candidate_removed_piece_duration < args.max_junk_proportion:\n            return  # Nothing to do as the candidate piece to remove has too\n                    # little junk.\n        # OK, remove the piece.\n        self.end_index = candidate_end_index\n        self.end_unk_padding = 0.0\n        self.end_keep_proportion = 1.0\n        self.debug_str += ',truncated-end-for-junk'\n\n\n    # this will return true if there is at least one word in the utterance\n    # that's a scored word (not a non-scored word) and not an OOV word that's\n    # realized as unk.  This becomes a filter on keeping segments.\n    def ContainsAtLeastOneScoredNonOovWord(self):\n        global non_scored_words\n        for i in range(self.start_index, self.end_index):\n            this_split_line = self.split_lines_of_utt[i]\n            this_hyp_word = this_split_line[4]\n            this_ref_word = this_split_line[6]\n            this_edit = this_split_line[7]\n            if this_edit == 'cor' and not this_ref_word in non_scored_words \\\n               and this_ref_word == this_hyp_word:\n                return True\n        return False\n\n    # Returns the text corresponding to this utterance, as a string.\n    def Text(self):\n        global oov_symbol\n        text_array = []\n        if self.start_unk_padding != 0.0:\n            text_array.append(oov_symbol)\n        for i in range(self.start_index, self.end_index):\n            this_split_line = self.split_lines_of_utt[i]\n            this_edit = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            if this_ref_word != '<eps>' and this_split_line[-1] != 'do-not-include-in-text':\n                text_array.append(this_ref_word)\n        if self.end_unk_padding != 0.0:\n            text_array.append(oov_symbol)\n        return ' '.join(text_array)\n\n\n# Here, 'text' will be something that indicates the stage of processing,\n# e.g. 'Stage 0: segment cores', 'Stage 1: add tainted lines',\n#, etc.\ndef AccumulateSegmentStats(segment_list, text):\n    global segment_total_length, num_segments\n    for segment in segment_list:\n        num_segments[text] += 1\n        segment_total_length[text] += segment.Length()\n\ndef PrintSegmentStats():\n    global segment_total_length, num_segments, \\\n       num_utterances, num_utterances_without_segments, \\\n       total_length_of_utterances\n\n    print('Number of utterances is %d, of which %.2f%% had no segments after '\n          'all processing; total length of data in original utterances (in seconds) '\n          'was %d' % (num_utterances,\n                      num_utterances_without_segments * 100.0 / num_utterances,\n                      total_length_of_utterances),\n          file = sys.stderr)\n\n\n    keys = sorted(segment_total_length.keys())\n    for i in range(len(keys)):\n        key = keys[i]\n        if i > 0:\n            delta_percentage = '[%+.2f%%]' % ((segment_total_length[key] - segment_total_length[keys[i-1]])\n                                              * 100.0 / total_length_of_utterances)\n        print('At %s, num-segments is %d, total length %.2f%% of original total %s' % (\n                key, num_segments[key],\n                segment_total_length[key] * 100.0 / total_length_of_utterances,\n                delta_percentage if i > 0 else ''),\n              file = sys.stderr)\n\n# This function creates the segments for an utterance as a list\n# of class Segment.\n# It returns a 2-tuple (list-of-segments, list-of-deleted-segments)\n# where the deleted segments are only useful for diagnostic printing.\n# Note: split_lines_of_utt is a list of lists, one per line, each containing the\n# sequence of fields.\ndef GetSegmentsForUtterance(split_lines_of_utt):\n    global num_utterances, num_utterances_without_segments, total_length_of_utterances\n\n    num_utterances += 1\n\n    segment_ranges = ComputeSegmentCores(split_lines_of_utt)\n\n    utterance_end_time = float(split_lines_of_utt[-1][2]) + float(split_lines_of_utt[-1][3])\n    total_length_of_utterances += utterance_end_time\n\n    segments = [ Segment(split_lines_of_utt, x[0], x[1])\n                 for x in segment_ranges ]\n\n    AccumulateSegmentStats(segments, 'stage  0 [segment cores]')\n    for segment in segments:\n        segment.PossiblyAddTaintedLines()\n    AccumulateSegmentStats(segments, 'stage  1 [add tainted lines]')\n    new_segments = []\n    for s in segments:\n        new_segments += s.PossiblySplitSegment()\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage  2 [split segments]')\n    for s in segments:\n        s.PossiblyTruncateBoundaries()\n    AccumulateSegmentStats(segments, 'stage  3 [truncate boundaries]')\n    for s in segments:\n        s.RelaxBoundaryTruncation()\n    AccumulateSegmentStats(segments, 'stage  4 [relax boundary truncation]')\n    for s in segments:\n        s.PossiblyAddUnkPadding()\n    AccumulateSegmentStats(segments, 'stage  5 [unk-padding]')\n\n    deleted_segments = []\n    new_segments = []\n    for s in segments:\n        # the 0.999 allows for roundoff error.\n        if (not s.IsWholeUtterance() and s.Length() < 0.999 * args.min_new_segment_length):\n            s.debug_str += '[deleted-because-of--min-new-segment-length]'\n            deleted_segments.append(s)\n        else:\n            new_segments.append(s)\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage  6 [remove new segments under --min-new-segment-length')\n\n    new_segments = []\n    for s in segments:\n        # the 0.999 allows for roundoff error.\n        if s.Length() < 0.999 * args.min_segment_length:\n            s.debug_str += '[deleted-because-of--min-segment-length]'\n            deleted_segments.append(s)\n        else:\n            new_segments.append(s)\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage  7 [remove segments under --min-segment-length')\n\n    for s in segments:\n        s.PossiblyTruncateStartForJunkProportion()\n    AccumulateSegmentStats(segments, 'stage  8 [truncate segment-starts for --max-junk-proportion')\n\n    for s in segments:\n        s.PossiblyTruncateEndForJunkProportion()\n    AccumulateSegmentStats(segments, 'stage  9 [truncate segment-ends for --max-junk-proportion')\n\n    new_segments = []\n    for s in segments:\n        if s.ContainsAtLeastOneScoredNonOovWord():\n            new_segments.append(s)\n        else:\n            s.debug_str += '[deleted-because-no-scored-non-oov-words]'\n            deleted_segments.append(s)\n\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage 10 [remove segments without scored,non-OOV words]')\n\n    new_segments = []\n    for s in segments:\n        j = s.JunkProportion()\n        if j <= args.max_junk_proportion:\n            new_segments.append(s)\n        else:\n            s.debug_str += '[deleted-because-junk-proportion={0}]'.format(j)\n            deleted_segments.append(s)\n\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage 11 [remove segments with junk exceeding --max-junk-proportion]')\n\n    new_segments = []\n    if len(segments) > 0:\n        new_segments.append(segments[0])\n        for i in range(1, len(segments)):\n            if new_segments[-1].EndTime() >= segments[i].StartTime():\n                new_segments[-1].MergeWithSegment(segments[i])\n            else:\n                new_segments.append(segments[i])\n    segments = new_segments\n    AccumulateSegmentStats(segments, 'stage 12 [merge overlapping or touching segments]')\n\n    for i in range(len(segments) - 1):\n        if segments[i].EndTime() > segments[i+1].StartTime():\n            # this just adds something to --ctm-edits-out output\n            segments[i+1].debug_str += \",overlaps-previous-segment\"\n\n    if len(segments) == 0:\n        num_utterances_without_segments += 1\n\n    return (segments, deleted_segments)\n\n# this prints a number with a certain number of digits after\n# the point, while removing trailing zeros.\ndef FloatToString(f):\n    num_digits = 6 # we want to print 6 digits after the zero\n    g = f\n    while abs(g) > 1.0:\n        g *= 0.1\n        num_digits += 1\n    format_str = '%.{0}g'.format(num_digits)\n    return format_str % f\n\n# Gives time in string form as an exact multiple of the frame-length, e.g. 0.01\n# (after rounding).\ndef TimeToString(time, frame_length):\n    n = round(time / frame_length)\n    assert n >= 0\n    # The next function call will remove trailing zeros while printing it, so\n    # that e.g. 0.01 will be printed as 0.01 and not 0.0099999999999999.  It\n    # seems that doing this in a simple way is not really possible (at least,\n    # not without assuming that frame_length is of the form 10^-n, which we\n    # don't really want to do).\n    return FloatToString(n * frame_length)\n\ndef WriteSegmentsForUtterance(text_output_handle, segments_output_handle,\n                              old_utterance_name, segments):\n    num_digits = len('{}'.format(len(segments)))\n    for n in range(len(segments)):\n        segment = segments[n]\n        # split utterances will be named foo-bar-1 foo-bar-2, etc.\n        new_utterance_name = \"{old}-{index:0{width}}\".format(\n                                 old=old_utterance_name, index=n+1,\n                                 width=num_digits)\n        # print a line to the text output of the form like\n        # <new-utterance-id> <text>\n        # like:\n        # foo-bar-1 hello this is dan\n        print(new_utterance_name, segment.Text(), file = text_output_handle)\n        # print a line to the segments output of the form\n        # <new-utterance-id> <old-utterance-id> <start-time> <end-time>\n        # like:\n        # foo-bar-1 foo-bar 5.1 7.2\n        print(new_utterance_name, old_utterance_name,\n              TimeToString(segment.StartTime(), args.frame_length),\n              TimeToString(segment.EndTime(), args.frame_length),\n              file = segments_output_handle)\n\n\n\n# Note, this is destrutive of 'segments_for_utterance', but it won't matter.\ndef PrintDebugInfoForUtterance(ctm_edits_out_handle,\n                               split_lines_of_cur_utterance,\n                               segments_for_utterance,\n                               deleted_segments_for_utterance):\n    # info_to_print will be list of 2-tuples (time, 'start-segment-n'|'end-segment-n')\n    # representing the start or end times of segments.\n    info_to_print = []\n    for n in range(len(segments_for_utterance)):\n        segment = segments_for_utterance[n]\n        start_string = 'start-segment-{0}[{1}]'.format(n+1, segment.DebugInfo())\n        info_to_print.append( (segment.StartTime(), start_string) )\n        end_string = 'end-segment-{}'.format(n+1)\n        info_to_print.append( (segment.EndTime(), end_string) )\n    # for segments that were deleted we print info like start-deleted-segment-1, and\n    # otherwise similar info to segments that were retained.\n    for n in range(len(deleted_segments_for_utterance)):\n        segment = deleted_segments_for_utterance[n]\n        start_string = 'start-deleted-segment-{0}[{1}]'.format(n+1, segment.DebugInfo())\n        info_to_print.append( (segment.StartTime(), start_string) )\n        end_string = 'end-deleted-segment-{}'.format(n+1)\n        info_to_print.append( (segment.EndTime(), end_string) )\n\n    info_to_print = sorted(info_to_print)\n\n    for i in range(len(split_lines_of_cur_utterance)):\n        split_line=split_lines_of_cur_utterance[i]\n        split_line[0] += '[{}]'.format(i)    # add an index like [0], [1], to\n                                             # the utterance-id so we can easily\n                                             # look up segment indexes.\n        start_time = float(split_line[2])\n        end_time = start_time + float(split_line[3])\n        split_line_copy = list(split_line)\n        while len(info_to_print) > 0 and info_to_print[0][0] <= end_time:\n            (segment_start, string) = info_to_print[0]\n            # shift the first element off of info_to_print.\n            info_to_print = info_to_print[1:]\n            # add a field like 'start-segment1[...]=3.21' to what we're about to print.\n            split_line_copy.append(string + \"=\" + TimeToString(segment_start, args.frame_length))\n        print(' '.join(split_line_copy), file = ctm_edits_out_handle)\n\n# This accumulates word-level stats about, for each reference word, with what\n# probability it will end up in the core of a segment.  Words with low\n# probabilities of being in segments will generally be associated with some kind\n# of error (there is a higher probability of having a wrong lexicon entry).\ndef AccWordStatsForUtterance(split_lines_of_utt,\n                             segments_for_utterance):\n    # word_count_pair is a map from a string (the word) to\n    # a list [total-count, count-not-within-segments]\n    global word_count_pair\n    line_is_in_segment = [ False ] * len(split_lines_of_utt)\n    for segment in segments_for_utterance:\n        for i in range(segment.start_index, segment.end_index):\n            line_is_in_segment[i] = True\n    for i in range(len(split_lines_of_utt)):\n        this_ref_word = split_lines_of_utt[i][6]\n        if this_ref_word != '<eps>':\n            word_count_pair[this_ref_word][0] += 1\n            if not line_is_in_segment[i]:\n                word_count_pair[this_ref_word][1] += 1\n\ndef PrintWordStats(word_stats_out):\n    try:\n        f = open(word_stats_out, 'w', encoding='utf-8')\n    except:\n        sys.exit(\"segment_ctm_edits.py: error opening word-stats file --word-stats-out={0} \"\n                 \"for writing\".format(word_stats_out))\n    global word_count_pair\n    # Sort from most to least problematic.  We want to give more prominence to\n    # words that are most frequently not in segments, but also to high-count\n    # words.  Define badness = pair[1] / pair[0], and total_count = pair[0],\n    # where 'pair' is a value of word_count_pair.  We'll reverse sort on\n    # badness^3 * total_count = pair[1]^3 / pair[0]^2.\n    for key, pair in sorted(word_count_pair.items(),\n                      key = lambda item: (item[1][1] ** 3) * 1.0 / (item[1][0] ** 2),\n                      reverse = True):\n        badness = pair[1] * 1.0 / pair[0]\n        total_count = pair[0]\n        print(key, badness, total_count, file = f)\n    try:\n        f.close()\n    except:\n        sys.exit(\"segment_ctm_edits.py: error closing file --word-stats-out={0} \"\n                 \"(full disk?)\".format(word_stats_out))\n    print(\"segment_ctm_edits.py: please see the file {0} for word-level statistics \"\n          \"saying how frequently each word was excluded for a segment; format is \"\n          \"<word> <proportion-of-time-excluded> <total-count>.  Particularly \"\n          \"problematic words appear near the top of the file.\".format(word_stats_out),\n          file = sys.stderr)\n\n\ndef ProcessData():\n    try:\n        f_in = open(args.ctm_edits_in, encoding='utf-8')\n    except:\n        sys.exit(\"segment_ctm_edits.py: error opening ctm-edits input \"\n                 \"file {0}\".format(args.ctm_edits_in))\n    try:\n        text_output_handle = open(args.text_out, 'w', encoding='utf-8')\n    except:\n        sys.exit(\"segment_ctm_edits.py: error opening text output \"\n                 \"file {0}\".format(args.text_out))\n    try:\n        segments_output_handle = open(args.segments_out, 'w', encoding='utf-8')\n    except:\n        sys.exit(\"segment_ctm_edits.py: error opening segments output \"\n                 \"file {0}\".format(args.text_out))\n    if args.ctm_edits_out != None:\n        try:\n            ctm_edits_output_handle = open(args.ctm_edits_out, 'w', encoding='utf-8')\n        except:\n            sys.exit(\"segment_ctm_edits.py: error opening ctm-edits output \"\n                     \"file {0}\".format(args.ctm_edits_out))\n\n    # Most of what we're doing in the lines below is splitting the input lines\n    # and grouping them per utterance, before giving them to ProcessUtterance()\n    # and then printing the modified lines.\n    first_line = f_in.readline()\n    if first_line == '':\n        sys.exit(\"segment_ctm_edits.py: empty input\")\n    split_pending_line = first_line.split()\n    if len(split_pending_line) == 0:\n        sys.exit(\"segment_ctm_edits.py: bad input line \" + first_line)\n    cur_utterance = split_pending_line[0]\n    split_lines_of_cur_utterance = []\n\n    while True:\n        if len(split_pending_line) == 0 or split_pending_line[0] != cur_utterance:\n            (segments_for_utterance,\n             deleted_segments_for_utterance) = GetSegmentsForUtterance(split_lines_of_cur_utterance)\n            AccWordStatsForUtterance(split_lines_of_cur_utterance, segments_for_utterance)\n            WriteSegmentsForUtterance(text_output_handle, segments_output_handle,\n                                      cur_utterance, segments_for_utterance)\n            if args.ctm_edits_out != None:\n                PrintDebugInfoForUtterance(ctm_edits_output_handle,\n                                           split_lines_of_cur_utterance,\n                                           segments_for_utterance,\n                                           deleted_segments_for_utterance)\n            split_lines_of_cur_utterance = []\n            if len(split_pending_line) == 0:\n                break\n            else:\n                cur_utterance = split_pending_line[0]\n\n        split_lines_of_cur_utterance.append(split_pending_line)\n        next_line = f_in.readline()\n        split_pending_line = next_line.split()\n        if len(split_pending_line) == 0:\n            if next_line != '':\n                sys.exit(\"segment_ctm_edits.py: got an empty or whitespace input line\")\n    try:\n        text_output_handle.close()\n        segments_output_handle.close()\n        if args.ctm_edits_out != None:\n            ctm_edits_output_handle.close()\n    except:\n        sys.exit(\"segment_ctm_edits.py: error closing one or more outputs \"\n                 \"(broken pipe or full disk?)\")\n\n\ndef ReadNonScoredWords(non_scored_words_file):\n    global non_scored_words\n    try:\n        f = open(non_scored_words_file, encoding='utf-8')\n    except:\n        sys.exit(\"segment_ctm_edits.py: error opening file: \"\n                 \"--non-scored-words=\" + non_scored_words_file)\n    for line in f.readlines():\n        a = line.split()\n        if not len(line.split()) == 1:\n            sys.exit(\"segment_ctm_edits.py: bad line in non-scored-words \"\n                     \"file {0}: {1}\".format(non_scored_words_file, line))\n        non_scored_words.add(a[0])\n    f.close()\n\n\n\n\nnon_scored_words = set()\nReadNonScoredWords(args.non_scored_words_in)\n\noov_symbol = None\nif args.oov_symbol_file != None:\n    try:\n        with open(args.oov_symbol_file, encoding='utf-8') as f:\n            line = f.readline()\n            assert len(line.split()) == 1\n            oov_symbol = line.split()[0]\n            assert f.readline() == ''\n    except Exception as e:\n        sys.exit(\"segment_ctm_edits.py: error reading file --oov-symbol-file=\" +\n                 args.oov_symbol_file + \", error is: \" + str(e))\nelif args.unk_padding != 0.0:\n    sys.exit(\"segment_ctm_edits.py: if the --unk-padding option is nonzero (which \"\n             \"it is by default, the --oov-symbol-file option must be supplied.\")\n\n# segment_total_length and num_segments are maps from\n# 'stage' strings; see AccumulateSegmentStats for details.\nsegment_total_length = defaultdict(int)\nnum_segments = defaultdict(int)\n# the lambda expression below is an anonymous function that takes no arguments\n# and returns the new list [0, 0].\nword_count_pair = defaultdict(lambda: [0, 0])\nnum_utterances = 0\nnum_utterances_without_segments = 0\ntotal_length_of_utterances = 0\n\n\nProcessData()\nPrintSegmentStats()\nif args.word_stats_out != None:\n    PrintWordStats(args.word_stats_out)\nif args.ctm_edits_out != None:\n    print(\"segment_ctm_edits.py: detailed utterance-level debug information \"\n          \"is in \" + args.ctm_edits_out, file = sys.stderr)\n\n"
  },
  {
    "path": "egs/steps/cleanup/internal/segment_ctm_edits_mild.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport copy\nimport logging\nimport heapq\nimport sys\nfrom collections import defaultdict\n\n\"\"\"\nThis script reads 'ctm-edits' file format that is produced by align_ctm_ref.py\nand modified by modify_ctm_edits.py and taint_ctm_edits.py. Its function is to\nproduce a segmentation and text from the ctm-edits input.\n\nIt is a milder version of the script segment_ctm_edits.py i.e. it allows\nto keep more of the reference. This is useful for segmenting long-audio\nbased on imperfect transcripts.\n\nThe ctm-edits file format that this script expects is as follows\n<file-id> <channel> <start-time> <duration> <conf> <hyp-word> <ref-word> <edit>\n['tainted']\n[note: file-id is really utterance-id at this point].\n\"\"\"\n\n_global_logger = logging.getLogger(__name__)\n_global_logger.setLevel(logging.INFO)\n_global_handler = logging.StreamHandler()\n_global_handler.setLevel(logging.INFO)\n_global_formatter = logging.Formatter(\n    '%(asctime)s [%(pathname)s:%(lineno)s - '\n    '%(funcName)s - %(levelname)s ] %(message)s')\n_global_handler.setFormatter(_global_formatter)\n_global_logger.addHandler(_global_handler)\n\n_global_non_scored_words = {}\n\n\ndef non_scored_words():\n    return _global_non_scored_words\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This program produces segmentation and text information\n        based on reading ctm-edits input format which is produced by\n        steps/cleanup/internal/get_ctm_edits.py,\n        steps/cleanup/internal/modify_ctm_edits.py and\n        steps/cleanup/internal/taint_ctm_edits.py.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n    parser.add_argument(\"--min-segment-length\", type=float, default=0.5,\n                        help=\"\"\"Minimum allowed segment length (in seconds) for\n                        any segment; shorter segments than this will be\n                        discarded.\"\"\")\n    parser.add_argument(\"--min-new-segment-length\", type=float, default=1.0,\n                        help=\"\"\"Minimum allowed segment length (in seconds) for\n                        newly created segments (i.e. not identical to the input\n                        utterances).\n                        Expected to be >= --min-segment-length.\"\"\")\n    parser.add_argument(\"--frame-length\", type=float, default=0.01,\n                        help=\"\"\"This only affects rounding of the output times;\n                        they will be constrained to multiples of this\n                        value.\"\"\")\n    parser.add_argument(\"--max-tainted-length\", type=float, default=0.05,\n                        help=\"\"\"Maximum allowed length of any 'tainted' line.\n                        Note: 'tainted' lines may only appear at the boundary\n                        of a segment\"\"\")\n    parser.add_argument(\"--max-edge-silence-length\", type=float, default=0.5,\n                        help=\"\"\"Maximum allowed length of silence if it appears\n                        at the edge of a segment (will be truncated).  This\n                        rule is relaxed if such truncation would take a segment\n                        below the --min-segment-length or\n                        --min-new-segment-length.\"\"\")\n    parser.add_argument(\"--max-edge-non-scored-length\", type=float,\n                        default=0.5,\n                        help=\"\"\"Maximum allowed length of a non-scored word\n                        (noise, cough, etc.) if it appears at the edge of a\n                        segment (will be truncated).  This rule is relaxed if\n                        such truncation would take a segment below the\n                        --min-segment-length.\"\"\")\n    parser.add_argument(\"--max-internal-silence-length\", type=float,\n                        default=2.0,\n                        help=\"\"\"Maximum allowed length of silence if it appears\n                        inside a segment (will cause the segment to be\n                        split).\"\"\")\n    parser.add_argument(\"--max-internal-non-scored-length\", type=float,\n                        default=2.0,\n                        help=\"\"\"Maximum allowed length of a non-scored word\n                        (noise, etc.) if it appears inside a segment (will\n                        cause the segment to be split).\n                        Note: reference words which are real words but OOV are\n                        not included in this category.\"\"\")\n    parser.add_argument(\"--unk-padding\", type=float, default=0.05,\n                        help=\"\"\"Amount of padding with <unk> that we do if a\n                        segment boundary is next to errors (ins, del, sub).\n                        That is, we add this amount of time to the segment and\n                        add the <unk> word to cover the acoustics.  If nonzero,\n                        the --oov-symbol-file option must be supplied.\"\"\")\n    parser.add_argument(\"--max-junk-proportion\", type=float, default=0.1,\n                        help=\"\"\"Maximum proportion of the time of the segment\n                        that may consist of potentially bad data, in which we\n                        include 'tainted' lines of the ctm-edits input and\n                        unk-padding.\"\"\")\n    parser.add_argument(\"--min-split-point-duration\", type=float, default=0.0,\n                        help=\"\"\"Minimum duration of silence or non-scored word\n                        to be considered a viable split point when\n                        truncating based on junk proportion.\"\"\")\n    parser.add_argument(\"--max-deleted-words-kept-when-merging\",\n                        dest='max_deleted_words', type=int, default=1,\n                        help=\"\"\"When merging segments that are found to be\n                        overlapping or adjacent after all other processing,\n                        keep in the transcript the reference words that were\n                        deleted between the segments [if any] as long as there\n                        were no more than this many reference words.  Setting\n                        this to zero will mean that any reference words that\n                        were deleted between the segments we're about to\n                        reattach will not appear in the generated transcript\n                        (so we'll match the hyp).\"\"\")\n\n    parser.add_argument(\"--splitting.min-silence-length\",\n                        dest=\"min_silence_length_to_split\",\n                        type=float, default=0.3,\n                        help=\"\"\"Only considers silences that are at least this\n                        long as potential split points\"\"\")\n    parser.add_argument(\"--splitting.min-non-scored-length\",\n                        dest=\"min_non_scored_length_to_split\",\n                        type=float, default=0.1,\n                        help=\"\"\"Only considers non-scored words that are at\n                        least this long as potential split points\"\"\")\n    parser.add_argument(\"--splitting.max-segment-length\",\n                        dest=\"max_segment_length_for_splitting\",\n                        type=float, default=10,\n                        help=\"\"\"Try to split long segments into segments that\n                        are smaller that this size. See\n                        possibly_split_long_segments() in Segment class.\"\"\")\n    parser.add_argument(\"--splitting.hard-max-segment-length\",\n                        dest=\"hard_max_segment_length\",\n                        type=float, default=15,\n                        help=\"\"\"Split all segments that are longer than this\n                        uniformly into segments of size\n                        --splitting.max-segment-length\"\"\")\n\n    parser.add_argument(\"--merging-score.silence-factor\",\n                        dest=\"silence_factor\",\n                        type=float, default=1,\n                        help=\"\"\"Weightage on the silence length when merging\n                        segments\"\"\")\n    parser.add_argument(\"--merging-score.incorrect-words-factor\",\n                        dest=\"incorrect_words_factor\",\n                        type=float, default=1,\n                        help=\"\"\"Weightage on the incorrect_words_length when\n                        merging segments\"\"\")\n    parser.add_argument(\"--merging-score.tainted-words-factor\",\n                        dest=\"tainted_words_factor\",\n                        type=float, default=1,\n                        help=\"\"\"Weightage on the WER including the\n                        tainted words as incorrect words.\"\"\")\n\n    parser.add_argument(\"--merging.max-wer\",\n                        dest=\"max_wer\",\n                        type=float, default=10.0,\n                        help=\"Max WER%% of merged segments when merging\")\n    parser.add_argument(\"--merging.max-bad-proportion\",\n                        dest=\"max_bad_proportion\",\n                        type=float, default=0.2,\n                        help=\"\"\"Maximum length of silence, junk and incorrect\n                        words in a merged segment allowed as a fraction of the\n                        total length of merged segment.\"\"\")\n    parser.add_argument(\"--merging.max-segment-length\",\n                        dest='max_segment_length_for_merging',\n                        type=float, default=10,\n                        help=\"\"\"Maximum segment length allowed for merged\n                        segment\"\"\")\n    parser.add_argument(\"--merging.max-intersegment-incorrect-words-length\",\n                        dest='max_intersegment_incorrect_words_length',\n                        type=float, default=0.2,\n                        help=\"\"\"Maximum length of intersegment region that\n                        can be of incorrect word. This is to\n                        allow cases where there may be a lot of silence in the\n                        segment but the incorrect words are few, while\n                        preventing regions that have a lot of incorrect\n                        words.\"\"\")\n\n    parser.add_argument(\"--oov-symbol-file\", type=argparse.FileType('r'),\n                        help=\"\"\"Filename of file such as data/lang/oov.txt\n                        which contains the text form of the OOV word, normally\n                        '<unk>'.  Supplied as a file to avoid complications\n                        with escaping.  Necessary if the --unk-padding option\n                        has a nonzero value (which it does by default.\"\"\")\n    parser.add_argument(\"--ctm-edits-out\", type=argparse.FileType('w'),\n                        help=\"\"\"Filename to output an extended version of the\n                        ctm-edits format with segment start and end points\n                        noted.  This file is intended to be read by humans;\n                        there are currently no scripts that will read it.\"\"\")\n    parser.add_argument(\"--word-stats-out\", type=argparse.FileType('w'),\n                        help=\"\"\"Filename for output of word-level stats, of the\n                        form '<word> <bad-proportion> <total-count-in-ref>',\n                        e.g. 'hello 0.12 12408', where the <bad-proportion> is\n                        the proportion of the time that this reference word\n                        does not make it into a segment.  It can help reveal\n                        words that have problematic pronunciations or are\n                        associated with transcription errors.\"\"\")\n\n    parser.add_argument(\"non_scored_words_in\",\n                        metavar=\"<non-scored-words-file>\",\n                        type=argparse.FileType('r'),\n                        help=\"\"\"Filename of file containing a list of\n                        non-scored words, one per line. See\n                        steps/cleanup/internal/get_nonscored_words.py.\"\"\")\n    parser.add_argument(\"ctm_edits_in\", metavar=\"<ctm-edits-in>\",\n                        type=argparse.FileType('r'),\n                        help=\"\"\"Filename of input ctm-edits file.  Use\n                        /dev/stdin for standard input.\"\"\")\n    parser.add_argument(\"text_out\", metavar=\"<text-out>\",\n                        type=argparse.FileType('w'),\n                        help=\"\"\"Filename of output text file (same format as\n                        data/train/text, i.e.  <new-utterance-id> <word1>\n                        <word2> ... <wordN>\"\"\")\n    parser.add_argument(\"segments_out\", metavar=\"<segments-out>\",\n                        type=argparse.FileType('w'),\n                        help=\"\"\"Filename of output segments.  This has the same\n                        format as data/train/segments, but instead of\n                        <recording-id>, the second field is the old\n                        utterance-id, i.e <new-utterance-id> <old-utterance-id>\n                        <start-time> <end-time>\"\"\")\n\n    parser.add_argument(\"--verbose\", type=int, default=0,\n                        help=\"Use higher verbosity for more debugging output\")\n\n    args = parser.parse_args()\n\n    if args.verbose > 2:\n        _global_handler.setLevel(logging.DEBUG)\n        _global_logger.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef is_tainted(split_line_of_utt):\n    \"\"\"Returns True if this line in ctm-edit is \"tainted.\"\"\"\n    return len(split_line_of_utt) > 8 and split_line_of_utt[8] == 'tainted'\n\n\ndef compute_segment_cores(split_lines_of_utt):\n    \"\"\"\n    This function returns a list of pairs (start-index, end-index) representing\n    the cores of segments (so if a pair is (s, e), then the core of a segment\n    would span (s, s+1, ... e-1).\n\n    The argument 'split_lines_of_utt' is list of lines from a ctm-edits file\n    corresponding to a single utterance.\n\n    By the 'core of a segment', we mean a sequence of ctm-edits lines including\n    at least one 'cor' line and a contiguous sequence of other lines of the\n    type 'cor', 'fix' and 'sil' that must be not tainted.  The segment core\n    excludes any tainted lines at the edge of a segment, which will be added\n    later.\n\n    We only initiate segments when it contains something correct and not\n    realized as unk (i.e. ref==hyp); and we extend it with anything that is\n    'sil' or 'fix' or 'cor' that is not tainted.  Contiguous regions of 'true'\n    in the resulting boolean array will then become the cores of prototype\n    segments, and we'll add any adjacent tainted words (or parts of them).\n    \"\"\"\n    num_lines = len(split_lines_of_utt)\n    line_is_in_segment_core = [False] * num_lines\n    # include only the correct lines\n    for i in range(num_lines):\n        if (split_lines_of_utt[i][7] == 'cor'\n                and split_lines_of_utt[i][4] == split_lines_of_utt[i][6]):\n            line_is_in_segment_core[i] = True\n\n    # extend each proto-segment forwards as far as we can:\n    for i in range(1, num_lines):\n        if line_is_in_segment_core[i - 1] and not line_is_in_segment_core[i]:\n            edit_type = split_lines_of_utt[i][7]\n            if (not is_tainted(split_lines_of_utt[i])\n                    and (edit_type == 'cor' or edit_type == 'sil'\n                         or edit_type == 'fix')):\n                line_is_in_segment_core[i] = True\n\n    # extend each proto-segment backwards as far as we can:\n    for i in reversed(range(0, num_lines - 1)):\n        if line_is_in_segment_core[i + 1] and not line_is_in_segment_core[i]:\n            edit_type = split_lines_of_utt[i][7]\n            if (not is_tainted(split_lines_of_utt[i])\n                    and (edit_type == 'cor' or edit_type == 'sil'\n                         or edit_type == 'fix')):\n                line_is_in_segment_core[i] = True\n\n    # Get contiguous regions of line in the form of a list\n    # of (start_index, end_index)\n    segment_ranges = []\n    cur_segment_start = None\n    for i in range(0, num_lines):\n        if line_is_in_segment_core[i]:\n            if cur_segment_start is None:\n                cur_segment_start = i\n        else:\n            if cur_segment_start is not None:\n                segment_ranges.append((cur_segment_start, i))\n                cur_segment_start = None\n    if cur_segment_start is not None:\n        segment_ranges.append((cur_segment_start, num_lines))\n\n    return segment_ranges\n\n\nclass SegmentStats(object):\n    \"\"\"Class to store various statistics of segments.\"\"\"\n\n    def __init__(self):\n        self.num_incorrect_words = 0\n        self.num_tainted_words = 0\n        self.incorrect_words_length = 0\n        self.tainted_nonsilence_length = 0\n        self.silence_length = 0\n        self.num_words = 0\n        self.total_length = 0\n\n    def wer(self):\n        \"\"\"Returns WER%\"\"\"\n        try:\n            return float(self.num_incorrect_words) * 100.0 / self.num_words\n        except ZeroDivisionError:\n            return float(\"inf\")\n\n    def bad_proportion(self):\n        assert self.total_length > 0\n        proportion = float(self.silence_length + self.tainted_nonsilence_length\n                           + self.incorrect_words_length) / self.total_length\n        if proportion > 1.00005:\n            raise RuntimeError(\"Error in segment stats {0}\".format(self))\n        return proportion\n\n    def incorrect_proportion(self):\n        assert self.total_length > 0\n        proportion = float(self.incorrect_words_length) / self.total_length\n        if proportion > 1.00005:\n            raise RuntimeError(\"Error in segment stats {0}\".format(self))\n        return proportion\n\n    def combine(self, other, scale=1):\n        \"\"\"Merges this stats with another stats object.\"\"\"\n        self.num_incorrect_words += scale * other.num_incorrect_words\n        self.num_tainted_words += scale * other.num_tainted_words\n        self.num_words += scale * other.num_words\n        self.incorrect_words_length += scale * other.incorrect_words_length\n        self.tainted_nonsilence_length += (scale\n                                           * other.tainted_nonsilence_length)\n        self.silence_length += scale * other.silence_length\n        self.total_length += scale * other.total_length\n\n    def assert_equal(self, other):\n        try:\n            assert self.num_incorrect_words == other.num_incorrect_words\n            assert self.num_tainted_words == other.num_tainted_words\n            assert (abs(self.incorrect_words_length\n                        - other.incorrect_words_length) < 0.01)\n            assert (abs(self.tainted_nonsilence_length\n                        - other.tainted_nonsilence_length) < 0.01)\n            assert abs(self.silence_length - other.silence_length) < 0.01\n            assert self.num_words == other.num_words\n            assert abs(self.total_length - other.total_length) < 0.01\n        except AssertionError:\n            _global_logger.error(\"self %s != other %s\", self, other)\n            raise\n\n    def compare(self, other):\n        \"\"\"Returns true if this stats is same as another stats object.\"\"\"\n        if self.num_incorrect_words != other.num_incorrect_words:\n            return False\n        if self.num_tainted_words != other.num_tainted_words:\n            return False\n        if self.incorrect_words_length != other.incorrect_words_length:\n            return False\n        if self.tainted_nonsilence_length != other.tainted_nonsilence_length:\n            return False\n        if self.silence_length != other.silence_length:\n            return False\n        if self.num_words != other.num_words:\n            return False\n        if self.total_length != other.total_length:\n            return False\n        return True\n\n    def __str__(self):\n        return (\"num-incorrect-words={num_incorrect:d},\"\n                \"num-tainted-words={num_tainted:d},\"\n                \"num-words={num_words:d},\"\n                \"incorrect-length={incorrect_length:.2f},\"\n                \"silence-length={sil_length:.2f},\"\n                \"tainted-nonsilence-length={tainted_nonsilence_length:.2f},\"\n                \"total-length={total_length:.2f}\".format(\n                    num_incorrect=self.num_incorrect_words,\n                    num_tainted=self.num_tainted_words,\n                    num_words=self.num_words,\n                    incorrect_length=self.incorrect_words_length,\n                    sil_length=self.silence_length,\n                    tainted_nonsilence_length=self.tainted_nonsilence_length,\n                    total_length=self.total_length))\n\n\nclass Segment(object):\n    \"\"\"Class to store segments.\"\"\"\n\n    def __init__(self, split_lines_of_utt, start_index, end_index,\n                 debug_str=None, compute_segment_stats=False,\n                 segment_stats=None):\n        self.split_lines_of_utt = split_lines_of_utt\n\n        # start_index is the index of the first line that appears in this\n        # segment, and end_index is one past the last line.  This does not\n        # include unk-padding.\n        self.start_index = start_index\n        self.end_index = end_index\n        assert end_index > start_index\n\n        # If the following values are nonzero, then when we create the segment\n        # we will add <unk> at the start and end of the segment [representing\n        # partial words], with this amount of additional audio.\n        self.start_unk_padding = 0.0\n        self.end_unk_padding = 0.0\n\n        # debug_str keeps track of the 'core' of the segment.\n        if debug_str is None:\n            debug_str = 'core-start={0},core-end={1}'.format(start_index,\n                                                             end_index)\n        else:\n            assert type(debug_str) == str\n        self.debug_str = debug_str\n\n        # This gives the proportion of the time of the first line in the\n        # segment that we keep.  Usually 1.0 but may be less if we've trimmed\n        # away some proportion of the time.\n        self.start_keep_proportion = 1.0\n        # This gives the proportion of the time of the last line in the segment\n        # that we keep.  Usually 1.0 but may be less if we've trimmed away some\n        # proportion of the time.\n        self.end_keep_proportion = 1.0\n\n        self.stats = None\n\n        if compute_segment_stats:\n            self.compute_stats()\n\n        if segment_stats is not None:\n            self.compute_stats()\n            self.stats.assert_equal(segment_stats)\n            self.stats = segment_stats\n\n    def copy(self, copy_stats=True):\n        segment = Segment(self.split_lines_of_utt, self.start_index,\n                          self.end_index, debug_str=self.debug_str,\n                          segment_stats=(None if not copy_stats\n                                         else copy.deepcopy(self.stats)))\n        segment.start_keep_proportion = self.start_keep_proportion\n        segment.end_keep_proportion = self.end_keep_proportion\n        segment.start_unk_padding = self.start_unk_padding\n        segment.end_unk_padding = self.end_unk_padding\n        return segment\n\n    def __str__(self):\n        return self.debug_info()\n\n    def compute_stats(self):\n        \"\"\"Compute stats for this segment and store them in SegmentStats\n        structure.\n        This is typically called just before merging segments.\n        \"\"\"\n        self.stats = SegmentStats()\n        for i in range(self.start_index, self.end_index):\n            this_duration = float(self.split_lines_of_utt[i][3])\n            assert self.start_keep_proportion == 1.0\n            assert self.end_keep_proportion == 1.0\n            # TODO(vimal): Decide if keep proportion must be applied\n            # if i == self.start_index:\n            #     this_duration *= self.start_keep_proportion\n            # if i == self.end_index - 1:\n            #     this_duration *= self.end_keep_proportion\n            if self.end_index - 1 == self.start_index:\n                # TODO(vimal): Is this true?\n                assert self.start_keep_proportion == self.end_keep_proportion\n\n            try:\n                if self.split_lines_of_utt[i][7] not in ['cor', 'fix', 'sil']:\n                    # TODO(vimal): The commented part below is is apparently\n                    # not true in modify_ctm_edits.py.\n                    # Need to check this or change comments there.\n                    # assert (self.split_lines_of_utt[i][6]\n                    #         not in non_scored_words)\n                    assert not is_tainted(self.split_lines_of_utt[i])\n                    self.stats.num_incorrect_words += 1\n                    self.stats.incorrect_words_length += this_duration\n                if self.split_lines_of_utt[i][7] == 'sil':\n                    self.stats.silence_length += this_duration\n                else:\n                    if (self.split_lines_of_utt[i][6]\n                            not in non_scored_words()):\n                        self.stats.num_words += 1\n                if (is_tainted(self.split_lines_of_utt[i])\n                        and self.split_lines_of_utt[i][7] not in 'sil'\n                        and (self.split_lines_of_utt[i][6]\n                             not in non_scored_words())):\n                    # If ref_word is not a non-scored word, this would be\n                    # counted as an incorrect word.\n                    self.stats.num_tainted_words += 1\n                    self.stats.tainted_nonsilence_length += this_duration\n            except Exception:\n                _global_logger.error(\n                    \"Something went wrong when computing stats at \"\n                    \"ctm line %s\", self.split_lines_of_utt[i])\n                raise\n        self.stats.total_length = self.length()\n\n        try:\n            assert (self.stats.tainted_nonsilence_length\n                    + self.stats.silence_length\n                    + self.stats.incorrect_words_length - 0.001\n                    <= self.stats.total_length)\n        except AssertionError:\n            _global_logger.error(\n                \"Something wrong with the stats for segment %s\", self)\n            raise\n\n    def possibly_add_tainted_lines(self):\n        \"\"\"\n        This is stage 1 of segment processing (after creating the boundaries of\n        the core of the segment, which is done outside of this class).\n\n        This function may reduce start_index and/or increase end_index by\n        including a single adjacent 'tainted' line from the ctm-edits file.\n        This is only done if the lines at the boundaries of the segment are\n        currently real non-silence words and not non-scored words.  The idea is\n        that we probably don't want to start or end the segment right at the\n        boundary of a real word, we want to add some kind of padding.\n        \"\"\"\n        split_lines_of_utt = self.split_lines_of_utt\n        # we're iterating over the segment (start, end)\n        for b in [False, True]:\n            if b:\n                boundary_index = self.end_index - 1\n                adjacent_index = self.end_index\n            else:\n                boundary_index = self.start_index\n                adjacent_index = self.start_index - 1\n            if (adjacent_index >= 0\n                    and adjacent_index < len(split_lines_of_utt)):\n                # only consider merging the adjacent word into the segment if\n                # we're not at the boundary of the utterance.\n                adjacent_line_is_tainted = is_tainted(\n                    split_lines_of_utt[adjacent_index])\n                # if the adjacent line wasn't tainted, then there must have\n                # been another stronger reason why we didn't include it in the\n                # core of the segment (probably that it was an ins, del or\n                # sub), so there is no point considering it.\n                if adjacent_line_is_tainted:\n                    boundary_edit_type = split_lines_of_utt[boundary_index][7]\n                    boundary_ref_word = split_lines_of_utt[boundary_index][6]\n                    # Even if the edit_type is 'cor', it is possible that\n                    # column 4 (hyp_word) is not the same as column 6\n                    # (ref_word) because the ref_word is an OOV and the\n                    # hyp_word is OOV symbol.\n\n                    # we only add the tainted line to the segment if the word\n                    # at the boundary was a non-silence word that was correctly\n                    # decoded and not fixed [see modify_ctm_edits.py.]\n                    if (boundary_edit_type == 'cor'\n                            and (boundary_ref_word\n                                 not in non_scored_words())):\n                        # Add the adjacent tainted line to the segment.\n                        if b:\n                            self.end_index += 1\n                        else:\n                            self.start_index -= 1\n\n    def possibly_split_segment(self, max_internal_silence_length,\n                               max_internal_non_scored_length):\n        \"\"\"\n        This is stage 3 of segment processing.\n        This function will split a segment into multiple pieces if any of the\n        internal [non-boundary] silences or non-scored words are longer\n        than the allowed values --max-internal-silence-length and\n        --max-internal-non-scored-length.\n        This function returns a list of segments.\n        In the normal case (where there is no splitting) it just returns an\n        array with a single element 'self'.\n\n        Note: --max-internal-silence-length and\n        --max-internal-non-scored-length can be set to very large values\n        to avoid any splitting.\n        \"\"\"\n        # make sure the segment hasn't been processed more than we expect.\n        assert (self.start_unk_padding == 0.0 and self.end_unk_padding == 0.0\n                and self.start_keep_proportion == 1.0\n                and self.end_keep_proportion == 1.0)\n        segments = []  # the answer\n        cur_start_index = self.start_index\n        cur_start_is_split = False\n        # only consider splitting at non-boundary lines.  [we'd just truncate\n        # the boundary lines.]\n        for index_to_split_at in range(cur_start_index + 1,\n                                       self.end_index - 1):\n            this_split_line = self.split_lines_of_utt[index_to_split_at]\n            this_duration = float(this_split_line[3])\n            this_edit_type = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            if ((this_edit_type == 'sil' and\n                 this_duration > max_internal_silence_length)\n                    or (this_ref_word in non_scored_words()\n                        and (this_duration\n                             > max_internal_non_scored_length))):\n                # We split this segment at this index, dividing the word in two\n                # [later on, in possibly_truncate_boundaries, it may be further\n                # truncated.]\n                # Note: we use 'index_to_split_at + 1' because the Segment\n                # constructor takes an 'end-index' which is interpreted as one\n                # past the end.\n                new_segment = Segment(self.split_lines_of_utt, cur_start_index,\n                                      index_to_split_at + 1,\n                                      debug_str=self.debug_str)\n                if cur_start_is_split:\n                    new_segment.start_keep_proportion = 0.5\n                new_segment.end_keep_proportion = 0.5\n                cur_start_is_split = True\n                cur_start_index = index_to_split_at\n                segments.append(new_segment)\n        if len(segments) == 0:  # We did not split.\n            segments.append(self)\n        else:\n            # We did split.  Add the very last segment.\n            new_segment = Segment(self.split_lines_of_utt, cur_start_index,\n                                  self.end_index,\n                                  debug_str=self.debug_str)\n            assert cur_start_is_split\n            new_segment.start_keep_proportion = 0.5\n            segments.append(new_segment)\n        return segments\n\n    def possibly_split_long_segment(self, max_segment_length,\n                                    hard_max_segment_length,\n                                    min_silence_length_to_split,\n                                    min_non_scored_length_to_split):\n        \"\"\"\n        This is stage 4 of segment processing.\n        This function will split a segment into multiple pieces if it is\n        longer than the value --max-segment-length.\n        It tries to split at silences and non-scored words that are\n        at least --min-silence-length-to-split or\n        --min-non-scored-length-to-split long.\n        If this is not possible and the segments are still longer than\n        --hard-max-segment-length, then this is split into equal length\n        pieces of approximately --max-segment-length long.\n        This function returns a list of segments.\n        In the normal case (where there is no splitting) it just returns an\n        array with a single element 'self'.\n        \"\"\"\n        # make sure the segment hasn't been processed more than we expect.\n        assert self.start_unk_padding == 0.0 and self.end_unk_padding == 0.0\n        if self.length() < max_segment_length:\n            return [self]\n\n        segments = [self]  # the answer\n        cur_start_index = self.start_index\n\n        split_indexes = []\n        # only consider splitting at non-boundary lines.  [we'd just truncate\n        # the boundary lines.]\n        for index_to_split_at in range(cur_start_index + 1,\n                                       self.end_index - 1):\n            this_split_line = self.split_lines_of_utt[index_to_split_at]\n            this_duration = float(this_split_line[3])\n            this_edit_type = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            this_is_tainted = is_tainted(this_split_line)\n            if (this_edit_type == 'sil'\n                    and this_duration > min_silence_length_to_split):\n                split_indexes.append((index_to_split_at, this_duration,\n                                      this_is_tainted))\n\n            if (this_ref_word in non_scored_words()\n                    and (this_duration > min_non_scored_length_to_split)):\n                split_indexes.append((index_to_split_at, this_duration,\n                                      this_is_tainted))\n        split_indexes.sort(key=lambda x: x[1], reverse=True)\n        split_indexes.sort(key=lambda x: x[2])\n\n        while True:\n            if len(split_indexes) == 0:\n                break\n\n            new_segments = []\n\n            for segment in segments:\n                if segment.length() < max_segment_length:\n                    new_segments.append(segment)\n                    continue\n\n                try:\n                    index_to_split_at = next(\n                        (x[0] for x in split_indexes\n                         if (x[0] > segment.start_index\n                             and x[0] < segment.end_index - 1)))\n                except StopIteration:\n                    _global_logger.debug(\n                        \"Could not find an index in the range (%d, %d) in \"\n                        \"split-indexes %s\", segment.start_index,\n                        segment.end_index - 1, split_indexes)\n                    new_segments.append(segment)\n                    continue\n\n                # We split this segment at this index, dividing the word in two\n                # [later on, in possibly_truncate_boundaries, it may be further\n                # truncated.]\n                # Note: we use 'index_to_split_at + 1' because the Segment\n                # constructor takes an 'end-index' which is interpreted as one\n                # past the end.\n                new_segment = Segment(\n                    self.split_lines_of_utt, segment.start_index,\n                    index_to_split_at + 1, debug_str=self.debug_str)\n                new_segment.end_keep_proportion = 0.5\n                new_segments.append(new_segment)\n\n                new_segment = Segment(\n                    self.split_lines_of_utt, index_to_split_at,\n                    segment.end_index, debug_str=self.debug_str)\n                new_segment.start_keep_proportion = 0.5\n                new_segments.append(new_segment)\n\n            if len(segments) == len(new_segments):\n                # No splitting done\n                break\n            segments = new_segments\n\n            for i, x in enumerate(segments):\n                _global_logger.debug(\"Segment %d = %s\", i, x)\n\n        new_segments = []\n        # Split segments that are still very long\n        for segment in segments:\n            if segment.length() < hard_max_segment_length:\n                new_segments.append(segment)\n                continue\n\n            cur_start_index = segment.start_index\n            cur_start = segment.start_time()\n\n            index_to_split_at = None\n            try:\n                while True:\n                    index_to_split_at = next(\n                        (i for i in range(cur_start_index, segment.end_index)\n                         if (float(self.split_lines_of_utt[i][2])\n                             >= cur_start + max_segment_length)))\n\n                    new_segment = Segment(\n                        self.split_lines_of_utt, cur_start_index,\n                        index_to_split_at)\n                    new_segments.append(new_segment)\n\n                    cur_start_index = index_to_split_at\n                    cur_start = float(\n                        self.split_lines_of_utt[cur_start_index][2])\n                    index_to_split_at = None\n\n                    if (segment.end_time() - cur_start\n                            < hard_max_segment_length):\n                        raise StopIteration\n            except StopIteration:\n                if index_to_split_at is None:\n                    _global_logger.debug(\n                        \"Could not find an index in the range (%d, %d) with \"\n                        \"start time > %.2f\", cur_start_index,\n                        segment.end_index, cur_start + max_segment_length)\n                new_segment = Segment(\n                    self.split_lines_of_utt, cur_start_index,\n                    segment.end_index)\n                new_segments.append(new_segment)\n                break\n        segments = new_segments\n        return segments\n\n    def possibly_truncate_boundaries(self, max_edge_silence_length,\n                                     max_edge_non_scored_length):\n        \"\"\"\n        This is stage 5 of segment processing.\n        It will truncate the silences and non-scored words at the segment\n        boundaries if they are longer than the --max-edge-silence-length and\n        --max-edge-non-scored-length respectively\n        (and to the extent that this wouldn't take us below the\n        --min-segment-length or --min-new-segment-length. See\n        relax_boundary_truncation()).\n\n        Note: --max-edge-silence-length and --max-edge-non-scored-length\n        can be set to very large values to avoid any truncation.\n        \"\"\"\n        for b in [True, False]:\n            if b:\n                this_index = self.start_index\n            else:\n                this_index = self.end_index - 1\n            this_split_line = self.split_lines_of_utt[this_index]\n            truncated_duration = None\n            this_duration = float(this_split_line[3])\n            this_edit = this_split_line[7]\n            this_ref_word = this_split_line[6]\n            if (this_edit == 'sil'\n                    and this_duration > max_edge_silence_length):\n                truncated_duration = max_edge_silence_length\n            elif (this_ref_word in non_scored_words()\n                  and this_duration > max_edge_non_scored_length):\n                truncated_duration = max_edge_non_scored_length\n            if truncated_duration is not None:\n                keep_proportion = truncated_duration / this_duration\n                if b:\n                    self.start_keep_proportion = keep_proportion\n                else:\n                    self.end_keep_proportion = keep_proportion\n\n    def relax_boundary_truncation(self, min_segment_length,\n                                  min_new_segment_length):\n        \"\"\"\n        This relaxes the segment-boundary truncation of\n        possibly_truncate_boundaries(), if it would take us below\n        min-new-segment-length or min-segment-length.\n\n        Note: this does not relax the boundary truncation for a particular\n        boundary (start or end) if that boundary corresponds to a 'tainted'\n        line of the ctm (because it's dangerous to include too much 'tainted'\n        audio).\n        \"\"\"\n        # this should be called before adding unk padding.\n        assert self.start_unk_padding == self.end_unk_padding == 0.0\n        if self.start_keep_proportion == self.end_keep_proportion == 1.0:\n            return  # nothing to do there was no truncation.\n        length_cutoff = max(min_new_segment_length, min_segment_length)\n        length_with_truncation = self.length()\n        if length_with_truncation >= length_cutoff:\n            return  # Nothing to do.\n        orig_start_keep_proportion = self.start_keep_proportion\n        orig_end_keep_proportion = self.end_keep_proportion\n        if not is_tainted(self.split_lines_of_utt[self.start_index]):\n            self.start_keep_proportion = 1.0\n        if not is_tainted(self.split_lines_of_utt[self.end_index - 1]):\n            self.end_keep_proportion = 1.0\n        length_with_relaxed_boundaries = self.length()\n        if length_with_relaxed_boundaries <= length_cutoff:\n            # Completely undo the truncation [to the extent allowed by the\n            # presence of tainted lines at the start/end] if, even without\n            # truncation, we'd be below the length cutoff.  This segment may be\n            # removed later on (but it may not, if removing truncation makes us\n            # identical to the input utterance, and the length is between\n            # min_segment_length min_new_segment_length).\n            return\n        # Next, compute an interpolation constant a such that the\n        # {start,end}_keep_proportion values will equal\n        # a\n        # * [values-computed-by-possibly_truncate_boundaries()]\n        # + (1-a) * [completely-relaxed-values].\n        # we're solving the equation:\n        # length_cutoff = a * length_with_truncation\n        #                 + (1-a) * length_with_relaxed_boundaries\n        # -> length_cutoff - length_with_relaxed_boundaries =\n        #        a * (length_with_truncation - length_with_relaxed_boundaries)\n        # -> a = (length_cutoff - length_with_relaxed_boundaries)\n        #        / (length_with_truncation - length_with_relaxed_boundaries)\n        a = (length_cutoff - length_with_relaxed_boundaries) / (length_with_truncation - length_with_relaxed_boundaries)\n        if a < 0.0 or a > 1.0:\n            # TODO(vimal): Should this be an error?\n            _global_logger.warn(\"bad 'a' value = %.4f\", a)\n            return\n        self.start_keep_proportion = (\n            a * orig_start_keep_proportion\n            + (1 - a) * self.start_keep_proportion)\n        self.end_keep_proportion = (\n            a * orig_end_keep_proportion + (1 - a) * self.end_keep_proportion)\n        if abs(self.length() - length_cutoff) >= 0.01:\n            # TODO(vimal): Should this be an error?\n            _global_logger.warn(\n                \"possible problem relaxing boundary \"\n                \"truncation, length is %.2f vs %.2f\", self.length(),\n                length_cutoff)\n\n    def possibly_add_unk_padding(self, max_unk_padding):\n        \"\"\"\n        This is stage 7 of segment processing.\n        This function may set start_unk_padding and end_unk_padding to nonzero\n        values.  This is done if the current boundary words are real, scored\n        words and we're not next to the beginning or end of the utterance.\n        \"\"\"\n        for b in [True, False]:\n            if b:\n                this_index = self.start_index\n            else:\n                this_index = self.end_index - 1\n            this_split_line = self.split_lines_of_utt[this_index]\n            this_start_time = float(this_split_line[2])\n            this_ref_word = this_split_line[6]\n            this_edit = this_split_line[7]\n            if this_edit == 'cor' and this_ref_word not in non_scored_words():\n                # we can consider adding unk-padding.\n                if b:   # start of utterance.\n                    unk_padding = max_unk_padding\n                    # close to beginning of file\n                    if unk_padding > this_start_time:\n                        unk_padding = this_start_time\n                    # If we could add less than half of the specified\n                    # unk-padding, don't add any (because when we add\n                    # unk-padding we add the unknown-word symbol '<unk>', and\n                    # if there isn't enough space to traverse the HMM we don't\n                    # want to do it at all.\n                    if unk_padding < 0.5 * max_unk_padding:\n                        unk_padding = 0.0\n                    self.start_unk_padding = unk_padding\n                else:   # end of utterance.\n                    this_end_time = this_start_time + float(this_split_line[3])\n                    last_line = self.split_lines_of_utt[-1]\n                    utterance_end_time = (float(last_line[2])\n                                          + float(last_line[3]))\n                    max_allowable_padding = utterance_end_time - this_end_time\n                    assert max_allowable_padding > -0.01\n                    unk_padding = max_unk_padding\n                    if unk_padding > max_allowable_padding:\n                        unk_padding = max_allowable_padding\n                    # If we could add less than half of the specified\n                    # unk-padding, don't add any (because when we add\n                    # unk-padding we add the unknown-word symbol '<unk>',\n                    # and if there isn't enough space to traverse the HMM we\n                    # don't want to do it at all.\n                    if unk_padding < 0.5 * max_unk_padding:\n                        unk_padding = 0.0\n                    self.end_unk_padding = unk_padding\n\n    def start_time(self):\n        \"\"\"Returns the start time of the utterance (within the enclosing\n        utterance).\n        This is before any rounding.\n        \"\"\"\n        if self.start_index == len(self.split_lines_of_utt):\n            assert self.end_index == len(self.split_lines_of_utt)\n            return self.end_time()\n        first_line = self.split_lines_of_utt[self.start_index]\n        first_line_start = float(first_line[2])\n        first_line_duration = float(first_line[3])\n        first_line_end = first_line_start + first_line_duration\n        return (first_line_end - self.start_unk_padding\n                - (first_line_duration * self.start_keep_proportion))\n\n    def debug_info(self, include_stats=True):\n        \"\"\"Returns some string-valued information about 'this' that is useful\n        for debugging.\"\"\"\n        if include_stats and self.stats is not None:\n            stats = 'wer={wer:.2f},{stats},'.format(\n                wer=self.stats.wer(), stats=self.stats)\n        else:\n            stats = ''\n\n        return ('start={start:d},end={end:d},'\n                'unk-padding={start_unk_padding:.2f},{end_unk_padding:.2f},'\n                'keep-proportion={start_prop:.2f},{end_prop:.2f},'\n                'start-time={start_time:.2f},end-time={end_time:.2f},'\n                '{stats}'\n                'debug-str={debug_str}'.format(\n                    start=self.start_index, end=self.end_index,\n                    start_unk_padding=self.start_unk_padding,\n                    end_unk_padding=self.end_unk_padding,\n                    start_prop=self.start_keep_proportion,\n                    end_prop=self.end_keep_proportion,\n                    start_time=self.start_time(), end_time=self.end_time(),\n                    stats=stats, debug_str=self.debug_str))\n\n    def end_time(self):\n        \"\"\"Returns the start time of the utterance (within the enclosing\n        utterance).\"\"\"\n        if self.end_index == 0:\n            assert self.start_index == 0\n            return self.start_time()\n        last_line = self.split_lines_of_utt[self.end_index - 1]\n        last_line_start = float(last_line[2])\n        last_line_duration = float(last_line[3])\n        return (last_line_start\n                + (last_line_duration * self.end_keep_proportion)\n                + self.end_unk_padding)\n\n    def length(self):\n        \"\"\"Returns the segment length in seconds.\"\"\"\n        return self.end_time() - self.start_time()\n\n    def is_whole_utterance(self):\n        \"\"\"returns true if this segment corresponds to the whole utterance that\n        it's a part of (i.e. its start/end time are zero and the end-time of\n        the last segment.\"\"\"\n        last_line_of_utt = self.split_lines_of_utt[-1]\n        last_line_end_time = (float(last_line_of_utt[2])\n                              + float(last_line_of_utt[3]))\n        return (abs(self.start_time() - 0.0) < 0.001\n                and abs(self.end_time() - last_line_end_time) < 0.001)\n\n    def get_junk_proportion(self):\n        \"\"\"Returns the proportion of the duration of this segment that consists\n        of unk-padding and tainted lines of input (will be between 0.0 and\n        1.0).\"\"\"\n        # Note: only the first and last lines could possibly be tainted as\n        # that's how we create the segments; and if either or both are tainted\n        # the utterance must contain other lines, so double-counting is not a\n        # problem.\n        junk_duration = self.start_unk_padding + self.end_unk_padding\n        first_split_line = self.split_lines_of_utt[self.start_index]\n        if is_tainted(first_split_line):\n            first_duration = float(first_split_line[3])\n            junk_duration += first_duration * self.start_keep_proportion\n        last_split_line = self.split_lines_of_utt[self.end_index - 1]\n        if is_tainted(last_split_line):\n            last_duration = float(last_split_line[3])\n            junk_duration += last_duration * self.end_keep_proportion\n        return junk_duration / self.length()\n\n    def get_junk_duration(self):\n        \"\"\"Returns duration of junk\"\"\"\n        return self.get_junk_proportion() * self.length()\n\n    def merge_adjacent_segment(self, other):\n        \"\"\"\n        This function will merge the segment in 'other' with the segment\n        in 'self'.  It is only to be called when 'self' and 'other' are from\n        the same utterance, 'other' is after 'self' in time order (based on\n        the original segment cores), and self.end_index <= self.start_index\n        i.e. the two segments might have at most one index in common,\n        which is usually a tainted word or silence.\n        \"\"\"\n        try:\n            assert self.end_index <= other.start_index + 1\n            assert self.start_time() < other.end_time()\n            assert self.split_lines_of_utt is other.split_lines_of_utt\n        except AssertionError:\n            _global_logger.error(\"self: %s\", self)\n            _global_logger.error(\"other: %s\", other)\n            raise\n\n        assert self.start_index == 0 or self.start_index != other.start_index\n\n        _global_logger.debug(\"Before merging: %s\", self)\n\n        assert not self.stats.compare(other.stats), \"%s %s\" % (self, other)\n        self.stats.combine(other.stats)\n\n        if self.end_index == other.start_index + 1:\n            overlapping_segment = Segment(\n                self.split_lines_of_utt, other.start_index,\n                self.end_index, compute_segment_stats=True)\n            self.stats.combine(overlapping_segment.stats, scale=-1)\n\n        _global_logger.debug(\"Other segment: %s\", other)\n\n        self.debug_str = \"({0}/merged-with-adjacent/{1})\".format(\n            self.debug_str, other.debug_str)\n\n        # everything that relates to the end of this segment gets copied\n        # from 'other'.\n        self.end_index = other.end_index\n        self.end_unk_padding = other.end_unk_padding\n        self.end_keep_proportion = other.end_keep_proportion\n\n        _global_logger.debug(\"After merging %s\", self)\n        return\n\n    def merge_with_segment(self, other, max_deleted_words):\n        \"\"\"\n        This function will merge the segment in 'other' with the segment\n        in 'self'.  It is only to be called when 'self' and 'other' are from\n        the same utterance, 'other' is after 'self' in time order (based on\n        the original segment cores), and self.end_time() >= other.start_time().\n        Note: in this situation there will normally be deleted words\n        between the two segments.  What this program does with the deleted\n        words depends on '--max-deleted-words-kept-when-merging'.  If there\n        were any inserted words in the transcript (less likely), this\n        program will keep the reference.\n\n        Note: --max-deleted-words-kept-when-merging can be set to a very\n        large value to keep all the words.\n        \"\"\"\n        try:\n            assert self.end_time() >= other.start_time()\n            assert self.start_time() < other.end_time()\n            assert self.split_lines_of_utt is other.split_lines_of_utt\n        except AssertionError:\n            _global_logger.error(\"self: %s\", self)\n            _global_logger.error(\"other: %s\", other)\n            raise\n\n        assert self.start_index == 0 or self.start_index != other.start_index\n\n        _global_logger.debug(\"Before merging: %s\", self)\n\n        assert (not self.stats.compare(other.stats)\n                or self.start_time() != other.start_time()\n                or self.end_time() != other.end_time()\n                ), \"%s %s\" % (self, other)\n        self.stats.combine(other.stats)\n\n        _global_logger.debug(\"Other segment: %s\", other)\n\n        orig_self_end_index = self.end_index\n        self.debug_str = \"({0}/merged-with/{1})\".format(\n            self.debug_str, other.debug_str)\n\n        # everything that relates to the end of this segment gets copied\n        # from 'other'.\n        self.end_index = other.end_index\n        self.end_unk_padding = other.end_unk_padding\n        self.end_keep_proportion = other.end_keep_proportion\n\n        _global_logger.debug(\"After merging %s\", self)\n\n        # The next thing we have to do is to go over any lines of the ctm that\n        # appear between 'self' and 'other', or are shared between both (this\n        # would only happen for tainted silence or non-scored-word segments),\n        # and decide what to do with them.  We'll keep the reference for any\n        # substitutions or insertions (which anyway are unlikely to appear\n        # in these merged segments).  Note: most of this happens in\n        # self.Text(), but at this point we need to decide whether to mark any\n        # deletions as 'discard-this-word'.\n        try:\n            if orig_self_end_index <= other.start_index:\n                # No overlap in indexes\n                first_index_of_overlap = orig_self_end_index\n                last_index_of_overlap = other.start_index - 1\n                segment = Segment(\n                    self.split_lines_of_utt, orig_self_end_index,\n                    other.start_index, compute_segment_stats=True)\n                self.stats.combine(segment.stats)\n            else:\n                first_index_of_overlap = other.start_index\n                last_index_of_overlap = orig_self_end_index - 1\n\n            num_deleted_words = 0\n            for i in range(first_index_of_overlap, last_index_of_overlap + 1):\n                edit_type = self.split_lines_of_utt[i][7]\n                if edit_type == 'del':\n                    num_deleted_words += 1\n            if num_deleted_words > max_deleted_words:\n                for i in range(first_index_of_overlap,\n                               last_index_of_overlap + 1):\n                    if self.split_lines_of_utt[i][7] == 'del':\n                        self.split_lines_of_utt[i].append(\n                            'do-not-include-in-text')\n        except:\n            _global_logger.error(\n                \"first-index-of-overlap = %d\", first_index_of_overlap)\n            _global_logger.error(\n                \"last-index-of-overlap = %d\", last_index_of_overlap)\n            _global_logger.error(\"line = %d = %s\", i,\n                                 self.split_lines_of_utt[i])\n            raise\n        _global_logger.debug(\"After merging %s\", self)\n\n    def contains_atleast_one_scored_non_oov_word(self):\n        \"\"\"\n        this will return true if there is at least one word in the utterance\n        that's a scored word (not a non-scored word) and not an OOV word that's\n        realized as unk.  This becomes a filter on keeping segments.\n        \"\"\"\n        for i in range(self.start_index, self.end_index):\n            this_split_line = self.split_lines_of_utt[i]\n            this_hyp_word = this_split_line[4]\n            this_ref_word = this_split_line[6]\n            this_edit = this_split_line[7]\n            if (this_edit == 'cor' and this_ref_word not in non_scored_words()\n                    and this_ref_word == this_hyp_word):\n                return True\n        return False\n\n    def text(self, oov_symbol, eps_symbol=\"<eps_symbol>\"):\n        \"\"\"Returns the text corresponding to this utterance, as a string.\"\"\"\n        text_array = []\n        if self.start_unk_padding != 0.0:\n            text_array.append(oov_symbol)\n        for i in range(self.start_index, self.end_index):\n            this_split_line = self.split_lines_of_utt[i]\n            this_ref_word = this_split_line[6]\n            if (this_ref_word != eps_symbol\n                    and this_split_line[-1] != 'do-not-include-in-text'):\n                text_array.append(this_ref_word)\n        if self.end_unk_padding != 0.0:\n            text_array.append(oov_symbol)\n        return ' '.join(text_array)\n\n\nclass SegmentsMerger(object):\n    \"\"\"This class contains methods for merging segments. It stores the\n    appropriate statistics required for this process in objects of\n    SegmentStats class.\n\n    Paramters:\n        segments - a reference to the list of inital segments\n        merged_segments - stores all the initial segments as well\n                          as the newly created segments\n        between_segments - stores the inter-segment \"segments\"\n                           for the initial segments\n        split_lines_of_utt - a reference to the CTM lines\n    \"\"\"\n\n    def __init__(self, segments):\n        self.segments = segments\n\n        try:\n            self.split_lines_of_utt = segments[0].split_lines_of_utt\n        except IndexError as e:\n            _global_logger.error(\"No input segments found!\")\n            raise e\n\n        self.merged_segments = {}\n        self.between_segments = [None for i in range(len(segments) + 1)]\n\n        if segments[0].start_index > 0:\n            self.between_segments[0] = Segment(\n                self.split_lines_of_utt, 0, segments[0].start_index,\n                compute_segment_stats=True)\n\n        for i, x in enumerate(segments):\n            x.compute_stats()\n            self.merged_segments[(i, )] = x\n\n            if i > 0 and segments[i].start_index > segments[i - 1].end_index:\n                self.between_segments[i] = Segment(\n                    self.split_lines_of_utt, segments[i - 1].end_index,\n                    segments[i].start_index, compute_segment_stats=True)\n\n        if segments[-1].end_index < len(self.split_lines_of_utt):\n            self.between_segments[-1] = Segment(\n                self.split_lines_of_utt, segments[-1].end_index,\n                len(self.split_lines_of_utt), compute_segment_stats=True)\n\n    def _get_merged_cluster(self, cluster1, cluster2, rejected_clusters=None,\n                            max_intersegment_incorrect_words_length=1):\n        try:\n            assert cluster2[0] > cluster1[-1]\n            new_cluster = cluster1 + cluster2\n            new_cluster_tup = tuple(new_cluster)\n\n            if (rejected_clusters is not None\n                    and new_cluster_tup in rejected_clusters):\n                return (None, new_cluster, True)\n\n            if new_cluster_tup in self.merged_segments:\n                return (self.merged_segments[new_cluster_tup],\n                        new_cluster, False)\n\n            if cluster1[-1] == -1:\n                assert len(cluster1) == 1\n                # Consider merging cluster2 with the region before the 0^th\n                # segment\n                if (self.between_segments[0] is None\n                        or self.between_segments[0].stats.total_length == 0\n                        or (self.between_segments[0]\n                            .stats.incorrect_words_length\n                            > max_intersegment_incorrect_words_length)):\n                    # Reject zero length or bad start region\n                    return (None, new_cluster, True)\n                merged_segment = self.between_segments[0].copy()\n            else:\n                merged_segment = self.merged_segments[tuple(cluster1)].copy()\n\n                if cluster2[0] == len(self.segments):\n                    assert len(cluster2) == 1\n                    if (self.between_segments[-1] is None\n                            or (self.between_segments[-1]\n                                .stats.total_length == 0)\n                            or (self.between_segments[-1]\n                                .stats.incorrect_words_length\n                                > max_intersegment_incorrect_words_length)):\n                        # Reject zero length or bad end region\n                        return (None, new_cluster, True)\n                if self.between_segments[cluster2[0]] is not None:\n                    if (self.between_segments[cluster2[0]]\n                            .stats.incorrect_words_length\n                            > max_intersegment_incorrect_words_length):\n                        return (None, new_cluster, True)\n                    merged_segment.merge_adjacent_segment(\n                        self.between_segments[cluster2[0]])\n\n            if cluster2[0] < len(self.segments):\n                merged_segment.merge_adjacent_segment(\n                    self.merged_segments[tuple(cluster2)])\n            # else:\n            # Already done\n            # merged_segment.merge_adjacent_segment(self.between_segments[-1])\n\n            self.merged_segments[new_cluster_tup] = merged_segment\n            return (merged_segment, new_cluster, False)\n        except:\n            _global_logger.error(\"Failed merging cluster1 %s and cluster2 %s\",\n                                 cluster1, cluster2)\n            for i in (cluster1 + cluster2):\n                if i >= 0 and i < len(self.segments):\n                    _global_logger.error(\"Segment %d = %s\", i,\n                                         self.segments[i])\n            raise\n\n    def merge_clusters(self, scoring_function,\n                       max_wer=10, max_bad_proportion=0.3,\n                       max_segment_length=10,\n                       max_intersegment_incorrect_words_length=1):\n        for i, x in enumerate(self.segments):\n            _global_logger.debug(\"before agglomerative clustering, segment %d\"\n                                 \" = %s\", i, x)\n\n        # Initial clusters are the individual segments themselves.\n        clusters = [[x] for x in range(-1, len(self.segments) + 1)]\n\n        rejected_clusters = set()\n\n        while len(clusters) > 1:\n            try:\n                _global_logger.debug(\"Current clusters: %s\", clusters)\n\n                heap = []\n\n                for i in range(len(clusters) - 1):\n                    merged_segment, new_cluster, reject = (\n                        self._get_merged_cluster(\n                            clusters[i], clusters[i + 1], rejected_clusters,\n                            max_intersegment_incorrect_words_length=(\n                                max_intersegment_incorrect_words_length)))\n                    if reject:\n                        rejected_clusters.add(tuple(new_cluster))\n                        continue\n                    heapq.heappush(heap, ((-scoring_function(merged_segment), i),\n                                          (merged_segment, i, new_cluster)))\n\n                candidate_index = -1\n                candidate_cluster = None\n\n                while True:\n                    try:\n                        score, tup = heapq.heappop(heap)\n                    except IndexError:\n                        break\n\n                    segment, index, cluster = tup\n\n                    _global_logger.debug(\n                        \"Considering new cluster: (%d, %s)\", index, cluster)\n\n                    if segment.stats.wer() > max_wer:\n                        _global_logger.debug(\n                            \"Rejecting cluster with \"\n                            \"WER%% %.2f > %.2f\", segment.stats.wer(), max_wer)\n                        rejected_clusters.add(tuple(cluster))\n                        continue\n\n                    if segment.stats.bad_proportion() > max_bad_proportion:\n                        _global_logger.debug(\n                            \"Rejecting cluster with bad-proportion \"\n                            \"%.2f > %.2f\", segment.stats.bad_proportion(),\n                            max_bad_proportion)\n                        rejected_clusters.add(tuple(cluster))\n                        continue\n\n                    if segment.stats.total_length > max_segment_length:\n                        _global_logger.debug(\n                            \"Rejecting cluster with length \"\n                            \"%.2f > %.2f\", segment.stats.total_length,\n                            max_segment_length)\n                        rejected_clusters.add(tuple(cluster))\n                        continue\n\n                    candidate_index, candidate_cluster = tup[1:]\n                    _global_logger.debug(\"Accepted cluster (%d, %s)\",\n                                         candidate_index, candidate_cluster)\n                    break\n\n                if candidate_index == -1:\n                    return clusters\n\n                new_clusters = []\n\n                for i in range(candidate_index):\n                    new_clusters.append(clusters[i])\n                new_clusters.append(candidate_cluster)\n                for i in range(candidate_index + 2, len(clusters)):\n                    new_clusters.append(clusters[i])\n\n                if len(new_clusters) >= len(clusters):\n                    raise RuntimeError(\"Old: {0}; New: {1}\".format(\n                        clusters, new_clusters))\n                clusters = new_clusters\n            except Exception:\n                _global_logger.error(\n                    \"Failed merging clusters %s\", clusters)\n                raise\n\n        return clusters\n\n\ndef merge_segments(segments, args):\n    if len(segments) == 0:\n        _global_logger.debug(\"Got no segments at merging segments stage\")\n        return []\n\n    def scoring_function(segment):\n        stats = segment.stats\n        try:\n            return (-stats.wer() - args.silence_factor * stats.silence_length\n                    - args.incorrect_words_factor\n                    * stats.incorrect_words_length\n                    - args.tainted_words_factor\n                    * stats.num_tainted_words * 100.0 / stats.num_words)\n        except ZeroDivisionError:\n            return float(\"-inf\")\n\n    # Do agglomerative clustering on the initial segments with the score\n    # for combining neighboring segments being the scoring_function on the\n    # stats of the combined segment.\n    merger = SegmentsMerger(segments)\n    clusters = merger.merge_clusters(\n        scoring_function, max_wer=args.max_wer,\n        max_bad_proportion=args.max_bad_proportion,\n        max_segment_length=args.max_segment_length_for_merging,\n        max_intersegment_incorrect_words_length=(\n            args.max_intersegment_incorrect_words_length))\n\n    _global_logger.debug(\"Clusters to be merged: %s\", clusters)\n\n    # Do the actual merging based on the clusters.\n    new_segments = []\n    for cluster_index, cluster in enumerate(clusters):\n        _global_logger.debug(\n            \"Merging cluster (%d, %s)\", cluster_index, cluster)\n\n        try:\n            if cluster_index == 0 and len(cluster) == 1:\n                assert cluster[0] == -1\n                _global_logger.debug(\n                    \"Not adding region before the first segment\")\n                # skip adding the lines before the initial segment if its\n                # not merged with the initial segment\n                continue\n            elif cluster_index == len(clusters) - 1 and len(cluster) == 1:\n                _global_logger.debug(\n                    \"Not adding remaining end region %s\",\n                    cluster[0])\n                assert cluster[0] == len(segments)\n                # skip adding the lines after the last segment if its\n                # not merged with the last segment\n                break\n\n            new_segments.append(merger.merged_segments[tuple(cluster)])\n        except Exception:\n            _global_logger.error(\"Error with cluster (%d, %s)\",\n                                 cluster_index, cluster)\n            raise\n\n    segments = new_segments\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"after agglomerative clustering: segment %d = %s\", i, x)\n\n    assert len(segments) > 0\n    segment_index = 0\n    # Ignore all the initial segments that have WER > max_wer\n    while segment_index < len(segments):\n        segment = segments[segment_index]\n        if segment.stats.wer() < args.max_wer:\n            break\n        segment_index += 1\n\n    if segment_index == len(segments):\n        _global_logger.debug(\"No merged segments were below \"\n                             \"WER%% %.2f\", args.max_wer)\n        return []\n\n    _global_logger.debug(\"Merging overlapping segments starting from the \"\n                         \"first segment with WER%% < max_wer i.e. %d = %s\",\n                         segment_index, segments[segment_index])\n\n    new_segments = [segments[segment_index]]\n    segment_index += 1\n    while segment_index < len(segments):\n        if segments[segment_index].stats.wer() > args.max_wer:\n            # ignore this segment\n            segment_index += 1\n            continue\n        if new_segments[-1].end_time() >= segments[segment_index].start_time():\n            new_segments[-1].merge_with_segment(\n                segments[segment_index], args.max_deleted_words)\n        else:\n            new_segments.append(segments[segment_index])\n        segment_index += 1\n    segments = new_segments\n\n    return segments\n\n\ndef get_segments_for_utterance(split_lines_of_utt, args, utterance_stats):\n    \"\"\"\n    This function creates the segments for an utterance as a list\n    of class Segment.\n    It returns a 2-tuple (list-of-segments, list-of-deleted-segments)\n    where the deleted segments are only useful for diagnostic printing.\n    Note: split_lines_of_utt is a list of lists, one per line, each containing\n    the sequence of fields.\n    \"\"\"\n    utterance_stats.num_utterances += 1\n\n    segment_ranges = compute_segment_cores(split_lines_of_utt)\n\n    utterance_end_time = (float(split_lines_of_utt[-1][2])\n                          + float(split_lines_of_utt[-1][3]))\n    utterance_stats.total_length_of_utterances += utterance_end_time\n\n    segments = [Segment(split_lines_of_utt, x[0], x[1])\n                for x in segment_ranges]\n\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  0 [segment cores]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\"stage 0: segment %d = %s\", i, x)\n\n    if args.verbose > 4:\n        print(\"Stage 0 [segment cores]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    for segment in segments:\n        segment.possibly_add_tainted_lines()\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  1 [add tainted lines]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\"stage 1: segment %d = %s\", i, x)\n\n    if args.verbose > 4:\n        print(\"Stage 1 [add tainted lines]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    segments = merge_segments(segments, args)\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  2 [merge segments]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\"stage 2: segment %d = %s\", i, x)\n\n    if args.verbose > 4:\n        print(\"Stage 2 [merge segments]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    new_segments = []\n    for s in segments:\n        new_segments += s.possibly_split_segment(\n            args.max_internal_silence_length,\n            args.max_internal_non_scored_length)\n    segments = new_segments\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  3 [split segments]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 3: segment %d, %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 3 [split segments]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    new_segments = []\n    for s in segments:\n        new_segments += s.possibly_split_long_segment(\n            args.max_segment_length_for_splitting,\n            args.hard_max_segment_length,\n            args.min_silence_length_to_split,\n            args.min_non_scored_length_to_split)\n    segments = new_segments\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  4 [split long segments]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 4: segment %d, %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 4 [split long segments]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    for s in segments:\n        s.possibly_truncate_boundaries(args.max_edge_silence_length,\n                                       args.max_edge_non_scored_length)\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  5 [truncate boundaries]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 5: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 5 [truncate boundaries]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    for s in segments:\n        s.relax_boundary_truncation(args.min_segment_length,\n                                    args.min_new_segment_length)\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  6 [relax boundary truncation]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 6: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 6 [relax boundary truncation]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    for s in segments:\n        s.possibly_add_unk_padding(args.unk_padding)\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  7 [unk-padding]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 7: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 7 [unk-padding]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    deleted_segments = []\n    new_segments = []\n    for s in segments:\n        # the 0.999 allows for roundoff error.\n        if (not s.is_whole_utterance()\n                and s.length() < 0.999 * args.min_new_segment_length):\n            s.debug_str += '[deleted-because-of--min-new-segment-length]'\n            deleted_segments.append(s)\n        else:\n            new_segments.append(s)\n    segments = new_segments\n    utterance_stats.accumulate_segment_stats(\n        segments,\n        'stage  8 [remove new segments under --min-new-segment-length')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 8: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 8 [remove new segments under \"\n              \"--min-new-segment-length]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    new_segments = []\n    for s in segments:\n        # the 0.999 allows for roundoff error.\n        if s.length() < 0.999 * args.min_segment_length:\n            s.debug_str += '[deleted-because-of--min-segment-length]'\n            deleted_segments.append(s)\n        else:\n            new_segments.append(s)\n    segments = new_segments\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage  9 [remove segments under --min-segment-length]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 9: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 9 [remove segments under \"\n              \"--min-segment-length]:\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    new_segments = []\n    for s in segments:\n        if s.contains_atleast_one_scored_non_oov_word():\n            new_segments.append(s)\n        else:\n            s.debug_str += '[deleted-because-no-scored-non-oov-words]'\n            deleted_segments.append(s)\n    segments = new_segments\n    utterance_stats.accumulate_segment_stats(\n        segments, 'stage 10 [remove segments without scored,non-OOV words]')\n\n    for i, x in enumerate(segments):\n        _global_logger.debug(\n            \"stage 10: segment %d = %s\", i, x.debug_info(False))\n\n    if args.verbose > 4:\n        print(\"Stage 10 [remove segments without scored, non-OOV words \"\n              \"\", file=sys.stderr)\n        segments_copy = [x.copy() for x in segments]\n        print_debug_info_for_utterance(sys.stderr,\n                                       copy.deepcopy(split_lines_of_utt),\n                                       segments_copy, [])\n\n    for i in range(len(segments) - 1):\n        if segments[i].end_time() > segments[i + 1].start_time():\n            # this just adds something to --ctm-edits-out output\n            segments[i + 1].debug_str += \",overlaps-previous-segment\"\n\n    if len(segments) == 0:\n        utterance_stats.num_utterances_without_segments += 1\n\n    return (segments, deleted_segments)\n\n\ndef float_to_string(f):\n    \"\"\" this prints a number with a certain number of digits after the point,\n    while removing trailing zeros.\n    \"\"\"\n    num_digits = 6  # we want to print 6 digits after the zero\n    g = f\n    while abs(g) > 1.0:\n        g *= 0.1\n        num_digits += 1\n    format_str = '%.{0}g'.format(num_digits)\n    return format_str % f\n\n\ndef time_to_string(time, frame_length):\n    \"\"\" Gives time in string form as an exact multiple of the frame-length,\n    e.g. 0.01 (after rounding).\n    \"\"\"\n    n = round(time / frame_length)\n    assert n >= 0\n    # The next function call will remove trailing zeros while printing it, so\n    # that e.g. 0.01 will be printed as 0.01 and not 0.0099999999999999.  It\n    # seems that doing this in a simple way is not really possible (at least,\n    # not without assuming that frame_length is of the form 10^-n, which we\n    # don't really want to do).\n    return float_to_string(n * frame_length)\n\n\ndef write_segments_for_utterance(text_output_handle, segments_output_handle,\n                                 old_utterance_name, segments, oov_symbol,\n                                 eps_symbol=\"<eps>\", frame_length=0.01):\n    num_digits = len(str(len(segments)))\n    for n, segment in enumerate(segments):\n        # split utterances will be named foo-bar-1 foo-bar-2, etc.\n        new_utterance_name = \"{old}-{index:0{width}}\".format(\n                                 old=old_utterance_name, index=n+1,\n                                 width=num_digits)\n        # print a line to the text output of the form like\n        # <new-utterance-id> <text>\n        # like:\n        # foo-bar-1 hello this is dan\n        print(new_utterance_name, segment.text(oov_symbol, eps_symbol),\n              file=text_output_handle)\n        # print a line to the segments output of the form\n        # <new-utterance-id> <old-utterance-id> <start-time> <end-time>\n        # like:\n        # foo-bar-1 foo-bar 5.1 7.2\n        print(new_utterance_name, old_utterance_name,\n              time_to_string(segment.start_time(), frame_length),\n              time_to_string(segment.end_time(), frame_length),\n              file=segments_output_handle)\n\n\n# Note, this is destrutive of 'segments_for_utterance', but it won't matter.\ndef print_debug_info_for_utterance(ctm_edits_out_handle,\n                                   split_lines_of_cur_utterance,\n                                   segments_for_utterance,\n                                   deleted_segments_for_utterance,\n                                   frame_length=0.01):\n    # info_to_print will be list of 2-tuples\n    # (time, 'start-segment-n'|'end-segment-n')\n    # representing the start or end times of segments.\n    info_to_print = []\n    for n, segment in enumerate(segments_for_utterance):\n        start_string = 'start-segment-{0}[{1}]'.format(n + 1,\n                                                       segment.debug_info())\n        info_to_print.append((segment.start_time(), start_string))\n        end_string = 'end-segment-{0}'.format(n + 1)\n        info_to_print.append((segment.end_time(), end_string))\n    # for segments that were deleted we print info like\n    # start-deleted-segment-1, and otherwise similar info to segments that were\n    # retained.\n    for n, segment in enumerate(deleted_segments_for_utterance):\n        start_string = 'start-deleted-segment-{0}[{1}]'.format(\n            n + 1, segment.debug_info(False))\n        info_to_print.append((segment.start_time(), start_string))\n        end_string = 'end-deleted-segment-{0}'.format(n + 1)\n        info_to_print.append((segment.end_time(), end_string))\n\n    info_to_print = sorted(info_to_print)\n\n    for i, split_line in enumerate(split_lines_of_cur_utterance):\n        # add an index like [0], [1], to the utterance-id so we can easily look\n        # up segment indexes.\n        split_line[0] += '[{0}]'.format(i)\n        start_time = float(split_line[2])\n        end_time = start_time + float(split_line[3])\n        split_line_copy = list(split_line)\n        while len(info_to_print) > 0 and info_to_print[0][0] <= end_time:\n            (segment_start, string) = info_to_print[0]\n            # shift the first element off of info_to_print.\n            info_to_print = info_to_print[1:]\n            # add a field like 'start-segment1[...]=3.21' to what we're about\n            # to print.\n            split_line_copy.append(\n                '{0}={1}'.format(string,\n                                 time_to_string(segment_start, frame_length)))\n        print(' '.join(split_line_copy), file=ctm_edits_out_handle)\n\n\nclass WordStats(object):\n    \"\"\"\n    This accumulates word-level stats about, for each reference word, with\n    what probability it will end up in the core of a segment.  Words with\n    low probabilities of being in segments will generally be associated\n    with some kind of error (there is a higher probability of having a\n    wrong lexicon entry).\n    \"\"\"\n    def __init__(self):\n        self.word_count_pair = defaultdict(lambda: [0, 0])\n\n    def accumulate_for_utterance(self, split_lines_of_utt,\n                                 segments_for_utterance,\n                                 eps_symbol=\"<eps>\"):\n        # word_count_pair is a map from a string (the word) to\n        # a list [total-count, count-not-within-segments]\n        line_is_in_segment = [False] * len(split_lines_of_utt)\n        for segment in segments_for_utterance:\n            for i in range(segment.start_index, segment.end_index):\n                line_is_in_segment[i] = True\n        for i, split_line in enumerate(split_lines_of_utt):\n            this_ref_word = split_line[6]\n            if this_ref_word != eps_symbol:\n                self.word_count_pair[this_ref_word][0] += 1\n                if not line_is_in_segment[i]:\n                    self.word_count_pair[this_ref_word][1] += 1\n\n    def print(self, word_stats_out):\n        # Sort from most to least problematic.  We want to give more prominence\n        # to words that are most frequently not in segments, but also to\n        # high-count words.  Define badness = pair[1] / pair[0], and\n        # total_count = pair[0], where 'pair' is a value of word_count_pair.\n        # We'll reverse sort on badness^3 * total_count = pair[1]^3 /\n        # pair[0]^2.\n        for key, pair in sorted(\n                self.word_count_pair.items(),\n                key=lambda item: (item[1][1] ** 3) * 1.0 / (item[1][0] ** 2),\n                reverse=True):\n            badness = pair[1] * 1.0 / pair[0]\n            total_count = pair[0]\n            print(key, badness, total_count, file=word_stats_out)\n        try:\n            word_stats_out.close()\n        except:\n            _global_logger.error(\"error closing file --word-stats-out=%s \"\n                                 \"(full disk?)\", word_stats_out.name)\n            raise\n\n        _global_logger.info(\n            \"\"\"please see the file %s for word-level\n            statistics saying how frequently each word was excluded for a\n            segment; format is <word> <proportion-of-time-excluded>\n            <total-count>.  Particularly problematic words appear near the top\n            of the file.\"\"\", word_stats_out.name)\n\n\ndef process_data(args, oov_symbol, utterance_stats, word_stats):\n    \"\"\"\n    Most of what we're doing in the lines below is splitting the input lines\n    and grouping them per utterance, before giving them to\n    get_segments_for_utterance() and then printing the modified lines.\n    \"\"\"\n    first_line = args.ctm_edits_in.readline()\n    if first_line == '':\n        sys.exit(\"segment_ctm_edits.py: empty input\")\n    split_pending_line = first_line.split()\n    if len(split_pending_line) == 0:\n        sys.exit(\"segment_ctm_edits.py: bad input line \" + first_line)\n    cur_utterance = split_pending_line[0]\n    split_lines_of_cur_utterance = []\n\n    while True:\n        try:\n            if (len(split_pending_line) == 0\n                    or split_pending_line[0] != cur_utterance):\n                # Read one whole utterance. Now process it.\n                (segments_for_utterance,\n                 deleted_segments_for_utterance) = get_segments_for_utterance(\n                     split_lines_of_cur_utterance, args=args,\n                     utterance_stats=utterance_stats)\n                word_stats.accumulate_for_utterance(\n                    split_lines_of_cur_utterance, segments_for_utterance)\n                write_segments_for_utterance(\n                    args.text_out, args.segments_out, cur_utterance,\n                    segments_for_utterance, oov_symbol=oov_symbol,\n                    frame_length=args.frame_length)\n                if args.ctm_edits_out is not None:\n                    print_debug_info_for_utterance(\n                        args.ctm_edits_out, split_lines_of_cur_utterance,\n                        segments_for_utterance, deleted_segments_for_utterance,\n                        frame_length=args.frame_length)\n\n                split_lines_of_cur_utterance = []\n                if len(split_pending_line) == 0:\n                    break\n                else:\n                    cur_utterance = split_pending_line[0]\n\n            split_lines_of_cur_utterance.append(split_pending_line)\n            next_line = args.ctm_edits_in.readline()\n            split_pending_line = next_line.split()\n            if len(split_pending_line) == 0:\n                if next_line != '':\n                    sys.exit(\"segment_ctm_edits.py: got an \"\n                             \"empty or whitespace input line\")\n        except Exception:\n            _global_logger.error(\n                \"Error with utterance %s\", cur_utterance)\n            raise\n\n\ndef read_non_scored_words(non_scored_words_file):\n    for line in non_scored_words_file.readlines():\n        parts = line.split()\n        if not len(parts) == 1:\n            raise RuntimeError(\n                \"segment_ctm_edits.py: bad line in non-scored-words \"\n                \"file {0}: {1}\".format(non_scored_words_file, line))\n        _global_non_scored_words.add(parts[0])\n    non_scored_words_file.close()\n\n\nclass UtteranceStats(object):\n\n    def __init__(self):\n        # segment_total_length and num_segments are maps from\n        # 'stage' strings; see accumulate_segment_stats for details.\n        self.segment_total_length = defaultdict(int)\n        self.num_segments = defaultdict(int)\n        # the lambda expression below is an anonymous function that takes no\n        # arguments and returns the new list [0, 0].\n        self.num_utterances = 0\n        self.num_utterances_without_segments = 0\n        self.total_length_of_utterances = 0\n\n    def accumulate_segment_stats(self, segment_list, text):\n        \"\"\"\n        Here, 'text' will be something that indicates the stage of processing,\n        e.g. 'Stage 0: segment cores', 'Stage 1: add tainted lines', etc.\n        \"\"\"\n        for segment in segment_list:\n            self.num_segments[text] += 1\n            self.segment_total_length[text] += segment.length()\n\n    def print_segment_stats(self):\n        _global_logger.info(\n            \"\"\"Number of utterances is %d, of which %.2f%% had no segments\n            after all processing; total length of data in original utterances\n            (in seconds) was %d\"\"\",\n            self.num_utterances,\n            (self.num_utterances_without_segments * 100.0\n             / self.num_utterances),\n            self.total_length_of_utterances)\n\n        keys = sorted(self.segment_total_length.keys())\n        for i, key in enumerate(keys):\n            if i > 0:\n                delta_percentage = '[%+.2f%%]' % (\n                    (self.segment_total_length[key]\n                     - self.segment_total_length[keys[i - 1]])\n                    * 100.0 / self.total_length_of_utterances)\n            _global_logger.info(\n                'At %s, num-segments is %d, total length %.2f%% of '\n                'original total %s',\n                key, self.num_segments[key],\n                (self.segment_total_length[key]\n                 * 100.0 / self.total_length_of_utterances),\n                delta_percentage if i > 0 else '')\n\n\ndef main():\n    args = get_args()\n\n    try:\n        global _global_non_scored_words\n        _global_non_scored_words = set()\n        read_non_scored_words(args.non_scored_words_in)\n\n        oov_symbol = None\n        if args.oov_symbol_file is not None:\n            try:\n                line = args.oov_symbol_file.readline()\n                assert len(line.split()) == 1\n                oov_symbol = line.split()[0]\n                assert args.oov_symbol_file.readline() == ''\n                args.oov_symbol_file.close()\n            except Exception:\n                _global_logger.error(\"error reading file \"\n                                     \"--oov-symbol-file=%s\",\n                                     args.oov_symbol_file.name)\n                raise\n        elif args.unk_padding != 0.0:\n            raise ValueError(\n                \"if the --unk-padding option is nonzero (which \"\n                \"it is by default, \"\n                \"the --oov-symbol-file option must be supplied.\")\n\n        utterance_stats = UtteranceStats()\n        word_stats = WordStats()\n        process_data(args,\n                     oov_symbol=oov_symbol, utterance_stats=utterance_stats,\n                     word_stats=word_stats)\n\n        try:\n            args.text_out.close()\n            args.segments_out.close()\n            if args.ctm_edits_out is not None:\n                args.ctm_edits_out.close()\n        except:\n            _global_logger.error(\"error closing one or more outputs \"\n                                 \"(broken pipe or full disk?)\")\n            raise\n\n        utterance_stats.print_segment_stats()\n        if args.word_stats_out is not None:\n            word_stats.print(args.word_stats_out)\n        if args.ctm_edits_out is not None:\n            _global_logger.info(\"detailed utterance-level debug information \"\n                                \"is in %s\", args.ctm_edits_out.name)\n    except:\n        _global_logger.error(\"Failed segmenting CTM edits\")\n        raise\n    finally:\n        try:\n            args.text_out.close()\n            args.segments_out.close()\n            if args.ctm_edits_out is not None:\n                args.ctm_edits_out.close()\n        except:\n            _global_logger.error(\"error closing one or more outputs \"\n                                 \"(broken pipe or full disk?)\")\n            raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/split_text_into_docs.pl",
    "content": "#! /usr/bin/perl\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0.\n\n# If 'text' contains:\n#  utterance1 A B C D\n#  utterance2 C B\n#  and you ran:\n#  split_text_into_docs.pl --max-words 2 text doc2text docs\n#  then 'doc2text' would contain:\n#  utterance1-1 utterance1\n#  utterance1-2 utterance1\n#  utterance2-1 utterance2\n#  and 'docs' would contain:\n#  utterance1-1 A B\n#  utterance1-2 C D\n#  utterance2-1 C B\n\nuse warnings;\nuse strict;\n\nmy $max_words = 1000;\n\nmy $usage = \"Usage: steps/cleanup/internal/split_text_into_docs.pl [--max-words <int>] text doc2text docs\\n\";\n\nwhile (@ARGV > 3) {\n    if ($ARGV[0] eq \"--max-words\") {\n        shift @ARGV;\n        $max_words = shift @ARGV;\n    } else {\n        print STDERR \"$usage\";\n        exit (1);\n    }\n}\n\nif (scalar @ARGV != 3) {\n  print STDERR \"$usage\";\n  exit (1);\n}\n\nsub min ($$) { $_[$_[0] > $_[1]] }\n\nopen TEXT, $ARGV[0] or die \"$0: Could not open file $ARGV[0] for reading\\n\";\nopen DOC2TEXT, \">\", $ARGV[1] or die \"$0: Could not open file $ARGV[1] for writing\\n\";\nopen DOCS, \">\", $ARGV[2] or die \"$0: Could not open file $ARGV[2] for writing\\n\";\n\nwhile (<TEXT>) {\n  chomp;\n  my @F = split;\n  my $utt = shift @F;\n  my $num_words = scalar @F;\n\n  if ($num_words  <= $max_words) {\n    print DOCS \"$_\\n\";\n    print DOC2TEXT \"$utt $utt\\n\";\n    next;\n  }\n\n  my $num_docs = int($num_words / $max_words) + 1;\n  my $num_words_shift = int($num_words / $num_docs) + 1;\n  my $words_per_doc = $num_words_shift;\n\n  #print STDERR (\"$utt num-words=$num_words num-docs=$num_docs words-per-doc=$words_per_doc\\n\");\n  \n  for (my $i = 0; $i < $num_docs; $i++) {\n    my $st = $i*$num_words_shift;\n    my $end = min($st + $words_per_doc, $num_words) - 1;\n    print DOCS (\"$utt-$i \" . join(\" \", @F[$st..$end]) . \"\\n\");\n    print DOC2TEXT \"$utt-$i $utt\\n\";\n  }\n}\n"
  },
  {
    "path": "egs/steps/cleanup/internal/stitch_documents.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\"This script reads an archive of mapping from query to\ndocuments and stitches the documents for each query into a\nnew document.\nHere \"document\" is just a list of words.\n\nquery2docs is a mapping from query-id to a list of tuples\n(document-id, start-fraction, end-fraction)\nThe tuple can be just the document-id, which is equivaluent to\nspecifying a start-fraction and end-fraction of 1.0\nThe start and end fractions are used to stitch only a part of the\ndocument to the retrieved set for the query.\n\ne.g.\nquery1 doc1 doc2\nquery2 doc1,0,0.3 doc2,1,1\n\ninput-documents\ndoc1 A B C\ndoc2 D E\noutput-documents\nquery1 A B C D E\nquery2 C D E\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\n\nlogger = logging.getLogger(__name__)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\n\nfor l in [logger, logging.getLogger('libs')]:\n    l.setLevel(logging.DEBUG)\n    l.addHandler(handler)\n\n\ndef get_args():\n    \"\"\"Returns arguments parsed from command-line.\"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script reads an archive of mapping from query to\n        documents and stitches the documents for each query into a new\n        document.\"\"\")\n\n    parser.add_argument(\"--query2docs\", type=argparse.FileType('r'),\n                        required=True,\n                        help=\"\"\"Input file containing an archive\n                        of list of documents indexed by a query document\n                        id.\"\"\")\n    parser.add_argument(\"--input-documents\", type=argparse.FileType('r'),\n                        required=True,\n                        help=\"\"\"Input file containing the documents\n                        indexed by the document id.\"\"\")\n    parser.add_argument(\"--output-documents\", type=argparse.FileType('w'),\n                        required=True,\n                        help=\"\"\"Output documents indexed by the query\n                        document-id, obtained by stitching input documents\n                        corresponding to the query.\"\"\")\n    parser.add_argument(\"--check-sorted-docs-per-query\", type=str,\n                        choices=[\"true\", \"false\"], default=\"false\",\n                        help=\"If specified, the script will expect \"\n                        \"the document ids in --query2docs to be \"\n                        \"sorted.\")\n\n    args = parser.parse_args()\n\n    args.check_sorted_docs_per_query = bool(\n        args.check_sorted_docs_per_query == \"true\")\n\n    return args\n\n\ndef run(args):\n    documents = {}\n    for line in args.input_documents:\n        parts = line.strip().split()\n        key = parts[0]\n        documents[key] = parts[1:]\n    args.input_documents.close()\n\n    for line in args.query2docs:\n        try:\n            parts = line.strip().split()\n            query = parts[0]\n            document_infos = parts[1:]\n\n            output_document = []\n            prev_doc_id = ''\n            for doc_info in document_infos:\n                try:\n                    doc_id, start_fraction, end_fraction = doc_info.split(',')\n                    start_fraction = float(start_fraction)\n                    end_fraction = float(end_fraction)\n                except ValueError:\n                    doc_id = doc_info\n                    start_fraction = 1.0\n                    end_fraction = 1.0\n\n                if args.check_sorted_docs_per_query:\n                    if prev_doc_id != '':\n                        if doc_id <= prev_doc_id:\n                            raise RuntimeError(\n                                \"Documents not sorted and \"\n                                \"--check-sorted-docs-per-query was True; \"\n                                \"{0} <= {1}\".format(doc_id, prev_doc_id))\n                    prev_doc_id = doc_id\n\n                doc = documents[doc_id]\n                num_words = len(doc)\n\n                if start_fraction == 1.0 or end_fraction == 1.0:\n                    assert end_fraction == end_fraction\n                    output_document.extend(doc)\n                else:\n                    assert (start_fraction + end_fraction < 1.0)\n                    if start_fraction > 0:\n                        output_document.extend(\n                            doc[0:int(start_fraction * num_words)])\n                    if end_fraction > 0:\n                        output_document.extend(\n                            doc[int(end_fraction * num_words):])\n\n            print (\"{0} {1}\".format(query, \" \".join(output_document)),\n                   file=args.output_documents)\n        except Exception:\n            logger.error(\"Error processing line %s in file %s\", line,\n                         args.query2docs.name)\n            raise\n\n\ndef main():\n    args = get_args()\n\n    try:\n        run(args)\n    except:\n        logger.error(\"Failed to stictch document; got error \",\n                     exc_info=True)\n        raise SystemExit(1)\n    finally:\n        for f in [args.query2docs, args.input_documents,\n                  args.output_documents]:\n            if f is not None:\n                f.close()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/taint_ctm_edits.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016   Vimal Manohar\n#           2016   Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys, operator, argparse, os\nfrom collections import defaultdict\n\nimport io\nsys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding=\"utf8\")\n\n\n# This script reads and writes the 'ctm-edits' file that is\n# produced by get_ctm_edits.py.\n#\n# It is to be applied after modify_ctm_edits.py.  Its function is to add, in\n# certain circumstances, an optional extra field with the word 'tainted' to the\n# ctm-edits format, e.g an input line like:\n#\n# AJJacobs_2007P-0001605-0003029 1 0 0.09 <eps> 1.0 <eps> sil\n# might become:\n# AJJacobs_2007P-0001605-0003029 1 0 0.09 <eps> 1.0 <eps> sil tainted\n#\n# It also deletes certain lines, representing deletions, from the ctm (if they\n# were next to taintable lines... their presence could then be inferred from the\n# 'tainted' flag).\n#\n# You should interpret the 'tainted' flag as \"we're not sure what's going on here;\n# don't trust this.\"\n#\n# One of the problem this script is trying to solve is that if we have errors\n# that are adjacent to silence or non-scored words\n# it's not at all clear whether the silence or non-scored words were really such,\n# or might have contained actual words.\n# Also, if we have words in the reference that were realized as '<unk>' in the\n# hypothesis, and they are adjacent to errors, it's almost always the case\n# that the '<unk>' doesn't really correspond to the word in the reference, so\n# we mark these as 'tainted'.\n#\n# The rule for tainting is quite simple; see the code.\n\n\n\nparser = argparse.ArgumentParser(\n    description = \"This program modifies the ctm-edits format to identify \"\n    \"silence and 'fixed' non-scored-word lines, and lines where the hyp is \"\n    \"<unk> and the reference is a real but OOV word, where there is a relatively \"\n    \"high probability that something is going wrong so we shouldn't trust \"\n    \"this line.  It adds the field 'tainted' to such \"\n    \"lines.  Lines in the ctm representing deletions from the reference will \"\n    \"be removed if they have 'tainted' adjacent lines (since it won't be clear \"\n    \"where such reference words were really realized, if at all). \"\n    \"See comments at the top of the script for more information.\")\n\nparser.add_argument(\"--verbose\", type = int, default = 1,\n                    choices=[0,1,2,3],\n                    help = \"Verbose level, higher = more verbose output\")\nparser.add_argument(\"--remove-deletions\", type=str, default=\"true\",\n                    choices=[\"true\", \"false\"],\n                    help = \"Remove deletions next to taintable lines\")\nparser.add_argument(\"ctm_edits_in\", metavar = \"<ctm-edits-in>\",\n                    help = \"Filename of input ctm-edits file. \"\n                    \"Use /dev/stdin for standard input.\")\nparser.add_argument(\"ctm_edits_out\", metavar = \"<ctm-edits-out>\",\n                    help = \"Filename of output ctm-edits file. \"\n                    \"Use /dev/stdout for standard output.\")\n\nargs = parser.parse_args()\nargs.remove_deletions = bool(args.remove_deletions == \"true\")\n\n\n\n# This function is the core of the program, that does the tainting and\n# removes some lines representing deletions.\n# split_lines_of_utt is a list of lists, one per line, each containing the\n# sequence of fields.  Returns the same format of data after processing to add\n# the 'tainted' field.  Note: this function is destructive of its input; the\n# input will not have the same value afterwards.\ndef ProcessUtterance(split_lines_of_utt, remove_deletions=True):\n    global num_lines_of_type, num_tainted_lines, \\\n           num_del_lines_giving_taint, num_sub_lines_giving_taint, \\\n           num_ins_lines_giving_taint\n\n    # work out whether each line is taintable [i.e. silence or fix or unk replacing\n    # real-word].\n    taintable = [ False ] * len(split_lines_of_utt)\n    for i in range(len(split_lines_of_utt)):\n        edit_type = split_lines_of_utt[i][7]\n        if edit_type == 'sil' or edit_type == 'fix':\n            taintable[i] = True\n        elif edit_type == 'cor' and split_lines_of_utt[i][4] != split_lines_of_utt[i][6]:\n            # this is the case when <unk> replaces a real word that was out of\n            # the vocabulary; we mark it as correct because such words do\n            # translate to <unk> if we don't have a pronunciations.  However we\n            # don't have good confidence that the alignments of such words are\n            # accurate if they are adjacent to errors.\n            taintable[i] = True\n\n\n    for i in range(len(split_lines_of_utt)):\n        edit_type = split_lines_of_utt[i][7]\n        num_lines_of_type[edit_type] += 1\n        if edit_type == 'del' or edit_type == 'sub' or edit_type == 'ins':\n            tainted_an_adjacent_line = False\n            # First go backwards tainting lines\n            j = i - 1\n            while j >= 0 and taintable[j]:\n                tainted_an_adjacent_line = True\n                if len(split_lines_of_utt[j]) == 8:\n                    num_tainted_lines += 1\n                    split_lines_of_utt[j].append('tainted')\n                j -= 1\n            # Next go forwards tainting lines\n            j = i + 1\n            while j < len(split_lines_of_utt) and taintable[j]:\n                tainted_an_adjacent_line = True\n                if len(split_lines_of_utt[j]) == 8:\n                    num_tainted_lines += 1\n                    split_lines_of_utt[j].append('tainted')\n                j += 1\n            if tainted_an_adjacent_line:\n                if edit_type == 'del':\n                    if remove_deletions:\n                        split_lines_of_utt[i][7] = 'remove-this-line'\n                    num_del_lines_giving_taint += 1\n                elif edit_type == 'sub':\n                    num_sub_lines_giving_taint += 1\n                else:\n                    num_ins_lines_giving_taint += 1\n\n    new_split_lines_of_utt = []\n    for i in range(len(split_lines_of_utt)):\n        if (not remove_deletions\n                or split_lines_of_utt[i][7] != 'remove-this-line'):\n            new_split_lines_of_utt.append(split_lines_of_utt[i])\n    return new_split_lines_of_utt\n\n\ndef ProcessData():\n    try:\n        f_in = open(args.ctm_edits_in, encoding=\"utf8\")\n    except:\n        sys.exit(\"taint_ctm_edits.py: error opening ctm-edits input \"\n                 \"file {0}\".format(args.ctm_edits_in))\n    try:\n        f_out = open(args.ctm_edits_out, 'w', encoding=\"utf8\")\n    except:\n        sys.exit(\"taint_ctm_edits.py: error opening ctm-edits output \"\n                 \"file {0}\".format(args.ctm_edits_out))\n    num_lines_processed = 0\n\n\n    # Most of what we're doing in the lines below is splitting the input lines\n    # and grouping them per utterance, before giving them to ProcessUtterance()\n    # and then printing the modified lines.\n    first_line = f_in.readline()\n    if first_line == '':\n        sys.exit(\"taint_ctm_edits.py: empty input\")\n    split_pending_line = first_line.split()\n    if len(split_pending_line) == 0:\n        sys.exit(\"taint_ctm_edits.py: bad input line \" + first_line)\n    cur_utterance = split_pending_line[0]\n    split_lines_of_cur_utterance = []\n\n    while True:\n        if len(split_pending_line) == 0 or split_pending_line[0] != cur_utterance:\n            split_lines_of_cur_utterance = ProcessUtterance(\n                split_lines_of_cur_utterance, args.remove_deletions)\n            for split_line in split_lines_of_cur_utterance:\n                print(' '.join(split_line), file = f_out)\n            split_lines_of_cur_utterance = []\n            if len(split_pending_line) == 0:\n                break\n            else:\n                cur_utterance = split_pending_line[0]\n\n        split_lines_of_cur_utterance.append(split_pending_line)\n        next_line = f_in.readline()\n        split_pending_line = next_line.split()\n        if len(split_pending_line) == 0:\n            if next_line != '':\n                sys.exit(\"taint_ctm_edits.py: got an empty or whitespace input line\")\n    try:\n        f_out.close()\n    except:\n        sys.exit(\"taint_ctm_edits.py: error closing ctm-edits output \"\n                 \"(broken pipe or full disk?)\")\n\ndef PrintNonScoredStats():\n    if args.verbose < 1:\n        return\n    if num_lines == 0:\n        print(\"taint_ctm_edits.py: processed no input.\", file = sys.stderr)\n    num_lines_modified = sum(ref_change_stats.values())\n    num_incorrect_lines = num_lines - num_correct_lines\n    percent_lines_incorrect= '%.2f' % (num_incorrect_lines * 100.0 / num_lines)\n    percent_modified = '%.2f' % (num_lines_modified * 100.0 / num_lines);\n    percent_of_incorrect_modified = '%.2f' % (num_lines_modified * 100.0 / num_incorrect_lines)\n    print(\"taint_ctm_edits.py: processed {0} lines of ctm ({1}% of which incorrect), \"\n          \"of which {2} were changed fixing the reference for non-scored words \"\n          \"({3}% of lines, or {4}% of incorrect lines)\".format(\n            num_lines, percent_lines_incorrect, num_lines_modified,\n            percent_modified, percent_of_incorrect_modified),\n          file = sys.stderr)\n\n    keys = sorted(list(ref_change_stats.keys()), reverse=True,\n                  key = lambda x: ref_change_stats[x])\n    num_keys_to_print = 40 if args.verbose >= 2 else 10\n\n    print(\"taint_ctm_edits.py: most common edits (as percentages \"\n          \"of all such edits) are:\\n\" +\n          ('\\n'.join([ '%s [%.2f%%]' % (k, ref_change_stats[k]*100.0/num_lines_modified)\n                     for k in keys[0:num_keys_to_print]]))\n          + '\\n...'if num_keys_to_print < len(keys) else '',\n          file = sys.stderr)\n\n\ndef PrintStats():\n    tot_lines = sum(num_lines_of_type.values())\n    if args.verbose < 1 or tot_lines == 0:\n        return\n    print(\"taint_ctm_edits.py: processed {0} input lines, whose edit-types were: \".format(tot_lines) +\n          ', '.join([ '%s = %.2f%%' % (k, num_lines_of_type[k] * 100.0 / tot_lines)\n                      for k in sorted(list(num_lines_of_type.keys()), reverse = True,\n                                      key = lambda k: num_lines_of_type[k])  ]),\n          file = sys.stderr)\n\n\n    del_giving_taint_percent = num_del_lines_giving_taint * 100.0 / tot_lines\n    sub_giving_taint_percent = num_sub_lines_giving_taint * 100.0 / tot_lines\n    ins_giving_taint_percent = num_ins_lines_giving_taint * 100.0 / tot_lines\n    tainted_lines_percent = num_tainted_lines * 100.0 / tot_lines\n\n    print(\"taint_ctm_edits.py: as a percentage of all lines, (%.2f%%, %.2f%%, %.2f%%) were \"\n          \"(deletions, substitutions, insertions) that tainted adjacent lines.  %.2f%% of all \"\n          \"lines were tainted.\" % (del_giving_taint_percent, sub_giving_taint_percent,\n                                   ins_giving_taint_percent, tainted_lines_percent),\n          file = sys.stderr)\n\n\n\n# num_lines_of_type will map from line-type ('cor', 'sub', etc.) to count.\nnum_lines_of_type = defaultdict(int)\nnum_tainted_lines = 0\nnum_del_lines_giving_taint = 0\nnum_sub_lines_giving_taint = 0\nnum_ins_lines_giving_taint = 0\n\nProcessData()\nPrintStats()\n"
  },
  {
    "path": "egs/steps/cleanup/internal/tf_idf.py",
    "content": "# Copyright 2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\"This module contains structures to accumulate, store and use stats\nfor Term-frequency and Inverse-document-frequency values.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport logging\nimport math\nimport re\nimport sys\n\nsys.path.insert(0, 'steps')\n\nlogger = logging.getLogger('__name__')\nlogger.addHandler(logging.NullHandler())\n\n\nclass IDFStats(object):\n    \"\"\"Stores stats for computing inverse-document-frequencies.\n    \"\"\"\n    def __init__(self):\n        self.num_docs_for_term = {}\n        self.num_docs = 0\n\n    def get_inverse_document_frequency(self, term, weighting_scheme=\"log\"):\n        \"\"\"Get IDF for a term.\n\n        Weighting scheme is the function applied on the raw\n        inverse-document frequencies n(t) = |d in D: t in d|\n        when computing idf(t,d).\n        Let N = Total number of documents.\n\n        IDF weighting schemes:-\n        unary  : idf(t,D) = 1\n        log    : idf(t,D) = log (N / (1 + n(t)))\n        log-smoothed : idf(t,D) = log(1 + N / n(t))\n        probabilistic: idf(t,D) = log((N - n(t)) / n(t))\n        \"\"\"\n        n_t = float(self.num_docs_for_term.get(term, 0))\n        num_terms = len(self.num_docs_for_term)\n\n        if num_terms == 0:\n            raise RuntimeError(\"No IDF stats have been accumulated.\")\n\n        if weighting_scheme == \"unary\":\n            return 1\n        if weighting_scheme == \"log\":\n            return math.log(float(self.num_docs) / (1.0 + n_t))\n        if weighting_scheme == \"log-smoothed\":\n            return math.log(1.0 + float(self.num_docs) / (1.0 + n_t))\n        if weighting_scheme == \"probabilitic\":\n            return math.log((self.num_docs - n_t - 1) / (1.0 + n_t))\n\n    def accumulate(self, term):\n        \"\"\"Adds one count to the number of docs containing the term \"term\".\n        \"\"\"\n        self.num_docs_for_term[term] = self.num_docs_for_term.get(term, 0) + 1\n        if len(term) == 1:\n            self.num_docs += 1\n\n    def write(self, file_handle):\n        \"\"\"Writes the IDF stats to file using the format:\n        <term-1> <term-2> ... <term-N> <num-docs>\n        for n-gram (<term-1>, ... <term-N>)\n        \"\"\"\n        for term, num in self.num_docs_for_term.items():\n            if num == 0:\n                continue\n            assert isinstance(term, tuple)\n            print (\"{term} {n}\".format(term=\" \".join(term), n=num),\n                   file=file_handle)\n\n    def read(self, file_handle):\n        \"\"\"Loads IDF stats from file. \"\"\"\n        for line in file_handle:\n            parts = line.strip().split()\n            term = tuple(parts[0:-1])\n            self.num_docs_for_term[term] = float(parts[-1])\n            if len(term) == 1:\n                self.num_docs += 1\n\n        if len(self.num_docs_for_term) == 0:\n            raise RuntimeError(\"Read no IDF stats.\")\n\n\nclass TFStats(object):\n    \"\"\"Store stats for TF-IDF computation.\n    A separate object of IDFStats is stored within this object.\n    \"\"\"\n    def __init__(self):\n        self.raw_counts = {}\n        self.max_counts_for_term = {}\n\n    def get_term_frequency(self, term, doc, weighting_scheme=\"raw\",\n                           normalization_factor=0.5):\n        \"\"\"Returns the term-frequency for (term, document) pair.\n\n        The function applied on the raw term-frequencies f(t,d) when computing\n        tf(t,d) is specified by the weighting_scheme.\n        binary : tf(t,d) = 1 if t in d else 0\n        raw    : tf(t,d) = f(t,d)\n        log    : tf(t,d) = 1 + log(f(t,d))\n        normalized : tf(t,d) = K + (1-K) * f(t,d) / max{f(t',d): t' in d}\n        \"\"\"\n        if weighting_scheme == \"binary\":\n            return 1 if (term, doc) in self.raw_counts else 0\n        if weighting_scheme == \"raw\":\n            return self.raw_counts.get((term, doc), 0)\n        if weighting_scheme == \"log\":\n            if (term, doc) in self.raw_counts:\n                return 1 + math.log(self.raw_counts[(term, doc)])\n            return 0\n        if weighting_scheme == \"normalized\":\n            return (normalization_factor\n                    + (1 - normalization_factor)\n                    * self.raw_counts.get((term, doc), 0)\n                    / (1.0 + self.max_counts_for_term.get(term, 0)))\n        raise KeyError(\"Unknown tf-weighting-scheme {0}\".format(\n            weighting_scheme))\n\n    def accumulate(self, doc, text, ngram_order):\n        \"\"\"Accumulate raw stats from a document for upto the specified\n        ngram-order.\"\"\"\n        for n in range(1, ngram_order + 1):\n            for i in range(len(text)):\n                term = tuple(text[i:(i+n)])\n                self.raw_counts.setdefault((term, doc), 0)\n                self.raw_counts[(term, doc)] += 1\n\n    def compute_term_stats(self, idf_stats=None):\n        \"\"\"Compute the maximum counts for each term over all the documents\n        based on the stored raw counts.\"\"\"\n        if len(self.raw_counts) == 0:\n            raise RuntimeError(\"No (term, doc) found in tf-stats.\")\n        for tup, counts in self.raw_counts.items():\n            term = tup[0]\n\n            if counts > self.max_counts_for_term.get(term, 0):\n                self.max_counts_for_term[term] = counts\n\n            if idf_stats is not None:\n                idf_stats.accumulate(term)\n\n    def __str__(self):\n        \"\"\"Returns a string with all the stats in the following format:\n        <n-gram order> <term-1> <term-2> ... <term-n> <document-id> <counts>\n        \"\"\"\n        lines = []\n        for tup, counts in self.raw_counts.items():\n            term, doc = tup\n            lines.append(\"{order} {term} {doc} {counts}\".format(\n                order=len(term), term=\" \".join(term),\n                doc=doc, counts=counts))\n        return \"\\n\".join(lines)\n\n    def read(self, file_handle, ngram_order=None, idf_stats=None):\n        \"\"\"Reads the TF stats stored in a file in the following format:\n        <ngram-order> <term-1> <term-2> ... <term-n> <document-id> <counts>\n\n        If idf_stats is provided then idf_stats is accumulated simultaneously.\n        \"\"\"\n        for line in file_handle:\n            parts = line.strip().split()\n            order = parts[0]\n            assert len(parts) - 3 == order\n            if ngram_order is not None and order > ngram_order:\n                continue\n            term = tuple(parts[1:(order+1)])\n            doc = parts[-2]\n            counts = float(parts[-1])\n\n            self.raw_counts[(term, doc)] = counts\n\n            if counts > self.max_counts_for_term.get(term, 0):\n                self.max_counts_for_term[term] = counts\n\n            if idf_stats is not None:\n                idf_stats.accumulate(term)\n\n        if len(self.raw_counts) == 0:\n            raise RuntimeError(\"Read no TF stats.\")\n\n\nclass TFIDF(object):\n    \"\"\"Class to store TF-IDF values for term-document pairs.\n\n    Parameters:\n        tf_idf - A dictionary of TF-IDF values indexed by (term, document)\n                 tuple as key\n    \"\"\"\n\n    def __init__(self):\n        self.tf_idf = {}\n\n    def get_value(self, term, doc):\n        \"\"\"Returns TF-IDF value for (term, doc) tuple if it exists.\n        Otherwise returns 0.\n        \"\"\"\n        return self.tf_idf[(term, doc)]\n\n    def compute_similarity_scores(self, source_tfidf, source_docs=None,\n                                  do_length_normalization=False,\n                                  query_id=None):\n        \"\"\"Computes TF-IDF similarity score between each pair of query\n        document contained in this object and the source documents\n        in the source_tfidf object.\n\n        Arguments:\n            source_docs - If provided, the similarity scores are computed\n                          for only the source documents contained in\n                          source_docs.\n            use_average - If True, then the similarity scores is\n                          normalized by the length of query. This is usually\n                          not required when the scores are only utilized\n                          for ranking the source documents.\n            query_id - If provided, check that this tf_idf object\n                       contains values only for document with id 'query_id'\n\n        Returns a dictionary\n            { (query_document_id, source_document_id): similarity_score }\n        \"\"\"\n        num_terms_per_doc = {}\n        similarity_scores = {}\n\n        for tup, value in self.tf_idf.items():\n            term, doc = tup\n            num_terms_per_doc[doc] = num_terms_per_doc.get(doc, 0) + 1\n\n            if query_id is not None and doc != query_id:\n                raise RuntimeError(\"TF-IDF contains document {0}, which is \"\n                                   \"not the required query {1}. \\n\"\n                                   \"Something wrong in how this TF-IDF object \"\n                                   \"was created or a bug in the \"\n                                   \"calling script.\".format(\n                                       doc, query_id))\n\n            if source_docs is not None:\n                for src_doc in source_docs:\n                    try:\n                        src_value = source_tfidf.get_value(term, src_doc)\n                    except KeyError:\n                        logger.debug(\n                            \"Could not find ({term}, {src}) in \"\n                            \"source_tfidf. \"\n                            \"Choosing a tf-idf value of 0.\".format(\n                                term=term, src=src_doc))\n                        src_value = 0\n\n                    similarity_scores[(doc, src_doc)] = (\n                        similarity_scores.get((doc, src_doc), 0)\n                        + src_value * value)\n            else:\n                for src_tup, src_value in source_tfidf.tf_idf.items():\n                    similarity_scores[(doc, src_doc)] = (\n                        similarity_scores.get((doc, src_doc), 0)\n                        + src_value * value)\n\n        if do_length_normalization:\n            for doc_pair, value in similarity_scores.items():\n                doc, src_doc = doc_pair\n                similarity_scores[(doc, src_doc)] = value / num_terms_per_doc[doc]\n\n        if logger.isEnabledFor(logging.DEBUG):\n            for doc, count in num_terms_per_doc.items():\n                logger.debug(\n                    'Seen {0} terms in query document {1}'.format(count, doc))\n\n        return similarity_scores\n\n    def read(self, tf_idf_file):\n        \"\"\"Loads TFIDF object from file.\"\"\"\n\n        if len(self.tf_idf) != 0:\n            raise RuntimeError(\"TD-IDF object is not empty.\")\n        seen_footer = False\n        line = tf_idf_file.readline()\n        parts = line.strip().split()\n        if re.search('^<TFIDF>', line) is None:\n            raise TypeError(\n                \"Invalid format of TD-IDF object. \"\n                \"Missing header <TFIDF>; got {0}\".format(line))\n        assert parts[0] == \"<TFIDF>\"\n        if len(parts) > 1:\n            # Read header; go to the rest of line\n            line = \" \".join(parts[1:])\n        else:\n            # Nothing in this line. Read the next lines.\n            line = tf_idf_file.readline()\n        while line:\n            parts = line.strip().split()\n            if re.search('</TFIDF>', line):\n                if len(parts) > 1:\n                    raise TypeError(\n                        \"Expecting footer </TFIDF> \"\n                        \"to be on a separate line; got {0}\".format(line))\n                assert parts[0] == \"</TFIDF>\"\n                seen_footer = True\n                break\n            if re.search('<TFIDF>', line):\n                raise TypeError(\"Got unexpected header <TFIDF> in line \"\n                                \"{0}\".format(line))\n\n            order = int(parts[0])\n            term = tuple(parts[1:(order + 1)])\n            doc = parts[-2]\n            tfidf = float(parts[-1])\n\n            entry = (term, doc)\n            if entry in self.tf_idf:\n                raise RuntimeError(\"Duplicate entry {0} found while reading \"\n                                   \"TFIDF object.\".format(entry))\n            self.tf_idf[entry] = tfidf\n\n            line = tf_idf_file.readline()\n        if not seen_footer:\n            raise TypeError(\n                \"Did not see footer </TFIDF> \"\n                \"in TFIDF object; got {0}\".format(line))\n\n        if len(self.tf_idf) == 0:\n            raise RuntimeError(\n                \"Read no TF-IDF values from file {0}\".format(tf_idf_file.name))\n\n    def write(self, tf_idf_file):\n        \"\"\"Writes TFIDF object to file.\"\"\"\n\n        print (\"<TFIDF>\", file=tf_idf_file)\n        for tup, value in self.tf_idf.items():\n            term, doc = tup\n            print(\"{order} {term} {doc} {tfidf}\".format(\n                order=len(term), term=\" \".join(term),\n                doc=doc, tfidf=value),\n                  file=tf_idf_file)\n        print (\"</TFIDF>\", file=tf_idf_file)\n\n\ndef write_tfidf_from_stats(\n        tf_stats, idf_stats, tf_idf_file, tf_weighting_scheme=\"raw\",\n        idf_weighting_scheme=\"log\", tf_normalization_factor=0.5,\n        expected_document_id=None):\n    \"\"\"Writes TF-IDF values to file args.tf_idf_file.\n    The format used is\n    <ngram-order> <term> <document> <tfidf>.\n    Markers \"<TFIDF>\" and \"</TFIDF>\" are added for parsing this file\n    easily.\n\n    Arguments:\n        tf_stats - A TFStats object\n        idf_stats - An IDFStats object\n        tf_idf_file - Output file to which the TF-IDF values will be written\n        tf_weighting_scheme - See doc_string in TFStats class\n        idf_weighting_scheme - See doc_string in IDFStats class\n        tf_normalization_factor - See doc_string in TFStats class\n        document_id - If provided, checks that the TFStats object contains\n                      stats only for this document_id.\n    \"\"\"\n    if len(tf_stats.raw_counts) == 0:\n        raise RuntimeError(\"Supplied tf-stats object is empty.\")\n\n    if idf_stats.num_docs == 0:\n        raise RuntimeError(\"Supplied idf-stats object is empty.\")\n\n    print (\"<TFIDF>\", file=tf_idf_file)\n    for tup in tf_stats.raw_counts:\n        term, doc = tup\n\n        if expected_document_id is not None and doc != expected_document_id:\n            raise RuntimeError(\"TFStats object contains stats with \"\n                               \"document {0}, \"\n                               \"which is not the specified \"\n                               \"document {1}.\".format(doc,\n                                                      expected_document_id))\n\n        tf_value = tf_stats.get_term_frequency(\n            term, doc,\n            weighting_scheme=tf_weighting_scheme,\n            normalization_factor=tf_normalization_factor)\n\n        idf_value = idf_stats.get_inverse_document_frequency(\n            term, weighting_scheme=idf_weighting_scheme)\n\n        print(\"{order} {term} {doc} {tfidf}\".format(\n            order=len(term), term=\" \".join(term),\n            doc=doc, tfidf=tf_value * idf_value),\n              file=tf_idf_file)\n    print (\"</TFIDF>\", file=tf_idf_file)\n\n\ndef read_key(fd):\n  \"\"\" [str] = read_key(fd)\n   Read the utterance-key from the opened ark/stream descriptor 'fd'.\n  \"\"\"\n  str = ''\n  while 1:\n    char = fd.read(1)\n    if char == '' : break\n    if char == ' ' : break\n    str += char\n  str = str.strip()\n  if str == '': return None # end of file,\n  return str\n\n\ndef read_tfidf_ark(file_handle):\n    \"\"\"Read a kaldi archive of TFIDF objects indexed by a key (document-id).\n    <document-id1> <tf-idf-object1>\n    <document-id2> <tf-idf-object2>\n    ...\n    \"\"\"\n    try:\n        key = read_key(file_handle)\n        while key:\n            tf_idf = TFIDF()\n            try:\n                tf_idf.read(file_handle)\n            except RuntimeError:\n                raise\n            yield key, tf_idf\n            key = read_key(file_handle)\n    finally:\n        file_handle.close()\n"
  },
  {
    "path": "egs/steps/cleanup/lattice_oracle_align.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016  Vimal Manohar\n#           2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nset -e\nset -o pipefail\n\ncleanup=true\nstage=0\ncmd=run.pl\nspecial_symbol=\"***\"    # Special symbol to be aligned with the inserted or\n                        # deleted words. Your sentences should not contain this\n                        # symbol.\nprint_silence=true      # True if we want the silences in the ctm.  We do.\nframe_shift=0.01\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  echo \"This script computes oracle paths for lattices (against a reference \"\n  echo \"transcript) and does various kinds of processing of that, for use by \"\n  echo \"steps/cleanup/cleanup_with_segmentation.sh.\"\n  echo \"Its main input is <latdir>/lat.*.gz.\"\n  echo \"This script outputs a human-readable word alignment of the oracle path\"\n  echo \"through the lattice in <dir>/oracle_hyp.txt, and a time-aligned ctm version of\"\n  echo \"the same in <dir>/ctm.\"\n  echo \"It also creates <dir>/edits.txt (the number of edits per utterance),\"\n  echo \"<dir>/text (which is <data>/text but filtering out any utterances that\"\n  echo \"were not decoded for some reason), and <dir>/length.txt, which is the length\"\n  echo \"of the reference transcript, and <dir>/all_info.txt and <dir>/all_info.sorted.txt\"\n  echo \"which contain all the info in a way that's easier to scan for humans.\"\n  echo \"Note: most of this is the same as is done in steps/cleanup/find_bad_utts.sh,\"\n  echo \"except it runs from pre-existing lattices.\"\n  echo \"\"\n  echo \"Usage: $0 <data> <lang> <latdir> <dir>\"\n  echo \" e.g.: $0 data/train_si284 data/lang exp/tri4_bad_utts/lats exp/tri4_bad_utts/lattice_oracle\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>            # config containing options\"\n  echo \"  --cleanup <true|false>            # set this to false to disable cleanup of \"\n  echo \"                                    # temporary files (default: true)\"\n  echo \"  --cmd <command-string>            # how to run jobs (default: run.pl).\"\n  echo \"  --special-symbol <special-symbol> #  Symbol to pad with in insertions and deletions in the\"\n  echo \"                                    # output produced in <dir>/analysis/ (default: '***'\"\n  echo \"  --print-silence <true|false>      # Affects ctm generation; default is true (recommended)\"\n  echo \"  --frame-shift <frame-shift>       # Frame shift in seconds; default: 0.01.  Affects ctm generation.\"\n  exit 1\nfi\n\ndata=$1\nlang=$2\nlatdir=$3\ndir=$4\n\nfor f in $lang/oov.int $lang/words.txt $data/text $latdir/lat.1.gz $latdir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nmkdir -p $dir/log\n\nif [ -e $dir/final.mdl ]; then\n  model=$dir/final.mdl\nelif [ -e $dir/../final.mdl ]; then\n  model=$dir/../final.mdl\nelse\n  echo \"$0: expected $dir/final.mdl or $dir/../final.mdl to exist\"\n  exit 1\nfi\n\nnj=$(cat $latdir/num_jobs)\noov=$(cat $lang/oov.int)\n\nutils/split_data.sh $data $nj\n\nsdata=$data/split${nj}\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/get_oracle.JOB.log \\\n    lattice-oracle --write-lattices=\"ark:|gzip -c > $dir/lat.JOB.gz\" \\\n    \"ark:gunzip -c $latdir/lat.JOB.gz |\" \\\n    \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\" \\\n    ark,t:- \\| utils/int2sym.pl -f 2- $lang/words.txt '>' $dir/oracle_hyp.JOB.txt || exit 1;\n\n  echo -n \"lattice_oracle_align.sh: overall oracle %WER is: \"\n  grep 'Overall %WER'  $dir/log/get_oracle.*.log  | \\\n    perl -e 'while (<>){ if (m: (\\d+) / (\\d+):) { $x += $1; $y += $2}}  printf(\"%.2f%%\\n\", $x*100.0/$y); ' | \\\n    tee $dir/log/oracle_overall_wer.log\n\n  # the awk commands below are to ensure that partially-written files don't confuse us.\n  for x in $(seq $nj); do cat $dir/oracle_hyp.$x.txt; done | awk '{if(NF>=1){print;}}' > $dir/oracle_hyp.txt\n  if $cleanup; then\n    rm $dir/oracle_hyp.*.txt\n  fi\nfi\n\necho $nj > $dir/num_jobs\n\n\nif [ $stage -le 2 ]; then\n  # The following command gets the time-aligned ctm as $dir/ctm.JOB.txt.\n\n  if [ -f $lang/phones/word_boundary.int ]; then\n    $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n      set -o pipefail '&&' \\\n      lattice-align-words $lang/phones/word_boundary.int $model \"ark:gunzip -c $dir/lat.JOB.gz|\" ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt '>' $dir/ctm.JOB || exit 1;\n  elif [ -f $lang/phones/align_lexicon.int ]; then\n    $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n      set -o pipefail '&&' \\\n      lattice-align-words-lexicon $lang/phones/align_lexicon.int $model  \"ark:gunzip -c $dir/lat.JOB.gz|\" ark:- \\| \\\n      lattice-1best ark:- ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt '>' $dir/ctm.JOB || exit 1;\n  else\n    echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n    exit 1;\n  fi\n  for j in $(seq $nj); do cat $dir/ctm.$j; done > $dir/ctm\n  if $cleanup; then rm $dir/ctm.*; fi\n  echo \"$0: oracle ctm is in $dir/ctm\"\nfi\n\n\n# Stages below are really just to satifsy your curiosity; the output is the same\n# as that of find_bad_utts.sh.\n\nif [ $stage -le 3 ]; then\n  # in case any utterances failed to align, get filtered copy of $data/text\n  utils/filter_scp.pl $dir/oracle_hyp.txt < $data/text  > $dir/text\n  cat $dir/text | awk '{print $1, (NF-1);}' > $dir/length.txt\n\n  mkdir -p $dir/analysis\n\n  align-text --special-symbol=\"$special_symbol\"  ark:$dir/text ark:$dir/oracle_hyp.txt  ark,t:- | \\\n    utils/scoring/wer_per_utt_details.pl --special-symbol \"***\" > $dir/analysis/per_utt_details.txt\n\n  echo \"$0: human-readable alignments are in $dir/analysis/per_utt_details.txt\"\n\n  awk '{if ($2 == \"#csid\") print $1\" \"($4+$5+$6)}' $dir/analysis/per_utt_details.txt > $dir/edits.txt\n\n  n1=$(wc -l < $dir/edits.txt)\n  n2=$(wc -l < $dir/oracle_hyp.txt)\n  n3=$(wc -l < $dir/text)\n  n4=$(wc -l < $dir/length.txt)\n  if [ $n1 -ne $n2 ] || [ $n2 -ne $n3 ] || [ $n3 -ne $n4 ]; then\n    echo \"$0: mismatch in lengths of files:\"\n    wc $dir/edits.txt $dir/oracle_hyp.txt $dir/text $dir/length.txt\n    exit 1;\n  fi\n\n  # note: the format of all_info.txt is:\n  # <utterance-id>   <number of errors>  <reference-length>  <decoded-output>   <reference>\n  # with the fields separated by tabs, e.g.\n  # adg04_sr009_trn 1 \t12\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED AT\t SHOW THE GRIDLEY+S TRACK IN BRIGHT ORANGE WITH HORNE+S IN DIM RED\n\n  paste $dir/edits.txt \\\n      <(awk '{print $2}' $dir/length.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/oracle_hyp.txt) \\\n      <(awk '{$1=\"\";print;}' <$dir/text) > $dir/all_info.txt\n\n  sort -nr -k2 $dir/all_info.txt > $dir/all_info.sorted.txt\n\n  echo \"$0: per-utterance details sorted from worst to best utts are in $dir/all_info.sorted.txt\"\n  echo \"$0: format is: utt-id num-errs ref-length decoded-output (tab) reference\"\nfi\n\nif [ $stage -le 4 ]; then\n  ###\n  # These stats might help people figure out what is wrong with the data\n  # a)human-friendly and machine-parsable alignment in the file per_utt_details.txt\n  # b)evaluation of per-speaker performance to possibly find speakers with\n  #   distinctive accents/speech disorders and similar\n  # c)Global analysis on (Ins/Del/Sub) operation, which might be used to figure\n  #   out if there is systematic issue with lexicon, pronunciation or phonetic confusability\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_per_spk_details.pl $data/utt2spk > $dir/analysis/per_spk_details.txt\n\n  echo \"$0: per-speaker details are in $dir/analysis/per_spk_details.txt\"\n\n  cat $dir/analysis/per_utt_details.txt | \\\n    utils/scoring/wer_ops_details.pl --special-symbol \"$special_symbol\" | \\\n    sort -i -b -k1,1 -k4,4nr -k2,2 -k3,3 > $dir/analysis/ops_details.txt\n\n  echo \"$0: per-word statistics [corr,sub,ins,del] are in $dir/analysis/ops_details.txt\"\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: obtaining ctm edits\"\n\n  $cmd $dir/log/get_ctm_edits.log \\\n    align-text ark:$dir/oracle_hyp.txt ark:$dir/text ark,t:-  \\| \\\n      steps/cleanup/internal/get_ctm_edits.py --oov=$oov --symbol-table=$lang/words.txt \\\n       /dev/stdin $dir/ctm $dir/ctm_edits || exit 1\n\n  echo \"$0: ctm with edits information appended is in $dir/ctm_edits\"\nfi\n"
  },
  {
    "path": "egs/steps/cleanup/make_biased_lm_graphs.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2016     Johns Hopkins University (Author: Daniel Povey)\n#                2016     Vimal Manohar\n# Apache 2.0\n\n\n# This script creates biased decoding graphs based on the data transcripts as\n# HCLG.fsts.scp, in the specified directory; this can be consumed by\n# decode_segmentation.sh.\n# This is for use in data-cleanup and data-filtering.\n\n\nset -u\nset -o pipefail\nset -e\n\n# Begin configuration section.\nnj=10\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\ntop_n_words=100 # Number of common words that we compile into each graph (most frequent\n                # in $data/text.orig.\ntop_n_words_weight=1.0  # this weight is before renormalization; it can be more\n                        # or less than 1.\nmin_words_per_graph=100  # Utterances will be grouped so that they have at least\n                         # this many words, before making the graph.\nstage=0\n\n### options for make_one_biased_lm.py.\nngram_order=4  # maximum n-gram order to use (but see also --min-lm-state-cout).\nmin_lm_state_count=10  # make this smaller (e.g. 2) for more strongly biased LM.\ndiscounting_constant=0.3  # strictly between 0 and 1.  Make this closer to 0 for\n                          # more strongly biased LM.\n\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"usage: $0 <data-dir|text> <lang-dir> <dir> <graph-dir>\"\n   echo \"e.g.:  $0 data/train data/lang exp/tri3_cleanup exp/tri3_cleanup/graphs\"\n   echo \"  This script creates biased decoding graphs per utterance (or possibly\"\n   echo \"  groups of utterances, depending on --min-words-per-graph).  Its output\"\n   echo \"  goes to <dir>/HCLG.fsts.scp, indexed by utterance.  Directory <dir> is\"\n   echo \"  required to be a model or alignment directory, containing 'tree' and 'final.mdl'.\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --scale-opts <scale-opts>                 # Options relating to language\"\n   echo \"                                            # model scale; default is \"\n   echo \"                                            # '--transition-scale=1.0 --self-loop-scale=0.1'\"\n   echo \"  --top-n-words <N>                         # Number of most-common-words to add with\"\n   echo \"                                            # unigram probabilities into graph (default: 100)\"\n   echo \"  --top-n-words-weight <float>              # Weight given to top-n-words portion of graph\"\n   echo \"                                            # (before renormalizing); may be any positive\"\n   echo \"                                            # number (default: 1.0)\"\n   echo \"  --min-words-per-graph <N>                 # A constant that controls grouping of utterances\"\n   echo \"                                            # (we make the LMs for groups of utterances).\"\n   echo \"                                            # Default: 100.\"\n   echo \"  --ngram-order <N>                         # N-gram order in range [2,7].  Maximum n-gram order \"\n   echo \"                                            # that may be used (but also see --min-lm-state-count).\"\n   echo \"                                            # Default 4\"\n   echo \"  --min-lm-state-count <N>                  # Minimum state count for an LM-state of order >2 to \"\n   echo \"                                            # be completely pruned away [bigrams will always be kept]\"\n   echo \"                                            # Default 10.  Smaller -> more strongly biased LM\"\n   echo \"  --discounting-constant <float>            # Discounting constant for Kneser-Ney, strictly between 0\"\n   echo \"                                            # and 1.  Default 0.3.  Smaller -> more strongly biased LM.\"\n   echo \"  --config <config-file>                    # config containing options\"\n   echo \"  --nj <nj>                                 # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata_or_text=$1\nlang=$2\ndir=$3\ngraph_dir=$4\n\nif [ -d $data_or_text ]; then\n  text=$data_or_text/text\nelse\n  text=$data_or_text\nfi\n\nmkdir -p $graph_dir\n\nfor f in $text $lang/oov.int $dir/tree $dir/final.mdl \\\n    $lang/L_disambig.fst $lang/phones/disambig.int; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $dir/phones.txt\ncp $lang/phones.txt $graph_dir\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $graph_dir/log\n\n# create top_words.{int,txt}\nif [ $stage -le 0 ]; then\n  export LC_ALL=C\n  # the following pipe will be broken due to the 'head'; don't fail.\n  set +o pipefail\n  utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $text | \\\n    awk '{for(x=2;x<=NF;x++) print $x;}' | sort | uniq -c | \\\n     sort -nr | head -n $top_n_words > $graph_dir/word_counts.int\n  set -o pipefail\n  total_count=$(awk '{x+=$1} END{print x}' < $graph_dir/word_counts.int)\n  # print top-n words with their unigram probabilities.\n  awk -v tot=$total_count -v weight=$top_n_words_weight '{print $2, ($1*weight)/tot;}' \\\n     <$graph_dir/word_counts.int >$graph_dir/top_words.int\n  utils/int2sym.pl -f 1 $lang/words.txt <$graph_dir/top_words.int >$graph_dir/top_words.txt\nfi\n\nword_disambig_symbol=$(cat $lang/words.txt | grep -w \"#0\" | awk '{print $2}')\nif [ -z \"$word_disambig_symbol\" ]; then\n  echo \"$0: error getting word disambiguation symbol\"\n  exit 1\nfi\n\nmkdir -p $graph_dir/texts\nsplit_text=\nfor n in `seq $nj`; do\n  split_text=\"$split_text $graph_dir/texts/text.$n\"\ndone\n\nutils/split_scp.pl $text $split_text\n\nmkdir -p $graph_dir/log $graph_dir/fsts\n\n# Make $dir an absolute pathname\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nif [ $stage -le 1 ]; then\n  echo \"$0: creating utterance-group-specific decoding graphs with biased LMs\"\n\n  # These options are passed through directly to make_one_biased_lm.py.\n  lm_opts=\"--word-disambig-symbol=$word_disambig_symbol --ngram-order=$ngram_order --min-lm-state-count=$min_lm_state_count --discounting-constant=$discounting_constant --top-words=$graph_dir/top_words.int\"\n\n  $cmd JOB=1:$nj $graph_dir/log/compile_decoding_graphs.JOB.log \\\n    utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $graph_dir/texts/text.JOB \\| \\\n    steps/cleanup/make_biased_lms.py --min-words-per-graph=$min_words_per_graph \\\n      --lm-opts=\"$lm_opts\" $graph_dir/fsts/utt2group.JOB \\| \\\n    compile-train-graphs-fsts $scale_opts --read-disambig-syms=$lang/phones/disambig.int \\\n      $dir/tree $dir/final.mdl $lang/L_disambig.fst ark:- \\\n    ark,scp:$graph_dir/fsts/HCLG.fsts.JOB.ark,$graph_dir/fsts/HCLG.fsts.JOB.scp || exit 1\nfi\n\nfor j in $(seq $nj); do cat $graph_dir/fsts/HCLG.fsts.$j.scp; done > $graph_dir/fsts/HCLG.fsts.per_utt.scp\nfor j in $(seq $nj); do cat $graph_dir/fsts/utt2group.$j; done > $graph_dir/fsts/utt2group\n\n\ncp $lang/words.txt $graph_dir/\ncp -r $lang/phones $graph_dir/\n\n# The following command gives us an scp file relative to utterance-id.\nutils/apply_map.pl -f 2 $graph_dir/fsts/HCLG.fsts.per_utt.scp <$graph_dir/fsts/utt2group > $graph_dir/HCLG.fsts.scp\n\nn1=$(cat $text | wc -l)\nn2=$(cat $graph_dir/HCLG.fsts.scp | wc -l)\n\nif [ $[$n1*9] -gt $[$n2*10] ]; then\n  echo \"$0: too many utterances have no scp, something seems to have gone wrong.\"\n  exit 1\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/make_biased_lms.py",
    "content": "#!/usr/bin/env python3\n\nfrom __future__ import print_function\nimport sys\nimport argparse\nimport math\nimport subprocess\nfrom collections import defaultdict\n\nimport io\nsys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=\"utf8\")\nsys.stderr = io.TextIOWrapper(sys.stderr.buffer,encoding=\"utf8\")\nsys.stdin = io.TextIOWrapper(sys.stdin.buffer,encoding=\"utf8\")\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script is a wrapper for make_one_biased_lm.py that reads a Kaldi archive\nof (integerized) text data from the standard input and writes a Kaldi archive of\nbackoff-language-model FSTs to the standard-output.  It takes care of\ngrouping utterances to respect the --min-words-per-graph option.  It writes\nthe graphs to the standard output and also outputs a map from input utterance-ids\nto the per-group utterance-ids that index the output graphs.\"\"\")\n\nparser.add_argument(\"--lm-opts\", type = str, default = \"\",\n                    help = \"Options to pass in to make_one_biased_lm.py (which \"\n                    \"creates the individual LM graphs), e.g. '--word-disambig-symbol=8721'.\")\nparser.add_argument(\"--min-words-per-graph\", type = int, default = 100,\n                    help = \"Minimum number of words per utterance group; this program \"\n                    \"will try to arrange the input utterances into groups such that each \"\n                    \"one has at least this many words in total.\")\nparser.add_argument(\"utterance_map\", type = str,\n                    help = \"Filename to which a map from input utterances to grouped \"\n                    \"utterances, is written\")\n\nargs = parser.parse_args()\n\n\n\ntry:\n    utterance_map_file = open(args.utterance_map, \"w\", encoding=\"utf-8\")\nexcept:\n    sys.exit(\"make_biased_lms.py: error opening {0} to write utterance map\".format(\n            args.utterance_map))\n\n# This processes one group of input lines; 'group_of_lines' is\n# an array of lines of input integerized text, e.g.\n# [ 'utt1 67 89 432', 'utt2 89 48 62' ]\ndef ProcessGroupOfLines(group_of_lines):\n    num_lines = len(group_of_lines)\n    try:\n        first_utterance_id = group_of_lines[0].split()[0]\n    except:\n        sys.exit(\"make_biased_lms.py: empty input line\")\n\n    group_utterance_id = '{0}-group-of-{1}'.format(first_utterance_id, num_lines)\n    # print the group utterance-id to the stdout; it forms the name in\n    # the text-form archive.\n    print(group_utterance_id)\n    sys.stdout.flush()\n\n    try:\n        command = \"steps/cleanup/internal/make_one_biased_lm.py \" + args.lm_opts\n        p = subprocess.Popen(command, shell = True, stdin = subprocess.PIPE,\n                             stdout = sys.stdout, stderr = sys.stderr)\n        for line in group_of_lines:\n            a = line.split()\n            if len(a) == 0:\n                sys.exit(\"make_biased_lms.py: empty input line\")\n            utterance_id = a[0]\n            # print <utt> <utt-group> to utterance-map file\n            print(utterance_id, group_utterance_id, file = utterance_map_file)\n            rest_of_line = ' '.join(a[1:]) + '\\n' # get rid of utterance id.\n            p.stdin.write(rest_of_line.encode('utf-8'))\n        p.stdin.close()\n        assert p.wait() == 0\n    except Exception:\n        sys.stderr.write(\n            \"make_biased_lms.py: error calling subprocess, command was: \" +\n            command)\n        raise\n    # Print a blank line; this terminates the FST in the Kaldi fst-archive\n    # format.\n    print(\"\")\n    sys.stdout.flush()\n\n\n\nnum_words_this_group = 0\nthis_group_of_lines = []  # An array of strings, one per line\n\nwhile True:\n    line = sys.stdin.readline();\n    num_words_this_group += len(line.split())\n    if line != '':\n        this_group_of_lines.append(line)\n    if num_words_this_group >= args.min_words_per_graph or \\\n        (line == '' and len(this_group_of_lines) != 0):\n        ProcessGroupOfLines(this_group_of_lines)\n        num_words_this_group = 0\n        this_group_of_lines = []\n    if line == '':\n        break\n\n\n# test comand [to be run from ../..]\n#\n\n# (echo 1 0.5; echo 2 0.25) > top_words.txt\n# (echo utt1 6 7 8 4; echo utt2 7 8 9; echo utt3 7 8) | steps/cleanup/make_biased_lms.py --lm-opts='--word-disambig-symbol=1000 --top-words=top_words.txt' foo; cat foo\n\n# (echo utt1 6 7 8 4; echo utt2 7 8 9; echo utt3 7 8) | steps/cleanup/make_biased_lms.py --min-words-per-graph=4 --lm-opts='--word-disambig-symbol=1000 --top-words=top_words.txt' foo; cat foo\n"
  },
  {
    "path": "egs/steps/cleanup/make_segmentation_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# Begin configuration section.\nmax_seg_length=10\nmin_seg_length=2\nmin_sil_length=0.5\ntime_precision=0.05\nspecial_symbol=\"<***>\"\nseparator=\";\"\nwer_cutoff=-1\n# End configuration section.\n\nset -e\n\necho \"$0 $@\"\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"This script takes the ctm file that corresponds to the data directory\"\n  echo \"created by steps/cleanup/split_long_utterance.sh, works out a new\"\n  echo \"segmentation and creates a new data directory for the new segmentation.\"\n  echo \"\"\n  echo \"Usage: $0 [options] <ctm-file> <old-data-dir> <new-data-dir>\"\n  echo \" e.g.: $0 train_si284_split.ctm \\\\\"\n  echo \"                          data/train_si284_split data/train_si284_reseg\"\n  echo \"Options:\"\n  echo \"    --wer-cutoff            # ignore segments with WER higher than the\"\n  echo \"                            # specified value. -1 means no segment will\"\n  echo \"                            # be ignored.\"\n  echo \"    --max-seg-length        # maximum length of new segments\"\n  echo \"    --min-seg-length        # minimum length of new segments\"\n  echo \"    --min-sil-length        # minimum length of silence as split point\"\n  echo \"    --time-precision        # precision for determining \\\"same time\\\"\"\n  echo \"    --special-symbol        # special symbol to be aligned with\"\n  echo \"                            # inserted or deleted words\"\n  echo \"    --separator             # separator for aligned pairs\"\n  exit 1;\nfi\n\nctm=$1\nold_data_dir=$2\nnew_data_dir=$3\n\nfor f in $ctm $old_data_dir/text.orig $old_data_dir/utt2spk \\\n  $old_data_dir/wav.scp $old_data_dir/segments; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected $f to exist\"\n    exit 1;\n  fi\ndone\n\nmkdir -p $new_data_dir/tmp/\ncp -f $old_data_dir/wav.scp $new_data_dir\n[ -f old_data_dir/spk2gender ] &&  cp -f $old_data_dir/spk2gender $new_data_dir\n\n# Removes the overlapping region (in utils/split_long_utterance.sh we create\n# the segmentation with overlapping region).\n#\n# Note that for each audio file, we expect its segments have been sorted in time\n# ascending order (if we ignore the overlap).\ncat $ctm | perl -e '\n  $precision = $ARGV[0];\n  @ctm = ();\n  %processed_ids = ();\n  $previous_id = \"\";\n  while (<STDIN>) {\n    chomp;\n    my @current = split;\n    @current >= 5 || die \"Error: bad line $_\\n\";\n    $id = join(\"_\", ($current[0], $current[1]));\n    @previous = @{$ctm[-1]};\n\n    # Start of a new audio file.\n    if ($previous_id ne $id) {\n      # Prints existing information.\n      if (@ctm > 0) {\n        foreach $line (@ctm) {\n          print \"$line->[0] $line->[1] $line->[2] $line->[3] $line->[4]\\n\";\n        }\n      }\n\n      # Checks if the ctm file is sorted.\n      if (defined($processed_ids{$id})) {\n        die \"Error: \\\"$current[0] $current[1]\\\" has already been processed\\n\";\n      } else {\n        $processed_ids{$id} = 1;\n      }\n\n      @ctm = ();\n      push(@ctm, \\@current);\n      $previous_id = $id;\n      next;\n    }\n\n    $new_start = sprintf(\"%.2f\", $previous[2] + $previous[3]);\n\n    if ($new_start > $current[2]) {\n      # Case 2: scans for a splice point.\n      $index = -1;\n      while (defined($ctm[$index])\n             && $ctm[$index]->[2] + $ctm[$index]->[3] > $current[2]) {\n        if ($ctm[$index]->[4] eq $current[4]\n            && abs($ctm[$index]->[2] - $current[2]) < $precision\n            && abs($ctm[$index]->[3] - $current[3]) < $precision) {\n          pop @ctm for 2..abs($index);\n          last;\n        } else {\n          $index -= 1;\n        }\n      }\n    } else {\n      push(@ctm, \\@current);\n    }\n  }\n\n  if (@ctm > 0) {\n    foreach $line (@ctm) {\n      print \"$line->[0] $line->[1] $line->[2] $line->[3] $line->[4]\\n\";\n    }\n  }' $time_precision > $new_data_dir/tmp/ctm\n\n# Creates a text file from the ctm, which will be used in Levenshtein alignment.\n# Note that we remove <eps> in the text file.\ncat $new_data_dir/tmp/ctm | perl -e '\n  $previous_wav = \"\";\n  $previous_channel = \"\";\n  $text = \"\";\n  while (<STDIN>) {\n    chomp;\n    @col = split;\n    @col >= 5 || die \"Error: bad line $_\\n\";\n    if ($previous_wav eq $col[0]) {\n      $previous_channel eq $col[1] ||\n        die \"Error: more than one channels detected\\n\";\n      if ($col[4] ne \"<eps>\") {\n        $text .= \" $col[4]\";\n      }\n    } else {\n      if ($text ne \"\") {\n        print \"$previous_wav $text\\n\";\n      }\n      $text = $col[4];\n      $previous_wav = $col[0];\n      $previous_channel = $col[1];\n    }\n  }\n  if ($text ne \"\") {\n    print \"$previous_wav $text\\n\";\n  }' > $new_data_dir/tmp/text\n\n# Computes the Levenshtein alignment.\nalign-text --special-symbol=$special_symbol --separator=$separator \\\n  ark:$old_data_dir/text.orig ark:$new_data_dir/tmp/text \\\n  ark,t:$new_data_dir/tmp/aligned.txt\n\n# Creates new segmentation.\nsteps/cleanup/create_segments_from_ctm.pl \\\n  --max-seg-length $max_seg_length --min-seg-length $min_seg_length \\\n  --min-sil-length $min_sil_length \\\n  --separator $separator --special-symbol $special_symbol \\\n  --wer-cutoff $wer_cutoff \\\n  $new_data_dir/tmp/ctm $new_data_dir/tmp/aligned.txt \\\n  $new_data_dir/segments $new_data_dir/text\n\n# Now creates the new utt2spk and spk2utt file.\ncat $old_data_dir/utt2spk | perl -e '\n  ($old_seg_file, $new_seg_file, $utt2spk_file_out) = @ARGV;\n  open(OS, \"<$old_seg_file\") || die \"Error: fail to open $old_seg_file\\n\";\n  open(NS, \"<$new_seg_file\") || die \"Error: fail to open $new_seg_file\\n\";\n  open(UO, \">$utt2spk_file_out\") ||\n    die \"Error: fail to open $utt2spk_file_out\\n\";\n  while (<STDIN>) {\n    chomp;\n    @col = split;\n    @col == 2 || die \"Error: bad line $_\\n\";\n    $utt2spk{$col[0]} = $col[1];\n  }\n  while (<OS>) {\n    chomp;\n    @col = split;\n    @col == 4 || die \"Error: bad line $_\\n\";\n    if (defined($wav2spk{$col[1]})) {\n      $wav2spk{$col[1]} == $utt2spk{$col[0]} ||\n        die \"Error: multiple speakers detected for wav file $col[1]\\n\";\n    } else {\n      $wav2spk{$col[1]} = $utt2spk{$col[0]};\n    }\n  }\n  while (<NS>) {\n    chomp;\n    @col = split;\n    @col == 4 || die \"Error: bad line $_\\n\";\n    defined($wav2spk{$col[1]}) ||\n      die \"Error: could not find speaker for wav file $col[1]\\n\";\n    print UO \"$col[0] $wav2spk{$col[1]}\\n\";\n  } ' $old_data_dir/segments $new_data_dir/segments $new_data_dir/utt2spk\nutils/utt2spk_to_spk2utt.pl $new_data_dir/utt2spk > $new_data_dir/spk2utt\n\nutils/fix_data_dir.sh $new_data_dir\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/make_segmentation_graph.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\ntscale=1.0      # transition scale.\nloopscale=0.1   # scale for self-loops.\ncleanup=true\nngram_order=1\nsrilm_options=\"-wbdiscount\"   # By default, use Witten-Bell discounting in SRILM\n# End configuration section.\n\nset -e\n\necho \"$0 $@\"\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"This script builds one decoding graph for each truncated utterance in\"\n  echo \"segmentation. It first calls steps/cleanup/make_utterance_graph.sh to\"\n  echo \"build one decoding graph for each original utterance, which will be\"\n  echo \"shared by the truncated utterances from the same original utterance.\"\n  echo \"We assign the decoding graph to each truncated utterance using the scp\"\n  echo \"file so that we can avoid duplicating the graphs on the disk.\"\n  echo \"\"\n  echo \"Usage: $0 [options] <data-dir> <lang-dir> <model-dir> <graph-dir>\"\n  echo \" e.g.: $0 data/train_si284_split/ \\\\\"\n  echo \"                data/lang exp/tri2b/ exp/tri2b/graph_train_si284_split\"\n  echo \"\"\n  echo \"Options:\"\n  echo \"    --ngram-order           # order of n-gram language model\"\n  echo \"    --srilm-options         # options for ngram-count in SRILM tool\"\n  echo \"    --tscale                # transition scale\"\n  echo \"    --loopscale             # scale for self-loops\"\n  echo \"    --cleanup               # if true, removes the intermediate files\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nmodel_dir=$3\ngraph_dir=$4\n\nfor f in $data/text.orig $data/orig2utt $lang/L_disambig.fst \\\n  $lang/words.txt $lang/oov.int $model_dir/final.mdl $model_dir/tree; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected $f to exist\"\n    exit 1;\n  fi\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $model_dir/phones.txt\n\n# If --ngram-order is larger than 1, we will have to use SRILM\nif [ $ngram_order -gt 1 ]; then\n  ngram_count=`which ngram-count` || true\n  if [ -z $ngram_count ]; then\n    if uname -a | grep 64 >/dev/null; then # some kind of 64 bit...\n      sdir=$KALDI_ROOT/tools/srilm/bin/i686-m64\n    else\n      sdir=$KALDI_ROOT/tools/srilm/bin/i686\n    fi\n    if [ -f $sdir/ngram-count ]; then\n      echo Using SRILM tools from $sdir\n      export PATH=$PATH:$sdir\n    else\n      echo You appear to not have SRILM tools installed, either on your path,\n      echo or installed in $sdir.  See tools/install_srilm.sh for installation\n      echo instructions.\n      exit 1\n    fi\n  fi\nfi\n\n# Creates one graph for each transcript. We parallelize the process a little\n# bit.\nnum_lines=`cat $data/text.orig | wc -l`\nif [ $nj -gt $num_lines ]; then\n  nj=$num_lines\n  echo \"$0: Too many number of jobs, using $nj instead\"\nfi\n\nmkdir -p $graph_dir/split$nj\nmkdir -p $graph_dir/log\n \nsplit_texts=\"\"\nfor n in $(seq $nj); do\n  mkdir -p $graph_dir/split$nj/$n\n  split_texts=\"$split_texts $graph_dir/split$nj/$n/text\"\ndone\nutils/split_scp.pl $data/text.orig $split_texts\n\n$cmd JOB=1:$nj $graph_dir/log/make_utterance_graph.JOB.log \\\n  steps/cleanup/make_utterance_graph.sh --cleanup $cleanup \\\n  --tscale $tscale --loopscale $loopscale \\\n  --ngram-order $ngram_order --srilm-options \"$srilm_options\" \\\n  $graph_dir/split$nj/JOB/text $lang \\\n  $model_dir $graph_dir/split$nj/JOB || exit 1;\n\n# Copies files from lang directory.\nmkdir -p $graph_dir\ncp -r $lang/* $graph_dir\n\nam-info --print-args=false $model_dir/final.mdl |\\\n grep pdfs | awk '{print $NF}' > $graph_dir/num_pdfs\n\n# Creates the graph table.\ncat $graph_dir/split$nj/*/HCLG.fsts.scp > $graph_dir/split$nj/HCLG.fsts.scp\nfstcopy scp:$graph_dir/split$nj/HCLG.fsts.scp \\\n  \"ark,scp:$graph_dir/HCLG.fsts,$graph_dir/tmp.HCLG.fsts.scp\"\n\n# The graphs we created above were indexed by the old utterance id. We have to\n# duplicate them for the new utterance id. We do this in the scp file so we do\n# not have to store the duplicated graphs on the disk.\ncat $graph_dir/tmp.HCLG.fsts.scp | perl -e '\n  open(O2U, \"<$ARGV[0]\") || die \"Error: fail to open $ARGV[0]\\n\";\n  while (<STDIN>) {\n    chomp;\n    @col = split;\n    @col == 2 || die \"Error: bad line $_\\n\";\n    $scp{$col[0]} = $col[1];\n  }\n  while (<O2U>) {\n    chomp;\n    @col = split;\n    @col >= 2 || die \"Error: bad line $_\\n\";\n    defined($scp{$col[0]}) ||\n      die \"Error: $col[0] not defined in original scp file\\n\";\n    for ($i = 1; $i < @col; $i += 1) {\n      print \"$col[$i] $scp{$col[0]}\\n\"\n    }\n  }' $data/orig2utt > $graph_dir/HCLG.fsts.scp\nrm $graph_dir/tmp.HCLG.fsts.scp\n\nif $cleanup; then\n  rm -r $graph_dir/split$nj\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/make_utterance_fsts.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n\n# makes unigram decoding-graph FSTs specific to each utterances, where the\n# supplied top-n-words list together with the supervision text of the utterance are\n# combined.\n\nif (@ARGV != 1) {\n  print STDERR \"** Warning: this script is deprecated and will be removed.  See\\n\" .\n               \"** steps/cleanup/make_biased_lm_graphs.sh.\\n\" .\n               \"Usage: make_utterance_fsts.pl top-words-file.txt < text-archive > fsts-archive\\n\" .\n               \"e.g.: utils/sym2int.pl -f 2- data/lang/words.txt data/train/text | \\\\\\n\" .\n               \"  make_utterance_fsts.pl exp/foo/top_words.int | compile-train-graphs-fsts ... \\n\";\n  exit(1);\n}\n\n($top_words_file) = @ARGV;\n\nopen(F, \"<$top_words_file\") || die \"opening $top_words_file\";\n\n%top_word_probs = ( );\n\nwhile(<F>) {\n  @A = split;\n  (@A == 2 && $A[0] > 0.0) || die \"Bad line $_ in $top_words_file\";\n  $A[1] =~ m/^[0-9]+$/ || die \"Expecting numeric word-ids in $top_words_file: $_\\n\";\n  $top_word_probs{$A[1]} += $A[0];\n}\n\nwhile (<STDIN>) {\n  @A = split;\n  $utterance_id = shift @A;\n  print \"$utterance_id\\n\";\n  $num_words = @A + 0;  # length of array @A\n  %word_probs = %top_word_probs;\n  foreach $w (@A) {\n    $w =~ m/^[0-9]+$/ || die \"Expecting numeric word-ids as stdin: $_\";\n    $word_probs{$w} += 1.0 / $num_words;\n  }\n  foreach $w (keys %word_probs) {\n    $prob = $word_probs{$w};\n    $prob > 0.0 || die \"Word $w with bad probability $prob, utterance-id = $utterance_id\\n\";\n    $cost = -log($prob);\n    print \"0 0 $w $w $cost\\n\";\n  }\n  $final_cost = -log(1.0 / $num_words);\n  print \"0 $final_cost\\n\";\n  print \"\\n\"; # Empty line terminates the FST in the text-archive format.\n}\n"
  },
  {
    "path": "egs/steps/cleanup/make_utterance_graph.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# Begin configuration section.\ntscale=1.0      # transition scale.\nloopscale=0.1   # scale for self-loops.\ncleanup=true\nngram_order=1\nsrilm_options=\"-wbdiscount\"   # By default, use Witten-Bell discounting in SRILM\n# End configuration section.\n\nset -e\n\necho \"$0 $@\"\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"This script builds one decoding graph for each utterance using the\"\n  echo \"corresponding text in the given <text> file. If --ngram-order is 1,\"\n  echo \"then utils/make_unigram_grammar.pl will be used to build the unigram\"\n  echo \"language model. Otherwise SRILM will be used instead. You are supposed\"\n  echo \"to have SRILM installed if --ngram-order is larger than 1. The format\"\n  echo \"of the given <text> file is same as the transcript text files in data\"\n  echo \"directory.\"\n  echo \"\"\n  echo \"Usage: $0 [options] <text> <lang-dir> <model-dir> <graph-dir>\"\n  echo \" e.g.: $0 data/train_si284_split/text \\\\\"\n  echo \"                data/lang exp/tri2b/ exp/tri2b/graph_train_si284_split\"\n  echo \"\"\n  echo \"Options:\"\n  echo \"    --ngram-order           # order of n-gram language model\"\n  echo \"    --srilm-options         # options for ngram-count in SRILM tool\"\n  echo \"    --tscale                # transition scale\"\n  echo \"    --loopscale             # scale for self-loops\"\n  echo \"    --cleanup               # if true, removes the intermediate files\"\n  exit 1;\nfi\n\ntext=$1\nlang=$2\nmodel_dir=$3\ngraph_dir=$4\n\nfor f in $lang/L_disambig.fst $lang/words.txt $lang/oov.int \\\n  $model_dir/final.mdl $model_dir/tree; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected $f to exist\"\n    exit 1;\n  fi\ndone\n\nmkdir -p $graph_dir/sub_graphs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $model_dir/phones.txt\n\n# If --ngram-order is larger than 1, we will have to use SRILM\nif [ $ngram_order -gt 1 ]; then\n  ngram_count=`which ngram-count` || true\n  if [ -z $ngram_count ]; then\n    if uname -a | grep 64 >/dev/null; then # some kind of 64 bit...\n      sdir=$KALDI_ROOT/tools/srilm/bin/i686-m64\n    else\n      sdir=$KALDI_ROOT/tools/srilm/bin/i686\n    fi\n    if [ -f $sdir/ngram-count ]; then\n      echo Using SRILM tools from $sdir\n      export PATH=$PATH:$sdir\n    else\n      echo You appear to not have SRILM tools installed, either on your path,\n      echo or installed in $sdir.  See tools/install_srilm.sh for installation\n      echo instructions.\n      exit 1\n    fi\n  fi\nfi\n\n# Maps OOV words to the oov symbol.\noov=`cat $lang/oov.int`\noov_txt=`cat $lang/oov.txt`\n\nN=`tree-info --print-args=false $model_dir/tree |\\\n  grep \"context-width\" | awk '{print $NF}'`\nP=`tree-info --print-args=false $model_dir/tree |\\\n  grep \"central-position\" | awk '{print $NF}'`\n\n# Loops over all utterances.\nif [ -f $graph_dir/sub_graphs/HCLG.fsts.scp ]; then\n  rm $graph_dir/sub_graphs/HCLG.fsts.scp\nfi\n\ncat $text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n utils/int2sym.pl -f 2- $lang/words.txt | \\\n while read line; do\n  uttid=`echo $line | cut -d ' ' -f 1`\n  words=`echo $line | cut -d ' ' -f 2-`\n\n  echo \"$0: processing utterance $uttid.\"\n\n  wdir=$graph_dir/sub_graphs/$uttid\n  mkdir -p $wdir\n\n  # Compiles G.fst\n  if [ $ngram_order -eq 1 ]; then\n    echo $words > $wdir/text\n    cat $wdir/text | utils/sym2int.pl --map-oov $oov -f 1- $lang/words.txt | \\\n      utils/make_unigram_grammar.pl | fstcompile |\\\n      fstarcsort --sort_type=ilabel > $wdir/G.fst || exit 1;\n  else\n     echo $words | \\\n     perl -ane '@A = split; for ($n=0;$n<@A;$n++) { print \"$A[$n] \"; if(($n+1)%30000 == 0 || $n+1==@A) {print \"\\n\";} }' \\\n     > $wdir/text\n     ngram-count -text $wdir/text -order $ngram_order \"$srilm_options\" -lm - | \\\n      arpa2fst --disambig-symbol=#0 \\\n             --read-symbol-table=$lang/words.txt - $wdir/G.fst || exit 1;\n  fi\n  fstisstochastic $wdir/G.fst || echo \"$0: $uttid/G.fst not stochastic.\"\n\n  # Builds LG.fst\n  fsttablecompose $lang/L_disambig.fst $wdir/G.fst |\\\n    fstdeterminizestar --use-log=true | fstminimizeencoded |\\\n    fstarcsort --sort_type=ilabel > $wdir/LG.fst || exit 1;\n  fstisstochastic $wdir/LG.fst || echo \"$0: $uttid/LG.fst not stochastic.\"\n\n  # Builds CLG.fst\n  clg=$wdir/CLG_${N}_${P}.fst\n  fstcomposecontext --context-size=$N --central-position=$P \\\n    --read-disambig-syms=$lang/phones/disambig.int \\\n    --write-disambig-syms=$wdir/disambig_ilabels_${N}_${P}.int \\\n    $wdir/ilabels_${N}_${P} < $wdir/LG.fst | fstdeterminize > $wdir/CLG.fst\n  fstisstochastic $wdir/CLG.fst  || echo \"$0: $uttid/CLG.fst not stochastic.\"\n\n  make-h-transducer --disambig-syms-out=$wdir/disambig_tid.int \\\n    --transition-scale=$tscale $wdir/ilabels_${N}_${P} \\\n    $model_dir/tree $model_dir/final.mdl > $wdir/Ha.fst\n\n  # Builds HCLGa.fst\n  fsttablecompose $wdir/Ha.fst $wdir/CLG.fst | \\\n    fstdeterminizestar --use-log=true | \\\n    fstrmsymbols $wdir/disambig_tid.int | fstrmepslocal | \\\n    fstminimizeencoded > $wdir/HCLGa.fst\n  fstisstochastic $wdir/HCLGa.fst ||\\\n    echo \"$0: $uttid/HCLGa.fst is not stochastic\"\n\n  add-self-loops --self-loop-scale=$loopscale --reorder=true \\\n    $model_dir/final.mdl < $wdir/HCLGa.fst > $wdir/HCLG.fst\n\n  if [ $tscale == 1.0 -a $loopscale == 1.0 ]; then\n    fstisstochastic $wdir/HCLG.fst ||\\\n      echo \"$0: $uttid/HCLG.fst is not stochastic.\"\n  fi\n\n  echo \"$uttid $wdir/HCLG.fst\" >> $graph_dir/sub_graphs/HCLG.fsts.scp\n  echo\n done\n\n# Copies files from lang directory.\nmkdir -p $graph_dir\ncp -r $lang/* $graph_dir\n\nam-info --print-args=false $model_dir/final.mdl |\\\n grep pdfs | awk '{print $NF}' > $graph_dir/num_pdfs\n\n# Creates the graph table.\nfstcopy scp:$graph_dir/sub_graphs/HCLG.fsts.scp \\\n  \"ark,scp:$graph_dir/HCLG.fsts,$graph_dir/HCLG.fsts.scp\"\n\nif $cleanup; then\n  rm -r $graph_dir/sub_graphs\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/cleanup/segment_long_utterances.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n#           2016  Vimal Manohar\n# Apache 2.0\n\n# This script performs segmentation of the input data based on the transcription\n# and outputs segmented data along with the corresponding aligned transcription.\n# The purpose of this script is to divide up the input data (which may consist\n# of long recordings such as television shows or audiobooks) into segments which\n# are of manageable length for further processing, along with the portion of the\n# transcript that seems to match (aligns with) each segment.\n# This the light-supervised training scenario where the input transcription is\n# not expected to be completely clean and may have significant errors. \n# See \"JHU Kaldi System for Arabic MGB-3 ASR Challenge using Diarization,\n# Audio-transcript Alignment and Transfer Learning\": Vimal Manohar, Daniel\n# Povey, Sanjeev Khudanpur, ASRU 2017\n# (http://www.danielpovey.com/files/2017_asru_mgb3.pdf) for details.\n# The output data is not necessarily particularly clean; you can run\n# steps/cleanup/clean_and_segment_data.sh on the output in order to\n# further clean it and eliminate data where the transcript doesn't seem to\n# match.\n\n. ./path.sh\n\nset -e\nset -o pipefail\nset -u\n\n# Uniform segmentation options\nmax_segment_duration=30\noverlap_duration=5\nseconds_per_spk_max=30\n\n# Decode options\ngraph_opts=\nbeam=15.0\nlattice_beam=1.0\nnj=4\nlmwt=10\n\n# TF-IDF similarity search options\nmax_words=1000\nnum_neighbors_to_search=1   # Number of neighboring documents to search around the one retrieved based on maximum tf-idf similarity.\nneighbor_tfidf_threshold=0.5\n\nalign_full_hyp=false  # Align full hypothesis i.e. trackback from the end to get the alignment.\n\n# First-pass segmentation opts\n# These options are passed to the script\n# steps/cleanup/internal/segment_ctm_edits_mild.py\nsegmentation_extra_opts=\nmin_split_point_duration=0.1\nmax_deleted_words_kept_when_merging=1\nmax_wer=50\nmax_segment_length_for_merging=60\nmax_bad_proportion=0.75\nmax_intersegment_incorrect_words_length=1\nmax_segment_length_for_splitting=10\nhard_max_segment_length=15\nmin_silence_length_to_split_at=0.3\nmin_non_scored_length_to_split_at=0.3\n\nstage=-1\n\ncmd=run.pl\n\n. utils/parse_options.sh\n\nif [ $# -ne 5 ] && [ $# -ne 7 ]; then\n    cat <<EOF\nUsage: $0 [options] <model-dir> <lang> <data-in> [<text-in> <utt2text>] <segmented-data-out> <work-dir>\n e.g.: $0 exp/wsj_tri2b data/lang_nosp data/train_long data/train_long/text data/train_reseg exp/segment_wsj_long_utts_train\nThis script performs segmentation of the data in <data-in> and writes out the\nsegmented data (with a segments file) to\n<segmented-data-out> along with the corresponding aligned transcription.\nNote: If <utt2text> is not provided, the \"text\" file in <data-in> is used as the\nraw transcripts to train biased LM for the utterances.\nIf <utt2text> is provided, then it should be a mapping from the utterance-ids in\n<data-in> to the transcript-keys in the file <text-in>, which will be\nused to train biased LMs for the utterances.\nThe purpose of this script is to divide up the input data (which may consist of\nlong recordings such as television shows or audiobooks) into segments which are\nof manageable length for further processing, along with the portion of the\ntranscript that seems to match each segment.\nThe output data is not necessarily particularly clean; you are advised to run\nsteps/cleanup/clean_and_segment_data.sh on the output in order to further clean\nit and eliminate data where the transcript doesn't seem to match.\nEOF\n    exit 1\nfi\n\nsrcdir=$1\nlang=$2\ndata=$3\n\nextra_files=\nutt2text=\ntext=$data/text\nif [ $# -eq 7 ]; then\n  text=$4\n  utt2text=$5\n  out_data=$6\n  dir=$7\n  extra_files=\"$utt2text\"\nelse\n  out_data=$4\n  dir=$5\nfi\n\nfor f in $data/feats.scp $text $extra_files $srcdir/tree \\\n  $srcdir/final.mdl $srcdir/cmvn_opts; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\ndata_id=`basename $data`\nmkdir -p $dir\n\ndata_uniform_seg=$dir/${data_id}_uniform_seg\n\nframe_shift=`utils/data/get_frame_shift.sh $data`\n\n# First we split the data into segments of around 30s long, on which\n# it would be possible to do a decoding.\n# A diarization step will be added in the future.\nif [ $stage -le 1 ]; then\n  echo \"$0: Stage 1 (Splitting data directory $data into uniform segments)\"\n\n  utils/data/get_utt2dur.sh $data\n  if [ ! -f $data/segments ]; then\n    utils/data/get_segments_for_data.sh $data > $data/segments\n  fi\n\n  utils/data/get_uniform_subsegments.py \\\n    --max-segment-duration=$max_segment_duration \\\n    --overlap-duration=$overlap_duration \\\n    --max-remaining-duration=$(perl -e \"print $max_segment_duration / 2.0\") \\\n    $data/segments > $dir/uniform_sub_segments\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Stage 2 (Prepare uniform sub-segmented data directory)\"\n  rm -r $data_uniform_seg || true\n\n  if [ ! -z \"$seconds_per_spk_max\" ]; then\n    utils/data/subsegment_data_dir.sh \\\n      $data $dir/uniform_sub_segments $dir/${data_id}_uniform_seg.temp\n\n    utils/data/modify_speaker_info.sh --seconds-per-spk-max $seconds_per_spk_max \\\n      $dir/${data_id}_uniform_seg.temp $data_uniform_seg\n  else\n    utils/data/subsegment_data_dir.sh \\\n      $data $dir/uniform_sub_segments $data_uniform_seg\n  fi\n\n  utils/fix_data_dir.sh $data_uniform_seg\n\n  # Compute new cmvn stats for the segmented data directory\n  steps/compute_cmvn_stats.sh $data_uniform_seg/\nfi\n\ngraph_dir=$dir/graphs_uniform_seg\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Stage 3 (Building biased-language-model decoding graphs)\"\n\n  cp $srcdir/final.mdl $dir\n  cp $srcdir/tree $dir\n  cp $srcdir/cmvn_opts $dir\n  cp $srcdir/{splice_opts,delta_opts,final.mat,final.alimdl} $dir 2>/dev/null || true\n  cp $srcdir/phones.txt $dir 2>/dev/null || true\n\n  mkdir -p $graph_dir\n  \n  n_reco=$(cat $text | wc -l) || exit 1\n  nj_reco=$nj\n\n  if [ $nj -gt $n_reco ]; then\n    nj_reco=$n_reco\n  fi\n\n  # Make graphs w.r.t. to the original text (usually recording-level)\n  steps/cleanup/make_biased_lm_graphs.sh $graph_opts \\\n    --nj $nj_reco --cmd \"$cmd\" $text \\\n    $lang $dir $dir/graphs\n  if [ -z \"$utt2text\" ]; then\n    # and then copy it to the sub-segments.\n    cat $dir/uniform_sub_segments | awk '{print $1\" \"$2}' | \\\n      utils/apply_map.pl -f 2 $dir/graphs/HCLG.fsts.scp | \\\n      sort -k1,1 > \\\n      $graph_dir/HCLG.fsts.scp\n  else\n    # and then copy it to the sub-segments.\n    cat $dir/uniform_sub_segments | awk '{print $1\" \"$2}' | \\\n      utils/apply_map.pl -f 2 $utt2text | \\\n      utils/apply_map.pl -f 2 $dir/graphs/HCLG.fsts.scp | \\\n      sort -k1,1 > \\\n      $graph_dir/HCLG.fsts.scp\n  fi\n\n  cp $lang/words.txt $graph_dir\n  cp -r $lang/phones $graph_dir\n  [ -f $dir/graphs/num_pdfs ] && cp $dir/graphs/num_pdfs $graph_dir/\nfi\n\ndecode_dir=$dir/lats\nmkdir -p $decode_dir\n\nif [ $stage -le 4 ]; then\n  echo \"$0: Decoding with biased language models...\"\n\n  if [ -f $srcdir/trans.1 ]; then\n    steps/cleanup/decode_fmllr_segmentation.sh \\\n      --beam $beam --lattice-beam $lattice_beam --nj $nj --cmd \"$cmd --mem 4G\" \\\n      --skip-scoring true --allow-partial false \\\n      $graph_dir $data_uniform_seg $decode_dir\n  else\n    steps/cleanup/decode_segmentation.sh \\\n      --beam $beam --lattice-beam $lattice_beam --nj $nj --cmd \"$cmd --mem 4G\" \\\n      --skip-scoring true --allow-partial false \\\n      $graph_dir $data_uniform_seg $decode_dir\n  fi\nfi\n\nif [ $stage -le 5 ]; then\n  steps/get_ctm_fast.sh --frame_shift $frame_shift --lmwt $lmwt --cmd \"$cmd --mem 4G\" \\\n    --print-silence true \\\n    $data_uniform_seg $lang $decode_dir $decode_dir/ctm_$lmwt\nfi\n\n# Split the original text into documents, over which we can do\n# searching reasonably efficiently. Also get a mapping from the original\n# text to the created documents (i.e. text2doc)\n# Since the Smith-Waterman alignment is linear in the length of the\n# text, we want to keep it reasonably small (a few thousand words).\n\nif [ $stage -le 6 ]; then\n  # Split the reference text into documents.\n  mkdir -p $dir/docs\n\n  # text2doc is a mapping from the original transcript to the documents\n  # it is split into.\n  # The format is\n  # <original-transcript> <doc1> <doc2> ...\n  steps/cleanup/internal/split_text_into_docs.pl --max-words $max_words \\\n    $text $dir/docs/doc2text $dir/docs/docs.txt\n  utils/utt2spk_to_spk2utt.pl $dir/docs/doc2text > $dir/docs/text2doc\nfi\n\nif [ $stage -le 7 ]; then\n  # Get TF-IDF for the reference documents.\n  echo $nj > $dir/docs/num_jobs\n\n  utils/split_data.sh $data_uniform_seg $nj\n\n  mkdir -p $dir/docs/split$nj/\n\n  # First compute IDF stats\n  $cmd $dir/log/compute_source_idf_stats.log \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n    --tf-weighting-scheme=\"raw\" \\\n    --idf-weighting-scheme=\"log\" \\\n    --output-idf-stats=$dir/docs/idf_stats.txt \\\n    $dir/docs/docs.txt $dir/docs/src_tf_idf.txt\n\n  # Split documents so that they can be accessed easily by parallel jobs.\n  mkdir -p $dir/docs/split$nj/\n  sdir=$dir/docs/split$nj\n  for n in `seq $nj`; do\n\n    # old2new_utts is a mapping from the original segments to the\n    # new segments created by uniformly segmenting.\n    # The format is <old-utterance> <new-utt1> <new-utt2> ...\n    utils/filter_scp.pl $data_uniform_seg/split$nj/$n/utt2spk $dir/uniform_sub_segments | \\\n      cut -d ' ' -f 1,2 | utils/utt2spk_to_spk2utt.pl > $sdir/old2new_utts.$n.txt\n\n    if [ ! -z \"$utt2text\" ]; then\n      # utt2text, if provided, is a mapping from the <old-utterance> to\n      # <original-transript>.\n      # Since text2doc is mapping from <original-transcript> to documents, we\n      # first have to find the original-transcripts that are in the current\n      # split.\n      utils/filter_scp.pl $sdir/old2new_utts.$n.txt $utt2text | \\\n        cut -d ' ' -f 2 | sort -u | \\\n        utils/filter_scp.pl /dev/stdin $dir/docs/text2doc > $sdir/text2doc.$n\n    else\n      utils/filter_scp.pl $sdir/old2new_utts.$n.txt \\\n        $dir/docs/text2doc > $sdir/text2doc.$n\n    fi\n\n    utils/spk2utt_to_utt2spk.pl $sdir/text2doc.$n | \\\n      utils/filter_scp.pl /dev/stdin $dir/docs/docs.txt > \\\n      $sdir/docs.$n.txt\n  done\n\n  # Compute TF-IDF for the source documents.\n  $cmd JOB=1:$nj $dir/docs/log/get_tfidf_for_source_texts.JOB.log \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n      --tf-weighting-scheme=\"raw\" \\\n      --idf-weighting-scheme=\"log\" \\\n      --input-idf-stats=$dir/docs/idf_stats.txt \\\n      $sdir/docs.JOB.txt $sdir/src_tf_idf.JOB.txt\n\n  sdir=$dir/docs/split$nj\n  # Make $sdir an absolute pathname.\n  sdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $sdir ${PWD}`\n\n  for n in `seq $nj`; do\n    awk -v f=\"$sdir/src_tf_idf.$n.txt\" '{print $1\" \"f}' \\\n      $sdir/text2doc.$n\n  done | perl -ane 'BEGIN { %tfidfs = (); }\n  {\n    if (!defined $tfidfs{$F[0]}) {\n      $tfidfs{$F[0]} = $F[1];\n    }\n  }\n  END {\n  while(my ($k, $v) = each %tfidfs) {\n    print \"$k $v\\n\";\n  } }' > $dir/docs/source2tf_idf.scp\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: using default values of non-scored words...\"\n\n  # At the level of this script we just hard-code it that non-scored words are\n  # those that map to silence phones (which is what get_non_scored_words.py\n  # gives us), although this could easily be made user-configurable.  This list\n  # of non-scored words affects the behavior of several of the data-cleanup\n  # scripts; essentially, we view the non-scored words as negotiable when it\n  # comes to the reference transcript, so we'll consider changing the reference\n  # to match the hyp when it comes to these words.\n  steps/cleanup/internal/get_non_scored_words.py $lang > $dir/non_scored_words.txt\nfi\n\nif [ $stage -le 9 ]; then\n  sdir=$dir/query_docs/split$nj\n  mkdir -p $sdir\n\n  # Compute TF-IDF for the query documents (decode hypotheses).\n  # The output is an archive of TF-IDF indexed by the query.\n  $cmd JOB=1:$nj $decode_dir/ctm_$lmwt/log/compute_query_tf_idf.JOB.log \\\n    steps/cleanup/internal/ctm_to_text.pl --non-scored-words $dir/non_scored_words.txt \\\n      $decode_dir/ctm_$lmwt/ctm.JOB \\| \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n      --tf-weighting-scheme=\"normalized\" \\\n      --idf-weighting-scheme=\"log\" \\\n      --input-idf-stats=$dir/docs/idf_stats.txt \\\n      --accumulate-over-docs=false \\\n      - $sdir/query_tf_idf.JOB.ark.txt\n\n  # The relevant documents can be found using TF-IDF similarity and nearby\n  # documents can also be picked for the Smith-Waterman alignment stage.\n\n  # Get a mapping from the new utterance-ids to original transcripts\n  if [ -z \"$utt2text\" ]; then\n    awk '{print $1\" \"$2}' $dir/uniform_sub_segments > \\\n      $dir/new2orig_utt\n  else\n    awk '{print $1\" \"$2}' $dir/uniform_sub_segments | \\\n      utils/apply_map.pl -f 2 $utt2text > \\\n      $dir/new2orig_utt\n  fi\n\n  # The query TF-IDFs are all indexed by the utterance-id of the sub-segments.\n  # The source TF-IDFs use the document-ids created by splitting the reference\n  # text into documents.\n  # For each query, we need to retrieve the documents that were created from\n  # the same original utterance that the sub-segment was from. For this,\n  # we have to load the source TF-IDF that has those documents. This\n  # information is provided using the option --source-text-id2tf-idf-file.\n  # The output of this script is a file where the first column is the\n  # query-id (i.e. sub-segment-id) and the remaining columns, which is at least\n  # one in number and a maxmium of (1 + 2 * num-neighbors-to-search) columns\n  # is the document-ids for the retrieved documents.\n  $cmd JOB=1:$nj $dir/log/retrieve_similar_docs.JOB.log \\\n    steps/cleanup/internal/retrieve_similar_docs.py \\\n      --query-tfidf=$dir/query_docs/split$nj/query_tf_idf.JOB.ark.txt \\\n      --source-text-id2tfidf=$dir/docs/source2tf_idf.scp \\\n      --source-text-id2doc-ids=$dir/docs/text2doc \\\n      --query-id2source-text-id=$dir/new2orig_utt \\\n      --num-neighbors-to-search=$num_neighbors_to_search \\\n      --neighbor-tfidf-threshold=$neighbor_tfidf_threshold \\\n      --relevant-docs=$dir/query_docs/split$nj/relevant_docs.JOB.txt\n\n  $cmd JOB=1:$nj $decode_dir/ctm_$lmwt/log/get_ctm_edits.JOB.log \\\n    steps/cleanup/internal/stitch_documents.py \\\n      --query2docs=$dir/query_docs/split$nj/relevant_docs.JOB.txt \\\n      --input-documents=$dir/docs/split$nj/docs.JOB.txt \\\n      --output-documents=- \\| \\\n    steps/cleanup/internal/align_ctm_ref.py --eps-symbol='\"<eps>\"' \\\n      --oov-word=\"'`cat $lang/oov.txt`'\" --symbol-table=$lang/words.txt \\\n      --hyp-format=CTM --align-full-hyp=$align_full_hyp \\\n      --hyp=$decode_dir/ctm_$lmwt/ctm.JOB --ref=- \\\n      --output=$decode_dir/ctm_$lmwt/ctm_edits.JOB\n\n  for n in `seq $nj`; do\n    cat $decode_dir/ctm_$lmwt/ctm_edits.$n\n  done > $decode_dir/ctm_$lmwt/ctm_edits\n\nfi\n\nif [ $stage -le 10 ]; then\n  $cmd $dir/log/resolve_ctm_edits.log \\\n    steps/cleanup/internal/resolve_ctm_edits_overlaps.py \\\n    ${data_uniform_seg}/segments $decode_dir/ctm_$lmwt/ctm_edits $dir/ctm_edits\nfi\n\nif [ $stage -le 11 ]; then\n  echo \"$0: modifying ctm-edits file to allow repetitions [for dysfluencies] and \"\n  echo \"   ... to fix reference mismatches involving non-scored words. \"\n\n  $cmd $dir/log/modify_ctm_edits.log \\\n    steps/cleanup/internal/modify_ctm_edits.py --verbose=3 $dir/non_scored_words.txt \\\n    $dir/ctm_edits $dir/ctm_edits.modified\n\n  echo \"   ... See $dir/log/modify_ctm_edits.log for details and stats, including\"\n  echo \" a list of commonly-repeated words.\"\nfi\n\nif [ $stage -le 12 ]; then\n  echo \"$0: applying 'taint' markers to ctm-edits file to mark silences and\"\n  echo \"  ... non-scored words that are next to errors.\"\n  $cmd $dir/log/taint_ctm_edits.log \\\n       steps/cleanup/internal/taint_ctm_edits.py --remove-deletions=false \\\n       $dir/ctm_edits.modified $dir/ctm_edits.tainted\n  echo \"... Stats, including global cor/ins/del/sub stats, are in $dir/log/taint_ctm_edits.log.\"\nfi\n\nif [ $stage -le 13 ]; then\n  echo \"$0: creating segmentation from ctm-edits file.\"\n\n  segmentation_opts=(\n  --min-split-point-duration=$min_split_point_duration\n  --max-deleted-words-kept-when-merging=$max_deleted_words_kept_when_merging\n  --merging.max-wer=$max_wer\n  --merging.max-segment-length=$max_segment_length_for_merging\n  --merging.max-bad-proportion=$max_bad_proportion\n  --merging.max-intersegment-incorrect-words-length=$max_intersegment_incorrect_words_length\n  --splitting.max-segment-length=$max_segment_length_for_splitting\n  --splitting.hard-max-segment-length=$hard_max_segment_length\n  --splitting.min-silence-length=$min_silence_length_to_split_at\n  --splitting.min-non-scored-length=$min_non_scored_length_to_split_at\n  )\n\n  $cmd $dir/log/segment_ctm_edits.log \\\n    steps/cleanup/internal/segment_ctm_edits_mild.py \\\n      ${segmentation_opts[@]} $segmentation_extra_opts \\\n      --oov-symbol-file=$lang/oov.txt \\\n      --ctm-edits-out=$dir/ctm_edits.segmented \\\n      --word-stats-out=$dir/word_stats.txt \\\n      $dir/non_scored_words.txt \\\n      $dir/ctm_edits.tainted $dir/text $dir/segments\n\n  echo \"$0: contents of $dir/log/segment_ctm_edits.log are:\"\n  cat $dir/log/segment_ctm_edits.log\n  echo \"For word-level statistics on p(not-being-in-a-segment), with 'worst' words at the top,\"\n  echo \"see $dir/word_stats.txt\"\n  echo \"For detailed utterance-level debugging information, see $dir/ctm_edits.segmented\"\nfi\n\nmkdir -p $out_data\nif [ $stage -le 14 ]; then\n  utils/data/subsegment_data_dir.sh $data_uniform_seg \\\n    $dir/segments $dir/text $out_data\nfi\n"
  },
  {
    "path": "egs/steps/cleanup/segment_long_utterances_nnet3.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n#           2016  Vimal Manohar\n# Apache 2.0\n\n\n# This script is similar to steps/cleanup/segment_long_utterances.sh, but\n# uses nnet3 acoustic model instead of GMM acoustic model for decoding.\n# This script performs segmentation of the input data based on the transcription\n# and outputs segmented data along with the corresponding aligned transcription.\n# The purpose of this script is to divide up the input data (which may consist\n# of long recordings such as television shows or audiobooks) into segments which\n# are of manageable length for further processing, along with the portion of the\n# transcript that seems to match (aligns with) each segment.\n# This the light-supervised training scenario where the input transcription is\n# not expected to be completely clean and may have significant errors.\n# See \"JHU Kaldi System for Arabic MGB-3 ASR Challenge using Diarization,\n# Audio-transcript Alignment and Transfer Learning\": Vimal Manohar, Daniel\n# Povey, Sanjeev Khudanpur, ASRU 2017\n# (http://www.danielpovey.com/files/2017_asru_mgb3.pdf) for details.\n# The output data is not necessarily particularly clean; you can run\n# steps/cleanup/clean_and_segment_data_nnet3.sh on the output in order to\n# further clean it and eliminate data where the transcript doesn't seem to\n# match.\n\n\nset -e\nset -o pipefail\nset -u\n\nstage=-1\ncmd=run.pl\nnj=4\n\n# Uniform segmentation options\nmax_segment_duration=30\noverlap_duration=5\nseconds_per_spk_max=30\n\n# Decode options\ngraph_opts=\nscale_opts=  # for making the graphs\nbeam=15.0\nlattice_beam=1.0\nlmwt=10\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\n\n# Contexts must ideally match training\nextra_left_context=0  # Set to some large value, typically 40 for LSTM (must match training)\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nframes_per_chunk=150\n\n# i-vector options\nextractor=    # i-Vector extractor. If provided, will extract i-vectors.\n              # Required if the network was trained with i-vector extractor.\nuse_vad=false # Use energy-based VAD for i-vector extraction\n\n# TF-IDF similarity search options\nmax_words=1000\nnum_neighbors_to_search=1   # Number of neighboring documents to search around the one retrieved based on maximum tf-idf similarity.\nneighbor_tfidf_threshold=0.5\n\nalign_full_hyp=false  # Align full hypothesis i.e. trackback from the end to get the alignment.\n\n# First-pass segmentation opts\n# These options are passed to the script\n# steps/cleanup/internal/segment_ctm_edits_mild.py\nsegmentation_extra_opts=\nmin_split_point_duration=0.1\nmax_deleted_words_kept_when_merging=1\nmax_wer=50\nmax_segment_length_for_merging=60\nmax_bad_proportion=0.75\nmax_intersegment_incorrect_words_length=1\nmax_segment_length_for_splitting=10\nhard_max_segment_length=15\nmin_silence_length_to_split_at=0.3\nmin_non_scored_length_to_split_at=0.3\n\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 5 ] && [ $# -ne 7 ]; then\n  cat <<EOF\nUsage: $0 [--extractor <ivector-extractor>] [options] <model-dir> <lang> <data-in> [<text-in> <utt2text>] <segmented-data-out> <work-dir>\n e.g.: $0 exp/wsj_tri2b data/lang_nosp data/train_long data/train_long/text data/train_reseg exp/segment_wsj_long_utts_train\nThis script performs segmentation of the data in <data-in> and writes out the\nsegmented data (with a segments file) to\n<segmented-data-out> along with the corresponding aligned transcription.\nNote: If <utt2text> is not provided, the \"text\" file in <data-in> is used as the\nraw transcripts to train biased LM for the utterances.\nIf <utt2text> is provided, then it should be a mapping from the utterance-ids in\n<data-in> to the transcript-keys in the file <text-in>, which will be\nused to train biased LMs for the utterances.\nThe purpose of this script is to divide up the input data (which may consist of\nlong recordings such as television shows or audiobooks) into segments which are\nof manageable length for further processing, along with the portion of the\ntranscript that seems to match each segment.\nThe output data is not necessarily particularly clean; you are advised to run\nsteps/cleanup/clean_and_segment_data.sh on the output in order to further clean\nit and eliminate data where the transcript doesn't seem to match.\n  main options (for others, see top of script file):\n    --stage <n>             # stage to run from, to enable resuming from partially\n                            # completed run (default: 0)\n    --cmd '$cmd'            # command to submit jobs with (e.g. run.pl, queue.pl)\n    --nj <n>                # number of parallel jobs to use in graph creation and\n                            # decoding\n    --graph-opts 'opts'         # Additional options to make_biased_lm_graphs.sh.\n                                # Please run steps/cleanup/make_biased_lm_graphs.sh\n                                # without arguments to see allowed options.\n    --segmentation-extra-opts 'opts'  # Additional options to segment_ctm_edits_mild.py.\n                                # Please run steps/cleanup/internal/segment_ctm_edits_mild.py\n                                # without arguments to see allowed options.\n    --align-full-hyp <true|false>  # If true, align full hypothesis\n                                   i.e. trackback from the end to get the alignment.\n                                   This is different from the normal\n                                   Smith-Waterman alignment, where the\n                                   traceback will be from the maximum score.\n    --extractor <extractor>     # i-vector extractor directory if i-vector is\n                                # to be used during decoding. Must match\n                                # the extractor used for training neural-network.\n    --use-vad <true|false>      # If true, uses energy-based VAD to apply frame weights\n                                # for i-vector stats extraction\nEOF\n  exit 1\nfi\n\nsrcdir=$1\nlang=$2\ndata=$3\n\nextra_files=\nutt2text=\ntext=$data/text\nif [ $# -eq 7 ]; then\n  text=$4\n  utt2text=$5\n  out_data=$6\n  dir=$7\n  extra_files=\"$utt2text\"\nelse\n  out_data=$4\n  dir=$5\nfi\n\nif [ ! -z \"$extractor\" ]; then\n  extra_files=\"$extra_files $extractor/final.ie\"\nfi\n\nfor f in $data/feats.scp $text $extra_files $srcdir/tree \\\n  $srcdir/final.mdl $srcdir/cmvn_opts; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\ndata_id=`basename $data`\nmkdir -p $dir\ncp $srcdir/final.mdl $dir\ncp $srcdir/tree $dir\ncp $srcdir/cmvn_opts $dir\ncp $srcdir/{splice_opts,delta_opts,final.mat,final.alimdl} $dir 2>/dev/null || true\ncp $srcdir/frame_subsampling_factor $dir 2>/dev/null || true\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  echo \"$0: guessing that this is a chain system, checking parameters.\"\n  if [ -z $scale_opts ]; then\n    echo \"$0: setting scale_opts\"\n    scale_opts=\"--self-loop-scale=1.0 --transition-scale=1.0\"\n  fi\n  if [ $acwt == 0.1 ]; then\n    echo \"$0: setting acwt=1.0\"\n    acwt=1.0\n  fi\n  if [ $lmwt == 10 ]; then\n    echo \"$0: setting lmwt=1.0\"\n    lmwt=1\n  fi\nfi\n\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\ncp $lang/phones.txt $dir\n\ndata_uniform_seg=$dir/${data_id}_uniform_seg\n\n# First we split the data into segments of around 30s long, on which\n# it would be possible to do a decoding.\n# A diarization step will be added in the future.\nif [ $stage -le 1 ]; then\n  echo \"$0: Stage 1 (Splitting data directory $data into uniform segments)\"\n\n  utils/data/get_utt2dur.sh $data\n  if [ ! -f $data/segments ]; then\n    utils/data/get_segments_for_data.sh $data > $data/segments\n  fi\n\n  utils/data/get_uniform_subsegments.py \\\n    --max-segment-duration=$max_segment_duration \\\n    --overlap-duration=$overlap_duration \\\n    --max-remaining-duration=$(perl -e \"print $max_segment_duration / 2.0\") \\\n    $data/segments > $dir/uniform_sub_segments\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Stage 2 (Prepare uniform sub-segmented data directory)\"\n  rm -r $data_uniform_seg || true\n\n  if [ ! -z \"$seconds_per_spk_max\" ]; then\n    utils/data/subsegment_data_dir.sh \\\n      $data $dir/uniform_sub_segments $dir/${data_id}_uniform_seg.temp\n\n    utils/data/modify_speaker_info.sh --seconds-per-spk-max $seconds_per_spk_max \\\n      $dir/${data_id}_uniform_seg.temp $data_uniform_seg\n  else\n    utils/data/subsegment_data_dir.sh \\\n      $data $dir/uniform_sub_segments $data_uniform_seg\n  fi\n\n  utils/fix_data_dir.sh $data_uniform_seg\n\n  # Compute new cmvn stats for the segmented data directory\n  steps/compute_cmvn_stats.sh $data_uniform_seg/\nfi\n\ngraph_dir=$dir/graphs_uniform_seg\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Stage 3 (Building biased-language-model decoding graphs)\"\n\n  mkdir -p $graph_dir\n\n  n_reco=$(cat $text | wc -l) || exit 1\n  nj_reco=$nj\n\n  if [ $nj -gt $n_reco ]; then\n    nj_reco=$n_reco\n  fi\n\n  # Make graphs w.r.t. to the original text (usually recording-level)\n  steps/cleanup/make_biased_lm_graphs.sh $graph_opts \\\n    --scale-opts \"$scale_opts\" \\\n    --nj $nj_reco --cmd \"$cmd\" $text \\\n    $lang $dir $dir/graphs\n  if [ -z \"$utt2text\" ]; then\n    # and then copy it to the sub-segments.\n    cat $dir/uniform_sub_segments | awk '{print $1\" \"$2}' | \\\n      utils/apply_map.pl -f 2 $dir/graphs/HCLG.fsts.scp | \\\n      sort -k1,1 > \\\n      $graph_dir/HCLG.fsts.scp\n  else\n    # and then copy it to the sub-segments.\n    cat $dir/uniform_sub_segments | awk '{print $1\" \"$2}' | \\\n      utils/apply_map.pl -f 2 $utt2text | \\\n      utils/apply_map.pl -f 2 $dir/graphs/HCLG.fsts.scp | \\\n      sort -k1,1 > \\\n      $graph_dir/HCLG.fsts.scp\n  fi\n\n  cp $lang/words.txt $graph_dir\n  cp -r $lang/phones $graph_dir\n  [ -f $dir/graphs/num_pdfs ] && cp $dir/graphs/num_pdfs $graph_dir/\nfi\n\ndecode_dir=$dir/lats\nmkdir -p $decode_dir\n\nonline_ivector_dir=\nif [ ! -z \"$extractor\" ]; then\n  online_ivector_dir=$dir/ivectors_$(basename $data_uniform_seg)\n\n  if [ $stage -le 4 ]; then\n    # Compute energy-based VAD\n    if $use_vad; then\n      steps/compute_vad_decision.sh $data_uniform_seg \\\n        $data_uniform_seg/log $data_uniform_seg/data\n    fi\n\n    steps/online/nnet2/extract_ivectors_online.sh \\\n      --nj $nj --cmd \"$cmd --mem 4G\" --use-vad $use_vad \\\n      $data_uniform_seg $extractor $online_ivector_dir\n  fi\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: Decoding with biased language models...\"\n\n  steps/cleanup/decode_segmentation_nnet3.sh \\\n    --acwt $acwt \\\n    --beam $beam --lattice-beam $lattice_beam --nj $nj --cmd \"$cmd --mem 4G\" \\\n    --skip-scoring true --allow-partial false \\\n    --extra-left-context $extra_left_context \\\n    --extra-right-context $extra_right_context \\\n    --extra-left-context-initial $extra_left_context_initial \\\n    --extra-right-context-final $extra_right_context_final \\\n    --frames-per-chunk $frames_per_chunk \\\n    ${online_ivector_dir:+--online-ivector-dir $online_ivector_dir} \\\n    $graph_dir $data_uniform_seg $decode_dir\nfi\n\nframe_shift_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  frame_shift_opt=\"--frame-shift 0.0$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 6 ]; then\n  steps/get_ctm_fast.sh --lmwt $lmwt --cmd \"$cmd --mem 4G\" \\\n    --print-silence true $frame_shift_opt \\\n    $data_uniform_seg $lang $decode_dir $decode_dir/ctm_$lmwt\nfi\n\n# Split the original text into documents, over which we can do\n# searching reasonably efficiently. Also get a mapping from the original\n# text to the created documents (i.e. text2doc)\n# Since the Smith-Waterman alignment is linear in the length of the\n# text, we want to keep it reasonably small (a few thousand words).\n\nif [ $stage -le 7 ]; then\n  # Split the reference text into documents.\n  mkdir -p $dir/docs\n\n  # text2doc is a mapping from the original transcript to the documents\n  # it is split into.\n  # The format is\n  # <original-transcript> <doc1> <doc2> ...\n  steps/cleanup/internal/split_text_into_docs.pl --max-words $max_words \\\n    $text $dir/docs/doc2text $dir/docs/docs.txt\n  utils/utt2spk_to_spk2utt.pl $dir/docs/doc2text > $dir/docs/text2doc\nfi\n\nif [ $stage -le 8 ]; then\n  # Get TF-IDF for the reference documents.\n  echo $nj > $dir/docs/num_jobs\n\n  utils/split_data.sh $data_uniform_seg $nj\n\n  mkdir -p $dir/docs/split$nj/\n\n  # First compute IDF stats\n  $cmd $dir/log/compute_source_idf_stats.log \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n    --tf-weighting-scheme=\"raw\" \\\n    --idf-weighting-scheme=\"log\" \\\n    --output-idf-stats=$dir/docs/idf_stats.txt \\\n    $dir/docs/docs.txt $dir/docs/src_tf_idf.txt\n\n  # Split documents so that they can be accessed easily by parallel jobs.\n  mkdir -p $dir/docs/split$nj/\n  sdir=$dir/docs/split$nj\n  for n in `seq $nj`; do\n\n    # old2new_utts is a mapping from the original segments to the\n    # new segments created by uniformly segmenting.\n    # The format is <old-utterance> <new-utt1> <new-utt2> ...\n    utils/filter_scp.pl $data_uniform_seg/split$nj/$n/utt2spk $dir/uniform_sub_segments | \\\n      cut -d ' ' -f 1,2 | utils/utt2spk_to_spk2utt.pl > $sdir/old2new_utts.$n.txt\n\n    if [ ! -z \"$utt2text\" ]; then\n      # utt2text, if provided, is a mapping from the <old-utterance> to\n      # <original-transript>.\n      # Since text2doc is mapping from <original-transcript> to documents, we\n      # first have to find the original-transcripts that are in the current\n      # split.\n      utils/filter_scp.pl $sdir/old2new_utts.$n.txt $utt2text | \\\n        cut -d ' ' -f 2 | sort -u | \\\n        utils/filter_scp.pl /dev/stdin $dir/docs/text2doc > $sdir/text2doc.$n\n    else\n      utils/filter_scp.pl $sdir/old2new_utts.$n.txt \\\n        $dir/docs/text2doc > $sdir/text2doc.$n\n    fi\n\n    utils/spk2utt_to_utt2spk.pl $sdir/text2doc.$n | \\\n      utils/filter_scp.pl /dev/stdin $dir/docs/docs.txt > \\\n      $sdir/docs.$n.txt\n  done\n\n  # Compute TF-IDF for the source documents.\n  $cmd JOB=1:$nj $dir/docs/log/get_tfidf_for_source_texts.JOB.log \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n      --tf-weighting-scheme=\"raw\" \\\n      --idf-weighting-scheme=\"log\" \\\n      --input-idf-stats=$dir/docs/idf_stats.txt \\\n      $sdir/docs.JOB.txt $sdir/src_tf_idf.JOB.txt\n\n  sdir=$dir/docs/split$nj\n  # Make $sdir an absolute pathname.\n  sdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $sdir ${PWD}`\n\n  for n in `seq $nj`; do\n    awk -v f=\"$sdir/src_tf_idf.$n.txt\" '{print $1\" \"f}' \\\n      $sdir/text2doc.$n\n  done | perl -ane 'BEGIN { %tfidfs = (); }\n  {\n    if (!defined $tfidfs{$F[0]}) {\n      $tfidfs{$F[0]} = $F[1];\n    }\n  }\n  END {\n  while(my ($k, $v) = each %tfidfs) {\n    print \"$k $v\\n\";\n  } }' > $dir/docs/source2tf_idf.scp\nfi\n\nif [ $stage -le 9 ]; then\n  echo \"$0: using default values of non-scored words...\"\n\n  # At the level of this script we just hard-code it that non-scored words are\n  # those that map to silence phones (which is what get_non_scored_words.py\n  # gives us), although this could easily be made user-configurable.  This list\n  # of non-scored words affects the behavior of several of the data-cleanup\n  # scripts; essentially, we view the non-scored words as negotiable when it\n  # comes to the reference transcript, so we'll consider changing the reference\n  # to match the hyp when it comes to these words.\n  steps/cleanup/internal/get_non_scored_words.py $lang > $dir/non_scored_words.txt\nfi\n\nif [ $stage -le 10 ]; then\n  sdir=$dir/query_docs/split$nj\n  mkdir -p $sdir\n\n  # Compute TF-IDF for the query documents (decode hypotheses).\n  # The output is an archive of TF-IDF indexed by the query.\n  $cmd JOB=1:$nj $decode_dir/ctm_$lmwt/log/compute_query_tf_idf.JOB.log \\\n    steps/cleanup/internal/ctm_to_text.pl --non-scored-words $dir/non_scored_words.txt \\\n      $decode_dir/ctm_$lmwt/ctm.JOB \\| \\\n    steps/cleanup/internal/compute_tf_idf.py \\\n      --tf-weighting-scheme=\"normalized\" \\\n      --idf-weighting-scheme=\"log\" \\\n      --input-idf-stats=$dir/docs/idf_stats.txt \\\n      --accumulate-over-docs=false \\\n      - $sdir/query_tf_idf.JOB.ark.txt\n\n  # The relevant documents can be found using TF-IDF similarity and nearby\n  # documents can also be picked for the Smith-Waterman alignment stage.\n\n  # Get a mapping from the new utterance-ids to original transcripts\n  if [ -z \"$utt2text\" ]; then\n    awk '{print $1\" \"$2}' $dir/uniform_sub_segments > \\\n      $dir/new2orig_utt\n  else\n    awk '{print $1\" \"$2}' $dir/uniform_sub_segments | \\\n      utils/apply_map.pl -f 2 $utt2text > \\\n      $dir/new2orig_utt\n  fi\n\n  # The query TF-IDFs are all indexed by the utterance-id of the sub-segments.\n  # The source TF-IDFs use the document-ids created by splitting the reference\n  # text into documents.\n  # For each query, we need to retrieve the documents that were created from\n  # the same original utterance that the sub-segment was from. For this,\n  # we have to load the source TF-IDF that has those documents. This\n  # information is provided using the option --source-text-id2tf-idf-file.\n  # The output of this script is a file where the first column is the\n  # query-id (i.e. sub-segment-id) and the remaining columns, which is at least\n  # one in number and a maxmium of (1 + 2 * num-neighbors-to-search) columns\n  # is the document-ids for the retrieved documents.\n  $cmd JOB=1:$nj $dir/log/retrieve_similar_docs.JOB.log \\\n    steps/cleanup/internal/retrieve_similar_docs.py \\\n      --query-tfidf=$dir/query_docs/split$nj/query_tf_idf.JOB.ark.txt \\\n      --source-text-id2tfidf=$dir/docs/source2tf_idf.scp \\\n      --source-text-id2doc-ids=$dir/docs/text2doc \\\n      --query-id2source-text-id=$dir/new2orig_utt \\\n      --num-neighbors-to-search=$num_neighbors_to_search \\\n      --neighbor-tfidf-threshold=$neighbor_tfidf_threshold \\\n      --relevant-docs=$dir/query_docs/split$nj/relevant_docs.JOB.txt\n\n  $cmd JOB=1:$nj $decode_dir/ctm_$lmwt/log/get_ctm_edits.JOB.log \\\n    steps/cleanup/internal/stitch_documents.py \\\n      --query2docs=$dir/query_docs/split$nj/relevant_docs.JOB.txt \\\n      --input-documents=$dir/docs/split$nj/docs.JOB.txt \\\n      --output-documents=- \\| \\\n    steps/cleanup/internal/align_ctm_ref.py --eps-symbol='\"<eps>\"' \\\n      --oov-word=\"'`cat $lang/oov.txt`'\" --symbol-table=$lang/words.txt \\\n      --hyp-format=CTM --align-full-hyp=$align_full_hyp \\\n      --hyp=$decode_dir/ctm_$lmwt/ctm.JOB --ref=- \\\n      --output=$decode_dir/ctm_$lmwt/ctm_edits.JOB\n\n  for n in `seq $nj`; do\n    cat $decode_dir/ctm_$lmwt/ctm_edits.$n\n  done > $decode_dir/ctm_$lmwt/ctm_edits\n\nfi\n\nif [ $stage -le 11 ]; then\n  $cmd $dir/log/resolve_ctm_edits.log \\\n    steps/cleanup/internal/resolve_ctm_edits_overlaps.py \\\n    ${data_uniform_seg}/segments $decode_dir/ctm_$lmwt/ctm_edits $dir/ctm_edits\nfi\n\nif [ $stage -le 12 ]; then\n  echo \"$0: modifying ctm-edits file to allow repetitions [for dysfluencies] and \"\n  echo \"   ... to fix reference mismatches involving non-scored words. \"\n\n  $cmd $dir/log/modify_ctm_edits.log \\\n    steps/cleanup/internal/modify_ctm_edits.py --verbose=3 $dir/non_scored_words.txt \\\n    $dir/ctm_edits $dir/ctm_edits.modified\n\n  echo \"   ... See $dir/log/modify_ctm_edits.log for details and stats, including\"\n  echo \" a list of commonly-repeated words.\"\nfi\n\nif [ $stage -le 13 ]; then\n  echo \"$0: applying 'taint' markers to ctm-edits file to mark silences and\"\n  echo \"  ... non-scored words that are next to errors.\"\n  $cmd $dir/log/taint_ctm_edits.log \\\n       steps/cleanup/internal/taint_ctm_edits.py --remove-deletions=false \\\n       $dir/ctm_edits.modified $dir/ctm_edits.tainted\n  echo \"... Stats, including global cor/ins/del/sub stats, are in $dir/log/taint_ctm_edits.log.\"\nfi\n\nif [ $stage -le 14 ]; then\n  echo \"$0: creating segmentation from ctm-edits file.\"\n\n  segmentation_opts=(\n  --min-split-point-duration=$min_split_point_duration\n  --max-deleted-words-kept-when-merging=$max_deleted_words_kept_when_merging\n  --merging.max-wer=$max_wer\n  --merging.max-segment-length=$max_segment_length_for_merging\n  --merging.max-bad-proportion=$max_bad_proportion\n  --merging.max-intersegment-incorrect-words-length=$max_intersegment_incorrect_words_length\n  --splitting.max-segment-length=$max_segment_length_for_splitting\n  --splitting.hard-max-segment-length=$hard_max_segment_length\n  --splitting.min-silence-length=$min_silence_length_to_split_at\n  --splitting.min-non-scored-length=$min_non_scored_length_to_split_at\n  )\n\n  $cmd $dir/log/segment_ctm_edits.log \\\n    steps/cleanup/internal/segment_ctm_edits_mild.py \\\n      ${segmentation_opts[@]} $segmentation_extra_opts \\\n      --oov-symbol-file=$lang/oov.txt \\\n      --ctm-edits-out=$dir/ctm_edits.segmented \\\n      --word-stats-out=$dir/word_stats.txt \\\n      $dir/non_scored_words.txt \\\n      $dir/ctm_edits.tainted $dir/text $dir/segments\n\n  echo \"$0: contents of $dir/log/segment_ctm_edits.log are:\"\n  cat $dir/log/segment_ctm_edits.log\n  echo \"For word-level statistics on p(not-being-in-a-segment), with 'worst' words at the top,\"\n  echo \"see $dir/word_stats.txt\"\n  echo \"For detailed utterance-level debugging information, see $dir/ctm_edits.segmented\"\nfi\n\nmkdir -p $out_data\nif [ $stage -le 15 ]; then\n  utils/data/subsegment_data_dir.sh $data_uniform_seg \\\n    $dir/segments $dir/text $out_data\nfi\n"
  },
  {
    "path": "egs/steps/cleanup/split_long_utterance.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# Begin configuration section.\nseg_length=30\nmin_seg_length=10\noverlap_length=5\n# End configuration section.\n\necho \"$0 $@\"\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 2 ]; then\n  echo \"This script truncates the long audio into smaller overlapping segments\"\n  echo \"\"\n  echo \"Usage: $0 [options] <input-dir> <output-dir>\"\n  echo \" e.g.: $0 data/train_si284_long data/train_si284_split\"\n  echo \"\"\n  echo \"Options:\"\n  echo \"    --min-seg-length        # minimal segment length\"\n  echo \"    --seg-length            # length of segments in seconds.\"\n  echo \"    --overlap-length        # length of overlap in seconds.\"\n  exit 1;\nfi\n\ninput_dir=$1\noutput_dir=$2\n\nfor f in spk2utt text utt2spk wav.scp; do\n  [ ! -f $input_dir/$f ] && echo \"$0: no such file $input_dir/$f\" && exit 1;\ndone\n\n[ ! $seg_length -gt $overlap_length ] \\\n  && echo \"$0: --seg-length should be longer than --overlap-length.\" && exit 1;\n\n# Checks if sox is on the path.\nsox=`which sox`\n[ $? -ne 0 ] && echo \"$0: sox command not found.\" && exit 1;\nsph2pipe=$KALDI_ROOT/tools/sph2pipe_v2.5/sph2pipe\n[ ! -x $sph2pipe ] && echo \"$0: sph2pipe command not found.\" && exit 1;\n\nmkdir -p $output_dir\ncp -f $input_dir/spk2gender $output_dir/spk2gender 2>/dev/null\ncp -f $input_dir/text $output_dir/text.orig\ncp -f $input_dir/wav.scp $output_dir/wav.scp\n\n# We assume the audio length in header is correct and get it from there. It is\n# a little bit annoying that old version of sox does not support the following:\n#   $audio_cmd | sox --i -D\n# we have to put it in the following format for the old versions:\n#   $sox --i -D \"|$audio_cmd\"\n# Another way is to count all the samples to get the duration, but it takes\n# longer time, so we do not use it here.. The command is:\n#   $audio_cmd | sox -t wav - -n stat | grep -P \"^Length\" | awk '{print $1;}'\n#\n# Note: in the wsj example the process takes couple of minutes because of the\n#       audio file concatenation; in a real case it should be much faster since\n#       it just reads the header.\ncat $output_dir/wav.scp | perl -e '\n  $no_orig_seg = \"false\";       # Original segment file may or may not exist.\n  ($u2s_in, $u2s_out, $seg_in,\n   $seg_out, $orig2utt, $sox, $slen, $mslen, $olen) = @ARGV;\n  open(UI, \"<$u2s_in\") || die \"Error: fail to open $u2s_in\\n\";\n  open(UO, \">$u2s_out\") || die \"Error: fail to open $u2s_out\\n\";\n  open(SI, \"<$seg_in\") || ($no_orig_seg = \"true\");\n  open(SO, \">$seg_out\") || die \"Error: fail to open $seg_out\\n\";\n  open(UMAP, \">$orig2utt\") || die \"Error: fail to open $orig2utt\\n\";\n  # If the original segment file exists, we have to work out the segment\n  # duration from the segment file. Otherwise we work that out from the wav.scp\n  # file.\n  if ($no_orig_seg eq \"false\") {\n    while (<SI>) {\n      chomp;\n      @col = split;\n      @col == 4 || die \"Error: bad line $_\\n\";\n      ($seg_id, $wav_id, $seg_start, $seg_end) = @col;\n      $seg2wav{$seg_id} = $wav_id;\n      $seg_start{$seg_id} = $seg_start;\n      $seg_end{$seg_id} = $seg_end;\n    }\n  } else {\n    while (<STDIN>) {\n      chomp;\n      @col = split;\n      @col >= 2 || \"bad line $_\\n\";\n      if ((@col > 2) &&  ($col[-1] eq \"|\")) {\n        $wav_id = shift @col; pop @col;\n        $audio_cmd = join(\" \", @col);\n        $duration = `$sox --i -D '\\''|$audio_cmd'\\''`;\n      } else {\n        @col == 2 || die \"Error: bad line $_\\n in wav.scp\";\n        $wav_id = $col[0];\n        $audio_file = $col[1];\n        $duration = `$sox --i -D $audio_file`;\n      }\n      chomp($duration);\n      $seg2wav{$wav_id} = $wav_id;\n      $seg_start{$wav_id} = 0;\n      $seg_end{$wav_id} = $duration;\n    }\n  }\n  while (<UI>) {\n    chomp;\n    @col = split;\n    @col == 2 || die \"Error: bad line $_\\n\";\n    $utt2spk{$col[0]} = $col[1];\n  }\n  foreach $seg (sort keys %seg2wav) {\n    $index = 0;\n    $step = $slen - $olen;\n    print UMAP \"$seg\";\n    while ($seg_start{$seg} + $index * $step < $seg_end{$seg}) {\n      $new_seg = $seg . \"_\" . sprintf(\"%05d\", $index);\n      $start = $seg_start{$seg} + $index * $step;\n      $end = $start + $slen;\n      defined($utt2spk{$seg}) || die \"Error: speaker not found for $seg\\n\";\n      print UO \"$new_seg $utt2spk{$seg}\\n\";\n      print UMAP \" $new_seg\"; \n      $index += 1;\n      if ($end - $olen + $mslen >= $seg_end{$seg}) {\n        # last segment will have at least $mslen seconds.\n        $end = $seg_end{$seg};\n        print SO \"$new_seg $seg2wav{$seg} $start $end\\n\";\n        last;\n      } else {\n        print SO \"$new_seg $seg2wav{$seg} $start $end\\n\";\n      }\n    }\n    print UMAP \"\\n\";\n  }' $input_dir/utt2spk $output_dir/utt2spk \\\n    $input_dir/segments $output_dir/segments $output_dir/orig2utt \\\n    $sox $seg_length $min_seg_length $overlap_length\n\n# CAVEAT: We are not dealing with channels here. Each channel should have a\n# unique file name in wav.scp.\npaste -d ' ' <(cut -d ' ' -f 1 $output_dir/wav.scp) \\\n  <(cut -d ' ' -f 1 $output_dir/wav.scp) | awk '{print $1\" \"$2\" A\";}' \\\n  > $output_dir/reco2file_and_channel\n\nutils/fix_data_dir.sh $output_dir\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/combine_ali_dirs.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2016  Xiaohui Zhang  Apache 2.0.\n# Copyright 2019  SmartAction (kkm)\n\n# This script combines alignment directories, such as exp/tri4a_ali, and\n# validates matching of the utterances and alignments after combining.\n\n# Begin configuration section.\ncmd=run.pl\nnj=4\ncombine_lat=true\ncombine_ali=true\ntolerance=10\n# End configuration section.\necho \"$0 $@\"  # Print the command line for logging.\n\n[[ -f path.sh ]] && . ./path.sh\n. parse_options.sh || exit 1\n\nexport LC_ALL=C\n\nif [[ $# -lt 3 ]]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data> <dest-dir> <src-dir1> <src-dir2> ...\n e.g.: $0 --nj 32 data/train exp/tri3_ali_combined exp/tri3_ali_1 exp_tri3_ali_2\nOptions:\n --nj <nj>              # number of jobs to split combined archives [4]\n --combine_ali false    # merge ali.*.gz if present [true]\n --combine_lat false    # merge lat.*.gz if present [true]\n --tolerance <int,%>    # maximum percentage of missing alignments or lattices\n                        # w.r.t. total utterances in <data> before error is\n                        # reported [10]\n\nThe script checks that certain important files are present and compatible in all\nsource directories (phones.txt, tree); other are copied from the first source\n(cmvn_opts, final.mdl) without much checking.\n\nBoth --combine_ali and --combine_lat are true by default, but the script\nproceeds with a warning if directories do not contain either alignments or\nalignment lattices. Check for files ali.1.gz and/or lat.1.gz in the <dest-dir>\nafter the script completes if additional programmatic check is required.\nEOF\n  exit 1;\nfi\n\nif [[ ! $combine_lat && ! $combine_ali ]]; then\n  echo \"$0: at least one of --combine_lat and --combine_ali must be true\"\n  exit 1\nfi\n\ndata=$1\ndest=$2\nshift 2\nfirst_src=$1\n\ndo_ali=$combine_ali\ndo_lat=$combine_lat\n\n# Check if alignments and/or lattices are present. Since we combine both,\n# whichever present, issue a warning only. Also verify that the target is\n# different from any source; we cannot combine in-place, and a lot of damage\n# could result.\nfor src in $@; do\n  if [[ \"$(cd 2>/dev/null -P -- \"$src\" && pwd)\" = \\\n        \"$(cd 2>/dev/null -P -- \"$dest\" && pwd)\" ]]; then\n    echo \"$0: error: Source $src is same as target $dest.\"\n    exit 1\n  fi\n  if $do_ali && [[ ! -f $src/ali.1.gz ]]; then\n    echo \"$0: warning: Alignments (ali.*.gz) are not present in $src, not\" \\\n         \"combining. Consider '--combine_ali false' to suppress this warning.\"\n    do_ali=false\n  fi\n  if $do_lat && [[ ! -f $src/lat.1.gz ]]; then\n    echo \"$0: warning: Alignment lattices (lat.*.gz) are not present in $src,\"\\\n      \"not combining. Consider '--combine_lat false' to suppress this warning.\"\n    do_lat=false\n  fi\ndone\n\nif ! $do_ali && ! $do_lat; then\n  echo \"$0: error: Cannot combine directories.\"\n  exit 1\nfi\n\n# Verify that required files are present in the first directory.\nfor f in cmvn_opts final.mdl num_jobs phones.txt tree; do\n  if [ ! -f $first_src/$f ]; then\n    echo \"$0: error: Required source file $first_src/$f is missing.\"\n    exit 1\n  fi\ndone\n\n# Verify that phones and trees are compatible in all directories, and than\n# num_jobs files are present, too.\nfor src in $@; do\n  if [[ $src != $first_src ]]; then\n    if [[ ! -f $src/num_jobs ]]; then\n      echo \"$0: error: Required source file $src/num_jobs is missing.\"\n      exit 1\n    fi\n    if ! cmp -s $first_src/tree $src/tree; then\n      echo \"$0: error: tree $src/tree is either missing or not the\" \\\n           \"same as $first_src/tree.\"\n      exit 1\n    fi\n    if [[ ! -f $src/phones.txt ]]; then\n      echo \"$0: error: Required source file $src/phones.txt is missing.\"\n      exit 1\n    fi\n    utils/lang/check_phones_compatible.sh $first_src/phones.txt \\\n                                          $src/phones.txt || exit 1\n  fi\ndone\n\n# All checks passed, ok to prepare directory. Copy model and other files from\n# the first source, as they either checked to be compatible, or we do not care\n# if they are.\nmkdir -p $dest || exit 1\nrm -f $dest/{cmvn_opts,final.mdl,num_jobs,phones.txt,tree}\n$do_ali && rm -f $dest/ali.*.{gz,scp}\n$do_lat && rm -f $dest/lat.*.{gz,scp}\ncp $first_src/{cmvn_opts,final.mdl,phones.txt,tree} $dest/ || exit 1\ncp $first_src/frame_subsampling_factor $dest/ 2>/dev/null  # If present.\necho $nj > $dest/num_jobs || exit 1\n\n# Make temporary directory, delete on signal, but not on 'exit 1'.\ntemp_dir=$(mktemp -d $dest/temp.XXXXXX) || exit 1\ncleanup() { rm -rf \"$temp_dir\"; }\ntrap cleanup HUP INT TERM\necho \"$0: note: Temporary directory $temp_dir will not be deleted in case of\" \\\n     \"script failure, so you could examine it for troubleshooting.\"\n\n\n# This function may be called twice, once to combine alignments and the second\n# time to combine lattices. The two invocations are as follows:\n#   do_combine ali alignments copy-int-vector $@\n#   do_combine lat lattices   lattice-copy $@\n# where 'ali'/'lat' is a prefix to archive name, 'alignments'/'lattices' go into\n# log messages and logfile names, and 'copy-int-vector'/'lattice-copy' is the\n# program used to copy corresponding objects.\ndo_combine() {\n  local ark=$1 entities=$2 copy_program=$3\n  shift 3\n\n  echo \"$0: Gathering $entities from each source directory.\"\n  # Assign all source gzipped archive names to an exported variable, one each\n  # per source directory, so that we can copy archives in a job per source.\n  src_id=0\n  for src in $@; do\n    src_id=$((src_id + 1))\n    nj_src=$(cat $src/num_jobs) || exit 1\n    # Create and export variable src_arcs_${src_id} for the job runner.\n    # Each numbered variable will contain the list of archives, e. g.:\n    # src_arcs_1=\"exp/tri3_ali/ali.1.gz exp/tri3_ali/ali.1.gz ...\"\n    # ('printf' repeats its format as long as there are more arguments).\n    printf \"$src/$ark.%d.gz \" $(seq $nj_src) > $temp_dir/src_arks.${src_id}\n  done\n  \n  # Gather archives in parallel jobs.\n  $cmd JOB=1:$src_id $dest/log/gather_$entities.JOB.log \\\n    $copy_program \\\n      \"ark:gunzip -c \\$(cat $temp_dir/src_arks.JOB) |\" \\\n      \"ark,scp:$temp_dir/$ark.JOB.ark,$temp_dir/$ark.JOB.scp\" || exit 1\n\n  # Merge (presumed already sorted) scp's into a single script.\n  sort -m $temp_dir/$ark.*.scp > $temp_dir/$ark.scp || exit 1\n\n  inputs=$(for n in `seq $nj`; do echo $temp_dir/$ark.$n.scp; done)\n  utils/split_scp.pl --utt2spk=$data/utt2spk $temp_dir/$ark.scp $inputs\n\n  echo \"$0: Splitting combined $entities into $nj archives on speaker boundary.\"\n  $cmd JOB=1:$nj $dest/log/chop_combined_$entities.JOB.log \\\n    $copy_program \\\n      \"scp:$temp_dir/$ark.JOB.scp\" \\\n      \"ark:| gzip -c > $dest/$ark.JOB.gz\" || exit 1\n\n  # Get some interesting stats, and signal an error if error threshold exceeded.\n  n_utt=$(wc -l <$data/utt2spk)\n  n_ali=$(wc -l <$temp_dir/$ark.scp)\n  n_ali_no_utt=$(join -j1 -v2 $data/utt2spk $temp_dir/$ark.scp | wc -l)\n  n_utt_no_ali=$(join -j1 -v1 $data/utt2spk $temp_dir/$ark.scp | wc -l)\n  n_utt_no_ali_pct=$(perl -e \"print int($n_utt_no_ali/$n_utt * 100 + .5);\")\n  echo \"$0: Combined $n_ali $entities for $n_utt utterances.\" \\\n       \"There were $n_utt_no_ali utterances (${n_utt_no_ali_pct}%) without\" \\\n       \"$entities, and $n_ali_no_utt $entities not matching any utterance.\"\n\n  if (( $n_utt_no_ali_pct >= $tolerance )); then\n    echo \"$0: error: Percentage of utterances missing $entities,\" \\\n         \"${n_utt_no_ali_pct}%, is at or above error tolerance ${tolerance}%.\"\n    exit 1\n  fi\n\n  return 0\n}\n\n# Do the actual combining. Do not check returned exit code, as\n# the function always calls 'exit 1' on failure.\n$do_ali && do_combine ali 'alignments' copy-int-vector \"$@\"\n$do_lat && do_combine lat 'lattices' lattice-copy \"$@\"\n\n# Delete the temporary directory on success.\ncleanup\n\nwhat=\n$do_ali && what+='alignments '\n$do_ali && $do_lat && what+='and '\n$do_lat && what+='lattices '\necho \"$0: Stored combined ${what}in $dest\"  # No period, interferes with\n                                            # copy/paste from tty emulator.\nexit 0\n"
  },
  {
    "path": "egs/steps/combine_trans_dirs.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2016  Xiaohui Zhang  Apache 2.0.\n# Copyright 2019  SmartAction (kkm)\n# Copyright 2019  manhong wang (marvin)\n\n# This script only combines transform file in the aligments dirs, egs: trans.1,  and\n# validates matching of the utterances and alignments after combining. you would need this fmllr trans\n# files after you combine ali or lat dirs(combine_ali_dirs.sh or combine_lat_dis.sh).\n\n# Begin configuration section.\ncmd=run.pl\ntolerance=10\n# End configuration section.\necho \"$0 $@\"  # Print the command line for logging.\n\n[[ -f path.sh ]] && . ./path.sh\n. parse_options.sh || exit 1\n\nexport LC_ALL=C\n\nif [[ $# -lt 3 ]]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data> <dest-dir> <src-dir1> <src-dir2> ...\n e.g.: $0 data/train exp/tri3_trans_combined exp/tri3_trans_1 exp_tri3_trans_2\nOptions:\n --tolerance <int,%>    # maximum percentage of missing trans\n                        # w.r.t. total utterances in <data> before error is\n                        # reported [10]\n\nNote:we do not checks that certain important files are present and compatible in all\nsource directories (phones.txt, tree) here.Because you would run combine_trans_dirs.sh \nor combine_lat_dis.sh first.\n\nEOF\n  exit 1;\nfi\n\n\ndata=$1\ndest=$2\nshift 2\nfirst_src=$1\n\ndo_trans=true    \n\n\n# All checks passed, ok to prepare directory. but we do not Copy model and other files from\n# the first source.\n\nfor src in $@; do\n  if [[ \"$(cd 2>/dev/null -P -- \"$src\" && pwd)\" = \\\n        \"$(cd 2>/dev/null -P -- \"$dest\" && pwd)\" ]]; then\n    echo \"$0: error: Source $src is same as target $dest.\"\n    exit 1\n  fi\n  if $do_trans && [[ ! -f $src/trans.1 ]]; then\n    echo \"$0: warning: transform (trans.*) are not present in $src, not\" \\\n         \"combining. please check you files\" \n    exit 1\n  fi\ndone\n\nif [ ! -f $dest/ali.1.gz  ] && [ ! -f $dest/lat.1.gz ] ; then \n    echo \"$0: warning: we assume you have combined the ali or lat dirs \" \\\n         \"please run combine_ali_dir.sh or combine_lat_dir.sh firstly\"\n    exit 1\nfi\n\nnj=$(cat $dest/num_jobs)\n\nif [ -f $dest/trans.1 ] ; then rm $dest/trans.* ;fi    #remove old trans.*\n\n# Make temporary directory, delete on signal, but not on 'exit 1'.\ntemp_dir=$(mktemp -d $dest/temp.XXXXXX) || exit 1\ncleanup() { rm -rf \"$temp_dir\"; }\ntrap cleanup HUP INT TERM\necho \"$0: note: Temporary directory $temp_dir will not be deleted in case of\" \\\n     \"script failure, so you could examine it for troubleshooting.\"\n\ndo_combine_trans() {\n  local ark=$1 entities=$2 copy_program=$3\n  shift 3\n\n  echo \"$0: Gathering $entities from each source directory.\"\n  # Assign all source gzipped archive names to an exported variable, one each\n  # per source directory, so that we can copy archives in a job per source.\n  src_id=0\n  for src in $@; do\n    src_id=$((src_id + 1))\n    nj_src=$(cat $src/num_jobs) || exit 1\n    # Create and export variable src_arcs_${src_id} for the job runner.\n    # Each numbered variable will contain the list of archives, e. g.:\n    # src_arcs_1=\"exp/tri3_ali/trans.1 exp/tri3_ali/trans.1 ...\"\n    # ('printf' repeats its format as long as there are more arguments).\n    printf \"$src/$ark.%d \" $(seq $nj_src) > $temp_dir/src_arks.${src_id}\n  done\n  \n  # Gather archives in parallel jobs.\n  $cmd JOB=1:$src_id $dest/log/gather_$entities.JOB.log \\\n    $copy_program \\\n      \"ark:cat \\$(cat $temp_dir/src_arks.JOB) |\" \\\n      \"ark,scp:$temp_dir/$ark.JOB,$temp_dir/$ark.JOB.scp\" || exit 1\n\n  # Merge (presumed already sorted) scp's into a single script.\n  sort -m $temp_dir/$ark.*.scp > $temp_dir/$ark.scp || exit 1\n\n  echo \"$0: Splitting combined $entities into $nj archives on speaker boundary.\"\n  $cmd JOB=1:$nj $dest/log/chop_combined_$entities.JOB.log \\\n    $copy_program \\\n      \"scp:utils/split_scp.pl  -j $nj JOB --one-based $temp_dir/$ark.scp |\" \\\n      \"ark:$dest/$ark.JOB\" || exit 1\n\n  # Get some interesting stats.\n  n_utt=$(wc -l <$data/spk2utt)\n  n_trans=$(wc -l <$temp_dir/$ark.scp)\n  n_utt_no_trans_pct=$(perl -e \"print int(($n_utt - $n_trans)/$n_utt * 100 + .5);\")\n  echo \"$0: Combined $n_trans $entities for $n_utt utterances.\" \n\n  if (( $n_utt_no_trans_pct >= $tolerance )); then\n    echo \"$0: error: Percentage of utterances missing $entities,\" \\\n         \"${n_utt_no_trans_pct}%, is at or above error tolerance ${tolerance}%.\"\n    exit 1\n  fi\n\n  return 0\n}\n\n$do_trans && do_combine_trans trans 'transforms' copy-matrix \"$@\"\n\ncleanup     # Delete the temporary directory on success.\n\necho \"$0: Stored combined fmllr trans in $dest\"  \nexit 0\n"
  },
  {
    "path": "egs/steps/compare_alignments.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\nset -e\nstage=0\ncmd=run.pl   # We use this only for get_ctm.sh, which can be a little slow.\nnum_to_sample=1000  # We sample this many utterances for human-readable display, starting from the worst and then\n                    # starting from the middle.\ncleanup=true\n\nif [ -f ./path.sh ]; then . ./path.sh; fi\n\n. ./utils/parse_options.sh\n\nif [ $# -ne 5 ] && [ $# -ne 7 ]; then\n  cat <<EOF\n  This script compares two directories containing data alignments, and\n  creates statistics showing how much the phone and word alignments differ,\n  including breakdown by phones and words; and which utterances differ the\n  most.  This is intended for diagnostic purposes.  Both alignment directories\n  should be for the same data (or at least the data sets should overlap).\n  The word alignment stats may not be correctly obtained if the data-dirs are\n  not the same.\n\n  Usage: $0 [options] <lang-directory> <data-directory> <ali-dir1> <ali-dir2> <work-dir>\n    or:  $0 [options] <lang1> <lang2> <data1> <data2> <ali-dir1> <ali-dir2> <work-dir>\n   e.g.: $0 data/lang data/train exp/tri2_ali exp/tri3_ali exp/compare_ali_2_3\n\n  Options:\n              --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\n                                              # (passed through to get_train_ctm.sh)\n              --cleanup <true|false>          # Specify --cleanup false to prevent\n                                              # cleanup of temporary files.\n              --stage  <n>                    # Enables you to run part of the script.\n\nEOF\n  exit 1\nfi\n\nif [ $# -eq 5 ]; then\n  lang1=$1\n  lang2=$1\n  data1=$2\n  data2=$2\n  ali_dir1=$3\n  ali_dir2=$4\n  dir=$5\nelse\n  lang1=$1\n  lang2=$2\n  data1=$3\n  data2=$4\n  ali_dir1=$5\n  ali_dir2=$6\n  dir=$7\nfi\n\nfor f in $lang1/phones.txt $lang2/phones.txt $data1/utt2spk $data2/utt2spk \\\n         $ali_dir1/ali.1.gz $ali_dir2/ali.2.gz; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist\"\n    exit 1\n  fi\ndone\n\n# This will exit if the phone symbol id's are different, due to\n# `set -e` above.\nutils/lang/check_phones_compatible.sh $lang1/phones.txt $lang2/phones.txt\n\nnj1=$(cat $ali_dir1/num_jobs)\nnj2=$(cat $ali_dir2/num_jobs)\n\nmkdir -p $dir/log\n\n\nif [ $stage -le 0 ]; then\n  echo \"$0: converting alignments to phones.\"\n\n  for j in $(seq $nj1); do gunzip -c $ali_dir1/ali.$j.gz; done | \\\n    ali-to-phones --per-frame=true $ali_dir1/final.mdl ark:- ark:- | gzip -c > $dir/phones1.gz\n\n  for j in $(seq $nj2); do gunzip -c $ali_dir2/ali.$j.gz; done | \\\n    ali-to-phones --per-frame=true $ali_dir2/final.mdl ark:- ark:- | gzip -c > $dir/phones2.gz\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: getting comparison stats and utterance stats.\"\n  compare-int-vector --binary=false --write-confusion-matrix=$dir/conf.mat \\\n            \"ark:gunzip -c $dir/phones1.gz|\" \"ark:gunzip -c $dir/phones2.gz|\" 2>$dir/log/compare_phones.log > $dir/utt_stats.phones\n  tail -n 8 $dir/log/compare_phones.log\nfi\n\nif [ $stage -le 3 ]; then\n  cat $dir/conf.mat | grep -v -F '[' | sed 's/]//' | awk '{n=NF; for (k=1;k<=n;k++) { conf[NR,k] = $k; row_tot[NR] += $k; col_tot[k] += $k; } } END{\n   for (row=1;row<=n;row++) for (col=1;col<=n;col++) {\n     val = conf[row,col]; this_row_tot = row_tot[row]; this_col_tot = col_tot[col];\n     rval=conf[col,row]\n     min_tot = (this_row_tot < this_col_tot ? this_row_tot : this_col_tot);\n     if (val != 0) {\n       phone1 = row-1; phone2 = col-1;\n       if (row == col) printf(\"COR %d %d %.2f%\\n\", phone1, val, (val * 100 / this_row_tot));\n       else {\n         norm_prob = val * val / min_tot;  # heuristic for sorting.\n         printf(\"SUB %d %d %d %d %.2f%% %.2f%%\\n\",\n                 norm_prob, phone1, phone2, val, (val * 100 / min_tot), (rval * 100 / min_tot)); }}}}' > $dir/phone_stats.all\n\n   (\n     echo \"# Format: <phone> <frame-count> <percent-correct>\"\n     grep '^COR' $dir/phone_stats.all | sort -n -k4,4 | awk '{print $2, $3, $4}' | utils/int2sym.pl -f 1 $lang1/phones.txt\n   ) > $dir/phones_correct.txt\n\n   (\n     echo \"#Format: <phone1> <phone2> <num-frames> <prob-wrong%> <reverse-prob-wrong%>\"\n     echo \"# <num-frames> is the number of frames that were labeled <phone1> in the first\"\n     echo \"# set of alignments and <phone2> in the second.\"\n     echo \"# <prob-wrong> is <num-frames> divided by the smaller of the total num-frames of\"\n     echo \"#  phone1 or phone2, whichever is smaller; expressed as a percentage.\"\n     echo \"#<reverse-prob-wrong> is the same but for the reverse substitution, from\"\n     echo \"#<phone2> to <phone1>; the comparison with <prob-wrong> the substitutions are).\"\n     grep '^SUB' $dir/phone_stats.all | sort -nr -k2,2 | awk '{print $3,$4,$5,$6,$7}' | utils/int2sym.pl -f 1-2 $lang1/phones.txt\n   ) > $dir/phone_subs.txt\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: getting CTMs\"\n  steps/get_train_ctm.sh --use-segments false --print-silence true --cmd \"$cmd\" --frame-shift 1.0 $data1 $lang1 $ali_dir1 $dir/ctm1\n  steps/get_train_ctm.sh --use-segments false --print-silence true --cmd \"$cmd\" --frame-shift 1.0 $data2 $lang2 $ali_dir2 $dir/ctm2\nfi\n\nif [ $stage -le 5 ]; then\n  oov=$(cat $lang1/oov.int)\n  # Note: below, we use $lang1 for both setups; this is by design as compare-int-vector\n  # assumes they use the same symbol table.\n  for n in 1 2; do\n    cat $dir/ctm${n}/ctm | utils/sym2int.pl --map-oov $oov -f 5 $lang1/words.txt | \\\n      awk 'BEGIN{utt_id=\"\";} { if (utt_id != $1) { if (utt_id != \"\") printf(\"\\n\"); utt_id=$1; printf(\"%s \", utt_id); } t_start=int($3); t_end=t_start + int($4); word=$5; for (t=t_start; t<t_end; t++) printf(\"%s \", word); } END{printf(\"\\n\")}' | \\\n      copy-int-vector ark:- ark:- | gzip -c >$dir/words${n}.gz\n  done\nfi\n\nif [ $stage -le 5 ]; then\n  compare-int-vector --binary=false --write-tot-counts=$dir/words_tot.vec --write-diff-counts=$dir/words_diff.vec \\\n         \"ark:gunzip -c $dir/words1.gz|\" \"ark:gunzip -c $dir/words2.gz|\" 2>$dir/log/compare_words.log >$dir/utt_stats.words\n  tail -n 8 $dir/log/compare_words.log\nfi\n\nif [ $stage -le 6 ]; then\n\n  ( echo \"# Word stats.  Format:\";\n    echo \"<proportion-of-wrong-frames> <num-wrong-frames> <num-correct-frames> <word>\"\n\n    paste <(awk '{for (n=2;n<NF;n++) print $n;}' <$dir/words_diff.vec) \\\n      <(awk '{for (n=2;n<NF;n++) print $n;}' <$dir/words_tot.vec) | \\\n       awk '{ if($2 > 0) print $1*$1/$2, $1/$2, $1, $2, (NR-1)}' | utils/int2sym.pl -f 5 $lang1/words.txt | \\\n      sort -nr | awk '{print $2, $3, $4, $5;}'\n  ) > $dir/word_stats.txt\n\nfi\n\nif [ $stage -le 7 ]; then\n  for type in phones words; do\n    num_utts=$(wc -l <$dir/utt_stats.$type)\n    cat $dir/utt_stats.$type | awk -v type=$type 'BEGIN{print \"Utterance-id proportion-\"type\"-changed num-frames num-wrong-frames\"; }\n          {print $1, $3 * 1.0 / $2, $2, $3; }' | sort -nr -k2,2 > $dir/utt_stats.$type.sorted\n    (\n      echo \"$0: Percentiles 100, 90, .. 0 of proportion-$type-changed distribution (over utterances) are:\"\n    cat $dir/utt_stats.$type.sorted | awk -v n=$num_utts 'BEGIN{k=int((n-1)/10);} {if (NR % k == 1) printf(\"%s \", $2); } END{print \"\";}'\n    ) | tee $dir/utt_stats.$type.percentiles\n  done\nfi\n\n\nif [ $stage -le 8 ]; then\n  # Display the 1000 worst utterances, and 1000 utterances from the middle of the pack, in a readable format.\n  num_utts=$(wc -l <$dir/utt_stats.words.sorted)\n  half_num_utts=$[$num_utts/2];\n  if [ $num_to_sample -gt $half_num_utts ]; then\n    num_to_sample=$half_num_utts\n  fi\n  head -n $num_to_sample $dir/utt_stats.words.sorted | awk '{print $1}' > $dir/utt_ids.worst\n  tail -n +$half_num_utts $dir/utt_stats.words.sorted | head -n $num_to_sample | awk '{print $1}' > $dir/utt_ids.mid\n\n  for suf in worst mid; do\n    for n in 1 2; do\n      gunzip -c $dir/phones${n}.gz | copy-int-vector ark:- ark,t:- | utils/filter_scp.pl $dir/utt_ids.$suf  >$dir/temp\n      # the next command reorders them, and duplicates the utterance-idwhich we'll later use\n      # that to display the word sequence.\n      awk '{print $1,$1,$1}' <$dir/utt_ids.$suf | utils/apply_map.pl -f 3 $dir/temp > $dir/phones${n}.$suf\n      rm $dir/temp\n    done\n    # the stuff with 0 and <eps> below is a kind of hack so that if the phones are the same, we end up\n    # with just the phone, but if different, we end up with p1/p2.\n    # The apply_map.pl stuff is to put the transcript there.\n\n    (\n      echo \"# Format: <utterance-id> <word1> <word2> ... <wordN>  <frame1-phone> ... <frameN-phone>\"\n      echo \"# If the two alignments have the same phone, just that phone will be printed;\"\n      echo \"# otherwise the two phones will be printed, as in 'phone1/phone2'.  So '/' is present\"\n      echo \"# whenever there is a mismatch.\"\n\n      paste $dir/phones1.$suf $dir/phones2.$suf | perl -ane ' @A = split(\"\\t\", $_); @A1 = split(\" \", $A[0]); @A2 = split(\" \", $A[1]);\n            $utt = shift @A1; shift @A2; print $utt, \" \";\n            for ($n = 0; $n < @A1 && $n < @A2; $n++) { $a1=$A1[$n]; $a2=$A2[$n];  if ($a1 eq $a2) { print \"$a1 \"; } else { print \"$a1 0 $a2 \"; }}\n            print \"\\n\" ' | utils/int2sym.pl -f 3- $lang1/phones.txt | sed 's: <eps> :/:g' | \\\n        utils/apply_map.pl -f 2 $data1/text\n    )  > $dir/compare_phones_${suf}.txt\n  done\nfi\n\n\nif [ $stage -le 9 ] && $cleanup; then\n  rm $dir/phones{1,2}.gz $dir/words{1,2}.gz $dir/ctm*/ctm $dir/*.vec $dir/conf.mat \\\n     $dir/utt_ids.*  $dir/phones{1,2}.{mid,worst} $dir/utt_stats.{phones,words} \\\n     $dir/phone_stats.all\nfi\n\n# clean up\nexit 0\n"
  },
  {
    "path": "egs/steps/compute_cmvn_stats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Compute cepstral mean and variance statistics per speaker.\n# We do this in just one job; it's fast.\n# This script takes no options.\n#\n# Note: there is no option to do CMVN per utterance.  The idea is\n# that if you did it per utterance it would not make sense to do\n# per-speaker fMLLR on top of that (since you'd be doing fMLLR on\n# top of different offsets).  Therefore what would be the use\n# of the speaker information?  In this case you should probably\n# make the speaker-ids identical to the utterance-ids.  The\n# speaker information does not have to correspond to actual\n# speakers, it's just the level you want to adapt at.\n\necho \"$0 $@\"  # Print the command line for logging\n\nfake=false   # If specified, can generate fake/dummy CMVN stats (that won't normalize)\nfake_dims=   # as the \"fake\" option, but you can generate \"fake\" stats only for certain\n             # dimensions.\ntwo_channel=false\n\nif [ \"$1\" == \"--fake\" ]; then\n  fake=true\n  shift\nfi\nif [ \"$1\" == \"--fake-dims\" ]; then\n  fake_dims=$2\n  shift\n  shift\nfi\nif [ \"$1\" == \"--two-channel\" ]; then\n  two_channel=true\n  shift\nfi\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n   echo \"Usage: $0 [options] <data-dir> [<log-dir> [<cmvn-dir>] ]\";\n   echo \"e.g.: $0 data/train exp/make_mfcc/train mfcc\"\n   echo \"Note: <log-dir> defaults to <data-dir>/log, and <cmvn-dir> defaults to <data-dir>/data\"\n   echo \"Options:\"\n   echo \" --fake          gives you fake cmvn stats that do no normalization.\"\n   echo \" --two-channel   is for two-channel telephone data, there must be no segments \"\n   echo \"                 file and reco2file_and_channel must be present.  It will take\"\n   echo \"                 only frames that are louder than the other channel.\"\n   echo \" --fake-dims <n1:n2>  Generate stats that won't cause normalization for these\"\n   echo \"                  dimensions (e.g. 13:14:15)\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  cmvndir=$3\nelse\n  cmvndir=$data/data\nfi\n\n# make $cmvndir an absolute pathname.\ncmvndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $cmvndir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $cmvndir || exit 1;\nmkdir -p $logdir || exit 1;\n\n\nrequired=\"$data/feats.scp $data/spk2utt\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif $fake; then\n  dim=`feat-to-dim scp:$data/feats.scp -`\n  ! cat $data/spk2utt | awk -v dim=$dim '{print $1, \"[\"; for (n=0; n < dim; n++) { printf(\"0 \"); } print \"1\";\n                                                        for (n=0; n < dim; n++) { printf(\"1 \"); } print \"0 ]\";}' | \\\n    copy-matrix ark:- ark,scp:$cmvndir/cmvn_$name.ark,$cmvndir/cmvn_$name.scp && \\\n     echo \"Error creating fake CMVN stats.  See $logdir/cmvn_$name.log.\" && exit 1;\nelif $two_channel; then\n  ! compute-cmvn-stats-two-channel $data/reco2file_and_channel scp:$data/feats.scp \\\n       ark,scp:$cmvndir/cmvn_$name.ark,$cmvndir/cmvn_$name.scp \\\n    2> $logdir/cmvn_$name.log && echo \"Error computing CMVN stats (using two-channel method). See $logdir/cmvn_$name.log.\" && exit 1;\nelif [ ! -z \"$fake_dims\" ]; then\n  ! compute-cmvn-stats --spk2utt=ark:$data/spk2utt scp:$data/feats.scp ark:- | \\\n    modify-cmvn-stats \"$fake_dims\" ark:- ark,scp:$cmvndir/cmvn_$name.ark,$cmvndir/cmvn_$name.scp && \\\n    echo \"Error computing (partially fake) CMVN stats.  See $logdir/cmvn_$name.log\" && exit 1;\nelse\n  ! compute-cmvn-stats --spk2utt=ark:$data/spk2utt scp:$data/feats.scp ark,scp:$cmvndir/cmvn_$name.ark,$cmvndir/cmvn_$name.scp \\\n    2> $logdir/cmvn_$name.log && echo \"Error computing CMVN stats. See $logdir/cmvn_$name.log\" && exit 1;\nfi\n\ncp $cmvndir/cmvn_$name.scp $data/cmvn.scp || exit 1;\n\nnc=`cat $data/cmvn.scp | wc -l`\nnu=`cat $data/spk2utt | wc -l`\nif [ $nc -ne $nu ]; then\n  echo \"$0: warning: it seems not all of the speakers got cmvn stats ($nc != $nu);\"\n  [ $nc -eq 0 ] && exit 1;\nfi\n\necho \"Succeeded creating CMVN stats for $name\"\n"
  },
  {
    "path": "egs/steps/compute_vad_decision.sh",
    "content": "#!/bin/bash \n\n# Copyright    2017  Vimal Manohar\n# Apache 2.0\n\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Compute energy based VAD output\n\nnj=4\ncmd=run.pl\nvad_config=conf/vad.conf\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n   echo \"Usage: $0 [options] <data-dir> [<log-dir> [<vad-dir>]]\";\n   echo \"e.g.: $0 data/train exp/make_vad mfcc\"\n   echo \"Note: <log-dir> defaults to <data-dir>/log, and <vad-dir> defaults to <data-dir>/data\"\n   echo \" Options:\"\n   echo \"  --vad-config <config-file>                       # config passed to compute-vad-energy\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  vaddir=$3\nelse\n  vaddir=$data/data\nfi\n\n\n# make $vaddir an absolute pathname.\nvaddir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $vaddir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $vaddir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/vad.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/vad.scp to $data/.backup\"\n  mv $data/vad.scp $data/.backup\nfi\n\nfor f in $data/feats.scp \"$vad_config\"; do\n  if [ ! -f $f ]; then\n    echo \"compute_vad_decision.sh: no such file $f\"\n    exit 1;\n  fi\ndone\n\nutils/split_data.sh $data $nj || exit 1;\nsdata=$data/split$nj;\n\n$cmd JOB=1:$nj $logdir/vad_${name}.JOB.log \\\n  compute-vad --config=$vad_config scp:$sdata/JOB/feats.scp \\\n  ark,scp:$vaddir/vad_${name}.JOB.ark,$vaddir/vad_${name}.JOB.scp || exit 1\n\nfor ((n=1; n<=nj; n++)); do\n  cat $vaddir/vad_${name}.$n.scp || exit 1;\ndone > $data/vad.scp\n\nnc=`cat $data/vad.scp | wc -l` \nnu=`cat $data/feats.scp | wc -l` \nif [ $nc -ne $nu ]; then\n  echo \"**Warning it seems not all of the speakers got VAD output ($nc != $nu);\"\n  echo \"**validate_data_dir.sh will fail; you might want to use fix_data_dir.sh\"\n  [ $nc -eq 0 ] && exit 1;\nfi\n\n\necho \"Created VAD output for $name\"\n"
  },
  {
    "path": "egs/steps/conf/append_eval_to_ctm.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys,operator\n\n# Append Levenshtein alignment of 'hypothesis' and 'reference' into 'CTM':\n# (i.e. the output of 'align-text' post-processed by 'wer_per_utt_details.pl')\n\n# The tags in the appended column are:\n#  'C' = correct\n#  'S' = substitution\n#  'I' = insertion\n#  'U' = unknown (not part of scored segment)\n\nif len(sys.argv) != 4:\n  print('Usage: %s eval-in ctm-in ctm-eval-out' % __file__)\n  sys.exit(1)\ndummy, eval_in, ctm_in, ctm_eval_out = sys.argv\n\nif ctm_eval_out == '-': ctm_eval_out = '/dev/stdout'\n\n# Read the evalutation,\neval_vec = dict()\nwith open(eval_in, 'r') as f:\n  while True:\n    # Reading 4 lines encoding one utterance,\n    ref = f.readline()\n    hyp = f.readline()\n    op = f.readline()\n    csid = f.readline()\n    if not ref: break\n    # Parse the input,\n    utt,tag,hyp_vec = hyp.split(' ',2)\n    assert(tag == 'hyp')\n    utt,tag,op_vec = op.split(' ',2)\n    assert(tag == 'op')\n    hyp_vec = hyp_vec.split()\n    op_vec = op_vec.split()\n    # Fill create eval vector with symbols 'C', 'S', 'I'\n    assert(utt not in eval_vec)\n    eval_vec[utt] = []\n    for op,hyp in zip(op_vec, hyp_vec):\n      if op != 'D': eval_vec[utt].append((op,hyp))\n\n# Load the 'ctm' into dictionary,\nctm = dict()\nwith open(ctm_in) as f:\n  for l in f:\n    utt, ch, beg, dur, wrd, conf = l.split()\n    if not utt in ctm: ctm[utt] = []\n    ctm[utt].append((utt, ch, float(beg), float(dur), wrd, float(conf)))\n\n# Build the 'ctm' with 'eval' column added,\nctm_eval = []\nfor utt,ctm_part in ctm.items():\n  ctm_part.sort(key = operator.itemgetter(2)) # Sort by 'beg' time,\n  try:\n    # merging 'tuples' by '+', the record has format:\n    # (utt, ch, beg, dur, ctm_wrd, conf, op, hyp_wrd)\n    merged = [ ctm_tup + evl_tup for ctm_tup,evl_tup in zip(ctm_part,eval_vec[utt]) ]\n    # check,\n    for j in range(len(merged)):\n      hyp_wrd = merged[j][-1]\n      ctm_wrd = merged[j][-4]\n      assert hyp_wrd == ctm_wrd, \"We failed with words: hyp_wrd %s, ctm_wrd %s\" % (hyp_wrd,ctm_wrd) # Check that words in 'ctm' and 'utt_stats' match!\n      merged[j] = merged[j][:-1] # dropping the 'hyp_wrd' (the last element of tuple),\n    # append,\n    ctm_eval.extend(merged)\n  except KeyError:\n    print('Missing key', utt, 'in the word-evaluation stats from scoring')\n\n# Sort again,\nctm_eval.sort(key = operator.itemgetter(0,1,2))\n\n# Store,\nwith open(ctm_eval_out,'w') as f:\n  for tup in ctm_eval:\n    f.write('%s %s %f %f %s %f %s\\n' % tup)\n\n"
  },
  {
    "path": "egs/steps/conf/append_prf_to_ctm.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys\n\n# Append Levenshtein alignment of 'hypothesis' and 'reference' into 'CTM':\n# (parsed from the 'prf' output of 'sclite')\n\n# The tags in appended column are:\n#  'C' = correct\n#  'S' = substitution\n#  'I' = insertion\n#  'U' = unknown (not part of scored segment)\n\n# Parse options,\nif len(sys.argv) != 4:\n  print(\"Usage: %s prf ctm_in ctm_out\" % __file__)\n  sys.exit(1)\nprf_file, ctm_file, ctm_out_file = sys.argv[1:]\n\nif ctm_out_file == '-': ctm_out_file = '/dev/stdout'\n\n# Load the prf file,\nprf = []\nwith open(prf_file) as f:\n  for l in f:\n    # Store the data,\n    if l[:5] == 'File:':\n      file_id = l.split()[1]\n    if l[:8] == 'Channel:':\n      chan = l.split()[1]\n    if l[:5] == 'H_T1:':\n      h_t1 = l\n    if l[:5] == 'Eval:':\n      evl = l\n      prf.append((file_id,chan,h_t1,evl))\n\n# Parse the prf records into dictionary,\nprf_dict = dict()\nfor (f,c,t,e) in prf:\n  t_pos = 0 # position in the 't' string,\n  while t_pos < len(t):\n    t1 = t[t_pos:].split(' ',1)[0] # get 1st token at 't_pos'\n    try:\n      # get word evaluation letter 'C,S,I',\n      evl = e[t_pos] if e[t_pos] != ' ' else 'C' \n      # add to dictionary,\n      key='%s,%s' % (f,c) # file,channel\n      if key not in prf_dict: prf_dict[key] = dict()\n      prf_dict[key][float(t1)] = evl\n    except ValueError:\n      pass\n    t_pos += len(t1)+1 # advance position for parsing,\n\n# Load the ctm file (with confidences),\nwith open(ctm_file) as f:\n  ctm = [ l.split() for l in f ]\n\n# Append the sclite alignment tags to ctm,\nctm_out = []\nfor f, chan, beg, dur, wrd, conf in ctm:\n  # U = unknown, C = correct, S = substitution, I = insertion,\n  sclite_tag = 'U' \n  try:\n    sclite_tag = prf_dict[('%s,%s'%(f,chan)).lower()][float(beg)]\n  except KeyError:\n    pass\n  ctm_out.append([f,chan,beg,dur,wrd,conf,sclite_tag])\n\n# Save the augmented ctm file,\nwith open(ctm_out_file, 'w') as f:\n  f.writelines([' '.join(ctm_record)+'\\n' for ctm_record in ctm_out])\n\n"
  },
  {
    "path": "egs/steps/conf/apply_calibration.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2015, Brno University of Technology (Author: Karel Vesely). Apache 2.0.\n\n# Trains logistic regression, which calibrates the per-word confidences,\n# which are extracted by the Minimum Bayes Risk decoding.\n\n# begin configuration section.\ncmd=\nstage=0\n# end configuration section.\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [opts] <data-dir> <lang-dir|graph-dir> <decode-dir> <calibration-dir> <output-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\nlatdir=$3\ncaldir=$4\ndir=$5\n\nmodel=$latdir/../final.mdl # assume model one level up from decoding dir.\ncalibration=$caldir/calibration.mdl\nword_feats=$caldir/word_feats\nword_categories=$caldir/word_categories\n\nfor f in $lang/words.txt $word_feats $word_categories $latdir/lat.1.gz $calibration $model; do\n  [ ! -f $f ] && echo \"$0: Missing file $f\" && exit 1\ndone\n[ -z \"$cmd\" ] && echo \"$0: Missing --cmd '...'\" && exit 1\n\n[ -d $dir/log ] || mkdir -p $dir/log\nnj=$(cat $latdir/num_jobs)\nlmwt=$(cat $caldir/lmwt)\ndecode_mbr=$(cat $caldir/decode_mbr)\n\n# Store the setup,\necho $lmwt >$dir/lmwt\necho $decode_mbr >$dir/decode_mbr \ncp $calibration $dir/calibration.mdl\ncp $word_feats $dir/word_feats\ncp $word_categories $dir/word_categories\n\n# Create the ctm with raw confidences,\n# - we keep the timing relative to the utterance,\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n    lattice-scale --inv-acoustic-scale=$lmwt \"ark:gunzip -c $latdir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-limit-depth ark:- ark:- \\| \\\n    lattice-push --push-strings=false ark:- ark:- \\| \\\n    lattice-align-words-lexicon --max-expand=10.0 \\\n     $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n    lattice-to-ctm-conf --decode-mbr=$decode_mbr ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/JOB.ctm\n  # Merge and clean,\n  for ((n=1; n<=nj; n++)); do cat $dir/${n}.ctm; done > $dir/ctm\n  rm $dir/*.ctm\n  cat $dir/ctm | utils/sym2int.pl -f 5 $lang/words.txt >$dir/ctm_int\nfi\n\n# Compute lattice-depth,\nlatdepth=$dir/lattice_frame_depth.ark\nif [ $stage -le 1 ]; then\n  [ -e $latdepth ] || steps/conf/lattice_depth_per_frame.sh --cmd \"$cmd\" $latdir $dir\nfi\n\n# Create the forwarding data for logistic regression,\nif [ $stage -le 2 ]; then\n  steps/conf/prepare_calibration_data.py --conf-feats $dir/forward_feats.ark \\\n    --lattice-depth $latdepth $dir/ctm_int $word_feats $word_categories\nfi\n\n# Apply calibration model to dev,\nif [ $stage -le 3 ]; then\n  logistic-regression-eval --apply-log=false $calibration \\\n    ark:$dir/forward_feats.ark ark,t:- | \\\n    awk '{ key=$1; p_corr=$4; sub(/,.*/,\"\",key); gsub(/\\^/,\" \",key); print key,p_corr }' | \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    >$dir/ctm_calibrated\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/steps/conf/convert_ctm_to_tra.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys, operator\n\n# This scripts loads a 'ctm' file and converts it into the 'tra' format:\n# \"utt-key word1 word2 word3 ... wordN\"\n# The 'utt-key' is the 1st column in the CTM.\n\n# Typically the CTM contains:\n# - utterance-relative timimng (i.e. prepared without 'utils/convert_ctm.pl')\n# - confidences\n\nif len(sys.argv) != 3:\n  print('Usage: %s ctm-in tra-out' % __file__)\n  sys.exit(1)\ndummy, ctm_in, tra_out = sys.argv\n\nif ctm_in == '-': ctm_in = '/dev/stdin'\nif tra_out == '-': tra_out = '/dev/stdout'\n\n# Load the 'ctm' into dictionary,\ntra = dict()\nwith open(ctm_in) as f:\n  for l in f:\n    utt, ch, beg, dur, wrd, conf = l.split()\n    if not utt in tra: tra[utt] = []\n    tra[utt].append((float(beg),wrd))\n\n# Store the in 'tra' format,\nwith open(tra_out,'w') as f:\n  for utt,tuples in tra.items():\n    tuples.sort(key = operator.itemgetter(0)) # Sort by 'beg' time,\n    f.write('%s %s\\n' % (utt,' '.join([t[1] for t in tuples])))\n\n"
  },
  {
    "path": "egs/steps/conf/get_ctm_conf.sh",
    "content": "#!/usr/bin/env bash\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2012.  Apache 2.0.\n\n# This script produces CTM files from a decoding directory that has lattices\n# present.  This version gives you confidence scores using MBR decoding.\n# See also steps/get_ctm.sh\n\n\n# begin configuration section.\ncmd=run.pl\nstage=0\nmin_lmwt=5\nmax_lmwt=20\nuse_segments=true # if we have a segments file, use it to convert\n                  # the segments to be relative to the original files.\niter=final\nbeam=5  # pruning beam before MBR decoding\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"This script produces CTM files from a decoding directory that has lattices \"\n  echo \"present.  This version gives you confidence scores using MBR decoding.\"\n  echo \"Usage: $0 [options] <data-dir> <lang-dir|graph-dir> <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --use-segments (true|false)     # use segments and reco2file_and_channel files \"\n  echo \"                                    # to produce a ctm relative to the original audio\"\n  echo \"                                    # files, with channel information (typically needed\"\n  echo \"                                    # for NIST scoring).\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri4a/decode/\"\n  echo \"See also: steps/get_ctm.sh, steps/get_ctm_conf_fast.sh\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\ndir=$3\n\nmodel=$dir/../$iter.mdl # assume model one level up from decoding dir.\n\n\nfor f in $lang/words.txt $model $dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nname=`basename $data`; # e.g. eval2000\n\nmkdir -p $dir/scoring/log\n\nframe_shift_opt=\nif [ -f $dir/../frame_shift ]; then\n  frame_shift_opt=\"--frame-shift=$(cat $dir/../frame_shift)\"\n  echo \"$0: $dir/../frame_shift exists, using $frame_shift_opt\"\nelif [ -f $dir/../frame_subsampling_factor ]; then\n  factor=$(cat $dir/../frame_subsampling_factor) || exit 1\n  frame_shift_opt=\"--frame-shift=0.0$factor\"\n  echo \"$0: $dir/../frame_subsampling_factor exists, using $frame_shift_opt\"\nfi\n\nif [ $stage -le 0 ]; then\n  if [ -f $data/segments ] && $use_segments; then\n    f=$data/reco2file_and_channel\n    [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\n    filter_cmd=\"utils/convert_ctm.pl $data/segments $data/reco2file_and_channel\"\n  else\n    filter_cmd=cat\n  fi\n\n  nj=$(cat $dir/num_jobs)\n  lats=$(for n in $(seq $nj); do echo -n \"$dir/lat.$n.gz \"; done)\n  if [ -f $lang/phones/word_boundary.int ]; then\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring/log/get_ctm.LMWT.log \\\n      set -o pipefail '&&' mkdir -p $dir/score_LMWT/ '&&' \\\n      lattice-prune --inv-acoustic-scale=LMWT --beam=$beam \"ark:gunzip -c $lats|\" ark:- \\| \\\n      lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \\| \\\n      lattice-to-ctm-conf $frame_shift_opt --decode-mbr=true --inv-acoustic-scale=LMWT ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      $filter_cmd '>' $dir/score_LMWT/$name.ctm || exit 1;\n  else\n    if [ ! -f $lang/phones/align_lexicon.int ]; then\n      echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n      exit 1;\n    fi\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring/log/get_ctm.LMWT.log \\\n      set -o pipefail '&&' mkdir -p $dir/score_LMWT/ '&&' \\\n      lattice-prune --inv-acoustic-scale=LMWT --beam=$beam \"ark:gunzip -c $lats|\" ark:- \\| \\\n      lattice-align-words-lexicon $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n      lattice-to-ctm-conf $frame_shift_opt --decode-mbr=true --inv-acoustic-scale=LMWT ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      $filter_cmd '>' $dir/score_LMWT/$name.ctm || exit 1;\n  fi\nfi\n\n"
  },
  {
    "path": "egs/steps/conf/lattice_depth_per_frame.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2015  Brno University of Technology (Author: Karel Vesely)\n# Licensed under the Apache License, Version 2.0 (the \"License\")\n\n# Extract lattice-depth for each frame.\n\n# Begin configuration\ncmd=run.pl\n# End configuration\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 2 ]; then\n   echo \"usage: $0 [opts] <dir-with-lats> <out-dir>\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>          # config containing options\"\n   echo \"  --cmd\"\n   exit 1;\nfi\n\nset -euo pipefail\n\nlatdir=$1\ndir=$2\n\n[ ! -f $latdir/lat.1.gz ] && echo \"Missing $latdir/lat.1.gz\" && exit 1\nnj=$(cat $latdir/num_jobs)\n\n# Get the pdf-posterior vectors,\n$cmd JOB=1:$nj $dir/log/lattice_depth_per_frame.JOB.log \\\n  lattice-depth-per-frame \"ark:gunzip -c $latdir/lat.JOB.gz |\" ark,t:$dir/lattice_frame_depth.JOB.ark\n# Merge,\nfor ((n=1; n<=nj; n++)); do cat $dir/lattice_frame_depth.${n}.ark; done >$dir/lattice_frame_depth.ark\nrm $dir/lattice_frame_depth.*.ark\n\n# Done!\n"
  },
  {
    "path": "egs/steps/conf/parse_arpa_unigrams.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys, gzip, re\n\n# Parse options,\nif len(sys.argv) != 4:\n  print(\"Usage: %s <words.txt> <arpa-gz> <unigrams>\" % __file__)\n  sys.exit(0)\nwords_txt, arpa_gz, unigrams_out = sys.argv[1:]\n\nif arpa_gz == '-': arpa_gz = '/dev/stdin'\nif unigrams_out == '-': unigrams_out = '/dev/stdout'\n\n# Load the words.txt,\nwords = [ l.split() for l in open(words_txt) ]\n\n# Load the unigram probabilities in 10log from ARPA,\nwrd_log10 = dict()\nwith gzip.open(arpa_gz,'r') as f:\n  read = False\n  for l in f:\n    if l.strip() == '\\\\1-grams:': read = True\n    if l.strip() == '\\\\2-grams:': break\n    if read and len(l.split())>=2:\n      log10_p_unigram, wrd = re.split('[\\t ]+',l.strip(),2)[:2]\n      wrd_log10[wrd] = float(log10_p_unigram)\n\n# Create list, 'wrd id log_p_unigram',\nwords_unigram = [[wrd, id, (wrd_log10[wrd] if wrd in wrd_log10 else -99)] for wrd,id in words ]\n\nprint(words_unigram[0], file=sys.stderr)\n# Store,\nwith open(unigrams_out,'w') as f:\n  f.writelines(['%s %s %g\\n' % (w,i,p) for (w,i,p) in words_unigram])\n\n"
  },
  {
    "path": "egs/steps/conf/prepare_calibration_data.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nfrom __future__ import division\nimport sys, math\n\nfrom optparse import OptionParser\ndesc = \"\"\"\nPrepare input features and training targets for logistic regression,\nwhich calibrates the Minimum Bayes Risk posterior confidences.\n\nThe logisitc-regression input features are: \n- posteriors from 'ctm' transformed by logit,\n- logarithm of word-length in letters,\n- 10base logarithm of unigram probability of a word from language model,\n- logarithm of average lattice-depth at position of the word (optional),\n\nThe logistic-regresion targets are:\n- 1 for correct word,\n- 0 for incorrect word (substitution, insertion),\n\nThe iput 'ctm' is augmented by per-word tags (or 'U' is added if no tags),\n'C' = correct\n'S' = substitution\n'I' = insertion\n'U' = unknown (not part of scored segment)\n\nThe script can be used both to prepare the training data,\nor to prepare input features for forwarding through trained model.\n\"\"\"\nusage = \"%prog [opts] ctm word-filter word-length unigrams depth-per-frame-ascii.ark word-categories\"\nparser = OptionParser(usage=usage, description=desc)\nparser.add_option(\"--conf-targets\", help=\"Targets file for logistic regression (no targets generated if '') [default %default]\", default='')\nparser.add_option(\"--conf-feats\", help=\"Feature file for logistic regression. [default %default]\", default='')\nparser.add_option(\"--lattice-depth\", help=\"Per-frame lattice depths, ascii-ark (optional). [default %default]\", default='')\n(o, args) = parser.parse_args()\n\nif len(args) != 3:\n  parser.print_help()\n  sys.exit(1)\nctm_file, word_feats_file, word_categories_file = args\n\nassert(o.conf_feats != '')\n\n# Load the ctm (optionally add eval colmn with 'U'):\nctm = [ l.split() for l in open(ctm_file) ]\nif len(ctm[0]) == 6: [ l.append('U') for l in ctm ]\nassert(len(ctm[0]) == 7)\n\n# Load the word-features, the format: \"wrd wrd_id filter length other_feats\"\n# (typically 'other_feats' are unigram log-probabilities),\nword_feats = [ l.split(None,4) for l in open(word_feats_file) ]\n\n# Prepare filtering dict,\nword_filter = { wrd_id:bool(int(filter)) for (wrd,wrd_id,filter,length,other_feats) in word_feats }\n# Prepare the lenght dict,\nword_length = { wrd_id:float(length) for (wrd,wrd_id,filter,length,other_feats) in word_feats }\n# Prepare other_feats dict,\nother_feats = { wrd_id:other_feats.strip() for (wrd,wrd_id,filter,length,other_feats) in word_feats }\n\n# Build the targets,\nif o.conf_targets != '':\n  with open(o.conf_targets,'w') as f:\n    for (utt, chan, beg, dur, wrd_id, conf, score_tag) in ctm:\n      # Skip the words we don't know if being correct, \n      if score_tag == 'U': continue \n      # Some words are excluded from training (partial words, hesitations, etc.),\n      # (Value: 1 == keep word, 0 == exclude word from the targets),\n      if not word_filter[wrd_id]: continue \n      # Build the key,\n      key = \"%s^%s^%s^%s^%s,%s,%s\" % (utt, chan, beg, dur, wrd_id, conf, score_tag)\n      # Build the target,\n      tgt = 1 if score_tag == 'C' else 0 # Correct = 1, else 0,\n      # Write,\n      f.write('%s %d\\n' % (key,tgt))\n\n# Load the per-frame lattice-depth,\n# - we assume, the 1st column in 'ctm' is the 'utterance-key' in depth file,\n# - if the 'ctm' and 'ark' keys don't match, we leave this feature out,\nif o.lattice_depth:\n  depths = dict()\n  for l in open(o.lattice_depth):\n    utt,d = l.split(' ',1)\n    depths[utt] = [int(i) for i in d.split()]\n\n# Load the 'word_categories' mapping for categorical input features derived from 'lang/words.txt',\nwrd_to_cat = [ l.split() for l in open(word_categories_file) ]\nwrd_to_cat = { wrd_id:int(category) for wrd,wrd_id,category in wrd_to_cat }\nwrd_cat_num = max(wrd_to_cat.values()) + 1\n\n# Build the input features,\nwith open(o.conf_feats,'w') as f:\n  for (utt, chan, beg, dur, wrd_id, conf, score_tag) in ctm:\n    # Build the key, same as previously,\n    key = \"%s^%s^%s^%s^%s,%s,%s\" % (utt, chan, beg, dur, wrd_id, conf, score_tag)\n\n    # Build input features,\n    # - logit of MBR posterior,\n    damper = 0.001 # avoid -inf,+inf from log,\n    logit = math.log(float(conf)+damper) - math.log(1.0 - float(conf)+damper)\n    # - log of word-length,\n    log_word_length = math.log(word_length[wrd_id]) # i.e. number of phones in a word,\n    # - categorical distribution of words (with frequency higher than min-count),\n    wrd_1_of_k = [0]*wrd_cat_num; \n    wrd_1_of_k[wrd_to_cat[wrd_id]] = 1;\n\n    # Compose the input feature vector,\n    feats = [ logit, log_word_length, other_feats[wrd_id] ] + wrd_1_of_k\n\n    # Optionally add average-depth of lattice at the word position,\n    if o.lattice_depth != '':\n      depth_slice = depths[utt][int(round(100.0*float(beg))):int(round(100.0*(float(beg)+float(dur))))]\n      log_avg_depth = math.log(float(sum(depth_slice))/len(depth_slice))\n      feats += [ log_avg_depth ]\n\n    # Store the input features, \n    f.write(key + ' [ ' + ' '.join(map(str,feats)) + ' ]\\n')\n\n"
  },
  {
    "path": "egs/steps/conf/prepare_word_categories.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\nimport sys\n\nfrom optparse import OptionParser\ndesc = \"\"\"\nPrepare mapping of words into categories. Each word with minimal frequency \nhas its own category, the rest is merged into single class.\n\"\"\"\nusage = \"%prog [opts] words.txt ctm category_mapping\"\nparser = OptionParser(usage=usage, description=desc)\nparser.add_option(\"--min-count\", help=\"Minimum word-count to have a single word category. [default %default]\", type='int', default=20)\n(o, args) = parser.parse_args()\n\nif len(args) != 3:\n  parser.print_help()\n  sys.exit(1)\nwords_file, text_file, category_mapping_file = args\n\nif text_file == '-': text_file = '/dev/stdin'\nif category_mapping_file == '-': category_mapping_file = '/dev/stdout'\n\n# Read the words from the 'tra' file,\nwith open(text_file) as f:\n  text_words = [ l.split()[1:] for l in f ]\n\n# Flatten the array of arrays of words,\nimport itertools\ntext_words = list(itertools.chain.from_iterable(text_words))\n\n# Count the words (regardless if correct or incorrect),\nword_counts = dict()\nfor w in text_words:\n  if w not in word_counts: word_counts[w] = 0\n  word_counts[w] += 1\n\n# Read the words.txt,\nwith open(words_file) as f:\n  word_id = [ l.split() for l in f ]\n\n# Append the categories,\nn=1\nword_id_cat=[]\nfor word, idx in word_id:\n  cat = 0 \n  if word in word_counts:\n    if word_counts[word] > o.min_count:\n      cat = n; n += 1\n  word_id_cat.append([word, idx, str(cat)])\n\n# Store the mapping,\nwith open(category_mapping_file,'w') as f:\n  f.writelines([' '.join(record)+'\\n' for record in word_id_cat])\n"
  },
  {
    "path": "egs/steps/conf/train_calibration.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2015, Brno University of Technology (Author: Karel Vesely). Apache 2.0.\n\n# Trains logistic regression, which calibrates the per-word confidences in 'CTM'.\n# The 'raw' confidences are obtained by Minimum Bayes Risk decoding.\n\n# The input features of logistic regression are:\n# - logit of Minumum Bayer Risk posterior\n# - log of word-length in characters\n# - log of average-depth depth of a lattice at words' position\n# - log of frames per character ratio\n# (- categorical distribution of 'lang/words.txt', DISABLED)\n\n# begin configuration section.\ncmd=\nlmwt=12\ndecode_mbr=true\nword_min_count=10 # Minimum word-count for single-word category,\nnormalizer=0.0025 # L2 regularization constant,\ncategory_text= # Alternative corpus for counting words to get word-categories (by default using 'ctm'),\nstage=0\n# end configuration section.\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [opts] <data-dir> <lang-dir|graph-dir> <word-feats> <decode-dir> <calibration-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --lmwt <int>                    # scaling for confidence extraction\"\n  echo \"    --decode-mbr <bool>             # use Minimum Bayes Risk decoding\"\n  echo \"    --grep-filter <str>             # remove words from calibration targets\"\n  exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\nword_feats=$3\nlatdir=$4\ndir=$5\n\nmodel=$latdir/../final.mdl # assume model one level up from decoding dir.\n\nfor f in $data/text $lang/words.txt $word_feats $latdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: Missing file $f\" && exit 1\ndone\n[ -z \"$cmd\" ] && echo \"$0: Missing --cmd '...'\" && exit 1\n\n[ -d $dir/log ] || mkdir -p $dir/log\nnj=$(cat $latdir/num_jobs)\n\n# Store the setup,\necho $lmwt >$dir/lmwt\necho $decode_mbr >$dir/decode_mbr\ncp $word_feats $dir/word_feats\n\n# Create the ctm with raw confidences,\n# - we keep the timing relative to the utterance,\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n    lattice-scale --inv-acoustic-scale=$lmwt \"ark:gunzip -c $latdir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-limit-depth ark:- ark:- \\| \\\n    lattice-push --push-strings=false ark:- ark:- \\| \\\n    lattice-align-words-lexicon --max-expand=10.0 \\\n     $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n    lattice-to-ctm-conf --decode-mbr=$decode_mbr ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/JOB.ctm\n  # Merge and clean,\n  for ((n=1; n<=nj; n++)); do cat $dir/${n}.ctm; done > $dir/ctm\n  rm $dir/*.ctm\nfi\n\n# Get evaluation of the 'ctm' using the 'text' reference,\nif [ $stage -le 1 ]; then\n  steps/conf/convert_ctm_to_tra.py $dir/ctm - | \\\n  align-text --special-symbol=\"<eps>\" ark:$data/text ark:- ark,t:- | \\\n  utils/scoring/wer_per_utt_details.pl --special-symbol \"<eps>\" \\\n  >$dir/align_text \n  # Append alignment to ctm,\n  steps/conf/append_eval_to_ctm.py $dir/align_text $dir/ctm $dir/ctm_aligned\n  # Convert words to 'ids',\n  cat $dir/ctm_aligned | utils/sym2int.pl -f 5 $lang/words.txt >$dir/ctm_aligned_int\nfi\n\n# Prepare word-categories (based on wotd frequencies in 'ctm'),\nif [ -z \"$category_text\" ]; then\n  steps/conf/convert_ctm_to_tra.py $dir/ctm - | \\\n  steps/conf/prepare_word_categories.py --min-count $word_min_count $lang/words.txt - $dir/word_categories\nelse\n  steps/conf/prepare_word_categories.py --min-count $word_min_count $lang/words.txt \"$category_text\" $dir/word_categories\nfi\n\n# Compute lattice-depth,\nlatdepth=$dir/lattice_frame_depth.ark\nif [ $stage -le 2 ]; then\n  [ -e $latdepth ] || steps/conf/lattice_depth_per_frame.sh --cmd \"$cmd\" $latdir $dir\nfi\n\n# Create the training data for logistic regression,\nif [ $stage -le 3 ]; then\n  steps/conf/prepare_calibration_data.py \\\n    --conf-targets $dir/train_targets.ark --conf-feats $dir/train_feats.ark \\\n    --lattice-depth $latdepth $dir/ctm_aligned_int $word_feats $dir/word_categories\nfi\n\n# Train the logistic regression,\nif [ $stage -le 4 ]; then\n  logistic-regression-train --binary=false --normalizer=$normalizer ark:$dir/train_feats.ark \\\n    ark:$dir/train_targets.ark $dir/calibration.mdl 2>$dir/log/logistic-regression-train.log\nfi\n\n# Apply calibration model to dev,\nif [ $stage -le 5 ]; then\n  logistic-regression-eval --apply-log=false $dir/calibration.mdl \\\n    ark:$dir/train_feats.ark ark,t:- | \\\n    awk '{ key=$1; p_corr=$4; sub(/,.*/,\"\",key); gsub(/\\^/,\" \",key); print key,p_corr }' | \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    >$dir/ctm_calibrated_int\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/steps/copy_ali_dir.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2019   Phani Sankar Nidadavolu\n# Apache 2.0.\n\nprefixes=\"reverb1 babble music noise\"\ninclude_original=true\nmax_jobs_run=50\nnj=100\ncmd=queue.pl\nwrite_binary=true\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 <out-data> <src-ali-dir> <out-ali-dir>\"\n  echo \"This script creates alignments for the aug dirs by copying \"\n  echo \" the alignments of original train dir\"\n  echo \"While copying it adds prefix to the utterances specified by prefixes option\"\n  echo \"Note that the original train dir does not have any prefix\"\n  echo \"To include the original training directory in the copied \"\n  echo \"version set the --include-original option to true\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --prefixes <string of prefixes to add>    # All the prefixes of aug data to be included\"\n  echo \"  --include-original <true/false>           # If true, will copy the alignements of original dir\"\n  echo \"  --write-compact <true/false>              # Write lattices in compact mode\"\n  exit 1\nfi\n\ndata=$1\nsrc_dir=$2\ndir=$3\n\nmkdir -p $dir\n\nnum_jobs=$(cat $src_dir/num_jobs)\n\nrm -f $dir/ali_tmp.*.{ark,scp} 2>/dev/null\n\n# Copy the alignments temporarily\necho \"creating temporary alignments in $dir\"\n$cmd --max-jobs-run $max_jobs_run JOB=1:$num_jobs $dir/log/copy_ali_temp.JOB.log \\\n  copy-int-vector --binary=$write_binary \\\n  \"ark:gunzip -c $src_dir/ali.JOB.gz |\" \\\n  ark,scp:$dir/ali_tmp.JOB.ark,$dir/ali_tmp.JOB.scp || exit 1\n\n# Make copies of utterances for perturbed data\nfor p in $prefixes; do\n  cat $dir/ali_tmp.*.scp | awk -v p=$p '{print p\"-\"$0}'\ndone | sort -k1,1 > $dir/ali_out.scp.aug\n\nif [ \"$include_original\" == \"true\" ]; then\n  cat $dir/ali_tmp.*.scp | awk '{print $0}' | sort -k1,1 > $dir/ali_out.scp.clean\n  cat $dir/ali_out.scp.clean $dir/ali_out.scp.aug | sort -k1,1 > $dir/ali_out.scp\nelse\n  cat $dir/ali_out.scp.aug | sort -k1,1 > $dir/ali_out.scp\nfi\n\nutils/split_data.sh ${data} $nj\n\n# Copy and dump the lattices for perturbed data\necho Creating alignments for augmented data by copying alignments from clean data\n$cmd --max-jobs-run $max_jobs_run JOB=1:$nj $dir/log/copy_out_ali.JOB.log \\\n  copy-int-vector --binary=$write_binary \\\n  \"scp:utils/filter_scp.pl ${data}/split$nj/JOB/utt2spk $dir/ali_out.scp |\" \\\n  \"ark:| gzip -c > $dir/ali.JOB.gz\" || exit 1\n\nrm $dir/ali_out.scp.{aug,clean} $dir/ali_out.scp\nrm $dir/ali_tmp.*\n\necho $nj > $dir/num_jobs\n\nfor f in cmvn_opts tree splice_opts phones.txt final.mdl splice_opts tree frame_subsampling_factor; do\n  if [ -f $src_dir/$f ]; then cp $src_dir/$f $dir/$f; fi\ndone\n"
  },
  {
    "path": "egs/steps/copy_lat_dir.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2019   Phani Sankar Nidadavolu\n# Apache 2.0.\n\nprefixes=\"reverb1 babble music noise\"\ninclude_original=true\nmax_jobs_run=50\nnj=100\ncmd=queue.pl\nwrite_compact=true\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <out-data> <src-lat-dir> <out-lat-dir>\"\n  echo \"This script creates lattices for the aug dirs by copying the lattices of original train dir\"\n  echo \"While copying it adds prefix to the utterances specified by prefixes option\"\n  echo \"Note that the original train dir does not have any prefix\"\n  echo \"To include the original training directory in the copied \"\n  echo \"version set the --include-original option to true\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --prefixes <string of prefixes to add>             # All the prefixes of aug data to be included\"\n  echo \"  --include-original <true/false>                    # If true, will copy the lattices of original dir\"\n  echo \"  --write-compact <true/false>                       # Write lattices in compact mode\"\n  exit 1\nfi\n\ndata=$1\nsrc_dir=$2\ndir=$3\n\nmkdir -p $dir\n\nnum_jobs=$(cat $src_dir/num_jobs)\n\nrm -f $dir/lat_tmp.*.{ark,scp} 2>/dev/null\n\n# Copy the alignments temporarily\necho \"creating temporary lattices in $dir\"\n$cmd --max-jobs-run $max_jobs_run JOB=1:$num_jobs $dir/log/copy_lat_temp.JOB.log \\\n  lattice-copy --write-compact=$write_compact \\\n  \"ark:gunzip -c $src_dir/lat.JOB.gz |\" \\\n  ark,scp:$dir/lat_tmp.JOB.ark,$dir/lat_tmp.JOB.scp || exit 1\n\n# Make copies of utterances for perturbed data\nfor p in $prefixes; do\n  cat $dir/lat_tmp.*.scp | awk -v p=$p '{print p\"-\"$0}'\ndone | sort -k1,1 > $dir/lat_out.scp.aug\n\nif [ \"$include_original\" == \"true\" ]; then\n  cat $dir/lat_tmp.*.scp | awk '{print $0}' | sort -k1,1 > $dir/lat_out.scp.clean\n  cat $dir/lat_out.scp.clean $dir/lat_out.scp.aug | sort -k1,1 > $dir/lat_out.scp\nelse\n  cat $dir/lat_out.scp.aug | sort -k1,1 > $dir/lat_out.scp\nfi\n\nutils/split_data.sh ${data} $nj\n\n# Copy and dump the lattices for perturbed data\necho Creating lattices for augmented data by copying lattices from clean data\n$cmd --max-jobs-run $max_jobs_run JOB=1:$nj $dir/log/copy_out_lat.JOB.log \\\n  lattice-copy --write-compact=$write_compact \\\n  \"scp:utils/filter_scp.pl ${data}/split$nj/JOB/utt2spk $dir/lat_out.scp |\" \\\n  \"ark:| gzip -c > $dir/lat.JOB.gz\" || exit 1\n\nrm $dir/lat_out.scp.{aug,clean} $dir/lat_out.scp\nrm $dir/lat_tmp.*\n\necho $nj > $dir/num_jobs\n\nfor f in phones.txt cmvn_opts splice_opts final.mdl splice_opts tree frame_subsampling_factor; do\n  if [ -f $src_dir/$f ]; then cp $src_dir/$f $dir/$f; fi\ndone\n"
  },
  {
    "path": "egs/steps/copy_trans_dir.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2019   Phani Sankar Nidadavolu\n# Copyright 2019   manhong wang(marvin)\n# Apache 2.0.\n\n#This script creates fmllr transform for the aug dirs by copying \n#the trans of original train dir after you copy_ali_dirs.sh or copy_lat_dirs.sh\n#Note :  wo do not accept --nj here ,which shoud keep same as ali file\nprefixes=\"reverb1 babble music noise\"\ninclude_original=true\ncmd=run.pl\nwrite_binary=true\n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 <out-data> <src-ali-dir> <out-ali-dir>\"\n  echo \"This script creates fmllr transform for the aug dirs by copying \"\n  echo \" the trans of original train dir\"\n  echo \"While copying it adds prefix to the utterances specified by prefixes option\"\n  echo \"Note that the original train dir does not have any prefix\"\n  echo \"To include the original training directory in the copied \"\n  echo \"version set the --include-original option to true\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --prefixes <string of prefixes to add>    # All the prefixes of aug data to be included\"\n  echo \"  --include-original <true/false>           # If true, will copy the alignements of original dir\"\n  exit 1\nfi\n\ndata=$1\nsrc_dir=$2\ndir=$3\n\nif [ ! -d $dir ]; then\n    echo \"$0: warning : you may need combine ali or lat first !\" && exit 1\nfi\n\nif [ ! -f $src_dir/trans.1 ] ; then\n    echo \"$0: no trans exist in $src_dir dir\"  && exit 1\nfi\n\n\nnj=$(cat $dir/num_jobs)\nrm -f $dir/trans* 2>/dev/null\n\n# Copy the fmllr trans temporarily\necho \"creating temporary trans in $dir\"\n$cmd  JOB=1:$nj $dir/log/copy_trans_temp.JOB.log \\\n  copy-matrix --binary=$write_binary \\\n  \"ark:cat $src_dir/trans.JOB |\" \\\n  ark,scp:$dir/trans_tmp.JOB.ark,$dir/trans_tmp.JOB.scp || exit 1\n\n# Make copies of utterances for perturbed data\nfor p in $prefixes; do\n  cat $dir/trans_tmp.*.scp | awk -v p=$p '{print p\"-\"$0}'\ndone | sort -k1,1 > $dir/trans_out.scp.aug\n\nif [ \"$include_original\" == \"true\" ]; then\n  cat $dir/trans_tmp.*.scp | awk '{print $0}' | sort -k1,1 > $dir/trans_out.scp.clean\n  cat $dir/trans_out.scp.clean $dir/trans_out.scp.aug | sort -k1,1 > $dir/trans_out.scp.old\nelse\n  cat $dir/trans_out.scp.aug | sort -k1,1 > $dir/trans_out.scp.old\nfi\n\nutils/filter_scp.pl  ${data}/spk2utt  $dir/trans_out.scp.old  >  $dir/trans_out.scp\nutils/split_data.sh ${data} $nj\n\n# Copy and dump the trans for perturbed data\necho Creating fmllr trans for augmented data by copying fmllr trans from clean data\n$cmd  JOB=1:$nj $dir/log/copy_out_trans.JOB.log \\\n  copy-matrix --binary=$write_binary \\\n  \"scp:utils/split_scp.pl  --one-based -j $nj JOB $dir/trans_out.scp |\" \\\n  ark:$dir/trans.JOB || exit 1\n\nn_aug_trans=`wc -l $data/spk2utt`\nn_copy_trans=`wc -l $dir/trans_out.scp`\necho \"copy $n_copy_trans speaker's  fmllr trans of total $n_aug_trans\"\nrm $dir/trans_out.scp.aug  $dir/trans_out.scp.old $dir/trans_out.scp   $dir/trans_tmp.*\nexit 0\n"
  },
  {
    "path": "egs/steps/data/augment_data_dir.py",
    "content": "#!/usr/bin/env python3\n# Copyright 2017  David Snyder\n#           2017  Ye Bai\n#           2019  Phani Sankar Nidadavolu\n# Apache 2.0\n#\n# This script generates augmented data.  It is based on\n# steps/data/reverberate_data_dir.py but doesn't handle reverberation.\n# It is designed to be somewhat simpler and more flexible for augmenting with\n# additive noise.\nfrom __future__ import print_function\nimport sys, random, argparse, os, imp\nsys.path.append(\"steps/data/\")\nsys.path.insert(0, 'steps/')\n\nfrom reverberate_data_dir import parse_file_to_dict\nfrom reverberate_data_dir import write_dict_to_file\nimport libs.common as common_lib\ndata_lib = imp.load_source('dml', 'steps/data/data_dir_manipulation_lib.py')\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"Augment the data directory with additive noises. \"\n        \"Noises are separated into background and foreground noises which are added together or \"\n        \"separately.  Background noises are added to the entire recording, and repeated as necessary \"\n        \"to cover the full length.  Multiple overlapping background noises can be added, to simulate \"\n        \"babble, for example.  Foreground noises are added sequentially, according to a specified \"\n        \"interval.  See also steps/data/reverberate_data_dir.py \"\n        \"Usage: augment_data_dir.py [options...] <in-data-dir> <out-data-dir> \"\n        \"E.g., steps/data/augment_data_dir.py --utt-suffix aug --fg-snrs 20:10:5:0 --bg-snrs 20:15:10 \"\n        \"--num-bg-noise 1:2:3 --fg-interval 3 --fg-noise-dir data/musan_noise --bg-noise-dir \"\n        \"data/musan_music data/train data/train_aug\", formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n    parser.add_argument('--fg-snrs', type=str, dest = \"fg_snr_str\", default = '20:10:0',\n                        help='When foreground noises are being added, the script will iterate through these SNRs.')\n    parser.add_argument('--bg-snrs', type=str, dest = \"bg_snr_str\", default = '20:10:0',\n                        help='When background noises are being added, the script will iterate through these SNRs.')\n    parser.add_argument('--num-bg-noises', type=str,\n                        dest = \"num_bg_noises\", default = '1',\n                        help='Number of overlapping background noises that we iterate over.'\n                            ' For example, if the input is \"1:2:3\" then the output wavs will have either '\n                            '1, 2, or 3 randomly chosen background noises overlapping the entire recording')\n    parser.add_argument('--fg-interval', type=int,\n                        dest = \"fg_interval\", default = 0,\n                        help='Number of seconds between the end of one '\n                            'foreground noise and the beginning of the next.')\n    parser.add_argument('--utt-suffix', type=str,\n                        dest = \"utt_suffix\", default = None,\n                        help='Suffix added to utterance IDs.')\n    parser.add_argument('--utt-prefix', type=str,\n                        dest = \"utt_prefix\", default = None,\n                        help='Prefix added to utterance IDs.')\n    parser.add_argument('--random-seed', type=int, dest = \"random_seed\",\n                        default = 123, help='Random seed.')\n    parser.add_argument(\"--modify-spk-id\", type=str,\n                        dest='modify_spk_id', default=False,\n                        action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"],\n                        help='Utt prefix or suffix would be added to the spk id '\n                            'also (used in ASR), in speaker id it is left unmodifed')\n    parser.add_argument(\"--bg-noise-dir\", type=str, dest=\"bg_noise_dir\",\n                        help=\"Background noise data directory\")\n    parser.add_argument(\"--fg-noise-dir\", type=str, dest=\"fg_noise_dir\",\n                        help=\"Foreground noise data directory\")\n    parser.add_argument(\"input_dir\", help=\"Input data directory\")\n    parser.add_argument(\"output_dir\", help=\"Output data directory\")\n\n    print(' '.join(sys.argv))\n    args = parser.parse_args()\n    args = check_args(args)\n    return args\n\ndef check_args(args):\n    # Check args\n    if args.utt_suffix is None and args.utt_prefix is None:\n        args.utt_modifier_type = None\n        args.utt_modifier = \"\"\n    elif args.utt_suffix is None and args.utt_prefix is not None:\n        args.utt_modifier_type = \"prefix\"\n        args.utt_modifier = args.utt_prefix\n    elif args.utt_suffix is not None and args.utt_prefix is None:\n        args.utt_modifier_type = \"suffix\"\n        args.utt_modifier = args.utt_suffix\n    else:\n        raise Exception(\"Trying to add both prefix and suffix. Choose either of them\")\n\n    if not os.path.exists(args.output_dir):\n        os.makedirs(args.output_dir)\n    if not args.fg_interval >= 0:\n        raise Exception(\"--fg-interval must be 0 or greater\")\n    if args.bg_noise_dir is None and args.fg_noise_dir is None:\n        raise Exception(\"Either --fg-noise-dir or --bg-noise-dir must be specified\")\n    return args\n\ndef get_noise_list(noise_wav_scp_filename):\n    noise_wav_scp_file = open(noise_wav_scp_filename, 'r', encoding='utf-8').readlines()\n    noise_wavs = {}\n    noise_utts = []\n    for line in noise_wav_scp_file:\n        toks=line.split(\" \")\n        wav = \" \".join(toks[1:])\n        noise_utts.append(toks[0])\n        noise_wavs[toks[0]] = wav.rstrip()\n    return noise_utts, noise_wavs\n\ndef augment_wav(utt, wav, dur, fg_snr_opts, bg_snr_opts, fg_noise_utts, \\\n    bg_noise_utts, noise_wavs, noise2dur, interval, num_opts):\n    # This section is common to both foreground and background noises\n    new_wav = \"\"\n    dur_str = str(dur)\n    noise_dur = 0\n    tot_noise_dur = 0\n    snrs=[]\n    noises=[]\n    start_times=[]\n\n    # Now handle the background noises\n    if len(bg_noise_utts) > 0:\n        num = random.choice(num_opts)\n        for i in range(0, num):\n            noise_utt = random.choice(bg_noise_utts)\n            noise = \"wav-reverberate --duration=\" \\\n            + dur_str + \" \\\"\" + noise_wavs[noise_utt] + \"\\\" - |\"\n            snr = random.choice(bg_snr_opts)\n            snrs.append(snr)\n            start_times.append(0)\n            noises.append(noise)\n\n    # Now handle the foreground noises\n    if len(fg_noise_utts) > 0:\n        while tot_noise_dur < dur:\n            noise_utt = random.choice(fg_noise_utts)\n            noise = noise_wavs[noise_utt]\n            snr = random.choice(fg_snr_opts)\n            snrs.append(snr)\n            noise_dur = noise2dur[noise_utt]\n            start_times.append(tot_noise_dur)\n            tot_noise_dur += noise_dur + interval\n            noises.append(noise)\n\n    start_times_str = \"--start-times='\" + \",\".join([str(i) for i in start_times]) + \"'\"\n    snrs_str = \"--snrs='\" + \",\".join([str(i) for i in snrs]) + \"'\"\n    noises_str = \"--additive-signals='\" + \",\".join(noises).strip() + \"'\"\n\n    # If the wav is just a file\n    if wav.strip()[-1] != \"|\":\n        new_wav = \"wav-reverberate --shift-output=true \" + noises_str + \" \" \\\n            + start_times_str + \" \" + snrs_str + \" \" + wav + \" - |\"\n    # Else if the wav is in a pipe\n    else:\n        new_wav = wav + \" wav-reverberate --shift-output=true \" + noises_str + \" \" \\\n            + start_times_str + \" \" + snrs_str + \" - - |\"\n    return new_wav\n\ndef get_new_id(utt, utt_modifier_type, utt_modifier):\n    \"\"\" This function generates a new id from the input id\n        This is needed when we have to create multiple copies of the original data\n        E.g. get_new_id(\"swb0035\", prefix=\"rvb\", copy=1) returns a string \"rvb1_swb0035\"\n    \"\"\"\n    if utt_modifier_type == \"suffix\" and len(utt_modifier) > 0:\n        new_utt = utt + \"-\" + utt_modifier\n    elif utt_modifier_type == \"prefix\" and len(utt_modifier) > 0:\n        new_utt = utt_modifier + \"-\" + utt\n    else:\n        new_utt = utt\n\n    return new_utt\n\ndef copy_file_if_exists(input_file, output_file, utt_modifier_type,\n                        utt_modifier, fields=[0]):\n    if os.path.isfile(input_file):\n        clean_dict = parse_file_to_dict(input_file,\n            value_processor = lambda x: \" \".join(x))\n        new_dict = {}\n        for key in clean_dict.keys():\n            modified_key = get_new_id(key, utt_modifier_type, utt_modifier)\n            if len(fields) > 1:\n                values = clean_dict[key].split(\" \")\n                modified_values = values\n                for idx in range(1, len(fields)):\n                    modified_values[idx-1] = get_new_id(values[idx-1],\n                                            utt_modifier_type, utt_modifier)\n                new_dict[modified_key] = \" \".join(modified_values)\n            else:\n                new_dict[modified_key] = clean_dict[key]\n        write_dict_to_file(new_dict, output_file)\n\ndef create_augmented_utt2uniq(input_dir, output_dir,\n                            utt_modifier_type, utt_modifier):\n    clean_utt2spk_file = input_dir + \"/utt2spk\"\n    clean_utt2spk_dict = parse_file_to_dict(clean_utt2spk_file,\n                            value_processor = lambda x: \" \".join(x))\n    augmented_utt2uniq_dict = {}\n    for key in clean_utt2spk_dict.keys():\n        modified_key = get_new_id(key, utt_modifier_type, utt_modifier)\n        augmented_utt2uniq_dict[modified_key] = key\n    write_dict_to_file(augmented_utt2uniq_dict, output_dir + \"/utt2uniq\")\n\ndef main():\n    args = get_args()\n    input_dir = args.input_dir\n    output_dir = args.output_dir\n\n    fg_snrs = [int(i) for i in args.fg_snr_str.split(\":\")]\n    bg_snrs = [int(i) for i in args.bg_snr_str.split(\":\")]\n    num_bg_noises = [int(i) for i in args.num_bg_noises.split(\":\")]\n    reco2dur = parse_file_to_dict(input_dir + \"/reco2dur\",\n        value_processor = lambda x: float(x[0]))\n    wav_scp_file = open(input_dir + \"/wav.scp\", 'r', encoding='utf-8').readlines()\n\n    noise_wavs = {}\n    noise_reco2dur = {}\n    bg_noise_utts = []\n    fg_noise_utts = []\n\n    # Load background noises\n    if args.bg_noise_dir:\n        bg_noise_wav_filename = args.bg_noise_dir + \"/wav.scp\"\n        bg_noise_utts, bg_noise_wavs = get_noise_list(bg_noise_wav_filename)\n        bg_noise_reco2dur = parse_file_to_dict(args.bg_noise_dir + \"/reco2dur\",\n            value_processor = lambda x: float(x[0]))\n        noise_wavs.update(bg_noise_wavs)\n        noise_reco2dur.update(bg_noise_reco2dur)\n\n    # Load foreground noises\n    if args.fg_noise_dir:\n        fg_noise_wav_filename = args.fg_noise_dir + \"/wav.scp\"\n        fg_noise_reco2dur_filename = args.fg_noise_dir + \"/reco2dur\"\n        fg_noise_utts, fg_noise_wavs = get_noise_list(fg_noise_wav_filename)\n        fg_noise_reco2dur = parse_file_to_dict(args.fg_noise_dir + \"/reco2dur\",\n            value_processor = lambda x: float(x[0]))\n        noise_wavs.update(fg_noise_wavs)\n        noise_reco2dur.update(fg_noise_reco2dur)\n\n    random.seed(args.random_seed)\n    new_utt2wav = {}\n    new_utt2spk = {}\n\n    # Augment each line in the wav file\n    for line in wav_scp_file:\n        toks = line.rstrip().split(\" \")\n        utt = toks[0]\n        wav = \" \".join(toks[1:])\n        dur = reco2dur[utt]\n        new_wav = augment_wav(utt, wav, dur, fg_snrs, bg_snrs, fg_noise_utts,\n            bg_noise_utts, noise_wavs, noise_reco2dur, args.fg_interval,\n            num_bg_noises)\n\n        new_utt = get_new_id(utt, args.utt_modifier_type, args.utt_modifier)\n\n        new_utt2wav[new_utt] = new_wav\n\n    if not os.path.exists(output_dir):\n        os.makedirs(output_dir)\n\n    write_dict_to_file(new_utt2wav, output_dir + \"/wav.scp\")\n    copy_file_if_exists(input_dir + \"/reco2dur\", output_dir + \"/reco2dur\",\n                                args.utt_modifier_type, args.utt_modifier)\n    copy_file_if_exists(input_dir + \"/utt2dur\", output_dir + \"/utt2dur\",\n                                args.utt_modifier_type, args.utt_modifier)\n\n    # Check whether to modify the speaker id or not while creating utt2spk file\n    fields = ([0, 1] if args.modify_spk_id else [0])\n    copy_file_if_exists(input_dir + \"/utt2spk\", output_dir + \"/utt2spk\",\n                        args.utt_modifier_type, args.utt_modifier, fields=fields)\n    copy_file_if_exists(input_dir + \"/utt2lang\", output_dir + \"/utt2lang\",\n                        args.utt_modifier_type, args.utt_modifier)\n    copy_file_if_exists(input_dir + \"/utt2num_frames\", output_dir + \"/utt2num_frames\",\n                        args.utt_modifier_type, args.utt_modifier)\n    copy_file_if_exists(input_dir + \"/text\", output_dir + \"/text\", args.utt_modifier_type,\n                        args.utt_modifier)\n    copy_file_if_exists(input_dir + \"/segments\", output_dir + \"/segments\",\n                        args.utt_modifier_type, args.utt_modifier, fields=[0, 1])\n    copy_file_if_exists(input_dir + \"/vad.scp\", output_dir + \"/vad.scp\",\n                        args.utt_modifier_type, args.utt_modifier)\n    copy_file_if_exists(input_dir + \"/reco2file_and_channel\",\n                        output_dir + \"/reco2file_and_channel\",\n                        args.utt_modifier_type, args.utt_modifier, fields=[0, 1])\n\n    if args.modify_spk_id:\n        copy_file_if_exists(input_dir + \"/spk2gender\", output_dir + \"/spk2gender\",\n                        args.utt_modifier_type, args.utt_modifier)\n    else:\n        copy_file_if_exists(input_dir + \"/spk2gender\", output_dir + \"/spk2gender\", None, \"\")\n\n    # Create utt2uniq file\n    if os.path.isfile(input_dir + \"/utt2uniq\"):\n        copy_file_if_exists(input_dir + \"/utt2uniq\", output_dir + \"/utt2uniq\",\n                        args.utt_modifier_type, args.utt_modifier, fields=[0])\n    else:\n        create_augmented_utt2uniq(input_dir, output_dir,\n                        args.utt_modifier_type, args.utt_modifier)\n\n    data_lib.RunKaldiCommand(\"utils/utt2spk_to_spk2utt.pl <{output_dir}/utt2spk >{output_dir}/spk2utt\"\n                    .format(output_dir = output_dir))\n\n    data_lib.RunKaldiCommand(\"utils/fix_data_dir.sh {output_dir}\".format(output_dir = output_dir))\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/data/data_dir_manipulation_lib.py",
    "content": "import subprocess\n\ndef RunKaldiCommand(command, wait = True):\n    \"\"\" Runs commands frequently seen in Kaldi scripts. These are usually a\n        sequence of commands connected by pipes, so we use shell=True \"\"\"\n    #logger.info(\"Running the command\\n{0}\".format(command))\n    p = subprocess.Popen(command, shell = True,\n                         stdout = subprocess.PIPE,\n                         stderr = subprocess.PIPE)\n\n    if wait:\n        [stdout, stderr] = p.communicate()\n        if p.returncode is not 0:\n            raise Exception(\"There was an error while running the command {0}\\n------------\\n{1}\".format(command, stderr))\n        return stdout, stderr\n    else:\n        return p\n"
  },
  {
    "path": "egs/steps/data/make_musan.py",
    "content": "#!/usr/bin/env python3\n# Copyright 2015   David Snyder\n#           2019   Phani Sankar Nidadavolu\n# Apache 2.0.\n#\n# This file is meant to be invoked by make_musan.sh.\n\nimport os, sys, argparse\nsys.path.append(\"steps/data/\")\nsys.path.insert(0, 'steps/')\nimport libs.common as common_lib\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"Create MUSAN corpus\",\n                        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n    parser.add_argument(\"--use-vocals\", type=str,\n                        dest='use_vocals', default=True,\n                        action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"],\n                        help='use vocals from the music corpus')\n    parser.add_argument('--sampling-rate', type=int, default=16000,\n                        help=\"Sampling rate of the source data. If a positive integer is specified with this option, \"\n                        \"the MUSAN corpus will be resampled to the rate of the source data.\"\n                        \"Original MUSAN corpus is sampled at 16KHz. Defaults to 16000 Hz\")\n    parser.add_argument(\"in_dir\", help=\"Input data directory\")\n    parser.add_argument(\"out_dir\", help=\"Output data directory\")\n\n    print(' '.join(sys.argv))\n    args = parser.parse_args()\n    args = check_args(args)\n\n    return args\n\ndef check_args(args):\n    if not os.path.exists(args.in_dir):\n        raise Exception('input dir {0} does not exist'.format(args.in_dir))\n    if not os.path.exists(args.out_dir):\n        print(\"Preparing {0}/musan...\".format(args.out_dir))\n        os.makedirs(args.out_dir)\n\n    return args\n\ndef process_music_annotations(path):\n    utt2spk = {}\n    utt2vocals = {}\n    lines = open(path, 'r').readlines()\n    for line in lines:\n        utt, genres, vocals, musician = line.rstrip().split()[:4]\n        # For this application, the musican ID isn't important\n        utt2spk[utt] = utt\n        utt2vocals[utt] = vocals == \"Y\"\n    return utt2spk, utt2vocals\n\ndef prepare_music(root_dir, use_vocals, sampling_rate):\n    utt2vocals = {}\n    utt2spk = {}\n    utt2wav = {}\n    num_good_files = 0\n    num_bad_files = 0\n    music_dir = os.path.join(root_dir, \"music\")\n    for root, dirs, files in os.walk(music_dir):\n        for file in files:\n            file_path = os.path.join(root, file)\n            if file.endswith(\".wav\"):\n                utt = str(file).replace(\".wav\", \"\")\n                utt2wav[utt] = file_path\n            elif str(file) == \"ANNOTATIONS\":\n                utt2spk_part, utt2vocals_part = process_music_annotations(file_path)\n                utt2spk.update(utt2spk_part)\n                utt2vocals.update(utt2vocals_part)\n\n    utt2spk_str = \"\"\n    utt2wav_str = \"\"\n    for utt in utt2vocals:\n        if utt in utt2wav:\n            if use_vocals or not utt2vocals[utt]:\n                utt2spk_str = utt2spk_str + utt + \" \" + utt2spk[utt] + \"\\n\"\n                if sampling_rate == 16000:\n                    utt2wav_str = utt2wav_str + utt + \" \" + utt2wav[utt] + \"\\n\"\n                else:\n                    utt2wav_str = utt2wav_str + utt + \" sox -t wav \" + utt2wav[utt] + \" -r\" \\\n                                    \" {fs} -t wav - |\\n\".format(fs=sampling_rate)\n            num_good_files += 1\n        else:\n            print(\"Missing file {}\".format(utt))\n            num_bad_files += 1\n    print(\"In music directory, processed {} files; {} had missing wav data\".format(\n                                                    num_good_files, num_bad_files))\n    return utt2spk_str, utt2wav_str\n\n\ndef prepare_speech(root_dir, sampling_rate):\n    utt2spk = {}\n    utt2wav = {}\n    num_good_files = 0\n    num_bad_files = 0\n    speech_dir = os.path.join(root_dir, \"speech\")\n    for root, dirs, files in os.walk(speech_dir):\n        for file in files:\n            file_path = os.path.join(root, file)\n            if file.endswith(\".wav\"):\n                utt = str(file).replace(\".wav\", \"\")\n                utt2wav[utt] = file_path\n                utt2spk[utt] = utt\n\n    utt2spk_str = \"\"\n    utt2wav_str = \"\"\n    for utt in utt2spk:\n        if utt in utt2wav:\n            utt2spk_str = utt2spk_str + utt + \" \" + utt2spk[utt] + \"\\n\"\n            if sampling_rate == 16000:\n                utt2wav_str = utt2wav_str + utt + \" \" + utt2wav[utt] + \"\\n\"\n            else:\n                utt2wav_str = utt2wav_str + utt + \" sox -t wav \" + utt2wav[utt] + \" -r\" \\\n                                    \" {fs} -t wav - |\\n\".format(fs=sampling_rate)\n            num_good_files += 1\n        else:\n            print(\"Missing file {}\".format(utt))\n            num_bad_files += 1\n    print(\"In speech directory, processed {} files; {} had missing wav data\".format(\n                                                    num_good_files, num_bad_files))\n    return utt2spk_str, utt2wav_str\n\n\ndef prepare_noise(root_dir, sampling_rate):\n    utt2spk = {}\n    utt2wav = {}\n    num_good_files = 0\n    num_bad_files = 0\n    noise_dir = os.path.join(root_dir, \"noise\")\n    for root, dirs, files in os.walk(noise_dir):\n        for file in files:\n            file_path = os.path.join(root, file)\n            if file.endswith(\".wav\"):\n                utt = str(file).replace(\".wav\", \"\")\n                utt2wav[utt] = file_path\n                utt2spk[utt] = utt\n\n    utt2spk_str = \"\"\n    utt2wav_str = \"\"\n    for utt in utt2spk:\n        if utt in utt2wav:\n            utt2spk_str = utt2spk_str + utt + \" \" + utt2spk[utt] + \"\\n\"\n            if sampling_rate == 16000:\n                utt2wav_str = utt2wav_str + utt + \" \" + utt2wav[utt] + \"\\n\"\n            else:\n                utt2wav_str = utt2wav_str + utt + \" sox -t wav \" + utt2wav[utt] + \" -r\" \\\n                                    \" {fs} -t wav - |\\n\".format(fs=sampling_rate)\n            num_good_files += 1\n        else:\n            print(\"Missing file {}\".format(utt))\n            num_bad_files += 1\n    print(\"In noise directory, processed {} files; {} had missing wav data\".format(\n                                    num_good_files, num_bad_files))\n    return utt2spk_str, utt2wav_str\n\n\ndef main():\n    args = get_args()\n    in_dir = args.in_dir\n    out_dir = args.out_dir\n    use_vocals = args.use_vocals\n    sampling_rate = args.sampling_rate\n\n    utt2spk_music, utt2wav_music = prepare_music(in_dir, use_vocals, sampling_rate)\n    utt2spk_speech, utt2wav_speech = prepare_speech(in_dir, sampling_rate)\n    utt2spk_noise, utt2wav_noise = prepare_noise(in_dir, sampling_rate)\n\n    utt2spk = utt2spk_speech + utt2spk_music + utt2spk_noise\n    utt2wav = utt2wav_speech + utt2wav_music + utt2wav_noise\n    wav_fi = open(os.path.join(out_dir, \"wav.scp\"), 'w')\n    wav_fi.write(utt2wav)\n    utt2spk_fi = open(os.path.join(out_dir, \"utt2spk\"), 'w')\n    utt2spk_fi.write(utt2spk)\n\n\nif __name__==\"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/data/make_musan.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2015   David Snyder\n#           2019   Phani Sankar Nidadavolu\n# Apache 2.0.\n#\n# This script creates the MUSAN data directory.\n# Consists of babble, music and noise files.\n# Used to create augmented data\n# The required dataset is freely available at http://www.openslr.org/17/\n\n# The corpus can be cited as follows:\n# @misc{musan2015,\n#  author = {David Snyder and Guoguo Chen and Daniel Povey},\n#  title = {{MUSAN}: {A} {M}usic, {S}peech, and {N}oise {C}orpus},\n#  year = {2015},\n#  eprint = {1510.08484},\n#  note = {arXiv:1510.08484v1}\n# }\n\nset -e\nuse_vocals=true\nsampling_rate=16000\nstage=0\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -ne 2 ]; then\n    echo USAGE: $0 input_dir output_dir\n    echo input_dir is the path where the MUSAN corpus is located\n    echo e.g: $0 /export/corpora/JHU/musan data\n    echo \"main options (for others, see top of script file)\"\n    echo \"  --sampling-rate <sampling frequency>        # Sampling frequency of source dir\"\n    echo \"  --use-vocals <true/false>        # Use vocals from music portion of MUSAN corpus\"\n    exit 1;\nfi\n\nin_dir=$1\ndata_dir=$2\n\nmkdir -p local/musan.tmp\n\n# The below script will create the musan corpus\nsteps/data/make_musan.py --use-vocals ${use_vocals} \\\n                        --sampling-rate ${sampling_rate} \\\n                        ${in_dir} ${data_dir}/musan || exit 1;\n\nutils/fix_data_dir.sh ${data_dir}/musan\n\ngrep \"music\" ${data_dir}/musan/utt2spk > local/musan.tmp/utt2spk_music\ngrep \"speech\" ${data_dir}/musan/utt2spk > local/musan.tmp/utt2spk_speech\ngrep \"noise\" ${data_dir}/musan/utt2spk > local/musan.tmp/utt2spk_noise\n\nutils/subset_data_dir.sh --utt-list local/musan.tmp/utt2spk_music \\\n        ${data_dir}/musan ${data_dir}/musan_music\nutils/subset_data_dir.sh --utt-list local/musan.tmp/utt2spk_speech \\\n        ${data_dir}/musan ${data_dir}/musan_speech\nutils/subset_data_dir.sh --utt-list local/musan.tmp/utt2spk_noise \\\n        ${data_dir}/musan ${data_dir}/musan_noise\n\nutils/fix_data_dir.sh ${data_dir}/musan_music\nutils/fix_data_dir.sh ${data_dir}/musan_speech\nutils/fix_data_dir.sh ${data_dir}/musan_noise\n\nrm -rf local/musan.tmp\n\nfor name in speech noise music; do\n    utils/data/get_reco2dur.sh ${data_dir}/musan_${name}\ndone\n"
  },
  {
    "path": "egs/steps/data/reverberate_data_dir.py",
    "content": "#!/usr/bin/env python3\n# Copyright 2016  Tom Ko\n#           2018  David Snyder\n#           2019  Phani Sankar Nidadavolu\n# Apache 2.0\n# script to generate reverberated data\n\nimport argparse, shlex, glob, math, os, random, sys, warnings, copy, imp, ast\n\ndata_lib = imp.load_source('dml', 'steps/data/data_dir_manipulation_lib.py')\n\ndef get_args():\n    # we add required arguments as named arguments for readability\n    parser = argparse.ArgumentParser(description=\"Reverberate the data directory with an option \"\n                                                 \"to add isotropic and point source noises. \"\n                                                 \"Usage: reverberate_data_dir.py [options...] <in-data-dir> <out-data-dir> \"\n                                                 \"E.g. reverberate_data_dir.py --rir-set-parameters rir_list \"\n                                                 \"--foreground-snrs 20:10:15:5:0 --background-snrs 20:10:15:5:0 \"\n                                                 \"--noise-list-file noise_list --speech-rvb-probability 1 --num-replications 2 \"\n                                                 \"--random-seed 1 data/train data/train_rvb\",\n                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n    parser.add_argument(\"--rir-set-parameters\", type=str, action='append', required = True, dest = \"rir_set_para_array\",\n                        help=\"Specifies the parameters of an RIR set. \"\n                        \"Supports the specification of  mixture_weight and rir_list_file_name. The mixture weight is optional. \"\n                        \"The default mixture weight is the probability mass remaining after adding the mixture weights \"\n                        \"of all the RIR lists, uniformly divided among the RIR lists without mixture weights. \"\n                        \"E.g. --rir-set-parameters '0.3, rir_list' or 'rir_list' \"\n                        \"the format of the RIR list file is \"\n                        \"--rir-id <string,required> --room-id <string,required> \"\n                        \"--receiver-position-id <string,optional> --source-position-id <string,optional> \"\n                        \"--rt-60 <float,optional> --drr <float, optional> location <rspecifier> \"\n                        \"E.g. --rir-id 00001 --room-id 001 --receiver-position-id 001 --source-position-id 00001 \"\n                        \"--rt60 0.58 --drr -4.885 data/impulses/Room001-00001.wav\")\n    parser.add_argument(\"--noise-set-parameters\", type=str, action='append', default = None, dest = \"noise_set_para_array\",\n                        help=\"Specifies the parameters of an noise set. \"\n                        \"Supports the specification of mixture_weight and noise_list_file_name. The mixture weight is optional. \"\n                        \"The default mixture weight is the probability mass remaining after adding the mixture weights \"\n                        \"of all the noise lists, uniformly divided among the noise lists without mixture weights. \"\n                        \"E.g. --noise-set-parameters '0.3, noise_list' or 'noise_list' \"\n                        \"the format of the noise list file is \"\n                        \"--noise-id <string,required> --noise-type <choices = {isotropic, point source},required> \"\n                        \"--bg-fg-type <choices = {background, foreground}, default=background> \"\n                        \"--room-linkage <str, specifies the room associated with the noise file. Required if isotropic> \"\n                        \"location <rspecifier> \"\n                        \"E.g. --noise-id 001 --noise-type isotropic --rir-id 00019 iso_noise.wav\")\n    parser.add_argument(\"--num-replications\", type=int, dest = \"num_replicas\", default = 1,\n                        help=\"Number of replicate to generated for the data\")\n    parser.add_argument('--foreground-snrs', type=str, dest = \"foreground_snr_string\", default = '20:10:0', help='When foreground noises are being added the script will iterate through these SNRs.')\n    parser.add_argument('--background-snrs', type=str, dest = \"background_snr_string\", default = '20:10:0', help='When background noises are being added the script will iterate through these SNRs.')\n    parser.add_argument('--prefix', type=str, default = None, help='This prefix will modified for each reverberated copy, by adding additional affixes.')\n    parser.add_argument(\"--speech-rvb-probability\", type=float, default = 1.0,\n                        help=\"Probability of reverberating a speech signal, e.g. 0 <= p <= 1\")\n    parser.add_argument(\"--pointsource-noise-addition-probability\", type=float, default = 1.0,\n                        help=\"Probability of adding point-source noises, e.g. 0 <= p <= 1\")\n    parser.add_argument(\"--isotropic-noise-addition-probability\", type=float, default = 1.0,\n                        help=\"Probability of adding isotropic noises, e.g. 0 <= p <= 1\")\n    parser.add_argument(\"--rir-smoothing-weight\", type=float, default = 0.3,\n                        help=\"Smoothing weight for the RIR probabilties, e.g. 0 <= p <= 1. If p = 0, no smoothing will be done. \"\n                        \"The RIR distribution will be mixed with a uniform distribution according to the smoothing weight\")\n    parser.add_argument(\"--noise-smoothing-weight\", type=float, default = 0.3,\n                        help=\"Smoothing weight for the noise probabilties, e.g. 0 <= p <= 1. If p = 0, no smoothing will be done. \"\n                        \"The noise distribution will be mixed with a uniform distribution according to the smoothing weight\")\n    parser.add_argument(\"--max-noises-per-minute\", type=int, default = 2,\n                        help=\"This controls the maximum number of point-source noises that could be added to a recording according to its duration\")\n    parser.add_argument('--random-seed', type=int, default=0, help='seed to be used in the randomization of impulses and noises')\n    parser.add_argument(\"--shift-output\", type=str, help=\"If true, the reverberated waveform will be shifted by the amount of the peak position of the RIR\",\n                         choices=['true', 'false'], default = \"true\")\n    parser.add_argument('--source-sampling-rate', type=int, default=None,\n                        help=\"Sampling rate of the source data. If a positive integer is specified with this option, \"\n                        \"the RIRs/noises will be resampled to the rate of the source data.\")\n    parser.add_argument(\"--include-original-data\", type=str, help=\"If true, the output data includes one copy of the original data\",\n                         choices=['true', 'false'], default = \"false\")\n    parser.add_argument(\"input_dir\",\n                        help=\"Input data directory\")\n    parser.add_argument(\"output_dir\",\n                        help=\"Output data directory\")\n\n    print(' '.join(sys.argv))\n\n    args = parser.parse_args()\n    args = check_args(args)\n\n    return args\n\ndef check_args(args):\n    if args.prefix is None:\n        if args.num_replicas > 1 or args.include_original_data == \"true\":\n            args.prefix = \"rvb\"\n            warnings.warn(\"--prefix is set to 'rvb' as more than one copy of data is generated\")\n\n    if not args.num_replicas > 0:\n        raise Exception(\"--num-replications cannot be non-positive\")\n\n    if args.speech_rvb_probability < 0 or args.speech_rvb_probability > 1:\n        raise Exception(\"--speech-rvb-probability must be between 0 and 1\")\n\n    if args.pointsource_noise_addition_probability < 0 or args.pointsource_noise_addition_probability > 1:\n        raise Exception(\"--pointsource-noise-addition-probability must be between 0 and 1\")\n\n    if args.isotropic_noise_addition_probability < 0 or args.isotropic_noise_addition_probability > 1:\n        raise Exception(\"--isotropic-noise-addition-probability must be between 0 and 1\")\n\n    if args.rir_smoothing_weight < 0 or args.rir_smoothing_weight > 1:\n        raise Exception(\"--rir-smoothing-weight must be between 0 and 1\")\n\n    if args.noise_smoothing_weight < 0 or args.noise_smoothing_weight > 1:\n        raise Exception(\"--noise-smoothing-weight must be between 0 and 1\")\n\n    if args.max_noises_per_minute < 0:\n        raise Exception(\"--max-noises-per-minute cannot be negative\")\n\n    if args.source_sampling_rate is not None and args.source_sampling_rate <= 0:\n        raise Exception(\"--source-sampling-rate cannot be non-positive\")\n\n    return args\n\n\nclass list_cyclic_iterator(object):\n    def __init__(self, list):\n        self.list_index = 0\n        self.list = list\n        random.shuffle(self.list)\n\n    def __next__(self):\n        item = self.list[self.list_index]\n        self.list_index = (self.list_index + 1) % len(self.list)\n        return item\n\n    next = __next__  # for Python 2\n\ndef pick_item_with_probability(x):\n    \"\"\" This functions picks an item from the collection according to the associated\n        probability distribution. The probability estimate of each item in the collection\n        is stored in the \"probability\" field of the particular item. x : a\n        collection (list or dictionary) where the values contain a field called probability\n    \"\"\"\n    if isinstance(x, dict):\n        keylist = list(x.keys())\n        keylist.sort()\n        random.shuffle(keylist)\n        plist = [x[k] for k in keylist]\n    else:\n        plist = x\n    total_p = sum(item.probability for item in plist)\n    p = random.uniform(0, total_p)\n    accumulate_p = 0\n    for item in plist:\n        if accumulate_p + item.probability >= p:\n            return item\n        accumulate_p += item.probability\n    assert False, \"Shouldn't get here as the accumulated probability should always equal to 1\"\n\n\ndef parse_file_to_dict(file, assert2fields = False, value_processor = None):\n    \"\"\" This function parses a file and pack the data into a dictionary\n        It is useful for parsing file like wav.scp, utt2spk, text...etc\n    \"\"\"\n    if value_processor is None:\n        value_processor = lambda x: x[0]\n    dict = {}\n    for line in open(file, 'r', encoding='utf-8'):\n        parts = line.split()\n        if assert2fields:\n            assert(len(parts) == 2)\n\n        dict[parts[0]] = value_processor(parts[1:])\n    return dict\n\ndef write_dict_to_file(dict, file_name):\n    \"\"\" This function creates a file and write the content of a dictionary into it\n    \"\"\"\n    file = open(file_name, 'w', encoding='utf-8')\n    keys = sorted(dict.keys())\n    for key in keys:\n        value = dict[key]\n        if type(value) in [list, tuple] :\n            if type(value) is tuple:\n                value = list(value)\n            value = sorted(value)\n            value = ' '.join(str(value))\n        file.write('{0} {1}\\n'.format(key, value))\n    file.close()\n\n\ndef create_corrupted_utt2uniq(input_dir, output_dir, num_replicas, include_original, prefix):\n    \"\"\"This function creates the utt2uniq file from the utterance id in utt2spk file\n    \"\"\"\n    corrupted_utt2uniq = {}\n    # Parse the utt2spk to get the utterance id\n    utt2spk = parse_file_to_dict(input_dir + \"/utt2spk\", value_processor = lambda x: \" \".join(x))\n    keys = sorted(utt2spk.keys())\n    if include_original:\n        start_index = 0\n    else:\n        start_index = 1\n\n    for i in range(start_index, num_replicas+1):\n        for utt_id in keys:\n            new_utt_id = get_new_id(utt_id, prefix, i)\n            corrupted_utt2uniq[new_utt_id] = utt_id\n\n    write_dict_to_file(corrupted_utt2uniq, output_dir + \"/utt2uniq\")\n\n\ndef add_point_source_noise(noise_addition_descriptor,  # descriptor to store the information of the noise added\n                        room,  # the room selected\n                        pointsource_noise_list, # the point source noise list\n                        pointsource_noise_addition_probability, # Probability of adding point-source noises\n                        foreground_snrs, # the SNR for adding the foreground noises\n                        background_snrs, # the SNR for adding the background noises\n                        speech_dur,  # duration of the recording\n                        max_noises_recording  # Maximum number of point-source noises that can be added\n                        ):\n    if len(pointsource_noise_list) > 0 and random.random() < pointsource_noise_addition_probability and max_noises_recording >= 1:\n        for k in range(random.randint(1, max_noises_recording)):\n            # pick the RIR to reverberate the point-source noise\n            noise = pick_item_with_probability(pointsource_noise_list)\n            noise_rir = pick_item_with_probability(room.rir_list)\n            # If it is a background noise, the noise will be extended and be added to the whole speech\n            # if it is a foreground noise, the noise will not extended and be added at a random time of the speech\n            if noise.bg_fg_type == \"background\":\n                noise_rvb_command = \"\"\"wav-reverberate --impulse-response=\"{0}\" --duration={1}\"\"\".format(noise_rir.rir_rspecifier, speech_dur)\n                noise_addition_descriptor['start_times'].append(0)\n                noise_addition_descriptor['snrs'].append(next(background_snrs))\n            else:\n                noise_rvb_command = \"\"\"wav-reverberate --impulse-response=\"{0}\" \"\"\".format(noise_rir.rir_rspecifier)\n                noise_addition_descriptor['start_times'].append(round(random.random() * speech_dur, 2))\n                noise_addition_descriptor['snrs'].append(next(foreground_snrs))\n\n            # check if the rspecifier is a pipe or not\n            if len(noise.noise_rspecifier.split()) == 1:\n                noise_addition_descriptor['noise_io'].append(\"{1} {0} - |\".format(noise.noise_rspecifier, noise_rvb_command))\n            else:\n                noise_addition_descriptor['noise_io'].append(\"{0} {1} - - |\".format(noise.noise_rspecifier, noise_rvb_command))\n\n    return noise_addition_descriptor\n\n\ndef generate_reverberation_opts(room_dict,  # the room dictionary, please refer to make_room_dict() for the format\n                              pointsource_noise_list, # the point source noise list\n                              iso_noise_dict, # the isotropic noise dictionary\n                              foreground_snrs, # the SNR for adding the foreground noises\n                              background_snrs, # the SNR for adding the background noises\n                              speech_rvb_probability, # Probability of reverberating a speech signal\n                              isotropic_noise_addition_probability, # Probability of adding isotropic noises\n                              pointsource_noise_addition_probability, # Probability of adding point-source noises\n                              speech_dur,  # duration of the recording\n                              max_noises_recording  # Maximum number of point-source noises that can be added\n                              ):\n    \"\"\" This function randomly decides whether to reverberate, and sample a RIR if it does\n        It also decides whether to add the appropriate noises\n        This function return the string of options to the binary wav-reverberate\n    \"\"\"\n    reverberate_opts = \"\"\n    noise_addition_descriptor = {'noise_io': [],\n                                 'start_times': [],\n                                 'snrs': []}\n    # Randomly select the room\n    # Here the room probability is a sum of the probabilities of the RIRs recorded in the room.\n    room = pick_item_with_probability(room_dict)\n    # Randomly select the RIR in the room\n    speech_rir = pick_item_with_probability(room.rir_list)\n    if random.random() < speech_rvb_probability:\n        # pick the RIR to reverberate the speech\n        reverberate_opts += \"\"\"--impulse-response=\"{0}\" \"\"\".format(speech_rir.rir_rspecifier)\n\n    rir_iso_noise_list = []\n    if speech_rir.room_id in iso_noise_dict:\n        rir_iso_noise_list = iso_noise_dict[speech_rir.room_id]\n    # Add the corresponding isotropic noise associated with the selected RIR\n    if len(rir_iso_noise_list) > 0 and random.random() < isotropic_noise_addition_probability:\n        isotropic_noise = pick_item_with_probability(rir_iso_noise_list)\n        # extend the isotropic noise to the length of the speech waveform\n        # check if the rspecifier is a pipe or not\n        if len(isotropic_noise.noise_rspecifier.split()) == 1:\n            noise_addition_descriptor['noise_io'].append(\"wav-reverberate --duration={1} {0} - |\".format(isotropic_noise.noise_rspecifier, speech_dur))\n        else:\n            noise_addition_descriptor['noise_io'].append(\"{0} wav-reverberate --duration={1} - - |\".format(isotropic_noise.noise_rspecifier, speech_dur))\n        noise_addition_descriptor['start_times'].append(0)\n        noise_addition_descriptor['snrs'].append(next(background_snrs))\n\n    noise_addition_descriptor = add_point_source_noise(noise_addition_descriptor,  # descriptor to store the information of the noise added\n                                                    room,  # the room selected\n                                                    pointsource_noise_list, # the point source noise list\n                                                    pointsource_noise_addition_probability, # Probability of adding point-source noises\n                                                    foreground_snrs, # the SNR for adding the foreground noises\n                                                    background_snrs, # the SNR for adding the background noises\n                                                    speech_dur,  # duration of the recording\n                                                    max_noises_recording  # Maximum number of point-source noises that can be added\n                                                    )\n\n    assert len(noise_addition_descriptor['noise_io']) == len(noise_addition_descriptor['start_times'])\n    assert len(noise_addition_descriptor['noise_io']) == len(noise_addition_descriptor['snrs'])\n    if len(noise_addition_descriptor['noise_io']) > 0:\n        reverberate_opts += \"--additive-signals='{0}' \".format(','.join(noise_addition_descriptor['noise_io']))\n        reverberate_opts += \"--start-times='{0}' \".format(','.join([str(x) for x in noise_addition_descriptor['start_times']]))\n        reverberate_opts += \"--snrs='{0}' \".format(','.join([str(x) for x in noise_addition_descriptor['snrs']]))\n\n    return reverberate_opts\n\ndef get_new_id(id, prefix=None, copy=0):\n    \"\"\" This function generates a new id from the input id\n        This is needed when we have to create multiple copies of the original data\n        E.g. get_new_id(\"swb0035\", prefix=\"rvb\", copy=1) returns a string \"rvb1-swb0035\"\n    \"\"\"\n    if prefix is not None:\n        new_id = prefix + str(copy) + \"-\" + id\n    else:\n        new_id = id\n\n    return new_id\n\n\ndef generate_reverberated_wav_scp(wav_scp,  # a dictionary whose values are the Kaldi-IO strings of the speech recordings\n                               durations, # a dictionary whose values are the duration (in sec) of the speech recordings\n                               output_dir, # output directory to write the corrupted wav.scp\n                               room_dict,  # the room dictionary, please refer to make_room_dict() for the format\n                               pointsource_noise_list, # the point source noise list\n                               iso_noise_dict, # the isotropic noise dictionary\n                               foreground_snr_array, # the SNR for adding the foreground noises\n                               background_snr_array, # the SNR for adding the background noises\n                               num_replicas, # Number of replicate to generated for the data\n                               include_original, # include a copy of the original data\n                               prefix, # prefix for the id of the corrupted utterances\n                               speech_rvb_probability, # Probability of reverberating a speech signal\n                               shift_output, # option whether to shift the output waveform\n                               isotropic_noise_addition_probability, # Probability of adding isotropic noises\n                               pointsource_noise_addition_probability, # Probability of adding point-source noises\n                               max_noises_per_minute # maximum number of point-source noises that can be added to a recording according to its duration\n                               ):\n    \"\"\" This is the main function to generate pipeline command for the corruption\n        The generic command of wav-reverberate will be like:\n        wav-reverberate --duration=t --impulse-response=rir.wav\n        --additive-signals='noise1.wav,noise2.wav' --snrs='snr1,snr2' --start-times='s1,s2' input.wav output.wav\n    \"\"\"\n    foreground_snrs = list_cyclic_iterator(foreground_snr_array)\n    background_snrs = list_cyclic_iterator(background_snr_array)\n    corrupted_wav_scp = {}\n    keys = sorted(wav_scp.keys())\n    if include_original:\n        start_index = 0\n    else:\n        start_index = 1\n\n    for i in range(start_index, num_replicas+1):\n        for recording_id in keys:\n            wav_original_pipe = wav_scp[recording_id]\n            # check if it is really a pipe\n            if len(wav_original_pipe.split()) == 1:\n                wav_original_pipe = \"cat {0} |\".format(wav_original_pipe)\n            speech_dur = durations[recording_id]\n            max_noises_recording = math.floor(max_noises_per_minute * speech_dur / 60)\n\n            reverberate_opts = generate_reverberation_opts(room_dict,  # the room dictionary, please refer to make_room_dict() for the format\n                                                         pointsource_noise_list, # the point source noise list\n                                                         iso_noise_dict, # the isotropic noise dictionary\n                                                         foreground_snrs, # the SNR for adding the foreground noises\n                                                         background_snrs, # the SNR for adding the background noises\n                                                         speech_rvb_probability, # Probability of reverberating a speech signal\n                                                         isotropic_noise_addition_probability, # Probability of adding isotropic noises\n                                                         pointsource_noise_addition_probability, # Probability of adding point-source noises\n                                                         speech_dur,  # duration of the recording\n                                                         max_noises_recording  # Maximum number of point-source noises that can be added\n                                                         )\n\n            # prefix using index 0 is reserved for original data e.g. rvb0_swb0035 corresponds to the swb0035 recording in original data\n            if reverberate_opts == \"\" or i == 0:\n                wav_corrupted_pipe = \"{0}\".format(wav_original_pipe)\n            else:\n                wav_corrupted_pipe = \"{0} wav-reverberate --shift-output={1} {2} - - |\".format(wav_original_pipe, shift_output, reverberate_opts)\n\n            new_recording_id = get_new_id(recording_id, prefix, i)\n            corrupted_wav_scp[new_recording_id] = wav_corrupted_pipe\n\n    write_dict_to_file(corrupted_wav_scp, output_dir + \"/wav.scp\")\n\n\ndef add_prefix_to_fields(input_file, output_file, num_replicas, include_original, prefix, field = [0]):\n    \"\"\" This function replicate the entries in files like segments, utt2spk, text\n    \"\"\"\n    list = [x.strip() for x in open(input_file, encoding='utf-8')]\n    f = open(output_file, \"w\", encoding='utf-8')\n    if include_original:\n        start_index = 0\n    else:\n        start_index = 1\n\n    for i in range(start_index, num_replicas+1):\n        for line in list:\n            if len(line) > 0 and line[0] != ';':\n                split1 = line.split()\n                for j in field:\n                    split1[j] = get_new_id(split1[j], prefix, i)\n                print(\" \".join(split1), file=f)\n            else:\n                print(line, file=f)\n    f.close()\n\n\ndef create_reverberated_copy(input_dir,\n                           output_dir,\n                           room_dict,  # the room dictionary, please refer to make_room_dict() for the format\n                           pointsource_noise_list, # the point source noise list\n                           iso_noise_dict, # the isotropic noise dictionary\n                           foreground_snr_string, # the SNR for adding the foreground noises\n                           background_snr_string, # the SNR for adding the background noises\n                           num_replicas, # Number of replicate to generated for the data\n                           include_original, # include a copy of the original data\n                           prefix, # prefix for the id of the corrupted utterances\n                           speech_rvb_probability, # Probability of reverberating a speech signal\n                           shift_output, # option whether to shift the output waveform\n                           isotropic_noise_addition_probability, # Probability of adding isotropic noises\n                           pointsource_noise_addition_probability, # Probability of adding point-source noises\n                           max_noises_per_minute  # maximum number of point-source noises that can be added to a recording according to its duration\n                           ):\n    \"\"\" This function creates multiple copies of the necessary files,\n        e.g. utt2spk, wav.scp ...\n    \"\"\"\n    if not os.path.exists(output_dir):\n        os.makedirs(output_dir)\n    wav_scp = parse_file_to_dict(input_dir + \"/wav.scp\", value_processor = lambda x: \" \".join(x))\n    if not os.path.isfile(input_dir + \"/reco2dur\"):\n        print(\"Getting the duration of the recordings...\");\n        data_lib.RunKaldiCommand(\"utils/data/get_reco2dur.sh {}\".format(input_dir))\n    durations = parse_file_to_dict(input_dir + \"/reco2dur\", value_processor = lambda x: float(x[0]))\n    foreground_snr_array = [float(x) for x in foreground_snr_string.split(':')]\n    background_snr_array = [float(x) for x in background_snr_string.split(':')]\n\n    generate_reverberated_wav_scp(wav_scp, durations, output_dir, room_dict, pointsource_noise_list, iso_noise_dict,\n               foreground_snr_array, background_snr_array, num_replicas, include_original, prefix,\n               speech_rvb_probability, shift_output, isotropic_noise_addition_probability,\n               pointsource_noise_addition_probability, max_noises_per_minute)\n\n    add_prefix_to_fields(input_dir + \"/utt2spk\", output_dir + \"/utt2spk\", num_replicas, include_original, prefix, field = [0,1])\n    data_lib.RunKaldiCommand(\"utils/utt2spk_to_spk2utt.pl <{output_dir}/utt2spk >{output_dir}/spk2utt\"\n                    .format(output_dir = output_dir))\n\n    if os.path.isfile(input_dir + \"/utt2uniq\"):\n        add_prefix_to_fields(input_dir + \"/utt2uniq\", output_dir + \"/utt2uniq\", num_replicas, include_original, prefix, field =[0])\n    else:\n        # Create the utt2uniq file\n        create_corrupted_utt2uniq(input_dir, output_dir, num_replicas, include_original, prefix)\n\n    if os.path.isfile(input_dir + \"/text\"):\n        add_prefix_to_fields(input_dir + \"/text\", output_dir + \"/text\", num_replicas, include_original, prefix, field =[0])\n    if os.path.isfile(input_dir + \"/segments\"):\n        add_prefix_to_fields(input_dir + \"/segments\", output_dir + \"/segments\", num_replicas, include_original, prefix, field = [0,1])\n    if os.path.isfile(input_dir + \"/reco2file_and_channel\"):\n        add_prefix_to_fields(input_dir + \"/reco2file_and_channel\", output_dir + \"/reco2file_and_channel\", num_replicas, include_original, prefix, field = [0,1])\n    if os.path.isfile(input_dir + \"/vad.scp\"):\n        add_prefix_to_fields(input_dir + \"/vad.scp\", output_dir + \"/vad.scp\", num_replicas, include_original, prefix, field=[0])\n\n    data_lib.RunKaldiCommand(\"utils/validate_data_dir.sh --no-feats --no-text {output_dir}\"\n                    .format(output_dir = output_dir))\n\n\ndef smooth_probability_distribution(set_list, smoothing_weight=0.0, target_sum=1.0):\n    \"\"\" This function smooths the probability distribution in the list\n    \"\"\"\n    if len(list(set_list)) > 0:\n      num_unspecified = 0\n      accumulated_prob = 0\n      for item in set_list:\n          if item.probability is None:\n              num_unspecified += 1\n          else:\n              accumulated_prob += item.probability\n\n      # Compute the probability for the items without specifying their probability\n      uniform_probability = 0\n      if num_unspecified > 0 and accumulated_prob < 1:\n          uniform_probability = (1 - accumulated_prob) / float(num_unspecified)\n      elif num_unspecified > 0 and accumulated_prob >= 1:\n          warnings.warn(\"The sum of probabilities specified by user is larger than or equal to 1. \"\n                        \"The items without probabilities specified will be given zero to their probabilities.\")\n\n      for item in set_list:\n          if item.probability is None:\n              item.probability = uniform_probability\n          else:\n              # smooth the probability\n              item.probability = (1 - smoothing_weight) * item.probability + smoothing_weight * uniform_probability\n\n      # Normalize the probability\n      sum_p = sum(item.probability for item in set_list)\n      for item in set_list:\n          item.probability = item.probability / sum_p * target_sum\n\n    return set_list\n\n\ndef parse_set_parameter_strings(set_para_array):\n    \"\"\" This function parse the array of rir set parameter strings.\n        It will assign probabilities to those rir sets which don't have a probability\n        It will also check the existence of the rir list files.\n    \"\"\"\n    set_list = []\n    for set_para in set_para_array:\n        set = lambda: None\n        setattr(set, \"filename\", None)\n        setattr(set, \"probability\", None)\n        parts = set_para.split(',')\n        if len(parts) == 2:\n            set.probability = float(parts[0])\n            set.filename = parts[1].strip()\n        else:\n            set.filename = parts[0].strip()\n        if not os.path.isfile(set.filename):\n            raise Exception(set.filename + \" not found\")\n        set_list.append(set)\n\n    return smooth_probability_distribution(set_list)\n\n\ndef parse_rir_list(rir_set_para_array, smoothing_weight, sampling_rate = None):\n    \"\"\" This function creates the RIR list\n        Each rir object in the list contains the following attributes:\n        rir_id, room_id, receiver_position_id, source_position_id, rt60, drr, probability\n        Please refer to the help messages in the parser for the meaning of these attributes\n    \"\"\"\n    rir_parser = argparse.ArgumentParser()\n    rir_parser.add_argument('--rir-id', type=str, required=True, help='This id is unique for each RIR and the noise may associate with a particular RIR by refering to this id')\n    rir_parser.add_argument('--room-id', type=str, required=True, help='This is the room that where the RIR is generated')\n    rir_parser.add_argument('--receiver-position-id', type=str, default=None, help='receiver position id')\n    rir_parser.add_argument('--source-position-id', type=str, default=None, help='source position id')\n    rir_parser.add_argument('--rt60', type=float, default=None, help='RT60 is the time required for reflections of a direct sound to decay 60 dB.')\n    rir_parser.add_argument('--drr', type=float, default=None, help='Direct-to-reverberant-ratio of the impulse response.')\n    rir_parser.add_argument('--cte', type=float, default=None, help='Early-to-late index of the impulse response.')\n    rir_parser.add_argument('--probability', type=float, default=None, help='probability of the impulse response.')\n    rir_parser.add_argument('rir_rspecifier', type=str, help=\"\"\"rir rspecifier, it can be either a filename or a piped command.\n                            E.g. data/impulses/Room001-00001.wav or \"sox data/impulses/Room001-00001.wav -t wav - |\" \"\"\")\n\n    set_list = parse_set_parameter_strings(rir_set_para_array)\n\n    rir_list = []\n    for rir_set in set_list:\n        current_rir_list = [rir_parser.parse_args(shlex.split(x.strip())) for x in open(rir_set.filename)]\n        for rir in current_rir_list:\n            if sampling_rate is not None:\n                # check if the rspecifier is a pipe or not\n                if len(rir.rir_rspecifier.split()) == 1:\n                    rir.rir_rspecifier = \"sox {0} -r {1} -t wav - |\".format(rir.rir_rspecifier, sampling_rate)\n                else:\n                    rir.rir_rspecifier = \"{0} sox -t wav - -r {1} -t wav - |\".format(rir.rir_rspecifier, sampling_rate)\n\n        rir_list += smooth_probability_distribution(current_rir_list, smoothing_weight, rir_set.probability)\n\n    return rir_list\n\n\ndef almost_equal(value_1, value_2, accuracy = 10**-8):\n    \"\"\" This function checks if the inputs are approximately equal assuming they are floats.\n    \"\"\"\n    return abs(value_1 - value_2) < accuracy\n\n\ndef make_room_dict(rir_list):\n    \"\"\" This function converts a list of RIRs into a dictionary of RIRs indexed by the room-id.\n        Its values are objects with two attributes: a local RIR list\n        and the probability of the corresponding room\n        Please look at the comments at parse_rir_list() for the attributes that a RIR object contains\n    \"\"\"\n    room_dict = {}\n    for rir in rir_list:\n        if rir.room_id not in room_dict:\n            # add new room\n            room_dict[rir.room_id] = lambda: None\n            setattr(room_dict[rir.room_id], \"rir_list\", [])\n            setattr(room_dict[rir.room_id], \"probability\", 0)\n        room_dict[rir.room_id].rir_list.append(rir)\n\n    # the probability of the room is the sum of probabilities of its RIR\n    for key in room_dict.keys():\n        room_dict[key].probability = sum(rir.probability for rir in room_dict[key].rir_list)\n\n    assert almost_equal(sum(room_dict[key].probability for key in room_dict.keys()), 1.0)\n\n    return room_dict\n\ndef parse_noise_list(noise_set_para_array, smoothing_weight, sampling_rate = None):\n    \"\"\" This function creates the point-source noise list\n         and the isotropic noise dictionary from the noise information file\n         The isotropic noise dictionary is indexed by the room\n         and its value is the corrresponding isotropic noise list\n         Each noise object in the list contains the following attributes:\n         noise_id, noise_type, bg_fg_type, room_linkage, probability, noise_rspecifier\n         Please refer to the help messages in the parser for the meaning of these attributes\n    \"\"\"\n    noise_parser = argparse.ArgumentParser()\n    noise_parser.add_argument('--noise-id', type=str, required=True, help='noise id')\n    noise_parser.add_argument('--noise-type', type=str, required=True, help='the type of noise; i.e. isotropic or point-source', choices = [\"isotropic\", \"point-source\"])\n    noise_parser.add_argument('--bg-fg-type', type=str, default=\"background\", help='background or foreground noise, for background noises, '\n                              'they will be extended before addition to cover the whole speech; for foreground noise, they will be kept '\n                              'to their original duration and added at a random point of the speech.', choices = [\"background\", \"foreground\"])\n    noise_parser.add_argument('--room-linkage', type=str, default=None, help='required if isotropic, should not be specified if point-source.')\n    noise_parser.add_argument('--probability', type=float, default=None, help='probability of the noise.')\n    noise_parser.add_argument('noise_rspecifier', type=str, help=\"\"\"noise rspecifier, it can be either a filename or a piped command.\n                              E.g. type5_noise_cirline_ofc_ambient1.wav or \"sox type5_noise_cirline_ofc_ambient1.wav -t wav - |\" \"\"\")\n\n    set_list = parse_set_parameter_strings(noise_set_para_array)\n\n    pointsource_noise_list = []\n    iso_noise_dict = {}\n    for noise_set in set_list:\n        current_noise_list = [noise_parser.parse_args(shlex.split(x.strip())) for x in open(noise_set.filename)]\n        current_pointsource_noise_list = []\n        for noise in current_noise_list:\n            if sampling_rate is not None:\n                # check if the rspecifier is a pipe or not\n                if len(noise.noise_rspecifier.split()) == 1:\n                    noise.noise_rspecifier = \"sox {0} -r {1} -t wav - |\".format(noise.noise_rspecifier, sampling_rate)\n                else:\n                    noise.noise_rspecifier = \"{0} sox -t wav - -r {1} -t wav - |\".format(noise.noise_rspecifier, sampling_rate)\n\n            if noise.noise_type == \"isotropic\":\n                if noise.room_linkage is None:\n                    raise Exception(\"--room-linkage must be specified if --noise-type is isotropic\")\n                else:\n                    if noise.room_linkage not in iso_noise_dict:\n                        iso_noise_dict[noise.room_linkage] = []\n                    iso_noise_dict[noise.room_linkage].append(noise)\n            else:\n                current_pointsource_noise_list.append(noise)\n\n        pointsource_noise_list += smooth_probability_distribution(current_pointsource_noise_list, smoothing_weight, noise_set.probability)\n\n    # ensure the point-source noise probabilities sum to 1\n    pointsource_noise_list = smooth_probability_distribution(pointsource_noise_list, smoothing_weight, 1.0)\n    if len(pointsource_noise_list) > 0:\n        assert almost_equal(sum(noise.probability for noise in pointsource_noise_list), 1.0)\n\n    # ensure the isotropic noise source probabilities for a given room sum to 1\n    for key in iso_noise_dict.keys():\n        iso_noise_dict[key] = smooth_probability_distribution(iso_noise_dict[key])\n        assert almost_equal(sum(noise.probability for noise in iso_noise_dict[key]), 1.0)\n\n    return (pointsource_noise_list, iso_noise_dict)\n\n\ndef main():\n    args = get_args()\n\n    random.seed(args.random_seed)\n    rir_list = parse_rir_list(args.rir_set_para_array, args.rir_smoothing_weight, args.source_sampling_rate)\n    print(\"Number of RIRs is {0}\".format(len(rir_list)))\n    pointsource_noise_list = []\n    iso_noise_dict = {}\n    if args.noise_set_para_array is not None:\n        pointsource_noise_list, iso_noise_dict = parse_noise_list(args.noise_set_para_array,\n                                                                args.noise_smoothing_weight,\n                                                                args.source_sampling_rate)\n        print(\"Number of point-source noises is {0}\".format(len(pointsource_noise_list)))\n        print(\"Number of isotropic noises is {0}\".format(sum(len(iso_noise_dict[key]) for key in iso_noise_dict.keys())))\n    room_dict = make_room_dict(rir_list)\n\n    if args.include_original_data == \"true\":\n        include_original = True\n    else:\n        include_original = False\n    create_reverberated_copy(input_dir = args.input_dir,\n                           output_dir = args.output_dir,\n                           room_dict = room_dict,\n                           pointsource_noise_list = pointsource_noise_list,\n                           iso_noise_dict = iso_noise_dict,\n                           foreground_snr_string = args.foreground_snr_string,\n                           background_snr_string = args.background_snr_string,\n                           num_replicas = args.num_replicas,\n                           include_original = include_original,\n                           prefix = args.prefix,\n                           speech_rvb_probability = args.speech_rvb_probability,\n                           shift_output = args.shift_output,\n                           isotropic_noise_addition_probability = args.isotropic_noise_addition_probability,\n                           pointsource_noise_addition_probability = args.pointsource_noise_addition_probability,\n                           max_noises_per_minute = args.max_noises_per_minute)\n\n\n    data_lib.RunKaldiCommand(\"utils/validate_data_dir.sh --no-feats --no-text {output_dir}\"\n                    .format(output_dir = args.output_dir))\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration section.\ntransform_dir=   # this option won't normally be used, but it can be used if you want to\n                 # supply existing fMLLR transforms when decoding.\niter=\nmodel= # You can specify the model to use (e.g. if you want to use the .alimdl)\nstage=0\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\n# note: there are no more min-lmwt and max-lmwt options, instead use\n# e.g. --scoring-opts \"--min-lmwt 1 --max-lmwt 20\"\nskip_scoring=false\ndecode_extra_opts=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: steps/decode.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.\"\n   echo \"e.g.: steps/decode.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --model <model>                                  # which model to use (e.g. to\"\n   echo \"                                                   # specify the final.alimdl)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <trans-dir>                      # dir to find fMLLR transforms \"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --num-threads <n>                                # number of threads to use, default 1.\"\n   echo \"  --parallel-opts <opts>                           # ignored now, present for historical reasons.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  if [ -z $iter ]; then model=$srcdir/final.mdl;\n  else model=$srcdir/$iter.mdl; fi\nfi\n\nif [ $(basename $model) != final.alimdl ] ; then\n  # Do not use the $srcpath -- look at the path where the model is\n  if [ -f $(dirname $model)/final.alimdl ] && [ -z \"$transform_dir\" ]; then\n    echo -e '\\n\\n'\n    echo $0 'WARNING: Running speaker independent system decoding using a SAT model!'\n    echo $0 'WARNING: This is OK if you know what you are doing...'\n    echo -e '\\n\\n'\n  fi\nfi\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $model $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"$0: Error: no such file $f\" && exit 1;\ndone\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode.sh: feature type is $feat_type\";\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"$0: Error: Invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: Error: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    echo \"$0: num-jobs for transforms mismatches, so copying them.\"\n    for n in $(seq $nj_orig); do cat $transform_dir/trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/trans.ark,$dir/trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\n  fi\nfi\n\nif [ $stage -le 0 ]; then\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"$0: Error: Mismatch in number of pdfs with $model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt $decode_extra_opts \\\n    $model $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir ||\n    { echo \"$0: Error: scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_basis_fmllr.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012   Carnegie Mellon University (Author: Yajie Miao)\n#                  Johns Hopkins University (Author: Daniel Povey)\n#           2014   David Snyder\n\n# Decoding script that does basis fMLLR.  This can be on top of delta+delta-delta,\n# or LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the \n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:                 \n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in initial pass.\nalignment_model=\nadapt_model=\nfinal_model=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in \n              # lattice generation.\n\n# Parameters in alignment of training data\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nalign_beam=10\nretry_beam=40\n\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored, present for historical reasons.\nskip_scoring=false\nscoring_opts=\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: steps/decode_basis_fmllr.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_basis_fmllr.sh exp/tri2b/graph_tgpr data/train_si84 data/test_dev93 exp/tri2b/decode_dev93_tgpr\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n   echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n   echo \"  --parallel-opts <opts>                   # ignored, present for historical reasons.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/tree $srcdir/fmllr.basis; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n    if [ -f \"$graphdir/num_pdfs\" ]; then\n      [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $alignment_model | grep pdfs | awk '{print $NF}'` ] || \\\n        { echo \"Mismatch in number of pdfs with $alignment_model\"; exit 1; }\n    fi\n\n    steps/decode.sh --scoring-opts \"$scoring_opts\" \\\n              --num-threads $num_threads --skip-scoring $skip_scoring \\\n              --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam \\\n              --model $alignment_model --max-active \\\n              $first_max_active $graphdir $data $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n## Set up the unadapted features \"$sifeats\" for testing set\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n##\n\n## Now get the first-pass fMLLR transforms.\n## We give all the default parameters in gmm-est-basis-fmllr\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass fMLLR transforms.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-basis-fmllr-gpost --spk2utt=ark:$sdata/JOB/spk2utt \\\n    --fmllr-min-count=200  --num-iters=10 --size-scale=0.2 \\\n    --step-size-iters=3 --write-weights=ark:$dir/pre_wgt.JOB \\\n     $adapt_model $srcdir/fmllr.basis \"$sifeats\" ark,s,cs:- \\\n    ark:$dir/pre_trans.JOB || exit 1;\nfi\n##\n\npass1feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/pre_trans.JOB ark:- ark:- |\"\n\n## Do the main lattice generation pass.  Note: we don't determinize the lattices at\n## this stage, as we're going to use them in acoustic rescoring with the larger \n## model, and it's more correct to store the full state-level lattice for this purpose.\nif [ $stage -le 2 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $adapt_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $adapt_model\"; exit 1; }\n  fi\n  $cmd JOB=1:$nj --num-threads $num_threads $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt  \\\n    --determinize-lattice=false --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass1feats\" \"ark:|gzip -c > $dir/lat.tmp.JOB.gz\" \\\n    || exit 1;\nfi\n##\n\n## Do a second pass of estimating the transform-- this time with the lattices\n## generated from the alignment model.  Compose the transforms to get\n## $dir/trans.1, etc.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating fMLLR transforms a second time.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=4.0 \\\n    \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-basis-fmllr --fmllr-min-count=200 \\\n    --spk2utt=ark:$sdata/JOB/spk2utt --write-weights=ark:$dir/trans_tmp_wgt.JOB \\\n    $adapt_model $srcdir/fmllr.basis \"$pass1feats\" ark,s,cs:- ark:$dir/trans_tmp.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans_tmp.JOB ark:$dir/pre_trans.JOB \\\n    ark:$dir/trans.JOB  || exit 1;\nfi\n##\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for \n# language model rescoring.\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" '&&' rm $dir/lat.tmp.JOB.gz || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nrm $dir/{trans_tmp,pre_trans}.*\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_biglm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nnj=4\ncmd=run.pl\nmaxactive=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333\nskip_scoring=false\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"Usage: steps/decode_si_biglm.sh [options] <graph-dir> <old-LM-fst> <new-LM-fst> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.\"\n   echo \"e.g.: steps/decode_si.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\noldlm_fst=$2\nnewlm_fst=$3\ndata=$4\ndir=$5\n\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $srcdir/final.mdl $graphdir/HCLG.fst $oldlm_fst $newlm_fst; do\n  [ ! -f $f ] && echo \"decode_si.sh: no such file $f\" && exit 1;\ndone\n\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode_si.sh: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f `dirname $oldlm_fst`/words.txt ] && ! cmp `dirname $oldlm_fst`/words.txt $graphdir/words.txt && \\\n  echo \"Warning: old LM words.txt does not match with that in $graphdir .. probably will not work.\";\n[ -f `dirname $newlm_fst`/words.txt ] && ! cmp `dirname $oldlm_fst`/words.txt $graphdir/words.txt && \\\n  echo \"Warning: new LM words.txt does not match with that in $graphdir .. probably will not work.\";\n\n# fstproject replaces the disambiguation symbol #0, which only appears on the\n# input side, with the <eps> that appears in the corresponding arcs on the output side.\noldlm_cmd=\"fstproject --project_output=true $oldlm_fst | fstarcsort --sort_type=ilabel |\"\nnewlm_cmd=\"fstproject --project_output=true $newlm_fst | fstarcsort --sort_type=ilabel |\"\n\n$cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n gmm-latgen-biglm-faster --max-active=$maxactive --beam=$beam --lattice-beam=$lattice_beam \\\n   --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n  $srcdir/final.mdl $graphdir/HCLG.fst \"$oldlm_cmd\" \"$newlm_cmd\" \"$feats\" \\\n  \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_combine.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# Combine two decoding directories by composing the lattices (we\n# apply a weight to each of the original weights, by default 0.5 each).\n# Note, this is not the only combination method, or the most normal combination\n# method.  See also egs/wsj/s5/local/score_combine.sh.\n\n# Begin configuration section.\nweight1=0.5 # Weight on 1st set of lattices.\ncmd=run.pl\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/decode_combine.sh [options] <data> <lang-dir|graph-dir> <decode-dir1> <decode-dir2> <decode-dir-out>\"\n  echo \" e.g.: steps/decode_combine.sh data/lang data/test exp/dir1/decode exp/dir2/decode exp/combine_1_2/decode\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --weight1 <weight>                       # Weight on 1st set of lattices (default 0.5)\"\n  exit 1;\nfi\n\ndata=$1\nlang_or_graphdir=$2\nsrcdir1=$3\nsrcdir2=$4\ndir=$5\n\nfor f in $data/utt2spk $lang_or_graphdir/phones.txt $srcdir1/lat.1.gz $srcdir2/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj1=`cat $srcdir1/num_jobs` || exit 1;\nnj2=`cat $srcdir2/num_jobs` || exit 1;\n[ $nj1 -ne $nj2 ] && echo \"$0: mismatch in number of jobs $nj1 versus $nj2\" && exit 1;\nnj=$nj1\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\n# The lattice-interp command does the score interpolation (with composition),\n# and the lattice-copy-backoff replaces the result with the 1st lattice, in\n# cases where the composed result was empty.\n$cmd JOB=1:$nj $dir/log/interp.JOB.log \\\n  lattice-interp --alpha=$weight1 \"ark:gunzip -c $srcdir1/lat.JOB.gz|\" \\\n   \"ark,s,cs:gunzip -c $srcdir2/lat.JOB.gz|\" ark:- \\| \\\n  lattice-copy-backoff \"ark,s,cs:gunzip -c $srcdir1/lat.JOB.gz|\" ark,s,cs:- \\\n   \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $lang_or_graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_fmllr.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey)\n\n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the\n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:\n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in initial pass.\nalignment_model=\nadapt_model=\nfinal_model=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in\n              # lattice generation.\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nskip_scoring=false\nscoring_opts=\nmax_fmllr_jobs=25  # I've seen the fMLLR jobs overload NFS badly if the decoding\n                   # was started with a lot of many jobs, so we limit the number of\n                   # parallel jobs to 25 by default.  End configuration section\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Wrong #arguments ($#, expected 3)\"\n   echo \"Usage: steps/decode_fmllr.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_fmllr.sh exp/tri2b/graph_tgpr data/test_dev93 exp/tri2b/decode_dev93_tgpr\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n   echo \"  --scoring-opts <opts>                    # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n    if [ -f \"$graphdir/num_pdfs\" ]; then\n      [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $alignment_model | grep pdfs | awk '{print $NF}'` ] || \\\n        { echo \"Mismatch in number of pdfs with $alignment_model\"; exit 1; }\n    fi\n    steps/decode.sh --scoring-opts \"$scoring_opts\" \\\n           --num-threads $num_threads --skip-scoring $skip_scoring \\\n           --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam \\\n           --model $alignment_model --max-active \\\n           $first_max_active $graphdir $data $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n## Set up the unadapted features \"$sifeats\"\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n##\n\n## Now get the first-pass fMLLR transforms.\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass fMLLR transforms.\"\n  $cmd --max-jobs-run $max_fmllr_jobs JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$sifeats\" ark,s,cs:- \\\n    ark:$dir/pre_trans.JOB || exit 1;\nfi\n##\n\npass1feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/pre_trans.JOB ark:- ark:- |\"\n\n## Do the main lattice generation pass.  Note: we don't determinize the lattices at\n## this stage, as we're going to use them in acoustic rescoring with the larger\n## model, and it's more correct to store the full state-level lattice for this purpose.\nif [ $stage -le 2 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $adapt_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $adapt_model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --determinize-lattice=false \\\n    --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass1feats\" \"ark:|gzip -c > $dir/lat.tmp.JOB.gz\" \\\n    || exit 1;\nfi\n##\n\n## Do a second pass of estimating the transform-- this time with the lattices\n## generated from the alignment model.  Compose the transforms to get\n## $dir/trans.1, etc.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating fMLLR transforms a second time.\"\n  $cmd --max-jobs-run $max_fmllr_jobs JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=4.0 \\\n    \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$pass1feats\" \\\n    ark,s,cs:- ark:$dir/trans_tmp.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans_tmp.JOB ark:$dir/pre_trans.JOB \\\n    ark:$dir/trans.JOB  || exit 1;\nfi\n##\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for\n# language model rescoring.\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" '&&' rm $dir/lat.tmp.JOB.gz || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $graphdir $dir\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\nfi\n\nrm $dir/{trans_tmp,pre_trans}.*\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/decode_fmllr_extra.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n\n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n# This script does an extra pass of lattice generation over and above what the original\n# script did-- it's for robustness in the case where your original cepstral mean\n# normalization was way off.\n# We also added a new option --distribute=true (by default) to \n# weight-silence-post.  This weights the silence frames in a different way,\n# weighting all posteriors on the frame rather than just the silence ones, which\n# removes a particular kind of bias that the old approach suffered from.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the \n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:                 \n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in first two passes.\nfirst_lattice_beam=4.0 # lattice pruning beam for si decode and first-pass fMLLR decode.\n                # the different spelling from lattice_beam is unfortunate; these scripts\n                # have a history.\nalignment_model=\nadapt_model=\nfinal_model=\ncleanup=true\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in \n              # lattice generation.\nmax_active=7000\nmax_mem=50000000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ndistribute=true # option to weight-silence-post.\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\nskip_scoring=false\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\n\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: steps/decode_fmllr.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_fmllr.sh exp/tri2b/graph_tgpr data/test_dev93 exp/tri2b/decode_dev93_tgpr\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n   echo \"  --scoring-opts <opts>                    # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $alignment_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $alignment_model\" exit 1; }\n  fi\n    steps/decode.sh --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam --model $alignment_model\\\n      --max-active $first_max_active --num-threads $num_threads\\\n      --skip-scoring true $graphdir $data $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n## Set up the unadapted features \"$sifeats\"\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n##\n\n## Now get the first-pass fMLLR transforms.\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass fMLLR transforms.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post --distribute=$distribute $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$sifeats\" ark,s,cs:- \\\n    ark:$dir/trans1.JOB || exit 1;\nfi\n##\n\npass1feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans1.JOB ark:- ark:- |\"\n\n## Do the first adapted lattice generation pass. \nif [ $stage -le 2 ]; then\n  echo \"$0: doing first adapted lattice generation phase\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $adapt_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $adapt_model\" exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode1.JOB.log\\\n    gmm-latgen-faster$thread_string --max-active=$first_max_active --max-mem=$max_mem --beam=$first_beam --lattice-beam=$first_lattice_beam \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass1feats\" \"ark:|gzip -c > $dir/lat1.JOB.gz\" \\\n    || exit 1;\nfi\n\n\n## Do a second pass of estimating the transform.  Compose the transforms to get\n## $dir/trans2.*.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating fMLLR transforms a second time.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-to-post --acoustic-scale=$acwt \"ark:gunzip -c $dir/lat1.JOB.gz|\" ark:- \\| \\\n    weight-silence-post --distribute=$distribute $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$pass1feats\" \\\n    ark,s,cs:- ark:$dir/trans1b.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans1b.JOB ark:$dir/trans1.JOB \\\n    ark:$dir/trans2.JOB  || exit 1;\n  if $cleanup; then\n    rm $dir/trans1b.* $dir/trans1.* $dir/lat1.*.gz\n  fi\nfi\n##\n\npass2feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans2.JOB ark:- ark:- |\"\n\n# Generate a 3rd set of lattices, with the \"adaptation model\"; we'll use these\n# to adapt a 3rd time, and we'll rescore them.  Since we should be close to the final\n# fMLLR, we don't bother dumping un-determinized lattices to disk.\n\n## Do the final lattice generation pass (but we'll rescore these lattices\n## after another stage of adaptation.)\nif [ $stage -le 4 ]; then\n  echo \"$0: doing final lattice generation phase\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode2.JOB.log\\\n    gmm-latgen-faster$thread_string --max-active=$max_active --max-mem=$max_mem --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass2feats\" \"ark:|gzip -c > $dir/lat2.JOB.gz\" \\\n    || exit 1;\nfi\n\n\n## Do a third pass of estimating the transform.  Compose the transforms to get\n## $dir/trans.*.\nif [ $stage -le 5 ]; then\n  echo \"$0: estimating fMLLR transforms a third time.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass3.JOB.log \\\n    lattice-to-post --acoustic-scale=$acwt \"ark:gunzip -c $dir/lat2.JOB.gz|\" ark:- \\| \\\n    weight-silence-post --distribute=$distribute $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$pass2feats\" \\\n    ark,s,cs:- ark:$dir/trans2b.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans2b.JOB ark:$dir/trans2.JOB \\\n    ark:$dir/trans.JOB  || exit 1;\n  if $cleanup; then\n    rm $dir/trans2b.* $dir/trans2.*\n  fi\nfi\n##\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 6 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat2.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n  if $cleanup; then\n    rm $dir/lat2.*.gz\n  fi\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_fmmi.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# Decoding of fMMI or fMPE models (feature-space discriminative training).\n# If transform-dir supplied, expects e.g. fMLLR transforms in that dir.\n\n# Begin configuration section.  \nstage=1\niter=final\nnj=4\ncmd=run.pl\nmaxactive=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nngselect=2; # Just use the 2 top Gaussians for fMMI/fMPE.  Should match train.\ntransform_dir=\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: steps/decode_fmmi.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.\"\n   echo \"e.g.: steps/decode_fmmi.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"You can also use fMLLR features-- you have to supply --transform-dir option.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --transform-dir <transform-dir>                  # where to find fMLLR transforms.\"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"                                                   # speaker-adapted decoding\"\n   echo \"  --num-threads <n>                                # number of threads to use, default 1.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\" \n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nmodel=$srcdir/$iter.mdl\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $model $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"decode_fmmi.sh: no such file $f\" && exit 1;\ndone\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode_fmmi.sh: feature type is $feat_type\";\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne $nj ] && \\\n     echo \"Mismatch in number of jobs with $transform_dir\";\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\n\nfmpefeats=\"$feats fmpe-apply-transform $srcdir/$iter.fmpe ark:- 'ark,s,cs:gunzip -c $dir/gselect.JOB.gz|' ark:- |\" \n\nif [ $stage -le 1 ]; then\n  # Get Gaussian selection info.\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$ngselect $srcdir/$iter.fmpe \"$feats\" \\\n    \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n  \nif [ $stage -le 2 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$maxactive --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $model $graphdir/HCLG.fst \"$fmpefeats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    \n    local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir || \n      { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\n  fi\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_fromlats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Decode, limited to the word-sequences that were present in a set\n# of lattices on disk.  The other lattices do not have to be built\n# with the same tree or the same context size-- however, you do\n# have to be using the same vocabulary (words.txt)-- if not you'd\n# have to map the vocabulary somehow.\n\n# Note: if the trees are identical, you can use gmm-rescore-lattice.\n\n# Mechanism: create an unweighted acceptor (on words) for each utterance,\n# compose that with G, determinize, and then use compile-train-graphs-fsts\n# to compile a graph for each utterance, to decode with.  \n\n# Begin configuration.\ncmd=run.pl\nmaxactive=7000\nbeam=20.0\nlattice_beam=7.0\nacwt=0.083333\nbatch_size=75 # Limits memory blowup in compile-train-graphs-fsts\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nskip_scoring=false\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/decode_si_fromlats.sh [options] <data-dir> <lang> <old-decode-dir> <decode-dir>\"\n   echo \"e.g.: steps/decode_si_fromlats.sh data/test_dev93 data/lang_test_tg exp/tri2b/decode_tgpr_dev93 exp/tri2a/decode_tgpr_dev93_fromlats\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\n\ndata=$1\nlang=$2\nolddir=$3\ndir=$4\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nmkdir -p $dir/log\n\nnj=`cat $olddir/num_jobs` || exit 1;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nsdata=$data/split$nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj >$dir/num_jobs\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $srcdir/final.mdl $olddir/lat.1.gz \\\n    $srcdir/tree $lang/L_disambig.fst $lang/phones.txt; do\n  [ ! -f $f ] && echo \"decode_si_fromlats.sh: no such file $f\" && exit 1;\ndone\n\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode_si.sh: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n\n$cmd JOB=1:$nj $dir/log/decode_lats.JOB.log \\\n lattice-to-fst \"ark:gunzip -c $olddir/lat.JOB.gz|\" ark:- \\| \\\n  fsttablecompose \"fstproject --project_output=true $lang/G.fst | fstarcsort |\" ark:- ark:- \\| \\\n  fstdeterminizestar ark:- ark:- \\| \\\n  compile-train-graphs-fsts --read-disambig-syms=$lang/phones/disambig.int \\\n    --batch-size=$batch_size $scale_opts $srcdir/tree $srcdir/final.mdl $lang/L_disambig.fst ark:- ark:- \\|  \\\n  gmm-latgen-faster --max-active=$maxactive --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --allow-partial=true --word-symbol-table=$lang/words.txt \\\n    $srcdir/final.mdl ark:- \"$feats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $lang $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_lvtln.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Copyright 2014  Vimal Manohar\n\n# Decoding script for LVTLN models.  Will estimate VTLN warping factors\n# as a by product, which can be used to extract VTLN-warped features.\n\n# Begin configuration section\nstage=0\nacwt=0.083333 \nmax_active=3000 # Have a smaller than normal max-active, to limit decoding time.\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.0\nlogdet_scale=0.0\ncmd=run.pl\nskip_scoring=false\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\ncleanup=true\n# End configuration section\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Wrong #arguments ($#, expected 3)\"\n   echo \"Usage: steps/decode_lvtln.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_lvtln.sh exp/tri2d/graph_tgpr data/test_dev93 exp/tri2d/decode_dev93_tgpr\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --scoring-opts <opts>                    # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0: file $data/spk2warp exists.  This script expects non-VTLN features\"\n  exit 1;\nfi\n\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/tree $srcdir/final.mdl \\\n  $srcdir/final.alimdl $srcdir/final.lvtln; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Set up the unadapted features \"$sifeats\"\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n\n## Generate lattices.\nif [ $stage -le 0 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $srcdir/final.alimdl | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $srcdir/final.alimdl\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n     --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $srcdir/final.alimdl $graphdir/HCLG.fst \"$sifeats\" \"ark:|gzip -c > $dir/lat_pass1.JOB.gz\" \\\n    || exit 1;\nfi\n\n\n## Get the first-pass LVTLN transforms\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass LVTLN transforms.\"\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $srcdir/final.mdl | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $srcdir/final.mdl\"; exit 1; }\n  fi\n  $cmd JOB=1:$nj $dir/log/lvtln_pass1.JOB.log \\\n    gunzip -c $dir/lat_pass1.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $srcdir/final.alimdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $srcdir/final.alimdl \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-lvtln-trans --logdet-scale=$logdet_scale --verbose=1 --spk2utt=ark:$sdata/JOB/spk2utt \\\n       $srcdir/final.mdl $srcdir/final.lvtln \"$sifeats\" ark,s,cs:- ark:$dir/trans_pass1.JOB \\\n       ark,t:$dir/warp_pass1.JOB || exit 1;\nfi\n##\n\nfeats1=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans_pass1.JOB ark:- ark:- |\"\n\n## Do a second pass of estimating the LVTLN transform.\n\nif [ $stage -le 3 ]; then\n  echo \"$0: rescoring the lattices with first-pass LVTLN transforms\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/rescore.JOB.log \\\n    gmm-rescore-lattice $srcdir/final.mdl \"ark:gunzip -c $dir/lat_pass1.JOB.gz|\" \"$feats1\" \\\n     \"ark:|gzip -c > $dir/lat_pass2.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: re-estimating LVTLN transforms\"\n  $cmd JOB=1:$nj $dir/log/lvtln_pass2.JOB.log \\\n    gunzip -c $dir/lat_pass2.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $srcdir/final.mdl \"$feats1\" ark:- ark:- \\| \\\n    gmm-est-lvtln-trans --logdet-scale=$logdet_scale --verbose=1 --spk2utt=ark:$sdata/JOB/spk2utt \\\n      $srcdir/final.mdl $srcdir/final.lvtln \"$sifeats\" ark,s,cs:- ark:$dir/trans.JOB \\\n      ark,t:$dir/warp.JOB || exit 1;\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 5 ]; then\n  # This second rescoring is only really necessary for scoring purposes,\n  # it does not affect the transforms.\n  echo \"$0: rescoring the lattices with second-pass LVTLN transforms\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/rescore.JOB.log \\\n    gmm-rescore-lattice $srcdir/final.mdl \"ark:gunzip -c $dir/lat_pass2.JOB.gz|\" \"$feats\" \\\n     \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ -f $dir/warp.1 ]; then\n  for j in $(seq $nj); do cat $dir/warp_pass1.$j; done > $dir/0.warp || exit 1;\n  for j in $(seq $nj); do cat $dir/warp.$j; done > $dir/final.warp || exit 1;\n  ns1=$(cat $dir/0.warp | wc -l)\n  ns2=$(cat $dir/final.warp | wc -l)\n  ! [ \"$ns1\" == \"$ns2\" ] && echo \"$0: Number of speakers differ pass1 vs pass2, $ns1 != $ns2\" && exit 1;\n\n  paste $dir/0.warp $dir/final.warp | awk '{x=$2 - $4; if ((x>0?x:-x) > 0.010001) { print $1, $2, $4; }}' > $dir/warp_changed\n  nc=$(cat $dir/warp_changed | wc -l)\n  echo \"$0: For $nc speakers out of $ns1, warp changed pass1 vs pass2 by >0.01, see $dir/warp_changed for details\"\nfi\n\nif true; then # Diagnostics\n  if [ -f $data/spk2gender ]; then \n    # To make it easier to eyeball the male and female speakers' warps\n    # separately, separate them out.\n    for g in m f; do # means: for gender in male female\n      cat $dir/final.warp | \\\n        utils/filter_scp.pl <(grep -w $g $data/spk2gender | awk '{print $1}') > $dir/final.warp.$g\n      echo -n \"The last few warp factors for gender $g are: \"\n      tail -n 10 $dir/final.warp.$g | awk '{printf(\"%s \", $2);}'; \n      echo\n    done\n  fi\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\nfi\n\nif $cleanup; then\n  rm $dir/lat_pass?.*.gz $dir/trans_pass1.* $dir/warp_pass1.* $dir/warp.*\nfi\n\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_nolats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey)\n#                      Vimal Manohar\n# Apache 2.0\n\n##Changes\n# Vimal Manohar (Jan 2014):\n# Added options to boost silence probabilities in the model before\n# decoding. This can help in favoring the silence phones when \n# some silence regions are wrongly decoded as speech phones like glottal stops\n\n# Begin configuration section.  \ntransform_dir=\niter=\nmodel= # You can specify the model to use (e.g. if you want to use the .alimdl)\nboost_silence=1.0         # Boost silence pdfs in the model by this factor before decoding\nsilence_phones_list=      # List of silence phones that would be boosted before decoding\nstage=0\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nwrite_alignments=false  # The output directory is treated like an alignment directory\nwrite_words=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n[ -z $silence_phones_list ] && boost_silence=1.0\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.  This version produces just linear output, no lattices\"\n   echo \"\"\n   echo \"e.g.: steps/decode.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --model <model>                                  # which model to use (e.g. to\"\n   echo \"                                                   # specify the final.alimdl)\"\n   echo \"  --write-alignments <true|false>                  # if true, output ali.*.gz\"\n   echo \"  --write-words <true|false>                       # if true, output words.*.gz\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <trans-dir>                      # dir to find fMLLR transforms \"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  if [ -z $iter ]; then model=$srcdir/final.mdl; \n  else model=$srcdir/$iter.mdl; fi\nfi\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $model $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"decode.sh: no such file $f\" && exit 1;\ndone\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode.sh: feature type is $feat_type\";\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\nutils/lang/check_phones_compatible.sh $graphdir/phones.txt $srcdir/phones.txt || exit 1;\n\nif $write_alignments; then\n  # Copy model and options that are generally expected in an alignment \n  # directory.\n  cp $graphdir/phones.txt $dir || exit 1;\n\n  cp $srcdir/{tree,final.mdl} $dir || exit 1;\n  cp $srcdir/final.alimdl $dir 2>/dev/null\n  cp $srcdir/final.occs $dir;\n  cp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\n  cp $srcdir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n  cp $srcdir/delta_opts $dir 2>/dev/null\nfi\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    if $write_alignments; then\n      cp $srcdir/final.mat $dir\n      cp $srcdir/full.mat $dir 2>/dev/null\n    fi\n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne $nj ] && \\\n     echo \"Mismatch in number of jobs with $transform_dir\";\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\n\nif [ $stage -le 0 ]; then\n  if $write_alignments; then\n    ali=\"ark:|gzip -c > $dir/ali.JOB.gz\"\n  else\n    ali=\"ark:/dev/null\"\n  fi\n  if $write_words; then\n    words=\"ark:|gzip -c > $dir/words.JOB.gz\"\n  else\n    words=\"ark:/dev/null\"\n  fi\n\n  [ ! -z \"$silence_phones_list\" ]  && \\\n    model=\"gmm-boost-silence --boost=$boost_silence $silence_phones_list $model - |\"\n\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $model\"; exit 1; }\n  fi\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-decode-faster --max-active=$max_active --beam=$beam  \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    \"$model\" $graphdir/HCLG.fst \"$feats\" \"$words\" \"$ali\" || exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_raw_fmllr.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey)\n\n# This decoding script is like decode_fmllr.sh, but it does the fMLLR on\n# the raw cepstra, using the model in the LDA+MLLT space\n# \n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the \n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:                 \n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in initial pass.\nalignment_model=\nadapt_model=\nfinal_model=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in \n              # lattice generation.\nmax_active=7000\nuse_normal_fmllr=false\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nskip_scoring=false\nscoring_opts=\n# End configuration section\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Wrong #arguments ($#, expected 3)\"\n   echo \"Usage: steps/decode_fmllr.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_fmllr.sh exp/tri2b/graph_tgpr data/test_dev93 exp/tri2b/decode_dev93_tgpr\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n   echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n   echo \"  --scoring-opts <opts>                    # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=`echo $3 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nraw_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n! [ \"$raw_dim\" -gt 0 ] && echo \"raw feature dim not set\" && exit 1;\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n    steps/decode.sh --scoring-opts \"$scoring_opts\" \\\n              --num-threads $num_threads --skip-scoring $skip_scoring \\\n              --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam \\\n              --model $alignment_model --max-active \\\n              $first_max_active $graphdir $data $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\nif [[ ! -f $srcdir/final.mat || ! -f $srcdir/full.mat ]]; then\n  echo \"$0: we require final.mat and full.mat in the source directory $srcdir\"\nfi\n\nsplicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\"\nsifeats=\"$splicedfeats transform-feats $srcdir/final.mat ark:- ark:- |\"\n\nfull_lda_mat=\"get-full-lda-mat --print-args=false $srcdir/final.mat $srcdir/full.mat -|\"\n\n##\n\n## Now get the first-pass fMLLR transforms.\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass raw-fMLLR transforms.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-fmllr-raw-gpost --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt $adapt_model \"$full_lda_mat\" \\\n      \"$splicedfeats\" ark,s,cs:- ark:$dir/pre_trans.JOB || exit 1;\nfi\n##\n\npass1splicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/pre_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- |\"\npass1feats=\"$pass1splicedfeats transform-feats $srcdir/final.mat ark:- ark:- |\"\n\n## Do the main lattice generation pass.  Note: we don't determinize the lattices at\n## this stage, as we're going to use them in acoustic rescoring with the larger \n## model, and it's more correct to store the full state-level lattice for this purpose.\nif [ $stage -le 2 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --determinize-lattice=false \\\n    --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass1feats\" \"ark:|gzip -c > $dir/lat.tmp.JOB.gz\" \\\n    || exit 1;\nfi\n##\n\n## Do a second pass of estimating the transform-- this time with the lattices\n## generated from the alignment model.  Compose the transforms to get\n## $dir/trans.1, etc.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating raw-fMLLR transforms a second time.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=4.0 \\\n    \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr-raw --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt \\\n     $adapt_model \"$full_lda_mat\" \"$pass1splicedfeats\" ark,s,cs:- ark:$dir/trans_tmp.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans_tmp.JOB ark:$dir/pre_trans.JOB \\\n    ark:$dir/raw_trans.JOB  || exit 1;\nfi\n##\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n\nif [ $stage -le 4 ] && $use_normal_fmllr; then\n  echo \"$0: estimating normal fMLLR transforms\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass3.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=4.0 ark:- ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --spk2utt=ark:$sdata/JOB/spk2utt \\\n     $adapt_model \"$feats\" ark,s,cs:- ark:$dir/trans.JOB || exit 1;\nfi\n\nif $use_normal_fmllr; then\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\nfi\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for \n# language model rescoring.\n\nif [ $stage -le 5 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" '&&' rm $dir/lat.tmp.JOB.gz || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\nfi\n\n#rm $dir/{trans_tmp,pre_trans}.*\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/decode_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does decoding with an SGMM system, with speaker vectors.\n# If the SGMM system was\n# built on top of fMLLR transforms from a conventional system, you should\n# provide the --transform-dir option.\n\n# Begin configuration section.\nstage=1\ntransform_dir=    # dir to find fMLLR transforms.\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\ncmd=run.pl\nbeam=13.0\ngselect=15  # Number of Gaussian-selection indices for SGMMs.  [Note:\n            # the first_pass_gselect variable is used for the 1st pass of\n            # decoding and can be tighter.\nfirst_pass_gselect=3 # Use a smaller number of Gaussian-selection indices in\n            # the 1st pass of decoding (lattice generation).\nmax_active=7000\nmax_mem=50000000\n#WARNING: This option is renamed lattice_beam (it was renamed to follow the naming\n#         in the other scripts\nlattice_beam=6.0 # Beam we use in lattice generation.\nvecs_beam=4.0 # Beam we use to prune lattices while getting posteriors for\n    # speaker-vector computation.  Can be quite tight (actually we could\n    # probably just do best-path.\nuse_fmllr=false\nfmllr_iters=10\nfmllr_min_count=1000\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nskip_scoring=false\nscoring_opts=\n# note: there are no more min-lmwt and max-lmwt options, instead use\n# e.g. --scoring-opts \"--min-lmwt 1 --max-lmwt 20\"\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: steps/decode_sgmm2.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \" e.g.: steps/decode_sgmm2.sh --transform-dir exp/tri3b/decode_dev93_tgpr \\\\\"\n  echo \"      exp/sgmm3a/graph_tgpr data/test_dev93 exp/sgmm3a/decode_dev93_tgpr\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 13.0\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nfor f in $graphdir/HCLG.fst $data/feats.scp $srcdir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\ngselect_opt_1stpass=\"$gselect_opt copy-gselect --n=$first_pass_gselect ark:- ark:- |\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\n## Set up features.\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  if [ -f $transform_dir/trans.1 ]; then\n    echo \"$0: using transforms from $transform_dir\"\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\n  elif [ -f $transform_dir/raw_trans.1 ]; then\n    feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n  else\n    echo \"$0: no such file $transform_dir/trans.1 or $transform_dir/raw_trans.1, invalid --transform-dir option?\"\n    exit 1;\n  fi\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n##\n\n## Save Gaussian-selection info to disk.\n# Note: we can use final.mdl regardless of whether there is an alignment model--\n# they use the same UBM.\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect --full-gmm-nbest=$gselect $srcdir/final.mdl \\\n    \"$feats\" \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n\n\n# Generate state-level lattice which we can rescore.  This is done with the alignment\n# model and no speaker-vectors.\nif [ $stage -le 2 ]; then\n  if [ -f \"$graphdir/num_pdfs\" ]; then\n    [ \"`cat $graphdir/num_pdfs`\" -eq `am-info --print-args=false $alignment_model | grep pdfs | awk '{print $NF}'` ] || \\\n      { echo \"Mismatch in number of pdfs with $alignment_model\"; exit 1; }\n  fi\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_pass1.JOB.log \\\n    sgmm2-latgen-faster$thread_string --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --determinize-lattice=false --allow-partial=true \\\n    --word-symbol-table=$graphdir/words.txt --max-mem=$max_mem \"$gselect_opt_1stpass\" $alignment_model \\\n    $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c > $dir/pre_lat.JOB.gz\" || exit 1;\nfi\n\n# Estimate speaker vectors (1st pass).  Prune before determinizing\n# because determinization can take a while on un-pruned lattices.\n# Note: the sgmm2-post-to-gpost stage is necessary because we have\n# a separate alignment-model and final model, otherwise we'd skip it\n# and use sgmm2-est-spkvecs.\nif [ $stage -le 3 ]; then\n  $cmd JOB=1:$nj $dir/log/vecs_pass1.JOB.log \\\n    gunzip -c $dir/pre_lat.JOB.gz \\| \\\n    lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alignment_model ark:- ark:- \\| \\\n    sgmm2-post-to-gpost \"$gselect_opt\" $alignment_model \"$feats\" ark:- ark:- \\| \\\n    sgmm2-est-spkvecs-gpost --spk2utt=ark:$sdata/JOB/spk2utt \\\n     $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/pre_vecs.JOB\" || exit 1;\nfi\n\n# Estimate speaker vectors (2nd pass).  Since we already have spk vectors,\n# at this point we need to rescore the lattice to get the correct posteriors.\nif [ $stage -le 4 ]; then\n  $cmd JOB=1:$nj $dir/log/vecs_pass2.JOB.log \\\n    gunzip -c $dir/pre_lat.JOB.gz \\| \\\n    sgmm2-rescore-lattice --speedup=true --spk-vecs=ark:$dir/pre_vecs.JOB \\\n           --utt2spk=ark:$sdata/JOB/utt2spk \\\n      \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n    lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n    sgmm2-est-spkvecs --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/pre_vecs.JOB \\\n     $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/vecs.JOB\" || exit 1;\nfi\nrm $dir/pre_vecs.*\n\nif $use_fmllr; then\n  # Estimate fMLLR transforms (note: these may be on top of any\n  # fMLLR transforms estimated with the baseline GMM system.\n  if [ $stage -le 5 ]; then # compute fMLLR transforms.\n    echo \"$0: computing fMLLR transforms.\"\n    if [ ! -f $srcdir/final.fmllr_mdl ] || [ $srcdir/final.fmllr_mdl -ot $srcdir/final.mdl ]; then\n      echo \"$0: computing pre-transform for fMLLR computation.\"\n      sgmm2-comp-prexform $srcdir/final.mdl $srcdir/final.occs $srcdir/final.fmllr_mdl || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      gunzip -c $dir/pre_lat.JOB.gz \\| \\\n      sgmm2-rescore-lattice --speedup=true --spk-vecs=ark:$dir/vecs.JOB \\\n        --utt2spk=ark:$sdata/JOB/utt2spk \\\n      \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n      lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n      sgmm2-est-fmllr --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/vecs.JOB \\\n       --fmllr-iters=$fmllr_iters --fmllr-min-count=$fmllr_min_count \\\n      $srcdir/final.fmllr_mdl \"$feats\" ark,s,cs:- \"ark:$dir/trans.JOB\" || exit 1;\n  fi\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\nfi\n\n# Now rescore the state-level lattices with the adapted features and the\n# corresponding model.  Prune and determinize the lattices to limit\n# their size.\nif [ $stage -le 6 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/rescore.JOB.log \\\n    sgmm2-rescore-lattice \"$gselect_opt\" --utt2spk=ark:$sdata/JOB/utt2spk --spk-vecs=ark:$dir/vecs.JOB \\\n    $srcdir/final.mdl \"ark:gunzip -c $dir/pre_lat.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned$thread_string --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\nrm $dir/pre_lat.*.gz\n\n\nif [ $stage -le 7 ]; then\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $graphdir $dir\nfi\n\nif [ $stage -le 8 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n  fi\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_sgmm2_fromlats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does decoding with an SGMM2 system, with speaker vectors.  If the\n# SGMM2 system was built on top of fMLLR transforms from a conventional system,\n# you should provide the --transform-dir option.\n\n# This script does not use a decoding graph, but instead you provide\n# a previous decoding directory with lattices in it.  This script will only\n# make use of the word sequences in the lattices; it limits the decoding\n# to those sequences.  You should also provide a \"lang\" directory from\n# which this script will use the G.fst and L.fst.\n\n# Begin configuration section.\nstage=1\nalignment_model=\ntransform_dir=    # dir to find fMLLR transforms.\nacwt=0.08333  # Just a default value, used for adaptation and beam-pruning..\nbatch_size=75 # Limits memory blowup in compile-train-graphs-fsts\ncmd=run.pl\nbeam=20.0\ngselect=15  # Number of Gaussian-selection indices for SGMMs.  [Note:\n            # the first_pass_gselect variable is used for the 1st pass of\n            # decoding and can be tighter.\nfirst_pass_gselect=3 # Use a smaller number of Gaussian-selection indices in\n            # the 1st pass of decoding (lattice generation).\nmax_active=7000\nlattice_beam=8.0 # Beam we use in lattice generation.\nvecs_beam=4.0 # Beam we use to prune lattices while getting posteriors for\n    # speaker-vector computation.  Can be quite tight (actually we could\n    # probably just do best-path.\nuse_fmllr=false\nfmllr_iters=10\nfmllr_min_count=1000\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: steps/decode_sgmm2_fromlats.sh [options] <data-dir> <lang-dir> <old-decode-dir> <decode-dir>\"\n  echo \"\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --alignment-model <ali-mdl>              # Model for the first-pass decoding.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 13.0\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nolddir=$3\ndir=$4\nsrcdir=`dirname $dir`\n\nfor f in $data/feats.scp $lang/G.fst $lang/L_disambig.fst $lang/phones/disambig.int \\\n    $srcdir/final.mdl $srcdir/tree $olddir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=`cat $olddir/num_jobs` || exit 1;\nsdata=$data/split$nj;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\ngselect_opt_1stpass=\"$gselect_opt copy-gselect --n=$first_pass_gselect ark:- ark:- |\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\nif [ -z \"$transform_dir\" ] && [ -f $olddir/trans.1 ]; then\n  transform_dir=$olddir\nfi\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n\n## Calculate FMLLR pre-transforms if needed. We are doing this here since this\n## step is requried by models both with and without speaker vectors\nif $use_fmllr; then\n  if [ ! -f $srcdir/final.fmllr_mdl ] || [ $srcdir/final.fmllr_mdl -ot $srcdir/final.mdl ]; then\n    echo \"$0: computing pre-transform for fMLLR computation.\"\n    sgmm2-comp-prexform $srcdir/final.mdl $srcdir/final.occs $srcdir/final.fmllr_mdl || exit 1;\n  fi\nfi\n\n## Save Gaussian-selection info to disk.\n# Note: we can use final.mdl regardless of whether there is an alignment model--\n# they use the same UBM.\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect --full-gmm-nbest=$gselect $srcdir/final.mdl \\\n    \"$feats\" \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n\n# Generate state-level lattice which we can rescore.  This is done with the\n# alignment model and no speaker-vectors.\nif [ $stage -le 2 ]; then\n  $cmd JOB=1:$nj $dir/log/decode_pass1.JOB.log \\\n    lattice-to-fst \"ark:gunzip -c $olddir/lat.JOB.gz|\" ark:- \\| \\\n    fsttablecompose \"fstproject --project_output=true $lang/G.fst | fstarcsort |\" ark:- ark:- \\| \\\n    fstdeterminizestar ark:- ark:- \\| \\\n    compile-train-graphs-fsts --read-disambig-syms=$lang/phones/disambig.int \\\n      --batch-size=$batch_size $scale_opts \\\n      $srcdir/tree $srcdir/final.mdl $lang/L_disambig.fst ark:- ark:- \\| \\\n    sgmm2-latgen-faster --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n      --acoustic-scale=$acwt --determinize-lattice=false --allow-partial=true \\\n      --word-symbol-table=$lang/words.txt \"$gselect_opt_1stpass\" $alignment_model \\\n      \"ark:-\" \"$feats\" \"ark:|gzip -c > $dir/pre_lat.JOB.gz\" || exit 1;\nfi\n\n## Check if the model has speaker vectors\nspkdim=`sgmm2-info $srcdir/final.mdl | grep 'speaker vector' | awk '{print $NF}'`\n\nif [ $spkdim -gt 0 ]; then  ### For models with speaker vectors:\n\n# Estimate speaker vectors (1st pass).  Prune before determinizing\n# because determinization can take a while on un-pruned lattices.\n# Note: the sgmm2-post-to-gpost stage is necessary because we have\n# a separate alignment-model and final model, otherwise we'd skip it\n# and use sgmm2-est-spkvecs.\n  if [ $stage -le 3 ]; then\n    $cmd JOB=1:$nj $dir/log/vecs_pass1.JOB.log \\\n      gunzip -c $dir/pre_lat.JOB.gz \\| \\\n      lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alignment_model ark:- ark:- \\| \\\n      sgmm2-post-to-gpost \"$gselect_opt\" $alignment_model \"$feats\" ark:- ark:- \\| \\\n      sgmm2-est-spkvecs-gpost --spk2utt=ark:$sdata/JOB/spk2utt \\\n        $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/pre_vecs.JOB\" || exit 1;\n  fi\n\n# Estimate speaker vectors (2nd pass).  Since we already have spk vectors,\n# at this point we need to rescore the lattice to get the correct posteriors.\n  if [ $stage -le 4 ]; then\n    $cmd JOB=1:$nj $dir/log/vecs_pass2.JOB.log \\\n      gunzip -c $dir/pre_lat.JOB.gz \\| \\\n      sgmm2-rescore-lattice --spk-vecs=ark:$dir/pre_vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk \\\n        \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n      lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n      sgmm2-est-spkvecs --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/pre_vecs.JOB \\\n        $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/vecs.JOB\" || exit 1;\n  fi\n  rm $dir/pre_vecs.*\n\n  if $use_fmllr; then\n    # Estimate fMLLR transforms (note: these may be on top of any\n    # fMLLR transforms estimated with the baseline GMM system.\n    if [ $stage -le 5 ]; then # compute fMLLR transforms.\n      echo \"$0: computing fMLLR transforms.\"\n      $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n        gunzip -c $dir/pre_lat.JOB.gz \\| \\\n        sgmm2-rescore-lattice --spk-vecs=ark:$dir/vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk \\\n          \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n        lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n        lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n        lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n        sgmm2-est-fmllr --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/vecs.JOB \\\n          --fmllr-iters=$fmllr_iters --fmllr-min-count=$fmllr_min_count \\\n          $srcdir/final.fmllr_mdl \"$feats\" ark,s,cs:- \"ark:$dir/trans.JOB\" || exit 1;\n    fi\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\n  fi\n\n# Now rescore the state-level lattices with the adapted features and the\n# corresponding model.  Prune and determinize the lattices to limit\n# their size.\n  if [ $stage -le 6 ]; then\n    $cmd JOB=1:$nj $dir/log/rescore.JOB.log \\\n      sgmm2-rescore-lattice \"$gselect_opt\" --utt2spk=ark:$sdata/JOB/utt2spk --spk-vecs=ark:$dir/vecs.JOB \\\n        $srcdir/final.mdl \"ark:gunzip -c $dir/pre_lat.JOB.gz|\" \"$feats\" ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n        \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n  fi\n  rm $dir/pre_lat.*.gz\n\nelse  ### For models without speaker vectors:\n\n  if $use_fmllr; then\n    # Estimate fMLLR transforms (note: these may be on top of any\n    # fMLLR transforms estimated with the baseline GMM system.\n    if [ $stage -le 5 ]; then # compute fMLLR transforms.\n      echo \"$0: computing fMLLR transforms.\"\n      $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n        gunzip -c $dir/pre_lat.JOB.gz \\| \\\n        sgmm2-rescore-lattice --utt2spk=ark:$sdata/JOB/utt2spk \\\n        \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n        lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n        lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n        lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n        sgmm2-est-fmllr --spk2utt=ark:$sdata/JOB/spk2utt \"$gselect_opt\" \\\n        --fmllr-iters=$fmllr_iters --fmllr-min-count=$fmllr_min_count \\\n        $srcdir/final.fmllr_mdl \"$feats\" ark,s,cs:- \"ark:$dir/trans.JOB\" || exit 1;\n    fi\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\n  fi\n\n# Now rescore the state-level lattices with the adapted features and the\n# corresponding model.  Prune and determinize the lattices to limit\n# their size.\n  if [ $stage -le 6 ] && $use_fmllr; then\n    $cmd JOB=1:$nj $dir/log/rescore.JOB.log \\\n      sgmm2-rescore-lattice \"$gselect_opt\" --utt2spk=ark:$sdata/JOB/utt2spk \\\n        $srcdir/final.mdl \"ark:gunzip -c $dir/pre_lat.JOB.gz|\" \"$feats\" ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n        \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n    rm $dir/pre_lat.*.gz\n  else  # Already done with decoding if no adaptation needed.\n    for n in `seq 1 $nj`; do\n      mv $dir/pre_lat.${n}.gz $dir/lat.${n}.gz\n    done\n  fi\n\nfi\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\n\n\nif [ $stage -le 7 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    local/score.sh --cmd \"$cmd\" $data $lang $dir ||\n      { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_sgmm2_rescore.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does decoding with an SGMM system, by rescoring lattices\n# generated from a previous SGMM system.  The directory with the lattices\n# is assumed to contain speaker vectors, if used.  Basically it rescores\n# the lattices one final time, using the same setup as the final decoding\n# pass of the source dir.  The assumption is that the model may have\n# been discriminatively trained.\n\n# If the system was built on top of fMLLR transforms from a conventional system,\n# you should provide the --transform-dir option.\n\n# Begin configuration section.\ntransform_dir=    # dir to find fMLLR transforms.\ncmd=run.pl\niter=final\nskip_scoring=false\nscoring_opts=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: steps/decode_sgmm2_rescore.sh [options] <graph-dir|lang-dir> <data-dir> <old-decode-dir> <decode-dir>\"\n  echo \" e.g.: steps/decode_sgmm2_rescore.sh --transform-dir exp/tri3b/decode_dev93_tgpr \\\\\"\n  echo \"      exp/sgmm3a/graph_tgpr data/test_dev93 exp/sgmm3a/decode_dev93_tgpr exp/sgmm3a_mmi/decode_dev93_tgpr\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --iter <iter>                            # iteration of model to use (default: final)\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\nolddir=$3\ndir=$4\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nfor f in $graphdir/words.txt $data/feats.scp $olddir/lat.1.gz $olddir/gselect.1.gz \\\n   $srcdir/$iter.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=`cat $olddir/num_jobs` || exit 1;\nsdata=$data/split$nj;\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $olddir/gselect.JOB.gz|\"\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -f $olddir/vecs.1 ]; then\n  echo \"$0: using speaker vectors from $olddir\"\n  spkvecs_opt=\"--spk-vecs=ark:$olddir/vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk\"\nelse\n  echo \"$0: no speaker vectors found.\"\n  spkvecs_opt=\nfi\n\n\n## Set up features.\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n\nif [ -f $olddir/trans.1 ]; then\n  echo \"$0: using (in addition to any previous transforms) transforms from $olddir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$olddir/trans.JOB ark:- ark:- |\"\nfi\n##\n\n# Rescore the state-level lattices with the model provided.  Just\n# one command in this script.\necho \"$0: rescoring lattices with SGMM model in $srcdir/$iter.mdl\"\n$cmd JOB=1:$nj $dir/log/rescore.JOB.log \\\n  sgmm2-rescore-lattice \"$gselect_opt\" $spkvecs_opt \\\n  $srcdir/$iter.mdl \"ark:gunzip -c $olddir/lat.JOB.gz|\" \"$feats\" \\\n  \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_sgmm2_rescore_project.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does decoding with an SGMM system, by rescoring lattices\n# generated from a previous SGMM system.  This version does the \"predictive\"\n# SGMM, where we subtract some constant times the log-prob of the left\n# few spliced frames, and the same for the right few.\n# The directory with the lattices\n# is assumed to contain any speaker vectors, if used.  This script just\n# adds into the acoustic scores, (some constant, default -0.25) times\n# the acoustic score of the left model, and the same for the right model.\n\n# the lattices one final time, using the same setup as the final decoding\n# pass of the source dir.  The assumption is that the model may have\n# been discriminatively trained.\n\n# If the system was built on top of fMLLR transforms from a conventional system,\n# you should provide the --transform-dir option.\n\n# Begin configuration section.\nstage=0\ntransform_dir=    # dir to find fMLLR transforms.\ncmd=run.pl\niter=final\nprob_scale=-0.25\ndimensions=0:13:104:117\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/decode_sgmm_rescore_project.sh [options] <full-lda-mat> <graph-dir|lang-dir> <data-dir> <old-decode-dir> <decode-dir>\"\n  echo \" e.g.: steps/decode_sgmm_rescore_project.sh --transform-dir exp/tri3b/decode_dev93_tgpr \\\\\"\n  echo \"     exp/tri2b/full.mat exp/sgmm3a/graph_tgpr data/test_dev93 exp/sgmm3a/decode_dev93_tgpr exp/sgmm3a/decode_dev93_tgpr_predict\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --prob-scale <scale>                     # Default -0.25, scale on left and right models.\"\n  exit 1;\nfi\n\nfull_lda_mat=$1\ngraphdir=$2\ndata=$3\nolddir=$4\ndir=$5\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nfor f in $full_lda_mat $graphdir/words.txt $data/feats.scp $olddir/lat.1.gz \\\n   $olddir/gselect.1.gz $srcdir/$iter.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=`cat $olddir/num_jobs` || exit 1;\nsdata=$data/split$nj;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -f $olddir/vecs.1 ]; then\n  echo \"$0: using speaker vectors from $olddir\"\n  spkvecs_opt=\"--spk-vecs=ark:$olddir/vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk\"\nelse\n  echo \"$0: no speaker vectors found.\"\n  spkvecs_opt=\nfi\n\nif [ $stage -le 0 ]; then\n  # Get full LDA+MLLT mat and its inverse.  Note: the full LDA+MLLT mat is\n  # the LDA+MLLT mat, plus the \"rejected\" rows of the LDA matrix.\n  $cmd $dir/log/get_full_lda.log \\\n    get-full-lda-mat $srcdir/final.mat $full_lda_mat $dir/full.mat $dir/full_inv.mat || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  left_start=`echo $dimensions | cut '-d:' -f 1`;\n  left_end=`echo $dimensions | cut '-d:' -f 2`;\n  right_start=`echo $dimensions | cut '-d:' -f 3`;\n  right_end=`echo $dimensions | cut '-d:' -f 4`;\n\n  # Prepare left and right models.  For now, the dimensions are hardwired (e.g., 13 MFCCs and splice 9 frames).\n  # Note: the choice of dividing by the prob of the left 4 and the right 4 frames is a bit arbitrary and\n  # we could investigate different configurations.\n  $cmd $dir/log/left.log \\\n    sgmm2-project --start-dim=$left_start --end-dim=$left_end $srcdir/final.mdl $dir/full.mat $dir/left.mdl $dir/left.mat || exit 1;\n  $cmd $dir/log/right.log \\\n    sgmm2-project --start-dim=$right_start --end-dim=$right_end $srcdir/final.mdl $dir/full.mat $dir/right.mdl $dir/right.mat || exit 1;\nfi\n\n\n# we apply the scaling on the new acoustic probs by adding the inverse\n# of that to the old acoustic probs, and then later inverting again.\n# this has to do with limitations in sgmm2-rescore-lattice: we can only\n# scale the *old* acoustic probs, not the new ones.\ninverse_prob_scale=`perl -e \"print (1.0 / $prob_scale);\"`\ncur_lats=\"ark:gunzip -c $olddir/lat.JOB.gz | lattice-scale --acoustic-scale=$inverse_prob_scale ark:- ark:- |\"\n\n## Set up features.  Note: we only support LDA+MLLT features, this\n## is inherent in the method, we could not support deltas.\n\nfor model_type in left right; do\n\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\" # spliced features.\n  if [ ! -z \"$transform_dir\" ]; then  # using speaker-specific transforms.\n     # we want to transform in the sequence: $dir/full.mat, then the result of\n     # (extend-transform-dim $transform_dir/trans.JOB), then $dir/full_inv.mat to\n     # get back to the spliced space, then the left.mat or right.mat.  But\n     # note that compose-transforms operates in matrix-multiplication order,\n     # which is opposite from the \"order of applying the transforms\" order.\n     new_dim=$[`copy-matrix --binary=false $dir/full.mat - | wc -l` - 1]; # 117 in normal case.\n     feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk 'ark:extend-transform-dim --new-dimension=$new_dim ark:$transform_dir/trans.JOB ark:- | compose-transforms ark:- $dir/full.mat ark:- | compose-transforms $dir/full_inv.mat ark:- ark:- | compose-transforms $dir/${model_type}.mat ark:- ark:- |' ark:- ark:- |\"\n  else  # else, we transform with the \"left\" or \"right\" matrix; these transform from the\n        # spliced space.\n     feats=\"$feats transform-feats $dir/${model_type}.mat |\"\n     # If we don't have the --transform-dir option, make sure the model was\n     # trained in the same way.\n     if grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n       echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n       echo \"  but you are not providing the --transform-dir option in test time.\"\n     fi\n  fi\n  if [ -f $olddir/trans.1 ]; then\n     echo \"$0: warning: not using transforms in $olddir (this is just a \"\n     echo \" limitation of the script right now, and could be fixed).\"\n  fi\n  \n  if [ $stage -le 2 ]; then\n    echo \"Getting gselect info for $model_type model.\"\n    $cmd JOB=1:$nj $dir/log/gselect.$model_type.JOB.log \\\n       sgmm2-gselect $dir/$model_type.mdl \"$feats\" \\\n       \"ark,t:|gzip -c >$dir/gselect.$model_type.JOB.gz\" || exit 1;\n  fi\n  gselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.$model_type.JOB.gz|\"\n\n\n  # Rescore the state-level lattices with the model provided.  Just\n  # one command in this script.\n  # The --old-acoustic-scale=1.0 option means we just add the scores\n  # to the old scores.\n  if [ $stage -le 3 ]; then\n    echo \"$0: rescoring lattices with $model_type model\"\n    $cmd JOB=1:$nj $dir/log/rescore.${model_type}.JOB.log \\\n      sgmm2-rescore-lattice --old-acoustic-scale=1.0 \"$gselect_opt\" $spkvecs_opt \\\n      $dir/$model_type.mdl \"$cur_lats\" \"$feats\" \\\n      \"ark:|gzip -c > $dir/lat.${model_type}.JOB.gz\" || exit 1;\n  fi\n  cur_lats=\"ark:gunzip -c $dir/lat.${model_type}.JOB.gz |\"\ndone\n\nif [ $stage -le 4 ]; then\n  echo \"$0: getting final lattices.\"\n  $cmd JOB=1:$nj $dir/log/scale_lats.JOB.log \\\n    lattice-scale --acoustic-scale=$prob_scale \"$cur_lats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" \\\n   || exit 1;\nfi\n\nrm $dir/lat.{left,right}.*.gz 2>/dev/null  # note: if these still exist, it will\n # confuse the scoring script.\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/decode_with_map.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Neha Agrawal, Cisco Systems;\n#                 Johns Hopkins University (Author: Daniel Povey);\n#                 \n# Apache 2.0\n\n# Begin configuration section.  \ntransform_dir=\niter=\nmodel= # You can specify the model to use (e.g. if you want to use the .alimdl)\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nmean_tau=20\nweight_tau=10\nflags=mw  # could also contain \"v\" for variance; the default\n          # tau for that is 50.\nstage=1\nskip_scoring=false\n# End configuration section.\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\nif [ $# != 3 ]; then\n   echo \"Usage: steps/decode.sh [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.\"\n   echo \"e.g.: steps/decode.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --model <model>                                  # which model to use (e.g. to\"\n   echo \"                                                   # specify the final.alimdl)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <trans-dir>                      # dir to find fMLLR transforms \"\n   echo \"                                                   # speaker-adapted decoding\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  if [ -z $iter ]; then model=$srcdir/final.mdl; \n  else model=$srcdir/$iter.mdl; fi\nfi\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $model $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"decode.sh: no such file $f\" && exit 1;\ndone\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode.sh: feature type is $feat_type\";\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne $nj ] && \\\n     echo \"Mismatch in number of jobs with $transform_dir\";\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"Doing first-pass decoding before MAP decoding.\"\n  $cmd JOB=1:$nj $dir/log/decode_pass1.JOB.log \\\n    gmm-decode-faster --max-active=$max_active --beam=$beam \\\n    --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $model $graphdir/HCLG.fst \"$feats\" ark:$dir/tmp.JOB.tra ark:$dir/pass1_decode.JOB.ali || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"Computing MAP stats and doing MAP-adapted decoding\"\n  $cmd JOB=1:$nj $dir/log/decode_pass2.JOB.log \\\n    ali-to-post ark:$dir/pass1_decode.JOB.ali ark:- \\| \\\n  gmm-adapt-map --mean-tau=$mean_tau --weight-tau=$weight_tau \\\n       --update-flags=$flags --spk2utt=ark:$sdata/JOB/spk2utt \\\n     $model \"$feats\" ark:- ark:- \\| \\\n  gmm-latgen-map --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n   --utt2spk=ark:$sdata/JOB/utt2spk --max-active=$max_active --beam=$beam \\\n   --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n   $model ark,s,cs:- $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\"\nfi\n#rm -f $dir/pass1_decode.*.ali\n#rm -f $dir/tmp.*.tra\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/diagnostic/analyze_alignments.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2016.  Apache 2.0.\n\n# This script performs some analysis of alignments on disk, currently in terms\n# of phone lengths, including lengths of leading and trailing silences\n\n\n# begin configuration section.\ncmd=run.pl\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] <lang-dir> <ali-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"e.g.:\"\n  echo \"$0 data/lang exp/tri4b\"\n  echo \"This script writes some diagnostics to <ali-dir>/log/alignments.log\"\n  exit 1;\nfi\n\nlang=$1\ndir=$2\n\nmodel=$dir/final.mdl\n\nfor f in $lang/words.txt $model $dir/ali.1.gz $dir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nnum_jobs=$(cat $dir/num_jobs) || exit 1\n\nmkdir -p $dir/log\n\nrm $dir/phone_stats.*.gz 2>/dev/null || true\n\n$cmd JOB=1:$num_jobs $dir/log/get_phone_alignments.JOB.log \\\n  set -o pipefail '&&' ali-to-phones --write-lengths=true \"$model\"  \\\n      \"ark:gunzip -c $dir/ali.JOB.gz|\" ark,t:- \\| \\\n   sed -E 's/^[^ ]+ //' \\| \\\n   awk 'BEGIN{FS=\" ; \"; OFS=\"\\n\";} {print \"begin \" $1; if (NF>1) print \"end \" $NF; for (n=1;n<=NF;n++) print \"all \" $n; }' \\| \\\n   sort \\| uniq -c \\| gzip -c '>' $dir/phone_stats.JOB.gz || exit 1\n\nif ! $cmd $dir/log/analyze_alignments.log \\\n  gunzip -c \"$dir/phone_stats.*.gz\" \\| \\\n  steps/diagnostic/analyze_phone_length_stats.py $lang; then\n  echo \"$0: analyze_phone_length_stats.py failed, but ignoring the error (it's just for diagnostics)\"\nfi\n\ngrep WARNING $dir/log/analyze_alignments.log\necho \"$0: see stats in $dir/log/analyze_alignments.log\"\n\nrm $dir/phone_stats.*.gz\n\nexit 0\n"
  },
  {
    "path": "egs/steps/diagnostic/analyze_lats.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2016.  Apache 2.0.\n\n# This script does the same type of diagnostics as analyze_alignments.sh, except\n# it starts from lattices (so it has to convert the lattices to alignments\n# first).\n\n# begin configuration section.\niter=final\ncmd=run.pl\nacwt=0.1\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] (<lang-dir>|<graph-dir>) <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --acwt <acoustic-scale>         # Acoustic scale for getting best-path (default: 0.1)\"\n  echo \"e.g.:\"\n  echo \"$0 data/lang exp/tri4b/decode_dev\"\n  echo \"This script writes some diagnostics to <decode-dir>/log/alignments.log\"\n  exit 1;\nfi\n\nlang=$1\ndir=$2\n\nmodel=$dir/../${iter}.mdl\n\nfor f in $lang/words.txt $model $dir/lat.1.gz $dir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nnum_jobs=$(cat $dir/num_jobs) || exit 1\n\nmkdir -p $dir/log\n\nrm $dir/phone_stats.*.gz 2>/dev/null || true\n\n# this writes two archives of depth_tmp and ali_tmp of (depth per frame, alignment per frame).\n$cmd JOB=1:$num_jobs $dir/log/lattice_best_path.JOB.log \\\n  lattice-depth-per-frame \"ark:gunzip -c $dir/lat.JOB.gz|\" \"ark,t:|gzip -c > $dir/depth_tmp.JOB.gz\" ark:- \\| \\\n  lattice-best-path --acoustic-scale=$acwt ark:- ark:/dev/null \"ark,t:|gzip -c >$dir/ali_tmp.JOB.gz\" || exit 1\n\n$cmd JOB=1:$num_jobs $dir/log/get_lattice_stats.JOB.log \\\n  ali-to-phones --write-lengths=true \"$model\" \"ark:gunzip -c $dir/ali_tmp.JOB.gz|\" ark,t:- \\| \\\n  perl -ne 'chomp;s/^\\S+\\s*//;@a=split /\\s;\\s/, $_;$count{\"begin \".$a[$0].\"\\n\"}++;\n  if(@a>1){$count{\"end \".$a[-1].\"\\n\"}++;}for($i=0;$i<@a;$i++){$count{\"all \".$a[$i].\"\\n\"}++;}\n  END{for $k (sort keys %count){print \"$count{$k} $k\"}}' \\| \\\n  gzip -c '>' $dir/phone_stats.JOB.gz || exit 1\n\n$cmd $dir/log/analyze_alignments.log \\\n  gunzip -c \"$dir/phone_stats.*.gz\" \\| \\\n  steps/diagnostic/analyze_phone_length_stats.py $lang || exit 1\n\ngrep WARNING $dir/log/analyze_alignments.log\necho \"$0: see stats in $dir/log/analyze_alignments.log\"\n\n$cmd $dir/log/dump_ali_frame.log \\\n  ali-to-phones --per-frame=true \"$model\" \"ark:gunzip -c $dir/ali_tmp.*.gz|\" \"ark,t:|gzip -c >$dir/ali_frame_tmp.gz\"\n\n$cmd $dir/log/analyze_lattice_depth_stats.log \\\n  gunzip -c \"$dir/depth_tmp.*.gz\" \\| \\\n  steps/diagnostic/analyze_lattice_depth_stats.py $lang \"$dir/ali_frame_tmp.gz\" || exit 1\n\ngrep Overall $dir/log/analyze_lattice_depth_stats.log\necho \"$0: see stats in $dir/log/analyze_lattice_depth_stats.log\"\n\n\nrm $dir/phone_stats.*.gz\nrm $dir/depth_tmp.*.gz\nrm $dir/ali_frame_tmp.gz\nrm $dir/ali_tmp.*.gz\n\nexit 0\n"
  },
  {
    "path": "egs/steps/diagnostic/analyze_lattice_depth_stats.py",
    "content": "#!/usr/bin/env python3\n\n\n# Copyright 2016 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport sys, os\nfrom collections import defaultdict\nfrom io import open\nimport codecs\nimport gzip\n\n# reference: http://www.macfreek.nl/memory/Encoding_of_Python_stdout\nif sys.version_info.major == 2:\n    sys.stdout = codecs.getwriter('utf-8')(sys.stdout, 'strict')\nelse:\n    assert sys.version_info.major == 3\n    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')\n\n\nparser = argparse.ArgumentParser(description=\"This script reads stats created in analyze_lats.sh \"\n                                 \"to print information about lattice depths broken down per phone. \"\n                                 \"The normal output of this script is written to the standard output \"\n                                 \"and is human readable (on crashes, we'll print an error to stderr.\")\n\nparser.add_argument(\"--frequency-cutoff-percentage\", type = float,\n                    default = 0.5, help=\"Cutoff, expressed as a percentage \"\n                    \"(between 0 and 100), of frequency at which we print stats \"\n                    \"for a phone.\")\n\nparser.add_argument(\"lang\",\n                    help=\"Language directory, e.g. data/lang.\")\n\nparser.add_argument(\"ali_per_frame\",\n                    help=\"Gzipped alignment per frame, e.g. ali_frame_tmp.gz\")\n\nargs = parser.parse_args()\n\n# set up phone_int2text to map from phone to printed form.\nphone_int2text = {}\ntry:\n    f = open(args.lang + \"/phones.txt\", \"r\", encoding='utf-8')\n    for line in f.readlines():\n        [ word, number] = line.split()\n        phone_int2text[int(number)] = word\n    f.close()\nexcept:\n    sys.exit(u\"analyze_lattice_depth_stats.py: error opening or reading {0}/phones.txt\".format(\n            args.lang))\n# this is a special case... for begin- and end-of-sentence stats,\n# we group all nonsilence phones together.\nphone_int2text[0] = 'nonsilence'\n\n# populate the set and 'nonsilence', which will contain the integer phone-ids of\n# nonsilence phones (and disambig phones, which won't matter).\nnonsilence = set(phone_int2text.keys())\nnonsilence.remove(0)\ntry:\n    # open lang/phones/silence.csl-- while there are many ways of obtaining the\n    # silence/nonsilence phones, we read this because it's present in graph\n    # directories as well as lang directories.\n    filename = u\"{0}/phones/silence.csl\".format(args.lang)\n    f = open(filename, \"r\")\n    line = f.readline()\n    for silence_phone in line.split(\":\"):\n        nonsilence.remove(int(silence_phone))\n    f.close()\nexcept Exception as e:\n    sys.exit(u\"analyze_lattice_depth_stats.py: error processing {0}/phones/silence.csl: {1}\".format(\n            args.lang, str(e)))\n\n# phone_depth_counts is a dict of dicts.\n# for each integer phone-id 'phone',\n# phone_depth_counts[phone] is a map from depth to count (of frames on which\n# that was the 1-best phone in the alignment, and the lattice depth\n# had that value).  So we'd access it as\n# count = phone_depth_counts[phone][depth].\n\nphone_depth_counts = dict()\n\n# note: -1 is for all phones put in one bucket.\nfor p in [ -1 ] + list(phone_int2text.keys()):\n    phone_depth_counts[p] = defaultdict(int)\n\ntotal_frames = 0\n\nali_per_frame = {}\nfor line in gzip.open(args.ali_per_frame, mode='rt', encoding='utf-8'):\n   uttid, ali = line.split(\" \", 1)\n   ali_per_frame[uttid] = ali\n\nfor line in sys.stdin:\n    uttid, depth = line.split(\" \", 1)\n    if uttid in ali_per_frame:\n        apf = ali_per_frame[uttid].split()\n        dpf = depth.split()\n        for p, d in zip(apf, dpf):\n              p, d = int(p), int(d)\n              phone_depth_counts[p][d] += 1\n              total_frames += 1\n              if p in nonsilence:\n                  nonsilence_phone = 0\n                  phone_depth_counts[nonsilence_phone][d] += 1\n              universal_phone = -1\n              phone_depth_counts[universal_phone][d] += 1\n\nif total_frames == 0:\n    sys.exit(u\"analyze_lattice_depth_stats.py: read no input\")\n\n# If depth_to_count is a map from depth-in-frames to count,\n# return the depth-in-frames that equals the (fraction * 100)'th\n# percentile of the distribution.\ndef GetPercentile(depth_to_count, fraction):\n    this_total_frames = sum(depth_to_count.values())\n    if this_total_frames == 0:\n        return 0\n    else:\n        items = sorted(depth_to_count.items())\n        count_cutoff = int(fraction * this_total_frames)\n        cur_count_total = 0\n        for depth,count in items:\n            assert count >= 0\n            cur_count_total += count\n            if cur_count_total >= count_cutoff:\n                return depth\n        assert false # we shouldn't reach here.\n\ndef GetMean(depth_to_count):\n    this_total_frames = sum(depth_to_count.values())\n    if this_total_frames == 0:\n        return 0.0\n    this_total_depth = sum([ float(l * c) for l,c in depth_to_count.items() ])\n    return this_total_depth / this_total_frames\n\n\nprint(u\"The total amount of data analyzed assuming 100 frames per second \"\n      u\"is {0} hours\".format(\"%.1f\" % (total_frames / 360000.0)))\n\n# the next block prints lines like (to give some examples):\n# Nonsilence phones as a group account for 74.4% of phone occurrences, with lattice depth (10,50,90-percentile)=(1,2,7) and mean=3.1\n# Phone SIL accounts for 25.5% of phone occurrences, with lattice depth (10,50,90-percentile)=(1,1,4) and mean=2.5\n# Phone Z_E accounts for 2.5% of phone occurrences, with lattice depth (10,50,90-percentile)=(1,2,6) and mean=2.9\n# ...\n\n\n# sort the phones in decreasing order of count.\nfor phone,depths in sorted(phone_depth_counts.items(), key = lambda x : -sum(x[1].values())):\n\n    frequency_percentage = sum(depths.values()) * 100.0 / total_frames\n    if frequency_percentage < args.frequency_cutoff_percentage:\n        continue\n\n\n    depth_percentile_10 = GetPercentile(depths, 0.1)\n    depth_percentile_50 = GetPercentile(depths, 0.5)\n    depth_percentile_90 = GetPercentile(depths, 0.9)\n    depth_mean = GetMean(depths)\n\n    if phone > 0:\n        try:\n            phone_text = phone_int2text[phone]\n        except:\n            sys.exit(u\"analyze_lattice_depth_stats.py: phone {0} is not covered on phones.txt \"\n                     u\"(lang/alignment mismatch?)\".format(phone))\n        preamble = u\"Phone {phone_text} accounts for {percent}% of frames, with\".format(\n            phone_text = phone_text, percent = \"%.1f\" % frequency_percentage)\n    elif phone == 0:\n        preamble = u\"Nonsilence phones as a group account for {percent}% of frames, with\".format(\n            percent = \"%.1f\" % frequency_percentage)\n    else:\n        assert phone == -1\n        preamble = \"Overall,\";\n\n    print(u\"{preamble} lattice depth (10,50,90-percentile)=({p10},{p50},{p90}) and mean={mean}\".format(\n            preamble = preamble,\n            p10 = depth_percentile_10,\n            p50 = depth_percentile_50,\n            p90 = depth_percentile_90,\n            mean = \"%.1f\" % depth_mean))\n"
  },
  {
    "path": "egs/steps/diagnostic/analyze_phone_length_stats.py",
    "content": "#!/usr/bin/env python\n\n\n# Copyright 2016 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nimport argparse\nimport sys, os\nfrom collections import defaultdict\nfrom io import open\nimport codecs\n\n# reference: http://www.macfreek.nl/memory/Encoding_of_Python_stdout\nif sys.version_info.major == 2:\n    sys.stdout = codecs.getwriter('utf-8')(sys.stdout, 'strict')\nelse:\n    assert sys.version_info.major == 3\n    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')\n\n\nparser = argparse.ArgumentParser(description=\"This script reads stats created in analyze_alignments.sh \"\n                                 \"to print information about phone lengths in alignments.  It's principally \"\n                                 \"useful in order to see whether there is a reasonable amount of silence \"\n                                 \"at the beginning and ends of segments.  The normal output of this script \"\n                                 \"is written to the standard output and is human readable (on crashes, \"\n                                 \"we'll print an error to stderr.\")\n\nparser.add_argument(\"--frequency-cutoff-percentage\", type = float,\n                    default = 0.5, help=\"Cutoff, expressed as a percentage \"\n                    \"(between 0 and 100), of frequency at which we print stats \"\n                    \"for a phone.\")\n\nparser.add_argument(\"lang\",\n                    help=\"Language directory, e.g. data/lang.\")\n\nargs = parser.parse_args()\n\n\n# set up phone_int2text to map from phone to printed form.\nphone_int2text = {}\ntry:\n    f = open(args.lang + \"/phones.txt\", \"r\", encoding='utf-8')\n    for line in f.readlines():\n        [ word, number] = line.split()\n        phone_int2text[int(number)] = word\n    f.close()\nexcept:\n    sys.exit(\"analyze_phone_length_stats.py: error opening or reading {0}/phones.txt\".format(\n            args.lang))\n# this is a special case... for begin- and end-of-sentence stats,\n# we group all nonsilence phones together.\nphone_int2text[0] = 'nonsilence'\n\n\n# populate the set 'nonsilence', which will contain the integer phone-ids of\n# nonsilence phones (and disambig phones, which won't matter).\nnonsilence = set(phone_int2text.keys())\nnonsilence.remove(0)\ntry:\n    # open lang/phones/silence.csl-- while there are many ways of obtaining the\n    # silence/nonsilence phones, we read this because it's present in graph\n    # directories as well as lang directories.\n    filename = \"{0}/phones/silence.csl\".format(args.lang)\n    f = open(filename, \"r\")\n    line = f.readline()\n    f.close()\n    for silence_phone in line.split(\":\"):\n        nonsilence.remove(int(silence_phone))\nexcept Exception as e:\n    sys.exit(\"analyze_phone_length_stats.py: error processing {0}/phones/silence.csl: {1}\".format(\n            args.lang, str(e)))\n\n\n# phone_length is a dict of dicts of dicts;\n# phone_lengths[boundary_type] for boundary_type in [ 'begin', 'end', 'all' ] is\n# a dict indexed by phone, containing dicts from length to a count of occurrences.\n# Phones are ints and lengths are integers representing numbers of frames.\n# So: count == phone_lengths[boundary_type][phone][length].\n# note: for the 'begin' and 'end' boundary-types, we group all nonsilence phones\n# into phone-id zero.\nphone_lengths = dict()\nfor boundary_type in [ 'begin', 'end', 'all' ]:\n    phone_lengths[boundary_type] = dict()\n    for p in phone_int2text.keys():\n        phone_lengths[boundary_type][p] = defaultdict(int)\n\n# total_phones is a dict from boundary_type to total count [of phone occurrences]\ntotal_phones = defaultdict(int)\n# total_frames is a dict from boundary_type to total number of frames.\ntotal_frames = defaultdict(int)\n# total_frames is a dict from num-frames to count of num-utterances with that\n# num-frames.\n\nwhile True:\n    line = sys.stdin.readline()\n    if line == '':\n        break\n    a = line.split()\n    if len(a) != 4:\n        sys.exit(\"analyze_phone_length_stats.py: reading stdin, could not interpret line: \" + line)\n    try:\n        count, boundary_type, phone, length = a\n        total_phones[boundary_type] += int(count)\n        total_frames[boundary_type] += int(count) * int(length)\n        phone_lengths[boundary_type][int(phone)][int(length)] += int(count)\n        if int(phone) in nonsilence:\n            nonsilence_phone = 0\n            phone_lengths[boundary_type][nonsilence_phone][int(length)] += int(count)\n    except Exception as e:\n        sys.exit(\"analyze_phone_length_stats.py: unexpected phone {0} \"\n                 \"seen (lang directory mismatch?): {1}\".format(phone, str(e)))\n\nif len(phone_lengths) == 0:\n    sys.exit(\"analyze_phone_length_stats.py: read no input\")\n\n# work out the optional-silence phone\ntry:\n    f = open(args.lang + \"/phones/optional_silence.int\", \"r\")\n    optional_silence_phone = int(f.readline())\n    optional_silence_phone_text = phone_int2text[optional_silence_phone]\n    f.close()\n    if optional_silence_phone in nonsilence:\n        print(u\"analyze_phone_length_stats.py: was expecting the optional-silence phone to \"\n              u\"be a member of the silence phones, it is not.  This script won't work correctly.\")\nexcept:\n    largest_count = 0\n    optional_silence_phone = 1\n    for p in phone_int2text.keys():\n        if p > 0 and not p in nonsilence:\n            this_count = sum([ l * c for l,c in phone_lengths['all'][p].items() ])\n            if this_count > largest_count:\n                largest_count = this_count\n                optional_silence_phone = p\n    optional_silence_phone_text = phone_int2text[optional_silence_phone]\n    print(u\"analyze_phone_length_stats.py: could not get optional-silence phone from \"\n          u\"{0}/phones/optional_silence.int, guessing that it's {1} from the stats. \".format(\n            args.lang, optional_silence_phone_text))\n\n\n\n# If length_to_count is a map from length-in-frames to count,\n# return the length-in-frames that equals the (fraction * 100)'th\n# percentile of the distribution.\ndef GetPercentile(length_to_count, fraction):\n    total_phones = sum(length_to_count.values())\n    if total_phones == 0:\n        return 0\n    else:\n        items = sorted(length_to_count.items())\n        count_cutoff = int(fraction * total_phones)\n        cur_count_total = 0\n        for length,count in items:\n            assert count >= 0\n            cur_count_total += count\n            if cur_count_total >= count_cutoff:\n                return length\n        assert false # we shouldn't reach here.\n\ndef GetMean(length_to_count):\n    total_phones = sum(length_to_count.values())\n    if total_phones == 0:\n        return 0.0\n    total_frames = sum([ float(l * c) for l,c in length_to_count.items() ])\n    return total_frames / total_phones\n\n\n# Analyze frequency, median and mean of optional-silence at beginning and end of utterances.\n# The next block will print something like\n#  \"At utterance begin, SIL is seen 15.0% of the time; when seen, duration (median, mean) is (5, 7.6) frames.\"\n#  \"At utterance end, SIL is seen 14.6% of the time; when seen, duration (median, mean) is (4, 6.1) frames.\"\n\n\n# This block will print warnings if silence is seen less than 80% of the time at utterance\n# beginning and end.\nfor boundary_type in 'begin', 'end':\n    phone_to_lengths = phone_lengths[boundary_type]\n    num_utterances = total_phones[boundary_type]\n    assert num_utterances > 0\n    opt_sil_lengths = phone_to_lengths[optional_silence_phone]\n    frequency_percentage = sum(opt_sil_lengths.values()) * 100.0 / num_utterances\n    # The reason for this warning is that the tradition in speech recognition is\n    # to supply a little silence at the beginning and end of utterances... up to\n    # maybe half a second.  If your database is not like this, you should know;\n    # you may want to mess with the segmentation to add more silence.\n    if frequency_percentage < 80.0:\n        print(u\"analyze_phone_length_stats.py: WARNING: optional-silence {0} is seen only {1}% \"\n              u\"of the time at utterance {2}.  This may not be optimal.\".format(\n                optional_silence_phone_text, frequency_percentage, boundary_type))\n\n\n\n# this will control a sentence that we print..\nboundary_to_text = { }\nboundary_to_text['begin'] = 'At utterance begin'\nboundary_to_text['end'] = 'At utterance end'\nboundary_to_text['all'] = 'Overall'\n\n# the next block prints lines like (to give some examples):\n# At utterance begin, SIL accounts for 98.4% of phone occurrences, with duration (median, mean, 95-percentile) is (57,59.9,113) frames.\n# ...\n# At utterance end, nonsilence accounts for 4.2% of phone occurrences, with duration (median, mean, 95-percentile) is (13,13.3,22) frames.\n# ...\n# Overall, R_I accounts for 3.2% of phone occurrences, with duration (median, mean, 95-percentile) is (6,6.9,12) frames.\n\nfor boundary_type in 'begin', 'end', 'all':\n    phone_to_lengths = phone_lengths[boundary_type]\n    tot_num_phones = total_phones[boundary_type]\n    # sort the phones in decreasing order of count.\n    for phone,lengths in sorted(phone_to_lengths.items(), key = lambda x : -sum(x[1].values())):\n        frequency_percentage = sum(lengths.values()) * 100.0 / tot_num_phones\n        if frequency_percentage < args.frequency_cutoff_percentage:\n            continue\n\n        duration_median = GetPercentile(lengths, 0.5)\n        duration_percentile_95 = GetPercentile(lengths, 0.95)\n        duration_mean = GetMean(lengths)\n\n        text = boundary_to_text[boundary_type]  # e.g. 'At utterance begin'.\n        try:\n            phone_text = phone_int2text[phone]\n        except:\n            sys.exit(\"analyze_phone_length_stats.py: phone {0} is not covered on phones.txt \"\n                     \"(lang/alignment mismatch?)\".format(phone))\n        print(u\"{text}, {phone_text} accounts for {percent}% of phone occurrences, with \"\n              u\"duration (median, mean, 95-percentile) is ({median},{mean},{percentile95}) frames.\".format(\n                text = text, phone_text = phone_text,\n                percent = \"%.1f\" % frequency_percentage,\n                median = duration_median, mean = \"%.1f\" % duration_mean,\n                percentile95 = duration_percentile_95))\n\n\n## Print stats on frequency and average length of word-internal optional-silences.\n## For optional-silence only, subtract the begin and end-utterance stats from the 'all'\n## stats, to get the stats excluding initial and final phones.\ntotal_frames['internal'] = total_frames['all'] - total_frames['begin'] - total_frames['end']\ntotal_phones['internal'] = total_phones['all'] - total_phones['begin'] - total_phones['end']\n\ninternal_opt_sil_phone_lengths = dict(phone_lengths['all'][optional_silence_phone])\n# internal_opt_sil_phone_lenghts is a dict from length to count.\nfor length in list(internal_opt_sil_phone_lengths.keys()):\n    # subtract the counts for begin and end from the overall counts to get the\n    # word-internal count.\n    internal_opt_sil_phone_lengths[length] -= (phone_lengths['begin'][optional_silence_phone][length] +\n                                               phone_lengths['end'][optional_silence_phone][length])\n    if internal_opt_sil_phone_lengths[length] == 0:\n        del internal_opt_sil_phone_lengths[length]\n\nif total_phones['internal'] != 0.0:\n    total_internal_optsil_frames = sum([ float(l * c) for l,c in internal_opt_sil_phone_lengths.items() ])\n    total_optsil_frames = sum([ float(l * c)\n                                for l,c in phone_lengths['all'][optional_silence_phone].items() ])\n    opt_sil_internal_frame_percent = total_internal_optsil_frames * 100.0 / total_frames['internal']\n    opt_sil_total_frame_percent = total_optsil_frames * 100.0 / total_frames['all']\n    internal_frame_percent = total_frames['internal'] * 100.0 / total_frames['all']\n\n    print(u\"The optional-silence phone {0} occupies {1}% of frames overall \".format(\n            optional_silence_phone_text, \"%.1f\" % opt_sil_total_frame_percent))\n    hours_total = total_frames['all'] / 360000.0;\n    hours_nonsil = (total_frames['all'] - total_optsil_frames) / 360000.0\n    print(u\"Limiting the stats to the {0}% of frames not covered by an utterance-[begin/end] phone, \"\n          u\"optional-silence {1} occupies {2}% of frames.\".format(\"%.1f\" % internal_frame_percent,\n                                                                 optional_silence_phone_text,\n                                                                 \"%.1f\" % opt_sil_internal_frame_percent))\n    print(u\"Assuming 100 frames per second, the alignments represent {0} hours of data, \"\n          u\"or {1} hours if {2} frames are excluded.\".format(\n            \"%.1f\" % hours_total, \"%.1f\" % hours_nonsil, optional_silence_phone_text))\n\n    opt_sil_internal_phone_percent = (sum(internal_opt_sil_phone_lengths.values()) *\n                                      100.0 / total_phones['internal'])\n    duration_median = GetPercentile(internal_opt_sil_phone_lengths, 0.5)\n    duration_mean = GetMean(internal_opt_sil_phone_lengths)\n    duration_percentile_95 = GetPercentile(internal_opt_sil_phone_lengths, 0.95)\n    print(u\"Utterance-internal optional-silences {0} comprise {1}% of utterance-internal phones, with duration \"\n          u\"(median, mean, 95-percentile) = ({2},{3},{4})\".format(\n                optional_silence_phone_text, \"%.1f\" % opt_sil_internal_phone_percent,\n                duration_median, \"%0.1f\" % duration_mean, duration_percentile_95))\n"
  },
  {
    "path": "egs/steps/dict/apply_g2p.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2014  Johns Hopkins University (Author: Yenda Trmal)\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0\n\n# Begin configuration section.  \nstage=0\nencoding='utf-8'\nvar_counts=3  #Generate upto N variants\nvar_mass=0.9  #Generate so many variants to produce 90 % of the prob mass\ncmd=run.pl\nnj=10          #Split the task into several parallel, to speedup things\nmodel=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -u\nset -e\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <word-list> <g2p-model-dir> <output-dir>\"\n   echo \"... where <word-list> is a list of words whose pronunciation is to be generated\"\n   echo \"          <g2p-model-dir> is a directory used as a target during training of G2P\"\n   echo \"          <output-dir> is the directory where the output lexicon should be stored\"\n   echo \"e.g.: $0 oov_words exp/g2p exp/g2p/oov_lex\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --nj <int>                                    # How many tasks should be spawn (to speedup things)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\nwordlist=$1\nmodeldir=$2\noutput=$3\n\n\nmkdir -p $output/log\n\nmodel=$modeldir/g2p.model.final\n[ ! -f ${model:-} ] && echo \"File $model not found in the directory $modeldir.\" && exit 1\n#[ ! -x $wordlist ] && echo \"File $wordlist not found!\" && exit 1\n\ncp $wordlist $output/wordlist.txt\n\nif ! g2p=`which g2p.py` ; then\n  echo \"The Sequitur was not found !\"\n  echo \"Go to $KALDI_ROOT/tools and execute extras/install_sequitur.sh\"\n  exit 1\nfi\n\necho \"Applying the G2P model to wordlist $wordlist\"\n\nif [ $stage -le 0 ]; then\n  $cmd JOBS=1:$nj $output/log/apply.JOBS.log \\\n    split -n l/JOBS/$nj $output/wordlist.txt \\| \\\n    g2p.py -V $var_mass --variants-number $var_counts --encoding $encoding \\\n      --model $modeldir/g2p.model.final --apply - \\\n    \\> $output/output.JOBS\nfi\ncat $output/output.* > $output/output\n\n# Remap the words from output file back to the original casing\n# Conversion of some of thems might have failed, so we have to be careful\n# and use the transform_map file we generated beforehand\n# Also, because the sequitur output is not readily usable as lexicon (it adds \n# one more column with ordering of the pron. variants) convert it into the proper lexicon form\noutput_lex=$output/lexicon.lex\n\n# Just convert it to a proper lexicon format\ncut -f 1,3,4 $output/output > $output_lex\n\n# Some words might have been removed or skipped during the process,\n# let's check it and warn the user if so...\nnlex=`cut -f 1 $output_lex | sort -u | wc -l`\nnwlist=`cut -f 1 $output/wordlist.txt | sort -u | wc -l`\nif [ $nlex -ne $nwlist ] ; then\n  echo \"WARNING: Unable to generate pronunciation for all words. \";\n  echo \"WARINNG:   Wordlist: $nwlist words\"\n  echo \"WARNING:   Lexicon : $nlex words\"\n  echo \"WARNING:Diff example: \"\n  diff <(cut -f 1 $output_lex | sort -u ) \\\n       <(cut -f 1 $output/wordlist.txt | sort -u ) || true\nfi\nexit 0\n"
  },
  {
    "path": "egs/steps/dict/apply_g2p_phonetisaurus.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2014  Johns Hopkins University (Author: Yenda Trmal)\n# Copyright 2016  Xiaohui Zhang\n#           2018  Ruizhe Huang\n# Apache 2.0\n\n# This script applies a trained Phonetisarus G2P model to\n# synthesize pronunciations for missing words (i.e., words in\n# transcripts but not the lexicon), and output the expanded lexicon.\n# The user could specify either nbest or pmass option \n# to determine the number of output pronunciation variants, \n# or use them together to get the intersection of two options.\n\n# Begin configuration section.  \nstage=0\nnbest=      # Generate up to N, like N=3, pronunciation variants for each word\n            # (The maximum size of the nbest list, not considering pruning and taking the prob-mass yet). \nthresh=5    # Pruning threshold for the n-best list, in (0, 99], which is a -log-probability value.\n            # A large threshold makes the nbest list shorter, and less likely to hit the max size.\n            # This value corresponds to the weight_threshold in shortest-path.h of openfst.\npmass=      # Select the top variants from the pruned nbest list,\n            # summing up to this total prob-mass for a word.\n            # On the \"boundary\", it's greedy by design, e.g. if pmass = 0.8,\n            # and we have prob(pron_1) = 0.5, and prob(pron_2) = 0.4, then we get both.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nset -u\nset -e\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [options] <word-list> <g2p-model-dir> <output-dir>\"\n  echo \"... where <word-list> is a list of words whose pronunciation is to be generated.\"\n  echo \"          <g2p-model-dir> is a directory used as a target during training of G2P\"\n  echo \"          <output-dir> is the directory where the output lexicon should be stored.\"\n  echo \"                       The format of the output lexicon output-dir/lexicon.lex is\" \n  echo \"                       <word>\\t<prob>\\t<pronunciation> per line.\"\n  echo \"e.g.: $0 --nbest 1 exp/g2p/oov_words.txt exp/g2p exp/g2p/oov_lex\"\n  echo \"\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --nbest <int>    # Generate upto N pronunciation variants for each word.\" \n  echo \"  --pmass <float>  # Select the top variants from the pruned nbest list,\" \n  echo \"                   # summing up to this total prob-mass, within [0, 1], for a word.\" \n  echo \"  --thresh <int>   # Pruning threshold for n-best.\"\n  exit 1;\nfi\n\nwordlist=$1\nmodeldir=$2\noutdir=$3\n\nmodel=$modeldir/model.fst\noutput_lex=$outdir/lexicon.lex\nmkdir -p $outdir\n\n[ ! -f ${model:-} ] && echo \"$0: File $model not found in the directory $modeldir.\" && exit 1\n[ ! -f $wordlist ] && echo \"$0: File $wordlist not found!\" && exit 1\n[ -z $pmass ] && [ -z $nbest ] && echo \"$0: nbest or/and pmass should be specified.\" && exit 1;\nif ! phonetisaurus=`which phonetisaurus-apply` ; then\n  echo \"Phonetisarus was not found !\"\n  echo \"Go to $KALDI_ROOT/tools and execute extras/install_phonetisaurus.sh\"\n  exit 1\nfi\n\ncp $wordlist $outdir/wordlist.txt\n\n# three options: 1) nbest, 2) pmass, 3) nbest+pmass,\nnbest=${nbest:-20}   # if nbest is not specified, set it to 20, due to Phonetisaurus mechanism\npmass=${pmass:-1.0}  # if pmass is not specified, set it to 1.0, due to Phonetisaurus mechanism\n\n[[ ! $nbest =~ ^[1-9][0-9]*$ ]] && echo \"$0: nbest should be a positive integer.\" && exit 1;\n\necho \"Applying the G2P model to wordlist $wordlist\"\nphonetisaurus-apply --pmass $pmass --nbest $nbest --thresh $thresh \\\n  --word_list $wordlist --model $model \\\n  --accumulate --verbose --prob \\\n  1>$output_lex\n\necho \"Completed. Synthesized lexicon for new words is in $output_lex\"\n\n# Some words might have been removed or skipped during the process,\n# let's check it and warn the user if so...\nnlex=`cut -f 1 $output_lex | sort -u | wc -l`\nnwlist=`cut -f 1 $wordlist | sort -u | wc -l`\nif [ $nlex -ne $nwlist ] ; then\n  failed_wordlist=$outdir/lexicon.failed\n  echo \"WARNING: Unable to generate pronunciation for all words. \";\n  echo \"WARINNG:   Wordlist: $nwlist words\"\n  echo \"WARNING:   Lexicon : $nlex words\"\n  comm -13 <(cut -f 1 $output_lex | sort -u ) \\\n           <(cut -f 1 $wordlist | sort -u ) \\\n           >$failed_wordlist && echo \"WARNING: The list of failed words is in $failed_wordlist\"\nfi\nexit 0\n\n"
  },
  {
    "path": "egs/steps/dict/apply_lexicon_edits.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nimport argparse\nimport sys\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description = \"Apply an lexicon edits file (output from steps/dict/select_prons_bayesian.py)to an input lexicon\"\n                                     \"to produce a learned lexicon.\",\n                                     epilog = \"See steps/dict/learn_lexicon_greedy.sh for example\")\n\n    parser.add_argument(\"in_lexicon\", metavar='<in-lexicon>', type = str,\n                        help = \"Input lexicon. Each line must be <word> <phones>.\")\n    parser.add_argument(\"lexicon_edits_file\", metavar='<lexicon-edits-file>', type = str,\n                        help = \"Input lexicon edits file containing human-readable & editable\"\n                               \"pronounciation info.  The info for each word is like:\"\n                         \"------------ an 4086.0 --------------\"\n                         \"R  | Y |  2401.6 |  AH N\"\n                         \"R  | Y |  640.8 |  AE N\"\n                         \"P  | Y |  1035.5 |  IH N\"\n                         \"R(ef), P(hone-decoding) represents the pronunciation source\"\n                         \"Y/N means the recommended decision of including this pron or not\"\n                         \"and the numbers are soft counts accumulated from lattice-align-word outputs. See steps/dict/select_prons_bayesian.py for more details.\")\n    parser.add_argument(\"out_lexicon\", metavar='<out-lexicon>', type = str,\n                        help = \"Output lexicon to this file.\")\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.in_lexicon == \"-\":\n        args.in_lexicon = sys.stdin\n    else:\n        args.in_lexicon_handle = open(args.in_lexicon)\n    args.lexicon_edits_file_handle = open(args.lexicon_edits_file)\n\n    if args.out_lexicon == \"-\":\n        args.out_lexicon_handle = sys.stdout\n    else:\n        args.out_lexicon_handle = open(args.out_lexicon, \"w\")\n\n    return args\n\ndef ReadLexicon(lexicon_file_handle):\n    lexicon = set()\n    if lexicon_file_handle:\n        for line in lexicon_file_handle.readlines():\n            splits = line.strip().split()\n            if len(splits) == 0:\n                continue\n            if len(splits) < 2:\n                raise Exception('Invalid format of line ' + line\n                                    + ' in lexicon file.')\n            word = splits[0]\n            phones = ' '.join(splits[1:])\n            lexicon.add((word, phones))\n    return lexicon\n\ndef ApplyLexiconEdits(lexicon, lexicon_edits_file_handle):\n    if lexicon_edits_file_handle:\n        for line in lexicon_edits_file_handle.readlines():\n            # skip all commented lines\n            if line.startswith('#'):\n                continue\n            # read a word from a line like \"---- MICROPHONES 200.0 ----\".\n            if line.startswith('---'):\n                splits = line.strip().strip('-').strip().split()\n                if len(splits) != 2:\n                    print(splits, file=sys.stderr)\n                    raise Exception('Invalid format of line ' + line\n                                        + ' in lexicon edits file.')\n                word = splits[0].strip()\n            else:\n            # parse the pron and decision 'Y/N' of accepting the pron or not,\n            # from a line like: 'P  | Y |  42.0 |  M AY K R AH F OW N Z'\n                splits = line.split('|')\n                if len(splits) != 4:\n                    raise Exception('Invalid format of line ' + line\n                                        + ' in lexicon edits file.')\n                pron = splits[3].strip()\n                if splits[1].strip() == 'Y':\n                    lexicon.add((word, pron))\n                elif splits[1].strip() == 'N':\n                    lexicon.discard((word, pron))\n                else:\n                    raise Exception('Invalid format of line ' + line\n                                        + ' in lexicon edits file.')\n    return lexicon\n\n\ndef WriteLexicon(lexicon, out_lexicon_handle):\n    for word, pron in lexicon:\n        print('{0} {1}'.format(word, pron), file=out_lexicon_handle)\n    out_lexicon_handle.close()\n\ndef Main():\n    args = GetArgs()\n    lexicon = ReadLexicon(args.in_lexicon_handle)\n    ApplyLexiconEdits(lexicon, args.lexicon_edits_file_handle)\n    WriteLexicon(lexicon, args.out_lexicon_handle)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/get_pron_stats.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Xiaohui Zhang\n#           2016  Vimal Manohar\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"Accumulate statistics from lattice-alignment outputs for lexicon\"\n        \"learning. The inputs are a file containing arc level information from lattice-align-words,\"\n        \"and a map which maps word-position-dependent phones to word-position-independent phones\"\n        \"(output from steps/cleanup/debug_lexicon.txt). The output contains accumulated soft-counts\"\n        \"of pronunciations\",\n        epilog = \"cat exp/tri3_lex_0.4_work/lats/arc_info_sym.*.txt \\\\|\"\n        \"  steps/dict/get_pron_stats.py - exp/tri3_lex_0.4_work/phone_decode/phone_map.txt \\\\\"\n        \"  exp/tri3_lex_0.4_work/lats/pron_stats.txt\"\n        \"See steps/dict/learn_lexicon_greedy.sh for examples in detail.\")\n\n    parser.add_argument(\"arc_info_file\", metavar = \"<arc-info-file>\", type = str,\n                        help = \"Input file containing per arc statistics; \"\n                        \"each line must be <counts> <word> <phones>\")\n    parser.add_argument(\"phone_map\", metavar = \"<phone-map>\", type = str,\n                        help = \"An input phone map used to remove word boundary markers from phones;\"\n                        \"generated in steps/cleanup/debug_lexicon.sh\")\n    parser.add_argument(\"stats_file\", metavar = \"<stats_file>\", type = str,\n                        help = \"Write accumulated statitistics to this file;\"\n                        \"each line is <count> <word> <phones>\")\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.arc_info_file == \"-\":\n        args.arc_info_file_handle = sys.stdin\n    else:\n        args.arc_info_file_handle = open(args.arc_info_file)\n    args.phone_map_handle = open(args.phone_map)\n\n    if args.stats_file == \"-\":\n        args.stats_file_handle = sys.stdout\n    else:\n        args.stats_file_handle = open(args.stats_file, \"w\")\n\n    return args\n\n\ndef GetStatsFromArcInfo(arc_info_file_handle, phone_map_handle):\n    prons = defaultdict(set)\n    # need to map the phones to remove word boundary markers.\n    phone_map = {}\n    stats_unmapped = {} \n    stats = {} \n    for line in phone_map_handle.readlines():\n        splits = line.strip().split()\n        phone_map[splits[0]] = splits[1]\n\n    for line in arc_info_file_handle.readlines():\n        splits = line.strip().split()\n        if (len(splits) == 0):\n            continue\n        if (len(splits) < 6):\n            raise Exception('Invalid format of line ' + line\n                                + ' in arc_info_file')\n        word = splits[4]\n        count = float(splits[3])\n        phones = \" \".join(splits[5:])        \n        prons[word].add(phones)\n        stats_unmapped[(word, phones)] = stats_unmapped.get((word, phones), 0) + count\n     \n    for word_pron, count in stats_unmapped.items():\n        phones_unmapped = word_pron[1].split()\n        phones = [phone_map[phone] for phone in phones_unmapped]\n        stats[(word_pron[0], \" \".join(phones))] = count\n    return stats\n\ndef WriteStats(stats, file_handle):\n    for word_pron, count in stats.items():\n        print('{2} {0} {1}'.format(word_pron[0], word_pron[1], count),\n              file=file_handle)\n    file_handle.close()\n\ndef Main():\n    args = GetArgs()\n    stats = GetStatsFromArcInfo(args.arc_info_file_handle, args.phone_map_handle)\n    WriteStats(stats, args.stats_file_handle)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/internal/get_subsegments.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2018 Xiaohui Zhang\n# Apache 2.0.\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nimport argparse\nimport sys\nimport string\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"The purpose of this script is to use a ctm and a vocab file\"\n        \"to extract sub-utterances and a sub-segmentation. Extracted sub-utterances\"\n        \"are all the strings of consecutive in-vocab words from the ctm\"\n        \"surrounded by an out-of-vocab word at each end if present.\",\n        epilog = \"e.g. steps/dict/internal/get_subsegments.py exp/tri3_lex_0.4_work/phonetic_decoding/word.ctm \\\\\"\n        \"exp/tri3_lex_0.4_work/learn_vocab.txt exp/tri3_lex_0.4_work/resegmentation/subsegments \\\\\"\n        \"exp/tri3_lex_0.4_work/resegmentation/text\"\n        \"See steps/dict/learn_lexicon_greedy.sh for an example.\")\n\n    parser.add_argument(\"ctm\", metavar='<ctm>', type = str,\n                        help = \"Input ctm file.\"\n                        \"each line must be <utt-id> <chanel> <start-time> <duration> <word>\")\n    parser.add_argument(\"vocab\", metavar='<vocab>', type = str,\n                        help = \"Vocab file.\"\n                        \"each line must be <word>\")\n    parser.add_argument(\"subsegment\", metavar='<subsegtment>', type = str,\n                        help = \"Subsegment file. Each line is in format:\"\n                        \"<new-utt> <old-utt> <start-time-within-old-utt> <end-time-within-old-utt>\")\n    parser.add_argument(\"text\", metavar='<text>', type = str,\n                        help = \"Text file. Each line is in format:\"\n                        \" <new-utt> <word1> <word2> ... <wordN>.\")\n  \n    print (' '.join(sys.argv), file = sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.ctm == \"-\":\n        args.ctm_handle = sys.stdin\n    else:\n        args.ctm_handle = open(args.ctm)\n\n    if args.vocab is not '':\n        if args.vocab == \"-\":\n            args.vocab_handle = sys.stdout\n        else:\n            args.vocab_handle = open(args.vocab)\n\n    args.subsegment_handle = open(args.subsegment, 'w')\n    args.text_handle = open(args.text, 'w')\n\n    return args\n\ndef GetSubsegments(args, vocab):\n    sub_utt = list()\n    last_is_oov = False\n    is_oov = False\n    utt_id_last = None\n    start_times = {}\n    end_times = {}\n    sub_utts = {}\n    sub_utt_id = 1\n    sub_utt_id_last = 1\n    end_time_last = 0.0\n    for line in args.ctm_handle:\n        splits = line.strip().split()\n        if len(splits) < 5:\n            raise Exception(\"problematic line\",line)\n\n        utt_id = splits[0]\n        start = float(splits[2])\n        dur = float(splits[3])\n        word = splits[4]\n        if utt_id != utt_id_last:\n            sub_utt_id = 1\n            if len(sub_utt)>1:\n                sub_utts[utt_id_last+'-'+str(sub_utt_id_last)] = (utt_id_last, sub_utt)\n                end_times[utt_id_last+'-'+str(sub_utt_id_last)] = ent_time_last\n            sub_utt = []\n            start_times[utt_id+'-'+str(sub_utt_id)] = start\n            is_oov_last = False\n        if word == '<eps>':\n            is_oov = True\n            end_times[utt_id+'-'+str(sub_utt_id)] = start + dur\n        elif word in vocab:\n            is_oov = True\n            sub_utt.append(word)\n            end_times[utt_id+'-'+str(sub_utt_id)] = start + dur\n        else:\n            is_oov = False\n            if is_oov_last == True:\n                sub_utt.append(word)\n                sub_utts[utt_id+'-'+str(sub_utt_id_last)] = (utt_id, sub_utt)\n                end_times[utt_id+'-'+str(sub_utt_id_last)] = start + dur\n                sub_utt_id += 1\n            sub_utt = [word]\n            start_times[utt_id+'-'+str(sub_utt_id)] = start\n        utt_id_last = utt_id\n        sub_utt_id_last = sub_utt_id\n        is_oov_last = is_oov\n        ent_time_last = start + dur\n        \n    if is_oov:\n        if word != '<eps>':\n            sub_utt.append(word)\n        sub_utts[utt_id+'-'+str(sub_utt_id_last)] = (utt_id, sub_utt)\n        end_times[utt_id+'-'+str(sub_utt_id_last)] = start + dur\n\n    for utt,v in sorted(sub_utts.items()):\n        print(utt, ' '.join(sub_utts[utt][1]), file=args.text_handle)\n        print(utt, sub_utts[utt][0], start_times[utt], end_times[utt], file=args.subsegment_handle)\n\ndef ReadVocab(vocab_file_handle):\n    vocab = set()\n    if vocab_file_handle:\n        for line in vocab_file_handle.readlines():\n            splits = line.strip().split()\n            if len(splits) == 0:\n                continue\n            if len(splits) > 1:\n                raise Exception('Invalid format of line ' + line\n                                    + ' in vocab file.')\n            word = splits[0]\n            vocab.add(word)\n    return vocab\n\ndef Main():\n    args = GetArgs()\n\n    vocab = ReadVocab(args.vocab_handle)\n    GetSubsegments(args, vocab)\n   \nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/internal/prune_pron_candidates.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\nimport math\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"Prune pronunciation candidates based on soft-counts from lattice-alignment\"\n        \"outputs, and a reference lexicon. Basically, for each word we sort all pronunciation\"\n        \"cadidates according to their soft-counts, and then select the top variant-counts-ratio * N candidates\"\n        \"(For words in the reference lexicon, N = # pron variants given by the reference\"\n        \"lexicon; For oov words, N = avg. # pron variants per word in the reference lexicon).\",\n        epilog = \"See steps/dict/learn_lexicon_greedy.sh for example\")\n\n    parser.add_argument(\"--variant-counts-ratio\", type = float, default = \"3.0\",\n                        help = \"A user-specified ratio parameter which determines how many\"\n                        \"pronunciation candidates we want to keep for each word at most.\")\n    parser.add_argument(\"pron_stats\", metavar = \"<pron-stats>\", type = str,\n                        help = \"File containing soft-counts of pronounciation candidates; \"\n                        \"each line must be <soft-counts> <word> <phones>\")\n    parser.add_argument(\"lexicon_phonetic_decoding\", metavar = \"<lexicon-phonetic-decoding>\", type = str,\n                        help = \"Lexicon containing pronunciation candidates from phonetic decoding.\"\n                        \"each line must be <word> <phones>\")\n    parser.add_argument(\"lexiconp_g2p\", metavar = \"<lexiconp-g2p>\", type = str,\n                        help = \"Lexicon with probabilities for pronunciation candidates from G2P.\"\n                        \"each line must be <prob> <word> <phones>\")\n    parser.add_argument(\"ref_lexicon\", metavar = \"<ref-lexicon>\", type = str,\n                        help = \"Reference lexicon file, where we obtain # pron variants for\"\n                        \"each word, based on which we prune the pron candidates.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"lexicon_phonetic_decoding_pruned\", metavar = \"<lexicon-phonetic-decoding-pruned>\", type = str,\n                        help = \"Output lexicon containing pronunciation candidates from phonetic decoding after pruning.\"\n                        \"each line must be <word> <phones>\")\n    parser.add_argument(\"lexicon_g2p_pruned\", metavar = \"<lexicon-g2p-pruned>\", type = str,\n                        help = \"Output lexicon containing pronunciation candidates from G2P after pruning.\"\n                        \"each line must be <word> <phones>\")\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    print(args)\n    args.pron_stats_handle = open(args.pron_stats)\n    args.lexicon_phonetic_decoding_handle = open(args.lexicon_phonetic_decoding)\n    args.lexiconp_g2p_handle = open(args.lexiconp_g2p)\n    args.ref_lexicon_handle = open(args.ref_lexicon)\n    args.lexicon_phonetic_decoding_pruned_handle = open(args.lexicon_phonetic_decoding_pruned, \"w\")\n    args.lexicon_g2p_pruned_handle = open(args.lexicon_g2p_pruned, \"w\")\n    return args\n\ndef ReadStats(pron_stats_handle):\n    stats = defaultdict(list)\n    for line in pron_stats_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in stats file.')\n        count = float(splits[0])\n        word = splits[1]\n        phones = ' '.join(splits[2:])\n        stats[word].append((phones, count))\n\n    return stats\n\ndef ReadLexicon(lexicon_handle):\n    lexicon = defaultdict(set)\n    for line in lexicon_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[0]\n        phones = ' '.join(splits[1:])\n        lexicon[word].add(phones)\n    return lexicon\n\ndef ReadLexiconp(lexiconp_handle):\n    lexicon = defaultdict(set)\n    pron_probs = defaultdict(float)\n    for line in lexiconp_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 3:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[1]\n        prob = float(splits[0])\n        phones = ' '.join(splits[2:])\n        pron_probs[(word, phones)] = prob\n        lexicon[word].add(phones)\n    return lexicon, pron_probs\n\ndef PruneProns(args, stats, ref_lexicon, lexicon_phonetic_decoding, lexicon_g2p, lexicon_g2p_probs):\n    # For those pron candidates from lexicon_phonetic_decoding/g2p which don't\n    # have stats, we append them to the \"stats\" dict, with a zero count.\n    for word, entry in stats.iteritems():\n        prons_with_stats = set()\n        for (pron, count) in entry:\n            prons_with_stats.add(pron)\n        for pron in lexicon_g2p[word]:\n            if pron not in prons_with_stats:\n                entry.append((pron, lexicon_g2p_probs[(word, pron)]-1.0))\n        entry.sort(key=lambda x: x[1])\n    \n    # Compute the average # pron variants counts per word in the reference lexicon.\n    num_words_ref = 0\n    num_prons_ref = 0\n    for word, prons in ref_lexicon.iteritems():\n        num_words_ref += 1\n        num_prons_ref += len(prons)\n    avg_variant_counts_ref = round(float(num_prons_ref) / float(num_words_ref))\n    for word, entry in stats.iteritems():\n        if word in ref_lexicon:\n            variant_counts = args.variant_counts_ratio * len(ref_lexicon[word])\n        else:\n            variant_counts = args.variant_counts_ratio * avg_variant_counts_ref\n        num_variants = 0\n        count = 0.0\n        while num_variants < variant_counts:\n            try:\n                pron, count = entry.pop()\n                if word in ref_lexicon and pron in ref_lexicon[word]:\n                    continue\n                if pron in lexicon_phonetic_decoding[word]:\n                    num_variants += 1\n                    print('{0} {1}'.format(word, pron), file=args.lexicon_phonetic_decoding_pruned_handle)\n                if pron in lexicon_g2p[word]:\n                    num_variants += 1\n                    print('{0} {1}'.format(word, pron), file=args.lexicon_g2p_pruned_handle)\n            except IndexError:\n                break\n\ndef Main():\n    args = GetArgs()\n    ref_lexicon = ReadLexicon(args.ref_lexicon_handle)\n    lexicon_phonetic_decoding = ReadLexicon(args.lexicon_phonetic_decoding_handle)\n    lexicon_g2p, lexicon_g2p_probs = ReadLexiconp(args.lexiconp_g2p_handle)\n    stats = ReadStats(args.pron_stats_handle)\n\n    PruneProns(args, stats, ref_lexicon, lexicon_phonetic_decoding, lexicon_g2p, lexicon_g2p_probs)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/internal/sum_arc_info.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2018   Xiaohui Zhang\n# Apache 2.0\n\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\n\nclass StrToBoolAction(argparse.Action):\n    \"\"\" A custom action to convert bools from shell format i.e., true/false\n        to python format i.e., True/False \"\"\"\n    def __call__(self, parser, namespace, values, option_string=None):\n        if values == \"true\":\n            setattr(namespace, self.dest, True)\n        elif values == \"false\":\n            setattr(namespace, self.dest, False)\n        else:\n            raise Exception(\"Unknown value {0} for --{1}\".format(values, self.dest))\n\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"Accumulate statistics from per arc lattice statitics\"\n        \"for lexicon learning\",\n        epilog = \"See steps/dict/learn_lexicon_greedy.sh for example\")\n\n    parser.add_argument(\"--set-sum-to-one\", type = str, default = True,\n                        action = StrToBoolAction, choices = [\"true\", \"false\"],\n                        help = \"If normalize posteriors such that the sum of \"\n                        \"pronunciation posteriors of a word in an utterance is 1.\")\n    parser.add_argument(\"arc_info_file\", metavar = \"<arc-info-file>\", type = str,\n                        help = \"File containing per arc statistics; \"\n                        \"each line must be <utt-id> <word> <start-frame> <duration> <posterior>\"\n                        \"<phones-with-word-boundary-markers>\")\n    parser.add_argument(\"phone_map\", metavar = \"<phone-map>\", type = str,\n                        help = \"An input phone map used to remove word boundary markers from phones;\"\n                        \"generated in steps/cleanup/debug_lexicon.sh\")\n    parser.add_argument(\"stats_file\", metavar = \"<out-stats-file>\", type = str,\n                        help = \"Write accumulated statitistics to this file\"\n                        \"each line is <utt-id> <word> <start-frame> <posterior>\"\n                        \"<phones-without-word-boundary-markers>\")\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.arc_info_file == \"-\":\n        args.arc_info_file_handle = sys.stdin\n    else:\n        args.arc_info_file_handle = open(args.arc_info_file)\n    \n    args.phone_map_handle = open(args.phone_map)\n\n    if args.stats_file == \"-\":\n        args.stats_file_handle = sys.stdout\n    else:\n        args.stats_file_handle = open(args.stats_file, \"w\")\n\n    return args\n\ndef Main():\n    args = GetArgs()\n\n    lexicon = defaultdict(list)\n    prons = defaultdict(list)\n    start_frames = {}\n    stats = defaultdict(lambda : defaultdict(float))\n    sum_tot = defaultdict(float)\n\n    phone_map = {}\n    for line in args.phone_map_handle.readlines():\n        splits = line.strip().split()\n        phone_map[splits[0]] = splits[1]\n\n    for line in args.arc_info_file_handle.readlines():\n        splits = line.strip().split()\n\n        if (len(splits) == 0):\n            continue\n\n        if (len(splits) < 6):\n            raise Exception('Invalid format of line ' + line\n                                + ' in ' + args.arc_info_file)\n\n        utt = splits[0]\n        start_frame = int(splits[1])\n        word = splits[4]\n        count = float(splits[3])\n        phones_unmapped = splits[5:]   \n        phones = [phone_map[phone] for phone in phones_unmapped]\n        phones = ' '.join(phones)\n        overlap = False\n        if word == '<eps>':\n            continue\n        if (word, utt) not in start_frames:\n            start_frames[(word, utt)] = start_frame\n\n        if (word, utt) in stats:\n            stats[word, utt][phones] = stats[word, utt].get(phones, 0) + count\n        else:\n            stats[(word, utt)][phones] = count\n        sum_tot[(word, utt)] += count\n\n        if phones not in prons[word]:\n            prons[word].append(phones)\n\n    for (word, utt) in stats:\n       count_sum = 0.0\n       counts = dict()\n       for phones in stats[(word, utt)]:\n           count = stats[(word, utt)][phones]\n           count_sum += count\n           counts[phones] = count\n       # By default we normalize the pron posteriors of each word in each utterance,\n       # so that they sum up exactly to one. If a word occurs two times in a utterance,\n       # the effect of this operation is to average the posteriors of these two occurences\n       # so that there's only one \"equivalent occurence\" of this word in the utterance.\n       # However, this case should be extremely rare if the utterances are already\n       # short sub-utterances produced by steps/dict/internal/get_subsegments.py\n       for phones in stats[(word, utt)]:\n           count = counts[phones] / count_sum\n           print(word, utt, start_frames[(word, utt)], count, phones, file=args.stats_file_handle)\n       # # Diagnostics info implying incomplete arc_info or multiple occurences of a word in a utterance:\n       # if count_sum < 0.9 or count_sum > 1.1:\n       #    print(word, utt, start_frame, count_sum, stats[word, utt], file=sys.stderr)\n\n    args.stats_file_handle.close()\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/learn_lexicon_bayesian.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016  Xiaohui Zhang\n#           2016  Vimal Manohar\n# Apache 2.0\n\n# This script demonstrate how to expand a existing lexicon using a combination\n# of acoustic evidence and G2P to learn a lexicon that covers words in a target \n# vocab, and agrees sufficiently with the acoustics. The basic idea is to \n# run phonetic decoding on acoustic training data using an existing\n# acoustice model (possibly re-trained using a G2P-expanded lexicon) to get \n# alternative pronunciations for words in training data. Then we combine three\n# exclusive sources of pronunciations: the reference lexicon (supposedly \n# hand-derived), phonetic decoding, and G2P (optional) into one lexicon and then run \n# lattice alignment on the same data, to collect acoustic evidence (soft\n# counts) of all pronunciations. Based on these statistics, and\n# user-specified prior-counts (parameterized by prior mean and prior-counts-tot,\n# assuming the prior follows a Dirichlet distribution), we then use a Bayesian\n# framework to compute posteriors of all pronunciations for each word,\n# and then select best pronunciations for each word. The output is a final learned lexicon\n# whose vocab matches the user-specified target-vocab, and two intermediate resultis:\n# an edits file which records the recommended changes to all in-ref-vocab words'\n# prons, and a half-learned lexicon where all in-ref-vocab words' prons were untouched\n# (on top of which we apply the edits file to produce the final learned lexicon).\n# The user can always modify the edits file manually and then re-apply it on the\n# half-learned lexicon using steps/dict/apply_lexicon_edits to produce the final\n# learned lexicon. See the last stage in this script for details.\n\n\n# Begin configuration section.  \ncmd=run.pl\nnj=4\nstage=0\n\noov_symbol=\nlexicon_g2p=\n\nmin_prob=0.3\nvariant_counts_ratio=8 \nvariants_prob_mass=0.7\nvariants_prob_mass_ref=0.9\n\nprior_counts_tot=15\nprior_mean=\"0.7,0.2,0.1\"\nnum_gauss=\nnum_leaves=\nretrain_src_mdl=false\n\ncleanup=true\n# End configuration section.  \n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -lt 6 ] || [ $# -gt 7 ]; then\n  echo \"Usage: $0 [options] <ref-dict> <target-vocab> <data> \\\\\"\n  echo \"                    <src-mdl-dir> <ref-lang> <dest-dict> [ <tmp-dir> ]\"\n  echo \"e.g.: $0 --oov-symbol \\\"<UNK>\\\" data/local/dict data/local/lm/librispeech-vocab.txt data/train \\\\\"\n  echo \"                               exp/tri3 data/lang data/local/dict_learned\"\n  echo \"\" \n  echo \"  This script does lexicon expansion using a combination of acoustic\"\n  echo \"  evidence and G2P to produce a lexicon that covers words of a target vocab:\"\n  echo \"\"               \n  echo \"Arguments:\"\n  echo \" <ref-dict>     the dir which contains the reference lexicon (most probably hand-derived)\"\n  echo \"                we want to expand/improve, and nonsilence_phones.txt,.etc which we need \" \n  echo \"                for building new dict dirs.\"\n  echo \" <target-vocab> the vocabulary we want the final learned lexicon to cover (one word per line).\"\n  echo \" <data>         acoustic training data we use to get alternative\"\n  echo \"                pronunciations and collet acoustic evidence.\"\n  echo \" <src-mdl-dir>  The dir containing an SAT-GMM acoustic model (we optionaly we re-train it\" \n  echo \"                using G2P expanded lexicon) to do phonetic decoding (to get alternative\"\n  echo \"                pronunciations) and lattice-alignment (to collect acoustic evidence for\"\n  echo \"                evaluating all prounciations)\"\n  echo \" <ref-lang>     the reference lang dir which we use to get non-scored-words\"\n  echo \"                like <UNK> for building new dict dirs\"\n  echo \" <dest-dict>    the dict dir where we put the final learned lexicon, whose vocab\"\n  echo \"                matches <target-vocab>\"\n  echo \" [ <tmp-dir> ]  the temporary dir where most of the intermediate outputs are stored\"\n  echo \"                (default: \\${src-mdl-dir}_lex_learn_work)\"\n  echo \"\"\n  echo \"Note: <target-vocab> and the vocab of <data> don't have to match. For words\"\n  echo \"     who are in <target-vocab> but not seen in <data>, their pronunciations\" \n  echo \"     will be given by G2P at the end.\"\n  echo \"\"\n  echo \"Options:\"\n  echo \"  --stage <n>                  # stage to run from, to enable resuming from partially\"\n  echo \"                               # completed run (default: 0)\"\n  echo \"  --cmd '$cmd'                 # command to submit jobs with (e.g. run.pl, queue.pl)\"\n  echo \"  --nj <nj>                    # number of parallel jobs\"\n  echo \"  --oov-symbol <unk_symbol>    # (required option) oov symbol, like <UNK>.\"\n  echo \"  --lexicon-g2p                # A lexicon file containing g2p generated pronunciations, for words in acoustic training \"\n  echo \"                               # data / target vocabulary. It's optional.\"\n  echo \"  --min-prob <float>           # The cut-off parameter used to select pronunciation candidates from phonetic\"\n  echo \"                               # decoding. We remove pronunciations with probabilities less than this value\"\n  echo \"                               # after normalizing the probs s.t. the max-prob is 1.0 for each word.\"\n  echo \"  --variant-counts-ratio <int> # This ratio parameter determines the maximum number of pronunciation\"\n  echo \"                               # candidates we will keep for each word, after pruning according to lattice statistics from\"\n  echo \"                               # the first iteration of lattice generation. See steps/dict/internal/prune_pron_candidates.py\"\n  echo \"                               # for details.\"\n  echo \"  --prior-mean                 # Mean of priors (summing up to 1) assigned to three exclusive pronunciation\"\n  echo \"         <float,float,float>   # source: reference lexicon, g2p, and phonetic decoding (used in the Bayesian\"\n  echo \"                               # pronunciation selection procedure). We recommend setting a larger prior\"\n  echo \"                               # mean for the reference lexicon, e.g. '0.6,0.2,0.2'.\"\n  echo \"  --prior-counts-tot <float>   # Total amount of prior counts we add to all pronunciation candidates of\"\n  echo \"                               # each word. By timing it with the prior mean of a source, and then dividing\"\n  echo \"                               # by the number of candidates (for a word) from this source, we get the\"\n  echo \"                               # prior counts we actually add to each candidate.\"\n  echo \"  --variants-prob-mass <float> # In the Bayesian pronunciation selection procedure, for each word, we\"\n  echo \"                               # choose candidates (from all three sources) with highest posteriors\"\n  echo \"                               # until the total prob mass hit this amount.\"\n  echo \"                               # It's used in a similar fashion when we apply G2P.\"\n  echo \"  --variants-prob-mass-ref     # In the Bayesian pronunciation selection procedure, for each word,\"\n  echo \"                               # after the total prob mass of selected candidates hit variants-prob-mass,\"\n  echo \"                               # we continue to pick up reference candidates with highest posteriors\"\n  echo \"                               # until the total prob mass hit this amount (must >= variants-prob-mass).\"\n  echo \"  --num-gauss                  # number of gaussians for the re-trained SAT model (on top of <src-mdl-dir>).\"            \n  echo \"  --num-leaves                 # number of leaves for the re-trained SAT model (on top of <src-mdl-dir>).\" \n  echo \"  --retrain-src-mdl            # true if you want to re-train the src_mdl before phone decoding (default false).\"\n  exit 1\nfi\n\necho \"$0 $@\"  # Print the command line for logging\n\nref_dict=$1\ntarget_vocab=$2\ndata=$3\nsrc_mdl_dir=$4\nref_lang=$5\ndest_dict=$6\n\nif [ -z \"$oov_symbol\" ]; then\n   echo \"$0: the --oov-symbol option is required.\"\n   exit 1\nfi\n\nif [ $# -gt 6 ]; then\n  dir=$7 \nelse\n  dir=${src_mdl_dir}_lex_learn_work\nfi\n\nmkdir -p $dir\n\nif [ $stage -le 0 ]; then\n  echo \"$0: Some preparatory work.\"\n  # Get the word counts of training data.\n  awk '{for (n=2;n<=NF;n++) counts[$n]++;} END{for (w in counts) printf \"%s %d\\n\",w, counts[w];}' \\\n    $data/text | sort > $dir/train_counts.txt\n  \n  # Get the non-scored entries and exclude them from the reference lexicon/vocab, and target_vocab.\n  steps/cleanup/internal/get_non_scored_words.py $ref_lang > $dir/non_scored_words\n  awk 'NR==FNR{a[$1] = 1; next} {if($1 in a) print $0}' $dir/non_scored_words \\\n    $ref_dict/lexicon.txt > $dir/non_scored_entries \n\n  # Remove non-scored-words from the reference lexicon.\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' $dir/non_scored_words \\\n    $ref_dict/lexicon.txt | tr -s '\\t' ' ' | awk '$1=$1' > $dir/ref_lexicon.txt\n\n  cat $dir/ref_lexicon.txt | awk '{print $1}' | sort | uniq > $dir/ref_vocab.txt\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' $dir/non_scored_words \\\n    $target_vocab | sort | uniq > $dir/target_vocab.txt\n    \n  # From the reference lexicon, we estimate the target_num_prons_per_word as,\n  # round(avg. # prons per word in the reference lexicon). This'll be used as \n  # the upper bound of # pron variants per word when we apply G2P or select prons to\n  # construct the learned lexicon in later stages.\n  python -c 'import sys; import math; print int(round(float(sys.argv[1])/float(sys.argv[2])))' \\\n    `wc -l $dir/ref_lexicon.txt | awk '{print $1}'` `wc -l $dir/ref_vocab.txt | awk '{print $1}'` \\\n    > $dir/target_num_prons_per_word || exit 1;\n\n  if [ -z $lexicon_g2p ]; then\n    # create an empty list of g2p generated prons, if it's not given.\n    touch $dir/lexicon_g2p.txt\n  else\n    cat $lexicon_g2p | awk '{if (NF<2) {print \"There is an empty pronunciation in lexicon_g2p.txt. Exit.\" \\\n      > \"/dev/stderr\"; exit 1} print $0}' - > $dir/lexicon_g2p.txt || exit 1;\n  fi\nfi\n\nif [ $stage -le 1 ] && $retrain_src_mdl; then\n  echo \"$0: Expand the reference lexicon to cover all words in the target vocab. and then\"\n  echo \"   ... re-train the source acoustic model for phonetic decoding. \"\n  mkdir -p $dir/dict_expanded_target_vocab\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_expanded_target_vocab  2>/dev/null\n  rm $dir/dict_expanded_target_vocab/lexiconp.txt $dir/dict_expanded_target_vocab/lexicon.txt 2>/dev/null\n  \n  # Get the oov words list (w.r.t ref vocab) which are in the target vocab. \n  awk 'NR==FNR{a[$1] = 1; next} !($1 in a)' $dir/ref_lexicon.txt \\\n    $dir/target_vocab.txt | sort | uniq > $dir/oov_target_vocab.txt\n\n  # Assign pronunciations from lexicon_g2p.txt to oov_target_vocab. For words which\n  # cannot be found in lexicon_g2p.txt, we simply ignore them.\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_target_vocab.txt \\\n    $dir/lexicon_g2p.txt > $dir/lexicon_g2p_oov_target_vocab.txt\n  \n  cat $dir/lexicon_g2p_oov_target_vocab.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/target_vocab.txt - | \\\n    cat $dir/non_scored_entries - | \n    sort | uniq > $dir/dict_expanded_target_vocab/lexicon.txt\n   \n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt $dir/dict_expanded_target_vocab \\\n    \"$oov_symbol\" $dir/lang_expanded_target_vocab_tmp $dir/lang_expanded_target_vocab || exit 1;\n  \n  # Align the acoustic training data using the given src_mdl_dir.\n  alidir=${src_mdl_dir}_ali_$(basename $data) \n  steps/align_fmllr.sh --nj $nj --cmd \"$train_cmd\" \\\n    $data $dir/lang_expanded_target_vocab $src_mdl_dir $alidir || exit 1;\n\n  # Train another SAT system on the given data and put it in $dir/${src_mdl_dir}_retrained\n  # this model will be used for phonetic decoding and lattice alignment later on.\n  if [ -z $num_leaves ] || [ -z $num_gauss ] ; then\n    # infer the model parameters using the inital GMM\n    num_leaves=`gmm-info ${src_mdl_dir}/final.mdl  | grep 'pdfs' | awk '{print $NF-1}'`\n    num_gauss=`gmm-info ${src_mdl_dir}/final.mdl  | grep 'gaussians' | awk '{print $NF-1}'`\n  fi\n  steps/train_sat.sh --cmd \"$train_cmd\" $num_leaves $num_gauss \\\n    $data $dir/lang_expanded_target_vocab $alidir $dir/${src_mdl_dir}_retrained || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Expand the reference lexicon to cover all words seen in,\"\n  echo \"  ... acoustic training data, and prepare corresponding dict and lang directories.\"\n  echo \"  ... This is needed when generate pron candidates from phonetic decoding.\"\n  mkdir -p $dir/dict_expanded_train\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_expanded_train 2>/dev/null\n  rm $dir/dict_expanded_train/lexiconp.txt $dir/dict_expanded_train/lexicon.txt 2>/dev/null\n\n  # Get the oov words list (w.r.t ref vocab) which are in training data. \n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/ref_lexicon.txt \\\n    $dir/train_counts.txt | awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' \\\n    $dir/non_scored_words - | sort > $dir/oov_train.txt || exit 1; \n  \n  awk 'NR==FNR{a[$1] = 1; next} {if(($1 in a)) b+=$2; else c+=$2} END{print c/(b+c)}' \\\n    $dir/ref_vocab.txt $dir/train_counts.txt > $dir/train_oov_rate || exit 1;\n  \n  echo \"OOV rate (w.r.t. the reference lexicon) of the acoustic training data is:\"\n  cat $dir/train_oov_rate\n\n  # Assign pronunciations from lexicon_g2p to oov_train. For words which\n  # cannot be found in lexicon_g2p, we simply assign oov_symbol's pronunciaiton\n  # (like NSN) to them, in order to get phonetic decoding pron candidates for them later on.\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_train.txt \\\n    $dir/lexicon_g2p.txt > $dir/g2p_prons_for_oov_train.txt || exit 1;\n  \n  # Get the pronunciation of oov_symbol.\n  oov_pron=`cat $dir/non_scored_entries | grep $oov_symbol | awk '{print $2}'`\n  # For oov words in training data for which we don't even have G2P pron candidates,\n  # we simply assign them the pronunciation of the oov symbol (like <unk>).\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/g2p_prons_for_oov_train.txt \\\n    $dir/oov_train.txt | awk -v op=\"$oov_pron\" '{print $0\" \"op}' > $dir/oov_train_no_pron.txt || exit 1;\n    \n  cat $dir/oov_train_no_pron.txt $dir/g2p_prons_for_oov_train.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat - $dir/non_scored_entries | \\\n    sort | uniq > $dir/dict_expanded_train/lexicon.txt || exit 1;\n  \n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt $dir/dict_expanded_train \"$oov_symbol\" \\\n    $dir/lang_expanded_train_tmp $dir/lang_expanded_train || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Generate pronunciation candidates from phonetic decoding on acoustic training data..\"\n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n  steps/cleanup/debug_lexicon.sh --nj $nj --cmd \"$decode_cmd\" $data $dir/lang_expanded_train \\\n    $mdl_dir $dir/dict_expanded_train/lexicon.txt $dir/phonetic_decoding || exit 1;\n  \n  # We prune the phonetic decoding generated prons relative to the largest count, by setting \"min_prob\",\n  # and only leave prons who are not present in the reference lexicon / g2p-generated lexicon.\n  cat $dir/ref_lexicon.txt $dir/lexicon_g2p.txt | sort -u > $dir/phonetic_decoding/filter_lexicon.txt \n  \n  $cmd $dir/phonetic_decoding/log/prons_to_lexicon.log steps/dict/prons_to_lexicon.py \\\n    --min-prob=$min_prob --filter-lexicon=$dir/phonetic_decoding/filter_lexicon.txt \\\n    $dir/phonetic_decoding/prons.txt $dir/lexicon_phonetic_decoding_with_eps.txt\n  cat $dir/lexicon_phonetic_decoding_with_eps.txt | grep -vP \"<eps>|<UNK>|<unk>|\\[.*\\]\" | \\\n    sort | uniq > $dir/lexicon_phonetic_decoding.txt || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: Combine the reference lexicon and pronunciations from phone-decoding/G2P into one\"\n  echo \"  ... lexicon, and run lattice alignment using this lexicon on acoustic training data\"\n  echo \"  ... to collect acoustic evidence.\"\n  # Combine the reference lexicon, pronunciations from G2P and phonetic decoding into one lexicon.\n  mkdir -p $dir/dict_combined_iter1\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_combined_iter1/ 2>/dev/null\n  rm $dir/dict_combined_iter1/lexiconp.txt $dir/dict_combined_iter1/lexicon.txt 2>/dev/null\n\n  # Filter out words which don't appear in the acoustic training data\n  cat $dir/lexicon_phonetic_decoding.txt $dir/lexicon_g2p.txt \\\n    $dir/ref_lexicon.txt | tr -s '\\t' ' ' | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat $dir/non_scored_entries - | \\\n    sort | uniq > $dir/dict_combined_iter1/lexicon.txt\n  \n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt \\\n    $dir/dict_combined_iter1 \"$oov_symbol\" \\\n    $dir/lang_combined_iter1_tmp $dir/lang_combined_iter1 || exit 1;\n  \n  # Generate lattices for the acoustic training data with the combined lexicon.\n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n  steps/align_fmllr_lats.sh --acoustic-scale 0.05 --cmd \"$decode_cmd\" --nj $nj \\\n    $data $dir/lang_combined_iter1 $mdl_dir $dir/lats_iter1 || exit 1;\n\n  # Get arc level information from the lattice.\n  $cmd JOB=1:$nj $dir/lats_iter1/log/get_arc_info.JOB.log \\\n    lattice-align-words $dir/lang_combined_iter1/phones/word_boundary.int \\\n    $dir/lats_iter1/final.mdl \\\n    \"ark:gunzip -c $dir/lats_iter1/lat.JOB.gz |\" ark:- \\| \\\n    lattice-arc-post --acoustic-scale=0.1 $dir/lats_iter1/final.mdl ark:- - \\| \\\n    utils/int2sym.pl -f 5 $dir/lang_combined_iter1/words.txt \\| \\\n    utils/int2sym.pl -f 6- $dir/lang_combined_iter1/phones.txt '>' \\\n    $dir/lats_iter1/arc_info_sym.JOB.txt || exit 1;\n  \n  # Get soft counts of all pronunciations from arc level information.\n  cat $dir/lats_iter1/arc_info_sym.*.txt | steps/dict/get_pron_stats.py - \\\n    $dir/phonetic_decoding/phone_map.txt $dir/lats_iter1/pron_stats.txt || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: Prune the pronunciation candidates generated from G2P/phonetic decoding, and re-do lattice-alignment.\"\n  mkdir -p $dir/dict_combined_iter2\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_combined_iter2/ 2>/dev/null\n  rm $dir/dict_combined_iter2/lexiconp.txt $dir/dict_combined_iter2/lexicon.txt 2>/dev/null\n\n  # Prune away pronunciations which have low acoustic evidence from the first pass of lattice alignment.\n  $cmd $dir/lats_iter1/log/prune_pron_candidates.log steps/dict/internal/prune_pron_candidates.py \\\n    --variant-counts-ratio $variant_counts_ratio \\\n    $dir/lats_iter1/pron_stats.txt $dir/lexicon_phonetic_decoding.txt $dir/lexiconp_g2p.txt $dir/ref_lexicon.txt \\\n    $dir/lexicon_phonetic_decoding_pruned.txt $dir/lexicon_g2p_pruned.txt\n\n  # Filter out words which don't appear in the acoustic training data\n  cat $dir/lexicon_phonetic_decoding_pruned.txt $dir/lexicon_g2p_pruned.txt \\\n    $dir/ref_lexicon.txt | tr -s '\\t' ' ' | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat $dir/non_scored_entries - | \\\n    sort | uniq > $dir/dict_combined_iter2/lexicon.txt\n\n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt \\\n    $dir/dict_combined_iter2 \"$oov_symbol\" \\\n    $dir/lang_combined_iter2_tmp $dir/lang_combined_iter2 || exit 1;\n  \n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n  steps/align_fmllr_lats.sh --cmd \"$decode_cmd\" --nj $nj \\\n    $data $dir/lang_combined_iter2 $mdl_dir $dir/lats_iter2 || exit 1;\n\n  # Get arc level information from the lattice.\n  $cmd JOB=1:$nj $dir/lats_iter2/log/get_arc_info.JOB.log \\\n    lattice-align-words $dir/lang_combined_iter2/phones/word_boundary.int \\\n    $dir/lats_iter2/final.mdl \\\n    \"ark:gunzip -c $dir/lats_iter2/lat.JOB.gz |\" ark:- \\| \\\n    lattice-arc-post --acoustic-scale=0.1 $dir/lats_iter2/final.mdl ark:- - \\| \\\n    utils/int2sym.pl -f 5 $dir/lang_combined_iter2/words.txt \\| \\\n    utils/int2sym.pl -f 6- $dir/lang_combined_iter2/phones.txt '>' \\\n    $dir/lats_iter2/arc_info_sym.JOB.txt || exit 1;\n  \n  # Get soft counts of all pronunciations from arc level information.\n  cat $dir/lats_iter2/arc_info_sym.*.txt | steps/dict/get_pron_stats.py - \\\n    $dir/phonetic_decoding/phone_map.txt $dir/lats_iter2/pron_stats.txt || exit 1;\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: Select pronunciations according to the acoustic evidence from lattice alignment.\"\n  # Given the acoustic evidence (soft-counts), we use a Bayesian framework to select pronunciations \n  # from three exclusive candidate sources: reference (hand-derived) lexicon, G2P and phonetic decoding.\n  # The posteriors for all candidate prons for all words are printed into pron_posteriors.txt\n  # For words which are out of the ref. vocab, the learned prons are written into out_of_ref_vocab_prons_learned.txt.\n  # Among them, for words without acoustic evidence, we just ignore them, even if pron candidates from G2P were provided).\n  # For words in the ref. vocab, we instead output a human readable & editable \"edits\" file called\n  # ref_lexicon_edits.txt, which records all proposed changes to the prons (if any). Also, a \n  # summary is printed into the log file.\n  \n  variants_counts=`cat $dir/target_num_prons_per_word` || exit 1;\n  $cmd $dir/lats_iter2/log/select_prons_bayesian.log \\\n    steps/dict/select_prons_bayesian.py --prior-mean=$prior_mean --prior-counts-tot=$prior_counts_tot \\\n    --variants-counts=$variants_counts --variants-prob-mass=$variants_prob_mass --variants-prob-mass-ref=$variants_prob_mass_ref \\\n    $ref_dict/silence_phones.txt $dir/lats_iter2/pron_stats.txt $dir/train_counts.txt $dir/ref_lexicon.txt \\\n    $dir/lexicon_g2p_pruned.txt $dir/lexicon_phonetic_decoding_pruned.txt \\\n    $dir/lats_iter2/pron_posteriors.temp $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt $dir/lats_iter2/ref_lexicon_edits.txt\n\n  # We reformat the pron_posterior file and add some comments.\n  paste <(cat $dir/lats_iter2/pron_posteriors.temp | cut -d' ' -f1-3 | column -t) \\\n    <(cat $dir/lats_iter2/pron_posteriors.temp | cut -d' ' -f4-) | sort -nr -k1,3 | \\\n    cat <( echo ';; <word> <source: R(eference)/G(2P)/P(hone-decoding)> <posterior> <pronunciation>') -  \\\n    > $dir/lats_iter2/pron_posteriors.txt\n  rm $dir/pron_posteriors.temp 2>/dev/null\n\n  # Remove some stuff that takes up space and is unlikely to be useful later on.\n  if $cleanup; then\n    rm -r $dir/lats_iter*/{fsts*,lat*} 2>/dev/null\n  fi\nfi\n\nif [ $stage -le 7 ]; then\n  echo \"$0: Expand the learned lexicon further to cover words in target vocab that are.\"\n  echo \"  ... not seen in acoustic training data.\"\n  mkdir -p $dest_dict\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dest_dict  2>/dev/null\n  rm $dest_dict/lexiconp.txt $dest_dict/lexicon.txt 2>/dev/null\n  # Get the list of oov (w.r.t. ref vocab) without acoustic evidence, which are in the\n  # target vocab. We'll just assign to them pronunciations from lexicon_g2p, if any.\n  cat $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} !($1 in a)' - \\\n    $dir/target_vocab.txt | sort | uniq > $dir/oov_no_acoustics.txt || exit 1;\n\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_no_acoustics.txt \\\n    $dir/lexicon_g2p.txt > $dir/g2p_prons_for_oov_no_acoustics.txt\n \n  # We concatenate three lexicons togethers: G2P lexicon for oov words without acoustics,\n  # learned lexicon for oov words with acoustics, and the original reference lexicon (for\n  # this part, later one we'll apply recommended changes using steps/dict/apply_lexicon_edits.py\n  cat $dir/g2p_prons_for_oov_no_acoustics.txt $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt \\\n    $dir/ref_lexicon.txt | tr -s '\\t' ' ' | sort | uniq > $dest_dict/lexicon.temp\n\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/target_vocab.txt \\\n    $dest_dict/lexicon.temp | sort | uniq > $dest_dict/lexicon.nosil\n\n  cat $dir/non_scored_entries $dest_dict/lexicon.nosil | sort | uniq >$dest_dict/lexicon0.txt\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: Apply the ref_lexicon_edits file to the reference lexicon.\"\n  echo \"  ... The user can inspect/modify the edits file and then re-run:\"\n  echo \"  ... steps/dict/apply_lexicon_edits.py $dest_dict/lexicon0.txt $dir/lats_iter2/ref_lexicon_edits.txt  - | \\\\\"\n  echo \"  ...   sort -u \\> $dest_dict/lexicon.txt to re-produce the final learned lexicon.\"\n  cp $dir/lats_iter2/ref_lexicon_edits.txt $dest_dict/lexicon_edits.txt 2>/dev/null\n  steps/dict/apply_lexicon_edits.py $dest_dict/lexicon0.txt $dir/lats_iter2/ref_lexicon_edits.txt - | \\\n    sort | uniq > $dest_dict/lexicon.txt || exit 1;\nfi\n"
  },
  {
    "path": "egs/steps/dict/learn_lexicon_greedy.sh",
    "content": "#! /bin/bash\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0\n\n# This recipe has similar inputs and outputs as steps/dict/learn_lexicon.sh\n# The major difference is, instead of using a Bayesian framework for \n# pronunciation selection, we used a likelihood-reduction based greedy \n# pronunciation selection framework presented in the paper:\n# \"Acoustic data-driven lexicon learning based on a greedy pronunciation \"\n# \"selection framework, by X. Zhang, V. Mahonar, D. Povey and S. Khudanpur,\"\n# \"Interspeech 2017.\"\n\n# This script demonstrate how to expand a existing lexicon using a combination\n# of acoustic evidence and G2P to learn a lexicon that covers words in a target \n# vocab, and agrees sufficiently with the acoustics. The basic idea is to \n# run phonetic decoding on acoustic training data using an existing\n# acoustice model (possibly re-trained using a G2P-expanded lexicon) to get \n# alternative pronunciations for words in training data. Then we combine three\n# exclusive sources of pronunciations: the reference lexicon (supposedly \n# hand-derived), phonetic decoding, and G2P (optional) into one lexicon and then run \n# lattice alignment on the same data, to collect acoustic evidence (soft\n# counts) of all pronunciations. Based on these statistics, we use a greedy\n# framework (see steps/dict/select_prons_greedy.sh for details) to select an\n# informative subset of pronunciations for each word with acoustic evidence. \n# two important parameters are alpha and beta. Basically, the three dimensions of alpha\n# and beta correspond to three pronunciation sources: phonetic-decoding, G2P and\n# the reference lexicon, and the larger a value is, the more aggressive we'll\n# prune pronunciations from that sooure. The valid range of each dim. is [0, 1]\n# (for alpha, and 0 means we never pruned pron from that source.) [0, 100] (for beta). \n# The output of steps/dict/select_prons_greedy.sh is a learned lexicon whose vocab \n# matches the user-specified target-vocab, and two intermediate outputs which were\n# used to generate the learned lexicon: an edits file which records the recommended\n# changes to all in-ref-vocab words' prons, and a half-learned lexicon\n# ($dest_dict/lexicon0.txt) where all in-ref-vocab words' prons were untouched\n# (on top of which we apply the edits file to produce the final learned lexicon). \n# The user can always modify the edits file manually and then re-apply it on the \n# half-learned lexicon using steps/dict/apply_lexicon_edits.sh to produce the \n# final learned lexicon. See the last stage in this script for details.\n\nstage=0\n# Begin configuration section.  \ncmd=run.pl\nnj=\nstage=0\noov_symbol=\nlexiconp_g2p=\nmin_prob=0.3\nvariant_counts_ratio=8 \nvariant_counts_no_acoustics=1 \nalpha=\"0,0,0\"\nbeta=\"0,0,0\"\ndelta=0.0000001\nnum_gauss=\nnum_leaves=\nretrain_src_mdl=true\ncleanup=true\nnj_select_prons=200\nlearn_iv_prons=false # whether we want to learn the prons of IV words (w.r.t. ref_vocab), \n\n# End configuration section.  \n\n. ./path.sh\n. utils/parse_options.sh\n\nif [ $# -lt 6 ] || [ $# -gt 7 ]; then\n  echo \"Usage: $0 [options] <ref-dict> <target-vocab> <data> <src-mdl-dir> \\\\\"\n  echo \"          <ref-lang> <dest-dict> <dir>.\"\n  echo \"  This script does lexicon expansion using a combination of acoustic\"\n  echo \"  evidence and G2P to produce a lexicon that covers words of a target vocab:\"\n  echo \"\"               \n  echo \"Arguments:\"\n  echo \" <ref-dict>     The dir which contains the reference lexicon (most probably hand-derived)\"\n  echo \"                we want to expand/improve, and nonsilence_phones.txt,.etc which we need \" \n  echo \"                for building new dict dirs.\"\n  echo \" <target-vocab> The vocabulary we want the final learned lexicon to cover (one word per line).\"\n  echo \" <data>         acoustic training data we use to get alternative\"\n  echo \"                pronunciations and collet acoustic evidence.\"\n  echo \" <src-mdl-dir>  The dir containing an SAT-GMM acoustic model (we optionaly we re-train it\" \n  echo \"                using G2P expanded lexicon) to do phonetic decoding (to get alternative\"\n  echo \"                pronunciations) and lattice-alignment (to collect acoustic evidence for\"\n  echo \"                evaluating all prounciations)\"\n  echo \" <ref-lang>     The reference lang dir which we use to get non-scored-words\"\n  echo \"                like <UNK> for building new dict dirs\"\n  echo \" <dest-dict>    The dict dir where we put the final learned lexicon, whose vocab\"\n  echo \"                matches <target-vocab>.\"\n  echo \" <dir>          The dir which contains all the intermediate outputs of this script.\"\n  echo \"\"\n  echo \"Note: <target-vocab> and the vocab of <data> don't have to match. For words\"\n  echo \"     who are in <target-vocab> but not seen in <data>, their pronunciations\" \n  echo \"     will be given by G2P at the end.\"\n  echo \"\"\n  echo \"e.g. $0 data/local/dict data/local/lm/librispeech-vocab.txt data/train \\\\\"\n  echo \"          exp/tri3 data/lang data/local/dict_learned\"\n  echo \"Options:\"\n  echo \"  --stage <n>                         # stage to run from, to enable resuming from partially\"\n  echo \"                                      # completed run (default: 0)\"\n  echo \"  --cmd '$cmd'                        # command to submit jobs with (e.g. run.pl, queue.pl)\"\n  echo \"  --nj <nj>                           # number of parallel jobs\"\n  echo \"  --oov-symbol '$oov_symbol'          # oov symbol, like <UNK>.\"\n  echo \"  --lexiconp-g2p                      # a lexicon (with prob in the second column) file containing g2p generated\"\n  echo \"                                      # pronunciations, for words in acoustic training data / target vocabulary. It's optional.\"\n  echo \"  --min-prob <float>                  # The cut-off parameter used to select pronunciation candidates from phonetic\"\n  echo \"                                      # decoding. We remove pronunciations with probabilities less than this value\"\n  echo \"                                      # after normalizing the probs s.t. the max-prob is 1.0 for each word.\"\n  echo \"  --variant-counts-ratio <int>        # This ratio parameter determines the maximum number of pronunciation\"\n  echo \"                                      # candidates we will keep for each word, after pruning according to lattice statistics from\"\n  echo \"                                      # the first iteration of lattice generation. See steps/dict/internal/prune_pron_candidates.py\"\n  echo \"                                      # for details.\"\n  echo \"  --variant-counts-no-acoustics <int> # how many g2p-prons per word we want to include for each words unseen in acoustic training data.\"\n  echo \"  --alpha <float>,<float>,<float>     # scaling factors used in the greedy pronunciation selection framework, \"\n  echo \"                                      # see steps/dict/select_prons_greedy.py for details.\"\n  echo \"  --beta <int>,<int>,<int>            # smoothing factors used in the greedy pronunciation selection framework, \"\n  echo \"                                      # see steps/dict/select_prons_greedy.py for details.\"\n  echo \"  --delta <float>                     # a floor value used in the greedy pronunciation selection framework, \"\n  echo \"                                      # see steps/dict/select_prons_greedy.py for details.\"\n  echo \"  --num-gauss                         # number of gaussians for the re-trained SAT model (on top of <src-mdl-dir>).\"            \n  echo \"  --num-leaves                        # number of leaves for the re-trained SAT model (on top of <src-mdl-dir>).\" \n  echo \"  --retrain-src-mdl                   # true if you want to re-train the src_mdl before phone decoding (default false).\"\n  exit 1\nfi\n\necho \"$0 $@\"  # Print the command line for logging\n\nref_dict=$1\ntarget_vocab=$2\ndata=$3\nsrc_mdl_dir=$4\nref_lang=$5\ndest_dict=$6\n\nif [ -z \"$oov_symbol\" ]; then\n   echo \"$0: the --oov-symbol option is required.\"\n   exit 1\nfi\n\nif [ $# -gt 6 ]; then\n  dir=$7 # Most intermediate outputs will be put here. \nelse\n  dir=${src_mdl_dir}_lex_learn_work\nfi\n\nmkdir -p $dir\nif [ $stage -le 0 ]; then\n  echo \"$0: Some preparatory work.\"\n  # Get the word counts of training data.\n  awk '{for (n=2;n<=NF;n++) counts[$n]++;} END{for (w in counts) printf \"%s %d\\n\",w, counts[w];}' \\\n    $data/text | sort > $dir/train_counts.txt\n  \n  # Get the non-scored entries and exclude them from the reference lexicon/vocab, and target_vocab.\n  steps/cleanup/internal/get_non_scored_words.py $ref_lang > $dir/non_scored_words\n  awk 'NR==FNR{a[$1] = 1; next} {if($1 in a) print $0}' $dir/non_scored_words \\\n    $ref_dict/lexicon.txt > $dir/non_scored_entries \n\n  # Remove non-scored-words from the reference lexicon.\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' $dir/non_scored_words \\\n    $ref_dict/lexicon.txt | tr -s '\\t' ' ' | awk '$1=$1' > $dir/ref_lexicon.txt\n\n  cat $dir/ref_lexicon.txt | awk '{print $1}' | sort | uniq > $dir/ref_vocab.txt\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' $dir/non_scored_words \\\n    $target_vocab | sort | uniq > $dir/target_vocab.txt\n    \n  # From the reference lexicon, we estimate the target_num_prons_per_word as,\n  # round(avg. # prons per word in the reference lexicon). This'll be used as \n  # the upper bound of # pron variants per word when we apply G2P or select prons to\n  # construct the learned lexicon in later stages.\n  python -c 'import sys; import math; print int(round(float(sys.argv[1])/float(sys.argv[2])))' \\\n    `wc -l $dir/ref_lexicon.txt | awk '{print $1}'` `wc -l $dir/ref_vocab.txt | awk '{print $1}'` \\\n    > $dir/target_num_prons_per_word || exit 1;\n\n  if [ -z $lexiconp_g2p ]; then\n    # create an empty list of g2p generated prons, if it's not given.\n    touch $dir/lexicon_g2p.txt\n    touch $dir/lexiconp_g2p.txt\n  else\n    # Exchange the 1st column (word) and 2nd column (prob) and remove pronunciations\n    # which are already in the reference lexicon.\n    cat $lexiconp_g2p | awk '{a=$1;b=$2; $1=\"\";$2=\"\";print b\" \"a$0}' | \\\n      awk 'NR==FNR{a[$0] = 1; next} {w=$2;for (n=3;n<=NF;n++) w=w\" \"$n; if(!(w in a)) print $0}' \\\n      $dir/ref_lexicon.txt - > $dir/lexiconp_g2p.txt 2>/dev/null\n    \n    # make a copy where we remove the first column (probabilities).\n    cat $dir/lexiconp_g2p.txt | cut -f1,3- > $dir/lexicon_g2p.txt 2>/dev/null\n  fi\n  variant_counts=`cat $dir/target_num_prons_per_word` || exit 1;\n  $cmd $dir/log/prune_g2p_lexicon.log steps/dict/prons_to_lexicon.py \\\n    --top-N=$variant_counts $dir/lexiconp_g2p.txt \\\n    $dir/lexicon_g2p_variant_counts${variant_counts}.txt || exit 1;\nfi\n\nif [ $stage -le 1 ] && $retrain_src_mdl; then\n  echo \"$0: Expand the reference lexicon to cover all words in the target vocab. and then\"\n  echo \"   ... re-train the source acoustic model for phonetic decoding. \"\n  mkdir -p $dir/dict_expanded_target_vocab\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_expanded_target_vocab  2>/dev/null\n  rm $dir/dict_expanded_target_vocab/lexiconp.txt $dir/dict_expanded_target_vocab/lexicon.txt 2>/dev/null\n  \n  # Get the oov words list (w.r.t ref vocab) which are in the target vocab. \n  awk 'NR==FNR{a[$1] = 1; next} !($1 in a)' $dir/ref_lexicon.txt \\\n    $dir/target_vocab.txt | sort | uniq > $dir/oov_target_vocab.txt\n\n  # Assign pronunciations from lexicon_g2p.txt to oov_target_vocab. For words which\n  # cannot be found in lexicon_g2p.txt, we simply ignore them.\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_target_vocab.txt \\\n    $dir/lexicon_g2p.txt > $dir/lexicon_g2p_oov_target_vocab.txt\n  \n  cat $dir/lexicon_g2p_oov_target_vocab.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/target_vocab.txt - | \\\n    cat $dir/non_scored_entries - | \n    sort | uniq > $dir/dict_expanded_target_vocab/lexicon.txt\n  \n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt $dir/dict_expanded_target_vocab \\\n    $oov_symbol $dir/lang_expanded_target_vocab_tmp $dir/lang_expanded_target_vocab || exit 1;\n  \n  # Align the acoustic training data using the given src_mdl_dir.\n  alidir=${src_mdl_dir}_ali_$(basename $data) \n  steps/align_fmllr.sh --nj $nj --cmd \"$train_cmd\" \\\n    $data $dir/lang_expanded_target_vocab $src_mdl_dir $alidir || exit 1;\n  \n  # Train another SAT system on the given data and put it in $dir/${src_mdl_dir}_retrained\n  # this model will be used for phonetic decoding and lattice alignment later on.\n  if [ -z $num_leaves ] || [ -z $num_gauss ] ; then\n    echo \"num_leaves and num_gauss need to be specified.\" && exit 1;\n  fi\n  steps/train_sat.sh --cmd \"$train_cmd\" $num_leaves $num_gauss \\\n    $data $dir/lang_expanded_target_vocab $alidir $dir/${src_mdl_dir}_retrained || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Expand the reference lexicon to cover all words seen in,\"\n  echo \"  ... acoustic training data, and prepare corresponding dict and lang directories.\"\n  echo \"  ... This is needed when generate pron candidates from phonetic decoding.\"\n  mkdir -p $dir/dict_expanded_train\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_expanded_train 2>/dev/null\n  rm $dir/dict_expanded_train/lexiconp.txt $dir/dict_expanded_train/lexicon.txt 2>/dev/null\n\n  # Get the oov words list (w.r.t ref vocab) which are in training data. \n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/ref_lexicon.txt \\\n    $dir/train_counts.txt | awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $0}' \\\n    $dir/non_scored_words - | sort > $dir/oov_train.txt || exit 1; \n  \n  awk 'NR==FNR{a[$1] = 1; next} {if(($1 in a)) b+=$2; else c+=$2} END{print c/(b+c)}' \\\n    $dir/ref_vocab.txt $dir/train_counts.txt > $dir/train_oov_rate || exit 1;\n  \n  echo \"OOV rate (w.r.t. the reference lexicon) of the acoustic training data is:\"\n  cat $dir/train_oov_rate\n\n  # Assign pronunciations from lexicon_g2p to oov_train. For words which\n  # cannot be found in lexicon_g2p, we simply assign oov_symbol's pronunciaiton\n  # (like NSN) to them, in order to get phonetic decoding pron candidates for them later on.\n  variant_counts=`cat $dir/target_num_prons_per_word` || exit 1;\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_train.txt \\\n    $dir/lexicon_g2p_variant_counts${variant_counts}.txt > $dir/g2p_prons_for_oov_train.txt || exit 1;\n  \n  # Get the pronunciation of oov_symbol.\n  oov_pron=`cat $dir/non_scored_entries | grep $oov_symbol | awk '{print $2}'`\n  # For oov words in training data for which we don't even have G2P pron candidates,\n  # we simply assign them the pronunciation of the oov symbol (like <unk>),\n  # so that we can get pronunciations for them from phonetic decoding.\n  awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/g2p_prons_for_oov_train.txt \\\n    $dir/oov_train.txt | awk -v op=\"$oov_pron\" '{print $0\" \"op}' > $dir/oov_train_no_pron.txt || exit 1;\n    \n  cat $dir/oov_train_no_pron.txt $dir/g2p_prons_for_oov_train.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat - $dir/non_scored_entries | \\\n    sort | uniq > $dir/dict_expanded_train/lexicon.txt || exit 1;\n  \n  utils/prepare_lang.sh $dir/dict_expanded_train $oov_symbol \\\n    $dir/lang_expanded_train_tmp $dir/lang_expanded_train || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Generate pronunciation candidates from phonetic decoding on acoustic training data..\"\n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n  steps/cleanup/debug_lexicon.sh  --nj $nj \\\n    --cmd \"$decode_cmd\" $data $dir/lang_expanded_train \\\n    $mdl_dir $dir/dict_expanded_train/lexicon.txt $dir/phonetic_decoding || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: Combine the reference lexicon and pronunciations from phone-decoding/G2P into one\"\n  echo \"  ... lexicon, and run lattice alignment using this lexicon on acoustic training data\"\n  echo \"  ... to collect acoustic evidence.\"\n  # We first prune the phonetic decoding generated prons relative to the largest count, by setting \"min_prob\",\n  # and only leave prons who are not present in the reference lexicon / g2p-generated lexicon.\n  cat $dir/ref_lexicon.txt $dir/lexicon_g2p.txt | sort -u > $dir/phonetic_decoding/filter_lexicon.txt \n  \n  $cmd $dir/phonetic_decoding/log/prons_to_lexicon.log steps/dict/prons_to_lexicon.py \\\n    --min-prob=$min_prob --filter-lexicon=$dir/phonetic_decoding/filter_lexicon.txt \\\n    $dir/phonetic_decoding/prons.txt $dir/lexicon_pd_with_eps.txt\n\n  # We abandon phonetic-decoding candidates for infrequent words.\n  awk '{if($2 < 3) print $1}' $dir/train_counts.txt > $dir/pd_candidates_to_exclude.txt \n  awk 'NR==FNR{a[$1] = $2; next} {if(a[$1]<10) print $1}' $dir/train_counts.txt \\\n    $dir/oov_train_no_pron.txt >> $dir/pd_candidates_to_exclude.txt \n\n  if [ -s $dir/pd_candidates_to_exclude.txt ]; then\n    cat $dir/lexicon_pd_with_eps.txt | grep -vP \"<eps>|<UNK>|<unk>|\\[.*\\]\" | \\\n      awk 'NR==FNR{a[$0] = 1; next} {if(!($1 in a)) print $0}' $dir/pd_candidates_to_exclude.txt - | \\\n      sort | uniq > $dir/lexicon_pd.txt || exit 1;\n  else\n    cat $dir/lexicon_pd_with_eps.txt | grep -vP \"<eps>|<UNK>|<unk>|\\[.*\\]\" | \\\n      sort | uniq > $dir/lexicon_pd.txt || exit 1;\n  fi\n\n  # Combine the reference lexicon, pronunciations from G2P and phonetic decoding into one lexicon.\n  mkdir -p $dir/dict_combined_iter1\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_combined_iter1/ 2>/dev/null\n  rm $dir/dict_combined_iter1/lexiconp.txt $dir/dict_combined_iter1/lexicon.txt 2>/dev/null\n\n  # Filter out words which don't appear in the acoustic training data\n  cat $dir/lexicon_pd.txt $dir/lexicon_g2p.txt \\\n    $dir/ref_lexicon.txt | tr -s '\\t' ' ' | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat $dir/non_scored_entries - | \\\n    sort | uniq > $dir/dict_combined_iter1/lexicon.txt\n  \n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt \\\n    $dir/dict_combined_iter1 $oov_symbol \\\n    $dir/lang_combined_iter1_tmp $dir/lang_combined_iter1 || exit 1;\n  \n  # Generate lattices for the acoustic training data with the combined lexicon.\n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n\n  # Get the vocab for words for which we want to learn pronunciations.\n  if $learn_iv_prons; then\n    # If we want to learn the prons of IV words (w.r.t. ref_vocab), the learn_vocab is just the intersection of\n    # target_vocab and the vocab of words seen in acoustic training data (first col. of train_counts.txt)\n    awk 'NR==FNR{a[$1] = 1; next} {if($1 in a) print $1}' $dir/target_vocab.txt $dir/train_counts.txt \\\n      > $dir/learn_vocab.txt\n  else\n    # Exclude words from the ref_vocab if we don't want to learn the pronunciations of IV words.\n    awk 'NR==FNR{a[$1] = 1; next} {if($1 in a) print $1}' $dir/target_vocab.txt $dir/train_counts.txt | \\\n      awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/ref_vocab.txt - > $dir/learn_vocab.txt\n  fi\n  \n  # In order to get finer lattice stats of alternative prons, we want to make lattices deeper.\n  # To speed up lattice generation, we use a ctm to create sub-utterances and a sub-segmentation\n  # for each instance of a word within learn_vocab (or a string of consecutive words within learn_vocab),\n  # including a single out-of-learn-vocab word at the boundary if present.\n  mkdir -p $dir/resegmentation\n  steps/dict/internal/get_subsegments.py $dir/phonetic_decoding/word.ctm $dir/learn_vocab.txt \\\n    $dir/resegmentation/subsegments $dir/resegmentation/text || exit 1;\n  utils/data/subsegment_data_dir.sh $data $dir/resegmentation/subsegments $dir/resegmentation/text \\\n    $dir/resegmentation/data || exit 1;\n  steps/compute_cmvn_stats.sh $dir/resegmentation/data || exit 1;\n\n  steps/align_fmllr_lats.sh --beam 20 --retry-beam 50 --final-beam 30 --acoustic-scale 0.05 --cmd \"$decode_cmd\" --nj $nj \\\n    $dir/resegmentation/data $dir/lang_combined_iter1 $mdl_dir $dir/lats_iter1 || exit 1;\n\n  # Get arc level information from the lattice.\n  $cmd JOB=1:$nj $dir/lats_iter1/log/get_arc_info.JOB.log \\\n    lattice-align-words $dir/lang_combined_iter1/phones/word_boundary.int \\\n    $dir/lats_iter1/final.mdl \\\n    \"ark:gunzip -c $dir/lats_iter1/lat.JOB.gz |\" ark:- \\| \\\n    lattice-arc-post --acoustic-scale=0.1 $dir/lats_iter1/final.mdl ark:- - \\| \\\n    utils/int2sym.pl -f 5 $dir/lang_combined_iter1/words.txt \\| \\\n    utils/int2sym.pl -f 6- $dir/lang_combined_iter1/phones.txt '>' \\\n    $dir/lats_iter1/arc_info_sym.JOB.txt || exit 1;\n  \n  # Compute soft counts (pron_stats) of every particular word-pronunciation pair by\n  # summing up arc level information over all utterances. We'll use this to prune\n  # pronunciation candidates before the next iteration of lattice generation.\n  cat $dir/lats_iter1/arc_info_sym.*.txt | steps/dict/get_pron_stats.py - \\\n    $dir/phonetic_decoding/phone_map.txt $dir/lats_iter1/pron_stats.txt || exit 1;\n  \n  # Accumlate utterance-level pronunciation posteriors (into arc_stats) by summing up\n  # posteriors of arcs representing the same word & pronunciation and starting\n  # from roughly the same location. See steps/dict/internal/sum_arc_info.py for details.\n  for i in `seq 1 $nj`;do\n    cat $dir/lats_iter1/arc_info_sym.${i}.txt | sort -n -k1 -k2 -k3r | \\\n      steps/dict/internal/sum_arc_info.py - $dir/phonetic_decoding/phone_map.txt $dir/lats_iter1/arc_info_summed.${i}.txt\n  done \n  cat $dir/lats_iter1/arc_info_summed.*.txt | sort -k1 -k2 > $dir/lats_iter1/arc_stats.txt \n\n  # Prune the phonetic_decoding lexicon so that any pronunciation that only has non-zero posterior at one word example will be removed.\n  # The pruned lexicon is put in $dir/lats_iter1. After further pruning in the next stage it'll be put back to $dir.\n  awk 'NR==FNR{w=$1;for (n=5;n<=NF;n++) w=w\" \"$n;a[w]+=1;next} {if($0 in a && a[$0]>1) print $0}' \\\n    $dir/lats_iter1/arc_stats.txt $dir/lexicon_pd.txt > $dir/lats_iter1/lexicon_pd_pruned.txt\nfi\n\n# Here we re-generate lattices (with a wider beam and a pruned combined lexicon) and re-collect pronunciation statistics \nif [ $stage -le 5 ]; then\n  echo \"$0: Prune the pronunciation candidates generated from G2P/phonetic decoding, and re-do lattice-alignment.\"\n  mkdir -p $dir/dict_combined_iter2\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dir/dict_combined_iter2/ 2>/dev/null\n  rm $dir/dict_combined_iter2/lexiconp.txt $dir/dict_combined_iter2/lexicon.txt 2>/dev/null\n\n  # Prune away pronunciations which have low acoustic evidence from the first pass of lattice generation.\n  $cmd $dir/lats_iter1/log/prune_pron_candidates.log steps/dict/internal/prune_pron_candidates.py \\\n    --variant-counts-ratio $variant_counts_ratio \\\n    $dir/lats_iter1/pron_stats.txt $dir/lats_iter1/lexicon_pd_pruned.txt $dir/lexiconp_g2p.txt $dir/ref_lexicon.txt \\\n    $dir/lexicon_pd_pruned.txt $dir/lexicon_g2p_pruned.txt\n\n  # Filter out words which don't appear in the acoustic training data.\n  cat $dir/lexicon_pd_pruned.txt $dir/lexicon_g2p_pruned.txt \\\n    $dir/ref_lexicon.txt | tr -s '\\t' ' ' | \\\n    awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/train_counts.txt - | \\\n    cat $dir/non_scored_entries - | \\\n    sort | uniq > $dir/dict_combined_iter2/lexicon.txt\n\n  utils/prepare_lang.sh --phone-symbol-table $ref_lang/phones.txt \\\n    $dir/dict_combined_iter2 $oov_symbol \\\n    $dir/lang_combined_iter2_tmp $dir/lang_combined_iter2 || exit 1;\n  \n  # Re-generate lattices with a wider beam, so that we'll get deeper lattices.\n  if $retrain_src_mdl; then mdl_dir=$dir/${src_mdl_dir}_retrained; else mdl_dir=$src_mdl_dir; fi\n  steps/align_fmllr_lats.sh  --beam 30 --retry-beam 60 --final-beam 50 --acoustic-scale 0.05 --cmd \"$decode_cmd\" --nj $nj \\\n    $dir/resegmentation/data $dir/lang_combined_iter2 $mdl_dir $dir/lats_iter2 || exit 1;\n\n  # Get arc level information from the lattice as we did in the last stage.\n  $cmd JOB=1:$nj $dir/lats_iter2/log/get_arc_info.JOB.log \\\n    lattice-align-words $dir/lang_combined_iter2/phones/word_boundary.int \\\n    $dir/lats_iter2/final.mdl \\\n    \"ark:gunzip -c $dir/lats_iter2/lat.JOB.gz |\" ark:- \\| \\\n    lattice-arc-post --acoustic-scale=0.1 $dir/lats_iter2/final.mdl ark:- - \\| \\\n    utils/int2sym.pl -f 5 $dir/lang_combined_iter2/words.txt \\| \\\n    utils/int2sym.pl -f 6- $dir/lang_combined_iter2/phones.txt '>' \\\n    $dir/lats_iter2/arc_info_sym.JOB.txt || exit 1;\n  \n  # Compute soft counts (pron_stats) of every particular word-pronunciation pair as\n  # we did in the last stage. The stats will only be used as diagnostics.\n  cat $dir/lats_iter2/arc_info_sym.*.txt | steps/dict/get_pron_stats.py - \\\n    $dir/phonetic_decoding/phone_map.txt $dir/lats_iter2/pron_stats.txt || exit 1;\n  \n  # Accumlate utterance-level pronunciation posteriors as we did in the last stage.\n  for i in `seq 1 $nj`;do\n    cat $dir/lats_iter2/arc_info_sym.${i}.txt | sort -n -k1 -k2 -k3r | \\\n      steps/dict/internal/sum_arc_info.py - $dir/phonetic_decoding/phone_map.txt $dir/lats_iter2/arc_info_summed.${i}.txt\n  done \n  cat $dir/lats_iter2/arc_info_summed.*.txt | sort -k1 -k2 > $dir/lats_iter2/arc_stats.txt \n\n  # The pron_stats are the acoustic evidence which the likelihood-reduction-based pronunciation\n  # selection procedure will be based on.\n  # Split the utterance-level pronunciation posterior stats into $nj_select_prons pieces,\n  # so that the following pronunciation selection stage can be parallelized.\n  numsplit=$nj_select_prons\n  awk '{print $1\"-\"$2\" \"$1}' $dir/lats_iter2/arc_stats.txt > $dir/lats_iter2/utt2word\n  utt2words=$(for n in `seq $numsplit`; do echo $dir/lats_iter2/utt2word.$n; done)\n  utils/split_scp.pl --utt2spk=$dir/lats_iter2/utt2word $dir/lats_iter2/utt2word $utt2words || exit 1\n  for n in `seq $numsplit`; do \n    (cat $dir/lats_iter2/utt2word.$n | awk '{$1=substr($1,length($2)+2);print $2\" \"$1}' - > $dir/lats_iter2/word2utt.$n\n     awk 'NR==FNR{a[$0] = 1; next} {b=$1\" \"$2; if(b in a) print $0}' $dir/lats_iter2/word2utt.$n \\\n       $dir/lats_iter2/arc_stats.txt > $dir/lats_iter2/arc_stats.${n}.txt\n    ) &\n  done\n  wait\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: Select pronunciations according to the acoustic evidence from lattice alignment.\"\n  # Given the acoustic evidence (soft-counts), we use a Bayesian framework to select pronunciations \n  # from three exclusive candidate sources: reference (hand-derived) lexicon, G2P and phonetic decoding.\n  # The posteriors for all candidate prons for all words are printed into pron_posteriors.txt\n  # For words which are out of the ref. vocab, the learned prons are written into out_of_ref_vocab_prons_learned.txt.\n  # Among them, for words without acoustic evidence, we just ignore them, even if pron candidates from G2P were provided).\n  # For words in the ref. vocab, we instead output a human readable & editable \"edits\" file called\n  # ref_lexicon_edits.txt, which records all proposed changes to the prons (if any). Also, a \n  # summary is printed into the log file.\n  \n  $cmd JOB=1:$nj_select_prons $dir/lats_iter2/log/generate_learned_lexicon.JOB.log \\\n    steps/dict/select_prons_greedy.py \\\n      --alpha=${alpha} --beta=${beta} \\\n      --delta=${delta} \\\n      $ref_dict/silence_phones.txt $dir/lats_iter2/arc_stats.JOB.txt $dir/train_counts.txt $dir/ref_lexicon.txt \\\n      $dir/lexicon_g2p_pruned.txt $dir/lexicon_pd_pruned.txt \\\n      $dir/lats_iter2/learned_lexicon.JOB.txt || exit 1;\n\n  cat $dir/lats_iter2/learned_lexicon.*.txt > $dir/lats_iter2/learned_lexicon.txt\n  rm $dir/lats_iter2/learned_lexicon.*.txt\n\n  $cmd $dir/lats_iter2/log/lexicon_learning_summary.log \\\n    steps/dict/merge_learned_lexicons.py \\\n      $dir/lats_iter2/arc_stats.txt $dir/train_counts.txt $dir/ref_lexicon.txt \\\n      $dir/lexicon_g2p_pruned.txt $dir/lexicon_pd_pruned.txt \\\n      $dir/lats_iter2/learned_lexicon.txt \\\n      $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt $dir/lats_iter2/ref_lexicon_edits.txt || exit 1;\n\n  cp $dir/lats_iter2/ref_lexicon_edits.txt $dir/lats_iter2/ref_lexicon_edits.txt\n  # Remove some stuff that takes up space and is unlikely to be useful later on.\n  if $cleanup; then\n    rm -r $dir/lats_iter*/{fsts*,lat*} 2>/dev/null\n  fi\nfi\n\nif [ $stage -le 7 ]; then\n  echo \"$0: Expand the learned lexicon further to cover words in target vocab that are.\"\n  echo \"  ... not seen in acoustic training data.\"\n  mkdir -p $dest_dict\n  cp $ref_dict/{extra_questions.txt,optional_silence.txt,nonsilence_phones.txt,silence_phones.txt} \\\n    $dest_dict  2>/dev/null\n  rm $dest_dict/lexiconp.txt $dest_dict/lexicon.txt 2>/dev/null\n  # Get the list of oov (w.r.t. ref vocab) without acoustic evidence, which are in the\n  # target vocab. We'll just assign to them pronunciations from lexicon_g2p, if any.\n  cat $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt $dir/ref_lexicon.txt | \\\n    awk 'NR==FNR{a[$1] = 1; next} !($1 in a)' - \\\n    $dir/target_vocab.txt | sort | uniq > $dir/oov_no_acoustics.txt || exit 1;\n  \n  variant_counts=$variant_counts_no_acoustics\n  \n  $cmd $dir/log/prune_g2p_lexicon.log steps/dict/prons_to_lexicon.py \\\n    --top-N=$variant_counts $dir/lexiconp_g2p.txt \\\n    $dir/lexicon_g2p_variant_counts${variant_counts}.txt || exit 1;\n  \n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/oov_no_acoustics.txt \\\n    $dir/lexicon_g2p_variant_counts${variant_counts}.txt > $dir/g2p_prons_for_oov_no_acoustics.txt|| exit 1;\n\n  # Get the pronunciation of oov_symbol.\n  oov_pron=`cat $dir/non_scored_entries | grep $oov_symbol | awk '{print $2}'` || exit 1;\n  # For oov words in target_vocab for which we don't even have G2P pron candidates,\n  # we simply assign them the pronunciation of the oov symbol (like <unk>),\n  if [ -s $dir/g2p_prons_for_oov_no_acoustics.txt ]; then\n    awk 'NR==FNR{a[$1] = 1; next} {if(!($1 in a)) print $1}' $dir/g2p_prons_for_oov_no_acoustics.txt \\\n      $dir/oov_no_acoustics.txt | awk -v op=\"$oov_pron\" '{print $0\" \"op}' > $dir/oov_target_vocab_no_pron.txt || exit 1;\n  else\n    awk -v op=\"$oov_pron\" '{print $0\" \"op}' $dir/oov_no_acoustics.txt > $dir/oov_target_vocab_no_pron.txt || exit 1\n  fi\n\n  # We concatenate three lexicons togethers: G2P lexicon for oov words without acoustics,\n  # learned lexicon for oov words with acoustics, and the original reference lexicon (for\n  # this part, later one we'll apply recommended changes using steps/dict/apply_lexicon_edits.py\n  cat $dir/g2p_prons_for_oov_no_acoustics.txt $dir/lats_iter2/out_of_ref_vocab_prons_learned.txt \\\n    $dir/oov_target_vocab_no_pron.txt $dir/ref_lexicon.txt | tr -s '\\t' ' ' | sort | uniq > $dest_dict/lexicon.temp\n\n  awk 'NR==FNR{a[$1] = 1; next} ($1 in a)' $dir/target_vocab.txt \\\n    $dest_dict/lexicon.temp | sort | uniq > $dest_dict/lexicon.nosil\n\n  cat $dir/non_scored_entries $dest_dict/lexicon.nosil | sort | uniq >$dest_dict/lexicon0.txt\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: Apply the ref_lexicon_edits file to the reference lexicon.\"\n  echo \"  ... The user can inspect/modify the edits file and then re-run:\"\n  echo \"  ... steps/dict/apply_lexicon_edits.py $dest_dict/lexicon0.txt $dir/lats_iter2/ref_lexicon_edits.txt  - | \\\\\"\n  echo \"  ...   sort -u \\> $dest_dict/lexicon.txt to re-produce the final learned lexicon.\"\n  cp $dir/lats_iter2/ref_lexicon_edits.txt $dest_dict/lexicon_edits.txt 2>/dev/null\n  steps/dict/apply_lexicon_edits.py $dest_dict/lexicon0.txt $dir/lats_iter2/ref_lexicon_edits.txt - | \\\n    sort | uniq > $dest_dict/lexicon.txt || exit 1;\nfi\n\necho \"Lexicon learning ends successfully. Please refer to $dir/lats_iter2/log/lexicon_learning_summary.log\"\necho \"  for a summary. The learned lexicon, whose vocab matches the target_vocab, is $dest_dict/lexicon.txt\"\n"
  },
  {
    "path": "egs/steps/dict/merge_learned_lexicons.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\nimport math\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"Convert a learned lexicon produced by steps/dict/select_prons_greedy.sh\"\n        \"into a lexicon for OOV words (w.r.t. ref. vocab) and a human editable lexicon-edit file.\"\n        \"for in-vocab words, and generate detailed summaries of the lexicon learning results\"\n        \"The inputs are a learned lexicon, an arc-stats file, and three source lexicons \"\n        \"(phonetic-decoding(PD)/G2P/ref). The outputs are: a learned lexicon for OOVs\"\n        \"(learned_lexicon_oov), and a lexicon_edits file (ref_lexicon_edits) containing\"\n        \"suggested modifications of prons, for in-vocab words.\",\n        epilog = \"See steps/dict/learn_lexicon_greedy.sh for example.\")\n    parser.add_argument(\"arc_stats_file\", metavar = \"<arc-stats-file>\", type = str,\n                        help = \"File containing word-pronunciation statistics obtained from lattices; \"\n                        \"each line must be <word> <utt-id> <start-frame> <count> <phones>\")\n    parser.add_argument(\"word_counts_file\", metavar = \"<counts-file>\", type = str,\n                        help = \"File containing word counts in acoustic training data; \"\n                        \"each line must be <word> <count>.\")\n    parser.add_argument(\"ref_lexicon\", metavar = \"<reference-lexicon>\", type = str,\n                        help = \"The reference lexicon (most probably hand-derived).\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"g2p_lexicon\", metavar = \"<g2p-expanded-lexicon>\", type = str,\n                        help = \"Candidate ronouciations from G2P results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"pd_lexicon\", metavar = \"<prons-in-acoustic-evidence>\", type = str,\n                        help = \"Candidate ronouciations from phonetic decoding results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"learned_lexicon\", metavar = \"<learned-lexicon>\", type = str,\n                        help = \"Learned lexicon.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"learned_lexicon_oov\", metavar = \"<learned-lexicon-oov>\", type = str,\n                        help = \"Output file which is the learned lexicon for words out of the ref. vocab.\")\n    parser.add_argument(\"ref_lexicon_edits\", metavar = \"<lexicon-edits>\", type = str,\n                        help = \"Output file containing human-readable & editable pronounciation info (and the\"\n                        \"accept/reject decision made by our algorithm) for those words in ref. vocab,\" \n                        \"to which any change has been recommended. The info for each word is like:\" \n                        \"------------ an 4086.0 --------------\"\n                        \"R  | Y |  2401.6 |  AH N\"\n                        \"R  | Y |  640.8 |  AE N\"\n                        \"P  | Y |  1035.5 |  IH N\"\n                        \"R(ef), P(hone-decoding) represents the pronunciation source\"\n                        \"Y/N means the recommended decision of including this pron or not\"\n                        \"and the numbers are soft counts accumulated from lattice-align-word outputs. \"\n                        \"See the function WriteEditsAndSummary for more details.\")\n \n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.arc_stats_file == \"-\":\n        args.arc_stats_file_handle = sys.stdin\n    else:\n        args.arc_stats_file_handle = open(args.arc_stats_file)\n    args.word_counts_file_handle = open(args.word_counts_file)\n    args.ref_lexicon_handle = open(args.ref_lexicon)\n    args.g2p_lexicon_handle = open(args.g2p_lexicon)\n    args.pd_lexicon_handle = open(args.pd_lexicon)\n    args.learned_lexicon_handle = open(args.learned_lexicon)\n    args.learned_lexicon_oov_handle = open(args.learned_lexicon_oov, \"w\")\n    args.ref_lexicon_edits_handle = open(args.ref_lexicon_edits, \"w\")\n    \n    return args\n\ndef ReadArcStats(arc_stats_file_handle):\n    stats = defaultdict(lambda : defaultdict(dict))\n    stats_summed = defaultdict(float)\n    for line in arc_stats_file_handle.readlines():\n        splits = line.strip().split()\n\n        if (len(splits) == 0):\n            continue\n\n        if (len(splits) < 5):\n            raise Exception('Invalid format of line ' + line\n                                + ' in ' + arc_stats_file)\n        utt = splits[1]\n        start_frame = int(splits[2])\n        word = splits[0]\n        count = float(splits[3])\n        phones = splits[4:]\n        phones = ' '.join(phones)\n        stats[word][(utt, start_frame)][phones] = count\n        stats_summed[(word, phones)] += count\n    return stats, stats_summed\n\ndef ReadWordCounts(word_counts_file_handle):\n    counts = {}\n    for line in word_counts_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in counts file.')\n        word = splits[0]\n        count = int(splits[1])\n        counts[word] = count\n    return counts\n\ndef ReadLexicon(args, lexicon_file_handle, counts):\n    # we're skipping any word not in counts (not seen in training data),\n    # cause we're only learning prons for words who have acoustic examples.\n    lexicon = defaultdict(set)\n    for line in lexicon_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[0]\n        if word not in counts:\n            continue\n        phones = ' '.join(splits[1:])\n        lexicon[word].add(phones)\n    return lexicon\n\ndef WriteEditsAndSummary(args, learned_lexicon, ref_lexicon, pd_lexicon, g2p_lexicon, counts, stats, stats_summed):\n    # Note that learned_lexicon and ref_lexicon are dicts of sets of prons, while the other two lexicons are sets of (word, pron) pairs.\n    threshold = 2\n    words = [defaultdict(set) for i in range(4)] # \"words\" contains four bins, where we\n    # classify each word into, according to whether it's count > threshold,\n    # and whether it's OOVs w.r.t the reference lexicon.\n\n    src = {}\n    print(\"# Note: This file contains pronunciation info for words who have candidate \"\n          \"prons from G2P/phonetic-decoding accepted in the learned lexicon\"\n          \", sorted by their counts in acoustic training data, \"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 1st Col: source of the candidate pron: G(2P) / P(hone-decoding) / R(eference).\"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 2nd Col: accepted or not in the learned lexicon (Y/N).\", file=args.ref_lexicon_edits_handle)\n    print(\"# 3rd Col: soft counts from lattice-alignment (not augmented by prior-counts).\"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 4th Col: the pronunciation cadidate.\", file=args.ref_lexicon_edits_handle)\n    \n    # words which are to be printed into the edits file.\n    words_to_edit = [] \n    num_prons_tot = 0\n    for word in learned_lexicon:\n        num_prons_tot += len(learned_lexicon[word])\n        count = len(stats[word]) # This count could be smaller than the count read from the dict \"counts\",\n        # since in each sub-utterance, multiple occurences (which is rare) of the same word are compressed into one.\n        # We use this count here so that in the edit-file, soft counts for each word sum up to one. \n        flags = ['0' for i in range(3)] # \"flags\" contains three binary indicators, \n        # indicating where this word's pronunciations come from.\n        for pron in learned_lexicon[word]:\n            if word in pd_lexicon and pron in pd_lexicon[word]:\n                flags[0] = '1'\n                src[(word, pron)] = 'P'\n            elif word in ref_lexicon and pron in ref_lexicon[word]:\n                flags[1] = '1'\n                src[(word, pron)] = 'R'\n            elif word in g2p_lexicon and pron in g2p_lexicon[word]:\n                flags[2] = '1'\n                src[(word, pron)] = 'G'\n        if word in ref_lexicon:\n            all_ref_prons_accepted = True\n            for pron in ref_lexicon[word]:\n                if pron not in learned_lexicon[word]:\n                    all_ref_prons_accepted = False\n                    break\n            if not all_ref_prons_accepted or flags[0] == '1' or flags[2] == '1':\n                words_to_edit.append((word, len(stats[word])))\n            if count > threshold:\n                words[0][flags[0] + flags[1] + flags[2]].add(word)\n            else:\n                words[1][flags[0] + flags[1] + flags[2]].add(word)\n        else:\n            if count > threshold: \n                words[2][flags[0] + flags[2]].add(word)\n            else:\n                words[3][flags[0] + flags[2]].add(word)\n\n    words_to_edit_sorted = sorted(words_to_edit, key=lambda entry: entry[1], reverse=True)\n    for word, count in words_to_edit_sorted:\n        print(\"------------\",word, \"%2.1f\" % count, \"--------------\", file=args.ref_lexicon_edits_handle)\n        learned_prons = []\n        for pron in learned_lexicon[word]:\n            learned_prons.append((src[(word, pron)], 'Y', stats_summed[(word, pron)], pron))\n        for pron in ref_lexicon[word]:\n            if pron not in learned_lexicon[word]:\n                learned_prons.append(('R', 'N', stats_summed[(word, pron)], pron))\n        learned_prons_sorted = sorted(learned_prons, key=lambda item: item[2], reverse=True)\n        for item in learned_prons_sorted:\n            print('{} | {} |  {:.2f} | {}'.format(item[0], item[1], item[2], item[3]), file=args.ref_lexicon_edits_handle)\n\n    num_oovs_with_acoustic_evidence = len(set(learned_lexicon.keys()).difference(set(ref_lexicon.keys())))\n    num_oovs = len(set(counts.keys()).difference(set(ref_lexicon.keys())))\n    num_ivs = len(learned_lexicon) - num_oovs_with_acoustic_evidence\n    print(\"Average num. prons per word in the learned lexicon is {}\".format(float(num_prons_tot)/float(len(learned_lexicon))), file=sys.stderr)\n    # print(\"Here are the words whose reference pron candidates were all declined\", words[0]['100'], file=sys.stderr)\n    print(\"-------------------------------------------------Summary------------------------------------------\", file=sys.stderr)\n    print(\"We have acoustic evidence for {} out of {} in-vocab (w.r.t the reference lexicon) words from the acoustic training data.\".format(num_ivs, len(ref_lexicon)), file=sys.stderr) \n    print(\"  Among those frequent words whose counts in the training text > \", threshold, \":\", file=sys.stderr) \n    num_freq_ivs_from_all_sources = len(words[0]['111']) + len(words[0]['110']) + len(words[0]['011'])\n    num_freq_ivs_from_g2p_or_phonetic_decoding = len(words[0]['101']) + len(words[0]['001']) + len(words[0]['100'])\n    num_freq_ivs_from_ref = len(words[0]['010'])\n    num_infreq_ivs_from_all_sources = len(words[1]['111']) + len(words[1]['110']) + len(words[1]['011'])\n    num_infreq_ivs_from_g2p_or_phonetic_decoding = len(words[1]['101']) + len(words[1]['001']) + len(words[1]['100'])\n    num_infreq_ivs_from_ref = len(words[1]['010'])\n    print('    {} words\\' selected prons came from the reference lexicon, G2P/phonetic-decoding.'.format(num_freq_ivs_from_all_sources), file=sys.stderr)\n    print('    {} words\\' selected prons come from G2P/phonetic-decoding-generated.'.format(num_freq_ivs_from_g2p_or_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from the reference lexicon only.'.format(num_freq_ivs_from_ref), file=sys.stderr) \n    print('  For those words whose counts in the training text <= {}:'.format(threshold), file=sys.stderr) \n    print('    {} words\\' selected prons came from the reference lexicon, G2P/phonetic-decoding.'.format(num_infreq_ivs_from_all_sources), file=sys.stderr)\n    print('    {} words\\' selected prons come from G2P/phonetic-decoding-generated.'.format(num_infreq_ivs_from_g2p_or_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from the reference lexicon only.'.format(num_infreq_ivs_from_ref), file=sys.stderr) \n    print(\"---------------------------------------------------------------------------------------------------\", file=sys.stderr)\n    num_freq_oovs_from_both_sources = len(words[2]['11'])\n    num_freq_oovs_from_phonetic_decoding = len(words[2]['10'])\n    num_freq_oovs_from_g2p = len(words[2]['01'])\n    num_infreq_oovs_from_both_sources = len(words[3]['11'])\n    num_infreq_oovs_from_phonetic_decoding = len(words[3]['10'])\n    num_infreq_oovs_from_g2p = len(words[3]['01'])\n    print('We have acoustic evidence for {} out of {} OOV (w.r.t the reference lexicon) words from the acoustic training data.'.format(num_oovs_with_acoustic_evidence, num_oovs), file=sys.stderr)\n    print('  Among those words whose counts in the training text > {}:'.format(threshold), file=sys.stderr)\n    print('    {} words\\' selected prons came from G2P and phonetic-decoding.'.format(num_freq_oovs_from_both_sources), file=sys.stderr)\n    print('    {} words\\' selected prons came from phonetic decoding only.'.format(num_freq_oovs_from_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P only.'.format(num_freq_oovs_from_g2p), file=sys.stderr) \n    print('  For those words whose counts in the training text <= {}:'.format(threshold), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P and phonetic-decoding.'.format(num_infreq_oovs_from_both_sources), file=sys.stderr)\n    print('    {} words\\' selected prons came from phonetic decoding only.'.format(num_infreq_oovs_from_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P only.'.format(num_infreq_oovs_from_g2p), file=sys.stderr) \n\ndef WriteLearnedLexiconOov(learned_lexicon, ref_lexicon, file_handle):\n    for word, prons in learned_lexicon.iteritems():\n        if word not in ref_lexicon:\n            for pron in prons:\n                print('{0} {1}'.format(word, pron), file=file_handle)\n    file_handle.close()\n\ndef Main():\n    args = GetArgs()\n\n    # Read in three lexicon sources, word counts, and pron stats.\n    counts = ReadWordCounts(args.word_counts_file_handle)\n    ref_lexicon = ReadLexicon(args, args.ref_lexicon_handle, counts)\n    g2p_lexicon = ReadLexicon(args, args.g2p_lexicon_handle, counts)\n    pd_lexicon =  ReadLexicon(args, args.pd_lexicon_handle, counts)\n    stats, stats_summed = ReadArcStats(args.arc_stats_file_handle)\n    learned_lexicon =  ReadLexicon(args, args.learned_lexicon_handle, counts)\n    \n    # Write the learned prons for words out of the ref. vocab into learned_lexicon_oov.\n    WriteLearnedLexiconOov(learned_lexicon, ref_lexicon, args.learned_lexicon_oov_handle)\n    # Edits will be printed into ref_lexicon_edits, and the summary will be printed into stderr.\n    WriteEditsAndSummary(args, learned_lexicon, ref_lexicon, pd_lexicon, g2p_lexicon, counts, stats, stats_summed)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/prons_to_lexicon.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Vimal Manohar\n#           2016  Xiaohui Zhang\n# Apache 2.0.\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\n\nclass StrToBoolAction(argparse.Action):\n    \"\"\" A custom action to convert bools from shell format i.e., true/false\n        to python format i.e., True/False \"\"\"\n    def __call__(self, parser, namespace, values, option_string=None):\n        if values == \"true\":\n            setattr(namespace, self.dest, True)\n        elif values == \"false\":\n            setattr(namespace, self.dest, False)\n        else:\n            raise Exception(\"Unknown value {0} for --{1}\".format(values, self.dest))\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description = \"Converts pronunciation statistics (from phonetic decoding or g2p) \"\n                                     \"into a lexicon for. We prune the pronunciations \"\n                                     \"based on a provided stats file, and optionally filter out entries which are present \"\n                                     \"in a filter lexicon.\",\n                                     epilog = \"e.g. steps/dict/prons_to_lexicon.py --min-prob=0.4 \\\\\"\n                                     \"--filter-lexicon=exp/tri3_lex_0.4_work/phone_decode/filter_lexicon.txt \\\\\"\n                                     \"exp/tri3_lex_0.4_work/phone_decode/prons.txt \\\\\"\n                                     \"exp/tri3_lex_0.4_work/lexicon_phone_decoding.txt\"\n                                     \"See steps/dict/learn_lexicon_greedy.sh for examples in detail.\")\n\n    parser.add_argument(\"--set-sum-to-one\", type = str, default = False,\n                        action = StrToBoolAction, choices = [\"true\", \"false\"],\n                        help = \"If normalize lexicon such that the sum of \"\n                        \"probabilities is 1.\")\n    parser.add_argument(\"--set-max-to-one\", type = str, default = True,\n                        action = StrToBoolAction, choices = [\"true\", \"false\"],\n                        help = \"If normalize lexicon such that the max \"\n                        \"probability is 1.\")\n    parser.add_argument(\"--top-N\", type = int, default = 0,\n                        help = \"If non-zero, we just take the top N pronunciations (according to stats/pron-probs) for each word.\")\n    parser.add_argument(\"--min-prob\", type = float, default = 0.1,\n                        help = \"Remove pronunciation with probabilities less \"\n                        \"than this value after normalization.\")\n    parser.add_argument(\"--filter-lexicon\", metavar='<filter-lexicon>', type = str, default = '',\n                        help = \"Exclude entries in this filter lexicon from the output lexicon.\"\n                        \"each line must be <word> <phones>\")\n    parser.add_argument(\"stats_file\", metavar='<stats-file>', type = str,\n                        help = \"Input lexicon file containing pronunciation statistics/probs in the first column.\"\n                        \"each line must be <counts> <word> <phones>\")\n    parser.add_argument(\"out_lexicon\", metavar='<out-lexicon>', type = str,\n                        help = \"Output lexicon.\")\n\n    print (' '.join(sys.argv), file = sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if args.stats_file == \"-\":\n        args.stats_file_handle = sys.stdin\n    else:\n        args.stats_file_handle = open(args.stats_file)\n\n    if args.filter_lexicon is not '':\n        if args.filter_lexicon == \"-\":\n            args.filter_lexicon_handle = sys.stdout\n        else:\n            args.filter_lexicon_handle = open(args.filter_lexicon)\n    \n    if args.out_lexicon == \"-\":\n        args.out_lexicon_handle = sys.stdout\n    else:\n        args.out_lexicon_handle = open(args.out_lexicon, \"w\")\n\n    if args.set_max_to_one == args.set_sum_to_one:\n        raise Exception(\"Cannot have both \"\n            \"set-max-to-one and set-sum-to-one as true or false.\")\n\n    return args\n\ndef ReadStats(args):\n    lexicon = {}\n    word_count = {}\n    for line in args.stats_file_handle:\n        splits = line.strip().split()\n        if len(splits) < 3:\n            continue\n\n        word = splits[1]\n        count = float(splits[0])\n        phones = ' '.join(splits[2:])\n\n        lexicon[(word, phones)] = lexicon.get((word, phones), 0) + count\n        word_count[word] = word_count.get(word, 0) + count\n\n    return [lexicon, word_count]\n\ndef ReadLexicon(lexicon_file_handle):\n    lexicon = set()\n    if lexicon_file_handle:\n        for line in lexicon_file_handle.readlines():\n            splits = line.strip().split()\n            if len(splits) == 0:\n                continue\n            if len(splits) < 2:\n                raise Exception('Invalid format of line ' + line\n                                    + ' in lexicon file.')\n            word = splits[0]\n            phones = ' '.join(splits[1:])\n            lexicon.add((word, phones))\n    return lexicon\n\ndef ConvertWordCountsToProbs(args, lexicon, word_count):\n    word_probs = {}\n    for entry, count in lexicon.iteritems():\n        word = entry[0]\n        phones = entry[1]\n        prob = float(count) / float(word_count[word])\n        if word in word_probs:\n            word_probs[word].append((phones, prob))\n        else:\n            word_probs[word] = [(phones, prob)]\n\n    return word_probs\n\ndef ConvertWordProbsToLexicon(word_probs):\n    lexicon = {}\n    for word, entry in word_probs.iteritems():\n        for x in entry:\n            lexicon[(word, x[0])] = lexicon.get((word,x[0]), 0) + x[1]\n    return lexicon\n\ndef NormalizeLexicon(lexicon, set_max_to_one = True,\n                     set_sum_to_one = False, min_prob = 0):\n    word_probs = {}\n    for entry, prob in lexicon.iteritems():\n        t = word_probs.get(entry[0], (0,0))\n        word_probs[entry[0]] = (t[0] + prob, max(t[1], prob))\n\n    for entry, prob in lexicon.iteritems():\n        if set_max_to_one:\n            prob = prob / word_probs[entry[0]][1]\n        elif set_sum_to_one:\n            prob = prob / word_probs[entry[0]][0]\n        if prob < min_prob:\n            prob = 0\n        lexicon[entry] = prob\n\ndef TakeTopN(lexicon, top_N):\n    lexicon_reshaped = defaultdict(list) \n    lexicon_pruned = {}\n    for entry, prob in lexicon.iteritems():\n        lexicon_reshaped[entry[0]].append([entry[1], prob])\n    for word in lexicon_reshaped:\n        prons = lexicon_reshaped[word]\n        sorted_prons = sorted(prons, reverse=True, key=lambda prons: prons[1])\n        for i in range(len(sorted_prons)):\n            if i >= top_N:\n                lexicon[(word, sorted_prons[i][0])] = 0\n        \ndef WriteLexicon(args, lexicon, filter_lexicon):\n    words = set()\n    num_removed = 0\n    num_filtered = 0\n    for entry, prob in lexicon.iteritems():\n        if prob == 0:\n            num_removed += 1\n            continue\n        if entry in filter_lexicon:\n            num_filtered += 1\n            continue\n        words.add(entry[0])\n        print(\"{0} {1}\".format(entry[0], entry[1]),\n                file = args.out_lexicon_handle)\n    print (\"Before pruning, the total num. pronunciations is: {}\".format(len(lexicon)), file=sys.stderr)\n    print (\"Removed {0} pronunciations by setting min_prob {1}\".format(num_removed, args.min_prob), file=sys.stderr)\n    print (\"Filtered out {} pronunciations in the filter lexicon.\".format(num_filtered), file=sys.stderr)\n    num_prons_from_phone_decoding = len(lexicon) - num_removed - num_filtered\n    print (\"Num. pronunciations in the output lexicon, which solely come from phone decoding\"\n           \"is {0}. num. words is {1}\".format(num_prons_from_phone_decoding, len(words)), file=sys.stderr)\n\ndef Main():\n    args = GetArgs()\n\n    [lexicon, word_count] = ReadStats(args)\n\n    word_probs = ConvertWordCountsToProbs(args, lexicon, word_count)\n\n    lexicon = ConvertWordProbsToLexicon(word_probs)\n    filter_lexicon = set()\n    if args.filter_lexicon is not '':\n        filter_lexicon = ReadLexicon(args.filter_lexicon_handle)\n    if args.top_N > 0:\n        TakeTopN(lexicon, args.top_N)\n    else:\n        NormalizeLexicon(lexicon, set_max_to_one = args.set_max_to_one,\n                         set_sum_to_one = args.set_sum_to_one,\n                         min_prob = args.min_prob)\n    WriteLexicon(args, lexicon, filter_lexicon)\n    args.out_lexicon_handle.close()\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/prune_pron_candidates.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nfrom collections import defaultdict\nimport argparse\nimport sys\nimport math\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description = \"Prune pronunciation candidates based on soft-counts from lattice-alignment\"\n                                     \"outputs, and a reference lexicon. Basically, for each word we sort all pronunciation\"\n                                     \"cadidates according to their soft-counts, and then select the top r * N candidates\"\n                                     \"(For words in the reference lexicon, N = # pron variants given by the reference\"\n                                     \"lexicon; For oov words, N = avg. # pron variants per word in the reference lexicon).\"\n                                     \"r is a user-specified constant, like 2.\",\n                                     epilog = \"See steps/dict/learn_lexicon_greedy.sh for example\")\n\n    parser.add_argument(\"--r\", type = float, default = \"2.0\",\n                        help = \"a user-specified ratio parameter which determines how many\"\n                        \"pronunciation candidates we want to keep for each word.\")\n    parser.add_argument(\"pron_stats\", metavar = \"<pron-stats>\", type = str,\n                        help = \"File containing soft-counts of all pronounciation candidates; \"\n                        \"each line must be <soft-counts> <word> <phones>\")\n    parser.add_argument(\"ref_lexicon\", metavar = \"<ref-lexicon>\", type = str,\n                        help = \"Reference lexicon file, where we obtain # pron variants for\"\n                        \"each word, based on which we prune the pron candidates.\")\n    parser.add_argument(\"pruned_prons\", metavar = \"<pruned-prons>\", type = str,\n                        help = \"A file in lexicon format, which contains prons we want to\" \n                        \"prune away from the pron_stats file.\")\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    args.pron_stats_handle = open(args.pron_stats)\n    args.ref_lexicon_handle = open(args.ref_lexicon)\n    if args.pruned_prons == \"-\":\n        args.pruned_prons_handle = sys.stdout\n    else:\n        args.pruned_prons_handle = open(args.pruned_prons, \"w\")\n    return args\n\ndef ReadStats(pron_stats_handle):\n    stats = defaultdict(list)\n    for line in pron_stats_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in stats file.')\n        count = float(splits[0])\n        word = splits[1]\n        phones = ' '.join(splits[2:])\n        stats[word].append((phones, count))\n\n    for word, entry in stats.items():\n        entry.sort(key=lambda x: x[1])\n    return stats\n\ndef ReadLexicon(ref_lexicon_handle):\n    ref_lexicon = defaultdict(set)\n    for line in ref_lexicon_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[0]\n        try:\n            phones = ' '.join(splits[2:])\n        except ValueError:\n            phones = ' '.join(splits[1:])\n        ref_lexicon[word].add(phones)\n    return ref_lexicon\n\ndef PruneProns(args, stats, ref_lexicon):\n    # Compute the average # pron variants counts per word in the reference lexicon.\n    num_words_ref = 0\n    num_prons_ref = 0\n    for word, prons in ref_lexicon.items():\n        num_words_ref += 1\n        num_prons_ref += len(prons)\n    avg_variants_counts_ref = math.ceil(float(num_prons_ref) / float(num_words_ref))\n\n    for word, entry in stats.items():\n        if word in ref_lexicon:\n            variants_counts = args.r * len(ref_lexicon[word])\n        else:\n            variants_counts = args.r * avg_variants_counts_ref\n        num_variants = 0\n        while num_variants < variants_counts:\n            try:\n                pron, prob = entry.pop()\n                if word not in ref_lexicon or pron not in ref_lexicon[word]:\n                    num_variants += 1\n            except IndexError:\n                break\n        \n    for word, entry in stats.items():\n        for pron, prob in entry:\n            if word not in ref_lexicon or pron not in ref_lexicon[word]:\n                print('{0} {1}'.format(word, pron), file=args.pruned_prons_handle)\n\ndef Main():\n    args = GetArgs()\n    ref_lexicon = ReadLexicon(args.ref_lexicon_handle)\n    stats = ReadStats(args.pron_stats_handle)\n    PruneProns(args, stats, ref_lexicon)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/select_prons_bayesian.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nfrom collections import defaultdict\nimport argparse\nimport sys\nimport math\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description = \"Use a Bayesian framework to select\"\n                                     \"pronunciation candidates from three sources: reference lexicon\"\n                                     \", G2P lexicon and phonetic-decoding lexicon. The inputs are a word-stats file,\"\n                                     \"a pron-stats file, and three source lexicons (ref/G2P/phonetic-decoding).\"\n                                     \"We assume the pronunciations for each word follow a Categorical distribution\"\n                                     \"with Dirichlet priors. Thus, with user-specified prior counts (parameterized by\"\n                                     \"prior-mean and prior-count-tot) and observed counts from the pron-stats file, \"\n                                     \"we can compute posterior for each pron, and select candidates with highest\"\n                                     \"posteriors, until we hit user-specified variants-prob-mass/counts thresholds.\"\n                                     \"The outputs are: a file specifiying posteriors of all candidate (pron_posteriors),\"\n                                     \"a learned lexicon for words out of the ref. vocab (learned_lexicon_oov),\"\n                                     \"and a lexicon_edits file containing suggested modifications of prons, for\"\n                                     \"words within the ref. vocab (ref_lexicon_edits).\",\n                                     epilog = \"See steps/dict/learn_lexicon_bayesian.sh for example.\")\n    parser.add_argument(\"--prior-mean\", type = str, default = \"0,0,0\",\n                        help = \"Mean of priors (summing up to 1) assigned to three exclusive n\"\n                        \"pronunciatio sources: reference lexicon, g2p, and phonetic decoding. We \"\n                        \"recommend setting a larger prior mean for the reference lexicon, e.g. '0.6,0.2,0.2'\")\n    parser.add_argument(\"--prior-counts-tot\", type = float, default = 15.0,\n                        help = \"Total amount of prior counts we add to all pronunciation candidates of\"\n                        \"each word. By timing it with the prior mean of a source, and then dividing\"\n                        \"by the number of candidates (for a word) from this source, we get the\"\n                        \"prior counts we actually add to each candidate.\")\n    parser.add_argument(\"--variants-prob-mass\", type = float, default = 0.7,\n                        help = \"For each word, we pick up candidates (from all three sources)\"\n                        \"with highest posteriors until the total prob mass hit this amount.\")\n    parser.add_argument(\"--variants-prob-mass-ref\", type = float, default = 0.9,\n                        help = \"For each word, after the total prob mass of selected candidates \"\n                        \"hit variants-prob-mass, we continue to pick up reference candidates\"\n                        \"with highest posteriors until the total prob mass hit this amount (must >= variants-prob-mass).\")\n    parser.add_argument(\"--variants-counts\", type = int, default = 1,\n                        help = \"Generate upto this many variants of prons for each word out\"\n                        \"of the ref. lexicon.\")\n    parser.add_argument(\"silence_file\", metavar = \"<silphonetic-file>\", type = str,\n                        help = \"File containing a list of silence phones.\")\n    parser.add_argument(\"pron_stats_file\", metavar = \"<stats-file>\", type = str,\n                        help = \"File containing pronunciation statistics from lattice alignment; \"\n                        \"each line must be <count> <word> <phones>.\")\n    parser.add_argument(\"word_counts_file\", metavar = \"<counts-file>\", type = str,\n                        help = \"File containing word counts in acoustic training data; \"\n                        \"each line must be <word> <count>.\")\n    parser.add_argument(\"ref_lexicon\", metavar = \"<reference-lexicon>\", type = str,\n                        help = \"The reference lexicon (most probably hand-derived).\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"g2p_lexicon\", metavar = \"<g2p-expanded-lexicon>\", type = str,\n                        help = \"Candidate ronouciations from G2P results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"phonetic_decoding_lexicon\", metavar = \"<prons-in-acoustic-evidence>\", type = str,\n                        help = \"Candidate ronouciations from phonetic decoding results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"pron_posteriors\", metavar = \"<pron-posteriors>\", type = str,\n                        help = \"Output file containing posteriors of all candidate prons for each word,\"\n                        \"based on which we select prons to construct the learned lexicon.\"\n                        \"each line is <word> <pronunciation-source: one of R(ef)/G(2P)/P(hone-decoding)> <posterior> <pronunciation> \")\n    parser.add_argument(\"learned_lexicon_oov\", metavar = \"<learned-lexicon-oov>\", type = str,\n                        help = \"Output file which is the learned lexicon for words out of the ref. vocab.\")\n    parser.add_argument(\"ref_lexicon_edits\", metavar = \"<lexicon-edits>\", type = str,\n                        help = \"Output file containing human-readable & editable pronounciation info (and the\"\n                        \"accept/reject decision made by our algorithm) for those words in ref. vocab,\" \n                        \"to which any change has been recommended. The info for each word is like:\" \n                        \"------------ an 4086.0 --------------\"\n                        \"R  | Y |  2401.6 |  AH N\"\n                        \"R  | Y |  640.8 |  AE N\"\n                        \"P  | Y |  1035.5 |  IH N\"\n                        \"R(ef), P(hone-decoding) represents the pronunciation source\"\n                        \"Y/N means the recommended decision of including this pron or not\"\n                        \"and the numbers are soft counts accumulated from lattice-align-word outputs. \"\n                        \"See the function WriteEditsAndSummary for more details.\")\n\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    args.silence_file_handle = open(args.silence_file)\n    if args.pron_stats_file == \"-\":\n        args.pron_stats_file_handle = sys.stdin\n    else:\n        args.pron_stats_file_handle = open(args.pron_stats_file)\n    args.word_counts_file_handle = open(args.word_counts_file)\n    args.ref_lexicon_handle = open(args.ref_lexicon)\n    args.g2p_lexicon_handle = open(args.g2p_lexicon)\n    args.phonetic_decoding_lexicon_handle = open(args.phonetic_decoding_lexicon)\n    args.pron_posteriors_handle = open(args.pron_posteriors, \"w\")\n    args.learned_lexicon_oov_handle = open(args.learned_lexicon_oov, \"w\")\n    args.ref_lexicon_edits_handle = open(args.ref_lexicon_edits, \"w\")\n    \n    prior_mean = args.prior_mean.strip().split(',')\n    if len(prior_mean) is not 3:\n        raise Exception('Invalid Dirichlet prior mean ', args.prior_mean)\n    for i in range(0,3):\n        if float(prior_mean[i]) <= 0 or float(prior_mean[i]) >= 1:\n            raise Exception('Dirichlet prior mean', prior_mean[i], 'is invalid, it must be between 0 and 1.')\n    args.prior_mean = [float(prior_mean[0]), float(prior_mean[1]), float(prior_mean[2])]\n\n    return args\n\ndef ReadPronStats(pron_stats_file_handle):\n    stats = {}\n    for line in pron_stats_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in stats file.')\n        count = float(splits[0])\n        word = splits[1]\n        phones = ' '.join(splits[2:])\n        stats[(word, phones)] = count\n    return stats\n\ndef ReadWordCounts(word_counts_file_handle):\n    counts = {}\n    for line in word_counts_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in counts file.')\n        word = splits[0]\n        count = int(splits[1])\n        counts[word] = count\n    return counts\n\ndef ReadLexicon(args, lexicon_file_handle, counts):\n    # we're skipping any word not in counts (not seen in training data),\n    # cause we're only learning prons for words who have acoustic examples.\n    lexicon = defaultdict(set)\n    for line in lexicon_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[0]\n        if word not in counts:\n            continue\n        phones = ' '.join(splits[1:])\n        lexicon[word].add(phones)\n    return lexicon\n\ndef FilterPhoneticDecodingLexicon(args, phonetic_decoding_lexicon, stats):\n    # We want to remove all candidates which contains silence phones\n    silphones = set()\n    for line in args.silence_file_handle:\n        silphones.add(line.strip())\n    rejected_candidates = set()\n    for word, prons in phonetic_decoding_lexicon.items():\n        for pron in prons:\n            for phone in pron.split():\n                if phone in silphones:\n                   if (word, pron) in stats:\n                       count = stats[(word, pron)]\n                       del stats[(word, pron)]\n                   else:\n                       count = 0\n                   rejected_candidates.add((word, pron))\n                   print('WARNING: removing the candidate pronunciation from phonetic-decoding: {0}: '\n                         '\"{1}\" whose soft-count from lattice-alignment is {2}, cause it contains at'\n                         ' least one silence phone.'.format(word, pron, count), file=sys.stderr)\n                   break\n    for word, pron in rejected_candidates:\n        phonetic_decoding_lexicon[word].remove(pron)\n    return phonetic_decoding_lexicon, stats\n\ndef ComputePriorCounts(args, counts, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon):\n    prior_counts = defaultdict(list)\n    # In case one source is absent for a word, we set zero prior to this source, \n    # and then re-normalize the prior mean parameters s.t. they sum up to one.\n    for word in counts:\n        prior_mean = [args.prior_mean[0], args.prior_mean[1], args.prior_mean[2]]\n        if word not in ref_lexicon:\n            prior_mean[0] = 0\n        if word not in g2p_lexicon:\n            prior_mean[1] = 0\n        if word not in phonetic_decoding_lexicon:\n            prior_mean[2] = 0\n        prior_mean_sum = sum(prior_mean)\n        try:\n            prior_mean = [float(t) / prior_mean_sum for t in prior_mean] \n        except ZeroDivisionError:\n            print('WARNING: word {} appears in train_counts but not in any lexicon.'.format(word), file=sys.stderr)\n        prior_counts[word] = [t * args.prior_counts_tot for t in prior_mean] \n    return prior_counts\n\ndef ComputePosteriors(args, stats, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon, prior_counts):\n    posteriors = defaultdict(list) # This dict stores a list of (pronunciation, posterior)\n    # pairs for each word, where the posteriors are normalized soft counts. Before normalization,\n    # The soft-counts were augmented by a user-specified prior count, according the source \n    # (ref/G2P/phonetic-decoding) of this pronunciation.\n\n    for word, prons in ref_lexicon.items():\n        for pron in prons:\n            # c is the augmented soft count (observed count + prior count)\n            c = float(prior_counts[word][0]) / len(ref_lexicon[word]) + stats.get((word, pron), 0)\n            posteriors[word].append((pron, c))\n\n    for word, prons in g2p_lexicon.items():\n        for pron in prons:\n            c = float(prior_counts[word][1]) / len(g2p_lexicon[word]) + stats.get((word, pron), 0)\n            posteriors[word].append((pron, c))\n\n    for word, prons in phonetic_decoding_lexicon.items():\n        for pron in prons:\n            c = float(prior_counts[word][2]) / len(phonetic_decoding_lexicon[word]) + stats.get((word, pron), 0)\n            posteriors[word].append((pron, c))\n\n    num_prons_from_ref = sum(len(ref_lexicon[i]) for i in ref_lexicon)\n    num_prons_from_g2p = sum(len(g2p_lexicon[i]) for i in g2p_lexicon)\n    num_prons_from_phonetic_decoding = sum(len(phonetic_decoding_lexicon[i]) for i in phonetic_decoding_lexicon)\n    print (\"---------------------------------------------------------------------------------------------------\", file=sys.stderr)\n    print ('Total num. words is {}:'.format(len(posteriors)), file=sys.stderr)\n    print ('{0} candidate prons came from the reference lexicon; {1} came from G2P;{2} came from'\n           'phonetic_decoding'.format(num_prons_from_ref, num_prons_from_g2p, num_prons_from_phonetic_decoding), file=sys.stderr)\n    print (\"---------------------------------------------------------------------------------------------------\", file=sys.stderr)\n\n    # Normalize the augmented soft counts to get posteriors.\n    count_sum = defaultdict(float) # This dict stores the pronunciation which has \n    # the sum of augmented soft counts for each word.\n    \n    for word in posteriors:\n        # each entry is a pair: (prounciation, count)\n        count_sum[word] = sum([entry[1] for entry in posteriors[word]])\n    \n    for word, entry in posteriors.items():\n        new_entry = []\n        for pron, count in entry:      \n            post = float(count) / count_sum[word]\n            new_entry.append((pron, post))\n            source = 'R'\n            if word in g2p_lexicon and pron in g2p_lexicon[word]:\n                source = 'G'\n            elif word in phonetic_decoding_lexicon and pron in phonetic_decoding_lexicon[word]:\n                source = 'P'\n            print(word, source, \"%3.2f\" % post, pron, file=args.pron_posteriors_handle)\n        del entry[:]\n        entry.extend(sorted(new_entry, key=lambda new_entry: new_entry[1]))\n    return posteriors\n\ndef SelectPronsBayesian(args, counts, posteriors, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon):\n    reference_selected = 0\n    g2p_selected = 0\n    phonetic_decoding_selected = 0\n    learned_lexicon = defaultdict(set)\n\n    for word, entry in posteriors.items():\n        num_variants = 0\n        post_tot = 0.0\n        variants_counts = args.variants_counts\n        variants_prob_mass = args.variants_prob_mass\n        if word in ref_lexicon:\n            # the variants count of the current word's prons in the ref lexicon.\n            variants_counts_ref = len(ref_lexicon[word])\n            # For words who don't appear in acoustic training data at all, we simply accept all ref prons.\n            # For words in ref. vocab, we set the max num. variants \n            if counts.get(word, 0) > 0:\n                variants_counts = math.ceil(1.5 * variants_counts_ref)\n            else:\n                variants_counts = variants_counts_ref\n                variants_prob_mass = 1.0\n        last_post = 0.0\n        while ((num_variants < variants_counts and post_tot < variants_prob_mass)\n               or (len(entry) > 0 and entry[-1][1] == last_post)): # this conditions \n               # means the posterior of the current pron is the same as the one we just included.\n            try:\n                pron, post = entry.pop()\n                last_post = post\n            except IndexError:\n                break\n            post_tot += post\n            learned_lexicon[word].add(pron)\n            num_variants += 1\n            if word in ref_lexicon and pron in ref_lexicon[word]:\n                reference_selected += 1\n            elif word in g2p_lexicon and pron in g2p_lexicon[word]:\n                g2p_selected += 1\n            else:\n                phonetic_decoding_selected += 1\n\n        while (num_variants < variants_counts and post_tot < args.variants_prob_mass_ref):\n            try:\n                pron, post = entry.pop()\n            except IndexError:\n                break\n            if word in ref_lexicon and pron in ref_lexicon[word]:\n                post_tot += post\n                learned_lexicon[word].add(pron)\n                num_variants += 1\n                reference_selected += 1\n\n    num_prons_tot = reference_selected + g2p_selected + phonetic_decoding_selected\n    print('---------------------------------------------------------------------------------------------------', file=sys.stderr)\n    print ('Num. words in the learned lexicon: {0} num. selected prons: {1}'.format(len(learned_lexicon), num_prons_tot), file=sys.stderr)\n    print ('{0} selected prons came from reference candidate prons; {1} came from G2P candidate prons;'\n           '{2} came from phonetic-decoding candidate prons.'.format(reference_selected, g2p_selected, phonetic_decoding_selected), file=sys.stderr) \n    return learned_lexicon\n\ndef WriteEditsAndSummary(args, learned_lexicon, ref_lexicon, phonetic_decoding_lexicon, g2p_lexicon, counts, stats):\n    # Note that learned_lexicon and ref_lexicon are dicts of sets of prons, while the other two lexicons are sets of (word, pron) pairs.\n    threshold = 3\n    words = [defaultdict(set) for i in range(4)] # \"words\" contains four bins, where we\n    # classify each word into, according to whether it's count > threshold,\n    # and whether it's OOVs w.r.t the reference lexicon.\n\n    src = {}\n    print(\"# Note: This file contains pronunciation info for words who have candidate\"\n          \"prons from G2P/phonetic-decoding accepted in the learned lexicon.\"\n          \", sorted by their counts in acoustic training data, \"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 1st Col: source of the candidate pron: G(2P) / P(hone-decoding) / R(eference).\"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 2nd Col: accepted or not in the learned lexicon (Y/N).\", file=args.ref_lexicon_edits_handle)\n    print(\"# 3rd Col: soft counts from lattice-alignment (not augmented by prior-counts).\"\n          ,file=args.ref_lexicon_edits_handle)\n    print(\"# 4th Col: the pronunciation cadidate.\", file=args.ref_lexicon_edits_handle)\n    \n    # words which are to be printed into the edits file.\n    words_to_edit = [] \n    for word in learned_lexicon:\n        count = counts.get(word, 0)\n        flags = ['0' for i in range(3)] # \"flags\" contains three binary indicators, \n        # indicating where this word's pronunciations come from.\n        for pron in learned_lexicon[word]:\n            if word in phonetic_decoding_lexicon and pron in phonetic_decoding_lexicon[word]:\n                flags[0] = '1'\n                src[(word, pron)] = 'P'\n            if word in ref_lexicon and pron in ref_lexicon[word]:\n                flags[1] = '1'\n                src[(word, pron)] = 'R'\n            if word in g2p_lexicon and pron in g2p_lexicon[word]:\n                flags[2] = '1'\n                src[(word, pron)] = 'G'\n        if word in ref_lexicon:\n            all_ref_prons_accepted = True\n            for pron in ref_lexicon[word]:\n                if pron not in learned_lexicon[word]:\n                    all_ref_prons_accepted = False\n                    break\n            if not all_ref_prons_accepted or flags[0] == '1' or flags[2] == '1':\n                words_to_edit.append((word, counts[word]))\n            if count > threshold:\n                words[0][flags[0] + flags[1] + flags[2]].add(word)\n            else:\n                words[1][flags[0] + flags[1] + flags[2]].add(word)\n        else:\n            if count > threshold: \n                words[2][flags[0] + flags[2]].add(word)\n            else:\n                words[3][flags[0] + flags[2]].add(word)\n\n    words_to_edit_sorted = sorted(words_to_edit, key=lambda entry: entry[1], reverse=True)\n    for word, count in words_to_edit_sorted:\n        print(\"------------\",word, \"%2.1f\" % count, \"--------------\", file=args.ref_lexicon_edits_handle)\n        for pron in learned_lexicon[word]:\n            print(src[(word, pron)], ' | Y | ', \"%2.1f | \" % stats.get((word, pron), 0), pron, \n                  file=args.ref_lexicon_edits_handle)\n        for pron in ref_lexicon[word]:\n            if pron not in learned_lexicon[word]:\n                soft_count = stats.get((word, pron), 0)\n                print('R  | N |  {:.2f} | {} '.format(soft_count, pron), file=args.ref_lexicon_edits_handle)\n    print(\"Here are the words whose reference pron candidates were all declined\", words[0]['100'], file=sys.stderr)\n    print(\"-------------------------------------------------Summary------------------------------------------\", file=sys.stderr)\n    print(\"In the learned lexicon, out of those\", len(ref_lexicon), \"words from the vocab of the reference lexicon:\", file=sys.stderr) \n    print(\"  For those frequent words whose counts in the training text > \", threshold, \":\", file=sys.stderr) \n    num_freq_ivs_from_all_sources = len(words[0]['111']) + len(words[0]['110']) + len(words[0]['011'])\n    num_freq_ivs_from_g2p_or_phonetic_decoding = len(words[0]['101']) + len(words[0]['001']) + len(words[0]['100'])\n    num_freq_ivs_from_ref = len(words[0]['010'])\n    num_infreq_ivs_from_all_sources = len(words[1]['111']) + len(words[1]['110']) + len(words[1]['011'])\n    num_infreq_ivs_from_g2p_or_phonetic_decoding = len(words[1]['101']) + len(words[1]['001']) + len(words[1]['100'])\n    num_infreq_ivs_from_ref = len(words[1]['010'])\n    print(' {} words\\' selected prons came from the reference lexicon, G2P/phonetic-decoding.'.format(num_freq_ivs_from_all_sources), file=sys.stderr)\n    print(' {} words\\' selected prons come from G2P/phonetic-decoding-generated.'.format(num_freq_ivs_from_g2p_or_phonetic_decoding), file=sys.stderr) \n    print(' {} words\\' selected prons came from the reference lexicon only.'.format(num_freq_ivs_from_ref), file=sys.stderr) \n    print('  For those words whose counts in the training text <= {}:'.format(threshold), file=sys.stderr) \n    print(' {} words\\' selected prons came from the reference lexicon, G2P/phonetic-decoding.'.format(num_infreq_ivs_from_all_sources), file=sys.stderr)\n    print(' {} words\\' selected prons come from G2P/phonetic-decoding-generated.'.format(num_infreq_ivs_from_g2p_or_phonetic_decoding), file=sys.stderr) \n    print(' {} words\\' selected prons came from the reference lexicon only.'.format(num_infreq_ivs_from_ref), file=sys.stderr) \n    print(\"---------------------------------------------------------------------------------------------------\", file=sys.stderr)\n    num_oovs = len(learned_lexicon) - len(ref_lexicon)\n    num_freq_oovs_from_both_sources = len(words[2]['11'])\n    num_freq_oovs_from_phonetic_decoding = len(words[2]['10'])\n    num_freq_oovs_from_g2p = len(words[2]['01'])\n    num_infreq_oovs_from_both_sources = len(words[3]['11'])\n    num_infreq_oovs_from_phonetic_decoding = len(words[3]['10'])\n    num_infreq_oovs_from_g2p = len(words[3]['01'])\n    print('  In the learned lexicon, out of those {} OOV words (w.r.t the reference lexicon):'.format(num_oovs), file=sys.stderr)\n    print('  For those words whose counts in the training text > {}:'.format(threshold), file=sys.stderr)\n    print('    {} words\\' selected prons came from G2P and phonetic-decoding.'.format(num_freq_oovs_from_both_sources), file=sys.stderr)\n    print('    {} words\\' selected prons came from phonetic decoding only.'.format(num_freq_oovs_from_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P only.'.format(num_freq_oovs_from_g2p), file=sys.stderr) \n    print('  For those words whose counts in the training text <= {}:'.format(threshold), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P and phonetic-decoding.'.format(num_infreq_oovs_from_both_sources), file=sys.stderr)\n    print('    {} words\\' selected prons came from phonetic decoding only.'.format(num_infreq_oovs_from_phonetic_decoding), file=sys.stderr) \n    print('    {} words\\' selected prons came from G2P only.'.format(num_infreq_oovs_from_g2p), file=sys.stderr) \n\ndef WriteLearnedLexiconOov(learned_lexicon, ref_lexicon, file_handle):\n    for word, prons in learned_lexicon.items():\n        if word not in ref_lexicon:\n            for pron in prons:\n                print('{0} {1}'.format(word, pron), file=file_handle)\n    file_handle.close()\n\ndef Main():\n    args = GetArgs()\n\n    # Read in three lexicon sources, word counts, and pron stats.\n    counts = ReadWordCounts(args.word_counts_file_handle)\n    ref_lexicon = ReadLexicon(args, args.ref_lexicon_handle, counts)\n    g2p_lexicon = ReadLexicon(args, args.g2p_lexicon_handle, counts)\n    phonetic_decoding_lexicon =  ReadLexicon(args, args.phonetic_decoding_lexicon_handle, counts)\n    stats = ReadPronStats(args.pron_stats_file_handle)\n    phonetic_decoding_lexicon, stats = FilterPhoneticDecodingLexicon(args, phonetic_decoding_lexicon, stats)\n   \n    # Compute prior counts\n    prior_counts = ComputePriorCounts(args, counts, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon)\n    # Compute posteriors, and then select prons to construct the learned lexicon.\n    posteriors = ComputePosteriors(args, stats, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon, prior_counts)\n\n    # Select prons to construct the learned lexicon.\n    learned_lexicon = SelectPronsBayesian(args, counts, posteriors, ref_lexicon, g2p_lexicon, phonetic_decoding_lexicon)\n    \n    # Write the learned prons for words out of the ref. vocab into learned_lexicon_oov.\n    WriteLearnedLexiconOov(learned_lexicon, ref_lexicon, args.learned_lexicon_oov_handle)\n    # Edits will be printed into ref_lexicon_edits, and the summary will be printed into stderr.\n    WriteEditsAndSummary(args, learned_lexicon, ref_lexicon, phonetic_decoding_lexicon, g2p_lexicon, counts, stats)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/select_prons_greedy.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom collections import defaultdict\nimport argparse\nimport sys\nimport math\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description = \"Use a greedy framework to select pronunciation candidates\"\n        \"from three sources: a reference lexicon, G2P lexicon and phonetic-decoding\"\n        \"(PD) lexicon. Basically, this script implements the Alg. 1 in the paper:\"\n        \"Acoustic data-driven lexicon learning based on a greedy pronunciation \"\n        \"selection framework, by X. Zhang, V. Mahonar, D. Povey and S. Khudanpur,\"\n        \"Interspeech 2017. The inputs are an arc-stats file, containing \"\n        \"acoustic evidence (tau_{uwb} in the paper) and three source lexicons \"\n        \"(phonetic-decoding(PD)/G2P/ref). The outputs is the learned lexicon for\"\n        \"all words in the arc_stats (acoustic evidence) file.\",\n        epilog = \"See steps/dict/learn_lexicon_greedy.sh for example.\")\n    parser.add_argument(\"--alpha\", type = str, default = \"0,0,0\",\n                        help = \"Scaling factors for the likelihood reduction threshold.\"\n                        \"of three pronunciaiton candidate sources: phonetic-decoding (PD),\"\n                        \"G2P and reference. The valid range of each dimension is [0, 1], and\"\n                        \"a large value means we prune pronunciations from this source more\"\n                        \"aggressively. Setting a dimension to zero means we never want to remove\"\n                        \"pronunciaiton from that source. See Section 4.3 in the paper for details.\")\n    parser.add_argument(\"--beta\", type = str, default = \"0,0,0\",\n                        help = \"smoothing factors for the likelihood reduction term.\"\n                        \"of three pronunciaiton candidate sources: phonetic-decoding (PD),\"\n                        \"G2P and reference. The valid range of each dimension is [0, 100], and\"\n                        \"a large value means we prune pronunciations from this source more\"\n                        \"aggressively. See Section 4.3 in the paper for details.\")\n    parser.add_argument(\"--delta\", type = float, default = 0.000000001,\n                        help = \"Floor value of the pronunciation posterior statistics.\"\n                        \"The valid range is (0, 0.01),\"\n                        \"See Section 3 in the paper for details.\")\n    parser.add_argument(\"silence_phones_file\", metavar = \"<silphone-file>\", type = str,\n                        help = \"File containing a list of silence phones.\")\n    parser.add_argument(\"arc_stats_file\", metavar = \"<arc-stats-file>\", type = str,\n                        help = \"File containing word-pronunciation statistics obtained from lattices; \"\n                        \"each line must be <word> <utt-id> <start-frame> <count> <phones>\")\n    parser.add_argument(\"word_counts_file\", metavar = \"<counts-file>\", type = str,\n                        help = \"File containing word counts in acoustic training data; \"\n                        \"each line must be <word> <count>.\")\n    parser.add_argument(\"ref_lexicon\", metavar = \"<reference-lexicon>\", type = str,\n                        help = \"The reference lexicon (most probably hand-derived).\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"g2p_lexicon\", metavar = \"<g2p-expanded-lexicon>\", type = str,\n                        help = \"Candidate ronouciations from G2P results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"pd_lexicon\", metavar = \"<phonetic-decoding-lexicon>\", type = str,\n                        help = \"Candidate ronouciations from phonetic decoding results.\"\n                        \"Each line must be <word> <phones>\")\n    parser.add_argument(\"learned_lexicon\", metavar = \"<learned-lexicon>\", type = str,\n                        help = \"Learned lexicon.\")\n\n\n    print (' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    args.silence_phones_file_handle = open(args.silence_phones_file)\n    if args.arc_stats_file == \"-\":\n        args.arc_stats_file_handle = sys.stdin\n    else:\n        args.arc_stats_file_handle = open(args.arc_stats_file)\n    args.word_counts_file_handle = open(args.word_counts_file)\n    args.ref_lexicon_handle = open(args.ref_lexicon)\n    args.g2p_lexicon_handle = open(args.g2p_lexicon)\n    args.pd_lexicon_handle = open(args.pd_lexicon)\n    args.learned_lexicon_handle = open(args.learned_lexicon, \"w\")\n    \n    alpha = args.alpha.strip().split(',')\n    if len(alpha) is not 3:\n        raise Exception('Invalid alpha ', args.alpha)\n    for i in range(0,3):\n        if float(alpha[i]) < 0 or float(alpha[i]) > 1:\n            raise Exception('alaph ', alpha[i], \n                            ' is invalid, it must be within [0, 1].')\n        if float(alpha[i]) == 0:\n            alpha[i] = -1e-3\n        # The absolute likelihood loss (search for loss_abs) is supposed to be positive.\n        # But it could be negative near zero because of numerical precision limit.\n        # In this case, even if alpha is set to be zero, which means we never want to\n        # remove pronunciation from that source, the quality score (search for q_b)\n        # could still be negative, which means this pron could be potentially removed.\n        # To prevent this, we set alpha as a negative value near zero to ensure\n        # q_b is always positive.\n\n    args.alpha = [float(alpha[0]), float(alpha[1]), float(alpha[2])]\n    print(\"[alpha_{pd}, alpha_{g2p}, alpha_{ref}] is: \", args.alpha)\n    exit\n    beta = args.beta.strip().split(',')\n    if len(beta) is not 3:\n        raise Exception('Invalid beta ', args.beta)\n    for i in range(0,3):\n        if float(beta[i]) < 0 or float(beta[i]) > 100:\n            raise Exception('beta ', beta[i], \n                            ' is invalid, it must be within [0, 100].')\n    args.beta = [float(beta[0]), float(beta[1]), float(beta[2])]\n    print(\"[beta_{pd}, beta_{g2p}, beta_{ref}] is: \", args.beta)\n\n    if args.delta <= 0 or args.delta > 0.1:\n        raise Exception('delta ', args.delta, ' is invalid, it must be within'\n                        '(0, 0.01).')\n    print(\"delta is: \", args.delta)\n\n    return args\n\ndef ReadArcStats(arc_stats_file_handle):\n    stats = defaultdict(lambda : defaultdict(dict))\n    stats_summed = defaultdict(float)\n    for line in arc_stats_file_handle.readlines():\n        splits = line.strip().split()\n\n        if (len(splits) == 0):\n            continue\n\n        if (len(splits) < 5):\n            raise Exception('Invalid format of line ' + line\n                                + ' in ' + arc_stats_file)\n        utt = splits[1]\n        start_frame = int(splits[2])\n        word = splits[0]\n        count = float(splits[3])\n        phones = splits[4:]\n        phones = ' '.join(phones)\n        stats[word][(utt, start_frame)][phones] = count\n        stats_summed[(word, phones)] += count\n    return stats, stats_summed\n\ndef ReadWordCounts(word_counts_file_handle):\n    counts = {}\n    for line in word_counts_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in counts file.')\n        word = splits[0]\n        count = int(splits[1])\n        counts[word] = count\n    return counts\n\ndef ReadLexicon(args, lexicon_file_handle, counts):\n    # we're skipping any word not in counts (not seen in training data),\n    # cause we're only learning prons for words who have acoustic examples.\n    lexicon = defaultdict(set)\n    for line in lexicon_file_handle.readlines():\n        splits = line.strip().split()\n        if len(splits) == 0:\n            continue\n        if len(splits) < 2:\n            raise Exception('Invalid format of line ' + line\n                                + ' in lexicon file.')\n        word = splits[0]\n        if word not in counts:\n            continue\n        phones = ' '.join(splits[1:])\n        lexicon[word].add(phones)\n    return lexicon\n\ndef FilterPhoneticDecodingLexicon(args, pd_lexicon):\n    # We want to remove all candidates which contain silence phones\n    silphones = set()\n    for line in args.silence_phones_file_handle:\n        silphones.add(line.strip())\n    rejected_candidates = set()\n    for word, prons in pd_lexicon.iteritems():\n        for pron in prons:\n            for phone in pron.split():\n                if phone in silphones:\n                   rejected_candidates.add((word, pron))\n                   break\n    for word, pron in rejected_candidates:\n        pd_lexicon[word].remove(pron)\n    return pd_lexicon\n\n# One iteration of Expectation-Maximization computation (Eq. 3-4 in the paper).\ndef OneEMIter(args, word, stats, prons, pron_probs, debug=False):\n    prob_acc = [0.0 for i in range(len(prons[word]))]\n    s = sum(pron_probs)\n    for i in range(len(pron_probs)):\n        pron_probs[i] = pron_probs[i] / s\n    log_like = 0.0\n    for (utt, start_frame) in stats[word]:\n        prob = []\n        soft_counts = []\n        for i in range(len(prons[word])):\n            phones = prons[word][i]\n            soft_count = stats[word][(utt, start_frame)].get(phones, 0)\n            if soft_count < args.delta: \n                soft_count = args.delta\n            soft_counts.append(soft_count)\n        prob = [i[0] * i[1] for i in zip(soft_counts, pron_probs)]\n        for i in range(len(prons[word])):\n            prob_acc[i] += prob[i] / sum(prob)\n        log_like += math.log(sum(prob))\n    pron_probs = [1.0 / float(len(stats[word])) * p for p in prob_acc]\n    log_like = 1.0 / float(len(stats[word])) * log_like\n    if debug:\n        print(\"Log_like of the word: \", log_like, \"pron probs: \", pron_probs)\n    return pron_probs, log_like\n\ndef SelectPronsGreedy(args, stats, counts, ref_lexicon, g2p_lexicon, pd_lexicon, dianostic_info=False):\n    prons = defaultdict(list) # Put all possible prons from three source lexicons into this dictionary\n    src = {} # Source of each (word, pron) pair: 'P' = phonetic-decoding, 'G' = G2P, 'R' = reference\n    learned_lexicon = defaultdict(set) # Put all selected prons in this dictionary\n    for lexicon in ref_lexicon, g2p_lexicon, pd_lexicon:\n        for word in lexicon:\n            for pron in lexicon[word]:\n                prons[word].append(pron)\n    for word in prons:\n        for pron in prons[word]:\n            if word in pd_lexicon and pron in pd_lexicon[word]:\n                src[(word, pron)] = 'P'\n            if word in g2p_lexicon and pron in g2p_lexicon[word]:\n                src[(word, pron)] = 'G'\n            if word in ref_lexicon and pron in ref_lexicon[word]:\n                src[(word, pron)] = 'R'\n   \n    for word in prons:\n        if word not in stats:\n            continue\n        n = len(prons[word])\n        pron_probs = [1/float(n) for i in range(n)]\n        if dianostic_info:\n            print(\"pronunciations of word '{}': {}\".format(word, prons[word]))\n        active_indexes = set(range(len(prons[word])))\n       \n        deleted_prons = [] # indexes of prons to be deleted\n        soft_counts_normalized = []\n        while len(active_indexes) > 1:\n            log_like = 1.0\n            log_like_last = -1.0\n            num_iters = 0\n            while abs(log_like - log_like_last) > 1e-7:\n                num_iters += 1\n                log_like_last = log_like\n                pron_probs, log_like = OneEMIter(args, word, stats, prons, pron_probs, False)\n                if log_like_last == 1.0 and len(soft_counts_normalized) == 0: # the first iteration\n                    soft_counts_normalized = pron_probs\n                    if dianostic_info: \n                        print(\"Avg.(over all egs) soft counts: {}\".format(soft_counts_normalized))\n            if dianostic_info:\n                print(\"\\n Log_like after {} iters of EM: {}, estimated pron_probs: {} \\n\".format(\n                        num_iters, log_like, pron_probs))\n            candidates_to_delete = []\n            \n            for i in active_indexes:\n                pron_probs_mod = [p for p in pron_probs]\n                pron_probs_mod[i] = 0.0\n                for j in range(len(pron_probs_mod)):\n                    if j in active_indexes and j != i:\n                        pron_probs_mod[j] += 0.01\n                pron_probs_mod = [s / sum(pron_probs_mod) for s in pron_probs_mod]\n                log_like2 = 1.0\n                log_like2_last = -1.0\n                num_iters2 = 0\n                # Running EM until convengence\n                while abs(log_like2 - log_like2_last) > 0.001 :\n                    num_iters2 += 1\n                    log_like2_last = log_like2\n                    pron_probs_mod, log_like2 = OneEMIter(args, word, stats,\n                                                          prons, pron_probs_mod, False)\n                \n                loss_abs = log_like - log_like2 # absolute likelihood loss before normalization\n                # (supposed to be positive, but could be negative near zero because of numerical precision limit).\n                log_delta = math.log(args.delta)\n                thr = -log_delta\n                loss = loss_abs\n                source = src[(word, prons[word][i])]\n                if dianostic_info:\n                    print(\"\\n set the pron_prob of '{}' whose source is {}, to zero results in {}\"\n                    \" loss in avg. log-likelihood; Num. iters until converging:{}. \".format(\n                      prons[word][i], source, loss, num_iters2))\n                # Compute quality score q_b = loss_abs * / (M_w + beta_s(b)) + alpha_s(b) * log_delta\n                # See Sec. 4.3 and Alg. 1 in the paper.\n                if source == 'P':\n                   thr *= args.alpha[0]\n                   loss *= float(len(stats[word])) / (float(len(stats[word])) + args.beta[0])\n                if source == 'G':\n                   thr *= args.alpha[1]\n                   loss *= float(len(stats[word])) / (float(len(stats[word])) + args.beta[1])\n                if source == 'R':\n                   thr *= args.alpha[2]\n                   loss *= float(len(stats[word])) / (float(len(stats[word])) + args.beta[2])\n                if loss - thr < 0: # loss - thr here is just q_b\n                   if dianostic_info:\n                       print(\"Smoothed log-like loss {} is smaller than threshold {} so that the quality\"\n                             \"score {} is negative, adding the pron to the list of candidates to delete\"\n                             \". \".format(loss, thr, loss-thr))\n                   candidates_to_delete.append((loss-thr, i))\n            if len(candidates_to_delete) == 0:\n                break\n            candidates_to_delete_sorted = sorted(candidates_to_delete, \n                                                 key=lambda candidates_to_delete: candidates_to_delete[0])\n\n            deleted_candidate = candidates_to_delete_sorted[0]\n            active_indexes.remove(deleted_candidate[1])\n            pron_probs[deleted_candidate[1]] = 0.0\n            for i in range(len(pron_probs)):\n                if i in active_indexes:\n                    pron_probs[i] += 0.01\n            pron_probs = [s / sum(pron_probs) for s in pron_probs]\n            source = src[(word, prons[word][deleted_candidate[1]])]\n            pron = prons[word][deleted_candidate[1]]\n            soft_count = soft_counts_normalized[deleted_candidate[1]]\n            quality_score = deleted_candidate[0]\n            # This part of diagnostic info provides hints to the user on how to adjust the parameters.\n            if dianostic_info:\n                print(\"removed pron {}, from source {} with quality score {:.5f}\".format(\n                        pron, source, quality_score)) \n                if (source == 'P' and soft_count > 0.7 and len(stats[word]) > 5):\n                    print(\"WARNING: alpha_{pd} or beta_{pd} may be too large!\"\n                          \"    For the word '{}' whose count is {}, the candidate \"\n                          \"    pronunciation from phonetic decoding '{}' with normalized \"\n                          \"    soft count {} (out of 1) is rejected. It shouldn't have been\"\n                          \"    rejected if alpha_{pd} is smaller than {}\".format(\n                            word, len(stats[word]), pron, soft_count, -loss / log_delta, \n                            -args.alpha[0] * len(stats[word]) + (objf_change + args.beta[0])),\n                            file=sys.stderr)\n                    if loss_abs > thr:\n                        print(\"    or beta_{pd} is smaller than {}\".format(\n                                (loss_abs / thr - 1) * len(stats[word])), file=sys.stderr)\n                if (source == 'G' and soft_count > 0.7 and len(stats[word]) > 5):\n                    print(\"WARNING: alpha_{g2p} or beta_{g2p} may be too large!\"\n                          \"    For the word '{}' whose count is {}, the candidate \"\n                          \"    pronunciation from G2P '{}' with normalized \"\n                          \"    soft count {} (out of 1) is rejected. It shouldn't have been\"\n                          \"    rejected if alpha_{g2p} is smaller than {} \".format(\n                            word, len(stats[word]), pron, soft_count, -loss / log_delta, \n                            -args.alpha[1] * len(stats[word]) + (objf_change + args.beta[1])),\n                          file=sys.stderr)\n                    if loss_abs > thr:\n                        print(\"    or beta_{g2p} is smaller than {}.\".format((\n                                loss_abs / thr - 1) * len(stats[word])), file=sys.stderr)\n            deleted_prons.append(deleted_candidate[1])\n        for i in range(len(prons[word])):\n            if i not in deleted_prons:\n                learned_lexicon[word].add(prons[word][i])\n\n    return learned_lexicon\n\ndef WriteLearnedLexicon(learned_lexicon, file_handle):\n    for word, prons in learned_lexicon.iteritems():\n        for pron in prons:\n            print('{0} {1}'.format(word, pron), file=file_handle)\n    file_handle.close()\n\ndef Main():\n    args = GetArgs()\n    \n    # Read in three lexicon sources, word counts, and pron stats.\n    counts = ReadWordCounts(args.word_counts_file_handle)\n    ref_lexicon = ReadLexicon(args, args.ref_lexicon_handle, counts)\n    g2p_lexicon = ReadLexicon(args, args.g2p_lexicon_handle, counts)\n    pd_lexicon =  ReadLexicon(args, args.pd_lexicon_handle, counts)\n    stats, stats_summed = ReadArcStats(args.arc_stats_file_handle)\n    pd_lexicon = FilterPhoneticDecodingLexicon(args, pd_lexicon)\n                  \n    # Select prons to construct the learned lexicon.\n    learned_lexicon = SelectPronsGreedy(args, stats, counts, ref_lexicon, g2p_lexicon, pd_lexicon)\n    \n    # Write the learned prons for words out of the ref. vocab into learned_lexicon_oov.\n    WriteLearnedLexicon(learned_lexicon, args.learned_lexicon_handle)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/dict/train_g2p.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2014  Johns Hopkins University (Author: Yenda Trmal)\n# Copyright 2016  Xiaohui Zhang\n# Apache 2.0\n\n# Begin configuration section.\niters=5\nstage=0\nencoding='utf-8'\nonly_words=true\ncmd=run.pl\n# a list of silence phones, like data/local/dict/silence_phones.txt\nsilence_phones=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nset -u\nset -e\n\nif [ $# != 2 ]; then\n   echo \"Usage: $0 [options] <lexicon-in> <work-dir>\"\n   echo \"    where <lexicon-in> is the training lexicon (one pronunciation per \"\n   echo \"    word per line, with lines like 'hello h uh l ow') and\"\n   echo \"    <work-dir> is directory where the models will be stored\"\n   echo \"e.g.: train_g2p.sh --silence-phones data/local/dict/silence_phones.txt data/local/dict/lexicon.txt exp/g2p/\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --iters <int>                                    # How many iterations. Relates to N-ngram order\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --silence-phones <silphones-list>                # e.g. data/local/dict/silence_phones.txt.\"\n   echo \"                                                   # A list of silence phones, one or more per line\"\n   echo \"                                                   # Relates to  --only-words option\"\n   echo \"  --only-words (true|false)    (default: true)     # If true, exclude silence words, i.e.\"\n   echo \"                                                   # words with 1 phone which is a silence.\"\n   exit 1;\nfi\n\nlexicon=$1\nwdir=$2\n\n\nmkdir -p $wdir/log\n\n[ ! -f $lexicon ] && echo \"$0: Training lexicon does not exist.\" && exit 1\n\n# Optionally remove words that are mapped to a single silence phone from the lexicon.\nif $only_words && [ ! -z \"$silence_phones\" ]; then\n  awk -v s=$silence_phones \\\n    'BEGIN{while((getline<s)>0) {for(i=1;i<=NF;i++) sil[$i]=1;}}\n    {if (!(NF == 2 && $2 in sil)) print;}' $lexicon > $wdir/lexicon_onlywords.txt\n  lexicon=$wdir/lexicon_onlywords.txt\nfi\n\nif ! g2p=`which g2p.py` ; then\n  echo \"Sequitur was not found !\"\n  echo \"Go to $KALDI_ROOT/tools and execute extras/install_sequitur.sh\"\n  exit 1\nfi\n\necho \"Training the G2P model (iter 0)\"\n\nif [ $stage -le 0 ]; then\n  $cmd $wdir/log/g2p.0.log \\\n    g2p.py -S --encoding $encoding --train $lexicon --devel 5% --write-model $wdir/g2p.model.0\nfi\n\nfor i in `seq 0 $(($iters-2))`; do\n\n  echo \"Training the G2P model (iter $[$i + 1] )\"\n\n  if [ $stage -le $i ]; then\n    $cmd $wdir/log/g2p.$(($i + 1)).log \\\n      g2p.py -S --encoding $encoding --model $wdir/g2p.model.$i --ramp-up --train $lexicon --devel 5% --write-model $wdir/g2p.model.$(($i+1))\n  fi\n\ndone\n\n! (set -e; cd $wdir; ln -sf g2p.model.$[$iters-1] g2p.model.final ) && echo \"Problem finalizing training... \" && exit 1\n\nif [ $stage -le $(($i + 2)) ]; then\n  echo \"Running test...\"\n  $cmd $wdir/log/test.log \\\n    g2p.py --encoding $encoding --model $wdir/g2p.model.final --test $lexicon\nfi\n"
  },
  {
    "path": "egs/steps/dict/train_g2p_phonetisaurus.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Intellisist, Inc. (Author: Navneeth K)\n#           2017  Xiaohui Zhang\n#           2018  Ruizhe Huang\n# Apache License 2.0\n\n# This script trains a g2p model using Phonetisaurus.\n\nstage=0\nencoding='utf-8'\nonly_words=true\nsilence_phones=\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nset -u\nset -e\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [options] <lexicon-in> <work-dir>\"\n  echo \"    where <lexicon-in> is the training lexicon (one pronunciation per \"\n  echo \"    word per line, with lines like 'hello h uh l ow') and\"\n  echo \"    <work-dir> is directory where the models will be stored\"\n  echo \"e.g.: $0 --silence-phones data/local/dict/silence_phones.txt data/local/dict/lexicon.txt exp/g2p/\"\n  echo \"\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-phones <silphones-list>                # e.g. data/local/dict/silence_phones.txt.\"\n  echo \"                                                   # A list of silence phones, one or more per line\"\n  echo \"                                                   # Relates to  --only-words option\"\n  echo \"  --only-words (true|false)    (default: true)     # If true, exclude silence words, i.e.\"\n  echo \"                                                   # words with one or multiple phones which are all silence.\"\n  exit 1;\nfi\n\nlexicon=$1\nwdir=$2\n\n[ ! -f $lexicon ] && echo \"Cannot find $lexicon\" && exit\n\nisuconv=`which uconv`\nif [ -z $isuconv ]; then\n  echo \"uconv was not found. You must install the icu4c package.\"\n  exit 1;\nfi\n\nif ! phonetisaurus=`which phonetisaurus-apply` ; then\n  echo \"Phonetisarus was not found !\"\n  echo \"Go to $KALDI_ROOT/tools and execute extras/install_phonetisaurus.sh\"\n  exit 1\nfi\n\nmkdir -p $wdir\n\n\n# For input lexicon, remove pronunciations containing non-utf-8-encodable characters,\n# and optionally remove words that are mapped to a single silence phone from the lexicon.\nif [ $stage -le 0 ]; then\n  if $only_words && [ ! -z \"$silence_phones\" ]; then\n    awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s\" \"$i; if(!(s in a)) print $1\" \"s}' \\\n      $silence_phones $lexicon | \\\n      awk '{printf(\"%s\\t\",$1); for (i=2;i<NF;i++){printf(\"%s \",$i);} printf(\"%s\\n\",$NF);}' | \\\n      uconv -f \"$encoding\"  -t \"$encoding\" -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt\n  else\n    awk '{printf(\"%s\\t\",$1); for (i=2;i<NF;i++){printf(\"%s \",$i);} printf(\"%s\\n\",$NF);}' $lexicon | \\\n      uconv -f \"$encoding\" -t \"$encoding\" -x Any-NFC - | awk 'NF > 0'> $wdir/lexicon_tab_separated.txt\n  fi\nfi\n\nif [ $stage -le 1 ]; then\n  # Align lexicon stage. Lexicon is assumed to have first column tab separated\n  phonetisaurus-align --input=$wdir/lexicon_tab_separated.txt --ofile=${wdir}/aligned_lexicon.corpus || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  # Convert aligned lexicon to arpa using make_kn_lm.py, a re-implementation of srilm's ngram-count functionality.\n  ./utils/lang/make_kn_lm.py -ngram-order 7 -text ${wdir}/aligned_lexicon.corpus -lm ${wdir}/aligned_lexicon.arpa\nfi\n\nif [ $stage -le 3 ]; then\n  # Convert the arpa file to FST.\n  phonetisaurus-arpa2wfst --lm=${wdir}/aligned_lexicon.arpa --ofile=${wdir}/model.fst\nfi\n\n"
  },
  {
    "path": "egs/steps/get_ctm.sh",
    "content": "#!/usr/bin/env bash\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2012.  Apache 2.0.\n\n# This script produces CTM files from a decoding directory that has lattices                                                                         \n# present. It does this for a range of language model weights; see also \n# get_ctm_fast.sh which does it for just one LM weight and also supports\n# the word insertion penalty, and get_ctm_conf.sh which outputs CTM files\n# with confidence scores.\n\n\n# begin configuration section.\ncmd=run.pl\nstage=0\nframe_shift=0.01\nmin_lmwt=5\nmax_lmwt=20\nuse_segments=true # if we have a segments file, use it to convert\n                  # the segments to be relative to the original files.\nprint_silence=false\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <data-dir> <lang-dir|graph-dir> <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --use-segments (true|false)     # use segments and reco2file_and_channel files \"\n  echo \"                                    # to produce a ctm relative to the original audio\"\n  echo \"                                    # files, with channel information (typically needed\"\n  echo \"                                    # for NIST scoring).\"\n  echo \"    --frame-shift (default=0.01)    # specify this if your lattices have a frame-shift\"\n  echo \"                                    # not equal to 0.01 seconds\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri4a/decode/\"\n  echo \"See also: steps/get_train_ctm.sh, steps/get_ctm_fast.sh, steps/get_ctm_conf.sh\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\ndir=$3\n\nmodel=$dir/../final.mdl # assume model one level up from decoding dir.\n\n\nfor f in $lang/words.txt $model $dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nname=`basename $data`; # e.g. eval2000\n\nmkdir -p $dir/scoring/log\n\nif [ $stage -le 0 ]; then\n  if [ -f $data/segments ] && $use_segments; then\n    f=$data/reco2file_and_channel\n    [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\n    filter_cmd=\"utils/convert_ctm.pl $data/segments $data/reco2file_and_channel\"\n  else\n    filter_cmd=cat\n  fi\n\n  nj=$(cat $dir/num_jobs)\n  lats=$(for n in $(seq $nj); do echo -n \"$dir/lat.$n.gz \"; done)\n  if [ -f $lang/phones/word_boundary.int ]; then\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring/log/get_ctm.LMWT.log \\\n      set -o pipefail '&&' mkdir -p $dir/score_LMWT/ '&&' \\\n      lattice-1best --lm-scale=LMWT \"ark:gunzip -c $lats|\" ark:- \\| \\\n      lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      $filter_cmd '>' $dir/score_LMWT/$name.ctm || exit 1;\n  elif [ -f $lang/phones/align_lexicon.int ]; then\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring/log/get_ctm.LMWT.log \\\n      set -o pipefail '&&' mkdir -p $dir/score_LMWT/ '&&' \\\n      lattice-1best --lm-scale=LMWT \"ark:gunzip -c $lats|\" ark:- \\| \\\n      lattice-align-words-lexicon $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n      lattice-1best ark:- ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      $filter_cmd '>' $dir/score_LMWT/$name.ctm || exit 1;\n  else\n    echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n    exit 1;\n  fi\nfi\n\n"
  },
  {
    "path": "egs/steps/get_ctm_conf_fast.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#           2017  Vimal Manohar\n#           2018  Xiaohui Zhang\n#           2018  Music Technology Group, Universitat Pompeu Fabra.\n# Apache 2.0\n\n# This script produces CTM files with confidence scores\n# from a decoding directory that has lattices\n# present. It does this for one LM weight and also supports \n# the word insertion penalty.\n# This is similar to get_ctm_conf.sh, but gets the CTM at the utterance-level.\n# It can be faster than steps/get_ctm_conf.sh --use-segments false as it splits\n# the process across many jobs. \n\n# begin configuration section.\ncmd=run.pl\nstage=0\nframe_shift=0.01\nlmwt=10\nwip=0.0\nprint_silence=false\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: $0 [options] <data-dir> <lang-dir|graph-dir> <decode-dir> <ctm-out-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --frame-shift (default=0.01)    # specify this if your lattices have a frame-shift\"\n  echo \"                                    # not equal to 0.01 seconds\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri4a/decode/\"\n  echo \"See also: steps/get_ctm.sh, steps/get_ctm_conf.sh\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\ndecode_dir=$3\ndir=$4\n\nif [ -f $decode_dir/final.mdl ]; then\n  model=$decode_dir/final.mdl\nelse\n  model=$decode_dir/../final.mdl # assume model one level up from decoding dir.\nfi\n\nfor f in $lang/words.txt $model $decode_dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nmkdir -p $dir\n\nnj=$(cat $decode_dir/num_jobs)\necho $nj > $dir/num_jobs\n\nif [ -f $lang/phones/word_boundary.int ]; then\n  $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n    set -o pipefail '&&' \\\n    lattice-add-penalty --word-ins-penalty=$wip \"ark:gunzip -c $decode_dir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-prune --inv-acoustic-scale=$lmwt --beam=5 ark:- ark:- \\| \\\n    lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \\| \\\n    lattice-to-ctm-conf --frame-shift=$frame_shift --decode-mbr=true --inv-acoustic-scale=$lmwt ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/ctm.JOB || exit 1;\nelif [ -f $lang/phones/align_lexicon.int ]; then\n    set -o pipefail '&&' \\\n    lattice-add-penalty --word-ins-penalty=$wip \"ark:gunzip -c $decode_dir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-prune --inv-acoustic-scale=$lmwt --beam=5 ark:- ark:- \\| \\\n    lattice-align-words-lexicon $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n    lattice-to-ctm-conf --frame-shift=$frame_shift --decode-mbr=true --inv-acoustic-scale=$lmwt ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/ctm.JOB || exit 1;\nelse\n  echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n  exit 1;\nfi\n\nfor n in `seq $nj`; do \n  cat $dir/ctm.$n\ndone > $dir/ctm\n"
  },
  {
    "path": "egs/steps/get_ctm_fast.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#           2017  Vimal Manohar\n#           2018  Xiaohui Zhang\n#           2018  Music Technology Group, Universitat Pompeu Fabra.\n# Apache 2.0\n\n# This script produces CTM files from a decoding directory that has lattices\n# present. It does this for one LM weight and also supports \n# the word insertion penalty.\n# This is similar to get_ctm.sh, but gets the CTM at the utterance-level.\n# It can be faster than steps/get_ctm.sh --use-segments false as it splits\n# the process across many jobs. \n\n# begin configuration section.\ncmd=run.pl\nstage=0\nframe_shift=0.01\nlmwt=10\nwip=0.0\nprint_silence=false\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: $0 [options] <data-dir> <lang-dir|graph-dir> <decode-dir> <ctm-out-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --frame-shift (default=0.01)    # specify this if your lattices have a frame-shift\"\n  echo \"                                    # not equal to 0.01 seconds\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri4a/decode/\"\n  echo \"See also: steps/get_ctm.sh, steps/get_ctm_conf.sh\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\ndecode_dir=$3\ndir=$4\n\nif [ -f $decode_dir/final.mdl ]; then\n  model=$decode_dir/final.mdl\nelse\n  model=$decode_dir/../final.mdl # assume model one level up from decoding dir.\nfi\n\nfor f in $lang/words.txt $model $decode_dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\nmkdir -p $dir\n\nnj=$(cat $decode_dir/num_jobs)\necho $nj > $dir/num_jobs\n\nif [ -f $lang/phones/word_boundary.int ]; then\n  $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n    set -o pipefail '&&' \\\n    lattice-1best --lm-scale=$lmwt --word-ins-penalty=$wip \"ark:gunzip -c $decode_dir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \\| \\\n    nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/ctm.JOB || exit 1;\nelif [ -f $lang/phones/align_lexicon.int ]; then\n  $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n    set -o pipefail '&&' \\\n    lattice-1best --lm-scale=$lmwt --word-ins-penalty=$wip \"ark:gunzip -c $decode_dir/lat.JOB.gz|\" ark:- \\| \\\n    lattice-align-words-lexicon $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n    lattice-1best ark:- ark:- \\| \\\n    nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n    utils/int2sym.pl -f 5 $lang/words.txt \\\n    '>' $dir/ctm.JOB || exit 1;\nelse\n  echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n  exit 1;\nfi\n\nfor n in `seq $nj`; do \n  cat $dir/ctm.$n\ndone > $dir/ctm\n"
  },
  {
    "path": "egs/steps/get_fmllr_basis.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012   Carnegie Mellon University (Author: Yajie Miao)\n#                  Johns Hopkins University (Author: Daniel Povey)\n\n# Decoding script that computes basis for basis-fMLLR (see decode_fmllr_basis.sh).\n# This can be on top of delta+delta-delta, or LDA+MLLT features.\n\nstage=0\n# Parameters in alignment of training data\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nper_utt=true # If true, then treat each utterance as a separate speaker for purposes of\n  # basis training... this is recommended if the number of actual speakers in your\n  # training set is less than (feature-dim) * (feature-dim+1).\nsilence_weight=0.01\ncmd=run.pl\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: steps/get_fmllr_basis.sh [options] <data-dir> <lang-dir> <exp-dir>\"\n   echo \" e.g.: steps/decode_basis_fmllr.sh data/train_si84 data/lang exp/tri3b/\"\n   echo \"Note: we currently assume that this is the same data you trained the model with.\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\ndir=$3\n\nnj=`cat $dir/num_jobs` || exit 1;\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nsplice_opts=`cat $dir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nfor f in $data/feats.scp $dir/final.mdl $dir/ali.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $dir/phones.txt || exit 1;\n# Set up the unadapted features \"$sifeats\".\nif [ -f $dir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n# Set up the adapted features \"$feats\" for training set.\nif [ -f $srcdir/trans.1 ]; then \n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$srcdir/trans.JOB ark:- ark:- |\";\nelse\n  feats=\"$sifeats\";\nfi\n\n\nif $per_utt; then\n  spk2utt_opt=  # treat each utterance as separate speaker when computing basis.\n  echo \"Doing per-utterance adaptation for purposes of computing the basis.\"\nelse\n  echo \"Doing per-speaker adaptation for purposes of computing the basis.\"\n  [ `cat $sdata/spk2utt | wc -l` -lt $[41*40] ] && \\\n    echo \"Warning: number of speakers is small, might be better to use --per-utt=true.\"\n  spk2utt_opt=\"--spk2utt=ark:$sdata/JOB/spk2utt\"\nfi\n\n# Note: we get Gaussian level alignments with the \"final.mdl\" and the\n# speaker adapted features. \n$cmd JOB=1:$nj $dir/log/basis_acc.JOB.log \\\n  ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n  weight-silence-post $silence_weight $silphonelist $dir/final.mdl ark:- ark:- \\| \\\n  gmm-post-to-gpost $dir/final.mdl \"$feats\" ark:- ark:- \\| \\\n  gmm-basis-fmllr-accs-gpost $spk2utt_opt \\\n    $dir/final.mdl \"$sifeats\" ark,s,cs:- $dir/basis.acc.JOB || exit 1; \n\n# Compute the basis matrices.\n$cmd $dir/log/basis_training.log \\\n  gmm-basis-fmllr-training $dir/final.mdl $dir/fmllr.basis $dir/basis.acc.* || exit 1;\nrm $dir/basis.acc.* 2>/dev/null\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/get_lexicon_probs.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n\n# From a training or alignment directory, and an original lexicon.txt and lang/\n# directory, obtain a new lexicon with pronunciation probabilities.\n# Note: this script is currently deprecated, the recipes are using a different\n# script in utils/dict_dir_add_pronprobs.sh.\n\n\n# Begin configuration section.  \nstage=0\nsmooth_count=1.0 # Amount of count to add corresponding to each original lexicon entry;\n                 # this corresponds to add-one smoothing of the pron-probs.\nmax_one=true   # If true, normalize the pron-probs so the maximum value for each word is 1.0,\n               # rather than summing to one.  This is quite standard.\n\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n   echo \"Usage: steps/get_lexicon_probs.sh <data-dir> <lang-dir> <src-dir|ali-dir> <old-lexicon> <exp-dir> <new-lexicon>\"\n   echo \"e.g.: steps/get_lexicon_probs.sh data/train data/lang exp/tri5 data/local/lexicon.txt \\\\\"\n   echo \"                      exp/tri5_lexprobs data/local_withprob/lexicon.txt\"\n   echo \"Note: we assume you ran using word-position-dependent phones but both the old and new lexicon will not have\"\n   echo \"these markings.  We also assume the new lexicon will have pron-probs but the old one does not; this limitation\"\n   echo \"of the script can be removed later.\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # used to control partial re-running.\"\n   echo \"  --max-one <true|false>                           # If true, normalize so max prob of each\"\n   echo \"                                                   # word is one.  Default: true\"\n   echo \"  --smooth <smooth-count>                          # Amount to smooth each count by (default: 1.0)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\nold_lexicon=$4\ndir=$5\nnew_lexicon=$6\n\noov=`cat $lang/oov.int` || exit 1;\nnj=`cat $srcdir/num_jobs` || exit 1;\n\nfor f in $data/text $lang/L.fst $lang/phones/word_boundary.int $srcdir/ali.1.gz $old_lexicon; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log\nutils/split_data.sh $data $nj # Make sure split data-dir exists.\nsdata=$data/split$nj\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nif [ $stage -le 0 ]; then\n\n  ( ( for n in `seq $nj`; do gunzip -c $srcdir/ali.$n.gz; done ) | \\\n    linear-to-nbest ark:- \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $data/text |\" '' '' ark:- | \\\n    lattice-align-words $lang/phones/word_boundary.int $srcdir/final.mdl ark:- ark:- | \\\n    lattice-to-phone-lattice --replace-words=false $srcdir/final.mdl ark:- ark,t:- | \\\n    awk '{ if (NF == 4) { word_phones = sprintf(\"%s %s\", $3, $4); count[word_phones]++; } } \n        END { for(key in count) { print count[key], key; } }' | \\\n          sed s:0,0,:: | awk '{print $2, $1, $3;}' | sed 's/_/ /g' | \\\n          utils/int2sym.pl -f 3- $lang/phones.txt  | \\\n          sed -E 's/_I( |$)/ /g' |  sed -E 's/_E( |$)/ /g' | sed -E 's/_B( |$)/ /g' | sed -E 's/_S( |$)/ /g' | \\\n          utils/int2sym.pl -f 1 $lang/words.txt > $dir/lexicon_counts.txt\n  ) 2>&1 | tee $dir/log/get_fsts.log\n\nfi\n\ncat $old_lexicon | awk '{if (!($2 > 0.0 && $2 < 1.0)) { exit(1); }}' && \\\n  echo \"Error: old lexicon $old_lexicon appears to have pron-probs; we don't expect this.\" && \\\n  exit 1;\n\nmkdir -p `dirname $new_lexicon` || exit 1;\n\nif [ $stage -le 1 ]; then\n  grep -v -w '^<eps>' $dir/lexicon_counts.txt | \\\n  perl -e ' ($old_lexicon, $smooth_count, $max_one) = @ARGV;\n    ($smooth_count >= 0) || die \"Invalid smooth_count $smooth_count\";\n    ($max_one eq \"true\" || $max_one eq \"false\") || die \"Invalid max_one variable $max_one\";\n    open(O, \"<$old_lexicon\")||die \"Opening old-lexicon file $old_lexicon\"; \n    while(<O>) {\n      $_ =~ m/(\\S+)\\s+(.+)/ || die \"Bad old-lexicon line $_\";\n      $word = $1;\n      $orig_pron = $2;\n      # Remember the mapping from canonical prons to original prons: in the case of\n      # syllable based systems we want to remember the locations of tabs in\n      # the original lexicon.\n      $pron = join(\" \", split(\" \", $orig_pron));\n      $orig_pron{$word,$pron} = $orig_pron;\n      $count{$word,$pron} += $smooth_count;\n      $tot_count{$word} += $smooth_count;\n    }\n    while (<STDIN>) {\n      $_ =~ m/(\\S+)\\s+(\\S+)\\s+(.+)/ || die \"Bad new-lexicon line $_\";\n      $word = $1;\n      $this_count = $2;\n      $pron = join(\" \", split(\" \", $3));\n      $count{$word,$pron} += $this_count;\n      $tot_count{$word} += $this_count;\n    }\n    if ($max_one eq \"true\") {  # replace $tot_count{$word} with max count\n       # of any pron.\n      %tot_count = {}; # set to empty assoc array.\n      foreach $key (keys %count) {\n        ($word, $pron) = split($; , $key); # $; is separator for strings that index assoc. arrays.\n        $this_count = $count{$key};\n        if (!defined $tot_count{$word} || $this_count > $tot_count{$word}) {\n          $tot_count{$word} = $this_count;\n        }\n      }\n    }\n    foreach $key (keys %count) {\n       ($word, $pron) = split($; , $key); # $; is separator for strings that index assoc. arrays.\n       $this_orig_pron = $orig_pron{$key};\n       if (!defined $this_orig_pron) { die \"Word $word and pron $pron did not appear in original lexicon.\"; }\n       if (!defined $tot_count{$word}) { die \"Tot-count not defined for word $word.\"; }\n       $prob = $count{$key} / $tot_count{$word};\n       print \"$word\\t$prob\\t$this_orig_pron\\n\";  # Output happens here.\n    } '  $old_lexicon $smooth_count $max_one > $new_lexicon || exit 1;\nfi\n\nexit 0;\n\necho $nj > $dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\ncp $srcdir/splice_opts $dir 2>/dev/null # frame-splicing options.\n\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir    \n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";   \n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/get_prons.sh",
    "content": "#!/usr/bin/env bash\n# Copyright  2014  Johns Hopkins University (Author: Daniel Povey)\n#            2014  Guoguo Chen\n# Apache 2.0\n\n# Begin configuration section.\ncmd=run.pl\nstage=1\nlmwt=10\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"usage: $0 <data-dir> <lang-dir> <dir>\"\n   echo \"e.g.:  $0 data/train data/lang exp/tri3\"\n   echo \"or:  $0 data/train data/lang exp/tri3/decode_dev\"\n   echo \"This script writes files prons.*.gz in the directory provided, which must\"\n   echo \"contain alignments (ali.*.gz) or lattices (lat.*.gz).  These files are as\"\n   echo \"output by nbest-to-prons (see its usage message).\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --lmwt <lm-weight>                               # scale for LM, only applicable\"\n   echo \"                                                   # for lattice input (default: 10)\"\n   exit 1;\nfi\n\n# As the usage message of nbest-to-prons says, its output has lines that can be interpreted as\n#  <utterance-id> <begin-frame> <num-frames> <word> <phone1> <phone2> ... <phoneN>\n# and you could convert these into text form using a command like:\n# gunzip -c prons.*.gz | utils/sym2int.pl -f 4 words.txt | utils/sym2int.pl -f 5- phones.txt\n\n\n\ndata=$1\nlang=$2\ndir=$3\n\nfor f in $data/utt2spk $lang/words.txt $dir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nnj=$(cat $dir/num_jobs) || exit 1;\nsdata=$data/split$nj\noov=`cat $lang/oov.int` || exit 1;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nif [ -f $dir/final.mdl ]; then\n  mdl=$dir/final.mdl\nelse\n  if [ -f $dir/../final.mdl ]; then\n    mdl=$dir/../final.mdl  # e.g. decoding directories.\n  else\n    echo \"$0: expected $dir/final.mdl or $dir/../final.mdl to exist.\"\n    exit 1;\n  fi\nfi\n\nif [ -f $lang/phones/word_boundary.int ]; then\n  align_words_cmd=\"lattice-align-words $lang/phones/word_boundary.int $mdl ark:- ark:-\"\nelse\n  if [ ! -f $lang/phones/align_lexicon.int ]; then\n    echo \"$0: expected either $lang/phones/word_boundary.int or $lang/phones/align_lexicon.int to exist.\"\n    exit 1;\n  fi\n  align_words_cmd=\"lattice-align-words-lexicon $lang/phones/align_lexicon.int $mdl ark:- ark:-\"\nfi\n\nif [ -f $dir/ali.1.gz ]; then\n  echo \"$0: $dir/ali.1.gz exists, so starting from alignments.\"\n\n  if [ $stage -le 1 ]; then\n    rm $dir/prons.*.gz 2>/dev/null\n    $cmd JOB=1:$nj $dir/log/nbest_to_prons.JOB.log \\\n      linear-to-nbest \"ark:gunzip -c $dir/ali.JOB.gz|\" \\\n      \"ark:sym2int.pl --map-oov $oov -f 2- $lang/words.txt <$sdata/JOB/text |\" \\\n      \"\" \"\" ark:- \\| $align_words_cmd \\| \\\n      nbest-to-prons $mdl ark:- \"|gzip -c >$dir/prons.JOB.gz\" || exit 1;\n  fi\nelse\n  if [ ! -f $dir/lat.1.gz ]; then\n    echo \"$0: expected either $dir/ali.1.gz or $dir/lat.1.gz to exist.\"\n    exit 1;\n  fi\n  echo \"$0: $dir/lat.1.gz exists, so starting from lattices.\"\n\n  if [ $stage -le 1 ]; then\n    rm $dir/prons.*.gz 2>/dev/null\n    $cmd JOB=1:$nj $dir/log/nbest_to_prons.JOB.log \\\n      lattice-1best --lm-scale=$lmwt \"ark:gunzip -c $dir/lat.JOB.gz|\" ark:- \\| \\\n      $align_words_cmd \\| \\\n      nbest-to-prons $mdl ark:- \"|gzip -c >$dir/prons.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 2 ]; then\n  gunzip -c $dir/prons.*.gz | \\\n    awk '{ $1=\"\"; $2=\"\"; $3=\"\"; count[$0]++; } END{for (k in count) { print count[k], k; }}' > $dir/pron_counts.int || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  cat $dir/pron_counts.int | utils/int2sym.pl -f 2 $lang/words.txt | \\\n    utils/int2sym.pl -f 3- $lang/phones.txt | sort -nr > $dir/pron_counts.txt\nfi\n\nif [ $stage -le 4 ]; then\n  if [ -f $lang/phones/word_boundary.int ]; then\n    # remove the _B, _I, _S, _E markers from phones; this is often convenient\n    # if we want to go back to a word-position-independent source lexicon.\n    cat $dir/pron_counts.txt | perl -ane '@A = split(\" \", $_);\n     for ($n=2;$n<@A;$n++) { $A[$n] =~ s/_[BISE]$//; } print join(\" \", @A) . \"\\n\"; ' >$dir/pron_counts_nowb.txt\n  fi\nfi\n\nif [ $stage -le 5 ]; then\n  # Here we figure the count of silence before and after words (actually prons)\n  # 1. Create a text like file, but instead of putting words, we write\n  #    \"word pron\" pairs. We change the format of prons.*.gz from pron-per-line\n  #    to utterance-per-line (with \"word pron\" pairs tab-separated), and add\n  #    <s> and </s> at the begin and end of each sentence. The _B, _I, _S, _E\n  #    markers are removed from phones.\n  gunzip -c $dir/prons.*.gz | utils/int2sym.pl -f 4 $lang/words.txt | \\\n    utils/int2sym.pl -f 5- $lang/phones.txt | cut -d ' ' -f 1,4- | awk '\n    BEGIN { utter_id = \"\"; }\n    {\n      if (utter_id == \"\") { utter_id = $1; printf(\"%s\\t<s>\", utter_id); }\n      else if (utter_id != $1) {\n        printf \"\\t</s>\\n\"; utter_id = $1; printf(\"%s\\t<s>\", utter_id);\n      }\n      printf(\"\\t%s\", $2);\n      for (n = 3; n <= NF; n++) { sub(\"_[BISE]$\", \"\", $n); printf(\" %s\", $n); }\n    }\n    END { printf \"\\t</s>\\n\"; }' > $dir/pron_perutt_nowb.txt\n\n  # 2. Collect bigram counts for words. To be more specific, we are actually\n  #    collecting counts for \"v ? w\", where \"?\" represents silence or\n  #    non-silence.\n  cat $dir/pron_perutt_nowb.txt | perl -ape 's/<eps>[^\\t]*\\t//g;' | perl -e '\n    while (<>) {\n      chomp; @col = split(\"\\t\");\n      for($i = 1; $i < scalar(@col) - 1; $i += 1) {\n        $bigram{$col[$i] . \"\\t\" . $col[$i + 1]} += 1;\n      }\n    }\n    foreach $key (keys %bigram) {\n      print \"$bigram{$key}\\t$key\\n\";\n    }' > $dir/pron_bigram_counts_nowb.txt\n\n  # 3. Collect bigram counts for silence and words. the count file has 4 fields\n  #    for counts, followed by the \"word pron\" pair. All fields are separated by\n  #    spaces:\n  #    <sil-before-count> <nonsil-before-count> <sil-after-count> <nonsil-after-count> <word> <phone1> <phone2 >...\n  cat $dir/pron_perutt_nowb.txt | cut -f 2- | perl -e '\n    %sil_wpron = (); %nonsil_wpron = (); %wpron_sil = (); %wpron_nonsil = ();\n    %words = ();\n    while (<STDIN>) {\n      chomp;\n      @col = split(/[\\t]+/, $_); @col >= 2 || die \"'$0': bad line \\\"$_\\\"\\n\";\n      for ($n = 0; $n < @col - 1; $n++) {\n        # First word is not silence, collect the wpron_sil and wpron_nonsil\n        # stats.\n        if ($col[$n] !~ m/^<eps> /) {\n          if ($col[$n + 1] =~ m/^<eps> /) { $wpron_sil{$col[$n]} += 1; }\n          else { $wpron_nonsil{$col[$n]} += 1; }\n          $words{$col[$n]} = 1;\n        }\n        # Second word is not silence, collect the sil_wpron and nonsil_wpron\n        # stats.\n        if ($col[$n + 1] !~ m/^<eps> /) {\n          if ($col[$n] =~ m/^<eps> /) { $sil_wpron{$col[$n + 1]} += 1; }\n          else { $nonsil_wpron{$col[$n + 1]} += 1; }\n          $words{$col[$n + 1]} = 1;\n        }\n      }\n    }\n    foreach $wpron (sort keys %words) {\n      $sil_wpron{$wpron} += 0; $nonsil_wpron{$wpron} += 0;\n      $wpron_sil{$wpron} += 0; $wpron_nonsil{$wpron} += 0;;\n      print \"$sil_wpron{$wpron} $nonsil_wpron{$wpron} \";\n      print \"$wpron_sil{$wpron} $wpron_nonsil{$wpron} $wpron\\n\";\n    }\n  '> $dir/sil_counts_nowb.txt\nfi\n\necho \"$0: done writing prons to $dir/prons.*.gz, silence counts in \"\necho \"$0: $dir/sil_counts_nowb.txt and pronunciation counts in \"\necho \"$0: $dir/pron_counts.{int,txt}\"\nif [ -f $lang/phones/word_boundary.int ]; then\n  echo \"$0: ... and also in $dir/pron_counts_nowb.txt\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/get_train_ctm.sh",
    "content": "#!/usr/bin/env bash\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2012.  Apache 2.0.\n\n# This script produces CTM files from a training directory that has alignments\n# present.\n\n\n# begin configuration section.\ncmd=run.pl\nframe_shift=0.01\nstage=0\nuse_segments=true # if we have a segments file, use it to convert\n                  # the segments to be relative to the original files.\nprint_silence=false # if true, will print <eps> (optional-silence) arcs.\n\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ] && [ $# -ne 4 ]; then\n  echo \"Usage: $0 [options] <data-dir> <lang-dir> <ali-dir|model-dir> [<output-dir>]\"\n  echo \"(<output-dir> defaults to  <ali-dir|model-dir>.)\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --use-segments (true|false)     # use segments and reco2file_and_channel files \"\n  echo \"                                    # to produce a ctm relative to the original audio\"\n  echo \"                                    # files, with channel information (typically needed\"\n  echo \"                                    # for NIST scoring).\"\n  echo \"    --frame-shift (default=0.01)    # specify this if your alignments have a frame-shift\"\n  echo \"                                    # not equal to 0.01 seconds\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri3a_ali\"\n  echo \"Produces ctm in: exp/tri3a_ali/ctm\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\nali_dir=$3\ndir=$4\nif [ -z $dir ]; then\n  dir=$ali_dir\nfi\n\n\nmodel=$ali_dir/final.mdl # assume model one level up from decoding dir.\n\n\nfor f in $lang/words.txt $model $ali_dir/ali.1.gz $lang/oov.int; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\noov=`cat $lang/oov.int` || exit 1;\nnj=`cat $ali_dir/num_jobs` || exit 1;\nsplit_data.sh $data $nj || exit 1;\nsdata=$data/split$nj\n\nmkdir -p $dir/log || exit 1;\n\nif [ $stage -le 0 ]; then\n  if [ -f $lang/phones/word_boundary.int ]; then\n    $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n      set -o pipefail '&&' linear-to-nbest \"ark:gunzip -c $ali_dir/ali.JOB.gz|\" \\\n      \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      '' '' ark:- \\| \\\n      lattice-align-words $lang/phones/word_boundary.int $model ark:- ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      gzip -c '>' $dir/ctm.JOB.gz || exit 1\n  else\n    if [ ! -f $lang/phones/align_lexicon.int ]; then\n      echo \"$0: neither $lang/phones/word_boundary.int nor $lang/phones/align_lexicon.int exists: cannot align.\"\n      exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/get_ctm.JOB.log \\\n      set -o pipefail '&&' linear-to-nbest \"ark:gunzip -c $ali_dir/ali.JOB.gz|\" \\\n      \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      '' '' ark:- \\| \\\n      lattice-align-words-lexicon $lang/phones/align_lexicon.int $model ark:- ark:- \\| \\\n      lattice-1best ark:- ark:- \\| \\\n      nbest-to-ctm --frame-shift=$frame_shift --print-silence=$print_silence ark:- - \\| \\\n      utils/int2sym.pl -f 5 $lang/words.txt \\| \\\n      gzip -c '>' $dir/ctm.JOB.gz || exit 1\n  fi\nfi\n\nif [ $stage -le 1 ]; then\n  if [ -f $data/segments ] && $use_segments; then\n    f=$data/reco2file_and_channel\n    [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\n    for n in `seq $nj`; do gunzip -c $dir/ctm.$n.gz; done | \\\n      utils/convert_ctm.pl $data/segments $data/reco2file_and_channel > $dir/ctm || exit 1;\n  else\n    for n in `seq $nj`; do gunzip -c $dir/ctm.$n.gz; done > $dir/ctm || exit 1;\n  fi\n  rm $dir/ctm.*.gz\nfi\n"
  },
  {
    "path": "egs/steps/info/chain_dir_info.pl",
    "content": "#!/usr/bin/perl -w\n\nuse Fcntl;\n\n# we may at some point support options.\n\n$debug = 0;  # we set it to 1 for debugging the script itself.\n\nif ($ARGV[0] eq \"--debug\") {\n  $debug = 1;\n  shift @ARGV;\n}\n\nif (@ARGV == 0) {\n  print STDERR \"Usage: steps/info/nnet3_dir_info.pl [--debug] <nnet3-dir1> [<nnet3-dir2> ... ]\\n\" .\n               \"e.g: steps/info/nnet3_dir_info.pl exp/nnet3/tdnn_sp\\n\" .\n               \"This script extracts some important information from the logs\\n\" .\n               \"and displays it on a single (rather long) line.\\n\" .\n               \"The --debug option is just to debug the script itself.\\n\" .\n               \"This program exits with status 0 if it seems like the arguments\\n\" .\n               \"really were of the expected directory type, and 1 otherwise.\\n\";\n  exit(1);\n}\n\nif (@ARGV > 1) {\n  # repeatedly invoke this program with each of the remaining args.\n  $exit_status = 0;\n  if ($debug) { $debug_opt = \"--debug \" } else { $debug_opt = \"\"; }\n  foreach $dir (@ARGV) {\n    if (system(\"$0 $debug_opt$dir\") != 0) {\n      $exit_status = 1;\n    }\n  }\n  exit($exit_status);\n}\n\n$nnet_dir = shift @ARGV;\n\nsub list_all_log_files {\n  my @ans = ();\n  my $dh;\n  if (!opendir($dh, \"$nnet_dir/log\")) { return (); }\n  @ans = readdir $dh;\n  closedir $dh;\n  return @ans;\n}\n\n\n# returns 1 if the diagnostics are finished on this iter, else 0.\nsub diagnostics_are_finished_on_iter {\n  my $ans = 1;\n  my $iter = shift @_;\n  if (!open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-probability/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  if (!open(F, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-probability/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  return $ans;\n}\n\n# get the number of iterations.\n# note: the iterations go from 0 to num-iters-1.\n# if num_iters = 0 this program will just exit with status 1.\n# we may return a number slightly less than the number of iterations\n# in order to ensure that the compute_prob_train and compute_prob_valid\n# processes have finished.\nsub get_num_iters {\n  my $iter = 0;\n  while (defined $log_file_hash{\"train.$iter.1.log\"}) {\n    $iter++;\n  }\n  if ($iter == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  my $last_iter = $iter - 1;\n  # find an iteration where the diagnostic jobs compute_prob_{train,valid}.$last_iter.log are done.\n  for (my $chosen_last_iter = $last_iter;\n       $chosen_last_iter >= $last_iter - 6 && $chosen_last_iter >= 0;\n       $chosen_last_iter--) {\n    if (! diagnostics_are_finished_on_iter($chosen_last_iter)) {\n      if ($debug) {\n        print STDERR \"nnet3_dir_info.pl: diagnostics not finished running on iteration $chosen_last_iter\\n\";\n      }\n    } else {\n      return $chosen_last_iter + 1;\n    }\n  }\n  # OK, something's not right, just return the original iteration.\n  return $iter;\n}\n\nsub get_num_jobs_initial {\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.0.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\n\nsub get_num_jobs_final {  # expects $num_iters to exist as a global variable.\n  my $final_iter = $num_iters - 1;\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.$final_iter.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\nsub get_combine_info {\n  # returns a string with info about the combination stage, or the empty\n  # string if there wasn't one.\n  if (defined $log_file_hash{\"combine.log\"} &&\n      open(F, \"<$nnet_dir/log/combine.log\")) {\n    while (<F>) {\n      if (m/Combining nnets, objective function changed from (\\S+) to (\\S+)/) {\n        close(F);\n        return sprintf(\" combine=%.3f->%.3f\", $1, $2);\n      } elsif (m/Combining (\\S+) nnets, objective function changed from (\\S+) to (\\S+)/) {\n        close(F);\n        return sprintf(\" combine=%.3f->%.3f (over %d)\", $2, $3, $1);\n      }\n    }\n  }\n  return \"\";\n}\n\nsub format_float_as_string {\n  my $float = shift @_;\n  if (abs($float) >= 1.0) {\n    return sprintf(\"%.2f\", $float);\n  } else {\n    return sprintf(\"%.3f\", $float);\n  }\n}\n\n# this is used in get_loglike_and_accuracy to format\n# strings like ' loglike[32,48,final],train/valid=(-2.43,-2.32,-2.21/-2.84,-2.71,-2.68)'.\nsub get_printed_string {\n  # $name might be 'loglike', for example.\n  my ($name, $iters_array_ref, $train_hash_ref, $valid_hash_ref) = @_;\n  my @iters_array = @$iters_array_ref;\n  my %train_hash = %$train_hash_ref;  # hash from iter-string to value.\n  my %valid_hash = %$valid_hash_ref;  # hash from iter-string to value.\n  my @iters_to_print = ();\n  my @train_values_to_print = ();\n  my @valid_values_to_print = ();\n  foreach my $iter (@iters_array) {\n    if (defined($train_hash{$iter}) && defined($valid_hash{$iter})) {\n      push @iters_to_print, $iter;\n      push @train_values_to_print, format_float_as_string($train_hash{$iter});\n      push @valid_values_to_print, format_float_as_string($valid_hash{$iter});\n    }\n  }\n  if (@iters_to_print == 0) {  return \"\"; }\n  my $joined_iters = join(\",\", @iters_to_print);\n  my $joined_train_values = join(\",\", @train_values_to_print);\n  my $joined_valid_values = join(\",\", @valid_values_to_print);\n  return \" ${name}:train/valid[$joined_iters]=($joined_train_values/$joined_valid_values)\";\n}\n\n\n# invoke this as get_objf_iter($iter1, $iter2,..) where $iterN is the string-valued\n# iteration, e.g. \"92\", or \"final\", or \"combined\", such that we expect\n# $nnet_dir/log/compute_prob_{train,valid}.$iterN.log to exist.\nsub get_logprob_and_accuracy_info {\n  my @iters_array = @_;\n  my %iter_to_train_logprob = ();\n  my %iter_to_train_penalty = ();\n  my %iter_to_train_xent = ();\n  my %iter_to_valid_logprob = ();\n  my %iter_to_valid_penalty = ();\n  my %iter_to_valid_xent = ();\n\n\n\n  foreach my $iter (@iters_array) {\n     if (defined $log_file_hash{\"compute_prob_train.$iter.log\"} &&\n        defined $log_file_hash{\"compute_prob_valid.$iter.log\"} &&\n        open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\") &&\n        open(G, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n      while (<F>) {\n        if (m/Overall log-probability for 'output' is (\\S+) \\+ (\\S+)/) {\n          $iter_to_train_logprob{$iter} = $1;\n          $iter_to_train_penalty{$iter} = $2;\n        } elsif (m/Overall log-probability for 'output' is (\\S+)/) {\n          $iter_to_train_logprob{$iter} = $1;\n          $iter_to_train_penalty{$iter} = 0.0;\n        } elsif (m/Overall log-probability for 'output-xent' is (\\S+) per frame/) {\n          $iter_to_train_xent{$iter} = $1;\n        }\n      }\n      close(F);\n      while (<G>) {\n        if (m/Overall log-probability for 'output' is (\\S+) \\+ (\\S+)/) {\n          $iter_to_valid_logprob{$iter} = $1;\n          $iter_to_valid_penalty{$iter} = $2;\n        } elsif (m/Overall log-probability for 'output' is (\\S+)/) {\n          $iter_to_valid_logprob{$iter} = $1;\n          $iter_to_valid_penalty{$iter} = 0.0;\n        } elsif (m/Overall log-probability for 'output-xent' is (\\S+) per frame/) {\n          $iter_to_valid_xent{$iter} = $1;\n        }\n      }\n      close(G);\n    }\n  }\n  $ans = \"\";\n  $ans .= get_printed_string(\"xent\", \\@iters_array, \\%iter_to_train_xent,\n                             \\%iter_to_valid_xent);\n  $ans .= get_printed_string(\"logprob\", \\@iters_array, \\%iter_to_train_logprob,\n                             \\%iter_to_valid_logprob);\n  # we don't do anything with the l2 penalties.\n  return $ans;\n}\n\n# invoke this as get_progress_info($iter), e.g. set $iter to the last\n# iteration number.\nsub get_progress_info {\n  my $iter = shift @_;\n  if (!defined $log_file_hash{\"progress.$iter.log\"} ||\n      !open(F, \"<$nnet_dir/log/progress.$iter.log\")) {\n    return \"\";\n  }\n  my $num_parameters = \"0\";\n  my $output_dim = 0;\n  my $input_dim = 0;\n  my $ivector_dim = 0;\n  my $max_clipped_proportion = 0.0;\n  while (<F>) {\n    if (m/clipped-proportion=([^,]+)/ && $1 > $max_clipped_proportion) {\n      $max_clipped_proportion = $1;\n    }\n    if (m/^num-parameters: (\\S+)/) {\n      $num_parameters = sprintf(\"%.1fM\", $1 / 1000000.0);\n    }\n    if (m/^output-node.* name=output .*dim=(\\S+)/) {\n      $output_dim = $1;\n    }\n    if (m/^input-node.* name=input .*dim=(\\S+)/) {\n      $input_dim = $1;\n    }\n    if (m/^input-node.* name=ivector .*dim=(\\S+)/) {\n      $ivector_dim = $1;\n    }\n  }\n  close(F);\n  $ans = \"\";\n  if ($num_parameters ne \"0\") {  $ans .= \" num-params=$num_parameters\"; }\n  if ($max_clipped_proportion > 0.1) {\n    if ($max_clipped_proportion > 0.3) {\n      $ans .= \" **max-clipped-proportion=$max_clipped_proportion**\";  # for emphasis; this generally isn't good.\n    } else {\n      $ans .= \" max-clipped-proportion=$max_clipped_proportion\";\n    }\n  }\n  if ($output_dim > 0 && $input_dim > 0 && $ivector_dim > 0) {\n    $ans .= \" dim=$input_dim+$ivector_dim->$output_dim\";\n  } elsif ($output_dim > 0 && $input_dim > 0) {\n    $ans .= \" dim=$input_dim->$output_dim\";\n  } elsif ($output_dim > 0) {\n    $ans .= \" output-dim=$output_dim\";\n  }\n  return $ans;\n}\n\n# return 1 if we seem to have finished training, else 0.\nsub finished_training {\n  return defined $log_file_hash{\"compute_prob_train.final.log\"} ||\n    defined $log_file_hash{\"compute_prob_train.combined.log\"};\n}\n\n@log_files = list_all_log_files();\nif (@log_files == 0) {  exit(1); }\n$log_file_hash = ();\nforeach $f (@log_files) { $log_file_hash{$f} = 1; }\n\n$num_iters = get_num_iters();\n$num_jobs_initial = get_num_jobs_initial();\n$num_jobs_final = get_num_jobs_final();\n$last_iter = $num_iters - 1;\n$two_thirds_iter = int($last_iter * 0.666);\n\n$output_string = \"$nnet_dir: num-iters=$num_iters\";\n\n$output_string .= \" nj=$num_jobs_initial..$num_jobs_final\";\n\n$output_string .= get_progress_info(\"$last_iter\");\n\n$output_string .= get_combine_info();\n\n\n\n# note: IIRC some of the scripts use the name 'combined' for the model after\n# combination, and some 'final', so we try both; only one of these will\n# actually produce any output.\n\n\n@iters_array = (\"$two_thirds_iter\", \"$last_iter\", \"final\", \"combined\");\n\n$output_string .= get_logprob_and_accuracy_info(@iters_array);\n\nprint \"$output_string\\n\";\n\nexit(0);\n"
  },
  {
    "path": "egs/steps/info/gmm_dir_info.pl",
    "content": "#!/usr/bin/perl -w\n\nuse Fcntl;\n\n# we may at some point support options.\n\n$debug = 0;  # we set it to 1 for debugging the script itself.\n\nif ($ARGV[0] eq \"--debug\") {\n  $debug = 1;\n  shift @ARGV;\n}\n\nif (@ARGV == 0) {\n  print STDERR \"Usage: steps/info/gmm_dir_info.pl [--debug] <gmm-dir1> [<gmm-dir2> ... ]\\n\" .\n               \"e.g: steps/info/gmm_dir_info.pl exp/tri3\\n\" .\n               \"This script extracts some important information from the logs\\n\" .\n               \"and displays it on a single (rather long) line.\\n\" .\n               \"The --debug option is just to debug the script itself.\\n\" .\n               \"This program exits with status 0 if it seems like the argument\\n\" .\n               \"really was a GMM dir, and 1 otherwise.\\n\";\n  exit(1);\n}\n\nif (@ARGV > 1) {\n  # repeatedly invoke this program with each of the remaining args.\n  $exit_status = 0;\n  if ($debug) { $debug_opt = \"--debug \" } else { $debug_opt = \"\"; }\n  foreach $dir (@ARGV) {\n    if (system(\"$0 $debug_opt$dir\") != 0) {\n      $exit_status = 1;\n    }\n  }\n  exit($exit_status);\n}\n\n\n$gmm_dir = shift @ARGV;\n\nsub list_all_log_files {\n  my @ans = ();\n  my $dh;\n  if (!opendir($dh, \"$gmm_dir/log\")) { return (); }\n  @ans = readdir $dh;\n  closedir $dh;\n  return @ans;\n}\n\n\nsub get_num_jobs {\n  if (! -d $gmm_dir) {\n    print STDERR \"steps/info/gmm_dir_info.pl: no such directory $gmm_dir\\n\";\n    exit(1);\n  }\n  if (!open(F, \"<$gmm_dir/num_jobs\")) {\n    print STDERR \"steps/info/gmm_dir_info.pl: no such file $gmm_dir/num_jobs\\n\";\n  }\n  my $num_jobs = <F>;\n  if (!($num_jobs > 0)) {\n    print STDERR \"steps/info/gmm_dir_info.pl: bad contents of file $gmm_dir/num_jobs\\n\";\n  }\n  close(F);\n  return 0 + $num_jobs;  # force conversion to integer.\n}\n\n# this function returns a string containing info from the last set of alignment\n# jobs.  it may be empty if no alignment info was found, or if it didn't have the\n# expected contents.\nsub get_last_align_info {\n  $max_align_iter = -1;\n  foreach $f (@log_files) {\n    if ($f =~ m:^align.(\\d+).1.log$: && $1 > $max_align_iter) {\n      $max_align_iter = $1;\n    }\n  }\n  if ($debug) {\n    print STDERR \"max-align-iter=$max_align_iter\\n\";\n  }\n  if ($max_align_iter == -1) { return \"\"; }  # something went wrong; return no info.\n\n  $num_utts = 0;\n  $num_utts_err = 0;\n  $num_utts_retry = 0;\n  $num_frames = 0;\n  $tot_loglike = 0;\n  if ($debug) {\n    print STDERR \"Starting reading alignment logs\\n\";\n  }\n  for ($j = 1; $j <= $num_jobs; $j++) {\n    if (open(F, \"${gmm_dir}/log/align.$max_align_iter.$j.log\")) {\n      # we only need the last few lines of the file, e.g. the last 5 lines which\n      # would normally be about 400 characters... so the last 1000 characters\n      # should be enough.\n      seek(F, Fcntl::SEEK_END, -1000);\n      while (<F>) {\n        if (m/Overall log-likelihood per frame is (\\S+) over (\\S+) frames./) {\n          $tot_loglike += $1 * $2;\n          $num_frames += $2;\n        } elsif (m/Retried (\\S+) out of (\\S+) utterances/) {\n          $num_utts_retry += $1;\n          $num_utts += $2;\n        } elsif (m/Done \\S+, errors on (\\S+)/) {\n          $num_utts_err += $1;\n        }\n      }\n      close(F);\n    }\n  }\n  if ($debug) {\n    print STDERR \"Done reading alignment logs\\n\";\n  }\n  if ($num_utts == 0 || $num_frames == 0) { return \"\"; }  # something went wrong.\n\n  # note: the number of hours of data, e.g. \"3.23h data\", assumes 100 frames\n  # per second, which is almost always true for GMM-based systems.\n  return sprintf(\" align prob=%.2f over %.2fh [retry=%.1f%%, fail=%.1f%%]\",\n                 ($tot_loglike / $num_frames), ($num_frames / 360000.0),\n                 ($num_utts_retry * 100.0 / $num_utts), ($num_utts_err * 100.0 / $num_utts));\n}\n\n\n# this function returns a string containing info from the last update\n# job.  Right now it includes info about the num-states and num-gauss\n# and the percentage of Gaussians that had variances floored; we\n# also say how much data was used if this\n# the string may be empty if no such job was found.\nsub get_last_update_info {\n  $max_update_iter = -1;\n  foreach $f (@log_files) {\n    if ($f =~ m:^update.(\\d+).log$: && $1 > $max_update_iter) {\n      $max_update_iter = $1;\n    }\n  }\n  if ($debug) {\n    print STDERR \"max-update-iter=$max_update_iter\\n\";\n  }\n  if ($max_update_iter == -1) { return \"\"; }  # something went wrong; return no info.\n\n\n  $num_gauss = 0;\n  $num_gauss_floored = 0;  # number of Gaussians with at least one variance floored.\n  $num_gauss_removed = 0;  # number of Gaussians removed due to low-counts.\n  $num_gauss_tot = 0;     # total number of Gaussians before splitting.\n  $num_gauss_after_split = 0;  # total number of Gaussians after splitting [will\n                               # usually be same as before, on last iter.]\n  $num_states = 0;  # total number of states [pdf-ids]\n  $num_frames = 0;  # total number of frames.\n  $loglike = 0;  # log-likelihood [from auxf].\n\n  if (open(F, \"<${gmm_dir}/log/update.$max_update_iter.log\")) {\n    while (<F>) {\n      if (m/variance elements floored in (\\S+) Gaussians, out of (\\S+)/) {\n        $num_gauss_floored = $1;\n        $num_gauss_tot = $2;\n      } elsif (m/Overall avg like per frame = (\\S+) over (\\S+) frames/) {\n        $loglike = $1;\n        $num_frames = $2;\n      } elsif (m/Split (\\S+) states .+ split #Gauss from \\S+ to (\\S+)/) {\n        $num_states = $1;\n        $num_gauss_after_split = $2;\n      }\n    }\n    close(F);\n  } else {\n    return \"\";  # something went wrong.\n  }\n  $ans = \"\";\n\n  if (($align_info eq \"\" || ! defined $align_info)) {\n    # add some info that we'd otherwise get from the alignment jobs.\n    if ($num_frames != 0) {\n      # add info about how much data we trained on.\n      $ans .= sprintf(\" %.2fh data\", $num_frames / 360000.0);\n    }\n    if ($loglike != 0) {\n      $ans .= sprintf(\" log-like=%.2f\", $loglike);\n    }\n  }\n\n  if ($num_states != 0) {\n    $ans .= sprintf(\" states=%d\", $num_states);\n  }\n\n  # the next line is really just in case there was no splitting done-- in that\n  # case we get the num-gauss from the line about the variance flooring.\n  $max_num_gauss = ($num_gauss > $num_gauss_after_split ? $num_gauss : $num_gauss_after_split);\n  if ($max_num_gauss > 0) { $ans .= \" gauss=$max_num_gauss\"; }\n\n  if ($num_gauss > 0 && $num_gauss_removed > 0) {\n    $ans .= sprintf(\" lowcount-gauss-removed=%d\", $num_gauss_removed);\n  }\n\n  if ($num_gauss > 0 && $num_gauss_floored > 0) {\n    $ans .= sprintf(\" gauss-floored=%.02%%\", $num_gauss_floored * 100.0 / $num_gauss);\n  }\n  return $ans;\n}\n\n\nsub get_fmllr_info {\n  my %fmllr_num_frames = ();  # maps from fmllr iteration to num-frames\n  my %fmllr_auxf_impr = ();  # maps from fmllr iteration to total auxf impr times num-frames.\n  foreach $log_file (@log_files) {\n    if ($log_file =~ m/^fmllr.(\\d+).(\\d+).log$/) {\n      $iter = $1;\n      $job_number = $2;\n      if ($job_number <= $num_jobs && open(F, \"<$gmm_dir/log/$log_file\")) {\n        while (<F>) {\n          if (m/Overall fMLLR auxf impr per frame is (\\S+) over (\\S+) frames/) {\n            $fmllr_num_frames{$iter} += $2;\n            $fmllr_auxf_impr{$iter} += $1 * $2;\n          }\n        }\n        close(F);\n      }\n    }\n  }\n  my $tot_auxf_impr = 0.0;\n  my $num_frames = 0.0;\n  # the fMLLR auxf impr will be summed over the fMLLR iterations.\n  foreach $iter (sort(keys %fmllr_auxf_impr)) {\n    if ($debug) {\n      print STDERR \"fmllr iter $iter: $fmllr_auxf_impr{$iter} / $fmllr_num_frames{$iter}\\n\";\n    }\n    $tot_auxf_impr += $fmllr_auxf_impr{$iter} / $fmllr_num_frames{$iter};\n    $num_frames = $fmllr_num_frames{$iter};  # take the num-frames from the final iteration.\n  }\n  if ($tot_auxf_impr != 0.0 && $num_frames != 0.0) {\n    return sprintf(\" fmllr-impr=%.2f over %.2fh\", $tot_auxf_impr, $num_frames / 360000.0);\n  } else {\n    return \"\";\n  }\n}\n\nsub get_mllt_info {\n  # note: both the objective improvement and logdet are summed over\n  # all the iterations of MLLT update.\n  my $mllt_objf_impr = 0.0;\n  my $mllt_logdet = 0.0;\n\n  foreach $log_file (@log_files) {\n    if ($log_file =~ m/^mupdate.\\d+.log$/) {\n      if (open(F, \"<$gmm_dir/log/$log_file\")) {\n        while (<F>) {\n          if (m/Overall objective function improvement for MLLT is (\\S+) over \\S+ frames, logdet is (\\S+)/) {\n            $mllt_objf_impr += $1;\n            $mllt_logdet += $2;\n          }\n        }\n        close(F);\n      }\n    }\n  }\n  if ($mllt_objf_impr != 0.0 && $mllt_logdet != 0.0) {\n    return sprintf(\" mllt:impr,logdet=%.2f,%.2f\", $mllt_objf_impr, $mllt_logdet);\n  } else {\n    return \"\";\n  }\n}\n\nsub get_tree_info {\n  $ans = \"\";\n  if (open(F, \"<$gmm_dir/log/build_tree.log\")) {\n    while (<F>) {\n      if (m/Including just phones that were split, improvement is (\\S+) per frame/) {\n        $ans = sprintf(\" tree-impr=%.2f\", $1);\n      }\n    }\n    close(F);\n  }\n  return $ans;\n}\n\nsub get_lda_info {\n  $ans = \"\";\n  if (open(F, \"<$gmm_dir/log/lda_est.log\")) {\n    while (<F>) {\n      if (m/Sum of selected singular values is (\\S+)/) {\n        $ans = sprintf(\" lda-sum=%.2f\", $1);\n      }\n    }\n    close(F);\n  }\n  return $ans;\n}\n\n\n@log_files = list_all_log_files();\n\nif (@log_files == 0) {\n  exit(1);\n}\n\n$output_string = \"$gmm_dir:\";\n\n$num_jobs = get_num_jobs();  # will crash on failure.\n\n$output_string .= \" nj=$num_jobs\";\n\n$insufficient_output_string = $output_string;\n\n$align_info =  get_last_align_info();\n$output_string .= $align_info;\n\n$output_string .= get_last_update_info($align_info);\n\n$output_string .= get_fmllr_info();\n\n$output_string .= get_tree_info();\n\n$output_string .= get_lda_info();\n\n$output_string .= get_mllt_info();\n\nprint $output_string . \"\\n\";\n\nif ($output_string eq $insufficient_output_string) {\n  # if we only had \"$gmm_dir: nj=$num_jobs\", then it's probably not a GMM dir:\n  # exit with status 1.\n  exit(1);\n}\n\nexit(0);\n\n"
  },
  {
    "path": "egs/steps/info/nnet2_dir_info.pl",
    "content": "#!/usr/bin/perl -w\n\nuse Fcntl;\n\n# we may at some point support options.\n\n$debug = 0;  # we set it to 1 for debugging the script itself.\n\nif ($ARGV[0] eq \"--debug\") {\n  $debug = 1;\n  shift @ARGV;\n}\n\nif (@ARGV == 0) {\n  print STDERR \"Usage: steps/info/nnet2_dir_info.pl [--debug] <nnet3-dir1> [<nnet3-dir2> ... ]\\n\" .\n               \"e.g: steps/info/nnet2_dir_info.pl exp/nnet3/tdnn_sp\\n\" .\n               \"This script extracts some important information from the logs\\n\" .\n               \"and displays it on a single (rather long) line.\\n\" .\n               \"The --debug option is just to debug the script itself.\\n\" .\n               \"This program exits with status 0 if it seems like the arguments\\n\" .\n               \"really were of the expected directory type, and 1 otherwise.\\n\";\n  exit(1);\n}\n\nif (@ARGV > 1) {\n  # repeatedly invoke this program with each of the remaining args.\n  $exit_status = 0;\n  if ($debug) { $debug_opt = \"--debug \" } else { $debug_opt = \"\"; }\n  foreach $dir (@ARGV) {\n    if (system(\"$0 $debug_opt$dir\") != 0) {\n      $exit_status = 1;\n    }\n  }\n  exit($exit_status);\n}\n\n$nnet_dir = shift @ARGV;\n\nsub list_all_log_files {\n  my @ans = ();\n  my $dh;\n  if (!opendir($dh, \"$nnet_dir/log\")) { return (); }\n  @ans = readdir $dh;\n  closedir $dh;\n  return @ans;\n}\n\n\n# returns 1 if the diagnostics are finished on this iter, else 0.\nsub diagnostics_are_finished_on_iter {\n  my $ans = 1;\n  my $iter = shift @_;\n  if (!open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-likelihood/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  if (!open(F, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-likelihood/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  return $ans;\n}\n\n# get the number of iterations.\n# note: the iterations go from 0 to num-iters-1.\n# if num_iters = 0 this program will just exit with status 1.\n# we may return a number slightly less than the number of iterations\n# in order to ensure that the compute_prob_train and compute_prob_valid\n# processes have finished.\nsub get_num_iters {\n  my $iter = 0;\n  while (defined $log_file_hash{\"train.$iter.1.log\"}) {\n    $iter++;\n  }\n  if ($iter == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  my $last_iter = $iter - 1;\n  # find an iteration where the diagnostic jobs compute_prob_{train,valid}.$last_iter.log are done.\n  for (my $chosen_last_iter = $last_iter;\n       $chosen_last_iter >= $last_iter - 6 && $chosen_last_iter >= 0;\n       $chosen_last_iter--) {\n    if (! diagnostics_are_finished_on_iter($chosen_last_iter)) {\n      if ($debug) {\n        print STDERR \"nnet3_dir_info.pl: diagnostics not finished running on iteration $chosen_last_iter\\n\";\n      }\n    } else {\n      return $chosen_last_iter + 1;\n    }\n  }\n  # OK, something's not right, just return the original iteration.\n  return $iter;\n}\n\nsub get_num_jobs_initial {\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.0.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\n\nsub get_num_jobs_final {  # expects $num_iters to exist as a global variable.\n  my $final_iter = $num_iters - 1;\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.$final_iter.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\nsub get_combine_info {\n  # returns a string with info about the combination stage, or the empty\n  # string if there wasn't one.\n  if (defined $log_file_hash{\"combine.log\"} &&\n      open(F, \"<$nnet_dir/log/combine.log\")) {\n    while (<F>) {\n      if (m/Combining nnets, objective function changed from (\\S+) to (\\S+)/) {\n        close(F);\n        return sprintf(\" combine=%.2f->%.2f\", $1, $2);\n      }\n    }\n  }\n  return \"\";\n}\n\n# this is used in get_loglike_and_accuracy to format\n# strings like ' loglike[32,48,final],train/valid=(-2.43,-2.32,-2.21/-2.84,-2.71,-2.68)'.\nsub get_printed_string {\n  # $name might be 'loglike', for example.\n  my ($name, $iters_array_ref, $train_hash_ref, $valid_hash_ref) = @_;\n  my @iters_array = @$iters_array_ref;\n  my %train_hash = %$train_hash_ref;  # hash from iter-string to value.\n  my %valid_hash = %$valid_hash_ref;  # hash from iter-string to value.\n  my @iters_to_print = ();\n  my @train_values_to_print = ();\n  my @valid_values_to_print = ();\n  foreach my $iter (@iters_array) {\n    if (defined($train_hash{$iter}) && defined($valid_hash{$iter})) {\n      push @iters_to_print, $iter;\n      push @train_values_to_print, sprintf(\"%.2f\", $train_hash{$iter});\n      push @valid_values_to_print, sprintf(\"%.2f\", $valid_hash{$iter});\n    }\n  }\n  if (@iters_to_print == 0) {  return \"\"; }\n  my $joined_iters = join(\",\", @iters_to_print);\n  my $joined_train_values = join(\",\", @train_values_to_print);\n  my $joined_valid_values = join(\",\", @valid_values_to_print);\n  return \" ${name}:train/valid[$joined_iters]=($joined_train_values/$joined_valid_values)\";\n}\n\n# invoke this as get_objf_iter($iter1, $iter2,..) where $iterN is the string-valued\n# iteration, e.g. \"92\", or \"final\", or \"combined\", such that we expect\n# $nnet_dir/log/compute_prob_{train,valid}.$iterN.log to exist.\nsub get_loglike_and_accuracy_info {\n  my @iters_array = @_;\n  my %iter_to_train_loglike = ();\n  my %iter_to_valid_loglike = ();\n  my %iter_to_train_accuracy = ();\n  my %iter_to_valid_accuracy = ();\n\n\n  foreach my $iter (@iters_array) {\n    if (defined $log_file_hash{\"compute_prob_train.$iter.log\"} &&\n        defined $log_file_hash{\"compute_prob_valid.$iter.log\"} &&\n        open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\") &&\n        open(G, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n      while (<F>) {\n        if (m/average probability is (\\S+) and accuracy is (\\S+) with total weight \\S+/) {\n          $iter_to_train_loglike{$iter} = $1;\n          $iter_to_train_accuracy{$iter} = $2;\n        }\n      }\n      close(F);\n      while (<G>) {\n        if (m/average probability is (\\S+) and accuracy is (\\S+) with total weight \\S+/) {\n          $iter_to_valid_loglike{$iter} = $1;\n          $iter_to_valid_accuracy{$iter} = $2;\n        }\n      }\n      close(G);\n    }\n  }\n  $ans = \"\";\n  $ans .= get_printed_string(\"loglike\", \\@iters_array, \\%iter_to_train_loglike,\n                             \\%iter_to_valid_loglike);\n  $ans .= get_printed_string(\"accuracy\", \\@iters_array, \\%iter_to_train_accuracy,\n                             \\%iter_to_valid_accuracy);\n  return $ans;\n}\n\n# invoke this as get_progress_info($iter), e.g. set $iter to the last\n# iteration number.\nsub get_progress_info {\n  my $iter = shift @_;\n  if (!defined $log_file_hash{\"progress.$iter.log\"} ||\n      !open(F, \"<$nnet_dir/log/progress.$iter.log\")) {\n    return \"\";\n  }\n  my $num_parameters = \"0\";\n  my $output_dim = 0;\n  my $input_dim = 0;\n  while (<F>) {\n    if (m/^parameter-dim (\\S+)/) {\n      $num_parameters = sprintf(\"%.1fM\", $1 / 1000000.0);\n    }\n    if (m/^input-dim (\\S+)/) {\n      $input_dim = $1;\n    }\n    if (m/^output-dim (\\S+)/) {\n      $output_dim = $1;\n    }\n  }\n  close(F);\n  $ans = \"\";\n  if ($num_parameters ne \"0\") {  $ans .= \" num-params=$num_parameters\"; }\n  if ($output_dim > 0 && $input_dim > 0) {\n    $ans .= \" dim=$input_dim->$output_dim\";\n  } elsif ($output_dim > 0) {\n    $ans .= \" output-dim=$output_dim\";\n  }\n  return $ans;\n}\n\n# return 1 if we seem to have finished training, else 0.\nsub finished_training {\n  return defined $log_file_hash{\"compute_prob_train.final.log\"} ||\n    defined $log_file_hash{\"compute_prob_train.combined.log\"};\n}\n\n@log_files = list_all_log_files();\nif (@log_files == 0) {  exit(1); }\n$log_file_hash = ();\nforeach $f (@log_files) { $log_file_hash{$f} = 1; }\n\n$num_iters = get_num_iters();\n$num_jobs_initial = get_num_jobs_initial();\n$num_jobs_final = get_num_jobs_final();\n$last_iter = $num_iters - 1;\n$two_thirds_iter = int($last_iter * 0.666);\n\n$output_string = \"$nnet_dir: num-iters=$num_iters\";\n\n$output_string .= \" nj=$num_jobs_initial..$num_jobs_final\";\n\n$output_string .= get_progress_info(\"$last_iter\");\n\n$output_string .= get_combine_info();\n\n\n\n# note: IIRC some of the scripts use the name 'combined' for the model after\n# combination, and some 'final', so we try both; only one of these will\n# actually produce any output.\n\n\n@iters_array = (\"$two_thirds_iter\", \"$last_iter\", \"final\", \"combined\");\n\n$output_string .= get_loglike_and_accuracy_info(@iters_array);\n\nprint \"$output_string\\n\";\n\nexit(0);\n"
  },
  {
    "path": "egs/steps/info/nnet3_dir_info.pl",
    "content": "#!/usr/bin/perl -w\n\nuse Fcntl;\n\n# we may at some point support options.\n\n$debug = 0;  # we set it to 1 for debugging the script itself.\n\nif ($ARGV[0] eq \"--debug\") {\n  $debug = 1;\n  shift @ARGV;\n}\n\nif (@ARGV == 0) {\n  print STDERR \"Usage: steps/info/nnet3_dir_info.pl [--debug] <nnet3-dir1> [<nnet3-dir2> ... ]\\n\" .\n               \"e.g: steps/info/nnet3_dir_info.pl exp/nnet3/tdnn_sp\\n\" .\n               \"This script extracts some important information from the logs\\n\" .\n               \"and displays it on a single (rather long) line.\\n\" .\n               \"The --debug option is just to debug the script itself.\\n\" .\n               \"This program exits with status 0 if it seems like the arguments\\n\" .\n               \"really were of the expected directory type, and 1 otherwise.\\n\";\n  exit(1);\n}\n\nif (@ARGV > 1) {\n  # repeatedly invoke this program with each of the remaining args.\n  $exit_status = 0;\n  if ($debug) { $debug_opt = \"--debug \" } else { $debug_opt = \"\"; }\n  foreach $dir (@ARGV) {\n    if (system(\"$0 $debug_opt$dir\") != 0) {\n      $exit_status = 1;\n    }\n  }\n  exit($exit_status);\n}\n\n$nnet_dir = shift @ARGV;\n\nsub list_all_log_files {\n  my @ans = ();\n  my $dh;\n  if (!opendir($dh, \"$nnet_dir/log\")) { return (); }\n  @ans = readdir $dh;\n  closedir $dh;\n  return @ans;\n}\n\n\n# returns 1 if the diagnostics are finished on this iter, else 0.\nsub diagnostics_are_finished_on_iter {\n  my $ans = 1;\n  my $iter = shift @_;\n  if (!open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-likelihood/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  if (!open(F, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n    return 0;\n  }\n  $found_loglike = 0;\n  while (<F>) {\n    if (m/Overall log-likelihood/) { $found_loglike = 1; }\n  }\n  if (!$found_loglike) { $ans = 0; }\n  close(F);\n  return $ans;\n}\n\n# get the number of iterations.\n# note: the iterations go from 0 to num-iters-1.\n# if num_iters = 0 this program will just exit with status 1.\n# we may return a number slightly less than the number of iterations\n# in order to ensure that the compute_prob_train and compute_prob_valid\n# processes have finished.\nsub get_num_iters {\n  my $iter = 0;\n  while (defined $log_file_hash{\"train.$iter.1.log\"}) {\n    $iter++;\n  }\n  if ($iter == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  my $last_iter = $iter - 1;\n  # find an iteration where the diagnostic jobs compute_prob_{train,valid}.$last_iter.log are done.\n  for (my $chosen_last_iter = $last_iter;\n       $chosen_last_iter >= $last_iter - 6 && $chosen_last_iter >= 0;\n       $chosen_last_iter--) {\n    if (! diagnostics_are_finished_on_iter($chosen_last_iter)) {\n      if ($debug) {\n        print STDERR \"nnet3_dir_info.pl: diagnostics not finished running on iteration $chosen_last_iter\\n\";\n      }\n    } else {\n      return $chosen_last_iter + 1;\n    }\n  }\n  # OK, something's not right, just return the original iteration.\n  return $iter;\n}\n\nsub get_num_jobs_initial {\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.0.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\n\nsub get_num_jobs_final {  # expects $num_iters to exist as a global variable.\n  my $final_iter = $num_iters - 1;\n  my $num_jobs = 1;\n  while (defined $log_file_hash{\"train.$final_iter.$num_jobs.log\"}) {\n    $num_jobs++;\n  }\n  $num_jobs--;\n  if ($num_jobs == 0) {\n    die \"$nnet_dir does not seem to be an nnet3 neural net training directory.\";\n  }\n  return $num_jobs;\n}\n\nsub get_combine_info {\n  # returns a string with info about the combination stage, or the empty\n  # string if there wasn't one.\n  if (defined $log_file_hash{\"combine.log\"} &&\n      open(F, \"<$nnet_dir/log/combine.log\")) {\n    while (<F>) {\n      if (m/Combining nnets, objective function changed from (\\S+) to (\\S+)/) {\n        close(F);\n        return sprintf(\" combine=%.2f->%.2f\", $1, $2);\n      } elsif (m/Combining (\\S+) nnets, objective function changed from (\\S+) to (\\S+)/) {\n        close(F);\n        return sprintf(\" combine=%.2f->%.2f (over %d)\", $2, $3, $1); \n      }\n    }\n  }\n  return \"\";\n}\n\nsub number_to_string {\n  my ($value, $name) = @_;\n  my $precision;\n  if (abs($value) < 0.02 or ($name eq \"accuracy\" and abs($value) > 0.97)) {\n    $precision = 4;\n  } elsif (abs($value) < 0.2 or ($name eq \"accuracy\" and abs($value) > 0.7)) {\n    $precision = 3;\n  } else {\n    $precision = 2;\n  }\n  my $format = \"%.${precision}f\";  # e.g. \"%.2f\"\n  return sprintf($format, $value);\n}\n\n# this is used in get_loglike_and_accuracy to format\n# strings like ' loglike[32,48,final],train/valid=(-2.43,-2.32,-2.21/-2.84,-2.71,-2.68)'.\nsub get_printed_string {\n  # $name might be 'loglike', for example.\n  my ($name, $iters_array_ref, $train_hash_ref, $valid_hash_ref) = @_;\n  my @iters_array = @$iters_array_ref;\n  my %train_hash = %$train_hash_ref;  # hash from iter-string to value.\n  my %valid_hash = %$valid_hash_ref;  # hash from iter-string to value.\n  my @iters_to_print = ();\n  my @train_values_to_print = ();\n  my @valid_values_to_print = ();\n  foreach my $iter (@iters_array) {\n    if (defined($train_hash{$iter}) && defined($valid_hash{$iter})) {\n      push @iters_to_print, $iter;\n      push @train_values_to_print, number_to_string($train_hash{$iter}, $name);\n      push @valid_values_to_print, number_to_string($valid_hash{$iter}, $name);\n    }\n  }\n  if (@iters_to_print == 0) {  return \"\"; }\n  my $joined_iters = join(\",\", @iters_to_print);\n  my $joined_train_values = join(\",\", @train_values_to_print);\n  my $joined_valid_values = join(\",\", @valid_values_to_print);\n  return \" ${name}:train/valid[$joined_iters]=($joined_train_values/$joined_valid_values)\";\n}\n\n# invoke this as get_objf_iter($iter1, $iter2,..) where $iterN is the string-valued\n# iteration, e.g. \"92\", or \"final\", or \"combined\", such that we expect\n# $nnet_dir/log/compute_prob_{train,valid}.$iterN.log to exist.\nsub get_loglike_and_accuracy_info {\n  my @iters_array = @_;\n  my %iter_to_train_loglike = ();\n  my %iter_to_valid_loglike = ();\n  my %iter_to_train_accuracy = ();\n  my %iter_to_valid_accuracy = ();\n\n\n  foreach my $iter (@iters_array) {\n    if (defined $log_file_hash{\"compute_prob_train.$iter.log\"} &&\n        defined $log_file_hash{\"compute_prob_valid.$iter.log\"} &&\n        open(F, \"<$nnet_dir/log/compute_prob_train.$iter.log\") &&\n        open(G, \"<$nnet_dir/log/compute_prob_valid.$iter.log\")) {\n      while (<F>) {\n        if (m/Overall log-likelihood for 'output' is (\\S+) per frame/) {\n          $iter_to_train_loglike{$iter} = $1;\n        } elsif (m/Overall accuracy for 'output' is (\\S+) per frame/) {\n          $iter_to_train_accuracy{$iter} = $1;\n        }\n      }\n      close(F);\n      while (<G>) {\n        if (m/Overall log-likelihood for 'output' is (\\S+) per frame/) {\n          $iter_to_valid_loglike{$iter} = $1;\n        } elsif (m/Overall accuracy for 'output' is (\\S+) per frame/) {\n          $iter_to_valid_accuracy{$iter} = $1;\n        }\n      }\n      close(G);\n    }\n  }\n  $ans = \"\";\n  $ans .= get_printed_string(\"loglike\", \\@iters_array, \\%iter_to_train_loglike,\n                             \\%iter_to_valid_loglike);\n  $ans .= get_printed_string(\"accuracy\", \\@iters_array, \\%iter_to_train_accuracy,\n                             \\%iter_to_valid_accuracy);\n  return $ans;\n}\n\n# invoke this as get_progress_info($iter), e.g. set $iter to the last\n# iteration number.\nsub get_progress_info {\n  my $iter = shift @_;\n  if (!defined $log_file_hash{\"progress.$iter.log\"} ||\n      !open(F, \"<$nnet_dir/log/progress.$iter.log\")) {\n    return \"\";\n  }\n  my $num_parameters = \"0\";\n  my $output_dim = 0;\n  my $input_dim = 0;\n  my $ivector_dim = 0;\n  my $max_clipped_proportion = 0.0;\n  while (<F>) {\n    if (m/clipped-proportion=([^,]+)/ && $1 > $max_clipped_proportion) {\n      $max_clipped_proportion = $1;\n    }\n    if (m/^num-parameters: (\\S+)/) {\n      $num_parameters = sprintf(\"%.1fM\", $1 / 1000000.0);\n    }\n    if (m/^output-node.* name=output .*dim=(\\S+)/) {\n      $output_dim = $1;\n    }\n    if (m/^input-node.* name=input .*dim=(\\S+)/) {\n      $input_dim = $1;\n    }\n    if (m/^input-node.* name=ivector .*dim=(\\S+)/) {\n      $ivector_dim = $1;\n    }\n  }\n  close(F);\n  $ans = \"\";\n  if ($num_parameters ne \"0\") {  $ans .= \" num-params=$num_parameters\"; }\n  if ($max_clipped_proportion > 0.1) {\n    if ($max_clipped_proportion > 0.3) {\n      $ans .= \" **max-clipped-proportion=$max_clipped_proportion**\";  # for emphasis; this generally isn't good.\n    } else {\n      $ans .= \" max-clipped-proportion=$max_clipped_proportion\";\n    }\n  }\n  if ($output_dim > 0 && $input_dim > 0 && $ivector_dim > 0) {\n    $ans .= \" dim=$input_dim+$ivector_dim->$output_dim\";\n  } elsif ($output_dim > 0 && $input_dim > 0) {\n    $ans .= \" dim=$input_dim->$output_dim\";\n  } elsif ($output_dim > 0) {\n    $ans .= \" output-dim=$output_dim\";\n  }\n  return $ans;\n}\n\n# return 1 if we seem to have finished training, else 0.\nsub finished_training {\n  return defined $log_file_hash{\"compute_prob_train.final.log\"} ||\n    defined $log_file_hash{\"compute_prob_train.combined.log\"};\n}\n\n@log_files = list_all_log_files();\nif (@log_files == 0) {  exit(1); }\n$log_file_hash = ();\nforeach $f (@log_files) { $log_file_hash{$f} = 1; }\n\n$num_iters = get_num_iters();\n$num_jobs_initial = get_num_jobs_initial();\n$num_jobs_final = get_num_jobs_final();\n$last_iter = $num_iters - 1;\n$two_thirds_iter = int($last_iter * 0.666);\n\n$output_string = \"$nnet_dir: num-iters=$num_iters\";\n\n$output_string .= \" nj=$num_jobs_initial..$num_jobs_final\";\n\n$output_string .= get_progress_info(\"$last_iter\");\n\n$output_string .= get_combine_info();\n\n\n\n# note: IIRC some of the scripts use the name 'combined' for the model after\n# combination, and some 'final', so we try both; only one of these will\n# actually produce any output.\n\n\n@iters_array = (\"$two_thirds_iter\", \"$last_iter\", \"final\", \"combined\");\n\n$output_string .= get_loglike_and_accuracy_info(@iters_array);\n\nprint \"$output_string\\n\";\n\nexit(0);\n"
  },
  {
    "path": "egs/steps/info/nnet3_disc_dir_info.pl",
    "content": "#!/usr/bin/perl -w\n\nuse Fcntl;\n\n# we may at some point support options.\n\n$debug = 0;  # we set it to 1 for debugging the script itself.\n\nif ($ARGV[0] eq \"--debug\") {\n  $debug = 1;\n  shift @ARGV;\n}\n\nif (@ARGV == 0) {\n  print STDERR \"Usage: steps/info/nnet3_disc_dir_info.pl [--debug] <nnet3-disc-dir1> [<nnet3-disc-dir2> ... ]\\n\" .\n               \"e.g: steps/info/nnet3_dir_info.pl exp/nnet3/tdnn_sp_smbr\\n\" .\n               \"This script extracts some important information from the logs\\n\" .\n               \"and displays it on a few lines.\\n\" .\n               \"The --debug option is just to debug the script itself.\\n\" .\n               \"This program exits with status 0 if it seems like the argument\\n\" .\n               \"really was a GMM dir, and 1 otherwise.\\n\";\n  exit(1);\n}\n\nif (@ARGV > 1) {\n  # repeatedly invoke this program with each of the remaining args.\n  $exit_status = 0;\n  if ($debug) { $debug_opt = \"--debug \" } else { $debug_opt = \"\"; }\n  foreach $dir (@ARGV) {\n    if (system(\"$0 $debug_opt$dir\") != 0) {\n      $exit_status = 1;\n    }\n  }\n  exit($exit_status);\n}\n\n# from this point we can assume we're invoked with one argument.\n$nnet_dir = shift @ARGV;\n\n# This function returns an array of iteration numbers, one\n# for each epoch that has already completed (but including\n# epoch zero)... e.g.\n# it might return (0, 194, 388, 582).\n# This is done by reading the soft links, e.g. epoch1.mdl ->194.mdl\nsub get_iters_for_epochs {\n  my @ans = ();\n  for (my $n = 0; 1; $n++) {\n    if (-l \"$nnet_dir/epoch$n.mdl\") {\n      my $link_name = readlink(\"$nnet_dir/epoch$n.mdl\");\n      if ($link_name =~ m/^(\\d+).mdl/) {\n        my $iter = $1;\n        push @ans, $iter;\n      } else {\n        die \"unexpected link name $nnet_dir/epoch$n.mdl -> $link_name\";\n      }\n    } else {\n      if (@ans == 0) {\n        die \"$nnet_dir does not seem to be a discriminative-training dir \" .\n          \"(expected a link $nnet_dir/epoch0.mdl)\";\n      }\n      return @ans;\n    }\n  }\n}\n\n\nsub get_num_jobs {\n  my $j = 1;\n  for (my $j = 1; 1; $j++) {\n    if (! -f \"$nnet_dir/log/train.0.$j.log\") {\n      if ($j == 1) {\n        die \"$nnet_dir does not seem to be a discriminative-training dir \" .\n          \"(expected $nnet_dir/log/train.0.1.log to exist)\";\n      } else {\n        return $j - 1;\n      }\n    }\n  }\n}\n\n# returns a string describing the effective learning rate and possibly\n# any final-layer-factor.\nsub get_effective_learning_rate_str {\n  # effective learning rate is the actual learning rate divided by the\n  # number of jobs.\n  my $convert_log = \"$nnet_dir/log/convert.log\";\n  if (-f $convert_log) {\n    open(F, \"<$convert_log\");\n    while (<F>) {\n      if (m/--edits/) {\n        if (m/set-learning-rate learning-rate=(\\S+); set-learning-rate name=output.affine learning-rate=([^\"']+)[\"']/) {\n          my $learning_rate = $1;\n          my $last_layer_factor = sprintf(\"%.2f\", $2 / $1);\n          my $num_jobs = get_num_jobs();\n          my $effective_learning_rate = sprintf(\"%.3g\", $learning_rate / $num_jobs);\n          close(F);\n          return \"effective-lrate=$effective_learning_rate;last-layer-factor=$last_layer_factor\";\n        } elsif (m/set-learning-rate learning-rate=([^\"']+)[\"']/) {\n          my $learning_rate = $1;\n          my $num_jobs = get_num_jobs();\n          my $effective_learning_rate = sprintf(\"%.3g\", $learning_rate / $num_jobs);\n          close(F);\n          return \"effective-lrate=$effective_learning_rate\";\n        }\n      }\n    }\n  } else {\n    die(\"Expected file $convert_log to exist\");\n  }\n  close(F);\n  return \"lrate=??\";  # could not parse it from the log.\n}\n\n\n# prints some info about the objective function...\nsub get_objf_str {\n  my @iters_for_epochs = get_iters_for_epochs();\n  if (@iters_for_epochs == 1) {\n    die(\"No epochs have finished in directory $nnet_dir\")\n  }\n  # will produce output like:\n  # iters-per-epoch=123;epoch[0,1,2,3,4]:train-objf=[0.89,0.92,0.93,0.94],valid-objf=[...],train-counts=[...],valid-counts=[...]\"\n  # the \"counts\" are the average num+den occupation counts in the lattices; it's a measure of how much confusability\n  # there still is in the lattices.\n  my $iters_per_epoch = $iters_for_epochs[1] - $iters_for_epochs[0];\n  my $ans = \"iters-per-epoch=$iters_per_epoch\";\n  $ans .= \";epoch[\" . join(\",\", 0..$#iters_for_epochs) . \"]:\";\n  my @train_objfs = ();\n  my @train_counts = ();\n  my @valid_objfs = ();\n  my @valid_counts = ();\n  foreach $iter (@iters_for_epochs) {\n    if ($iter > 0) { $iter -= 1; }  # last iter will not exist.\n    my $train_log = \"$nnet_dir/log/compute_objf_train.$iter.log\";\n    my $valid_log = \"$nnet_dir/log/compute_objf_valid.$iter.log\";\n    if (!open (T, \"<$train_log\")){  print STDERR \"$0: warning: Expected file $train_log to exist\\n\"; }\n    if (!open (V, \"<$valid_log\")){  print STDERR \"$0: warning: Expected file $valid_log to exist\\n\"; }\n    my $train_count = \"??\";\n    my $valid_count = \"??\";\n    my $train_objf = \"??\";\n    my $valid_objf = \"??\";\n    while (<T>) {\n      if (m/num\\+den count.+is (\\S+) per frame/) { $train_count = sprintf(\"%.2f\", $1); }\n      if (m/Overall.+ is (\\S+) per frame/) { $train_objf = sprintf(\"%.2f\", $1); }\n    }\n    close(T);\n    while (<V>) {\n      if (m/num\\+den count.+is (\\S+) per frame/) { $valid_count = sprintf(\"%.2f\", $1); }\n      if (m/Overall.+ is (\\S+) per frame/) { $valid_objf = sprintf(\"%.2f\", $1); }\n    }\n    push @train_objfs, $train_objf;\n    push @train_counts, $train_count;\n    push @valid_objfs, $valid_objf;\n    push @valid_counts, $valid_count;\n    close(V);\n  }\n  $ans .= \"train-objf=[\" . join(\",\", @train_objfs) .\n       \"],valid-objf=[\" . join(\",\", @valid_objfs) .\n       \"],train-counts=[\" . join(\",\", @train_counts) .\n       \"],valid-counts=[\" . join(\",\", @valid_counts) . \"]\";\n  return $ans;\n}\n\n\n\n\n$output_string = \"$nnet_dir:num-jobs=\".get_num_jobs().\";\" .\n     get_effective_learning_rate_str() . \";\" . get_objf_str();\n\nprint \"$output_string\\n\";\n\nexit(0);\n"
  },
  {
    "path": "egs/steps/libs/__init__.py",
    "content": "\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This package contains modules and subpackages used in kaldi scripts.\n\"\"\"\n\nfrom . import common\n\n__all__ = [\"common\"]\n"
  },
  {
    "path": "egs/steps/libs/common.py",
    "content": "\n\n# Copyright 2016 Vijayaditya Peddinti.\n#           2016 Vimal Manohar\n#           2017 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n\"\"\" This module contains several utility functions and classes that are\ncommonly used in many kaldi python scripts.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport logging\nimport math\nimport os\nimport subprocess\nimport sys\nimport threading\n\ntry:\n    import thread as thread_module\nexcept:\n    import _thread as thread_module\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\ndef send_mail(message, subject, email_id):\n    try:\n        subprocess.Popen(\n            'echo \"{message}\" | mail -s \"{subject}\" {email}'.format(\n                message=message,\n                subject=subject,\n                email=email_id), shell=True)\n    except Exception as e:\n        logger.info(\"Unable to send mail due to error:\\n {error}\".format(\n                        error=str(e)))\n        pass\n\n\ndef str_to_bool(value):\n    if value == \"true\":\n        return True\n    elif value == \"false\":\n        return False\n    else:\n        raise ValueError\n\n\nclass StrToBoolAction(argparse.Action):\n    \"\"\" A custom action to convert bools from shell format i.e., true/false\n        to python format i.e., True/False \"\"\"\n\n    def __call__(self, parser, namespace, values, option_string=None):\n        try:\n            setattr(namespace, self.dest, str_to_bool(values))\n        except ValueError:\n            raise Exception(\n                \"Unknown value {0} for --{1}\".format(values, self.dest))\n\n\nclass NullstrToNoneAction(argparse.Action):\n    \"\"\" A custom action to convert empty strings passed by shell to None in\n    python. This is necessary as shell scripts print null strings when a\n    variable is not specified. We could use the more apt None in python. \"\"\"\n\n    def __call__(self, parser, namespace, values, option_string=None):\n        if values.strip() == \"\":\n            setattr(namespace, self.dest, None)\n        else:\n            setattr(namespace, self.dest, values)\n\n\nclass smart_open(object):\n    \"\"\"\n    This class is designed to be used with the \"with\" construct in python\n    to open files. It is similar to the python open() function, but\n    treats the input \"-\" specially to return either sys.stdout or sys.stdin\n    depending on whether the mode is \"w\" or \"r\".\n\n    e.g.: with smart_open(filename, 'w') as fh:\n            print (\"foo\", file=fh)\n    \"\"\"\n    def __init__(self, filename, mode=\"r\"):\n        self.filename = filename\n        self.mode = mode\n        assert self.mode == \"w\" or self.mode == \"r\"\n\n    def __enter__(self):\n        if self.filename == \"-\" and self.mode == \"w\":\n            self.file_handle = sys.stdout\n        elif self.filename == \"-\" and self.mode == \"r\":\n            self.file_handle = sys.stdin\n        else:\n            self.file_handle = open(self.filename, self.mode)\n        return self.file_handle\n\n    def __exit__(self, *args):\n        if self.filename != \"-\":\n            self.file_handle.close()\n\n\ndef check_if_cuda_compiled():\n    p = subprocess.Popen(\"cuda-compiled\")\n    p.communicate()\n    if p.returncode == 1:\n        return False\n    else:\n        return True\n\n\ndef execute_command(command):\n    \"\"\" Runs a kaldi job in the foreground and waits for it to complete; raises an\n        exception if its return status is nonzero.  The command is executed in\n        'shell' mode so 'command' can involve things like pipes.  Often,\n        'command' will start with 'run.pl' or 'queue.pl'.  The stdout and stderr\n        are merged with the calling process's stdout and stderr so they will\n        appear on the screen.\n\n        See also: get_command_stdout, background_command\n    \"\"\"\n    p = subprocess.Popen(command, shell=True)\n    p.communicate()\n    if p.returncode is not 0:\n        raise Exception(\"Command exited with status {0}: {1}\".format(\n                p.returncode, command))\n\n\ndef get_command_stdout(command, require_zero_status = True):\n    \"\"\" Executes a command and returns its stdout output as a string.  The\n        command is executed with shell=True, so it may contain pipes and\n        other shell constructs.\n\n        If require_zero_stats is True, this function will raise an exception if\n        the command has nonzero exit status.  If False, it just prints a warning\n        if the exit status is nonzero.\n\n        See also: execute_command, background_command\n    \"\"\"\n    p = subprocess.Popen(command, shell=True,\n                         stdout=subprocess.PIPE)\n\n    stdout = p.communicate()[0]\n    if p.returncode is not 0:\n        output = \"Command exited with status {0}: {1}\".format(\n            p.returncode, command)\n        if require_zero_status:\n            raise Exception(output)\n        else:\n            logger.warning(output)\n    return stdout if type(stdout) is str else stdout.decode()\n\n\n\n\ndef wait_for_background_commands():\n    \"\"\" This waits for all threads to exit.  You will often want to\n        run this at the end of programs that have launched background\n        threads, so that the program will wait for its child processes\n        to terminate before it dies.\"\"\"\n    for t in threading.enumerate():\n        if not t == threading.current_thread():\n            t.join()\n\ndef background_command(command, require_zero_status = False):\n    \"\"\"Executes a command in a separate thread, like running with '&' in the shell.\n       If you want the program to die if the command eventually returns with\n       nonzero status, then set require_zero_status to True.  'command' will be\n       executed in 'shell' mode, so it's OK for it to contain pipes and other\n       shell constructs.\n\n       This function returns the Thread object created, just in case you want\n       to wait for that specific command to finish.  For example, you could do:\n             thread = background_command('foo | bar')\n             # do something else while waiting for it to finish\n             thread.join()\n\n       See also:\n         - wait_for_background_commands(), which can be used\n           at the end of the program to wait for all these commands to terminate.\n         - execute_command() and get_command_stdout(), which allow you to\n           execute commands in the foreground.\n\n    \"\"\"\n\n    p = subprocess.Popen(command, shell=True)\n    thread = threading.Thread(target=background_command_waiter,\n                              args=(command, p, require_zero_status))\n    thread.daemon=True  # make sure it exits if main thread is terminated\n                        # abnormally.\n    thread.start()\n    return thread\n\n\ndef background_command_waiter(command, popen_object, require_zero_status):\n    \"\"\" This is the function that is called from background_command, in\n        a separate thread.\"\"\"\n\n    popen_object.communicate()\n    if popen_object.returncode is not 0:\n        str = \"Command exited with status {0}: {1}\".format(\n            popen_object.returncode, command)\n        if require_zero_status:\n            logger.error(str)\n            # thread.interrupt_main() sends a KeyboardInterrupt to the main\n            # thread, which will generally terminate the program.\n            thread_module.interrupt_main()\n        else:\n            logger.warning(str)\n\n\ndef get_number_of_leaves_from_tree(alidir):\n    stdout = get_command_stdout(\n        \"tree-info {0}/tree 2>/dev/null | grep num-pdfs\".format(alidir))\n    parts = stdout.split()\n    assert(parts[0] == \"num-pdfs\")\n    num_leaves = int(parts[1])\n    if num_leaves == 0:\n        raise Exception(\"Number of leaves is 0\")\n    return num_leaves\n\n\ndef get_number_of_leaves_from_model(dir):\n    stdout = get_command_stdout(\n        \"am-info {0}/final.mdl 2>/dev/null | grep -w pdfs\".format(dir))\n    parts = stdout.split()\n    # number of pdfs 7115\n    assert(' '.join(parts[0:3]) == \"number of pdfs\")\n    num_leaves = int(parts[3])\n    if num_leaves == 0:\n        raise Exception(\"Number of leaves is 0\")\n    return num_leaves\n\n\ndef get_number_of_jobs(alidir):\n    try:\n        num_jobs = int(open('{0}/num_jobs'.format(alidir)).readline().strip())\n    except (IOError, ValueError) as e:\n        logger.error(\"Exception while reading the \"\n                     \"number of alignment jobs: \", exc_info=True)\n        raise SystemExit(1)\n    return num_jobs\n\n\ndef get_ivector_dim(ivector_dir=None):\n    if ivector_dir is None:\n        return 0\n    stdout_val = get_command_stdout(\n        \"feat-to-dim --print-args=false \"\n        \"scp:{dir}/ivector_online.scp -\".format(dir=ivector_dir))\n    ivector_dim = int(stdout_val)\n    return ivector_dim\n\ndef get_ivector_extractor_id(ivector_dir=None):\n    if ivector_dir is None:\n        return None\n    stdout_val = get_command_stdout(\n        \"steps/nnet2/get_ivector_id.sh {dir}\".format(dir=ivector_dir))\n\n    if (stdout_val.strip() == \"\") or (stdout_val is None):\n        return None\n\n    return stdout_val.strip()\n\ndef get_feat_dim(feat_dir):\n    if feat_dir is None:\n        return 0\n    stdout_val = get_command_stdout(\n        \"feat-to-dim --print-args=false \"\n        \"scp:{data}/feats.scp -\".format(data=feat_dir))\n    feat_dim = int(stdout_val)\n    return feat_dim\n\n\ndef get_feat_dim_from_scp(feat_scp):\n    stdout_val = get_command_stdout(\n        \"feat-to-dim --print-args=false \"\n        \"scp:{feat_scp} -\".format(feat_scp=feat_scp))\n    feat_dim = int(stdout_val)\n    return feat_dim\n\n\ndef read_kaldi_matrix(matrix_file):\n    \"\"\"This function reads a kaldi matrix stored in text format from\n    'matrix_file' and stores it as a list of rows, where each row is a list.\n    \"\"\"\n    try:\n        lines = [x.split() for x in open(matrix_file).readlines()]\n        first_field = lines[0][0]\n        last_field = lines[-1][-1]\n        lines[0] = lines[0][1:]\n        lines[-1] = lines[-1][:-1]\n        if not (first_field == \"[\" and last_field == \"]\"):\n            raise Exception(\n                \"Kaldi matrix file has incorrect format, \"\n                \"only text format matrix files can be read by this script\")\n        for i in range(len(lines)):\n            lines[i] = [int(float(x)) for x in lines[i]]\n        return lines\n    except IOError:\n        raise Exception(\"Error while reading the kaldi matrix file \"\n                        \"{0}\".format(matrix_file))\n\n\ndef write_kaldi_matrix(output_file, matrix):\n    \"\"\"This function writes the matrix stored as a list of lists\n    into 'output_file' in kaldi matrix text format.\n    \"\"\"\n    with open(output_file, 'w') as f:\n        f.write(\"[ \")\n        num_rows = len(matrix)\n        if num_rows == 0:\n            raise Exception(\"Matrix is empty\")\n        num_cols = len(matrix[0])\n\n        for row_index in range(len(matrix)):\n            if num_cols != len(matrix[row_index]):\n                raise Exception(\"All the rows of a matrix are expected to \"\n                                \"have the same length\")\n            f.write(\" \".join([str(x) for x in matrix[row_index]]))\n            if row_index != num_rows - 1:\n                f.write(\"\\n\")\n        f.write(\" ]\")\n\n\ndef write_matrix_ascii(file_or_fd, mat, key=None):\n    \"\"\"This function writes the matrix 'mat' stored as a list of lists\n    in kaldi matrix text format.\n    The destination can be a file or an opened file descriptor.\n    If key is provided, then matrix is written to an archive with the 'key'\n    as the index field.\n    \"\"\"\n    try:\n        fd = open(file_or_fd, 'w')\n    except TypeError:\n        # 'file_or_fd' is opened file descriptor,\n        fd = file_or_fd\n\n    try:\n        if key is not None:\n            print (\"{0} [\".format(key),\n                   file=fd)  # ark-files have keys (utterance-id)\n        else:\n            print (\" [\", file=fd)\n\n        num_cols = 0\n        for i, row in enumerate(mat):\n            line = ' '.join([\"{0:f}\".format(x) for x in row])\n            if i == 0:\n                num_cols = len(row)\n            elif len(row) != num_cols:\n                raise Exception(\"All the rows of a matrix are expected to \"\n                                \"have the same length\")\n\n            if i == len(mat) - 1:\n                line += \" ]\"\n            print (line, file=fd)\n    finally:\n        if fd is not file_or_fd : fd.close()\n\n\ndef read_matrix_ascii(file_or_fd):\n    \"\"\"This function reads a matrix in kaldi matrix text format\n    and stores it as a list of lists.\n    The input can be a file or an opened file descriptor.\n    \"\"\"\n    try:\n        fd = open(file_or_fd, 'r')\n        fname = file_or_fd\n    except TypeError:\n        # 'file_or_fd' is opened file descriptor,\n        fd = file_or_fd\n        fname = file_or_fd.name\n\n    first = fd.read(2)\n    if first != ' [':\n        logger.error(\n            \"Kaldi matrix file %s has incorrect format, \"\n            \"only text format matrix files can be read by this script\",\n            fname)\n        raise RuntimeError\n\n    rows = []\n    while True:\n        line = fd.readline()\n        if len(line) == 0:\n            logger.error(\"Kaldi matrix file %s has incorrect format; \"\n                         \"got EOF before end of matrix\", fname)\n        if len(line.strip()) == 0 : continue # skip empty line\n        arr = line.strip().split()\n        if arr[-1] != ']':\n            rows.append([float(x) for x in arr])  # not last line\n        else:\n            rows.append([float(x) for x in arr[:-1]])  # lastline\n            return rows\n    if fd is not file_or_fd:\n        fd.close()\n\n\ndef read_key(fd):\n  \"\"\" [str] = read_key(fd)\n   Read the utterance-key from the opened ark/stream descriptor 'fd'.\n  \"\"\"\n  str_ = ''\n  while True:\n    char = fd.read(1)\n    if char == '':\n        break\n    if char == ' ':\n        break\n    str_ += char\n  str_ = str_.strip()\n  if str_ == '':\n      return None   # end of file,\n  return str_\n\n\ndef read_mat_ark(file_or_fd):\n    \"\"\"This function reads a kaldi matrix archive in text format\n    and yields a dictionary output indexed by the key (utterance-id).\n    The input can be a file or an opened file descriptor.\n\n    Example usage:\n    mat_dict = { key: mat for key, mat in read_mat_ark(file) }\n    \"\"\"\n    try:\n        fd = open(file_or_fd, 'r')\n        fname = file_or_fd\n    except TypeError:\n        # 'file_or_fd' is opened file descriptor,\n        fd = file_or_fd\n        fname = file_or_fd.name\n\n    try:\n        key = read_key(fd)\n        while key:\n          mat = read_matrix_ascii(fd)\n          yield key, mat\n          key = read_key(fd)\n    finally:\n        if fd is not file_or_fd:\n            fd.close()\n\n\ndef force_symlink(file1, file2):\n    import errno\n    try:\n        os.symlink(file1, file2)\n    except OSError as e:\n        if e.errno == errno.EEXIST:\n            os.remove(file2)\n            os.symlink(file1, file2)\n\n\ndef compute_lifter_coeffs(lifter, dim):\n    coeffs = [0] * dim\n    for i in range(0, dim):\n        coeffs[i] = 1.0 + 0.5 * lifter * math.sin(math.pi * i / float(lifter))\n\n    return coeffs\n\n\ndef compute_idct_matrix(K, N, cepstral_lifter=0):\n    matrix = [[0] * K for i in range(N)]\n    # normalizer for X_0\n    normalizer = math.sqrt(1.0 / float(N))\n    for j in range(0, N):\n        matrix[j][0] = normalizer\n    # normalizer for other elements\n    normalizer = math.sqrt(2.0 / float(N))\n    for k in range(1, K):\n        for n in range(0, N):\n            matrix[n][\n                k] = normalizer * math.cos(math.pi / float(N) * (n + 0.5) * k)\n\n    if cepstral_lifter != 0:\n        lifter_coeffs = compute_lifter_coeffs(cepstral_lifter, K)\n        for k in range(0, K):\n            for n in range(0, N):\n                matrix[n][k] = float(matrix[n][k]) / lifter_coeffs[k]\n\n    return matrix\n\n\ndef write_idct_matrix(feat_dim, cepstral_lifter, file_path):\n    # generate the IDCT matrix and write to the file\n    idct_matrix = compute_idct_matrix(feat_dim, feat_dim, cepstral_lifter)\n    # append a zero column to the matrix, this is the bias of the fixed affine\n    # component\n    for k in range(0, feat_dim):\n        idct_matrix[k].append(0)\n    write_kaldi_matrix(file_path, idct_matrix)\n"
  },
  {
    "path": "egs/steps/libs/nnet3/__init__.py",
    "content": "\n\n# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vimal Manohar\n#           2016    Vijayaditya Peddinti\n#           2016    Yiming Wang\n# Apache 2.0.\n\n\n# This module has the python functions which facilitate the use of nnet3 toolkit\n# It has two sub-modules\n# xconfig : Library for parsing high level description of neural networks\n# train : Library for training scripts\n"
  },
  {
    "path": "egs/steps/libs/nnet3/report/__init__.py",
    "content": "\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0.\n\nfrom . import log_parse\n\n__all__ = [\"log_parse\"]\n"
  },
  {
    "path": "egs/steps/libs/nnet3/report/log_parse.py",
    "content": "\n\n# Copyright 2016    Vijayaditya Peddinti\n#                   Vimal Manohar\n# Apache 2.0.\n\nfrom __future__ import division\nfrom __future__ import print_function\nimport traceback\nimport datetime\nimport logging\nimport re\n\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\ng_lstmp_nonlin_regex_pattern = ''.join([\".*progress.([0-9]+).log:component name=(.+) \",\n    \"type=(.*)Component,.*\",\n    \"i_t_sigmoid.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"f_t_sigmoid.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"c_t_tanh.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"o_t_sigmoid.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"m_t_tanh.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\]\"])\n\ng_normal_nonlin_regex_pattern = ''.join([\".*progress.([0-9]+).log:component name=(.+) \",\n    \"type=(.*)Component,.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\]\"])\n\ng_normal_nonlin_regex_pattern_with_oderiv = ''.join([\".*progress.([0-9]+).log:component name=(.+) \",\n    \"type=(.*)Component,.*\",\n    \"value-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"deriv-avg=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\].*\",\n    \"oderiv-rms=\\[.*=\\((.+)\\), mean=([0-9\\.\\-e]+), stddev=([0-9\\.e\\-]+)\\]\"])\n\nclass KaldiLogParseException(Exception):\n    \"\"\" An Exception class that throws an error when there is an issue in\n    parsing the log files. Extend this class if more granularity is needed.\n    \"\"\"\n    def __init__(self, message = None):\n        if message is not None and message.strip() == \"\":\n            message = None\n\n        Exception.__init__(self,\n                           \"There was an error while trying to parse the logs.\"\n                           \" Details : \\n{0}\\n\".format(message))\n\n# This function is used to fill stats_per_component_per_iter table with the\n# results of regular expression.\n\ndef fill_nonlin_stats_table_with_regex_result(groups, gate_index, stats_table):\n    iteration = int(groups[0])\n    component_name = groups[1]\n    component_type = groups[2]\n    # for value-avg\n    value_percentiles = groups[3+gate_index*6]\n    value_mean = float(groups[4+gate_index*6])\n    value_stddev = float(groups[5+gate_index*6])\n    value_percentiles_split = re.split(',| ',value_percentiles)\n    assert len(value_percentiles_split) == 13\n    value_5th = float(value_percentiles_split[4])\n    value_50th = float(value_percentiles_split[6])\n    value_95th = float(value_percentiles_split[9])\n    # for deriv-avg\n    deriv_percentiles = groups[6+gate_index*6]\n    deriv_mean = float(groups[7+gate_index*6])\n    deriv_stddev = float(groups[8+gate_index*6])\n    deriv_percentiles_split = re.split(',| ',deriv_percentiles)\n    assert len(deriv_percentiles_split) == 13\n    deriv_5th = float(deriv_percentiles_split[4])\n    deriv_50th = float(deriv_percentiles_split[6])\n    deriv_95th = float(deriv_percentiles_split[9])\n\n    if len(groups) <= 9:\n        try:\n            if iteration in stats_table[component_name]['stats']:\n                stats_table[component_name]['stats'][iteration].extend(\n                        [value_mean,  value_stddev,\n                         deriv_mean,  deriv_stddev,\n                         value_5th,  value_50th,  value_95th,\n                         deriv_5th,  deriv_50th,  deriv_95th])\n            else:\n                stats_table[component_name]['stats'][iteration] = [\n                        value_mean,  value_stddev,\n                        deriv_mean,  deriv_stddev,\n                        value_5th,  value_50th,  value_95th,\n                        deriv_5th,  deriv_50th,  deriv_95th]\n        except KeyError:\n            stats_table[component_name] = {}\n            stats_table[component_name]['type'] = component_type\n            stats_table[component_name]['stats'] = {}\n            stats_table[component_name][\n                    'stats'][iteration] = [value_mean,  value_stddev,\n                                           deriv_mean,  deriv_stddev,\n                                           value_5th,  value_50th,  value_95th,\n                                           deriv_5th,  deriv_50th,  deriv_95th]\n    else:\n        #for oderiv-rms\n        oderiv_percentiles = groups[9+gate_index*6]\n        oderiv_mean = float(groups[10+gate_index*6])\n        oderiv_stddev = float(groups[11+gate_index*6])\n        oderiv_percentiles_split = re.split(',| ',oderiv_percentiles)\n        assert len(oderiv_percentiles_split) == 13\n        oderiv_5th = float(oderiv_percentiles_split[4])\n        oderiv_50th = float(oderiv_percentiles_split[6])\n        oderiv_95th = float(oderiv_percentiles_split[9])\n        try:\n            if iteration in stats_table[component_name]['stats']:\n                stats_table[component_name]['stats'][iteration].extend(\n                        [value_mean,  value_stddev,\n                         deriv_mean,  deriv_stddev,\n                         oderiv_mean, oderiv_stddev,\n                         value_5th,  value_50th,  value_95th,\n                         deriv_5th,  deriv_50th,  deriv_95th,\n                         oderiv_5th, oderiv_50th, oderiv_95th])\n            else:\n                stats_table[component_name]['stats'][iteration] = [\n                        value_mean,  value_stddev,\n                        deriv_mean,  deriv_stddev,\n                        oderiv_mean, oderiv_stddev,\n                        value_5th,  value_50th,  value_95th,\n                        deriv_5th,  deriv_50th,  deriv_95th,\n                        oderiv_5th, oderiv_50th, oderiv_95th]\n        except KeyError:\n            stats_table[component_name] = {}\n            stats_table[component_name]['type'] = component_type\n            stats_table[component_name]['stats'] = {}\n            stats_table[component_name][\n                    'stats'][iteration] = [value_mean,  value_stddev,\n                                           deriv_mean,  deriv_stddev,\n                                           oderiv_mean, oderiv_stddev,\n                                           value_5th,  value_50th,  value_95th,\n                                           deriv_5th,  deriv_50th,  deriv_95th,\n                                           oderiv_5th, oderiv_50th, oderiv_95th]\n\ndef parse_progress_logs_for_nonlinearity_stats(exp_dir):\n\n    \"\"\" Parse progress logs for mean and std stats for non-linearities.\n    e.g. for a line that is parsed from progress.*.log:\n    exp/nnet3/lstm_self_repair_ld5_sp/log/progress.9.log:component name=Lstm3_i\n    type=SigmoidComponent, dim=1280, self-repair-scale=1e-05, count=1.96e+05,\n    value-avg=[percentiles(0,1,2,5 10,20,50,80,90\n    95,98,99,100)=(0.05,0.09,0.11,0.15 0.19,0.27,0.50,0.72,0.83\n    0.88,0.92,0.94,0.99), mean=0.502, stddev=0.23],\n    deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90\n    95,98,99,100)=(0.009,0.04,0.05,0.06 0.08,0.10,0.14,0.17,0.18\n    0.19,0.20,0.20,0.21), mean=0.134, stddev=0.0397]\n    \"\"\"\n\n    progress_log_files = \"%s/log/progress.*.log\" % (exp_dir)\n    stats_per_component_per_iter = {}\n\n    progress_log_lines = common_lib.get_command_stdout(\n        'grep -e \"value-avg.*deriv-avg.*oderiv\" {0}'.format(progress_log_files),\n        require_zero_status = False)\n\n    if progress_log_lines:\n        # cases with oderiv-rms\n        parse_regex = re.compile(g_normal_nonlin_regex_pattern_with_oderiv)\n    else:\n        # cases with only value-avg and deriv-avg\n        progress_log_lines = common_lib.get_command_stdout(\n        'grep -e \"value-avg.*deriv-avg\" {0}'.format(progress_log_files),\n        require_zero_status = False)\n        parse_regex = re.compile(g_normal_nonlin_regex_pattern)\n\n    for line in progress_log_lines.split(\"\\n\"):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is None:\n            continue\n        # groups = ('9', 'Lstm3_i', 'Sigmoid', '0.05...0.99', '0.502', '0.23',\n        # '0.009...0.21', '0.134', '0.0397')\n        groups = mat_obj.groups()\n        component_type = groups[2]\n        if component_type == 'LstmNonlinearity':\n            parse_regex_lstmp = re.compile(g_lstmp_nonlin_regex_pattern)\n            mat_obj = parse_regex_lstmp.search(line)\n            groups = mat_obj.groups()\n            assert len(groups) == 33\n            for i in list(range(0,5)):\n                fill_nonlin_stats_table_with_regex_result(groups, i,\n                        stats_per_component_per_iter)\n        else:\n            fill_nonlin_stats_table_with_regex_result(groups, 0,\n                    stats_per_component_per_iter)\n    return stats_per_component_per_iter\n\n\ndef parse_difference_string(string):\n    dict = {}\n    for parts in string.split():\n        sub_parts = parts.split(\":\")\n        dict[sub_parts[0]] = float(sub_parts[1])\n    return dict\n\n\nclass MalformedClippedProportionLineException(Exception):\n    def __init__(self, line):\n        Exception.__init__(self,\n                           \"Malformed line encountered while trying to \"\n                           \"extract clipped-proportions.\\n{0}\".format(line))\n\n\ndef parse_progress_logs_for_clipped_proportion(exp_dir):\n    \"\"\" Parse progress logs for clipped proportion stats.\n\n    e.g. for a line that is parsed from progress.*.log:\n    exp/chain/cwrnn_trial2_ld5_sp/log/progress.245.log:component\n    name=BLstm1_forward_c type=ClipGradientComponent, dim=512,\n    norm-based-clipping=true, clipping-threshold=30,\n    clipped-proportion=0.000565527,\n    self-repair-clipped-proportion-threshold=0.01, self-repair-target=0,\n    self-repair-scale=1\n    \"\"\"\n\n    progress_log_files = \"%s/log/progress.*.log\" % (exp_dir)\n    component_names = set([])\n    progress_log_lines = common_lib.get_command_stdout(\n        'grep -e \"{0}\" {1}'.format(\n            \"clipped-proportion\", progress_log_files),\n        require_zero_status=False)\n    parse_regex = re.compile(\".*progress\\.([0-9]+)\\.log:component \"\n                             \"name=(.*) type=.* \"\n                             \"clipped-proportion=([0-9\\.e\\-]+)\")\n\n    cp_per_component_per_iter = {}\n\n    max_iteration = 0\n    component_names = set([])\n    for line in progress_log_lines.split(\"\\n\"):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is None:\n            if line.strip() == \"\":\n                continue\n            raise MalformedClippedProportionLineException(line)\n        groups = mat_obj.groups()\n        iteration = int(groups[0])\n        max_iteration = max(max_iteration, iteration)\n        name = groups[1]\n        clipped_proportion = float(groups[2])\n        if clipped_proportion > 1:\n            raise MalformedClippedProportionLineException(line)\n        if iteration not in cp_per_component_per_iter:\n            cp_per_component_per_iter[iteration] = {}\n        cp_per_component_per_iter[iteration][name] = clipped_proportion\n        component_names.add(name)\n    component_names = list(component_names)\n    component_names.sort()\n\n    # re arranging the data into an array\n    # and into an cp_per_iter_per_component\n    cp_per_iter_per_component = {}\n    for component_name in component_names:\n        cp_per_iter_per_component[component_name] = []\n    data = []\n    data.append([\"iteration\"]+component_names)\n    for iter in range(max_iteration+1):\n        if iter not in cp_per_component_per_iter:\n            continue\n        comp_dict = cp_per_component_per_iter[iter]\n        row = [iter]\n        for component in component_names:\n            try:\n                row.append(comp_dict[component])\n                cp_per_iter_per_component[component].append(\n                    [iter, comp_dict[component]])\n            except KeyError:\n                # if clipped proportion is not available for a particular\n                # component it is set to None\n                # this usually happens during layer-wise discriminative\n                # training\n                row.append(None)\n        data.append(row)\n\n    return {'table': data,\n            'cp_per_component_per_iter': cp_per_component_per_iter,\n            'cp_per_iter_per_component': cp_per_iter_per_component}\n\n\ndef parse_progress_logs_for_param_diff(exp_dir, pattern):\n    \"\"\" Parse progress logs for per-component parameter differences.\n\n    e.g. for a line that is parsed from progress.*.log:\n    exp/chain/cwrnn_trial2_ld5_sp/log/progress.245.log:LOG\n    (nnet3-show-progress:main():nnet3-show-progress.cc:144) Relative parameter\n    differences per layer are [ Cwrnn1_T3_W_r:0.0171537\n    Cwrnn1_T3_W_x:1.33338e-07 Cwrnn1_T2_W_r:0.048075 Cwrnn1_T2_W_x:1.34088e-07\n    Cwrnn1_T1_W_r:0.0157277 Cwrnn1_T1_W_x:0.0212704 Final_affine:0.0321521\n    Cwrnn2_T3_W_r:0.0212082 Cwrnn2_T3_W_x:1.33691e-07 Cwrnn2_T2_W_r:0.0212978\n    Cwrnn2_T2_W_x:1.33401e-07 Cwrnn2_T1_W_r:0.014976 Cwrnn2_T1_W_x:0.0233588\n    Cwrnn3_T3_W_r:0.0237165 Cwrnn3_T3_W_x:1.33184e-07 Cwrnn3_T2_W_r:0.0239754\n    Cwrnn3_T2_W_x:1.3296e-07 Cwrnn3_T1_W_r:0.0194809 Cwrnn3_T1_W_x:0.0271934 ]\n    \"\"\"\n\n    if pattern not in set([\"Relative parameter differences\",\n                           \"Parameter differences\"]):\n        raise Exception(\"Unknown value for pattern : {0}\".format(pattern))\n\n    progress_log_files = \"%s/log/progress.*.log\" % (exp_dir)\n    progress_per_iter = {}\n    component_names = set([])\n    progress_log_lines = common_lib.get_command_stdout(\n        'grep -e \"{0}\" {1}'.format(pattern, progress_log_files))\n    parse_regex = re.compile(\".*progress\\.([0-9]+)\\.log:\"\n                             \"LOG.*{0}.*\\[(.*)\\]\".format(pattern))\n    for line in progress_log_lines.split(\"\\n\"):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is None:\n            continue\n        groups = mat_obj.groups()\n        iteration = groups[0]\n        differences = parse_difference_string(groups[1])\n        component_names = component_names.union(list(differences.keys()))\n        progress_per_iter[int(iteration)] = differences\n\n    component_names = list(component_names)\n    component_names.sort()\n    # rearranging the parameter differences available per iter\n    # into parameter differences per component\n    progress_per_component = {}\n    for cn in component_names:\n        progress_per_component[cn] = {}\n\n    max_iter = max(progress_per_iter.keys())\n    total_missing_iterations = 0\n    gave_user_warning = False\n    for iter in range(max_iter + 1):\n        try:\n            component_dict = progress_per_iter[iter]\n        except KeyError:\n            continue\n\n        for component_name in component_names:\n            try:\n                progress_per_component[component_name][iter] = component_dict[\n                    component_name]\n            except KeyError:\n                total_missing_iterations += 1\n                # the component was not found this iteration, may be because of\n                # layerwise discriminative training\n                pass\n        if (total_missing_iterations/len(component_names) > 20\n                and not gave_user_warning and logger is not None):\n            logger.warning(\"There are more than {0} missing iterations per \"\n                           \"component. Something might be wrong.\".format(\n                                total_missing_iterations/len(component_names)))\n            gave_user_warning = True\n\n    return {'progress_per_component': progress_per_component,\n            'component_names': component_names,\n            'max_iter': max_iter}\n\n\ndef get_train_times(exp_dir):\n    train_log_files = \"%s/log/\" % (exp_dir)\n    train_log_names = \"train.*.log\"\n    train_log_lines = common_lib.get_command_stdout(\n        'find {0} -name \"{1}\" | xargs grep -H -e Accounting'.format(train_log_files,train_log_names))\n    parse_regex = re.compile(\".*train\\.([0-9]+)\\.([0-9]+)\\.log:# \"\n                             \"Accounting: time=([0-9]+) thread.*\")\n\n    train_times = {}\n    for line in train_log_lines.split('\\n'):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is not None:\n            groups = mat_obj.groups()\n            try:\n                train_times[int(groups[0])][int(groups[1])] = float(groups[2])\n            except KeyError:\n                train_times[int(groups[0])] = {}\n                train_times[int(groups[0])][int(groups[1])] = float(groups[2])\n    iters = train_times.keys()\n    for iter in iters:\n        values = train_times[iter].values()\n        train_times[iter] = max(values)\n    return train_times\n\ndef parse_prob_logs(exp_dir, key='accuracy', output=\"output\"):\n    train_prob_files = \"%s/log/compute_prob_train.*.log\" % (exp_dir)\n    valid_prob_files = \"%s/log/compute_prob_valid.*.log\" % (exp_dir)\n    train_prob_strings = common_lib.get_command_stdout(\n        'grep -e {0} {1}'.format(key, train_prob_files))\n    valid_prob_strings = common_lib.get_command_stdout(\n        'grep -e {0} {1}'.format(key, valid_prob_files))\n\n    # LOG\n    # (nnet3-chain-compute-prob:PrintTotalStats():nnet-chain-diagnostics.cc:149)\n    # Overall log-probability for 'output' is -0.399395 + -0.013437 = -0.412832\n    # per frame, over 20000 fra\n\n    # LOG\n    # (nnet3-chain-compute-prob:PrintTotalStats():nnet-chain-diagnostics.cc:144)\n    # Overall log-probability for 'output' is -0.307255 per frame, over 20000\n    # frames.\n\n    parse_regex = re.compile(\n        \".*compute_prob_.*\\.([0-9]+).log:LOG \"\n        \".nnet3.*compute-prob.*:PrintTotalStats..:\"\n        \"nnet.*diagnostics.cc:[0-9]+. Overall ([a-zA-Z\\-]+) for \"\n        \"'{output}'.*is ([0-9.\\-e]+) .*per frame\".format(output=output))\n\n    train_objf = {}\n    valid_objf = {}\n\n    for line in train_prob_strings.split('\\n'):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is not None:\n            groups = mat_obj.groups()\n            if groups[1] == key:\n                train_objf[int(groups[0])] = groups[2]\n    if not train_objf:\n        raise KaldiLogParseException(\"Could not find any lines with {k} in \"\n                \" {l}\".format(k=key, l=train_prob_files))\n\n    for line in valid_prob_strings.split('\\n'):\n        mat_obj = parse_regex.search(line)\n        if mat_obj is not None:\n            groups = mat_obj.groups()\n            if groups[1] == key:\n                valid_objf[int(groups[0])] = groups[2]\n\n    if not valid_objf:\n        raise KaldiLogParseException(\"Could not find any lines with {k} in \"\n                \" {l}\".format(k=key, l=valid_prob_files))\n\n    iters = list(set(valid_objf.keys()).intersection(list(train_objf.keys())))\n    if not iters:\n        raise KaldiLogParseException(\"Could not any common iterations with\"\n                \" key {k} in both {tl} and {vl}\".format(\n                    k=key, tl=train_prob_files, vl=valid_prob_files))\n    iters.sort()\n    return list([(int(x), float(train_objf[x]),\n                               float(valid_objf[x])) for x in iters])\n\ndef parse_rnnlm_prob_logs(exp_dir, key='objf'):\n    train_prob_files = \"%s/log/train.*.*.log\" % (exp_dir)\n    valid_prob_files = \"%s/log/compute_prob.*.log\" % (exp_dir)\n    train_prob_strings = common_lib.get_command_stdout(\n        'grep -e {0} {1}'.format(key, train_prob_files))\n    valid_prob_strings = common_lib.get_command_stdout(\n        'grep -e {0} {1}'.format(key, valid_prob_files))\n\n    # LOG\n    # (rnnlm-train[5.3.36~8-2ec51]:PrintStatsOverall():rnnlm-core-training.cc:118)\n    # Overall objf is (-4.426 + -0.008287) = -4.435 over 4.503e+06 words (weighted)\n    # in 1117 minibatches; exact = (-4.426 + 0) = -4.426\n\n    # LOG\n    # (rnnlm-compute-prob[5.3.36~8-2ec51]:PrintStatsOverall():rnnlm-core-training.cc:118)\n    # Overall objf is (-4.677 + -0.002067) = -4.679 over 1.08e+05 words (weighted)\n    # in 27 minibatches; exact = (-4.677 + 0.002667) = -4.674\n\n    parse_regex_train = re.compile(\n        \".*train\\.([0-9]+).1.log:LOG \"\n        \".rnnlm-train.*:PrintStatsOverall..:\"\n        \"rnnlm.*training.cc:[0-9]+. Overall ([a-zA-Z\\-]+) is \"\n        \".*exact = \\(.+\\) = ([0-9.\\-\\+e]+)\")\n\n    parse_regex_valid = re.compile(\n        \".*compute_prob\\.([0-9]+).log:LOG \"\n        \".rnnlm.*compute-prob.*:PrintStatsOverall..:\"\n        \"rnnlm.*training.cc:[0-9]+. Overall ([a-zA-Z\\-]+) is \"\n        \".*exact = \\(.+\\) = ([0-9.\\-\\+e]+)\")\n\n    train_objf = {}\n    valid_objf = {}\n\n    for line in train_prob_strings.split('\\n'):\n        mat_obj = parse_regex_train.search(line)\n        if mat_obj is not None:\n            groups = mat_obj.groups()\n            if groups[1] == key:\n                train_objf[int(groups[0])] = groups[2]\n    if not train_objf:\n        raise KaldiLogParseException(\"Could not find any lines with {k} in \"\n                \" {l}\".format(k=key, l=train_prob_files))\n\n    for line in valid_prob_strings.split('\\n'):\n        mat_obj = parse_regex_valid.search(line)\n        if mat_obj is not None:\n            groups = mat_obj.groups()\n            if groups[1] == key:\n                valid_objf[int(groups[0])] = groups[2]\n\n    if not valid_objf:\n        raise KaldiLogParseException(\"Could not find any lines with {k} in \"\n                \" {l}\".format(k=key, l=valid_prob_files))\n\n    iters = list(set(valid_objf.keys()).intersection(list(train_objf.keys())))\n    if not iters:\n        raise KaldiLogParseException(\"Could not any common iterations with\"\n                \" key {k} in both {tl} and {vl}\".format(\n                    k=key, tl=train_prob_files, vl=valid_prob_files))\n    iters.sort()\n    return [(int(x), float(train_objf[x]),\n                          float(valid_objf[x])) for x in iters]\n\n\n\ndef generate_acc_logprob_report(exp_dir, key=\"accuracy\", output=\"output\"):\n    try:\n        times = get_train_times(exp_dir)\n    except:\n        tb = traceback.format_exc()\n        logger.warning(\"Error getting info from logs, exception was: \" + tb)\n        times = {}\n\n    report = []\n    report.append(\"%Iter\\tduration\\ttrain_objective\\tvalid_objective\\tdifference\")\n    try:\n        if key == \"rnnlm_objective\":\n            data = list(parse_rnnlm_prob_logs(exp_dir, 'objf'))\n        else:\n            data = list(parse_prob_logs(exp_dir, key, output))\n    except:\n        tb = traceback.format_exc()\n        logger.warning(\"Error getting info from logs, exception was: \" + tb)\n        data = []\n    for x in data:\n        try:\n            report.append(\"%d\\t%s\\t%g\\t%g\\t%g\" % (x[0], str(times[x[0]]),\n                                                  x[1], x[2], x[2]-x[1]))\n        except (KeyError, IndexError):\n            continue\n\n    total_time = 0\n    for iter in times.keys():\n        total_time += times[iter]\n    report.append(\"Total training time is {0}\\n\".format(\n                    str(datetime.timedelta(seconds=total_time))))\n    return [\"\\n\".join(report), times, data]\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/__init__.py",
    "content": "\n# Copyright 2016 Vimal Manohar\n# Apache 2.0\n\n\"\"\" This library has classes and methods commonly used for training nnet3\nneural networks.\n\nIt has separate submodules for frame-level objectives and chain objective:\nframe_level_objf -- For both recurrent and non-recurrent architectures\nchain_objf -- LF-MMI objective training\n\"\"\"\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/chain_objf/__init__.py",
    "content": "\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This is a subpackage containing modules for training of\ndeep neural network acoustic model with chain objective.\n\"\"\"\n\nfrom . import acoustic_model\n\n__all__ = [\"acoustic_model\"]\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/chain_objf/acoustic_model.py",
    "content": "\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This is a module with methods which will be used by scripts for training of\ndeep neural network acoustic model with chain objective.\n\"\"\"\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport logging\nimport math\nimport os\nimport sys\n\nimport libs.common as common_lib\nimport libs.nnet3.train.common as common_train_lib\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\ndef create_phone_lm(dir, tree_dir, run_opts, lm_opts=None):\n    \"\"\"Create a phone LM for chain training\n\n    This method trains a phone LM for chain training using the alignments\n    in \"tree_dir\"\n    \"\"\"\n    try:\n        f = open(tree_dir + \"/num_jobs\", 'r')\n        num_ali_jobs = int(f.readline())\n        assert num_ali_jobs > 0\n    except:\n        raise Exception(\"\"\"There was an error getting the number of alignment\n                        jobs from {0}/num_jobs\"\"\".format(tree_dir))\n\n    alignments=' '.join(['{0}/ali.{1}.gz'.format(tree_dir, job)\n                         for job in range(1, num_ali_jobs + 1)])\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/make_phone_lm.log \\\n            gunzip -c {alignments} \\| \\\n            ali-to-phones {tree_dir}/final.mdl ark:- ark:- \\| \\\n            chain-est-phone-lm {lm_opts} ark:- {dir}/phone_lm.fst\"\"\".format(\n                command=run_opts.command, dir=dir,\n                alignments=alignments,\n                lm_opts=lm_opts if lm_opts is not None else '',\n                tree_dir=tree_dir))\n\n\ndef create_denominator_fst(dir, tree_dir, run_opts):\n    common_lib.execute_command(\n        \"\"\"copy-transition-model {tree_dir}/final.mdl \\\n                {dir}/0.trans_mdl\"\"\".format(dir=dir, tree_dir=tree_dir))\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/make_den_fst.log \\\n                   chain-make-den-fst {dir}/tree {dir}/0.trans_mdl \\\n                   {dir}/phone_lm.fst \\\n                   {dir}/den.fst {dir}/normalization.fst\"\"\".format(\n                       dir=dir, command=run_opts.command))\n\n\ndef generate_chain_egs(dir, data, lat_dir, egs_dir,\n                       left_context, right_context,\n                       run_opts, stage=0,\n                       left_tolerance=None, right_tolerance=None,\n                       left_context_initial=-1, right_context_final=-1,\n                       frame_subsampling_factor=3,\n                       alignment_subsampling_factor=3,\n                       online_ivector_dir=None,\n                       frames_per_iter=20000, frames_per_eg_str=\"20\", srand=0,\n                       egs_opts=None, cmvn_opts=None):\n    \"\"\"Wrapper for steps/nnet3/chain/get_egs.sh\n\n    See options in that script.\n    \"\"\"\n\n    common_lib.execute_command(\n        \"\"\"steps/nnet3/chain/get_egs.sh {egs_opts} \\\n                --cmd \"{command}\" \\\n                --cmvn-opts \"{cmvn_opts}\" \\\n                --online-ivector-dir \"{ivector_dir}\" \\\n                --left-context {left_context} \\\n                --right-context {right_context} \\\n                --left-context-initial {left_context_initial} \\\n                --right-context-final {right_context_final} \\\n                --left-tolerance '{left_tolerance}' \\\n                --right-tolerance '{right_tolerance}' \\\n                --frame-subsampling-factor {frame_subsampling_factor} \\\n                --alignment-subsampling-factor {alignment_subsampling_factor} \\\n                --stage {stage} \\\n                --frames-per-iter {frames_per_iter} \\\n                --frames-per-eg {frames_per_eg_str} \\\n                --srand {srand} \\\n                {data} {dir} {lat_dir} {egs_dir}\"\"\".format(\n                    command=run_opts.egs_command,\n                    cmvn_opts=cmvn_opts if cmvn_opts is not None else '',\n                    ivector_dir=(online_ivector_dir\n                                 if online_ivector_dir is not None\n                                 else ''),\n                    left_context=left_context,\n                    right_context=right_context,\n                    left_context_initial=left_context_initial,\n                    right_context_final=right_context_final,\n                    left_tolerance=(left_tolerance\n                                    if left_tolerance is not None\n                                    else ''),\n                    right_tolerance=(right_tolerance\n                                     if right_tolerance is not None\n                                     else ''),\n                    frame_subsampling_factor=frame_subsampling_factor,\n                    alignment_subsampling_factor=alignment_subsampling_factor,\n                    stage=stage, frames_per_iter=frames_per_iter,\n                    frames_per_eg_str=frames_per_eg_str, srand=srand,\n                    data=data, lat_dir=lat_dir, dir=dir, egs_dir=egs_dir,\n                    egs_opts=egs_opts if egs_opts is not None else ''))\n\n\ndef train_new_models(dir, iter, srand, num_jobs,\n                     num_archives_processed, num_archives,\n                     raw_model_string, egs_dir,\n                     apply_deriv_weights,\n                     min_deriv_time, max_deriv_time_relative,\n                     l2_regularize, xent_regularize, leaky_hmm_coefficient,\n                     momentum, max_param_change,\n                     shuffle_buffer_size, num_chunk_per_minibatch_str,\n                     frame_subsampling_factor, run_opts, train_opts,\n                     backstitch_training_scale=0.0, backstitch_training_interval=1,\n                     use_multitask_egs=False):\n    \"\"\"\n    Called from train_one_iteration(), this method trains new models\n    with 'num_jobs' jobs, and\n    writes files like exp/tdnn_a/24.{1,2,3,..<num_jobs>}.raw\n\n    We cannot easily use a single parallel SGE job to do the main training,\n    because the computation of which archive and which --frame option\n    to use for each job is a little complex, so we spawn each one separately.\n    this is no longer true for RNNs as we use do not use the --frame option\n    but we use the same script for consistency with FF-DNN code\n\n    use_multitask_egs : True, if different examples used to train multiple\n                        tasks or outputs, e.g.multilingual training.\n                        multilingual egs can be generated using get_egs.sh and\n                        steps/nnet3/multilingual/allocate_multilingual_examples.py,\n                        those are the top-level scripts.\n    \"\"\"\n\n    deriv_time_opts = []\n    if min_deriv_time is not None:\n        deriv_time_opts.append(\"--optimization.min-deriv-time={0}\".format(\n                                    min_deriv_time))\n    if max_deriv_time_relative is not None:\n        deriv_time_opts.append(\"--optimization.max-deriv-time-relative={0}\".format(\n                                    int(max_deriv_time_relative)))\n\n    threads = []\n    # the GPU timing info is only printed if we use the --verbose=1 flag; this\n    # slows down the computation slightly, so don't accumulate it on every\n    # iteration.  Don't do it on iteration 0 either, because we use a smaller\n    # than normal minibatch size, and people may get confused thinking it's\n    # slower for iteration 0 because of the verbose option.\n    verbose_opt = (\"--verbose=1\" if iter % 20 == 0 and iter > 0 else \"\")\n\n    for job in range(1, num_jobs+1):\n        # k is a zero-based index that we will derive the other indexes from.\n        k = num_archives_processed + job - 1\n        # work out the 1-based archive index.\n        archive_index = (k % num_archives) + 1\n        # previous : frame_shift = (k/num_archives) % frame_subsampling_factor\n        frame_shift = ((archive_index + k//num_archives)\n                       % frame_subsampling_factor)\n\n        multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n            egs_dir,\n            egs_prefix=\"cegs.\",\n            archive_index=archive_index,\n            use_multitask_egs=use_multitask_egs)\n        scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n        cache_io_opts = ((\"--read-cache={dir}/cache.{iter}\".format(dir=dir,\n                                                                  iter=iter)\n                          if iter > 0 else \"\") +\n                         (\" --write-cache={0}/cache.{1}\".format(dir, iter + 1)\n                          if job == 1 else \"\"))\n\n        thread = common_lib.background_command(\n            \"\"\"{command} {train_queue_opt} {dir}/log/train.{iter}.{job}.log \\\n                    nnet3-chain-train {parallel_train_opts} {verbose_opt} \\\n                    --apply-deriv-weights={app_deriv_wts} \\\n                    --l2-regularize={l2} --leaky-hmm-coefficient={leaky} \\\n                    {cache_io_opts}  --xent-regularize={xent_reg} \\\n                    {deriv_time_opts} \\\n                    --print-interval=10 --momentum={momentum} \\\n                    --max-param-change={max_param_change} \\\n                    --backstitch-training-scale={backstitch_training_scale} \\\n                    --backstitch-training-interval={backstitch_training_interval} \\\n                    --l2-regularize-factor={l2_regularize_factor} {train_opts} \\\n                    --srand={srand} \\\n                    \"{raw_model}\" {dir}/den.fst \\\n                    \"ark,bg:nnet3-chain-copy-egs {multitask_egs_opts} \\\n                        --frame-shift={fr_shft} \\\n                        {scp_or_ark}:{egs_dir}/cegs.{archive_index}.{scp_or_ark} ark:- | \\\n                        nnet3-chain-shuffle-egs --buffer-size={buf_size} \\\n                        --srand={srand} ark:- ark:- | nnet3-chain-merge-egs \\\n                        --minibatch-size={num_chunk_per_mb} ark:- ark:- |\" \\\n                    {dir}/{next_iter}.{job}.raw\"\"\".format(\n                        command=run_opts.command,\n                        train_queue_opt=run_opts.train_queue_opt,\n                        dir=dir, iter=iter, srand=iter + srand,\n                        next_iter=iter + 1, job=job,\n                        deriv_time_opts=\" \".join(deriv_time_opts),\n                        app_deriv_wts=apply_deriv_weights,\n                        fr_shft=frame_shift, l2=l2_regularize,\n                        train_opts=train_opts,\n                        xent_reg=xent_regularize, leaky=leaky_hmm_coefficient,\n                        cache_io_opts=cache_io_opts,\n                        parallel_train_opts=run_opts.parallel_train_opts,\n                        verbose_opt=verbose_opt,\n                        momentum=momentum, max_param_change=max_param_change,\n                        backstitch_training_scale=backstitch_training_scale,\n                        backstitch_training_interval=backstitch_training_interval,\n                        l2_regularize_factor=1.0/num_jobs,\n                        raw_model=raw_model_string,\n                        egs_dir=egs_dir, archive_index=archive_index,\n                        buf_size=shuffle_buffer_size,\n                        num_chunk_per_mb=num_chunk_per_minibatch_str,\n                        multitask_egs_opts=multitask_egs_opts,\n                        scp_or_ark=scp_or_ark),\n            require_zero_status=True)\n\n        threads.append(thread)\n\n    for thread in threads:\n        thread.join()\n\n\ndef train_one_iteration(dir, iter, srand, egs_dir,\n                        num_jobs, num_archives_processed, num_archives,\n                        learning_rate, shrinkage_value,\n                        num_chunk_per_minibatch_str,\n                        apply_deriv_weights, min_deriv_time,\n                        max_deriv_time_relative,\n                        l2_regularize, xent_regularize,\n                        leaky_hmm_coefficient,\n                        momentum, max_param_change, shuffle_buffer_size,\n                        frame_subsampling_factor,\n                        run_opts, dropout_edit_string=\"\", train_opts=\"\",\n                        backstitch_training_scale=0.0, backstitch_training_interval=1,\n                        use_multitask_egs=False):\n    \"\"\" Called from steps/nnet3/chain/train.py for one iteration for\n    neural network training with LF-MMI objective\n\n    \"\"\"\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    # check if different iterations use the same random seed\n    if os.path.exists('{0}/srand'.format(dir)):\n        try:\n            saved_srand = int(open('{0}/srand'.format(dir)).readline().strip())\n        except (IOError, ValueError):\n            logger.error(\"Exception while reading the random seed \"\n                         \"for training\")\n            raise\n        if srand != saved_srand:\n            logger.warning(\"The random seed provided to this iteration \"\n                           \"(srand={0}) is different from the one saved last \"\n                           \"time (srand={1}). Using srand={0}.\".format(\n                               srand, saved_srand))\n    else:\n        with open('{0}/srand'.format(dir), 'w') as f:\n            f.write(str(srand))\n\n    # Sets off some background jobs to compute train and\n    # validation set objectives\n    compute_train_cv_probabilities(\n        dir=dir, iter=iter, egs_dir=egs_dir,\n        l2_regularize=l2_regularize, xent_regularize=xent_regularize,\n        leaky_hmm_coefficient=leaky_hmm_coefficient, run_opts=run_opts,\n        use_multitask_egs=use_multitask_egs)\n\n    if iter > 0:\n        # Runs in the background\n        compute_progress(dir, iter, run_opts)\n\n    do_average = (iter > 0)\n\n    raw_model_string = (\"nnet3-am-copy --raw=true --learning-rate={0} \"\n                        \"--scale={1} {2}/{3}.mdl - |\".format(\n                            learning_rate, shrinkage_value, dir, iter))\n\n    if do_average:\n        cur_num_chunk_per_minibatch_str = num_chunk_per_minibatch_str\n        cur_max_param_change = max_param_change\n    else:\n        # on iteration zero, use a smaller minibatch size (and we will later\n        # choose the output of just one of the jobs): the model-averaging isn't\n        # always helpful when the model is changing too fast (i.e. it can worsen\n        # the objective function), and the smaller minibatch size will help to\n        # keep the update stable.\n        cur_num_chunk_per_minibatch_str = common_train_lib.halve_minibatch_size_str(\n            num_chunk_per_minibatch_str)\n        cur_max_param_change = float(max_param_change) / math.sqrt(2)\n\n    raw_model_string = raw_model_string + dropout_edit_string\n    train_new_models(dir=dir, iter=iter, srand=srand, num_jobs=num_jobs,\n                     num_archives_processed=num_archives_processed,\n                     num_archives=num_archives,\n                     raw_model_string=raw_model_string,\n                     egs_dir=egs_dir,\n                     apply_deriv_weights=apply_deriv_weights,\n                     min_deriv_time=min_deriv_time,\n                     max_deriv_time_relative=max_deriv_time_relative,\n                     l2_regularize=l2_regularize,\n                     xent_regularize=xent_regularize,\n                     leaky_hmm_coefficient=leaky_hmm_coefficient,\n                     momentum=momentum,\n                     max_param_change=cur_max_param_change,\n                     shuffle_buffer_size=shuffle_buffer_size,\n                     num_chunk_per_minibatch_str=cur_num_chunk_per_minibatch_str,\n                     frame_subsampling_factor=frame_subsampling_factor,\n                     run_opts=run_opts, train_opts=train_opts,\n                     # linearly increase backstitch_training_scale during the\n                     # first few iterations (hard-coded as 15)\n                     backstitch_training_scale=(backstitch_training_scale *\n                         iter / 15 if iter < 15 else backstitch_training_scale),\n                     backstitch_training_interval=backstitch_training_interval,\n                     use_multitask_egs=use_multitask_egs)\n\n    [models_to_average, best_model] = common_train_lib.get_successful_models(\n         num_jobs, '{0}/log/train.{1}.%.log'.format(dir, iter))\n    nnets_list = []\n    for n in models_to_average:\n        nnets_list.append(\"{0}/{1}.{2}.raw\".format(dir, iter + 1, n))\n\n    if do_average:\n        # average the output of the different jobs.\n        common_train_lib.get_average_nnet_model(\n            dir=dir, iter=iter,\n            nnets_list=\" \".join(nnets_list),\n            run_opts=run_opts)\n\n    else:\n        # choose the best model from different jobs\n        common_train_lib.get_best_nnet_model(\n            dir=dir, iter=iter,\n            best_model_index=best_model,\n            run_opts=run_opts)\n\n    try:\n        for i in range(1, num_jobs + 1):\n            os.remove(\"{0}/{1}.{2}.raw\".format(dir, iter + 1, i))\n    except OSError:\n        raise Exception(\"Error while trying to delete the raw models\")\n\n    new_model = \"{0}/{1}.mdl\".format(dir, iter + 1)\n\n    if not os.path.isfile(new_model):\n        raise Exception(\"Could not find {0}, at the end of \"\n                        \"iteration {1}\".format(new_model, iter))\n    elif os.stat(new_model).st_size == 0:\n        raise Exception(\"{0} has size 0. Something went wrong in \"\n                        \"iteration {1}\".format(new_model, iter))\n    if os.path.exists(\"{0}/cache.{1}\".format(dir, iter)):\n        os.remove(\"{0}/cache.{1}\".format(dir, iter))\n\n\ndef check_for_required_files(feat_dir, tree_dir, lat_dir=None):\n    files = ['{0}/feats.scp'.format(feat_dir), '{0}/ali.1.gz'.format(tree_dir),\n             '{0}/final.mdl'.format(tree_dir), '{0}/tree'.format(tree_dir)]\n    if lat_dir is not None:\n        files += [\n             '{0}/lat.1.gz'.format(lat_dir), '{0}/final.mdl'.format(lat_dir),\n             '{0}/num_jobs'.format(lat_dir)]\n    for file in files:\n        if not os.path.isfile(file):\n            raise Exception('Expected {0} to exist.'.format(file))\n\n\ndef compute_preconditioning_matrix(dir, egs_dir, num_lda_jobs, run_opts,\n                                   max_lda_jobs=None, rand_prune=4.0,\n                                   lda_opts=None, use_multitask_egs=False):\n    \"\"\" Function to estimate and write LDA matrix from cegs\n\n    This function is exactly similar to the version in module\n    libs.nnet3.train.frame_level_objf.common except this uses cegs instead of\n    egs files.\n    \"\"\"\n    if max_lda_jobs is not None:\n        if num_lda_jobs > max_lda_jobs:\n            num_lda_jobs = max_lda_jobs\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n        egs_dir,\n        egs_prefix=\"cegs.\",\n        archive_index=\"JOB\",\n        use_multitask_egs=use_multitask_egs)\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_rspecifier = (\n        \"ark:nnet3-chain-copy-egs {multitask_egs_opts} \"\n        \"{scp_or_ark}:{egs_dir}/cegs.JOB.{scp_or_ark} ark:- |\"\n        \"\".format(egs_dir=egs_dir, scp_or_ark=scp_or_ark,\n                  multitask_egs_opts=multitask_egs_opts))\n\n    # Write stats with the same format as stats for LDA.\n    common_lib.execute_command(\n        \"\"\"{command} JOB=1:{num_lda_jobs} {dir}/log/get_lda_stats.JOB.log \\\n                nnet3-chain-acc-lda-stats --rand-prune={rand_prune} \\\n                {dir}/init.raw \"{egs_rspecifier}\" \\\n                {dir}/JOB.lda_stats\"\"\".format(\n                    command=run_opts.command,\n                    num_lda_jobs=num_lda_jobs,\n                    dir=dir,\n                    egs_rspecifier=egs_rspecifier,\n                    rand_prune=rand_prune))\n\n    # the above command would have generated dir/{1..num_lda_jobs}.lda_stats\n    lda_stat_files = ['{0}/{1}.lda_stats'.format(dir, x) for x in range(1, num_lda_jobs + 1)]\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/sum_transform_stats.log \\\n                sum-lda-accs {dir}/lda_stats {lda_stat_files}\"\"\".format(\n                    command=run_opts.command,\n                    dir=dir, lda_stat_files=\" \".join(lda_stat_files)))\n\n    for file in lda_stat_files:\n        try:\n            os.remove(file)\n        except OSError:\n            raise Exception(\"There was error while trying to remove \"\n                            \"lda stat files.\")\n    # this computes a fixed affine transform computed in the way we described\n    # in Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled\n    # variant of an LDA transform but without dimensionality reduction.\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/get_transform.log \\\n                nnet-get-feature-transform {lda_opts} {dir}/lda.mat \\\n                {dir}/lda_stats\"\"\".format(\n                    command=run_opts.command, dir=dir,\n                    lda_opts=lda_opts if lda_opts is not None else \"\"))\n\n    common_lib.force_symlink(\"../lda.mat\", \"{0}/configs/lda.mat\".format(dir))\n\n\ndef prepare_initial_acoustic_model(dir, run_opts, srand=-1, input_model=None):\n    \"\"\" This function adds the first layer; It will also prepare the acoustic\n        model with the transition model.\n        If 'input_model' is specified, no initial network preparation(adding\n        the first layer) is done and this model is used as initial 'raw' model\n        instead of '0.raw' model to prepare '0.mdl' as acoustic model by adding the\n        transition model.\n    \"\"\"\n    if input_model is None:\n        common_train_lib.prepare_initial_network(dir, run_opts,\n                                                 srand=srand)\n\n    # The model-format for a 'chain' acoustic model is just the transition\n    # model and then the raw nnet, so we can use 'cat' to create this, as\n    # long as they have the same mode (binary or not binary).\n    # We ensure that they have the same mode (even if someone changed the\n    # script to make one or both of them text mode) by copying them both\n    # before concatenating them.\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/init_mdl.log \\\n                nnet3-am-init {dir}/0.trans_mdl {raw_mdl} \\\n                {dir}/0.mdl\"\"\".format(command=run_opts.command, dir=dir,\n                                      raw_mdl=(input_model if input_model is not None\n                                      else '{0}/0.raw'.format(dir))))\n\n\ndef compute_train_cv_probabilities(dir, iter, egs_dir, l2_regularize,\n                                   xent_regularize, leaky_hmm_coefficient,\n                                   run_opts,\n                                   use_multitask_egs=False):\n    model = '{0}/{1}.mdl'.format(dir, iter)\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_suffix = \".scp\" if use_multitask_egs else \".cegs\"\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"valid_diagnostic.\",\n                             use_multitask_egs=use_multitask_egs)\n\n\n    common_lib.background_command(\n        \"\"\"{command} {dir}/log/compute_prob_valid.{iter}.log \\\n                nnet3-chain-compute-prob --l2-regularize={l2} \\\n                --leaky-hmm-coefficient={leaky} --xent-regularize={xent_reg} \\\n                {model} {dir}/den.fst \\\n                \"ark,bg:nnet3-chain-copy-egs {multitask_egs_opts} {scp_or_ark}:{egs_dir}/valid_diagnostic{egs_suffix} \\\n                    ark:- | nnet3-chain-merge-egs --minibatch-size=1:64 ark:- ark:- |\" \\\n        \"\"\".format(command=run_opts.command, dir=dir, iter=iter, model=model,\n                   l2=l2_regularize, leaky=leaky_hmm_coefficient,\n                   xent_reg=xent_regularize,\n                   egs_dir=egs_dir,\n                   multitask_egs_opts=multitask_egs_opts,\n                   scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"train_diagnostic.\",\n                             use_multitask_egs=use_multitask_egs)\n\n    common_lib.background_command(\n        \"\"\"{command} {dir}/log/compute_prob_train.{iter}.log \\\n                nnet3-chain-compute-prob --l2-regularize={l2} \\\n                --leaky-hmm-coefficient={leaky} --xent-regularize={xent_reg} \\\n                {model} {dir}/den.fst \\\n                \"ark,bg:nnet3-chain-copy-egs {multitask_egs_opts} {scp_or_ark}:{egs_dir}/train_diagnostic{egs_suffix} \\\n                    ark:- | nnet3-chain-merge-egs --minibatch-size=1:64 ark:- ark:- |\" \\\n        \"\"\".format(command=run_opts.command, dir=dir, iter=iter, model=model,\n                   l2=l2_regularize, leaky=leaky_hmm_coefficient,\n                   xent_reg=xent_regularize,\n                   egs_dir=egs_dir,\n                   multitask_egs_opts=multitask_egs_opts,\n                   scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))\n\n\ndef compute_progress(dir, iter, run_opts):\n\n    prev_model = '{0}/{1}.mdl'.format(dir, iter - 1)\n    model = '{0}/{1}.mdl'.format(dir, iter)\n\n    common_lib.background_command(\n        \"\"\"{command} {dir}/log/progress.{iter}.log \\\n                nnet3-am-info {model} '&&' \\\n                nnet3-show-progress --use-gpu=no {prev_model} {model}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir,\n                   iter=iter,\n                   model=model,\n                   prev_model=prev_model))\n    if iter % 10 == 0 and iter > 0:\n        # Every 10 iters, print some more detailed information.\n        # full_progress.X.log contains some diagnostics of the difference in\n        # parameters, printed in the same format as from nnet3-info.\n        common_lib.background_command(\n            \"\"\"{command} {dir}/log/full_progress.{iter}.log \\\n            nnet3-show-progress --use-gpu=no --verbose=2 {prev_model} {model}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir,\n                   iter=iter,\n                   model=model,\n                   prev_model=prev_model))\n        # full_info.X.log is just the nnet3-info of the model, with the --verbose=2\n        # option which includes stats on the singular values of the parameter matrices.\n        common_lib.background_command(\n            \"\"\"{command} {dir}/log/full_info.{iter}.log \\\n            nnet3-info --verbose=2 {model}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir,\n                   iter=iter,\n                   model=model))\n\n\n\ndef combine_models(dir, num_iters, models_to_combine, num_chunk_per_minibatch_str,\n                   egs_dir, leaky_hmm_coefficient, l2_regularize,\n                   xent_regularize, run_opts,\n                   max_objective_evaluations=30,\n                   use_multitask_egs=False):\n    \"\"\" Function to do model combination\n\n    In the nnet3 setup, the logic\n    for doing averaging of subsets of the models in the case where\n    there are too many models to reliably esetimate interpolation\n    factors (max_models_combine) is moved into the nnet3-combine.\n    \"\"\"\n    raw_model_strings = []\n    logger.info(\"Combining {0} models.\".format(models_to_combine))\n\n    models_to_combine.add(num_iters)\n\n    for iter in sorted(models_to_combine):\n        model_file = '{0}/{1}.mdl'.format(dir, iter)\n        if os.path.exists(model_file):\n            # we used to copy them with nnet3-am-copy --raw=true, but now\n            # the raw-model-reading code discards the other stuff itself.\n            raw_model_strings.append(model_file)\n        else:\n            print(\"{0}: warning: model file {1} does not exist \"\n                  \"(final combination)\".format(sys.argv[0], model_file))\n\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_suffix = \".scp\" if use_multitask_egs else \".cegs\"\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"combine.\",\n                             use_multitask_egs=use_multitask_egs)\n\n    # We reverse the order of the raw model strings so that the freshest one\n    # goes first.  This is important for systems that include batch\n    # normalization-- it means that the freshest batch-norm stats are used.\n    # Since the batch-norm stats are not technically parameters, they are not\n    # combined in the combination code, they are just obtained from the first\n    # model.\n    raw_model_strings = list(reversed(raw_model_strings))\n\n    common_lib.execute_command(\n        \"\"\"{command} {combine_queue_opt} {dir}/log/combine.log \\\n                nnet3-chain-combine \\\n                --max-objective-evaluations={max_objective_evaluations} \\\n                --l2-regularize={l2} --leaky-hmm-coefficient={leaky} \\\n                --verbose=3 {combine_gpu_opt} {dir}/den.fst {raw_models} \\\n                \"ark,bg:nnet3-chain-copy-egs {multitask_egs_opts} {scp_or_ark}:{egs_dir}/combine{egs_suffix} ark:- | \\\n                    nnet3-chain-merge-egs --minibatch-size={num_chunk_per_mb} \\\n                    ark:- ark:- |\" - \\| \\\n                nnet3-am-copy --set-raw-nnet=- {dir}/{num_iters}.mdl \\\n                {dir}/final.mdl\"\"\".format(\n                    command=run_opts.command,\n                    combine_queue_opt=run_opts.combine_queue_opt,\n                    combine_gpu_opt=run_opts.combine_gpu_opt,\n                    max_objective_evaluations=max_objective_evaluations,\n                    l2=l2_regularize, leaky=leaky_hmm_coefficient,\n                    dir=dir, raw_models=\" \".join(raw_model_strings),\n                    num_chunk_per_mb=num_chunk_per_minibatch_str,\n                    num_iters=num_iters,\n                    egs_dir=egs_dir,\n                    multitask_egs_opts=multitask_egs_opts,\n                    scp_or_ark=scp_or_ark, egs_suffix=egs_suffix))\n\n    # Compute the probability of the final, combined model with\n    # the same subset we used for the previous compute_probs, as the\n    # different subsets will lead to different probs.\n    compute_train_cv_probabilities(\n        dir=dir, iter='final', egs_dir=egs_dir,\n        l2_regularize=l2_regularize, xent_regularize=xent_regularize,\n        leaky_hmm_coefficient=leaky_hmm_coefficient,\n        run_opts=run_opts,\n        use_multitask_egs=use_multitask_egs)\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/common.py",
    "content": "\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0\n\n\"\"\"This module contains classes and methods common to training of\nnnet3 neural networks.\n\"\"\"\nfrom __future__ import division\n\nimport argparse\nimport glob\nimport logging\nimport os\nimport math\nimport re\nimport shutil\n\nimport libs.common as common_lib\nfrom libs.nnet3.train.dropout_schedule import *\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\nclass RunOpts(object):\n    \"\"\"A structure to store run options.\n\n    Run options like queue.pl and run.pl, along with their memory\n    and parallel training options for various types of commands such\n    as the ones for training, parallel-training, running on GPU etc.\n    \"\"\"\n\n    def __init__(self):\n        self.command = None\n        self.train_queue_opt = None\n        self.combine_gpu_opt = None\n        self.combine_queue_opt = None\n        self.prior_gpu_opt = None\n        self.prior_queue_opt = None\n        self.parallel_train_opts = None\n\ndef get_outputs_list(model_file, get_raw_nnet_from_am=True):\n    \"\"\" Generates list of output-node-names used in nnet3 model configuration.\n        It will normally return 'output'.\n    \"\"\"\n    if get_raw_nnet_from_am:\n        outputs_list = common_lib.get_command_stdout(\n            \"nnet3-am-info --print-args=false {0} | \"\n            \"grep -e 'output-node' | cut -f2 -d' ' | cut -f2 -d'=' \".format(model_file))\n    else:\n        outputs_list = common_lib.get_command_stdout(\n            \"nnet3-info --print-args=false {0} | \"\n            \"grep -e 'output-node' | cut -f2 -d' ' | cut -f2 -d'=' \".format(model_file))\n\n    return outputs_list.split()\n\n\ndef get_multitask_egs_opts(egs_dir, egs_prefix=\"\",\n                           archive_index=-1,\n                           use_multitask_egs=False):\n    \"\"\" Generates egs option for multitask(or multilingual) training setup,\n        if {egs_prefix}output.*.ark or {egs_prefix}weight.*.ark files exists in egs_dir.\n        Each line in {egs_prefix}*.scp has a corresponding line containing\n        name of the output-node in the network and language-dependent weight in\n        {egs_prefix}output.*.ark or {egs_prefix}weight.*.ark respectively.\n        e.g. Returns the empty string ('') if use_multitask_egs == False,\n        otherwise something like:\n        '--output=ark:foo/egs/output.3.ark --weight=ark:foo/egs/weights.3.ark'\n        i.e. egs_prefix is \"\" for train and\n        \"valid_diagnostic.\" for validation.\n\n        Caution: archive_index is usually an integer, but may be a string (\"JOB\")\n        in some cases.\n    \"\"\"\n    multitask_egs_opts = \"\"\n    egs_suffix = \".{0}\".format(archive_index) if archive_index != -1 else \"\"\n\n    if use_multitask_egs:\n        output_file_name = (\"{egs_dir}/{egs_prefix}output{egs_suffix}.ark\"\n                            \"\".format(egs_dir=egs_dir,\n                                      egs_prefix=egs_prefix,\n                                      egs_suffix=egs_suffix))\n        output_rename_opt = \"\"\n        if os.path.isfile(output_file_name):\n            output_rename_opt = (\"--outputs=ark:{output_file_name}\".format(\n                output_file_name=output_file_name))\n\n        weight_file_name = (\"{egs_dir}/{egs_prefix}weight{egs_suffix}.ark\"\n                            \"\".format(egs_dir=egs_dir,\n                                      egs_prefix=egs_prefix,\n                                      egs_suffix=egs_suffix))\n        weight_opt = \"\"\n        if os.path.isfile(weight_file_name):\n            weight_opt = (\"--weights=ark:{weight_file_name}\"\n                          \"\".format(weight_file_name=weight_file_name))\n\n        multitask_egs_opts = (\n            \"{output_rename_opt} {weight_opt}\".format(\n                output_rename_opt=output_rename_opt,\n                weight_opt=weight_opt))\n\n    return multitask_egs_opts\n\n\ndef get_successful_models(num_models, log_file_pattern,\n                          difference_threshold=1.0):\n    assert num_models > 0\n\n    parse_regex = re.compile(\n        \"LOG .* Overall average objective function for \"\n        \"'output' is ([0-9e.\\-+= ]+) over ([0-9e.\\-+]+) frames\")\n    objf = []\n    for i in range(num_models):\n        model_num = i + 1\n        logfile = re.sub('%', str(model_num), log_file_pattern)\n        lines = open(logfile, 'r').readlines()\n        this_objf = -100000.0\n        for line_num in range(1, len(lines) + 1):\n            # we search from the end as this would result in\n            # lesser number of regex searches. Python regex is slow !\n            mat_obj = parse_regex.search(lines[-1 * line_num])\n            if mat_obj is not None:\n                this_objf = float(mat_obj.groups()[0].split()[-1])\n                break\n        objf.append(this_objf)\n    max_index = objf.index(max(objf))\n    accepted_models = []\n    for i in range(num_models):\n        if (objf[max_index] - objf[i]) <= difference_threshold:\n            accepted_models.append(i + 1)\n\n    if len(accepted_models) != num_models:\n        logger.warn(\"Only {0}/{1} of the models have been accepted \"\n                    \"for averaging, based on log files {2}.\".format(\n                        len(accepted_models),\n                        num_models, log_file_pattern))\n\n    return [accepted_models, max_index + 1]\n\n\ndef get_average_nnet_model(dir, iter, nnets_list, run_opts,\n                           get_raw_nnet_from_am=True):\n\n    next_iter = iter + 1\n    if get_raw_nnet_from_am:\n        out_model = (\"\"\"- \\| nnet3-am-copy --set-raw-nnet=-  \\\n                        {dir}/{iter}.mdl {dir}/{next_iter}.mdl\"\"\".format(\n                            dir=dir, iter=iter,\n                            next_iter=next_iter))\n    else:\n        out_model = \"{dir}/{next_iter}.raw\".format(\n            dir=dir, next_iter=next_iter)\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/average.{iter}.log \\\n                nnet3-average {nnets_list} \\\n                {out_model}\"\"\".format(command=run_opts.command,\n                                      dir=dir,\n                                      iter=iter,\n                                      nnets_list=nnets_list,\n                                      out_model=out_model))\n\n\ndef get_best_nnet_model(dir, iter, best_model_index, run_opts,\n                        get_raw_nnet_from_am=True):\n\n    best_model = \"{dir}/{next_iter}.{best_model_index}.raw\".format(\n        dir=dir,\n        next_iter=iter + 1,\n        best_model_index=best_model_index)\n\n    if get_raw_nnet_from_am:\n        out_model = (\"\"\"- \\| nnet3-am-copy --set-raw-nnet=- \\\n                        {dir}/{iter}.mdl {dir}/{next_iter}.mdl\"\"\".format(\n                            dir=dir, iter=iter, next_iter=iter + 1))\n    else:\n        out_model = \"{dir}/{next_iter}.raw\".format(dir=dir,\n                                                   next_iter=iter + 1)\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/select.{iter}.log \\\n                nnet3-copy {best_model} \\\n                {out_model}\"\"\".format(command=run_opts.command,\n                                      dir=dir, iter=iter,\n                                      best_model=best_model,\n                                      out_model=out_model))\n\n\ndef validate_chunk_width(chunk_width):\n    \"\"\"Validate a chunk-width string , returns boolean.\n    Expected to be a string representing either an integer, like '20',\n    or a comma-separated list of integers like '20,30,16'\"\"\"\n    if not isinstance(chunk_width, str):\n        return False\n    a = chunk_width.split(\",\")\n    assert len(a) != 0  # would be code error\n    for elem in a:\n        try:\n            i = int(elem)\n            if i < 1 and i != -1:\n                return False\n        except:\n            return False\n    return True\n\n\ndef principal_chunk_width(chunk_width):\n    \"\"\"Given a chunk-width string like \"20\" or \"50,70,40\", returns the principal\n    chunk-width which is the first element, as an int.  E.g. 20, or 40.\"\"\"\n    if not validate_chunk_width(chunk_width):\n        raise Exception(\"Invalid chunk-width {0}\".format(chunk_width))\n    return int(chunk_width.split(\",\")[0])\n\n\ndef validate_range_str(range_str):\n    \"\"\"Helper function used inside validate_minibatch_size_str().\n    Returns true if range_str is a a comma-separated list of\n    positive integers and ranges of integers, like '128',\n    '128,256', or '64-128,256'.\"\"\"\n    if not isinstance(range_str, str):\n        return False\n    ranges = range_str.split(\",\")\n    assert len(ranges) > 0\n    for r in ranges:\n        # a range may be either e.g. '64', or '128-256'\n        try:\n            c = [int(x) for x in r.split(\":\")]\n        except:\n            return False\n        # c should be either e.g. [ 128 ], or  [64,128].\n        if len(c) == 1:\n            if c[0] <= 0:\n                return False\n        elif len(c) == 2:\n            if c[0] <= 0 or c[1] < c[0]:\n                return False\n        else:\n            return False\n    return True\n\n\ndef validate_minibatch_size_str(minibatch_size_str):\n    \"\"\"Validate a minibatch-size string (returns bool).\n    A minibatch-size string might either be an integer, like '256',\n    a comma-separated set of integers or ranges like '128,256' or\n    '64:128,256',  or a rule like '128=64:128/256=32,64', whose format\n    is: eg-length1=size-range1/eg-length2=size-range2/....\n    where a size-range is a comma-separated list of either integers like '16'\n    or ranges like '16:32'.  An arbitrary eg will be mapped to the size-range\n    for the closest of the listed eg-lengths (the eg-length is defined\n    as the number of input frames, including context frames).\"\"\"\n    if not isinstance(minibatch_size_str, str):\n        return False\n    a = minibatch_size_str.split(\"/\")\n    assert len(a) != 0  # would be code error\n\n    for elem in a:\n        b = elem.split('=')\n        # We expect b to have length 2 in the normal case.\n        if len(b) != 2:\n            # one-element 'b' is OK if len(a) is 1 (so there is only\n            # one choice)... this would mean somebody just gave \"25\"\n            # or something like that for the minibatch size.\n            if len(a) == 1 and len(b) == 1:\n                return validate_range_str(elem)\n            else:\n                return False\n        # check that the thing before the '=' sign is a positive integer\n        try:\n            if int(b[0]) <= 0:\n                return False\n        except:\n            return False  # not an integer at all.\n\n        if not validate_range_str(b[1]):\n            return False\n    return True\n\n\ndef halve_range_str(range_str):\n    \"\"\"Helper function used inside halve_minibatch_size_str().\n    returns half of a range [but converting resulting zeros to\n    ones], e.g. '16'->'8', '16,32'->'8,16', '64:128'->'32:64'.\n    Returns true if range_str is a a comma-separated list of\n    positive integers and ranges of integers, like '128',\n    '128,256', or '64-128,256'.\"\"\"\n\n    ranges = range_str.split(\",\")\n    halved_ranges = []\n    for r in ranges:\n        # a range may be either e.g. '64', or '128:256'\n        c = [str(max(1, int(x)//2)) for x in r.split(\":\")]\n        halved_ranges.append(\":\".join(c))\n    return ','.join(halved_ranges)\n\n\ndef halve_minibatch_size_str(minibatch_size_str):\n    \"\"\"Halve a minibatch-size string, as would be validated by\n    validate_minibatch_size_str (see docs for that).  This halves\n    all the integer elements of minibatch_size_str that represent minibatch\n    sizes (as opposed to chunk-lengths) and that are >1.\"\"\"\n\n    if not validate_minibatch_size_str(minibatch_size_str):\n        raise Exception(\"Invalid minibatch-size string '{0}'\".format(minibatch_size_str))\n\n    a = minibatch_size_str.split(\"/\")\n    ans = []\n    for elem in a:\n        b = elem.split('=')\n        # We expect b to have length 2 in the normal case.\n        if len(b) == 1:\n            return halve_range_str(elem)\n        else:\n            assert len(b) == 2\n            ans.append('{0}={1}'.format(b[0], halve_range_str(b[1])))\n    return '/'.join(ans)\n\n\ndef copy_egs_properties_to_exp_dir(egs_dir, dir):\n    try:\n        for file in ['cmvn_opts', 'splice_opts', 'info/final.ie.id', 'final.mat',\n                     'global_cmvn.stats', 'online_cmvn']:\n            file_name = '{dir}/{file}'.format(dir=egs_dir, file=file)\n            if os.path.isfile(file_name):\n                shutil.copy(file_name, dir)\n    except IOError:\n        logger.error(\"Error while trying to copy egs \"\n                     \"property files to {dir}\".format(dir=dir))\n        raise\n\n\ndef parse_generic_config_vars_file(var_file):\n    variables = {}\n    try:\n        var_file_handle = open(var_file, 'r')\n        for line in var_file_handle:\n            parts = line.split('=')\n            field_name = parts[0].strip()\n            field_value = parts[1].strip()\n            if field_name in ['model_left_context', 'left_context']:\n                variables['model_left_context'] = int(field_value)\n            elif field_name in ['model_right_context', 'right_context']:\n                variables['model_right_context'] = int(field_value)\n            elif field_name == 'num_hidden_layers':\n                if int(field_value) > 1:\n                    raise Exception(\n                        \"You have num_hidden_layers={0} (real meaning: your config files \"\n                        \"are intended to do discriminative pretraining).  Since Kaldi 5.2, \"\n                        \"this is no longer supported --> use newer config-creation scripts, \"\n                        \"i.e. xconfig_to_configs.py.\".format(field_value))\n            else:\n                variables[field_name] = field_value\n\n        return variables\n    except ValueError:\n        # we will throw an error at the end of the function so I will just pass\n        pass\n\n    raise Exception('Error while parsing the file {0}'.format(var_file))\n\n\ndef get_input_model_info(input_model):\n    \"\"\" This function returns a dictionary with keys \"model_left_context\" and\n        \"model_right_context\" and values equal to the left/right model contexts\n        for input_model.\n        This function is useful when using the --trainer.input-model option\n        instead of initializing the model using configs.\n    \"\"\"\n    variables = {}\n    try:\n        out = common_lib.get_command_stdout(\"\"\"nnet3-info {0} | \"\"\"\n                                            \"\"\"head -4 \"\"\".format(input_model))\n        # out looks like this\n        # left-context: 7\n        # right-context: 0\n        # num-parameters: 90543902\n        # modulus: 1\n        for line in out.split(\"\\n\"):\n            parts = line.split(\":\")\n            if len(parts) != 2:\n                continue\n            if parts[0].strip() ==  'left-context':\n                variables['model_left_context'] = int(parts[1].strip())\n            elif parts[0].strip() ==  'right-context':\n                variables['model_right_context'] = int(parts[1].strip())\n\n    except ValueError:\n        pass\n    return variables\n\n\ndef verify_egs_dir(egs_dir, feat_dim, ivector_dim, ivector_extractor_id,\n                   left_context, right_context,\n                   left_context_initial=-1, right_context_final=-1):\n    try:\n        egs_feat_dim = int(open('{0}/info/feat_dim'.format(\n                                    egs_dir)).readline())\n\n        egs_ivector_id = None\n        try:\n            egs_ivector_id = open('{0}/info/final.ie.id'.format(\n                                        egs_dir)).readline().strip()\n            if (egs_ivector_id == \"\"):\n                egs_ivector_id = None;\n        except:\n            # it could actually happen that the file is not there\n            # for example in cases where the egs were dumped by\n            # an older version of the script\n            pass\n\n        try:\n            egs_ivector_dim = int(open('{0}/info/ivector_dim'.format(\n                egs_dir)).readline())\n        except:\n            egs_ivector_dim = 0\n        egs_left_context = int(open('{0}/info/left_context'.format(\n                                    egs_dir)).readline())\n        egs_right_context = int(open('{0}/info/right_context'.format(\n                                    egs_dir)).readline())\n        try:\n            egs_left_context_initial = int(open('{0}/info/left_context_initial'.format(\n                        egs_dir)).readline())\n        except:  # older scripts didn't write this, treat it as -1 in that case.\n            egs_left_context_initial = -1\n        try:\n            egs_right_context_final = int(open('{0}/info/right_context_final'.format(\n                        egs_dir)).readline())\n        except:  # older scripts didn't write this, treat it as -1 in that case.\n            egs_right_context_final = -1\n\n        # if feat_dim was supplied as 0, it means the --feat-dir option was not\n        # supplied to the script, so we simply don't know what the feature dim is.\n        if (feat_dim != 0 and feat_dim != egs_feat_dim) or (ivector_dim != egs_ivector_dim):\n            raise Exception(\"There is mismatch between featdim/ivector_dim of \"\n                            \"the current experiment and the provided \"\n                            \"egs directory\")\n\n        if (((egs_ivector_id is None) and (ivector_extractor_id is not None)) or\n            ((egs_ivector_id is not None) and (ivector_extractor_id is None))):\n            logger.warning(\"The ivector ids are used inconsistently. It's your \"\n                          \"responsibility to make sure the ivector extractor \"\n                          \"has been used consistently\")\n            logger.warning(\"ivector id for egs: {0} in dir {1}\".format(egs_ivector_id, egs_dir))\n            logger.warning(\"ivector id for extractor: {0}\".format(ivector_extractor_id))\n        elif ((egs_ivector_dim > 0) and (egs_ivector_id is None) and (ivector_extractor_id is None)):\n            logger.warning(\"The ivector ids are not used. It's your \"\n                          \"responsibility to make sure the ivector extractor \"\n                          \"has been used consistently\")\n        elif ivector_extractor_id != egs_ivector_id:\n            raise Exception(\"The egs were generated using a different ivector \"\n                            \"extractor. id1 = {0}, id2={1}\".format(\n                                ivector_extractor_id, egs_ivector_id));\n\n        if (egs_left_context < left_context or\n            egs_right_context < right_context):\n            raise Exception('The egs have insufficient (l,r) context ({0},{1}) '\n                            'versus expected ({2},{3})'.format(\n                                egs_left_context, egs_right_context,\n                                left_context, right_context))\n\n        # the condition on the initial/final context is an equality condition,\n        # not an inequality condition, as there is no mechanism to 'correct' the\n        # context (by subtracting context) while copying the egs, like there is\n        # for the regular left-right context.  If the user is determined to use\n        # previously dumped egs, they may be able to slightly adjust the\n        # --egs.chunk-left-context-initial and --egs.chunk-right-context-final\n        # options to make things matched up.  [note: the model l/r context gets\n        # added in, so you have to correct for changes in that.]\n        if (egs_left_context_initial != left_context_initial or\n            egs_right_context_final != right_context_final):\n            raise Exception('The egs have incorrect initial/final (l,r) context '\n                            '({0},{1}) versus expected ({2},{3}).  See code from '\n                            'where this exception was raised for more info'.format(\n                                egs_left_context_initial, egs_right_context_final,\n                                left_context_initial, right_context_final))\n\n        frames_per_eg_str = open('{0}/info/frames_per_eg'.format(\n                             egs_dir)).readline().rstrip()\n        if not validate_chunk_width(frames_per_eg_str):\n            raise Exception(\"Invalid frames_per_eg in directory {0}/info\".format(\n                    egs_dir))\n        num_archives = int(open('{0}/info/num_archives'.format(\n                                    egs_dir)).readline())\n\n        return [egs_left_context, egs_right_context,\n                frames_per_eg_str, num_archives]\n    except (IOError, ValueError):\n        logger.error(\"The egs dir {0} has missing or \"\n                     \"malformed files.\".format(egs_dir))\n        raise\n\n\ndef compute_presoftmax_prior_scale(dir, alidir, num_jobs, run_opts,\n                                   presoftmax_prior_scale_power=-0.25):\n\n    # getting the raw pdf count\n    common_lib.execute_command(\n        \"\"\"{command} JOB=1:{num_jobs} {dir}/log/acc_pdf.JOB.log \\\n                ali-to-post \"ark:gunzip -c {alidir}/ali.JOB.gz|\" ark:- \\| \\\n                post-to-tacc --per-pdf=true  {alidir}/final.mdl ark:- \\\n                {dir}/pdf_counts.JOB\"\"\".format(command=run_opts.command,\n                                               num_jobs=num_jobs,\n                                               dir=dir,\n                                               alidir=alidir))\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/sum_pdf_counts.log \\\n                vector-sum --binary=false {dir}/pdf_counts.* {dir}/pdf_counts \\\n        \"\"\".format(command=run_opts.command, dir=dir))\n\n    for file in glob.glob('{0}/pdf_counts.*'.format(dir)):\n        os.remove(file)\n    pdf_counts = common_lib.read_kaldi_matrix('{0}/pdf_counts'.format(dir))[0]\n    scaled_counts = smooth_presoftmax_prior_scale_vector(\n        pdf_counts,\n        presoftmax_prior_scale_power=presoftmax_prior_scale_power,\n        smooth=0.01)\n\n    output_file = \"{0}/presoftmax_prior_scale.vec\".format(dir)\n    common_lib.write_kaldi_matrix(output_file, [scaled_counts])\n    common_lib.force_symlink(\"../presoftmax_prior_scale.vec\",\n                             \"{0}/configs/presoftmax_prior_scale.vec\".format(\n                                dir))\n\n\ndef smooth_presoftmax_prior_scale_vector(pdf_counts,\n                                         presoftmax_prior_scale_power=-0.25,\n                                         smooth=0.01):\n    total = sum(pdf_counts)\n    average_count = float(total) / len(pdf_counts)\n    scales = []\n    for i in range(len(pdf_counts)):\n        scales.append(math.pow(pdf_counts[i] + smooth * average_count,\n                               presoftmax_prior_scale_power))\n    num_pdfs = len(pdf_counts)\n    scaled_counts = [x * float(num_pdfs) / sum(scales) for x in scales]\n    return scaled_counts\n\n\ndef prepare_initial_network(dir, run_opts, srand=-3, input_model=None):\n    if input_model is not None:\n        shutil.copy(input_model, \"{0}/0.raw\".format(dir))\n        return\n    if os.path.exists(dir+\"/configs/init.config\"):\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/add_first_layer.log \\\n                    nnet3-init --srand={srand} {dir}/init.raw \\\n                    {dir}/configs/final.config {dir}/0.raw\"\"\".format(\n                        command=run_opts.command, srand=srand,\n                        dir=dir))\n    else:\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/init_model.log \\\n           nnet3-init --srand={srand} {dir}/configs/final.config {dir}/0.raw\"\"\".format(\n                        command=run_opts.command, srand=srand,\n                        dir=dir))\n\n\ndef get_model_combine_iters(num_iters, num_epochs,\n                      num_archives, max_models_combine,\n                      num_jobs_final):\n    \"\"\" Figures out the list of iterations for which we'll use those models\n        in the final model-averaging phase.  (note: it's a weighted average\n        where the weights are worked out from a subset of training data.)\"\"\"\n\n    approx_iters_per_epoch_final = float(num_archives) / num_jobs_final\n    # Note: it used to be that we would combine over an entire epoch,\n    # but in practice we very rarely would use any weights from towards\n    # the end of that range, so we are changing it to use not\n    # approx_iters_per_epoch_final, but instead:\n    # approx_iters_per_epoch_final/2 + 1,\n    # dividing by 2 to use half an epoch, and adding 1 just to make sure\n    # it's not zero.\n\n    # First work out how many iterations we want to combine over in the final\n    # nnet3-combine-fast invocation.\n    # The number we use is:\n    # min(max(max_models_combine, approx_iters_per_epoch_final/2+1),\n    #     iters/2)\n    # But if this value is > max_models_combine, then the models\n    # are subsampled to get these many models to combine.\n\n    num_iters_combine_initial = min(int(approx_iters_per_epoch_final/2) + 1,\n                                    int(num_iters/2))\n\n    if num_iters_combine_initial > max_models_combine:\n        subsample_model_factor = int(\n            float(num_iters_combine_initial) / max_models_combine)\n        num_iters_combine = num_iters_combine_initial\n        models_to_combine = set(range(\n            num_iters - num_iters_combine_initial + 1,\n            num_iters + 1, subsample_model_factor))\n        models_to_combine.add(num_iters)\n    else:\n        subsample_model_factor = 1\n        num_iters_combine = min(max_models_combine, num_iters//2)\n        models_to_combine = set(range(num_iters - num_iters_combine + 1,\n                                      num_iters + 1))\n\n    return models_to_combine\n\n\ndef get_current_num_jobs(it, num_it, start, step, end):\n    \"Get number of jobs for iteration number 'it' of range('num_it')\"\n\n    ideal = float(start) + (end - start) * float(it) / num_it\n    if ideal < step:\n        return int(0.5 + ideal)\n    else:\n        return int(0.5 + ideal / step) * step\n\n\ndef get_learning_rate(iter, num_jobs, num_iters, num_archives_processed,\n                      num_archives_to_process,\n                      initial_effective_lrate, final_effective_lrate):\n    if iter + 1 >= num_iters:\n        effective_learning_rate = final_effective_lrate\n    else:\n        effective_learning_rate = (\n                initial_effective_lrate\n                * math.exp(num_archives_processed\n                           * math.log(float(final_effective_lrate) / initial_effective_lrate)\n                           / num_archives_to_process))\n\n    return num_jobs * effective_learning_rate\n\n\ndef should_do_shrinkage(iter, model_file, shrink_saturation_threshold,\n                        get_raw_nnet_from_am=True):\n\n    if iter == 0:\n        return True\n\n    if get_raw_nnet_from_am:\n        output = common_lib.get_command_stdout(\n            \"nnet3-am-info {0} 2>/dev/null | \"\n            \"steps/nnet3/get_saturation.pl\".format(model_file))\n    else:\n        output = common_lib.get_command_stdout(\n            \"nnet3-info 2>/dev/null {0} | \"\n            \"steps/nnet3/get_saturation.pl\".format(model_file))\n    output = output.strip().split(\"\\n\")\n    try:\n        assert len(output) == 1\n        saturation = float(output[0])\n        assert saturation >= 0 and saturation <= 1\n    except:\n        raise Exception(\"Something went wrong, could not get \"\n                        \"saturation from the output '{0}' of \"\n                        \"get_saturation.pl on the info of \"\n                        \"model {1}\".format(output, model_file))\n    return saturation > shrink_saturation_threshold\n\n\ndef remove_nnet_egs(egs_dir):\n    common_lib.execute_command(\"steps/nnet2/remove_egs.sh {egs_dir}\".format(\n            egs_dir=egs_dir))\n\n\ndef clean_nnet_dir(nnet_dir, num_iters, egs_dir,\n                   preserve_model_interval=100,\n                   remove_egs=True,\n                   get_raw_nnet_from_am=True):\n    try:\n        if remove_egs:\n            remove_nnet_egs(egs_dir)\n\n        for iter in range(num_iters):\n            remove_model(nnet_dir, iter, num_iters, None,\n                         preserve_model_interval,\n                         get_raw_nnet_from_am=get_raw_nnet_from_am)\n    except (IOError, OSError):\n        logger.error(\"Error while cleaning up the nnet directory\")\n        raise\n\n\ndef remove_model(nnet_dir, iter, num_iters, models_to_combine=None,\n                 preserve_model_interval=100,\n                 get_raw_nnet_from_am=True):\n    if iter % preserve_model_interval == 0:\n        return\n    if models_to_combine is not None and iter in models_to_combine:\n        return\n    if get_raw_nnet_from_am:\n        file_name = '{0}/{1}.mdl'.format(nnet_dir, iter)\n    else:\n        file_name = '{0}/{1}.raw'.format(nnet_dir, iter)\n\n    if os.path.isfile(file_name):\n        os.remove(file_name)\n\n\ndef positive_int(arg):\n   val = int(arg)\n   if (val <= 0):\n      raise argparse.ArgumentTypeError(\"must be positive int: '%s'\" % arg)\n   return val\n\n\nclass CommonParser(object):\n    \"\"\"Parser for parsing common options related to nnet3 training.\n\n    This argument parser adds common options related to nnet3 training\n    such as egs creation, training optimization options.\n    These are used in the nnet3 train scripts\n    in steps/nnet3/train*.py and steps/nnet3/chain/train.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(add_help=False)\n\n    def __init__(self,\n                 include_chunk_context=True,\n                 default_chunk_left_context=0):\n        # feat options\n        self.parser.add_argument(\"--feat.online-ivector-dir\", type=str,\n                                 dest='online_ivector_dir', default=None,\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"\"\"directory with the ivectors extracted\n                                 in an online fashion.\"\"\")\n        self.parser.add_argument(\"--feat.cmvn-opts\", type=str,\n                                 dest='cmvn_opts', default=None,\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"A string specifying '--norm-means' \"\n                                 \"and '--norm-vars' values\")\n\n        # egs extraction options.  there is no point adding the chunk context\n        # option for non-RNNs (by which we mean basic TDNN-type topologies), as\n        # it wouldn't affect anything, so we disable them if we know in advance\n        # that we're not supporting RNN-type topologies (as in train_dnn.py).\n        if include_chunk_context:\n            self.parser.add_argument(\"--egs.chunk-left-context\", type=int,\n                                     dest='chunk_left_context',\n                                     default=default_chunk_left_context,\n                                     help=\"\"\"Number of additional frames of input\n                                 to the left of the input chunk. This extra\n                                 context will be used in the estimation of RNN\n                                 state before prediction of the first label. In\n                                 the case of FF-DNN this extra context will be\n                                 used to allow for frame-shifts\"\"\")\n            self.parser.add_argument(\"--egs.chunk-right-context\", type=int,\n                                     dest='chunk_right_context', default=0,\n                                     help=\"\"\"Number of additional frames of input\n                                     to the right of the input chunk. This extra\n                                     context will be used in the estimation of\n                                     bidirectional RNN state before prediction of\n                                 the first label.\"\"\")\n            self.parser.add_argument(\"--egs.chunk-left-context-initial\", type=int,\n                                     dest='chunk_left_context_initial', default=-1,\n                                     help=\"\"\"Number of additional frames of input\n                                 to the left of the *first* input chunk extracted\n                                 from an utterance.  If negative, defaults to\n                                 the same as --egs.chunk-left-context\"\"\")\n            self.parser.add_argument(\"--egs.chunk-right-context-final\", type=int,\n                                     dest='chunk_right_context_final', default=-1,\n                                     help=\"\"\"Number of additional frames of input\n                                 to the right of the *last* input chunk extracted\n                                 from an utterance.  If negative, defaults to the\n                                 same as --egs.chunk-right-context\"\"\")\n        self.parser.add_argument(\"--egs.dir\", type=str, dest='egs_dir',\n                                 default=None,\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"\"\"Directory with egs. If specified this\n                                 directory will be used rather than extracting\n                                 egs\"\"\")\n        self.parser.add_argument(\"--egs.stage\", type=int, dest='egs_stage',\n                                 default=0,\n                                 help=\"Stage at which get_egs.sh should be \"\n                                 \"restarted\")\n        self.parser.add_argument(\"--egs.opts\", type=str, dest='egs_opts',\n                                 default=None,\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"\"\"String to provide options directly\n                                 to steps/nnet3/get_egs.sh script\"\"\")\n\n        # trainer options\n        self.parser.add_argument(\"--trainer.srand\", type=int, dest='srand',\n                                 default=0,\n                                 help=\"\"\"Sets the random seed for model\n                                 initialization and egs shuffling.\n                                 Warning: This random seed does not control all\n                                 aspects of this experiment.  There might be\n                                 other random seeds used in other stages of the\n                                 experiment like data preparation (e.g. volume\n                                 perturbation).\"\"\")\n        self.parser.add_argument(\"--trainer.num-epochs\", type=float,\n                                 dest='num_epochs', default=8.0,\n                                 help=\"Number of epochs to train the model\")\n        self.parser.add_argument(\"--trainer.shuffle-buffer-size\", type=int,\n                                 dest='shuffle_buffer_size', default=5000,\n                                 help=\"\"\" Controls randomization of the samples\n                                 on each iteration. If 0 or a large value the\n                                 randomization is complete, but this will\n                                 consume memory and cause spikes in disk I/O.\n                                 Smaller is easier on disk and memory but less\n                                 random.  It's not a huge deal though, as\n                                 samples are anyway randomized right at the\n                                 start.  (the point of this is to get data in\n                                 different minibatches on different iterations,\n                                 since in the preconditioning method, 2 samples\n                                 in the same minibatch can affect each others'\n                                 gradients.\"\"\")\n        self.parser.add_argument(\"--trainer.max-param-change\", type=float,\n                                 dest='max_param_change', default=2.0,\n                                 help=\"\"\"The maximum change in parameters\n                                 allowed per minibatch, measured in Frobenius\n                                 norm over the entire model\"\"\")\n        self.parser.add_argument(\"--trainer.samples-per-iter\", type=int,\n                                 dest='samples_per_iter', default=400000,\n                                 help=\"This is really the number of egs in \"\n                                 \"each archive.\")\n        self.parser.add_argument(\"--trainer.lda.rand-prune\", type=float,\n                                 dest='rand_prune', default=4.0,\n                                 help=\"Value used in preconditioning \"\n                                 \"matrix estimation\")\n        self.parser.add_argument(\"--trainer.lda.max-lda-jobs\", type=int,\n                                 dest='max_lda_jobs', default=10,\n                                 help=\"Max number of jobs used for \"\n                                 \"LDA stats accumulation\")\n        self.parser.add_argument(\"--trainer.presoftmax-prior-scale-power\",\n                                 type=float,\n                                 dest='presoftmax_prior_scale_power',\n                                 default=-0.25,\n                                 help=\"Scale on presofmax prior\")\n        self.parser.add_argument(\"--trainer.optimization.proportional-shrink\", type=float,\n                                 dest='proportional_shrink', default=0.0,\n                                 help=\"\"\"If nonzero, this will set a shrinkage (scaling)\n                        factor for the parameters, whose value is set as:\n                        shrink-value=(1.0 - proportional-shrink * learning-rate), where\n                        'learning-rate' is the learning rate being applied\n                        on the current iteration, which will vary from\n                        initial-effective-lrate*num-jobs-initial to\n                        final-effective-lrate*num-jobs-final.\n                        Unlike for train_rnn.py, this is applied unconditionally,\n                        it does not depend on saturation of nonlinearities.\n                        Can be used to roughly approximate l2 regularization.\"\"\")\n\n        # Parameters for the optimization\n        self.parser.add_argument(\n            \"--trainer.optimization.initial-effective-lrate\", type=float,\n            dest='initial_effective_lrate', default=0.0003,\n            help=\"Learning rate used during the initial iteration\")\n        self.parser.add_argument(\n            \"--trainer.optimization.final-effective-lrate\", type=float,\n            dest='final_effective_lrate', default=0.00003,\n            help=\"Learning rate used during the final iteration\")\n        self.parser.add_argument(\"--trainer.optimization.num-jobs-initial\",\n                                 type=int, dest='num_jobs_initial', default=1,\n                                 help=\"Number of neural net jobs to run in \"\n                                 \"parallel at the start of training\")\n        self.parser.add_argument(\"--trainer.optimization.num-jobs-final\",\n                                 type=int, dest='num_jobs_final', default=8,\n                                 help=\"Number of neural net jobs to run in \"\n                                 \"parallel at the end of training\")\n        self.parser.add_argument(\"--trainer.optimization.num-jobs-step\",\n            type=positive_int,  metavar='N', dest='num_jobs_step', default=1,\n            help=\"\"\"Number of jobs increment, when exceeds this number. For\n            example, if N=3, the number of jobs may progress as 1, 2, 3, 6, 9...\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.max-models-combine\",\n                                 \"--trainer.max-models-combine\",\n                                 type=int, dest='max_models_combine',\n                                 default=20,\n                                 help=\"\"\"The maximum number of models used in\n                                 the final model combination stage.  These\n                                 models will themselves be averages of\n                                 iteration-number ranges\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.max-objective-evaluations\",\n                                 \"--trainer.max-objective-evaluations\",\n                                 type=int, dest='max_objective_evaluations',\n                                 default=30,\n                                 help=\"\"\"The maximum number of objective\n                                 evaluations in order to figure out the\n                                 best number of models to combine. It helps to\n                                 speedup if the number of models provided to the\n                                 model combination binary is quite large (e.g.\n                                 several hundred).\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.do-final-combination\",\n                                 dest='do_final_combination', type=str,\n                                 action=common_lib.StrToBoolAction,\n                                 choices=[\"true\", \"false\"], default=True,\n                                 help=\"\"\"Set this to false to disable the final\n                                 'combine' stage (in this case we just use the\n                                 last-numbered model as the final.mdl).\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.combine-sum-to-one-penalty\",\n                                 type=float, dest='combine_sum_to_one_penalty', default=0.0,\n                                 help=\"\"\"This option is deprecated and does nothing.\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.momentum\", type=float,\n                                 dest='momentum', default=0.0,\n                                 help=\"\"\"Momentum used in update computation.\n                                 Note: we implemented it in such a way that it\n                                 doesn't increase the effective learning\n                                 rate.\"\"\")\n        self.parser.add_argument(\"--trainer.dropout-schedule\", type=str,\n                                 action=common_lib.NullstrToNoneAction,\n                                 dest='dropout_schedule', default=None,\n                                 help=\"\"\"Use this to specify the dropout\n                                 schedule.  You specify a piecewise linear\n                                 function on the domain [0,1], where 0 is the\n                                 start and 1 is the end of training; the\n                                 function-argument (x) rises linearly with the\n                                 amount of data you have seen, not iteration\n                                 number (this improves invariance to\n                                 num-jobs-{initial-final}).  E.g. '0,0.2,0'\n                                 means 0 at the start; 0.2 after seeing half\n                                 the data; and 0 at the end.  You may specify\n                                 the x-value of selected points, e.g.\n                                 '0,0.2@0.25,0' means that the 0.2\n                                 dropout-proportion is reached a quarter of the\n                                 way through the data.   The start/end x-values\n                                 are at x=0/x=1, and other unspecified x-values\n                                 are interpolated between known x-values.  You\n                                 may specify different rules for different\n                                 component-name patterns using 'pattern1=func1\n                                 pattern2=func2', e.g. 'relu*=0,0.1,0\n                                 lstm*=0,0.2,0'.  More general should precede\n                                 less general patterns, as they are applied\n                                 sequentially.\"\"\")\n        self.parser.add_argument(\"--trainer.add-option\", type=str,\n                                 dest='train_opts', action='append', default=[],\n                                 help=\"\"\"You can use this to add arbitrary options that\n                                 will be passed through to the core training code (nnet3-train\n                                 or nnet3-chain-train)\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.backstitch-training-scale\",\n                                 type=float, dest='backstitch_training_scale',\n                                 default=0.0, help=\"\"\"scale of parameters changes\n                                 used in backstitch training step.\"\"\")\n        self.parser.add_argument(\"--trainer.optimization.backstitch-training-interval\",\n                                 type=int, dest='backstitch_training_interval',\n                                 default=1, help=\"\"\"the interval of minibatches\n                                 that backstitch training is applied on.\"\"\")\n        self.parser.add_argument(\"--trainer.compute-per-dim-accuracy\",\n                                 dest='compute_per_dim_accuracy',\n                                 type=str, choices=['true', 'false'],\n                                 default=False,\n                                 action=common_lib.StrToBoolAction,\n                                 help=\"Compute train and validation \"\n                                 \"accuracy per-dim\")\n\n        # General options\n        self.parser.add_argument(\"--stage\", type=int, default=-4,\n                                 help=\"Specifies the stage of the experiment \"\n                                 \"to execution from\")\n        self.parser.add_argument(\"--exit-stage\", type=int, default=None,\n                                 help=\"If specified, training exits before \"\n                                 \"running this stage\")\n        self.parser.add_argument(\"--cmd\", type=str, dest=\"command\",\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"\"\"Specifies the script to launch jobs.\n                                 e.g. queue.pl for launching on SGE cluster\n                                        run.pl for launching on local machine\n                                 \"\"\", default=\"queue.pl\")\n        self.parser.add_argument(\"--egs.cmd\", type=str, dest=\"egs_command\",\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"Script to launch egs jobs\")\n        self.parser.add_argument(\"--use-gpu\", type=str,\n                                 choices=[\"true\", \"false\", \"yes\", \"no\", \"wait\"],\n                                 help=\"Use GPU for training. \"\n                                 \"Note 'true' and 'false' are deprecated.\",\n                                 default=\"yes\")\n        self.parser.add_argument(\"--cleanup\", type=str,\n                                 action=common_lib.StrToBoolAction,\n                                 choices=[\"true\", \"false\"], default=True,\n                                 help=\"Clean up models after training\")\n        self.parser.add_argument(\"--cleanup.remove-egs\", type=str,\n                                 dest='remove_egs', default=True,\n                                 action=common_lib.StrToBoolAction,\n                                 choices=[\"true\", \"false\"],\n                                 help=\"If true, remove egs after experiment\")\n        self.parser.add_argument(\"--cleanup.preserve-model-interval\",\n                                 dest=\"preserve_model_interval\",\n                                 type=int, default=100,\n                                 help=\"\"\"Determines iterations for which models\n                                 will be preserved during cleanup.\n                                 If mod(iter,preserve_model_interval) == 0\n                                 model will be preserved.\"\"\")\n\n        self.parser.add_argument(\"--reporting.email\", dest=\"email\",\n                                 type=str, default=None,\n                                 action=common_lib.NullstrToNoneAction,\n                                 help=\"\"\" Email-id to report about the progress\n                                 of the experiment.  NOTE: It assumes the\n                                 machine on which the script is being run can\n                                 send emails from command line via. mail\n                                 program. The Kaldi mailing list will not\n                                 support this feature.  It might require local\n                                 expertise to setup. \"\"\")\n        self.parser.add_argument(\"--reporting.interval\",\n                                 dest=\"reporting_interval\",\n                                 type=float, default=0.1,\n                                 help=\"\"\"Frequency with which reports have to\n                                 be sent, measured in terms of fraction of\n                                 iterations.\n                                 If 0 and reporting mail has been specified\n                                 then only failure notifications are sent\"\"\")\n\n\nimport unittest\n\nclass SelfTest(unittest.TestCase):\n\n    def test_halve_minibatch_size_str(self):\n        self.assertEqual('32', halve_minibatch_size_str('64'))\n        self.assertEqual('32,8:16', halve_minibatch_size_str('64,16:32'))\n        self.assertEqual('1', halve_minibatch_size_str('1'))\n        self.assertEqual('128=32/256=20,40:50', halve_minibatch_size_str('128=64/256=40,80:100'))\n\n\n    def test_validate_chunk_width(self):\n        for s in [ '64', '64,25,128' ]:\n            self.assertTrue(validate_chunk_width(s), s)\n\n\n    def test_validate_minibatch_size_str(self):\n        # Good descriptors.\n        for s in [ '32', '32,64', '1:32', '1:32,64', '64,1:32', '1:5,10:15',\n                   '128=64:128/256=32,64', '1=2/3=4', '1=1/2=2/3=3/4=4' ]:\n            self.assertTrue(validate_minibatch_size_str(s), s)\n        # Bad descriptors.\n        for s in [ None, 42, (43,), '', '1:', ':2', '3,', ',4', '5:6,', ',7:8',\n                   '9=', '10=10/', '11=11/11', '12=1:2//13=1:3' '14=/15=15',\n                   '16/17=17', '/18=18', '/18', '//19', '/' ]:\n            self.assertFalse(validate_minibatch_size_str(s), s)\n\n\n    def test_get_current_num_jobs(self):\n        niters = 12\n        self.assertEqual([2, 3, 3, 4, 4, 5, 6, 6, 7, 7, 8, 8],\n                         [get_current_num_jobs(i, niters, 2, 1, 9)\n                              for i in range(niters)])\n        self.assertEqual([2, 3, 3, 3, 3, 6, 6, 6, 6, 6, 9, 9],\n                         [get_current_num_jobs(i, niters, 2, 3, 9)\n                              for i in range(niters)])\n\n\nif __name__ == '__main__':\n    unittest.main()\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/dropout_schedule.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0\n\n\"\"\"This module contains methods related to scheduling dropout.\nSee _self_test() for examples of how the functions work.\n\"\"\"\n\nimport logging\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\n_debug_dropout = False\n\ndef _parse_dropout_option(dropout_option):\n    \"\"\"Parses the string option to --trainer.dropout-schedule and\n    returns a list of dropout schedules for different component name patterns.\n    Calls _parse_dropout_string() function for each component name pattern\n    in the option.\n\n    Arguments:\n        dropout_option: The string option passed to --trainer.dropout-schedule.\n            See its help for details.\n            See _self_test() for examples.\n        num_archive_to_process: See _parse_dropout_string() for details.\n\n    Returns a list of (component_name, dropout_schedule) tuples,\n    where dropout_schedule is itself a list of\n    (data_fraction, dropout_proportion) tuples sorted in reverse order of\n    data_fraction.\n    A data fraction of 0 corresponds to beginning of training\n    and 1 corresponds to all data.\n    \"\"\"\n    components = dropout_option.strip().split(' ')\n    dropout_schedule = []\n    for component in components:\n        parts = component.split('=')\n\n        if len(parts) == 2:\n            component_name = parts[0]\n            this_dropout_str = parts[1]\n        elif len(parts) == 1:\n            component_name = '*'\n            this_dropout_str = parts[0]\n        else:\n            raise Exception(\"The dropout schedule must be specified in the \"\n                            \"format 'pattern1=func1 patter2=func2' where \"\n                            \"the pattern can be omitted for a global function \"\n                            \"for all components.\\n\"\n                            \"Got {0} in {1}\".format(component, dropout_option))\n\n        this_dropout_values = _parse_dropout_string(this_dropout_str)\n        dropout_schedule.append((component_name, this_dropout_values))\n\n    if _debug_dropout:\n        logger.info(\"Dropout schedules for component names is as follows:\")\n        logger.info(\"<component-name-pattern>: [(num_archives_processed), \"\n                    \"(dropout_proportion) ...]\")\n        for name, schedule in dropout_schedule:\n            logger.info(\"{0}: {1}\".format(name, schedule))\n\n    return dropout_schedule\n\n\ndef _parse_dropout_string(dropout_str):\n    \"\"\"Parses the dropout schedule from the string corresponding to a\n    single component in --trainer.dropout-schedule.\n    This is a module-internal function called by parse_dropout_function().\n\n    Arguments:\n        dropout_str: Specifies dropout schedule for a particular component\n            name pattern.\n            See help for the option --trainer.dropout-schedule.\n\n    Returns a list of (data_fraction_processed, dropout_proportion) tuples\n    sorted in descending order of num_archives_processed.\n    A data fraction of 1 corresponds to all data.\n    \"\"\"\n    dropout_values = []\n    parts = dropout_str.strip().split(',')\n\n    try:\n        if len(parts) < 2:\n            raise Exception(\"dropout proportion string must specify \"\n                            \"at least the start and end dropouts\")\n\n        # Starting dropout proportion\n        dropout_values.append((0, float(parts[0])))\n        for i in range(1, len(parts) - 1):\n            value_x_pair = parts[i].split('@')\n            if len(value_x_pair) == 1:\n                # Dropout proportion at half of training\n                dropout_proportion = float(value_x_pair[0])\n                data_fraction = 0.5\n            else:\n                assert len(value_x_pair) == 2\n\n                dropout_proportion = float(value_x_pair[0])\n                data_fraction = float(value_x_pair[1])\n\n            if (data_fraction < dropout_values[-1][0]\n                    or data_fraction > 1.0):\n                logger.error(\n                    \"Failed while parsing value %s in dropout-schedule. \"\n                    \"dropout-schedule must be in incresing \"\n                    \"order of data fractions.\", value_x_pair)\n                raise ValueError\n\n            dropout_values.append((data_fraction, float(dropout_proportion)))\n\n        dropout_values.append((1.0, float(parts[-1])))\n    except Exception:\n        logger.error(\"Unable to parse dropout proportion string %s. \"\n                     \"See help for option \"\n                     \"--trainer.dropout-schedule.\", dropout_str)\n        raise\n\n    # reverse sort so that its easy to retrieve the dropout proportion\n    # for a particular data fraction\n    dropout_values.reverse()\n    for data_fraction, proportion in dropout_values:\n        assert data_fraction <= 1.0 and data_fraction >= 0.0\n        assert proportion <= 1.0 and proportion >= 0.0\n\n    return dropout_values\n\n\ndef _get_component_dropout(dropout_schedule, data_fraction):\n    \"\"\"Retrieve dropout proportion from schedule when data_fraction\n    proportion of data is seen. This value is obtained by using a\n    piecewise linear function on the dropout schedule.\n    This is a module-internal function called by _get_dropout_proportions().\n\n    See help for --trainer.dropout-schedule for how the dropout value\n    is obtained from the options.\n\n    Arguments:\n        dropout_schedule: A list of (data_fraction, dropout_proportion) values\n            sorted in descending order of data_fraction.\n        data_fraction: The fraction of data seen until this stage of\n            training.\n    \"\"\"\n    if data_fraction == 0:\n        # Dropout at start of the iteration is in the last index of\n        # dropout_schedule\n        assert dropout_schedule[-1][0] == 0\n        return dropout_schedule[-1][1]\n    try:\n        # Find lower bound of the data_fraction. This is the\n        # lower end of the piecewise linear function.\n        (dropout_schedule_index, initial_data_fraction,\n         initial_dropout) = next((i, tup[0], tup[1])\n                                 for i, tup in enumerate(dropout_schedule)\n                                 if tup[0] <= data_fraction)\n    except StopIteration:\n        raise RuntimeError(\n            \"Could not find data_fraction in dropout schedule \"\n            \"corresponding to data_fraction {0}.\\n\"\n            \"Maybe something wrong with the parsed \"\n            \"dropout schedule {1}.\".format(data_fraction, dropout_schedule))\n\n    if dropout_schedule_index == 0:\n        assert dropout_schedule[0][0] == 1 and data_fraction == 1\n        return dropout_schedule[0][1]\n\n    # The upper bound of data_fraction is at the index before the\n    # lower bound.\n    final_data_fraction, final_dropout = dropout_schedule[\n        dropout_schedule_index - 1]\n\n    if final_data_fraction == initial_data_fraction:\n        assert data_fraction == initial_data_fraction\n        return initial_dropout\n\n    assert (data_fraction >= initial_data_fraction\n            and data_fraction < final_data_fraction)\n\n    return ((data_fraction - initial_data_fraction)\n            * (final_dropout - initial_dropout)\n            / (final_data_fraction - initial_data_fraction)\n            + initial_dropout)\n\n\ndef _get_dropout_proportions(dropout_schedule, data_fraction):\n    \"\"\"Returns dropout proportions based on the dropout_schedule for the\n    fraction of data seen at this stage of training.  Returns a list of\n    pairs (pattern, dropout_proportion); for instance, it might return\n    the list ['*', 0.625] meaning a dropout proportion of 0.625 is to\n    be applied to all dropout components.\n\n    Returns None if dropout_schedule is None.\n\n    dropout_schedule might be (in the sample case using the default pattern of\n    '*'): '0.1,0.5@0.5,0.1', meaning a piecewise linear function that starts at\n    0.1 when data_fraction=0.0, rises to 0.5 when data_fraction=0.5, and falls\n    again to 0.1 when data_fraction=1.0.   It can also contain space-separated\n    items of the form 'pattern=schedule', for instance:\n       '*=0.0,0.5,0.0 lstm.*=0.0,0.3@0.75,0.0'\n    The more specific patterns should go later, otherwise they will be overridden\n    by the less specific patterns' commands.\n\n    Calls _get_component_dropout() for the different component name patterns\n    in dropout_schedule.\n\n    Arguments:\n        dropout_schedule: Value for the --trainer.dropout-schedule option.\n            See help for --trainer.dropout-schedule.\n            See _self_test() for examples.\n        data_fraction: The fraction of data seen until this stage of\n            training.\n\n    \"\"\"\n    if dropout_schedule is None:\n        return None\n    dropout_schedule = _parse_dropout_option(dropout_schedule)\n    dropout_proportions = []\n    for component_name, component_dropout_schedule in dropout_schedule:\n        dropout_proportions.append(\n            (component_name, _get_component_dropout(\n                component_dropout_schedule, data_fraction)))\n    return dropout_proportions\n\ndef get_dropout_edit_option(dropout_schedule, data_fraction, iter_):\n    \"\"\"Return an option to be passed to nnet3-copy (or nnet3-am-copy)\n    that will set the appropriate dropout proportion.  If no dropout\n    is being used (dropout_schedule is None), returns the empty\n    string, otherwise returns something like\n    \"--edits='set-dropout-proportion name=* proportion=0.625'\"\n    Arguments:\n        dropout_schedule: Value for the --trainer.dropout-schedule option.\n            See help for --trainer.dropout-schedule.\n            See _self_test() for examples.\n        data_fraction: real number in [0,1] that says how far along\n            in training we are.\n        iter_: iteration number (needed for debug printing only)\n    See ReadEditConfig() in nnet3/nnet-utils.h to see how\n    set-dropout-proportion directive works.\n    \"\"\"\n\n    if data_fraction > 1.0:\n        data_fraction = 1.0\n\n    if dropout_schedule is None:\n        return \"\"\n\n    dropout_proportions = _get_dropout_proportions(\n        dropout_schedule, data_fraction)\n\n    edit_config_lines = []\n    dropout_info = []\n\n    for component_name, dropout_proportion in dropout_proportions:\n        edit_config_lines.append(\n            \"set-dropout-proportion name={0} proportion={1}\".format(\n                component_name, dropout_proportion))\n        dropout_info.append(\"pattern/dropout-proportion={0}/{1}\".format(\n            component_name, dropout_proportion))\n\n    if _debug_dropout:\n        logger.info(\"On iteration %d, %s\", iter_, ', '.join(dropout_info))\n\n    return \"--edits='{0}'\".format(\";\".join(edit_config_lines))\n\n\n\ndef get_dropout_edit_string(dropout_schedule, data_fraction, iter_):\n    \"\"\"Return an nnet3-copy --edits line to modify raw_model_string to\n    set dropout proportions according to dropout_proportions.\n    E.g. if _dropout_proportions(dropout_schedule, data_fraction)\n    returns [('*', 0.625)],  this will return the string:\n     \"nnet3-copy --edits='set-dropout-proportion name=* proportion=0.625'\"\n\n\n    Arguments:\n        dropout_schedule: Value for the --trainer.dropout-schedule option.\n            See help for --trainer.dropout-schedule.\n            See _self_test() for examples.\n\n    See ReadEditConfig() in nnet3/nnet-utils.h to see how\n    set-dropout-proportion directive works.\n    \"\"\"\n\n    if dropout_schedule is None:\n        return \"\"\n\n    dropout_proportions = _get_dropout_proportions(\n        dropout_schedule, data_fraction)\n\n    edit_config_lines = []\n    dropout_info = []\n\n    for component_name, dropout_proportion in dropout_proportions:\n        edit_config_lines.append(\n            \"set-dropout-proportion name={0} proportion={1}\".format(\n                component_name, dropout_proportion))\n        dropout_info.append(\"pattern/dropout-proportion={0}/{1}\".format(\n            component_name, dropout_proportion))\n\n    if _debug_dropout:\n        logger.info(\"On iteration %d, %s\", iter_, ', '.join(dropout_info))\n    return (\"\"\"nnet3-copy --edits='{edits}' - - |\"\"\".format(\n        edits=\";\".join(edit_config_lines)))\n\n\ndef _self_test():\n    \"\"\"Run self-test.\n    This method is called if the module is run as a standalone script.\n    \"\"\"\n\n    def assert_approx_equal(list1, list2):\n        \"\"\"Checks that the two dropout proportions lists are equal.\"\"\"\n        assert len(list1) == len(list2)\n        for i in range(0, len(list1)):\n            assert len(list1[i]) == 2\n            assert len(list2[i]) == 2\n            assert list1[i][0] == list2[i][0]\n            assert abs(list1[i][1] - list2[i][1]) < 1e-8\n\n    assert (_parse_dropout_option('*=0.0,0.5,0.0 lstm.*=0.0,0.3@0.75,0.0')\n            == [ ('*', [ (1.0, 0.0), (0.5, 0.5), (0.0, 0.0) ]),\n                 ('lstm.*', [ (1.0, 0.0), (0.75, 0.3), (0.0, 0.0) ]) ])\n    assert_approx_equal(_get_dropout_proportions(\n                           '*=0.0,0.5,0.0 lstm.*=0.0,0.3@0.75,0.0', 0.75),\n                        [ ('*', 0.25), ('lstm.*', 0.3) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            '*=0.0,0.5,0.0 lstm.*=0.0,0.3@0.75,0.0', 0.5),\n                        [ ('*', 0.5), ('lstm.*', 0.2) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            '*=0.0,0.5,0.0 lstm.*=0.0,0.3@0.75,0.0', 0.25),\n                        [ ('*', 0.25), ('lstm.*', 0.1) ])\n\n    assert (_parse_dropout_option('0.0,0.3,0.0')\n            == [ ('*', [ (1.0, 0.0), (0.5, 0.3), (0.0, 0.0) ]) ])\n    assert_approx_equal(_get_dropout_proportions('0.0,0.3,0.0', 0.5),\n                        [ ('*', 0.3) ])\n    assert_approx_equal(_get_dropout_proportions('0.0,0.3,0.0', 0.0),\n                        [ ('*', 0.0) ])\n    assert_approx_equal(_get_dropout_proportions('0.0,0.3,0.0', 1.0),\n                        [ ('*', 0.0) ])\n    assert_approx_equal(_get_dropout_proportions('0.0,0.3,0.0', 0.25),\n                        [ ('*', 0.15) ])\n\n    assert (_parse_dropout_option('0.0,0.5@0.25,0.0,0.6@0.75,0.0')\n            == [ ('*', [ (1.0, 0.0), (0.75, 0.6), (0.5, 0.0), (0.25, 0.5), (0.0, 0.0) ]) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            '0.0,0.5@0.25,0.0,0.6@0.75,0.0', 0.25),\n                        [ ('*', 0.5) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            '0.0,0.5@0.25,0.0,0.6@0.75,0.0', 0.1),\n                        [ ('*', 0.2) ])\n\n    assert (_parse_dropout_option('lstm.*=0.0,0.3,0.0@0.75,1.0')\n            == [ ('lstm.*', [ (1.0, 1.0), (0.75, 0.0), (0.5, 0.3), (0.0, 0.0) ]) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            'lstm.*=0.0,0.3,0.0@0.75,1.0', 0.25),\n                        [ ('lstm.*', 0.15) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            'lstm.*=0.0,0.3,0.0@0.75,1.0', 0.5),\n                        [ ('lstm.*', 0.3) ])\n    assert_approx_equal(_get_dropout_proportions(\n                            'lstm.*=0.0,0.3,0.0@0.75,1.0', 0.9),\n                        [ ('lstm.*', 0.6) ])\n\n\nif __name__ == '__main__':\n    try:\n        _self_test()\n    except Exception:\n        logger.error(\"Failed self test\")\n        raise\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/frame_level_objf/__init__.py",
    "content": "\n\n# Copyright 2016 Vimal Manohar\n# Apache 2.0\n\n\"\"\" This library has classes and methods commonly used for training nnet3\nneural networks with frame-level objectives.\n\"\"\"\n\nfrom . import common\nfrom . import raw_model\nfrom . import acoustic_model\n\n__all__ = [\"common\", \"raw_model\", \"acoustic_model\"]\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/frame_level_objf/acoustic_model.py",
    "content": "\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This is a module with method which will be used by scripts for\ntraining of deep neural network acoustic model with frame-level objective.\n\"\"\"\n\nimport logging\n\nimport libs.common as common_lib\nimport libs.nnet3.train.common as common_train_lib\n\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\ndef generate_egs(data, alidir, egs_dir,\n                 left_context, right_context,\n                 run_opts, stage=0,\n                 left_context_initial=-1, right_context_final=-1,\n                 online_ivector_dir=None,\n                 samples_per_iter=20000, frames_per_eg_str=\"20\", srand=0,\n                 egs_opts=None, cmvn_opts=None):\n\n    \"\"\" Wrapper for calling steps/nnet3/get_egs.sh\n\n    Generates targets from alignment directory 'alidir', which contains\n    the model final.mdl and alignments.\n    \"\"\"\n\n    common_lib.execute_command(\n        \"\"\"steps/nnet3/get_egs.sh {egs_opts} \\\n                --cmd \"{command}\" \\\n                --cmvn-opts \"{cmvn_opts}\" \\\n                --online-ivector-dir \"{ivector_dir}\" \\\n                --left-context {left_context} \\\n                --right-context {right_context} \\\n                --left-context-initial {left_context_initial} \\\n                --right-context-final {right_context_final} \\\n                --stage {stage} \\\n                --samples-per-iter {samples_per_iter} \\\n                --frames-per-eg {frames_per_eg_str} \\\n                --srand {srand} \\\n                {data} {alidir} {egs_dir}\n        \"\"\".format(command=run_opts.egs_command,\n                   cmvn_opts=cmvn_opts if cmvn_opts is not None else '',\n                   ivector_dir=(online_ivector_dir\n                                if online_ivector_dir is not None\n                                else ''),\n                   left_context=left_context,\n                   right_context=right_context,\n                   left_context_initial=left_context_initial,\n                   right_context_final=right_context_final,\n                   stage=stage, samples_per_iter=samples_per_iter,\n                   frames_per_eg_str=frames_per_eg_str, srand=srand, data=data,\n                   alidir=alidir, egs_dir=egs_dir,\n                   egs_opts=egs_opts if egs_opts is not None else ''))\n\n\ndef prepare_initial_acoustic_model(dir, alidir, run_opts,\n                                   srand=-3, input_model=None):\n    \"\"\" Adds the first layer; this will also add in the lda.mat and\n        presoftmax_prior_scale.vec. It will also prepare the acoustic model\n        with the transition model.\n        If 'input_model' is specified, no initial network preparation(adding\n        the first layer) is done and this model is used as initial 'raw' model\n        instead of '0.raw' model to prepare '0.mdl' as acoustic model by adding the\n        transition model.\n    \"\"\"\n\n    if input_model is None:\n        common_train_lib.prepare_initial_network(dir, run_opts,\n                                                 srand=srand)\n\n    # Convert to .mdl, train the transitions, set the priors.\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/init_mdl.log \\\n                nnet3-am-init {alidir}/final.mdl {raw_mdl} - \\| \\\n                nnet3-am-train-transitions - \\\n                \"ark:gunzip -c {alidir}/ali.*.gz|\" {dir}/0.mdl\n        \"\"\".format(command=run_opts.command,\n                   dir=dir, alidir=alidir,\n                   raw_mdl=(input_model if input_model is not None\n                            else '{0}/0.raw'.format(dir))))\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/frame_level_objf/common.py",
    "content": "\n# Copyright 2016 Vijayaditya Peddinti.\n#           2016 Vimal Manohar\n#           2017 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n\"\"\" This is a module with methods which will be used by scripts for training of\ndeep neural network acoustic model and raw model (i.e., generic neural\nnetwork without transition model) with frame-level objectives.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport glob\nimport logging\nimport math\nimport os\nimport random\nimport time\n\nimport libs.common as common_lib\nimport libs.nnet3.train.common as common_train_lib\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\ndef train_new_models(dir, iter, srand, num_jobs,\n                     num_archives_processed, num_archives,\n                     raw_model_string, egs_dir,\n                     momentum, max_param_change,\n                     shuffle_buffer_size, minibatch_size_str,\n                     image_augmentation_opts,\n                     run_opts, frames_per_eg=-1,\n                     min_deriv_time=None, max_deriv_time_relative=None,\n                     use_multitask_egs=False, train_opts=\"\",\n                     backstitch_training_scale=0.0, backstitch_training_interval=1):\n    \"\"\" Called from train_one_iteration(), this model does one iteration of\n    training with 'num_jobs' jobs, and writes files like\n    exp/tdnn_a/24.{1,2,3,..<num_jobs>}.raw\n\n    We cannot easily use a single parallel SGE job to do the main training,\n    because the computation of which archive and which --frame option\n    to use for each job is a little complex, so we spawn each one separately.\n    this is no longer true for RNNs as we use do not use the --frame option\n    but we use the same script for consistency with FF-DNN code\n\n    Selected args:\n        frames_per_eg:\n            The frames_per_eg, in the context of (non-chain) nnet3 training,\n            is normally the number of output (supervised) frames in each training\n            example.  However, the frames_per_eg argument to this function should\n            only be set to that number (greater than zero) if you intend to\n            train on a single frame of each example, on each minibatch.  If you\n            provide this argument >0, then for each training job a different\n            frame from the dumped example is selected to train on, based on\n            the option --frame=n to nnet3-copy-egs.\n            If you leave frames_per_eg at its default value (-1), then the\n            entire sequence of frames is used for supervision.  This is suitable\n            for RNN training, where it helps to amortize the cost of computing\n            the activations for the frames of context needed for the recurrence.\n        use_multitask_egs : True, if different examples used to train multiple\n            tasks or outputs, e.g.multilingual training.  multilingual egs can\n            be generated using get_egs.sh and\n            steps/nnet3/multilingual/allocate_multilingual_examples.py, those\n            are the top-level scripts.\n    \"\"\"\n\n    chunk_level_training = False if frames_per_eg > 0 else True\n\n    deriv_time_opts = []\n    if min_deriv_time is not None:\n        deriv_time_opts.append(\"--optimization.min-deriv-time={0}\".format(\n                           min_deriv_time))\n    if max_deriv_time_relative is not None:\n        deriv_time_opts.append(\"--optimization.max-deriv-time-relative={0}\".format(\n                           max_deriv_time_relative))\n\n    threads = []\n\n    # the GPU timing info is only printed if we use the --verbose=1 flag; this\n    # slows down the computation slightly, so don't accumulate it on every\n    # iteration.  Don't do it on iteration 0 either, because we use a smaller\n    # than normal minibatch size, and people may get confused thinking it's\n    # slower for iteration 0 because of the verbose option.\n    verbose_opt = (\"--verbose=1\" if iter % 20 == 0 and iter > 0 else \"\")\n\n    for job in range(1, num_jobs+1):\n        # k is a zero-based index that we will derive the other indexes from.\n        k = num_archives_processed + job - 1\n\n        # work out the 1-based archive index.\n        archive_index = (k % num_archives) + 1\n\n        if not chunk_level_training:\n            frame = (k // num_archives + archive_index) % frames_per_eg\n\n        cache_io_opts = ((\"--read-cache={dir}/cache.{iter}\".format(dir=dir,\n                                                                  iter=iter)\n                          if iter > 0 else \"\") +\n                         (\" --write-cache={0}/cache.{1}\".format(dir, iter + 1)\n                          if job == 1 else \"\"))\n\n        if image_augmentation_opts:\n            image_augmentation_cmd = (\n                'nnet3-egs-augment-image --srand={srand} {aug_opts} ark:- ark:- |'.format(\n                    srand=k+srand,\n                    aug_opts=image_augmentation_opts))\n        else:\n            image_augmentation_cmd = ''\n\n\n        multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n            egs_dir,\n            egs_prefix=\"egs.\",\n            archive_index=archive_index,\n            use_multitask_egs=use_multitask_egs)\n\n        scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n\n        egs_rspecifier = (\n            \"\"\"ark,bg:nnet3-copy-egs {frame_opts} {multitask_egs_opts} \\\n            {scp_or_ark}:{egs_dir}/egs.{archive_index}.{scp_or_ark} ark:- | \\\n            nnet3-shuffle-egs --buffer-size={shuffle_buffer_size} \\\n            --srand={srand} ark:- ark:- | {aug_cmd} \\\n            nnet3-merge-egs --minibatch-size={minibatch_size} ark:- ark:- |\"\"\".format(\n                frame_opts=(\"\" if chunk_level_training\n                            else \"--frame={0}\".format(frame)),\n                egs_dir=egs_dir, archive_index=archive_index,\n                shuffle_buffer_size=shuffle_buffer_size,\n                minibatch_size=minibatch_size_str,\n                aug_cmd=image_augmentation_cmd,\n                srand=iter+srand,\n                scp_or_ark=scp_or_ark,\n                multitask_egs_opts=multitask_egs_opts))\n\n        # note: the thread waits on that process's completion.\n        thread = common_lib.background_command(\n            \"\"\"{command} {train_queue_opt} {dir}/log/train.{iter}.{job}.log \\\n                    nnet3-train {parallel_train_opts} {cache_io_opts} \\\n                     {verbose_opt} --print-interval=10 \\\n                    --momentum={momentum} \\\n                    --max-param-change={max_param_change} \\\n                    --backstitch-training-scale={backstitch_training_scale} \\\n                    --l2-regularize-factor={l2_regularize_factor} \\\n                    --backstitch-training-interval={backstitch_training_interval} \\\n                    --srand={srand} {train_opts} \\\n                    {deriv_time_opts} \"{raw_model}\" \"{egs_rspecifier}\" \\\n                    {dir}/{next_iter}.{job}.raw\"\"\".format(\n                command=run_opts.command,\n                train_queue_opt=run_opts.train_queue_opt,\n                dir=dir, iter=iter,\n                next_iter=iter + 1, srand=iter + srand,\n                job=job,\n                parallel_train_opts=run_opts.parallel_train_opts,\n                cache_io_opts=cache_io_opts,\n                verbose_opt=verbose_opt,\n                momentum=momentum, max_param_change=max_param_change,\n                l2_regularize_factor=1.0/num_jobs,\n                backstitch_training_scale=backstitch_training_scale,\n                backstitch_training_interval=backstitch_training_interval,\n                train_opts=train_opts,\n                deriv_time_opts=\" \".join(deriv_time_opts),\n                raw_model=raw_model_string,\n                egs_rspecifier=egs_rspecifier),\n            require_zero_status=True)\n\n        threads.append(thread)\n\n    for thread in threads:\n        thread.join()\n\n\ndef train_one_iteration(dir, iter, srand, egs_dir,\n                        num_jobs, num_archives_processed, num_archives,\n                        learning_rate, minibatch_size_str,\n                        momentum, max_param_change, shuffle_buffer_size,\n                        run_opts, image_augmentation_opts=None,\n                        frames_per_eg=-1,\n                        min_deriv_time=None, max_deriv_time_relative=None,\n                        shrinkage_value=1.0, dropout_edit_string=\"\",  train_opts=\"\",\n                        get_raw_nnet_from_am=True, use_multitask_egs=False,\n                        backstitch_training_scale=0.0, backstitch_training_interval=1,\n                        compute_per_dim_accuracy=False):\n    \"\"\" Called from steps/nnet3/train_*.py scripts for one iteration of neural\n    network training\n\n    Selected args:\n        frames_per_eg: The default value -1 implies chunk_level_training, which\n            is particularly applicable to RNN training. If it is > 0, then it\n            implies frame-level training, which is applicable for DNN training.\n            If it is > 0, then each parallel SGE job created, a different frame\n            numbered 0..frames_per_eg-1 is used.\n        shrinkage_value: If value is 1.0, no shrinkage is done; otherwise\n            parameter values are scaled by this value.\n        get_raw_nnet_from_am: If True, then the network is read and stored as\n            acoustic model i.e. along with transition model e.g. 10.mdl\n            as against a raw network e.g. 10.raw when the value is False.\n    \"\"\"\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n\n    # check if different iterations use the same random seed\n    if os.path.exists('{0}/srand'.format(dir)):\n        try:\n            saved_srand = int(open('{0}/srand'.format(dir)).readline().strip())\n        except (IOError, ValueError):\n            logger.error(\"Exception while reading the random seed \"\n                         \"for training\")\n            raise\n        if srand != saved_srand:\n            logger.warning(\"The random seed provided to this iteration \"\n                           \"(srand={0}) is different from the one saved last \"\n                           \"time (srand={1}). Using srand={0}.\".format(\n                               srand, saved_srand))\n    else:\n        with open('{0}/srand'.format(dir), 'w') as f:\n            f.write(str(srand))\n\n    # Sets off some background jobs to compute train and\n    # validation set objectives\n    compute_train_cv_probabilities(\n        dir=dir, iter=iter, egs_dir=egs_dir,\n        run_opts=run_opts,\n        get_raw_nnet_from_am=get_raw_nnet_from_am,\n        use_multitask_egs=use_multitask_egs,\n        compute_per_dim_accuracy=compute_per_dim_accuracy)\n\n    if iter > 0:\n        # Runs in the background\n        compute_progress(dir=dir, iter=iter, egs_dir=egs_dir,\n                         run_opts=run_opts,\n                         get_raw_nnet_from_am=get_raw_nnet_from_am)\n\n    do_average = (iter > 0)\n\n\n    raw_model_string = (\"nnet3-copy --learning-rate={lr} --scale={s} \"\n                        \"{dir}/{iter}.{suf} - |\".format(\n                            lr=learning_rate, s=shrinkage_value,\n                            suf=\"mdl\" if get_raw_nnet_from_am else \"raw\",\n                            dir=dir, iter=iter))\n\n    raw_model_string = raw_model_string + dropout_edit_string\n\n    if do_average:\n        cur_minibatch_size_str = minibatch_size_str\n        cur_max_param_change = max_param_change\n    else:\n        # on iteration zero, use a smaller minibatch size (and we will later\n        # choose the output of just one of the jobs): the model-averaging isn't\n        # always helpful when the model is changing too fast (i.e. it can worsen\n        # the objective function), and the smaller minibatch size will help to\n        # keep the update stable.\n        cur_minibatch_size_str = common_train_lib.halve_minibatch_size_str(minibatch_size_str)\n        cur_max_param_change = float(max_param_change) / math.sqrt(2)\n\n    train_new_models(dir=dir, iter=iter, srand=srand, num_jobs=num_jobs,\n                     num_archives_processed=num_archives_processed,\n                     num_archives=num_archives,\n                     raw_model_string=raw_model_string, egs_dir=egs_dir,\n                     momentum=momentum, max_param_change=cur_max_param_change,\n                     shuffle_buffer_size=shuffle_buffer_size,\n                     minibatch_size_str=cur_minibatch_size_str,\n                     run_opts=run_opts,\n                     frames_per_eg=frames_per_eg,\n                     min_deriv_time=min_deriv_time,\n                     max_deriv_time_relative=max_deriv_time_relative,\n                     image_augmentation_opts=image_augmentation_opts,\n                     use_multitask_egs=use_multitask_egs,\n                     train_opts=train_opts,\n                     backstitch_training_scale=backstitch_training_scale,\n                     backstitch_training_interval=backstitch_training_interval)\n\n    [models_to_average, best_model] = common_train_lib.get_successful_models(\n         num_jobs, '{0}/log/train.{1}.%.log'.format(dir, iter))\n    nnets_list = []\n    for n in models_to_average:\n        nnets_list.append(\"{0}/{1}.{2}.raw\".format(dir, iter + 1, n))\n\n    if do_average:\n        # average the output of the different jobs.\n        common_train_lib.get_average_nnet_model(\n            dir=dir, iter=iter,\n            nnets_list=\" \".join(nnets_list),\n            run_opts=run_opts,\n            get_raw_nnet_from_am=get_raw_nnet_from_am)\n\n    else:\n        # choose the best model from different jobs\n        common_train_lib.get_best_nnet_model(\n            dir=dir, iter=iter,\n            best_model_index=best_model,\n            run_opts=run_opts,\n            get_raw_nnet_from_am=get_raw_nnet_from_am)\n\n    try:\n        for i in range(1, num_jobs + 1):\n            os.remove(\"{0}/{1}.{2}.raw\".format(dir, iter + 1, i))\n    except OSError:\n        logger.error(\"Error while trying to delete the raw models\")\n        raise\n\n    if get_raw_nnet_from_am:\n        new_model = \"{0}/{1}.mdl\".format(dir, iter + 1)\n    else:\n        new_model = \"{0}/{1}.raw\".format(dir, iter + 1)\n\n    if not os.path.isfile(new_model):\n        raise Exception(\"Could not find {0}, at the end of \"\n                        \"iteration {1}\".format(new_model, iter))\n    elif os.stat(new_model).st_size == 0:\n        raise Exception(\"{0} has size 0. Something went wrong in \"\n                        \"iteration {1}\".format(new_model, iter))\n    if os.path.exists(\"{0}/cache.{1}\".format(dir, iter)):\n        os.remove(\"{0}/cache.{1}\".format(dir, iter))\n\n\ndef compute_preconditioning_matrix(dir, egs_dir, num_lda_jobs, run_opts,\n                                   max_lda_jobs=None, rand_prune=4.0,\n                                   lda_opts=None, use_multitask_egs=False):\n    if max_lda_jobs is not None:\n        if num_lda_jobs > max_lda_jobs:\n            num_lda_jobs = max_lda_jobs\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n        egs_dir,\n        egs_prefix=\"egs.\",\n        archive_index=\"JOB\",\n        use_multitask_egs=use_multitask_egs)\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_rspecifier = (\n        \"ark:nnet3-copy-egs {multitask_egs_opts} \"\n        \"{scp_or_ark}:{egs_dir}/egs.JOB.{scp_or_ark} ark:- |\"\n        \"\".format(egs_dir=egs_dir, scp_or_ark=scp_or_ark,\n                  multitask_egs_opts=multitask_egs_opts))\n\n    # Write stats with the same format as stats for LDA.\n    common_lib.execute_command(\n        \"\"\"{command} JOB=1:{num_lda_jobs} {dir}/log/get_lda_stats.JOB.log \\\n                nnet3-acc-lda-stats --rand-prune={rand_prune} \\\n                {dir}/init.raw \"{egs_rspecifier}\" \\\n                {dir}/JOB.lda_stats\"\"\".format(\n                    command=run_opts.command,\n                    num_lda_jobs=num_lda_jobs,\n                    dir=dir,\n                    egs_rspecifier=egs_rspecifier,\n                    rand_prune=rand_prune))\n\n    # the above command would have generated dir/{1..num_lda_jobs}.lda_stats\n    lda_stat_files = ['{0}/{1}.lda_stats'.format(dir, x) for x in range(1, num_lda_jobs + 1)]\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/sum_transform_stats.log \\\n                sum-lda-accs {dir}/lda_stats {lda_stat_files}\"\"\".format(\n                    command=run_opts.command,\n                    dir=dir, lda_stat_files=\" \".join(lda_stat_files)))\n\n    for file in lda_stat_files:\n        try:\n            os.remove(file)\n        except OSError:\n            logger.error(\"There was error while trying to remove \"\n                         \"lda stat files.\")\n            raise\n    # this computes a fixed affine transform computed in the way we described\n    # in Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled\n    # variant of an LDA transform but without dimensionality reduction.\n\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/get_transform.log \\\n                nnet-get-feature-transform {lda_opts} {dir}/lda.mat \\\n                {dir}/lda_stats\"\"\".format(\n                    command=run_opts.command, dir=dir,\n                    lda_opts=lda_opts if lda_opts is not None else \"\"))\n\n    common_lib.force_symlink(\"../lda.mat\", \"{0}/configs/lda.mat\".format(dir))\n\n\ndef compute_train_cv_probabilities(dir, iter, egs_dir, run_opts,\n                                   get_raw_nnet_from_am=True,\n                                   use_multitask_egs=False,\n                                   compute_per_dim_accuracy=False):\n    if get_raw_nnet_from_am:\n        model = \"{dir}/{iter}.mdl\".format(dir=dir, iter=iter)\n    else:\n        model = \"{dir}/{iter}.raw\".format(dir=dir, iter=iter)\n\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_suffix = \".scp\" if use_multitask_egs else \".egs\"\n    egs_rspecifier = (\"{0}:{1}/valid_diagnostic{2}\".format(\n        scp_or_ark, egs_dir, egs_suffix))\n\n    opts = []\n    if compute_per_dim_accuracy:\n        opts.append(\"--compute-per-dim-accuracy\")\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"valid_diagnostic.\",\n                             use_multitask_egs=use_multitask_egs)\n\n    common_lib.background_command(\n        \"\"\" {command} {dir}/log/compute_prob_valid.{iter}.log \\\n                nnet3-compute-prob \"{model}\" \\\n                \"ark,bg:nnet3-copy-egs {multitask_egs_opts} \\\n                    {egs_rspecifier} ark:- | \\\n                    nnet3-merge-egs --minibatch-size=1:64 ark:- \\\n                    ark:- |\" \"\"\".format(command=run_opts.command,\n                                        dir=dir,\n                                        iter=iter,\n                                        egs_rspecifier=egs_rspecifier,\n                                        opts=' '.join(opts), model=model,\n                                        multitask_egs_opts=multitask_egs_opts))\n\n    egs_rspecifier = (\"{0}:{1}/train_diagnostic{2}\".format(\n        scp_or_ark, egs_dir, egs_suffix))\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"train_diagnostic.\",\n                             use_multitask_egs=use_multitask_egs)\n\n    common_lib.background_command(\n        \"\"\"{command} {dir}/log/compute_prob_train.{iter}.log \\\n                nnet3-compute-prob {opts} \"{model}\" \\\n                \"ark,bg:nnet3-copy-egs {multitask_egs_opts} \\\n                    {egs_rspecifier} ark:- | \\\n                    nnet3-merge-egs --minibatch-size=1:64 ark:- \\\n                    ark:- |\" \"\"\".format(command=run_opts.command,\n                                        dir=dir,\n                                        iter=iter,\n                                        egs_rspecifier=egs_rspecifier,\n                                        opts=' '.join(opts), model=model,\n                                        multitask_egs_opts=multitask_egs_opts))\n\n\ndef compute_progress(dir, iter, egs_dir,\n                     run_opts,\n                     get_raw_nnet_from_am=True):\n    suffix = \"mdl\" if get_raw_nnet_from_am else \"raw\"\n    prev_model = '{0}/{1}.{2}'.format(dir, iter - 1, suffix)\n    model = '{0}/{1}.{2}'.format(dir, iter, suffix)\n\n    common_lib.background_command(\n            \"\"\"{command} {dir}/log/progress.{iter}.log \\\n                    nnet3-info {model} '&&' \\\n                    nnet3-show-progress --use-gpu=no {prev_model} {model} \"\"\"\n        ''.format(command=run_opts.command, dir=dir,\n                  iter=iter, model=model, prev_model=prev_model))\n\n    if iter % 10 == 0 and iter > 0:\n        # Every 10 iters, print some more detailed information.\n        # full_progress.X.log contains some diagnostics of the difference in\n        # parameters, printed in the same format as from nnet3-info.\n        common_lib.background_command(\n            \"\"\"{command} {dir}/log/full_progress.{iter}.log \\\n            nnet3-show-progress --use-gpu=no --verbose=2 {prev_model} {model}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir,\n                   iter=iter,\n                   model=model,\n                   prev_model=prev_model))\n        # full_info.X.log is just the nnet3-info of the model, with the --verbose=2\n        # option which includes stats on the singular values of the parameter matrices.\n        common_lib.background_command(\n            \"\"\"{command} {dir}/log/full_info.{iter}.log \\\n            nnet3-info --verbose=2 {model}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir,\n                   iter=iter,\n                   model=model))\n\n\n\ndef combine_models(dir, num_iters, models_to_combine, egs_dir,\n                   minibatch_size_str,\n                   run_opts,\n                   chunk_width=None, get_raw_nnet_from_am=True,\n                   max_objective_evaluations=30,\n                   use_multitask_egs=False,\n                   compute_per_dim_accuracy=False):\n    \"\"\" Function to do model combination\n\n    In the nnet3 setup, the logic\n    for doing averaging of subsets of the models in the case where\n    there are too many models to reliably esetimate interpolation\n    factors (max_models_combine) is moved into the nnet3-combine.\n    \"\"\"\n    raw_model_strings = []\n    logger.info(\"Combining {0} models.\".format(models_to_combine))\n\n    models_to_combine.add(num_iters)\n\n    for iter in sorted(models_to_combine):\n        suffix = \"mdl\" if get_raw_nnet_from_am else \"raw\"\n        model_file = '{0}/{1}.{2}'.format(dir, iter, suffix)\n        if not os.path.exists(model_file):\n            raise Exception('Model file {0} missing'.format(model_file))\n        raw_model_strings.append(model_file)\n\n    if get_raw_nnet_from_am:\n        out_model = (\"| nnet3-am-copy --set-raw-nnet=- {dir}/{num_iters}.mdl \"\n                     \"{dir}/combined.mdl\".format(dir=dir, num_iters=num_iters))\n    else:\n        out_model = '{dir}/final.raw'.format(dir=dir)\n\n\n    # We reverse the order of the raw model strings so that the freshest one\n    # goes first.  This is important for systems that include batch\n    # normalization-- it means that the freshest batch-norm stats are used.\n    # Since the batch-norm stats are not technically parameters, they are not\n    # combined in the combination code, they are just obtained from the first\n    # model.\n    raw_model_strings = list(reversed(raw_model_strings))\n\n    scp_or_ark = \"scp\" if use_multitask_egs else \"ark\"\n    egs_suffix = \".scp\" if use_multitask_egs else \".egs\"\n\n    egs_rspecifier = \"{0}:{1}/combine{2}\".format(scp_or_ark,\n                                                 egs_dir, egs_suffix)\n\n    multitask_egs_opts = common_train_lib.get_multitask_egs_opts(\n                             egs_dir,\n                             egs_prefix=\"combine.\",\n                             use_multitask_egs=use_multitask_egs)\n    common_lib.execute_command(\n        \"\"\"{command} {combine_queue_opt} {dir}/log/combine.log \\\n                nnet3-combine {combine_gpu_opt} \\\n                --max-objective-evaluations={max_objective_evaluations} \\\n                --verbose=3 {raw_models} \\\n                \"ark,bg:nnet3-copy-egs {multitask_egs_opts} \\\n                    {egs_rspecifier} ark:- | \\\n                      nnet3-merge-egs --minibatch-size=1:{mbsize} ark:- ark:- |\" \\\n                \"{out_model}\"\n        \"\"\".format(command=run_opts.command,\n                   combine_queue_opt=run_opts.combine_queue_opt,\n                   combine_gpu_opt=run_opts.combine_gpu_opt,\n                   dir=dir, raw_models=\" \".join(raw_model_strings),\n                   max_objective_evaluations=max_objective_evaluations,\n                   egs_rspecifier=egs_rspecifier,\n                   mbsize=minibatch_size_str,\n                   out_model=out_model,\n                   multitask_egs_opts=multitask_egs_opts))\n\n    # Compute the probability of the final, combined model with\n    # the same subset we used for the previous compute_probs, as the\n    # different subsets will lead to different probs.\n    if get_raw_nnet_from_am:\n        compute_train_cv_probabilities(\n            dir=dir, iter='combined', egs_dir=egs_dir,\n            run_opts=run_opts, use_multitask_egs=use_multitask_egs,\n            compute_per_dim_accuracy=compute_per_dim_accuracy)\n    else:\n        compute_train_cv_probabilities(\n            dir=dir, iter='final', egs_dir=egs_dir,\n            run_opts=run_opts, get_raw_nnet_from_am=False,\n            use_multitask_egs=use_multitask_egs,\n            compute_per_dim_accuracy=compute_per_dim_accuracy)\n\n\ndef get_realign_iters(realign_times, num_iters,\n                      num_jobs_initial, num_jobs_final):\n    \"\"\" Takes the realign_times string and identifies the approximate\n        iterations at which realignments have to be done.\n\n    realign_times is a space seperated string of values between 0 and 1\n    \"\"\"\n\n    realign_iters = []\n    for realign_time in realign_times.split():\n        realign_time = float(realign_time)\n        assert(realign_time > 0 and realign_time < 1)\n        if num_jobs_initial == num_jobs_final:\n            realign_iter = int(0.5 + num_iters * realign_time)\n        else:\n            realign_iter = math.sqrt((1 - realign_time)\n                                     * math.pow(num_jobs_initial, 2)\n                                     + realign_time * math.pow(num_jobs_final,\n                                                               2))\n            realign_iter = realign_iter - num_jobs_initial\n            realign_iter = realign_iter // (num_jobs_final - num_jobs_initial)\n            realign_iter = realign_iter * num_iters\n        realign_iters.append(int(realign_iter))\n\n    return realign_iters\n\n\ndef align(dir, data, lang, run_opts, iter=None,\n          online_ivector_dir=None):\n\n    alidir = '{dir}/ali{ali_suffix}'.format(\n            dir=dir,\n            ali_suffix=\"_iter_{0}\".format(iter) if iter is not None else \"\")\n\n    logger.info(\"Aligning the data{gpu}with {num_jobs} jobs.\".format(\n        gpu=\" using gpu \" if run_opts.realign_use_gpu else \" \",\n        num_jobs=run_opts.realign_num_jobs))\n    common_lib.execute_command(\n        \"\"\"steps/nnet3/align.sh --nj {num_jobs_align} \\\n                --cmd \"{align_cmd} {align_queue_opt}\" \\\n                --use-gpu {align_use_gpu} \\\n                --online-ivector-dir \"{online_ivector_dir}\" \\\n                --iter \"{iter}\" {data} {lang} {dir} {alidir}\"\"\".format(\n                    dir=dir, align_use_gpu=(\"yes\"\n                                            if run_opts.realign_use_gpu\n                                            else \"no\"),\n                    align_cmd=run_opts.realign_command,\n                    align_queue_opt=run_opts.realign_queue_opt,\n                    num_jobs_align=run_opts.realign_num_jobs,\n                    online_ivector_dir=(online_ivector_dir\n                                        if online_ivector_dir is not None\n                                        else \"\"),\n                    iter=iter if iter is not None else \"\",\n                    alidir=alidir,\n                    lang=lang, data=data))\n    return alidir\n\n\ndef realign(dir, iter, feat_dir, lang, prev_egs_dir, cur_egs_dir,\n            prior_subset_size, num_archives,\n            run_opts, online_ivector_dir=None):\n    raise Exception(\"Realignment stage has not been implemented in nnet3\")\n    logger.info(\"Getting average posterior for purposes of adjusting \"\n                \"the priors.\")\n    # Note: this just uses CPUs, using a smallish subset of data.\n    # always use the first egs archive, which makes the script simpler;\n    # we're using different random subsets of it.\n\n    avg_post_vec_file = compute_average_posterior(\n            dir=dir, iter=iter, egs_dir=prev_egs_dir,\n            num_archives=num_archives, prior_subset_size=prior_subset_size,\n            run_opts=run_opts)\n\n    avg_post_vec_file = \"{dir}/post.{iter}.vec\".format(dir=dir, iter=iter)\n    logger.info(\"Re-adjusting priors based on computed posteriors\")\n    model = '{0}/{1}.mdl'.format(dir, iter)\n    adjust_am_priors(dir, model, avg_post_vec_file, model, run_opts)\n\n    alidir = align(dir, feat_dir, lang, run_opts, iter,\n                   online_ivector_dir)\n    common_lib.execute_command(\n        \"\"\"steps/nnet3/relabel_egs.sh --cmd \"{command}\" --iter {iter} \\\n                {alidir} {prev_egs_dir} {cur_egs_dir}\"\"\".format(\n                    command=run_opts.command,\n                    iter=iter,\n                    dir=dir,\n                    alidir=alidir,\n                    prev_egs_dir=prev_egs_dir,\n                    cur_egs_dir=cur_egs_dir))\n\n\ndef adjust_am_priors(dir, input_model, avg_posterior_vector, output_model,\n                     run_opts):\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/adjust_priors.final.log \\\n                nnet3-am-adjust-priors \"{input_model}\" {avg_posterior_vector} \\\n                \"{output_model}\" \"\"\".format(\n                    command=run_opts.command,\n                    dir=dir, input_model=input_model,\n                    avg_posterior_vector=avg_posterior_vector,\n                    output_model=output_model))\n\n\ndef compute_average_posterior(dir, iter, egs_dir, num_archives,\n                              prior_subset_size,\n                              run_opts, get_raw_nnet_from_am=True):\n    \"\"\" Computes the average posterior of the network\n    \"\"\"\n    for file in glob.glob('{0}/post.{1}.*.vec'.format(dir, iter)):\n        os.remove(file)\n\n    if run_opts.num_jobs_compute_prior > num_archives:\n        egs_part = 1\n    else:\n        egs_part = 'JOB'\n\n    suffix = \"mdl\" if get_raw_nnet_from_am else \"raw\"\n    model = \"{0}/{1}.{2}\".format(dir, iter, suffix)\n\n    common_lib.execute_command(\n        \"\"\"{command} JOB=1:{num_jobs_compute_prior} {prior_queue_opt} \\\n                {dir}/log/get_post.{iter}.JOB.log \\\n                nnet3-copy-egs \\\n                ark:{egs_dir}/egs.{egs_part}.ark ark:- \\| \\\n                nnet3-subset-egs --srand=JOB --n={prior_subset_size} \\\n                ark:- ark:- \\| \\\n                nnet3-merge-egs --minibatch-size=128 ark:- ark:- \\| \\\n                nnet3-compute-from-egs {prior_gpu_opt} --apply-exp=true \\\n                \"{model}\" ark:- ark:- \\| \\\n                matrix-sum-rows ark:- ark:- \\| vector-sum ark:- \\\n                {dir}/post.{iter}.JOB.vec\"\"\".format(\n                    command=run_opts.command,\n                    dir=dir, model=model,\n                    num_jobs_compute_prior=run_opts.num_jobs_compute_prior,\n                    prior_queue_opt=run_opts.prior_queue_opt,\n                    iter=iter, prior_subset_size=prior_subset_size,\n                    egs_dir=egs_dir, egs_part=egs_part,\n                    prior_gpu_opt=run_opts.prior_gpu_opt))\n\n    # make sure there is time for $dir/post.{iter}.*.vec to appear.\n    time.sleep(5)\n    avg_post_vec_file = \"{dir}/post.{iter}.vec\".format(dir=dir, iter=iter)\n    common_lib.execute_command(\n        \"\"\"{command} {dir}/log/vector_sum.{iter}.log \\\n                vector-sum {dir}/post.{iter}.*.vec {output_file}\n        \"\"\".format(command=run_opts.command,\n                   dir=dir, iter=iter, output_file=avg_post_vec_file))\n\n    for file in glob.glob('{0}/post.{1}.*.vec'.format(dir, iter)):\n        os.remove(file)\n    return avg_post_vec_file\n"
  },
  {
    "path": "egs/steps/libs/nnet3/train/frame_level_objf/raw_model.py",
    "content": "\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This is a module with method which will be used by scripts for\ntraining of deep neural network raw model (i.e. without acoustic model)\nwith frame-level objective.\n\"\"\"\n\nimport logging\n\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.addHandler(logging.NullHandler())\n\n\ndef generate_egs_using_targets(data, targets_scp, egs_dir,\n                               left_context, right_context,\n                               run_opts, stage=0,\n                               left_context_initial=-1, right_context_final=-1,\n                               online_ivector_dir=None,\n                               target_type='dense', num_targets=-1,\n                               samples_per_iter=20000, frames_per_eg_str=\"20\",\n                               srand=0, egs_opts=None, cmvn_opts=None):\n    \"\"\" Wrapper for calling steps/nnet3/get_egs_targets.sh\n\n    This method generates egs directly from an scp file of targets, instead of\n    getting them from the alignments (as with the method generate_egs() in\n    module nnet3.train.frame_level_objf.acoustic_model).\n\n    Args:\n        target_type: \"dense\" if the targets are in matrix format\n                     \"sparse\" if the targets are in posterior format\n        num_targets: must be explicitly specified for \"sparse\" targets.\n            For \"dense\" targets, this option is ignored and the target dim\n            is computed from the target matrix dimension\n        For other options, see the file steps/nnet3/get_egs_targets.sh\n    \"\"\"\n\n    if target_type == 'dense':\n        num_targets = common_lib.get_feat_dim_from_scp(targets_scp)\n    else:\n        if num_targets == -1:\n            raise Exception(\"--num-targets is required if \"\n                            \"target-type is sparse\")\n\n    common_lib.execute_command(\n        \"\"\"steps/nnet3/get_egs_targets.sh {egs_opts} \\\n                --cmd \"{command}\" \\\n                --cmvn-opts \"{cmvn_opts}\" \\\n                --online-ivector-dir \"{ivector_dir}\" \\\n                --left-context {left_context} \\\n                --right-context {right_context} \\\n                --left-context-initial {left_context_initial} \\\n                --right-context-final {right_context_final} \\\n                --stage {stage} \\\n                --samples-per-iter {samples_per_iter} \\\n                --frames-per-eg {frames_per_eg_str} \\\n                --srand {srand} \\\n                --target-type {target_type} \\\n                --num-targets {num_targets} \\\n                {data} {targets_scp} {egs_dir}\n        \"\"\".format(command=run_opts.egs_command,\n                   cmvn_opts=cmvn_opts if cmvn_opts is not None else '',\n                   ivector_dir=(online_ivector_dir\n                                if online_ivector_dir is not None\n                                else ''),\n                   left_context=left_context,\n                   right_context=right_context,\n                   left_context_initial=left_context_initial,\n                   right_context_final=right_context_final,\n                   stage=stage, samples_per_iter=samples_per_iter,\n                   frames_per_eg_str=frames_per_eg_str, srand=srand,\n                   num_targets=num_targets,\n                   data=data,\n                   targets_scp=targets_scp, target_type=target_type,\n                   egs_dir=egs_dir,\n                   egs_opts=egs_opts if egs_opts is not None else ''))\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/__init__.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2016    Yiming Wang\n# Apache 2.0.\n\n\"\"\"This library has classes and methods to form neural network computation graphs,\nin the nnet3 framework, using higher level abstractions called 'layers'\n(e.g. sub-graphs like LSTMS ).\n\nNote : We use the term 'layer' though the computation graph can have a highly\nnon-linear structure as, other terms such as nodes/components have already been\nused in C++ codebase of nnet3.\n\nThis is basically a config parser module, where the configs have very concise\ndescriptions of a neural network.\n\nThis module has methods to convert the xconfigs into a configs interpretable by\nnnet3 C++ library.\n\nIt generates three different configs:\n 'init.config' : which is the config with the info necessary for computing\n               the preconditioning matrix i.e., LDA transform\n               e.g.\n                 input-node name=input dim=40\n                 input-node name=ivector dim=100\n                 output-node name=output input=Append(Offset(input, -2), Offset(input, -1), input, Offset(input, 1), Offset(input, 2), ReplaceIndex(ivector, t, 0)) objective=linear\n\n 'ref.config' : which is a version of the config file used to generate\n                a model for getting left and right context (it doesn't read\n                anything for the LDA-like transform and/or\n                presoftmax-prior-scale components)\n\n 'final.config' : which has the actual config used to initialize the model used\n                 in training i.e, it has file paths for LDA transform and\n                 other initialization files\n\"\"\"\n\n\n__all__ = [\"utils\", \"layers\", \"parser\"]\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/attention.py",
    "content": "# Copyright 2017    Johns Hopkins University (Dan Povey)\n#           2017    Hossein Hadian\n# Apache 2.0.\n\n\"\"\" This module has the implementation of attention layers.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n# This class is for parsing lines like\n#  'attention-renorm-layer num-heads=10 value-dim=50 key-dim=50 time-stride=3 num-left-inputs=5 num-right-inputs=2.'\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'               [Descriptor giving the input of the layer.]\n#   self-repair-scale=1.0e-05  [Affects relu, sigmoid and tanh layers.]\n#   learning-rate-factor=1.0   [This can be used to make the affine component\n#                               train faster or slower].\n#   Documentation for the rest of the parameters (related to the\n#   attention component) can be found in nnet-attention-component.h\n\n\nclass XconfigAttentionLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        # Here we just list some likely combinations.. you can just add any\n        # combinations you want to use, to this list.\n        assert first_token in ['attention-renorm-layer',\n                               'attention-relu-renorm-layer',\n                               'attention-relu-batchnorm-layer',\n                               'relu-renorm-attention-layer']\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = { 'input':'[-1]',\n                        'dim': -1,\n                        'max-change' : 0.75,\n                        'self-repair-scale' : 1.0e-05,\n                        'target-rms' : 1.0,\n                        'learning-rate-factor' : 1.0,\n                        'ng-affine-options' : '',\n                        'l2-regularize': 0.0,\n                        'num-left-inputs-required': -1,\n                        'num-right-inputs-required': -1,\n                        'output-context': True,\n                        'time-stride': 1,\n                        'num-heads': 1,\n                        'key-dim': -1,\n                        'key-scale': 0.0,\n                        'value-dim': -1,\n                        'num-left-inputs': -1,\n                        'num-right-inputs': -1,\n                        'dropout-proportion': 0.5}  # dropout-proportion only\n                                                    # affects layers with\n                                                    # 'dropout' in the name.\n\n    def check_configs(self):\n        if self.config['self-repair-scale'] < 0.0 or self.config['self-repair-scale'] > 1.0:\n            raise RuntimeError(\"self-repair-scale has invalid value {0}\"\n                               .format(self.config['self-repair-scale']))\n        if self.config['target-rms'] < 0.0:\n            raise RuntimeError(\"target-rms has invalid value {0}\"\n                               .format(self.config['target-rms']))\n        if self.config['learning-rate-factor'] <= 0.0:\n            raise RuntimeError(\"learning-rate-factor has invalid value {0}\"\n                               .format(self.config['learning-rate-factor']))\n        for conf in ['value-dim', 'key-dim',\n                     'num-left-inputs', 'num-right-inputs']:\n            if self.config[conf] < 0:\n                raise RuntimeError(\"{0} has invalid value {1}\"\n                                   .format(conf, self.config[conf]))\n        if self.config['key-scale'] == 0.0:\n            self.config['key-scale'] = 1.0 / math.sqrt(self.config['key-dim'])\n\n    def output_name(self, auxiliary_output=None):\n        # at a later stage we might want to expose even the pre-nonlinearity\n        # vectors\n        assert auxiliary_output == None\n\n        split_layer_name = self.layer_type.split('-')\n        assert split_layer_name[-1] == 'layer'\n        last_nonlinearity = split_layer_name[-2]\n        # return something like: layer3.renorm\n        return '{0}.{1}'.format(self.name, last_nonlinearity)\n\n    def attention_input_dim(self):\n        context_dim = (self.config['num-left-inputs'] +\n                       self.config['num-right-inputs'] + 1)\n        num_heads = self.config['num-heads']\n        key_dim = self.config['key-dim']\n        value_dim = self.config['value-dim']\n        query_dim = key_dim + context_dim;\n        return num_heads * (key_dim + value_dim + query_dim)\n\n    def attention_output_dim(self):\n        context_dim = (self.config['num-left-inputs'] +\n                       self.config['num-right-inputs'] + 1)\n        num_heads = self.config['num-heads']\n        value_dim = self.config['value-dim']\n        return (num_heads *\n                (value_dim +\n                 (context_dim if self.config['output-context'] else 0)))\n\n    def output_dim(self, auxiliary_output = None):\n      return self.attention_output_dim()\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n\n    def _generate_config(self):\n        split_layer_name = self.layer_type.split('-')\n        assert split_layer_name[-1] == 'layer'\n        nonlinearities = split_layer_name[:-1]\n\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n\n        # the child classes e.g. tdnn might want to process the input\n        # before adding the other components\n\n        return self._add_components(input_desc, input_dim, nonlinearities)\n\n    def _add_components(self, input_desc, input_dim, nonlinearities):\n        dim = self.attention_input_dim()\n        self_repair_scale = self.config['self-repair-scale']\n        target_rms = self.config['target-rms']\n        max_change = self.config['max-change']\n        ng_affine_options = self.config['ng-affine-options']\n        l2_regularize = self.config['l2-regularize']\n        learning_rate_factor=self.config['learning-rate-factor']\n        learning_rate_option=('learning-rate-factor={0}'.format(learning_rate_factor)\n                              if learning_rate_factor != 1.0 else '')\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n        configs = []\n        # First the affine node.\n        line = ('component name={0}.affine'\n                ' type=NaturalGradientAffineComponent'\n                ' input-dim={1}'\n                ' output-dim={2}'\n                ' max-change={3}'\n                ' {4} {5} {6}'\n                ''.format(self.name, input_dim, dim,\n                          max_change, ng_affine_options,\n                          learning_rate_option, l2_regularize_option))\n        configs.append(line)\n\n        line = ('component-node name={0}.affine'\n                ' component={0}.affine input={1}'\n                ''.format(self.name, input_desc))\n        configs.append(line)\n        cur_node = '{0}.affine'.format(self.name)\n\n        for nonlinearity in nonlinearities:\n            if nonlinearity == 'relu':\n                line = ('component name={0}.{1}'\n                        ' type=RectifiedLinearComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, dim,\n                            self_repair_scale))\n\n            elif nonlinearity == 'attention':\n                line = ('component name={0}.{1}'\n                        ' type=RestrictedAttentionComponent'\n                        ' value-dim={2}'\n                        ' key-dim={3}'\n                        ' num-left-inputs={4}'\n                        ' num-right-inputs={5}'\n                        ' num-left-inputs-required={6}'\n                        ' num-right-inputs-required={7}'\n                        ' output-context={8}'\n                        ' time-stride={9}'\n                        ' num-heads={10}'\n                        ' key-scale={11}'\n                        ''.format(self.name, nonlinearity,\n                                  self.config['value-dim'],\n                                  self.config['key-dim'],\n                                  self.config['num-left-inputs'],\n                                  self.config['num-right-inputs'],\n                                  self.config['num-left-inputs-required'],\n                                  self.config['num-right-inputs-required'],\n                                  self.config['output-context'],\n                                  self.config['time-stride'],\n                                  self.config['num-heads'],\n                                  self.config['key-scale']))\n                dim = self.attention_output_dim()\n\n            elif nonlinearity == 'sigmoid':\n                line = ('component name={0}.{1}'\n                        ' type=SigmoidComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, dim,\n                            self_repair_scale))\n\n            elif nonlinearity == 'tanh':\n                line = ('component name={0}.{1}'\n                        ' type=TanhComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, dim,\n                            self_repair_scale))\n\n            elif nonlinearity == 'renorm':\n                line = ('component name={0}.{1}'\n                        ' type=NormalizeComponent dim={2}'\n                        ' target-rms={3}'\n                        ''.format(self.name, nonlinearity, dim,\n                            target_rms))\n\n            elif nonlinearity == 'batchnorm':\n                line = ('component name={0}.{1}'\n                        ' type=BatchNormComponent dim={2}'\n                        ' target-rms={3}'\n                        ''.format(self.name, nonlinearity, dim,\n                            target_rms))\n\n            elif nonlinearity == 'dropout':\n                line = ('component name={0}.{1} type=DropoutComponent '\n                           'dim={2} dropout-proportion={3}'.format(\n                               self.name, nonlinearity, dim,\n                               self.config['dropout-proportion']))\n\n            else:\n                raise RuntimeError(\"Unknown nonlinearity type: {0}\"\n                                   .format(nonlinearity))\n\n            configs.append(line)\n            line = ('component-node name={0}.{1}'\n                    ' component={0}.{1} input={2}'\n                    ''.format(self.name, nonlinearity, cur_node))\n\n            configs.append(line)\n            cur_node = '{0}.{1}'.format(self.name, nonlinearity)\n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/basic_layers.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2017    Google Inc. (vpeddinti@google.com)\n#           2017    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This module contains the parent class from which all layers are inherited\nand some basic layer definitions.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport math\nimport re\nimport sys\nimport libs.nnet3.xconfig.utils as xutils\nimport libs.common as common_lib\n\n\nclass XconfigLayerBase(object):\n    \"\"\" A base-class for classes representing layers of xconfig files.\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, all_layers):\n        \"\"\"\n         first_token: first token on the xconfig line, e.g. 'affine-layer'.f\n         key_to_value: dictionary with parameter values\n             { 'name':'affine1',\n               'input':'Append(0, 1, 2, ReplaceIndex(ivector, t, 0))',\n               'dim=1024' }.\n             The only required and 'special' values that are dealt with directly\n             at this level, are 'name' and 'input'. The rest are put in\n             self.config and are dealt with by the child classes' init functions.\n         all_layers: An array of objects inheriting XconfigLayerBase for all\n                    previously parsed layers.\n        \"\"\"\n\n        self.layer_type = first_token\n        if 'name' not in key_to_value:\n            raise RuntimeError(\"Expected 'name' to be specified.\")\n        self.name = key_to_value['name']\n        if not xutils.is_valid_line_name(self.name):\n            raise RuntimeError(\"Invalid value: name={0}\".format(\n                key_to_value['name']))\n\n        # It is possible to have two layers with a same name in 'all_layer', if\n        # the layer type for one of them is 'existing'.\n        # Layers of type 'existing' are corresponding to the component-node names\n        # in the existing model, which we are adding layers to them.\n        # 'existing' layers are not presented in any config file, and new layer\n        # with the same name can exist in 'all_layers'.\n        # e.g. It is possible to have 'output-node' with name 'output' in the\n        # existing model, which is added to all_layers using layer type 'existing',\n        # and 'output-node' of type 'output-layer' with the same name 'output' in\n        # 'all_layers'.\n        for prev_layer in all_layers:\n            if (self.name == prev_layer.name and\n                prev_layer.layer_type is not 'existing'):\n                raise RuntimeError(\"Name '{0}' is used for more than one \"\n                                   \"layer.\".format(self.name))\n\n        self.config = {}\n        # the following, which should be overridden in the child class, sets\n        # default config parameters in self.config.\n        self.set_default_configs()\n        # The following is not to be reimplemented in child classes;\n        # it sets the config values to those specified by the user, and\n        # parses any Descriptors.\n        self.set_configs(key_to_value, all_layers)\n        # This method, sets the derived default config values\n        # i.e., config values when not specified can be derived from\n        # other values. It can be overridden in the child class.\n        self.set_derived_configs()\n        # the following, which should be overridden in the child class, checks\n        # that the config parameters that have been set are reasonable.\n        self.check_configs()\n\n\n    def set_configs(self, key_to_value, all_layers):\n        \"\"\" Sets the config variables.\n            We broke this code out of __init__ for clarity.\n            the child-class constructor will deal with the configuration values\n            in a more specific way.\n        \"\"\"\n\n        # First check that there are no keys that don't correspond to any config\n        # parameter of this layer, and if so, raise an exception with an\n        # informative message saying what configs are allowed.\n        for key, value in key_to_value.items():\n            if key != 'name':\n                if key not in self.config:\n                    configs = ' '.join([('{0}->\"{1}\"'.format(x, y) if isinstance(y, str)\n                                         else '{0}->{1}'.format(x, y))\n                                        for x, y in self.config.items()])\n                    raise RuntimeError(\"Configuration value {0}={1} was not \"\n                                       \"expected in layer of type {2}; allowed \"\n                                       \"configs with their defaults: {3}\"\n                                       \"\" .format(key, value, self.layer_type, configs))\n\n        for key, value in key_to_value.items():\n            if key != 'name':\n                assert key in self.config  # we checked above.\n                self.config[key] = xutils.convert_value_to_type(key,\n                                                                type(self.config[key]),\n                                                                value)\n        self.descriptors = dict()\n        self.descriptor_dims = dict()\n        # Parse Descriptors and get their dims and their 'final' string form.\n        # in self.descriptors[key]\n        for key in self.get_input_descriptor_names():\n            if key not in self.config:\n                raise RuntimeError(\"{0}: object of type {1} needs to override\"\n                                   \" get_input_descriptor_names().\"\n                                   \"\".format(sys.argv[0], str(type(self))))\n\n            descriptor_string = self.config[key]  # input string.\n            assert isinstance(descriptor_string, str)\n            desc = self.convert_to_descriptor(descriptor_string, all_layers)\n            desc_dim = self.get_dim_for_descriptor(desc, all_layers)\n            desc_norm_str = desc.str()\n\n            # desc_output_str contains the \"final\" component names, those that\n            # appear in the actual config file (i.e. not names like\n            # 'layer.auxiliary_output'); that's how it differs from desc_norm_str.\n            # Note: it's possible that the two strings might be the same in\n            # many, even most, cases-- it depends whether\n            # output_name(self, auxiliary_output)\n            # returns self.get_name() + '.' + auxiliary_output\n            # when auxiliary_output is not None.\n            # That's up to the designer of the layer type.\n            desc_output_str = self.get_string_for_descriptor(desc, all_layers)\n            self.descriptors[key] = {'string': desc,\n                                     'normalized-string': desc_norm_str,\n                                     'final-string': desc_output_str,\n                                     'dim': desc_dim}\n\n            # the following helps to check the code by parsing it again.\n            desc2 = self.convert_to_descriptor(desc_norm_str, all_layers)\n            desc_norm_str2 = desc2.str()\n            # if the following ever fails we'll have to do some debugging.\n            if desc_norm_str != desc_norm_str2:\n                raise RuntimeError(\"Likely code error: '{0}' != '{1}'\"\n                                   \"\".format(desc_norm_str, desc_norm_str2))\n\n    def str(self):\n        \"\"\"Converts 'this' to a string which could be printed to\n        an xconfig file; in xconfig_to_configs.py we actually expand all the\n        lines to strings and write it as xconfig.expanded as a reference\n        (so users can see any defaults).\n        \"\"\"\n\n        list_of_entries = ['{0} name={1}'.format(self.layer_type, self.name)]\n        for key, value in sorted(self.config.items()):\n            if isinstance(value, str) and re.search('=', value):\n                # the value is a string that contains an '=' sign, so we need to\n                # enclose it in double-quotes, otherwise we woudldn't be able to\n                # parse from that output.\n                if re.search('\"', value):\n                    print(\"Warning: config '{0}={1}' contains both double-quotes \"\n                          \"and equals sign; it will not be possible to parse it \"\n                          \"from the file.\".format(key, value), file=sys.stderr)\n                list_of_entries.append('{0}=\"{1}\"'.format(key, value))\n            else:\n                list_of_entries.append('{0}={1}'.format(key, value))\n\n        return ' '.join(list_of_entries)\n\n    def __str__(self):\n        return self.str()\n\n    def normalize_descriptors(self):\n        \"\"\"Converts any config variables in self.config which correspond to\n        Descriptors, into a 'normalized form' derived from parsing them as\n        Descriptors, replacing things like [-1] with the actual layer names,\n        and regenerating them as strings.  We stored this when the object was\n        initialized, in self.descriptors; this function just copies them back\n        to the config.\n        \"\"\"\n\n        for key, desc_str_dict in self.descriptors.items():\n            self.config[key] = desc_str_dict['normalized-string']\n\n    def convert_to_descriptor(self, descriptor_string, all_layers):\n        \"\"\"Convenience function intended to be called from child classes,\n        converts a string representing a descriptor ('descriptor_string')\n        into an object of type Descriptor, and returns it. It needs 'self' and\n        'all_layers' (where 'all_layers' is a list of objects of type\n        XconfigLayerBase) so that it can work out a list of the names of other\n        layers, and get dimensions from them.\n        \"\"\"\n\n        prev_names = xutils.get_prev_names(all_layers, self)\n        tokens = xutils.tokenize_descriptor(descriptor_string, prev_names)\n        pos = 0\n        (descriptor, pos) = xutils.parse_new_descriptor(tokens, pos, prev_names)\n        # note: 'pos' should point to the 'end of string' marker\n        # that terminates 'tokens'.\n        if pos != len(tokens) - 1:\n            raise RuntimeError(\"Parsing Descriptor, saw junk at end: {0}\"\n                               \"\".format(' '.join(tokens[pos:-1])))\n        return descriptor\n\n    def get_dim_for_descriptor(self, descriptor, all_layers):\n        \"\"\"Returns the dimension of a Descriptor object. This is a convenience\n        function used in set_configs.\n        \"\"\"\n\n        layer_to_dim_func = \\\n                lambda name: xutils.get_dim_from_layer_name(all_layers, self,\n                                                            name)\n        return descriptor.dim(layer_to_dim_func)\n\n    def get_string_for_descriptor(self, descriptor, all_layers):\n        \"\"\"Returns the 'final' string form of a Descriptor object,\n        as could be used in config files. This is a convenience function\n        provided for use in child classes;\n        \"\"\"\n\n        layer_to_string_func = \\\n                lambda name: xutils.get_string_from_layer_name(all_layers,\n                                                               self, name)\n        return descriptor.config_string(layer_to_string_func)\n\n    def get_name(self):\n        \"\"\"Returns the name of this layer, e.g. 'affine1'.  It does not\n        necessarily correspond to a component name.\n        \"\"\"\n\n        return self.name\n\n    ######  Functions that might be overridden by the child class: #####\n\n    def set_default_configs(self):\n        \"\"\"Child classes should override this.\n        \"\"\"\n\n        raise Exception(\"Child classes must override set_default_configs().\")\n\n    def set_derived_configs(self):\n        \"\"\"This is expected to be called after set_configs and before\n        check_configs().\n        \"\"\"\n        if 'dim' in self.config and self.config['dim'] <= 0:\n            self.config['dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        \"\"\"child classes should override this.\n        \"\"\"\n\n        pass\n\n    def get_input_descriptor_names(self):\n        \"\"\"This function, which may be (but usually will not have to be)\n        overridden by child classes, returns a list of names of the input\n        descriptors expected by this component. Typically this would just\n        return ['input'] as most layers just have one 'input'. However some\n        layers might require more inputs (e.g. cell state of previous LSTM layer\n        in Highway LSTMs). It is used in the function 'normalize_descriptors()'.\n        This implementation will work for layer types whose only\n        Descriptor-valued config is 'input'.\n        If a child class adds more inputs, or does not have an input\n        (e.g. the XconfigInputLayer), it should override this function's\n        implementation to something like: `return ['input', 'input2']`\n        \"\"\"\n\n        return ['input']\n\n    def auxiliary_outputs(self):\n        \"\"\"Returns a list of all auxiliary outputs that this layer supports.\n        These are either 'None' for the regular output, or a string\n        (e.g. 'projection' or 'memory_cell') for any auxiliary outputs that\n        the layer might provide.  Most layer types will not need to override\n        this.\n        \"\"\"\n\n        return [None]\n\n    def output_name(self, auxiliary_output=None):\n        \"\"\"Called with auxiliary_output is None, this returns the component-node\n        name of the principal output of the layer (or if you prefer, the text\n        form of a descriptor that gives you such an output; such as\n        Append(some_node, some_other_node)).\n        The 'auxiliary_output' argument is a text value that is designed for\n        extensions to layers that have additional auxiliary outputs.\n        For example, to implement a highway LSTM you need the memory-cell of a\n        layer, so you might allow auxiliary_output='memory_cell' for such a\n        layer type, and it would return the component node or a suitable\n        Descriptor: something like 'lstm3.c_t'\n        \"\"\"\n\n        raise Exception(\"Child classes must override output_name()\")\n\n    def output_dim(self, auxiliary_output=None):\n        \"\"\"The dimension that this layer outputs.  The 'auxiliary_output'\n        parameter is for layer types which support auxiliary outputs.\n        \"\"\"\n\n        raise Exception(\"Child classes must override output_dim()\")\n\n    def get_full_config(self):\n        \"\"\"This function returns lines destined for the 'full' config format, as\n        would be read by the C++ programs. Since the program\n        xconfig_to_configs.py writes several config files, this function returns\n        a list of pairs of the form (config_file_basename, line),\n        e.g. something like\n         [  ('init', 'input-node name=input dim=40'),\n            ('ref', 'input-node name=input dim=40') ]\n        which would be written to config_dir/init.config and config_dir/ref.config.\n        \"\"\"\n\n        raise Exception(\"Child classes must override get_full_config()\")\n\n\nclass XconfigInputLayer(XconfigLayerBase):\n    \"\"\"This class is for lines like\n    'input name=input dim=40'\n    or\n    'input name=ivector dim=100'\n    in the config file.\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n\n        assert first_token == 'input'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n\n        self.config = {'dim': -1}\n\n    def check_configs(self):\n\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"Dimension of input-layer '{0}'\"\n                               \"should be positive.\".format(self.name))\n\n    def get_input_descriptor_names(self):\n\n        return []  # there is no 'input' field in self.config.\n\n    def output_name(self, auxiliary_outputs=None):\n\n        # there are no auxiliary outputs as this layer will just pass the input\n        assert auxiliary_outputs is None\n        return self.name\n\n    def output_dim(self, auxiliary_outputs=None):\n\n        # there are no auxiliary outputs as this layer will just pass the input\n        assert auxiliary_outputs is None\n        return self.config['dim']\n\n    def get_full_config(self):\n\n        # unlike other layers the input layers need to be printed in\n        # 'init.config' (which initializes the neural network prior to the LDA)\n        ans = []\n        for config_name in ['init', 'ref', 'final']:\n            ans.append((config_name,\n                        'input-node name={0} dim={1}'.format(self.name,\n                                                             self.config['dim'])))\n        return ans\n\n\nclass XconfigTrivialOutputLayer(XconfigLayerBase):\n    \"\"\"\n    This class is for lines like\n    'output name=output input=Append(input@-1, input@0, input@1, ReplaceIndex(ivector, t, 0))'\n    This is for outputs that are not really output \"layers\"\n    (there is no affine transform or nonlinearity), they just directly map to an\n    output-node in nnet3.\n\n    Parameters of the class, and their defaults:\n        input='[-1]'    :   Descriptor giving the input of the layer.\n        objective-type=linear   :   the only other choice currently is\n            'quadratic', for use in regression problems\n        output-delay=0    :  Can be used to shift the frames on the output, equivalent\n             to delaying labels by this many frames (positive value increases latency\n             in online decoding but may help if you're using unidirectional LSTMs.\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, prev_names=None):\n\n        assert first_token == 'output'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = {'input': '[-1]', 'dim': -1,\n                       'objective-type': 'linear',\n                       'output-delay': 0}\n\n    def check_configs(self):\n\n        if self.config['objective-type'] != 'linear' and \\\n                self.config['objective-type'] != 'quadratic':\n            raise RuntimeError(\"In output, objective-type has\"\n                               \" invalid value {0}\"\n                               \"\".format(self.config['objective-type']))\n\n    def output_name(self, auxiliary_outputs=None):\n\n        # there are no auxiliary outputs as this layer will just pass the output\n        # of the previous layer\n        assert auxiliary_outputs is None\n        return self.name\n\n    def output_dim(self, auxiliary_outputs=None):\n\n        assert auxiliary_outputs is None\n        # note: each value of self.descriptors is (descriptor, dim, normalized-string, output-string).\n        return self.descriptors['input']['dim']\n\n    def get_full_config(self):\n\n        # the input layers need to be printed in 'init.config' (which\n        # initializes the neural network prior to the LDA), in 'ref.config',\n        # which is a version of the config file used for getting left and right\n        # context (it doesn't read anything for the LDA-like transform).\n        # In 'full.config' we write everything, this is just for reference,\n        # and also for cases where we don't use the LDA-like transform.\n        ans = []\n\n        # note: each value of self.descriptors is (descriptor, dim,\n        # normalized-string, output-string).\n        # by 'output-string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        descriptor_final_str = self.descriptors['input']['final-string']\n        objective_type = self.config['objective-type']\n        output_delay = self.config['output-delay']\n\n        if output_delay != 0:\n            descriptor_final_str = (\n                'Offset({0}, {1})'.format(descriptor_final_str, output_delay))\n\n        for config_name in ['ref', 'final']:\n            ans.append((config_name,\n                        'output-node name={0} input={1} '\n                        'objective={2}'.format(\n                            self.name, descriptor_final_str,\n                            objective_type)))\n        return ans\n\n\nclass XconfigOutputLayer(XconfigLayerBase):\n    \"\"\"This class is for lines like\n    'output-layer name=output dim=4257 input=Append(input@-1, input@0, input@1, ReplaceIndex(ivector, t, 0))'\n    By default this includes a log-softmax component.  The parameters are\n    initialized to zero, as this empirically tends to be the best approach for output layers.\n\n    Parameters of the class, and their defaults:\n        input='[-1]'    :   Descriptor giving the input of the layer.\n        dim=None    :   Output dimension of layer, will normally equal the number of pdfs.\n        bottleneck-dim=None    :   Bottleneck dimension of layer: if supplied, instead of\n                        an affine component we'll have a linear then affine, so a linear\n                        bottleneck, with the linear part constrained to be orthonormal.\n        include-log-softmax=true    :   setting it to false will omit the\n            log-softmax component- useful for chain models.\n        objective-type=linear   :   the only other choice currently is\n            'quadratic', for use in regression problems\n        learning-rate-factor=1.0    :   Learning rate factor for the final\n            affine component, multiplies the standard learning rate. normally\n            you'll leave this as-is, but for xent regularization output layers\n            for chain models you'll want to set\n            learning-rate-factor=(0.5/xent_regularize),\n            normally learning-rate-factor=5.0 since xent_regularize is\n            normally 0.1.\n        max-change=1.5 :  Can be used to change the max-change parameter in the\n            affine component; this affects how much the matrix can change on each\n            iteration.\n        l2-regularize=0.0:  Set this to a nonzero value (e.g. 1.0e-05) to\n            add l2 regularization on the parameter norm for the affine component.\n        output-delay=0    :  Can be used to shift the frames on the output, equivalent\n             to delaying labels by this many frames (positive value increases latency\n             in online decoding but may help if you're using unidirectional LSTMs.\n        ng-affine-options=''  :   Can be used supply non-default options to the affine\n             layer (intended for the natural gradient but can be an arbitrary string\n             to be added to the config line.  e.g. 'update-period=2'.).\n        ng-linear-options=''  :   Options, like ng-affine-options, that are passed to\n             the LinearComponent, only in bottleneck layers (i.e. if bottleneck-dim\n             is supplied).\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, prev_names=None):\n\n        assert first_token == 'output-layer'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'bottleneck-dim': -1,\n                       'orthonormal-constraint': 1.0,\n                            # orthonormal-constraint only matters if bottleneck-dim is set.\n                       'include-log-softmax': True,\n                            # this would be false for chain models\n                       'objective-type': 'linear',\n                            # see Nnet::ProcessOutputNodeConfigLine in\n                            # nnet-nnet.cc for other options\n                       'output-delay': 0,\n                       'ng-affine-options': '',\n                       'ng-linear-options': '',    # only affects bottleneck output layers.\n\n                       # The following are just passed through to the affine\n                       # component, and (in the bottleneck case) the linear\n                       # component.\n                       'learning-rate-factor': '',  # effective default: 1.0\n                       'l2-regularize': '',         # effective default: 0.0\n                       'max-change': 1.5,\n\n                       # The following are passed through to the affine component only.\n                       # It tends to be beneficial to initialize the output layer with\n                       # zero values, unlike the hidden layers.\n                       'param-stddev': 0.0,\n                       'bias-stddev': 0.0,\n                      }\n\n    def check_configs(self):\n\n        if self.config['dim'] <= -1:\n            raise RuntimeError(\"In output-layer, dim has invalid value {0}\"\n                               \"\".format(self.config['dim']))\n\n        if self.config['objective-type'] != 'linear' and \\\n                self.config['objective-type'] != 'quadratic':\n            raise RuntimeError(\"In output-layer, objective-type has\"\n                               \" invalid value {0}\"\n                               \"\".format(self.config['objective-type']))\n\n        if self.config['orthonormal-constraint'] <= 0.0:\n            raise RuntimeError(\"output-layer does not support negative (floating) \"\n                               \"orthonormal constraint; use a separate linear-component \"\n                               \"followed by batchnorm-component.\")\n\n    def auxiliary_outputs(self):\n\n        auxiliary_outputs = ['affine']\n        if self.config['include-log-softmax']:\n            auxiliary_outputs.append('log-softmax')\n\n        return auxiliary_outputs\n\n    def output_name(self, auxiliary_output=None):\n\n        if auxiliary_output is None:\n            # Note: nodes of type output-node in nnet3 may not be accessed in\n            # Descriptors, so calling this with auxiliary_outputs=None doesn't\n            # make sense.\n            raise RuntimeError(\"Outputs of output-layer may not be used by other\"\n                               \" layers\")\n\n        if auxiliary_output in self.auxiliary_outputs():\n            return '{0}.{1}'.format(self.name, auxiliary_output)\n        else:\n            raise RuntimeError(\"Unknown auxiliary output name {0}\"\n                               \"\".format(auxiliary_output))\n\n    def output_dim(self, auxiliary_output=None):\n\n        if auxiliary_output is None:\n            # Note: nodes of type output-node in nnet3 may not be accessed in\n            # Descriptors, so calling this with auxiliary_outputs=None doesn't\n            # make sense.\n            raise RuntimeError(\"Outputs of output-layer may not be used by other\"\n                               \" layers\")\n        return self.config['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n\n    def _generate_config(self):\n\n        configs = []\n\n        # note: each value of self.descriptors is (descriptor, dim,\n        # normalized-string, output-string).\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        descriptor_final_string = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.config['dim']\n        bottleneck_dim = self.config['bottleneck-dim']\n        objective_type = self.config['objective-type']\n        include_log_softmax = self.config['include-log-softmax']\n        output_delay = self.config['output-delay']\n\n        affine_options = self.config['ng-affine-options']\n        for opt in [ 'learning-rate-factor', 'l2-regularize', 'max-change',\n                     'param-stddev', 'bias-stddev' ]:\n            if self.config[opt] != '':\n                affine_options += ' {0}={1}'.format(opt, self.config[opt])\n\n        cur_node = descriptor_final_string\n        cur_dim = input_dim\n\n        if bottleneck_dim >= 0:\n            if bottleneck_dim == 0 or bottleneck_dim >= input_dim or bottleneck_dim >= output_dim:\n                raise RuntimeError(\"Bottleneck dim has value that does not make sense: {0}\".format(\n                    bottleneck_dim))\n            # This is the bottleneck case (it doesn't necessarily imply we\n            # will be using the features from the bottleneck; it's just a factorization\n            # of the matrix into two pieces without a nonlinearity in between).\n            # We don't include the l2-regularize option because it's useless\n            # given the orthonormality constraint.\n            linear_options = self.config['ng-linear-options']\n            for opt in [ 'learning-rate-factor', 'l2-regularize', 'max-change' ]:\n                if self.config[opt] != '':\n                    linear_options += ' {0}={1}'.format(opt, self.config[opt])\n\n\n            # note: by default the LinearComponent uses natural gradient.\n            line = ('component name={0}.linear type=LinearComponent '\n                    'orthonormal-constraint={1} param-stddev={2} '\n                    'input-dim={3} output-dim={4} max-change=0.75 {5}'\n                    ''.format(self.name, self.config['orthonormal-constraint'],\n                              self.config['orthonormal-constraint'] / math.sqrt(input_dim),\n                              input_dim, bottleneck_dim, linear_options))\n            configs.append(line)\n            line = ('component-node name={0}.linear component={0}.linear input={1}'\n                    ''.format(self.name, cur_node))\n            configs.append(line)\n            cur_node = '{0}.linear'.format(self.name)\n            cur_dim = bottleneck_dim\n\n\n        line = ('component name={0}.affine'\n                ' type=NaturalGradientAffineComponent'\n                ' input-dim={1} output-dim={2} {3}'\n                ''.format(self.name, cur_dim, output_dim, affine_options))\n        configs.append(line)\n        line = ('component-node name={0}.affine'\n                ' component={0}.affine input={1}'\n                ''.format(self.name, cur_node))\n        configs.append(line)\n        cur_node = '{0}.affine'.format(self.name)\n\n        if include_log_softmax:\n            line = ('component name={0}.log-softmax'\n                    ' type=LogSoftmaxComponent dim={1}'\n                    ''.format(self.name, output_dim))\n            configs.append(line)\n\n            line = ('component-node name={0}.log-softmax'\n                    ' component={0}.log-softmax input={1}'\n                    ''.format(self.name, cur_node))\n            configs.append(line)\n            cur_node = '{0}.log-softmax'.format(self.name)\n\n        if output_delay != 0:\n            cur_node = 'Offset({0}, {1})'.format(cur_node, output_delay)\n\n        line = ('output-node name={0} input={1} '\n                'objective={2}'.format(\n                    self.name, cur_node, objective_type))\n        configs.append(line)\n        return configs\n\n\nclass XconfigBasicLayer(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'relu-renorm-layer name=layer1 dim=1024 input=Append(-3,0,3)'\n    or:\n     'sigmoid-layer name=layer1 dim=1024 input=Append(-3,0,3)'\n    which specify addition of an affine component and a sequence of non-linearities.\n    Here, the name of the layer itself dictates the sequence of nonlinearities\n    that are applied after the affine component; the name should contain some\n    combination of 'relu', 'renorm', 'sigmoid' and 'tanh',\n    and these nonlinearities will be added along with the affine component.\n\n    The dimension specified is the output dim; the input dim is worked out from the input descriptor.\n    This class supports only nonlinearity types that do not change the dimension; we can create\n    another layer type to enable the use p-norm and similar dimension-reducing nonlinearities.\n\n    See other configuration values below.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=-1                   [Output dimension of layer, e.g. 1024]\n      bottleneck-dim=-1        [If you set this, a linear bottleneck is added, so\n                                we project to first bottleneck-dim then to dim.  The\n                                first of the two matrices is constrained to be\n                                orthonormal.]\n      self-repair-scale=1.0e-05  [Affects relu, sigmoid and tanh layers.]\n      learning-rate-factor=1.0   [This can be used to make the affine component\n                                  train faster or slower].\n      add-log-stddev=False     [If true, the log of the stddev of the output of\n                                renorm layer is appended as an\n                                additional dimension of the layer's output]\n      l2-regularize=0.0       [Set this to a nonzero value (e.g. 1.0e-05) to\n                               add l2 regularization on the parameter norm for\n                                this component.\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'bottleneck-dim': -1,  # Deprecated!  Use tdnnf-layer for\n                                              # factorized TDNNs, or prefinal-layer\n                                              # for bottlenecks just before the output.\n                       'self-repair-scale': 1.0e-05,\n                       'target-rms': 1.0,\n                       'ng-affine-options': '',\n                       'ng-linear-options': '',    # only affects bottleneck layers.\n                       'dropout-proportion': 0.5,  # dropout-proportion only\n                                                   # affects layers with\n                                                   # 'dropout' in the name\n                       'dropout-per-dim': False,  # if dropout-per-dim=true, the dropout\n                                                  # mask is shared across time.\n                       'dropout-per-dim-continuous':  False, # if you set this, it's\n                                                    # like dropout-per-dim but with a\n                                                    # continuous-valued (not zero-one) mask.\n                       'add-log-stddev': False,\n                       # the following are not really inspected by this level of\n                       # code, just passed through to the affine component if\n                       # their value is not ''.\n                       'bias-stddev': '',\n                       'l2-regularize': '',\n                       'learning-rate-factor': '',\n                       'max-change': 0.75 }\n\n    def check_configs(self):\n        if self.config['dim'] < 0:\n            raise RuntimeError(\"dim has invalid value {0}\".format(self.config['dim']))\n        b = self.config['bottleneck-dim']\n        if b >= 0 and (b >= self.config['dim'] or b == 0):\n            raise RuntimeError(\"bottleneck-dim has an invalid value {0}\".format(b))\n\n        if self.config['self-repair-scale'] < 0.0 or self.config['self-repair-scale'] > 1.0:\n            raise RuntimeError(\"self-repair-scale has invalid value {0}\"\n                               .format(self.config['self-repair-scale']))\n        if self.config['target-rms'] < 0.0:\n            raise RuntimeError(\"target-rms has invalid value {0}\"\n                               .format(self.config['target-rms']))\n        if (self.config['learning-rate-factor'] != '' and\n            self.config['learning-rate-factor'] <= 0.0):\n            raise RuntimeError(\"learning-rate-factor has invalid value {0}\"\n                               .format(self.config['learning-rate-factor']))\n\n    def output_name(self, auxiliary_output=None):\n        # at a later stage we might want to expose even the pre-nonlinearity\n        # vectors\n        assert auxiliary_output is None\n\n        split_layer_name = self.layer_type.split('-')\n        assert split_layer_name[-1] == 'layer'\n        last_nonlinearity = split_layer_name[-2]\n        # return something like: layer3.renorm\n        return '{0}.{1}'.format(self.name, last_nonlinearity)\n\n    def output_dim(self, auxiliary_output=None):\n        output_dim = self.config['dim']\n        # If not set, the output-dim defaults to the input-dim.\n        if output_dim <= 0:\n            self.config['dim'] = self.descriptors['input']['dim']\n\n        return output_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        split_layer_name = self.layer_type.split('-')\n        assert split_layer_name[-1] == 'layer'\n        nonlinearities = split_layer_name[:-1]\n\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n\n        # the child classes e.g. tdnn might want to process the input\n        # before adding the other components\n\n        return self._add_components(input_desc, input_dim, nonlinearities)\n\n    def _add_components(self, input_desc, input_dim, nonlinearities):\n        output_dim = self.output_dim()\n        self_repair_scale = self.config['self-repair-scale']\n        target_rms = self.config['target-rms']\n\n        affine_options = self.config['ng-affine-options']\n        for opt_name in [ 'max-change', 'learning-rate-factor',\n                          'bias-stddev', 'l2-regularize' ]:\n            value = self.config[opt_name]\n            if value != '':\n                affine_options += ' {0}={1}'.format(opt_name, value)\n\n        # The output of the affine component needs to have one dimension fewer in order to\n        # get the required output dim, if the final 'renorm' component has 'add-log-stddev' set\n        # (since in that case it increases the dimension by one).\n        if self.config['add-log-stddev']:\n            output_dim -= 1\n            if not self.layer_type.split('-')[-2] == \"renorm\":\n                raise RuntimeError(\"add-log-stddev cannot be true unless \"\n                                   \"there is a final 'renorm' component.\")\n\n        configs = []\n        cur_dim = input_dim\n        cur_node = input_desc\n\n        # First the affine node (or linear then affine, if bottleneck).\n        if self.config['bottleneck-dim'] > 0:\n            # The 'bottleneck-dim' option is deprecated and may eventually be\n            # removed.  Best to use tdnnf-layer if you want factorized TDNNs.\n\n            # This is the bottleneck case (it doesn't necessarily imply we\n            # will be using the features from the bottleneck; it's just a factorization\n            # of the matrix into two pieces without a nonlinearity in between).\n            # We don't include the l2-regularize option because it's useless\n            # given the orthonormality constraint.\n            linear_options = self.config['ng-linear-options']\n            for opt_name in [ 'max-change', 'learning-rate-factor' ]:\n                value = self.config[opt_name]\n                if value != '':\n                    linear_options += ' {0}={1}'.format(opt_name, value)\n\n            bottleneck_dim = self.config['bottleneck-dim']\n            # note: by default the LinearComponent uses natural gradient.\n            line = ('component name={0}.linear type=LinearComponent '\n                    'input-dim={1} orthonormal-constraint=1.0 output-dim={2} {3}'\n                    ''.format(self.name, input_dim, bottleneck_dim, linear_options))\n            configs.append(line)\n            line = ('component-node name={0}.linear component={0}.linear input={1}'\n                    ''.format(self.name, cur_node))\n            configs.append(line)\n            cur_node = '{0}.linear'.format(self.name)\n            cur_dim = bottleneck_dim\n\n\n        line = ('component name={0}.affine type=NaturalGradientAffineComponent'\n                ' input-dim={1} output-dim={2} {3}'\n                ''.format(self.name, cur_dim, output_dim, affine_options))\n        configs.append(line)\n        line = ('component-node name={0}.affine component={0}.affine input={1}'\n                ''.format(self.name, cur_node))\n        configs.append(line)\n        cur_node = '{0}.affine'.format(self.name)\n\n        for i, nonlinearity in enumerate(nonlinearities):\n            if nonlinearity == 'relu':\n                line = ('component name={0}.{1} type=RectifiedLinearComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, output_dim,\n                                  self_repair_scale))\n\n            elif nonlinearity == 'sigmoid':\n                line = ('component name={0}.{1}'\n                        ' type=SigmoidComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, output_dim,\n                                  self_repair_scale))\n\n            elif nonlinearity == 'tanh':\n                line = ('component name={0}.{1}'\n                        ' type=TanhComponent dim={2}'\n                        ' self-repair-scale={3}'\n                        ''.format(self.name, nonlinearity, output_dim,\n                                  self_repair_scale))\n\n            elif nonlinearity == 'renorm':\n                add_log_stddev = \"false\"\n                if i == len(nonlinearities) - 1:\n                    add_log_stddev = (\"true\" if self.config['add-log-stddev']\n                                      else \"false\")\n                line = ('component name={0}.{1}'\n                        ' type=NormalizeComponent dim={2}'\n                        ' target-rms={3}'\n                        ' add-log-stddev={4}'\n                        ''.format(self.name, nonlinearity, output_dim,\n                                  target_rms, add_log_stddev))\n\n            elif nonlinearity == 'batchnorm':\n                line = ('component name={0}.{1}'\n                        ' type=BatchNormComponent dim={2} target-rms={3}'\n                        ''.format(self.name, nonlinearity, output_dim,\n                                  target_rms))\n\n            elif nonlinearity == 'so':\n                line = ('component name={0}.{1}'\n                        ' type=ScaleAndOffsetComponent dim={2} max-change=0.5 '\n                        ''.format(self.name, nonlinearity, output_dim))\n\n            elif nonlinearity == 'dropout':\n                if not (self.config['dropout-per-dim'] or\n                        self.config['dropout-per-dim-continuous']):\n                    line = ('component name={0}.{1} type=DropoutComponent '\n                            'dim={2} dropout-proportion={3}'.format(\n                                self.name, nonlinearity, output_dim,\n                                self.config['dropout-proportion']))\n                else:\n                    continuous_opt='continuous=true' if self.config['dropout-per-dim-continuous'] else ''\n\n                    line = ('component name={0}.dropout type=GeneralDropoutComponent '\n                            'dim={1} dropout-proportion={2} {3}'.format(\n                                self.name, output_dim, self.config['dropout-proportion'],\n                                continuous_opt))\n            else:\n                raise RuntimeError(\"Unknown nonlinearity type: {0}\"\n                                   .format(nonlinearity))\n\n            configs.append(line)\n            line = ('component-node name={0}.{1}'\n                    ' component={0}.{1} input={2}'\n                    ''.format(self.name, nonlinearity, cur_node))\n\n            configs.append(line)\n            cur_node = '{0}.{1}'.format(self.name, nonlinearity)\n        return configs\n\n\nclass XconfigFixedAffineLayer(XconfigLayerBase):\n    \"\"\"\n    This class is for lines like\n     'fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=foo/bar/lda.mat'\n\n    The output dimension of the layer may be specified via 'dim=xxx', but if not specified,\n    the dimension defaults to the same as the input.  Note: we don't attempt to read that\n    file at the time the config is created, because in the recipes, that file is created\n    after the config files.\n\n    See other configuration values below.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=None                   [Output dimension of layer; defaults to the same as the input dim.]\n      affine-transform-file='' [Must be specified.]\n      delay=0                  [Optional delay for the output-node in init.config]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        assert first_token == 'fixed-affine-layer'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'affine-transform-file': '',\n                       'delay': 0,\n                       'write-init-config': True}\n\n    def check_configs(self):\n        if self.config['affine-transform-file'] is None:\n            raise RuntimeError(\"affine-transform-file must be set.\")\n\n    def output_name(self, auxiliary_output=None):\n        # Fixed affine layer computes only one vector, there are no intermediate\n        # vectors.\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        output_dim = self.config['dim']\n        # If not set, the output-dim defaults to the input-dim.\n        if output_dim <= 0:\n            output_dim = self.descriptors['input']['dim']\n        return output_dim\n\n    def get_full_config(self):\n        ans = []\n\n        # note: each value of self.descriptors is (descriptor, dim,\n        # normalized-string, output-string).\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        descriptor_final_string = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.output_dim()\n        transform_file = self.config['affine-transform-file']\n\n        if self.config['write-init-config']:\n            if self.config['delay'] != 0:\n                line = 'component name={0}.delayed type=NoOpComponent dim={1}'.format(self.name, input_dim)\n                ans.append(('init', line))\n                line = 'component-node name={0}.delayed component={0}.delayed input={1}'.format(self.name, descriptor_final_string)\n                ans.append(('init', line))\n                line = 'output-node name=output input=Offset({0}.delayed, {1})'.format(self.name, self.config['delay'])\n                ans.append(('init', line))\n            else:\n                # to init.config we write an output-node with the name 'output' and\n                # with a Descriptor equal to the descriptor that's the input to this\n                # layer.  This will be used to accumulate stats to learn the LDA transform.\n                line = 'output-node name=output input={0}'.format(descriptor_final_string)\n                ans.append(('init', line))\n\n        # write the 'real' component to final.config\n        line = 'component name={0} type=FixedAffineComponent matrix={1}'.format(\n            self.name, transform_file)\n        ans.append(('final', line))\n        # write a random version of the component, with the same dims, to ref.config\n        line = 'component name={0} type=FixedAffineComponent input-dim={1} output-dim={2}'.format(\n            self.name, input_dim, output_dim)\n        ans.append(('ref', line))\n        # the component-node gets written to final.config and ref.config.\n        line = 'component-node name={0} component={0} input={1}'.format(\n            self.name, descriptor_final_string)\n        ans.append(('final', line))\n        ans.append(('ref', line))\n        return ans\n\n\nclass XconfigAffineLayer(XconfigLayerBase):\n    \"\"\"\n    This class is for lines like\n     'affine-layer name=affine input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0))'\n\n    The output dimension of the layer may be specified via 'dim=xxx', but if not specified,\n    the dimension defaults to the same as the input.  Note: we don't attempt to read that\n    file at the time the config is created, because in the recipes, that file is created\n    after the config files.\n\n    See other configuration values below.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=None                 [Output dimension of layer; defaults to the same as the input dim.]\n\n      l2-regularize=0.0       [Set this to a nonzero value (e.g. 1.0e-05) to\n                               add l2 regularization on the parameter norm\n                               for the affine component.]\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        assert first_token == 'affine-layer'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        # use None for optional parameters as we want to default to the C++ defaults\n        # C++ component provides more options but I will just expose these for now\n        # Note : The type of the parameter is determined based on the value assigned\n        #        so please use decimal point if your parameter is a float\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'param-stddev': -1.0,  # this has to be initialized to 1/sqrt(input_dim)\n                       'bias-stddev': 1.0,\n                       'bias-mean': 0.0,\n                       'max-change': 0.75,\n                       'l2-regularize': 0.0,\n                       'learning-rate-factor': 1.0,\n                       'ng-affine-options': ''}\n\n    def set_derived_configs(self):\n        super(XconfigAffineLayer, self).set_derived_configs()\n        if self.config['param-stddev'] < 0:\n            self.config['param-stddev'] = 1.0 / math.sqrt(self.descriptors['input']['dim'])\n\n    def check_configs(self):\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"dim specified is invalid\")\n\n    def output_name(self, auxiliary_output=None):\n        # affine layer computes only one vector, there are no intermediate\n        # vectors.\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        output_dim = self.config['dim']\n        # If not set, the output-dim defaults to the input-dim.\n        if output_dim <= 0:\n            output_dim = self.descriptors['input']['dim']\n\n        return output_dim\n\n    def get_full_config(self):\n        ans = []\n\n        # note: each value of self.descriptors is (descriptor, dim,\n        # normalized-string, output-string).\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        descriptor_final_string = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.output_dim()\n\n        option_string = ''\n        for key in ['param-stddev', 'bias-stddev', 'bias-mean', 'max-change',\n                    'l2-regularize']:\n            option_string += ' {0}={1}'.format(key, self.config[key])\n        option_string += self.config['ng-affine-options']\n\n        conf_lines = []\n        # write the 'real' component to final.config\n        conf_lines.append('component name={n} type=NaturalGradientAffineComponent '\n                          'input-dim={i} output-dim={o} {opts}'.format(n=self.name,\n                                                                       i=input_dim,\n                                                                       o=output_dim,\n                                                                       opts=option_string))\n        # the component-node gets written to final.config and ref.config.\n        conf_lines.append('component-node name={0} component={0} input={1}'.format(self.name,\n                                                                                   descriptor_final_string))\n\n        # the config is same for both final and ref configs\n        for conf_name in ['final', 'ref']:\n            for line in conf_lines:\n                ans.append((conf_name, line))\n        return ans\n\n\nclass XconfigIdctLayer(XconfigLayerBase):\n    \"\"\"\n    This class is for lines like\n     'idct-layer name=idct dim=40 cepstral-lifter=22 affine-transform-file=foo/bar/idct.mat'\n\n    This is used to convert input MFCC-features to Filterbank featurs. The\n    affine transformation is written out to the file specified via\n    'affine-transform-file=xxx'.\n    The output dimension of the layer may be specified via 'dim=xxx', but if not specified,\n    the dimension defaults to the same as the input.\n\n    See other configuration values below.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=None                   [Output dimension of layer; defaults to the same as the input dim.]\n      cepstral-lifter=22       [Apply liftering co-efficient.]\n      affine-transform-file='' [Must be specified.]\n      include-in-init=false     [You should set this to true if this precedes a\n                                `fixed-affine-layer` that is to be initialized\n                                 via LDA]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        assert first_token == 'idct-layer'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        # note: self.config['input'] is a descriptor, '[-1]' means output\n        # the most recent layer.\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'cepstral-lifter': 22.0,\n                       'affine-transform-file': '',\n                       'include-in-init': False}\n\n    def check_configs(self):\n        if self.config['affine-transform-file'] is None:\n            raise RuntimeError(\"affine-transform-file must be set.\")\n\n    def output_name(self, auxiliary_output=None):\n        # Fixed affine layer computes only one vector, there are no intermediate\n        # vectors.\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        output_dim = self.config['dim']\n        # If not set, the output-dim defaults to the input-dim.\n        if output_dim <= 0:\n            output_dim = self.descriptors['input']['dim']\n        return output_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n            if self.config['include-in-init']:\n                ans.append(('init', line))\n        return ans\n\n\n    def _generate_config(self):\n\n        # note: each value of self.descriptors is (descriptor, dim,\n        # normalized-string, output-string).\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        descriptor_final_string = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.output_dim()\n        transform_file = self.config['affine-transform-file']\n\n        idct_mat = common_lib.compute_idct_matrix(\n            input_dim, output_dim, self.config['cepstral-lifter'])\n        # append a zero column to the matrix, this is the bias of the fixed\n        # affine component\n        for n in range(0, output_dim):\n            idct_mat[n].append(0)\n        common_lib.write_kaldi_matrix(transform_file, idct_mat)\n\n        configs = []\n\n        # write the 'real' component to final.config\n        line = 'component name={0} type=FixedAffineComponent matrix={1}'.format(\n            self.name, transform_file)\n        configs.append(line)\n        line = 'component-node name={0} component={0} input={1}'.format(\n            self.name, descriptor_final_string)\n        configs.append(line)\n        return configs\n\n\nclass XconfigExistingLayer(XconfigLayerBase):\n    \"\"\"\n    This class is used to internally convert component-nodes in an existing\n    model into lines like\n    'existing name=tdnn1.affine dim=40'.\n\n    Layers of this type are not presented in any actual xconfig or config\n    files, but are created internally for all component nodes\n    in an existing neural net model to use as input to other layers in xconfig.\n    (i.e. get_model_component_info function, which is called in\n     steps/nnet3/xconfig_to_configs.py, parses the name and\n     dimension of component-nodes used in the existing model\n     using the nnet3-info and returns a list of 'existing' layers.)\n\n    This class is useful in cases like transferring existing model\n    and using {input, output, component}-nodes in this model as\n    input to new layers.\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, prev_names=None):\n\n        assert first_token == 'existing'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n\n    def set_default_configs(self):\n        self.config = { 'dim': -1}\n\n    def check_configs(self):\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"Dimension of existing-layer '{0}'\"\n                                \"should be positive.\".format(self.name))\n\n    def get_input_descriptor_names(self):\n        return []  # there is no 'input' field in self.config.\n\n    def output_name(self, auxiliary_outputs=None):\n        # there are no auxiliary outputs as this layer will just pass the input\n        assert auxiliary_outputs is None\n        return self.name\n\n    def output_dim(self, auxiliary_outputs=None):\n        # there are no auxiliary outputs as this layer will just pass the input\n        assert auxiliary_outputs is None\n        return self.config['dim']\n\n    def get_full_config(self):\n        # unlike other layers the existing layers should not to be printed in\n        # any '*.config'\n        ans = []\n        return ans\n\n\nclass XconfigSpecAugmentLayer(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'spec-augment-layer name=spec-augment freq-max-proportion=0.5 time-zeroed-proportion=0.2 time-mask-max-frames=10'\n\n    which will produce a component of type GeneralDropoutComponent (to do the\n    frequency-domain part) and then one of type SpecaugmentTimeMaskComponent (to\n    do the time part).\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      freq-max-proportion=0.5  [The maximum proportion of the frequency space that\n                                might be zeroed out]\n      time-zeroed-proportion=0.2  [The proportion of time frames that will be zeroed\n                                  out]\n      time-mask-max-frames=20   [The maximum length of a zeroed region in the time\n                                axis, in frames.]\n      include-in-init=false     [You should set this to true if this precedes a\n                                `fixed-affine-layer` that is to be initialized\n                                 via LDA]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'freq-max-proportion': 0.5,\n                       'time-zeroed-proportion': 0.2,\n                       'time-mask-max-frames': 20,\n                       'include-in-init': False}\n\n\n    def check_configs(self):\n        assert (self.config['freq-max-proportion'] > 0.0 and self.config['freq-max-proportion'] < 1.0\n                and self.config['time-zeroed-proportion'] > 0.0 and self.config['time-zeroed-proportion'] < 1.0\n                and self.config['time-mask-max-frames'] >= 1)\n\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return '{0}.time-mask'.format(self.name)\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n            if self.config['include-in-init']:\n                ans.append(('init', line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        freq_max_proportion = self.config['freq-max-proportion']\n        time_zeroed_proportion = self.config['time-zeroed-proportion']\n        time_mask_max_frames = self.config['time-mask-max-frames']\n\n        configs = []\n        line = ('component name={0}.freq-mask type=GeneralDropoutComponent dim={1} specaugment-max-proportion={2}'.format(\n            self.name, input_dim, freq_max_proportion))\n        configs.append(line)\n        line = ('component-node name={0}.freq-mask component={0}.freq-mask input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        line = ('component name={0}.time-mask type=SpecAugmentTimeMaskComponent dim={1} '\n                'zeroed-proportion={2} time-mask-max-frames={3}'.format(\n                    self.name, input_dim, time_zeroed_proportion, time_mask_max_frames))\n        configs.append(line)\n        line = ('component-node name={0}.time-mask component={0}.time-mask input={0}.freq-mask'.format(\n            self.name))\n        configs.append(line)\n        return configs\n\n\ndef test_layers():\n    # for some config lines that should be printed the same way as they\n    # are read, check that this is the case.\n    for x in ['input name=input dim=30']:\n        assert str(config_line_to_object(x, [])) == x\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/composite_layers.py",
    "content": "# Copyright 2018    Johns Hopkins University (Dan Povey)\n# Apache 2.0.\n\n\"\"\" This module contains some composite layers, which is basically a catch-all\n    term for things like TDNN-F that contain several affine or linear comopnents.\n\"\"\"\nfrom __future__ import print_function\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n# This class is intended to implement an extension of the factorized TDNN\n# (TDNN-F) that supports resnet-type 'bypass' connections.  It is for lines like\n# the following:\n#\n# tdnnf-layer name=tdnnf2 dim=1024 bottleneck-dim=128 dropout-proportion=0.0 time-stride=3\n#\n# The line above would be roughly equivalent to the following four lines (except\n# for different naming, and the use of TdnnComponent, for efficiency, in place\n# of AffineComponent).  Assume that the previous layer (the default input) was tdnnf1:\n#\n#  linear-component name=tdnnf2.linear dim=128 orthonormal-constraint=-1.0 input=Append(Offset(-3, tdnnf1), tdnnf1)\n#  relu-batchnorm-dropout-layer name=tdnnf2.affine dim=1024 dropout-proportion=0.0 \\\n#    dropout-per-dim-continuous=true input=Append(0,3)\n#  no-op-component name=tdnnf2 input=Sum(Scale(0.66,tdnnf1), tdnn2.affine)\n\n#  Documentation of some of the important options:\n#\n#   - dropout-proportion\n# This gets passed through to the dropout component.  If you don't set\n# 'dropout-proportion', no dropout component will be included; it would be like\n# using a relu-batchnorm-layer in place of a relu-batchnorm-dropout-layer.  You\n# should only set 'dropout-proportion' if you intend to use dropout (it would\n# usually be combined with the --dropout-schedule option to train.py).  If you\n# use the --dropout-schedule option, the value doesn't really matter since it\n# will be changed during training, and 0 is recommended.\n#\n#  - time-stride\n# Controls the time offsets in the splicing, e.g. if you set time-stride to\n# 1 instead of the 3 in the example, the time-offsets would be -1 and 1 instead\n# of 1 and 3.\n# If you set time-stride=0, as a special case no splicing over time will be\n# performed (so no Append() expressions) and the second linear component (named\n# tdnnf2l in the example) would be omitted, since it would add no modeling\n# power.\n# You can set time-stride to a negative number which will negate all the\n# time indexes; it might potentially be useful to alternate negative and positive\n# time-stride if you wanted to force the overall network to have symmetric\n# context, since with positive time stride, this layer has more negative\n# than positive time context (i.e. more left than right).\n#\n#  - bypass-scale\n\n# A scale on the previous layer's output, used in bypass (resnet-type)\n# connections.  Should not exceed 1.0.  The default is 0.66.  If you set it to\n# zero, the layer will lack the bypass (but we don't recommend this).  won't use\n# a bypass connection at all, so it would be like conventional TDNN-F Note: the\n# layer outputs are added together after the batchnorm so the model cannot\n# control their relative magnitudes and this does actually affect what it can\n# model.  When we experimented with having this scale trainable it did not seem\n# to give an advantage.\n#\n#  - l2-regularize\n# This is passed through to the linear and affine components.  You'll normally\n# want this to be set to a nonzero value, e.g. 0.004.\n\nclass XconfigTdnnfLayer(XconfigLayerBase):\n\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"tdnnf-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'dim':-1,\n                       'bottleneck-dim':-1,\n                       'bypass-scale':0.66,\n                       'dropout-proportion':-1.0,\n                       'time-stride':1,\n                       'l2-regularize':0.0,\n                       'max-change': 0.75,\n                       'self-repair-scale': 1.0e-05,\n                       'context': 'default'}\n\n    def set_derived_configs(self):\n        pass\n\n    def check_configs(self):\n        if self.config['bottleneck-dim'] <= 0:\n            raise RuntimeError(\"bottleneck-dim must be set and >0.\")\n        if self.config['dim'] <= self.config['bottleneck-dim']:\n            raise RuntimeError(\"dim must be greater than bottleneck-dim\")\n\n        dropout = self.config['dropout-proportion']\n        if dropout != -1.0 and not (dropout >= 0.0 and dropout < 1.0):\n            raise RuntimeError(\"invalid value for dropout-proportion\")\n\n        if abs(self.config['bypass-scale']) > 1.0:\n            raise RuntimeError(\"bypass-scale has invalid value\")\n\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.config['dim']\n        if output_dim != input_dim and self.config['bypass-scale'] != 0.0:\n            raise RuntimeError('bypass-scale is nonzero but output-dim != input-dim: {0} != {1}'\n                               ''.format(output_dim, input_dim))\n\n        if not self.config['context'] in ['default', 'left-only', 'shift-left', 'none']:\n            raise RuntimeError('context must be default, left-only shift-left or none, got {}'.format(\n                self.config['context']))\n\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        output_component = ''\n        if self.config['bypass-scale'] != 0.0:\n            # the no-op component is used to cache something that we don't want\n            # to have to recompute.\n            output_component = 'noop'\n        elif self.config['dropout-proportion'] != -1.0:\n            output_component = 'dropout'\n        else:\n            output_component = 'batchnorm'\n        return '{0}.{1}'.format(self.name, output_component)\n\n\n    def output_dim(self, auxiliary_output=None):\n        return self.config['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                ans.append((config_name, line))\n        return ans\n\n\n    def _generate_config(self):\n        configs = []\n        name = self.name\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        output_dim = self.config['dim']\n        bottleneck_dim = self.config['bottleneck-dim']\n        bypass_scale = self.config['bypass-scale']\n        dropout_proportion = self.config['dropout-proportion']\n        time_stride = self.config['time-stride']\n        context = self.config['context']\n        if time_stride != 0 and context != 'none':\n            time_offsets1 = '{0},0'.format(-time_stride)\n            if context == 'default':\n                time_offsets2 = '0,{0}'.format(time_stride)\n            elif context == 'shift-left':\n                time_offsets2 = '{0},0'.format(-time_stride)\n            else:\n                assert context == 'left-only'\n                time_offsets2 = '0'\n        else:\n            time_offsets1 = '0'\n            time_offsets2 = '0'\n        l2_regularize = self.config['l2-regularize']\n        max_change = self.config['max-change']\n        self_repair_scale = self.config['self-repair-scale']\n\n        # The first linear layer, from input-dim (spliced x2) to bottleneck-dim\n        configs.append('component name={0}.linear type=TdnnComponent input-dim={1} '\n                       'output-dim={2} l2-regularize={3} max-change={4} use-bias=false '\n                       'time-offsets={5} orthonormal-constraint=-1.0'.format(\n                           name, input_dim, bottleneck_dim, l2_regularize,\n                           max_change, time_offsets1))\n        configs.append('component-node name={0}.linear component={0}.linear '\n                       'input={1}'.format(name, input_descriptor))\n\n        # The affine layer, from bottleneck-dim (spliced x2) to output-dim\n        configs.append('component name={0}.affine type=TdnnComponent '\n                       'input-dim={1} output-dim={2} l2-regularize={3} max-change={4} '\n                       'time-offsets={5}'.format(\n                           name, bottleneck_dim, output_dim, l2_regularize,\n                           max_change, time_offsets2))\n        configs.append('component-node name={0}.affine component={0}.affine '\n                       'input={0}.linear'.format(name))\n\n        # The ReLU layer\n        configs.append('component name={0}.relu type=RectifiedLinearComponent dim={1} '\n                       'self-repair-scale={2}'.format(\n                           name, output_dim, self_repair_scale))\n        configs.append('component-node name={0}.relu component={0}.relu '\n                       'input={0}.affine'.format(name))\n\n        # The BatchNorm layer\n        configs.append('component name={0}.batchnorm type=BatchNormComponent '\n                       'dim={1}'.format(name, output_dim))\n        configs.append('component-node name={0}.batchnorm component={0}.batchnorm '\n                       'input={0}.relu'.format(name))\n\n        if dropout_proportion != -1:\n            # This is not normal dropout.  It's dropout where the mask is shared\n            # across time, and (thanks to continuous=true), instead of a\n            # zero-or-one scale, it's a continuously varying scale whose\n            # expected value is 1, drawn from a uniform distribution over an\n            # interval of a size that varies with dropout-proportion.\n            configs.append('component name={0}.dropout type=GeneralDropoutComponent '\n                           'dim={1} dropout-proportion={2} continuous=true'.format(\n                               name, output_dim, dropout_proportion))\n            configs.append('component-node name={0}.dropout component={0}.dropout '\n                           'input={0}.batchnorm'.format(name))\n            cur_component_type = 'dropout'\n        else:\n            cur_component_type = 'batchnorm'\n\n        if bypass_scale != 0.0:\n            # Add a NoOpComponent to cache the weighted sum of the input and the\n            # output.  We could easily have the output of the component be a\n            # Descriptor like 'Append(Scale(0.66, tdnn1.batchnorm), tdnn2.batchnorm)',\n            # but if we did that and you used many of this component in sequence,\n            # the weighted sums would have more and more terms as you went deeper\n            # in the network.\n            configs.append('component name={0}.noop type=NoOpComponent '\n                           'dim={1}'.format(name, output_dim))\n            configs.append('component-node name={0}.noop component={0}.noop '\n                           'input=Sum(Scale({1}, {2}), {0}.{3})'.format(\n                               name, bypass_scale, input_descriptor,\n                               cur_component_type))\n\n        return configs\n\n# This is for lines like the following:\n#  prefinal-layer name=prefinal-chain input=prefinal-l l2-regularize=0.02 big-dim=1024 small-dim=256\n#\n# which is equivalent to the following sequence of components (except for\n# name differences):\n#  relu-batchnorm-layer name=prefinal-chain input=prefinal-l l2-regularize=0.02 dim=1024\n#  linear-comonent name=prefinal-chain-l dim=256 l2-regularize=0.02 orthonormal-constraint=-1.0\n#  batchnorm-component name=prefinal-chain-batchnorm\n#\n# This layer is really just for convenience in writing config files: it doesn't\n# do anything that's particular hard or unusual, but it encapsulates a commonly\n# repeated pattern.\nclass XconfigPrefinalLayer(XconfigLayerBase):\n\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"prefinal-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'big-dim':-1,\n                       'small-dim':-1,\n                       'l2-regularize':0.0,\n                       'max-change': 0.75,\n                       'self-repair-scale': 1.0e-05}\n\n    def set_derived_configs(self):\n        pass\n\n    def check_configs(self):\n        if self.config['small-dim'] <= 0:\n            raise RuntimeError(\"small-dim must be set and >0.\")\n        if self.config['big-dim'] <= self.config['small-dim']:\n            raise RuntimeError(\"big-dim must be greater than small-dim\")\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return '{0}.batchnorm2'.format(self.name)\n\n    def output_dim(self, auxiliary_output=None):\n        return self.config['small-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                ans.append((config_name, line))\n        return ans\n\n\n    def _generate_config(self):\n        configs = []\n        name = self.name\n\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        small_dim = self.config['small-dim']\n        big_dim = self.config['big-dim']\n        l2_regularize = self.config['l2-regularize']\n        max_change = self.config['max-change']\n        self_repair_scale = self.config['self-repair-scale']\n\n        # The affine layer, from input-dim to big-dim.\n        configs.append('component name={0}.affine type=NaturalGradientAffineComponent '\n                       'input-dim={1} output-dim={2} l2-regularize={3} max-change={4}'.format(\n                           name, input_dim, big_dim, l2_regularize, max_change))\n        configs.append('component-node name={0}.affine component={0}.affine '\n                       'input={1}'.format(name, input_descriptor))\n\n        # The ReLU layer\n        configs.append('component name={0}.relu type=RectifiedLinearComponent dim={1} '\n                       'self-repair-scale={2}'.format(\n                           name, big_dim, self_repair_scale))\n        configs.append('component-node name={0}.relu component={0}.relu '\n                       'input={0}.affine'.format(name))\n\n        # The first BatchNorm layer\n        configs.append('component name={0}.batchnorm1 type=BatchNormComponent '\n                       'dim={1}'.format(name, big_dim))\n        configs.append('component-node name={0}.batchnorm1 component={0}.batchnorm1 '\n                       'input={0}.relu'.format(name))\n\n        # The linear layer, from big-dim to small-dim, with orthonormal-constraint=-1\n        # (\"floating\" orthonormal constraint).\n        configs.append('component name={0}.linear type=LinearComponent '\n                       'input-dim={1} output-dim={2} l2-regularize={3} max-change={4} '\n                       'orthonormal-constraint=-1 '.format(\n                           name, big_dim, small_dim,\n                           l2_regularize, max_change))\n        configs.append('component-node name={0}.linear component={0}.linear '\n                       'input={0}.batchnorm1'.format(name))\n\n        # The second BatchNorm layer\n        configs.append('component name={0}.batchnorm2 type=BatchNormComponent '\n                       'dim={1}'.format(name, small_dim))\n        configs.append('component-node name={0}.batchnorm2 component={0}.batchnorm2 '\n                       'input={0}.linear'.format(name))\n\n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/convolution.py",
    "content": "# Copyright 2018    Johns Hopkins University (Author: Dan Povey)\n#           2016    Vijayaditya Peddinti\n# Apache 2.0.\n\n\n\n\"\"\" This module has the implementation of convolutional layers.\n\"\"\"\nfrom __future__ import print_function\nfrom __future__ import division\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n\n# This class is for lines like the following:\n#\n\n#  conv-batchnorm-layer name=conv2 height-in=40 height-out=40 \\\n#      num-filters-out=64 height-offsets=-1,0,1 time-offsets=-1,0,1 \\\n#      required-time-offsets=0\n#  or (with NormalizeLayer instead of batch-norm, and with subsampling on the height axis):\n#  conv-renorm-layer name=conv3 height-in=40 height-out=20 \\\n#      height-subsample-out=2 num-filters-out=128 height-offsets=-1,0,1 \\\n#       time-offsets=-1,0,1 required-time-offsets=0\n#\n# You don't specify subsampling on the time axis explicitly, it's implicit\n# in the 'time-offsets' which are the same as the splicing indexes in a TDNN,\n# and which, unlike the height offsets, operate relative to a fixed clock,\n# so that after subsampling by a factor of 2, we'd expect all time-offsets\n# of subsequent layers to be a factor of 2.  You don't specify the input\n# num-filters either; it's worked out from the input height and the input dim.\n#\n# The layer-name encodes the use (or not) of batch normalization, so that if you\n# want to skip batch normalization you could just call it 'conv-layer'.\n#\n# If batch-normalization is used, it's *spatial* batch-normalization, meaning\n# that the offset and scale is specific to the output filter, but shared across\n# all time and height offsets.\n#\n# Most of the configuration values mirror same-named values in class\n# TimeHeightConvolutionComponent, and for a deeper understanding of what's going\n# on you should look at the comment by its declaration, in\n# src/nnet3/nnet-convolutional-component.h.\n#\n# Parameters of the class, and their defaults if they have defaults:\n#\n#   input='[-1]'             Descriptor giving the input of the layer.\n#   height-in                The height of the input image, e.g. 40 if the input\n#                            is MFCCs.  The num-filters-in is worked out as\n#                            (dimension of input) / height-in.  If the preceding\n#                            layer is a convolutional layer, height-in should be\n#                            the same as the height-out of the preceding layer.\n#   height-subsample-out=1   The height subsampling factor, will be e.g. 2 if you\n#                            want to subsample by a factor of 2 on the height\n#                            axis.\n#   height-out               The height of the output image.  This will normally\n#                            be <= (height-in / height-subsample-out).\n#                            Zero-padding on the height axis may be implied by a\n#                            combination of this and height-offsets-in, e.g. if\n#                            height-out==height-in and height-subsample-out=1\n#                            and height-offsets=-2,-1,0,1 then we'd be padding\n#                            by 2 pixels on the bottom and 1 on the top; see\n#                            comments in nnet-convolutional-layers.h for more\n#                            details.\n#   height-offsets           The offsets on the height axis that define what\n#                            inputs require for each output pixel; will\n#                            often be something like -1,0,1 (if zero-padding\n#                            on height axis) or 0,1,2 otherwise.  These are\n#                            comparable to TDNN splicing offsets; e.g. if\n#                            height-offsets=-1,0,1 then height 10 at the output\n#                            would take input from heights 9,10,11 at the input.\n#   num-filters-out          The number of output filters.  The output dimension\n#                            of this layer is num-filters-out * height-out; the\n#                            filter dim varies the fastest (filter-stride == 1).\n#   time-offsets             The input offsets on the time axis; these are\n#                            interpreted just like the splicing indexes in TDNNs.\n#                            E.g. if time-offsets=-2,0,2 then time 100 at the\n#                            output would require times 98,100,102 at the input.\n#   required-time-offsets    The subset of 'time-offsets' that are required in\n#                            order to produce an output; if the set has fewer\n#                            elements than 'time-offsets' then it implies some\n#                            kind of zero-padding on the time axis is allowed.\n#                            Defaults to the same as 'time-offsets'.  For speech\n#                            tasks we recommend not to set this, as the normal\n#                            padding approach is to pad with copies of the\n#                            first/last frame, which is handled automatically in\n#                            the calling code.\n#   target-rms=1.0           Only applicable if the layer type is\n#                            conv-batchnorm-layer or\n#                            conv-normalize-layer.  This will affect the\n#                            scaling of the output features (larger -> larger),\n#                            and sometimes we set target-rms=0.5 for the layer\n#                            prior to the final layer to make the final layer\n#                            train more slowly.\n#   self-repair-scale=2.0e-05  This affects the ReLu's.  It is a scale on the\n#                            'self-repair' mechanism that nudges the inputs to the\n#                            ReLUs into the appropriate range in cases where\n#                            the unit is active either too little of the time\n#                            (<10%) or too much of the time (>90%).\n#\n# The following initialization and natural-gradient related options are, if\n# provided, passed through to the config file; if not, they are left at the\n# defaults in the code.  See nnet-convolutional-component.h for more information.\n#\n#  param-stddev, bias-stddev, max-change, learning-rate-factor (float)\n#  use-natural-gradient (bool)\n#  rank-in, rank-out    (int)\n#  num-minibatches-history (float)\n#  alpha-in, alpha-out (float)\n# the following is also passed into the convolution components, if specified:\n#  l2-regularize (float)\n\nclass XconfigConvLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        for operation in first_token.split('-')[:-1]:\n            assert operation in ['conv', 'renorm', 'batchnorm', 'relu',\n                                 'noconv', 'dropout', 'so']\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'height-in':-1,\n                       'height-subsample-out':1,\n                       'height-out':-1,\n                       'height-offsets':'',\n                       'num-filters-out':-1,\n                       'time-offsets':'',\n                       'required-time-offsets':'',\n                       'target-rms':1.0,\n                       'self-repair-scale': 2.0e-05,\n                       'self-repair-lower-threshold': 0.05,\n                       # the following are not really inspected by this level of\n                       # code, just passed through (but not if left at '').\n                       'param-stddev':'', 'bias-stddev':'',\n                       'max-change': 0.75, 'learning-rate-factor':'',\n                       'use-natural-gradient':'',\n                       'rank-in':'', 'rank-out':'', 'num-minibatches-history':'',\n                       'alpha-in':'', 'alpha-out':'', 'l2-regularize':'',\n                       'dropout-proportion': 0.5}\n\n    def set_derived_configs(self):\n        # sets 'num-filters-in'.\n        input_dim = self.descriptors['input']['dim']\n        height_in = self.config['height-in']\n        if height_in <= 0:\n            raise RuntimeError(\"height-in must be specified\");\n        if input_dim % height_in != 0:\n            raise RuntimeError(\"Input dimension {0} is not a multiple of height-in={1}\".format(\n                input_dim, height_in))\n        self.config['num-filters-in'] = input_dim // height_in\n\n\n    # Check whether 'str' is a sorted, unique, nonempty list of integers, like -1,0,1.,\n    # returns true if so.\n    def check_offsets_var(self, str):\n        try:\n            a = [ int(x) for x in str.split(\",\") ]\n            if len(a) == 0:\n                return False\n            for i in range(len(a) - 1):\n                if a[i] >= a[i+1]:\n                    return False\n            return True\n        except:\n            return False\n\n    def check_configs(self):\n        # Do some basic checking of the configs.  The component-level code does\n        # some more thorough checking, but if you set the height-out too small it\n        # prints it as a warning, which the user may not see, so at a minimum we\n        # want to check for that here.\n        height_subsample_out = self.config['height-subsample-out']\n        height_in = self.config['height-in']\n        height_out = self.config['height-out']\n        if height_subsample_out <= 0:\n            raise RuntimeError(\"height-subsample-out has invalid value {0}.\".format(\n                height_subsample_out))\n        # we already checked height-in in set_derived_configs.\n        if height_out <= 0:\n            raise RuntimeError(\"height-out has invalid value {0}.\".format(\n                height_out))\n        if height_out * height_subsample_out > height_in:\n            raise RuntimeError(\"The combination height-in={0}, height-out={1} and \"\n                               \"height-subsample-out={2} does not look right \"\n                               \"(height-out too large).\".format(\n                                   height_in, height_out, height_subsample_out))\n        height_offsets = self.config['height-offsets']\n        time_offsets = self.config['time-offsets']\n        required_time_offsets = self.config['required-time-offsets']\n\n        if not 'noconv' in self.layer_type.split('-'):\n            # only check height-offsets, time-offsets and required-time-offsets if there\n            # is actually a convolution in this layer.\n            if not self.check_offsets_var(height_offsets):\n                raise RuntimeError(\"height-offsets={0} is not valid\".format(height_offsets))\n            if not self.check_offsets_var(time_offsets):\n                raise RuntimeError(\"time-offsets={0} is not valid\".format(time_offsets))\n            if required_time_offsets != \"\" and not self.check_offsets_var(required_time_offsets):\n                raise RuntimeError(\"required-time-offsets={0} is not valid\".format(\n                    required_time_offsets))\n\n        if height_out * height_subsample_out < \\\n           height_in - len(height_offsets.split(',')):\n            raise RuntimeError(\"The combination height-in={0}, height-out={1} and \"\n                               \"height-subsample-out={2} and height-offsets={3} \"\n                               \"does not look right (height-out too small).\")\n\n        if self.config['target-rms'] <= 0.0:\n            raise RuntimeError(\"Config value target-rms={0} is not valid\".format(\n                self.config['target_rms']))\n\n    def auxiliary_outputs(self):\n        return []\n\n    def output_name(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        # note: the [:-1] is to remove the '-layer'.\n        operations = self.layer_type.split('-')[:-1]\n        if operations[-1] == 'noconv':\n            operations = operations[:-1]\n        assert len(operations) >= 1\n        last_operation = operations[-1]\n        assert last_operation in ['relu', 'conv', 'renorm', 'batchnorm', 'dropout', 'so']\n        # we'll return something like 'layer1.batchnorm'.\n        return '{0}.{1}'.format(self.name, last_operation)\n\n    def output_dim(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return self.config['num-filters-out'] * self.config['height-out']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_cnn_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in CNN initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the CNN config\n    def _generate_cnn_config(self):\n        configs = []\n\n        name = self.name\n\n        # These 3 variables will be updated as we add components.\n        cur_num_filters = self.config['num-filters-in']\n        cur_height = self.config['height-in']\n        cur_descriptor = self.descriptors['input']['final-string']\n\n        # note: the [:-1] is to remove the '-layer'.\n        operations = self.layer_type.split('-')[:-1]\n        if operations[-1] == 'noconv':\n            operations = operations[:-1]\n        # e.g.:\n        # operations = [ 'conv', 'relu', 'batchnorm' ]\n        # or:\n        # operations = [ 'relu', 'conv', 'renorm' ]\n\n        for operation in operations:\n            if operation == 'conv':\n                a = []\n                for opt_name in [\n                        'param-stddev', 'bias-stddev', 'use-natural-gradient',\n                        'max-change', 'rank-in', 'rank-out', 'num-minibatches-history',\n                        'alpha-in', 'alpha-out', 'num-filters-in', 'num-filters-out',\n                        'height-in','height-out', 'height-subsample-out',\n                        'height-offsets', 'time-offsets', 'required-time-offsets',\n                        'learning-rate-factor', 'l2-regularize' ]:\n                    value = self.config[opt_name]\n                    if value != '':\n                        a.append('{0}={1}'.format(opt_name, value))\n                conv_opts = ' '.join(a)\n\n                configs.append('component name={0}.conv type=TimeHeightConvolutionComponent '\n                               '{1}'.format(name, conv_opts))\n                configs.append('component-node name={0}.conv component={0}.conv '\n                               'input={1}'.format(name, cur_descriptor))\n                cur_num_filters = self.config['num-filters-out']\n                cur_height = self.config['height-out']\n            elif operation == 'batchnorm':\n                configs.append('component name={0}.batchnorm  type=BatchNormComponent dim={1} '\n                               'block-dim={2} target-rms={3}'.format(\n                                   name, cur_num_filters * cur_height, cur_num_filters,\n                                   self.config['target-rms']))\n                configs.append('component-node name={0}.batchnorm component={0}.batchnorm '\n                               'input={1}'.format(name, cur_descriptor))\n            elif operation == 'renorm':\n                configs.append('component name={0}.renorm type=NormalizeComponent '\n                           'dim={1} target-rms={2}'.format(\n                               name, cur_num_filters * cur_height,\n                               self.config['target-rms']))\n                configs.append('component-node name={0}.renorm component={0}.renorm '\n                               'input={1}'.format(name, cur_descriptor))\n            elif operation == 'relu':\n                configs.append('component name={0}.relu type=RectifiedLinearComponent '\n                               'dim={1} block-dim={2} self-repair-scale={3} '\n                               'self-repair-lower-threshold={4}'.format(\n                                   name, cur_num_filters * cur_height, cur_num_filters,\n                                   self.config['self-repair-scale'],\n                                   self.config['self-repair-lower-threshold']))\n                configs.append('component-node name={0}.relu component={0}.relu '\n                               'input={1}'.format(name, cur_descriptor))\n            elif operation == 'dropout':\n                configs.append('component name={0}.dropout type=DropoutComponent '\n                           'dim={1} dropout-proportion={2}'.format(\n                               name, cur_num_filters * cur_height,\n                               self.config['dropout-proportion']))\n                configs.append('component-node name={0}.dropout component={0}.dropout '\n                               'input={1}'.format(name, cur_descriptor))\n            elif operation == 'so':\n                configs.append('component name={0}.so type=ScaleAndOffsetComponent '\n                           'dim={1} block-dim={2}'.format(\n                               name, cur_num_filters * cur_height, cur_num_filters))\n                configs.append('component-node name={0}.so component={0}.so '\n                               'input={1}'.format(name, cur_descriptor))\n            else:\n                raise RuntimeError(\"Un-handled operation type: \" + operation)\n\n            cur_descriptor = '{0}.{1}'.format(name, operation)\n\n        return configs\n\n\n# This class is for lines like the following:\n#\n# res-block name=res1 num-filters=64 height=32 time-period=1\n#\n# It implements a residual block as in ResNets, with pre-activation, and with\n# some small differences-- basically, instead of adding the input to the output,\n# we put a convolutional layer in there but initialize it to the unit matrix and\n# if you want you can give it a relatively small (or even zero) learning rate\n# and max-change.  And there is batch-norm in that path also.\n#\n# The number of filters is the same on the input and output; it is actually\n# redundant to write it in the config file, because given that we know the\n# height, we can work it out from the dimension of the input (as dimension =\n# height * num-filters).  But we allow it to be specified anyway, for clarity.\n#\n# Note: the res-block does not support subsampling or changing the number of\n# filters.  If you want to do that, we recommend that you should do it with a\n# single relu-batchnorm-conv-layer.\n#\n# Here are the most important configuration values, with defaults shown if\n# defaults exist:\n#\n# input='[-1]'    Descriptor giving the input of the layer.\n# height          The input and output height of the image, e.g. 40.  Note: the width\n#                 is associated with the time dimension and is dealt with\n#                 implicitly, so it's not specified here.\n# num-filters     The number of filters on the input and output, e.g. 64.\n#                 It does not have to be specified; if it is not specified,\n#                 we work it out from the input dimension.\n# num-bottleneck-filters   If specified then this will be a 'bottleneck'\n#                 ResBlock, in which there is a 1x1 convolution from\n#                 num-filters->num-bottleneck-filters, a 3x3 convolution\n#                 from num-bottleneck-filters->num-bottleneck-filters, and\n#                 a 1x1 convolution from num-bottleneck-filters->num-filters.\n#\n# time-period=1   Think of this as the stride in the time dimension.  At the\n#                 input of the network will always have time-period=1; then\n#                 after subsampling once in time we'd have time-period=2; then\n#                 after subsampling again we'd have time-period=4.  Because of\n#                 the way nnet3 works, subsampling on the time axis is an\n#                 implicit, not explicit, operation.\n# height-period=1  This will almost always be left at the default (1).  It is\n#                 analogous to time-period, but because the height, unlike the\n#                 time, is explicitly subsampled, in normal topologies this should\n#                 be left at 1.\n#\n# bypass-source=noop\n#                       The output of this component is Sum(convolution, x), and\n#                       this option controls what 'x' is.  There are 3 options\n#                       here: 'noop', 'input', 'relu' or 'batchnorm'.  'noop' is\n#                       equivalent to 'input' in what it computes; it just\n#                       inserts a 'noop' component in order to make the\n#                       computation more efficient.  For both 'noop' and\n#                       'input', x is the input to this component.  If\n#                       bypass-source=relu then we use the relu of the\n#                       input; if 'batchnorm', then we use the relu+batchnorm of\n#                       the input.\n# allow-zero-padding=true By default this will allow zero-padding in the time\n#                       dimension, meaning that you don't need extra frames at\n#                       the input to compute the output.  There may be ASR\n#                       applications where you want to pad in the time dimension\n#                       with repeats of the first or last frame (as we do for\n#                       TDNNs), where it would be appropriate to write\n#                       allow-zero-padding=false.  Note: the way we have\n#                       set it up, it does zero-padding on the height axis\n#                       regardless\n#\n# Less important config variables:\n#  self-repair-scale=2.0e-05  This affects the ReLu's.  It is a scale on the\n#                            'self-repair' mechanism that nudges the inputs to the\n#                            ReLUs into the appropriate range in cases where\n#                            the unit is active either too little of the time\n#                            (<10%) or too much of the time (>90%).\n#  max-change=0.75           Max-parameter-change constant (per minibatch)\n#                            used for convolutional components.\n#\n#\n# The following natural-gradient-related configuration variables are passed in\n# to the convolution components, if specified:\n#  use-natural-gradient (bool)\n#  rank-in, rank-out    (int)\n#  num-minibatches-history (float)\n#  alpha-in, alpha-out (float)\n# the following is also passed into the convolution components, if specified:\n#  l2-regularize (float)\n#\n\nclass XconfigResBlock(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == 'res-block'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'height':-1,\n                       'num-filters':-1,\n                       'num-bottleneck-filters':-1,\n                       'time-period':1,\n                       'height-period':1,\n                       'self-repair-scale': 2.0e-05,\n                       'self-repair-lower-threshold1': 0.05,\n                       'self-repair-lower-threshold2': 0.05,\n                       'self-repair-lower-threshold3': 0.05,\n                       'max-change': 0.75,\n                       'allow-zero-padding': True,\n                       'bypass-source' : 'noop',\n                       # the following are not really inspected by this level of\n                       # code, just passed through (but not if left at '').\n                       'param-stddev':'', 'bias-stddev':'',\n                       'use-natural-gradient':'',\n                       'rank-in':'', 'rank-out':'',\n                       'num-minibatches-history':'',\n                       'alpha-in':'', 'alpha-out':'', 'l2-regularize':'' }\n\n    def set_derived_configs(self):\n        # set 'num-filters' or check it..\n        input_dim = self.descriptors['input']['dim']\n        height = self.config['height']\n\n        cur_num_filters = self.config['num-filters']\n        if cur_num_filters == -1:\n            if input_dim % height != 0:\n                raise RuntimeError(\"Specified image height {0} does not \"\n                                   \"divide the input dim {1}\".format(\n                                       height, input_dim))\n            self.config['num-filters'] = input_dim / height\n        elif input_dim != cur_num_filters * height:\n            raise RuntimeError(\"Expected the input-dim to equal \"\n                               \"height={0} * num-filters={1} = {2}, but \"\n                               \"it is {3}\".format(\n                                   height, cur_num_filters,\n                                   height * cur_num_filters,\n                                   input_dim));\n\n    def check_configs(self):\n        # we checked the dimensions in set_derived_configs.\n        if not self.config['bypass-source'] in [\n                'input', 'noop', 'relu', 'batchnorm' ]:\n            raise RuntimeError(\"Expected direct-convolution-source to \"\n                               \"be input, relu or batchnorm, got: {1}\".format(\n                                   self.config['direct-convolution-source']))\n\n    def auxiliary_outputs(self):\n        return []\n\n    def output_name(self, auxiliary_output = None):\n        bypass_source = self.config['bypass-source']\n        b = self.config['num-bottleneck-filters']\n        conv = ('{0}.conv2' if b <= 0 else '{0}.conv3').format(self.name)\n        if bypass_source == 'input':\n            residual = self.descriptors['input']['final-string']\n        elif bypass_source == 'noop':\n            # we let the noop be the sum of the convolutional part and the\n            # input, so just return the output of the no-op component.\n            return '{0}.noop'.format(self.name)\n        elif bypass_source == 'relu':\n            residual = '{0}.relu1'.format(self.name)\n        else:\n            assert bypass_source == 'batchnorm'\n            residual = '{0}.batchnorm1'.format(self.name)\n\n        return 'Sum({0}, {1})'.format(conv, residual)\n\n    def output_dim(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        b = self.config['num-bottleneck-filters']\n        if b <= 0:\n            config_lines = self._generate_normal_resblock_config()\n        else:\n            config_lines = self._generate_bottleneck_resblock_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in CNN initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # _generate_normal_resblock_config is a convenience function to generate the\n    # res-block config (the non-bottleneck version).\n    #\n    # The main path inside the res-block in the non-bottleneck case is as\n    # follows:\n    #\n    # input -> relu1 -> batchnorm1 -> conv1 -> relu2 -> batchnorm2 -> conv2\n    #\n    # We put the relu before the batchnorm because we think it makes more sense;\n    # because the Torch people seemed to find that this works better\n    # (https://github.com/gcr/torch-residual-networks/issues/5);\n    # and because in our batchnorm component we haven't implemented the beta and\n    # gamma; these would be essential to having it work before relu, but\n    # when before a convolution or linear component, they add no extra modeling\n    # power.\n    #\n    # The output of the res-block can be the sum of the last convolutional\n    # component (conv2), with the input.  However, the option ('bypass-source')\n    # controls whether we sum with the raw input, or its relu or relu+batchnorm.\n    # If the term is going to be the raw input, we give the option ('noop') and\n    # to cache the output sum via a NoOpComponent)-- because due to how nnet3\n    # works, if we didn't do this, redundant summing operations would take\n    # place.\n    def _generate_normal_resblock_config(self):\n        configs = []\n\n        name = self.name\n        num_filters = self.config['num-filters']\n        assert self.config['num-bottleneck-filters'] == -1\n        height = self.config['height']\n        input_descriptor = self.descriptors['input']['final-string']\n        allow_zero_padding = self.config['allow-zero-padding']\n        height_period = self.config['height-period']\n        time_period = self.config['time-period']\n\n        # input -> relu1 -> batchnorm1 -> conv1 -> relu2 -> batchnorm2 -> conv2\n        cur_descriptor = input_descriptor\n        for n in [1, 2]:\n            # the ReLU\n            configs.append('component name={0}.relu{1} type=RectifiedLinearComponent '\n                           'dim={2} block-dim={3} self-repair-scale={4} '\n                           'self-repair-lower-threshold={5}'.format(\n                               name, n, num_filters * height, num_filters,\n                               self.config['self-repair-scale'],\n                               self.config['self-repair-lower-threshold{0}'.format(n)]))\n            configs.append('component-node name={0}.relu{1} component={0}.relu{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n\n            cur_descriptor = '{0}.relu{1}'.format(name, n)\n\n            # the batch-norm\n            configs.append('component name={0}.batchnorm{1}  type=BatchNormComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, num_filters * height,\n                                   num_filters))\n            configs.append('component-node name={0}.batchnorm{1} component={0}.batchnorm{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.batchnorm{1}'.format(name, n)\n\n\n            # the convolution.\n            a = []\n            for opt_name in [\n                    'param-stddev', 'bias-stddev', 'use-natural-gradient',\n                    'max-change', 'rank-in', 'rank-out', 'num-minibatches-history',\n                    'alpha-in', 'alpha-out', 'l2-regularize' ]:\n                value = self.config[opt_name]\n                if value != '':\n                        a.append('{0}={1}'.format(opt_name, value))\n            conv_opts = ('height-in={h} height-out={h} height-offsets=-{hp},0,{hp} '\n                         'time-offsets=-{p},0,{p} '\n                         'num-filters-in={f} num-filters-out={f} {r} {o}'.format(\n                             h=height, hp=height_period, p=time_period, f=num_filters,\n                             r=('required-time-offsets=0' if allow_zero_padding else ''),\n                             o=' '.join(a)))\n\n            configs.append('component name={0}.conv{1} type=TimeHeightConvolutionComponent '\n                           '{2}'.format(name, n, conv_opts))\n            configs.append('component-node name={0}.conv{1} component={0}.conv{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.conv{1}'.format(name, n)\n\n\n\n        if self.config['bypass-source'] == 'noop':\n            dim = self.descriptors['input']['dim']\n            configs.append('component name={0}.noop dim={1} type=NoOpComponent'.format(\n                name, dim))\n            configs.append('component-node name={0}.noop component={0}.noop '\n                           'input=Sum({1}, {0}.conv2)'.format(name,\n                                                              input_descriptor))\n\n        # Note: the function 'output_name' is responsible for returning the\n        # descriptor corresponding to the output of the network.\n        return configs\n\n\n\n    # _generate_bottleneck_resblock_config is a convenience function to generate the\n    # res-block config (this is the bottleneck version, where there is\n    # a 3x3 kernel with a smaller number of filters than at the input and output,\n    # sandwiched between two 1x1 kernels.\n    #\n    # The main path inside the res-block in the bottleneck case is as follows:\n    #\n    # input -> relu1 -> batchnorm1 -> conv1 -> relu2 -> batchnorm2 -> conv2 ->\n    #   relu3 -> batchnorm3 -> conv3\n    #\n    # power.\n    #\n    # The output of the res-block can be the sum of the last convolutional\n    # component (conv3), with the input.  However we give the option\n    # ('bypass-source') to sum with the raw input, or its relu or\n    # relu+batchnorm.  If the term is going to be the raw input, we give the\n    # option ('noop') and to cache the output sum via a NoOpComponent)-- because\n    # due to how nnet3 works, if we didn't do this, redundant summing operations\n    # would take place.\n    def _generate_bottleneck_resblock_config(self):\n        configs = []\n\n        name = self.name\n        num_filters = self.config['num-filters']\n        num_bottleneck_filters = self.config['num-bottleneck-filters']\n        assert num_bottleneck_filters > 0\n        height = self.config['height']\n        input_descriptor = self.descriptors['input']['final-string']\n        allow_zero_padding = self.config['allow-zero-padding']\n        height_period = self.config['height-period']\n        time_period = self.config['time-period']\n\n        # input -> relu1 -> batchnorm1 -> conv1 -> relu2 -> batchnorm2 -> conv2\n        cur_descriptor = input_descriptor\n        cur_num_filters = num_filters\n\n        for n in [1, 2, 3]:\n            # the ReLU\n            configs.append('component name={0}.relu{1} type=RectifiedLinearComponent '\n                           'dim={2} block-dim={3} self-repair-scale={4} '\n                           'self-repair-lower-threshold={5}'.format(\n                               name, n, cur_num_filters * height, cur_num_filters,\n                               self.config['self-repair-scale'],\n                               self.config['self-repair-lower-threshold{0}'.format(n)]))\n            configs.append('component-node name={0}.relu{1} component={0}.relu{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n\n            cur_descriptor = '{0}.relu{1}'.format(name, n)\n\n            # the batch-norm\n            configs.append('component name={0}.batchnorm{1}  type=BatchNormComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, cur_num_filters * height,\n                                   cur_num_filters))\n            configs.append('component-node name={0}.batchnorm{1} component={0}.batchnorm{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.batchnorm{1}'.format(name, n)\n\n\n            # the convolution.\n            a = []\n            for opt_name in [\n                    'param-stddev', 'bias-stddev', 'use-natural-gradient',\n                    'max-change', 'rank-in', 'rank-out', 'num-minibatches-history',\n                    'alpha-in', 'alpha-out', 'l2-regularize' ]:\n                value = self.config[opt_name]\n                if value != '':\n                        a.append('{0}={1}'.format(opt_name, value))\n\n            height_offsets = ('-{hp},0,{hp}'.format(hp=height_period) if n == 2 else '0')\n            time_offsets = ('-{t},0,{t}'.format(t=time_period) if n == 2 else '0')\n            next_num_filters = (num_filters if n == 3 else num_bottleneck_filters)\n            conv_opts = ('height-in={h} height-out={h} height-offsets={ho} time-offsets={to} '\n                         'num-filters-in={fi} num-filters-out={fo} {r} {o}'.format(\n                             h=height, ho=height_offsets, to=time_offsets,\n                             fi=cur_num_filters, fo=next_num_filters,\n                             r=('required-time-offsets=0' if allow_zero_padding else ''),\n                             o=' '.join(a)))\n\n            configs.append('component name={0}.conv{1} type=TimeHeightConvolutionComponent '\n                           '{2}'.format(name, n, conv_opts))\n            configs.append('component-node name={0}.conv{1} component={0}.conv{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.conv{1}'.format(name, n)\n            cur_num_filters = next_num_filters\n\n\n        if self.config['bypass-source'] == 'noop':\n            dim = self.descriptors['input']['dim']\n            configs.append('component name={0}.noop dim={1} type=NoOpComponent'.format(\n                name, dim))\n            configs.append('component-node name={0}.noop component={0}.noop '\n                           'input=Sum({1}, {0}.conv3)'.format(name,\n                                                              input_descriptor))\n\n        # Note: the function 'output_name' is responsible for returning the\n        # descriptor corresponding to the output of the network.\n        return configs\n\n\n# This class is for lines like the following:\n#\n# res2-block name=res1 num-filters=64 height=32 time-period=1\n#\n# It is a residual block with post-activations, which does not support\n# downsampling (strided convolution) or changing the number of filters;\n# for that, see res2-downsample-block.\n# It's a pretty standard res-block, more standard than \"res-block\" (XconfigResBlock).\n#\n# The number of filters is the same on the input and output; it is actually\n# redundant to write it in the config file, because given that we know the\n# height, we can work it out from the dimension of the input (as dimension =\n# height * num-filters).  But we allow it to be specified anyway, for clarity.\n#\n\n# Here are the most important configuration values, with defaults shown if\n# defaults exist:\n#\n# input='[-1]'    Descriptor giving the input of the layer.\n# height          The input and output height of the image, e.g. 40.  Note: the width\n#                 is associated with the time dimension and is dealt with\n#                 implicitly, so it's not specified here.\n# num-filters     The number of filters on the input and output, e.g. 64.\n#                 It does not have to be specified; if it is not specified,\n#                 we work it out from the input dimension.\n# num-bottleneck-filters   If specified then this will be a 'bottleneck'\n#                 ResBlock, in which there is a 1x1 convolution from\n#                 num-filters->num-bottleneck-filters, a 3x3 convolution\n#                 from num-bottleneck-filters->num-bottleneck-filters, and\n#                 a 1x1 convolution from num-bottleneck-filters->num-filters.\n# time-period=1   Think of this as the stride in the time dimension.  At the\n#                 input of the network will always have time-period=1; then\n#                 after subsampling once in time we'd have time-period=2; then\n#                 after subsampling again we'd have time-period=4.  Because of\n#                 the way nnet3 works, subsampling on the time axis is an\n#                 implicit, not explicit, operation.\n# allow-zero-padding=true By default this will allow zero-padding in the time\n#                       dimension, meaning that you don't need extra frames at\n#                       the input to compute the output.  There may be ASR\n#                       applications where you want to pad in the time dimension\n#                       with repeats of the first or last frame (as we do for\n#                       TDNNs), where it would be appropriate to write\n#                       allow-zero-padding=false.  Note: the way we have\n#                       set it up, it does zero-padding on the height axis\n#                       regardless\n#\n# Less important config variables:\n#  self-repair-scale=2.0e-05  This affects the ReLu's.  It is a scale on the\n#                            'self-repair' mechanism that nudges the inputs to the\n#                            ReLUs into the appropriate range in cases where\n#                            the unit is active either too little of the time\n#                            (<10%) or too much of the time (>90%).\n#  max-change=0.75           Max-parameter-change constant (per minibatch)\n#                            used for convolutional components.\n#\n#\n# The following natural-gradient-related configuration variables are passed in\n# to the convolution components, if specified:\n#  use-natural-gradient (bool)\n#  rank-in, rank-out    (int)\n#  num-minibatches-history (float)\n#  alpha-in, alpha-out (float)\n# the following is also passed into the convolution components, if specified:\n#  l2-regularize (float)\n\nclass XconfigRes2Block(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == 'res2-block'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'height':-1,  # sets height-in and height-out\n                       'height-in':-1,\n                       'height-out':-1,\n                       'num-filters':-1, # interpreted as num-filters-out.\n                       'num-bottleneck-filters':-1,\n                       'time-period':1,\n                       'self-repair-scale': 2.0e-05,\n                       'self-repair-lower-threshold1': 0.05,\n                       'self-repair-lower-threshold2': 0.05,\n                       'self-repair-lower-threshold3': 0.05,\n                       'max-change': 0.75,\n                       'allow-zero-padding': True,\n                       # the following are not really inspected by this level of\n                       # code, just passed through (but not if left at '').\n                       'param-stddev':'', 'bias-stddev':'',\n                       'use-natural-gradient':'',\n                       'rank-in':'', 'rank-out':'',\n                       'num-minibatches-history':'',\n                       'alpha-in':'', 'alpha-out':'',\n                       'l2-regularize':'' }\n\n    def set_derived_configs(self):\n        input_dim = self.descriptors['input']['dim']\n\n        if not ((self.config['height'] > 0  and self.config['height-in'] == -1 and\n                 self.config['height-out'] == -1) or\n                (self.config['height-out'] > 0 and self.config['height-in'] > 0)):\n            raise RuntimeError(\"You must specify height, or height-in and height-out, for res2-block.\")\n\n        if not (self.config['height-in'] > 0 and self.config['height-out'] > 0):\n            height = self.config['height']\n            if not height > 0:\n                raise RuntimeError(\"You must specify either height, or height-in and height-out, for \"\n                                   \"res2-block.\")\n            self.config['height-in'] = height\n            self.config['height-out'] = height\n\n        height_in = self.config['height-in']\n        if input_dim % height_in != 0:\n            raise RuntimeError(\"Specified input image height {0} does not \"\n                                   \"divide the input dim {1}\".format(\n                                       height_in, input_dim))\n            self.config['num-filters'] = input_dim / height\n\n    def check_configs(self):\n        if self.config['num-filters'] == -1:\n            raise RuntimeError(\"You must specify num-filters for res2-block.\")\n\n    def auxiliary_outputs(self):\n        return []\n\n    def output_name(self, auxiliary_output = None):\n        b = self.config['num-bottleneck-filters']\n        return ('{0}.relu2' if b <= 0 else '{0}.relu3').format(self.name)\n\n    def output_dim(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return self.config['height-out'] * self.config['num-filters']\n\n    def get_full_config(self):\n        ans = []\n        b = self.config['num-bottleneck-filters']\n        if b <= 0:\n            config_lines = self._generate_normal_resblock_config()\n        else:\n            config_lines = self._generate_bottleneck_resblock_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in CNN initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # _generate_normal_resblock_config is a convenience function to generate the\n    # res-block config (the non-bottleneck version).\n    #\n    # The main path inside the res-block in the non-bottleneck case is as\n    # follows:\n    #\n    # input -> conv1 -> batchnorm1 -> scaleoffset1 -> relu1 -> conv2 -> batchnorm2 -> scaleoffset2 -> relu2\n    #\n    # where the 'scaleoffsetN' are ScaleAndOffsetComponent, which conventionally would be\n    # considered part of the BatchNorm.\n    #\n    # The relu2 actually sees the sum of the input and  'scaleoffset2'-- which gives us the bypass\n    # connection.\n    def _generate_normal_resblock_config(self):\n        configs = []\n        name = self.name\n        assert self.config['num-bottleneck-filters'] == -1\n        input_dim = self.descriptors['input']['dim']\n        height_in = self.config['height-in']\n        height_out = self.config['height-out']\n        time_period_out = self.config['time-period']\n        if not input_dim % height_in == 0:\n            raise RuntimeError(\"input-dim {0} does not divide height-in {1}\".format(\n                input_dim, height_in))\n        num_filters_in = input_dim / height_in\n        num_filters_out = self.config['num-filters']\n\n        if height_out != height_in:\n            if height_out < height_in / 2 - 1 or height_out > height_in / 2 + 1:\n                raise RuntimeError(\"Expected height-out to be about half height-in, or the same: \"\n                                   \"height-in={0} height-out={1}\".format(height_in, height_out))\n            if not time_period_out % 2 == 0:\n                raise RuntimeError(\"Expected time-period to be a multiple of 2 if you are subsampling \"\n                                   \"on height.\")\n            time_period_in = time_period_out / 2\n            height_subsample = 2\n        else:\n            time_period_in = time_period_out\n            height_subsample = 1\n\n\n        cur_time_period = time_period_in\n        cur_num_filters = num_filters_in\n        cur_height = height_in\n\n        input_descriptor = self.descriptors['input']['final-string']\n        allow_zero_padding = self.config['allow-zero-padding']\n        if height_subsample == 1 and num_filters_in == num_filters_out:\n            bypass_descriptor = input_descriptor\n        else:\n            bypass_descriptor = '{0}.conv_bypass'.format(name)\n\n        cur_descriptor = input_descriptor\n\n        # get miscellaneous convolution options passed in from the xconfig line\n        a = []\n        for opt_name in [\n                'param-stddev', 'bias-stddev', 'use-natural-gradient',\n                'max-change', 'rank-in', 'rank-out', 'num-minibatches-history',\n                'alpha-in', 'alpha-out', 'l2-regularize' ]:\n            value = self.config[opt_name]\n            if value != '':\n                a.append('{0}={1}'.format(opt_name, value))\n        misc_conv_opts = ' '.join(a)\n\n        for n in [1, 2]:\n            # the convolution.\n            conv_opts = ('height-in={hi} height-out={ho} height-offsets=-1,0,1 '\n                         'height-subsample-out={hs} '\n                         'time-offsets=-{p},0,{p} '\n                         'num-filters-in={fi} num-filters-out={fo} {r} {o}'.format(\n                             hi=cur_height, ho=height_out,\n                             p=cur_time_period,\n                             hs=(height_subsample if n == 1 else 1),\n                             fi=cur_num_filters,\n                             fo=num_filters_out,\n                             r=('required-time-offsets=0' if allow_zero_padding else ''),\n                             o=misc_conv_opts))\n\n            configs.append('component name={0}.conv{1} type=TimeHeightConvolutionComponent '\n                           '{2}'.format(name, n, conv_opts))\n            configs.append('component-node name={0}.conv{1} component={0}.conv{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.conv{1}'.format(name, n)\n\n            cur_num_filters = num_filters_out\n            cur_height = height_out\n            cur_time_period = time_period_out\n\n            # the batch-norm\n            configs.append('component name={0}.batchnorm{1}  type=BatchNormComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, cur_num_filters * cur_height,\n                                   cur_num_filters))\n            configs.append('component-node name={0}.batchnorm{1} component={0}.batchnorm{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.batchnorm{1}'.format(name, n)\n\n            # the scale-and-offset\n            configs.append('component name={0}.scaleoffset{1}  type=ScaleAndOffsetComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, cur_num_filters * cur_height,\n                                   cur_num_filters))\n            configs.append('component-node name={0}.scaleoffset{1} component={0}.scaleoffset{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.scaleoffset{1}'.format(name, n)\n\n\n            if n == 2:\n                # the bypass connection\n                cur_descriptor = 'Sum({0}, {1})'.format(cur_descriptor, bypass_descriptor)\n\n\n            # the ReLU\n            configs.append('component name={0}.relu{1} type=RectifiedLinearComponent '\n                           'dim={2} block-dim={3} self-repair-scale={4} '\n                           'self-repair-lower-threshold={5}'.format(\n                               name, n, cur_num_filters * cur_height, cur_num_filters,\n                               self.config['self-repair-scale'],\n                               self.config['self-repair-lower-threshold{0}'.format(n)]))\n            configs.append('component-node name={0}.relu{1} component={0}.relu{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n\n            cur_descriptor = '{0}.relu{1}'.format(name, n)\n\n        if bypass_descriptor != input_descriptor:\n            # We need to add the 1x1 bypass convolution because we're either doing height\n            # subsampling or changing the number of filters.\n            conv_opts = ('height-in={hi} height-out={ho} height-offsets=0 '\n                         'time-offsets=0 height-subsample-out={hs} '\n                         'num-filters-in={fi} num-filters-out={fo} {o}'.format(\n                             hi=height_in, ho=height_out, hs=height_subsample,\n                             fi=num_filters_in, fo=num_filters_out, o=misc_conv_opts))\n            configs.append('component name={0}.conv_bypass type=TimeHeightConvolutionComponent '\n                           '{1}'.format(name, conv_opts))\n            configs.append('component-node name={0}.conv_bypass component={0}.conv_bypass '\n                           'input={1}'.format(name, input_descriptor))\n\n\n\n        # Note: the function 'output_name' is responsible for returning the\n        # descriptor corresponding to the output of the network, which in\n        # this case would be '{0}.relu2'.format(name).\n        return configs\n\n\n    # _generate_bottleneck_resblock_config is a convenience function to generate the\n    # res-block config (this is the bottleneck version, where there is\n    # a 3x3 kernel with a smaller number of filters than at the input and output,\n    # sandwiched between two 1x1 kernels.\n    #\n    # The main path inside the res-block in the bottleneck case is as follows:\n    #\n    # input -> conv1 -> batchnorm1 -> scaleoffset1 -> relu1 ->\n    #          conv2 -> batchnorm2 -> scaleoffset2 -> relu2 ->\n    #          conv3 -> batchnorm3 -> scaleoffset3 -> relu3\n    #\n    #  but the relu3 takes as its input the sum of 'input' and 'scaleoffset3'.\n    #\n    def _generate_bottleneck_resblock_config(self):\n        configs = []\n\n        name = self.name\n        num_bottleneck_filters = self.config['num-bottleneck-filters']\n        assert num_bottleneck_filters > 0\n        input_dim = self.descriptors['input']['dim']\n        height_in = self.config['height-in']\n        height_out = self.config['height-out']\n        input_descriptor = self.descriptors['input']['final-string']\n        allow_zero_padding = self.config['allow-zero-padding']\n        time_period_out = self.config['time-period']\n        if not input_dim % height_in == 0:\n            raise RuntimeError(\"input-dim={0} does not divide height-in={1}\".format(\n                input_dim, height_in))\n        num_filters_in = input_dim / height_in\n        num_filters_out = self.config['num-filters']\n\n        if height_out != height_in:\n            if height_out < height_in / 2 - 1 or height_out > height_in / 2 + 1:\n                raise RuntimeError(\"Expected height-out to be about half height-in, or the same: \"\n                                   \"height-in={0} height-out={1}\".format(height_in, height_out))\n            height_subsample = 2\n        else:\n            height_subsample = 1\n\n        cur_descriptor = input_descriptor\n        cur_num_filters = num_filters_in\n        cur_height = height_in\n        if height_subsample == 1 and num_filters_in == num_filters_out:\n            bypass_descriptor = input_descriptor\n        else:\n            bypass_descriptor = '{0}.conv_bypass'.format(name)\n\n        # get miscellaneous convolution options passed in from the xconfig line\n        a = []\n        for opt_name in [\n                'param-stddev', 'bias-stddev', 'use-natural-gradient',\n                'max-change', 'rank-in', 'rank-out', 'num-minibatches-history',\n                'alpha-in', 'alpha-out', 'l2-regularize' ]:\n            value = self.config[opt_name]\n            if value != '':\n                a.append('{0}={1}'.format(opt_name, value))\n        misc_conv_opts = ' '.join(a)\n\n\n        for n in [1, 2, 3]:\n            # the convolution.\n            height_offsets = ('-1,0,1' if n == 2 else '0')\n            this_height_subsample = height_subsample if n == 1 else 1\n            time_offsets = ('-{t},0,{t}'.format(t=time_period_out) if n == 2 else '0')\n            next_num_filters = (num_filters_out if n == 3 else num_bottleneck_filters)\n\n            conv_opts = ('height-in={h_in} height-out={h_out} height-offsets={ho} time-offsets={to} '\n                         'num-filters-in={fi} num-filters-out={fo} height-subsample-out={hs} '\n                         '{r} {o}'.format(\n                             h_in=cur_height, h_out=height_out,\n                             to=time_offsets, ho=height_offsets,\n                             hs=this_height_subsample,\n                             fi=cur_num_filters, fo=next_num_filters,\n                             r=('required-time-offsets=0' if allow_zero_padding else ''),\n                             o=misc_conv_opts))\n\n            configs.append('component name={0}.conv{1} type=TimeHeightConvolutionComponent '\n                           '{2}'.format(name, n, conv_opts))\n            configs.append('component-node name={0}.conv{1} component={0}.conv{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n\n            cur_num_filters = next_num_filters\n            cur_height = height_out\n            cur_descriptor = '{0}.conv{1}'.format(name, n)\n\n            # the batch-norm\n            configs.append('component name={0}.batchnorm{1}  type=BatchNormComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, cur_num_filters * cur_height,\n                                   cur_num_filters))\n            configs.append('component-node name={0}.batchnorm{1} component={0}.batchnorm{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.batchnorm{1}'.format(name, n)\n\n            # the scale and offset\n            configs.append('component name={0}.scaleoffset{1}  type=ScaleAndOffsetComponent dim={2} '\n                               'block-dim={3}'.format(\n                                   name, n, cur_num_filters * cur_height,\n                                   cur_num_filters))\n            configs.append('component-node name={0}.scaleoffset{1} component={0}.scaleoffset{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n            cur_descriptor = '{0}.scaleoffset{1}'.format(name, n)\n\n            if n == 3:\n                # the bypass connection\n                cur_descriptor = 'Sum({0}, {1})'.format(cur_descriptor, bypass_descriptor)\n\n            # the ReLU\n            configs.append('component name={0}.relu{1} type=RectifiedLinearComponent '\n                           'dim={2} block-dim={3} self-repair-scale={4} '\n                           'self-repair-lower-threshold={5}'.format(\n                               name, n, cur_num_filters * cur_height, cur_num_filters,\n                               self.config['self-repair-scale'],\n                               self.config['self-repair-lower-threshold{0}'.format(n)]))\n            configs.append('component-node name={0}.relu{1} component={0}.relu{1} '\n                           'input={2}'.format(name, n, cur_descriptor))\n\n            cur_descriptor = '{0}.relu{1}'.format(name, n)\n\n        if bypass_descriptor != input_descriptor:\n            # We need to add the 1x1 bypass convolution because we're either doing height\n            # subsampling or changing the number of filters.\n            conv_opts = ('height-in={hi} height-out={ho} height-offsets=0 '\n                         'time-offsets=0 height-subsample-out={hs} '\n                         'num-filters-in={fi} num-filters-out={fo} {o}'.format(\n                             hi=height_in, ho=height_out, hs=height_subsample,\n                             fi=num_filters_in, fo=num_filters_out, o=misc_conv_opts))\n            configs.append('component name={0}.conv_bypass type=TimeHeightConvolutionComponent '\n                           '{1}'.format(name, conv_opts))\n            configs.append('component-node name={0}.conv_bypass component={0}.conv_bypass '\n                           'input={1}'.format(name, input_descriptor))\n\n        # Note: the function 'output_name' is responsible for returning the\n        # descriptor corresponding to the output of the network, which\n        # in this case will be '{0}.relu3'.format(name).\n        return configs\n\n\n# This layer just maps to a single component, a SumBlockComponent.  It's for\n# doing channel averaging at the end of neural networks.  See scripts for\n# examples of how to use it.\n# An example line using this layer is:\n# channel-average-layer name=channel-average input=Append(2, 4, 6, 8) dim=64\n\n# the configuration value 'dim' is the output dimension of this layer.\n# The input dimension is expected to be a multiple of 'dim'.  The output\n# will be the average of 'dim'-sized blocks of the input.\nclass ChannelAverageLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"channel-average-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                       'dim': -1 }\n\n    def set_derived_configs(self):\n        pass\n\n    def check_configs(self):\n        input_dim = self.descriptors['input']['dim']\n        dim = self.config['dim']\n        if dim <= 0:\n            raise RuntimeError(\"dim must be specified and > 0.\")\n        if input_dim % dim != 0:\n            raise RuntimeError(\"input-dim={0} is not a multiple of dim={1}\".format(\n                input_dim, dim))\n\n    def auxiliary_outputs(self):\n        return []\n\n    def output_name(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return self.config['dim']\n\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_channel_average_config()\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_channel_average_config(self):\n        configs = []\n        name = self.name\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        dim = self.config['dim']\n        # choose the scale that makes it an average rather than a sum.\n        scale = dim * 1.0 / input_dim\n        configs.append('component name={0} type=SumBlockComponent input-dim={1} '\n                       'output-dim={2} scale={3}'.format(name, input_dim,\n                                                         dim, scale))\n        configs.append('component-node name={0} component={0} input={1}'.format(\n            name, input_descriptor))\n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/gru.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2017    Gaofeng Cheng (UCAS)\n#           2017    Lu Huang (THU)\n#           2018    Hang Lyu\n# Apache 2.0.\n\n\n\"\"\" This module has the implementations of different GRU layers.\n\"\"\"\nfrom __future__ import print_function\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n# This class is for lines like\n#   'gru-layer name=gru1 input=[-1] delay=-3'\n# It generates an GRU sub-graph without output projections.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n# decay-time is deprecated under GRU or PGRU, as I found the PGRUs do not need the decay-time option to get generalized to unseen sequence length\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   delay=-1                 [Delay in the recurrent connections of the GRU/LSTM ]\n#   clipping-threshold=30    [similar to LSTMs ,nnet3 GRUs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''     [Additional options used for the diagonal matrices in the GRU/LSTM ]\n#   ng-affine-options=''                [Additional options used for the full matrices in the GRU/LSTM, can be used to do things like set biases to initialize to 1]\nclass XconfigGruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"gru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0\n                        }\n\n    def set_derived_configs(self):\n        if self.config['cell-dim'] <= 0:\n            self.config['cell-dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        key = 'cell-dim'\n        if self.config['cell-dim'] <= 0:\n            raise RuntimeError(\"cell-dim has invalid value {0}.\".format(self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(key, self.config[key]))\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 's_t'\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        return self.config['cell-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_gru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the GRU config\n    def generate_gru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        delay = self.config['delay']\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'], abs(delay)))\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        affine_str = self.config['ng-affine-options']\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        ng_per_element_scale_options = self.config['ng-per-element-scale-options']\n        if re.search('param-mean', ng_per_element_scale_options) is None and \\\n           re.search('param-stddev', ng_per_element_scale_options) is None:\n           ng_per_element_scale_options += \" param-mean=0.0 param-stddev=1.0 \"\n        pes_str = ng_per_element_scale_options\n\n        # formulation like:\n        # z_t = \\sigmoid ( x_t * U^z + h_{t-1} * W^z ) // update gate\n        # r_t = \\sigmoid ( x_t * U^r + h_{t-1} * W^r ) // reset gate\n        # \\tilde{h}_t = \\tanh ( x_t * U^h + ( h_{t-1} \\dot r_t ) * W^h )\n        # h_t = ( 1 - z_t ) \\dot \\tilde{h}_t + z_t \\dot h_{t-1}\n        # y_t = h_t // y_t is the output\n\n        configs = []\n        configs.append(\"# Update gate control : W_z* matrics\")\n        configs.append(\"component name={0}.W_z.xs_z type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + cell_dim, cell_dim, affine_str))\n        \n        configs.append(\"# Reset gate control : W_r* matrics\")\n        configs.append(\"component name={0}.W_z.xs_r type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + cell_dim, cell_dim, affine_str))\n\n        configs.append(\"# h related matrix : W_h* matrics\")\n        configs.append(\"component name={0}.W_h.UW type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + cell_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.h1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y type=NoOpComponent dim={1}\".format(name, cell_dim))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs_z input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# r_t\")\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_z.xs_r input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n        \n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h1_t component={0}.h1 input=Append({0}.r_t, IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.h_t_pre component={0}.W_h.UW input=Append({1}, {0}.h1_t)\".format(name, input_descriptor))\n        configs.append(\"component-node name={0}.h_t component={0}.h input={0}.h_t_pre\".format(name))\n        \n        configs.append(\"# y_t\")\n        configs.append(\"# The following two lines are to implement (1 - z_t)\")\n        configs.append(\"component-node name={0}.y1_t component={0}.y1 input=Append({0}.h_t, Sum(Scale(-1.0,{0}.z_t), Const(1.0, {1})))\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.y2_t component={0}.y2 input=Append(IfDefined(Offset({1}, {2})), {0}.z_t)\".format(name, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.y_t component={0}.y input=Sum({0}.y1_t, {0}.y2_t)\".format(name))\n\n        configs.append(\"# s_t : recurrence\")\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, cell_dim, bptrunc_str))\n\n        configs.append(\"# s_t will be output and recurrence\")\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.y_t\".format(name))\n        return configs\n\n\n# This class is for lines like\n#   'pgru-layer name=pgru1 input=[-1] delay=-3'\n# It generates an PGRU sub-graph with output projections. It can also generate\n# outputs without projection, but you could use the XconfigGruLayer for this\n# simple RNN.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection-dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\n\nclass XconfigPgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"pgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n\n    def auxiliary_outputs(self):\n        return ['h_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'sn_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the PGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # formulation like:\n        # z_t = \\sigmoid ( x_t * U^z + s_{t-1} * W^z ) // update gate\n        # r_t = \\sigmoid ( x_t * U^r + s_{t-1} * W^r ) // reset gate\n        # \\tilde{h}_t = \\tanh ( x_t * U^h + ( s_{t-1} \\dot r_t ) * W^h )\n        # h_t = ( 1 - z_t ) \\dot \\tilde{h}_t + z_t \\dot h_{t-1}\n        # y_t = h_t * W^y\n        # s_t = y_t (0:rec_proj_dim-1)\n        \n        configs = []\n        configs.append(\"# Update gate control : W_z* matrics\")\n        configs.append(\"component name={0}.W_z.xs_z type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        \n        configs.append(\"# Reset gate control : W_r* matrics\")\n        configs.append(\"component name={0}.W_z.xs_r type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, rec_proj_dim, affine_str))\n\n        configs.append(\"# h related matrix : W_h* matrics\")\n        configs.append(\"component name={0}.W_h.UW type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, rec_proj_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.h1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * rec_proj_dim, rec_proj_dim))\n        configs.append(\"component name={0}.y1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y type=NoOpComponent dim={1}\".format(name, cell_dim))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n        recurrent_connection_y = '{0}.y_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs_z input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# r_t\")\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_z.xs_r input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n\n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h1_t component={0}.h1 input=Append({0}.r_t, IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.h_t_pre component={0}.W_h.UW input=Append({1}, {0}.h1_t)\".format(name, input_descriptor))\n        configs.append(\"component-node name={0}.h_t component={0}.h input={0}.h_t_pre\".format(name))\n\n        configs.append(\"component-node name={0}.y1_t component={0}.y1 input=Append({0}.h_t, Sum(Scale(-1.0,{0}.z_t), Const(1.0, {1})))\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.y2_t component={0}.y2 input=Append(IfDefined(Offset({1}, {2})), {0}.z_t)\".format(name, recurrent_connection_y, delay))\n        \n        configs.append(\"component-node name={0}.y_t component={0}.y input=Sum({0}.y1_t, {0}.y2_t)\".format(name))\n\n        configs.append(\"# s_t recurrent\")\n        configs.append(\"component name={0}.W_s.ys type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n\n        configs.append(\"# s_t and n_t : sn_t will be the output\")\n        configs.append(\"component-node name={0}.sn_t component={0}.W_s.ys input={0}.y_t\".format(name))\n        configs.append(\"dim-range-node name={0}.s_t_preclip input-node={0}.sn_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_preclip\".format(name))\n\n        return configs\n\n\n# This class is for lines like\n#   'norm-pgru-layer name=norm-pgru1 input=[-1] delay=-3'\n\n# Different from the vanilla PGRU, the NormPGRU uses batchnorm in the forward direction\n# and renorm in the recurrence.\n\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection-dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\n\nclass XconfigNormPgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"norm-pgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        'dropout-proportion' : -1.0, # If -1.0, no dropout components will be added\n                        'dropout-per-frame' : True # If False, regular dropout, not per frame.\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n             raise RuntimeError(\"dropout-proportion has invalid value {0}.\"\n                                .format(self.config['dropout-proportion']))\n\n    def auxiliary_outputs(self):\n        return ['h_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'sn_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'h_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the Norm-PGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n        dropout_proportion = self.config['dropout-proportion']\n        dropout_per_frame = 'true' if self.config['dropout-per-frame'] else 'false' \n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # formulation like:\n        # z_t = \\sigmoid ( x_t * U^z + s_{t-1} * W^z ) // update gate\n        # r_t = \\sigmoid ( x_t * U^r + s_{t-1} * W^r ) // reset gate\n        # \\tilde{h}_t = \\tanh ( x_t * U^h + ( s_{t-1} \\dot r_t ) * W^h )\n        # h_t = ( 1 - z_t ) \\dot \\tilde{h}_t + z_t \\dot h_{t-1}\n        # y_t_tmp = h_t * W^y\n        # s_t = renorm ( y_t_tmp (0:rec_proj_dim-1) )\n        # y_t = batchnorm ( y_t_tmp )\n        \n        configs = []\n        configs.append(\"# Update gate control : W_z* matrics\")\n        configs.append(\"component name={0}.W_z.xs_z type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        \n        configs.append(\"# Reset gate control : W_r* matrics\")\n        configs.append(\"component name={0}.W_z.xs_r type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, rec_proj_dim, affine_str))\n\n        configs.append(\"# h related matrix : W_h* matrics\")\n        configs.append(\"component name={0}.W_h.UW type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim , affine_str))\n        \n        if dropout_proportion != -1.0:\n            configs.append(\"component name={0}.dropout_z type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, cell_dim, dropout_proportion, dropout_per_frame))\n            configs.append(\"component name={0}.dropout_r type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, rec_proj_dim, dropout_proportion, dropout_per_frame))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, rec_proj_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.h1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * rec_proj_dim, rec_proj_dim))\n        configs.append(\"component name={0}.y1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y type=NoOpComponent dim={1}\".format(name, cell_dim))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n        recurrent_connection_y = '{0}.y_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs_z input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.z_predrop_t component={0}.z input={0}.z_t_pre\".format(name))\n            configs.append(\"component-node name={0}.z_t component={0}.dropout_z input={0}.z_predrop_t\".format(name))\n        else:\n            configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name, input_descriptor, recurrent_connection, delay))\n\n        configs.append(\"# r_t\")\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_z.xs_r input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.r_predrop_t component={0}.r input={0}.r_t_pre\".format(name))\n            configs.append(\"component-node name={0}.r_t component={0}.dropout_r input={0}.r_predrop_t\".format(name))            \n        else:\n            configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n\n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h1_t component={0}.h1 input=Append({0}.r_t, IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.h_t_pre component={0}.W_h.UW input=Append({1}, {0}.h1_t)\".format(name, input_descriptor))\n        configs.append(\"component-node name={0}.h_t component={0}.h input={0}.h_t_pre\".format(name))\n\n        configs.append(\"component-node name={0}.y1_t component={0}.y1 input=Append({0}.h_t, Sum(Scale(-1.0,{0}.z_t), Const(1.0, {1})))\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.y2_t component={0}.y2 input=Append(IfDefined(Offset({1}, {2})), {0}.z_t)\".format(name, recurrent_connection_y, delay))\n        configs.append(\"component-node name={0}.y_t component={0}.y input=Sum({0}.y1_t, {0}.y2_t)\".format(name))\n\n        configs.append(\"# s_t recurrent\")\n        configs.append(\"component name={0}.W_s.ys type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        \n        configs.append(\"component name={0}.batchnorm type=BatchNormComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim + nonrec_proj_dim))\n        configs.append(\"component name={0}.renorm type=NormalizeComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim))\n\n        configs.append(\"# s_t and n_t : sn_t will be the output\")\n        configs.append(\"component-node name={0}.sn_nobatchnorm_t component={0}.W_s.ys input={0}.y_t\".format(name))\n        configs.append(\"dim-range-node name={0}.s_t_preclip input-node={0}.sn_nobatchnorm_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.sn_t component={0}.batchnorm input={0}.sn_nobatchnorm_t\".format(name))\n\n        configs.append(\"component-node name={0}.s_renorm_t component={0}.renorm input={0}.s_t_preclip\".format(name))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_renorm_t\".format(name))\n\n        return configs\n\n\n\n# This class is for lines like\n#   'opgru-layer name=opgru1 input=[-1] delay=-3'\n# It generates an OPGRU sub-graph with output projections.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection-dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigOpgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"opgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n\n    def auxiliary_outputs(self):\n        return ['h_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'sn_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the OPGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # formulation for OPGRU like:\n        # z_t = \\sigmoid ( x_t * U^z + s_{t-1} * W^z ) // update gate\n        # o_t = \\sigmoid ( x_t * U^o + s_{t-1} * W^o ) // output gate\n        # \\tilde{h}_t = \\tanh ( x_t * U^h + h_{t-1} \\dot W^h ) // W^h is learnable vector\n        # h_t = ( 1 - z_t ) \\dot \\tilde{h}_t + z_t \\dot h_{t-1}\n        # y_t = (y_t \\dot o_t) * W^y\n        # s_t = y_t(0:rec_proj_dim-1)\n        \n        configs = []\n        configs.append(\"# Update gate control : W_z* matrics\")\n        configs.append(\"component name={0}.W_z.xs_z type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        \n        configs.append(\"# Output gate control : W_r* matrics\")\n        configs.append(\"component name={0}.W_z.xs_o type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n\n        configs.append(\"# h related matrix : W_h* matrics\")\n        configs.append(\"component name={0}.W_h.UW type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim , cell_dim , affine_str))\n        configs.append(\"component name={0}.W_h.UW_elementwise type=NaturalGradientPerElementScaleComponent dim={1} {2}\".format(name, cell_dim , pes_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.o1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y type=NoOpComponent dim={1}\".format(name, cell_dim))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n        recurrent_connection_y = '{0}.y_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs_z input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# o_t\")\n        configs.append(\"component-node name={0}.o_t_pre component={0}.W_z.xs_o input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.o_t component={0}.o input={0}.o_t_pre\".format(name))\n        \n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h_t_pre component={0}.W_h.UW input={1}\".format(name, input_descriptor))\n        configs.append(\"component-node name={0}.h_t_pre2 component={0}.W_h.UW_elementwise input=IfDefined(Offset({1}, {2}))\".format(name, recurrent_connection_y, delay))\n        configs.append(\"component-node name={0}.h_t component={0}.h input=Sum({0}.h_t_pre, {0}.h_t_pre2)\".format(name))\n\n        configs.append(\"component-node name={0}.y1_t component={0}.y1 input=Append({0}.h_t, Sum(Scale(-1.0,{0}.z_t), Const(1.0, {1})))\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.y2_t component={0}.y2 input=Append(IfDefined(Offset({1}, {2})), {0}.z_t)\".format(name, recurrent_connection_y, delay))\n        configs.append(\"component-node name={0}.y_t component={0}.y input=Sum({0}.y1_t, {0}.y2_t)\".format(name))\n        configs.append(\"component-node name={0}.y_o_t component={0}.o1 input=Append({0}.o_t, {0}.y_t)\".format(name))\n\n        configs.append(\"# s_t recurrent\")\n        configs.append(\"component name={0}.W_s.ys type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n\n        configs.append(\"# s_t and n_t : sn_t will be the output\")\n        configs.append(\"component-node name={0}.sn_t component={0}.W_s.ys input={0}.y_o_t\".format(name))\n        configs.append(\"dim-range-node name={0}.s_t_preclip input-node={0}.sn_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_preclip\".format(name))\n\n        return configs\n\n# This class is for lines like\n#   'norm-opgru-layer name=norm-opgru1 input=[-1] delay=-3'\n# It generates a norm-OPGRU sub-graph with output projections.\n\n# Different from the vanilla OPGRU, the NormOPGRU uses batchnorm in the forward direction\n# and renorm in the recurrence.\n\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection-dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigNormOpgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"norm-opgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        'dropout-proportion' : -1.0, # If -1.0, no dropout components will be added\n                        'l2-regularize': 0.0,\n                        'dropout-per-frame' : True  # If false, regular dropout, not per frame.\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n             raise RuntimeError(\"dropout-proportion has invalid value {0}.\"\n                                .format(self.config['dropout-proportion']))\n\n    def auxiliary_outputs(self):\n        return ['h_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'sn_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the Norm-OPGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n        dropout_proportion = self.config['dropout-proportion']\n        dropout_per_frame = 'true' if self.config['dropout-per-frame'] else 'false' \n\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # formulation for OPGRU like:\n        # z_t = \\sigmoid ( x_t * U^z + s_{t-1} * W^z ) // update gate\n        # o_t = \\sigmoid ( x_t * U^o + s_{t-1} * W^o ) // output gate\n        # \\tilde{h}_t = \\tanh ( x_t * U^h + h_{t-1} \\dot W^h ) // W^h is learnable vector\n        # h_t = ( 1 - z_t ) \\dot \\tilde{h}_t + z_t \\dot h_{t-1}\n        # y_t_tmp = ( h_t \\dot o_t) * W^y\n        # s_t = renorm ( y_t_tmp(0:rec_proj_dim-1) )\n        # y_t = batchnorm ( y_t_tmp )\n        \n        configs = []\n        configs.append(\"# Update gate control : W_z* matrics\")\n        configs.append(\"component name={0}.W_z.xs_z type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str, l2_regularize_option))\n        \n        configs.append(\"# Output gate control : W_r* matrics\")\n        configs.append(\"component name={0}.W_z.xs_o type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str, l2_regularize_option))\n\n        configs.append(\"# h related matrix : W_h* matrics\")\n        configs.append(\"component name={0}.W_h.UW type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim , cell_dim , affine_str, l2_regularize_option))\n        configs.append(\"component name={0}.W_h.UW_elementwise type=NaturalGradientPerElementScaleComponent dim={1} {2}\".format(name, cell_dim , pes_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.o1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.y type=NoOpComponent dim={1}\".format(name, cell_dim))\n\n        if dropout_proportion != -1.0:\n            configs.append(\"component name={0}.dropout type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, cell_dim, dropout_proportion, dropout_per_frame))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n        recurrent_connection_y = '{0}.y_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs_z input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.z_predrop_t component={0}.z input={0}.z_t_pre\".format(name))\n            configs.append(\"component-node name={0}.z_t component={0}.dropout input={0}.z_predrop_t\".format(name))\n        else:\n            configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# o_t\")\n        configs.append(\"component-node name={0}.o_t_pre component={0}.W_z.xs_o input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.o_predrop_t component={0}.o input={0}.o_t_pre\".format(name))\n            configs.append(\"component-node name={0}.o_t component={0}.dropout input={0}.o_predrop_t\".format(name))\n        else:\n            configs.append(\"component-node name={0}.o_t component={0}.o input={0}.o_t_pre\".format(name))\n        \n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h_t_pre component={0}.W_h.UW input={1}\".format(name, input_descriptor))\n        configs.append(\"component-node name={0}.h_t_pre2 component={0}.W_h.UW_elementwise input=IfDefined(Offset({1}, {2}))\".format(name, recurrent_connection_y, delay))\n        configs.append(\"component-node name={0}.h_t component={0}.h input=Sum({0}.h_t_pre, {0}.h_t_pre2)\".format(name))\n\n        configs.append(\"# The following two lines are to implement (1 - z_t)\")\n        configs.append(\"component-node name={0}.y1_t component={0}.y1 input=Append({0}.h_t, Sum(Scale(-1.0,{0}.z_t), Const(1.0, {1})))\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.y2_t component={0}.y2 input=Append(IfDefined(Offset({1}, {2})), {0}.z_t)\".format(name, recurrent_connection_y, delay))\n        configs.append(\"component-node name={0}.y_t component={0}.y input=Sum({0}.y1_t, {0}.y2_t)\".format(name))\n        configs.append(\"component-node name={0}.y_o_t component={0}.o1 input=Append({0}.o_t, {0}.y_t)\".format(name))\n\n        configs.append(\"# s_t recurrent\")\n        configs.append(\"component name={0}.W_s.ys type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str, l2_regularize_option))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        configs.append(\"component name={0}.batchnorm type=BatchNormComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim + nonrec_proj_dim))\n        configs.append(\"component name={0}.renorm type=NormalizeComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim))\n\n        configs.append(\"# s_t and n_t : sn_t will be the output\")\n        configs.append(\"component-node name={0}.sn_nobatchnorm_t component={0}.W_s.ys input={0}.y_o_t\".format(name))\n        configs.append(\"component-node name={0}.sn_t component={0}.batchnorm input={0}.sn_nobatchnorm_t\".format(name))\n        configs.append(\"dim-range-node name={0}.s_t_preclip input-node={0}.sn_nobatchnorm_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t_preclip_renorm component={0}.renorm input={0}.s_t_preclip\".format(name))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_preclip_renorm\".format(name))\n\n        return configs\n\n# This class is for lines like\n#   'fast-gru-layer name=gru1 input=[-1] delay=-3'\n# It generates an GRU sub-graph without output projections.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n# decay-time is deprecated under GRU or PGRU, as I found the PGRUs do not need the decay-time option to get generalized to unseen sequence length\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   delay=-1                 [Delay in the recurrent connections of the GRU/LSTM ]\n#   clipping-threshold=30    [similar to LSTMs ,nnet3 GRUs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self-repair-scale-nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''     [Additional options used for the diagonal matrices in the GRU/LSTM ]\n#   gru-nonlinearity-options=' max-change=0.75' [options for GruNonlinearityComponent, see below for detail]\n#   ng-affine-options=''                [Additional options used for the full matrices in the GRU/LSTM, can be used to do things like set biases to initialize to 1]\nclass XconfigFastGruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"fast-gru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        # if you want to set 'self-repair-scale', ' self-repair-threshold'\n                        # or 'param-stddev' for GruNonlinearityComponent\n                        # For default, they are 1.0e-05, 0.2 and  1.0 / sqrt(d) where d is cell-dim.\n                        # you can add somethig like 'self-repair-scale=xxx' to gru-nonlinearity-options.\n                        # you can also see src/nnet3/nnet-combined-component.h for detail\n                        'gru-nonlinearity-options' : ' max-change=0.75'\n                        }\n\n    def set_derived_configs(self):\n        if self.config['cell-dim'] <= 0:\n            self.config['cell-dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        key = 'cell-dim'\n        if self.config['cell-dim'] <= 0:\n            raise RuntimeError(\"cell-dim has invalid value {0}.\".format(self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(key, self.config[key]))\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'y_t'\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        return self.config['cell-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_gru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the GRU config\n    def generate_gru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        delay = self.config['delay']\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'], abs(delay)))\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        affine_str = self.config['ng-affine-options']\n\n        # string for GruNonlinearityComponent\n        gru_nonlin_str = self.config['gru-nonlinearity-options']\n        \n        # formulation like:\n        # z_t = \\sigmoid ( U^z x_t + W^z y_{t-1} )   # update gate\n        # r_t = \\sigmoid ( U^r x_t + W^r y_{t-1} )   # reset gate\n        # h_t = \\tanh ( U^h x_t + W^h ( y_{t-1} \\dot r_t ) )\n        # y_t = ( 1 - z_t ) \\dot h_t  +  z_t \\dot y_{t-1}\n        # Note:\n        # naming convention:\n        # <layer-name>.W_<outputname>.<inputname> e.g. Gru1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        # notation convention:\n        # In order to be consistent with the notations which are used in\n        # nnet-combined-component.cc, we map \"\\tilde{h_t}\" and \"h_t\" which are\n        # used in paper to \"h_t\" and \"c_t\"\n\n        configs = []\n\n        configs.append(\"### Begin Gru layer '{0}'\".format(name))\n        configs.append(\"# Update gate control : W_z* matrices\")\n        configs.append(\"component name={0}.W_z.xh type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + cell_dim, cell_dim, affine_str))\n        configs.append(\"# Reset gate control : W_r* matrices\")\n        configs.append(\"component name={0}.W_r.xh type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + cell_dim, cell_dim, affine_str))\n\n        configs.append(\"# hpart_t related matrix : W_hpart matrice\")\n        configs.append(\"component name={0}.W_hpart.x type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities for z_t and r_t\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        \n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xh input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n        configs.append(\"# r_t\")\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_r.xh input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n\n        configs.append(\"# hpart_t\")\n        configs.append(\"component-node name={0}.hpart_t component={0}.W_hpart.x input={1}\".format(name, input_descriptor))\n        \n        configs.append(\"# y_t\")\n        configs.append(\"# Note: the output of GruNonlinearityComponent is (h_t, c_t), we just get the second half. Otherwise, in non-projection gru layer, y_t = c_t\")\n        configs.append(\"component name={0}.gru_nonlin type=GruNonlinearityComponent cell-dim={1} {2}\".format(name, cell_dim, gru_nonlin_str))\n        configs.append(\"component-node name={0}.gru_nonlin_t component={0}.gru_nonlin input=Append({0}.z_t, {0}.r_t, {0}.hpart_t, IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"dim-range-node name={0}.y_t input-node={0}.gru_nonlin_t dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        configs.append(\"# s_t : recurrence\")\n        configs.append(\"# Note: in non-projection gru layer, the recurrent part equals the output, namely y_t.\")\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, cell_dim, bptrunc_str))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.y_t\".format(name))\n        return configs\n\n\n# This class is for lines like\n#   'fast-pgru-layer name=pgru1 input=[-1] delay=-3'\n# It generates an PGRU sub-graph with output projections. It can also generate\n# outputs without projection, but you could use the XconfigGruLayer for this\n# simple RNN.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   gru-nonlinearity-options=' max-change=0.75' [options for GruNonlinearityComponent, see below for detail]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigFastPgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"fast-pgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        # if you want to set 'self-repair-scale', ' self-repair-threshold'\n                        # or 'param-stddev' for GruNonlinearityComponent\n                        # For default, they are 1.0e-05, 0.2 and  1.0 / sqrt(d) where d is cell-dim.\n                        # you can add somethig like 'self-repair-scale=xxx' to gru-nonlinearity-options.\n                        # you can also see src/nnet3/nnet-combined-component.h for detail\n                        'gru-nonlinearity-options' : ' max-change=0.75'\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'y_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the PGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # string for GruNonlinearityComponent\n        gru_nonlin_str = self.config['gru-nonlinearity-options']\n        \n        # formulation like:\n        # z_t = \\sigmoid ( U^z x_t + W^z s_{t-1} )   # update gate\n        # r_t = \\sigmoid ( U^r x_t + W^r s_{t-1} )   # reset gate\n        # h_t = \\tanh ( U^h x_t + W^h ( s_{t-1} \\dot r_t ) )\n        # c_t = ( 1 - z_t ) \\dot h_t  +  z_t \\dot c_{t-1}\n        # y_t = W^y c_t  # dim(y_t) = recurrent_dim + non_recurrent_dim.\n                         #  This is the output of the GRU.\n        # s_t = y_t[0:recurrent_dim-1]  # dimension range of y_t \n                                        # dim(s_t) = recurrent_dim.\n        # Note:\n        # naming convention:\n        # <layer-name>.W_<outputname>.<inputname> e.g. Gru1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        # notation convention:\n        # In order to be consistent with the notations which are used in\n        # nnet-combined-component.cc, we map \"\\tilde{h_t}\" and \"h_t\" which are\n        # used in paper to \"h_t\" and \"c_t\"\n\n        configs = []\n        configs.append(\"### Begin Gru layer '{0}'\".format(name))\n        configs.append(\"# Update gate control : W_z* matrices\")\n        configs.append(\"component name={0}.W_z.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        configs.append(\"# Reset gate control : W_r* matrices\")\n        configs.append(\"component name={0}.W_r.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, rec_proj_dim, affine_str))\n\n\n        configs.append(\"# hpart_t related matrix : W_hpart matric\")\n        configs.append(\"component name={0}.W_hpart.x type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, rec_proj_dim, repair_nonlin_str))\n        \n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t and r_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_r.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n\n        configs.append(\"# hpart_t\")\n        configs.append(\"component-node name={0}.hpart_t component={0}.W_hpart.x input={1}\".format(name, input_descriptor))\n        \n        configs.append(\"# c_t\")\n        configs.append(\"# Note: the output of GruNonlinearityComponent is (h_t, c_t), we use the second half.\")\n        configs.append(\"component name={0}.gru_nonlin type=GruNonlinearityComponent cell-dim={1} recurrent-dim={2} {3}\".format(name, cell_dim, rec_proj_dim, gru_nonlin_str))\n        configs.append(\"component-node name={0}.gru_nonlin_t component={0}.gru_nonlin input=Append({0}.z_t, {0}.r_t, {0}.hpart_t, IfDefined(Offset({0}.c_t, {2})), IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"dim-range-node name={0}.c_t input-node={0}.gru_nonlin_t dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        configs.append(\"# the projected matrix W_y.c and y_t\")\n        configs.append(\"component name={0}.W_y.c type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component-node name={0}.y_t component={0}.W_y.c input={0}.c_t\".format(name))\n\n        configs.append(\"# s_t : recurrence\")\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        configs.append(\"dim-range-node name={0}.s_t_pre input-node={0}.y_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_pre\".format(name))\n        return configs\n\n\n# This class is for lines like\n#   'fast-norm-pgru-layer name=pgru1 input=[-1] delay=-3'\n\n# Different from the vanilla PGRU, the NormPGRU uses batchnorm in the forward direction\n# and renorm in the recurrence.\n\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   gru-nonlinearity-options=' max-change=0.75' [options for GruNonlinearityComponent, see below for detail]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigFastNormPgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"fast-norm-pgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        # if you want to set 'self-repair-scale', ' self-repair-threshold'\n                        # or 'param-stddev' for GruNonlinearityComponent\n                        # For default, they are 1.0e-05, 0.2 and  1.0 / sqrt(d) where d is cell-dim.\n                        # you can add somethig like 'self-repair-scale=xxx' to gru-nonlinearity-options.\n                        # you can also see src/nnet3/nnet-combined-component.h for detail\n                        'gru-nonlinearity-options' : ' max-change=0.75',\n                        'dropout-proportion' : -1.0,  # If -1.0, no dropout components will be added\n                        'dropout-per-frame' : True  # If False, regular dropout, not per frame\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n             raise RuntimeError(\"dropout-proportion has invalid value {0}.\"\n                                .format(self.config['dropout-proportion']))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'y_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the Norm-PGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n        dropout_proportion = self.config['dropout-proportion']\n        dropout_per_frame = 'true' if self.config['dropout-per-frame'] else 'false' \n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # string for GruNonlinearityComponent\n        gru_nonlin_str = self.config['gru-nonlinearity-options']\n        \n        # formulation like:\n        # z_t = \\sigmoid ( U^z x_t + W^z s_{t-1} )   # update gate\n        # r_t = \\sigmoid ( U^r x_t + W^r s_{t-1} )   # reset gate\n        # h_t = \\tanh ( U^h x_t + W^h ( s_{t-1} \\dot r_t ) )\n        # c_t = ( 1 - z_t ) \\dot h_t  +  z_t \\dot c_{t-1}\n        # y_t_tmp = W^y c_t\n        # s_t = renorm ( y_t_tmp[0:rec_proj_dim-1] ) # dim(s_t) = recurrent_dim.\n        # y_t = batchnorm ( y_t_tmp )  # dim(y_t) = recurrent_dim + non_recurrent_dim.\n                                       # This is the output of the GRU.\n        # Note:\n        # naming convention:\n        # <layer-name>.W_<outputname>.<inputname> e.g. Gru1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        # notation convention:\n        # In order to be consistent with the notations which are used in\n        # nnet-combined-component.cc, we map \"\\tilde{h_t}\" and \"h_t\" which are\n        # used in paper to \"h_t\" and \"c_t\"\n\n        configs = []\n        configs.append(\"### Begin Gru layer '{0}'\".format(name))\n        configs.append(\"# Update gate control : W_z* matrices\")\n        configs.append(\"component name={0}.W_z.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        configs.append(\"# Reset gate control : W_r* matrices\")\n        configs.append(\"component name={0}.W_r.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, rec_proj_dim, affine_str))\n\n\n        configs.append(\"# hpart_t related matrix : W_hpart matric\")\n        configs.append(\"component name={0}.W_hpart.x type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.r type=SigmoidComponent dim={1} {2}\".format(name, rec_proj_dim, repair_nonlin_str))\n\n        if dropout_proportion != -1.0:\n            configs.append(\"# Defining the dropout component\")\n            configs.append(\"component name={0}.dropout_z type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, cell_dim, dropout_proportion, dropout_per_frame))\n            configs.append(\"component name={0}.dropout_r type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, rec_proj_dim, dropout_proportion, dropout_per_frame))\n\n\n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.z_t_predrop component={0}.z input={0}.z_t_pre\".format(name))\n            configs.append(\"component-node name={0}.z_t component={0}.dropout_z input={0}.z_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# r_t\")\n        configs.append(\"component-node name={0}.r_t_pre component={0}.W_r.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.r_t_predrop component={0}.r input={0}.r_t_pre\".format(name))\n            configs.append(\"component-node name={0}.r_t component={0}.dropout_r input={0}.r_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_pre\".format(name))\n\n        configs.append(\"# hpart_t\")\n        configs.append(\"component-node name={0}.hpart_t component={0}.W_hpart.x input={1}\".format(name, input_descriptor))\n        \n        configs.append(\"# c_t\")\n        configs.append(\"# Note: the output of GruNonlinearityComponent is (h_t, c_t), we use the second half.\")\n        configs.append(\"component name={0}.gru_nonlin type=GruNonlinearityComponent cell-dim={1} recurrent-dim={2} {3}\".format(name, cell_dim, rec_proj_dim, gru_nonlin_str))\n        configs.append(\"component-node name={0}.gru_nonlin_t component={0}.gru_nonlin input=Append({0}.z_t, {0}.r_t, {0}.hpart_t, IfDefined(Offset({0}.c_t, {2})), IfDefined(Offset({1}, {2})))\".format(name, recurrent_connection, delay))\n        configs.append(\"dim-range-node name={0}.c_t input-node={0}.gru_nonlin_t dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        configs.append(\"# the projected matrix W_y.c and y_t_tmp\")\n        configs.append(\"component name={0}.W_y.c type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component-node name={0}.y_t_tmp component={0}.W_y.c input={0}.c_t\".format(name))\n\n        configs.append(\"# s_t : recurrence\")\n        configs.append(\"component name={0}.renorm type=NormalizeComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        configs.append(\"dim-range-node name={0}.s_t_pre input-node={0}.y_t_tmp dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t_renorm component={0}.renorm input={0}.s_t_pre\".format(name))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_renorm\".format(name))\n\n        configs.append(\"# y_t : output\")\n        configs.append(\"component name={0}.batchnorm type=BatchNormComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim + nonrec_proj_dim))\n        configs.append(\"component-node name={0}.y_t component={0}.batchnorm input={0}.y_t_tmp\".format(name))\n        return configs\n\n\n# This class is for lines like\n#   'fast-opgru-layer name=opgru1 input=[-1] delay=-3'\n# It generates an PGRU sub-graph with output projections. It can also generate\n# outputs without projection, but you could use the XconfigGruLayer for this\n# simple RNN.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   gru-nonlinearity-options=' max-change=0.75' [options for GruNonlinearityComponent, see below for detail]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigFastOpgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"fast-opgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        # if you want to set 'self-repair-scale', ' self-repair-threshold'\n                        # or 'param-stddev' for GruNonlinearityComponent\n                        # For default, they are 1.0e-05, 0.2 and  1.0 / sqrt(d) where d is cell-dim.\n                        # you can add somethig like 'self-repair-scale=xxx' to gru-nonlinearity-options.\n                        # you can also see src/nnet3/nnet-combined-component.h for detail\n                        'gru-nonlinearity-options' : ' max-change=0.75'\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'y_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the OPGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # string for GruNonlinearityComponent\n        gru_nonlin_str = self.config['gru-nonlinearity-options']\n        \n        # formulation like:\n        # z_t = \\sigmoid ( U^z x_t + W^z s_{t-1} )   # update gate\n        # o_t = \\sigmoid ( U^o x_t + W^o s_{t-1} )   # reset gate\n        # h_t = \\tanh ( U^h x_t + W^h \\dot c_{t-1} )\n        # c_t = ( 1 - z_t ) \\dot h_t  +  z_t \\dot c_{t-1}\n        # y_t = ( c_t \\dot o_t ) W^y  # dim(y_t) = recurrent_dim + non_recurrent_dim.\n                                      #  This is the output of the GRU.\n        # s_t = y_t[0:recurrent_dim-1]  # dimension range of y_t \n                                        # dim(s_t) = recurrent_dim.\n        # Note:\n        # naming convention:\n        # <layer-name>.W_<outputname>.<inputname> e.g. Gru1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        # notation convention:\n        # In order to be consistent with the notations which are used in\n        # nnet-combined-component.cc, we map \"\\tilde{h_t}\" and \"h_t\" which are\n        # used in paper to \"h_t\" and \"c_t\"\n\n        configs = []\n        configs.append(\"### Begin Gru layer '{0}'\".format(name))\n        configs.append(\"# Update gate control : W_z* matrices\")\n        configs.append(\"component name={0}.W_z.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        configs.append(\"# Reset gate control : W_o* matrices\")\n        configs.append(\"component name={0}.W_o.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n\n\n        configs.append(\"# hpart_t related matrix : W_hpart matric\")\n        configs.append(\"component name={0}.W_hpart.x type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        \n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t and o_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n        configs.append(\"component-node name={0}.o_t_pre component={0}.W_o.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.o_t component={0}.o input={0}.o_t_pre\".format(name))\n\n        configs.append(\"# hpart_t\")\n        configs.append(\"component-node name={0}.hpart_t component={0}.W_hpart.x input={1}\".format(name, input_descriptor))\n        \n        configs.append(\"# c_t\")\n        configs.append(\"# Note: the output of OutputGruNonlinearityComponent is (h_t, c_t), we use the second half.\")\n        configs.append(\"component name={0}.gru_nonlin type=OutputGruNonlinearityComponent cell-dim={1} {2}\".format(name, cell_dim, gru_nonlin_str))\n        configs.append(\"component-node name={0}.gru_nonlin_t component={0}.gru_nonlin input=Append({0}.z_t, {0}.hpart_t, IfDefined(Offset({0}.c_t, {1})))\".format(name, delay))\n        configs.append(\"dim-range-node name={0}.c_t input-node={0}.gru_nonlin_t dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        configs.append(\"# the projected matrix W_y.cdoto and y_t\")\n        configs.append(\"component name={0}.cdoto type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component-node name={0}.cdoto component={0}.cdoto input=Append({0}.c_t, {0}.o_t)\".format(name))\n        configs.append(\"component name={0}.W_y.cdoto type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component-node name={0}.y_t component={0}.W_y.cdoto input={0}.cdoto\".format(name))\n\n        configs.append(\"# s_t recurrence\")\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        configs.append(\"dim-range-node name={0}.s_t_preclip input-node={0}.y_t dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_preclip\".format(name))\n\n        return configs\n\n\n# This class is for lines like\n#   'fast-norm-opgru-layer name=opgru1 input=[-1] delay=-3'\n\n# Different from the vanilla OPGRU, the NormOPGRU uses batchnorm in the forward direction\n# and renorm in the recurrence.\n\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the GRU ]\n#   clipping-threshold=30    [nnet3 GRU use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the GRU ]\n#   gru-nonlinearity-options=' max-change=0.75' [options for GruNonlinearityComponent, see below for detail]\n#   ng-affine-options=''              [Additional options used for the full matrices in the GRU, can be used to do things like set biases to initialize to 1]\nclass XconfigFastNormOpgruLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"fast-norm-opgru-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        # if you want to set 'self-repair-scale', ' self-repair-threshold'\n                        # or 'param-stddev' for GruNonlinearityComponent\n                        # For default, they are 1.0e-05, 0.2 and  1.0 / sqrt(d) where d is cell-dim.\n                        # you can add somethig like 'self-repair-scale=xxx' to gru-nonlinearity-options.\n                        # you can also see src/nnet3/nnet-combined-component.h for detail\n                        'gru-nonlinearity-options' : ' max-change=0.75',\n                        'dropout-proportion' : -1.0,  # If -1.0, no dropout components will be added\n                        'dropout-per-frame' : True  # If False, regular dropout, not per frame\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n             raise RuntimeError(\"dropout-proportion has invalid value {0}.\"\n                                .format(self.config['dropout-proportion']))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'y_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self.generate_pgru_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the Norm-OPGRU config\n    def generate_pgru_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay)))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n        dropout_proportion = self.config['dropout-proportion']\n        dropout_per_frame = 'true' if self.config['dropout-per-frame'] else 'false' \n\n        # Natural gradient per element scale parameters\n        # TODO: decide if we want to keep exposing these options\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n\n        # string for GruNonlinearityComponent\n        gru_nonlin_str = self.config['gru-nonlinearity-options']\n        \n        # formulation like:\n        # z_t = \\sigmoid ( U^z x_t + W^z s_{t-1} )   # update gate\n        # o_t = \\sigmoid ( U^o x_t + W^o s_{t-1} )   # output gate\n        # h_t = \\tanh ( U^h x_t + W^h \\dot c_{t-1} )\n        # c_t = ( 1 - z_t ) \\dot h_t  +  z_t \\dot c_{t-1}\n        # y_t_tmp = ( c_t \\dot o_t ) W^y\n        # s_t = renorm ( y_t_tmp[0:rec_proj_dim-1] ) # dim(s_t) = recurrent_dim.\n        # y_t = batchnorm ( y_t_tmp )  # dim(y_t) = recurrent_dim + non_recurrent_dim.\n                                       # This is the output of the GRU.\n        # Note:\n        # naming convention:\n        # <layer-name>.W_<outputname>.<inputname> e.g. Gru1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        # notation convention:\n        # In order to be consistent with the notations which are used in\n        # nnet-combined-component.cc, we map \"\\tilde{h_t}\" and \"h_t\" which are\n        # used in paper to \"h_t\" and \"c_t\"\n\n        configs = []\n        configs.append(\"### Begin Gru layer '{0}'\".format(name))\n        configs.append(\"# Update gate control : W_z* matrices\")\n        configs.append(\"component name={0}.W_z.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n        configs.append(\"# Reset gate control : W_o* matrices\")\n        configs.append(\"component name={0}.W_o.xs type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim + rec_proj_dim, cell_dim, affine_str))\n\n\n        configs.append(\"# hpart_t related matrix : W_hpart matric\")\n        configs.append(\"component name={0}.W_hpart.x type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input_dim, cell_dim , affine_str))\n        \n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.z type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        if dropout_proportion != -1.0:\n            configs.append(\"# Defining the dropout component\")\n            configs.append(\"component name={0}.dropout type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, cell_dim, dropout_proportion, dropout_per_frame))\n\n        recurrent_connection = '{0}.s_t'.format(name)\n\n        configs.append(\"# z_t\")\n        configs.append(\"component-node name={0}.z_t_pre component={0}.W_z.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.z_t_predrop component={0}.z input={0}.z_t_pre\".format(name))\n            configs.append(\"component-node name={0}.z_t component={0}.dropout input={0}.z_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.z_t component={0}.z input={0}.z_t_pre\".format(name))\n\n        configs.append(\"# o_t\")\n        configs.append(\"component-node name={0}.o_t_pre component={0}.W_o.xs input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.o_t_predrop component={0}.o input={0}.o_t_pre\".format(name))\n            configs.append(\"component-node name={0}.o_t component={0}.dropout input={0}.o_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.o_t component={0}.o input={0}.o_t_pre\".format(name))\n\n        configs.append(\"# hpart_t\")\n        configs.append(\"component-node name={0}.hpart_t component={0}.W_hpart.x input={1}\".format(name, input_descriptor))\n        \n        configs.append(\"# c_t\")\n        configs.append(\"# Note: the output of OutputGruNonlinearityComponent is (h_t, c_t), we use the second half.\")\n        configs.append(\"component name={0}.gru_nonlin type=OutputGruNonlinearityComponent cell-dim={1} {2}\".format(name, cell_dim, gru_nonlin_str))\n        configs.append(\"component-node name={0}.gru_nonlin_t component={0}.gru_nonlin input=Append({0}.z_t, {0}.hpart_t, IfDefined(Offset({0}.c_t, {1})))\".format(name, delay))\n        configs.append(\"dim-range-node name={0}.c_t input-node={0}.gru_nonlin_t dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        configs.append(\"# the projected matrix W_y.cdoto and y_t_tmp\")\n        configs.append(\"component name={0}.cdoto type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component-node name={0}.cdoto component={0}.cdoto input=Append({0}.c_t, {0}.o_t)\".format(name))\n        configs.append(\"component name={0}.W_y.cdoto type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim, affine_str))\n        configs.append(\"component-node name={0}.y_t_tmp component={0}.W_y.cdoto input={0}.cdoto\".format(name))\n\n        configs.append(\"# s_t : recurrence\")\n        configs.append(\"component name={0}.renorm type=NormalizeComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim))\n        configs.append(\"component name={0}.s_r type=BackpropTruncationComponent dim={1} {2}\".format(name, rec_proj_dim, bptrunc_str))\n        configs.append(\"dim-range-node name={0}.s_t_pre input-node={0}.y_t_tmp dim-offset=0 dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.s_t_renorm component={0}.renorm input={0}.s_t_pre\".format(name))\n        configs.append(\"component-node name={0}.s_t component={0}.s_r input={0}.s_t_renorm\".format(name))\n\n        configs.append(\"# y_t : output\")\n        configs.append(\"component name={0}.batchnorm type=BatchNormComponent dim={1} target-rms=1.0\".format(name, rec_proj_dim + nonrec_proj_dim))\n        configs.append(\"component-node name={0}.y_t component={0}.batchnorm input={0}.y_t_tmp\".format(name))\n        \n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/layers.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2016    Yiming Wang\n# Apache 2.0.\n\nfrom .basic_layers import *\nfrom .convolution import *\nfrom .attention import *\nfrom .lstm import *\nfrom .gru import *\nfrom .stats_layer import *\nfrom .trivial_layers import *\nfrom .composite_layers import *\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/lstm.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2016    Yiming Wang\n# Apache 2.0.\n\n\n\"\"\" This module has the implementations of different LSTM layers.\n\"\"\"\nfrom __future__ import print_function\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n\n# This class is for lines like\n#   'lstm-layer name=lstm1 input=[-1] delay=-3'\n# It generates an LSTM sub-graph without output projections.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   delay=-1                 [Delay in the recurrent connections of the LSTM ]\n#   clipping-threshold=30    [nnet3 LSTMs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''     [Additional options used for the diagonal matrices in the LSTM ]\n#   ng-affine-options=''                [Additional options used for the full matrices in the LSTM, can be used to do things like set biases to initialize to 1]\n#   decay-time=-1            [If >0, an approximate maximum on how many frames\n#                            can be remembered via summation into the cell\n#                            contents c_t; enforced by putting a scaling factor\n#                            of recurrence_scale = 1 - abs(delay)/decay_time on\n#                            the recurrence, i.e. the term c_{t-1} in the LSTM\n#                            equations.  E.g. setting this to 20 means no more\n#                            than about 20 frames' worth of history,\n#                            i.e. history since about t = t-20, can be\n#                            accumulated in c_t.]\n#  l2-regularize=0.0         Constant controlling l2 regularization for this layer\nclass XconfigLstmLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == \"lstm-layer\"\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                       'l2-regularize': 0.0,\n                        'decay-time':  -1.0\n                        }\n\n    def set_derived_configs(self):\n        if self.config['cell-dim'] <= 0:\n            self.config['cell-dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        key = 'cell-dim'\n        if self.config['cell-dim'] <= 0:\n            raise RuntimeError(\"cell-dim has invalid value {0}.\".format(self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(key, self.config[key]))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = 'm_t'\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n\n        return self.config['cell-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_lstm_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the LSTM config\n    def _generate_lstm_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        delay = self.config['delay']\n        decay_time = self.config['decay-time']\n        # we expect decay_time to be either -1, or large, like 10 or 50.\n        recurrence_scale = (1.0 if decay_time < 0 else\n                            1.0 - (abs(delay) / decay_time))\n        assert recurrence_scale > 0   # or user may have set decay-time much\n                                      # too small.\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \" scale={4}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay), recurrence_scale))\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        affine_str = self.config['ng-affine-options']\n        # Natural gradient per element scale parameters\n        ng_per_element_scale_options = self.config['ng-per-element-scale-options']\n        if re.search('param-mean', ng_per_element_scale_options) is None and \\\n           re.search('param-stddev', ng_per_element_scale_options) is None:\n           ng_per_element_scale_options += \" param-mean=0.0 param-stddev=1.0 \"\n        pes_str = ng_per_element_scale_options\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n\n\n        configs = []\n\n        # To see the equations implemented here, see\n        # eqs (1)-(6) in https://arxiv.org/abs/1402.1128\n        # naming convention:\n        # <layer-name>.W_<outputname>.<input_name> e.g. Lstm1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n\n        configs.append(\"### Begin LTSM layer '{0}'\".format(name))\n        configs.append(\"# Input gate control : W_i* matrices\")\n        configs.append(\"component name={0}.W_i.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + cell_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_i.c type=NaturalGradientPerElementScaleComponent \"\n                       \"dim={1} {2} {3} \".format(name, cell_dim, pes_str,\n                                                 l2_regularize_option))\n        configs.append(\"# Forget gate control : W_f* matrices\")\n        configs.append(\"component name={0}.W_f.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + cell_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_f.c type=NaturalGradientPerElementScaleComponent \"\n                       \"dim={1} {2} {3}\".format(name, cell_dim, pes_str, l2_regularize_option))\n\n        configs.append(\"#  Output gate control : W_o* matrices\")\n        configs.append(\"component name={0}.W_o.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + cell_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_o.c type=NaturalGradientPerElementScaleComponent \"\n                       \" dim={1} {2} {3}\".format(name, cell_dim, pes_str,\n                                                 l2_regularize_option))\n\n        configs.append(\"# Cell input matrices : W_c* matrices\")\n        configs.append(\"component name={0}.W_c.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + cell_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n\n\n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.i type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.f type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.g type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.c1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.c2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.m type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.c type=BackpropTruncationComponent dim={1} {2}\"\n                       \"\".format(name, cell_dim, bptrunc_str))\n\n        # c1_t and c2_t defined below\n        configs.append(\"component-node name={0}.c_t component={0}.c input=Sum({0}.c1_t, {0}.c2_t)\".format(name))\n        delayed_c_t_descriptor = \"IfDefined(Offset({0}.c_t, {1}))\".format(name, delay)\n\n        configs.append(\"# i_t\")\n        configs.append(\"component-node name={0}.i1_t component={0}.W_i.xr input=Append({1}, IfDefined(Offset({0}.r_t, {2})))\"\n                       \"\".format(name, input_descriptor, delay))\n        configs.append(\"component-node name={0}.i2_t component={0}.w_i.c  input={1}\".format(name, delayed_c_t_descriptor))\n        configs.append(\"component-node name={0}.i_t component={0}.i input=Sum({0}.i1_t, {0}.i2_t)\".format(name))\n\n        configs.append(\"# f_t\")\n        configs.append(\"component-node name={0}.f1_t component={0}.W_f.xr input=Append({1}, IfDefined(Offset({0}.r_t, {2})))\"\n                       \"\".format(name, input_descriptor, delay))\n        configs.append(\"component-node name={0}.f2_t component={0}.w_f.c  input={1}\".format(name, delayed_c_t_descriptor))\n        configs.append(\"component-node name={0}.f_t component={0}.f input=Sum({0}.f1_t, {0}.f2_t)\".format(name))\n\n        configs.append(\"# o_t\")\n        configs.append(\"component-node name={0}.o1_t component={0}.W_o.xr input=Append({1}, IfDefined(Offset({0}.r_t, {2})))\"\n                       \"\".format(name, input_descriptor, delay))\n        configs.append(\"component-node name={0}.o2_t component={0}.w_o.c input={0}.c_t\".format(name))\n        configs.append(\"component-node name={0}.o_t component={0}.o input=Sum({0}.o1_t, {0}.o2_t)\".format(name))\n\n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h_t component={0}.h input={0}.c_t\".format(name))\n\n        configs.append(\"# g_t\")\n        configs.append(\"component-node name={0}.g1_t component={0}.W_c.xr input=Append({1}, IfDefined(Offset({0}.r_t, {2})))\"\n                       \"\".format(name, input_descriptor, delay))\n        configs.append(\"component-node name={0}.g_t component={0}.g input={0}.g1_t\".format(name))\n\n        configs.append(\"# parts of c_t\")\n        configs.append(\"component-node name={0}.c1_t component={0}.c1  input=Append({0}.f_t, {1})\"\n                       \"\".format(name, delayed_c_t_descriptor))\n        configs.append(\"component-node name={0}.c2_t component={0}.c2 input=Append({0}.i_t, {0}.g_t)\"\n                       \"\".format(name))\n\n        configs.append(\"# m_t\")\n        configs.append(\"component-node name={0}.m_t component={0}.m input=Append({0}.o_t, {0}.h_t)\"\n                       \"\".format(name))\n\n        # add the recurrent connections\n        configs.append(\"component name={0}.r type=BackpropTruncationComponent dim={1} {2}\"\n                       \"\".format(name, cell_dim, bptrunc_str))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.m_t\".format(name))\n        configs.append(\"### End LTSM layer '{0}'\".format(name))\n        return configs\n\n\n# This class is for lines like\n#   'lstmp-layer name=lstm1 input=[-1] delay=-3'\n# (you can also use the name 'lstmp-batchnorm-layer' if you want it to be followed\n# by batchnorm).\n# It generates an LSTM sub-graph with output projections. It can also generate\n# outputs without projection, but you could use the XconfigLstmLayer for this\n# simple LSTM.\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1            [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the LSTM ]\n#   clipping-threshold=30    [nnet3 LSTMs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   self_repair_scale_nonlinearity=1e-5      [It is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent]\n#                                       i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent ]\n#   ng-per-element-scale-options=''   [Additional options used for the diagonal matrices in the LSTM ]\n#   ng-affine-options=''              [Additional options used for the full matrices in the LSTM, can be used to do things like set biases to initialize to 1]\n#   decay-time=-1            [If >0, an approximate maximum on how many frames\n#                            can be remembered via summation into the cell\n#                            contents c_t; enforced by putting a scaling factor\n#                            of recurrence_scale = 1 - abs(delay)/decay_time on\n#                            the recurrence, i.e. the term c_{t-1} in the LSTM\n#                            equations.  E.g. setting this to 20 means no more\n#                            than about 20 frames' worth of history,\n#                            i.e. history since about t = t-20, can be\n#                            accumulated in c_t.]\n#  l2-regularize=0.0         Constant controlling l2 regularization for this layer\nclass XconfigLstmpLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        # lstmp-batchnorm-layer is like lstmp-layer but followed by a batchnorm\n        # component.\n        assert first_token in [\"lstmp-layer\", \"lstmp-batchnorm-layer\"]\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input' : '[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,  # defaults to cell-dim / 4\n                        'non-recurrent-projection-dim' : -1, # defaults to\n                                                             # recurrent-projection-dim\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        'ng-per-element-scale-options' : ' max-change=0.75 ',\n                        'ng-affine-options' : ' max-change=0.75 ',\n                        'self-repair-scale-nonlinearity' : 0.00001,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        'dropout-proportion' : -1.0, # If -1.0, no dropout components will be added\n                        'dropout-per-frame' : False,  # If false, regular dropout, not per frame.\n                        'decay-time':  -1.0,\n                       'l2-regularize': 0.0,\n                       }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim.\")\n        for key in ['self-repair-scale-nonlinearity']:\n            if self.config[key] < 0.0 or self.config[key] > 1.0:\n                raise RuntimeError(\"{0} has invalid value {2}.\"\n                                   .format(self.layer_type, key,\n                                           self.config[key]))\n\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n             raise RuntimeError(\"dropout-proportion has invalid value {0}.\"\n                                .format(self.config['dropout-proportion']))\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = ( 'rp_t_batchnorm' if self.layer_type == 'lstmp-batchnorm-layer'\n                      else 'rp_t' )\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c_t':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise Exception(\"In {0} of type {1}, unknown auxiliary output name {1}\".format(self.layer_type, auxiliary_output))\n\n        return self.config['recurrent-projection-dim'] + self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_lstm_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the LSTM config\n    def _generate_lstm_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        delay = self.config['delay']\n        repair_nonlin = self.config['self-repair-scale-nonlinearity']\n        repair_nonlin_str = \"self-repair-scale={0:.10f}\".format(repair_nonlin) if repair_nonlin is not None else ''\n        decay_time = self.config['decay-time']\n        # we expect decay_time to be either -1, or large, like 10 or 50.\n        recurrence_scale = (1.0 if decay_time < 0 else\n                            1.0 - (abs(delay) / decay_time))\n        assert recurrence_scale > 0   # or user may have set decay-time much\n                                      # too small.\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \" scale={4}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay), recurrence_scale))\n        affine_str = self.config['ng-affine-options']\n        pes_str = self.config['ng-per-element-scale-options']\n        dropout_proportion = self.config['dropout-proportion']\n        dropout_per_frame = 'true' if self.config['dropout-per-frame'] else 'false'\n\n        # Natural gradient per element scale parameters\n        if re.search('param-mean', pes_str) is None and \\\n           re.search('param-stddev', pes_str) is None:\n           pes_str += \" param-mean=0.0 param-stddev=1.0 \"\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n\n        configs = []\n\n        # the equations implemented here are from Sak et. al. \"Long Short-Term\n        # Memory Recurrent Neural Network Architectures for Large Scale Acoustic\n        # Modeling\"\n        # https://arxiv.org/pdf/1402.1128.pdf\n        # See equations (7) to (14).\n        # naming convention <layer-name>.W_<outputname>.<input_name>\n        # e.g. Lstm1.W_i.xr for matrix providing output to gate i and operating\n        # on an appended vector [x,r]\n        configs.append(\"# Input gate control : W_i* matrices\")\n        configs.append(\"component name={0}.W_i.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim,\n                                                       cell_dim, affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_i.c type=NaturalGradientPerElementScaleComponent \"\n                       \"dim={1} {2} {3}\".format(name, cell_dim, pes_str,\n                                                l2_regularize_option))\n        configs.append(\"# Forget gate control : W_f* matrices\")\n        configs.append(\"component name={0}.W_f.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_f.c type=NaturalGradientPerElementScaleComponent  \"\n                       \"dim={1} {2} {3}\".format(name, cell_dim, pes_str, l2_regularize_option))\n\n        configs.append(\"#  Output gate control : W_o* matrices\")\n        configs.append(\"component name={0}.W_o.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"# note : the cell outputs pass through a diagonal matrix\")\n        configs.append(\"component name={0}.w_o.c type=NaturalGradientPerElementScaleComponent \"\n                       \"dim={1} {2} {3}\".format(name, cell_dim, pes_str, l2_regularize_option))\n\n        configs.append(\"# Cell input matrices : W_c* matrices\")\n        configs.append(\"component name={0}.W_c.xr type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + rec_proj_dim, cell_dim,\n                                                       affine_str, l2_regularize_option))\n\n        configs.append(\"# Defining the non-linearities\")\n        configs.append(\"component name={0}.i type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.f type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.g type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        configs.append(\"component name={0}.h type=TanhComponent dim={1} {2}\".format(name, cell_dim, repair_nonlin_str))\n        if dropout_proportion != -1.0:\n            configs.append(\"component name={0}.dropout type=DropoutComponent dim={1} \"\n                           \"dropout-proportion={2} dropout-per-frame={3}\"\n                           .format(name, cell_dim, dropout_proportion, dropout_per_frame))\n        configs.append(\"# Defining the components for other cell computations\")\n        configs.append(\"component name={0}.c1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.c2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.m type=ElementwiseProductComponent input-dim={1} output-dim={2}\"\n                       \"\".format(name, 2 * cell_dim, cell_dim))\n        configs.append(\"component name={0}.c type=BackpropTruncationComponent dim={1} {2}\"\n                       \"\".format(name, cell_dim, bptrunc_str))\n\n        # c1_t and c2_t defined below\n        configs.append(\"component-node name={0}.c_t component={0}.c input=Sum({0}.c1_t, {0}.c2_t)\".format(name))\n        delayed_c_t_descriptor = \"IfDefined(Offset({0}.c_t, {1}))\".format(name, delay)\n\n        recurrent_connection = '{0}.r_t'.format(name)\n        configs.append(\"# i_t\")\n        configs.append(\"component-node name={0}.i1_t component={0}.W_i.xr input=Append({1}, IfDefined(Offset({2}, {3})))\"\n                       \"\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.i2_t component={0}.w_i.c  input={1}\".format(name, delayed_c_t_descriptor))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.i_t_predrop component={0}.i input=Sum({0}.i1_t, {0}.i2_t)\".format(name))\n            configs.append(\"component-node name={0}.i_t component={0}.dropout input={0}.i_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.i_t component={0}.i input=Sum({0}.i1_t, {0}.i2_t)\".format(name))\n\n        configs.append(\"# f_t\")\n        configs.append(\"component-node name={0}.f1_t component={0}.W_f.xr input=Append({1}, IfDefined(Offset({2}, {3})))\"\n                       \"\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.f2_t component={0}.w_f.c  input={1}\".format(name, delayed_c_t_descriptor))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.f_t_predrop component={0}.f input=Sum({0}.f1_t, {0}.f2_t)\".format(name))\n            configs.append(\"component-node name={0}.f_t component={0}.dropout input={0}.f_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.f_t component={0}.f input=Sum({0}.f1_t, {0}.f2_t)\".format(name))\n\n        configs.append(\"# o_t\")\n        configs.append(\"component-node name={0}.o1_t component={0}.W_o.xr input=Append({1}, IfDefined(Offset({2}, {3})))\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.o2_t component={0}.w_o.c input={0}.c_t\".format(name))\n        if dropout_proportion != -1.0:\n            configs.append(\"component-node name={0}.o_t_predrop component={0}.o input=Sum({0}.o1_t, {0}.o2_t)\".format(name))\n            configs.append(\"component-node name={0}.o_t component={0}.dropout input={0}.o_t_predrop\".format(name))\n        else:\n            configs.append(\"component-node name={0}.o_t component={0}.o input=Sum({0}.o1_t, {0}.o2_t)\".format(name))\n\n        configs.append(\"# h_t\")\n        configs.append(\"component-node name={0}.h_t component={0}.h input={0}.c_t\".format(name))\n\n        configs.append(\"# g_t\")\n        configs.append(\"component-node name={0}.g1_t component={0}.W_c.xr input=Append({1}, IfDefined(Offset({2}, {3})))\"\n                       \"\".format(name, input_descriptor, recurrent_connection, delay))\n        configs.append(\"component-node name={0}.g_t component={0}.g input={0}.g1_t\".format(name))\n\n        configs.append(\"# parts of c_t\")\n        configs.append(\"component-node name={0}.c1_t component={0}.c1  input=Append({0}.f_t, {1})\".format(name, delayed_c_t_descriptor))\n        configs.append(\"component-node name={0}.c2_t component={0}.c2 input=Append({0}.i_t, {0}.g_t)\".format(name))\n\n        configs.append(\"# m_t\")\n        configs.append(\"component-node name={0}.m_t component={0}.m input=Append({0}.o_t, {0}.h_t)\".format(name))\n\n        # add the recurrent connections\n        configs.append(\"# projection matrices : Wrm and Wpm\")\n        configs.append(\"component name={0}.W_rp.m type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, cell_dim, rec_proj_dim + nonrec_proj_dim,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"component name={0}.r type=BackpropTruncationComponent dim={1} {2}\"\n                       \"\".format(name, rec_proj_dim, bptrunc_str))\n\n        configs.append(\"# r_t and p_t : rp_t will be the output (if we're not doing batchnorm)\")\n        configs.append(\"component-node name={0}.rp_t component={0}.W_rp.m input={0}.m_t\"\n                       \"\".format(name))\n        configs.append(\"dim-range-node name={0}.r_t_preclip input-node={0}.rp_t dim-offset=0 \"\n                       \"dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"component-node name={0}.r_t component={0}.r input={0}.r_t_preclip\".format(name))\n\n        if self.layer_type == \"lstmp-batchnorm-layer\":\n            # Add the batchnorm component, if requested to include batchnorm.\n            configs.append(\"component name={0}.rp_t_batchnorm type=BatchNormComponent dim={1} \".format(\n                name, rec_proj_dim + nonrec_proj_dim))\n            configs.append(\"component-node name={0}.rp_t_batchnorm component={0}.rp_t_batchnorm \"\n                           \"input={0}.rp_t\".format(name))\n\n        return configs\n\n\n# This class is for lines like\n#   'fast-lstm-layer name=lstm1 input=[-1] delay=-3'\n# (you can also use the name 'fast-lstm-batchnorm-layer' if you want it to be followed\n# by batchnorm).\n# It generates an LSTM sub-graph without output projections.\n# Unlike 'lstm-layer', the core nonlinearities of the LSTM are done in a special-purpose\n# component (LstmNonlinearityComponent), and most of the affine parts of the LSTM are combined\n# into one.\n#\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   delay=-1                 [Delay in the recurrent connections of the LSTM ]\n#   clipping-threshold=30    [nnet3 LSTMs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   lstm-nonlinearity-options=' max-change=0.75 '  [Options string to pass into the LSTM nonlinearity component.]\n#   ng-affine-options=' max-change=1.5 '           [Additional options used for the full matrices in the LSTM, can be used to\n#                                      do things like set biases to initialize to 1]\n#   decay-time=-1            [If >0, an approximate maximum on how many frames\n#                            can be remembered via summation into the cell\n#                            contents c_t; enforced by putting a scaling factor\n#                            of recurrence_scale = 1 - abs(delay)/decay_time on\n#                            the recurrence, i.e. the term c_{t-1} in the LSTM\n#                            equations.  E.g. setting this to 20 means no more\n#                            than about 20 frames' worth of history,\n#                            i.e. history since about t = t-20, can be\n#                            accumulated in c_t.]\n#  l2-regularize=0.0         Constant controlling l2 regularization for this layer\nclass XconfigFastLstmLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token in [\"fast-lstm-layer\", \"fast-lstm-batchnorm-layer\"]\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'clipping-threshold' : 30.0,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        'delay' : -1,\n                        # if you want to set 'self-repair-scale' (c.f. the\n                        # self-repair-scale-nonlinearity config value in older LSTM layers), you can\n                        # add 'self-repair-scale=xxx' to\n                        # lstm-nonlinearity-options.\n                        'lstm-nonlinearity-options' : ' max-change=0.75',\n                        # the affine layer contains 4 of our old layers -> use a\n                        # larger max-change than the normal value of 0.75.\n                        'ng-affine-options' : ' max-change=1.5',\n                        'l2-regularize': 0.0,\n                        'decay-time':  -1.0\n                        }\n        self.c_needed = False  # keep track of whether the 'c' output is needed.\n\n    def set_derived_configs(self):\n        if self.config['cell-dim'] <= 0:\n            self.config['cell-dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        key = 'cell-dim'\n        if self.config['cell-dim'] <= 0:\n            raise RuntimeError(\"cell-dim has invalid value {0}.\".format(self.config[key]))\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n\n\n    def auxiliary_outputs(self):\n        return ['c']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = ('m_batchnorm' if self.layer_type == 'fast-lstm-batchnorm-layer'\n                      else 'm')\n        if auxiliary_output is not None:\n            if auxiliary_output == 'c':\n                node_name = 'c'\n                self.c_needed = True\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output == 'c':\n                self.c_needed = True\n                return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n        return self.config['cell-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_lstm_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the LSTM config\n    def _generate_lstm_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        delay = self.config['delay']\n        affine_str = self.config['ng-affine-options']\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n        decay_time = self.config['decay-time']\n        # we expect decay_time to be either -1, or large, like 10 or 50.\n        recurrence_scale = (1.0 if decay_time < 0 else\n                            1.0 - (abs(delay) / decay_time))\n        assert recurrence_scale > 0   # or user may have set decay-time much\n                                      # too small.\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \" scale={4}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay), recurrence_scale))\n        lstm_str = self.config['lstm-nonlinearity-options']\n\n\n        configs = []\n\n        # the equations implemented here are equations (1) through (6) of\n        # https://arxiv.org/pdf/1402.1128.pdf.\n        # naming convention\n        # <layer-name>.W_<outputname>.<input_name> e.g. Lstm1.W_i.xr for matrix\n        # providing output to gate i and operating on an appended vector [x,r]\n        configs.append(\"### Begin LTSM layer '{0}'\".format(name))\n        configs.append(\"# Gate control: contains W_i, W_f, W_c and W_o matrices as blocks.\")\n\n        configs.append(\"component name={0}.W_all type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, input_dim + cell_dim, cell_dim * 4,\n                                                       affine_str, l2_regularize_option))\n\n        configs.append(\"# The core LSTM nonlinearity, implemented as a single component.\")\n        configs.append(\"# Input = (i_part, f_part, c_part, o_part, c_{t-1}), output = (c_t, m_t)\")\n        configs.append(\"# See cu-math.h:ComputeLstmNonlinearity() for details.\")\n        configs.append(\"component name={0}.lstm_nonlin type=LstmNonlinearityComponent \"\n                       \"cell-dim={1} {2} {3}\".format(name, cell_dim, lstm_str,\n                                                     l2_regularize_option))\n\n        configs.append(\"# Component for backprop truncation, to avoid gradient blowup in long training examples.\")\n        configs.append(\"component name={0}.cm_trunc type=BackpropTruncationComponent dim={1} \"\n                       \"{2}\".format(name, 2 * cell_dim, bptrunc_str))\n\n        configs.append(\"###  Nodes for the components above.\")\n        configs.append(\"component-node name={0}.W_all component={0}.W_all input=Append({1}, \"\n                       \"IfDefined(Offset({0}.m_trunc, {2})))\".format(\n                           name, input_descriptor, delay))\n\n        configs.append(\"component-node name={0}.lstm_nonlin component={0}.lstm_nonlin \"\n                       \"input=Append({0}.W_all, IfDefined(Offset({0}.c_trunc, {1})))\".format(\n                           name, delay))\n        # we can print .c later if needed, but it generates a warning since it's not used.  could use c_trunc instead\n        #configs.append(\"dim-range-node name={0}.c input-node={0}.lstm_nonlin dim-offset=0 dim={1}\".format(name, cell_dim))\n        configs.append(\"dim-range-node name={0}.m input-node={0}.lstm_nonlin dim-offset={1} dim={1}\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.cm_trunc component={0}.cm_trunc input={0}.lstm_nonlin\".format(name))\n        configs.append(\"dim-range-node name={0}.c_trunc input-node={0}.cm_trunc dim-offset=0 dim={1}\".format(name, cell_dim))\n        configs.append(\"dim-range-node name={0}.m_trunc input-node={0}.cm_trunc dim-offset={1} dim={1}\".format(name, cell_dim))\n\n        if self.layer_type == \"fast-lstm-batchnorm-layer\":\n            # Add the batchnorm component, if requested to include batchnorm.\n            configs.append(\"component name={0}.m_batchnorm type=BatchNormComponent dim={1} \".format(\n                name, cell_dim))\n            configs.append(\"component-node name={0}.m_batchnorm component={0}.m_batchnorm \"\n                           \"input={0}.m\".format(name))\n        configs.append(\"### End LTSM layer '{0}'\".format(name))\n        return configs\n\n\n\n# This class is for lines like\n#   'lstmb-layer name=lstm1 input=[-1] delay=-3'\n#\n# LSTMB is not something we've published; it's LSTM with a bottleneck in the\n# middle of the W_all matrix (where W_all is a matrix that combines the 8 full\n# matrices of standard LSTM).  W_all is factored into W_all_a and W_all_b, where\n# W_all_a is constrained to have orthonormal rows (this keeps it training stably).\n#\n# It also contains a couple of other improvements: W_all_b is followed by\n# trainable ScaleAndOffsetComponent (this is a bit like the idea from the\n# publication \"Self-stabilized deep neural network\" by Ghahramani et al).\n# And the LSTM is followed by a batchnorm component (this is by default; it's not\n# part of the layer name, like lstmb-batchnorm-layer).\n\n#\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   bottleneck-dim=-1        [Bottleneck dim, should be less than cell-dim plus the input dim.]\n#   delay=-1                 [Delay in the recurrent connections of the LSTM ]\n#   clipping-threshold=30    [nnet3 LSTMs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   lstm-nonlinearity-options=' max-change=0.75 '  [Options string to pass into the LSTM nonlinearity component.]\n#   ng-affine-options=' max-change=1.5 '           [Additional options used for the full matrices in the LSTM, can be used to\n#                                      do things like set biases to initialize to 1]\n#   decay-time=-1            [If >0, an approximate maximum on how many frames\n#                            can be remembered via summation into the cell\n#                            contents c_t; enforced by putting a scaling factor\n#                            of recurrence_scale = 1 - abs(delay)/decay_time on\n#                            the recurrence, i.e. the term c_{t-1} in the LSTM\n#                            equations.  E.g. setting this to 20 means no more\n#                            than about 20 frames' worth of history,\n#                            i.e. history since about t = t-20, can be\n#                            accumulated in c_t.]\n#  l2-regularize=0.0         Constant controlling l2 regularization for this layer\nclass XconfigLstmbLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token == 'lstmb-layer'\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = { 'input':'[-1]',\n                        'cell-dim' : -1, # this is a required argument\n                        'bottleneck-dim': -1, # this is a required argument\n                        'clipping-threshold': 30.0,\n                        'zeroing-interval': 20,\n                        'zeroing-threshold': 15.0,\n                        'orthonormal-constraint': 1.0,\n                        'delay' : -1,\n                        'lstm-nonlinearity-options' : ' max-change=0.75',\n                        # the recurrence scale is the scale on m_trunc, used in the\n                        # recurrence (to balance its size with the input).\n                        'self-scale' : 1.0,\n                        # the affine layer contains 4 of our old layers -> use a\n                        # larger max-change than the normal value of 0.75.\n                        'ng-affine-options' : ' max-change=1.5',\n                        'l2-regularize': 0.0,\n                        'decay-time':  -1.0\n                        }\n\n    def set_derived_configs(self):\n        if self.config['cell-dim'] <= 0:\n            self.config['cell-dim'] = self.descriptors['input']['dim']\n\n    def check_configs(self):\n        if self.config['cell-dim'] <= 0:\n            raise RuntimeError(\"cell-dim has invalid value {0}.\".format(\n                self.config['cell-dim']))\n        if self.config['bottleneck-dim'] <= 0:\n            raise RuntimeError(\"bottleneck-dim has invalid value {0}.\".format(\n                self.config['bottleneck-dim']))\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n\n    def auxiliary_outputs(self):\n        return []\n\n    def output_name(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return '{0}.m_batchnorm'.format(self.name)\n\n    def output_dim(self, auxiliary_output = None):\n        assert auxiliary_output is None\n        return self.config['cell-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_lstm_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the LSTM config\n    def _generate_lstm_config(self):\n\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        bottleneck_dim = self.config['bottleneck-dim']\n        self_scale = self.config['self-scale']\n        delay = self.config['delay']\n        affine_str = self.config['ng-affine-options']\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n        decay_time = self.config['decay-time']\n        # we expect decay_time to be either -1, or large, like 10 or 50.\n        recurrence_scale = (1.0 if decay_time < 0 else\n                            1.0 - (abs(delay) / decay_time))\n        assert recurrence_scale > 0   # or user may have set decay-time much\n                                      # too small.\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \" scale={4}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay), recurrence_scale))\n        lstm_str = self.config['lstm-nonlinearity-options']\n\n\n        configs = []\n\n        # See XconfigFastLstmLayer to understand what's going on here.  This\n        # differs from that code by a factorization of the W_all matrix into two\n        # pieces with a smaller dimension in between (with the first of the two\n        # pieces constrained to have orthonormal rows).  Note: we don't apply l2\n        # regularization to this layer, since, with the orthonormality\n        # constraint, it's meaningless.\n        configs.append(\"### Begin LTSM layer '{0}'\".format(name))\n        configs.append(\"component name={0}.W_all_a type=LinearComponent input-dim={1} \"\n                       \"orthonormal-constraint={2} output-dim={3} {4}\".format(\n                           name, input_dim + cell_dim,\n                           self.config['orthonormal-constraint'],\n                           bottleneck_dim, affine_str))\n\n        configs.append(\"component name={0}.W_all_b type=LinearComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(name, bottleneck_dim, cell_dim * 4,\n                                                       affine_str, l2_regularize_option))\n        configs.append(\"component name={0}.W_all_b_so type=ScaleAndOffsetComponent dim={1} \"\n                       \"max-change=0.75\".format(name, cell_dim * 4))\n\n\n        configs.append(\"# The core LSTM nonlinearity, implemented as a single component.\")\n        configs.append(\"# Input = (i_part, f_part, c_part, o_part, c_{t-1}), output = (c_t, m_t)\")\n        configs.append(\"# See cu-math.h:ComputeLstmNonlinearity() for details.\")\n        configs.append(\"component name={0}.lstm_nonlin type=LstmNonlinearityComponent \"\n                       \"cell-dim={1} {2} {3}\".format(name, cell_dim, lstm_str,\n                                                     l2_regularize_option))\n        configs.append(\"# Component for backprop truncation, to avoid gradient blowup in long training examples.\")\n\n        configs.append(\"component name={0}.cm_trunc type=BackpropTruncationComponent dim={1} {2}\".format(\n            name, 2 * cell_dim, bptrunc_str))\n        configs.append(\"component name={0}.m_batchnorm type=BatchNormComponent dim={1} \".format(\n            name, cell_dim))\n\n        configs.append(\"###  Nodes for the components above.\")\n        configs.append(\"component-node name={0}.W_all_a component={0}.W_all_a input=Append({1}, \"\n                       \"IfDefined(Offset(Scale({2}, {0}.m_trunc), {3})))\".format(\n                           name, input_descriptor, self_scale, delay))\n        configs.append(\"component-node name={0}.W_all_b component={0}.W_all_b \"\n                       \"input={0}.W_all_a\".format(name))\n        configs.append(\"component-node name={0}.W_all_b_so component={0}.W_all_b_so \"\n                       \"input={0}.W_all_b\".format(name))\n\n        configs.append(\"component-node name={0}.lstm_nonlin component={0}.lstm_nonlin \"\n                       \"input=Append({0}.W_all_b_so, IfDefined(Offset({0}.c_trunc, {1})))\".format(\n                           name, delay))\n        configs.append(\"dim-range-node name={0}.m input-node={0}.lstm_nonlin dim-offset={1} \"\n                       \"dim={1}\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.cm_trunc component={0}.cm_trunc input={0}.lstm_nonlin\".format(name))\n        configs.append(\"dim-range-node name={0}.c_trunc input-node={0}.cm_trunc dim-offset=0 \"\n                       \"dim={1}\".format(name, cell_dim))\n        configs.append(\"dim-range-node name={0}.m_trunc input-node={0}.cm_trunc dim-offset={1} \"\n                       \"dim={1}\".format(name, cell_dim))\n        configs.append(\"component-node name={0}.m_batchnorm component={0}.m_batchnorm \"\n                       \"input={0}.m\".format(name))\n        configs.append(\"### End LTSM layer '{0}'\".format(name))\n        return configs\n\n\n\n\n# This class is for lines like\n#   'fast-lstmp-layer name=lstm1 input=[-1] delay=-3'\n# or:\n#   'fast-lstmp-layer name=lstm1 input=[-1] delay=-3 cell-dim=1024 recurrent-projection-dim=512 non-recurrent-projection-dim=512'\n# (you can also use the name 'fast-lstmp-batchnorm-layer' if you want it to be followed\n# by batchnorm).\n# It generates an LSTM sub-graph with output projections (i.e. a projected LSTM, AKA LSTMP).\n# Unlike 'lstmp-layer', the core nonlinearities of the LSTM are done in a special-purpose\n# component (LstmNonlinearityComponent), and most of the affine parts of the LSTM are combined\n# into one.\n#\n# The output dimension of the layer may be specified via 'cell-dim=xxx', but if not specified,\n# the dimension defaults to the same as the input.\n# See other configuration values below.\n#\n# Parameters of the class, and their defaults:\n#   input='[-1]'             [Descriptor giving the input of the layer.]\n#   cell-dim=-1              [Dimension of the cell]\n#   recurrent-projection_dim [Dimension of the projection used in recurrent connections, e.g. cell-dim/4]\n#   non-recurrent-projection-dim   [Dimension of the projection in non-recurrent connections,\n#                                   in addition to recurrent-projection-dim, e.g. cell-dim/4]\n#   delay=-1                 [Delay in the recurrent connections of the LSTM ]\n#   clipping-threshold=30    [nnet3 LSTMs use a gradient clipping component at the recurrent connections.\n#                             This is the threshold used to decide if clipping has to be activated ]\n#   zeroing-interval=20      [interval at which we (possibly) zero out the recurrent derivatives.]\n#   zeroing-threshold=15     [We only zero out the derivs every zeroing-interval, if derivs exceed this value.]\n#   lstm-nonlinearity-options=' max-change=0.75 '  [Options string to pass into the LSTM nonlinearity component.]\n#   ng-affine-options=' max-change=1.5 '           [Additional options used for the full matrices in the LSTM, can be used to\n#                                      do things like set biases to initialize to 1]\n#   decay-time=-1            [If >0, an approximate maximum on how many frames\n#                            can be remembered via summation into the cell\n#                            contents c_t; enforced by putting a scaling factor\n#                            of recurrence_scale = 1 - abs(delay)/decay_time on\n#                            the recurrence, i.e. the term c_{t-1} in the LSTM\n#                            equations.  E.g. setting this to 20 means no more\n#                            than about 20 frames' worth of history,\n#                            i.e. history since about t = t-20, can be\n#                            accumulated in c_t.]\n#  l2-regularize=0.0         Constant controlling l2 regularization for this layer\nclass XconfigFastLstmpLayer(XconfigLayerBase):\n    def __init__(self, first_token, key_to_value, prev_names = None):\n        assert first_token in ['fast-lstmp-layer', 'fast-lstmp-batchnorm-layer']\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input':'[-1]',\n                        'cell-dim' : -1, # this is a compulsory argument\n                        'recurrent-projection-dim' : -1,\n                        'non-recurrent-projection-dim' : -1,\n                        'clipping-threshold' : 30.0,\n                        'delay' : -1,\n                        # if you want to set 'self-repair-scale' (c.f. the\n                        # self-repair-scale-nonlinearity config value in older LSTM layers), you can\n                        # add 'self-repair-scale=xxx' to\n                        # lstm-nonlinearity-options.\n                        'lstm-nonlinearity-options' : ' max-change=0.75',\n                        # the affine layer contains 4 of our old layers -> use a\n                        # larger max-change than the normal value of 0.75.\n                        'ng-affine-options' : ' max-change=1.5',\n                        'l2-regularize': 0.0,\n                        'decay-time':  -1.0,\n                        'zeroing-interval' : 20,\n                        'zeroing-threshold' : 15.0,\n                        'dropout-proportion' : -1.0, # If -1.0, no dropout will\n                                                     # be used)\n                         }\n\n    def set_derived_configs(self):\n        if self.config['recurrent-projection-dim'] <= 0:\n            self.config['recurrent-projection-dim'] = self.config['cell-dim'] / 4\n\n        if self.config['non-recurrent-projection-dim'] <= 0:\n            self.config['non-recurrent-projection-dim'] = \\\n               self.config['recurrent-projection-dim']\n\n\n    def check_configs(self):\n        for key in ['cell-dim', 'recurrent-projection-dim',\n                    'non-recurrent-projection-dim']:\n            if self.config[key] <= 0:\n                raise RuntimeError(\"{0} has invalid value {1}.\".format(\n                    key, self.config[key]))\n        if self.config['delay'] == 0:\n            raise RuntimeError(\"delay cannot be zero\")\n        if (self.config['recurrent-projection-dim'] +\n            self.config['non-recurrent-projection-dim'] >\n            self.config['cell-dim']):\n            raise RuntimeError(\"recurrent+non-recurrent projection dim exceeds \"\n                                \"cell dim\")\n        if ((self.config['dropout-proportion'] > 1.0 or\n             self.config['dropout-proportion'] < 0.0) and\n             self.config['dropout-proportion'] != -1.0 ):\n            raise RuntimeError(\"dropout-proportion has invalid value {0}.\".format(self.config['dropout-proportion']))\n\n\n    def auxiliary_outputs(self):\n        return ['c_t']\n\n    def output_name(self, auxiliary_output = None):\n        node_name = ('rp_batchnorm' if self.layer_type == 'fast-lstmp-batchnorm-layer'\n                     else 'rp')\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                node_name = auxiliary_output\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n\n        return '{0}.{1}'.format(self.name, node_name)\n\n    def output_dim(self, auxiliary_output = None):\n        if auxiliary_output is not None:\n            if auxiliary_output in self.auxiliary_outputs():\n                if node_name == 'c':\n                    return self.config['cell-dim']\n                # add code for other auxiliary_outputs here when we decide to expose them\n            else:\n                raise RuntimeError(\"Unknown auxiliary output name {0}\".format(auxiliary_output))\n        return self.config['recurrent-projection-dim'] + \\\n               self.config['non-recurrent-projection-dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_lstm_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in LSTM initialization\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    # convenience function to generate the LSTM config\n    def _generate_lstm_config(self):\n        # assign some variables to reduce verbosity\n        name = self.name\n        # in the below code we will just call descriptor_strings as descriptors for conciseness\n        input_dim = self.descriptors['input']['dim']\n        input_descriptor = self.descriptors['input']['final-string']\n        cell_dim = self.config['cell-dim']\n        delay = self.config['delay']\n        rec_proj_dim = self.config['recurrent-projection-dim']\n        nonrec_proj_dim = self.config['non-recurrent-projection-dim']\n        affine_str = self.config['ng-affine-options']\n        decay_time = self.config['decay-time']\n        # we expect decay_time to be either -1, or large, like 10 or 50.\n        recurrence_scale = (1.0 if decay_time < 0 else\n                            1.0 - (abs(delay) / decay_time))\n        assert recurrence_scale > 0   # or user may have set decay-time much\n                                      # too small.\n        bptrunc_str = (\"clipping-threshold={0}\"\n                      \" zeroing-threshold={1}\"\n                      \" zeroing-interval={2}\"\n                      \" recurrence-interval={3}\"\n                      \" scale={4}\"\n                      \"\".format(self.config['clipping-threshold'],\n                                self.config['zeroing-threshold'],\n                                self.config['zeroing-interval'],\n                                abs(delay), recurrence_scale))\n\n        lstm_str = self.config['lstm-nonlinearity-options']\n        dropout_proportion = self.config['dropout-proportion']\n        l2_regularize = self.config['l2-regularize']\n        l2_regularize_option = ('l2-regularize={0} '.format(l2_regularize)\n                                if l2_regularize != 0.0 else '')\n\n        configs = []\n\n        # the equations implemented here are from Sak et. al. \"Long Short-Term\n        # Memory Recurrent Neural Network Architectures for Large Scale Acoustic\n        # Modeling\"\n        # https://arxiv.org/pdf/1402.1128.pdf\n        # See equations (7) to (14).\n        # naming convention\n        # <layer-name>.W_<outputname>.<input_name> e.g. Lstm1.W_i.xr for matrix providing output to gate i and operating on an appended vector [x,r]\n        configs.append(\"##  Begin LTSM layer '{0}'\".format(name))\n        configs.append(\"# Gate control: contains W_i, W_f, W_c and W_o matrices as blocks.\")\n        configs.append(\"component name={0}.W_all type=NaturalGradientAffineComponent input-dim={1} \"\n                       \"output-dim={2} {3} {4}\".format(\n                           name, input_dim + rec_proj_dim, cell_dim * 4,\n                           affine_str, l2_regularize_option))\n        configs.append(\"# The core LSTM nonlinearity, implemented as a single component.\")\n        configs.append(\"# Input = (i_part, f_part, c_part, o_part, c_{t-1}), output = (c_t, m_t)\")\n        configs.append(\"# See cu-math.h:ComputeLstmNonlinearity() for details.\")\n        configs.append(\"component name={0}.lstm_nonlin type=LstmNonlinearityComponent cell-dim={1} \"\n                       \"use-dropout={2} {3} {4}\"\n                       .format(name, cell_dim,\n                               \"true\" if dropout_proportion != -1.0 else \"false\",\n                               lstm_str, l2_regularize_option))\n        configs.append(\"# Component for backprop truncation, to avoid gradient blowup in long training examples.\")\n        configs.append(\"component name={0}.cr_trunc type=BackpropTruncationComponent \"\n                       \"dim={1} {2}\".format(name, cell_dim + rec_proj_dim, bptrunc_str))\n        if dropout_proportion != -1.0:\n            configs.append(\"component name={0}.dropout_mask type=DropoutMaskComponent output-dim=3 \"\n                           \"dropout-proportion={1} \"\n                           .format(name, dropout_proportion))\n        configs.append(\"# Component specific to 'projected' LSTM (LSTMP), contains both recurrent\");\n        configs.append(\"# and non-recurrent projections\")\n        configs.append(\"component name={0}.W_rp type=NaturalGradientAffineComponent \"\n                       \"input-dim={1} output-dim={2} {3} {4}\".format(\n                           name, cell_dim, rec_proj_dim + nonrec_proj_dim,\n                           affine_str, l2_regularize_option))\n        configs.append(\"###  Nodes for the components above.\")\n        configs.append(\"component-node name={0}.W_all component={0}.W_all input=Append({1}, \"\n                       \"IfDefined(Offset({0}.r_trunc, {2})))\".format(name, input_descriptor, delay))\n\n        if dropout_proportion != -1.0:\n            # note: the 'input' is a don't-care as the component never uses it; it's required\n            # in component-node lines.\n            configs.append(\"component-node name={0}.dropout_mask component={0}.dropout_mask \"\n                           \"input={0}.dropout_mask\".format(name))\n            configs.append(\"component-node name={0}.lstm_nonlin component={0}.lstm_nonlin \"\n                           \"input=Append({0}.W_all, IfDefined(Offset({0}.c_trunc, {1})), \"\n                           \"{0}.dropout_mask)\".format(name, delay))\n        else:\n            configs.append(\"component-node name={0}.lstm_nonlin component={0}.lstm_nonlin \"\n                           \"input=Append({0}.W_all, IfDefined(Offset({0}.c_trunc, {1})))\".format(\n                               name, delay))\n        configs.append(\"dim-range-node name={0}.c input-node={0}.lstm_nonlin \"\n                       \"dim-offset=0 dim={1}\".format(name, cell_dim))\n        configs.append(\"dim-range-node name={0}.m input-node={0}.lstm_nonlin \"\n                       \"dim-offset={1} dim={1}\".format(name, cell_dim))\n        configs.append(\"# {0}.rp is the output node of this layer (if we're not \"\n                       \"including batchnorm)\".format(name))\n        configs.append(\"component-node name={0}.rp component={0}.W_rp input={0}.m\".format(name))\n        configs.append(\"dim-range-node name={0}.r input-node={0}.rp dim-offset=0 \"\n                       \"dim={1}\".format(name, rec_proj_dim))\n        configs.append(\"# Note: it's not 100% efficient that we have to stitch the c\")\n        configs.append(\"# and r back together to truncate them but it probably\");\n        configs.append(\"# makes the deriv truncation more accurate .\")\n        configs.append(\"component-node name={0}.cr_trunc component={0}.cr_trunc \"\n                       \"input=Append({0}.c, {0}.r)\".format(name))\n        configs.append(\"dim-range-node name={0}.c_trunc input-node={0}.cr_trunc \"\n                       \"dim-offset=0 dim={1}\".format(name, cell_dim))\n        configs.append(\"dim-range-node name={0}.r_trunc input-node={0}.cr_trunc \"\n                       \"dim-offset={1} dim={2}\".format(name, cell_dim, rec_proj_dim))\n        if self.layer_type == \"fast-lstmp-batchnorm-layer\":\n            # Add the batchnorm component, if requested to include batchnorm.\n            configs.append(\"component name={0}.rp_batchnorm type=BatchNormComponent dim={1} \".format(\n                name, rec_proj_dim + nonrec_proj_dim))\n            configs.append(\"component-node name={0}.rp_batchnorm component={0}.rp_batchnorm \"\n                           \"input={0}.rp\".format(name))\n        configs.append(\"### End LSTM Layer '{0}'\".format(name))\n\n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/parser.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n# Apache 2.0.\n\n\"\"\" This module contains the top level xconfig parsing functions.\n\"\"\"\n\nfrom __future__ import print_function\n\nimport logging\nimport sys\nimport libs.nnet3.xconfig.layers as xlayers\nimport libs.nnet3.xconfig.utils as xutils\n\nimport libs.common as common_lib\n\n\n# We have to modify this dictionary when adding new layers\nconfig_to_layer = {\n        'input' : xlayers.XconfigInputLayer,\n        'output' : xlayers.XconfigTrivialOutputLayer,\n        'output-layer' : xlayers.XconfigOutputLayer,\n        'relu-layer' : xlayers.XconfigBasicLayer,\n        'relu-renorm-layer' : xlayers.XconfigBasicLayer,\n        'relu-batchnorm-dropout-layer' : xlayers.XconfigBasicLayer,\n        'relu-dropout-layer': xlayers.XconfigBasicLayer,\n        'relu-batchnorm-layer' : xlayers.XconfigBasicLayer,\n        'relu-batchnorm-so-layer' : xlayers.XconfigBasicLayer,\n        'batchnorm-so-relu-layer' : xlayers.XconfigBasicLayer,\n        'batchnorm-layer' : xlayers.XconfigBasicLayer,\n        'sigmoid-layer' : xlayers.XconfigBasicLayer,\n        'tanh-layer' : xlayers.XconfigBasicLayer,\n        'fixed-affine-layer' : xlayers.XconfigFixedAffineLayer,\n        'idct-layer' : xlayers.XconfigIdctLayer,\n        'affine-layer' : xlayers.XconfigAffineLayer,\n        'lstm-layer' : xlayers.XconfigLstmLayer,\n        'lstmp-layer' : xlayers.XconfigLstmpLayer,\n        'lstmp-batchnorm-layer' : xlayers.XconfigLstmpLayer,\n        'fast-lstm-layer' : xlayers.XconfigFastLstmLayer,\n        'fast-lstm-batchnorm-layer' : xlayers.XconfigFastLstmLayer,\n        'fast-lstmp-layer' : xlayers.XconfigFastLstmpLayer,\n        'fast-lstmp-batchnorm-layer' : xlayers.XconfigFastLstmpLayer,\n        'lstmb-layer' : xlayers.XconfigLstmbLayer,\n        'stats-layer': xlayers.XconfigStatsLayer,\n        'relu-conv-layer': xlayers.XconfigConvLayer,\n        'conv-layer': xlayers.XconfigConvLayer,\n        'conv-relu-layer': xlayers.XconfigConvLayer,\n        'conv-renorm-layer': xlayers.XconfigConvLayer,\n        'relu-conv-renorm-layer': xlayers.XconfigConvLayer,\n        'batchnorm-conv-layer': xlayers.XconfigConvLayer,\n        'conv-relu-renorm-layer': xlayers.XconfigConvLayer,\n        'batchnorm-conv-relu-layer': xlayers.XconfigConvLayer,\n        'relu-batchnorm-conv-layer': xlayers.XconfigConvLayer,\n        'relu-batchnorm-noconv-layer': xlayers.XconfigConvLayer,\n        'relu-noconv-layer': xlayers.XconfigConvLayer,\n        'conv-relu-batchnorm-layer': xlayers.XconfigConvLayer,\n        'conv-relu-batchnorm-so-layer': xlayers.XconfigConvLayer,\n        'conv-relu-batchnorm-dropout-layer': xlayers.XconfigConvLayer,\n        'conv-relu-dropout-layer': xlayers.XconfigConvLayer,\n        'res-block': xlayers.XconfigResBlock,\n        'res2-block': xlayers.XconfigRes2Block,\n        'channel-average-layer': xlayers.ChannelAverageLayer,\n        'attention-renorm-layer': xlayers.XconfigAttentionLayer,\n        'attention-relu-renorm-layer': xlayers.XconfigAttentionLayer,\n        'attention-relu-batchnorm-layer': xlayers.XconfigAttentionLayer,\n        'relu-renorm-attention-layer': xlayers.XconfigAttentionLayer,\n        'gru-layer' : xlayers.XconfigGruLayer,\n        'pgru-layer' : xlayers.XconfigPgruLayer,\n        'opgru-layer' : xlayers.XconfigOpgruLayer,\n        'norm-pgru-layer' : xlayers.XconfigNormPgruLayer,\n        'norm-opgru-layer' : xlayers.XconfigNormOpgruLayer,\n        'fast-gru-layer' : xlayers.XconfigFastGruLayer,\n        'fast-pgru-layer' : xlayers.XconfigFastPgruLayer,\n        'fast-norm-pgru-layer' : xlayers.XconfigFastNormPgruLayer,\n        'fast-opgru-layer' : xlayers.XconfigFastOpgruLayer,\n        'fast-norm-opgru-layer' : xlayers.XconfigFastNormOpgruLayer,\n        'tdnnf-layer': xlayers.XconfigTdnnfLayer,\n        'prefinal-layer': xlayers.XconfigPrefinalLayer,\n        'spec-augment-layer': xlayers.XconfigSpecAugmentLayer,\n        'renorm-component': xlayers.XconfigRenormComponent,\n        'batchnorm-component': xlayers.XconfigBatchnormComponent,\n        'no-op-component': xlayers.XconfigNoOpComponent,\n        'linear-component': xlayers.XconfigLinearComponent,\n        'affine-component': xlayers.XconfigAffineComponent,\n        'scale-component':  xlayers.XconfigPerElementScaleComponent,\n        'dim-range-component': xlayers.XconfigDimRangeComponent,\n        'offset-component':  xlayers.XconfigPerElementOffsetComponent,\n        'combine-feature-maps-layer': xlayers.XconfigCombineFeatureMapsLayer,\n        'delta-layer': xlayers.XconfigDeltaLayer\n}\n\n# Turn a config line and a list of previous layers into\n# either an object representing that line of the config file; or None\n# if the line was empty after removing comments.\n# 'prev_layers' is a list of objects corresponding to preceding layers of the\n# config file.\ndef xconfig_line_to_object(config_line, prev_layers = None):\n    try:\n        x  = xutils.parse_config_line(config_line)\n        if x is None:\n            return None\n        (first_token, key_to_value) = x\n        if not first_token in config_to_layer:\n            raise RuntimeError(\"No such layer type '{0}'\".format(first_token))\n        return config_to_layer[first_token](first_token, key_to_value, prev_layers)\n    except Exception:\n        logging.error(\n            \"***Exception caught while parsing the following xconfig line:\\n\"\n            \"*** {0}\".format(config_line))\n        raise\n\n\ndef get_model_component_info(model_filename):\n    \"\"\"\n    This function reads existing model (*.raw or *.mdl) and returns array\n    of XconfigExistingLayer one per {input,output}-node or component-node\n    with same 'name' used in the raw model and 'dim' equal to 'output-dim'\n    for component-node and 'dim' for {input,output}-node.\n\n    e.g. layer in *.mdl -> corresponding 'XconfigExistingLayer' layer\n         'input-node name=ivector dim=100' ->\n         'existing name=ivector dim=100'\n         'component-node name=tdnn1.affine ... input-dim=1000 '\n         'output-dim=500' ->\n         'existing name=tdnn1.affine dim=500'\n    \"\"\"\n\n    all_layers = []\n    try:\n        f = open(model_filename, 'r')\n    except Exception as e:\n        sys.exit(\"{0}: error reading model file '{1}'\".format(sys.argv[0],\n                                                              model_filename,\n                                                              repr(e)))\n\n    # use nnet3-info to get component names in the model.\n    out = common_lib.get_command_stdout(\"\"\"nnet3-info {0} | grep '\\-node' \"\"\"\n                                        \"\"\" \"\"\".format(model_filename))\n\n    # out contains all {output, input, component}-nodes used in model_filename\n    # It can parse lines in out like:\n    # i.e. input-node name=input dim=40\n    #   component-node name=tdnn1.affine component=tdnn1.affine input=lda\n    #   input-dim=300 output-dim=512\n    layer_names = []\n    key_to_value = dict()\n    for line in out.split(\"\\n\"):\n        parts = line.split(\" \")\n        dim = -1\n        for  field in parts:\n            key_value = field.split(\"=\")\n            if len(key_value) == 2:\n                key = key_value[0]\n                value = key_value[1]\n                if key == \"name\":           # name=**\n                    layer_name = value\n                elif key == \"dim\":          # for input-node\n                    dim = int(value)\n                elif key == \"output-dim\":   # for component-node\n                    dim = int(value)\n\n        if layer_name is not None and layer_name not in layer_names:\n            layer_names.append(layer_name)\n            key_to_value['name'] = layer_name\n            assert(dim != -1)\n            key_to_value['dim'] = dim\n            all_layers.append(xlayers.XconfigExistingLayer('existing', key_to_value, all_layers))\n    if len(all_layers) == 0:\n        raise RuntimeError(\"{0}: model filename '{1}' is empty.\".format(\n            sys.argv[0], model_filename))\n    f.close()\n    return all_layers\n\n\n# This function reads xconfig file and returns it as a list of layers\n# (usually we use the variable name 'all_layers' elsewhere for this).\n# It will die if the xconfig file is empty or if there was\n# some error parsing it.\n# 'existing_layers' contains some layers of type 'existing' (layers which are not really\n# layers but are actual component node names from an existing neural net model\n# and created using get_model_component_info function).\n# 'existing' layers can be used as input to component-nodes in layers of xconfig file.\ndef read_xconfig_file(xconfig_filename, existing_layers=None):\n    if existing_layers is None:\n        existing_layers = []\n    try:\n        f = open(xconfig_filename, 'r')\n    except Exception as e:\n        sys.exit(\"{0}: error reading xconfig file '{1}'; error was {2}\".format(\n            sys.argv[0], xconfig_filename, repr(e)))\n    all_layers = []\n    while True:\n        line = f.readline()\n        if line == '':\n            break\n        # the next call will raise an easy-to-understand exception if\n        # it fails.\n        this_layer = xconfig_line_to_object(line, existing_layers)\n        if this_layer is None:\n            continue  # line was blank after removing comments.\n        all_layers.append(this_layer)\n        existing_layers.append(this_layer)\n    if len(all_layers) == 0:\n        raise RuntimeError(\"{0}: xconfig file '{1}' is empty\".format(\n            sys.argv[0], xconfig_filename))\n    f.close()\n    return all_layers\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/stats_layer.py",
    "content": "# Copyright 2016    Johns Hopkins University (Author: Daniel Povey)\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This module contains the statistics extraction and pooling layer.\n\"\"\"\n\nfrom __future__ import print_function\nimport re\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n\nclass XconfigStatsLayer(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n    stats-layer name=tdnn1-stats config=mean+stddev(-99:3:9:99) input=tdnn1\n\n    This adds statistics-pooling and statistics-extraction components.  An\n    example string is 'mean(-99:3:9::99)', which means, compute the mean of\n    data within a window of -99 to +99, with distinct means computed every 9\n    frames (we round to get the appropriate one), and with the input extracted\n    on multiples of 3 frames (so this will force the input to this layer to be\n    evaluated every 3 frames).  Another example string is\n    'mean+stddev(-99:3:9:99)', which will also cause the standard deviation to\n    be computed.\n\n    The dimension is worked out from the input. mean and stddev add a\n    dimension of input_dim each to the output dimension. If counts is\n    specified, an additional dimension is added to the output to store log\n    counts.\n\n    Parameters of the class, and their defaults:\n        input='[-1]'    [Descriptor giving the input of the layer.]\n        dim=-1      [Output dimension of layer. If provided, must match the\n                     dimension computed from input]\n        config=''   [Required. Defines what stats must be computed.]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        assert first_token in ['stats-layer']\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'config': ''}\n\n    def set_derived_configs(self):\n        config_string = self.config['config']\n        if config_string == '':\n            raise RuntimeError(\"config has to be non-empty\",\n                                self.str())\n        m = re.search(\"(mean|mean\\+stddev|mean\\+count|mean\\+stddev\\+count)\"\n                      \"\\((-?\\d+):(-?\\d+):(-?\\d+):(-?\\d+)\\)\",\n                      config_string)\n        if m is None:\n            raise RuntimeError(\"Invalid statistic-config string: {0}\".format(\n                config_string), self)\n\n        self._output_stddev = (m.group(1) in ['mean+stddev',\n                                              'mean+stddev+count'])\n        self._output_log_counts = (m.group(1) in ['mean+count',\n                                                  'mean+stddev+count'])\n        self._left_context = -int(m.group(2))\n        self._input_period = int(m.group(3))\n        self._stats_period = int(m.group(4))\n        self._right_context = int(m.group(5))\n\n        if self._output_stddev:\n          output_dim = 2 * self.descriptors['input']['dim']\n        else:\n          output_dim = self.descriptors['input']['dim']\n        if self._output_log_counts:\n          output_dim = output_dim + 1\n\n        if self.config['dim'] > 0 and self.config['dim'] != output_dim:\n            raise RuntimeError(\n                \"Invalid dim supplied {0:d} != \"\n                \"actual output dim {1:d}\".format(\n                    self.config['dim'], output_dim))\n        self.config['dim'] = output_dim\n\n    def check_configs(self):\n        if not (self._left_context >= 0 and self._right_context >= 0\n                and self._input_period > 0 and self._stats_period > 0\n                and self._left_context % self._stats_period == 0\n                and self._right_context % self._stats_period == 0\n                and self._stats_period % self._input_period == 0):\n            raise RuntimeError(\n                \"Invalid configuration of statistics-extraction: {0}\".format(\n                    self.config['config']), self)\n        super(XconfigStatsLayer, self).check_configs()\n\n    def _generate_config(self):\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n\n        configs = []\n        configs.append(\n            'component name={name}-extraction-{lc}-{rc} '\n            'type=StatisticsExtractionComponent input-dim={dim} '\n            'input-period={input_period} output-period={output_period} '\n            'include-variance={var} '.format(\n                name=self.name, lc=self._left_context, rc=self._right_context,\n                dim=input_dim, input_period=self._input_period,\n                output_period=self._stats_period,\n                var='true' if self._output_stddev else 'false'))\n        configs.append(\n            'component-node name={name}-extraction-{lc}-{rc} '\n            'component={name}-extraction-{lc}-{rc} input={input} '.format(\n                name=self.name, lc=self._left_context, rc=self._right_context,\n                input=input_desc))\n\n        stats_dim = 1 + input_dim * (2 if self._output_stddev else 1)\n        configs.append(\n            'component name={name}-pooling-{lc}-{rc} '\n            'type=StatisticsPoolingComponent input-dim={dim} '\n            'input-period={input_period} left-context={lc} right-context={rc} '\n            'num-log-count-features={count} output-stddevs={var} '.format(\n                name=self.name, lc=self._left_context, rc=self._right_context,\n                dim=stats_dim, input_period=self._stats_period,\n                count=1 if self._output_log_counts else 0,\n                var='true' if self._output_stddev else 'false'))\n        configs.append(\n            'component-node name={name}-pooling-{lc}-{rc} '\n            'component={name}-pooling-{lc}-{rc} '\n            'input={name}-extraction-{lc}-{rc} '.format(\n                name=self.name, lc=self._left_context, rc=self._right_context))\n        return configs\n\n    def output_name(self, auxiliary_output=None):\n        return 'Round({name}-pooling-{lc}-{rc}, {period})'.format(\n            name=self.name, lc=self._left_context,\n            rc=self._right_context, period=self._stats_period)\n\n    def output_dim(self, auxiliary_outputs=None):\n        return self.config['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                ans.append((config_name, line))\n\n        return ans\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/trivial_layers.py",
    "content": "# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2017    Google Inc. (vpeddinti@google.com)\n#           2017    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This module contains layers that just map to a single component.\n\"\"\"\n\nfrom __future__ import print_function\nimport math\nimport re\nimport sys\nfrom libs.nnet3.xconfig.basic_layers import XconfigLayerBase\n\n\nclass XconfigRenormComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'renorm-component name=renorm1 input=Append(-3,0,3)'\n    which will produce just a single component, of type NormalizeComponent.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      target-rms=1.0           [The target RMS of the NormalizeComponent]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'target-rms': 1.0 }\n\n    def check_configs(self):\n        assert self.config['target-rms'] > 0.0\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        target_rms = self.config['target-rms']\n\n        configs = []\n        line = ('component name={0} type=NormalizeComponent dim={1} target-rms={2}'.format(\n            self.name, input_dim, target_rms))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\n\nclass XconfigBatchnormComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'batchnorm-component name=batchnorm input=Append(-3,0,3)'\n    which will produce just a single component, of type BatchNormComponent.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      target-rms=1.0           [The target RMS of the BatchNormComponent]\n      include-in-init=false     [You should set this to true if this precedes a\n                                `fixed-affine-layer` that is to be initialized\n                                 via LDA]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'target-rms': 1.0,\n                       'include-in-init': False}\n\n    def check_configs(self):\n        assert self.config['target-rms'] > 0.0\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n            if self.config['include-in-init']:\n                ans.append(('init', line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        target_rms = self.config['target-rms']\n\n        configs = []\n        line = ('component name={0} type=BatchNormComponent dim={1} target-rms={2}'.format(\n            self.name, input_dim, target_rms))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigNoOpComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'no-op-component name=noop1 input=Append(-3,0,3)'\n    which will produce just a single component, of type NoOpComponent.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]' }\n\n    def check_configs(self):\n        pass\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n\n        configs = []\n        line = ('component name={0} type=NoOpComponent dim={1}'.format(\n            self.name, input_dim))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigDeltaLayer(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'delta-layer name=delta input=idct'\n    which appends the central frame with the delta features\n    (i.e. -1,0,1 since scale equals 1) and delta-delta features \n    (i.e. 1,0,-2,0,1), and then applies batchnorm to it.\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]'}\n\n    def check_configs(self):\n        pass\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return (3*input_dim)\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.output_dim()\n\n        configs = []\n        line = ('dim-range-node name={0}_copy1 input-node={0} dim={1} dim-offset=0'.format(\n            input_desc, input_dim))\n        configs.append(line)\n        line = ('dim-range-node name={0}_copy2 input-node={0} dim={1} dim-offset=0'.format(\n            input_desc, input_dim))\n        configs.append(line)\n\n        line = ('component name={0}_2 type=NoOpComponent dim={1}'.format(\n            input_desc, output_dim))\n        configs.append(line)\n        line = ('component-node name={0}_2 component={0}_2 input=Append(Offset({0},0),'\n            ' Sum(Offset(Scale(-1.0,{0}_copy1),-1), Offset({0},1)), Sum(Offset({0},-2), Offset({0},2),' \n            ' Offset(Scale(-2.0,{0}_copy2),0)))'.format(input_desc))\n        configs.append(line)\n        \n        line = ('component name={0} type=BatchNormComponent dim={1}'.format(\n            self.name, output_dim))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}_2'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigLinearComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'linear-component name=linear1 dim=1024 input=Append(-3,0,3)'\n    which will produce just a single component, of type LinearComponent, with\n    output-dim 1024 in this case, and input-dim determined by the dimension\n    of the input .\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=-1                   [Dimension of the output]\n\n    The following (shown with their effective defaults) are just passed through\n    to the component's config line.\n\n      orthonormal-constraint=0.0\n      max-change=0.75\n      l2-regularize=0.0\n\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'orthonormal-constraint': '',\n                       'max-change': 0.75,\n                       'l2-regularize': '',\n                       'param-stddev': '',\n                       'learning-rate-factor': '' }\n\n    def check_configs(self):\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"'dim' must be specified and > 0.\")\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        assert self.config['dim'] > 0\n        return self.config['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.config['dim']\n\n        opts = ''\n        for opt_name in ['orthonormal-constraint', 'max-change', 'l2-regularize',\n                         'param-stddev', 'learning-rate-factor' ]:\n            value = self.config[opt_name]\n            if value != '':\n                opts += ' {0}={1}'.format(opt_name, value)\n\n        configs = []\n        line = ('component name={0} type=LinearComponent input-dim={1} output-dim={2} '\n                '{3}'.format(self.name, input_dim, output_dim, opts))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigCombineFeatureMapsLayer(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n      'combine-feature-maps-layer name=combine_features1 height=40 num-filters1=1 num-filters2=4'\n      or\n      'combine-feature-maps-layer name=combine_features1 height=40 num-filters1=1 num-filters2=4 num-filters3=2'\n\n      It produces a PermuteComponent.  It expects its input to be two or three things\n      appended together, where the first is of dimension height * num-filters1 and\n      the second is of dimension height * num-filters2 (and the third, if present is\n      of dimension height * num-filters2; it interpolates the filters\n      so the output can be interpreted as a single feature map with the same height\n      as the input and the sum of the num-filters.\n\n      This is to be used in convolutional setups as part of how we combine the\n      filterbank inputs with ivectors.\n    \"\"\"\n\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = { 'input': '[-1]',\n                        'num-filters1': -1,\n                        'num-filters2': -1,\n                        'num-filters3': 0,\n                        'height': -1 }\n\n    def check_configs(self):\n        input_dim = self.descriptors['input']['dim']\n        if (self.config['num-filters1'] <= 0 or\n            self.config['num-filters2'] <= 0 or\n            self.config['num-filters3'] < 0 or\n            self.config['height'] <= 0):\n            raise RuntimeError(\"invalid values of num-filters1, num-filters2 and/or height\")\n        f1 = self.config['num-filters1']\n        f2 = self.config['num-filters2']\n        f3 = self.config['num-filters3']\n        h = self.config['height']\n        if input_dim != (f1 + f2 + f3) * h:\n            raise RuntimeError(\"Expected input-dim={0} based on num-filters1={1}, num-filters2={2}, \"\n                               \"num-filters3={3} and height={4}, but got input-dim={5}\".format(\n                                   (f1 + f2 + f3) * h, f1, f2, f3, h, input_dim))\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        input_dim = self.descriptors['input']['dim']\n        return input_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        dim = self.descriptors['input']['dim']\n        num_filters1 = self.config['num-filters1']\n        num_filters2 = self.config['num-filters2']\n        num_filters3 = self.config['num-filters3']  # normally 0.\n        height = self.config['height']\n        assert dim == (num_filters1 + num_filters2 + num_filters3) * height\n\n        column_map = []\n        for h in range(height):\n            for f in range(num_filters1):\n                column_map.append(h * num_filters1 + f)\n            for f in range(num_filters2):\n                column_map.append(height * num_filters1 + h * num_filters2 + f)\n            for f in range(num_filters3):\n                column_map.append(height * (num_filters1 + num_filters2) + h * num_filters3 + f)\n\n        configs = []\n        line = ('component name={0} type=PermuteComponent column-map={1} '.format(\n            self.name, ','.join([str(x) for x in column_map])))\n        configs.append(line)\n\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\n\n\nclass XconfigAffineComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'affine-component name=linear1 dim=1024 input=Append(-3,0,3)'\n    which will produce just a single component, of type NaturalGradientAffineComponent,\n    with output-dim 1024 in this case, and input-dim determined by the dimension\n    of the input .\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=-1                   [Dimension of the output]\n\n    The following (shown with their effective defaults) are just passed through\n    to the component's config line.\n\n      orthonormal-constraint=0.0\n      max-change=0.75\n      l2-regularize=0.0\n\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'orthonormal-constraint': '',\n                       'max-change': 0.75,\n                       'param-stddev': '',\n                       'bias-stddev': '',\n                       'l2-regularize': '' }\n\n    def check_configs(self):\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"'dim' must be specified and > 0.\")\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        assert self.config['dim'] > 0\n        return self.config['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        input_dim = self.descriptors['input']['dim']\n        output_dim = self.config['dim']\n\n        opts = ''\n        for opt_name in ['orthonormal-constraint', 'max-change', 'l2-regularize',\n                         'param-stddev', 'bias-stddev']:\n            value = self.config[opt_name]\n            if value != '':\n                opts += ' {0}={1}'.format(opt_name, value)\n\n        configs = []\n        line = ('component name={0} type=NaturalGradientAffineComponent input-dim={1} output-dim={2} '\n                '{3}'.format(self.name, input_dim, output_dim, opts))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigPerElementScaleComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'scale-component name=scale1 input=Append(-3,0,3)'\n    which will produce just a single component, of type NaturalGradientPerElementScaleComponent, with\n    output-dim 1024 in this case, and input-dim determined by the dimension of the input .\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n\n    The following (shown with their effective defaults) are just passed through\n    to the component's config line.  (These defaults are mostly set in the\n    code).\n\n      max-change=0.75\n      l2-regularize=0.0\n      param-mean=1.0   # affects initialization\n      param-stddev=0.0  # affects initialization\n      learning-rate-factor=1.0\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'l2-regularize': '',\n                       'max-change': 0.75,\n                       'param-mean': '',\n                       'param-stddev': '',\n                       'learning-rate-factor': '' }\n\n    def check_configs(self):\n        pass\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.descriptors['input']['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        dim = self.descriptors['input']['dim']\n\n        opts = ''\n        for opt_name in ['learning-rate-factor', 'max-change', 'l2-regularize', 'param-mean',\n                         'param-stddev' ]:\n            value = self.config[opt_name]\n            if value != '':\n                opts += ' {0}={1}'.format(opt_name, value)\n\n        configs = []\n        line = ('component name={0} type=NaturalGradientPerElementScaleComponent dim={1} {2} '\n                ''.format(self.name, dim, opts))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\nclass XconfigPerElementOffsetComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'offset-component name=offset1 input=Append(-3,0,3)'\n    which will produce just a single component, of type PerElementOffsetComponent, with\n    output-dim 1024 in this case, and input-dim determined by the dimension of the input .\n\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n\n    The following (shown with their effective defaults) are just passed through\n    to the component's config line.  (These defaults are mostly set in the\n    code).\n\n      max-change=0.75\n      l2-regularize=0.0\n      param-mean=0.0   # affects initialization\n      param-stddev=0.0  # affects initialization\n      learning-rate-factor=1.0\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'l2-regularize': '',\n                       'max-change': 0.75,\n                       'param-mean': '',\n                       'param-stddev': '',\n                       'learning-rate-factor': '' }\n\n    def check_configs(self):\n        pass\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.descriptors['input']['dim']\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_desc = self.descriptors['input']['final-string']\n        dim = self.descriptors['input']['dim']\n\n        opts = ''\n        for opt_name in ['learning-rate-factor', 'max-change', 'l2-regularize', 'param-mean',\n                         'param-stddev' ]:\n            value = self.config[opt_name]\n            if value != '':\n                opts += ' {0}={1}'.format(opt_name, value)\n\n        configs = []\n        line = ('component name={0} type=PerElementOffsetComponent dim={1} {2} '\n                ''.format(self.name, dim, opts))\n        configs.append(line)\n        line = ('component-node name={0} component={0} input={1}'.format(\n            self.name, input_desc))\n        configs.append(line)\n        return configs\n\n\nclass XconfigDimRangeComponent(XconfigLayerBase):\n    \"\"\"This class is for parsing lines like\n     'dim-range-component name=feature1 input=Append(-3,0,3) dim=40 dim-offset=0'\n    which will produce just a single component, of part of the input.\n    Parameters of the class, and their defaults:\n      input='[-1]'             [Descriptor giving the input of the layer.]\n      dim=-1                   [Dimension of the output.]\n      dim-offset=0             [Dimension offset of the input.]\n    \"\"\"\n    def __init__(self, first_token, key_to_value, prev_names=None):\n        XconfigLayerBase.__init__(self, first_token, key_to_value, prev_names)\n\n    def set_default_configs(self):\n        self.config = {'input': '[-1]',\n                       'dim': -1,\n                       'dim-offset': 0 }\n\n    def check_configs(self):\n        input_dim = self.descriptors['input']['dim']\n        if self.config['dim'] <= 0:\n            raise RuntimeError(\"'dim' must be specified and > 0.\")\n        elif self.config['dim'] > input_dim:\n            raise RuntimeError(\"'dim' must be specified and lower than the input dim.\")\n        if self.config['dim-offset'] < 0 :\n            raise RuntimeError(\"'dim-offset' must be specified and >= 0.\")\n        elif self.config['dim-offset'] + self.config['dim'] > input_dim:\n            raise RuntimeError(\"'dim-offset' plus output dim must be lower than the input dim.\")\n\n    def output_name(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        return self.name\n\n    def output_dim(self, auxiliary_output=None):\n        assert auxiliary_output is None\n        output_dim = self.config['dim']\n        if output_dim <= 0:\n            self.config['dim'] = self.descriptors['input']['dim']\n        return output_dim\n\n    def get_full_config(self):\n        ans = []\n        config_lines = self._generate_config()\n\n        for line in config_lines:\n            for config_name in ['ref', 'final']:\n                # we do not support user specified matrices in this layer\n                # so 'ref' and 'final' configs are the same.\n                ans.append((config_name, line))\n        return ans\n\n    def _generate_config(self):\n        # by 'descriptor_final_string' we mean a string that can appear in\n        # config-files, i.e. it contains the 'final' names of nodes.\n        input_node = self.descriptors['input']['final-string']\n        output_dim = self.config['dim']\n        dim_offset = self.config['dim-offset']\n\n        configs = []\n        line = ('dim-range-node name={0} input-node={1} dim={2} dim-offset={3}'.format(\n            self.name, input_node, output_dim, dim_offset))\n        configs.append(line)\n        return configs\n"
  },
  {
    "path": "egs/steps/libs/nnet3/xconfig/utils.py",
    "content": "# Copyright  2016  Johns Hopkins University (Author: Daniel Povey).\n# License: Apache 2.0.\n\n# This library contains various utilities that are involved in processing\n# of xconfig -> config conversion.  It contains \"generic\" lower-level code\n# while xconfig_layers.py contains the code specific to layer types.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport re\nimport sys\n\n\n# [utility function used in xconfig_layers.py]\n# Given a list of objects of type XconfigLayerBase ('all_layers'),\n# including at least the layers preceding 'current_layer' (and maybe\n# more layers), return the names of layers preceding 'current_layer'\n# other than layers of type 'existing', which corresponds to component-node\n# names from an existing model that we are adding layers to them.\n# This will be used in parsing expressions like [-1] in descriptors\n# (which is an alias for the previous layer).\ndef get_prev_names(all_layers, current_layer):\n    prev_names = []\n    for layer in all_layers:\n        if layer is current_layer:\n            break\n\n        # The following if-statement is needed to handle the case where the\n        # the layer is an 'existing' layer, derived from an existing trained\n        # neural network supplied via the existing-model option, that we are\n        # adding layers to. In this case, these layers are not considered as\n        # layers preceding 'current_layer'.\n        if layer.layer_type is not 'existing':\n            prev_names.append(layer.get_name())\n    prev_names_set = set()\n    for name in prev_names:\n        if name in prev_names_set:\n            raise RuntimeError(\"{0}: Layer name {1} is used more than once.\".format(\n                    sys.argv[0], name))\n        prev_names_set.add(name)\n    return prev_names\n\n\n# This is a convenience function to parser the auxiliary output name from the\n# full layer name\ndef split_layer_name(full_layer_name):\n    assert isinstance(full_layer_name, str)\n    split_name = full_layer_name.split('.')\n    if len(split_name) == 0:\n        raise RuntimeError(\"Bad layer name: \" + full_layer_name)\n    layer_name = split_name[0]\n    if len(split_name) == 1:\n        auxiliary_output = None\n    else:\n        # we probably expect len(split_name) == 2 in this case,\n        # but no harm in allowing dots in the auxiliary_output.\n        auxiliary_output = '.'.join(split_name[1:])\n\n    return [layer_name, auxiliary_output]\n\n# [utility function used in xconfig_layers.py]\n# this converts a layer-name like 'ivector' or 'input', or a sub-layer name like\n# 'lstm2.memory_cell', into a dimension.  'all_layers' is a vector of objects\n# inheriting from XconfigLayerBase.  'current_layer' is provided so that the\n# function can make sure not to look in layers that appear *after* this layer\n# (because that's not allowed).\ndef get_dim_from_layer_name(all_layers, current_layer, full_layer_name):\n    layer_name, auxiliary_output = split_layer_name(full_layer_name)\n    for layer in all_layers:\n        if layer is current_layer:\n            break\n\n        # If 'all_layers' contains some 'existing' layers, i.e. layers which\n        # are not really layers but are actual component names from an existing\n        # neural net that we are adding components to, they may already be\n        # of the form 'xxx.yyy', e.g. 'tdnn1.affine'.  In this case the name of\n        # the layer in 'all_layers' won't be just the 'xxx' part (e.g. 'tdnn1'),\n        # it will be the full thing, like 'tdnn1.affine'.\n        # We will also use the if-statement immediately below this comment for\n        # regular layers, e.g. where full_layer_name is something like 'tdnn2'.\n        # The if-statement below the next one, that uses\n        # auxiliary_output, will only be used in the (rare) case when we are\n        # using auxiliary outputs, e.g. 'lstm1.c'.\n        if layer.get_name() == full_layer_name:\n            return  layer.output_dim()\n\n        if layer.get_name() == layer_name:\n            if (not auxiliary_output in layer.auxiliary_outputs()\n                and auxiliary_output is not None):\n                raise RuntimeError(\"Layer '{0}' has no such auxiliary output:\"\n                                   \"'{1}' ({0}.{1})\".format(layer_name,\n                                                            auxiliary_output))\n            return layer.output_dim(auxiliary_output)\n    # No such layer was found.\n    if layer_name in [ layer.get_name() for layer in all_layers ]:\n        raise RuntimeError(\"Layer '{0}' was requested before it appeared in \"\n                        \"the xconfig file (circular dependencies or out-of-order \"\n                        \"layers\".format(layer_name))\n    else:\n        raise RuntimeError(\"No such layer: '{0}'\".format(layer_name))\n\n\n# [utility function used in xconfig_layers.py]\n# this converts a layer-name like 'ivector' or 'input', or a sub-layer name like\n# 'lstm2.memory_cell', into a descriptor (usually, but not required to be a simple\n# component-node name) that can appear in the generated config file.  'all_layers' is a vector of objects\n# inheriting from XconfigLayerBase.  'current_layer' is provided so that the\n# function can make sure not to look in layers that appear *after* this layer\n# (because that's not allowed).\ndef get_string_from_layer_name(all_layers, current_layer, full_layer_name):\n    layer_name, auxiliary_output = split_layer_name(full_layer_name)\n    for layer in all_layers:\n        if layer is current_layer:\n            break\n\n        # The following if-statement is needed to handle the case where the\n        # layer is an 'existing' layer, derived from an existing trained\n        # neural network supplied via the --existing-model option, that we are\n        # adding layers to.  In this case the name of the layer will actually\n        # be of the form xxx.yyy, e.g. 'tdnn1.affine'.\n        # The code path will also be taken for regular (non-'existing') layer\n        # names where the 'auxiliary_output' field is not used, which is actually\n        # the normal case (e.g. when 'full_layer_name' is 'lstm1',\n        # as opposed to, say, 'lstm1.c'\n        if layer.get_name() == full_layer_name:\n            return layer.output_name()\n\n        if layer.get_name() == layer_name:\n            if (not auxiliary_output in layer.auxiliary_outputs() and\n                auxiliary_output is not None):\n                raise RuntimeError(\"Layer '{0}' has no such auxiliary output: \"\n                                   \"'{1}' ({0}.{1})\".format(\n                    layer_name, auxiliary_output))\n            return layer.output_name(auxiliary_output)\n    # No such layer was found.\n    if layer_name in [ layer.get_name() for layer in all_layers ]:\n        raise RuntimeError(\"Layer '{0}' was requested before it appeared in \"\n                        \"the xconfig file (circular dependencies or out-of-order \"\n                        \"layers\".format(layer_name))\n    else:\n        raise RuntimeError(\"No such layer: '{0}'\".format(layer_name))\n\n\n# This function, used in converting string values in config lines to\n# configuration values in self.config in layers, attempts to\n# convert 'string_value' to an instance dest_type (which is of type Type)\n# 'key' is only needed for printing errors.\ndef convert_value_to_type(key, dest_type, string_value):\n    if dest_type == type(bool()):\n        if string_value == \"True\" or string_value == \"true\":\n            return True\n        elif string_value == \"False\" or string_value == \"false\":\n            return False\n        else:\n            raise RuntimeError(\"Invalid configuration value {0}={1} (expected bool)\".format(\n                key, string_value))\n    elif dest_type == type(int()):\n        try:\n            return int(string_value)\n        except:\n            raise RuntimeError(\"Invalid configuration value {0}={1} (expected int)\".format(\n                key, string_value))\n    elif dest_type == type(float()):\n        try:\n            return float(string_value)\n        except:\n            raise RuntimeError(\"Invalid configuration value {0}={1} (expected int)\".format(\n                key, string_value))\n    elif dest_type == type(str()):\n        return string_value\n\n\n\n# This class parses and stores a Descriptor-- expression\n# like Append(Offset(input, -3), input) and so on.\n# For the full range of possible expressions, see the comment at the\n# top of src/nnet3/nnet-descriptor.h.\n# Note: as an extension to the descriptor format used in the C++\n# code, we can have e.g. input@-3 meaning Offset(input, -3);\n# and if bare integer numbers appear where a descriptor was expected,\n# they are interpreted as Offset(prev_layer, -3) where 'prev_layer'\n# is the previous layer in the config file.\n\n# Also, in any place a raw input/layer/output name can appear, we accept things\n# like [-1] meaning the previous input/layer/output's name, or [-2] meaning the\n# last-but-one input/layer/output, and so on.\nclass Descriptor(object):\n    def __init__(self,\n                 descriptor_string = None,\n                 prev_names = None):\n        # self.operator is a string that may be 'Offset', 'Append',\n        # 'Sum', 'Failover', 'IfDefined', 'Offset', 'Switch', 'Round',\n        # 'ReplaceIndex'; it also may be None, representing the base-case\n        # (where it's just a layer name)\n\n        # self.items will be whatever items are\n        # inside the parentheses, e.g. if this is Sum(foo bar),\n        # then items will be [d1, d2], where d1 is a Descriptor for\n        # 'foo' and d1 is a Descriptor for 'bar'.  However, there are\n        # cases where elements of self.items are strings or integers,\n        # for instance in an expression 'ReplaceIndex(ivector, x, 0)',\n        # self.items would be [d, 'x', 0], where d is a Descriptor\n        # for 'ivector'.  In the case where self.operator is None (where\n        # this Descriptor represents just a bare layer name), self.\n        # items contains the name of the input layer as a string.\n        self.operator = None\n        self.items = None\n\n        if descriptor_string != None:\n            try:\n                tokens = tokenize_descriptor(descriptor_string, prev_names)\n                pos = 0\n                (d, pos) = parse_new_descriptor(tokens, pos, prev_names)\n                # note: 'pos' should point to the 'end of string' marker\n                # that terminates 'tokens'.\n                if pos != len(tokens) - 1:\n                    raise RuntimeError(\"Parsing Descriptor, saw junk at end: \" +\n                                    ' '.join(tokens[pos:-1]))\n                # copy members from d.\n                self.operator = d.operator\n                self.items = d.items\n            except RuntimeError as e:\n                traceback.print_tb(sys.exc_info()[2])\n                raise RuntimeError(\"Error parsing Descriptor '{0}', specific error was: {1}\".format(\n                    descriptor_string, repr(e)))\n\n    # This is like the str() function, but it uses the layer_to_string function\n    # (which is a function from strings to strings) to convert layer names (or\n    # in general sub-layer names of the form 'foo.bar') to the component-node\n    # (or, in general, descriptor) names that appear in the final config file.\n    # This mechanism gives those designing layer types the freedom to name their\n    # nodes as they want.\n    def config_string(self, layer_to_string):\n        if self.operator is None:\n            assert len(self.items) == 1 and isinstance(self.items[0], str)\n            return layer_to_string(self.items[0])\n        else:\n            assert isinstance(self.operator, str)\n            return self.operator + '(' + ', '.join(\n                    [ item.config_string(layer_to_string) if isinstance(item, Descriptor) else str(item)\n                      for item in self.items]) + ')'\n\n    def str(self):\n        if self.operator is None:\n            assert len(self.items) == 1 and isinstance(self.items[0], str)\n            return self.items[0]\n        else:\n            assert isinstance(self.operator, str)\n            return self.operator + '(' + ', '.join([str(item) for item in self.items]) + ')'\n\n    def __str__(self):\n        return self.str()\n\n    # This function returns the dimension (i.e. the feature dimension) of the\n    # descriptor.  It takes 'layer_to_dim' which is a function from\n    # layer-names (including sub-layer names, like lstm1.memory_cell) to\n    # dimensions, e.g. you might have layer_to_dim('ivector') = 100, or\n    # layer_to_dim('affine1') = 1024.\n    # note: layer_to_dim will raise an exception if a nonexistent layer or\n    # sub-layer is requested.\n    def dim(self, layer_to_dim):\n        if self.operator is None:\n            # base-case: self.items = [ layer_name ] (or sub-layer name, like\n            # 'lstm.memory_cell').\n            return layer_to_dim(self.items[0])\n        elif self.operator in [ 'Sum', 'Failover', 'IfDefined', 'Switch' ]:\n            # these are all operators for which all args are descriptors\n            # and must have the same dim.\n            dim = self.items[0].dim(layer_to_dim)\n            for desc in self.items[1:]:\n                next_dim = desc.dim(layer_to_dim)\n                if next_dim != dim:\n                    raise RuntimeError(\"In descriptor {0}, different fields have different \"\n                                       \"dimensions: {1} != {2}\".format(self.str(), dim, next_dim))\n            return dim\n        elif self.operator in [  'Offset', 'Round', 'ReplaceIndex' ]:\n            # for these operators, only the 1st arg is relevant.\n            return self.items[0].dim(layer_to_dim)\n        elif self.operator == 'Append':\n            return sum([ x.dim(layer_to_dim) for x in self.items])\n        elif self.operator == 'Scale':\n            # e.g. Scale(2.0, lstm1).  Return dim of 2nd arg.\n            return self.items[1].dim(layer_to_dim)\n        elif self.operator == 'Const':\n            # e.g. Const(0.5, 512).  Return 2nd arg, which is an int.\n            return self.items[1]\n        else:\n            raise RuntimeError(\"Unknown operator {0}\".format(self.operator))\n\n\n\n# This just checks that seen_item == expected_item, and raises an\n# exception if not.\ndef expect_token(expected_item, seen_item, what_parsing):\n    if seen_item != expected_item:\n        raise RuntimeError(\"parsing {0}, expected '{1}' but got '{2}'\".format(\n            what_parsing, expected_item, seen_item))\n\n# returns true if 'name' is valid as the name of a line (input, layer or output);\n# this is the same as IsValidname() in the nnet3 code.\ndef is_valid_line_name(name):\n    return isinstance(name, str) and re.match(r'^[a-zA-Z_][-a-zA-Z_0-9.]*', name) != None\n\n# This function for parsing Descriptors takes an array of tokens as produced\n# by tokenize_descriptor.  It parses a descriptor\n# starting from position pos >= 0 of the array 'tokens', and\n# returns a new position in the array that reflects any tokens consumed while\n# parsing the descriptor.\n# It returns a pair (d, pos) where d is the newly parsed Descriptor,\n# and 'pos' is the new position after consuming the relevant input.\n# 'prev_names' is so that we can find the most recent layer name for\n# expressions like Append(-3, 0, 3) which is shorthand for the most recent\n# layer spliced at those time offsets.\ndef parse_new_descriptor(tokens, pos, prev_names):\n    size = len(tokens)\n    first_token = tokens[pos]\n    pos += 1\n    d = Descriptor()\n\n    # when reading this function, be careful to note the indent level,\n    # there is an if-statement within an if-statement.\n    if first_token in [ 'Offset', 'Round', 'ReplaceIndex', 'Append', 'Sum',\n                        'Switch', 'Failover', 'IfDefined' ]:\n        expect_token('(', tokens[pos], first_token + '()')\n        pos += 1\n        d.operator = first_token\n        # the 1st argument of all these operators is a Descriptor.\n        (desc, pos) = parse_new_descriptor(tokens, pos, prev_names)\n        d.items = [desc]\n\n        if first_token == 'Offset':\n            expect_token(',', tokens[pos], 'Offset()')\n            pos += 1\n            try:\n                t_offset = int(tokens[pos])\n                pos += 1\n                d.items.append(t_offset)\n            except:\n                raise RuntimeError(\"Parsing Offset(), expected integer, got \" + tokens[pos])\n            if tokens[pos] == ')':\n                return (d, pos + 1)\n            elif tokens[pos] != ',':\n                raise RuntimeError(\"Parsing Offset(), expected ')' or ',', got \" + tokens[pos])\n            pos += 1\n            try:\n                x_offset = int(tokens[pos])\n                pos += 1\n                d.items.append(x_offset)\n            except:\n                raise RuntimeError(\"Parsing Offset(), expected integer, got \" + tokens[pos])\n            expect_token(')', tokens[pos], 'Offset()')\n            pos += 1\n        elif first_token in [ 'Append', 'Sum', 'Switch', 'Failover', 'IfDefined' ]:\n            while True:\n                if tokens[pos] == ')':\n                    # check num-items is correct for some special cases.\n                    if first_token == 'Failover' and len(d.items) != 2:\n                        raise RuntimeError(\"Parsing Failover(), expected 2 items but got {0}\".format(len(d.items)))\n                    if first_token == 'IfDefined' and len(d.items) != 1:\n                        raise RuntimeError(\"Parsing IfDefined(), expected 1 item but got {0}\".format(len(d.items)))\n                    pos += 1\n                    break\n                elif tokens[pos] == ',':\n                    pos += 1  # consume the comma.\n                else:\n                    raise RuntimeError(\"Parsing Append(), expected ')' or ',', got \" + tokens[pos])\n\n                (desc, pos) = parse_new_descriptor(tokens, pos, prev_names)\n                d.items.append(desc)\n        elif first_token == 'Round':\n            expect_token(',', tokens[pos], 'Round()')\n            pos += 1\n            try:\n                t_modulus = int(tokens[pos])\n                assert t_modulus > 0\n                pos += 1\n                d.items.append(t_modulus)\n            except:\n                raise RuntimeError(\"Parsing Offset(), expected integer, got \" + tokens[pos])\n            expect_token(')', tokens[pos], 'Round()')\n            pos += 1\n        elif first_token == 'ReplaceIndex':\n            expect_token(',', tokens[pos], 'ReplaceIndex()')\n            pos += 1\n            if tokens[pos] in [ 'x', 't' ]:\n                d.items.append(tokens[pos])\n                pos += 1\n            else:\n                raise RuntimeError(\"Parsing ReplaceIndex(), expected 'x' or 't', got \" +\n                                tokens[pos])\n            expect_token(',', tokens[pos], 'ReplaceIndex()')\n            pos += 1\n            try:\n                new_value = int(tokens[pos])\n                pos += 1\n                d.items.append(new_value)\n            except:\n                raise RuntimeError(\"Parsing Offset(), expected integer, got \" + tokens[pos])\n            expect_token(')', tokens[pos], 'ReplaceIndex()')\n            pos += 1\n        else:\n            raise RuntimeError(\"code error\")\n    elif first_token in ['Scale', 'Const' ]:\n        # Parsing something like 'Scale(2.0, lstm1)' or 'Const(1.0, 512)'\n        expect_token('(', tokens[pos], first_token + '()')\n        pos += 1\n        d.operator = first_token\n        # First arg of Scale() and Const() is a float: the scale or value,\n        # respectively.\n        try:\n            value = float(tokens[pos])\n            pos += 1\n            d.items = [value]\n        except:\n            raise RuntimeError(\"Parsing {0}, expected float, got {1}\".format(\n                first_token, tokens[pos]))\n        # Consume the comma.\n        expect_token(',', tokens[pos], first_token + '()')\n        pos += 1\n        if first_token == 'Scale':\n            # Second arg of Scale() is a Descriptor.\n            (desc, pos) = parse_new_descriptor(tokens, pos, prev_names)\n            d.items.append(desc)\n        else:\n            assert first_token == 'Const'\n            try:\n                dim = int(tokens[pos])\n                pos += 1\n                d.items.append(dim)\n            except:\n                raise RuntimeError(\"Parsing Const() expression, expected int, got {0}\".format(\n                    tokens[pos]))\n        expect_token(')', tokens[pos], first_token)\n        pos += 1\n    elif first_token in [ 'end of string', '(', ')', ',', '@' ]:\n        raise RuntimeError(\"Expected descriptor, got \" + first_token)\n    elif is_valid_line_name(first_token) or first_token == '[':\n        # This section parses a raw input/layer/output name, e.g. \"affine2\"\n        # (which must start with an alphabetic character or underscore),\n        # optionally followed by an offset like '@-3'.\n\n        d.operator = None\n        d.items = [first_token]\n\n        # If the layer-name o is followed by '@', then\n        # we're parsing something like 'affine1@-3' which\n        # is syntactic sugar for 'Offset(affine1, 3)'.\n        if tokens[pos] == '@':\n            pos += 1\n            try:\n                offset_t = int(tokens[pos])\n                pos += 1\n            except:\n                raise RuntimeError(\"Parse error parsing {0}@{1}\".format(\n                    first_token, tokens[pos]))\n            if offset_t != 0:\n                inner_d = d\n                d = Descriptor()\n                # e.g. foo@3 is equivalent to 'Offset(foo, 3)'.\n                d.operator = 'Offset'\n                d.items = [ inner_d, offset_t ]\n    else:\n        # the last possible case is that 'first_token' is just an integer i,\n        # which can appear in things like Append(-3, 0, 3).\n        # See if the token is an integer.\n        # In this case, it's interpreted as the name of previous layer\n        # (with that time offset applied).\n        try:\n            offset_t = int(first_token)\n        except:\n            raise RuntimeError(\"Parsing descriptor, expected descriptor but got \" +\n                            first_token)\n        assert isinstance(prev_names, list)\n        if len(prev_names) < 1:\n            raise RuntimeError(\"Parsing descriptor, could not interpret '{0}' because \"\n                            \"there is no previous layer\".format(first_token))\n        d.operator = None\n        # the layer name is the name of the most recent layer.\n        d.items = [prev_names[-1]]\n        if offset_t != 0:\n            inner_d = d\n            d = Descriptor()\n            d.operator = 'Offset'\n            d.items = [ inner_d, offset_t ]\n    return (d, pos)\n\n\n# This function takes a string 'descriptor_string' which might\n# look like 'Append([-1], [-2], input)', and a list of previous layer\n# names like prev_names = ['foo', 'bar', 'baz'], and replaces\n# the integers in brackets with the previous layers.  -1 means\n# the most recent previous layer ('baz' in this case), -2\n# means the last layer but one ('bar' in this case), and so on.\n# It will throw an exception if the number is out of range.\n# If there are no such expressions in the string, it's OK if\n# prev_names == None (this is useful for testing).\ndef replace_bracket_expressions_in_descriptor(descriptor_string,\n                                              prev_names = None):\n    fields = re.split(r'(\\[|\\])\\s*', descriptor_string)\n    out_fields = []\n    i = 0\n    while i < len(fields):\n        f = fields[i]\n        i += 1\n        if f == ']':\n            raise RuntimeError(\"Unmatched ']' in descriptor\")\n        elif f == '[':\n            if i + 2 >= len(fields):\n                raise RuntimeError(\"Error tokenizing string '{0}': '[' found too close \"\n                                \"to the end of the descriptor.\".format(descriptor_string))\n            assert isinstance(prev_names, list)\n            try:\n                offset = int(fields[i])\n                assert offset < 0 and -offset <= len(prev_names)\n                i += 2  # consume the int and the ']'.\n            except:\n                raise RuntimeError(\"Error tokenizing string '{0}': expression [{1}] has an \"\n                                \"invalid or out of range offset.\".format(descriptor_string, fields[i]))\n            this_field = prev_names[offset]\n            out_fields.append(this_field)\n        else:\n            out_fields.append(f)\n    return ''.join(out_fields)\n\n# tokenizes 'descriptor_string' into the tokens that may be part of Descriptors.\n# Note: for convenience in parsing, we add the token 'end-of-string' to this\n# list.\n# The argument 'prev_names' (for the names of previous layers and input and\n# output nodes) is needed to process expressions like [-1] meaning the most\n# recent layer, or [-2] meaning the last layer but one.\n# The default None for prev_names is only supplied for testing purposes.\n# Called with 'Append(-1, 0, 1)' this would return\n# [ 'Append', '(',  '-1', ',', '0', ',', '1' ')' ].\n# for a more complicated example: if you call\n#   tokenize_descriptor('Append(-1, 0, 1, [-2]@0)', prev_names = ['a', 'b', 'c', 'd'])\n# the [-2] would get replaced with prev_names[-2] = 'c', returning:\n#  [ 'Append', '(', '-1', ',', '0', ',', '1', ',', 'c', '@', '0', ')' ]\ndef tokenize_descriptor(descriptor_string,\n                       prev_names = None):\n    # split on '(', ')', ',', '@', and space.  Note: the parenthesis () in the\n    # regexp causes it to output the stuff inside the () as if it were a field,\n    # which is how the call to re.split() keeps characters like '(' and ')' as\n    # tokens.\n    fields = re.split(r'(\\(|\\)|@|,|\\s)\\s*',\n                      replace_bracket_expressions_in_descriptor(descriptor_string,\n                                                                prev_names))\n    ans = []\n    for f in fields:\n        # don't include fields that are space, or are empty.\n        if re.match(r'^\\s*$', f) is None:\n            ans.append(f)\n\n    ans.append('end of string')\n    return ans\n\n\n# This function parses a line in a config file, something like\n# affine-layer name=affine1 input=Append(-3, 0, 3)\n# and returns a pair,\n# (first_token, fields), as (string, dict) e.g. in this case\n# ('affine-layer', {'name':'affine1', 'input':'Append(-3, 0, 3)\"\n# Note: spaces are allowed in the field names but = signs are\n# disallowed, except when quoted with double quotes,\n# which is why it's possible to parse them.\n# This function also removes comments (anything after '#').\n# As a special case, this function will return None if the line\n# is empty after removing spaces.\ndef parse_config_line(orig_config_line):\n    # Remove comments.\n    # note: splitting on '#' will always give at least one field...  python\n    # treats splitting on space as a special case that may give zero fields.\n    config_line = orig_config_line.split('#')[0]\n    # Note: this set of allowed characters may have to be expanded in future.\n    x = re.search('[^a-zA-Z0-9\\.\\-\\(\\)@_=,/+:\\s\"]', config_line)\n    if x is not None:\n        bad_char = x.group(0)\n        if bad_char == \"'\":\n            raise RuntimeError(\"Xconfig line has disallowed character ' (use \"\n                               \"double quotes for strings containing = signs)\")\n        else:\n            raise RuntimeError(\"Xconfig line has disallowed character: {0}\"\n                               .format(bad_char))\n\n    # Now split on space; later we may splice things back together.\n    fields=config_line.split()\n    if len(fields) == 0:\n        return None   # Line was only whitespace after removing comments.\n    first_token = fields[0]\n    # if first_token does not look like 'foo-bar' or 'foo-bar2', then die.\n    if re.match('^[a-z][-a-z0-9]+$', first_token) is None:\n        raise RuntimeError(\"Error parsing config line (first field doesn't look right).\")\n\n    # get rid of the first field which we put in 'first_token'.\n    fields = fields[1:]\n\n    rest_of_line = ' '.join(fields)\n    # rest of the line can be of the form 'a=1 b=\" x=1 y=2 \" c=Append( i1, i2)'\n    positions = [x.start() for x in re.finditer('\"', rest_of_line)]\n    if not len(positions) % 2 == 0:\n        raise RuntimeError(\"Double-quotes should occur in pairs\")\n\n    # Replace all the equals signs inside the \"-enclosed strings\n    # with question marks ('?') [this is just an arbitrary character\n    # that won't otherwise be present, search above for 'banned'],\n    # and replace the quotation marks themselves with spaces.\n    # Then later on we'll convert all the question marks to\n    # equals signs in the values in the dicts.\n    num_strings = len(positions) // 2\n    fields = []\n    for i in range(num_strings):\n        start = positions[i * 2]\n        end = positions[i * 2 + 1]\n\n        line_before_start = rest_of_line[:start]\n        inside_quotes=rest_of_line[start+1:end].replace('=', '?')\n        line_after_end = rest_of_line[end + 1:]\n        # the reason why we include the spaces here, is to keep the length of\n        # rest_of_line the same, and the positions in 'positions' valid.\n        new_rest_of_line = line_before_start + ' ' + inside_quotes + ' ' + line_after_end\n        assert len(new_rest_of_line) == len(rest_of_line)\n        rest_of_line = new_rest_of_line\n\n    # suppose rest_of_line is: 'input=Append(foo, bar) foo=bar'\n    # then after the below we'll get\n    # fields = ['', 'input', 'Append(foo, bar)', 'foo', 'bar']\n    ans_dict = dict()\n    other_fields = re.split(r'\\s*([-a-zA-Z0-9_]*)=', rest_of_line)\n    if not (other_fields[0] == '' and len(other_fields) % 2 ==  1):\n        raise RuntimeError(\"Could not parse config line.\");\n    fields += other_fields[1:]\n    num_variables = len(fields) // 2\n    for i in range(num_variables):\n        var_name = fields[i * 2]\n        var_value = fields[i * 2 + 1]\n        if re.match(r'[a-zA-Z_]', var_name) is None:\n            raise RuntimeError(\"Expected variable name '{0}' to start with alphabetic character or _, \"\n                            \"in config line {1}\".format(var_name, orig_config_line))\n        if var_name in ans_dict:\n            raise RuntimeError(\"Config line has multiply defined variable {0}: {1}\".format(\n                var_name, orig_config_line))\n        # Teplace any '?' characters that we inserted above, with the original\n        # '=' characters.\n        # The 'strip()' is to remove initial and final spaces that we might\n        # have inserted while processing double-quotes above (search above\n        # for the string 'inside_quotes' to see what is meant by this).\n        ans_dict[var_name] = var_value.replace('?', '=').strip()\n    return (first_token, ans_dict)\n\n\ndef test_library():\n    tokenize_test = lambda x: tokenize_descriptor(x)[:-1]  # remove 'end of string'\n    assert tokenize_test(\"hi\") == ['hi']\n    assert tokenize_test(\"hi there\") == ['hi', 'there']\n    assert tokenize_test(\"hi,there\") == ['hi', ',', 'there']\n    assert tokenize_test(\"hi@-1,there\") == ['hi', '@', '-1', ',', 'there']\n    assert tokenize_test(\"hi(there)\") == ['hi', '(', 'there', ')']\n    assert tokenize_descriptor(\"[-1]@2\", ['foo', 'bar'])[:-1] == ['bar', '@', '2' ]\n    assert tokenize_descriptor(\"[-2].special@2\", ['foo', 'bar'])[:-1] == ['foo.special', '@', '2' ]\n\n    assert Descriptor('foo').str() == 'foo'\n    assert Descriptor('Sum(foo,bar)').str() == 'Sum(foo, bar)'\n    assert Descriptor('Sum(Offset(foo,1),Offset(foo,0))').str() == 'Sum(Offset(foo, 1), Offset(foo, 0))'\n    for x in [ 'Append(foo, Sum(bar, Offset(baz, 1)))', 'Failover(foo, Offset(bar, -1))',\n               'IfDefined(Round(baz, 3))', 'Switch(foo1, Offset(foo2, 2), Offset(foo3, 3))',\n               'IfDefined(ReplaceIndex(ivector, t, 0))', 'ReplaceIndex(foo, x, 0)' ]:\n        if not Descriptor(x).str() == x:\n            print(\"Error: '{0}' != '{1}'\".format(Descriptor(x).str(), x))\n\n    prev_names = ['last_but_one_layer', 'prev_layer']\n    for x, y in [ ('Sum(foo,bar)', 'Sum(foo, bar)'),\n                  ('Sum(foo1,bar-3_4)', 'Sum(foo1, bar-3_4)'),\n                  ('Append(input@-3, input@0, input@3)',\n                   'Append(Offset(input, -3), input, Offset(input, 3))'),\n                  ('Append(-3,0,3)',\n                   'Append(Offset(prev_layer, -3), prev_layer, Offset(prev_layer, 3))'),\n                  ('[-1]', 'prev_layer'),\n                  ('Scale(2.0,foo)', 'Scale(2.0, foo)'),\n                  ('Const(0.5,500)', 'Const(0.5, 500)'),\n                  ('[-2]', 'last_but_one_layer'),\n                  ('[-2]@3',\n                   'Offset(last_but_one_layer, 3)') ]:\n        if not Descriptor(x, prev_names).str() == y:\n            print(\"Error: '{0}' != '{1}'\".format(Descriptor(x).str(), y))\n\n\n    print(parse_config_line('affine-layer input=Append(foo, bar) foo=bar'))\n    print(parse_config_line('affine-layer x=\"y z\" input=Append(foo, bar) foo=bar opt2=\"a=1 b=2\"'))\n    print(parse_config_line('affine-layer1 input=Append(foo, bar) foo=bar'))\n    print(parse_config_line('affine-layer'))\n\nif __name__ == \"__main__\":\n    test_library()\n"
  },
  {
    "path": "egs/steps/lmrescore.sh",
    "content": "#!/usr/bin/env bash\n\nset -e -o pipefail\n\n# Begin configuration section.\nmode=4  # mode can be 1 through 5.  They should all give roughly similar results.\n        # See the comments in the case statement for more details.\ncmd=run.pl\nskip_scoring=false\nself_loop_scale=0.1  # only matters for mode 4.\nacoustic_scale=0.1   # only matters for mode 5.\nscoring_opts=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n   echo \"Do language model rescoring of lattices (remove old LM, add new LM)\"\n   echo \"Usage: steps/lmrescore.sh [options] <old-lang-dir> <new-lang-dir> <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \"Ooptions:\"\n   echo \" --cmd   <cmd-string>       # How to run commands (e.g. run.pl, queue.pl)\"\n   echo \" --mode  (1|2|3|4|5)        # Mode of LM rescoring to use (default: 4).\"\n   echo \"                            # These should give very similar results.\"\n   echo \" --self-loop-scale  <scale> # Self-loop-scale, only relevant in mode 4.\"\n   echo \"                            # Default: 0.1.\"\n   echo \" --acoustic-scale  <scale>  # Acoustic scale, only relevant in mode 5.\"\n   echo \"                            # Default: 0.1.\"\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nnewlang=$2\ndata=$3\nindir=$4\noutdir=$5\n\noldlm=$oldlang/G.fst\nnewlm=$newlang/G.fst\n! cmp $oldlang/words.txt $newlang/words.txt && echo \"Warning: vocabularies may be incompatible.\"\n[ ! -f $oldlm ] && echo Missing file $oldlm && exit 1;\n[ ! -f $newlm ] && echo Missing file $newlm && exit 1;\n! ls $indir/lat.*.gz >/dev/null && echo \"No lattices input directory $indir\" && exit 1;\n\nif ! cmp -s $oldlang/words.txt $newlang/words.txt; then\n  echo \"$0: $oldlang/words.txt and $newlang/words.txt differ: make sure you know what you are doing.\";\nfi\n\noldlmcommand=\"fstproject --project_output=true $oldlm |\"\nnewlmcommand=\"fstproject --project_output=true $newlm |\"\n\nmkdir -p $outdir/log\n\nphi=`grep -w '#0' $newlang/words.txt | awk '{print $2}'`\n\nif [ \"$mode\" == 4 ]; then\n  # we have to prepare $outdir/Ldet.fst in this case: determinized\n  # lexicon (determinized on phones), with disambig syms removed.\n  # take L_disambig.fst; get rid of transition with \"#0 #0\" on it; determinize\n  # with epsilon removal; remove disambiguation symbols.\n  fstprint $newlang/L_disambig.fst | awk '{if($4 != '$phi'){print;}}' | fstcompile | \\\n    fstdeterminizestar | fstrmsymbols $newlang/phones/disambig.int >$outdir/Ldet.fst || exit 1;\nfi\n\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\n\n#for lat in $indir/lat.*.gz; do\n#  number=`basename $lat | cut -d. -f2`;\n#  newlat=$outdir/`basename $lat`\n\ncase \"$mode\" in\n  1) # 1 is inexact, it's the original way of doing it.\n    $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n      lattice-lmrescore --lm-scale=-1.0 \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlmcommand\" ark:-  \\| \\\n      lattice-lmrescore --lm-scale=1.0 ark:- \"$newlmcommand\" \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" \\\n      || exit 1;\n    ;;\n  2)  # 2 is equivalent to 1, but using more basic operations, combined.\n    $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n      gunzip -c $indir/lat.JOB.gz \\| \\\n      lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- ark:- \\| \\\n      lattice-compose ark:- \"fstproject --project_output=true $oldlm |\" ark:- \\| \\\n      lattice-determinize ark:- ark:- \\| \\\n      lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- ark:- \\| \\\n      lattice-compose ark:- \"fstproject --project_output=true $newlm |\" ark:- \\| \\\n      lattice-determinize ark:- ark:- \\| \\\n      gzip -c \\>$outdir/lat.JOB.gz || exit 1;\n    ;;\n  3) # 3 is \"exact\" in that we remove the old LM scores accepting any path\n     # through G.fst (which is what we want as that happened in lattice\n     # generation), but we add the new one with \"phi matcher\", only taking\n     # backoff arcs if an explicit arc did not exist.\n    $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n      gunzip -c $indir/lat.JOB.gz \\| \\\n      lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- ark:- \\| \\\n      lattice-compose ark:- \"fstproject --project_output=true $oldlm |\" ark:- \\| \\\n      lattice-determinize ark:- ark:- \\| \\\n      lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- ark:- \\| \\\n      lattice-compose --phi-label=$phi ark:- $newlm ark:- \\| \\\n      lattice-determinize ark:- ark:- \\| \\\n      gzip -c \\>$outdir/lat.JOB.gz || exit 1;\n    ;;\n  4) # 4 is also exact (like 3), but instead of subtracting the old LM-scores,\n     # it removes the old graph scores entirely and adds in the lexicon,\n     # grammar and transition weights.\n    mdl=`dirname $indir`/final.mdl\n    [ ! -f $mdl ] && echo No such model $mdl && exit 1;\n    [[ -f `dirname $indir`/frame_subsampling_factor && \"$self_loop_scale\" == 0.1 ]] && \\\n      echo \"$0: WARNING: chain models need '--self-loop-scale 1.0'\";\n    $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n      gunzip -c $indir/lat.JOB.gz \\| \\\n      lattice-scale --lm-scale=0.0 ark:- ark:- \\| \\\n      lattice-to-phone-lattice $mdl ark:- ark:- \\| \\\n      lattice-compose ark:- $outdir/Ldet.fst ark:- \\| \\\n      lattice-determinize ark:- ark:- \\| \\\n      lattice-compose --phi-label=$phi ark:- $newlm ark:- \\| \\\n      lattice-add-trans-probs --transition-scale=1.0 --self-loop-scale=$self_loop_scale \\\n      $mdl ark:- ark:- \\| \\\n      gzip -c \\>$outdir/lat.JOB.gz  || exit 1;\n    ;;\n  5) # Mode 5 uses the binary lattice-lmrescore-pruned to do the LM rescoring\n    # within a single program.  There are options for pruning, but these won't\n    # normally need to be modified; the pruned aspect is more necessary for\n    # RNNLM rescoring or when the lattices are extremely deep.\n\n    [[ -f `dirname $indir`/frame_subsampling_factor && \"$acoustic_scale\" == 0.1 ]] && \\\n      echo \"$0: WARNING: chain models need '--acoustic-scale 1.0'\";\n\n    $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n      lattice-lmrescore-pruned --acoustic-scale=$acoustic_scale \"$oldlm\" \"$newlm\" \\\n      \"ark:gunzip -c $indir/lat.JOB.gz|\" \"ark:|gzip -c >$outdir/lat.JOB.gz\" || exit 1;\n    ;;\nesac\n\nrm $outdir/Ldet.fst 2>/dev/null || true\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $newlang $outdir\nelse\n  echo \"Not scoring because requested so...\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/lmrescore_const_arpa.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# This script rescores lattices with the ConstArpaLm format language model.\n\n# Begin configuration section.\ncmd=run.pl\nskip_scoring=false\nstage=1\nscoring_opts=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n   echo \"Does language model rescoring of lattices (remove old LM, add new LM)\"\n   echo \"Usage: $0 [options] <old-lang-dir> <new-lang-dir> \\\\\"\n   echo \"                   <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \"options: [--cmd (run.pl|queue.pl [queue opts])]\"\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nnewlang=$2\ndata=$3\nindir=$4\noutdir=$5\n\noldlm=$oldlang/G.fst\nnewlm=$newlang/G.carpa\n! cmp $oldlang/words.txt $newlang/words.txt &&\\\n  echo \"$0: Warning: vocabularies may be incompatible.\"\n[ ! -f $oldlm ] && echo \"$0: Missing file $oldlm\" && exit 1;\n[ ! -f $newlm ] && echo \"$0: Missing file $newlm\" && exit 1;\n! ls $indir/lat.*.gz >/dev/null &&\\\n  echo \"$0: No lattices input directory $indir\" && exit 1;\n\nif ! cmp -s $oldlang/words.txt $newlang/words.txt; then\n  echo \"$0: $oldlang/words.txt and $newlang/words.txt differ: make sure you know what you are doing.\";\nfi\n\noldlmcommand=\"fstproject --project_output=true $oldlm |\"\n\nmkdir -p $outdir/log\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-lmrescore --lm-scale=-1.0 \\\n    \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlmcommand\" ark:-  \\| \\\n    lattice-lmrescore-const-arpa --lm-scale=1.0 \\\n    ark:- \"$newlm\" \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring && [ $stage -le 2 ]; then\n  err_msg=\"Not scoring because local/score.sh does not exist or not executable.\"\n  [ ! -x local/score.sh ] && echo $err_msg && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $newlang $outdir\nelse\n  echo \"Not scoring because requested so...\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/lmrescore_const_arpa_undeterminized.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n#           2017  Vimal Manohar\n# Apache 2.0\n\n# This script rescores non-compact, (possibly) undeterminized lattices with the \n# ConstArpaLm format language model.\n# This is similar to steps/lmrescore_const_arpa.sh, but expects \n# non-compact lattices as input.\n# This works by first determinizing the lattice and rescoring it with \n# const ARPA LM, followed by composing it with the original lattice to add the \n# new LM scores.\n\n# If you use the option \"--write compact false\" it outputs non-compact lattices;\n# the purpose is to add in LM scores while leaving the frame-by-frame acoustic\n# scores in the same position that they were in in the input, undeterminized\n# lattices. This is important in our 'chain' semi-supervised training recipes,\n# where it helps us to split lattices while keeping the scores at the edges of\n# the split points correct.\n\n# Begin configuration section.\ncmd=run.pl\nskip_scoring=false\nstage=1\nscoring_opts=\nwrite_compact=true   # If set to false, writes lattice in non-compact format.\n                     # This retains the acoustic scores on the arcs of the lattice.\n                     # Useful for another stage of LM rescoring.\nacwt=0.1  # used for pruning and determinization\nbeam=8.0  # beam used in determinization\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n  cat <<EOF\n   Does language model rescoring of non-compact undeterminized lattices \n   (remove old LM, add new LM). This script expects the input lattices \n   to be in non-compact format.\n   Usage: $0 [options] <old-lang-dir> <new-lang-dir> \\\\\n                      <data-dir> <input-decode-dir> <output-decode-dir>\n   options: [--cmd (run.pl|queue.pl [queue opts])]\n   See also: steps/lmrescore_const_arpa.sh \nEOF\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nnewlang=$2\ndata=$3\nindir=$4\noutdir=$5\n\noldlm=$oldlang/G.fst\nnewlm=$newlang/G.carpa\n! cmp $oldlang/words.txt $newlang/words.txt &&\\\n  echo \"$0: Warning: vocabularies may be incompatible.\"\n[ ! -f $oldlm ] && echo \"$0: Missing file $oldlm\" && exit 1;\n[ ! -f $newlm ] && echo \"$0: Missing file $newlm\" && exit 1;\n! ls $indir/lat.*.gz >/dev/null &&\\\n  echo \"$0: No lattices input directory $indir\" && exit 1;\n\nif ! cmp -s $oldlang/words.txt $newlang/words.txt; then\n  echo \"$0: $oldlang/words.txt and $newlang/words.txt differ: make sure you know what you are doing.\";\nfi\n\noldlmcommand=\"fstproject --project_output=true $oldlm |\"\n\nmkdir -p $outdir/log\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\nlats_rspecifier=\"ark:gunzip -c $indir/lat.JOB.gz |\"\n  \nlats_wspecifier=\"ark:| gzip -c > $outdir/lat.JOB.gz\" \n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$beam \\\n      \"ark:gunzip -c $indir/lat.JOB.gz |\" ark:- \\| \\\n    lattice-scale --lm-scale=0.0 --acoustic-scale=0.0 ark:- ark:- \\| \\\n    lattice-lmrescore --lm-scale=-1.0 ark:- \"$oldlmcommand\" ark:- \\| \\\n    lattice-lmrescore-const-arpa --lm-scale=1.0 \\\n      ark:- \"$newlm\" ark:- \\| \\\n    lattice-project ark:- ark:- \\| \\\n    lattice-compose --write-compact=$write_compact \\\n      \"$lats_rspecifier\" \\\n      ark,s,cs:- \"$lats_wspecifier\" || exit 1\nfi\n\nif ! $skip_scoring && [ $stage -le 2 ]; then\n  err_msg=\"Not scoring because local/score.sh does not exist or not executable.\"\n  [ ! -x local/score.sh ] && echo $err_msg && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $newlang $outdir\nelse\n  echo \"Not scoring because requested so...\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/lmrescore_rnnlm_lat.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015  Guoguo Chen\n#           2017  Hainan Xu\n# Apache 2.0\n\n# This script rescores lattices with RNNLM.  See also rnnlmrescore.sh which is\n# an older script using n-best lists.\n\n# Begin configuration section.\ncmd=run.pl\nskip_scoring=false\nmax_ngram_order=4\nacwt=0.1\nweight=0.5  # Interpolation weight for RNNLM.\nrnnlm_ver=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n   echo \"Does language model rescoring of lattices (remove old LM, add new LM)\"\n   echo \"with RNNLM.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <old-lang-dir> <rnnlm-dir> \\\\\"\n   echo \"                   <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \" e.g.: $0 ./rnnlm data/lang_tg data/test \\\\\"\n   echo \"                   exp/tri3/test_tg exp/tri3/test_rnnlm\"\n   echo \"options: [--cmd (run.pl|queue.pl [queue opts])]\"\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nrnnlm_dir=$2\ndata=$3\nindir=$4\noutdir=$5\n\nrescoring_binary=lattice-lmrescore-rnnlm\n\nfirst_arg=ark:$rnnlm_dir/unk.probs # this is for mikolov's rnnlm\nextra_arg=\n\nif [ \"$rnnlm_ver\" == \"cuedrnnlm\" ]; then\n  layer_string=`cat $rnnlm_dir/layer_string | sed \"s=:= =g\"`\n  total_size=`wc -l $rnnlm_dir/unigram.counts | awk '{print $1}'`\n  rescoring_binary=\"lattice-lmrescore-cuedrnnlm\"\n  cat $rnnlm_dir/rnnlm.input.wlist.index | tail -n +2 | awk '{print $1-1,$2}' > $rnnlm_dir/rnn.wlist\n  extra_arg=\"--full-voc-size=$total_size --layer-sizes=\\\"$layer_string\\\"\"\n  first_arg=$rnnlm_dir/rnn.wlist\nfi\n\noldlm=$oldlang/G.fst\nif [ -f $oldlang/G.carpa ]; then\n  oldlm=$oldlang/G.carpa\nfi\n\n[ ! -f $oldlm ] && echo \"$0: expecting either $oldlang/G.fst or $oldlang/G.carpa to exist\" && exit 1;\n[ ! -f $rnnlm_dir/rnnlm ] && echo \"$0: Missing file $rnnlm_dir/rnnlm\" && exit 1;\n[ ! -f $rnnlm_dir/unk.probs ] &&\\\n  echo \"$0: Missing file $rnnlm_dir/unk.probs\" && exit 1;\n[ ! -f $oldlang/words.txt ] &&\\\n  echo \"$0: Missing file $oldlang/words.txt\" && exit 1;\n! ls $indir/lat.*.gz >/dev/null &&\\\n  echo \"$0: No lattices input directory $indir\" && exit 1;\nawk -v n=$0 -v w=$weight 'BEGIN {if (w < 0 || w > 1) {\n  print n\": Interpolation weight should be in the range of [0, 1]\"; exit 1;}}' \\\n  || exit 1;\n\noldlm_command=\"fstproject --project_output=true $oldlm |\"\n\nmkdir -p $outdir/log\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\noldlm_weight=`perl -e \"print -1.0 * $weight;\"`\nif [ \"$oldlm\" == \"$oldlang/G.fst\" ]; then\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-lmrescore --lm-scale=$oldlm_weight \\\n    \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlm_command\" ark:-  \\| \\\n    $rescoring_binary $extra_arg --lm-scale=$weight \\\n    --max-ngram-order=$max_ngram_order \\\n    $first_arg $oldlang/words.txt ark:- \"$rnnlm_dir/rnnlm\" \\\n    \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\nelse\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-lmrescore-const-arpa --lm-scale=$oldlm_weight \\\n    \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlm\" ark:-  \\| \\\n    $rescoring_binary $extra_arg --lm-scale=$weight \\\n    --max-ngram-order=$max_ngram_order \\\n    $first_arg $oldlang/words.txt ark:- \"$rnnlm_dir/rnnlm\" \\\n    \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\nfi\nif ! $skip_scoring ; then\n  err_msg=\"Not scoring because local/score.sh does not exist or not executable.\"\n  [ ! -x local/score.sh ] && echo $err_msg && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $oldlang $outdir\nelse\n  echo \"$0: Not scoring because --skip-scoring was specified.\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# Create denominator lattices for MMI/MPE training.\n# Creates its output in $dir/lat.*.gz\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\ntransform_dir=\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\nnum_threads=1\nparallel_opts= # ignored now\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/make_denlats.sh [options] <data-dir> <lang-dir> <src-dir> <exp-dir>\"\n   echo \"  e.g.: steps/make_denlats.sh data/train data/lang exp/tri1 exp/tri1_denlats\"\n   echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n   echo \" plus transforms.\"\n   echo \"\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n   echo \"                           # large databases so your jobs will be smaller and\"\n   echo \"                           # will (individually) finish reasonably soon.\"\n   echo \"  --transform-dir <transform-dir>   # directory to find fMLLR transforms.\"\n   echo \"  --num-threads  <n>                # number of threads per decoding job\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $srcdir/delta_opts 2>/dev/null`\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir\n\ncp -RH $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\necho \"Making unigram grammar FST in $new_lang\"\ncat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n  awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n  utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n   || exit 1;\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\necho \"Compiling decoding graph in $dir/dengraph\"\nif [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n   echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  utils/mkgraph.sh $new_lang $srcdir $dir/dengraph || exit 1;\nfi\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"$0: using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\" && exit 1;\n  [ \"`cat $transform_dir/num_jobs`\" -ne \"$nj\" ] \\\n    && echo \"$0: mismatch in number of jobs with $transform_dir\" && exit 1;\n  [ -f $srcdir/final.mat ] && ! cmp $transform_dir/final.mat $srcdir/final.mat && \\\n     echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  if [ -f $srcdir/final.alimdl ]; then\n    echo \"$0: you seem to have a SAT system but you did not supply the --transform-dir option.\";\n    exit 1;\n  fi\nfi\n\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\n\nif [ $sub_split -eq 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_den.JOB.log \\\n   gmm-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n     $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n\n      $cmd --num-threads $num_threads JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        gmm-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n        --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || touch $dir/.error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo Merging archives for data subset $prev_n\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && echo \"$0: Merging lattices for subset $prev_n failed (or maybe some other error)\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/make_denlats_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#           2014  Guoguo Chen\n\n# Create denominator lattices for MMI/MPE training, with SGMM models.  If the\n# features have fMLLR transforms you have to supply the --transform-dir option.\n# It gets any speaker vectors from the \"alignment dir\" ($alidir).  Note: this is\n# possibly a slight mismatch because the speaker vectors come from supervised\n# adaptation.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\ntransform_dir=\nmax_mem=20000000 # This will stop the processes getting too large.\nnum_threads=1\nparallel_opts=  # ignored now.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/make_denlats_sgmm2.sh [options] <data-dir> <lang-dir> <src-dir|alidir> <exp-dir>\"\n   echo \"  e.g.: steps/make_denlats_sgmm2.sh data/train data/lang exp/sgmm4a_ali exp/sgmm4a_denlats\"\n   echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n   echo \" plus transforms.\"\n   echo \"\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n   echo \"                           # large databases so your jobs will be smaller and\"\n   echo \"                           # will (individually) finish reasonably soon.\"\n   echo \"  --transform-dir <transform-dir>   # directory to find fMLLR transforms.\"\n   echo \"  --num-threads  <n>                # number of threads per decoding job\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3 # could also be $srcdir, but only if no vectors supplied.\ndir=$4\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nif [ $num_threads -gt 1 ]; then\n  # the -parallel becomes part of the binary name we decode with.\n  thread_string=\"-parallel --num-threads=$num_threads\"\nfi\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\noov=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\n\ncp -RH $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\necho \"$0: Making unigram grammar FST in $new_lang\"\ncat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n  awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n  utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n   || exit 1;\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $alidir; the output HCLG.fst goes in $dir/graph.\n\necho \"$0: Compiling decoding graph in $dir/dengraph\"\nif [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n   echo \"$0: Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  utils/mkgraph.sh $new_lang $alidir $dir/dengraph || exit 1;\nfi\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n   ;;\n  *) echo \"$0: Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"$0: using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\" && exit 1\n  [ ! -f $transform_dir/num_jobs ] && echo \"Expected $transform_dir/num_jobs to exist.\" && exit 1\n  [ \"`cat $transform_dir/num_jobs`\" -ne \"$nj\" ] \\\n    && echo \"$0: mismatch in number of jobs with $transform_dir\" && exit 1;\n  [ -f $alidir/final.mat ] && ! cmp $transform_dir/final.mat $alidir/final.mat && \\\n     echo \"$0: LDA transforms differ between $alidir and $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  echo \"$0: Assuming you don't have a SAT system, since no --transform-dir option supplied \"\nfi\n\nif [ -f $alidir/gselect.1.gz ]; then\n  gselect_opt=\"--gselect=ark,s,cs:gunzip -c $alidir/gselect.JOB.gz|\"\nelse\n  echo \"$0: no such file $alidir/gselect.1.gz\" && exit 1;\nfi\n\nif [ -f $alidir/vecs.1 ]; then\n  spkvecs_opt=\"--spk-vecs=ark:$alidir/vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk\"\n  [ \"`cat $alidir/num_jobs`\" -ne \"$nj\" ] \\\n    && echo \"$0: mismatch in number of jobs with $alidir\" && exit 1;\nelse\n  if [ -f $alidir/final.alimdl ]; then\n    echo \"$0: You seem to have an SGMM system with speaker vectors,\"\n    echo \"yet we can't find speaker vectors.  Perhaps you supplied\"\n    echo \"the model director instead of the alignment directory?\"\n    exit 1;\n  fi\nfi\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\nif [ $sub_split -eq 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_den.JOB.log \\\n    sgmm2-latgen-faster$thread_string $spkvecs_opt \"$gselect_opt\" --beam=$beam \\\n    --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --max-mem=$max_mem --max-active=$max_active \\\n    --word-symbol-table=$lang/words.txt $alidir/final.mdl  \\\n    $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have\n  # stragglers from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $alidir/final.mdl ]; then\n      echo \"$0: Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n      spkvecs_opt_subset=`echo $spkvecs_opt | sed \"s/JOB/$n/g\"`\n      gselect_opt_subset=`echo $gselect_opt | sed \"s/JOB/$n/g\"`\n      $cmd --num-threads $num_threads JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        sgmm2-latgen-faster$thread_string \\\n        $spkvecs_opt_subset \"$gselect_opt_subset\" \\\n        --beam=$beam --lattice-beam=$lattice_beam \\\n        --acoustic-scale=$acwt --max-mem=$max_mem --max-active=$max_active \\\n        --word-symbol-table=$lang/words.txt $alidir/final.mdl  \\\n        $dir/dengraph/HCLG.fst \"$feats_subset\" \\\n        \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || touch $dir/.error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then # Wait for the previous job to merge lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && \\\n        echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo \"$0: Merging archives for data subset $prev_n\"\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && \\\n        echo \"$0: Merging lattices for subset $prev_n failed\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices with SGMMs.\"\n"
  },
  {
    "path": "egs/steps/make_fbank.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016  Karel Vesely\n# Copyright 2012-2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nfbank_config=conf/fbank.conf\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<fbank-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <fbank-dir> defaults to <data-dir>/data\nOptions:\n  --fbank-config <config-file>         # config passed to compute-fbank-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  fbankdir=$3\nelse\n  fbankdir=$data/data\nfi\n\n\n# make $fbankdir an absolute pathname.\nfbankdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $fbankdir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $fbankdir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $fbank_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $fbankdir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $fbankdir/raw_fbank_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  $cmd JOB=1:$nj $logdir/make_fbank_${name}.JOB.log \\\n    extract-segments scp,p:$scp $logdir/segments.JOB ark:- \\| \\\n    compute-fbank-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$fbank_config ark:- ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n     ark,scp:$fbankdir/raw_fbank_$name.JOB.ark,$fbankdir/raw_fbank_$name.JOB.scp \\\n     || exit 1;\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\"\"\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  $cmd JOB=1:$nj $logdir/make_fbank_${name}.JOB.log \\\n    compute-fbank-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n     --config=$fbank_config scp,p:$logdir/wav.JOB.scp ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n     ark,scp:$fbankdir/raw_fbank_$name.JOB.ark,$fbankdir/raw_fbank_$name.JOB.scp \\\n     || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing filterbank features for $name:\"\n  tail $logdir/make_fbank_${name}.1.log\n  exit 1;\nfi\n\n# concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $fbankdir/raw_fbank_$name.$n.scp || exit 1\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift and fbank_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $fbank_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf && cp $fbank_config $data/conf/fbank.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating filterbank features for $name\"\n"
  },
  {
    "path": "egs/steps/make_fbank_pitch.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  The Shenzhen Key Laboratory of Intelligent Media and Speech,\n#                 PKU-HKUST Shenzhen Hong Kong Institution (Author: Wei Shi)\n#           2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# Combine filterbank and pitch features together\n# Note: This file is based on make_fbank.sh and make_pitch_kaldi.sh\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nfbank_config=conf/fbank.conf\npitch_config=conf/pitch.conf\npitch_postprocess_config=\npaste_length_tolerance=2\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<fbank-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <fbank-dir> defaults to <data-dir>/data\nOptions:\n  --fbank-config <fbank-config-file>   # config passed to compute-fbank-feats.\n  --pitch-config <pitch-config-file>   # config passed to compute-kaldi-pitch-feats.\n  --pitch-postprocess-config <postprocess-config-file> # config passed to process-kaldi-pitch-feats.\n  --paste-length-tolerance <tolerance> # length tolerance passed to paste-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  fbank_pitch_dir=$3\nelse\n  fbank_pitch_dir=$data/data\nfi\n\n\n# make $fbank_pitch_dir an absolute pathname.\nfbank_pitch_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $fbank_pitch_dir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $fbank_pitch_dir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $fbank_config $pitch_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\n# utils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ ! -z \"$pitch_postprocess_config\" ]; then\n  postprocess_config_opt=\"--config=$pitch_postprocess_config\";\nelse\n  postprocess_config_opt=\nfi\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $fbank_pitch_dir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $fbank_pitch_dir/raw_fbank_pitch_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  fbank_feats=\"ark:extract-segments scp,p:$scp $logdir/segments.JOB ark:- |\\\n    compute-fbank-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$fbank_config ark:- ark:- |\"\n  pitch_feats=\"ark,s,cs:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-kaldi-pitch-feats --verbose=2 --config=$pitch_config ark:- ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_fbank_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$fbank_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$fbank_pitch_dir/raw_fbank_pitch_$name.JOB.ark,$fbank_pitch_dir/raw_fbank_pitch_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  fbank_feats=\"ark:compute-fbank-feats $vtln_opts $write_utt2dur_opt \\\n   --verbose=2 --config=$fbank_config scp,p:$logdir/wav_${name}.JOB.scp ark:- |\"\n  pitch_feats=\"ark,s,cs:compute-kaldi-pitch-feats --verbose=2 \\\n      --config=$pitch_config scp,p:$logdir/wav_${name}.JOB.scp ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_fbank_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$fbank_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$fbank_pitch_dir/raw_fbank_pitch_$name.JOB.ark,$fbank_pitch_dir/raw_fbank_pitch_$name.JOB.scp \\\n      || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing filterbank and pitch features for $name:\"\n  tail $logdir/make_fbank_pitch_${name}.1.log\n  exit 1;\nfi\n\n# Concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $fbank_pitch_dir/raw_fbank_pitch_$name.$n.scp || exit 1\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift, fbank_config and pitch_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $fbank_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf &&\n  cp $fbank_config $data/conf/fbank.conf &&\n  cp $pitch_config $data/conf/pitch.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating filterbank and pitch features for $name\"\n"
  },
  {
    "path": "egs/steps/make_index.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Guoguo Chen)\n# Apache 2.0\n\n# Begin configuration section.\nmodel= # You can specify the model to use\ncmd=run.pl\nacwt=0.083333\nlmwt=1.0\nmax_silence_frames=50\nmax_states=1000000\nmax_states_scale=4\nmax_expand=180 # limit memory blowup in lattice-align-words\nstrict=true\nword_ins_penalty=0\nsilence_word=  # Specify this only if you did so in kws_setup\nskip_optimization=false     # If you only search for few thousands of keywords, you probablly\n                            # can skip the optimization; but if you're going to search for\n                            # millions of keywords, you'd better do set this optimization to\n                            # false and do the optimization on the final index.\nframe_subsampling_factor=   # We will try to autodetect this. You should specify\n                            # the right value if your directory structure is\n                            # non-standard\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/make_index.sh [options] <kws-data-dir> <lang-dir> <decode-dir> <kws-dir>\"\n   echo \"... where <decode-dir> is where you have the lattices, and is assumed to be\"\n   echo \" a sub-directory of the directory where the model is.\"\n   echo \"e.g.: steps/make_index.sh data/kws data/lang exp/sgmm2_5a_mmi/decode/ exp/sgmm2_5a_mmi/decode/kws/\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --lmwt <float>                                   # lm scale used for lattice\"\n   echo \"  --model <model>                                  # which model to use\"\n   echo \"                                                   # speaker-adapted decoding\"\n   echo \"  --max-silence-frames <int>                       # maximum #frames for silence\"\n   exit 1;\nfi\n\n\nkwsdatadir=$1;\nlangdir=$2;\ndecodedir=$3;\nkwsdir=$4;\nsrcdir=`dirname $decodedir`; # The model directory is one level up from decoding directory.\n\nmkdir -p $kwsdir/log;\nnj=`cat $decodedir/num_jobs` || exit 1;\necho $nj > $kwsdir/num_jobs;\n\nutter_id=$kwsdatadir/utter_id\nif [ ! -f $utter_id ] ; then\n  utter_id=$kwsdatadir/utt.map\nfi\n\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  model=$srcdir/final.mdl;\nfi\n\nfor f in $model $decodedir/lat.1.gz $utter_id; do\n  [ ! -f $f ] && echo \"$0: Error: no such file $f\" && exit 1;\ndone\n\necho \"$0: Using model: $model\"\n\nif [ ! -z $silence_word ]; then\n  silence_int=`grep -w $silence_word $langdir/words.txt | awk '{print $2}'`\n  [ -z $silence_int ] && \\\n    echo \"$0: Error: could not find integer representation of silence word $silence_word\" && exit 1;\n  silence_opt=\"--silence-label=$silence_int\"\nfi\n\nif [ -z \"$frame_subsampling_factor\" ]; then\n  if [ -f $decodedir/../frame_subsampling_factor ] ; then\n    frame_subsampling_factor=$(cat $decodedir/../frame_subsampling_factor)\n  else \n    frame_subsampling_factor=1\n  fi\n  echo \"$0: Frame subsampling factor autodetected: $frame_subsampling_factor\"\nfi\n\nword_boundary=$langdir/phones/word_boundary.int\nalign_lexicon=$langdir/phones/align_lexicon.int\nif [ -f $word_boundary ] ; then\n  $cmd JOB=1:$nj $kwsdir/log/index.JOB.log \\\n    lattice-add-penalty --word-ins-penalty=$word_ins_penalty \"ark:gzip -cdf $decodedir/lat.JOB.gz|\" ark:- \\| \\\n      lattice-align-words $silence_opt --max-expand=$max_expand $word_boundary $model  ark:- ark:- \\| \\\n      lattice-scale --acoustic-scale=$acwt --lm-scale=$lmwt ark:- ark:- \\| \\\n      lattice-to-kws-index --max-states-scale=$max_states_scale --allow-partial=true \\\n      --frame-subsampling-factor=$frame_subsampling_factor \\\n      --max-silence-frames=$max_silence_frames --strict=$strict ark:$utter_id ark:- ark:- \\| \\\n      kws-index-union --skip-optimization=$skip_optimization --strict=$strict --max-states=$max_states \\\n      ark:- \"ark:|gzip -c > $kwsdir/index.JOB.gz\" || exit 1\nelif [ -f $align_lexicon ]; then\n  $cmd JOB=1:$nj $kwsdir/log/index.JOB.log \\\n    lattice-add-penalty --word-ins-penalty=$word_ins_penalty \"ark:gzip -cdf $decodedir/lat.JOB.gz|\" ark:- \\| \\\n      lattice-align-words-lexicon $silence_opt --max-expand=$max_expand $align_lexicon $model  ark:- ark:- \\| \\\n      lattice-scale --acoustic-scale=$acwt --lm-scale=$lmwt ark:- ark:- \\| \\\n      lattice-to-kws-index --max-states-scale=$max_states_scale --allow-partial=true \\\n      --frame-subsampling-factor=$frame_subsampling_factor \\\n      --max-silence-frames=$max_silence_frames --strict=$strict ark:$utter_id ark:- ark:- \\| \\\n      kws-index-union --skip-optimization=$skip_optimization --strict=$strict --max-states=$max_states \\\n      ark:- \"ark:|gzip -c > $kwsdir/index.JOB.gz\" || exit 1\nelse\n  echo \"$0: Error: cannot find either word-boundary file $word_boundary or alignment lexicon $align_lexicon\"\n  exit 1\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/make_mfcc.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nmfcc_config=conf/mfcc.conf\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<mfcc-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <mfcc-dir> defaults to <data-dir>/data.\nOptions:\n  --mfcc-config <config-file>          # config passed to compute-mfcc-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  mfccdir=$3\nelse\n  mfccdir=$data/data\nfi\n\n# make $mfccdir an absolute pathname.\nmfccdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $mfccdir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $mfccdir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $mfcc_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nelse\n  vtln_opts=\"\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $mfccdir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $mfccdir/raw_mfcc_$name.$n.ark\ndone\n\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_${name}.JOB.log \\\n    extract-segments scp,p:$scp $logdir/segments.JOB ark:- \\| \\\n    compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$mfcc_config ark:- ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$mfccdir/raw_mfcc_$name.JOB.ark,$mfccdir/raw_mfcc_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n\n  # add ,p to the input rspecifier so that we can just skip over\n  # utterances that have bad wave data.\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_${name}.JOB.log \\\n    compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$mfcc_config scp,p:$logdir/wav_${name}.JOB.scp ark:- \\| \\\n    copy-feats $write_num_frames_opt --compress=$compress ark:- \\\n      ark,scp:$mfccdir/raw_mfcc_$name.JOB.ark,$mfccdir/raw_mfcc_$name.JOB.scp \\\n      || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing MFCC features for $name:\"\n  tail $logdir/make_mfcc_${name}.1.log\n  exit 1;\nfi\n\n# concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $mfccdir/raw_mfcc_$name.$n.scp || exit 1\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift and mfcc_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $mfcc_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf && cp $mfcc_config $data/conf/mfcc.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\n\necho \"$0: Succeeded creating MFCC features for $name\"\n"
  },
  {
    "path": "egs/steps/make_mfcc_pitch.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  The Shenzhen Key Laboratory of Intelligent Media and Speech,\n#                 PKU-HKUST Shenzhen Hong Kong Institution (Author: Wei Shi)\n#           2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# Combine MFCC and pitch features together\n# Note: This file is based on make_mfcc.sh and make_pitch_kaldi.sh\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nmfcc_config=conf/mfcc.conf\npitch_config=conf/pitch.conf\npitch_postprocess_config=\npaste_length_tolerance=2\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<mfcc-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <mfcc-dir> defaults to <data-dir>/data\nOptions:\n  --mfcc-config <mfcc-config-file>     # config passed to compute-mfcc-feats.\n  --pitch-config <pitch-config-file>   # config passed to compute-kaldi-pitch-feats.\n  --pitch-postprocess-config <postprocess-config-file> # config passed to process-kaldi-pitch-feats.\n  --paste-length-tolerance <tolerance> # length tolerance passed to paste-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  mfcc_pitch_dir=$3\nelse\n  mfcc_pitch_dir=$data/data\nfi\n\n\n# make $mfcc_pitch_dir an absolute pathname.\nmfcc_pitch_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $mfcc_pitch_dir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $mfcc_pitch_dir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $mfcc_config $pitch_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ ! -z \"$pitch_postprocess_config\" ]; then\n  postprocess_config_opt=\"--config=$pitch_postprocess_config\";\nelse\n  postprocess_config_opt=\nfi\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $mfcc_pitch_dir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $mfcc_pitch_dir/raw_mfcc_pitch_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  mfcc_feats=\"ark:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$mfcc_config ark:- ark:- |\"\n  pitch_feats=\"ark,s,cs:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-kaldi-pitch-feats --verbose=2 --config=$pitch_config ark:- ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$mfcc_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$mfcc_pitch_dir/raw_mfcc_pitch_$name.JOB.ark,$mfcc_pitch_dir/raw_mfcc_pitch_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  mfcc_feats=\"ark:compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n    --config=$mfcc_config scp,p:$logdir/wav_${name}.JOB.scp ark:- |\"\n  pitch_feats=\"ark,s,cs:compute-kaldi-pitch-feats --verbose=2 \\\n      --config=$pitch_config scp,p:$logdir/wav_${name}.JOB.scp ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$mfcc_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$mfcc_pitch_dir/raw_mfcc_pitch_$name.JOB.ark,$mfcc_pitch_dir/raw_mfcc_pitch_$name.JOB.scp \\\n      || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing MFCC and pitch features for $name:\"\n  tail $logdir/make_mfcc_pitch_${name}.1.log\n  exit 1;\nfi\n\n# Concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $mfcc_pitch_dir/raw_mfcc_pitch_$name.$n.scp || exit 1;\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift, mfcc_config and pitch_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $mfcc_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf &&\n  cp $mfcc_config $data/conf/mfcc.conf &&\n  cp $pitch_config $data/conf/pitch.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating MFCC and pitch features for $name\"\n"
  },
  {
    "path": "egs/steps/make_mfcc_pitch_online.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  The Shenzhen Key Laboratory of Intelligent Media and Speech,\n#                 PKU-HKUST Shenzhen Hong Kong Institution (Author: Wei Shi)\n#           2014-2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# Combine MFCC and online-pitch features together\n# Note: This file is based on make_mfcc_pitch.sh\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nmfcc_config=conf/mfcc.conf\nonline_pitch_config=conf/online_pitch.conf\npaste_length_tolerance=2\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<mfcc-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <mfcc-dir> defaults to <data-dir>/data\nOptions:\n  --mfcc-config <mfcc-config-file>     # config passed to compute-mfcc-feats [conf/mfcc.conf]\n  --online-pitch-config <online-pitch-config-file> # config passed to compute-and-process-kaldi-pitch-feats [conf/online_pitch.conf]\n  --paste-length-tolerance <tolerance> # length tolerance passed to paste-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  mfcc_pitch_dir=$3\nelse\n  mfcc_pitch_dir=$data/data\nfi\n\n\n# make $mfcc_pitch_dir an absolute pathname.\nmfcc_pitch_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $mfcc_pitch_dir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $mfcc_pitch_dir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $mfcc_config $online_pitch_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $mfcc_pitch_dir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $mfcc_pitch_dir/raw_mfcc_online_pitch_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  mfcc_feats=\"ark:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$mfcc_config ark:- ark:- |\"\n  pitch_feats=\"ark,s,cs:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-and-process-kaldi-pitch-feats --verbose=2 \\\n      --config=$online_pitch_config ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$mfcc_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$mfcc_pitch_dir/raw_mfcc_online_pitch_$name.JOB.ark,$mfcc_pitch_dir/raw_mfcc_online_pitch_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  mfcc_feats=\"ark:compute-mfcc-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n    --config=$mfcc_config scp,p:$logdir/wav_${name}.JOB.scp ark:- |\"\n  pitch_feats=\"ark,s,cs:compute-and-process-kaldi-pitch-feats --verbose=2 \\\n    --config=$online_pitch_config scp,p:$logdir/wav_${name}.JOB.scp ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_mfcc_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$mfcc_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$mfcc_pitch_dir/raw_mfcc_online_pitch_$name.JOB.ark,$mfcc_pitch_dir/raw_mfcc_online_pitch_$name.JOB.scp \\\n      || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing MFCC and online-pitch features for $name:\"\n  tail $logdir/make_mfcc_pitch_${name}.1.log\n  exit 1;\nfi\n\n# Concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $mfcc_pitch_dir/raw_mfcc_online_pitch_$name.$n.scp || exit 1\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift, mfcc_config and pitch_config_online along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $mfcc_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf &&\n  cp $mfcc_config $data/conf/mfcc.conf &&\n  cp $online_pitch_config $data/conf/online_pitch.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating MFCC and online-pitch features for $name\"\n"
  },
  {
    "path": "egs/steps/make_phone_graph.sh",
    "content": "#!/usr/bin/env bash\n\n# steps/make_phone_graph.sh data/train_100k_nodup/ data/lang exp/tri2_ali_100k_nodup/ exp/tri2\n\n# Copyright 2013  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script makes a phone-based LM, without smoothing to unigram, that\n# is to be used for segmentation, and uses that together with a model to\n# make a decoding graph.\n# Uses SRILM.\n# See also utils/lang/make_phone_bigram_lm.sh.\n\n# Begin configuration section.\nstage=0\ncmd=run.pl\nN=3  # change N and P for non-trigram systems.\nP=1\ntscale=1.0 # transition scale.\nloopscale=0.1 # scale for self-loops.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0  [options] <lang-dir> <alignment-dir> <model-dir>\"\n  echo \" e.g.: $0 data/lang exp/tri3b_ali exp/tri4b_seg\"\n  echo \"Makes the graph in $dir/phone_graph, corresponding to the model in $dir\"\n  echo \"The alignments from $ali_dir are used to train the phone LM.\"\n  exit 1;\nfi\n\nlang=$1\nalidir=$2\ndir=$3\n\n\nfor f in $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $dir/final.mdl; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected $f to exist\"\n    exit 1;\n  fi\ndone\n\nloc=`which ngram-count`;\nif [ -z $loc ]; then\n  if uname -a | grep 64 >/dev/null; then # some kind of 64 bit...\n    sdir=$KALDI_ROOT/tools/srilm/bin/i686-m64\n  else\n    sdir=$KALDI_ROOT/tools/srilm/bin/i686\n  fi\n  if [ -f $sdir/ngram-count ]; then\n    echo Using SRILM tools from $sdir\n    export PATH=$PATH:$sdir\n  else\n    echo You appear to not have SRILM tools installed, either on your path,\n    echo or installed in $sdir.  See tools/install_srilm.sh for installation\n    echo instructions.\n    exit 1\n  fi\nfi\n\nset -e # exit on error status\n\nmkdir -p $dir/phone_graph\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt\n\nif [ $stage -le 0 ]; then\n  echo \"$0: creating phone LM-training data\"\n  gunzip -c $alidir/ali.*gz | ali-to-phones $alidir/final.mdl ark:- ark,t:- | \\\n    awk '{for (x=2; x <= NF; x++) printf(\"%s \", $x); printf(\"\\n\"); }' | \\\n    utils/int2sym.pl $lang/phones.txt > $dir/phone_graph/train_phones.txt\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: building ARPA LM\"\n  ngram-count -text $dir/phone_graph/train_phones.txt -order 3  \\\n    -addsmooth1 1 -kndiscount2 -kndiscount3 -interpolate -lm $dir/phone_graph/arpa.gz\nfi\n\n# Set the unigram and unigram-backoff log-probs to -99.  we'll later remove the\n# arcs from the FST.  This is to avoid CLG blowup, and to increase speed.\n\nif [ $stage -le 2 ]; then\n  echo \"$0: removing unigrams from ARPA LM\"\n\n  gunzip -c $dir/phone_graph/arpa.gz | \\\n    awk '/\\\\1-grams/{state=1;} /\\\\2-grams:/{ state=2; }\n       {if(state == 1 && NF == 3) { printf(\"-99\\t%s\\t-99\\n\", $2); } else {print;}}' | \\\n         gzip -c >$dir/phone_graph/arpa_noug.gz\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: creating G_phones.fst from ARPA\"\n  gunzip -c $dir/phone_graph/arpa_noug.gz | \\\n    arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/phones.txt - - | \\\n    fstprint | awk '{if (NF < 5 || $5 < 100.0) { print; }}' | fstcompile | \\\n    fstconnect > $dir/phone_graph/G_phones.fst\n  fstisstochastic $dir/phone_graph/G_phones.fst || echo \"[info]: G_phones not stochastic.\"\nfi\n\n\nif [ $stage -le 4 ]; then\n  echo \"$0: creating CLG.\"\n\n  fstcomposecontext --context-size=$N --central-position=$P \\\n   --read-disambig-syms=$lang/phones/disambig.int \\\n   --write-disambig-syms=$dir/phone_graph/disambig_ilabels_${N}_${P}.int \\\n    $dir/phone_graph/ilabels_${N}_${P} < $dir/phone_graph/G_phones.fst | \\\n      fstdeterminize >$dir/phone_graph/CLG.fst\n  fstisstochastic $dir/phone_graph/CLG.fst  || echo \"[info]: CLG not stochastic.\"\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: creating Ha.fst\"\n  make-h-transducer --disambig-syms-out=$dir/phone_graph/disambig_tid.int \\\n    --transition-scale=$tscale $dir/phone_graph/ilabels_${N}_${P} $dir/tree $dir/final.mdl \\\n       > $dir/phone_graph/Ha.fst\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: creating HCLGa.fst\"\n  fsttablecompose $dir/phone_graph/Ha.fst $dir/phone_graph/CLG.fst | \\\n      fstdeterminizestar --use-log=true | \\\n      fstrmsymbols $dir/phone_graph/disambig_tid.int | fstrmepslocal | \\\n      fstminimizeencoded > $dir/phone_graph/HCLGa.fst || exit 1;\n  fstisstochastic $dir/phone_graph/HCLGa.fst || echo \"HCLGa is not stochastic\"\nfi\n\nif [ $stage -le 7 ]; then\n  add-self-loops --self-loop-scale=$loopscale --reorder=true \\\n    $dir/final.mdl < $dir/phone_graph/HCLGa.fst > $dir/phone_graph/HCLG.fst || exit 1;\n\n  if [ $tscale == 1.0 -a $loopscale == 1.0 ]; then\n    # No point doing this test if transition-scale not 1, as it is bound to fail.\n    fstisstochastic $dir/phone_graph/HCLG.fst || echo \"[info]: final HCLG is not stochastic.\"\n  fi\n\n  # $lang/phones.txt is the symbol table that corresponds to the output\n  # symbols on the graph; decoding scripts expect it as words.txt.\n  cp $lang/phones.txt $dir/phone_graph/words.txt\n  cp -r $lang/phones $dir/phone_graph/\nfi\n"
  },
  {
    "path": "egs/steps/make_plp.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nplp_config=conf/plp.conf\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<plp-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <plp-dir> defaults to <data-dir>/data\nOptions:\n  --plp-config <config-file>           # config passed to compute-plp-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  plpdir=$3\nelse\n  plpdir=$data/data\nfi\n\n# make $plpdir an absolute pathname.\nplpdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $plpdir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $plpdir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $plp_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nelse\n  vtln_opts=\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $plpdir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $plpdir/raw_plp_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  $cmd JOB=1:$nj $logdir/make_plp_${name}.JOB.log \\\n    extract-segments scp,p:$scp $logdir/segments.JOB ark:- \\| \\\n    compute-plp-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$plp_config ark:- ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$plpdir/raw_plp_$name.JOB.ark,$plpdir/raw_plp_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  $cmd JOB=1:$nj $logdir/make_plp_${name}.JOB.log \\\n    compute-plp-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$plp_config scp,p:$logdir/wav_${name}.JOB.scp ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$plpdir/raw_plp_$name.JOB.ark,$plpdir/raw_plp_$name.JOB.scp \\\n      || exit 1;\n\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing PLP features for $name:\"\n  tail $logdir/make_plp_${name}.1.log\n  exit 1;\nfi\n\n# concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $plpdir/raw_plp_$name.$n.scp || exit 1\ndone > $data/feats.scp\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift and plp_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $plp_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf && cp $plp_config $data/conf/plp.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating PLP features for $name\"\n"
  },
  {
    "path": "egs/steps/make_plp_pitch.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  The Shenzhen Key Laboratory of Intelligent Media and Speech,\n#                 PKU-HKUST Shenzhen Hong Kong Institution (Author: Wei Shi)\n#           2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# Combine PLP and pitch features together\n# Note: This file is based on make_plp.sh and make_pitch_kaldi.sh\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nplp_config=conf/plp.conf\npitch_config=conf/pitch.conf\npitch_postprocess_config=\npaste_length_tolerance=2\ncompress=true\nwrite_utt2num_frames=true  # If true writes utt2num_frames.\nwrite_utt2dur=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging.\n\nif [ -f ./path.sh ]; then . ./path.sh;  fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ] || [ $# -gt 3 ]; then\n  cat >&2 <<EOF\nUsage: $0 [options] <data-dir> [<log-dir> [<plp-dir>] ]\n e.g.: $0 data/train\nNote: <log-dir> defaults to <data-dir>/log, and\n      <plp-dir> defaults to <data-dir>/data\nOptions:\n  --plp-config <plp-config-file>       # config passed to compute-plp-feats.\n  --pitch-config <pitch-config-file>   # config passed to compute-kaldi-pitch-feats.\n  --pitch-postprocess-config <postprocess-config-file> # config passed to process-kaldi-pitch-feats.\n  --paste-length-tolerance <tolerance> # length tolerance passed to paste-feats.\n  --nj <nj>                            # number of parallel jobs.\n  --cmd <run.pl|queue.pl <queue opts>> # how to run jobs.\n  --write-utt2num-frames <true|false>  # If true, write utt2num_frames file.\n  --write-utt2dur <true|false>         # If true, write utt2dur file.\nEOF\n   exit 1;\nfi\n\ndata=$1\nif [ $# -ge 2 ]; then\n  logdir=$2\nelse\n  logdir=$data/log\nfi\nif [ $# -ge 3 ]; then\n  plp_pitch_dir=$3\nelse\n  plp_pitch_dir=$data/data\nfi\n\n# make $plp_pitch_dir an absolute pathname.\nplp_pitch_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $plp_pitch_dir ${PWD}`\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nmkdir -p $plp_pitch_dir || exit 1;\nmkdir -p $logdir || exit 1;\n\nif [ -f $data/feats.scp ]; then\n  mkdir -p $data/.backup\n  echo \"$0: moving $data/feats.scp to $data/.backup\"\n  mv $data/feats.scp $data/.backup\nfi\n\nscp=$data/wav.scp\n\nrequired=\"$scp $plp_config $pitch_config\"\n\nfor f in $required; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\nutils/validate_data_dir.sh --no-text --no-feats $data || exit 1;\n\nif [ ! -z \"$pitch_postprocess_config\" ]; then\n  postprocess_config_opt=\"--config=$pitch_postprocess_config\";\nelse\n  postprocess_config_opt=\nfi\n\nif [ -f $data/spk2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/spk2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk\"\nelif [ -f $data/utt2warp ]; then\n  echo \"$0 [info]: using VTLN warp factors from $data/utt2warp\"\n  vtln_opts=\"--vtln-map=ark:$data/utt2warp\"\nfi\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $plp_pitch_dir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $plp_pitch_dir/raw_plp_pitch_$name.$n.ark\ndone\n\nif $write_utt2num_frames; then\n  write_num_frames_opt=\"--write-num-frames=ark,t:$logdir/utt2num_frames.JOB\"\nelse\n  write_num_frames_opt=\nfi\n\nif $write_utt2dur; then\n  write_utt2dur_opt=\"--write-utt2dur=ark,t:$logdir/utt2dur.JOB\"\nelse\n  write_utt2dur_opt=\nfi\n\nif [ -f $data/segments ]; then\n  echo \"$0 [info]: segments file exists: using that.\"\n  split_segments=\n  for n in $(seq $nj); do\n    split_segments=\"$split_segments $logdir/segments.$n\"\n  done\n\n  utils/split_scp.pl $data/segments $split_segments || exit 1;\n  rm $logdir/.error 2>/dev/null\n\n  plp_feats=\"ark:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-plp-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n      --config=$plp_config ark:- ark:- |\"\n  pitch_feats=\"ark,s,cs:extract-segments scp,p:$scp $logdir/segments.JOB ark:- | \\\n    compute-kaldi-pitch-feats --verbose=2 --config=$pitch_config ark:- ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_plp_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$plp_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$plp_pitch_dir/raw_plp_pitch_$name.JOB.ark,$plp_pitch_dir/raw_plp_pitch_$name.JOB.scp \\\n     || exit 1;\n\nelse\n  echo \"$0: [info]: no segments file exists: assuming wav.scp indexed by utterance.\"\n  split_scps=\n  for n in $(seq $nj); do\n    split_scps=\"$split_scps $logdir/wav_${name}.$n.scp\"\n  done\n\n  utils/split_scp.pl $scp $split_scps || exit 1;\n\n  plp_feats=\"ark:compute-plp-feats $vtln_opts $write_utt2dur_opt --verbose=2 \\\n    --config=$plp_config scp,p:$logdir/wav_${name}.JOB.scp ark:- |\"\n  pitch_feats=\"ark,s,cs:compute-kaldi-pitch-feats --verbose=2 \\\n      --config=$pitch_config scp,p:$logdir/wav_${name}.JOB.scp ark:- | \\\n    process-kaldi-pitch-feats $postprocess_config_opt ark:- ark:- |\"\n\n  $cmd JOB=1:$nj $logdir/make_plp_pitch_${name}.JOB.log \\\n    paste-feats --length-tolerance=$paste_length_tolerance \\\n      \"$plp_feats\" \"$pitch_feats\" ark:- \\| \\\n    copy-feats --compress=$compress $write_num_frames_opt ark:- \\\n      ark,scp:$plp_pitch_dir/raw_plp_pitch_$name.JOB.ark,$plp_pitch_dir/raw_plp_pitch_$name.JOB.scp \\\n      || exit 1;\nfi\n\n\nif [ -f $logdir/.error.$name ]; then\n  echo \"$0: Error producing PLP and pitch features for $name:\"\n  tail $logdir/make_plp_pitch_${name}.1.log\n  exit 1;\nfi\n\n# Concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $plp_pitch_dir/raw_plp_pitch_$name.$n.scp || exit 1\ndone > $data/feats.scp || exit 1\n\nif $write_utt2num_frames; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2num_frames.$n || exit 1\n  done > $data/utt2num_frames || exit 1\nfi\n\nif $write_utt2dur; then\n  for n in $(seq $nj); do\n    cat $logdir/utt2dur.$n || exit 1\n  done > $data/utt2dur || exit 1\nfi\n\n# Store frame_shift, plp_config and pitch_config along with features.\nframe_shift=$(perl -ne 'if (/^--frame-shift=(\\d+)/) {\n                          printf \"%.3f\", 0.001 * $1; exit; }' $plp_config)\necho ${frame_shift:-'0.01'} > $data/frame_shift\nmkdir -p $data/conf &&\n  cp $plp_config $data/conf/plp.conf &&\n  cp $pitch_config $data/conf/pitch.conf || exit 1\n\nrm $logdir/wav_${name}.*.scp  $logdir/segments.* \\\n   $logdir/utt2num_frames.* $logdir/utt2dur.* 2>/dev/null\n\nnf=$(wc -l < $data/feats.scp)\nnu=$(wc -l < $data/utt2spk)\nif [ $nf -ne $nu ]; then\n  echo \"$0: It seems not all of the feature files were successfully procesed\" \\\n       \"($nf != $nu); consider using utils/fix_data_dir.sh $data\"\nfi\n\nif (( nf < nu - nu/20 )); then\n  echo \"$0: Less than 95% the features were successfully generated.\"\\\n       \"Probably a serious error.\"\n  exit 1\nfi\n\necho \"$0: Succeeded creating PLP and pitch features for $name\"\n"
  },
  {
    "path": "egs/steps/nnet/align.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2015 Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\n# Aligns 'data' to sequences of transition-ids using Neural Network based acoustic model.\n# Optionally produces alignment in lattice format, this is handy to get word alignment.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nstage=0\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nnnet_forward_opts=\"--no-softmax=true --prior-scale=1.0\"\nivector=            # rx-specifier with i-vectors (ark-with-vectors),\ntext= # (optional) transcipts we align to,\n\nalign_to_lats=false # optionally produce alignment in lattice format\n lats_decode_opts=\"--acoustic-scale=0.1 --beam=20 --lattice_beam=10\"\n lats_graph_scales=\"--transition-scale=1.0 --self-loop-scale=0.1\"\n\nuse_gpu=\"no\" # yes|no|optionaly\n# End configuration options.\n\n[ $# -gt 0 ] && echo \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 4 ]; then\n   echo \"usage: $0 <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  $0 data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\ncp $lang/phones.txt $dir\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\n\n# Select default locations to model files\nnnet=$srcdir/final.nnet;\nclass_frame_counts=$srcdir/ali_train_pdf.counts\nfeature_transform=$srcdir/final.feature_transform\nmodel=$dir/final.mdl\n\n# Check that files exist\nfor f in $sdata/1/feats.scp $lang/L.fst $nnet $model $feature_transform $class_frame_counts; do\n  [ ! -f $f ] && echo \"$0: missing file $f\" && exit 1;\ndone\n[ -z \"$text\" -a ! -f $sdata/1/text ] && echo \"$0: missing file $f\" && exit 1\n\n\n# PREPARE FEATURE EXTRACTION PIPELINE\n# import config,\nonline_cmvn_opts=\ncmvn_opts=\ndelta_opts=\nD=$srcdir\n[ -e $D/online_cmvn_opts ] && online_cmvn_opts=$(cat $D/online_cmvn_opts)\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- |\"\n# apply-cmvn-online (optional),\n[ -n \"$online_cmvn_opts\" -a ! -f $D/global_cmvn_stats.mat ] && echo \"$0: Missing $D/global_cmvn_stats.mat\" && exit 1\n[ -n \"$online_cmvn_opts\" ] && feats=\"$feats apply-cmvn-online $online_cmvn_opts --spk2utt=ark:$srcdata/spk2utt $D/global_cmvn_stats.mat ark:- ark:- |\"\n# apply-cmvn (optional),\n[ -n \"$cmvn_opts\" -a ! -f $sdata/1/cmvn.scp ] && echo \"$0: Missing $sdata/1/cmvn.scp\" && exit 1\n[ -n \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ -n \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  feats_job_1=$(sed 's:JOB:1:g' <(echo $feats))\n  dim_raw=$(feat-to-dim \"$feats_job_1\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_job_1 $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n# nnet-forward,\nfeats=\"$feats nnet-forward $nnet_forward_opts --feature-transform=$feature_transform --class-frame-counts=$class_frame_counts --use-gpu=$use_gpu $nnet ark:- ark:- |\"\n#\n\necho \"$0: aligning data '$data' using nnet/model '$srcdir', putting alignments in '$dir'\"\n\n# Map oovs in reference transcription,\noov=`cat $lang/oov.int` || exit 1;\n[ -z \"$text\" ] && text=$sdata/JOB/text\ntra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $text |\";\n# We could just use align-mapped in the next line, but it's less efficient as it compiles the\n# training graphs one by one.\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl $lang/L.fst \"$tra\" ark:- \\| \\\n    align-compiled-mapped $scale_opts --beam=$beam --retry-beam=$retry_beam $dir/final.mdl ark:- \\\n      \"$feats\" \"ark,t:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n# Optionally align to lattice format (handy to get word alignment)\nif [ \"$align_to_lats\" == \"true\" ]; then\n  echo \"$0: aligning also to lattices '$dir/lat.*.gz'\"\n  $cmd JOB=1:$nj $dir/log/align_lat.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $lats_graph_scales $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n    latgen-faster-mapped $lats_decode_opts --word-symbol-table=$lang/words.txt $dir/final.mdl ark:- \\\n      \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\necho \"$0: done aligning data.\"\n"
  },
  {
    "path": "egs/steps/nnet/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015 Brno University of Technology (author: Karel Vesely), Daniel Povey\n# Apache 2.0\n\n# Begin configuration section.\nnnet=               # non-default location of DNN (optional)\nfeature_transform=  # non-default location of feature_transform (optional)\nmodel=              # non-default location of transition model (optional)\nclass_frame_counts= # non-default location of PDF counts (optional)\nsrcdir=             # non-default location of DNN-dir (decouples model dir from decode dir)\nivector=            # rx-specifier with i-vectors (ark-with-vectors),\n\nblocksoftmax_dims=   # 'csl' with block-softmax dimensions: dim1,dim2,dim3,...\nblocksoftmax_active= # '1' for the 1st block,\n\nstage=0 # stage=1 skips lattice generation\nnj=4\ncmd=run.pl\n\nacwt=0.10 # note: only really affects pruning (scoring is on lattices).\nbeam=13.0\nlattice_beam=8.0\nmin_active=200\nmax_active=7000 # limit of active tokens\nmax_mem=50000000 # approx. limit to memory consumption during minimization in bytes\nnnet_forward_opts=\"--no-softmax=true --prior-scale=1.0\"\n\nskip_scoring=false\nscoring_opts=\"--min-lmwt 4 --max-lmwt 15\"\n\nnum_threads=1 # if >1, will use latgen-faster-parallel\nparallel_opts=   # Ignored now.\nuse_gpu=\"no\" # yes|no|optionaly\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the DNN and transition model is.\"\n   echo \"e.g.: $0 exp/dnn1/graph_tgpr data/test exp/dnn1/decode_tgpr\"\n   echo \"\"\n   echo \"This script works on plain or modified features (CMN,delta+delta-delta),\"\n   echo \"which are then sent through feature-transform. It works out what type\"\n   echo \"of features you used from content of srcdir.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"\"\n   echo \"  --nnet <nnet>                                    # non-default location of DNN (opt.)\"\n   echo \"  --srcdir <dir>                                   # non-default dir with DNN/models, can be different\"\n   echo \"                                                   # from parent dir of <decode-dir>' (opt.)\"\n   echo \"\"\n   echo \"  --acwt <float>                                   # select acoustic scale for decoding\"\n   echo \"  --scoring-opts <opts>                            # options forwarded to local/score.sh\"\n   echo \"  --num-threads <N>                                # N>1: run multi-threaded decoder\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\n[ -z $srcdir ] && srcdir=`dirname $dir`; # Default model directory one level up from decoding directory.\nsdata=$data/split$nj;\n\nmkdir -p $dir/log\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n# Select default locations to model files (if not already set externally)\n[ -z \"$nnet\" ] && nnet=$srcdir/final.nnet\n[ -z \"$model\" ] && model=$srcdir/final.mdl\n[ -z \"$feature_transform\" -a -e $srcdir/final.feature_transform ] && feature_transform=$srcdir/final.feature_transform\n#\n[ -z \"$class_frame_counts\" -a -f $srcdir/prior_counts ] && class_frame_counts=$srcdir/prior_counts # priority,\n[ -z \"$class_frame_counts\" ] && class_frame_counts=$srcdir/ali_train_pdf.counts\n\n# Check that files exist,\nfor f in $sdata/1/feats.scp $nnet $model $feature_transform $class_frame_counts $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"$0: missing file $f\" && exit 1;\ndone\n\n# Possibly use multi-threaded decoder\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\n\n# PREPARE FEATURE EXTRACTION PIPELINE\n# import config,\nonline_cmvn_opts=\ncmvn_opts=\ndelta_opts=\nD=$srcdir\n[ -e $D/online_cmvn_opts ] && online_cmvn_opts=$(cat $D/online_cmvn_opts)\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- |\"\n# apply-cmvn-online (optional),\n[ -n \"$online_cmvn_opts\" -a ! -f $D/global_cmvn_stats.mat ] && echo \"$0: Missing $D/global_cmvn_stats.mat\" && exit 1\n[ -n \"$online_cmvn_opts\" ] && feats=\"$feats apply-cmvn-online $online_cmvn_opts --spk2utt=ark:$srcdata/spk2utt $D/global_cmvn_stats.mat ark:- ark:- |\"\n# apply-cmvn (optional),\n[ -n \"$cmvn_opts\" -a ! -f $sdata/1/cmvn.scp ] && echo \"$0: Missing $sdata/1/cmvn.scp\" && exit 1\n[ -n \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ -n \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  feats_job_1=$(sed 's:JOB:1:g' <(echo $feats))\n  dim_raw=$(feat-to-dim \"$feats_job_1\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_job_1 $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n# select a block from blocksoftmax,\nif [ ! -z \"$blocksoftmax_dims\" ]; then\n  # blocksoftmax_active is a csl! dim1,dim2,dim3,...\n  [ -z \"$blocksoftmax_active\" ] && echo \"$0 Missing option --blocksoftmax-active N\" && exit 1\n  # getting dims,\n  dim_total=$(awk -F'[:,]' '{ for(i=1;i<=NF;i++) { sum += $i }; print sum; }' <(echo $blocksoftmax_dims))\n  dim_block=$(awk -F'[:,]' -v active=$blocksoftmax_active '{ print $active; }' <(echo $blocksoftmax_dims))\n  offset=$(awk -F'[:,]' -v active=$blocksoftmax_active '{ sum=0; for(i=1;i<active;i++) { sum += $i }; print sum; }' <(echo $blocksoftmax_dims))\n  # create components which select a block,\n  nnet-initialize <(echo \"<Copy> <InputDim> $dim_total <OutputDim> $dim_block <BuildVector> $((1+offset)):$((offset+dim_block)) </BuildVector>\";\n                    echo \"<Softmax> <InputDim> $dim_block <OutputDim> $dim_block\") $dir/copy_and_softmax.nnet\n  # nnet is assembled on-the fly, <BlockSoftmax> is removed, while <Copy> + <Softmax> is added,\n  nnet=\"nnet-concat 'nnet-copy --remove-last-components=1 $nnet - |' $dir/copy_and_softmax.nnet - |\"\nfi\n\n# Run the decoding in the queue,\nif [ $stage -le 0 ]; then\n  $cmd --num-threads $((num_threads+1)) JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet-forward $nnet_forward_opts --feature-transform=$feature_transform --class-frame-counts=$class_frame_counts --use-gpu=$use_gpu \"$nnet\" \"$feats\" ark:- \\| \\\n    latgen-faster-mapped$thread_string --min-active=$min_active --max-active=$max_active --max-mem=$max_mem --beam=$beam \\\n    --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $model $graphdir/HCLG.fst ark:- \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\n# Run the scoring\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir || exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet/ivector/extract_ivectors.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright     2013  Daniel Povey\n#               2016  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0.\n\n\n# This script computes iVectors in the same format as extract_ivectors_online.sh,\n# except that they are actually not really computed online, they are first computed\n# per speaker and just duplicated many times.\n# This is mainly intended for use in decoding, where you want the best possible\n# quality of iVectors.\n#\n# This setup also makes it possible to use a previous decoding or alignment, to\n# down-weight silence in the stats (default is --silence-weight 0.0).\n#\n# This is for when you use the \"online-decoding\" setup in an offline task, and\n# you want the best possible results.\n\n\n# Begin configuration section.\nnj=30\ncmd=\"run.pl\"\nstage=0\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\n\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.  Making this small during iVector\n                    # extraction is equivalent to scaling up the prior, and will\n                    # will tend to produce smaller iVectors where data-counts are\n                    # small.  It's not so important that this match the value\n                    # used when training the iVector extractor, but more important\n                    # that this match the value used when you do real online decoding\n                    # with the neural nets trained with these iVectors.\n\nmax_count=100       # Interpret this as a number of frames times posterior scale...\n                    # this config ensures that once the count exceeds this (i.e.\n                    # 1000 frames, or 10 seconds, by default), we start to scale\n                    # down the stats, accentuating the prior term.   This seems quite\n                    # important for some reason.\n\nsilence_weight=0.0\nacwt=0.1  # used if input is a decode dir, to get best path from lattices.\nmdl=final  # change this if decode directory did not have ../final.mdl present.\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ] && [ $# != 5 ]; then\n  echo \"Usage: $0 [options] <data> <lang> <extractor-dir> [<alignment-dir>|<decode-dir>|<weights-archive>] <ivector-dir>\"\n  echo \" e.g.: $0 data/test exp/nnet2_online/extractor exp/tri3/decode_test exp/nnet2_online/ivectors_test\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <n|10>                                      # Number of jobs (also see num-processes and num-threads)\"\n  echo \"                                                   # Ignored if <alignment-dir> or <decode-dir> supplied.\"\n  echo \"  --stage <stage|0>                                # To control partial reruns\"\n  echo \"  --num-gselect <n|5>                              # Number of Gaussians to select using\"\n  echo \"                                                   # diagonal model.\"\n  echo \"  --min-post <float;default=0.025>                 # Pruning threshold for posteriors\"\n  echo \"  --ivector-period <int;default=10>                # How often to extract an iVector (frames)\"\n  echo \"  --posterior-scale <float;default=0.1>            # Scale on posteriors in iVector extraction; \"\n  echo \"                                                   # affects strength of prior term.\"\n\n  exit 1;\nfi\n\nset -euxo pipefail\n\nif [ $# -eq 4 ]; then\n  data=$1\n  lang=$2\n  srcdir=$3\n  dir=$4\nelse # 5 arguments\n  data=$1\n  lang=$2\n  srcdir=$3\n  ali_or_decode_dir=$4\n  dir=$5\nfi\n\nfor f in $data/feats.scp $srcdir/final.ie $srcdir/final.dubm $lang/phones.txt; do\n  [ ! -f $f ] && echo \"$0: No such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log\nsilphonelist=$(cat $lang/phones/silence.csl) || exit 1;\n\nif [ ! -z \"$ali_or_decode_dir\" ]; then\n\n  if [ -f $ali_or_decode_dir/ali.1.gz ]; then\n    if [ ! -f $ali_or_decode_dir/${mdl}.mdl ]; then\n      echo \"$0: expected $ali_or_decode_dir/${mdl}.mdl to exist.\"\n      exit 1;\n    fi\n    nj_orig=$(cat $ali_or_decode_dir/num_jobs) || exit 1;\n\n    if [ $stage -le 0 ]; then\n      rm $dir/weights.*.gz 2>/dev/null || true\n\n      $cmd JOB=1:$nj_orig  $dir/log/ali_to_post.JOB.log \\\n        gunzip -c $ali_or_decode_dir/ali.JOB.gz \\| \\\n        ali-to-post ark:- ark:- \\| \\\n        weight-silence-post $silence_weight $silphonelist $ali_or_decode_dir/final.mdl ark:- ark:- \\| \\\n        post-to-weights ark:- \"ark:|gzip -c >$dir/weights.JOB.gz\" || exit 1;\n\n      # put all the weights in one archive.\n      for j in $(seq $nj_orig); do gunzip -c $dir/weights.$j.gz; done | gzip -c >$dir/weights.gz || exit 1;\n      rm $dir/weights.*.gz || exit 1;\n    fi\n\n  elif [ -f $ali_or_decode_dir/lat.1.gz ]; then\n    nj_orig=$(cat $ali_or_decode_dir/num_jobs) || exit 1;\n    if [ ! -f $ali_or_decode_dir/../${mdl}.mdl ]; then\n      echo \"$0: expected $ali_or_decode_dir/../${mdl}.mdl to exist.\"\n      exit 1;\n    fi\n\n\n    if [ $stage -le 0 ]; then\n      rm $dir/weights.*.gz 2>/dev/null || true\n\n      $cmd JOB=1:$nj_orig  $dir/log/lat_to_post.JOB.log \\\n        lattice-best-path --acoustic-scale=$acwt \"ark:gunzip -c $ali_or_decode_dir/lat.JOB.gz|\" ark:/dev/null ark:- \\| \\\n        ali-to-post ark:- ark:- \\| \\\n        weight-silence-post $silence_weight $silphonelist $ali_or_decode_dir/../${mdl}.mdl ark:- ark:- \\| \\\n        post-to-weights ark:- \"ark:|gzip -c >$dir/weights.JOB.gz\" || exit 1;\n\n      # put all the weights in one archive.\n      for j in $(seq $nj_orig); do gunzip -c $dir/weights.$j.gz; done | gzip -c >$dir/weights.gz || exit 1;\n      rm $dir/weights.*.gz || exit 1;\n    fi\n\n  elif [ -f $ali_or_decode_dir ] && gunzip -c $ali_or_decode_dir >/dev/null; then\n    cp $ali_or_decode_dir $dir/weights.gz || exit 1;\n\n  else\n    echo \"$0: expected ali.1.gz or lat.1.gz to exist in $ali_or_decode_dir\";\n    exit 1;\n  fi\nfi\n\nsdata=$data/split$nj;\nutils/split_data.sh $data $nj || exit 1;\n\ngmm_feats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- |\"\nfeats=\"$gmm_feats\"\n\n# (here originally was the sub-speaker hack),\nthis_sdata=$sdata\n\n# Per-speaker i-vectors,\nif [ $stage -le 2 ]; then\n  if [ ! -z \"$ali_or_decode_dir\" ]; then\n    $cmd JOB=1:$nj $dir/log/extract_ivectors.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      weight-post ark:- \"ark,s,cs:gunzip -c $dir/weights.gz|\" ark:- \\| \\\n      ivector-extract --acoustic-weight=$posterior_scale --compute-objf-change=true \\\n        --max-count=$max_count --spk2utt=ark:$this_sdata/JOB/spk2utt \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark:$dir/ivectors_spk.JOB.ark\n  else\n    $cmd JOB=1:$nj $dir/log/extract_ivectors.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      ivector-extract --acoustic-weight=$posterior_scale --compute-objf-change=true \\\n        --max-count=$max_count --spk2utt=ark:$this_sdata/JOB/spk2utt \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark:$dir/ivectors_spk.JOB.ark\n  fi\nfi\n\n# Per-utterance i-vectors,\nif [ $stage -le 3 ]; then\n  if [ ! -z \"$ali_or_decode_dir\" ]; then\n    $cmd JOB=1:$nj $dir/log/extract_ivectors_utt.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      weight-post ark:- \"ark,s,cs:gunzip -c $dir/weights.gz|\" ark:- \\| \\\n      ivector-extract --acoustic-weight=$posterior_scale --compute-objf-change=true --max-count=$max_count \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark:$dir/ivectors_utt.JOB.ark\n  else\n    $cmd JOB=1:$nj $dir/log/extract_ivectors_utt.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      ivector-extract --acoustic-weight=$posterior_scale --compute-objf-change=true --max-count=$max_count \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark:$dir/ivectors_utt.JOB.ark\n  fi\nfi\n\nabsdir=$(utils/make_absolute.sh $dir)\nif [ $stage -le 4 ]; then\n  echo \"$0: merging iVectors across jobs\"\n  copy-vector \"ark:cat $dir/ivectors_spk.*.ark |\" ark,scp:$absdir/ivectors_spk.ark,$dir/ivectors_spk.scp\n  rm $dir/ivectors_spk.*.ark\n  copy-vector \"ark:cat $dir/ivectors_utt.*.ark |\" ark,scp:$absdir/ivectors_utt.ark,$dir/ivectors_utt.scp\n  rm $dir/ivectors_utt.*.ark\nfi\n\n# duplicate the `speaker' i-vector to all `utterances' of that speaker,\nif [ $stage -le 5 ]; then\n  # filter utt2spk (remove speakers with no iVector),\n  awk -v ivec_spk=$dir/ivectors_spk.scp \\\n    'BEGIN{ while(getline < ivec_spk) { spk_has_ivec[$1] = 1; }} { spk=$2; if(spk_has_ivec[spk]) { print $0 }}' \\\n    $data/utt2spk >$dir/utt2spk.filt\n  # expand the list of i-vectors,\n  utils/apply_map.pl -f 2 $dir/ivectors_spk.scp <$dir/utt2spk.filt >$dir/ivectors_spk-as-utt.scp\nfi\n\necho \"$0: done extracting iVectors (per-speaker, per-sentence) into '$dir'\"\n\n"
  },
  {
    "path": "egs/steps/nnet/ivector/train_diag_ubm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2012  Johns Hopkins University (Author: Daniel Povey)\n#             2013  Daniel Povey\n#             2016  Brno University of Technology (Author: Karel Vesely)\n# Apache 2.0.\n\n# This script trains a diagonal UBM that we'll use in online iVector estimation,\n# where the online-estimated iVector will be used as a secondary input to a deep\n# neural net for single-pass DNN-based decoding.\n\n# This script was modified from ../../sre08/v1/sid/train_diag_ubm.sh.\n# It trains a diagonal UBM on top of input features. We use the original features,\n# assuming they are already normalized (or transformed).\n\n# This script does not use the trained model from the source directory to\n# initialize the diagonal GMM; instead, we initialize the GMM using\n# gmm-global-init-from-feats, which sets the means to random data points and\n# then does some iterations of E-M in memory.  After the in-memory\n# initialization we train for a few iterations in parallel.\n# Note that there is a slight mismatch in that the source LDA+MLLT matrix\n# (final.mat) will have been estimated using standard CMVN, and we're using\n# online CMVN.  We don't think this will have much effect.\n\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nnum_iters=4\nstage=-2\nnum_gselect=30 # Number of Gaussian-selection indices to use while training\n               # the model.\nnum_frames=500000 # number of frames to keep in memory for initialization\nnum_iters_init=20\ninitial_gauss_proportion=0.5 # Start with half the target number of Gaussians\nsubsample=2 # subsample all features with this periodicity, in the main E-M phase.\ncleanup=true\nmin_gaussian_weight=0.0001\nremove_low_count_gaussians=true # set this to false if you need #gauss to stay fixed.\nnum_threads=8\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0  <data> <num-gauss> <output-dir>\"\n  echo \" e.g.: $0 data/train 1024 exp/diag_ubm\"\n  echo \"Options: \"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <num-jobs|4>                                # number of parallel jobs to run.\"\n  echo \"  --num-iters <niter|20>                           # number of iterations of parallel \"\n  echo \"                                                   # training (default: $num_iters)\"\n  echo \"  --stage <stage|-2>                               # stage to do partial re-run from.\"\n  echo \"  --num-gselect <n|30>                             # Number of Gaussians per frame to\"\n  echo \"                                                   # limit computation to, for speed\"\n  echo \" --subsample <n|5>                                 # In main E-M phase, use every n\"\n  echo \"                                                   # frames (a speedup)\"\n  echo \"  --num-frames <n|500000>                          # Maximum num-frames to keep in memory\"\n  echo \"                                                   # for model initialization\"\n  echo \"  --num-iters-init <n|20>                          # Number of E-M iterations for model\"\n  echo \"                                                   # initialization\"\n  echo \" --initial-gauss-proportion <proportion|0.5>       # Proportion of Gaussians to start with\"\n  echo \"                                                   # in initialization phase (then split)\"\n  echo \" --num-threads <n|16>                              # number of threads to use in initialization\"\n  echo \"                                                   # phase (must match with parallel-opts option)\"\n  echo \" --min-gaussian-weight <weight|0.0001>             # min Gaussian weight allowed in GMM\"\n  echo \"                                                   # initialization (this relatively high\"\n  echo \"                                                   # value keeps counts fairly even)\"\n  exit 1;\nfi\n\nset -euo pipefail\n\ndata=$1\nnum_gauss=$2\ndir=$3\n\n! [ $num_gauss -gt 0 ] && echo \"Bad num-gauss $num_gauss\" && exit 1;\n\nsdata=$data/split$nj\nmkdir -p $dir/log\nutils/split_data.sh $data $nj || exit 1;\n\nfor f in $data/feats.scp; do\n   [ ! -f \"$f\" ] && echo \"$0: expecting file $f to exist\" && exit 1\ndone\n\n# Note: there is no point subsampling all_feats, because gmm-global-init-from-feats\n# effectively does subsampling itself (it keeps a random subset of the features).\nall_feats=\"ark,s,cs:copy-feats scp:$data/feats.scp ark:- |\"\nfeats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\n\nnum_gauss_init=$(perl -e \"print int($initial_gauss_proportion * $num_gauss); \");\n! [ $num_gauss_init -gt 0 ] && echo \"Invalid num-gauss-init $num_gauss_init\" && exit 1;\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing model from E-M in memory, \"\n  echo \"$0: starting from $num_gauss_init Gaussians, reaching $num_gauss;\"\n  echo \"$0: for $num_iters_init iterations, using at most $num_frames frames of data\"\n\n  $cmd --num-threads $num_threads $dir/log/gmm_init.log \\\n    gmm-global-init-from-feats --num-threads=$num_threads --num-frames=$num_frames \\\n     --min-gaussian-weight=$min_gaussian_weight \\\n     --num-gauss=$num_gauss --num-gauss-init=$num_gauss_init --num-iters=$num_iters_init \\\n    \"$all_feats\" $dir/0.dubm\nfi\n\n# Store Gaussian selection indices on disk-- this speeds up the training passes.\nif [ $stage -le -1 ]; then\n  echo \"Getting Gaussian-selection info\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$num_gselect $dir/0.dubm \"$feats\" \\\n      \"ark:|gzip -c >$dir/gselect.JOB.gz\"\nfi\n\necho \"$0: will train for $num_iters iterations, in parallel over\"\necho \"$0: $nj machines, parallelized with '$cmd'\"\n\nfor x in $(seq 0 $[$num_iters-1]); do\n  echo \"$0: Training pass $x\"\n  if [ $stage -le $x ]; then\n  # Accumulate stats.\n    $cmd JOB=1:$nj $dir/log/acc.${x}.JOB.log \\\n      gmm-global-acc-stats \"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" \\\n      $dir/$x.dubm \"$feats\" $dir/$x.JOB.acc\n    if [ $x -lt $[$num_iters-1] ]; then # Don't remove low-count Gaussians till last iter,\n      opt=\"--remove-low-count-gaussians=false\" # or gselect info won't be valid any more.\n    else\n      opt=\"--remove-low-count-gaussians=$remove_low_count_gaussians\"\n    fi\n    $cmd $dir/log/update.${x}.log \\\n      gmm-global-est $opt --min-gaussian-weight=$min_gaussian_weight $dir/${x}.dubm \"gmm-global-sum-accs - $dir/${x}.*.acc|\" \\\n      $dir/$[$x+1].dubm\n    rm $dir/$x.*.acc $dir/$x.dubm\n  fi\ndone\n\nrm $dir/gselect.*.gz\nmv $dir/$num_iters.dubm $dir/final.dubm\n\nexit 0 # Done!\n\n"
  },
  {
    "path": "egs/steps/nnet/ivector/train_ivector_extractor.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2013  Daniel Povey\n#             2016  Brno University of Technology (Author: Karel Vesely)\n# Apache 2.0.\n\n# This script is modified from ^/egs/sre08/v1/sid/train_ivector_extractor.sh.\n# It trains an iVector extractor for use in DNN training.\n\n# This script trains the i-vector extractor.  Note: there are 3 separate levels\n# of parallelization: num_threads, num_processes, and num_jobs.  This may seem a\n# bit excessive.  It has to do with minimizing memory usage and disk I/O,\n# subject to various constraints.  The \"num_threads\" is how many threads a\n# program uses; the \"num_processes\" is the number of separate processes a single\n# job spawns, and then sums the accumulators in memory.  Our recommendation:\n#  - Set num_threads to the minimum of (4, or how many virtual cores your machine has).\n#    (because of needing to lock various global quantities, the program can't\n#    use many more than 4 threads with good CPU utilization).\n#  - Set num_processes to the number of virtual cores on each machine you have, divided by\n#    num_threads.  E.g. 4, if you have 16 virtual cores.   If you're on a shared queue\n#    that's busy with other people's jobs, it may be wise to set it to rather less\n#    than this maximum though, or your jobs won't get scheduled.  And if memory is\n#    tight you need to be careful; in our normal setup, each process uses about 5G.\n#  - Set num_jobs to as many of the jobs (each using $num_threads * $num_processes CPUs)\n#    your queue will let you run at one time, but don't go much more than 10 or 20, or\n#    summing the accumulators will possibly get slow.  If you have a lot of data, you\n#    may want more jobs, though.\n\n# Begin configuration section.\nnj=10   # this is the number of separate queue jobs we run, but each one\n        # contains num_processes sub-jobs.. the real number of threads we\n        # run is nj * num_processes * num_threads, and the number of\n        # separate pieces of data is nj * num_processes.\nnum_threads=4\nnum_processes=2 # each job runs this many processes, each with --num-threads threads\ncmd=\"run.pl\"\nstage=-4\nivector_dim=100 # dimension of the extracted i-vector\nnum_iters=10\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\n               # caution: you should use the same value in the online-estimation\n               # code.\nsubsample=2  # This speeds up the training: training on every 2nd feature\n             # (configurable) Since the features are highly correlated across\n             # frames, we don't expect to lose too much from this.\nparallel_opts=  # ignored now.\ncleanup=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 <data> <diagonal-ubm-dir> <extractor-dir>\"\n  echo \" e.g.: $0 data/train exp/nnet2_online/diag_ubm/ exp/nnet2_online/extractor\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-iters <#iters|10>                          # Number of iterations of E-M\"\n  echo \"  --nj <n|10>                                      # Number of jobs (also see num-processes and num-threads)\"\n  echo \"  --num-processes <n|4>                            # Number of processes for each queue job (relates\"\n  echo \"                                                   # to summing accs in memory)\"\n  echo \"  --num-threads <n|4>                              # Number of threads for each process (can't be usefully\"\n  echo \"                                                   # increased much above 4)\"\n  echo \"  --stage <stage|-4>                               # To control partial reruns\"\n  echo \"  --num-gselect <n|5>                              # Number of Gaussians to select using\"\n  echo \"                                                   # diagonal model.\"\n  exit 1;\nfi\n\nset -euxo pipefail\n\ndata=$1\nsrcdir=$2\ndir=$3\n\nfor f in $srcdir/final.dubm $data/feats.scp; do\n  [ ! -f $f ] && echo \"No such file $f\" && exit 1;\ndone\n\n# Set various variables.\nmkdir -p $dir/log\nnj_full=$[$nj*$num_processes]\nsdata=$data/split$nj_full;\nutils/split_data.sh $data $nj_full\n\ncp $srcdir/final.dubm $dir\n\n## Set up features.\ngmm_feats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\nfeats=\"$gmm_feats\"\n\n# Initialize the i-vector extractor using the input GMM, which is converted to\n# full because that's what the i-vector extractor expects.  Note: we have to do\n# --use-weights=false to disable regression of the log weights on the ivector,\n# because that would make the online estimation of the ivector difficult (since\n# the online/real-time ivector estimation is the whole point of this script).\nif [ $stage -le -2 ]; then\n  $cmd $dir/log/init.log \\\n    ivector-extractor-init --ivector-dim=$ivector_dim --use-weights=false \\\n     \"gmm-global-to-fgmm $dir/final.dubm -|\" $dir/0.ie\nfi\n\n# Do Gaussian selection and posterior extracion\n\n# if we subsample frame, modify the posterior-scale; this is likely\n# to make the original posterior-scale (before subsampling) suitable.\nmodified_posterior_scale=$(perl -e \"print $posterior_scale * $subsample;\");\n\nif [ $stage -le -1 ]; then\n  echo $nj_full > $dir/num_jobs\n  echo \"$0: doing Gaussian selection and posterior computation\"\n  $cmd JOB=1:$nj_full $dir/log/post.JOB.log \\\n    gmm-global-get-post --n=$num_gselect --min-post=$min_post $dir/final.dubm \"$gmm_feats\" ark:- \\| \\\n    scale-post ark:- $modified_posterior_scale \"ark:|gzip -c >$dir/post.JOB.gz\"\nelse\n  # make sure we at least have the right number of post.*.gz files.\n  if ! [ $nj_full -eq $(cat $dir/num_jobs) ]; then\n    echo \"Num-jobs mismatch $nj_full versus $(cat $dir/num_jobs)\"\n    exit 1\n  fi\nfi\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    rm $dir/.error 2>/dev/null || true\n\n    Args=() # bash array of training commands for 1:nj, that put accs to stdout.\n    for j in $(seq $nj_full); do\n      Args[$j]=`echo \"ivector-extractor-acc-stats --num-threads=$num_threads $dir/$x.ie '$feats' 'ark,s,cs:gunzip -c $dir/post.JOB.gz|' -|\" | sed s/JOB/$j/g`\n    done\n\n    echo \"Accumulating stats (pass $x)\"\n    for g in $(seq $nj); do\n      start=$[$num_processes*($g-1)+1]\n      $cmd --num-threads $[$num_threads*$num_processes] $dir/log/acc.$x.$g.log \\\n        ivector-extractor-sum-accs --parallel=true \"${Args[@]:$start:$num_processes}\" \\\n          $dir/acc.$x.$g || touch $dir/.error &\n    done\n    wait\n    [ -f $dir/.error ] && echo \"Error accumulating stats on iteration $x\" && exit 1;\n\n    accs=\"\"\n    for j in $(seq $nj); do\n      accs+=\"$dir/acc.$x.$j \"\n    done\n    echo \"Summing accs (pass $x)\"\n    $cmd $dir/log/sum_acc.$x.log \\\n      ivector-extractor-sum-accs $accs $dir/acc.$x\n\n    echo \"Updating model (pass $x)\"\n    nt=$[$num_threads*$num_processes] # use the same number of threads that\n                                      # each accumulation process uses, since we\n                                      # can be sure the queue will support this many.\n                                      #\n                                      # The parallel-opts was either specified by\n                                      # the user or we computed it correctly in\n                                      # tge previous stages\n    $cmd --num-threads $[$num_threads*$num_processes] $dir/log/update.$x.log \\\n      ivector-extractor-est --num-threads=$nt $dir/$x.ie $dir/acc.$x $dir/$[$x+1].ie\n    rm $dir/acc.$x.*\n\n    if $cleanup; then\n      rm $dir/acc.$x\n      # rm $dir/$x.ie\n    fi\n  fi\n  x=$[$x+1]\ndone\n\nrm $dir/final.ie 2>/dev/null || true\nln -s $x.ie $dir/final.ie\n"
  },
  {
    "path": "egs/steps/nnet/make_bn_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015 Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nremove_last_components=4 # remove N last components from the nnet\nnnet_forward_opts=\nuse_gpu=no\nhtk_save=false\nivector=            # rx-specifier with i-vectors (ark-with-vectors),\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 5 ]; then\n   echo \"usage: $0 [options] <tgt-data-dir> <src-data-dir> <nnet-dir> <log-dir> <abs-path-to-bn-feat-dir>\";\n   echo \"options: \"\n   echo \"  --cmd 'queue.pl <queue opts>'   # how to run jobs.\"\n   echo \"  --nj <nj>                       # number of parallel jobs\"\n   echo \"  --remove-last-components <N>    # number of NNet Components to remove from the end\"\n   echo \"  --use-gpu (no|yes|optional)     # forwarding on GPU\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\ndata=$1\nsrcdata=$2\nnndir=$3\nlogdir=$4\nbnfeadir=$5\n\n######## CONFIGURATION\n\n# copy the dataset metadata from srcdata.\nmkdir -p $data $logdir $bnfeadir || exit 1;\nutils/copy_data_dir.sh $srcdata $data; rm -f $data/{feats,cmvn}.scp 2>/dev/null\n\n# make $bnfeadir an absolute pathname.\n[ '/' != ${bnfeadir:0:1} ] && bnfeadir=$PWD/$bnfeadir\n\nrequired=\"$srcdata/feats.scp $nndir/final.nnet $nndir/final.feature_transform\"\nfor f in $required; do\n  [ ! -f $f ] && echo \"$0: Missing $f\" && exit 1;\ndone\n\nname=$(basename $srcdata)\nsdata=$srcdata/split$nj\n[[ -d $sdata && $srcdata/feats.scp -ot $sdata ]] || split_data.sh $srcdata $nj || exit 1;\n\n# Concat feature transform with trimmed MLP:\nnnet=$bnfeadir/feature_extractor.nnet\nnnet-concat $nndir/final.feature_transform \"nnet-copy --remove-last-components=$remove_last_components $nndir/final.nnet - |\" $nnet 2>$logdir/feature_extractor.log || exit 1\nnnet-info $nnet >$data/feature_extractor.nnet-info\n\necho \"Creating bn-feats into $data\"\n\n# PREPARE FEATURE EXTRACTION PIPELINE\n# import config,\nonline_cmvn_opts=\ncmvn_opts=\ndelta_opts=\nD=$nndir\n[ -e $D/online_cmvn_opts ] && online_cmvn_opts=$(cat $D/online_cmvn_opts)\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- |\"\n# apply-cmvn-online (optional),\n[ -n \"$online_cmvn_opts\" -a ! -f $nndir/global_cmvn_stats.mat ] && echo \"$0: Missing $nndir/global_cmvn_stats.mat\" && exit 1\n[ -n \"$online_cmvn_opts\" ] && feats=\"$feats apply-cmvn-online $online_cmvn_opts --spk2utt=ark:$srcdata/spk2utt $nndir/global_cmvn_stats.mat ark:- ark:- |\"\n# apply-cmvn (optional),\n[ -n \"$cmvn_opts\" -a ! -f $sdata/1/cmvn.scp ] && echo \"$0: Missing $sdata/1/cmvn.scp\" && exit 1\n[ -n \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$srcdata/utt2spk scp:$srcdata/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ -n \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  feats_job_1=$(sed 's:JOB:1:g' <(echo $feats))\n  dim_raw=$(feat-to-dim \"$feats_job_1\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_job_1 $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\nif [ $htk_save == false ]; then\n  # Run the forward pass,\n  $cmd JOB=1:$nj $logdir/make_bnfeats.JOB.log \\\n    nnet-forward $nnet_forward_opts --use-gpu=$use_gpu $nnet \"$feats\" \\\n    ark,scp:$bnfeadir/raw_bnfea_$name.JOB.ark,$bnfeadir/raw_bnfea_$name.JOB.scp \\\n    || exit 1;\n  # concatenate the .scp files\n  for ((n=1; n<=nj; n++)); do\n    cat $bnfeadir/raw_bnfea_$name.$n.scp >> $data/feats.scp\n  done\n\n  # check sentence counts,\n  N0=$(cat $srcdata/feats.scp | wc -l)\n  N1=$(cat $data/feats.scp | wc -l)\n  [[ \"$N0\" != \"$N1\" ]] && echo \"$0: sentence-count mismatch, $srcdata $N0, $data $N1\" && exit 1\n  echo \"Succeeded creating MLP-BN features '$data'\"\n\nelse # htk_save == true\n  # Run the forward pass saving HTK features,\n  $cmd JOB=1:$nj $logdir/make_bnfeats_htk.JOB.log \\\n    mkdir -p $data/htkfeats/JOB \\; \\\n    nnet-forward $nnet_forward_opts --use-gpu=$use_gpu $nnet \"$feats\" ark:- \\| \\\n    copy-feats-to-htk --output-dir=$data/htkfeats/JOB ark:- || exit 1\n  # Make list of htk features,\n  find $data/htkfeats -name *.fea >$data/htkfeats.scp\n\n  # Check sentence counts,\n  N0=$(cat $srcdata/feats.scp | wc -l)\n  N1=$(find $data/htkfeats.scp | wc -l)\n  [[ \"$N0\" != \"$N1\" ]] && echo \"$0: sentence-count mismatch, $srcdata $N0, $data/htk* $N1\" && exit 1\n  echo \"Succeeded creating MLP-BN features '$data/htkfeats.scp'\"\nfi\n"
  },
  {
    "path": "egs/steps/nnet/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013 Brno University of Technology (author: Karel Vesely), Daniel Povey\n# Apache 2.0.\n\n# Create denominator lattices for MMI/MPE/sMBR training.\n# Creates its output in $dir/lat.*.ark,$dir/lat.scp\n# The lattices are uncompressed, we need random access for DNN training.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\nnnet=\nnnet_forward_opts=\"--no-softmax=true --prior-scale=1.0\"\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\n# End configuration section.\nuse_gpu=no # yes|no|optional\nparallel_opts=\"--num-threads 2\"\nivector=         # rx-specifier with i-vectors (ark-with-vectors),\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/$0 [options] <data-dir> <lang-dir> <src-dir> <exp-dir>\"\n   echo \"  e.g.: steps/$0 data/train data/lang exp/tri1 exp/tri1_denlats\"\n   echo \"Works for plain features (or CMN, delta), forwarded through feature-transform.\"\n   echo \"\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n   echo \"                           # large databases so your jobs will be smaller and\"\n   echo \"                           # will (individually) finish reasonably soon.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nsdata=$data/split$nj\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\noov=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\n\ncp -r $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\necho \"Making unigram grammar FST in $new_lang\"\ncat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n  awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n  utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n   || exit 1;\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\necho \"Compiling decoding graph in $dir/dengraph\"\nif [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n   echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  utils/mkgraph.sh $new_lang $srcdir $dir/dengraph || exit 1;\nfi\n\n\ncp $srcdir/{tree,final.mdl} $dir\n\n# Select default locations to model files\n[ -z \"$nnet\" ] && nnet=$srcdir/final.nnet;\nclass_frame_counts=$srcdir/ali_train_pdf.counts\nfeature_transform=$srcdir/final.feature_transform\nmodel=$dir/final.mdl\n\n# Check that files exist\nfor f in $sdata/1/feats.scp $nnet $model $feature_transform $class_frame_counts; do\n  [ ! -f $f ] && echo \"$0: missing file $f\" && exit 1;\ndone\n\n\n# PREPARE FEATURE EXTRACTION PIPELINE\n# import config,\ncmvn_opts=\ndelta_opts=\nD=$srcdir\n[ -e $D/norm_vars ] && cmvn_opts=\"--norm-means=true --norm-vars=$(cat $D/norm_vars)\" # Bwd-compatibility,\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_order ] && delta_opts=\"--delta-order=$(cat $D/delta_order)\" # Bwd-compatibility,\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,s,cs:copy-feats scp:$sdata/JOB/feats.scp ark:- |\"\n# apply-cmvn (optional),\n[ ! -z \"$cmvn_opts\" -a ! -f $sdata/1/cmvn.scp ] && echo \"$0: Missing $sdata/1/cmvn.scp\" && exit 1\n[ ! -z \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ ! -z \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n# add-pytel transform (optional),\n[ -e $D/pytel_transform.py ] && feats=\"$feats /bin/env python $D/pytel_transform.py |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  feats_job_1=$(sed 's:JOB:1:g' <(echo $feats))\n  dim_raw=$(feat-to-dim \"$feats_job_1\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_job_1 $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n# nnet-forward,\nfeats=\"$feats nnet-forward $nnet_forward_opts --feature-transform=$feature_transform --class-frame-counts=$class_frame_counts --use-gpu=$use_gpu $nnet ark:- ark:- |\"\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids || true\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\n\necho \"$0: generating denlats from data '$data', putting lattices in '$dir'\"\n#1) Generate the denominator lattices\nif [ $sub_split -eq 1 ]; then\n  # Prepare 'scp' for storing lattices separately and gzipped\n  for n in `seq $nj`; do\n    [ ! -d $dir/lat$n ] && mkdir $dir/lat$n;\n    cat $sdata/$n/feats.scp | \\\n    awk -v dir=$dir -v n=$n '{ utt=$1; utt_noslash=utt; gsub(\"/\",\"_\",utt_noslash);\n                               printf(\"%s | gzip -c >%s/lat%d/%s.gz\\n\", utt, dir, n, utt_noslash); }'\n  done >$dir/lat.store_separately_as_gz.scp\n  # Generate the lattices\n  $cmd $parallel_opts JOB=1:$nj $dir/log/decode_den.JOB.log \\\n    latgen-faster-mapped --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n      --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n      $dir/dengraph/HCLG.fst \"$feats\" \"scp:$dir/lat.store_separately_as_gz.scp\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm -f $dir/.error\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=$(echo $feats | sed s:JOB/:$n/split${sub_split}utt/JOB/:g)\n      # Prepare 'scp' for storing lattices separately and gzipped\n      for k in `seq $sub_split`; do\n        [ ! -d $dir/lat$n/$k ] && mkdir -p $dir/lat$n/$k;\n        cat $sdata2/$k/feats.scp | \\\n        awk -v dir=$dir -v n=$n -v k=$k '{ utt=$1; utt_noslash=utt; gsub(\"/\",\"_\",utt_noslash);\n                                           printf(\"%s | gzip -c >%s/lat%d/%d/%s.gz\\n\", utt, dir, n, k, utt_noslash); }'\n      done >$dir/lat.${n}.store_separately_as_gz.scp\n      # Generate lattices\n      $cmd $parallel_opts JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        latgen-faster-mapped --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n          --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" scp:$dir/lat.$n.store_separately_as_gz.scp || touch .error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n#2) Generate 'scp' for reading the lattices\n# make $dir an absolute pathname.\n[ '/' != ${dir:0:1} ] && dir=$PWD/$dir\nfor n in `seq $nj`; do\n  find $dir/lat${n} -name \"*.gz\" | perl -ape 's:.*/([^/]+)\\.gz$:$1 gunzip -c $& |:; '\ndone | sort >$dir/lat.scp\n[ -s $dir/lat.scp ] || exit 1\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/nnet/make_fmllr_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Brno University of Technology (author: Karel Vesely),\n#                 \n# Apache 2.0.\n#\n# This script dumps fMLLR features in a new data directory, \n# which is later used for neural network training/testing.\n\n# Begin configuration section.  \nnj=4\ncmd=run.pl\ntransform_dir=\nraw_transform_dir=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 5 ]; then\n   echo \"Usage: $0 [options] <tgt-data-dir> <src-data-dir> <gmm-dir> <log-dir> <fea-dir>\"\n   echo \"e.g.: $0 data-fmllr/train data/train exp/tri5a exp/make_fmllr_feats/log plp/processed/\"\n   echo \"\"\n   echo \"This script dumps fMLLR features to disk, so it can be used for NN training.\"\n   echo \"It automoatically figures out the 'feature-type' of the source GMM systems.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --transform-dir <transform-dir>                  # dir with fMLLR transforms\"\n   echo \"  --raw-transform-dir <transform-dir>              # dir with raw-fMLLR transforms\"\n   exit 1;\nfi\n\ndata=$1\nsrcdata=$2\ngmmdir=$3\nlogdir=$4\nfeadir=$5\n\nsdata=$srcdata/split$nj;\n\n# Get the config,\nD=$gmmdir\n[ -f $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts) || cmvn_opts=\n[ -f $D/delta_opts ] && delta_opts=$(cat $D/delta_opts) || delta_opts=\n[ -f $D/splice_opts ] && splice_opts=$(cat $D/splice_opts) || splice_opts=\n\nmkdir -p $data $logdir $feadir\n[[ -d $sdata && $srcdata/feats.scp -ot $sdata ]] || split_data.sh $srcdata $nj || exit 1;\n\n# Check files exist,\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp; do\n  [ ! -f $f ] && echo \"$0: Missing $f\" && exit 1;\ndone\n[ ! -z \"$transform_dir\" -a ! -f $transform_dir/trans.1 ] && \\\n  echo \"$0: Missing $transform_dir/trans.1\" && exit 1;\n[ ! -z \"$raw_transform_dir\" -a ! -f $raw_transform_dir/raw_trans.1 ] && \\\n  echo \"$0: Missing $raw_transform_dir/raw_trans.1\" && exit 1;\n\n# Figure-out the feature-type,\nfeat_type=\"[UNKNOWN]\"\n[ -z \"$raw_transform_dir\" -a ! -f $gmmdir/final.mat -a ! -z \"$transform_dir\" ] && feat_type=delta_fmllr\n[ -z \"$raw_transform_dir\" -a -f $gmmdir/final.mat -a ! -z \"$transform_dir\" ] && feat_type=lda_fmllr\n[ ! -z \"$raw_transform_dir\" ] && feat_type=raw_fmllr\necho \"$0: feature type is $feat_type\";\n\n# Hand-code the feature pipeline,\ncase $feat_type in\n  delta_fmllr) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk \\\"ark:cat $transform_dir/trans.* |\\\" ark:- ark:- |\";;\n  lda_fmllr) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $gmmdir/final.mat ark:- ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk \\\"ark:cat $transform_dir/trans.* |\\\" ark:- ark:- |\";;\n  raw_fmllr) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$raw_transform_dir/raw_trans.JOB ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n# Prepare the output dir,\nutils/copy_data_dir.sh $srcdata $data; rm $data/{feats,cmvn}.scp 2>/dev/null\n# Make $feadir an absolute pathname,\n[ '/' != ${feadir:0:1} ] && feadir=$PWD/$feadir\n\n# Store the output-features,\nname=`basename $data`\n$cmd JOB=1:$nj $logdir/make_fmllr_feats.JOB.log \\\n  copy-feats \"$feats\" \\\n  ark,scp:$feadir/feats_fmllr_$name.JOB.ark,$feadir/feats_fmllr_$name.JOB.scp || exit 1;\n\n# Merge the scp,\nfor n in $(seq 1 $nj); do\n  cat $feadir/feats_fmllr_$name.$n.scp \ndone > $data/feats.scp\n\necho \"$0: Done!, type $feat_type, $srcdata --> $data, using : raw-trans ${raw_transform_dir:-None}, gmm $gmmdir, trans ${transform_dir:-None}\"\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet/make_fmmi_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Brno University of Technology (author: Karel Vesely),\n#\n# Apache 2.0\n#\n# This script dumps fMMI features in a new data directory, \n# which is later used for neural network training/testing.\n\n# Begin configuration section.  \niter=final\nnj=4\ncmd=run.pl\nngselect=2; # Just use the 2 top Gaussians for fMMI/fMPE.  Should match train.\ntransform_dir=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 5 ]; then\n   echo \"Usage: $0 [options] <tgt-data-dir> <src-data-dir> <gmm-dir> <log-dir> <fea-dir>\"\n   echo \"e.g.: $0 data-fmmi/train data/train exp/tri5a_fmmi_b0.1 data-fmmi/train/_log data-fmmi/train/_data \"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"You can also use fMLLR features-- you have to supply --transform-dir option.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <transform-dir>                  # where to find fMLLR transforms.\"\n   exit 1;\nfi\n\ndata=$1\nsrcdata=$2\ngmmdir=$3\nlogdir=$4\nfeadir=$5\n\nsdata=$srcdata/split$nj;\n\n# Get the config,\nD=$gmmdir\n[ -f $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts) || cmvn_opts=\n[ -f $D/splice_opts ] && splice_opts=$(cat $D/splice_opts) || splice_opts=\n\nmkdir -p $data $logdir $feadir\n[[ -d $sdata && $srcdata/feats.scp -ot $sdata ]] || split_data.sh $srcdata $nj || exit 1;\n\nfor f in $sdata/1/feats.scp $sdata/1/cmvn.scp $gmmdir/$iter.fmpe; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nif [ -f $gmmdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\";\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $gmmdir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne $nj ] && \\\n     echo \"Mismatch in number of jobs with $transform_dir\";\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\n\n# Get Gaussian selection info.\n$cmd JOB=1:$nj $logdir/gselect.JOB.log \\\n  gmm-gselect --n=$ngselect $gmmdir/$iter.fmpe \"$feats\" \\\n  \"ark:|gzip -c >$feadir/gselect.JOB.gz\" || exit 1;\n\n# prepare the dir\ncp $srcdata/* $data 2>/dev/null; rm $data/{feats,cmvn}.scp;\n\n# make $bnfeadir an absolute pathname.\nfeadir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $feadir ${PWD}`\n\n# forward the feats\n$cmd JOB=1:$nj $logdir/make_fmmi_feats.JOB.log \\\n  fmpe-apply-transform $gmmdir/$iter.fmpe \"$feats\" \"ark,s,cs:gunzip -c $feadir/gselect.JOB.gz|\"  \\\n  ark,scp:$feadir/feats_fmmi.JOB.ark,$feadir/feats_fmmi.JOB.scp || exit 1;\n   \n# merge the feats to single SCP\nfor n in $(seq 1 $nj); do\n  cat $feadir/feats_fmmi.$n.scp \ndone > $data/feats.scp\n\necho \"$0 finished... $srcdata -> $data ($gmmdir)\"\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet/make_priors.sh",
    "content": "#!/bin/bash \n\n# Copyright 2012-2015 Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nuse_gpu=no\nivector=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 2 ]; then\n   echo \"usage: $0 [options] <data-dir> <nnet-dir>\";\n   echo \"options: \"\n   echo \"  --cmd 'queue.pl <queue opts>'   # how to run jobs.\"\n   echo \"  --nj <nj>                       # number of parallel jobs\"\n   echo \"  --remove-last-components <N>    # number of NNet Components to remove from the end\"\n   echo \"  --use-gpu (no|yes|optional)     # forwarding on GPU\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\ndata=$1\nnndir=$2\n\n######## CONFIGURATION\n\nrequired=\"$data/feats.scp $nndir/final.nnet $nndir/final.feature_transform\"\nfor f in $required; do\n  [ ! -f $f ] && echo \"$0: Missing $f\" && exit 1;\ndone\n\nsdata=$data/split$nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\necho \"Accumulating prior stats by forwarding '$data' with '$nndir'\"\n\n# We estimate priors on 10k utterances, selected randomly from the splitted data,\nN=$((10000/nj))\n\n# PREPARE FEATURE EXTRACTION PIPELINE\n# import config,\ncmvn_opts=\ndelta_opts=\nD=$nndir\n[ -e $D/norm_vars ] && cmvn_opts=\"--norm-means=true --norm-vars=$(cat $D/norm_vars)\" # Bwd-compatibility,\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_order ] && delta_opts=\"--delta-order=$(cat $D/delta_order)\" # Bwd-compatibility,\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark:cat $sdata/JOB/feats.scp | utils/shuffle_list.pl --srand 777 | head -n$N | copy-feats scp:- ark:- |\"\n# apply-cmvn (optional),\n[ ! -z \"$cmvn_opts\" -a ! -f $sdata/1/cmvn.scp ] && echo \"$0: Missing $sdata/1/cmvn.scp\" && exit 1\n[ ! -z \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ ! -z \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n# add-pytel transform (optional),\n[ -e $D/pytel_transform.py ] && feats=\"$feats /bin/env python $D/pytel_transform.py |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool, \n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  feats_job_1=$(sed 's:JOB:1:g' <(echo $feats))\n  dim_raw=$(feat-to-dim \"$feats_job_1\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_job_1 $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n# Run the forward pass,\n$cmd JOB=1:$nj $nndir/log/prior_stats.JOB.log \\\n  nnet-forward --use-gpu=$use_gpu --feature-transform=$nndir/final.feature_transform $nndir/final.nnet \"$feats\" ark:- \\| \\\n  compute-cmvn-stats --binary=false ark:- $nndir/JOB.prior_cmvn_stats || exit 1\n\nsum-matrices --binary=false $nndir/prior_cmvn_stats $nndir/*.prior_cmvn_stats 2>$nndir/log/prior_sum_matrices.log || exit 1\nrm $nndir/*.prior_cmvn_stats\n\nawk 'NR==2{ $NF=\"\"; print \"[\",$0,\"]\"; }' $nndir/prior_cmvn_stats >$nndir/prior_counts || exit 1\n    \necho \"Succeeded creating prior counts '$nndir/prior_counts' from '$data'\" \n\n"
  },
  {
    "path": "egs/steps/nnet/pretrain_dbn.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2013-2015 Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# To be run from ../../\n#\n# Restricted Boltzman Machine (RBM) pre-training by Contrastive Divergence\n# algorithm (CD-1). A stack of RBMs forms a Deep Belief Neetwork (DBN).\n#\n# This script by default pre-trains on plain features (ie. saved fMLLR features),\n# building a 'feature_transform' containing +/-5 frame splice and global CMVN.\n#\n# There is also a support for adding speaker-based CMVN, deltas, i-vectors,\n# or passing custom 'feature_transform' or its prototype.\n#\n\n# Begin configuration.\n\n# topology, initialization,\nnn_depth=6             # number of hidden layers,\nhid_dim=2048           # number of neurons per layer,\nparam_stddev_first=0.1 # init parameters in 1st RBM\nparam_stddev=0.1 # init parameters in other RBMs\ninput_vis_type=gauss # type of visible nodes on DBN input\n\n# number of iterations,\nrbm_iter=1            # number of pre-training epochs (Gaussian-Bernoulli RBM has 2x more)\n\n# pre-training opts,\nrbm_lrate=0.4         # RBM learning rate\nrbm_lrate_low=0.01    # lower RBM learning rate (for Gaussian units)\nrbm_l2penalty=0.0002  # L2 penalty (increases RBM-mixing rate)\nrbm_extra_opts=\n\n# data processing,\ncopy_feats=true     # resave the features to tmpdir,\ncopy_feats_tmproot=/tmp/kaldi.XXXX # sets tmproot for 'copy-feats',\ncopy_feats_compress=true # compress feats while resaving\n\n# feature processing,\nsplice=5            # (default) splice features both-ways along time axis,\ncmvn_opts=          # (optional) adds 'apply-cmvn' to input feature pipeline, see opts,\ndelta_opts=         # (optional) adds 'add-deltas' to input feature pipeline, see opts,\nivector=            # (optional) adds 'append-vector-to-feats', the option is rx-filename for the 2nd stream,\nivector_append_tool=append-vector-to-feats # (optional) the tool for appending ivectors,\n\nfeature_transform_proto= # (optional) use this prototype for 'feature_transform',\nfeature_transform=  # (optional) directly use this 'feature_transform',\n\n# misc.\nverbose=1 # enable per-cache reports\nskip_cuda_check=false\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 2 ]; then\n   echo \"Usage: $0 <data> <exp-dir>\"\n   echo \" e.g.: $0 data/train exp/rbm_pretrain\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>           # config containing options\"\n   echo \"\"\n   echo \"  --nn-depth <N>                   # number of RBM layers\"\n   echo \"  --hid-dim <N>                    # number of hidden units per layer\"\n   echo \"  --rbm-iter <N>                   # number of CD-1 iterations per layer\"\n   echo \"                                   # can be used to subsample large datasets\"\n   echo \"  --rbm-lrate <float>              # learning-rate for Bernoulli-Bernoulli RBMs\"\n   echo \"  --rbm-lrate-low <float>          # learning-rate for Gaussian-Bernoulli RBM\"\n   echo \"\"\n   echo \"  --cmvn-opts  <string>            # add 'apply-cmvn' to input feature pipeline\"\n   echo \"  --delta-opts <string>            # add 'add-deltas' to input feature pipeline\"\n   echo \"  --splice <N>                     # splice +/-N frames of input features\"\n   echo \"  --copy-feats <bool>              # copy features to /tmp, lowers storage stress\"\n   echo \"\"\n   echo \"  --feature_transform_proto <file> # use this prototype for 'feature_transform'\"\n   echo \"  --feature-transform <file>       # directly use this 'feature_transform'\"\n   exit 1;\nfi\n\ndata=$1\ndir=$2\n\nfor f in $data/feats.scp; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\necho \"# INFO\"\necho \"$0 : Pre-training Deep Belief Network as a stack of RBMs\"\nprintf \"\\t dir       : $dir \\n\"\nprintf \"\\t Train-set : $data '$(cat $data/feats.scp | wc -l)'\\n\"\necho\n\n[ -e $dir/${nn_depth}.dbn ] && echo \"$0 Skipping, already have $dir/${nn_depth}.dbn\" && exit 0\n\n# check if CUDA compiled in and GPU is available,\nif ! $skip_cuda_check; then cuda-gpu-available || exit 1; fi\n\nmkdir -p $dir/log\n\n###### PREPARE FEATURES ######\necho\necho \"# PREPARING FEATURES\"\nif [ \"$copy_feats\" == \"true\" ]; then\n  # re-save the features to local disk into /tmp/,\n  tmpdir=$(mktemp -d $copy_feats_tmproot)\n  trap \"echo \\\"# Removing features tmpdir $tmpdir @ $(hostname)\\\"; ls $tmpdir; rm -r $tmpdir\" INT QUIT TERM EXIT\n  copy-feats --compress=$copy_feats_compress scp:$data/feats.scp ark,scp:$tmpdir/train.ark,$dir/train_sorted.scp || exit 1\nelse\n  # or copy the list,\n  cp $data/feats.scp $dir/train_sorted.scp\nfi\n# shuffle the list,\nutils/shuffle_list.pl --srand 777 <$dir/train_sorted.scp >$dir/train.scp\n\n# create a 10k utt subset for global cmvn estimates,\nhead -n 10000 $dir/train.scp > $dir/train.scp.10k\n\n# for debugging, add list with non-local features,\nutils/shuffle_list.pl --srand 777 <$data/feats.scp >$dir/train.scp_non_local\n\n###### OPTIONALLY IMPORT FEATURE SETTINGS ######\nivector_dim= # no ivectors,\nif [ ! -z $feature_transform ]; then\n  D=$(dirname $feature_transform)\n  echo \"# importing feature settings from dir '$D'\"\n  [ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n  [ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n  [ -e $D/ivector_dim ] && ivector_dim=$(cat $D/ivector_dim)\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  echo \"# cmvn_opts='$cmvn_opts' delta_opts='$delta_opts' ivector_dim='$ivector_dim'\"\nfi\n\n###### PREPARE FEATURE PIPELINE ######\n# read the features\nfeats_tr=\"ark:copy-feats scp:$dir/train.scp ark:- |\"\n\n# optionally add per-speaker CMVN\nif [ ! -z \"$cmvn_opts\" ]; then\n  echo \"+ 'apply-cmvn' with '$cmvn_opts' using statistics : $data/cmvn.scp\"\n  [ ! -r $data/cmvn.scp ] && echo \"Missing $data/cmvn.scp\" && exit 1;\n  [ ! -r $data/utt2spk ] && echo \"Missing $data/utt2spk\" && exit 1;\n  feats_tr=\"$feats_tr apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |\"\nelse\n  echo \"# 'apply-cmvn' not used,\"\nfi\n\n# optionally add deltas\nif [ ! -z \"$delta_opts\" ]; then\n  feats_tr=\"$feats_tr add-deltas $delta_opts ark:- ark:- |\"\n  echo \"# + 'add-deltas' with '$delta_opts'\"\nfi\n\n# keep track of the config,\n[ ! -z \"$cmvn_opts\" ] && echo \"$cmvn_opts\" >$dir/cmvn_opts\n[ ! -z \"$delta_opts\" ] && echo \"$delta_opts\" >$dir/delta_opts\n#\n\n# get feature dim,\nfeat_dim=$(feat-to-dim \"$feats_tr\" -)\necho \"# feature dim : $feat_dim (input of 'feature_transform')\"\n\n# Now we start building 'feature_transform' which goes right in front of a NN.\n# The forwarding is computed on a GPU before the frame shuffling is applied.\n#\n# Same GPU is used both for 'feature_transform' and the NN training.\n# So it has to be done by a single process (we are using exclusive mode).\n# This also reduces the CPU-GPU uploads/downloads to minimum.\n\nif [ ! -z \"$feature_transform\" ]; then\n  echo \"# importing 'feature_transform' from '$feature_transform'\"\n  tmp=$dir/imported_$(basename $feature_transform)\n  cp $feature_transform $tmp; feature_transform=$tmp\nelse\n  # Make default proto with splice,\n  if [ ! -z $feature_transform_proto ]; then\n    echo \"# importing custom 'feature_transform_proto' from : $feature_transform_proto\"\n  else\n    echo \"+ default 'feature_transform_proto' with splice +/-$splice frames\"\n    feature_transform_proto=$dir/splice${splice}.proto\n    echo \"<Splice> <InputDim> $feat_dim <OutputDim> $(((2*splice+1)*feat_dim)) <BuildVector> -$splice:$splice </BuildVector>\" >$feature_transform_proto\n  fi\n\n  # Initialize 'feature-transform' from a prototype,\n  feature_transform=$dir/tr_$(basename $feature_transform_proto .proto).nnet\n  nnet-initialize --binary=false $feature_transform_proto $feature_transform\n\n  # Renormalize the MLP input to zero mean and unit variance,\n  feature_transform_old=$feature_transform\n  feature_transform=${feature_transform%.nnet}_cmvn-g.nnet\n  echo \"# compute normalization stats from 10k sentences\"\n  nnet-forward --print-args=true --use-gpu=yes $feature_transform_old \\\n    \"$(echo $feats_tr | sed 's|train.scp|train.scp.10k|')\" ark:- |\\\n    compute-cmvn-stats ark:- $dir/cmvn-g.stats\n  echo \"# + normalization of NN-input at '$feature_transform'\"\n  nnet-concat --print-args=false --binary=false $feature_transform_old \\\n    \"cmvn-to-nnet $dir/cmvn-g.stats -|\" $feature_transform\nfi\n\nif [ ! -z $ivector ]; then\n  echo\n  echo \"# ADDING IVECTOR FEATURES\"\n  # The iVectors are concatenated 'as they are' directly to the input of the neural network,\n  # To do this, we paste the features, and use <ParallelComponent> where the 1st component\n  # contains the transform and 2nd network contains <Copy> component.\n\n  echo \"# getting dims,\"\n  dim_raw=$(feat-to-dim \"$feats_tr\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_tr $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  echo \"# dims, feats-raw $dim_raw, ivectors $dim_ivec,\"\n\n  # Should we do something with 'feature_transform'?\n  if [ ! -z $ivector_dim ]; then\n    # No, the 'ivector_dim' comes from dir with 'feature_transform' with iVec forwarding,\n    echo \"# assuming we got '$feature_transform' with ivector forwarding,\"\n    [ $ivector_dim != $dim_ivec ] && \\\n    echo -n \"Error, i-vector dimensionality mismatch!\" && \\\n    echo \" (expected $ivector_dim, got $dim_ivec in $ivector)\" && exit 1\n  else\n    # Yes, adjust the transform to do ``iVec forwarding'',\n    feature_transform_old=$feature_transform\n    feature_transform=${feature_transform%.nnet}_ivec_copy.nnet\n    echo \"# setting up ivector forwarding into '$feature_transform',\"\n    dim_transformed=$(feat-to-dim \"$feats_tr nnet-forward $feature_transform_old ark:- ark:- |\" -)\n    nnet-initialize --print-args=false <(echo \"<Copy> <InputDim> $dim_ivec <OutputDim> $dim_ivec <BuildVector> 1:$dim_ivec </BuildVector>\") $dir/tr_ivec_copy.nnet\n    nnet-initialize --print-args=false <(echo \"<ParallelComponent> <InputDim> $((dim_raw+dim_ivec)) <OutputDim> $((dim_transformed+dim_ivec)) <NestedNnetFilename> $feature_transform_old $dir/tr_ivec_copy.nnet </NestedNnetFilename>\") $feature_transform\n  fi\n  echo $dim_ivec >$dir/ivector_dim # mark down the iVec dim!\n  echo $ivector_append_tool >$dir/ivector_append_tool\n\n  # pasting the iVecs to the feaures,\n  echo \"# + ivector input '$ivector'\"\n  feats_tr=\"$feats_tr $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n###### Show the final 'feature_transform' in the log,\necho\necho \"### Showing the final 'feature_transform':\"\nnnet-info $feature_transform\necho \"###\"\n\n###### MAKE LINK TO THE FINAL feature_transform, so the other scripts will find it ######\n[ -f $dir/final.feature_transform ] && unlink $dir/final.feature_transform\n(cd $dir; ln -s $(basename $feature_transform) final.feature_transform )\nfeature_transform=$dir/final.feature_transform\n\n\n###### GET THE DIMENSIONS ######\nnum_fea=$(feat-to-dim --print-args=false \"$feats_tr nnet-forward --use-gpu=no $feature_transform ark:- ark:- |\" - 2>/dev/null)\nnum_hid=$hid_dim\n\n\n###### PERFORM THE PRE-TRAINING ######\nfor depth in $(seq 1 $nn_depth); do\n  echo\n  echo \"# PRE-TRAINING RBM LAYER $depth\"\n  RBM=$dir/$depth.rbm\n  [ -f $RBM ] && echo \"RBM '$RBM' already trained, skipping.\" && continue\n\n  # The first RBM needs special treatment, because of Gussian input nodes,\n  if [ \"$depth\" == \"1\" ]; then\n    # This is usually Gaussian-Bernoulli RBM (not if CNN layers are part of input transform)\n    # initialize,\n    echo \"# initializing '$RBM.init'\"\n    echo \"<Rbm> <InputDim> $num_fea <OutputDim> $num_hid <VisibleType> $input_vis_type <HiddenType> bern <ParamStddev> $param_stddev_first\" > $RBM.proto\n    nnet-initialize $RBM.proto $RBM.init 2>$dir/log/nnet-initialize.$depth.log || exit 1\n    # pre-train,\n    num_iter=$rbm_iter; [ $input_vis_type == \"gauss\" ] && num_iter=$((2*rbm_iter)) # 2x more epochs for Gaussian input\n    [ $input_vis_type == \"bern\" ] && rbm_lrate_low=$rbm_lrate # original lrate for Bernoulli input\n    echo \"# pretraining '$RBM' (input $input_vis_type, lrate $rbm_lrate_low, iters $num_iter)\"\n    rbm-train-cd1-frmshuff --learn-rate=$rbm_lrate_low --l2-penalty=$rbm_l2penalty \\\n      --num-iters=$num_iter --verbose=$verbose \\\n      --feature-transform=$feature_transform \\\n      $rbm_extra_opts \\\n      $RBM.init \"$feats_tr\" $RBM 2>$dir/log/rbm.$depth.log || exit 1\n  else\n    # This is Bernoulli-Bernoulli RBM,\n    # cmvn stats for init,\n    echo \"# computing cmvn stats '$dir/$depth.cmvn' for RBM initialization\"\n    if [ ! -f $dir/$depth.cmvn ]; then\n      nnet-forward --print-args=false --use-gpu=yes \\\n        \"nnet-concat $feature_transform $dir/$((depth-1)).dbn - |\" \\\n        \"$(echo $feats_tr | sed 's|train.scp|train.scp.10k|')\" ark:- | \\\n      compute-cmvn-stats --print-args=false ark:- - | \\\n      cmvn-to-nnet --print-args=false - $dir/$depth.cmvn || exit 1\n    else\n      echo \"# compute-cmvn-stats already done, skipping.\"\n    fi\n    # initialize,\n    echo \"initializing '$RBM.init'\"\n    echo \"<Rbm> <InputDim> $num_hid <OutputDim> $num_hid <VisibleType> bern <HiddenType> bern <ParamStddev> $param_stddev <VisibleBiasCmvnFilename> $dir/$depth.cmvn\" > $RBM.proto\n    nnet-initialize $RBM.proto $RBM.init 2>$dir/log/nnet-initialize.$depth.log || exit 1\n    # pre-train,\n    echo \"pretraining '$RBM' (lrate $rbm_lrate, iters $rbm_iter)\"\n    rbm-train-cd1-frmshuff --learn-rate=$rbm_lrate --l2-penalty=$rbm_l2penalty \\\n      --num-iters=$rbm_iter --verbose=$verbose \\\n      --feature-transform=\"nnet-concat $feature_transform $dir/$((depth-1)).dbn - |\" \\\n      $rbm_extra_opts \\\n      $RBM.init \"$feats_tr\" $RBM 2>$dir/log/rbm.$depth.log || exit 1\n  fi\n\n  # Create DBN stack,\n  if [ \"$depth\" == \"1\" ]; then\n    echo \"# converting RBM to $dir/$depth.dbn\"\n    rbm-convert-to-nnet $RBM $dir/$depth.dbn\n  else\n    echo \"# appending RBM to $dir/$depth.dbn\"\n    nnet-concat $dir/$((depth-1)).dbn \"rbm-convert-to-nnet $RBM - |\"  $dir/$depth.dbn\n  fi\n\ndone\n\necho\necho \"# REPORT\"\necho \"# RBM pre-training progress (line per-layer)\"\ngrep progress $dir/log/rbm.*.log\necho\n\necho \"Pre-training finished.\"\n\nsleep 3\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet/train.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2017  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\n# Begin configuration.\n\nconfig=             # config, also forwarded to 'train_scheduler.sh',\n\n# topology, initialization,\nnetwork_type=dnn    # select type of neural network (dnn,cnn1d,cnn2d,lstm),\nhid_layers=4        # nr. of hidden layers (before sotfmax or bottleneck),\nhid_dim=1024        # number of neurons per layer,\nbn_dim=             # (optional) adds bottleneck and one more hidden layer to the NN,\ndbn=                # (optional) prepend layers to the initialized NN,\n\nproto_opts=         # adds options to 'make_nnet_proto.py',\ncnn_proto_opts=     # adds options to 'make_cnn_proto.py',\n\nnnet_init=          # (optional) use this pre-initialized NN,\nnnet_proto=         # (optional) use this NN prototype for initialization,\n\n# feature processing,\nsplice=5            # (default) splice features both-ways along time axis,\nonline_cmvn_opts=   # (optional) adds 'apply-cmvn-online' to input feature pipeline, see opts,\ncmvn_opts=          # (optional) adds 'apply-cmvn' to input feature pipeline, see opts,\ndelta_opts=         # (optional) adds 'add-deltas' to input feature pipeline, see opts,\nivector=            # (optional) adds 'append-vector-to-feats', the option is rx-filename for the 2nd stream,\nivector_append_tool=append-vector-to-feats # (optional) the tool for appending ivectors,\n\nfeat_type=plain\ntraps_dct_basis=11    # (feat_type=traps) nr. of DCT basis, 11 is good with splice=10,\ntransf=               # (feat_type=transf) import this linear tranform,\nsplice_after_transf=5 # (feat_type=transf) splice after the linear transform,\n\nfeature_transform_proto= # (optional) use this prototype for 'feature_transform',\nfeature_transform=  # (optional) directly use this 'feature_transform',\n\n# labels,\nlabels=            # (optional) specify non-default training targets,\n                   # (targets need to be in posterior format, see 'ali-to-post', 'feat-to-post'),\nnum_tgt=           # (optional) specifiy number of NN outputs, to be used with 'labels=',\n\n# training scheduler,\nlearn_rate=0.008   # initial learning rate,\nscheduler_opts=    # options, passed to the training scheduler,\ntrain_tool=        # optionally change the training tool,\ntrain_tool_opts=   # options for the training tool,\nframe_weights=     # per-frame weights for gradient weighting,\nutt_weights=       # per-utterance weights (scalar for --frame-weights),\n\n# data processing, misc.\ncopy_feats=true     # resave the train/cv features into /tmp (disabled by default),\ncopy_feats_tmproot=/tmp/kaldi.XXXX # sets tmproot for 'copy-feats',\ncopy_feats_compress=true # compress feats while resaving\nfeats_std=1.0\n\nsplit_feats=        # split the training data into N portions, one portion will be one 'epoch',\n                    # (empty = no splitting)\n\nseed=777            # seed value used for data-shuffling, nn-initialization, and training,\nskip_cuda_check=false\nskip_phoneset_check=false\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 6 ]; then\n   echo \"Usage: $0 <data-train> <data-dev> <lang-dir> <ali-train> <ali-dev> <exp-dir>\"\n   echo \" e.g.: $0 data/train data/cv data/lang exp/mono_ali_train exp/mono_ali_cv exp/mono_nnet\"\n   echo \"\"\n   echo \" Training data : <data-train>,<ali-train> (for optimizing cross-entropy)\"\n   echo \" Held-out data : <data-dev>,<ali-dev> (for learn-rate scheduling, model selection)\"\n   echo \" note.: <ali-train>,<ali-dev> can point to same directory, or 2 separate directories.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>   # config containing options\"\n   echo \"\"\n   echo \"  --network-type (dnn,cnn1d,cnn2d,lstm)  # type of neural network\"\n   echo \"  --nnet-proto <file>      # use this NN prototype\"\n   echo \"  --feature-transform <file> # re-use this input feature transform\"\n   echo \"\"\n   echo \"  --feat-type (plain|traps|transf) # type of input features\"\n   echo \"  --cmvn-opts  <string>            # add 'apply-cmvn' to input feature pipeline\"\n   echo \"  --delta-opts <string>            # add 'add-deltas' to input feature pipeline\"\n   echo \"  --splice <N>                     # splice +/-N frames of input features\"\n   echo\n   echo \"  --learn-rate <float>     # initial leaning-rate\"\n   echo \"  --copy-feats <bool>      # copy features to /tmp, lowers storage stress\"\n   echo \"\"\n   exit 1;\nfi\n\ndata=$1\ndata_cv=$2\nlang=$3\nalidir=$4\nalidir_cv=$5\ndir=$6\n\n# Using alidir for supervision (default)\nif [ -z \"$labels\" ]; then\n  silphonelist=`cat $lang/phones/silence.csl`\n  for f in $alidir/final.mdl $alidir/ali.1.gz $alidir_cv/ali.1.gz; do\n    [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\n  done\nfi\n\nfor f in $data/feats.scp $data_cv/feats.scp; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\necho\necho \"# INFO\"\necho \"$0 : Training Neural Network\"\nprintf \"\\t dir       : $dir \\n\"\nprintf \"\\t Train-set : $data $(cat $data/feats.scp | wc -l), $alidir \\n\"\nprintf \"\\t CV-set    : $data_cv $(cat $data_cv/feats.scp | wc -l) $alidir_cv \\n\"\necho\n\nmkdir -p $dir/{log,nnet}\n\nif ! $skip_phoneset_check; then\n  utils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt\n  utils/lang/check_phones_compatible.sh $lang/phones.txt $alidir_cv/phones.txt\n  cp $lang/phones.txt $dir\nfi\n\n# skip when already trained,\nif [ -e $dir/final.nnet ]; then\n  echo \"SKIPPING TRAINING... ($0)\"\n  echo \"nnet already trained : $dir/final.nnet ($(readlink $dir/final.nnet))\"\n  exit 0\nfi\n\n# check if CUDA compiled in and GPU is available,\nif ! $skip_cuda_check; then cuda-gpu-available || exit 1; fi\n\n###### PREPARE ALIGNMENTS ######\necho\necho \"# PREPARING ALIGNMENTS\"\nif [ ! -z \"$labels\" ]; then\n  echo \"Using targets '$labels' (by force)\"\n  labels_tr=\"$labels\"\n  labels_cv=\"$labels\"\nelse\n  echo \"Using PDF targets from dirs '$alidir' '$alidir_cv'\"\n  # training targets in posterior format,\n  labels_tr=\"ark:ali-to-pdf $alidir/final.mdl \\\"ark:gunzip -c $alidir/ali.*.gz |\\\" ark:- | ali-to-post ark:- ark:- |\"\n  labels_cv=\"ark:ali-to-pdf $alidir/final.mdl \\\"ark:gunzip -c $alidir_cv/ali.*.gz |\\\" ark:- | ali-to-post ark:- ark:- |\"\n  # training targets for analyze-counts,\n  labels_tr_pdf=\"ark:ali-to-pdf $alidir/final.mdl \\\"ark:gunzip -c $alidir/ali.*.gz |\\\" ark:- |\"\n  labels_tr_phn=\"ark:ali-to-phones --per-frame=true $alidir/final.mdl \\\"ark:gunzip -c $alidir/ali.*.gz |\\\" ark:- |\"\n\n  # get pdf-counts, used later for decoding/aligning,\n  num_pdf=$(hmm-info $alidir/final.mdl | awk '/pdfs/{print $4}')\n  analyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf \\\n    ${frame_weights:+ \"--frame-weights=$frame_weights\"} \\\n    ${utt_weights:+ \"--utt-weights=$utt_weights\"} \\\n    \"$labels_tr_pdf\" $dir/ali_train_pdf.counts 2>$dir/log/analyze_counts_pdf.log\n  # copy the old transition model, will be needed by decoder,\n  copy-transition-model --binary=false $alidir/final.mdl $dir/final.mdl\n  # copy the tree\n  cp $alidir/tree $dir/tree\n\n  # make phone counts for analysis,\n  [ -e $lang/phones.txt ] && analyze-counts --verbose=1 --symbol-table=$lang/phones.txt --counts-dim=$num_pdf \\\n    ${frame_weights:+ \"--frame-weights=$frame_weights\"} \\\n    ${utt_weights:+ \"--utt-weights=$utt_weights\"} \\\n    \"$labels_tr_phn\" /dev/null 2>$dir/log/analyze_counts_phones.log\nfi\n\n###### PREPARE FEATURES ######\necho\necho \"# PREPARING FEATURES\"\nif [ \"$copy_feats\" == \"true\" ]; then\n  echo \"# re-saving features to local disk,\"\n  tmpdir=$(mktemp -d $copy_feats_tmproot)\n  copy-feats --compress=$copy_feats_compress scp:$data/feats.scp ark,scp:$tmpdir/train.ark,$dir/train_sorted.scp\n  copy-feats --compress=$copy_feats_compress scp:$data_cv/feats.scp ark,scp:$tmpdir/cv.ark,$dir/cv.scp\n  trap \"echo '# Removing features tmpdir $tmpdir @ $(hostname)'; ls $tmpdir; rm -r $tmpdir\" EXIT\nelse\n  # or copy the list,\n  cp $data/feats.scp $dir/train_sorted.scp\n  cp $data_cv/feats.scp $dir/cv.scp\nfi\n# shuffle the list,\nutils/shuffle_list.pl --srand ${seed:-777} <$dir/train_sorted.scp >$dir/train.scp\n\n# create a 10k utt subset for global cmvn estimates,\nhead -n 10000 $dir/train.scp > $dir/train.scp.10k\n\n# split the list,\nif [ -n \"$split_feats\" ]; then\n  scps= # 1..split_feats,\n  for (( ii=1; ii<=$split_feats; ii++ )); do scps=\"$scps $dir/train.${ii}.scp\"; done\n  utils/split_scp.pl $dir/train.scp $scps\nfi\n\n# for debugging, add lists with non-local features,\nutils/shuffle_list.pl --srand ${seed:-777} <$data/feats.scp >$dir/train.scp_non_local\ncp $data_cv/feats.scp $dir/cv.scp_non_local\n\n###### OPTIONALLY IMPORT FEATURE SETTINGS (from pre-training) ######\nivector_dim= # no ivectors,\nif [ -n \"$feature_transform\" ]; then\n  D=$(dirname $feature_transform)\n  echo \"# importing feature settings from dir '$D'\"\n  [ -e $D/online_cmvn_opts ] && online_cmvn_opts=$(cat $D/online_cmvn_opts)\n  [ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n  [ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n  [ -e $D/ivector_dim ] && ivector_dim=$(cat $D/ivector_dim)\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  echo \"# cmvn_opts='$cmvn_opts' delta_opts='$delta_opts' ivector_dim='$ivector_dim'\"\nfi\n\n###### PREPARE FEATURE PIPELINE ######\n# read the features,\nfeats_tr=\"ark:copy-feats scp:$dir/train.scp ark:- |\"\nfeats_cv=\"ark:copy-feats scp:$dir/cv.scp ark:- |\"\n\n# optionally add per-speaker CMVN,\n[ -n \"$online_cmvn_opts\" -a -n \"$cmvn_opts\" ] && echo \"Error: use \\$online_cmvn_opts or \\$cmvn_opts, not both!\" && exit 1\nif [ -n \"$online_cmvn_opts\" ]; then\n  echo \"# + 'apply-cmvn-online' with '$online_cmvn_opts' is used,\"\n  global_cmvn_stats=$dir/global_cmvn_stats.mat\n  matrix-sum --binary=false scp:$data/cmvn.scp $global_cmvn_stats\n  feats_tr=\"$feats_tr apply-cmvn-online $online_cmvn_opts $global_cmvn_stats ark:- ark:- |\"\n  feats_cv=\"$feats_cv apply-cmvn-online $online_cmvn_opts $global_cmvn_stats ark:- ark:- |\"\nelif [ -n \"$cmvn_opts\" ]; then\n  echo \"# + 'apply-cmvn' with '$cmvn_opts' using statistics : $data/cmvn.scp, $data_cv/cmvn.scp\"\n  [ ! -r $data/cmvn.scp ] && echo \"Missing $data/cmvn.scp\" && exit 1;\n  [ ! -r $data_cv/cmvn.scp ] && echo \"Missing $data_cv/cmvn.scp\" && exit 1;\n  feats_tr=\"$feats_tr apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |\"\n  feats_cv=\"$feats_cv apply-cmvn $cmvn_opts --utt2spk=ark:$data_cv/utt2spk scp:$data_cv/cmvn.scp ark:- ark:- |\"\nelse\n  echo \"# 'apply-cmvn' is not used,\"\nfi\n\n# optionally add deltas,\nif [ ! -z \"$delta_opts\" ]; then\n  feats_tr=\"$feats_tr add-deltas $delta_opts ark:- ark:- |\"\n  feats_cv=\"$feats_cv add-deltas $delta_opts ark:- ark:- |\"\n  echo \"# + 'add-deltas' with '$delta_opts'\"\nfi\n\n# keep track of the config,\n[ -n \"$online_cmvn_opts\" ] && echo \"$online_cmvn_opts\" >$dir/online_cmvn_opts\n[ -n \"$cmvn_opts\" ] && echo \"$cmvn_opts\" >$dir/cmvn_opts\n[ -n \"$delta_opts\" ] && echo \"$delta_opts\" >$dir/delta_opts\n#\n\n# temoprary pipeline with first 10k,\nfeats_tr_10k=\"${feats_tr/train.scp/train.scp.10k}\"\n\n# get feature dim,\nfeat_dim=$(feat-to-dim \"$feats_tr_10k\" -)\necho \"# feature dim : $feat_dim (input of 'feature_transform')\"\n\n# Now we start building 'feature_transform' which goes right in front of a NN.\n# The forwarding is computed on a GPU before the frame shuffling is applied.\n#\n# Same GPU is used both for 'feature_transform' and the NN training.\n# So it has to be done by a single process (we are using exclusive mode).\n# This also reduces the CPU-GPU uploads/downloads to minimum.\n\nif [ -n \"$feature_transform\" ]; then\n  echo \"# importing 'feature_transform' from '$feature_transform'\"\n  tmp=$dir/imported_$(basename $feature_transform)\n  cp $feature_transform $tmp; feature_transform=$tmp\nelse\n  # Make default proto with splice,\n  if [ -n \"$feature_transform_proto\" ]; then\n    echo \"# importing custom 'feature_transform_proto' from '$feature_transform_proto'\"\n  else\n    echo \"# + default 'feature_transform_proto' with splice +/-$splice frames,\"\n    feature_transform_proto=$dir/splice${splice}.proto\n    echo \"<Splice> <InputDim> $feat_dim <OutputDim> $(((2*splice+1)*feat_dim)) <BuildVector> -$splice:$splice </BuildVector>\" >$feature_transform_proto\n  fi\n\n  # Initialize 'feature-transform' from a prototype,\n  feature_transform=$dir/tr_$(basename $feature_transform_proto .proto).nnet\n  nnet-initialize --binary=false $feature_transform_proto $feature_transform\n\n  # Choose further processing of spliced features\n  echo \"# feature type : $feat_type\"\n  case $feat_type in\n    plain)\n    ;;\n    traps)\n      #generate hamming+dct transform\n      feature_transform_old=$feature_transform\n      feature_transform=${feature_transform%.nnet}_hamm_dct${traps_dct_basis}.nnet\n      echo \"# + Hamming DCT transform (t$((splice*2+1)),dct${traps_dct_basis}) into '$feature_transform'\"\n      #prepare matrices with time-transposed hamming and dct\n      utils/nnet/gen_hamm_mat.py --fea-dim=$feat_dim --splice=$splice > $dir/hamm.mat\n      utils/nnet/gen_dct_mat.py --fea-dim=$feat_dim --splice=$splice --dct-basis=$traps_dct_basis > $dir/dct.mat\n      #put everything together\n      compose-transforms --binary=false $dir/dct.mat $dir/hamm.mat - | \\\n        transf-to-nnet - - | \\\n        nnet-concat --binary=false $feature_transform_old - $feature_transform\n    ;;\n    transf)\n      feature_transform_old=$feature_transform\n      feature_transform=${feature_transform%.nnet}_transf_splice${splice_after_transf}.nnet\n      [ -z $transf ] && transf=$alidir/final.mat\n      [ ! -f $transf ] && echo \"Missing transf $transf\" && exit 1\n      feat_dim=$(feat-to-dim \"$feats_tr_10k nnet-forward 'nnet-concat $feature_transform_old \\\"transf-to-nnet $transf - |\\\" - |' ark:- ark:- |\" -)\n      nnet-concat --binary=false $feature_transform_old \\\n        \"transf-to-nnet $transf - |\" \\\n        \"utils/nnet/gen_splice.py --fea-dim=$feat_dim --splice=$splice_after_transf |\" \\\n        $feature_transform\n    ;;\n    *)\n      echo \"Unknown feature type $feat_type\"\n      exit 1;\n    ;;\n  esac\n\n  # keep track of feat_type,\n  echo $feat_type > $dir/feat_type\n\n  # Renormalize the MLP input to zero mean and unit variance,\n  feature_transform_old=$feature_transform\n  feature_transform=${feature_transform%.nnet}_cmvn-g.nnet\n  echo \"# compute normalization stats from 10k sentences\"\n  nnet-forward --print-args=true --use-gpu=yes $feature_transform_old \\\n    \"$feats_tr_10k\" ark:- |\\\n    compute-cmvn-stats ark:- $dir/cmvn-g.stats\n  echo \"# + normalization of NN-input at '$feature_transform'\"\n  nnet-concat --binary=false $feature_transform_old \\\n    \"cmvn-to-nnet --std-dev=$feats_std $dir/cmvn-g.stats -|\" $feature_transform\nfi\n\nif [ ! -z $ivector ]; then\n  echo\n  echo \"# ADDING IVECTOR FEATURES\"\n  # The iVectors are concatenated 'as they are' directly to the input of the neural network,\n  # To do this, we paste the features, and use <ParallelComponent> where the 1st component\n  # contains the transform and 2nd network contains <Copy> component.\n\n  echo \"# getting dims,\"\n  dim_raw=$(feat-to-dim \"$feats_tr_10k\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats_tr_10k $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  echo \"# dims, feats-raw $dim_raw, ivectors $dim_ivec,\"\n\n  # Should we do something with 'feature_transform'?\n  if [ ! -z $ivector_dim ]; then\n    # No, the 'ivector_dim' comes from dir with 'feature_transform' with iVec forwarding,\n    echo \"# assuming we got '$feature_transform' with ivector forwarding,\"\n    [ $ivector_dim != $dim_ivec ] && \\\n    echo -n \"Error, i-vector dimensionality mismatch!\" && \\\n    echo \" (expected $ivector_dim, got $dim_ivec in $ivector)\" && exit 1\n  else\n    # Yes, adjust the transform to do ``iVec forwarding'',\n    feature_transform_old=$feature_transform\n    feature_transform=${feature_transform%.nnet}_ivec_copy.nnet\n    echo \"# setting up ivector forwarding into '$feature_transform',\"\n    dim_transformed=$(feat-to-dim \"$feats_tr_10k nnet-forward $feature_transform_old ark:- ark:- |\" -)\n    nnet-initialize --print-args=false <(echo \"<Copy> <InputDim> $dim_ivec <OutputDim> $dim_ivec <BuildVector> 1:$dim_ivec </BuildVector>\") $dir/tr_ivec_copy.nnet\n    nnet-initialize --print-args=false <(echo \"<ParallelComponent> <InputDim> $((dim_raw+dim_ivec)) <OutputDim> $((dim_transformed+dim_ivec)) \\\n                                               <NestedNnetFilename> $feature_transform_old $dir/tr_ivec_copy.nnet </NestedNnetFilename>\") $feature_transform\n  fi\n  echo $dim_ivec >$dir/ivector_dim # mark down the iVec dim!\n  echo $ivector_append_tool >$dir/ivector_append_tool\n\n  # pasting the iVecs to the features,\n  echo \"# + ivector input '$ivector'\"\n  feats_tr=\"$feats_tr $ivector_append_tool ark:- '$ivector' ark:- |\"\n  feats_cv=\"$feats_cv $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n###### Show the final 'feature_transform' in the log,\necho\necho \"### Showing the final 'feature_transform':\"\nnnet-info $feature_transform\necho \"###\"\n\n###### MAKE LINK TO THE FINAL feature_transform, so the other scripts will find it ######\n[ -f $dir/final.feature_transform ] && unlink $dir/final.feature_transform\n(cd $dir; ln -s $(basename $feature_transform) final.feature_transform )\nfeature_transform=$dir/final.feature_transform\n\n\n###### INITIALIZE THE NNET ######\necho\necho \"# NN-INITIALIZATION\"\nif [ ! -z $nnet_init ]; then\n  echo \"# using pre-initialized network '$nnet_init'\"\nelif [ ! -z $nnet_proto ]; then\n  echo \"# initializing NN from prototype '$nnet_proto'\";\n  nnet_init=$dir/nnet.init; log=$dir/log/nnet_initialize.log\n  nnet-initialize --seed=$seed $nnet_proto $nnet_init\nelse\n  echo \"# getting input/output dims :\"\n  # input-dim,\n  get_dim_from=$feature_transform\n  [ ! -z \"$dbn\" ] && get_dim_from=\"nnet-concat $feature_transform '$dbn' -|\"\n  num_fea=$(feat-to-dim \"$feats_tr_10k nnet-forward \\\"$get_dim_from\\\" ark:- ark:- |\" -)\n\n  # output-dim,\n  [ -z $num_tgt ] && \\\n    num_tgt=$(hmm-info --print-args=false $alidir/final.mdl | grep pdfs | awk '{ print $NF }')\n\n  # make network prototype,\n  nnet_proto=$dir/nnet.proto\n  echo \"# genrating network prototype $nnet_proto\"\n  case \"$network_type\" in\n    dnn)\n      utils/nnet/make_nnet_proto.py $proto_opts \\\n        ${bn_dim:+ --bottleneck-dim=$bn_dim} \\\n        $num_fea $num_tgt $hid_layers $hid_dim >$nnet_proto\n      ;;\n    cnn1d)\n      delta_order=$([ -z $delta_opts ] && echo \"0\" || { echo $delta_opts | tr ' ' '\\n' | grep \"delta[-_]order\" | sed 's:^.*=::'; })\n      echo \"Debug : $delta_opts, delta_order $delta_order\"\n      utils/nnet/make_cnn_proto.py $cnn_proto_opts \\\n        --splice=$splice --delta-order=$delta_order --dir=$dir \\\n        $num_fea >$nnet_proto\n      cnn_fea=$(cat $nnet_proto | grep -v '^$' | tail -n1 | awk '{ print $5; }')\n      utils/nnet/make_nnet_proto.py $proto_opts \\\n        --no-smaller-input-weights \\\n        ${bn_dim:+ --bottleneck-dim=$bn_dim} \\\n        \"$cnn_fea\" $num_tgt $hid_layers $hid_dim >>$nnet_proto\n      ;;\n    lstm)\n      utils/nnet/make_lstm_proto.py $proto_opts \\\n        $num_fea $num_tgt >$nnet_proto\n      ;;\n    blstm)\n      utils/nnet/make_blstm_proto.py $proto_opts \\\n        $num_fea $num_tgt >$nnet_proto\n      ;;\n    *) echo \"Unknown : --network-type $network_type\" && exit 1;\n  esac\n\n  # initialize,\n  nnet_init=$dir/nnet.init\n  echo \"# initializing the NN '$nnet_proto' -> '$nnet_init'\"\n  nnet-initialize --seed=$seed $nnet_proto $nnet_init\n\n  # optionally prepend dbn to the initialization,\n  if [ ! -z \"$dbn\" ]; then\n    nnet_init_old=$nnet_init; nnet_init=$dir/nnet_dbn_dnn.init\n    nnet-concat \"$dbn\" $nnet_init_old $nnet_init\n  fi\nfi\n\n\n###### TRAIN ######\necho\necho \"# RUNNING THE NN-TRAINING SCHEDULER\"\nsteps/nnet/train_scheduler.sh \\\n  ${scheduler_opts} \\\n  ${train_tool:+ --train-tool \"$train_tool\"} \\\n  ${train_tool_opts:+ --train-tool-opts \"$train_tool_opts\"} \\\n  ${feature_transform:+ --feature-transform $feature_transform} \\\n  ${split_feats:+ --split-feats $split_feats} \\\n  --learn-rate $learn_rate \\\n  ${frame_weights:+ --frame-weights \"$frame_weights\"} \\\n  ${utt_weights:+ --utt-weights \"$utt_weights\"} \\\n  ${config:+ --config $config} \\\n  $nnet_init \"$feats_tr\" \"$feats_cv\" \"$labels_tr\" \"$labels_cv\" $dir\n\necho \"$0: Successfuly finished. '$dir'\"\n\nsleep 3\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet/train_mmi.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2013-2015  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0.\n\n# Sequence-discriminative MMI/BMMI training of DNN.\n# 4 iterations (by default) of Stochastic Gradient Descent with per-utterance updates.\n# Boosting of paths with more errors (BMMI) gets activated by '--boost <float>' option.\n\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0 #ie. disable boosting\nacwt=0.1\nlmwt=1.0\nlearn_rate=0.00001\nhalving_factor=1.0 #ie. disable halving\ndrop_frames=true\nverbose=0 # 0 No GPU time-stats, 1 with GPU time-stats (slower),\nivector=\n\nseed=777    # seed value used for training data shuffling\nskip_cuda_check=false\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# -ne 6 ]; then\n  echo \"Usage: $0 <data> <lang> <srcdir> <ali> <denlats> <exp>\"\n  echo \" e.g.: $0 data/train_all data/lang exp/tri3b_dnn exp/tri3b_dnn_ali exp/tri3b_dnn_denlats exp/tri3b_dnn_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --num-iters <N>                                  # number of iterations to run\"\n  echo \"  --acwt <float>                                   # acoustic score scaling\"\n  echo \"  --lmwt <float>                                   # linguistic score scaling\"\n  echo \"  --learn-rate <float>                             # learning rate for NN training\"\n  echo \"  --drop-frames <bool>                             # drop frames num/den completely disagree\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\nalidir=$4\ndenlatdir=$5\ndir=$6\n\nfor f in $data/feats.scp $denlatdir/lat.scp \\\n         $alidir/{tree,final.mdl,ali.1.gz} \\\n         $srcdir/{final.nnet,final.feature_transform}; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# check if CUDA compiled in,\nif ! $skip_cuda_check; then cuda-compiled || { echo \"Error, CUDA not compiled-in!\"; exit 1; } fi\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt\ncp $lang/phones.txt $dir\n\ncp $alidir/{final.mdl,tree} $dir\n\nsilphonelist=`cat $lang/phones/silence.csl`\n\n\n#Get the files we will need\nnnet=$srcdir/$(readlink $srcdir/final.nnet || echo final.nnet);\n[ -z \"$nnet\" ] && echo \"Error nnet '$nnet' does not exist!\" && exit 1;\ncp $nnet $dir/0.nnet; nnet=$dir/0.nnet\n\nclass_frame_counts=$srcdir/ali_train_pdf.counts\n[ -z \"$class_frame_counts\" ] && echo \"Error class_frame_counts '$class_frame_counts' does not exist!\" && exit 1;\ncp $srcdir/ali_train_pdf.counts $dir\n\nfeature_transform=$srcdir/final.feature_transform\nif [ ! -f $feature_transform ]; then\n  echo \"Missing feature_transform '$feature_transform'\"\n  exit 1\nfi\ncp $feature_transform $dir/final.feature_transform\n\nmodel=$dir/final.mdl\n[ -z \"$model\" ] && echo \"Error transition model '$model' does not exist!\" && exit 1;\n\n\n\n# Shuffle the feature list to make the GD stochastic!\n# By shuffling features, we have to use lattices with random access (indexed by .scp file).\ncat $data/feats.scp | utils/shuffle_list.pl --srand $seed >$dir/train.scp\n\n###\n### PREPARE FEATURE EXTRACTION PIPELINE\n###\n# import config,\ncmvn_opts=\ndelta_opts=\nD=$srcdir\n[ -e $D/norm_vars ] && cmvn_opts=\"--norm-means=true --norm-vars=$(cat $D/norm_vars)\" # Bwd-compatibility,\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_order ] && delta_opts=\"--delta-order=$(cat $D/delta_order)\" # Bwd-compatibility,\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,o:copy-feats scp:$dir/train.scp ark:- |\"\n# apply-cmvn (optional),\n[ ! -z \"$cmvn_opts\" -a ! -f $data/cmvn.scp ] && echo \"$0: Missing $data/cmvn.scp\" && exit 1\n[ ! -z \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ ! -z \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n# add-pytel transform (optional),\n[ -e $D/pytel_transform.py ] && feats=\"$feats /bin/env python $D/pytel_transform.py |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  dim_raw=$(feat-to-dim \"$feats\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n### Record the setup,\n[ ! -z \"$cmvn_opts\" ] && echo $cmvn_opts >$dir/cmvn_opts\n[ ! -z \"$delta_opts\" ] && echo $delta_opts >$dir/delta_opts\n[ -e $D/pytel_transform.py ] && cp $D/pytel_transform.py $dir/pytel_transform.py\n[ -e $D/ivector_dim ] && cp $D/ivector_dim $dir/ivector_dim\n[ -e $D/ivector_append_tool ] && cp $D/ivector_append_tool $dir/ivector_append_tool\n###\n\n###\n### Prepare the alignments\n###\n# Assuming all alignments will fit into memory\nali=\"ark:gunzip -c $alidir/ali.*.gz |\"\n\n\n###\n### Prepare the lattices\n###\n# The lattices are indexed by SCP (they are not gziped because of the random access in SGD)\nlats=\"scp:$denlatdir/lat.scp\"\n\n# Optionally apply boosting\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  # make lattice scp with same order as the shuffled feature scp,\n  awk '{ if(r==0) { utt_id=$1; latH[$1]=$0; } # lat.scp\n         if(r==1) { if(latH[$1] != \"\") { print latH[$1]; } } # train.scp\n  }' r=0 $denlatdir/lat.scp r=1 $dir/train.scp > $dir/lat.scp\n  # get the list of alignments,\n  ali-to-phones $alidir/final.mdl \"$ali\" ark,t:- | awk '{print $1;}' > $dir/ali.lst\n  # remove from features sentences which have no lattice or no alignment,\n  # (so that the mmi training tool does not blow-up due to lattice caching),\n  mv $dir/train.scp $dir/train.scp_unfilt\n  awk '{ if(r==0) { latH[$1]=\"1\"; } # lat.scp\n         if(r==1) { aliH[$1]=\"1\"; } # ali.lst\n         if(r==2) { if((latH[$1] != \"\") && (aliH[$1] != \"\")) { print $0; } } # train.scp_\n  }' r=0 $dir/lat.scp r=1 $dir/ali.lst r=2 $dir/train.scp_unfilt > $dir/train.scp\n  # create the lat pipeline,\n  lats=\"ark,o:lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl scp:$dir/lat.scp '$ali' ark:- |\"\nfi\n###\n###\n###\n\n# Run several iterations of the MMI/BMMI training\ncur_mdl=$nnet\nx=1\nwhile [ $x -le $num_iters ]; do\n  echo \"Pass $x (learnrate $learn_rate)\"\n  if [ -f $dir/$x.nnet ]; then\n    echo \"Skipped, file $dir/$x.nnet exists\"\n  else\n    $cmd $dir/log/mmi.$x.log \\\n     nnet-train-mmi-sequential \\\n       --feature-transform=$feature_transform \\\n       --class-frame-counts=$class_frame_counts \\\n       --acoustic-scale=$acwt \\\n       --lm-scale=$lmwt \\\n       --learn-rate=$learn_rate \\\n       --drop-frames=$drop_frames \\\n       --verbose=$verbose \\\n       $cur_mdl $alidir/final.mdl \"$feats\" \"$lats\" \"$ali\" $dir/$x.nnet\n  fi\n  cur_mdl=$dir/$x.nnet\n\n  #report the progress\n  grep -B 2 MMI-objective $dir/log/mmi.$x.log | sed -e 's|^[^)]*)[^)]*)||'\n\n  x=$((x+1))\n  learn_rate=$(awk \"BEGIN{print($learn_rate*$halving_factor)}\")\n\ndone\n\n(cd $dir; [ -e final.nnet ] && unlink final.nnet; ln -s $((x-1)).nnet final.nnet)\n\necho \"MMI/BMMI training finished\"\n\nif [ -e $dir/prior_counts ]; then\n  echo \"Priors are already re-estimated, skipping... ($dir/prior_counts)\"\nelse\n  echo \"Re-estimating priors by forwarding 10k utterances from training set.\"\n  . ./cmd.sh\n  nj=$(cat $alidir/num_jobs)\n  steps/nnet/make_priors.sh --cmd \"$train_cmd\" --nj $nj \\\n    ${ivector:+ --ivector \"$ivector\"} $data $dir\nfi\n\necho \"$0: Done. '$dir'\"\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet/train_mpe.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2013-2017  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0.\n\n# Sequence-discriminative MPE/sMBR training of DNN.\n# 4 iterations (by default) of Stochastic Gradient Descent with per-utterance updates.\n# We select between MPE/sMBR optimization by '--do-smbr <bool>' option.\n\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nacwt=0.1\nlmwt=1.0\nlearn_rate=0.00001\nmomentum=0.0\nhalving_factor=1.0 #ie. disable halving\ndo_smbr=true\none_silence_class=true # if true : all the `silphones' are mapped to a single class in the Forward-backward of sMBR/MPE,\n                       # (this prevents the sMBR from WER explosion, which was happenning with some data).\n                       # if false : the silphone-frames are always counted as 'wrong' in the calculation of the approximate accuracies,\nsilphonelist=          # this overrides default silphone-list (for selecting a subset of sil-phones)\n\nunkphonelist=          # dummy deprecated option, for backward compatibility,\nexclude_silphones=     # dummy deprecated option, for backward compatibility,\n\nverbose=0 # 0 No GPU time-stats, 1 with GPU time-stats (slower),\nivector=\nnnet=  # For non-default location of nnet,\n\nseed=777    # seed value used for training data shuffling\nskip_cuda_check=false\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# -ne 6 ]; then\n  echo \"Usage: $0 <data> <lang> <srcdir> <ali> <denlats> <exp>\"\n  echo \" e.g.: $0 data/train_all data/lang exp/tri3b_dnn exp/tri3b_dnn_ali exp/tri3b_dnn_denlats exp/tri3b_dnn_smbr\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --num-iters <N>                                  # number of iterations to run\"\n  echo \"  --acwt <float>                                   # acoustic score scaling\"\n  echo \"  --lmwt <float>                                   # linguistic score scaling\"\n  echo \"  --learn-rate <float>                             # learning rate for NN training\"\n  echo \"  --do-smbr <bool>                                 # do sMBR training, otherwise MPE\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\nalidir=$4\ndenlatdir=$5\ndir=$6\n\nfor f in $data/feats.scp $denlatdir/lat.scp \\\n         $alidir/{tree,final.mdl,ali.1.gz} \\\n         $srcdir/{final.nnet,final.feature_transform}; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# check if CUDA compiled in,\nif ! $skip_cuda_check; then cuda-compiled || { echo \"Error, CUDA not compiled-in!\"; exit 1; } fi\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt\ncp $lang/phones.txt $dir\n\ncp $alidir/{final.mdl,tree} $dir\n\n[ -z $silphonelist ] && silphonelist=`cat $lang/phones/silence.csl` # Default 'silphonelist',\n\n#Get the files we will need\n[ -z \"$nnet\" ] && nnet=$srcdir/$(readlink $srcdir/final.nnet || echo final.nnet);\n[ -z \"$nnet\" ] && echo \"Error nnet '$nnet' does not exist!\" && exit 1;\ncp $nnet $dir/0.nnet; nnet=$dir/0.nnet\n\nclass_frame_counts=$srcdir/ali_train_pdf.counts\n[ -z \"$class_frame_counts\" ] && echo \"Error class_frame_counts '$class_frame_counts' does not exist!\" && exit 1;\ncp $srcdir/ali_train_pdf.counts $dir\n\nfeature_transform=$srcdir/final.feature_transform\nif [ ! -f $feature_transform ]; then\n  echo \"Missing feature_transform '$feature_transform'\"\n  exit 1\nfi\ncp $feature_transform $dir/final.feature_transform\n\nmodel=$dir/final.mdl\n[ -z \"$model\" ] && echo \"Error transition model '$model' does not exist!\" && exit 1;\n\n# Shuffle the feature list to make the GD stochastic!\n# By shuffling features, we have to use lattices with random access (indexed by .scp file).\ncat $data/feats.scp | utils/shuffle_list.pl --srand $seed > $dir/train.scp\n\n[ -n \"$unkphonelist\" ] && echo \"WARNING: The option '--unkphonelist' is now deprecated. Please remove it from your recipe...\"\n[ -n \"$exclude_silphones\" ] && echo \"WARNING: The option '--exclude-silphones' is now deprecated. Please remove it from your recipe...\"\n\n###\n### PREPARE FEATURE EXTRACTION PIPELINE\n###\n# import config,\ncmvn_opts=\ndelta_opts=\nD=$srcdir\n[ -e $D/norm_vars ] && cmvn_opts=\"--norm-means=true --norm-vars=$(cat $D/norm_vars)\" # Bwd-compatibility,\n[ -e $D/cmvn_opts ] && cmvn_opts=$(cat $D/cmvn_opts)\n[ -e $D/delta_order ] && delta_opts=\"--delta-order=$(cat $D/delta_order)\" # Bwd-compatibility,\n[ -e $D/delta_opts ] && delta_opts=$(cat $D/delta_opts)\n#\n# Create the feature stream,\nfeats=\"ark,o:copy-feats scp:$dir/train.scp ark:- |\"\n# apply-cmvn (optional),\n[ ! -z \"$cmvn_opts\" -a ! -f $data/cmvn.scp ] && echo \"$0: Missing $data/cmvn.scp\" && exit 1\n[ ! -z \"$cmvn_opts\" ] && feats=\"$feats apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |\"\n# add-deltas (optional),\n[ ! -z \"$delta_opts\" ] && feats=\"$feats add-deltas $delta_opts ark:- ark:- |\"\n# add-pytel transform (optional),\n[ -e $D/pytel_transform.py ] && feats=\"$feats /bin/env python $D/pytel_transform.py |\"\n\n# add-ivector (optional),\nif [ -e $D/ivector_dim ]; then\n  [ -z $ivector ] && echo \"Missing --ivector, they were used in training!\" && exit 1\n  # Get the tool,\n  ivector_append_tool=append-vector-to-feats # default,\n  [ -e $D/ivector_append_tool ] && ivector_append_tool=$(cat $D/ivector_append_tool)\n  # Check dims,\n  dim_raw=$(feat-to-dim \"$feats\" -)\n  dim_raw_and_ivec=$(feat-to-dim \"$feats $ivector_append_tool ark:- '$ivector' ark:- |\" -)\n  dim_ivec=$((dim_raw_and_ivec - dim_raw))\n  [ $dim_ivec != \"$(cat $D/ivector_dim)\" ] && \\\n    echo \"Error, i-vector dim. mismatch (expected $(cat $D/ivector_dim), got $dim_ivec in '$ivector')\" && \\\n    exit 1\n  # Append to feats,\n  feats=\"$feats $ivector_append_tool ark:- '$ivector' ark:- |\"\nfi\n\n### Record the setup,\n[ ! -z \"$cmvn_opts\" ] && echo $cmvn_opts >$dir/cmvn_opts\n[ ! -z \"$delta_opts\" ] && echo $delta_opts >$dir/delta_opts\n[ -e $D/pytel_transform.py ] && cp {$D,$dir}/pytel_transform.py\n[ -e $D/ivector_dim ] && cp {$D,$dir}/ivector_dim\n[ -e $D/ivector_append_tool ] && cp $D/ivector_append_tool $dir/ivector_append_tool\n###\n\n###\n### Prepare the alignments\n###\n# Assuming all alignments will fit into memory\nali=\"ark:gunzip -c $alidir/ali.*.gz |\"\n\n\n###\n### Prepare the lattices\n###\n# The lattices are indexed by SCP (they are not gziped because of the random access in SGD)\nlats=\"scp:$denlatdir/lat.scp\"\n\n\n# Run several iterations of the MPE/sMBR training\ncur_mdl=$nnet\nx=1\nwhile [ $x -le $num_iters ]; do\n  echo \"Pass $x (learnrate $learn_rate)\"\n  if [ -f $dir/$x.nnet ]; then\n    echo \"Skipped, file $dir/$x.nnet exists\"\n  else\n    #train\n    $cmd $dir/log/mpe.$x.log \\\n     nnet-train-mpe-sequential \\\n       --feature-transform=$feature_transform \\\n       --class-frame-counts=$class_frame_counts \\\n       --acoustic-scale=$acwt \\\n       --lm-scale=$lmwt \\\n       --learn-rate=$learn_rate \\\n       --momentum=$momentum \\\n       --do-smbr=$do_smbr \\\n       --verbose=$verbose \\\n       --one-silence-class=$one_silence_class \\\n       ${silphonelist:+ --silence-phones=$silphonelist} \\\n       $cur_mdl $alidir/final.mdl \"$feats\" \"$lats\" \"$ali\" $dir/$x.nnet\n  fi\n  cur_mdl=$dir/$x.nnet\n\n  #report the progress\n  grep -B 2 \"Overall average frame-accuracy\" $dir/log/mpe.$x.log | sed -e 's|.*)||'\n\n  x=$((x+1))\n  learn_rate=$(awk \"BEGIN{print($learn_rate*$halving_factor)}\")\n\ndone\n\n(cd $dir; [ -e final.nnet ] && unlink final.nnet; ln -s $((x-1)).nnet final.nnet)\n\n\necho \"MPE/sMBR training finished\"\n\nif [ -e $dir/prior_counts ]; then\n  echo \"Priors are already re-estimated, skipping... ($dir/prior_counts)\"\nelse\n  echo \"Re-estimating priors by forwarding 10k utterances from training set.\"\n  . ./cmd.sh\n  nj=$(cat $alidir/num_jobs)\n  steps/nnet/make_priors.sh --cmd \"$train_cmd\" --nj $nj \\\n    ${ivector:+ --ivector \"$ivector\"} $data $dir\nfi\n\necho \"$0: Done. '$dir'\"\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet/train_scheduler.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2017  Brno University of Technology (author: Karel Vesely)\n# Apache 2.0\n\n# Schedules epochs and controls learning rate during the neural network training\n\n# Begin configuration.\n\n# training options,\nlearn_rate=0.008\nmomentum=0\nl1_penalty=0\nl2_penalty=0\n\n# data processing,\ntrain_tool=\"nnet-train-frmshuff\"\ntrain_tool_opts=\"--minibatch-size=256 --randomizer-size=32768 --randomizer-seed=777\"\nfeature_transform=\n\nsplit_feats= # int -> number of splits 'feats.scp -> feats.${i}.scp', starting from feats.1.scp,\n             # (data are alredy shuffled and split to N parts),\n             # empty -> no splitting,\n\n# learn rate scheduling,\nmax_iters=20\nmin_iters=0 # keep training, disable weight rejection, start learn-rate halving as usual,\nkeep_lr_iters=0 # fix learning rate for N initial epochs, disable weight rejection,\ndropout_schedule= # dropout-rates for N initial epochs, for example: 0.1,0.1,0.1,0.1,0.1,0.0\nstart_halving_impr=0.01\nend_halving_impr=0.001\nhalving_factor=0.5\n\n# misc,\nverbose=0 # 0 No GPU time-stats, 1 with GPU time-stats (slower),\nframe_weights=\nutt_weights=\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f path.sh ] && . ./path.sh;\n\n. parse_options.sh || exit 1;\n\nset -euo pipefail\n\nif [ $# != 6 ]; then\n   echo \"Usage: $0 <mlp-init> <feats-tr> <feats-cv> <labels-tr> <labels-cv> <exp-dir>\"\n   echo \" e.g.: $0 0.nnet scp:train.scp scp:cv.scp ark:labels_tr.ark ark:labels_cv.ark exp/dnn1\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>  # config containing options\"\n   exit 1;\nfi\n\nmlp_init=$1\nfeats_tr=$2\nfeats_cv=$3\nlabels_tr=$4\nlabels_cv=$5\ndir=$6\n\n[ ! -d $dir ] && mkdir $dir\n[ ! -d $dir/log ] && mkdir $dir/log\n[ ! -d $dir/nnet ] && mkdir $dir/nnet\n\ndropout_array=($(echo ${dropout_schedule} | tr ',' ' '))\n\n# Skip training\n[ -e $dir/final.nnet ] && echo \"'$dir/final.nnet' exists, skipping training\" && exit 0\n\n##############################\n# start training\n\n# choose mlp to start with,\nmlp_best=$mlp_init\nmlp_base=${mlp_init##*/}; mlp_base=${mlp_base%.*}\n\n# optionally resume training from the best epoch, using saved learning-rate,\n[ -e $dir/.mlp_best ] && mlp_best=$(cat $dir/.mlp_best)\n[ -e $dir/.learn_rate ] && learn_rate=$(cat $dir/.learn_rate)\n\n# cross-validation on original network,\nlog=$dir/log/iter00.initial.log; hostname>$log\n$train_tool --cross-validate=true --randomize=false --verbose=$verbose $train_tool_opts \\\n  ${feature_transform:+ --feature-transform=$feature_transform} \\\n  ${frame_weights:+ \"--frame-weights=$frame_weights\"} \\\n  ${utt_weights:+ \"--utt-weights=$utt_weights\"} \\\n  \"$feats_cv\" \"$labels_cv\" $mlp_best \\\n  2>> $log\n\nloss=$(cat $dir/log/iter00.initial.log | grep \"AvgLoss:\" | tail -n 1 | awk '{ print $4; }')\nloss_type=$(cat $dir/log/iter00.initial.log | grep \"AvgLoss:\" | tail -n 1 | awk '{ print $5; }')\necho \"CROSSVAL PRERUN AVG.LOSS $(printf \"%.4f\" $loss) $loss_type\"\n\n# resume lr-halving,\nhalving=0\n[ -e $dir/.halving ] && halving=$(cat $dir/.halving)\n\n# training,\nfor iter in $(seq -w $max_iters); do\n  echo -n \"ITERATION $iter: \"\n  mlp_next=$dir/nnet/${mlp_base}_iter${iter}\n\n  # skip iteration (epoch) if already done,\n  [ -e $dir/.done_iter$iter ] && echo -n \"skipping... \" && ls $mlp_next* && continue\n\n  # set dropout-rate from the schedule,\n  if [ -n ${dropout_array[$((${iter#0}-1))]-''} ]; then\n    dropout_rate=${dropout_array[$((${iter#0}-1))]}\n    nnet-copy --dropout-rate=$dropout_rate $mlp_best ${mlp_best}.dropout_rate${dropout_rate}\n    mlp_best=${mlp_best}.dropout_rate${dropout_rate}\n  fi\n\n  # select the split,\n  feats_tr_portion=\"$feats_tr\" # no split?\n  if [ -n \"$split_feats\" ]; then\n    portion=$((1 + iter % split_feats))\n    feats_tr_portion=\"${feats_tr/train.scp/train.${portion}.scp}\"\n  fi\n\n  # training,\n  log=$dir/log/iter${iter}.tr.log; hostname>$log\n  $train_tool --cross-validate=false --randomize=true --verbose=$verbose $train_tool_opts \\\n    --learn-rate=$learn_rate --momentum=$momentum \\\n    --l1-penalty=$l1_penalty --l2-penalty=$l2_penalty \\\n    ${feature_transform:+ --feature-transform=$feature_transform} \\\n    ${frame_weights:+ \"--frame-weights=$frame_weights\"} \\\n    ${utt_weights:+ \"--utt-weights=$utt_weights\"} \\\n    \"$feats_tr_portion\" \"$labels_tr\" $mlp_best $mlp_next \\\n    2>> $log || exit 1;\n\n  tr_loss=$(cat $dir/log/iter${iter}.tr.log | grep \"AvgLoss:\" | tail -n 1 | awk '{ print $4; }')\n  echo -n \"TRAIN AVG.LOSS $(printf \"%.4f\" $tr_loss), (lrate$(printf \"%.6g\" $learn_rate)), \"\n\n  # cross-validation,\n  log=$dir/log/iter${iter}.cv.log; hostname>$log\n  $train_tool --cross-validate=true --randomize=false --verbose=$verbose $train_tool_opts \\\n    ${feature_transform:+ --feature-transform=$feature_transform} \\\n    ${frame_weights:+ \"--frame-weights=$frame_weights\"} \\\n    ${utt_weights:+ \"--utt-weights=$utt_weights\"} \\\n    \"$feats_cv\" \"$labels_cv\" $mlp_next \\\n    2>>$log || exit 1;\n\n  loss_new=$(cat $dir/log/iter${iter}.cv.log | grep \"AvgLoss:\" | tail -n 1 | awk '{ print $4; }')\n  echo -n \"CROSSVAL AVG.LOSS $(printf \"%.4f\" $loss_new), \"\n\n  # accept or reject?\n  loss_prev=$loss\n  if [ 1 == $(awk \"BEGIN{print($loss_new < $loss ? 1:0);}\") -o $iter -le $keep_lr_iters -o $iter -le $min_iters ]; then\n    # accepting: the loss was better, or we had fixed learn-rate, or we had fixed epoch-number,\n    loss=$loss_new\n    mlp_best=$dir/nnet/${mlp_base}_iter${iter}_learnrate${learn_rate}_tr$(printf \"%.4f\" $tr_loss)_cv$(printf \"%.4f\" $loss_new)\n    [ $iter -le $min_iters ] && mlp_best=${mlp_best}_min-iters-$min_iters\n    [ $iter -le $keep_lr_iters ] && mlp_best=${mlp_best}_keep-lr-iters-$keep_lr_iters\n    mv $mlp_next $mlp_best\n    echo \"nnet accepted ($(basename $mlp_best))\"\n    echo $mlp_best > $dir/.mlp_best\n  else\n    # rejecting,\n    mlp_reject=$dir/nnet/${mlp_base}_iter${iter}_learnrate${learn_rate}_tr$(printf \"%.4f\" $tr_loss)_cv$(printf \"%.4f\" $loss_new)_rejected\n    mv $mlp_next $mlp_reject\n    echo \"nnet rejected ($(basename $mlp_reject))\"\n  fi\n\n  # create .done file, the iteration (epoch) is completed,\n  touch $dir/.done_iter$iter\n\n  # continue with original learn-rate,\n  [ $iter -le $keep_lr_iters ] && continue\n\n  # stopping criterion,\n  rel_impr=$(awk \"BEGIN{print(($loss_prev-$loss)/$loss_prev);}\")\n  if [ 1 == $halving -a 1 == $(awk \"BEGIN{print($rel_impr < $end_halving_impr ? 1:0);}\") ]; then\n    if [ $iter -le $min_iters ]; then\n      echo we were supposed to finish, but we continue as min_iters : $min_iters\n      continue\n    fi\n    echo finished, too small rel. improvement $rel_impr\n    break\n  fi\n\n  # start learning-rate fade-out when improvement is low,\n  if [ 1 == $(awk \"BEGIN{print($rel_impr < $start_halving_impr ? 1:0);}\") ]; then\n    halving=1\n    echo $halving >$dir/.halving\n  fi\n\n  # reduce the learning-rate,\n  if [ 1 == $halving ]; then\n    learn_rate=$(awk \"BEGIN{print($learn_rate*$halving_factor)}\")\n    echo $learn_rate >$dir/.learn_rate\n  fi\ndone\n\n# select the best network,\nif [ $mlp_best != $mlp_init ]; then\n  mlp_final=${mlp_best}_final_\n  ( cd $dir/nnet; ln -s $(basename $mlp_best) $(basename $mlp_final); )\n  ( cd $dir; ln -s nnet/$(basename $mlp_final) final.nnet; )\n  echo \"$0: Succeeded training the Neural Network : '$dir/final.nnet'\"\nelse\n  echo \"$0: Error training neural network...\"\n  exit 1\nfi\n\n"
  },
  {
    "path": "egs/steps/nnet2/adjust_priors.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright (c) 2015, Johns Hopkins University (Yenda Trmal <jtrmal@gmail.com>)\n# License: Apache 2.0\n\n# Begin configuration section.\ncmd=run.pl\niter=final\n# End configuration section\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [opts] <degs-dir> <nnet-dir>\"\n  echo \" e.g.: $0 exp/tri4_mpe_degs exp/tri4_mpe\"\n  echo \"\"\n  echo \"Performs priors adjustment either on the final iteration\"\n  echo \"or iteration of choice of the training. The adjusted model\"\n  echo \"filename will be suffixed by \\\"adj\\\", i.e. for the final\"\n  echo \"iteration final.mdl will become final.adj.mdl\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --iter <iteration|final>                         # which iteration to be adjusted\"\n  exit 1;\nfi\n\ndegs_dir=$1\ndir=$2\n\nsrc_model=$dir/${iter}.mdl\n\nif [ ! -f $src_model ]; then\n  echo \"$0: Expecting $src_model to exist.\"\n  exit 1\nfi\n\nif [ ! -f $degs_dir/priors_egs.1.ark ]; then\n  echo \"$0: Expecting $degs_dir/priors_egs.1.ark to exist.\"\n  exit 1\nfi\n\nnum_archives_priors=`cat $degs_dir/info/num_archives_priors` || {\n  echo \"Could not find $degs_dir/info/num_archives_priors.\";\n  exit 1;\n}\n\n$cmd JOB=1:$num_archives_priors $dir/log/get_post.${iter}.JOB.log \\\n  nnet-compute-from-egs \"nnet-to-raw-nnet $src_model -|\" \\\n  ark:$degs_dir/priors_egs.JOB.ark ark:- \\| \\\n  matrix-sum-rows ark:- ark:- \\| \\\n  vector-sum ark:- $dir/post.${iter}.JOB.vec || {\n    echo \"Error in getting posteriors for adjusting priors.\"\n    echo \"See $dir/log/get_post.${iter}.*.log\";\n    exit 1;\n  }\n\n\n$cmd $dir/log/sum_post.${iter}.log \\\n  vector-sum $dir/post.${iter}.*.vec $dir/post.${iter}.vec || {\n    echo \"Error in summing posteriors. See $dir/log/sum_post.${iter}.log\";\n    exit 1;\n  }\n\nrm -f $dir/post.${iter}.*.vec\n\necho \"Re-adjusting priors based on computed posteriors for iter $iter\"\n$cmd $dir/log/adjust_priors.${iter}.log \\\n  nnet-adjust-priors $src_model $dir/post.${iter}.vec $dir/${iter}.adj.mdl || {\n    echo \"Error in adjusting priors. See $dir/log/adjust_priors.${iter}.log\";\n    exit 1;\n  }\n\necho \"Done adjusting priors (on $src_model)\"\n"
  },
  {
    "path": "egs/steps/nnet2/align.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Brno University of Technology (Author: Karel Vesely)\n#           2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments using DNN\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ntransform_dir=\niter=final\nuse_gpu=no\nonline_ivector_dir=\nfeat_type=  # you can set this to force it to use delta features.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: $0 [--transform-dir <transform-dir>] <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/nnet4 exp/nnet4_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nextra_files=\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfor f in $srcdir/tree $srcdir/${iter}.mdl $data/feats.scp $lang/L.fst $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/{tree,${iter}.mdl} $dir || exit 1;\n\n\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\nif [ -z \"$feat_type\" ]; then\n  if [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $srcdir/splice_opts $dir 2>/dev/null\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n   ;;\n  lda)\n    splice_opts=`cat $srcdir/splice_opts 2>/dev/null`\n    cp $srcdir/splice_opts $dir 2>/dev/null\n    cp $srcdir/final.mat $dir || exit 1;\n    feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n\n  if [ $feat_type == \"raw\" ]; then trans=raw_trans;\n  else trans=trans; fi\n  if [ $feat_type == \"lda\" ] && ! cmp $transform_dir/final.mat $srcdir/final.mat; then\n    echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n    exit 1;\n  fi\n  if [ ! -f $transform_dir/$trans.1 ]; then\n    echo \"$0: expected $transform_dir/$trans.1 to exist (--transform-dir option)\"\n    exit 1;\n  fi\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/$trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/$trans.ark,$dir/$trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/$trans.JOB ark:- ark:- |\"\n  fi\nfi\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  # note: subsample-feats, with negative n, will repeat each feature -n times.\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\nfi\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\ntra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n\n$cmd JOB=1:$nj $dir/log/align.JOB.log \\\n  compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $srcdir/${iter}.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n  nnet-align-compiled $scale_opts --use-gpu=$use_gpu --beam=$beam --retry-beam=$retry_beam \\\n    $srcdir/${iter}.mdl ark:- \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\necho \"$0: done aligning data.\"\n"
  },
  {
    "path": "egs/steps/nnet2/check_ivectors_compatible.sh",
    "content": "#!/usr/bin/env bash\n# Copyright (c) 2016, Johns Hopkins University (Yenda Trmal <jtrmal@gmail.com>)\n# License: Apache 2.0\n\n# Begin configuration section.\n# End configuration section\n\n#echo >&2 \"$0 $@\"  # Print the command line for logging\nif [ $# != 2 ] ; then\n  echo >&2 \"Usage: $0  <first-dir> <second-dir>\"\n  echo >&2 \" e.g.: $0 exp/nnet3/extractor exp/nnet3/ivectors_dev10h.pem\"\nfi\n\ndir_a=$1\ndir_b=$2\n\nid_a=$(steps/nnet2/get_ivector_id.sh $dir_a)\nret_a=$?\nid_b=$(steps/nnet2/get_ivector_id.sh $dir_b)\nret_b=$?\n\nif [ ! -z \"$id_a\" ] && [ ! -z \"${id_b}\" ] ; then\n  if [ \"${id_a}\" == \"${id_b}\" ]; then\n    exit 0\n  else\n    echo >&2 \"$0: ERROR: iVector id ${id_a} in $dir_a and the iVector id ${id_b} in $dir_b do not match\"\n    echo >&2 \"$0: ERROR: that means that the systems are not compatible.\"\n    exit 1\n  fi\nelif [ -z \"$id_a\" ] && [ -z \"${id_b}\" ] ; then\n    echo >&2 \"$0: WARNING: The directories do not contain iVector ID.\"\n    echo >&2 \"$0: WARNING: That means it's you who's reponsible for keeping \"\n    echo >&2 \"$0: WARNING: the directories compatible\"\n    exit 0\nelse\n    echo >&2 \"$0: WARNING: One of the directories do not contain iVector ID.\"\n    echo >&2 \"$0: WARNING: That means it's you who's reponsible for keeping \"\n    echo >&2 \"$0: WARNING: the directories compatible\"\n    exit 0\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/convert_lda_to_raw.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014    Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script converts nnet2 models which expect splice+LDA as the input, into\n# models which expect raw features (e.g. MFCC) as the input.  If you include\n# the option --global-cmvn-stats <matrix>, it will also remove CMVN from the model\n# by including it as part of the neural net.\n\n\n# Begin configuration section\ncleanup=true\nglobal_cmvn_stats=\ncmd=run.pl\n# learning_rate and max_change will only make a difference if we train this model, which is unlikely.\nlearning_rate=0.00001 # give it a tiny learning rate by default; the user\n                      # should probably tune this or set it if they want to train.\nmax_change=5.0\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] <src-nnet-dir> <dest-nnet-dir>\"\n  echo \"e.g.: $0 --global-cmvn-stats global_cmvn.mat exp/dnn4b_nnet2 exp/dnn4b_nnet2_raw\"\n  echo \"Options include\"\n  echo \"   --global-cmvn-stats <stats-file>         # Filename of globally summed CMVN stats, if\"\n  echo \"                                            # you want to push the CMVN inside the nnet\"\n  echo \"                                            # (it won't any longer be speaker specific)\"\n  exit 1;\nfi\n\nsrc=$1\ndir=$2\n\nmkdir -p $dir/log || exit 1;\n\nfor f in $src/final.mdl $src/final.mat $src/splice_opts $src/cmvn_opts; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1\ndone\n\ncp $src/phones.txt $dir 2>/dev/null\n\nmkdir -p $dir/log\n\n# nnet.config will be a config for a few trivial neural-network layers\n# that come before the main network, and which do things like\necho -n >$dir/nnet.config\n\nif [ ! -z \"$global_cmvn_stats\" ]; then\n  [ ! -f $src/cmvn_opts ] && \\\n    echo \"$0: expected $src/cmvn_opts to exist\" && exit 1;\n  norm_vars=false\n  if grep 'norm-means=false' $src/cmvn_opts; then\n    echo \"$0: if --norm-means=false, don't supply the --global-cmvn-stats option to this script\"\n    exit 1;\n  elif grep 'norm-vars=true' $src/cmvn_opts; then\n    echo \"$0: warning: this script has not been tested with --norm-vars=true in CMVN options\"\n    norm_vars=true\n  fi\n\n\n  # First add to the config, layers that will do the same transform as cepstral\n  # mean and variance normalization using these global stats.  We do this as\n  # first an added offset (FixedBiasComonent), then, only if norm-vars=true\n  # in the CMVN options, a scaling (FixedScaleComponent).\n  \n  $cmd $dir/log/copy_feats.log \\\n    copy-feats --binary=false \"$global_cmvn_stats\" $dir/global_cmvn_stats.txt || exit 1;\n  cat $dir/global_cmvn_stats.txt | \\\n    perl -e ' $line0 = <STDIN>; $line0 == \"[\\n\" || die \"expected first line to be [, got $line0\";\n    $line1 = <STDIN>; $line2 = <STDIN>; @L1 = split(\" \",$line1); @L2 = split(\" \",$line2);\n    ($bias_out, $scale_out) = @ARGV;\n    open(B, \">$bias_out\") || die \"opening bias-out file $bias_out\";\n    open(S, \">$scale_out\") || die \"opening scale-out file $scale_out\";\n    pop @L2; pop @L2; # remove the \" 0 ]\"\n    $count = pop @L1;  # last element of line 1 is total count.\n    ($count > 0.0) || die \"Bad count $count\";\n    $dim = @L1;\n    $dim == scalar @L2 || die \"Bad dimension of second line of CMVN stats @L2\";\n    print B \"[ \";  print S \"[ \";\n    for ($x = 0; $x < $dim; $x++) {\n      $mean = $L1[$x] / $count;  $var = ($L2[$x] / $count) - ($mean * $mean);\n      $bias = -$mean;  print B \"$bias \";\n      $scale = 1.0 / sqrt($var); $scale > 0 || die \"Bad scale $scale\";  print S \"$scale \";\n    }\n    print B \"]\\n\";  print S \"]\\n\"; ' $dir/bias.txt $dir/scales.txt || exit 1;\n  echo \"FixedBiasComponent bias=$dir/bias.txt\" >> $dir/nnet.config  \n  if $norm_vars; then\n    echo \"FixedScaleComponent scales=$dir/scales.txt\" >> $dir/nnet.config  \n  fi\n  echo \"--norm-means=false --norm-vars=false\" >$dir/cmvn_opts || exit 1;\nelse\n  cp $src/cmvn_opts $dir/ || exit 1;\nfi\n\n# We need the dimension of the raw features.  We work it out from the LDA matrix dimension.\n# get a word-count of the second row of the LDA matrix...  this will be either the\n# spliced dim or the spliced dim plus one.\nspliced_dim=$(copy-matrix --binary=false $src/final.mat - | head -n 2 | tail -n 1 | wc -w) || exit 1;\n\n\nsplice_opts=$(cat $src/splice_opts) || exit 1;\n# Work out how many frames are spliced together by splicing a matrix with one element\n# and testing the resulting number of columns.\nnum_splice=$(echo \"foo [ 1.0 ]\" | splice-feats $splice_opts ark:- ark:- | feat-to-dim ark:- -)\n\n# We'll separately need the left-context and right-context.\n# defaults in the splice-feats code are 4 and 4.\nleft_context=4\nright_context=4\nfor opt in $(cat $src/splice_opts); do\n  if echo $opt | grep left-context  >/dev/null; then\n    left_context=$(echo $opt | cut -d= -f2) || exit 1;\n  fi\n  if echo $opt | grep right-context  >/dev/null; then\n    right_context=$(echo $opt | cut -d= -f2) || exit 1;\n  fi\ndone\nif ! [ $num_splice -eq $[$left_context+1+$right_context] ]; then\n  echo \"$0: num-splice worked out from the binaries differs from our interpreation of the options:\"\n  echo \"$num_splice != $left_context + 1 + $right_context\"\n  exit 1;\nfi\n\nmodulo=$[$spliced_dim%$num_splice]\nif [ $modulo -eq 1 ]; then\n  # matrix includes offset term.\n  spliced_dim=$[$spliced_dim-1];\n  cp $src/final.mat $dir/\nelif [ $modulo -eq 0 ]; then\n  # We need to add a zero bias term to the matrix, because the AffineComponent\n  # expects that.\n  copy-matrix --binary=false $src/final.mat - | \\\n    awk '{if ($NF == \"]\") { $NF = \"0\"; print $0, \"]\"; } else { if (NF > 1) { print $0, \"0\"; } else {print;}}}' >$dir/final.mat\nelse\n  echo \"$0: Cannot make sense of spliced dimension $spliced_dim and num-splice=$num_splice\"\n  exit 1;\nfi\nfeat_dim=$[$spliced_dim/$num_splice];\necho \"SpliceComponent input-dim=$feat_dim left-context=$left_context right-context=$right_context\" >>$dir/nnet.config\n\n# use AffineComponentPreconditioned as it's easier to configure than AffineComponentPreconditionedOnline.\necho \"AffineComponentPreconditioned alpha=4.0 learning-rate=$learning_rate max-change=$max_change matrix=$dir/final.mat\" >>$dir/nnet.config\n\n\n$cmd $dir/log/nnet_init.log \\\n  nnet-init $dir/nnet.config $dir/lda.nnet || exit 1;\n\n$cmd $dir/log/nnet_insert.log \\\n  nnet-insert --insert-at=0 --randomize-next-component=false \\\n   $src/final.mdl $dir/lda.nnet $dir/final.mdl || exit 1;\n\nif $cleanup; then\n  rm $dir/final.mat $dir/lda.nnet\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/convert_nnet1_to_nnet2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014    Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script converts nnet1 into nnet2 models.\n# Note, it doesn't support all possible types of nnet1 models.\n\n# Begin configuration section\ncleanup=true\ncmd=run.pl\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] <src-nnet1-dir> <dest-nnet2-dir>\"\n  echo \"e.g.: $0 exp/dnn4b_pretrain-dbn_dnn exp/dnn4b_nnet2\"\n  exit 1;\nfi\n\nsrc=$1\ndir=$2\n\nmkdir -p $dir/log || exit 1;\n\nfor f in $src/final.mdl $src/final.feature_transform $src/ali_train_pdf.counts; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1\ndone\n\ncp $src/phones.txt $dir 2>/dev/null\n\n$cmd $dir/log/convert_feature_transform.log \\\n  nnet1-to-raw-nnet $src/final.feature_transform $dir/0.raw || exit 1;\n\n\nif [ -f $src/final.nnet ]; then\n  echo \"$0: $src/final.nnet exists, using it as input.\"\n  $cmd $dir/log/convert_model.log \\\n    nnet1-to-raw-nnet $src/final.nnet $dir/1.raw || exit 1;\nelif [ -f $src/final.dbn ]; then\n  echo \"$0: $src/final.dbn exists, using it as input.\"\n  num_leaves=$(am-info $src/final.mdl | grep -w pdfs | awk '{print $NF}') || exit 1;\n  dbn_output_dim=$(nnet-info exp/dnn4b_pretrain-dbn/6.dbn  | grep component | tail -n 1 | sed s:,::g | awk '{print $NF}') || exit 1;\n  [ -z \"$dbn_output_dim\" ] && exit 1;\n  \n  cat > $dir/final_layer.conf <<EOF\nAffineComponent input-dim=$dbn_output_dim output-dim=$num_leaves learning-rate=0.001\nSoftmaxComponent dim=$num_leaves\nEOF\n  $cmd $dir/log/convert_model.log \\\n    nnet1-to-raw-nnet $src/final.dbn - \\| \\\n    raw-nnet-concat - \"raw-nnet-init $dir/final_layer.conf -|\" $dir/1.raw || exit 1;\nelse\n  echo \"$0: expected either $src/final.nnet or $src/final.dbn to exist\"\nfi\n\n$cmd $dir/log/append_model.log \\\n  raw-nnet-concat $dir/0.raw $dir/1.raw $dir/concat.raw || exit 1;\n\n$cmd $dir/log/init_model.log \\\n  nnet-am-init $src/final.mdl $dir/concat.raw $dir/final_noprior.mdl || exit 1;\n\n$cmd $dir/log/set_priors.log \\\n  nnet-adjust-priors $dir/final_noprior.mdl $src/ali_train_pdf.counts $dir/final.mdl || exit 1;\n\nif $cleanup; then\n  rm $dir/0.raw $dir/1.raw $dir/concat.raw $dir/final_noprior.mdl\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/create_appended_model.sh",
    "content": "#!/usr/bin/env bash\n\n#  Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n#  Apache 2.0.\n\n# This script is for use with \"retrain_fast.sh\"; it combines the original model\n# that you trained on top of, with the single layer model you trained, so that\n# you can do joint backpropagation.\n\n# Begin configuration options.\ncmd=run.pl\n# End configuration options.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 <original-nnet-dir> <new-nnet-dir> <combined-nnet-dir>\"\n  echo \"where <original-nnet-dir> will typically be a normal neural net from another corpus,\"\n  echo \"and <new-nnet-dir> will usually be a single-layer neural net trained on top of it by\"\n  echo \"dumping the activations (e.g. using steps/online/nnet2/dump_nnet_activations.sh, I\"\n  echo \"think no such script exists for non-online), and then training using\"\n  echo \"steps/nnet2/retrain_fast.sh.\"\n  echo \"e.g.: $0 ../../swbd/s5b/exp/nnet2_online/nnet_gpu_online exp/nnet2_swbd_online/nnet_gpu_online exp/nnet2_swbd_online/nnet_gpu_online_combined\"\nfi\n\n\nsrc1=$1\nsrc2=$2\ndir=$3\n\nfor f in $src1/final.mdl $src2/tree $src2/final.mdl; do\n   [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\n\nmkdir -p $dir/log\n\ninfo=$dir/nnet_info\nnnet-am-info $src1/final.mdl >$info\nnc=$(grep num-components $info | awk '{print $2}');\nif grep SumGroupComponent $info >/dev/null; then \n  nc_truncate=$[$nc-3]  # we did mix-up: remove AffineComponent,\n                        # SumGroupComponent, SoftmaxComponent\nelse\n                        # we didn't mix-up:\n  nc_truncate=$[$nc-2]  # remove AffineComponent, SoftmaxComponent\nfi\n\n$cmd $dir/log/get_raw_nnet.log \\\n nnet-to-raw-nnet --truncate=$nc_truncate $src1/final.mdl $dir/first_nnet.raw || exit 1;\n\n$cmd $dir/log/append_nnet.log \\\n  nnet-insert --randomize-next-component=false --insert-at=0 \\\n  $src2/final.mdl $dir/first_nnet.raw $dir/final.mdl || exit 1;\n\n$cleanup && rm $dir/first_nnet.raw\n\n# Copy the tree etc., \n\ncp $src2/tree $dir || exit 1;\n\n# Copy feature-related things from src1 where we built the initial model.\n# Note: if you've done anything like mess with the feature-extraction configs,\n# or changed the feature type, you have to keep track of that yourself.\nfor f in final.mat cmvn_opts splice_opts; do\n  if [ -f $src1/$f ]; then\n    cp $src1/$f $dir || exit 1;\n  fi\ndone\n\necho \"$0: created appended model in $dir\"\n"
  },
  {
    "path": "egs/steps/nnet2/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script does decoding with a neural-net.  If the neural net was built on\n# top of fMLLR transforms from a conventional system, you should provide the\n# --transform-dir option.\n\n# Begin configuration section.\nstage=1\ntransform_dir=    # dir to find fMLLR transforms.\nnj=4 # number of decoding jobs.  If --transform-dir set, must match that number!\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\ncmd=run.pl\nbeam=15.0\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nparallel_opts=  # ignored now.\nscoring_opts=\nskip_scoring=false\nfeat_type=\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \" e.g.: $0 --transform-dir exp/tri3b/decode_dev93_tgpr \\\\\"\n  echo \"      exp/tri3b/graph_tgpr data/test_dev93 exp/tri4a_nnet/decode_dev93_tgpr\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n  echo \"  --parallel-opts <opts>                   # e.g. '--num-threads 4' if you supply --num-threads 4\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $graphdir/HCLG.fst $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features.\nif [ -z \"$feat_type\" ]; then\n  if [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=raw; fi\n  echo \"$0: feature type is $feat_type\"\nfi\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n  if [ -f $srcdir/delta_order ]; then\n    delta_order=`cat $srcdir/delta_order 2>/dev/null`\n    feats=\"$feats add-deltas --delta-order=$delta_order ark:- ark:- |\"\n  fi\n    ;;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n\n  if [ $feat_type == \"raw\" ]; then trans=raw_trans;\n  else trans=trans; fi\n  if [ $feat_type == \"lda\" ] && \\\n    ! cmp $transform_dir/../final.mat $srcdir/final.mat && \\\n    ! cmp $transform_dir/final.mat $srcdir/final.mat; then\n    echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n    exit 1;\n  fi\n  if [ ! -f $transform_dir/$trans.1 ]; then\n    echo \"$0: expected $transform_dir/$trans.1 to exist (--transform-dir option)\"\n    exit 1;\n  fi\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/$trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/$trans.ark,$dir/$trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/$trans.JOB ark:- ark:- |\"\n  fi\nelif grep 'transform-feats --utt2spk' $srcdir/log/train.1.log >&/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using a neural net system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n##\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  # note: subsample-feats, with negative n, will repeat each feature -n times.\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- | copy-matrix --scale=$ivector_scale ark:- ark:-|' ark:- |\"\nfi\n\nif [ $stage -le 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet-latgen-faster$thread_string \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n  steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\nfi\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\n\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $iter_opt $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet2/dump_bottleneck_features.sh",
    "content": "#!/usr/bin/env bash\n\n#           2014  Pegah Ghahremani\n# Apache 2.0\n\n\n# Begin configuration section.\nfeat_type=\nstage=1\nnj=4\ncmd=run.pl\n\n# Begin configuration.\ntransform_dir=\n\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"usage: steps/nnet2/dump_bottleneck_features.sh <input-data-dir> <output-data-dir> <bnf-nnet-dir> <archive-dir> <log-dir>\"\n   echo \"e.g.:  steps/nnet2/dump_bottleneck_features.sh data/train data/train_bnf exp_bnf/bnf_net exp/tri5_ali mfcc exp_bnf/dump_bnf\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nbnf_data=$2\nnnetdir=$3\narchivedir=$4\ndir=$5\n\n# Assume that final.nnet is in nnetdir\nbnf_nnet=$nnetdir/final.raw\nif [ ! -f $bnf_nnet ] ; then\n  echo \"No such file $bnf_nnet\";\n  exit 1;\nfi\n\n## Set up input features of nnet\nif [ -z \"$feat_type\" ]; then\n  if [ -f $nnetdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\nfi\necho \"$0: feature type is $feat_type\"\n\nif [ \"$feat_type\" == \"lda\" ] && [ ! -f $nnetdir/final.mat ]; then\n  echo \"$0: no such file $nnetdir/final.mat\"\n  exit 1\nfi\n\nname=`basename $data`\nsdata=$data/split$nj\n\nmkdir -p $dir/log\nmkdir -p $bnf_data\necho $nj > $nnetdir/num_jobs\nsplice_opts=`cat $nnetdir/splice_opts 2>/dev/null`\ndelta_opts=`cat $nnetdir/delta_opts 2>/dev/null`\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\";;\n  delta) feats=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $nnetdir/final.mat ark:- ark:- |\"\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"Using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"No such file $transform_dir/trans.1\" && exit 1;\n  transform_nj=`cat $transform_dir/num_jobs` || exit 1;\n  if [ \"$nj\" != \"$transform_nj\" ]; then\n    for n in $(seq $transform_nj); do cat $transform_dir/trans.$n; done >$dir/trans.ark\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.ark ark:- ark:- |\"\n  else\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"Making BNF scp and ark.\"\n  $cmd JOB=1:$nj $dir/log/make_bnf_$name.JOB.log \\\n    nnet-compute $bnf_nnet \"$feats\" ark:- \\| \\\n    copy-feats --compress=true ark:- ark,scp:$archivedir/raw_bnfeat_$name.JOB.ark,$archivedir/raw_bnfeat_$name.JOB.scp || exit 1;\nfi\n\nrm $dir/trans.ark 2>/dev/null\n\nN0=$(cat $data/feats.scp | wc -l)\nN1=$(cat $archivedir/raw_bnfeat_$name.*.scp | wc -l)\nif [[ \"$N0\" != \"$N1\" ]]; then\n  echo \"Error happens when generating BNF for $name (Original:$N0  BNF:$N1)\"\n  exit 1;\nfi\n\n# Concatenate feats.scp into bnf_data\nfor n in $(seq $nj); do  cat $archivedir/raw_bnfeat_$name.$n.scp; done > $bnf_data/feats.scp\n\nfor f in segments spk2utt text utt2spk wav.scp char.stm glm kws reco2file_and_channel stm; do\n  [ -e $data/$f ] && cp -r $data/$f $bnf_data/$f\ndone\n\necho \"$0: computing CMVN stats.\"\nsteps/compute_cmvn_stats.sh $bnf_data $dir $archivedir\n\necho \"$0: done making BNF feats.scp.\"\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet2/get_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n\n# Begin configuration section.\ncmd=run.pl\nfeat_type=\nnum_utts_subset=300    # number of utterances in validation and training\n                       # subsets used for shrinkage and diagnostics\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=10000 # # train frames for the above.\nnum_frames_diagnostic=4000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This is just a guideline; it will pick a number\n                        # that divides the number of samples in the entire data.\ntransform_dir=     # If supplied, overrides alidir\nnum_jobs_nnet=16    # Number of neural net jobs to run in parallel\nstage=0\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nleft_context=\nright_context=\nrandom_copy=false\nonline_ivector_dir=\nivector_randomize_prob=0.0 # if >0.0, randomizes iVectors during training with\n                           # this prob per iVector.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/nnet2/get_egs.sh [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/nnet2/get_egs.sh data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-jobs-nnet <num-jobs;16>                    # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --feat-type <lda|raw>                            # (by default it tries to guess).  The feature type you want\"\n  echo \"                                                   # to use as input to the neural net.\"\n  echo \"  --splice-width <width;4>                         # Number of frames on each side to append for feature input\"\n  echo \"  --left-context <width;4>                         # Number of frames on left side to append for feature input, overrides splice-width\"\n  echo \"  --right-context <width;4>                        # Number of frames on right side to append for feature input, overrides splice-width\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2  # kept for historical reasons, but never used.\nalidir=$3\ndir=$4\n\n[ -z \"$left_context\" ] && left_context=$splice_width\n[ -z \"$right_context\" ] && right_context=$splice_width\n\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid_uttlist || exit 1;\n\nif [ -f $data/utt2uniq ]; then\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl | head -$num_utts_subset > $dir/train_subset_uttlist || exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\n## Set up features.\nif [ -z $feat_type ]; then\n  if [ -f $alidir/final.mat ] && [ ! -f $transform_dir/raw_trans.1 ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n    valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n    train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n    echo $cmvn_opts >$dir/cmvn_opts\n   ;;\n  lda)\n    splice_opts=`cat $alidir/splice_opts 2>/dev/null`\n    cp $alidir/{splice_opts,cmvn_opts,final.mat} $dir || exit 1;\n    [ ! -z \"$cmvn_opts\" ] && \\\n       echo \"You cannot supply --cmvn-opts option if feature type is LDA.\" && exit 1;\n    cmvn_opts=$(cat $dir/cmvn_opts)\n    feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -f $transform_dir/trans.1 ] && [ $feat_type != \"raw\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\n  valid_feats=\"$valid_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/trans.*|' ark:- ark:- |\"\n  train_subset_feats=\"$train_subset_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/trans.*|' ark:- ark:- |\"\nfi\nif [ -f $transform_dir/raw_trans.1 ] && [ $feat_type == \"raw\" ]; then\n  echo \"$0: using raw-fMLLR transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/raw_trans.JOB ark:- ark:- |\"\n  valid_feats=\"$valid_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/raw_trans.*|' ark:- ark:- |\"\n  train_subset_feats=\"$train_subset_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/raw_trans.*|' ark:- ark:- |\"\nfi\nif [ ! -z \"$online_ivector_dir\" ]; then\n  feats_one=\"$(echo \"$feats\" | sed s:JOB:1:g)\"\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  ivectors_opt=\"--const-feat-dim=$ivector_dim\"\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- | ivector-randomize --randomize-prob=$ivector_randomize_prob ark:- ark:- |' ark:- |\"\n  valid_feats=\"$valid_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- | ivector-randomize --randomize-prob=$ivector_randomize_prob ark:- ark:- |' ark:- |\"\n  train_subset_feats=\"$train_subset_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- | ivector-randomize --randomize-prob=$ivector_randomize_prob ark:- ark:- |' ark:- |\"\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/num_frames\nelse\n  num_frames=`cat $dir/num_frames` || exit 1;\nfi\n\n# Working out number of iterations per epoch.\niters_per_epoch=`perl -e \"print int($num_frames/($samples_per_iter * $num_jobs_nnet) + 0.5);\"` || exit 1;\n[ $iters_per_epoch -eq 0 ] && iters_per_epoch=1\nsamples_per_iter_real=$[$num_frames/($num_jobs_nnet*$iters_per_epoch)]\necho \"$0: Every epoch, splitting the data up into $iters_per_epoch iterations,\"\necho \"$0: giving samples-per-iteration of $samples_per_iter_real (you requested $samples_per_iter).\"\n\n# Making soft links to storage directories.  This is a no-up unless\n# the subdirectory $dir/egs/storage/ exists.  See utils/create_split_dir.pl\nfor x in `seq 1 $num_jobs_nnet`; do\n  for y in `seq 0 $[$iters_per_epoch-1]`; do\n    utils/create_data_link.pl $dir/egs/egs.$x.$y.ark\n    utils/create_data_link.pl $dir/egs/egs_tmp.$x.$y.ark\n  done\n  for y in `seq 1 $nj`; do\n    utils/create_data_link.pl $dir/egs/egs_orig.$x.$y.ark\n  done\ndone\n\nremove () { for x in $*; do [ -L $x ] && rm $(utils/make_absolute.sh $x); rm $x; done }\n\nnnet_context_opts=\"--left-context=$left_context --right-context=$right_context\"\nmkdir -p $dir/egs\n\nif [ $stage -le 2 ]; then\n  echo \"Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n  echo \"$0: extracting validation and training-subset alignments.\"\n  set -o pipefail;\n  for id in $(seq $nj); do gunzip -c $alidir/ali.$id.gz; done | \\\n    copy-int-vector ark:- ark,t:- | \\\n    utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) | \\\n    gzip -c >$dir/ali_special.gz || exit 1;\n  set +o pipefail; # unset the pipefail option.\n\n  $cmd $dir/log/create_valid_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$valid_feats\" \\\n    \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/egs/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$train_subset_feats\" \\\n     \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/egs/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"Getting subsets of validation examples for diagnostics and combination.\"\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet-subset-egs --n=$num_valid_frames_combine ark:$dir/egs/valid_all.egs \\\n        ark:$dir/egs/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/egs/valid_all.egs \\\n    ark:$dir/egs/valid_diagnostic.egs || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet-subset-egs --n=$num_train_frames_combine ark:$dir/egs/train_subset_all.egs \\\n    ark:$dir/egs/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/egs/train_subset_all.egs \\\n    ark:$dir/egs/train_diagnostic.egs || touch $dir/.error &\n  wait\n  cat $dir/egs/valid_combine.egs $dir/egs/train_combine.egs > $dir/egs/combine.egs\n\n  for f in $dir/egs/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/egs/valid_all.egs $dir/egs/train_subset_all.egs $dir/egs/{train,valid}_combine.egs $dir/ali_special.gz\nfi\n\nif [ $stage -le 3 ]; then\n  # Other scripts might need to know the following info:\n  echo $num_jobs_nnet >$dir/egs/num_jobs_nnet\n  echo $iters_per_epoch >$dir/egs/iters_per_epoch\n  echo $samples_per_iter_real >$dir/egs/samples_per_iter\n\n  echo \"Creating training examples\";\n  # in $dir/egs, create $num_jobs_nnet separate files with training examples.\n  # The order is not randomized at this point.\n\n  egs_list=\n  for n in `seq 1 $num_jobs_nnet`; do\n    egs_list=\"$egs_list ark:$dir/egs/egs_orig.$n.JOB.ark\"\n  done\n  echo \"Generating training examples on disk\"\n  # The examples will go round-robin to egs_list.\n  $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$feats\" \\\n    \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" ark:- \\| \\\n    nnet-copy-egs ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: rearranging examples into parts for different parallel jobs\"\n  # combine all the \"egs_orig.JOB.*.scp\" (over the $nj splits of the data) and\n  # then split into multiple parts egs.JOB.*.scp for different parts of the\n  # data, 0 .. $iters_per_epoch-1.\n\n  if [ $iters_per_epoch -eq 1 ]; then\n    echo \"$0: Since iters-per-epoch == 1, just concatenating the data.\"\n    for n in `seq 1 $num_jobs_nnet`; do\n      cat $dir/egs/egs_orig.$n.*.ark > $dir/egs/egs_tmp.$n.0.ark || exit 1;\n      remove $dir/egs/egs_orig.$n.*.ark\n    done\n  else # We'll have to split it up using nnet-copy-egs.\n    egs_list=\n    for n in `seq 0 $[$iters_per_epoch-1]`; do\n      egs_list=\"$egs_list ark:$dir/egs/egs_tmp.JOB.$n.ark\"\n    done\n    # note, the \"|| true\" below is a workaround for NFS bugs\n    # we encountered running this script with Debian-7, NFS-v4.\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/split_egs.JOB.log \\\n      nnet-copy-egs --random=$random_copy --srand=JOB \\\n        \"ark:cat $dir/egs/egs_orig.JOB.*.ark|\" $egs_list || exit 1;\n    remove $dir/egs/egs_orig.*.*.ark  2>/dev/null\n  fi\nfi\n\nif [ $stage -le 5 ]; then\n  # Next, shuffle the order of the examples in each of those files.\n  # Each one should not be too large, so we can do this in memory.\n  echo \"Shuffling the order of training examples\"\n  echo \"(in order to avoid stressing the disk, these won't all run at once).\"\n\n  for n in `seq 0 $[$iters_per_epoch-1]`; do\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/shuffle.$n.JOB.log \\\n      nnet-shuffle-egs \"--srand=\\$[JOB+($num_jobs_nnet*$n)]\" \\\n      ark:$dir/egs/egs_tmp.JOB.$n.ark ark:$dir/egs/egs.JOB.$n.ark\n    remove $dir/egs/egs_tmp.*.$n.ark\n  done\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet2/get_egs2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n#\n# This script differs from get_egs.sh in that it dumps egs with several frames\n# of labels, controlled by the frames_per_eg config variable (default: 8).  This\n# takes many times less disk space because typically we have 4 to 7 frames of\n# context on the left and right, and this ends up getting shared.  This is at\n# the expense of slightly higher disk I/O during training time.\n#\n# We also have a simpler way of dividing the egs up into pieces, with one level\n# of index, so we have $dir/egs.{0,1,2,...}.ark instead of having two levels of\n# indexes.  The extra files we write to $dir that explain the structure are\n# $dir/info/num_archives, which contains the number of files egs.*.ark, and\n# $dir/info/frames_per_eg, which contains the number of frames of labels per eg\n# (e.g. 7), and $dir/samples_per_archive.  These replace the files\n# iters_per_epoch and num_jobs_nnet and egs_per_iter that the previous script\n# wrote to.  This script takes the directory where the \"egs\" are located as the\n# argument, not the directory one level up.\n\n# Begin configuration section.\ncmd=run.pl\nfeat_type=          # e.g. set it to \"raw\" to use raw MFCC\nframes_per_eg=8   # number of frames of labels per example.  more->less disk space and\n                  # less time preparing egs, but more I/O during training.\n                  # note: the script may reduce this if reduce_frames_per_eg is true.\nleft_context=4    # amount of left-context per eg\nright_context=4   # amount of right-context per eg\ndelta_order=      # delta feature order\n\nreduce_frames_per_eg=true  # If true, this script may reduce the frames_per_eg\n                           # if there is only one archive and even with the\n                           # reduced frames_pe_eg, the number of\n                           # samples_per_iter that would result is less than or\n                           # equal to the user-specified value.\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=10000 # # train frames for the above.\nnum_frames_diagnostic=4000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This is just a guideline; it will pick a number\n                        # that divides the number of samples in the entire data.\n\ntransform_dir=     # If supplied, overrides alidir as the place to find fMLLR transforms\npostdir=        # If supplied, we will use posteriors in it as soft training targets.\n\nstage=0\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nrandom_copy=false\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <data> <ali-dir> <egs-dir>\"\n  echo \" e.g.: $0 data/train exp/tri3_ali exp/tri4_nnet/egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --feat-type <lda|raw>                            # (by default it tries to guess).  The feature type you want\"\n  echo \"                                                   # to use as input to the neural net.\"\n  echo \"  --frames-per-eg <frames;8>                       # number of frames per eg on disk\"\n  echo \"  --left-context <width;4>                         # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <width;4>                        # Number of frames on right side to append for feature input\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nalidir=$2\ndir=$3\n\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log $dir/info\ncp $alidir/tree $dir\n\nnum_utts=$(cat $data/utt2spk | wc -l)\nif ! [ $num_utts -gt $[$num_utts_subset*4] ]; then\n  echo \"$0: number of utterances $num_utts in your training data is too small versus --num-utts-subset=$num_utts_subset\"\n  echo \"... you probably have so little data that it doesn't make sense to train a neural net.\"\n  exit 1\nfi\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid_uttlist || exit 1;\n\nif [ -f $data/utt2uniq ]; then\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl | head -$num_utts_subset > $dir/train_subset_uttlist || exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\n## Set up features.\nif [ -z $feat_type ]; then\n  if [ -f $alidir/final.mat ] && [ ! -f $transform_dir/raw_trans.1 ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n    valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n    train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n    echo $cmvn_opts >$dir/cmvn_opts # caution: the top-level nnet training script should copy this to its own dir now.\n    if [ ! -z \"$delta_order\" ]; then\n      feats=\"$feats add-deltas --delta-order=$delta_order ark:- ark:- |\"\n      valid_feats=\"$valid_feats add-deltas --delta-order=$delta_order ark:- ark:- |\"\n      train_subset_feats=\"$train_subset_feats add-deltas --delta-order=$delta_order ark:- ark:- |\"\n      echo $delta_order >$dir/delta_order\n    fi\n   ;;\n  lda)\n    splice_opts=`cat $alidir/splice_opts 2>/dev/null`\n    # caution: the top-level nnet training script should copy these to its own dir now.\n    cp $alidir/{splice_opts,cmvn_opts,final.mat} $dir || exit 1;\n    [ ! -z \"$cmvn_opts\" ] && \\\n       echo \"You cannot supply --cmvn-opts option if feature type is LDA.\" && exit 1;\n    cmvn_opts=$(cat $dir/cmvn_opts)\n    feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -f $transform_dir/trans.1 ] && [ $feat_type != \"raw\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\n  valid_feats=\"$valid_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/trans.*|' ark:- ark:- |\"\n  train_subset_feats=\"$train_subset_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/trans.*|' ark:- ark:- |\"\nfi\nif [ -f $transform_dir/raw_trans.1 ] && [ $feat_type == \"raw\" ]; then\n  echo \"$0: using raw-fMLLR transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/raw_trans.JOB ark:- ark:- |\"\n  valid_feats=\"$valid_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/raw_trans.*|' ark:- ark:- |\"\n  train_subset_feats=\"$train_subset_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $transform_dir/raw_trans.*|' ark:- ark:- |\"\nfi\nif [ ! -z \"$online_ivector_dir\" ]; then\n  feats_one=\"$(echo \"$feats\" | sed s:JOB:1:g)\"\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim > $dir/info/ivector_dim\n  ivectors_opt=\"--const-feat-dim=$ivector_dim\"\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\n  valid_feats=\"$valid_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\n  train_subset_feats=\"$train_subset_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\nelse\n  echo 0 >$dir/info/ivector_dim\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\nelse\n  num_frames=`cat $dir/info/num_frames` || exit 1;\nfi\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/($frames_per_eg*$samples_per_iter)+1]\n# (for small data)- while reduce_frames_per_eg == true and the number of\n# archives is 1 and would still be 1 if we reduced frames_per_eg by 1, reduce it\n# by 1.\nreduced=false\nwhile $reduce_frames_per_eg && [ $frames_per_eg -gt 1 ] && \\\n  [ $[$num_frames/(($frames_per_eg-1)*$samples_per_iter)] -eq 0 ]; do\n  frames_per_eg=$[$frames_per_eg-1]\n  num_archives=1\n  reduced=true\ndone\n$reduced && echo \"$0: reduced frames_per_eg to $frames_per_eg because amount of data is small.\"\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n\n# Working out number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg*$num_archives)]\n! [ $egs_per_archive -le $samples_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= samples_per_iter=$samples_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\n\n# Making soft links to storage directories.  This is a no-up unless\n# the subdirectory $dir/storage/ exists.  See utils/create_split_dir.pl\nfor x in `seq $num_archives`; do\n  utils/create_data_link.pl $dir/egs.$x.ark\n  for y in `seq $nj`; do\n    utils/create_data_link.pl $dir/egs_orig.$x.$y.ark\n  done\ndone\n\nnnet_context_opts=\"--left-context=$left_context --right-context=$right_context\"\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\nif [ $stage -le 2 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n  echo \"$0: ... extracting validation and training-subset alignments.\"\n  set -o pipefail;\n  for id in $(seq $nj); do gunzip -c $alidir/ali.$id.gz; done | \\\n    copy-int-vector ark:- ark,t:- | \\\n    utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) | \\\n    gzip -c >$dir/ali_special.gz || exit 1;\n  set +o pipefail; # unset the pipefail option.\n\n  $cmd $dir/log/create_valid_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$valid_feats\" \\\n    \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$train_subset_feats\" \\\n     \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet-subset-egs --n=$num_valid_frames_combine ark:$dir/valid_all.egs \\\n        ark:$dir/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/valid_all.egs \\\n    ark:$dir/valid_diagnostic.egs || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet-subset-egs --n=$num_train_frames_combine ark:$dir/train_subset_all.egs \\\n    ark:$dir/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/train_subset_all.egs \\\n    ark:$dir/train_diagnostic.egs || touch $dir/.error &\n  wait\n  sleep 5  # wait for file system to sync.\n  cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n\n  for f in $dir/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/valid_all.egs $dir/train_subset_all.egs $dir/{train,valid}_combine.egs $dir/ali_special.gz\nfi\n\nif [ $stage -le 3 ]; then\n  # create egs_orig.*.*.ark; the first index goes to $num_archives,\n  # the second to $nj (which is the number of jobs in the original alignment\n  # dir)\n\n  egs_list=\n  for n in $(seq $num_archives); do\n    egs_list=\"$egs_list ark:$dir/egs_orig.$n.JOB.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n  # The examples will go round-robin to egs_list.\n  if [ ! -z $postdir ]; then\n    $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n      nnet-get-egs $ivectors_opt $nnet_context_opts --num-frames=$frames_per_eg \"$feats\" \\\n      scp:$postdir/post.JOB.scp ark:- \\| \\\n      nnet-copy-egs ark:- $egs_list || exit 1;\n  else\n    $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n      nnet-get-egs $ivectors_opt $nnet_context_opts --num-frames=$frames_per_eg \"$feats\" \\\n      \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      nnet-copy-egs ark:- $egs_list || exit 1;\n  fi\nfi\nif [ $stage -le 4 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.JOB.*.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  egs_list=\n  for n in $(seq $nj); do\n    egs_list=\"$egs_list $dir/egs_orig.JOB.$n.ark\"\n  done\n\n  $cmd $io_opts $extra_opts JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n    nnet-shuffle-egs --srand=JOB \"ark:cat $egs_list|\" ark:$dir/egs.JOB.ark  || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: removing temporary archives\"\n  for x in `seq $num_archives`; do\n    for y in `seq $nj`; do\n      file=$dir/egs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file)\n      rm $file\n    done\n  done\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet2/get_egs_discriminative2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script dumps examples MPE or MMI or state-level minimum bayes risk (sMBR)\n# training of neural nets.  Note: for \"criterion\", smbr > mpe > mmi in terms of\n# compatibility of the dumped egs, meaning you can use the egs dumped with\n# --criterion smbr for MPE or MMI, and egs dumped with --criterion mpe for MMI\n# training.  The discriminative training program itself doesn't enforce this and\n# it would let you mix and match them arbitrarily; we area speaking in terms of\n# the correctness of the algorithm that splits the lattices into pieces.\n\n# Begin configuration section.\ncmd=run.pl\ncriterion=smbr\ndrop_frames=false #  option relevant for MMI, affects how we dump examples.\nsamples_per_iter=400000 # measured in frames, not in \"examples\"\nmax_temp_archives=128 # maximum number of temp archives per input job, only\n                      # affects the process of generating archives, not the\n                      # final result.\n\nstage=0\n\ncleanup=true\ntransform_dir= # If this is a SAT system, directory for transforms\nonline_ivector_dir=\n\nnum_utts_subset=3000\nnum_archives_priors=10\n\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <denlat-dir> <src-model-file> <degs-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet_denlats exp/tri4/final.mdl exp/tri4_mpe/degs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs (probably would be good to add --max-jobs-run 5 or so if using\"\n  echo \"                                                   # GridEngine (to avoid excessive NFS traffic).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --stage <stage|-8>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --online-ivector-dir <dir|\"\">                    # Directory for online-estimated iVectors, used in the\"\n  echo \"                                                   # online-neural-net setup.  (but you may want to use\"\n  echo \"                                                   # steps/online/nnet2/get_egs_discriminative2.sh instead)\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\nsrc_model=$5\ndir=$6\n\n\nextra_files=\n[ ! -z $online_ivector_dir ] && \\\n  extra_files=\"$online_ivector_dir/ivector_period $online_ivector_dir/ivector_online.scp\"\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/num_jobs $alidir/tree \\\n         $denlatdir/lat.1.gz $denlatdir/num_jobs $src_model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log $dir/info || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\n\nnj=$(cat $denlatdir/num_jobs) || exit 1; # $nj is the number of\n                                         # splits of the denlats and alignments.\n\nnj_ali=$(cat $alidir/num_jobs) || exit 1;\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nif [ $nj_ali -eq $nj ]; then\n  ali_rspecifier=\"ark,s,cs:gunzip -c $alidir/ali.JOB.gz |\"\n  alis=$(for n in $(seq $nj); do echo -n \"$alidir/ali.$n.gz \"; done)\n  prior_ali_rspecifier=\"ark,s,cs:gunzip -c $alis | copy-int-vector ark:- ark,t:- | utils/filter_scp.pl $dir/priors_uttlist | ali-to-pdf $alidir/final.mdl ark,t:- ark:- |\"\nelse\n  ali_rspecifier=\"scp:$dir/ali.scp\"\n  prior_ali_rspecifier=\"ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $dir/ali.scp | ali-to-pdf $alidir/final.mdl scp:- ark:- |\"\n  if [ $stage -le 1 ]; then\n    echo \"$0: number of jobs in den-lats versus alignments differ: dumping them as single archive and index.\"\n    alis=$(for n in $(seq $nj_ali); do echo -n \"$alidir/ali.$n.gz \"; done)\n    $cmd $dir/log/copy_alignments.log \\\n      copy-int-vector \"ark:gunzip -c $alis|\" \\\n      ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\n  fi\nfi\n\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null\ncp $alidir/tree $dir\ncp $lang/phones/silence.csl $dir/info/\ncp $src_model $dir/final.mdl || exit 1\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period)\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim >$dir/info/ivector_dim\n  # the 'const_dim_opt' allows it to write only one iVector per example,\n  # rather than one per time-index... it has to average over\n  const_dim_opt=\"--const-feat-dim=$ivector_dim\"\nelse\n  echo 0 > $dir/info/ivector_dim\nfi\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/priors_uttlist || exit 1;\n\n## We don't support deltas here, only LDA or raw (mainly because deltas are less\n## frequently used).\nif [ -z $feat_type ]; then\n  if [ -f $alidir/final.mat ] && [ ! -f $transform_dir/raw_trans.1 ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n    priors_feats=\"ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n   ;;\n  lda)\n    splice_opts=`cat $alidir/splice_opts 2>/dev/null`\n    cp $alidir/final.mat $dir\n    feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    priors_feats=\"ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -z \"$transform_dir\" ]; then\n  if [ -f $transform_dir/trans.1 ] || [ -f $transform_dir/raw_trans.1 ]; then\n    transform_dir=$alidir\n  fi\nfi\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n\n  if [ $feat_type == \"raw\" ]; then trans=raw_trans;\n  else trans=trans; fi\n  if [ $feat_type == \"lda\" ] && ! cmp $transform_dir/final.mat $alidir/final.mat; then\n    echo \"$0: LDA transforms differ between $alidir and $transform_dir\"\n    exit 1;\n  fi\n  if [ ! -f $transform_dir/$trans.1 ]; then\n    echo \"$0: expected $transform_dir/$trans.1 to exist (--transform-dir option)\"\n    exit 1;\n  fi\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/$trans.$n; done | \\\n      copy-feats ark:- ark,scp:$dir/$trans.ark,$dir/$trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n    priors_feats=\"$priors_feats transform-feats --utt2spk=ark:$data/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/$trans.JOB ark:- ark:- |\"\n    tras=$(for n in $(seq $nj); do echo -n \"$transform_dir/$trans.$n \"; done)\n    priors_feats=\"$priors_feats transform-feats --utt2spk=ark:$data/utt2spk 'ark:cat $tras |' ark:- ark:- |\"\n  fi\nfi\nif [ ! -z $online_ivector_dir ]; then\n  # add iVectors to the features.\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\n  priors_feats=\"$priors_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\nfi\n\n\nif [ $stage -le 2 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n\n  echo $num_frames > $dir/info/num_frames\n\n  # Working out total number of archives. Add one on the assumption the\n  # num-frames won't divide exactly, and we want to round up.\n  num_archives=$[$num_frames/$samples_per_iter + 1]\n\n  # the next few lines relate to how we may temporarily split each input job\n  # into fewer than $num_archives pieces, to avoid using an excessive\n  # number of filehandles.\n  archive_ratio=$[$num_archives/$max_temp_archives+1]\n  num_archives_temp=$[$num_archives/$archive_ratio]\n  # change $num_archives slightly to make it an exact multiple\n  # of $archive_ratio.\n  num_archives=$[$num_archives_temp*$archive_ratio]\n\n  echo $num_archives >$dir/info/num_archives || exit 1\n  echo $num_archives_temp >$dir/info/num_archives_temp || exit 1\n\n  frames_per_archive=$[$num_frames/$num_archives]\n\n  # note, this is the number of frames per archive prior to discarding frames.\n  echo $frames_per_archive > $dir/info/frames_per_archive\nelse\n  num_archives=$(cat $dir/info/num_archives) || exit 1;\n  num_archives_temp=$(cat $dir/info/num_archives_temp) || exit 1;\n  frames_per_archive=$(cat $dir/info/frames_per_archive) || exit 1;\nfi\n\necho \"$0: Splitting the data up into $num_archives archives (using $num_archives_temp temporary pieces per input job)\"\necho \"$0: giving samples-per-iteration of $frames_per_archive (you requested $samples_per_iter).\"\n\n# we create these data links regardless of the stage, as there are situations\n# where we would want to recreate a data link that had previously been deleted.\n\nif [ -d $dir/storage ]; then\n  echo \"$0: creating data links for distributed storage of degs\"\n  # See utils/create_split_dir.pl for how this 'storage' directory is created.\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_temp); do\n      utils/create_data_link.pl $dir/degs_orig.$x.$y.ark\n    done\n  done\n  for z in $(seq $num_archives); do\n    utils/create_data_link.pl $dir/degs.$z.ark\n  done\n  if [ $num_archives_temp -ne $num_archives ]; then\n    for z in $(seq $num_archives); do\n      utils/create_data_link.pl $dir/degs_temp.$z.ark\n    done\n  fi\nfi\n\nrm $dir/.error 2>/dev/null\nleft_context=$(nnet-am-info $dir/final.mdl | grep '^left-context' | awk '{print $2}') || exit 1\nright_context=$(nnet-am-info $dir/final.mdl | grep '^right-context' | awk '{print $2}') || exit 1\n\n(\n\nif [ $stage -le 10 ]; then\n\npriors_egs_list=\nfor y in `seq $num_archives_priors`; do\n  utils/create_data_link.pl $dir/priors_egs.$y.ark\n  priors_egs_list=\"$priors_egs_list ark:$dir/priors_egs.$y.ark\"\ndone\n\nnnet_context_opts=\"--left-context=$left_context --right-context=$right_context\"\n\necho \"$0: dumping egs for prior adjustment in the background.\"\n\n$cmd $dir/log/create_priors_subset.log \\\n  nnet-get-egs $ivectors_opt $nnet_context_opts \"$priors_feats\" \\\n  \"$prior_ali_rspecifier ali-to-post ark:- ark:- |\" \\\n  ark:- \\| nnet-copy-egs ark:- $priors_egs_list || \\\n  { touch $dir/.error; echo \"Error in creating priors subset. See $dir/log/create_priors_subset.log\"; exit 1; }\n\nsleep 3;\n\necho $num_archives_priors >$dir/info/num_archives_priors\n\nfi\n\n) &\n\nif [ $stage -le 3 ]; then\n  echo \"$0: getting initial training examples by splitting lattices\"\n\n  degs_list=$(for n in $(seq $num_archives_temp); do echo -n \"ark:$dir/degs_orig.JOB.$n.ark \"; done)\n\n  $cmd JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs-discriminative --criterion=$criterion --drop-frames=$drop_frames \\\n      \"$src_model\" \"$feats\" \"$ali_rspecifier\" \"ark,s,cs:gunzip -c $denlatdir/lat.JOB.gz|\" ark:- \\| \\\n    nnet-copy-egs-discriminative $const_dim_opt ark:- $degs_list || exit 1;\n  sleep 5;  # wait a bit so NFS has time to write files.\nfi\n\nif [ $stage -le 4 ]; then\n\n  degs_list=$(for n in $(seq $nj); do echo -n \"$dir/degs_orig.$n.JOB.ark \"; done)\n\n  if [ $num_archives -eq $num_archives_temp ]; then\n    echo \"$0: combining data into final archives and shuffling it\"\n\n    $cmd JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n      cat $degs_list \\| nnet-shuffle-egs-discriminative --srand=JOB ark:- \\\n       ark:$dir/degs.JOB.ark || exit 1;\n  else\n    echo \"$0: combining and re-splitting data into un-shuffled versions of final archives.\"\n\n    archive_ratio=$[$num_archives/$num_archives_temp]\n    ! [ $archive_ratio -gt 1 ] && echo \"$0: Bad archive_ratio $archive_ratio\" && exit 1;\n\n    # note: the \\$[ .. ] won't be evaluated until the job gets executed.  The\n    # aim is to write to the archives with the final numbering, 1\n    # ... num_archives, which is more than num_archives_temp.  The list with\n    # \\$[... ] expressions in it computes the set of final indexes for each\n    # temporary index.\n    degs_list_out=$(for n in $(seq $archive_ratio); do echo -n \"ark:$dir/degs_temp.\\$[((JOB-1)*$archive_ratio)+$n].ark \"; done)\n    # e.g. if dir=foo and archive_ratio=2, we'd have\n    # degs_list_out='foo/degs_temp.$[((JOB-1)*2)+1].ark foo/degs_temp.$[((JOB-1)*2)+2].ark'\n\n    $cmd JOB=1:$num_archives_temp $dir/log/resplit.JOB.log \\\n      cat $degs_list \\| nnet-copy-egs-discriminative --srand=JOB ark:- \\\n      $degs_list_out || exit 1;\n  fi\nfi\n\nif [ $stage -le 5 ] && [ $num_archives -ne $num_archives_temp ]; then\n  echo \"$0: shuffling final archives.\"\n\n  $cmd JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n    nnet-shuffle-egs-discriminative --srand=JOB ark:$dir/degs_temp.JOB.ark \\\n      ark:$dir/degs.JOB.ark || exit 1\nfi\n\nwait;\n[ -f $dir/.error ] && echo \"Error detected while creating priors adjustment egs\" && exit 1\n\nif $cleanup; then\n  echo \"$0: removing temporary archives.\"\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_temp); do\n      file=$dir/degs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file); rm $file\n    done\n  done\n  if [ $num_archives_temp -ne $num_archives ]; then\n    for z in $(seq $num_archives); do\n      file=$dir/degs_temp.$z.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file); rm $file\n    done\n  fi\nfi\n\necho \"$0: Done.\"\n"
  },
  {
    "path": "egs/steps/nnet2/get_ivector_id.sh",
    "content": "#!/usr/bin/env bash\n# Copyright (c) 2016, Johns Hopkins University (Yenda Trmal <jtrmal@gmail.com>)\n# License: Apache 2.0\n\n# Begin configuration section.\n# End configuration section\nset -e -o pipefail\nset -o nounset                              # Treat unset variables as an error\n\n# End configuration section.\n\n#echo >&2 \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 1 ]; then\n  echo >&2 \"Usage: $0 <directory>\"\n  echo >&2 \" e.g.: $0 exp/nnet3/extractor\"\n  exit 1\nfi\n\nivecdir=$1\n\nif [ -f $ivecdir/final.ie.id ] ; then\n  cat $ivecdir/final.ie.id\nelif [ -f $ivecdir/final.ie ] ; then\n  # note the creation can fail in case the extractor directory\n  # is not read-only media or the user des not have access rights\n  # in that case we will just behave as if the id is not available\n  id=$(md5sum $ivecdir/final.ie | awk '{print $1}')\n  echo \"$id\" > $ivecdir/final.ie.id || true\n  echo \"$id\"\nelse\n  exit 0\nfi\n\nexit 0\n\n\n\n"
  },
  {
    "path": "egs/steps/nnet2/get_lda.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n\n# Begin configuration section.\ncmd=run.pl\n\nfeat_type=\nstage=0\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nleft_context= # left context for second LDA\nright_context= # right context for second LDA\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nwithin_class_factor=0.0001 # This affects the scaling of the transform rows...\n                           # sorry for no explanation, you'll have to see the code.\ntransform_dir=     # If supplied, overrides alidir\nnum_feats=10000 # maximum number of feature files to use.  Beyond a certain point it just\n                # gets silly to use more data.\nlda_dim=  # This defaults to no dimension reduction.\nonline_ivector_dir=\nivector_randomize_prob=0.0 # if >0.0, randomizes iVectors during training with\n                           # this prob per iVector.\nivector_dir=\ncmvn_opts=  # allows you to specify options for CMVN, if feature type is not lda.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/nnet2/get_lda.sh [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/nnet2/get_lda.sh data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \" As well as extracting the examples, this script will also do the LDA computation,\"\n  echo \" if --est-lda=true (default:true)\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --left-context <width;4>                         # Number of frames on left side to append for feature input, overrides splice-width\"\n  echo \"  --right-context <width;4>                        # Number of frames on right side to append for feature input, overrides splice-width\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --online-vector-dir <dir|none>                   # Directory produced by\"\n  echo \"                                                   # steps/online/nnet2/extract_ivectors_online.sh\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n[ -z \"$left_context\" ] && left_context=$splice_width\n[ -z \"$right_context\" ] && right_context=$splice_width\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\noov=`cat $lang/oov.int`\nnum_leaves=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nif [ -z \"$cmvn_opts\" ]; then\n  cmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\nfi\necho $cmvn_opts >$dir/cmvn_opts 2>/dev/null\n\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\nif [ -z $feat_type ]; then\n  if [ -f $alidir/final.mat ] && ! [ -f $alidir/raw_trans.1 ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\n\n# If we have more than $num_feats feature files (default: 10k),\n# we use a random subset.  This won't affect the transform much, and will\n# spare us an unnecessary pass over the data.  Probably 10k is\n# way too much, but for small datasets this phase is quite fast.\nN=$[$num_feats/$nj]\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:utils/subset_scp.pl --quiet $N $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n    echo $cmvn_opts >$dir/cmvn_opts\n   ;;\n  lda) \n    splice_opts=`cat $alidir/splice_opts 2>/dev/null`\n    cp $alidir/{splice_opts,cmvn_opts,final.mat} $dir || exit 1;\n    [ ! -z \"$cmvn_opts\" ] && \\\n       echo \"You cannot supply --cmvn-opts option of feature type is LDA.\" && exit 1;\n    cmvn_opts=$(cat $dir/cmvn_opts)\n     feats=\"ark,s,cs:utils/subset_scp.pl --quiet $N $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -f $transform_dir/trans.1 ] && [ $feat_type != \"raw\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\nif [ -f $transform_dir/raw_trans.1 ] && [ $feat_type == \"raw\" ]; then\n  echo \"$0: using raw-fMLLR transforms from $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/raw_trans.JOB ark:- ark:- |\"\nfi\n\n\nfeats_one=\"$(echo \"$feats\" | sed s:JOB:1:g)\"\n# note: feat_dim is the raw, un-spliced feature dim without the iVectors.\nfeat_dim=$(feat-to-dim \"$feats_one\" -) || exit 1;\n# by default: no dim reduction.\n\nspliced_feats=\"$feats splice-feats --left-context=$left_context --right-context=$right_context ark:- ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  # note: subsample-feats, with negative value of n, repeats each feature n times.\n  spliced_feats=\"$spliced_feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- | ivector-randomize --randomize-prob=$ivector_randomize_prob ark:- ark:- |' ark:- |\"\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\nelse\n  ivector_dim=0\nfi\necho $ivector_dim >$dir/ivector_dim\n\nif [ -z \"$lda_dim\" ]; then\n  spliced_feats_one=\"$(echo \"$spliced_feats\" | sed s:JOB:1:g)\"  \n  lda_dim=$(feat-to-dim \"$spliced_feats_one\" -) || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: Accumulating LDA statistics.\"\n  rm $dir/lda.*.acc 2>/dev/null # in case any left over from before.\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      acc-lda --rand-prune=$rand_prune $alidir/final.mdl \"$spliced_feats\" ark,s,cs:- \\\n       $dir/lda.JOB.acc || exit 1;\nfi\n\necho $feat_dim > $dir/feat_dim\necho $lda_dim > $dir/lda_dim\necho $ivector_dim > $dir/ivector_dim\n\nif [ $stage -le 1 ]; then\n  sum-lda-accs $dir/lda.acc $dir/lda.*.acc 2>$dir/log/lda_sum.log || exit 1;\n  rm $dir/lda.*.acc\nfi\n\nif [ $stage -le 2 ]; then\n  # There are various things that we sometimes (but not always) need\n  # the within-class covariance and its Cholesky factor for, and we\n  # write these to disk just in case.\n  nnet-get-feature-transform --write-cholesky=$dir/cholesky.tpmat \\\n     --write-within-covar=$dir/within_covar.spmat \\\n     --within-class-factor=$within_class_factor --dim=$lda_dim \\\n      $dir/lda.mat $dir/lda.acc \\\n      2>$dir/log/lda_est.log || exit 1;\nfi\n\necho \"$0: Finished estimating LDA\"\n"
  },
  {
    "path": "egs/steps/nnet2/get_lda_block.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n\n# Begin configuration section.\ncmd=run.pl\n\nstage=0\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nwithin_class_factor=0.0001 # This affects the scaling of the transform rows...\n                           # sorry for no explanation, you'll have to see the code.\nblock_size=10\nblock_shift=5\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/nnet2/get_lda_block.sh [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/nnet2/get_lda.sh data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \" As well as extracting the examples, this script will also do the LDA computation,\"\n  echo \" if --est-lda=true (default:true)\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\noov=`cat $lang/oov.int`\nnum_leaves=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\n\n\nfeats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\ntrain_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n\nfeat_dim=`feat-to-dim \"$train_subset_feats\" -` || exit 1;\n\nif [ $stage -le 0 ]; then\n  echo \"$0: Accumulating LDA statistics.\"\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    set -o pipefail '&&' \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      acc-lda --rand-prune=$rand_prune $alidir/final.mdl \"$feats splice-feats --left-context=$splice_width --right-context=$splice_width ark:- ark:- |\" ark,s,cs:- \\\n       $dir/lda.JOB.acc || exit 1;\nfi\n\necho $feat_dim > $dir/feat_dim\n\necho -n > $dir/indexes\n# Get list of indexes, e.g. a file like:\n# 0 1 2 3 4 5 6 7 8 9\n# 5 6 7 8 9 10 11 12 13 14\n# 10 ...\n\ncur_index=0\nnum_blocks=0\ncontext_length=$[1+2*($splice_width)]\n\nwhile true; do\n  for n in `seq $cur_index $[cur_index+$block_size-1]`; do\n    echo -n `seq $n $feat_dim $[$n+($feat_dim*($context_length-1))]` '' >> $dir/indexes\n  done\n  echo >> $dir/indexes\n  num_blocks=$[$num_blocks+1]\n  next_index=$[$cur_index+$block_shift]\n  if [ $[$next_index+$block_size] -gt $feat_dim ]; then\n    next_index=$[$feat_dim-$block_size];\n  fi\n  if [ $next_index -le $cur_index ]; then break; fi\n  cur_index=$next_index\ndone\necho $num_blocks >$dir/num_blocks\n\nlda_dim=`cat $dir/indexes | wc -w`\necho $lda_dim > $dir/lda_dim\n\nif [ $stage -le 1 ]; then\n  nnet-get-feature-transform-multi --within-class-factor=$within_class_factor $dir/indexes $dir/lda.*.acc $dir/lda.mat \\\n      2>$dir/log/lda_est.log || exit 1;\n  rm $dir/lda.*.acc\nfi\n\necho \"$0: Finished estimating LDA\"\n"
  },
  {
    "path": "egs/steps/nnet2/get_perturbed_feats.sh",
    "content": "#!/usr/bin/env bash\n\n\n# begin configuration section\n\ncmd=\"run.pl\"\nnum_copies=5  # support 3, 4 or 5 perturbed copies of the data.\nstage=0\nnj=8\ncleanup=true\nfeature_type=fbank\n# end configuration section\n\nset -e\n. utils/parse_options.sh \n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [options] <baseline-feature-config> <feature-storage-dir> <log-location> <input-data-dir> <output-data-dir> \"\n  echo \"e.g.: $0 conf/fbank_40.conf mfcc exp/perturbed_fbank_train data/train data/train_perturbed_fbank\"\n  echo \"Supported options: \"\n  echo \"--feature-type (fbank|mfcc|plp)  # Type of features we are making, default fbank\"\n  echo \"--cmd 'command-program'      # Mechanism to run jobs, e.g. run.pl\"\n  echo \"--num-copies <n>             # Number of perturbed copies of the data (support 3, 4 or 5), default 5\"\n  echo \"--stage <stage>              # Use for partial re-run\"\n  echo \"--cleanup (true|false)       # If false, do not clean up temp files (default: true)\"\n  echo \"--nj <num-jobs>              # How many jobs to use for feature extraction (default: 8)\"\n  exit 1;\nfi\n\nbase_config=$1\nfeatdir=$2\ndir=$3 # dir/log* will contain log-files\ninputdata=$4\ndata=$5\n\n# Set pairs of (VTLN warp factor, time-warp factor)\n# Aim to put these roughly in a circle centered at 1.0-1.0; the\n# dynamic range of the VTLN warp factor will be 0.9 to 1.1 and\n# of the time-warping factor will be 0.8 to 1.2.\nif [ $num_copies -eq 5 ]; then\n  pairs=\"1.1-1.0 1.05-1.2 1.0-0.8 0.95-1.1 0.9-0.9\" \nelif [ $num_copies -eq 4 ]; then\n  pairs=\"1.1-1.0 1.0-0.8 1.0-1.2 0.9-1.0\"\nelif [ $num_copies -eq 3 ]; then\n  pairs=\"1.1-1.1 1.0-0.8 0.9-1.1\"\nelse\n  echo \"$0: unsupported --num-copies value: $num_copies (support 3, 4 or 5)\"\nfi\n\nfor f in $base_config $inputdata/wav.scp; do \n  if [ ! -f $f ]; then\n    echo \"Expected file $f to exist\"\n    exit 1;\n  fi\ndone\n\nif [ \"$feature_type\" != \"fbank\" ] && [ \"$feature_type\" != \"mfcc\" ] && \\\n   [ \"$feature_type\" != \"plp\" ]; then \n  echo \"$0: Invalid option --feature-type=$feature_type\"\n  exit 1;\nfi\n\nmkdir -p $featdir\nmkdir -p $dir/conf $dir/log\n\nall_feature_dirs=\"\"\n\nfor pair in $pairs; do\n  vtln_warp=`echo $pair | cut -d- -f1`\n  time_warp=`echo $pair | cut -d- -f2`\n  fs=`perl -e \"print ($time_warp*10);\"`\n  conf=$dir/conf/$pair.conf\n  this_dir=$dir/$pair\n  \n  ( cat $base_config; echo; echo \"--frame-shift=$fs\"; echo \"--vtln-warp=$vtln_warp\" ) > $conf\n  \n  echo \"Making ${feature_type} features for VTLN-warp $vtln_warp and time-warp $time_warp\"\n\n  feature_data=${data}-$pair\n  all_feature_dirs=\"$all_feature_dirs $feature_data\"\n\n  utils/copy_data_dir.sh --spk-prefix ${pair}- --utt-prefix ${pair}- $inputdata $feature_data\n  steps/make_${feature_type}.sh --${feature_type}-config $conf --nj \"$nj\" --cmd \"$cmd\" $feature_data $this_dir $featdir\n\n  steps/compute_cmvn_stats.sh $feature_data $this_dir $featdir\ndone\n\nutils/combine_data.sh $data $all_feature_dirs\n\n\n# In the combined feature directory, create a file utt2uniq which maps\n# our extended utterance-ids to \"unique utterances\".  This enables the\n# script steps/nnet2/get_egs.sh to hold out data in a more proper way.\ncat $data/utt2spk | \\\n   perl -e ' while(<STDIN>){ @A=split; $x=shift @A; $y=$x; \n     foreach $pair (@ARGV) { $y =~ s/^${pair}-// && last; } print \"$x $y\\n\"; } ' $pairs \\\n  > $data/utt2uniq\n\nif $cleanup; then\n  echo \"$0: Cleaning up temporary directories for ${feature_type} features.\"\n  # Note, this just removes the .scp files and so on, not the data which is located in\n  # $featdir and which is still needed.\n  rm -r $all_feature_dirs\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# Create denominator lattices for MMI/MPE training.\n# This version uses the neural-net models (version 2, i.e. the nnet2 code).\n# Creates its output in $dir/lat.*.gz\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\ntransform_dir=\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\nnum_threads=1\nonline_ivector_dir=\nparallel_opts= # ignored now\nfeat_type=  # you can set this in order to run on top of delta features, although we don't\n            # normally want to do this.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/make_denlats.sh [options] <data-dir> <lang-dir> <src-dir> <exp-dir>\"\n  echo \"  e.g.: steps/make_denlats.sh data/train data/lang exp/nnet4 exp/nnet4_denlats\"\n  echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n  echo \" plus transforms.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n  echo \"                           # large databases so your jobs will be smaller and\"\n  echo \"                           # will (individually) finish reasonably soon.\"\n  echo \"  --transform-dir <transform-dir>   # directory to find fMLLR transforms.\"\n  echo \"  --num-threads  <n>                # number of threads per decoding job\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\n\nextra_files=\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfor f in $data/feats.scp $lang/L.fst $srcdir/final.mdl $extra_files; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\ncp -rH $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\necho \"Compiling decoding graph in $dir/dengraph\"\nif [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n  echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  echo \"Making unigram grammar FST in $new_lang\"\n  cat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n   awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n    utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n    || exit 1;\n  utils/mkgraph.sh $new_lang $srcdir $dir/dengraph || exit 1;\nfi\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null\n\nif [ -z \"$feat_type\" ]; then\n  if [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  raw) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n   ;;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n    cp $srcdir/final.mat $dir\n   ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n\n  if [ $feat_type == \"raw\" ]; then trans=raw_trans;\n  else trans=trans; fi\n  if [ $feat_type == \"lda\" ] && ! cmp $transform_dir/final.mat $srcdir/final.mat; then\n    echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n    exit 1;\n  fi\n  if [ ! -f $transform_dir/$trans.1 ]; then\n    echo \"$0: expected $transform_dir/$trans.1 to exist (--transform-dir option)\"\n    exit 1;\n  fi\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/$trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/$trans.ark,$dir/$trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/$trans.JOB ark:- ark:- |\"\n  fi\nfi\n\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  # note: subsample-feats, with negative n, will repeat each feature -n times.\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\nfi\n\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\n\nif [ $sub_split -eq 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_den.JOB.log \\\n   nnet-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n     $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n\n      $cmd --num-threads $num_threads JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        nnet-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n        --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || touch $dir/.error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo Merging archives for data subset $prev_n\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && echo \"$0: Merging lattices for subset $prev_n failed (or maybe some other error)\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/nnet2/make_multisplice_configs.py",
    "content": "#!/usr/bin/env python\n# Copyright 2014  Johns Hopkins University (Authors: Daniel Povey and Vijayaditya Peddinti).  Apache 2.0.\n\n# Creates the nnet.config and hidde_*.config scripts used in train_pnorm_multisplice.sh\n# Parses the splice string to generate relevant variables for get_egs.sh, get_lda.sh and nnet/hidden.config files\n\nfrom __future__ import division\nfrom __future__ import print_function\nimport re, argparse, sys, math, warnings\n\n# returns the set of frame indices required to perform the convolution\n# between sequences with frame indices in x and y\ndef get_convolution_index_set(x, y):\n  z = []\n  for i in range(len(x)):\n    for j in range(len(y)):\n      z.append(x[i]+y[j])\n  z = list(set(z))\n  z.sort()\n  return z\n\ndef parse_splice_string(splice_string):\n  layerwise_splice_indexes = splice_string.split('layer')[1:]\n  print(splice_string.split('layer'))\n  contexts={}\n  first_right_context = 0 # default value\n  first_left_context = 0 # default value\n  nnet_frame_indexes = [0] # frame indexes required by the network\n                           # at the initial layer (will be used in \n                           # determining the context for get_egs.sh)\n  try:\n    for cur_splice_indexes in layerwise_splice_indexes:\n      layer_index, frame_indexes  = cur_splice_indexes.split(\"/\")\n      frame_indexes = [int(x) for x in frame_indexes.split(':')]\n      layer_index = int(layer_index)\n      assert(layer_index >= 0)\n      if layer_index == 0:\n        first_left_context = min(frame_indexes)\n        first_right_context = max(frame_indexes)\n        try:\n          assert(frame_indexes == list(range(first_left_context, first_right_context+1)))\n        except AssertionError:\n          raise Exception('Currently the first splice component just accepts contiguous context.')\n        try:\n          assert((first_left_context <=0) and (first_right_context >=0))\n        except AssertionError:\n          raise Exception(\"\"\"get_lda.sh script does not support postive left-context or negative right context.\n          left context provided is %d and right context provided is %d.\"\"\" % (first_left_context, first_right_context))\n        # convolve the current splice indices with the splice indices until last layer\n      nnet_frame_indexes = get_convolution_index_set(frame_indexes, nnet_frame_indexes)\n      cur_context = \":\".join([str(x) for x in frame_indexes])\n      contexts[layer_index] = cur_context\n  except ValueError:\n    raise Exception('Unknown format in splice_indexes variable: {0}'.format(params.splice_indexes))\n  print(nnet_frame_indexes)\n  max_left_context = min(nnet_frame_indexes)\n  max_right_context = max(nnet_frame_indexes)\n  return [contexts, ' nnet_left_context={0};\\n nnet_right_context={1}\\n first_left_context={2};\\n first_right_context={3}\\n'.format(abs(max_left_context), abs(max_right_context), abs(first_left_context), abs(first_right_context) )]\n\ndef create_config_files(output_dir, params):\n  pnorm_p = 2\n  pnorm_input_dim = params.pnorm_input_dim\n  pnorm_output_dim = params.pnorm_output_dim\n  contexts, context_variables = parse_splice_string(params.splice_indexes)\n  var_file = open(\"{0}/vars\".format(output_dir), \"w\")\n  var_file.write(context_variables)\n  var_file.close()\n\n  try:\n    assert(max(contexts.keys()) < params.num_hidden_layers)\n  except AssertionError:\n    raise Exception(\"\"\"Splice string provided is {2}.\n    Number of hidden layers {0}, is less than the number of context specifications provided.\n    Splicing is supported only until layer {1}.\"\"\".format(params.num_hidden_layers, params.num_hidden_layers - 1, params.splice_indexes))\n\n  stddev=1.0/math.sqrt(pnorm_input_dim)\n  try :\n    nnet_config = [\"SpliceComponent input-dim={0} context={1} const-component-dim={2}\".format(params.total_input_dim, contexts[0], params.ivector_dim),\n    \"FixedAffineComponent matrix={0}\".format(params.lda_mat),\n    \"AffineComponentPreconditionedOnline input-dim={0} output-dim={1} {2} learning-rate={3} param-stddev={4} bias-stddev={5}\".format(params.lda_dim, pnorm_input_dim, params.online_preconditioning_opts, params.initial_learning_rate, stddev, params.bias_stddev),\n    (\"PnormComponent input-dim={0} output-dim={1} p={2}\".format(pnorm_input_dim, pnorm_output_dim, pnorm_p) if pnorm_input_dim != pnorm_output_dim else \"RectifiedLinearComponent dim={0}\".format(pnorm_input_dim)),\n    \"NormalizeComponent dim={0}\".format(pnorm_output_dim),\n    \"AffineComponentPreconditionedOnline input-dim={0} output-dim={1} {2} learning-rate={3} param-stddev=0 bias-stddev=0\".format(pnorm_output_dim, params.num_targets, params.online_preconditioning_opts, params.initial_learning_rate),\n    \"SoftmaxComponent dim={0}\".format(params.num_targets)]\n\n    nnet_config_file = open((\"{0}/nnet.config\").format(output_dir), \"w\")\n    nnet_config_file.write(\"\\n\".join(nnet_config))\n    nnet_config_file.close()\n  except KeyError:\n    raise Exception('A splice layer is expected to be the first layer. Provide a context for the first layer.')\n\n  for i in range(1, params.num_hidden_layers): #just run till num_hidden_layers-1 since we do not add splice before the final affine transform\n    lines=[]\n    context_len = 1\n    if i in contexts:\n        # Adding the splice component as a context is provided\n        lines.append(\"SpliceComponent input-dim=%d context=%s \" % (pnorm_output_dim, contexts[i]))\n        context_len = len(contexts[i].split(\":\"))\n    # Add the hidden layer, which is a composition of an affine component, pnorm component and normalization component\n    lines.append(\"AffineComponentPreconditionedOnline input-dim=%d output-dim=%d %s learning-rate=%f param-stddev=%f bias-stddev=%f\" \n        % ( pnorm_output_dim*context_len, pnorm_input_dim, params.online_preconditioning_opts, params.initial_learning_rate, stddev, params.bias_stddev))\n    if pnorm_input_dim != pnorm_output_dim:\n      lines.append(\"PnormComponent input-dim=%d output-dim=%d p=%d\" % (pnorm_input_dim, pnorm_output_dim, pnorm_p))\n    else:\n      lines.append(\"RectifiedLinearComponent dim=%d\" % (pnorm_input_dim)) \n      warnings.warn(\"Using the RectifiedLinearComponent, in place of the PnormComponent as pnorm_input_dim == pnorm_output_dim\")\n    lines.append(\"NormalizeComponent dim={0}\".format(pnorm_output_dim))\n    out_file = open(\"{0}/hidden_{1}.config\".format(output_dir, i), 'w')\n    out_file.write(\"\\n\".join(lines))\n    out_file.close()\n\n\nif __name__ == \"__main__\":\n  print(\" \".join(sys.argv))\n  parser = argparse.ArgumentParser()\n  parser.add_argument('--splice-indexes', type=str, help='string specifying the indexes for the splice layers throughout the network')\n  parser.add_argument('--total-input-dim', type=int, help='dimension of the input to the network')\n  parser.add_argument('--ivector-dim', type=int, help='dimension of the ivector portion of the neural network input')\n  parser.add_argument('--lda-mat', type=str, help='lda-matrix used after the first splice component')\n  parser.add_argument('--lda-dim', type=str, help='dimension of the lda output')\n  parser.add_argument('--pnorm-input-dim', type=int, help='dimension of input to pnorm layer')\n  parser.add_argument('--pnorm-output-dim', type=int, help='dimension of output of pnorm layer')\n  parser.add_argument('--online-preconditioning-opts', type=str, help='extra options for the AffineComponentPreconditionedOnline component')\n  parser.add_argument('--initial-learning-rate', type=float, help='')\n  parser.add_argument('--num-targets', type=int, help='#targets for the neural network ')\n  parser.add_argument('--num-hidden-layers', type=int, help='#hidden layers in the neural network ')\n  parser.add_argument('--bias-stddev', type=float, help='standard deviation of r.v. used for bias component initialization')\n  parser.add_argument(\"mode\", type=str, help=\"contexts|configs\")\n  parser.add_argument(\"output_dir\", type=str, help=\"output directory to store the files\")\n  params = parser.parse_args() \n  \n  print(params)\n  if params.mode == \"contexts\":\n    [context, context_variables] = parse_splice_string(params.splice_indexes)\n    var_file = open(\"{0}/vars\".format(params.output_dir), \"w\")\n    var_file.write(context_variables)\n    var_file.close()\n  elif params.mode == \"configs\":\n    create_config_files(params.output_dir, params)\n  else:\n    raise Exception(\"mode has to be in the set {contexts, configs}\")\n"
  },
  {
    "path": "egs/steps/nnet2/relabel_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Vimal Manohar. Apache 2.0.\n# This script, which will generally be called during the neural-net training\n# relabels existing examples with better labels obtained by realigning the data\n# with the current nnet model\n\n# Begin configuration section\ncmd=run.pl\nstage=0\nextra_egs=        # Names of additional egs files that need to relabelled\n                  # other than egs.*.*.ark, combine.egs, train_diagnostic.egs,\n                  # valid_diagnostic.egs\niter=final\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: steps/nnet2/relabel_egs.sh [opts] <ali-dir> <egs-in-dir> <egs-out-dir>\"\n  echo \"  e.g: steps/nnet2/relabel_egs.sh exp/tri6_nnet/ali_1.5 exp/tri6_nnet/egs exp/tri6_nnet/egs_1.5\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n\n  exit 1;\nfi\n\nalidir=$1\negs_in_dir=$2\ndir=$3\n\nmodel=$alidir/$iter.mdl\n\n# Check some files.\n\nfor f in $alidir/ali.1.gz $model $egs_in_dir/egs.1.0.ark $egs_in_dir/combine.egs \\\n  $egs_in_dir/valid_diagnostic.egs $egs_in_dir/train_diagnostic.egs \\\n  $egs_in_dir/num_jobs_nnet $egs_in_dir/iters_per_epoch $egs_in_dir/samples_per_iter; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnum_jobs_nnet=`cat $egs_in_dir/num_jobs_nnet`\niters_per_epoch=`cat $egs_in_dir/iters_per_epoch`\nsamples_per_iter_real=`cat $egs_in_dir/samples_per_iter`\nnum_jobs_align=`cat $alidir/num_jobs`\n\nmkdir -p $dir/log\n\necho $num_jobs_nnet > $dir/num_jobs_nnet\necho $iters_per_epoch > $dir/iters_per_epoch\necho $samples_per_iter_real > $dir/samples_per_iter\n\nalignments=$(for n in $(seq $num_jobs_align); do echo -n \"$alidir/ali.$n.gz \"; done)\n\nif [ $stage -le 0 ]; then\n  egs_in=\n  egs_out=\n  for x in `seq 1 $num_jobs_nnet`; do\n    for y in `seq 0 $[$iters_per_epoch-1]`; do\n      utils/create_data_link.pl $dir/egs.$x.$y.ark\n      if [ $x -eq 1 ]; then\n        egs_in=\"$egs_in ark:$egs_in_dir/egs.JOB.$y.ark \"\n        egs_out=\"$egs_out ark:$dir/egs.JOB.$y.ark \"\n      fi\n    done\n  done\n\n  $cmd JOB=1:$num_jobs_nnet $dir/log/relabel_egs.JOB.log \\\n    nnet-relabel-egs \"ark:gunzip -c $alignments | ali-to-pdf $model ark:- ark:- |\" \\\n    $egs_in $egs_out || exit 1\nfi\n\nif [ $stage -le 1 ]; then\n  egs_in=\n  egs_out=\n  for x in combine.egs valid_diagnostic.egs train_diagnostic.egs $extra_egs; do\n    utils/create_data_link.pl $dir/$x\n    egs_in=\"$egs_in ark:$egs_in_dir/$x\"\n    egs_out=\"$egs_out ark:$dir/$x\"\n  done\n\n  $cmd $dir/log/relabel_egs_extra.log \\\n    nnet-relabel-egs \"ark:gunzip -c $alignments | ali-to-pdf $model ark:- ark:- |\" \\\n    $egs_in $egs_out || exit 1\nfi\n\necho \"$0: Finished relabeling training examples\"\n"
  },
  {
    "path": "egs/steps/nnet2/relabel_egs2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Vimal Manohar.\n#           2014  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n#\n# This script, which will generally be called during the neural-net training \n# relabels existing examples with better labels obtained by realigning the data\n# with the current nnet model.\n# This script is as relabel_egs.sh, but is adapted to work with the newer\n# egs format that is written by get_egs2.sh\n\n# Begin configuration section\ncmd=run.pl\nstage=0\nextra_egs=        # Names of additional egs files that need to relabelled \n                  # other than egs.*.*.ark, combine.egs, train_diagnostic.egs,\n                  # valid_diagnostic.egs\niter=final\nparallel_opts=\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: steps/nnet2/relabel_egs.sh [opts] <ali-dir> <egs-in-dir> <egs-out-dir>\"\n  echo \"  e.g: steps/nnet2/relabel_egs.sh exp/tri6_nnet/ali_1.5 exp/tri6_nnet/egs exp/tri6_nnet/egs_1.5\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n\n  exit 1;\nfi\n\nalidir=$1\negs_in_dir=$2\ndir=$3\n\nmodel=$alidir/$iter.mdl\n\n# Check some files.\n\n[ -f $egs_in_dir/iters_per_epoch ] && \\\n  echo \"$0: this script does not work with the old egs directory format\" && exit 1;\n\nfor f in $alidir/ali.1.gz $model $egs_in_dir/egs.1.ark $egs_in_dir/combine.egs \\\n  $egs_in_dir/valid_diagnostic.egs $egs_in_dir/train_diagnostic.egs \\\n  $egs_in_dir/info/num_archives; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnum_archives=$(cat $egs_in_dir/info/num_archives) || exit 1;\nnum_jobs_align=$(cat $alidir/num_jobs) || exit 1;\n\nmkdir -p $dir/log\n\nmkdir -p $dir/info\ncp -r $egs_in_dir/info/*  $dir/info\n\nalignments=$(for n in $(seq $num_jobs_align); do echo $alidir/ali.$n.gz; done)\n\nif [ $stage -le 0 ]; then\n  for x in $(seq $num_archives); do\n    # if $dir/storage exists, make the soft links that we'll\n    # use to distribute the data across machines\n    utils/create_data_link.pl $dir/egs.$x.ark\n  done\n\n  $cmd $parallel_opts JOB=1:$num_archives $dir/log/relabel_egs.JOB.log \\\n    nnet-relabel-egs \"ark:gunzip -c $alignments | ali-to-pdf $model ark:- ark:- |\" \\\n     ark:$egs_in_dir/egs.JOB.ark ark:$dir/egs.JOB.ark || exit 1\nfi\n\nif [ $stage -le 1 ]; then\n  egs_in=\n  egs_out=\n  for x in combine.egs valid_diagnostic.egs train_diagnostic.egs $extra_egs; do\n    utils/create_data_link.pl $dir/$x\n    egs_in=\"$egs_in ark:$egs_in_dir/$x\"\n    egs_out=\"$egs_out ark:$dir/$x\"\n  done\n\n  $cmd $dir/log/relabel_egs_extra.log \\\n    nnet-relabel-egs \"ark:gunzip -c $alignments | ali-to-pdf $model ark:- ark:- |\" \\\n    $egs_in $egs_out || exit 1\nfi\n\necho \"$0: Finished relabeling training examples\"\n"
  },
  {
    "path": "egs/steps/nnet2/remove_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey).  \n# Apache 2.0.\n\n# This script removes the examples in an egs/ directory, e.g.\n# steps/nnet2/remove_egs.sh exp/nnet4b/egs/\n# We give it its own script because we need to be careful about\n# things that are soft links to something in storage/ (i.e. remove the\n# data that's linked to as well as the soft link), and we want to not\n# delete the examples if someone has done \"touch $dir/egs/.nodelete\".\n\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0 <egs-dir>\"\n  echo \"e.g.: $0 data/nnet4b/egs/\"\n  echo \"e.g.: $0 data/nnet4b_mpe/degs/\"\n  echo \"This script is usually equivalent to 'rm <egs-dir>/egs.* <egs-dir>/degs.*' but it follows\"\n  echo \"soft links to <egs-dir>/storage/; and it avoids deleting anything in the directory if\"\n  echo \"someone did 'touch <egs-dir>/.nodelete\"\n  exit 1;\nfi\n\negs=$1\n\nif [ ! -d $egs ]; then\n  echo \"$0: expected directory $egs to exist\"\n  exit 1;\nfi\n\nif [ -f $egs/.nodelete ]; then\n  echo \"$0: not deleting egs in $egs since $egs/.nodelete exists\"\n  exit 0;\nfi\n\n\n\nfor f in $egs/egs.*.ark $egs/degs.*.ark $egs/cegs.*.ark; do\n  if [ -L $f ]; then\n    rm $(dirname $f)/$(readlink $f)  # this will print a warning if it fails.\n  fi\n  rm $f 2>/dev/null\ndone\n\n\necho \"$0: Finished deleting examples in $egs\"\n"
  },
  {
    "path": "egs/steps/nnet2/retrain_fast.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# retrain_fast.sh is a neural net training script that's intended to train\n# a system on top of an already-trained neural network, whose activations have\n# been dumped to disk.  All it really is is training a neural network with\n# no hidden layers, so it's a simplified version of some of the other scripts.\n# There is no get_lda stage, as we don't support any pre-scaling of the inputs.\n# It uses the AffineComponentPreconditionedOnline components, which is why\n# we name it _fast.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=4       # Number of epochs during which we reduce\n                   # the learning rate; number of iterations is worked out from this.\nnum_epochs_extra=1 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=10 # Maximum number of final iterations to give to the\n                   # optimization over the validation set (maximum)\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\n\nalpha=4.0   # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\n# this relates to perturbed training.\nmin_target_objf_change=0.1\ntarget_multiplier=0 #  Set this to e.g. 1.0 to enable perturbed training.\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\negs_opts=\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|4 >                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|1>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --num-utts-subset <#utts|300>                    # Number of utterances in subsets used for validation and diagnostics\"\n  echo \"                                                   # (the validation subset is held out from training)\"\n  echo \"  --num-frames-diagnostic <#frames|4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh --feat-type raw --cmvn-opts \"--norm-means=false --norm-vars=false\" \\\n      --samples-per-iter $samples_per_iter --left-context 0 --right-context 0 \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  feat_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  cat >$dir/nnet.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$feat_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n\nfunction set_target_objf_change {\n  # nothing to do if $target_multiplier not set.\n  [ \"$target_multiplier\" == \"0\" -o \"$target_multiplier\" == \"0.0\" ] && return;\n  [ $x -le $finish_add_layers_iter ] && return;\n  wait=2  # the compute_prob_{train,valid} from 2 iterations ago should\n          # most likey be done even though we backgrounded them.\n  [ $[$x-$wait] -le 0 ] && return;\n  while true; do\n    # Note: awk 'some-expression' is the same as: awk '{if(some-expression) print;}'\n    train_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_train.$[$x-$wait].log)\n    valid_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_valid.$[$x-$wait].log)\n    if [ -z \"$train_prob\" ] || [ -z \"$valid_prob\" ]; then\n      echo \"$0: waiting until $dir/log/compute_prob_{train,valid}.$[$x-$wait].log are done\"\n      sleep 60\n    else\n      target_objf_change=$(perl -e '($train,$valid,$min_change,$multiplier)=@ARGV; if (!($train < 0.0) || !($valid < 0.0)) { print \"0\\n\"; print STDERR \"Error: invalid train or valid prob: $train_prob, $valid_prob\\n\"; exit(0); } else { print STDERR \"train,valid=$train,$valid\\n\"; $proposed_target = $multiplier * ($train-$valid); if ($proposed_target < $min_change) { print \"0\"; } else { print $proposed_target; }}' -- \"$train_prob\" \"$valid_prob\" \"$min_target_objf_change\" \"$target_multiplier\")\n      echo \"On iter $x, (train,valid) probs from iter $[$x-$wait] were ($train_prob,$valid_prob), and setting target-objf-change to $target_objf_change.\"\n      return;\n    fi\n  done\n}\n\nmix_up_iter=$[$num_iters/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\nx=0\ntarget_objf_change=0 # relates to perturbed training.\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n          ark:$egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -eq 0 ]; then\n      # on iteration zero, use a smaller minibatch size and just one job: the\n      # model-averaging doesn't seem to be helpful when the model is changing\n      # too fast (i.e. it worsens the objective function), and the smaller\n      # minibatch size will help to keep the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    set_target_objf_change;  # only has effect if target_multiplier != 0\n    if [ \"$target_objf_change\" != \"0\" ]; then\n      [ ! -f $dir/within_covar.spmat ] && \\\n        echo \"$0: expected $dir/within_covar.spmat to exist.\" && exit 1;\n      perturb_suffix=\"-perturbed\"\n      perturb_opts=\"--target-objf-change=$target_objf_change --within-covar=$dir/within_covar.spmat\"\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n       nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \\\n        --minibatch-size=$this_minibatch_size --srand=$x $dir/$x.mdl \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -le $[$num_iters-$num_iters_final] ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/retrain_simple2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#                2013  Xiaohui Zhang\n#                2013  Guoguo Chen\n#                2014  Vimal Manohar\n# Apache 2.0.\n\n\n# retrain_simple2.sh is a script for training a single-layer (softmax-only)\n# neural network on top of activations dumped from an existing network; we'll\n# later combine the networks to a single network.\n\n# It differs from train_pnorm_simple2.sh in the same way that retrain_fast.sh\n# differs from train_pnorm_fast.sh.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=5      # Number of epochs of training;\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nstage=-4\n\n\nalpha=4.0 # relates to online preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\nmax_change_per_sample=0.075\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\negs_opts=\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|5>                         # Number of epochs of training\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --realign-epochs <list-of-epochs|\\\"\\\">           # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs2.sh\"\n  # set --frames-per-eg to 1 because there is no context, so there\n  # is no advantage in having multiple frames per eg.\n  steps/nnet2/get_egs2.sh --feat-type raw \\\n    --frames-per-eg 1 --left-context 0 --right-context 0 \\\n    --cmvn-opts \"--norm-means=false --norm-vars=false\" \\\n    --io-opts \"$io_opts\" \\\n    --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n    --cmd \"$cmd\" $egs_opts $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\nif [ $num_jobs_nnet -gt $num_archives_expanded ]; then\n  echo \"$0: --num-jobs-nnet cannot exceed num-archives*frames-per-eg which is $num_archives_expanded\"\n  echo \"$0: setting --num-jobs-nnet to $num_archives_expanded\"\n  num_jobs_nnet=$num_archives_expanded\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  feat_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$feat_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$num_jobs_nnet == $num_epochs*$num_archives_expanded\nnum_iters=$[($num_epochs*$num_archives_expanded)/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nmix_up_iter=$[$num_iters/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch=$[$num_iters/$num_epochs]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch ]; then\n  num_models_combine=$approx_iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  # compare the equation below with the equation we use to set num_iters above.\n  # note, realign_epochs may be floating-point, which is why we don't use $[] to\n  # do the math.\n  realign_iter=$(perl -e 'print int(($ARGV[0]*$ARGV[1])/$ARGV[2]);' $realign_epoch $num_archives_expanded $num_jobs_nnet)\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    mdl=$dir/$x.mdl\n\n    if [ $x -eq 0 ]; then\n      # on iteration zero, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $num_jobs_nnet); do\n        k=$[$x*$num_jobs_nnet + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x $dir/$x.mdl \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/retrain_tanh.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script is for training networks with tanh nonlinearities; it starts with\n# a given model and supports increasing the hidden-layer dimension.  It is\n# otherwise similar to train_tanh.sh\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nsoftmax_learning_rate_factor=0.5 # Train this layer half as fast as the other layers.\n\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\n\nstage=-5\n\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n         # specified.)  Will do this at the start.\nwiden=0 # If specified, it will increase the hidden-layer dimension \n                            # to this value.  Will do this at the start.\nbias_stddev=0.5 # will be used for widen\n\nnum_threads=16\nparallel_opts=\"--num-threads $num_threads\"  # using a smallish #threads by default, out of stability concerns.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <egs-dir> <old-nnet-dir> <exp-dir>\"\n  echo \" e.g.: $0 --widen 1024 exp/tri4_nnet/egs exp/tri4_nnet exp/tri5_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16\\\">            # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --stage <stage|-5>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  exit 1;\nfi\n\negs_dir=$1\nnnet_dir=$2\ndir=$3\n\n# Check some files.\nfor f in $egs_dir/egs.1.0.ark $nnet_dir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n\nmkdir -p $dir/log\n\ncp $nnet_dir/phones.txt $dir 2>/dev/null\n\ncp $nnet_dir/splice_opts $dir 2>/dev/null\ncp $nnet_dir/final.mat $dir 2>/dev/null # any LDA matrix...\ncp $nnet_dir/tree $dir\n\n\nif [ $stage -le -2 ] && [ $mix_up -gt 0 ]; then\n  echo Mixing up to $mix_up components\n  $cmd $dir/log/mix_up.$x.log \\\n    nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n      $nnet_dir/final.mdl $dir/0.mdl || exit 1;\nelse \n  cp $nnet_dir/final.mdl $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ] && [ $widen -gt 0 ]; then\n  echo \"$0: Widening nnet to hidden-layer-dim=$widen\"\n  $cmd $dir/log/widen.log \\\n    nnet-am-widen --hidden-layer-dim=$widen $dir/0.mdl $dir/0.mdl || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n\n    echo \"Training neural net (pass $x)\"\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train-parallel --num-threads=$num_threads \\\n         --minibatch-size=$minibatch_size --srand=$x $dir/$x.mdl \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    softmax_learning_rate=`perl -e \"print $learning_rate * $softmax_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep AffineComponent | wc -l` # number of last AffineComopnent layer [one-based]\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do \n      if [ $n -eq $na ]; then lr=$softmax_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n    \n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\n  num_iters_final=$num_iters_extra\nfi\nstart=$[$num_iters-$num_iters_final+1]\nnnets_list=\nfor x in `seq $start $num_iters`; do\n  nnets_list=\"$nnets_list $dir/$x.mdl\"\ndone\n\nif [ $stage -le $num_iters ]; then\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$num_threads-1)/$num_threads]\n  $cmd $parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --use-gpu=no --num-threads=$num_threads --verbose=3 --minibatch-size=$mb \\\n    $nnets_list ark:$egs_dir/combine.egs $dir/final.mdl || exit 1;\nfi\n\nsleep 2; # make sure final.mdl exists.\n\n# Compute the probability of the final, combined model with\n# the same subset we used for the previous compute_probs, as the\n# different subsets will lead to different probs.\n$cmd $dir/log/compute_prob_valid.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n$cmd $dir/log/compute_prob_train.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\n\necho Done\n\nif $cleanup; then\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%10] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then \n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_block.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# this is as train_tanh.sh but for on top of fbank feats-- we have block-diagonal\n# transforms for the first few layers, on separate frequency bands.\n# Otherwise it's tanh.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.0\nshrink_interval=5 # shrink every $shrink_interval iters except while we are\n                  # still adding layers, when we do it every iter.\nshrink=true\nnum_frames_shrink=2000 # note: must be <= --num-frames-diagnostic option to get_egs.sh, if\n                       # given.\nsoftmax_learning_rate_factor=0.5 # Train this layer half as fast as the other layers.\n\nhidden_layer_dim=300 #  You may want this larger, e.g. 1024 or 2048.\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\n\nnum_block_layers=2\nnum_normal_layers=2\nblock_size=10\nblock_shift=5\n\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nsplice_width=7 # meaning +- 7 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\negs_dir=\nlda_opts=\negs_opts=\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --initial-num-hidden-layers <#hidden-layers|1>   # Number of hidden layers to start with.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --num-utts-subset <#utts|300>                    # Number of utterances in subsets used for validation and diagnostics\"\n  echo \"                                                   # (the validation subset is held out from training)\"\n  echo \"  --num-frames-diagnostic <#frames|4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames|10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid_uttlist || exit 1;\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n     head -$num_utts_subset > $dir/train_subset_uttlist || exit 1;\n\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda_block.sh --block-size $block_size --block-shift $block_shift \\\n    $lda_opts --splice-width $splice_width --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda_block.sh\nfeat_dim=`cat $dir/feat_dim` || exit 1;\nlda_dim=`cat $dir/lda_dim` || exit 1;\nnum_blocks=`cat $dir/num_blocks` || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh --io-opts \"$io_opts\" --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet \\\n      --splice-width $splice_width --stage $get_egs_stage --cmd \"$cmd\" $egs_opts --feat-type raw \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet`\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  hidden_block_size=`perl -e \"print int(sqrt(($hidden_layer_dim*$hidden_layer_dim)/$num_blocks));\"`\n  echo \"Hidden block size is $hidden_block_size\"\n  hidden_block_dim=$[$hidden_block_size*$num_blocks]\n  block_stddev=`perl -e \"print 1.0/sqrt($block_size);\"`\n  hidden_block_stddev=`perl -e \"print 1.0/sqrt($hidden_block_size);\"`\n  first_hidden_layer_stddev=`perl -e \"print 1.0/sqrt($hidden_block_dim);\"`\n  stddev=`perl -e \"print 1.0/sqrt($hidden_layer_dim);\"`\n\n\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$feat_dim left-context=$splice_width right-context=$splice_width\nFixedAffineComponent matrix=$dir/lda.mat\nBlockAffineComponentPreconditioned input-dim=$lda_dim output-dim=$hidden_block_dim alpha=$alpha learning-rate=$initial_learning_rate num-blocks=$num_blocks param-stddev=$block_stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_block_dim\nEOF\n  for n in `seq 2 $num_block_layers`; do\n    cat >>$dir/nnet.config <<EOF\nBlockAffineComponentPreconditioned input-dim=$hidden_block_dim output-dim=$hidden_block_dim alpha=$alpha num-blocks=$num_blocks learning-rate=$initial_learning_rate param-stddev=$hidden_block_stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_block_dim\nEOF\n  done\n  cat >>$dir/nnet.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_block_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$first_hidden_layer_stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  for n in `seq 2 $num_normal_layers`; do\n  cat >>$dir/nnet.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  done\n\n  cat >>$dir/nnet.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[$num_iters/2]\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    mdl=$dir/$x.mdl\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    softmax_learning_rate=`perl -e \"print $learning_rate * $softmax_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$softmax_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n\n    if $shrink && [ $[$x % $shrink_interval] -eq 0 ]; then\n      mb=$[($num_frames_shrink+$num_threads-1)/$num_threads]\n      $cmd $parallel_opts $dir/log/shrink.$x.log \\\n        nnet-subset-egs --n=$num_frames_shrink --randomize-order=true --srand=$x \\\n          ark:$egs_dir/train_diagnostic.egs ark:-  \\| \\\n        nnet-combine-fast --num-threads=$num_threads --verbose=3 --minibatch-size=$mb \\\n          $dir/$[$x+1].mdl ark:- $dir/$[$x+1].mdl || exit 1;\n    else\n      # On other iters, do nnet-am-fix which is much faster and has roughly\n      # the same effect.\n      nnet-am-fix $dir/$[$x+1].mdl $dir/$[$x+1].mdl 2>$dir/log/fix.$x.log\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  this_num_threads=$num_threads\n  [ $this_num_threads -lt 8 ] && this_num_threads=8\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  $cmd $parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --use-gpu=no --num-threads=$this_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\nfi\n\n# Compute the probability of the final, combined model with\n# the same subset we used for the previous compute_probs, as the\n# different subsets will lead to different probs.\n$cmd $dir/log/compute_prob_valid.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n$cmd $dir/log/compute_prob_train.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%10] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_convnet_accel2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#                2013  Xiaohui Zhang\n#                2013  Guoguo Chen\n#                2014  Vimal Manohar\n#                2015  Xingyu Na\n# Apache 2.0.\n\n# train_convnet_accel2.sh is modified from train_pnorm_accel2.sh. It propotypes\n# the training of a ConvNet. The ConvNet is composed of 4 hidden layers. The first layer\n# is a Convolutional1d component plus a Maxpooling component. The second layer\n# is a single Convolutional1d component. The third and fourth layers are affine\n# components with ReLU nonlinearities. Due to non-squashing output, normalize\n# component is applied to all four layers. The number of hidden layers is hard\n# coded now.\n\n# train_pnorm_accel2.sh is a modified form of train_pnorm_simple2.sh (the \"2\"\n# suffix is because they both use the the \"new\" egs format, created by\n# get_egs2.sh).  The \"accel\" part of the name refers to the fact that this\n# script uses a number of jobs that can increase during training.  You can\n# specify --initial-num-jobs and --final-num-jobs to control these separately.\n# Also, in this script, the learning rates specified by --initial-learning-rate\n# and --final-learning-rate are the \"effective learning rates\" (defined as the\n# learning rate divided by the number of jobs), and the actual learning rates\n# used will be the specified learning rates multiplied by the current number\n# of jobs.  You'll want to set these lower than you normally would previously\n# have set the learning rates, by a factor equal to the (previous) number of\n# jobs.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\nbias_stddev=0.5\nhidden_dim=3000\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1    # Number of neural net jobs to run in parallel at the start of training.\nnum_jobs_final=8      # Number of jobs to run in parallel at the end of training.\n\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nonline_ivector_dir=\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\nnum_hidden_layers=4\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-3\n\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nleft_context= # if set, overrides splice-width\nright_context= # if set, overrides splice-width.\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nnum_filters1=128      # number of filters in the first convolutional layer\npatch_step1=1         # patch step of the first convolutional layer\npatch_dim1=7          # dim of convolutional kernel in the first layer\npool_size=3           # size of pooling after the first convolutional layer\nnum_filters2=256      # number of filters in the second convolutional layer\npatch_dim2=4          # dim of convolutional kernel in the second layer\npatch_step2=1         # patch step of the second convolutional layer\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ndelta_order=\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\ntransform_dir=     # If supplied, overrides alidir\npostdir=\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\nsrand=0 # random seed used to initialize the nnet\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training,\"\n  echo \"                                         # actual learning-rate is this time num-jobs.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --realign-epochs <list-of-epochs|\\\"\\\">           # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"ConvNet configurations\"\n  echo \"  --num-filters1 <num-filters1|128>                # number of filters in the first convolutional layer.\"\n  echo \"  --patch-step1 <patch-step1|1>                    # patch step of the first convolutional layer.\"\n  echo \"  --patch-dim1 <patch-dim1|7>                      # dim of convolutional kernel in the first layer.\"\n  echo \"                                                   # (note: (feat-dim - patch-dim1) % patch-step1 should be 0.)\"\n  echo \"  --pool-size <pool-size|3>                        # size of pooling after the first convolutional layer.\"\n  echo \"                                                   # (note: (feat-dim - patch-dim1 + 1) % pool-size should be 0.)\"\n  echo \"  --num-filters2 <num-filters2|256>                # number of filters in the second convolutional layer.\"\n  echo \"  --patch-dim2 <patch-dim2|4>                      # dim of convolutional kernel in the second layer.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n[ ! -f $postdir/post.1.scp ] && [ ! -f $alidir/ali.1.gz ] && echo \"$0: no (soft) alignments provided\" && exit 1;\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$delta_order\" ] && extra_opts+=(--delta-order $delta_order)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n[ -z \"$left_context\" ] && left_context=$splice_width\n[ -z \"$right_context\" ] && right_context=$splice_width\nextra_opts+=(--left-context $left_context --right-context $right_context)\n\nfeat-to-dim scp:$sdata/1/feats.scp - > $dir/feat_dim\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\"  --io-opts \"$io_opts\" \\\n    --postdir \"$postdir\" --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n    --cmd \"$cmd\" --feat-type \"raw\" $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -f $dir/egs/cmvn_opts ]; then\n  cp $dir/egs/cmvn_opts $dir\nfi\n\nif [ -f $dir/egs/delta_order ]; then\n  cp $dir/egs/delta_order $dir\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  tot_splice=$[($delta_order+1)*($left_context+1+$right_context)]\n  delta_feat_dim=$[($delta_order+1)*$feat_dim]\n  tot_input_dim=$[$feat_dim*$tot_splice]\n  num_patch1=$[1+($feat_dim-$patch_dim1)/$patch_step1]\n  num_pool=$[$num_patch1/$pool_size]\n  patch_stride2=$num_pool\n  num_patch2=$[1+($patch_stride2-$patch_dim2)/$patch_step2]\n  conv_out_dim1=$[$num_filters1*$num_patch1] # 128 x (36 - 7 + 1)\n  pool_out_dim=$[$num_filters1*$num_pool]\n  conv_out_dim2=$[$num_filters2*$num_patch2]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  initial_lrate=$(perl -e \"print ($initial_effective_lrate*$num_jobs_initial);\")\n  stddev=`perl -e \"print 1.0/sqrt($hidden_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$delta_feat_dim left-context=$left_context right-context=$right_context\nConvolutional1dComponent input-dim=$tot_input_dim output-dim=$conv_out_dim1 learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev patch-dim=$patch_dim1 patch-step=$patch_step1 patch-stride=$feat_dim\nMaxpoolingComponent input-dim=$conv_out_dim1 output-dim=$pool_out_dim pool-size=$pool_size pool-stride=$num_filters1\nNormalizeComponent dim=$pool_out_dim\nAffineComponentPreconditionedOnline input-dim=$pool_out_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  cat >$dir/replace.1.config <<EOF\nConvolutional1dComponent input-dim=$pool_out_dim output-dim=$conv_out_dim2 learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev patch-dim=$patch_dim2 patch-step=$patch_step2 patch-stride=$patch_stride2 appended-conv=true\nNormalizeComponent dim=$conv_out_dim2\nAffineComponentPreconditionedOnline input-dim=$conv_out_dim2 output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  cat >$dir/replace.2.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$conv_out_dim2 output-dim=$hidden_dim $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev\nRectifiedLinearComponent dim=$hidden_dim\nNormalizeComponent dim=$hidden_dim\nAffineComponentPreconditionedOnline input-dim=$hidden_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/replace.3.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$hidden_dim output-dim=$hidden_dim $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev\nRectifiedLinearComponent dim=$hidden_dim\nNormalizeComponent dim=$hidden_dim\nAffineComponentPreconditionedOnline input-dim=$hidden_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init --srand=$srand $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\n# mix up at the iteration where we've processed about half the data; this keeps\n# the overall training procedure fairly invariant to the number of initial and\n# final jobs.\n# j = initial, k = final, n = num-iters, x = half-of-data epoch,\n# p is proportion of data we want to process (e.g. p=0.5 here).\n# solve for x if the amount of data processed by epoch x is p\n# times the amount by iteration n.\n# put this in wolfram alpha:\n# solve { x*j + (k-j)*x*x/(2*n) = p * (j*n + (k-j)*n/2), {x} }\n# got: x = (j n-sqrt(-n^2 (j^2 (p-1)-k^2 p)))/(j-k) and j!=k and n!=0\n# simplified manually to: n * (sqrt(((1-p)j^2 + p k^2)/2) - j)/(j-k)\nmix_up_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters 0.5)\n! [ $mix_up_iter -gt $finish_add_layers_iter ] && \\\n  echo \"Mix-up-iter is $mix_up_iter, should be greater than $finish_add_layers_iter -> add more epochs?\" \\\n  && exit 1;\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch_final ]; then\n  num_models_combine=$approx_iters_per_epoch_final\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\nnum_hid_added=1\nwhile [ $x -lt $num_iters ]; do\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e  \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  # TODO: remove this line.\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      [ ! -f $x.mdl ] && sleep 10;\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging take the best.\n      mdl=\"nnet-init --srand=$x $dir/replace.$num_hid_added.config - | nnet-replace-last-layers $dir/$x.mdl - - | nnet-am-copy --learning-rate=$this_learning_rate - -|\"\n      num_hid_added=$[$num_hid_added+1]\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      mdl=\"nnet-am-copy --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list $dir/$[$x+1].mdl ||  exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      cp $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # ReLU layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_discriminative.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does MPE or MMI or state-level minimum bayes risk (sMBR) training\n# of neural nets. \n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=4       # Number of epochs of training\nlearning_rate=0.00002\neffective_lrate=    # If supplied, overrides the learning rate, which gets set to effective_lrate * num_jobs_nnet.\nacoustic_scale=0.1  # acoustic scale for MMI/MPFE/SMBR training.\ncriterion=smbr\nboost=0.0       # option relevant for MMI\ndrop_frames=false #  option relevant for MMI\none_silence_class=true # Option relevant for MPE/SMBR\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  Note: this\n                   # will interact with the learning rates (if you decrease\n                   # this, you'll have to decrease the learning rate, and vice\n                   # versa).\nsamples_per_iter=400000 # measured in frames, not in \"examples\"\n\nmodify_learning_rates=true\nlast_layer_factor=1.0  # relates to modify-learning-rates\nfirst_layer_factor=1.0 # relates to modify-learning-rates\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\n\nstage=-8\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\n\nnum_threads=16  # this is the default but you may want to change it, e.g. to 1 if\n                # using GPUs.\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 4 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ntransform_dir= # If this is a SAT system, directory for transforms\ncleanup=true\ntransform_dir=\ndegs_dir=\nretroactive=false\nonline_ivector_dir=\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <denlat-dir> <src-model-file> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet_denlats exp/tri4/final.mdl exp/tri4_mpe\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|4>                        # Number of epochs of training\"\n  echo \"  --learning-rate <learning-rate|0.0002>           # Learning rate to use\"\n  echo \"  --effective-lrate <effective-learning-rate>      # If supplied, learning rate will be set to\"\n  echo \"                                                   # this value times num-jobs-nnet.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --stage <stage|-8>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --boost <boost|0.0>                              # Boosting factor for MMI (e.g., 0.1)\"\n  echo \"  --modify-learning-rates <true,false|false>       # If true, modify learning rates to try to equalize relative\"\n  echo \"                                                   # changes across layers.\"\n  echo \"  --degs-dir <dir|\"\">                              # Directory for discriminative examples, e.g. exp/foo/degs\"\n  echo \"  --drop-frames <true,false|false>                 # Option that affects MMI training: if true, we exclude gradients from frames\"\n  echo \"                                                   # where the numerator transition-id is not in the denominator lattice.\"\n  echo \"  --one-silence-class <true,false|false>           # Option that affects MPE/SMBR training (will tend to reduce insertions)\"\n  echo \"  --online-ivector-dir <dir|\"\">                    # Directory for online-estimated iVectors, used in the\"\n  echo \"                                                   # online-neural-net setup.\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\nsrc_model=$5\ndir=$6\n\n\nextra_files=\n[ ! -z $online_ivector_dir ] && \\\n extra_files=\"$online_ivector_dir/ivector_period $online_ivector_dir/ivector_online.scp\"\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/num_jobs $alidir/tree \\\n         $denlatdir/lat.1.gz $denlatdir/num_jobs $src_model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=$(cat $alidir/num_jobs) || exit 1; # caution: $nj is the number of\n                                      # splits of the denlats and alignments, but\n                                      # num_jobs_nnet is the number of nnet training\n                                      # jobs we run in parallel.\nif ! [ $nj == $(cat $denlatdir/num_jobs) ]; then\n  echo \"Number of jobs mismatch: $nj versus $(cat $denlatdir/num_jobs)\"\n  exit 1;\nfi\n\nmkdir -p $dir/log || exit 1;\n[ -z \"$degs_dir\" ] && mkdir -p $dir/degs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\n# function to remove egs that might be soft links.\nremove () { for x in $*; do [ -L $x ] && rm $(utils/make_absolute.sh $x); rm $x; done }\n\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null\ncp $alidir/tree $dir\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period)\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  # the 'const_dim_opt' allows it to write only one iVector per example,\n  # rather than one per time-index... it has to average over\n  const_dim_opt=\"--const-feat-dim=$ivector_dim\"\nfi\n\n## Set up features.\n## Don't support deltas, only LDA or raw (mainly because deltas are less frequently used).\nif [ -z $feat_type ]; then\n  if [ -f $alidir/final.mat ] && [ ! -f $transform_dir/raw_trans.1 ]; then feat_type=lda; else feat_type=raw; fi\nfi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  raw) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n   ;;\n  lda) \n    splice_opts=`cat $alidir/splice_opts 2>/dev/null`\n    cp $alidir/final.mat $dir    \n    feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -z \"$transform_dir\" ]; then\n  if [ -f $transform_dir/trans.1 ] || [ -f $transform_dir/raw_trans.1 ]; then\n    transform_dir=$alidir\n  fi\nfi\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -s $transform_dir/num_jobs ] && \\\n    echo \"$0: expected $transform_dir/num_jobs to contain the number of jobs.\" && exit 1;\n  nj_orig=$(cat $transform_dir/num_jobs)\n  \n  if [ $feat_type == \"raw\" ]; then trans=raw_trans;\n  else trans=trans; fi\n  if [ $feat_type == \"lda\" ] && ! cmp $transform_dir/final.mat $alidir/final.mat; then\n    echo \"$0: LDA transforms differ between $alidir and $transform_dir\"\n    exit 1;\n  fi\n  if [ ! -f $transform_dir/$trans.1 ]; then\n    echo \"$0: expected $transform_dir/$trans.1 to exist (--transform-dir option)\"\n    exit 1;\n  fi\n  if [ $nj -ne $nj_orig ]; then\n    # Copy the transforms into an archive with an index.\n    for n in $(seq $nj_orig); do cat $transform_dir/$trans.$n; done | \\\n       copy-feats ark:- ark,scp:$dir/$trans.ark,$dir/$trans.scp || exit 1;\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk scp:$dir/$trans.scp ark:- ark:- |\"\n  else\n    # number of jobs matches with alignment dir.\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$transform_dir/$trans.JOB ark:- ark:- |\"\n  fi\nfi\nif [ ! -z $online_ivector_dir ]; then\n  # add iVectors to the features.\n  feats=\"$feats paste-feats --length-tolerance=$ivector_period ark:- 'ark,s,cs:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp | subsample-feats --n=-$ivector_period scp:- ark:- |' ark:- |\"\nfi\n\n\nif [ -z \"$degs_dir\" ]; then\n  if [ $stage -le -8 ]; then\n    echo \"$0: working out number of frames of training data\"\n    num_frames=$(steps/nnet2/get_num_frames.sh $data)\n    echo $num_frames > $dir/num_frames\n    # Working out number of iterations per epoch.\n    iters_per_epoch=`perl -e \"print int($num_frames/($samples_per_iter * $num_jobs_nnet) + 0.5);\"` || exit 1;\n    [ $iters_per_epoch -eq 0 ] && iters_per_epoch=1\n    echo $iters_per_epoch > $dir/degs/iters_per_epoch  || exit 1;\n  else\n    num_frames=$(cat $dir/num_frames) || exit 1;\n    iters_per_epoch=$(cat $dir/degs/iters_per_epoch) || exit 1;\n  fi\n\n  samples_per_iter_real=$[$num_frames/($num_jobs_nnet*$iters_per_epoch)]\n  echo \"$0: Every epoch, splitting the data up into $iters_per_epoch iterations,\"\n  echo \"$0: giving samples-per-iteration of $samples_per_iter_real (you requested $samples_per_iter).\"\nelse\n  iters_per_epoch=$(cat $degs_dir/iters_per_epoch) || exit 1;\n  [ -z \"$iters_per_epoch\" ] && exit 1;\n  echo \"$0: Every epoch, splitting the data up into $iters_per_epoch iterations\"\nfi\n\n\n# we create these data links regardless of the stage, as there are situations where we\n# would want to recreate a data link that had previously been deleted.\nif [ -z \"$degs_dir\" ] && [ -d $dir/degs/storage ]; then\n  echo \"$0: creating data links for distributed storage of degs\"\n    # See utils/create_split_dir.pl for how this 'storage' directory\n    # is created.\n  for x in $(seq $num_jobs_nnet); do\n    for y in $(seq $nj); do\n      utils/create_data_link.pl $dir/degs/degs_orig.$x.$y.ark\n    done\n    for z in $(seq 0 $[$iters_per_epoch-1]); do\n      utils/create_data_link.pl $dir/degs/degs_tmp.$x.$z.ark\n      utils/create_data_link.pl $dir/degs/degs.$x.$z.ark\n    done\n  done\nfi\n\n\n\nif [ $stage -le -7 ]; then\n  echo \"$0: Copying initial model and modifying preconditioning setup\"\n\n  # Note, the baseline model probably had preconditioning, and we'll keep it;\n  # but we want online preconditioning with a larger number of samples of\n  # history, since in this setup the frames are only randomized at the segment\n  # level so they are highly correlated.  It might make sense to tune this a\n  # little, later on, although I doubt it matters once the --num-samples-history\n  # is large enough.\n\n  if [ ! -z \"$effective_lrate\" ]; then\n    learning_rate=$(perl -e \"print ($num_jobs_nnet*$effective_lrate);\")\n    echo \"$0: setting learning rate to $learning_rate = --num-jobs-nnet * --effective-lrate.\"\n  fi\n  $cmd $dir/log/convert.log \\\n    nnet-am-copy --learning-rate=$learning_rate \"$src_model\" - \\| \\\n    nnet-am-switch-preconditioning  --num-samples-history=50000 - $dir/0.mdl || exit 1;\nfi\n\n\n\n\nif [ $stage -le -6 ] && [ -z \"$degs_dir\" ]; then\n  echo \"$0: getting initial training examples by splitting lattices\"\n\n  egs_list=\n  for n in `seq 1 $num_jobs_nnet`; do\n    egs_list=\"$egs_list ark:$dir/degs/degs_orig.$n.JOB.ark\"\n  done\n\n\n  $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs-discriminative --criterion=$criterion --drop-frames=$drop_frames \\\n      $dir/0.mdl \"$feats\" \\\n    \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz |\" \\\n    \"ark,s,cs:gunzip -c $denlatdir/lat.JOB.gz|\" ark:- \\| \\\n    nnet-copy-egs-discriminative $const_dim_opt ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le -5 ] && [ -z \"$degs_dir\" ]; then\n  echo \"$0: rearranging examples into parts for different parallel jobs\"\n\n  # combine all the \"egs_orig.JOB.*.scp\" (over the $nj splits of the data) and\n  # then split into multiple parts egs.JOB.*.scp for different parts of the\n  # data, 0 .. $iters_per_epoch-1.\n\n  if [ $iters_per_epoch -eq 1 ]; then\n    echo \"Since iters-per-epoch == 1, just concatenating the data.\"\n    for n in `seq 1 $num_jobs_nnet`; do\n      cat $dir/degs/degs_orig.$n.*.ark > $dir/degs/degs_tmp.$n.0.ark || exit 1;\n      remove $dir/degs/degs_orig.$n.*.ark  # don't \"|| exit 1\", due to NFS bugs...\n    done\n  else # We'll have to split it up using nnet-copy-egs.\n    egs_list=\n    for n in `seq 0 $[$iters_per_epoch-1]`; do\n      egs_list=\"$egs_list ark:$dir/degs/degs_tmp.JOB.$n.ark\"\n    done\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/split_egs.JOB.log \\\n      nnet-copy-egs-discriminative --srand=JOB \\\n        \"ark:cat $dir/degs/degs_orig.JOB.*.ark|\" $egs_list || exit 1;\n    remove $dir/degs/degs_orig.*.*.ark\n  fi\nfi\n\n\nif [ $stage -le -4 ] && [ -z \"$degs_dir\" ]; then\n  # Next, shuffle the order of the examples in each of those files.\n  # Each one should not be too large, so we can do this in memory.\n  # Then combine the examples together to form suitable-size minibatches\n  # (for discriminative examples, it's one example per minibatch, so we\n  # have to combine the lattices).\n  echo \"Shuffling the order of training examples\"\n  echo \"(in order to avoid stressing the disk, these won't all run at once).\"\n\n  # note, the \"|| true\" below is a workaround for NFS bugs\n  # we encountered running this script with Debian-7, NFS-v4.\n  # Also, we should note that we used to do nnet-combine-egs-discriminative\n  # at this stage, but if iVectors are used this would expand the size of\n  # the examples on disk (because they could no longer be stored in the spk_info\n  # variable of the discrminative example, no longer being constant), so\n  # now we do the nnet-combine-egs-discriminative operation on the fly during\n  # training.\n  for n in `seq 0 $[$iters_per_epoch-1]`; do\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/shuffle.$n.JOB.log \\\n      nnet-shuffle-egs-discriminative \"--srand=\\$[JOB+($num_jobs_nnet*$n)]\" \\\n      ark:$dir/degs/degs_tmp.JOB.$n.ark ark:$dir/degs/degs.JOB.$n.ark || exit 1;\n    remove $dir/degs/degs_tmp.*.$n.ark\n  done\nfi\n\nif [ -z \"$degs_dir\" ]; then\n  degs_dir=$dir/degs\nfi\n\nnum_iters=$[$num_epochs * $iters_per_epoch];\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif [ $num_threads -eq 1 ]; then\n train_suffix=\"-simple\" # this enables us to use GPU code if\n                        # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\n\nx=0   \nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    \n    echo \"Training neural net (pass $x)\"\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-train-discriminative$train_suffix --silence-phones=$silphonelist \\\n       --criterion=$criterion --drop-frames=$drop_frames \\\n       --one-silence-class=$one_silence_class --boost=$boost \\\n       --acoustic-scale=$acoustic_scale $dir/$x.mdl \\\n       \"ark,bg:nnet-combine-egs-discriminative ark:$degs_dir/degs.JOB.$[$x%$iters_per_epoch].ark ark:- |\" \\\n        $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=$(for n in $(seq $num_jobs_nnet); do echo $dir/$[$x+1].$n.mdl; done)\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list $dir/$[$x+1].mdl || exit 1;\n\n    if $modify_learning_rates; then\n      $cmd $dir/log/modify_learning_rates.$x.log \\\n        nnet-modify-learning-rates --retroactive=$retroactive \\\n        --last-layer-factor=$last_layer_factor \\\n        --first-layer-factor=$first_layer_factor \\\n        $dir/$x.mdl $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n\n  x=$[$x+1]\ndone\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n\n  echo Removing training examples\n  if [ -d $dir/degs ] && [ ! -L $dir/degs ]; then # only remove if directory is not a soft link.\n    remove $dir/degs/degs.*\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%$iters_per_epoch] -ne 0 ]; then\n      # delete all but the epoch-final models.\n      rm $dir/$x.mdl 2>/dev/null\n    fi\n  done\nfi\n\nfor n in $(seq 0 $num_epochs); do\n  x=$[$n*$iters_per_epoch]\n  rm $dir/epoch$n.mdl 2>/dev/null\n  ln -s $x.mdl $dir/epoch$n.mdl\ndone\n"
  },
  {
    "path": "egs/steps/nnet2/train_discriminative2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does MPE or MMI or state-level minimum bayes risk (sMBR) training.\n# This version (2) of the script uses a newer format for the discriminative-training\n# egs, as obtained by steps/nnet2/get_egs_discriminative2.sh.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=4       # Number of epochs of training\nlearning_rate=0.00002\neffective_lrate=    # If supplied, overrides the learning rate, which gets set to effective_lrate * num_jobs_nnet.\nacoustic_scale=0.1  # acoustic scale for MMI/MPFE/SMBR training.\nboost=0.0       # option relevant for MMI\n\ncriterion=smbr\ndrop_frames=false #  option relevant for MMI\none_silence_class=true # option relevant for MPE/SMBR\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  Note: this\n                   # will interact with the learning rates (if you decrease\n                   # this, you'll have to decrease the learning rate, and vice\n                   # versa).\n\nmodify_learning_rates=true\nlast_layer_factor=1.0  # relates to modify-learning-rates\nfirst_layer_factor=1.0 # relates to modify-learning-rates\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\n\nstage=-3\n\nadjust_priors=false\nnum_threads=16  # this is the default but you may want to change it, e.g. to 1 if\n                # using GPUs.\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\n\ncleanup=true\nretroactive=false\nremove_egs=false\nsrc_model=  # will default to $degs_dir/final.mdl\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [opts] <degs-dir> <exp-dir>\"\n  echo \" e.g.: $0 exp/tri4_mpe_degs exp/tri4_mpe\"\n  echo \"\"\n  echo \"You have to first call get_egs_discriminative2.sh to dump the egs.\"\n  echo \"Caution: the options 'drop-frames' and 'criterion' are taken here\"\n  echo \"even though they were required also by get_egs_discriminative2.sh,\"\n  echo \"and they should normally match.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|4>                        # Number of epochs of training\"\n  echo \"  --learning-rate <learning-rate|0.0002>           # Learning rate to use\"\n  echo \"  --effective-lrate <effective-learning-rate>      # If supplied, learning rate will be set to\"\n  echo \"                                                   # this value times num-jobs-nnet.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.  Also note: if there are fewer archives\"\n  echo \"                                                   # of egs than this, it will get reduced automatically.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.  With GPU, must be 1.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --stage <stage|-3>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --boost <boost|0.0>                              # Boosting factor for MMI (e.g., 0.1)\"\n  echo \"  --drop-frames <true,false|false>                 # Option that affects MMI training: if true, we exclude gradients from frames\"\n  echo \"                                                   # where the numerator transition-id is not in the denominator lattice.\"\n  echo \"  --one-silence-class <true,false|false>           # Option that affects MPE/SMBR training (will tend to reduce insertions)\"\n  echo \"  --modify-learning-rates <true,false|false>       # If true, modify learning rates to try to equalize relative\"\n  echo \"                                                   # changes across layers.\"\n  exit 1;\nfi\n\ndegs_dir=$1\ndir=$2\n\n[ -z \"$src_model\" ] && src_model=$degs_dir/final.mdl\n\n# Check some files.\nfor f in $degs_dir/degs.1.ark $degs_dir/info/{num_archives,silence.csl,frames_per_archive} $src_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log || exit 1;\n\ncp $degs_dir/phones.txt $dir 2>/dev/null\n# copy some things\nfor f in splice_opts cmvn_opts tree final.mat; do\n  if [ -f $degs_dir/$f ]; then\n    cp $degs_dir/$f $dir/ || exit 1;\n  fi\ndone\n\nsilphonelist=`cat $degs_dir/info/silence.csl` || exit 1;\n\n\nnum_archives=$(cat $degs_dir/info/num_archives) || exit 1;\n\nif [ $num_jobs_nnet -gt $num_archives ]; then\n  echo \"$0: num-jobs-nnet $num_jobs_nnet exceeds number of archives $num_archives,\"\n  echo \" ... setting it to $num_archives.\"\n  num_jobs_nnet=$num_archives\nfi\n\nnum_iters=$[($num_epochs*$num_archives)/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfor e in $(seq 1 $num_epochs); do\n  x=$[($e*$num_archives)/$num_jobs_nnet] # gives the iteration number.\n  iter_to_epoch[$x]=$e\ndone\n\nif [ $stage -le -1 ]; then\n  echo \"$0: Copying initial model and modifying preconditioning setup\"\n\n  # Note, the baseline model probably had preconditioning, and we'll keep it;\n  # but we want online preconditioning with a larger number of samples of\n  # history, since in this setup the frames are only randomized at the segment\n  # level so they are highly correlated.  It might make sense to tune this a\n  # little, later on, although I doubt it matters once the --num-samples-history\n  # is large enough.\n\n  if [ ! -z \"$effective_lrate\" ]; then\n    learning_rate=$(perl -e \"print ($num_jobs_nnet*$effective_lrate);\")\n    echo \"$0: setting learning rate to $learning_rate = --num-jobs-nnet * --effective-lrate.\"\n  fi\n\n  $cmd $dir/log/convert.log \\\n    nnet-am-copy --learning-rate=$learning_rate \"$src_model\" - \\| \\\n    nnet-am-switch-preconditioning  --num-samples-history=50000 - $dir/0.mdl || exit 1;\nfi\n\n\n\nif [ $num_threads -eq 1 ]; then\n train_suffix=\"-simple\" # this enables us to use GPU code if\n                        # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\nrm $dir/.error\nx=0   \nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    \n    echo \"Training neural net (pass $x)\"\n\n    # The \\$ below delays the evaluation of the expression until the script runs (and JOB\n    # will be replaced by the job-id).  That expression in $[..] is responsible for\n    # choosing the archive indexes to use for each job on each iteration... we cycle through\n    # all archives.\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-combine-egs-discriminative \\\n        \"ark:$degs_dir/degs.\\$[((JOB-1+($x*$num_jobs_nnet))%$num_archives)+1].ark\" ark:- \\| \\\n      nnet-train-discriminative$train_suffix --silence-phones=$silphonelist \\\n       --criterion=$criterion --drop-frames=$drop_frames \\\n       --one-silence-class=$one_silence_class \\\n       --boost=$boost --acoustic-scale=$acoustic_scale \\\n       $dir/$x.mdl ark:- $dir/$[$x+1].JOB.mdl || exit 1;\n\n    nnets_list=$(for n in $(seq $num_jobs_nnet); do echo $dir/$[$x+1].$n.mdl; done)\n\n    # below use run.pl instead of a generic $cmd for these very quick stages,\n    # so that we don't run the risk of waiting for a possibly hard-to-get GPU.\n    run.pl $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list $dir/$[$x+1].mdl || exit 1;\n\n    if $modify_learning_rates; then\n      run.pl $dir/log/modify_learning_rates.$x.log \\\n        nnet-modify-learning-rates --retroactive=$retroactive \\\n        --last-layer-factor=$last_layer_factor \\\n        --first-layer-factor=$first_layer_factor \\\n        $dir/$x.mdl $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  if $adjust_priors && [ ! -z \"${iter_to_epoch[$x]}\" ]; then\n    if [ ! -f $degs_dir/priors_egs.1.ark ]; then\n      echo \"$0: Expecting $degs_dir/priors_egs.1.ark to exist since --adjust-priors was true.\"\n      echo \"$0: Run this script with --adjust-priors false to not adjust priors\"\n      exit 1\n    fi\n    (\n    e=${iter_to_epoch[$x]}\n    rm $dir/.error\n    num_archives_priors=`cat $degs_dir/info/num_archives_priors` || { touch $dir/.error; echo \"Could not find $degs_dir/info/num_archives_priors. Set --adjust-priors false to not adjust priors\"; exit 1; }\n\n    $cmd JOB=1:$num_archives_priors $dir/log/get_post.epoch$e.JOB.log \\\n      nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" \\\n      ark:$degs_dir/priors_egs.JOB.ark ark:- \\| \\\n      matrix-sum-rows ark:- ark:- \\| \\\n      vector-sum ark:- $dir/post.epoch$e.JOB.vec || \\\n      { touch $dir/.error; echo \"Error in getting posteriors for adjusting priors. See $dir/log/get_post.epoch$e.*.log\"; exit 1; }\n\n    sleep 3;\n\n    $cmd $dir/log/sum_post.epoch$e.log \\\n      vector-sum $dir/post.epoch$e.*.vec $dir/post.epoch$e.vec || \\\n      { touch $dir/.error; echo \"Error in summing posteriors. See $dir/log/sum_post.epoch$e.log\"; exit 1; }\n\n    rm $dir/post.epoch$e.*.vec\n\n    echo \"Re-adjusting priors based on computed posteriors for iter $x\"\n    $cmd $dir/log/adjust_priors.epoch$e.log \\\n      nnet-adjust-priors $dir/$x.mdl $dir/post.epoch$e.vec $dir/$x.mdl \\\n      || { touch $dir/.error; echo \"Error in adjusting priors. See $dir/log/adjust_priors.epoch$e.log\"; exit 1; }\n    ) &\n  fi\n\n  [ -f $dir/.error ] && exit 1\n\n  x=$[$x+1]\ndone\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\necho Done\n\nepoch_final_iters=\nfor e in $(seq 0 $num_epochs); do\n  x=$[($e*$num_archives)/$num_jobs_nnet] # gives the iteration number.\n  ln -sf $x.mdl $dir/epoch$e.mdl\n  epoch_final_iters=\"$epoch_final_iters $x\"\ndone\n\n\n# function to remove egs that might be soft links.\nremove () { for x in $*; do [ -L $x ] && rm $(utils/make_absolute.sh $x); rm $x; done }\n\nif $cleanup && $remove_egs; then  # note: this is false by default.\n  echo Removing training examples\n  for n in $(seq $num_archives); do\n    remove $degs_dir/degs.*\n    remove $degs_dir/priors_egs.*\n  done\nfi\n\n\nif $cleanup; then\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if ! echo $epoch_final_iters | grep -w $x >/dev/null; then \n      # if $x is not an epoch-final iteration..\n      rm $dir/$x.mdl 2>/dev/null\n    fi\n  done\nfi\n\n"
  },
  {
    "path": "egs/steps/nnet2/train_discriminative_multilang2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script does MPE or MMI or state-level minimum bayes risk (sMBR) training,\n# in the multi-language or at least multi-model setting where you have multiple \"degs\" directories.\n# The input \"degs\" directories must be dumped by one of the get_egs_discriminative2.sh scripts.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=4       # Number of epochs of training\nlearning_rate=0.00002\nacoustic_scale=0.1  # acoustic scale for MMI/MPFE/SMBR training.\nboost=0.0       # option relevant for MMI\n\ncriterion=smbr\ndrop_frames=false #  option relevant for MMI\none_silence_class=true # option relevant for MPE/SMBR\nnum_jobs_nnet=\"4 4\"    # Number of neural net jobs to run in parallel, one per\n                       # language..  Note: this will interact with the learning\n                       # rates (if you decrease this, you'll have to decrease\n                       # the learning rate, and vice versa).\n\nmodify_learning_rates=true\nlast_layer_factor=1.0  # relates to modify-learning-rates\nfirst_layer_factor=1.0 # relates to modify-learning-rates\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\n\nstage=-3\n\n\nnum_threads=16  # this is the default but you may want to change it, e.g. to 1 if\n                # using GPUs.\ncleanup=true\nretroactive=false\nremove_egs=false\nsrc_models=  # can be used to override the defaults of <degs-dir1>/final.mdl <degs-dir2>/final.mdl .. etc.\n             # set this to a space-separated list.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# -lt 3 ]; then\n  echo \"Usage: $0 [opts] <degs-dir1> <degs-dir2> ... <degs-dirN>  <exp-dir>\"\n  echo \" e.g.: $0 exp/tri4_mpe_degs exp_other_lang/tri4_mpe_degs exp/tri4_mpe_multilang\"\n  echo \"\"\n  echo \"You have to first call get_egs_discriminative2.sh to dump the egs.\"\n  echo \"Caution: the options 'drop_frames' and 'criterion' are taken here\"\n  echo \"even though they were required also by get_egs_discriminative2.sh,\"\n  echo \"and they should normally match.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|4>                        # Number of epochs of training (measured on language 0)\"\n  echo \"  --learning-rate <learning-rate|0.0002>           # Learning rate to use\"\n  echo \"  --num-jobs-nnet <num-jobs|4 4>                   # Number of parallel jobs to use for main neural net:\"\n  echo \"                                                   # space separated list of num-jobs per language. Affects\"\n  echo \"                                                   # relative weighting.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.  With GPU, must be 1.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --stage <stage|-3>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --boost <boost|0.0>                              # Boosting factor for MMI (e.g., 0.1)\"\n  echo \"  --drop-frames <true,false|false>                 # Option that affects MMI training: if true, we exclude gradients from frames\"\n  echo \"                                                   # where the numerator transition-id is not in the denominator lattice.\"\n  echo \"  --modify-learning-rates <true,false|false>       # If true, modify learning rates to try to equalize relative\"\n  echo \"                                                   # changes across layers.\"\n  exit 1;\nfi\n\nargv=(\"$@\") \nnum_args=$#\nnum_lang=$[$num_args-1]\n\ndir=${argv[$num_args-1]}\n\nnum_jobs_nnet_array=($num_jobs_nnet)\n! [ \"${#num_jobs_nnet_array[@]}\" -eq \"$num_lang\" ] && \\\n  echo \"$0: --num-jobs-nnet option must have size equal to the number of languages\" && exit 1;\n\nfor lang in $(seq 0 $[$num_lang-1]); do\n  degs_dir[$lang]=${argv[$lang]}\ndone\n\nif [ ! -z \"$src_models\" ]; then\n  src_model_array=($src_models)\n  ! [ \"${#src_model_array[@]}\" -eq \"$num_lang\" ] && \\\n    echo \"$0: --src-models option must have size equal to the number of languages\" && exit 1;\nelse\n  for lang in $(seq 0 $[$num_lang-1]); do\n    src_model_array[$lang]=${degs_dir[$lang]}/final.mdl\n  done\nfi\n\nmkdir -p $dir/log || exit 1;\n\nfor lang in $(seq 0 $[$num_lang-1]); do\n  this_degs_dir=${degs_dir[$lang]}\n  mdl=${src_model_array[$lang]}\n  this_num_jobs_nnet=${num_jobs_nnet_array[$lang]}\n  # Check inputs\n  for f in $this_degs_dir/degs.1.ark $this_degs_dir/info/{num_archives,silence.csl,frames_per_archive} $mdl; do\n    [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\n  done\n  mkdir -p $dir/$lang/log || exit 1;\n\n  # check for valid num-jobs-nnet.\n  ! [ $this_num_jobs_nnet -gt 0 ] && echo \"Bad num-jobs-nnet option '$num_jobs_nnet'\" && exit 1;\n  this_num_archives=$(cat $this_degs_dir/info/num_archives) || exit 1;\n  num_archives_array[$lang]=$this_num_archives\n  silphonelist_array[$lang]=$(cat $this_degs_dir/info/silence.csl) || exit 1;\n\n  if [ $this_num_jobs_nnet -gt $this_num_archives ]; then\n    echo \"$0: num-jobs-nnet $this_num_jobs_nnet exceeds number of archives $this_num_archives\"\n    echo \" ... for language $lang; setting it to $this_num_archives.\"\n    num_jobs_nnet_array[$lang]=$this_num_archives\n  fi\n\n  # copy some things from the input directories.\n  for f in splice_opts cmvn_opts tree final.mat; do\n    if [ -f $this_degs_dir/$f ]; then\n      cp $this_degs_dir/$f $dir/$lang/ || exit 1;\n    fi\n  done\n  if [ -f $this_degs_dir/conf ]; then\n    ln -sf $(utils/make_absolute.sh $this_degs_dir/conf) $dir/ || exit 1; \n  fi\ndone\n\n\n# work out number of iterations.\nnum_archives0=$(cat ${degs_dir[0]}/info/num_archives) || exit 1;\nnum_jobs_nnet0=${num_jobs_nnet_array[0]}\n\n! [ $num_epochs -gt 0 ] && echo \"Error: num-epochs $num_epochs is not valid\" && exit 1;\n\n\nnum_iters=$[($num_epochs*$num_archives0)/$num_jobs_nnet0]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations (measured on language 0)\"\n# Work out the number of epochs we train for on the other languages... this is\n# just informational.\nfor lang in $(seq 1 $[$num_lang-1]); do\n  this_degs_dir=${degs_dir[$lang]}\n  this_num_archives=${num_archives_array[$lang]}\n  this_num_epochs=$[($num_iters*${num_jobs_nnet_array[$lang]})/$this_num_archives]\n  echo \"$0: $num_iters iterations is approximately $this_num_epochs epochs for language $lang\"\ndone\n\n\n\nif [ $stage -le -1 ]; then\n  echo \"$0: Copying initial models and modifying preconditioning setups\"\n\n  # Note, the baseline model probably had preconditioning, and we'll keep it;\n  # but we want online preconditioning with a larger number of samples of\n  # history, since in this setup the frames are only randomized at the segment\n  # level so they are highly correlated.  It might make sense to tune this a\n  # little, later on, although I doubt it matters once the --num-samples-history\n  # is large enough.\n\n  for lang in $(seq 0 $[$num_lang-1]); do\n    $cmd $dir/$lang/log/convert.log \\\n      nnet-am-copy --learning-rate=$learning_rate ${src_model_array[$lang]} - \\| \\\n      nnet-am-switch-preconditioning  --num-samples-history=50000 - $dir/$lang/0.mdl || exit 1;\n  done\nfi\n\n\n\nif [ $num_threads -eq 1 ]; then\n train_suffix=\"-simple\" # this enables us to use GPU code if\n                        # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\n\nx=0   \nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    \n    echo \"Training neural net (pass $x)\"\n\n\n    rm $dir/.error 2>/dev/null\n\n    for lang in $(seq 0 $[$num_lang-1]); do\n      this_num_jobs_nnet=${num_jobs_nnet_array[$lang]}\n      this_num_archives=${num_archives_array[$lang]}\n      this_degs_dir=${degs_dir[$lang]}\n      this_silphonelist=${silphonelist_array[$lang]}\n\n      # The \\$ below delays the evaluation of the expression until the script runs (and JOB\n      # will be replaced by the job-id).  That expression in $[..] is responsible for\n      # choosing the archive indexes to use for each job on each iteration... we cycle through\n      # all archives.\n\n      (\n        $cmd JOB=1:$this_num_jobs_nnet $dir/$lang/log/train.$x.JOB.log \\\n          nnet-combine-egs-discriminative \\\n          \"ark:$this_degs_dir/degs.\\$[((JOB-1+($x*$this_num_jobs_nnet))%$this_num_archives)+1].ark\" ark:- \\| \\\n          nnet-train-discriminative$train_suffix --silence-phones=$this_silphonelist \\\n           --criterion=$criterion --drop-frames=$drop_frames \\\n           --one-silence-class=$one_silence_class \\\n           --boost=$boost --acoustic-scale=$acoustic_scale \\\n           $dir/$lang/$x.mdl ark:- $dir/$lang/$[$x+1].JOB.mdl || exit 1;\n\n        nnets_list=$(for n in $(seq $this_num_jobs_nnet); do echo $dir/$lang/$[$x+1].$n.mdl; done)\n\n        # produce an average just within this language.\n        $cmd $dir/$lang/log/average.$x.log \\\n          nnet-am-average $nnets_list $dir/$lang/$[$x+1].tmp.mdl || exit 1;\n\n        rm $nnets_list\n      ) || touch $dir/.error &\n    done\n    wait\n    [ -f $dir/.error ] && echo \"$0: error on pass $x\" && exit 1\n\n\n    # apply the modify-learning-rates thing to the model for the zero'th language;\n    # we'll use the resulting learning rates for the other languages.\n    if $modify_learning_rates; then\n      $cmd $dir/log/modify_learning_rates.$x.log \\\n        nnet-modify-learning-rates --retroactive=$retroactive \\\n        --last-layer-factor=$last_layer_factor \\\n        --first-layer-factor=$first_layer_factor \\\n        $dir/0/$x.mdl $dir/0/$[$x+1].tmp.mdl $dir/0/$[$x+1].tmp.mdl || exit 1;\n    fi\n\n    nnets_list=$(for lang in $(seq 0 $[$num_lang-1]); do echo $dir/$lang/$[$x+1].tmp.mdl; done)\n    weights_csl=$(echo $num_jobs_nnet | sed 's/ /:/g') # get as colon separated list.\n\n    # the next command produces the cross-language averaged model containing the\n    # final layer corresponding to language zero.  Note, if we did modify-learning-rates,\n    # it will also have the modified learning rates.\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average --weights=$weights_csl --skip-last-layer=true \\\n      $nnets_list $dir/0/$[$x+1].mdl || exit 1;\n\n    # we'll transfer these learning rates to the other models.\n    learning_rates=$(nnet-am-info --print-learning-rates=true $dir/0/$[$x+1].mdl 2>/dev/null)        \n\n    for lang in $(seq 1 $[$num_lang-1]); do\n      # the next command takes the averaged hidden parameters from language zero, and\n      # the last layer from language $lang.  It's not really doing averaging.\n      # we use nnet-am-copy to transfer the learning rates from model zero.\n      $cmd $dir/$lang/log/combine_average.$x.log \\\n        nnet-am-average --weights=0.0:1.0 --skip-last-layer=true \\\n          $dir/$lang/$[$x+1].tmp.mdl $dir/0/$[$x+1].mdl - \\| \\\n        nnet-am-copy --learning-rates=$learning_rates - $dir/$lang/$[$x+1].mdl || exit 1;\n    done\n\n    $cleanup && rm $dir/*/$[$x+1].tmp.mdl\n\n  fi\n\n  x=$[$x+1]\ndone\n\n\nfor lang in $(seq 0 $[$num_lang-1]); do\n  rm $dir/$lang/final.mdl 2>/dev/null\n  ln -s $x.mdl $dir/$lang/final.mdl\n\n\n  epoch_final_iters=\n  for e in $(seq 0 $num_epochs); do\n    x=$[($e*$num_archives0)/$num_jobs_nnet0] # gives the iteration number.\n    ln -sf $x.mdl $dir/$lang/epoch$e.mdl\n    epoch_final_iters=\"$epoch_final_iters $x\"\n  done\n\n  if $cleanup; then\n    echo \"Removing most of the models for language $lang\"\n    for x in `seq 0 $num_iters`; do\n      if ! echo $epoch_final_iters | grep -w $x >/dev/null; then \n        # if $x is not an epoch-final iteration..\n        rm $dir/$lang/$x.mdl 2>/dev/null\n      fi\n    done\n  fi\ndone\n\n\necho Done\n"
  },
  {
    "path": "egs/steps/nnet2/train_more.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey). \n# Apache 2.0.\n\n\n# This script further trains an already-existing neural network,\n# given an existing model and an examples (egs/) directory.\n# The number of parallel jobs (--num-jobs-nnet) is determined by the\n# egs directory.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=10      # Number of epochs of training; number of iterations is\n                   # worked out from this.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                  # optimization over the validation set.\nlearning_rate_factor=1.0 # You can use this to gradually decrease the learning\n                         # rate during training (e.g. use 0.2); the initial\n                         # learning rates are as specified in the model, but it\n                         # will decrease slightly on each iteration to achieve\n                         # this ratio.\n\ncombine=true # controls whether or not to do the final model combination.\ncombine_regularizer=1.0e-14 # Small regularizer so that parameters won't go crazy.\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\nmix_up=0\nstage=-5\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n   # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\nremove_egs=false\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <input-model> <egs-dir> <exp-dir>\"\n  echo \" e.g.: $0 exp/nnet4c/final.mdl exp/nnet4c/egs exp/nnet5c/\"\n  echo \"see also the older script update_nnet.sh which creates the egs itself\"\n  echo \"You probably now want to use train_more2.sh, which uses the newer,\"\n  echo \"more compact egs format.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --learning-rate-factor<factor|1.0>               # Factor (e.g. 0.2) by which to change learning rate\"\n  echo \"                                                   # during the course of training\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --mix-up <#mix|0>                                # If specified, add quasi-targets, analogous to a mixture of Gaussians vs.\"\n  echo \"                                                   # single Gaussians.  Only do this if not already mixed-up.\"\n  echo \"  --combine <true or false|true>                   # If true, do the final nnet-combine-fast stage.\"\n  echo \"  --stage <stage|-5>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"  \n  exit 1;\nfi\n\ninput_mdl=$1\negs_dir=$2\ndir=$3\n\n# Check some files.\nfor f in $input_mdl $egs_dir/egs.1.0.ark; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist.\" && exit 1;\ndone\n\nmkdir -p $dir/log\n\n# Copy some things from the directory where the input model is located, to the\n# experimental directory, if they exist.  These might be needed for things like\n# decoding.\ninput_dir=$(dirname $input_mdl);\nfor f in tree splice_opts cmvn_opts final.mat; do\n  if [ -f $input_dir/$f ]; then\n    cp $input_dir/$f $dir/\n  fi\ndone\n\niters_per_epoch=$(cat $egs_dir/iters_per_epoch) || exit 1;\nnum_jobs_nnet=$(cat $egs_dir/num_jobs_nnet) || exit 1;\n\nnum_iters=$[$num_epochs * $iters_per_epoch];\nper_iter_learning_rate_factor=$(perl -e \"print ($learning_rate_factor ** (1.0 / $num_iters));\")\n\necho \"$0: Will train for $num_epochs epochs, equalling $num_iters iterations.\"\n\nmix_up_iter=$[$num_iters/2]\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\ncp $input_mdl $dir/0.mdl || exit 1;\n\nx=0\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n    \n    echo \"Training neural net (pass $x)\"\n\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix --minibatch-size=$minibatch_size --srand=$x $dir/$x.mdl \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done     \n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rate-factor=$per_iter_learning_rate_factor - $dir/$[$x+1].mdl || exit 1;\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n         $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\n[ $num_iters_final -gt $num_iters ] && num_iters_final=$num_iters\n[ \"$mix_up\" -gt 0 ] && [ $num_iters_final -gt $[$num_iters-$mix_up_iter] ] && \\\n  num_iters_final=$[$num_iters-$mix_up_iter]\n\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  if $combine; then\n    echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n    this_num_threads=$num_threads\n    [ $this_num_threads -lt 8 ] && this_num_threads=8\n    num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n    mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n    [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n    $cmd $parallel_opts $dir/log/combine.log \\\n      nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$this_num_threads --regularizer=$combine_regularizer \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n    $cmd $parallel_opts $dir/log/normalize.log \\\n      nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n    $cmd $dir/log/compute_prob_valid.final.log \\\n      nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.final.log \\\n      nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\n  else\n    echo \"$0: --combine=false so just using last model.\"\n    cp $dir/$x.mdl $dir/final.mdl\n  fi\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\n\nsleep 2\n\necho Done\n\n\n$remove_egs && steps/nnet2/remove_egs.sh $dir/egs\n\nif $cleanup; then\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then \n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_more2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey). \n# Apache 2.0.\n\n# This script further trains an already-existing neural network,\n# given an existing model and an examples (egs/) directory.\n# This version of the script epects an egs/ directory in the newer\n# format, as created by get_egs2.sh.\n#\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=10      # Number of epochs of training; number of iterations is\n                   # worked out from this.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                  # optimization over the validation set.\nlearning_rate_factor=1.0 # You can use this to gradually decrease the learning\n                         # rate during training (e.g. use 0.2); the initial\n                         # learning rates are as specified in the model, but it\n                         # will decrease slightly on each iteration to achieve\n                         # this ratio.\n\ncombine=true # controls whether or not to do the final model combination.\ncombine_regularizer=1.0e-14 # Small regularizer so that parameters won't go crazy.\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\nnum_jobs_nnet=4\nmix_up=0\nstage=-5\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n   # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncleanup=true\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nremove_egs=false\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <input-model> <egs-dir> <exp-dir>\"\n  echo \" e.g.: $0 exp/nnet4c/final.mdl exp/nnet4c/egs exp/nnet5c/\"\n  echo \"see also the older script update_nnet.sh which creates the egs itself\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-jobs-nnet <#jobs|4>                        # Number of neural-net jobs to run in parallel\"\n  echo \"  --learning-rate-factor<factor|1.0>               # Factor (e.g. 0.2) by which to change learning rate\"\n  echo \"                                                   # during the course of training\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --mix-up <#mix|0>                                # If specified, add quasi-targets, analogous to a mixture of Gaussians vs.\"\n  echo \"                                                   # single Gaussians.  Only do this if not already mixed-up.\"\n  echo \"  --combine <true or false|true>                   # If true, do the final nnet-combine-fast stage.\"\n  echo \"  --stage <stage|-5>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"  \n  exit 1;\nfi\n\ninput_mdl=$1\negs_dir=$2\ndir=$3\n\n# Check some files.\nfor f in $input_mdl $egs_dir/egs.1.ark; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist.\" && exit 1;\ndone\n\nmkdir -p $dir/log\n\n# Copy some things from the directory where the input model is located, to the\n# experimental directory, if they exist.  These might be needed for things like\n# decoding.\ninput_dir=$(dirname $input_mdl);\nfor f in tree splice_opts cmvn_opts final.mat; do\n  if [ -f $input_dir/$f ]; then\n    cp $input_dir/$f $dir/\n  fi\ndone\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\nif [ $num_jobs_nnet -gt $num_archives_expanded ]; then\n  echo \"$0: --num-jobs-nnet cannot exceed num-archives*frames-per-eg which is $num_archives_expanded\"\n  echo \"$0: setting --num-jobs-nnet to $num_archives_expanded\"\n  num_jobs_nnet=$num_archives_expanded\nfi\n\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$num_jobs_nnet == $num_epochs*$num_archives_expanded\nnum_iters=$[($num_epochs*$num_archives_expanded)/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nper_iter_learning_rate_factor=$(perl -e \"print ($learning_rate_factor ** (1.0 / $num_iters));\")\n\nmix_up_iter=$[$num_iters/4]  # mix up after only a short way into training, as\n                             # most likely the net is already quite well trained.\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch=$[$num_iters/$num_epochs]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch ]; then\n  num_models_combine=$approx_iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\ncp $input_mdl $dir/0.mdl || exit 1;\n\nx=0\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n    \n    echo \"Training neural net (pass $x)\"\n\n    rm $dir/.error 2>/dev/null\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n      \n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $num_jobs_nnet); do\n        k=$[$x*$num_jobs_nnet + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$minibatch_size --srand=$x $dir/$x.mdl \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done     \n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rate-factor=$per_iter_learning_rate_factor - $dir/$[$x+1].mdl || exit 1;\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n         $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\" \n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\n"
  },
  {
    "path": "egs/steps/nnet2/train_multilang2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey). \n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n\n# train_multilang2.sh is for multi-language training of neural nets.  It\n# takes multiple egs directories which must be created by get_egs2.sh, and the\n# corresponding alignment directories (only needed for training the transition\n# models).\n\n# for the n languages, we share all the hidden layers but there are separate\n# final layers.  On each iteration of training we average the hidden layers\n# across all jobs of all languages, but average the parameters of the final,\n# output layer only within each language.  The script starts from a partially\n# trained model from the first language (language 0 in the directory-numbering\n# scheme).  See egs/rm/s5/local/online/run_nnet2_wsj_joint.sh for example.\n#\n# This script requires you to supply a neural net partially trained for the 1st\n# language, by one of the regular training scripts, to be used as the initial\n# neural net (for use by other languages, we'll discard the last layer); it\n# should not have been subject to \"mix-up\" (since this script does mix-up), or\n# combination (since it would increase the parameter range to a too-large value\n# which isn't compatible with our normal learning rate schedules).\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=10      # Number of epochs of training (for first language);\n                   # the number of iterations is worked out from this.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update. \n\nnum_jobs_nnet=\"2 2\"    # Number of neural net jobs to run in parallel.  This option\n                       # is passed to get_egs.sh.  Array must be same length\n                       # as number of separate languages.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n\nstage=-4\n\n\nmix_up=\"0 0\" # Number of components to mix up to (should be > #tree leaves, if\n             # specified.)  An array, one per language.\n\nnum_threads=16  # default suitable for CPU-based training\nparallel_opts=\"--num-threads 16 --mem 1G\"  # default suitable for CPU-based training.\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=false # while testing, leaving cleanup=false.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 6 -o $[$#%2] -ne 0 ]; then\n  # num-args must be at least 6 and must be even.\n  echo \"Usage: $0 [opts] <ali0> <egs0> <ali1> <egs1> ... <aliN-1> <egsN-1> <input-model> <exp-dir>\"\n  echo \" e.g.: $0 data/train exp/tri6_ali exp/tri6_egs exp_lang2/tri6_ali exp_lang2/tri6_egs exp/dnn6a/10.mdl exp/tri6_multilang\"\n  echo \"\"\n  echo \"Note: <input-model> must correspond to the model/tree for <ali0> and <egs0>, and the\"\n  echo \"num-epochs is computed for the zeroth language.\"\n  echo \"\"\n  echo \"The --num-jobs-nnet should be an array saying how many jobs to allocate to each language,\"\n  echo \"e.g. --num-jobs-nnet '2 4'\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training (figured from 1st corpus)\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  exit 1;\nfi\n\n\nargv=(\"$@\") \nnum_args=$#\nnum_lang=$[($num_args-2)/2]\n\ndir=${argv[$num_args-1]}\ninput_model=${argv[$num_args-2]}\n\n[ ! -f $input_model ] && echo \"$0: Input model $input_model does not exist\" && exit 1;\n\n\nmkdir -p $dir/log\n\nnum_jobs_nnet_array=($num_jobs_nnet)\n! [ \"${#num_jobs_nnet_array[@]}\" -eq \"$num_lang\" ] && \\\n  echo \"$0: --num-jobs-nnet option must have size equal to the number of languages\" && exit 1;\nmix_up_array=($mix_up)\n! [ \"${#mix_up_array[@]}\" -eq \"$num_lang\" ] && \\\n  echo \"$0: --mix-up option must have size equal to the number of languages\" && exit 1;\n\n\n# Language index starts from 0.\nfor lang in $(seq 0 $[$num_lang-1]); do\n  alidir[$lang]=${argv[$lang*2]}\n  egs_dir[$lang]=${argv[$lang*2+1]}\n  for f in ${egs_dir[$lang]}/info/frames_per_eg ${egs_dir[lang]}/egs.1.ark ${alidir[$lang]}/ali.1.gz ${alidir[$lang]}/tree; do\n    [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\n  done\n  mkdir -p $dir/$lang/log\n  cp ${alidir[$lang]}/tree $dir/$lang/ || exit 1;\n\n  for f in ${egs_dir[$lang]}/{final.mat,cmvn_opts,splice_opts}; do\n    # Copy any of these files that exist.\n    cp $f $dir/$lang/ 2>/dev/null \n  done\ndone\n\n\ninput_model_pdfs=$(nnet-am-info $input_model | grep '^output-dim' | awk '{print $2}')\nalidir0_pdfs=$(tree-info ${alidir[0]}/tree | grep '^num-pdfs' | awk '{print $2}')\nif ! [ $input_model_pdfs -eq $alidir0_pdfs ]; then\n  echo \"$0: expected num-pdfs from the input model $input_model to match\"\n  echo \" .. the one used for the first alignment directory ${alidir[0]}, $input_model_pdfs != $alidir0_pdfs\"\n  exit 1;\nfi\n\n\n\nfor x in final.mat cmvn_opts splice_opts; do\n  if [ -f $dir/0/$x ]; then\n    for lang in $(seq 1 $[$num_lang-1]); do\n      if ! cmp $dir/0/$x $dir/$lang/$x; then\n        echo \"$0: warning: files $dir/0/$x and $dir/$lang/$x are not identical.\"\n      fi\n    done\n  fi\ndone\n\n# the input model is supposed to correspond to the first language.\nnnet-am-copy --learning-rate=$initial_learning_rate $input_model $dir/0/0.mdl\n\nif nnet-am-info --print-args=false $dir/0/0.mdl | grep SumGroupComponent 2>/dev/null; then\n  if [ \"${mix_up_array[0]}\" != \"0\" ]; then\n    echo \"$0: Your input model already has mixtures, but you are asking to mix it up.\"\n    echo \" ... best to use a model without mixtures as input.  (e.g., earlier iter).\"\n    exit 1;\n  fi\nfi\n\n\nif [ $stage -le -4 ]; then\n  echo \"$0: initializing models for other languages\"\n  for lang in $(seq 1 $[$num_lang-1]); do\n    # create the initial models for the other languages.\n    $cmd $dir/$lang/log/reinitialize.log \\\n      nnet-am-reinitialize $input_model ${alidir[$lang]}/final.mdl $dir/$lang/0.mdl || exit 1;\n  done\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  for lang in $(seq 0 $[$num_lang-1]); do\n    $cmd $dir/$lang/log/train_trans.log \\\n      nnet-train-transitions $dir/$lang/0.mdl \"ark:gunzip -c ${alidir[$lang]}/ali.*.gz|\" $dir/$lang/0.mdl \\\n      || exit 1;\n  done\nfi\n\n# Work out the number of iterations... the number of epochs refers to the\n# first language (language zero) and this, together with the num-jobs-nnet for\n# that language and details of the egs, determine the number of epochs.\n\nframes_per_eg0=$(cat ${egs_dir[0]}/info/frames_per_eg) || exit 1;\nnum_archives0=$(cat ${egs_dir[0]}/info/num_archives) || exit 1;\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded0=$[$num_archives0*$frames_per_eg0]\n\nif [ ${num_jobs_nnet_array[0]} -gt $num_archives_expanded0 ]; then\n  echo \"$0: --num-jobs-nnet[0] cannot exceed num-archives*frames-per-eg which is $num_archives_expanded\"\n  exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$num_jobs_nnet == $num_epochs*$num_archives_expanded\nnum_iters=$[($num_epochs*$num_archives_expanded0)/${num_jobs_nnet_array[0]}]\n\necho \"$0: Will train for $num_epochs epochs (of language 0) = $num_iters iterations\"\n\n! [ $num_iters -gt 0 ] && exit 1;\n\n# Work out the number of epochs we train for on the other languages... this is\n# just informational.\nfor lang in $(seq 1 $[$num_lang-1]); do\n  frames_per_eg=$(cat ${egs_dir[$lang]}/info/frames_per_eg) || exit 1;\n  num_archives=$(cat ${egs_dir[$lang]}/info/num_archives) || exit 1;\n  num_archives_expanded=$[$num_archives*$frames_per_eg]\n  num_epochs=$[($num_iters*${num_jobs_nnet_array[$lang]})/$num_archives_expanded]\n  echo \"$0: $num_iters iterations is approximately $num_epochs epochs for language $lang\"\ndone\n\n# do any mixing-up after half the iters.\nmix_up_iter=$[$num_iters/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch=$[$num_iters/$num_epochs]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup).\n# We use the same numbers of iterations for all languages, even though it's just\n# worked out for the first language.\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch ]; then\n  num_models_combine=$approx_iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\n\nwhile [ $x -lt $num_iters ]; do\n    \n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    for lang in $(seq 0 $[$num_lang-1]); do\n      # Set off jobs doing some diagnostics, in the background.\n      $cmd $dir/$lang/log/compute_prob_valid.$x.log \\\n        nnet-compute-prob $dir/$lang/$x.mdl ark:${egs_dir[$lang]}/valid_diagnostic.egs &\n      $cmd $dir/$lang/log/compute_prob_train.$x.log \\\n        nnet-compute-prob $dir/$lang/$x.mdl ark:${egs_dir[$lang]}/train_diagnostic.egs &\n      if [ $x -gt 0 ] && [ ! -f $dir/$lang/log/mix_up.$[$x-1].log ]; then\n        $cmd $dir/$lang/log/progress.$x.log \\\n          nnet-show-progress --use-gpu=no $dir/$lang/$[$x-1].mdl $dir/$lang/$x.mdl \\\n          ark:${egs_dir[$lang]}/train_diagnostic.egs '&&' \\\n           nnet-am-info $dir/$lang/$x.mdl &\n      fi\n    done\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -eq 0 ]; then\n      # on iteration zero, use a smaller minibatch size and only one quarter of the\n      # normal amount of training data: this will help, respectively, to ensure stability\n      # and to stop the models from moving so far that averaging hurts.\n      this_minibatch_size=$[$minibatch_size/2];\n      this_keep_proportion=0.25\n    else\n      this_minibatch_size=$minibatch_size\n      this_keep_proportion=1.0\n      # use half the examples on iteration 1, out of a concern that the model-averaging\n      # might not work if we move too far before getting close to convergence.\n      [ $x -eq 1 ] && this_keep_proportion=0.5 \n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n      \n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      \n      \n      for lang in $(seq 0 $[$num_lang-1]); do\n        this_num_jobs_nnet=${num_jobs_nnet_array[$lang]}\n        this_frames_per_eg=$(cat ${egs_dir[$lang]}/info/frames_per_eg) || exit 1;\n        this_num_archives=$(cat ${egs_dir[$lang]}/info/num_archives) || exit 1;\n\n        ! [ $this_num_jobs_nnet -gt 0 -a $this_frames_per_eg -gt 0 -a $this_num_archives -gt 0 ] && exit 1\n\n        for n in $(seq $this_num_jobs_nnet); do\n          k=$[$x*$this_num_jobs_nnet + $n - 1]; # k is a zero-based index that we'll derive\n                                                # the other indexes from.\n          archive=$[($k%$this_num_archives)+1]; # work out the 1-based archive index.\n          frame=$[(($k/$this_num_archives)%$this_frames_per_eg)];\n\n          $cmd $parallel_opts $dir/$lang/log/train.$x.$n.log \\\n            nnet-train$parallel_suffix $parallel_train_opts \\\n            --minibatch-size=$this_minibatch_size --srand=$x $dir/$lang/$x.mdl \\\n            \"ark,bg:nnet-copy-egs --keep-proportion=$this_keep_proportion --frame=$frame ark:${egs_dir[$lang]}/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n            $dir/$lang/$[$x+1].$n.mdl || touch $dir/.error &\n        done\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    (\n      # First average within each language.  Use a sub-shell so \"wait\" won't\n      # wait for the diagnostic jobs.\n      for lang in $(seq 0 $[$num_lang-1]); do\n        this_num_jobs_nnet=${num_jobs_nnet_array[$lang]}\n        nnets_list=$(for n in `seq 1 $this_num_jobs_nnet`; do echo $dir/$lang/$[$x+1].$n.mdl; done)\n        # average the output of the different jobs.\n        $cmd $dir/$lang/log/average.$x.log \\\n          nnet-am-average $nnets_list - \\| \\\n          nnet-am-copy --learning-rate=$learning_rate - $dir/$lang/$[$x+1].tmp.mdl || touch $dir/.error &\n      done\n      wait\n      [ -f $dir/.error ] && echo \"$0: error averaging models on iteration $x of training\" && exit 1;\n      # Remove the models we just averaged.\n      for lang in $(seq 0 $[$num_lang-1]); do\n        this_num_jobs_nnet=${num_jobs_nnet_array[$lang]}\n        for n in `seq 1 $this_num_jobs_nnet`; do rm $dir/$lang/$[$x+1].$n.mdl; done\n      done\n    )\n\n\n    nnets_list=$(for lang in $(seq 0 $[$num_lang-1]); do echo $dir/$lang/$[$x+1].tmp.mdl; done)\n    weights_csl=$(echo $num_jobs_nnet | sed 's/ /:/g') # get as colon separated list.\n\n    # the next command produces the cross-language averaged model containing the\n    # final layer corresponding to language zero.\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average --weights=$weights_csl --skip-last-layer=true \\\n      $nnets_list $dir/0/$[$x+1].mdl || exit 1;\n\n    for lang in $(seq 1 $[$num_lang-1]); do\n      # the next command takes the averaged hidden parameters from language zero, and\n      # the last layer from language $lang.  It's not really doing averaging.\n      $cmd $dir/$lang/log/combine_average.$x.log \\\n        nnet-am-average --weights=0.0:1.0 --skip-last-layer=true \\\n          $dir/$lang/$[$x+1].tmp.mdl $dir/0/$[$x+1].mdl $dir/$lang/$[$x+1].mdl || exit 1;\n    done\n\n    $cleanup && rm $dir/*/$[$x+1].tmp.mdl\n\n    if [ $x -eq $mix_up_iter ]; then\n      for lang in $(seq 0 $[$num_lang-1]); do     \n        this_mix_up=${mix_up_array[$lang]}\n        if [ $this_mix_up -gt 0 ]; then\n          echo \"$0: for language $lang, mixing up to $this_mix_up components\"\n          $cmd $dir/$lang/log/mix_up.$x.log \\\n            nnet-am-mixup --min-count=10 --num-mixtures=$this_mix_up \\\n             $dir/$lang/$[$x+1].mdl $dir/$lang/$[$x+1].mdl || exit 1;\n        fi\n      done\n    fi\n\n    # Now average across languages.\n\n    rm $nnets_list\n\n    for lang in $(seq 0 $[$num_lang-1]); do # mix up.\n      [ ! -f $dir/$lang/$[$x+1].mdl ] && echo \"No such file $dir/$lang/$[$x+1].mdl\" && exit 1;\n      if [ -f $dir/$lang/$[$x-1].mdl ] && $cleanup && \\\n        [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n        rm $dir/$lang/$[$x-1].mdl\n      fi\n    done\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"$0: Doing combination to produce final models\"\n\n\n  rm $dir/.error 2>/dev/null\n  for lang in $(seq 0 $[$num_lang-1]); do\n    nnets_list=()\n    # the if..else..fi statement below sets 'nnets_list'.\n    if [ $max_models_combine -lt $num_models_combine ]; then\n      # The number of models to combine is too large, e.g. > 20.  In this case,\n      # each argument to nnet-combine-fast will be an average of multiple models.\n      cur_offset=0 # current offset from first_model_combine.\n      for n in $(seq $max_models_combine); do\n        next_offset=$[($n*$num_models_combine)/$max_models_combine]\n        sub_list=\"\" \n        for o in $(seq $cur_offset $[$next_offset-1]); do\n          iter=$[$first_model_combine+$o]\n          mdl=$dir/$lang/$iter.mdl\n          [ ! -f $mdl ] && echo \"$0: Expected $mdl to exist\" && exit 1;\n          sub_list=\"$sub_list $mdl\"\n        done\n        nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n        cur_offset=$next_offset\n      done\n    else\n      nnets_list=\n      for n in $(seq 0 $[num_models_combine-1]); do\n        iter=$[$first_model_combine+$n]\n        mdl=$dir/$lang/$iter.mdl\n        [ ! -f $mdl ] && echo \"$0: Expected $mdl to exist\" && exit 1;\n        nnets_list[$n]=$mdl\n      done\n    fi\n\n    # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n    # if there are many models it can give out-of-memory error; set num-threads\n    # to 8 to speed it up (this isn't ideal...)\n    num_egs=`nnet-copy-egs ark:${egs_dir[$lang]}/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n\n    mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n    [ $mb -gt 512 ] && mb=512\n    # Setting --initial-model to a large value makes it initialize the combination\n    # with the average of all the models.  It's important not to start with a\n    # single model, or, due to the invariance to scaling that these nonlinearities\n    # give us, we get zero diagonal entries in the fisher matrix that\n    # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n    # the effect that the initial model chosen gets much higher learning rates\n    # than the others.  This prevents the optimization from working well.\n    $cmd $combine_parallel_opts $dir/$lang/log/combine.log \\\n      nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n        --num-threads=$combine_num_threads \\\n        --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:${egs_dir[$lang]}/combine.egs \\\n      - \\| nnet-normalize-stddev - $dir/$lang/final.mdl || touch $dir/.error &\n  done\n  wait\n  \n  [ -f $dir/.error ] && echo \"$0: error doing model combination\" && exit 1;\nfi\n\n\nif [ $stage -le $[$num_iters+1] ]; then\n  for lang in $(seq 0 $[$num_lang-1]); do  \n    # Run the diagnostics for the final models.\n    $cmd $dir/$lang/log/compute_prob_valid.final.log \\\n      nnet-compute-prob $dir/$lang/final.mdl ark:${egs_dir[$lang]}/valid_diagnostic.egs &\n    $cmd $dir/$lang/log/compute_prob_train.final.log \\\n      nnet-compute-prob $dir/$lang/final.mdl ark:${egs_dir[$lang]}/train_diagnostic.egs &\n  done\n  wait\nfi\n\nif [ $stage -le $[$num_iters+2] ]; then\n  # Note: this just uses CPUs, using a smallish subset of data.\n\n\n  for lang in $(seq 0 $[$num_lang-1]); do\n    echo \"$0: Getting average posterior for purposes of adjusting the priors (language $lang).\"\n    rm $dir/$lang/.error 2>/dev/null\n    rm $dir/$lang/post.$x.*.vec 2>/dev/null\n    $cmd JOB=1:$num_jobs_compute_prior $dir/$lang/log/get_post.JOB.log \\\n      nnet-copy-egs --frame=random --srand=JOB ark:${egs_dir[$lang]}/egs.1.ark ark:- \\| \\\n      nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n      nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$lang/final.mdl -|\" ark:- ark:- \\| \\\n      matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/$lang/post.JOB.vec || touch $dir/$lang/.error &\n  done\n  echo \"$0: ... waiting for jobs for all languages to complete.\"\n  wait\n  sleep 3;  # make sure there is time for $dir/$lang/post.$x.*.vec to appear.\n  for lang in $(seq 0 $[$num_lang-1]); do\n    [ -f $dir/$lang/.error ] && \\\n      echo \"$0: error getting posteriors for adjusting the priors for language $lang\" && exit 1;\n\n    $cmd $dir/$lang/log/vector_sum.log \\\n      vector-sum $dir/$lang/post.*.vec $dir/$lang/post.vec || exit 1;\n\n    rm $dir/$lang/post.*.vec;\n\n    echo \"Re-adjusting priors based on computed posteriors for language $lang\"\n    $cmd $dir/$lang/log/adjust_priors.final.log \\\n      nnet-adjust-priors $dir/$lang/final.mdl $dir/$lang/post.vec $dir/$lang/final.mdl || exit 1;\n  done\nfi\n\n\nfor lang in $(seq 0 $[$num_lang-1]); do\n  if [ ! -f $dir/$lang/final.mdl ]; then\n    echo \"$0: $dir/final.mdl does not exist.\"\n    # we don't want to clean up if the training didn't succeed.\n    exit 1;\n  fi\ndone\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$lang/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$lang/$x.mdl\n    fi\n  done\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet2/train_multisplice_accel2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n# train_multisplice_accel2.sh is a modified version of\n# train_pnorm_multisplice2.sh (still using pnorm).  The \"accel\" refers to the\n# fact that we increase the number of jobs during training (from\n# --num-jobs-initial to --num-jobs-final).  We dropped \"pnorm\" from the name as\n# it was getting too long.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\npresoftmax_prior_scale_power=-0.25 # use the specified power value on the priors (inverse priors)\n                                   # to scale the pre-softmax outputs\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nfix_nnet=false\nmin_average=0.05\nmax_average=0.95\nonline_ivector_dir=\nremove_egs=true  # set to false to disable removing egs.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\nsplice_indexes=\"layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\nframes_per_eg=8 # to be passed on to get_egs2.sh\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\necho $@\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --presoftmax-prior-scale-power <power|-0.25>     # use the specified power value on the priors (inverse priors)\"\n  echo \"                                                   # to scale the pre-softmax outputs.\"\n  echo \"                                                   # (set to 0.0 to disable the presoftmax element scale)\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # This option now does nothing; please remove it.\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-times <list-of-times|\\\"\\\">             # A list of space-separated floating point numbers between 0.0 and\"\n  echo \"                                                   # 1.0 to specify how far through training realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\n# process the splice_inds string, to get a layer-wise context string\n# to be processed by the nnet-components\n# this would be mainly used by SpliceComponent|SpliceMaxComponent\npython steps/nnet2/make_multisplice_configs.py contexts --splice-indexes \"$splice_indexes\" $dir || exit -1;\ncontext_string=$(cat $dir/vars) || exit -1\necho $context_string\neval $context_string || exit -1; #\n  # initializes variables used by get_lda.sh and get_egs.sh\n  # get_lda.sh : first_left_context, first_right_context,\n  # get_egs.sh : nnet_left_context & nnet_right_context\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --left-context $first_left_context --right-context $first_right_context --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n\n  extra_opts+=(--left-context $nnet_left_context )\n  extra_opts+=(--right-context $nnet_right_context )\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n      --io-opts \"$io_opts\" \\\n      --cmd \"$cmd\" $egs_opts \\\n      --frames-per-eg $frames_per_eg \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n# confirm that the provided egs_dir has the necessary context\negs_left_context=$(cat $egs_dir/info/left_context) || exit 1\negs_right_context=$(cat $egs_dir/info/right_context) || exit 1\n([[ $egs_left_context -lt $nnet_left_context ]] || [[ $egs_right_context -lt $nnet_right_context ]]) &&\n  echo \"$0: Provided egs_dir $egs_dir does not have sufficient context to train the neural network.\" && exit 1;\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  initial_lrate=$(perl -e \"print ($initial_effective_lrate*$num_jobs_initial);\")\n\n  # create the config files for nnet initialization\n  python steps/nnet2/make_multisplice_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --total-input-dim $tot_input_dim  \\\n    --ivector-dim $ivector_dim  \\\n    --lda-mat \"$lda_mat\"  \\\n    --lda-dim $lda_dim  \\\n    --pnorm-input-dim $pnorm_input_dim  \\\n    --pnorm-output-dim  $pnorm_output_dim \\\n    --online-preconditioning-opts \"$online_preconditioning_opts\"  \\\n    --initial-learning-rate $initial_lrate \\\n    --bias-stddev  $bias_stddev  \\\n    --num-hidden-layers $num_hidden_layers \\\n    --num-targets  $num_leaves  \\\n    configs  $dir || exit -1;\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\nif [ $pnorm_input_dim -eq $pnorm_output_dim ] && [ $fix_nnet ]; then fix_nnet=true;fi\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\n\n  if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n    echo \"prepare initial vector for FixedScaleComponent before softmax\"\n    echo \"use priors^$presoftmax_prior_scale_power and rescale to average 1\"\n\n    # obtains raw pdf count\n    $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      post-to-tacc --per-pdf=true --binary=false $alidir/final.mdl ark:- $dir/JOB.pacc || exit 1;\n    cat $dir/*.pacc > $dir/pacc\n    rm $dir/*.pacc\n    awk -v power=$presoftmax_prior_scale_power \\\n      '{ for(i=2; i<=NF-1; i++) {sum[i]+=$i} }\n      END {\n        for (i=2; i<=NF-1; i++) {total+=sum[i]}\n        ave_pdf=int(total/(NF-2)); total+=0.01*ave_pdf*(NF-2)\n        for (i=2; i<=NF-1; i++) {rescale+=((sum[i]+0.01*ave_pdf)/total)^power}\n        rescale/=(NF-2)\n        printf \" [ \"; for (i=2; i<=NF-1; i++) {printf(\"%f \", ((sum[i]+0.01*ave_pdf)/total)^power/rescale)}; print \"]\"\n      }' $dir/pacc > $dir/presoftmax_prior_scale_vecfile\n\n    echo \"FixedScaleComponent scales=$dir/presoftmax_prior_scale_vecfile\" > $dir/per_element.config\n    echo \"insert an additional layer of FixedScaleComponent before softmax\"\n    inp=`nnet-am-info $dir/0.mdl | grep 'Softmax' | awk '{print $2}'`\n    nnet-init $dir/per_element.config - | nnet-insert --insert-at=$inp --randomize-next-component=false $dir/0.mdl - $dir/0.mdl\n  fi\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\n\n# mix up at the iteration where we've processed about half the data; this keeps\n# the overall training procedure fairly invariant to the number of initial and\n# final jobs.\n# j = initial, k = final, n = num-iters, x = half-of-data epoch,\n# p is proportion of data we want to process (e.g. p=0.5 here).\n# solve for x if the amount of data processed by epoch x is p\n# times the amount by iteration n.\n# put this in wolfram alpha:\n# solve { x*j + (k-j)*x*x/(2*n) = p * (j*n + (k-j)*n/2), {x} }\n# got: x = (j n-sqrt(-n^2 (j^2 (p-1)-k^2 p)))/(j-k) and j!=k and n!=0\n# simplified manually to: n * (sqrt(((1-p)j^2 + p k^2)/2) - j)/(j-k)\nmix_up_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters 0.5)\n! [ $mix_up_iter -gt $finish_add_layers_iter ] && \\\n  echo \"Mix-up-iter is $mix_up_iter, should be greater than $finish_add_layers_iter -> add more epochs?\" \\\n  && exit 1;\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\necho \"$0: Will not do mix up\"\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch_final ]; then\n   num_models_combine=$approx_iters_per_epoch_final\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n  if [ $x -gt $[$num_iters/2] ]; then fix_nnet=false; fi\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging take the best.\n      cur_num_hidden_layers=$[$x/$add_layers_period];\n      inp=`nnet-am-info $dir/$x.mdl | grep 'Softmax' | awk '{print $2}'`\n\n      if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n        inp=$[$inp-2]\n      else\n        inp=$[$inp-1]\n      fi\n\n      mdl=\"nnet-init --srand=$x $dir/hidden_${cur_num_hidden_layers}.config - | nnet-insert --insert-at=$inp $dir/$x.mdl - - | nnet-am-copy --learning-rate=$this_learning_rate - -|\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      mdl=\"nnet-am-copy --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      cp $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    if $fix_nnet; then\n      # do nnet-am-fix to fix some pathology in the network\n      nnet-am-fix --max-average-deriv=$max_average --min-average-deriv=$min_average $dir/$[$x+1].mdl $dir/$[$x+1].mdl 2>$dir/log/fix.$x.log || exit;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      echo \"Warning: the mix up opertion is disabled!\"\n      echo \"    Ignore mix up leaves number specified\"\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\n"
  },
  {
    "path": "egs/steps/nnet2/train_multisplice_ensemble.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n# train_multisplice_accel2.sh is a modified version of\n# train_pnorm_multisplice2.sh (still using pnorm).  The \"accel\" refers to the\n# fact that we increase the number of jobs during training (from\n# --num-jobs-initial to --num-jobs-final).  We dropped \"pnorm\" from the name as\n# it was getting too long.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nonline_ivector_dir=\nremove_egs=true  # set to false to disable removing egs.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\nsplice_indexes=\"layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\npostdir=\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\nsrand=0 # random seed used to initialize the nnet\ninitial_beta=0.1\nfinal_beta=3\nensemble_size=4\n# End configuration section.\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|''>             # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n[ ! -f $postdir/post.1.scp ] && [ ! -f $alidir/ali.1.gz ] && echo \"$0: no (soft) alignments provided\" && exit 1;\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\n# process the splice_inds string, to get a layer-wise context string\n# to be processed by the nnet-components\n# this would be mainly used by SpliceComponent|SpliceMaxComponent\npython steps/nnet2/make_multisplice_configs.py contexts --splice-indexes \"$splice_indexes\" $dir || exit -1;\ncontext_string=$(cat $dir/vars) || exit -1\necho $context_string\neval $context_string || exit -1; #\n  # initializes variables used by get_lda.sh and get_egs.sh\n  # get_lda.sh : first_left_context, first_right_context,\n  # get_egs.sh : nnet_left_context & nnet_right_context\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --left-context $first_left_context --right-context $first_right_context --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n\n  extra_opts+=(--left-context $nnet_left_context )\n  extra_opts+=(--right-context $nnet_right_context )\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\" \\\n      --postdir \"$postdir\" \\\n      --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n      --io-opts \"$io_opts\" \\\n      --cmd \"$cmd\" $egs_opts \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  initial_lrate=$(perl -e \"print ($initial_effective_lrate*$num_jobs_initial);\")\n\n  # create the config files for nnet initialization\n  python steps/nnet2/make_multisplice_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --total-input-dim $tot_input_dim  \\\n    --ivector-dim $ivector_dim  \\\n    --lda-mat \"$lda_mat\"  \\\n    --lda-dim $lda_dim  \\\n    --pnorm-input-dim $pnorm_input_dim  \\\n    --pnorm-output-dim  $pnorm_output_dim \\\n    --online-preconditioning-opts \"$online_preconditioning_opts\"  \\\n    --initial-learning-rate $initial_lrate \\\n    --bias-stddev  $bias_stddev  \\\n    --num-hidden-layers $num_hidden_layers \\\n    --num-targets  $num_leaves  \\\n    configs  $dir || exit -1;\n\n    $cmd $parallel_opts JOB=1:$ensemble_size $dir/log/nnet_init.JOB.log \\\n      nnet-am-init $alidir/tree $lang/topo \"nnet-init --srand=JOB $dir/nnet.config -|\" \\\n      $dir/0.JOB.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $parallel_opts JOB=1:$ensemble_size $dir/log/train_trans.JOB.log \\\n    nnet-train-transitions $dir/0.JOB.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.JOB.mdl \\\n    || exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\n\n# mix up at the iteration where we've processed about half the data; this keeps\n# the overall training procedure fairly invariant to the number of initial and\n# final jobs.\n# j = initial, k = final, n = num-iters, x = half-of-data epoch,\n# p is proportion of data we want to process (e.g. p=0.5 here).\n# solve for x if the amount of data processed by epoch x is p\n# times the amount by iteration n.\n# put this in wolfram alpha:\n# solve { x*j + (k-j)*x*x/(2*n) = p * (j*n + (k-j)*n/2), {x} }\n# got: x = (j n-sqrt(-n^2 (j^2 (p-1)-k^2 p)))/(j-k) and j!=k and n!=0\n# simplified manually to: n * (sqrt(((1-p)j^2 + p k^2)/2) - j)/(j-k)\nmix_up_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters 0.5)\n! [ $mix_up_iter -gt $finish_add_layers_iter ] && \\\n  echo \"Mix-up-iter is $mix_up_iter, should be greater than $finish_add_layers_iter -> add more epochs?\" \\\n  && exit 1;\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n[ $mix_up -gt 0 ] && echo \"$0: Will mix up on iteration $mix_up_iter\"\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch_final ]; then\n   num_models_combine=$approx_iters_per_epoch_final\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.1.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.1.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].1.mdl $dir/$x.1.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.1.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    declare -A mdl\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging take the best.\n      cur_num_hidden_layers=$[$x/$add_layers_period];\n      for i in `seq 1 $ensemble_size`; do\n        mdl[$i]=\"nnet-init --srand=$[$x+$i] $dir/hidden_${cur_num_hidden_layers}.config - | nnet-insert $dir/$x.$i.mdl - - | nnet-am-copy --learning-rate=$this_learning_rate - -|\"\n      done\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      for i in `seq 1 $ensemble_size`; do\n        mdl[$i]=\"nnet-am-copy --learning-rate=$this_learning_rate $dir/$x.$i.mdl -|\"\n      done\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n      nnets_ensemble_in=\n      nnets_ensemble_out=\n      for i in `seq 1 $ensemble_size`; do\n        nnets_ensemble_in=\"$nnets_ensemble_in '${mdl[$i]}'\"\n        nnets_ensemble_out=\"${nnets_ensemble_out} $dir/$[$x+1].$n.$i.mdl \"\n      done\n\n      beta=`perl -e '($x,$n,$i,$f)=@ARGV; print ($i+$x*($f-$i)/$n);' $[$x+1] $num_iters $initial_beta $final_beta`;\n\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train-ensemble \\\n          --minibatch-size=$this_minibatch_size --srand=$x \\\n          --beta=$beta $nnets_ensemble_in \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          ark:- $nnets_ensemble_out || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    for i in `seq 1 $ensemble_size`; do\n      nnets_list=\n      for n in `seq 1 $this_num_jobs`; do\n        nnets_list=\"$nnets_list $dir/$[$x+1].$n.$i.mdl\"\n      done\n\n      if $do_average; then\n        # average the output of the different jobs.\n        $cmd $dir/log/average.$x.log \\\n          nnet-am-average $nnets_list $dir/$[$x+1].$i.mdl ||  exit 1;\n      else\n        # choose the best from the different jobs.\n        n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n            $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n            undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n            close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n            $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n        [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n        cp $dir/$[$x+1].$n.$i.mdl $dir/$[$x+1].$i.mdl || exit 1;\n      fi\n\n      if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n        # mix up.\n        echo Mixing up from $num_leaves to $mix_up components\n        $cmd $dir/log/mix_up.$x.$i.log \\\n          nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n          $dir/$[$x+1].$i.mdl $dir/$[$x+1].$i.mdl || exit 1;\n      fi\n      rm $nnets_list\n      [ ! -f $dir/$[$x+1].$i.mdl ] && exit 1;\n      if [ -f $dir/$[$x-1].$i.mdl ] && $cleanup && \\\n         [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n        rm $dir/$[$x-1].$i.mdl\n      fi\n    done\n  fi\n    x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n(\n  # Now do combination.\n  for i in `seq 1 $ensemble_size`; do\n    # Now do combination.\n    nnets_list=()\n    # the if..else..fi statement below sets 'nnets_list'.\n    if [ $max_models_combine -lt $num_models_combine ]; then\n      # The number of models to combine is too large, e.g. > 20.  In this case,\n      # each argument to nnet-combine-fast will be an average of multiple models.\n      cur_offset=0 # current offset from first_model_combine.\n      for n in $(seq $max_models_combine); do\n        next_offset=$[($n*$num_models_combine)/$max_models_combine]\n        sub_list=\"\"\n        for o in $(seq $cur_offset $[$next_offset-1]); do\n          iter=$[$first_model_combine+$o]\n          mdl=$dir/$iter.$i.mdl\n          [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n          sub_list=\"$sub_list $mdl\"\n        done\n        nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n        cur_offset=$next_offset\n      done\n    else\n      for n in $(seq 0 $[num_models_combine-1]); do\n        iter=$[$first_model_combine+$n]\n        mdl=$dir/$iter.$i.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        nnets_list[$n]=$mdl\n      done\n    fi\n    # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n    # if there are many models it can give out-of-memory error; set num-threads to 8\n    # to speed it up (this isn't ideal...)\n    num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n    mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n    [ $mb -gt 512 ] && mb=512\n    # Setting --initial-model to a large value makes it initialize the combination\n    # with the average of all the models.  It's important not to start with a\n    # single model, or, due to the invariance to scaling that these nonlinearities\n    # give us, we get zero diagonal entries in the fisher matrix that\n    # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n    # the effect that the initial model chosen gets much higher learning rates\n    # than the others.  This prevents the optimization from working well.\n\n    $cmd $combine_parallel_opts  $dir/log/combine.$i.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.$i.mdl || touch $dir/.error &\n\n    [ -f $dir/.error ] && echo \"$0: error when combining models.\" && exit 1;\n    rm $dir/.error 2>/dev/null\n  done\n  wait\n)\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd JOB=1:$ensemble_size $dir/log/normalize.JOB.log \\\n    nnet-normalize-stddev $dir/final.JOB.mdl $dir/final.JOB.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.1.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.1.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  for i in `seq 1 $ensemble_size`; do\n    echo \"Getting average posterior for purposes of adjusting the priors.\"\n    # Note: this just uses CPUs, using a smallish subset of data.\n    rm $dir/post.$x.*.vec 2>/dev/null\n    $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n      nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n      nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n      nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.$i.mdl -|\" ark:- ark:- \\| \\\n      matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n    sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n    $cmd $dir/log/vector_sum.$x.log \\\n     vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n    rm $dir/post.$x.*.vec;\n\n    echo \"Re-adjusting priors based on computed posteriors\"\n    $cmd $dir/log/adjust_priors.final.log \\\n      nnet-adjust-priors $dir/final.$i.mdl $dir/post.$x.vec $dir/final.$i.mdl || exit 1;\n  done\nfi\ncp $dir/final.1.mdl $dir/final.mdl\n\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    for i in `seq 1 $ensemble_size`; do\n      if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.$i.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n        rm $dir/$x.$i.mdl\n      fi\n    done\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n# Apache 2.0.\n\n\n# This script trains neural network with pnorm nonlinearities.\n# The difference with train_tanh.sh is that, instead of setting\n# hidden_layer_size, you should set pnorm_input_dim and pnorm_output_dim.\n# Also the P value (the order of the p-norm) should be set.\n#\n# [Vimal Manohar - Oct 2014]\n# The script now supports realignment during training, which can be done by\n# specifying realign_epochs.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\nsoftmax_learning_rate_factor=1.0 # In the default setting keep the same learning rate.\n\ncombine_regularizer=1.0e-14 # Small regularizer so that parameters won't go crazy.\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --egs-opts <opts>                                # Extra options to pass to get_egs.sh\"\n  echo \"  --lda-opts <opts>                                # Extra options to pass to get_lda.sh\"\n  echo \"  --realign-epochs <list-of-epochs|\\\"\\\">           # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=`cat $dir/feat_dim` || exit 1;\nlda_dim=`cat $dir/lda_dim` || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  lda_mat=$dir/lda.mat\n  ext_lda_dim=$lda_dim\n  ext_feat_dim=$feat_dim\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$ext_feat_dim left-context=$splice_width right-context=$splice_width\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditioned input-dim=$ext_lda_dim output-dim=$pnorm_input_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditioned input-dim=$pnorm_output_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditioned input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  realign_iter=`perl -e 'print int($ARGV[0] * $ARGV[1]);' $realign_epoch $iters_per_epoch`\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      rm $dir/post.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n        nnet-subset-egs --n=$prior_subset_size ark:$prev_egs_dir/egs.JOB.0.ark ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.log \\\n        vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n      rm $dir/post.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$cur_egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    softmax_learning_rate=`perl -e \"print $learning_rate * $softmax_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$softmax_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  this_num_threads=$num_threads\n  [ $this_num_threads -lt 8 ] && this_num_threads=8\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$this_num_threads --regularizer=$combine_regularizer \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $parallel_opts $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$cur_egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_accel2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#                2013  Xiaohui Zhang\n#                2013  Guoguo Chen\n#                2014  Vimal Manohar\n# Apache 2.0.\n\n# train_pnorm_accel2.sh is a modified form of train_pnorm_simple2.sh (the \"2\"\n# suffix is because they both use the the \"new\" egs format, created by\n# get_egs2.sh).  The \"accel\" part of the name refers to the fact that this\n# script uses a number of jobs that can increase during training.  You can\n# specify --initial-num-jobs and --final-num-jobs to control these separately.\n# Also, in this script, the learning rates specified by --initial-learning-rate\n# and --final-learning-rate are the \"effective learning rates\" (defined as the\n# learning rate divided by the number of jobs), and the actual learning rates\n# used will be the specified learning rates multiplied by the current number\n# of jobs.  You'll want to set these lower than you normally would previously\n# have set the learning rates, by a factor equal to the (previous) number of\n# jobs.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1    # Number of neural net jobs to run in parallel at the start of training.\nnum_jobs_final=8      # Number of jobs to run in parallel at the end of training.\n\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nonline_ivector_dir=\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\n\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nleft_context= # if set, overrides splice-width\nright_context= # if set, overrides splice-width.\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\ntransform_dir=     # If supplied, overrides alidir\npostdir=\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\nsrand=0 # random seed used to initialize the nnet\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training,\"\n  echo \"                                         # actual learning-rate is this time num-jobs.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-times <list-of-times|\\\"\\\">             # A list of space-separated floating point numbers between 0.0 and\"\n  echo \"                                                   # 1.0 to specify how far through training realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n[ ! -f $postdir/post.1.scp ] && [ ! -f $alidir/ali.1.gz ] && echo \"$0: no (soft) alignments provided\" && exit 1;\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n[ -z \"$left_context\" ] && left_context=$splice_width\n[ -z \"$right_context\" ] && right_context=$splice_width\nextra_opts+=(--left-context $left_context --right-context $right_context)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\"  --io-opts \"$io_opts\" \\\n    --postdir \"$postdir\" --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n    --cmd \"$cmd\" $egs_opts $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  initial_lrate=$(perl -e \"print ($initial_effective_lrate*$num_jobs_initial);\")\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$left_context right-context=$right_context const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_lrate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init --srand=$srand $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\n# mix up at the iteration where we've processed about half the data; this keeps\n# the overall training procedure fairly invariant to the number of initial and\n# final jobs.\n# j = initial, k = final, n = num-iters, x = half-of-data epoch,\n# p is proportion of data we want to process (e.g. p=0.5 here).\n# solve for x if the amount of data processed by epoch x is p\n# times the amount by iteration n.\n# put this in wolfram alpha:\n# solve { x*j + (k-j)*x*x/(2*n) = p * (j*n + (k-j)*n/2), {x} }\n# got: x = (j n-sqrt(-n^2 (j^2 (p-1)-k^2 p)))/(j-k) and j!=k and n!=0\n# simplified manually to: n * (sqrt(((1-p)j^2 + p k^2)/2) - j)/(j-k)\nmix_up_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters 0.5)\n! [ $mix_up_iter -gt $finish_add_layers_iter ] && \\\n  echo \"Mix-up-iter is $mix_up_iter, should be greater than $finish_add_layers_iter -> add more epochs?\" \\\n  && exit 1;\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch_final ]; then\n  num_models_combine=$approx_iters_per_epoch_final\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e  \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  # TODO: remove this line.\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      [ ! -f $x.mdl ] && sleep 10;\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging take the best.\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - | nnet-am-copy --learning-rate=$this_learning_rate - -|\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      mdl=\"nnet-am-copy --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list $dir/$[$x+1].mdl ||  exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      cp $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_bottleneck_fast.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2014  Pegah Ghahremani\n# Apache 2.0.\n\n\n# train_pnorm_fast.sh is a new, improved version of train_pnorm.sh, which uses\n# the 'online' preconditioning method.  For GPUs it's about two times faster\n# than before (although that's partly due to optimizations that will also help\n# the old recipe), and for CPUs it gives better performance than the old method\n# (I believe); also, the difference in optimization performance between CPU and\n# GPU is almost gone.  The old train_pnorm.sh script is now deprecated.\n# We made this a separate script because not all of the options that the\n# old script accepted, are still accepted.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iterations is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set (maximum)\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\nbottleneck_dim=42  # bottleneck layer dimensio\np=2\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\nonline_ivector_dir=\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\n# this relates to perturbed training.\nmin_target_objf_change=0.1\ntarget_multiplier=0 #  Set this to e.g. 1.0 to enable perturbed training.\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nbottleneck_layer_num=$num_hidden_layers-2 # bottleneck layer number between hidden layer\n                                          # eg. 2000|2000|420|2000 bottleneck_layer_num = 2\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --first-component-power <power|1.0>              # Power applied to output of first p-norm layer... setting this to\"\n  echo \"                                                   # 0.5 seems to help under some circumstances.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\ntruncate_comp_num=$[3*$num_hidden_layers+1]\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  [ ! -z $spk_vecs_dir ] && egs_opts=\"$egs_opts --spk-vecs-dir $spk_vecs_dir\";\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$splice_width right-context=$splice_width const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n\nbnf_input_dim=$((10 * $bottleneck_dim))\nbnf_output_dim=$bottleneck_dim\necho bnf_input_dim = $bnf_input_dim\n  bottleneck_stddev=`perl -e \"print 1.0/sqrt($bnf_input_dim);\"`\n  # bnf.config it will write the part of th config corresponding to a\n  # bottleneck layer; we need this to add bottleneck layer.\n  cat >$dir/bnf.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$bnf_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$bottleneck_stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$bnf_input_dim output-dim=$bnf_output_dim p=$p\nNormalizeComponent dim=$bnf_output_dim\nAffineComponentPreconditionedOnline input-dim=$bnf_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim  p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n\nfunction set_target_objf_change {\n  # nothing to do if $target_multiplier not set.\n  [ \"$target_multiplier\" == \"0\" -o \"$target_multiplier\" == \"0.0\" ] && return;\n  [ $x -le $finish_add_layers_iter ] && return;\n  wait=2  # the compute_prob_{train,valid} from 2 iterations ago should\n          # most likey be done even though we backgrounded them.\n  [ $[$x-$wait] -le 0 ] && return;\n  while true; do\n    # Note: awk 'some-expression' is the same as: awk '{if(some-expression) print;}'\n    train_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_train.$[$x-$wait].log)\n    valid_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_valid.$[$x-$wait].log)\n    if [ -z \"$train_prob\" ] || [ -z \"$valid_prob\" ]; then\n      echo \"$0: waiting until $dir/log/compute_prob_{train,valid}.$[$x-$wait].log are done\"\n      sleep 60\n    else\n      target_objf_change=$(perl -e '($train,$valid,$min_change,$multiplier)=@ARGV; if (!($train < 0.0) || !($valid < 0.0)) { print \"0\\n\"; print STDERR \"Error: invalid train or valid prob: $train_prob, $valid_prob\\n\"; exit(0); } else { print STDERR \"train,valid=$train,$valid\\n\"; $proposed_target = $multiplier * ($train-$valid); if ($proposed_target < $min_change) { print \"0\"; } else { print $proposed_target; }}' -- \"$train_prob\" \"$valid_prob\" \"$min_target_objf_change\" \"$target_multiplier\")\n      echo \"On iter $x, (train,valid) probs from iter $[$x-$wait] were ($train_prob,$valid_prob), and setting target-objf-change to $target_objf_change.\"\n      return;\n    fi\n  done\n}\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\nx=0\ntarget_objf_change=0 # relates to perturbed training.\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n          ark:$egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      if [ $[($x-1) / $add_layers_period] -eq $[($num_hidden_layers-2)] ]; then\n        echo bnf layer with x = $x\n        mdl=\"nnet-init --srand=$x $dir/bnf.config - | nnet-insert $dir/$x.mdl - - |\"\n      else\n        mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n      fi\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    set_target_objf_change;  # only has effect if target_multiplier != 0\n    if [ \"$target_objf_change\" != \"0\" ]; then\n      [ ! -f $dir/within_covar.spmat ] && \\\n        echo \"$0: expected $dir/within_covar.spmat to exist.\" && exit 1;\n      perturb_suffix=\"-perturbed\"\n      perturb_opts=\"--target-objf-change=$target_objf_change --within-covar=$dir/within_covar.spmat\"\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n       nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \\\n        --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -le $[$num_iters-$num_iters_final] ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\nfi\nname=`basename $data`\nif [ -f $dir/final.mdl ]; then\n  nnet-to-raw-nnet --truncate=$truncate_comp_num $dir/final.mdl $dir/final.raw\nelse\n  echo \"$0: we require final.mdl in source dir $dir\"\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_ensemble.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Guoguo Chen\n#           2014  Xiaohui Zhang\n# Apache 2.0.\n\n\n# This script trains an ensemble of neural networks with pnorm nonlinearities.\n# An ensemble of nets are first differently initialized, and then trained using the\n# same data during each iteration. In each training iteration, one term is added to\n# the objf, which is beta times the cross-entropy between the current net's posterior\n# output and the geometrically averaged posterior outputs of the ensemble of nets.\n# The beta values obey an exponentially increasing schedule (determined by initial_beta\n# and final_beta).\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\nsoftmax_learning_rate_factor=1.0 # In the default setting keep the same learning rate.\n\ncombine_regularizer=1.0e-14 # Small regularizer so that parameters won't go crazy.\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\negs_dir=\nlda_opts=\negs_opts=\ninitial_beta=0.1\nfinal_beta=6\nensemble_size=2\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --initial-num-hidden-layers <#hidden-layers|1>   # Number of hidden layers to start with.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --num-utts-subset <#utts|300>                    # Number of utterances in subsets used for validation and diagnostics\"\n  echo \"                                                   # (the validation subset is held out from training)\"\n  echo \"  --num-frames-diagnostic <#frames|4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames|10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts --splice-width $splice_width --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=`cat $dir/feat_dim` || exit 1;\nlda_dim=`cat $dir/lda_dim` || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh --samples-per-iter $samples_per_iter --num-jobs-nnet $num_jobs_nnet \\\n      --splice-width $splice_width --stage $get_egs_stage --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  lda_mat=$dir/lda.mat\n  ext_lda_dim=$lda_dim\n  ext_feat_dim=$feat_dim\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$ext_feat_dim left-context=$splice_width right-context=$splice_width\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditioned input-dim=$ext_lda_dim output-dim=$pnorm_input_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditioned input-dim=$pnorm_output_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditioned input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  for i in `seq 1 $ensemble_size`; do\n    $cmd $parallel_opts JOB=1:$ensemble_size $dir/log/nnet_init.JOB.log \\\n      nnet-am-init $alidir/tree $lang/topo \"nnet-init --srand=JOB $dir/nnet.config -|\" \\\n      $dir/0.JOB.mdl || exit 1;\n  done\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $parallel_opts JOB=1:$ensemble_size $dir/log/train_trans.JOB.log \\\n      nnet-train-transitions $dir/0.JOB.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.JOB.mdl \\\n      || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nfinish_add_layers_iter=$[$num_hidden_layers*$add_layers_period]\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit\n  fi\nfi\n\nx=0\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.1.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.1.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].1.mdl $dir/$x.1.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n\n    declare -A mdl\n    echo \"Training neural net (pass $x)\"\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      for i in `seq 1 $ensemble_size`; do\n        mdl[$i]=\"nnet-init --srand=$[$x+$i] $dir/hidden.config - | nnet-insert $dir/$x.$i.mdl - - |\"\n      done\n    else\n      for i in `seq 1 $ensemble_size`; do\n        mdl[$i]=$dir/$x.$i.mdl\n      done\n    fi\n\n    nnets_ensemble_in=\n    nnets_ensemble_out=\n    for i in `seq 1 $ensemble_size`; do\n      nnets_ensemble_in=\"$nnets_ensemble_in '${mdl[$i]}'\"\n      nnets_ensemble_out=\"${nnets_ensemble_out} $dir/$[$x+1].JOB.$i.mdl \"\n    done\n\n    beta=`perl -e '($x,$n,$i,$f)=@ARGV; print ($i+$x*($f-$i)/$n);' $[$x+1] $num_iters $initial_beta $final_beta`;\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train-ensemble \\\n         --minibatch-size=$minibatch_size --srand=$x --beta=$beta $nnets_ensemble_in \\\n        ark:- $nnets_ensemble_out \\\n      || exit 1;\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    softmax_learning_rate=`perl -e \"print $learning_rate * $softmax_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$softmax_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    for i in `seq 1 $ensemble_size`; do\n      nnets_list=\n      for n in `seq 1 $num_jobs_nnet`; do\n        nnets_list=\"$nnets_list $dir/$[$x+1].$n.$i.mdl\"\n      done\n      $cmd $dir/log/average.$x.$i.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].$i.mdl || exit 1;\n      rm $nnets_list\n      if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n        # mix up.\n        echo Mixing up from $num_leaves to $mix_up components\n        $cmd $dir/log/mix_up.$x.$i.log \\\n          nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n          $dir/$[$x+1].$i.mdl $dir/$[$x+1].$i.mdl || exit 1;\n      fi\n    done\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\n\nfor i in `seq 1 $ensemble_size`; do\n  nnets_list=()\n  if [ $num_iters_final -gt $num_iters_extra ]; then\n    echo \"Setting num_iters_final=$num_iters_extra\"\n  fi\n  start=$[$num_iters-$num_iters_final+1]\n  for x in `seq $start $num_iters`; do\n    idx=$[$x-$start]\n    if [ $x -gt $mix_up_iter ]; then\n      nnets_list[$idx]=$dir/$x.$i.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n    fi\n  done\n\n  if [ $stage -le $num_iters ]; then\n    # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n    # if there are many models it can give out-of-memory error; set num-threads to 8\n    # to speed it up (this isn't ideal...)\n    this_num_threads=$num_threads\n    [ $this_num_threads -lt 8 ] && this_num_threads=8\n    num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n    mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n    [ $mb -gt 512 ] && mb=512\n    # Setting --initial-model to a large value makes it initialize the combination\n    # with the average of all the models.  It's important not to start with a\n    # single model, or, due to the invariance to scaling that these nonlinearities\n    # give us, we get zero diagonal entries in the fisher matrix that\n    # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n    # the effect that the initial model chosen gets much higher learning rates\n    # than the others.  This prevents the optimization from working well.\n    $cmd $parallel_opts $dir/log/combine.$i.log \\\n      nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n        --num-threads=$this_num_threads --regularizer=$combine_regularizer \\\n        --initial-model=100000 --num-lbfgs-iters=40 \\\n        --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n        $dir/final.$i.mdl || exit 1;\n\n    # Normalize stddev for affine or block affine layers that are followed by a\n    # pnorm layer and then a normalize layer.\n    $cmd $parallel_opts $dir/log/normalize.$i.log \\\n      nnet-normalize-stddev $dir/final.$i.mdl $dir/final.$i.mdl || exit 1;\n  fi\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.$i.log \\\n    nnet-compute-prob $dir/final.$i.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.$i.log \\\n    nnet-compute-prob $dir/final.$i.mdl ark:$egs_dir/train_diagnostic.egs &\ndone\ncp $dir/final.1.mdl $dir/final.mdl\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%10] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      for i in `seq 1 $ensemble_size`; do\n        rm $dir/$x.$i.mdl\n      done\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_fast.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n# Apache 2.0.\n\n\n# train_pnorm_fast.sh is a new, improved version of train_pnorm.sh, which uses\n# the 'online' preconditioning method.  For GPUs it's about two times faster\n# than before (although that's partly due to optimizations that will also help\n# the old recipe), and for CPUs it gives better performance than the old method\n# (I believe); also, the difference in optimization performance between CPU and\n# GPU is almost gone.  The old train_pnorm.sh script is now deprecated.\n# We made this a separate script because not all of the options that the\n# old script accepted, are still accepted.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iterations is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set (maximum)\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\npresoftmax_prior_scale_power=-0.25 # use the specified power value on the priors (inverse priors)\n                                   # to scale the pre-softmax outputs\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\nonline_ivector_dir=\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-5\n\nio_opts=\"--max-jobs-run 15\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\n# this relates to perturbed training.\nmin_target_objf_change=0.1\ntarget_multiplier=0 #  Set this to e.g. 1.0 to enable perturbed training.\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # This option now does nothing; please remove it.\"\n  echo \"  --presoftmax-prior-scale-power <power|-0.25>     # use the specified power value on the priors (inverse priors) \"\n  echo \"                                                   # to scale the pre-softmax outputs.\"\n  echo \"                                                   # (set to 0.0 to disable the presoftmax element scale)\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --first-component-power <power|1.0>              # Power applied to output of first p-norm layer... setting this to\"\n  echo \"                                                   # 0.5 seems to help under some circumstances.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$splice_width right-context=$splice_width const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\n\n  if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n    echo \"prepare vector assignment for FixedScaleComponent before softmax\"\n    echo \"(use priors^$presoftmax_prior_scale_power and rescale to average 1)\"\n\n    # obtains raw pdf count\n    $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      post-to-tacc --per-pdf=true --binary=false $alidir/final.mdl ark:- $dir/JOB.pacc || exit 1;\n    cat $dir/*.pacc > $dir/pacc\n    rm $dir/*.pacc\n    awk -v power=$presoftmax_prior_scale_power \\\n      '{ for(i=2; i<=NF-1; i++) {sum[i]+=$i} }\n      END {\n        for (i=2; i<=NF-1; i++) {total+=sum[i]}\n        ave_pdf=int(total/(NF-2)); total+=0.01*ave_pdf*(NF-2)\n        for (i=2; i<=NF-1; i++) {rescale+=((sum[i]+0.01*ave_pdf)/total)^power}\n        rescale/=(NF-2)\n        printf \" [ \"; for (i=2; i<=NF-1; i++) {printf(\"%f \", ((sum[i]+0.01*ave_pdf)/total)^power/rescale)}; print \"]\"\n      }' $dir/pacc > $dir/presoftmax_prior_scale_vecfile\n\n    echo \"FixedScaleComponent scales=$dir/presoftmax_prior_scale_vecfile\" > $dir/per_element.config\n    echo \"insert an additional layer of FixedScaleComponent before softmax\"\n    inp=`nnet-am-info $dir/0.mdl | grep 'Softmax' | awk '{print $2}'`\n    nnet-init $dir/per_element.config - | nnet-insert --insert-at=$inp --randomize-next-component=false $dir/0.mdl - $dir/0.mdl\n  fi\n\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\necho \"$0: Will not do mix up\"\n\nfunction set_target_objf_change {\n  # nothing to do if $target_multiplier not set.\n  [ \"$target_multiplier\" == \"0\" -o \"$target_multiplier\" == \"0.0\" ] && return;\n  [ $x -le $finish_add_layers_iter ] && return;\n  wait=2  # the compute_prob_{train,valid} from 2 iterations ago should\n          # most likey be done even though we backgrounded them.\n  [ $[$x-$wait] -le 0 ] && return;\n  while true; do\n    # Note: awk 'some-expression' is the same as: awk '{if(some-expression) print;}'\n    train_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_train.$[$x-$wait].log)\n    valid_prob=$(awk '(NF == 1)' < $dir/log/compute_prob_valid.$[$x-$wait].log)\n    if [ -z \"$train_prob\" ] || [ -z \"$valid_prob\" ]; then\n      echo \"$0: waiting until $dir/log/compute_prob_{train,valid}.$[$x-$wait].log are done\"\n      sleep 60\n    else\n      target_objf_change=$(perl -e '($train,$valid,$min_change,$multiplier)=@ARGV; if (!($train < 0.0) || !($valid < 0.0)) { print \"0\\n\"; print STDERR \"Error: invalid train or valid prob: $train_prob, $valid_prob\\n\"; exit(0); } else { print STDERR \"train,valid=$train,$valid\\n\"; $proposed_target = $multiplier * ($train-$valid); if ($proposed_target < $min_change) { print \"0\"; } else { print $proposed_target; }}' -- \"$train_prob\" \"$valid_prob\" \"$min_target_objf_change\" \"$target_multiplier\")\n      echo \"On iter $x, (train,valid) probs from iter $[$x-$wait] were ($train_prob,$valid_prob), and setting target-objf-change to $target_objf_change.\"\n      return;\n    fi\n  done\n}\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\nx=0\ntarget_objf_change=0 # relates to perturbed training.\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n          ark:$egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n\n      inp=`nnet-am-info $dir/$x.mdl | grep 'Softmax' | awk '{print $2}'`\n      if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n        inp=$[$inp-2]\n      else\n        inp=$[$inp-1]\n      fi\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert --insert-at=$inp $dir/$x.mdl - - |\"\n\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    set_target_objf_change;  # only has effect if target_multiplier != 0\n    if [ \"$target_objf_change\" != \"0\" ]; then\n      [ ! -f $dir/within_covar.spmat ] && \\\n        echo \"$0: expected $dir/within_covar.spmat to exist.\" && exit 1;\n      perturb_suffix=\"-perturbed\"\n      perturb_opts=\"--target-objf-change=$target_objf_change --within-covar=$dir/within_covar.spmat\"\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n       nnet-train$parallel_suffix$perturb_suffix $parallel_train_opts $perturb_opts \\\n        --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      echo \"Warning: the mix up opertion is disabled!\"\n      echo \"    Ignore mix up leaves number specified\"\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -le $[$num_iters-$num_iters_final] ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_multisplice.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n# train_pnorm_multisplice.sh is a modified version of train_pnorm_simple.sh.\n# Like train_pnorm_fast.sh, it uses the `online' preconditioning,\n# which is faster (especially on GPUs).  The difference is that the\n# learning-rate schedule is simpler, with the learning rate exponentially\n# decreasing during training, and no phase where the learning rate is constant.\n#\n# Also, the final model-combination is done a bit differently: we combine models\n# over typically a whole epoch, and because that would be too many iterations to\n# easily be able to combine over, we arrange the iterations into groups (20\n# groups by default) and average over each group.\n#\n# [Vimal Manohar - Oct 2014]\n# The script now supports realignment during training, which can be done by\n# specifying realign_epochs.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\nonline_ivector_dir=\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_indexes=\"layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|''>             # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n# process the splice_inds string, to get a layer-wise context string\n# to be processed by the nnet-components\n# this would be mainly used by SpliceComponent|SpliceMaxComponent\npython steps/nnet2/make_multisplice_configs.py contexts --splice-indexes \"$splice_indexes\" $dir || exit -1;\ncontext_string=$(cat $dir/vars) || exit -1\necho $context_string\neval $context_string || exit -1; #\n  # initializes variables used by get_lda.sh and get_egs.sh\n  # get_lda.sh : first_left_context, first_right_context,\n  # get_egs.sh : nnet_left_context & nnet_right_context\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --left-context $first_left_context --right-context $first_right_context --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n\n  extra_opts+=(--left-context $nnet_left_context )\n  extra_opts+=(--right-context $nnet_right_context )\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  # create the config files for nnet initialization\n  python steps/nnet2/make_multisplice_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --total-input-dim $tot_input_dim  \\\n    --ivector-dim $ivector_dim  \\\n    --lda-mat \"$lda_mat\"  \\\n    --lda-dim $lda_dim  \\\n    --pnorm-input-dim $pnorm_input_dim  \\\n    --pnorm-output-dim  $pnorm_output_dim \\\n    --online-preconditioning-opts \"$online_preconditioning_opts\"  \\\n    --initial-learning-rate $initial_learning_rate  \\\n    --bias-stddev  $bias_stddev  \\\n    --num-hidden-layers $num_hidden_layers \\\n    --num-targets  $num_leaves  \\\n    configs  $dir || exit -1;\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\ncur_num_hidden_layer=1  # counts the number of hidden layers in the network\n                        # this is different from the number of components in\n                        # in the network, each hidden layer is composed of\n                        # affine comp. + pnorm comp. + normalization comp.\n                        # optionally a splice component is also added\n\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters=$[$num_epochs * $iters_per_epoch];\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $iters_per_epoch ]; then\n  num_models_combine=$iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  realign_iter=`perl -e 'print int($ARGV[0] * $ARGV[1]);' $realign_epoch $iters_per_epoch`\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.$x.JOB.log \\\n        nnet-subset-egs --n=$prior_subset_size ark:$prev_egs_dir/egs.JOB.0.ark ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      mdl=\"nnet-init --srand=$x $dir/hidden_${cur_num_hidden_layer}.config - | nnet-insert $dir/$x.mdl - - |\"\n      cur_num_hidden_layer=$((cur_num_hidden_layer + 1))\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$cur_egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n       nnet-train$parallel_suffix $parallel_train_opts \\\n        --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.$x.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$cur_egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n    vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\n\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_multisplice2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n# train_pnorm_multisplice2.sh is a modified version of\n# train_pnorm_simple2.sh. This script creates neural net architectures with\n# multiple levels of splicing.  You can also compare it with\n# train_pnorm_multisplice.sh; it differs from that script by using the newer,\n# more compact multi-frame egs format that is dumped by get_egs2.sh.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nonline_ivector_dir=\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\n\nsplice_indexes=\"layer0/-4:-3:-2:-1:0:1:2:3:4 layer2/-5:-1:3\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|''>             # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n# process the splice_inds string, to get a layer-wise context string\n# to be processed by the nnet-components\n# this would be mainly used by SpliceComponent|SpliceMaxComponent\npython steps/nnet2/make_multisplice_configs.py contexts --splice-indexes \"$splice_indexes\" $dir || exit -1;\ncontext_string=$(cat $dir/vars) || exit -1\necho $context_string\neval $context_string || exit -1; #\n  # initializes variables used by get_lda.sh and get_egs.sh\n  # get_lda.sh : first_left_context, first_right_context,\n  # get_egs.sh : nnet_left_context & nnet_right_context\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --left-context $first_left_context --right-context $first_right_context --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n\n  extra_opts+=(--left-context $nnet_left_context )\n  extra_opts+=(--right-context $nnet_right_context )\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n      --io-opts \"$io_opts\" \\\n      --cmd \"$cmd\" $egs_opts \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\nif [ $num_jobs_nnet -gt $num_archives_expanded ]; then\n  echo \"$0: --num-jobs-nnet cannot exceed num-archives*frames-per-eg which is $num_archives_expanded\"\n  echo \"$0: setting --num-jobs-nnet to $num_archives_expanded\"\n  num_jobs_nnet=$num_archives_expanded\nfi\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  # create the config files for nnet initialization\n  python steps/nnet2/make_multisplice_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --total-input-dim $tot_input_dim  \\\n    --ivector-dim $ivector_dim  \\\n    --lda-mat \"$lda_mat\"  \\\n    --lda-dim $lda_dim  \\\n    --pnorm-input-dim $pnorm_input_dim  \\\n    --pnorm-output-dim  $pnorm_output_dim \\\n    --online-preconditioning-opts \"$online_preconditioning_opts\"  \\\n    --initial-learning-rate $initial_learning_rate  \\\n    --bias-stddev  $bias_stddev  \\\n    --num-hidden-layers $num_hidden_layers \\\n    --num-targets  $num_leaves  \\\n    configs  $dir || exit -1;\n\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$num_jobs_nnet == $num_epochs*$num_archives_expanded\nnum_iters=$[($num_epochs*$num_archives_expanded)/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch=$[$num_iters/$num_epochs]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch ]; then\n  num_models_combine=$approx_iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  # compare the equation below with the equation we use to set num_iters above.\n  # note, realign_epochs may be floating-point, which is why we don't use $[] to\n  # do the math.\n  realign_iter=$(perl -e 'print int(($ARGV[0]*$ARGV[1])/$ARGV[2]);' $realign_epoch $num_archives_expanded $num_jobs_nnet)\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      cur_num_hidden_layers=$[$x/$add_layers_period];\n      mdl=\"nnet-init --srand=$x $dir/hidden_${cur_num_hidden_layers}.config - | nnet-insert $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $num_jobs_nnet); do\n        k=$[$x*$num_jobs_nnet + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_simple.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n# Apache 2.0.\n\n\n# train_pnorm_simple.sh is a modified version of train_pnorm_fast.sh.  Like\n# train_pnorm_fast.sh, it uses the `online' preconditioning, which is faster\n# (especially on GPUs).  The difference is that the learning-rate schedule is\n# simpler, with the learning rate exponentially decreasing during training,\n# and no phase where the learning rate is constant.\n#\n# Also, the final model-combination is done a bit differently: we combine models\n# over typically a whole epoch, and because that would be too many iterations to\n# easily be able to combine over, we arrange the iterations into groups (20\n# groups by default) and average over each group.\n#\n# [Vimal Manohar - Oct 2014]\n# The script now supports realignment during training, which can be done by\n# specifying realign_epochs.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\nonline_ivector_dir=\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|\\\"\\\">           # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$splice_width right-context=$splice_width const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters=$[$num_epochs * $iters_per_epoch];\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $iters_per_epoch ]; then\n  num_models_combine=$iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  realign_iter=`perl -e 'print int($ARGV[0] * $ARGV[1]);' $realign_epoch $iters_per_epoch`\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.$x.JOB.log \\\n        nnet-subset-egs --n=$prior_subset_size ark:$prev_egs_dir/egs.JOB.0.ark ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$cur_egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n       nnet-train$parallel_suffix $parallel_train_opts \\\n        --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.$x.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$cur_egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\n\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_pnorm_simple2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).\n#                2013  Xiaohui Zhang\n#                2013  Guoguo Chen\n#                2014  Vimal Manohar\n# Apache 2.0.\n\n\n# train_pnorm_simple2.sh is as train_pnorm_simple.sh but it uses the \"new\" egs\n# format, created by get_egs2.sh.\n\n# train_pnorm_simple.sh is a modified version of train_pnorm_fast.sh.  Like\n# train_pnorm_fast.sh, it uses the `online' preconditioning, which is faster\n# (especially on GPUs).  The difference is that the learning-rate schedule is\n# simpler, with the learning rate exponentially decreasing during training,\n# and no phase where the learning rate is constant.\n#\n# Also, the final model-combination is done a bit differently: we combine models\n# over typically a whole epoch, and because that would be too many iterations to\n# easily be able to combine over, we arrange the iterations into groups (20\n# groups by default) and average over each group.\n#\n# [Vimal Manohar - Oct 2014]\n# The script now supports realignment during training, which can be done by\n# specifying realign_epochs.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\npnorm_input_dim=3000\npnorm_output_dim=300\np=2\npresoftmax_prior_scale_power=-0.25 # use the specified power value on the priors (inverse priors)\n                                   # to scale the pre-softmax outputs\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0\nonline_ivector_dir=\n\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nstage=-4\n\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nleft_context= # if set, overrides splice-width\nright_context= # if set, overrides splice-width.\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\nprecondition_rank_in=20  # relates to online preconditioning\nprecondition_rank_out=80 # relates to online preconditioning\n\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\"\n  # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_num_threads=8\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncleanup=true\negs_dir=\nlda_opts=\nlda_dim=\negs_opts=\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\ntransform_dir=     # If supplied, overrides alidir\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_epochs=         # List of epochs, the beginning of which realignment is done\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # This option now does nothing; please remove it.\"\n  echo \"  --presoftmax-prior-scale-power <power|-0.25>     # use the specified power value on the priors (inverse priors) \"\n  echo \"                                                   # to scale the pre-softmax outputs.\"\n  echo \"                                                   # (set to 0.0 to disable the presoftmax element scale)\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|\\\"\\\">           # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_epochs\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_epochs specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_epochs specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\n[ -z \"$left_context\" ] && left_context=$splice_width\n[ -z \"$right_context\" ] && right_context=$splice_width\nextra_opts+=(--left-context $left_context --right-context $right_context)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs2.sh\"\n  steps/nnet2/get_egs2.sh $egs_opts \"${extra_opts[@]}\"  --io-opts \"$io_opts\" \\\n    --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n    --cmd \"$cmd\" $egs_opts $data $alidir $dir/egs || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\nif [ $num_jobs_nnet -gt $num_archives_expanded ]; then\n  echo \"$0: --num-jobs-nnet cannot exceed num-archives*frames-per-eg which is $num_archives_expanded\"\n  echo \"$0: setting --num-jobs-nnet to $num_archives_expanded\"\n  num_jobs_nnet=$num_archives_expanded\nfi\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($pnorm_input_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$left_context right-context=$right_context const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$pnorm_output_dim output-dim=$pnorm_input_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nPnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim p=$p\nNormalizeComponent dim=$pnorm_output_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\n\n  if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n    echo \"prepare vector assignment for FixedScaleComponent before softmax\"\n    echo \"(use priors^$presoftmax_prior_scale_power and rescale to average 1)\"\n\n    # obtains raw pdf count\n    $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      post-to-tacc --per-pdf=true --binary=false $alidir/final.mdl ark:- $dir/JOB.pacc || exit 1;\n    cat $dir/*.pacc > $dir/pacc\n    rm $dir/*.pacc\n    awk -v power=$presoftmax_prior_scale_power \\\n      '{ for(i=2; i<=NF-1; i++) {sum[i]+=$i} }\n      END {\n        for (i=2; i<=NF-1; i++) {total+=sum[i]}\n        ave_pdf=int(total/(NF-2)); total+=0.01*ave_pdf*(NF-2)\n        for (i=2; i<=NF-1; i++) {rescale+=((sum[i]+0.01*ave_pdf)/total)^power}\n        rescale/=(NF-2)\n        printf \" [ \"; for (i=2; i<=NF-1; i++) {printf(\"%f \", ((sum[i]+0.01*ave_pdf)/total)^power/rescale)}; print \"]\"\n      }' $dir/pacc > $dir/presoftmax_prior_scale_vecfile\n\n    echo \"FixedScaleComponent scales=$dir/presoftmax_prior_scale_vecfile\" > $dir/per_element.config\n    echo \"insert an additional layer of FixedScaleComponent before softmax\"\n    inp=`nnet-am-info $dir/0.mdl | grep 'Softmax' | awk '{print $2}'`\n    nnet-init $dir/per_element.config - | nnet-insert --insert-at=$inp --randomize-next-component=false $dir/0.mdl - $dir/0.mdl\n  fi\nfi\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$num_jobs_nnet == $num_epochs*$num_archives_expanded\nnum_iters=$[($num_epochs*$num_archives_expanded)/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\necho \"$0: Will not do mix up\"\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  parallel_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  parallel_suffix=\"-parallel\"\n  parallel_train_opts=\"--num-threads=$num_threads\"\nfi\n\n\napprox_iters_per_epoch=$[$num_iters/$num_epochs]\n# First work out how many models we want to combine over in the final\n# nnet-combine-fast invocation.  This equals\n# min(max(max_models_combine, iters_per_epoch),\n#     2/3 * iters_after_mixup)\nnum_models_combine=$max_models_combine\nif [ $num_models_combine -lt $approx_iters_per_epoch ]; then\n  num_models_combine=$approx_iters_per_epoch\nfi\niters_after_mixup_23=$[(($num_iters-$mix_up_iter-1)*2)/3]\nif [ $num_models_combine -gt $iters_after_mixup_23 ]; then\n  num_models_combine=$iters_after_mixup_23\nfi\nfirst_model_combine=$[$num_iters-$num_models_combine+1]\n\nx=0\n\nfor realign_epoch in $realign_epochs; do\n  # compare the equation below with the equation we use to set num_iters above.\n  # note, realign_epochs may be floating-point, which is why we don't use $[] to\n  # do the math.\n  realign_iter=$(perl -e 'print int(($ARGV[0]*$ARGV[1])/$ARGV[2]);' $realign_epoch $num_archives_expanded $num_jobs_nnet)\n  realign_this_iter[$realign_iter]=$realign_epoch\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      epoch=${realign_this_iter[$x]}\n\n\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet-copy-egs --srand=JOB --frame=random ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet-compute-from-egs \"nnet-to-raw-nnet $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet2/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$epoch || exit 1\n\n      steps/nnet2/relabel_egs2.sh --cmd \"$cmd\" --iter $x $dir/ali_$epoch \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet2/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$cur_egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n        ark:$cur_egs_dir/train_diagnostic.egs '&&' \\\n        nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n\n      inp=`nnet-am-info $dir/$x.mdl | grep 'Softmax' | awk '{print $2}'`\n      if [ \"$presoftmax_prior_scale_power\" != \"0.0\" ]; then\n        inp=$[$inp-2]\n      else\n        inp=$[$inp-1]\n      fi\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert --insert-at=$inp $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $num_jobs_nnet); do\n        k=$[$x*$num_jobs_nnet + $n - 1]; # k is a zero-based index that we'll derive\n                                         # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $parallel_opts $dir/log/train.$x.$n.log \\\n          nnet-train$parallel_suffix $parallel_train_opts \\\n          --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n          \"ark,bg:nnet-copy-egs --frame=$frame ark:$cur_egs_dir/egs.$archive.ark ark:-|nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-|\" \\\n          $dir/$[$x+1].$n.mdl || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters $initial_learning_rate $final_learning_rate`;\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rate=$learning_rate - $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rate=$learning_rate $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      echo \"Warning: the mix up opertion is disabled!\"\n      echo \"    Ignore mix up leaves number specified\"\n    fi\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.\n  nnets_list=()\n  # the if..else..fi statement below sets 'nnets_list'.\n  if [ $max_models_combine -lt $num_models_combine ]; then\n    # The number of models to combine is too large, e.g. > 20.  In this case,\n    # each argument to nnet-combine-fast will be an average of multiple models.\n    cur_offset=0 # current offset from first_model_combine.\n    for n in $(seq $max_models_combine); do\n      next_offset=$[($n*$num_models_combine)/$max_models_combine]\n      sub_list=\"\"\n      for o in $(seq $cur_offset $[$next_offset-1]); do\n        iter=$[$first_model_combine+$o]\n        mdl=$dir/$iter.mdl\n        [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n        sub_list=\"$sub_list $mdl\"\n      done\n      nnets_list[$[$n-1]]=\"nnet-am-average $sub_list - |\"\n      cur_offset=$next_offset\n    done\n  else\n    nnets_list=\n    for n in $(seq 0 $[num_models_combine-1]); do\n      iter=$[$first_model_combine+$n]\n      mdl=$dir/$iter.mdl\n      [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n      nnets_list[$n]=$mdl\n    done\n  fi\n\n\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$cur_egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$cur_egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Normalize stddev for affine or block affine layers that are followed by a\n  # pnorm layer and then a normalize layer.\n  $cmd $dir/log/normalize.log \\\n    nnet-normalize-stddev $dir/final.mdl $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$cur_egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n    nnet-copy-egs --frame=random --srand=JOB ark:$cur_egs_dir/egs.1.ark ark:- \\| \\\n    nnet-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_tanh.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script trains a fairly vanilla network with tanh nonlinearities.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\nshrink_interval=5 # shrink every $shrink_interval iters except while we are\n                  # still adding layers, when we do it every iter.\nshrink=true\nnum_frames_shrink=2000 # note: must be <= --num-frames-diagnostic option to get_egs.sh, if\n                       # given.\nfinal_learning_rate_factor=0.5 # Train the two last layers of parameters half as\n                               # fast as the other layers.\n\nhidden_layer_dim=300 #  You may want this larger, e.g. 1024 or 2048.\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nmodify_learning_rates=false\nlast_layer_factor=0.1 # relates to modify_learning_rates.\nfirst_layer_factor=1.0 # relates to modify_learning_rates.\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n         # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=true\negs_dir=\nlda_opts=\negs_opts=\ntransform_dir=\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # can be used to force \"raw\" feature type.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --initial-num-hidden-layers <#hidden-layers|1>   # Number of hidden layers to start with.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|200000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`am-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=`cat $dir/feat_dim` || exit 1;\nlda_dim=`cat $dir/lda_dim` || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  lda_mat=$dir/lda.mat\n  ext_lda_dim=$lda_dim\n  ext_feat_dim=$feat_dim\n\n  stddev=`perl -e \"print 1.0/sqrt($hidden_layer_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$ext_feat_dim left-context=$splice_width right-context=$splice_width\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditioned input-dim=$ext_lda_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\nfirst_modify_iter=$[$finish_add_layers_iter + $add_layers_period]\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    last_layer_learning_rate=`perl -e \"print $learning_rate * $final_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    # The last two layers will get this (usually lower) learning rate.\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$last_layer_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n\n    if $modify_learning_rates && [ $x -ge $first_modify_iter ]; then\n      $cmd $dir/log/modify_learning_rates.$x.log \\\n        nnet-modify-learning-rates --last-layer-factor=$last_layer_factor \\\n          --first-layer-factor=$first_layer_factor --average-learning-rate=$learning_rate \\\n        $dir/$x.mdl $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if $shrink && [ $[$x % $shrink_interval] -eq 0 ]; then\n      mb=$[($num_frames_shrink+$num_threads-1)/$num_threads]\n      $cmd $parallel_opts $dir/log/shrink.$x.log \\\n        nnet-subset-egs --n=$num_frames_shrink --randomize-order=true --srand=$x \\\n          ark:$egs_dir/train_diagnostic.egs ark:-  \\| \\\n        nnet-combine-fast --use-gpu=no --num-threads=$num_threads --verbose=3 --minibatch-size=$mb \\\n          $dir/$[$x+1].mdl ark:- $dir/$[$x+1].mdl || exit 1;\n    else\n      # On other iters, do nnet-am-fix which is much faster and has roughly\n      # the same effect.\n      nnet-am-fix $dir/$[$x+1].mdl $dir/$[$x+1].mdl 2>$dir/log/fix.$x.log\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  this_num_threads=$num_threads\n  [ $this_num_threads -lt 8 ] && this_num_threads=8\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  $cmd $parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --use-gpu=no --num-threads=$this_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/train_tanh_bottleneck.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#      2014  Pegah Ghahremani\n# This script trains a fairly vanilla network with tanh nonlinearities to generate bottleneck features\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15    # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\nshrink_interval=5 # shrink every $shrink_interval iters except while we are\n                  # still adding layers, when we do it every iter.\nshrink=true\nnum_frames_shrink=2000 # note: must be <= --num-frames-diagnostic option to get_egs.sh, if\n                       # given.\nfinal_learning_rate_factor=0.5 # Train the two last layers of parameters half as\n                               # fast as the other layers.\n\nhidden_layer_dim=1024 #  You may want this larger, e.g. 1024 or 2048.\n\nbottleneck_dim=42  # bottleneck layer dimension\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3\nbottleneck_layer_num=$num_hidden_layers-2 # bottleneck layer number between hidden layer\n                                        # eg. 1024|1024|42|1024 bottleneck_layer_num = 2\n\nmodify_learning_rates=false\nlast_layer_factor=0.1 # relates to modify_learning_rates.\nfirst_layer_factor=1.0 # relates to modify_learning_rates.\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_opts=\"--mem 12G\"\ncleanup=true\negs_dir=\nlda_opts=\negs_opts=\ntransform_dir=\nnj=\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --initial-num-hidden-layers <#hidden-layers|1>   # Number of hidden layers to start with.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|200000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to nsformreduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --num-utts-subset <#utts|300>                    # Number of utterances in subsets used for validation and diagnostics\"\n  echo \"                                                   # (the validation subset is held out from training)\"\n  echo \"  --num-frames-diagnostic <#frames|4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames|10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`am-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncp $alidir/final.mat $dir 2>/dev/null\ncp $alidir/splice_opts $dir 2>/dev/null\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ncp $alidir/cmvn_opts $dir 2>/dev/null\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ntruncate_comp_num=$[2*$num_hidden_layers+1]\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts --splice-width $splice_width --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=`cat $dir/feat_dim` || exit 1;\nlda_dim=`cat $dir/lda_dim` || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  [ ! -z $transform_dir ] && $transform_dir_opt=\"--transform-dir $transform_dir\";\n  steps/nnet2/get_egs.sh $transform_dir_opt --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --splice-width $splice_width --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  lda_mat=$dir/lda.mat\n\n  stddev=`perl -e \"print 1.0/sqrt($hidden_layer_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$feat_dim left-context=$splice_width right-context=$splice_width const-component-dim=0\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditioned input-dim=$lda_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  bottleneck_stddev=`perl -e \"print 1.0/sqrt($bottleneck_dim);\"`\n  # bnf.config it will write the part of th config corresponding to a\n  # bottleneck layer; we need this to add bottleneck layer.\n  cat >$dir/bnf.config <<EOF\nAffineComponentPreconditioned input-dim=$hidden_layer_dim output-dim=$bottleneck_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nAffineComponentPreconditioned input-dim=$bottleneck_dim output-dim=$hidden_layer_dim alpha=$alpha max-change=$max_change learning-rate=$initial_learning_rate param-stddev=$bottleneck_stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\necho num_iters = $num_iters\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nfinish_add_layers_iter=$[($num_hidden_layers-$initial_num_hidden_layers+1)*$add_layers_period]\nfirst_modify_iter=$[$finish_add_layers_iter + $add_layers_period]\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\ntruncate_comp_num=$[2*$num_hidden_layers+1]\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      if [ $[($x-1) / $add_layers_period] -eq $[($num_hidden_layers-2)] ]; then\n        echo bnf layer with x = $x\n        mdl=\"nnet-init --srand=$x $dir/bnf.config - | nnet-insert $dir/$x.mdl - - |\"\n      else\n        mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n      fi\n    else\n      mdl=$dir/$x.mdl\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    last_layer_learning_rate=`perl -e \"print $learning_rate * $final_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    # The last two layers will get this (usually lower) learning rate.\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$last_layer_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list - \\| \\\n      nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n\n    if $modify_learning_rates && [ $x -ge $first_modify_iter ]; then\n      $cmd $dir/log/modify_learning_rates.$x.log \\\n        nnet-modify-learning-rates --last-layer-factor=$last_layer_factor \\\n          --first-layer-factor=$first_layer_factor --average-learning-rate=$learning_rate \\\n        $dir/$x.mdl $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if $shrink && [ $[$x % $shrink_interval] -eq 0 ]; then\n      mb=$[($num_frames_shrink+$num_threads-1)/$num_threads]\n      $cmd $parallel_opts $dir/log/shrink.$x.log \\\n        nnet-subset-egs --n=$num_frames_shrink --randomize-order=true --srand=$x \\\n          ark:$egs_dir/train_diagnostic.egs ark:-  \\| \\\n        nnet-combine-fast --use-gpu=no --num-threads=$num_threads --verbose=3 --minibatch-size=$mb \\\n          $dir/$[$x+1].mdl ark:- $dir/$[$x+1].mdl || exit 1;\n    else\n      # On other iters, do nnet-am-fix which is much faster and has roughly\n      # the same effect.\n      nnet-am-fix $dir/$[$x+1].mdl $dir/$[$x+1].mdl 2>$dir/log/fix.$x.log\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  this_num_threads=$num_threads\n  [ $this_num_threads -lt 8 ] && this_num_threads=8\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  $cmd $parallel_opts $combine_opts $dir/log/combine.log \\\n    nnet-combine-fast --use-gpu=no --num-threads=$this_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\nfi\n\n# Compute the probability of the final, combined model with\n# the same subset we used for the previous compute_probs, as the\n# different subsets will lead to different probs.\n$cmd $dir/log/compute_prob_valid.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n$cmd $dir/log/compute_prob_train.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%10] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\nname=`basename $data`\nif [ -f $dir/final.mdl ]; then\n  nnet-to-raw-nnet --truncate=$truncate_comp_num $dir/final.mdl $dir/final.raw\nelse\n  echo \"$0: we require final.mdl in source dir $dir\"\nfi\n\n"
  },
  {
    "path": "egs/steps/nnet2/train_tanh_fast.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script trains a fairly vanilla network with tanh nonlinearities.\n\n# train_tanh_fast.sh is a new, improved version of train_tanh.sh, which uses\n# the 'online' preconditioning method.  For GPUs it's about two times faster\n# than before (although that's partly due to optimizations that will also help\n# the old recipe), and for CPUs it gives better performance than the old method\n# (I believe); also, the difference in optimization performance between CPU and\n# GPU is almost gone.  The old train_tanh.sh script is now deprecated.\n# We made this a separate script because not all of the options that the\n# old script accepted, are still accepted.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_epochs_extra=5 # Number of epochs after we stop reducing\n                   # the learning rate.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\ninitial_learning_rate=0.04\nfinal_learning_rate=0.004\nbias_stddev=0.5\nshrink_interval=5 # shrink every $shrink_interval iters except while we are\n                  # still adding layers, when we do it every iter.\nshrink=true\nnum_frames_shrink=2000 # note: must be <= --num-frames-diagnostic option to get_egs.sh, if\n                       # given.\nfinal_learning_rate_factor=0.5 # Train the two last layers of parameters half as\n                               # fast as the other layers, by default.\n\nhidden_layer_dim=300 #  You may want this larger, e.g. 1024 or 2048.\n\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh.\nnum_jobs_nnet=8    # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of\n                # the samples on each iter.  You could set it to 0 or to a large\n                # value for complete randomization, but this would both consume\n                # memory and cause spikes in disk I/O.  Smaller is easier on\n                # disk and memory but less random.  It's not a huge deal though,\n                # as samples are anyway randomized right at the start.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nnum_hidden_layers=3 # This is an important configuration value that you might\n                    # want to tune.\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0 # relates to preconditioning.\nupdate_period=4 # relates to online preconditioning: says how often we update the subspace.\nnum_samples_history=2000 # relates to online preconditioning\nmax_change_per_sample=0.075\n# we make the [input, output] ranks less different for the tanh setup than for\n# the pnorm setup, as we don't have the difference in dimensions to deal with.\nprecondition_rank_in=30  # relates to online preconditioning\nprecondition_rank_out=60 # relates to online preconditioning\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n         # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncombine_parallel_opts=\"--num-threads 8\"  # queue options for the \"combine\" stage.\ncombine_num_threads=8\ncleanup=true\negs_dir=\nlda_opts=\negs_opts=\ntransform_dir=\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=  # Can be used to force \"raw\" features.\nprior_subset_size=10000 # 10k samples per job, for computing priors.  Should be\n                        # more than enough.\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-epochs-extra <#epochs-extra|5>             # Number of extra epochs of training\"\n  echo \"                                                   # after learning rate fully reduced\"\n  echo \"  --initial-learning-rate <initial-learning-rate|0.02> # Learning rate at start of training, e.g. 0.02 for small\"\n  echo \"                                                       # data, 0.01 for large data\"\n  echo \"  --final-learning-rate  <final-learning-rate|0.004>   # Learning rate at end of training, e.g. 0.004 for small\"\n  echo \"                                                   # data, 0.001 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --initial-num-hidden-layers <#hidden-layers|1>   # Number of hidden layers to start with.\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --mix-up <#pseudo-gaussians|0>                   # Can be used to have multiple targets in final output layer,\"\n  echo \"                                                   # per context-dependent state.  Try a number several times #states.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|200000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|250>                              # Dimension to reduce spliced features to with LDA\"\n  echo \"  --num-iters-final <#iters|20>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`am-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nextra_opts=()\n[ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n[ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n[ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\nextra_opts+=(--transform-dir $transform_dir)\nextra_opts+=(--splice-width $splice_width)\n\nif [ $stage -le -4 ]; then\n  echo \"$0: calling get_lda.sh\"\n  steps/nnet2/get_lda.sh $lda_opts \"${extra_opts[@]}\" --cmd \"$cmd\" $data $lang $alidir $dir || exit 1;\nfi\n\n# these files will have been written by get_lda.sh\nfeat_dim=$(cat $dir/feat_dim) || exit 1;\nivector_dim=$(cat $dir/ivector_dim) || exit 1;\nlda_dim=$(cat $dir/lda_dim) || exit 1;\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter \\\n      --num-jobs-nnet $num_jobs_nnet --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" \\\n      $data $lang $alidir $dir || exit 1;\nfi\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\nif ! [ $num_hidden_layers -ge 1 ]; then\n  echo \"Invalid num-hidden-layers $num_hidden_layers\"\n  exit 1\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing neural net\";\n\n  # Get spk-vec dim (in case we're using them).\n  lda_mat=$dir/lda.mat\n  tot_input_dim=$[$feat_dim+$ivector_dim]\n\n  online_preconditioning_opts=\"alpha=$alpha num-samples-history=$num_samples_history update-period=$update_period rank-in=$precondition_rank_in rank-out=$precondition_rank_out max-change-per-sample=$max_change_per_sample\"\n\n  stddev=`perl -e \"print 1.0/sqrt($hidden_layer_dim);\"`\n  cat >$dir/nnet.config <<EOF\nSpliceComponent input-dim=$tot_input_dim left-context=$splice_width right-context=$splice_width const-component-dim=$ivector_dim\nFixedAffineComponent matrix=$lda_mat\nAffineComponentPreconditionedOnline input-dim=$lda_dim output-dim=$hidden_layer_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nAffineComponentPreconditionedOnline input-dim=$hidden_layer_dim output-dim=$num_leaves $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=0 bias-stddev=0\nSoftmaxComponent dim=$num_leaves\nEOF\n\n  # to hidden.config it will write the part of the config corresponding to a\n  # single hidden layer; we need this to add new layers.\n  cat >$dir/hidden.config <<EOF\nAffineComponentPreconditionedOnline input-dim=$hidden_layer_dim output-dim=$hidden_layer_dim $online_preconditioning_opts learning-rate=$initial_learning_rate param-stddev=$stddev bias-stddev=$bias_stddev\nTanhComponent dim=$hidden_layer_dim\nEOF\n  $cmd $dir/log/nnet_init.log \\\n    nnet-am-init $alidir/tree $lang/topo \"nnet-init $dir/nnet.config -|\" \\\n    $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"Training transition probabilities and setting priors\"\n  $cmd $dir/log/train_trans.log \\\n    nnet-train-transitions $dir/0.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl \\\n    || exit 1;\nfi\n\nnum_iters_reduce=$[$num_epochs * $iters_per_epoch];\nnum_iters_extra=$[$num_epochs_extra * $iters_per_epoch];\nnum_iters=$[$num_iters_reduce+$num_iters_extra]\n\necho \"$0: Will train for $num_epochs + $num_epochs_extra epochs, equalling \"\necho \"$0: $num_iters_reduce + $num_iters_extra = $num_iters iterations, \"\necho \"$0: (while reducing learning rate) + (with constant learning rate).\"\n\n# This is when we decide to mix up from: halfway between when we've finished\n# adding the hidden layers and the end of training.\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\nmix_up_iter=$[($num_iters + $finish_add_layers_iter)/2]\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n  fi\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    if [ $x -gt 0 ] && [ ! -f $dir/log/mix_up.$[$x-1].log ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl \\\n         ark:$egs_dir/train_diagnostic.egs '&&' \\\n         nnet-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[($x-1) % $add_layers_period] -eq 0 ]; then\n      mdl=\"nnet-init --srand=$x $dir/hidden.config - | nnet-insert $dir/$x.mdl - - |\"\n    else\n      mdl=$dir/$x.mdl\n    fi\n\n    if [ $x -eq 0 ] || [ \"$mdl\" != \"$dir/$x.mdl\" ]; then\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size and just one job: the model-averaging doesn't seem to be helpful\n      # when the model is changing too fast (i.e. it worsens the objective\n      # function), and the smaller minibatch size will help to keep\n      # the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      do_average=false\n    else\n      this_minibatch_size=$minibatch_size\n      do_average=true\n    fi\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$this_minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    learning_rate=`perl -e '($x,$n,$i,$f)=@ARGV; print ($x >= $n ? $f : $i*exp($x*log($f/$i)/$n));' $[$x+1] $num_iters_reduce $initial_learning_rate $final_learning_rate`;\n    last_layer_learning_rate=`perl -e \"print $learning_rate * $final_learning_rate_factor;\"`;\n    nnet-am-info $dir/$[$x+1].1.mdl > $dir/foo  2>/dev/null || exit 1\n    nu=`cat $dir/foo | grep num-updatable-components | awk '{print $2}'`\n    na=`cat $dir/foo | grep -v Fixed | grep AffineComponent | wc -l`\n    # na is number of last updatable AffineComponent layer [one-based, counting only\n    # updatable components.]\n    # The last two layers will get this (usually lower) learning rate.\n    lr_string=\"$learning_rate\"\n    for n in `seq 2 $nu`; do\n      if [ $n -eq $na ] || [ $n -eq $[$na-1] ]; then lr=$last_layer_learning_rate;\n      else lr=$learning_rate; fi\n      lr_string=\"$lr_string:$lr\"\n    done\n\n    if $do_average; then\n      $cmd $dir/log/average.$x.log \\\n        nnet-am-average $nnets_list - \\| \\\n        nnet-am-copy --learning-rates=$lr_string - $dir/$[$x+1].mdl || exit 1;\n    else\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet-am-copy --learning-rates=$lr_string $dir/$[$x+1].$n.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    if $shrink && [ $[$x % $shrink_interval] -eq 0 ]; then\n      mb=$[($num_frames_shrink+$num_threads-1)/$num_threads]\n      $cmd $combine_parallel_opts $dir/log/shrink.$x.log \\\n        nnet-subset-egs --n=$num_frames_shrink --randomize-order=true --srand=$x \\\n          ark:$egs_dir/train_diagnostic.egs ark:-  \\| \\\n        nnet-combine-fast --use-gpu=no --num-threads=$combine_num_threads \\\n          --verbose=3 --minibatch-size=$mb \\\n          $dir/$[$x+1].mdl ark:- $dir/$[$x+1].mdl || exit 1;\n    else\n      # On other iters, do nnet-am-fix which is much faster and has roughly\n      # the same effect.\n      nnet-am-fix $dir/$[$x+1].mdl $dir/$[$x+1].mdl 2>$dir/log/fix.$x.log\n    fi\n\n    if [ \"$mix_up\" -gt 0 ] && [ $x -eq $mix_up_iter ]; then\n      # mix up.\n      echo Mixing up from $num_leaves to $mix_up components\n      $cmd $dir/log/mix_up.$x.log \\\n        nnet-am-mixup --min-count=10 --num-mixtures=$mix_up \\\n        $dir/$[$x+1].mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters_extra ]; then\n  echo \"Setting num_iters_final=$num_iters_extra\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  if [ $x -gt $mix_up_iter ]; then\n    nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\n  fi\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as if\n  # there are many models it can give out-of-memory error on the GPU; set\n  # num-threads to 8 to speed it up (this isn't ideal...)\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$combine_num_threads-1)/$combine_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  $cmd $combine_parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --use-gpu=no --num-threads=$combine_num_threads \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_nnet $dir/log/get_post.JOB.log \\\n    nnet-subset-egs --n=$prior_subset_size ark:$egs_dir/egs.JOB.0.ark ark:- \\| \\\n    nnet-compute-from-egs \"nnet-to-raw-nnet $dir/final.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.log \\\n   vector-sum $dir/post.*.vec $dir/post.vec || exit 1;\n\n  rm $dir/post.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.log \\\n    nnet-adjust-priors $dir/final.mdl $dir/post.vec $dir/final.mdl || exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet2/update_nnet.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2013  Johns Hopkins University (Author: Jan Trmal)\n#           2013  Vimal Manohar\n# Apache 2.0.\n\n\n# This script updates an existing neural network model without initializing it.\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=20      # Number of epochs during which we reduce\n                   # the learning rate; number of iteration is worked out from this.\nnum_iters_final=20 # Maximum number of final iterations to give to the\n                   # optimization over the validation set.\nlearning_rates=\"0.0008:0.0008:0.0008:0\"\n\ncombine_regularizer=1.0e-14 # Small regularizer so that parameters won't go crazy.\nminibatch_size=128 # by default use a smallish minibatch size for neural net\n                   # training; this controls instability which would otherwise\n                   # be a problem with multi-threaded update.  Note: it also\n                   # interacts with the \"preconditioned\" update which generally\n                   # works better with larger minibatch size, so it's not\n                   # completely cost free.\n\nsamples_per_iter=200000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_nnet=16   # Number of neural net jobs to run in parallel.  This option\n                   # is passed to get_egs.sh.\nget_egs_stage=0\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\nstage=-5\n\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.   These don't\nsplice_width=4 # meaning +- 4 frames on each side for second LDA\nrandprune=4.0 # speeds up LDA.\nalpha=4.0\nmax_change=10.0\nmix_up=0 # Number of components to mix up to (should be > #tree leaves, if\n        # specified.)\nnum_threads=16\nparallel_opts=\"--num-threads 16 --mem 1G\" # by default we use 16 threads; this lets the queue know.\n  # note: parallel_opts doesn't automatically get adjusted if you adjust num-threads.\ncleanup=false\negs_dir=\negs_opts=\ntransform_dir=     # If supplied, overrides alidir\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <model-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet exp/tri4b_nnet\"\n  echo \"See also the more recent script train_more.sh which requires the egs\"\n  echo \"directory.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of main training\"\n  echo \"                                                   # while reducing learning rate (determines #iterations, together\"\n  echo \"                                                   # with --samples-per-iter and --num-jobs-nnet)\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-width <width|4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --num-iters-final <#iters|10>                    # Number of final iterations to give to nnet-combine-fast to \"\n  echo \"                                                   # interpolate parameters (the weights are learned with a validation set)\"\n  echo \"  --num-utts-subset <#utts|300>                    # Number of utterances in subsets used for validation and diagnostics\"\n  echo \"                                                   # (the validation subset is held out from training)\"\n  echo \"  --num-frames-diagnostic <#frames|4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames|10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|-9>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --transform-dir                                  # Directory with fMLLR transforms. Overrides alidir if provided.\"\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\nsdir=$4\ndir=$5\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/number of pdfs/{print $NF}'` || exit 1;\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\nutils/lang/check_phones_compatible.sh $lang/phones.txt $sdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\nif [ $stage -le -3 ] && [ -z \"$egs_dir\" ]; then\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet2/get_egs.sh --samples-per-iter $samples_per_iter --num-jobs-nnet $num_jobs_nnet \\\n      --splice-width $splice_width --stage $get_egs_stage --cmd \"$cmd\" $egs_opts --io-opts \"$io_opts\" --transform-dir $transform_dir \\\n      $data $lang $alidir $dir || exit 1;\nfi\n\nif [ -z $egs_dir ]; then\n  egs_dir=$dir/egs\nfi\n\niters_per_epoch=`cat $egs_dir/iters_per_epoch`  || exit 1;\n! [ $num_jobs_nnet -eq `cat $egs_dir/num_jobs_nnet` ] && \\\n  echo \"$0: Warning: using --num-jobs-nnet=`cat $egs_dir/num_jobs_nnet` from $egs_dir\"\nnum_jobs_nnet=`cat $egs_dir/num_jobs_nnet` || exit 1;\n\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: using existing neural net\";\n  source_model=$sdir/final.mdl\n  nnet-am-copy --learning-rates=${learning_rates} $source_model $dir/0.mdl\nfi\n\n\nnum_iters=$[$num_epochs * $iters_per_epoch];\n\necho \"$0: Will train for $num_epochs epochs, equalling $num_iters iterations\"\n\n\nif [ $num_threads -eq 1 ]; then\n  train_suffix=\"-simple\" # this enables us to use GPU code if\n                         # we have just one thread.\nelse\n  train_suffix=\"-parallel --num-threads=$num_threads\"\nfi\n\nx=0\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set off jobs doing some diagnostics, in the background.\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/valid_diagnostic.egs &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet-compute-prob $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n\n    if [ $x -gt 0 ] ; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet-show-progress --use-gpu=no $dir/$[$x-1].mdl $dir/$x.mdl ark:$egs_dir/train_diagnostic.egs &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n    mdl=$dir/$x.mdl\n\n\n    $cmd $parallel_opts JOB=1:$num_jobs_nnet $dir/log/train.$x.JOB.log \\\n      nnet-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x \\\n      ark:$egs_dir/egs.JOB.$[$x%$iters_per_epoch].ark ark:- \\| \\\n      nnet-train$train_suffix \\\n         --minibatch-size=$minibatch_size --srand=$x \"$mdl\" \\\n        ark:- $dir/$[$x+1].JOB.mdl \\\n      || exit 1;\n\n    nnets_list=\n    for n in `seq 1 $num_jobs_nnet`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.mdl\"\n    done\n\n    $cmd $dir/log/average.$x.log \\\n      nnet-am-average $nnets_list $dir/$[$x+1].mdl || exit 1;\n\n    rm $nnets_list\n  fi\n  x=$[$x+1]\ndone\n\n# Now do combination.\n# At the end, final.mdl will be a combination of the last e.g. 10 models.\nnnets_list=()\nif [ $num_iters_final -gt $num_iters ]; then\n  echo \"Setting num_iters_final=$num_iters\"\nfi\nstart=$[$num_iters-$num_iters_final+1]\nfor x in `seq $start $num_iters`; do\n  idx=$[$x-$start]\n  nnets_list[$idx]=$dir/$x.mdl # \"nnet-am-copy --remove-dropout=true $dir/$x.mdl - |\"\ndone\n\nif [ $stage -le $num_iters ]; then\n  # Below, use --use-gpu=no to disable nnet-combine-fast from using a GPU, as\n  # if there are many models it can give out-of-memory error; set num-threads to 8\n  # to speed it up (this isn't ideal...)\n  this_num_threads=$num_threads\n  [ $this_num_threads -lt 8 ] && this_num_threads=8\n  num_egs=`nnet-copy-egs ark:$egs_dir/combine.egs ark:/dev/null 2>&1 | tail -n 1 | awk '{print $NF}'`\n  mb=$[($num_egs+$this_num_threads-1)/$this_num_threads]\n  [ $mb -gt 512 ] && mb=512\n  # Setting --initial-model to a large value makes it initialize the combination\n  # with the average of all the models.  It's important not to start with a\n  # single model, or, due to the invariance to scaling that these nonlinearities\n  # give us, we get zero diagonal entries in the fisher matrix that\n  # nnet-combine-fast uses for scaling, which after flooring and inversion, has\n  # the effect that the initial model chosen gets much higher learning rates\n  # than the others.  This prevents the optimization from working well.\n  $cmd $parallel_opts $dir/log/combine.log \\\n    nnet-combine-fast --initial-model=100000 --num-lbfgs-iters=40 --use-gpu=no \\\n      --num-threads=$this_num_threads --regularizer=$combine_regularizer \\\n      --verbose=3 --minibatch-size=$mb \"${nnets_list[@]}\" ark:$egs_dir/combine.egs \\\n      $dir/final.mdl || exit 1;\nfi\n\n# Compute the probability of the final, combined model with\n# the same subset we used for the previous compute_probs, as the\n# different subsets will lead to different probs.\n$cmd $dir/log/compute_prob_valid.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/valid_diagnostic.egs &\n$cmd $dir/log/compute_prob_train.final.log \\\n  nnet-compute-prob $dir/final.mdl ark:$egs_dir/train_diagnostic.egs &\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if [ $egs_dir == \"$dir/egs\" ]; then\n    echo Removing training examples\n    steps/nnet2/remove_egs.sh $dir/egs\n  fi\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%10] -ne 0 ] && [ $x -lt $[$num_iters-$num_iters_final+1] ]; then\n       # delete all but every 10th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet3/adjust_priors.sh",
    "content": "#!/usr/bin/env bash\n\n. ./path.sh\n\n# This script computes the DNN output averaged over a small subset of\n# training egs and stores it in post.$iter.vec.\n# This is used for the purpose of adjusting the nnet priors.\n# When --use-raw-nnet is false, then the computed priors is added into the\n# nnet model; hence the term adjust priors.\n# When --use-raw-nnet is true, the computed priors is not added into the\n# nnet model and left in the file post.$iter.vec.\n\ncmd=run.pl\nprior_subset_size=20000   # 20k samples per job, for computing priors.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nuse_gpu=false             # if true, we run on GPU.\negs_type=egs              # Compute from $egs_type.*.ark in $egs_dir\n                          # If --egs-type is degs, then the program\n                          # nnet3-discriminative-compute-from-egs is used\n                          # instead of nnet3-compute-from-egs.\nuse_raw_nnet=false        # If raw nnet, the averaged posterior is computed\n                          # and stored in post.$iter.vec; but there is no\n                          # adjusting of priors\nminibatch_size=256\niter=final\n\n. utils/parse_options.sh\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [opts] <exp-dir> <egs-dir>\"\n  echo \" e.g.: $0 exp/nnet3_sad_snr/tdnn_train_100k_whole_1k_splice2_2_relu500\"\n  exit 1\nfi\n\ndir=$1\negs_dir=$2\n\nif $use_gpu; then\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\nelse\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\nfor f in $egs_dir/$egs_type.1.ark $egs_dir/info/num_archives; do\n  if [ ! -f $f ]; then\n    echo \"$f not found\"\n    exit 1\n  fi\ndone\n\nif $use_raw_nnet; then\n  model=$dir/$iter.raw\nelse\n  model=\"nnet3-am-copy --raw=true $dir/$iter.mdl - |\"\nfi\n\nrm -f $dir/post.$iter.*.vec 2>/dev/null\n\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nif [ $num_jobs_compute_prior -gt $num_archives ]; then\n  num_jobs_compute_prior=$num_archives\nfi\n\n\nif [ $egs_type != \"degs\" ]; then\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$iter.JOB.log \\\n    nnet3-copy-egs ark:$egs_dir/$egs_type.JOB.ark ark:- \\| \\\n    nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet3-merge-egs --minibatch-size=$minibatch_size ark:- ark:- \\| \\\n    nnet3-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n    \"$model\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$iter.JOB.vec || exit 1;\nelse\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$iter.JOB.log \\\n    nnet3-discriminative-copy-egs ark:$egs_dir/degs.JOB.ark ark:- \\| \\\n    nnet3-discriminative-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet3-discriminative-merge-egs --minibatch-size=$minibatch_size ark:- ark:- \\| \\\n    nnet3-discriminative-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n    \"$model\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$iter.JOB.vec || exit 1;\nfi\n\nsleep 3;  # make sure there is time for $dir/post.$iter.*.vec to appear.\n\n$cmd $dir/log/vector_sum.$iter.log \\\n  vector-sum $dir/post.$iter.*.vec $dir/post.$iter.vec || exit 1;\n\nif ! $use_raw_nnet; then\n  run.pl $dir/log/adjust_priors.$iter.log \\\n    nnet3-am-adjust-priors $dir/$iter.mdl $dir/post.$iter.vec $dir/${iter}_adj.mdl\nfi\n\nrm -f $dir/post.$iter.*.vec;\n"
  },
  {
    "path": "egs/steps/nnet3/align.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Brno University of Technology (Author: Karel Vesely)\n#           2013  Johns Hopkins University (Author: Daniel Povey)\n#           2015  Vijayaditya Peddinti\n#           2016  Vimal Manohar\n# Apache 2.0\n\n# Computes training alignments using nnet3 DNN\n# Warning: this script uses GPUs by default, and this is generally not\n# an efficient use of GPUs. Set --use-gpu false to make it run on CPU.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\niter=final\nuse_gpu=true\nframes_per_chunk=50\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\ngraphs_scp=\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: $0 <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/nnet4 exp/nnet4_ali\"\n   echo \"Warning: this script uses GPUs by default, and this is generally not\"\n   echo \"an efficient use of GPUs. Set --use-gpu false to make it run on CPU.\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split${nj}\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || \\\n   split_data.sh $data $nj || exit 1;\n\nif $use_gpu; then\n  queue_opt=\"--gpu 1\"\n  gpu_opt=\"--use-gpu=wait\"\nelse\n  queue_opt=\"\"\n  gpu_opt=\"--use-gpu=no\"\nfi\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nfor f in $srcdir/tree $srcdir/${iter}.mdl $data/feats.scp $lang/L.fst $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\ncp $srcdir/{tree,${iter}.mdl} $dir || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\necho \"$0: feature type is raw\"\n\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nivector_opts=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_factor=$(cat $srcdir/frame_subsampling_factor)\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\n  cp $srcdir/frame_subsampling_factor $dir\n  if [ \"$frame_subsampling_factor\" -gt 1 ] && \\\n     [ \"$scale_opts\" == \"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\" ]; then\n    echo \"$0: frame-subsampling-factor is not 1 (so likely a chain system),\"\n    echo \"...  but the scale opts are the defaults.  You probably want\"\n    echo \"--scale-opts '--transition-scale=1.0 --acoustic-scale=1.0 --self-loop-scale=1.0'\"\n    sleep 1\n  fi\nfi\n\nif [ ! -z \"$graphs_scp\" ]; then\n  if [ ! -f $graphs_scp ]; then\n    echo \"Could not find graphs $graphs_scp\" && exit 1\n  fi\n  tra=\"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $graphs_scp |\"\n  prog=compile-train-graphs-fsts\nelse\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  prog=compile-train-graphs\nfi\n\n$cmd $queue_opt JOB=1:$nj $dir/log/align.JOB.log \\\n  $prog --read-disambig-syms=$lang/phones/disambig.int $dir/tree \\\n  $srcdir/${iter}.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n  nnet3-align-compiled $scale_opts $ivector_opts $frame_subsampling_opt \\\n  --frames-per-chunk=$frames_per_chunk \\\n  --extra-left-context=$extra_left_context \\\n  --extra-right-context=$extra_right_context \\\n  --extra-left-context-initial=$extra_left_context_initial \\\n  --extra-right-context-final=$extra_right_context_final \\\n  $gpu_opt --beam=$beam --retry-beam=$retry_beam \\\n  $srcdir/${iter}.mdl ark:- \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\necho \"$0: done aligning data.\"\n"
  },
  {
    "path": "egs/steps/nnet3/align_lats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Brno University of Technology (Author: Karel Vesely)\n#           2013  Johns Hopkins University (Author: Daniel Povey)\n#           2015  Vijayaditya Peddinti\n#           2016  Vimal Manohar\n#           2017  Pegah Ghahremani\n# Apache 2.0\n\n# Computes training alignments using nnet3 DNN, with output to lattices.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nstage=-1\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=0.1\"\nacoustic_scale=0.1\nbeam=20\niter=final\nframes_per_chunk=50\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\ngraphs_scp=\ngenerate_ali_from_lats=false # If true, alingments generated from lattices.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: $0 <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/nnet4 exp/nnet4_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split${nj}\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || \\\n   split_data.sh $data $nj || exit 1;\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nfor f in $srcdir/tree $srcdir/${iter}.mdl $data/feats.scp $lang/L.fst $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\ncp $srcdir/{tree,${iter}.mdl} $dir || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\necho \"$0: feature type is raw\"\n\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nivector_opts=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_factor=$(cat $srcdir/frame_subsampling_factor)\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\n  cp $srcdir/frame_subsampling_factor $dir\n  if [[ $frame_subsampling_factor -gt 1 ]]; then\n    # Assume a chain system, check agrument sanity.\n    if [[ ! ($scale_opts == *--self-loop-scale=1.0* &&\n             $scale_opts == *--transition-scale=1.0* &&\n             $acoustic_scale = '1.0') ]]; then\n      echo \"$0: ERROR: frame-subsampling-factor is not 1, assuming a chain system.\"\n      echo \"... You should pass the following options to this script:\"\n      echo \"  --scale-opts '--transition-scale=1.0 --self-loop-scale=1.0'\" \\\n           \"--acoustic_scale 1.0\"\n    fi\n  fi\nfi\n\nif [ ! -z \"$graphs_scp\" ]; then\n  if [ ! -f $graphs_scp ]; then\n    echo \"Could not find graphs $graphs_scp\" && exit 1\n  fi\n  tra=\"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $graphs_scp |\"\n  prog=compile-train-graphs-fsts\nelse\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  prog=compile-train-graphs\nfi\n\nif [ $stage -le 0 ]; then\n  ## because nnet3-latgen-faster doesn't support adding the transition-probs to the\n  ## graph itself, we need to bake them into the compiled graphs.  This means we can't reuse previously compiled graphs,\n  ## because the other scripts write them without transition probs.\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    $prog --read-disambig-syms=$lang/phones/disambig.int \\\n    $scale_opts \\\n    $dir/tree $srcdir/${iter}.mdl  $lang/L.fst \"$tra\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1\nfi\n\nif [ $stage -le 1 ]; then\n  # Warning: nnet3-latgen-faster doesn't support a retry-beam so you may get more\n  # alignment errors (however, it does have a default min-active=200 so this\n  # will tend to reduce alignment errors).\n  # --allow_partial=false makes sure we reach the end of the decoding graph.\n  # --word-determinize=false makes sure we retain the alternative pronunciations of\n  #   words (including alternatives regarding optional silences).\n  #  --lattice-beam=$beam keeps all the alternatives that were within the beam,\n  #    it means we do no pruning of the lattice (lattices from a training transcription\n  #    will be small anyway).\n  $cmd JOB=1:$nj $dir/log/generate_lattices.JOB.log \\\n    nnet3-latgen-faster --acoustic-scale=$acoustic_scale $ivector_opts $frame_subsampling_opt \\\n    --frames-per-chunk=$frames_per_chunk \\\n    --extra-left-context=$extra_left_context \\\n    --extra-right-context=$extra_right_context \\\n    --extra-left-context-initial=$extra_left_context_initial \\\n    --extra-right-context-final=$extra_right_context_final \\\n    --beam=$beam --lattice-beam=$beam \\\n    --allow-partial=false --word-determinize=false \\\n    $srcdir/${iter}.mdl \"ark:gunzip -c $dir/fsts.JOB.gz |\" \\\n    \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ] && $generate_ali_from_lats; then\n  # If generate_alignments is true, ali.*.gz is generated in lats dir\n  $cmd JOB=1:$nj $dir/log/generate_alignments.JOB.log \\\n    lattice-best-path --acoustic-scale=$acoustic_scale \"ark:gunzip -c $dir/lat.JOB.gz |\" \\\n    ark:/dev/null \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\necho \"$0: done generating lattices from training transcripts.\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/align_lats.sh",
    "content": "#!/bin/bash\n# Copyright 2012  Brno University of Technology (Author: Karel Vesely)\n#           2013  Johns Hopkins University (Author: Daniel Povey)\n#           2015  Vijayaditya Peddinti\n#           2016  Vimal Manohar\n#           2017  Pegah Ghahremani\n# Apache 2.0\n\n# Computes training alignments using nnet3 DNN, with output to lattices.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nstage=-1\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --self-loop-scale=1.0\"\nacoustic_scale=1.0\npost_decode_acwt=10.0\nbeam=20\niter=final\nframes_per_chunk=50\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\ngraphs_scp=\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: $0 <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/nnet4 exp/nnet4_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split${nj}\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || \\\n   split_data.sh $data $nj || exit 1;\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nfor f in $srcdir/tree $srcdir/${iter}.mdl $data/feats.scp $lang/L.fst $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\ncp $srcdir/{tree,${iter}.mdl} $dir || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n## Set up features.  Note: these are different from the normal features\n## because we have one rspecifier that has the features for the entire\n## training set, not separate ones for each batch.\necho \"$0: feature type is raw\"\n\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nivector_opts=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_factor=$(cat $srcdir/frame_subsampling_factor)\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\n  cp $srcdir/frame_subsampling_factor $dir\n  if [ \"$frame_subsampling_factor\" -gt 1 ] && \\\n     [ \"$scale_opts\" == \"--transition-scale=1.0 --self-loop-scale=0.1\" ]; then\n    echo \"$0: frame-subsampling-factor is not 1 (so likely a chain system),\"\n    echo \"...  but the scale opts are the defaults.  You probably want\"\n    echo \"--scale-opts '--transition-scale=1.0 --self-loop-scale=1.0'\"\n    sleep 1\n  fi\nfi\n\nif [ ! -z \"$graphs_scp\" ]; then\n  if [ ! -f $graphs_scp ]; then\n    echo \"Could not find graphs $graphs_scp\" && exit 1\n  fi\n  tra=\"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $graphs_scp |\"\n  prog=compile-train-graphs-fsts\nelse\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n  prog=compile-train-graphs\nfi\n\nif [ $stage -le 0 ]; then\n  ## because nnet3-latgen-faster doesn't support adding the transition-probs to the\n  ## graph itself, we need to bake them into the compiled graphs.  This means we can't reuse previously compiled graphs,\n  ## because the other scripts write them without transition probs.\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    $prog --read-disambig-syms=$lang/phones/disambig.int \\\n    $scale_opts \\\n    $dir/tree $srcdir/${iter}.mdl  $lang/L.fst \"$tra\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1\nfi\n\nif [ $stage -le 1 ]; then\n  # Warning: nnet3-latgen-faster doesn't support a retry-beam so you may get more\n  # alignment errors (however, it does have a default min-active=200 so this\n  # will tend to reduce alignment errors).\n  # --allow_partial=false makes sure we reach the end of the decoding graph.\n  # --word-determinize=false makes sure we retain the alternative pronunciations of\n  #   words (including alternatives regarding optional silences).\n  #  --lattice-beam=$beam keeps all the alternatives that were within the beam,\n  #    it means we do no pruning of the lattice (lattices from a training transcription\n  #    will be small anyway).\n  $cmd JOB=1:$nj $dir/log/generate_lattices.JOB.log \\\n    nnet3-latgen-faster --acoustic-scale=$acoustic_scale $ivector_opts $frame_subsampling_opt \\\n    --frames-per-chunk=$frames_per_chunk \\\n    --extra-left-context=$extra_left_context \\\n    --extra-right-context=$extra_right_context \\\n    --extra-left-context-initial=$extra_left_context_initial \\\n    --extra-right-context-final=$extra_right_context_final \\\n    --beam=$beam --lattice-beam=$beam \\\n    --allow-partial=false --word-determinize=false \\\n    $srcdir/${iter}.mdl \"ark:gunzip -c $dir/fsts.JOB.gz |\" \\\n    \"$feats\" \"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\necho \"$0: done generating lattices from training transcripts.\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/build_tree.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#  Apache 2.0.\n\n\n# This script builds a tree for use in the 'chain' systems (although the script\n# itself is pretty generic and doesn't use any 'chain' binaries).  This is just\n# like the first stages of a standard system, like 'train_sat.sh', except it\n# does 'convert-ali' to convert alignments to a monophone topology just created\n# from the 'lang' directory (in case the topology is different from where you\n# got the system's alignments from), and it stops after the tree-building and\n# model-initialization stage, without re-estimating the Gaussians or training\n# the transitions.\n\n\n# Begin configuration section.\nstage=-5\nexit_stage=-100 # you can use this to require it to exit at the\n                # beginning of a specific stage.  Not all values are\n                # supported.\ncmd=run.pl\ncontext_opts=  # e.g. set this to \"--context-width 5 --central-position 2\" for quinphone.\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nframe_subsampling_factor=1\nalignment_subsampling_factor=\nleftmost_questions_truncate=-1  # note: this option is deprecated and has no effect\ntree_stats_opts=\ncluster_phones_opts=\nrepeat_frames=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n  echo \"Usage: $0 <#leaves> <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 --frame-subsampling-factor 3 \\\\\"\n  echo \"   --context-opts '--context-width=2 --central-position=1'  \\\\\"\n  echo \"    3500 data/train_si84 data/lang_chain exp/tri3b_ali_si284_sp exp/chain/tree_a_sp\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --repeat-frames <true|false>                     # Only affects alignment conversion at\"\n  echo \"                                                   # the end. If true, generate an \"\n  echo \"                                                   # alignment using the frame-subsampled \"\n  echo \"                                                   # topology that is repeated \"\n  echo \"                                                   # --frame-subsampling-factor times \"\n  echo \"                                                   # and interleaved, to be the same \"\n  echo \"                                                   # length as the original alignment \"\n  echo \"                                                   # (useful for cross-entropy training \"\n  echo \"                                                   # of reduced frame rate systems).\"\n  echo \"  --context-opts <option-string>                   # Options controlling phonetic context;\"\n  echo \"                                                   # we suggest '--context-width=2 --central-position=1',\"\n  echo \"                                                   # which is left bigram.\"\n  echo \"  --frame-subsampling-factor <factor>              # Factor (e.g. 3) controlling frame subsampling\"\n  echo \"                                                   # at the neural net output, so the frame rate at\"\n  echo \"                                                   # the output is less than at the input.\"\n  exit 1;\nfi\n\nnumleaves=$1\ndata=$2\nlang=$3\nalidir=$4\ndir=$5\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"train_sat.sh: no such file $f\" && exit 1;\ndone\n\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null # delta option.\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\nif [ -f $alidir/per_utt ]; then\n  sdata=$data/split${nj}utt\n  utils/split_data.sh --per-utt $data $nj\nelse\n  sdata=$data/split$nj\n  utils/split_data.sh $data $nj\nfi\n\n# Set up features.\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\n## Set up speaker-independent features.\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    cp $alidir/full.mat $dir 2>/dev/null\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# Add fMLLR transforms if available\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: Using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nfi\n\n# Do subsampling of feats, if needed\nif [ $frame_subsampling_factor -gt 1 ]; then\n  feats=\"$feats subsample-feats --n=$frame_subsampling_factor ark:- ark:- |\"\nfi\n\nif [ -z $alignment_subsampling_factor ]; then\n  alignment_subsampling_factor=$frame_subsampling_factor\nfi\n\nif [ $stage -le -5 ]; then\n  echo \"$0: Initializing monophone model (for alignment conversion, in case topology changed)\"\n\n  [ ! -f $lang/phones/sets.int ] && exit 1;\n  shared_phones_opt=\"--shared-phones=$lang/phones/sets.int\"\n  # get feature dimension\n  example_feats=\"`echo $feats | sed s/JOB/1/g`\";\n  if ! feat_dim=$(feat-to-dim \"$example_feats\" - 2>/dev/null) || [ -z $feat_dim ]; then\n    feat-to-dim \"$example_feats\" - # to see the error message.\n    echo \"error getting feature dimension\"\n    exit 1;\n  fi\n  $cmd JOB=1 $dir/log/init_mono.log \\\n    gmm-init-mono $shared_phones_opt \"--train-feats=$feats subset-feats --n=10 ark:- ark:-|\" $lang/topo $feat_dim \\\n      $dir/mono.mdl $dir/mono.tree || exit 1;\nfi\n\nif [ $stage -le -4 ]; then\n  # Get tree stats.\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n     convert-ali --frame-subsampling-factor=$alignment_subsampling_factor \\\n         $alidir/final.mdl $dir/mono.mdl $dir/mono.tree \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:-  \\| \\\n      acc-tree-stats $context_opts $tree_stats_opts --ci-phones=$ciphonelist $dir/mono.mdl \\\n         \"$feats\" ark:- $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  $cmd $dir/log/questions.log \\\n     cluster-phones $cluster_phones_opts $context_opts $dir/treeacc \\\n     $lang/phones/sets.int $dir/questions.int || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  $cmd $dir/log/compile_questions.log \\\n    compile-questions $context_opts $lang/topo \\\n      $dir/questions.int $dir/questions.qst || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments to the new tree.  Note: we likely will not use these\n  # converted alignments in the chain system directly, but they could be useful\n  # for other purposes.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali --repeat-frames=$repeat_frames \\\n      --frame-subsampling-factor=$alignment_subsampling_factor \\\n      $alidir/final.mdl $dir/1.mdl $dir/tree \\\n      \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\ncp $dir/1.mdl $dir/final.mdl\n\necho $0: Done building tree\n"
  },
  {
    "path": "egs/steps/nnet3/chain/build_tree_multiple_sources.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2017  Vimal Manohar\n#  Apache 2.0.\n\n# This script is similar to steps/nnet3/chain/build_tree.sh but supports \n# getting statistics from multiple alignment sources.\n\n\n# Begin configuration section.\nstage=-5\nexit_stage=-100 # you can use this to require it to exit at the\n                # beginning of a specific stage.  Not all values are\n                # supported.\ncmd=run.pl\nuse_fmllr=true  # If true, fmllr transforms will be applied from the alignment directories.\n                # Otherwise, no fmllr will be applied even if alignment directory contains trans.*\ncontext_opts=  # e.g. set this to \"--context-width 5 --central-position 2\" for quinphone.\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nframe_subsampling_factor=1  # frame subsampling factor of output w.r.t. to the input features\ntree_stats_opts=\ncluster_phones_opts=\nrepeat_frames=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -lt 5 ]; then\n  echo \"Usage: steps/nnet3/chain/build_tree_multiple_sources.sh <#leaves> <lang> <data1> <ali-dir1> [<data2> <ali-dir2> ... <data> <ali-dirN>] <exp-dir>\"\n  echo \" e.g.: steps/nnet3/chain/build_tree_multiple_sources.sh 15000 data/lang data/train_sup exp/tri3_ali data/train_unsup exp/tri3/best_path_train_unsup exp/tree_semi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --repeat-frames <true|false>                     # Only affects alignment conversion at\"\n  echo \"                                                   # the end. If true, generate an \"\n  echo \"                                                   # alignment using the frame-subsampled \"\n  echo \"                                                   # topology that is repeated \"\n  echo \"                                                   # --frame-subsampling-factor times \"\n  echo \"                                                   # and interleaved, to be the same \"\n  echo \"                                                   # length as the original alignment \"\n  echo \"                                                   # (useful for cross-entropy training \"\n  echo \"                                                   # of reduced frame rate systems).\"\n  exit 1;\nfi\n\nnumleaves=$1\nlang=$2\ndir=${@: -1}  # last argument to the script\nshift 2;\ndata_and_alidirs=( $@ )  # read the remaining arguments into an array\nunset data_and_alidirs[${#data_and_alidirs[@]}-1]  # 'pop' the last argument which is odir\nnum_sys=$[${#data_and_alidirs[@]}]  # number of systems to combine\n\nif (( $num_sys % 2 != 0 )); then\n  echo \"$0: The data and alignment arguments must be an even number of arguments.\"\n  exit 1\nfi\n\nnum_sys=$((num_sys / 2))\n\ndata=$dir/data_tmp\nmkdir -p $data\n\nmkdir -p $dir\nalidir=`echo ${data_and_alidirs[1]}`\n\ndatadirs=()\nalidirs=()\nfor n in `seq 0 $[num_sys-1]`; do\n  datadirs[$n]=${data_and_alidirs[$[2*n]]}\n  alidirs[$n]=${data_and_alidirs[$[2*n+1]]}\ndone\n\nutils/combine_data.sh $data ${datadirs[@]} || exit 1\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nsdata=$data/split$nj;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null` || exit 1\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null # delta option.\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n# Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\necho \"$0: feature type is $feat_type\"\n\nfeats=()\nfeats_one=()\nfor n in `seq 0 $[num_sys-1]`; do\n  this_nj=$(cat ${alidirs[$n]}/num_jobs) || exit 1\n  this_sdata=${datadirs[$n]}/split$this_nj\n  [[ -d $this_sdata && ${datadirs[$n]}/feats.scp -ot $this_sdata ]] || split_data.sh ${datadirs[$n]} $this_nj || exit 1;\n  ## Set up speaker-independent features.\n  case $feat_type in\n    delta) feats[$n]=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$this_sdata/JOB/utt2spk scp:$this_sdata/JOB/cmvn.scp scp:$this_sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\"\n      feats_one[$n]=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$this_sdata/1/utt2spk scp:$this_sdata/1/cmvn.scp scp:$this_sdata/1/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n    lda) feats[$n]=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$this_sdata/JOB/utt2spk scp:$this_sdata/JOB/cmvn.scp scp:$this_sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n      feats_one[$n]=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$this_sdata/1/utt2spk scp:$this_sdata/1/cmvn.scp scp:$this_sdata/1/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n      cp $alidir/final.mat $dir\n      cp $alidir/full.mat $dir 2>/dev/null\n      ;;\n    *) echo \"$0: invalid feature type $feat_type\" && exit 1;\n  esac\n  \n  if $use_fmllr; then\n    if [ ! -f ${alidirs[$n]}/trans.1 ]; then\n      echo \"$0: Could not find fMLLR transforms in ${alidirs[$n]}\"\n      exit 1\n    fi\n\n    echo \"$0: Using transforms from ${alidirs[$n]}\"\n    feats[$n]=\"${feats[$n]} transform-feats --utt2spk=ark:$this_sdata/JOB/utt2spk ark,s,cs:${alidirs[$n]}/trans.JOB ark:- ark:- |\"\n    feats_one[$n]=\"${feats_one[$n]} transform-feats --utt2spk=ark:$this_sdata/1/utt2spk ark,s,cs:${alidirs[$n]}/trans.1 ark:- ark:- |\"\n  fi\n\n  # Do subsampling of feats, if needed\n  if [ $frame_subsampling_factor -gt 1 ]; then\n    feats[$n]=\"${feats[$n]} subsample-feats --n=$frame_subsampling_factor ark:- ark:- |\"\n    feats_one[$n]=\"${feats_one[$n]} subsample-feats --n=$frame_subsampling_factor ark:- ark:- |\"\n  fi\ndone\n\nif [ $stage -le -5 ]; then\n  echo \"$0: Initializing monophone model (for alignment conversion, in case topology changed)\"\n\n  [ ! -f $lang/phones/sets.int ] && exit 1;\n  shared_phones_opt=\"--shared-phones=$lang/phones/sets.int\"\n  # get feature dimension\n  example_feats=\"`echo ${feats[0]} | sed s/JOB/1/g`\";\n  if ! feat_dim=$(feat-to-dim \"$example_feats\" - 2>/dev/null) || [ -z $feat_dim ]; then\n    feat-to-dim \"$example_feats\" - # to see the error message.\n    echo \"error getting feature dimension\"\n    exit 1;\n  fi\n\n  for n in `seq 0 $[num_sys-1]`; do\n    copy-feats \"${feats_one[$n]}\" ark:-\n  done | copy-feats ark:- ark:$dir/tmp.ark\n  \n  $cmd $dir/log/init_mono.log \\\n    gmm-init-mono $shared_phones_opt \\\n      \"--train-feats=ark:subset-feats --n=10 ark:$dir/tmp.ark ark:- |\" $lang/topo $feat_dim \\\n    $dir/mono.mdl $dir/mono.tree || exit 1\nfi\n\n\nif [ $stage -le -4 ]; then\n  # Get tree stats.\n\n  for n in `seq 0 $[num_sys-1]`; do\n    echo \"$0: Accumulating tree stats\"\n    this_data=${datadirs[$n]}\n    this_alidir=${alidirs[$n]}\n    this_nj=$(cat $this_alidir/num_jobs) || exit 1\n    this_frame_subsampling_factor=1\n    if [ -f $this_alidir/frame_subsampling_factor ]; then\n      this_frame_subsampling_factor=$(cat $this_alidir/frame_subsampling_factor)\n    fi\n\n    if (( $frame_subsampling_factor % $this_frame_subsampling_factor != 0 )); then\n      echo \"$0: frame-subsampling-factor=$frame_subsampling_factor is not \"\n      echo \"divisible by $this_frame_subsampling_factor (that of $this_alidir)\"\n      exit 1\n    fi\n\n    this_frame_subsampling_factor=$((frame_subsampling_factor / this_frame_subsampling_factor))\n    $cmd JOB=1:$this_nj $dir/log/acc_tree.$n.JOB.log \\\n       convert-ali --frame-subsampling-factor=$this_frame_subsampling_factor \\\n           $this_alidir/final.mdl $dir/mono.mdl $dir/mono.tree \"ark:gunzip -c $this_alidir/ali.JOB.gz|\" ark:-  \\| \\\n        acc-tree-stats $context_opts $tree_stats_opts --ci-phones=$ciphonelist $dir/mono.mdl \\\n           \"${feats[$n]}\" ark:- $dir/$n.JOB.treeacc || exit 1;\n    [ \"`ls $dir/$n.*.treeacc | wc -w`\" -ne \"$this_nj\" ] && echo \"$0: Wrong #tree-accs for data $n $this_data\" && exit 1;\n  done\n\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  $cmd $dir/log/questions.log \\\n     cluster-phones $cluster_phones_opts $context_opts $dir/treeacc \\\n     $lang/phones/sets.int $dir/questions.int || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  $cmd $dir/log/compile_questions.log \\\n    compile-questions \\\n      $context_opts $lang/topo $dir/questions.int $dir/questions.qst || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments to the new tree.  Note: we likely will not use these\n  # converted alignments in the chain system directly, but they could be useful\n  # for other purposes.\n\n  for n in `seq 0 $[num_sys-1]`; do\n    this_alidir=${alidirs[$n]}\n    this_nj=$(cat $this_alidir/num_jobs) || exit 1\n    \n    this_frame_subsampling_factor=1\n    if [ -f $this_alidir/frame_subsampling_factor ]; then\n      this_frame_subsampling_factor=$(cat $this_alidir/frame_subsampling_factor)\n    fi\n\n    if (( $frame_subsampling_factor % $this_frame_subsampling_factor != 0 )); then\n      echo \"$0: frame-subsampling-factor=$frame_subsampling_factor is not \"\n      echo \"divisible by $this_frame_subsampling_factor (hat of $this_alidir)\"\n      exit 1\n    fi\n\n    echo \"$0: frame-subsampling-factor for $this_alidir is $this_frame_subsampling_factor\"\n\n    this_frame_subsampling_factor=$((frame_subsampling_factor / this_frame_subsampling_factor))\n    echo \"$0: Converting alignments from $this_alidir to use current tree\"\n    $cmd JOB=1:$this_nj $dir/log/convert.$n.JOB.log \\\n      convert-ali --repeat-frames=$repeat_frames \\\n        --frame-subsampling-factor=$this_frame_subsampling_factor \\\n        $this_alidir/final.mdl $dir/1.mdl $dir/tree \"ark:gunzip -c $this_alidir/ali.JOB.gz |\" \\\n        ark,scp:$dir/ali.$n.JOB.ark,$dir/ali.$n.JOB.scp || exit 1\n\n    for i in `seq $this_nj`; do \n      cat $dir/ali.$n.$i.scp \n    done > $dir/ali.$n.scp || exit 1\n  done\n\n  for n in `seq 0 $[num_sys-1]`; do\n    cat $dir/ali.$n.scp\n  done | sort -k1,1 > $dir/ali.scp || exit 1\n\n  utils/split_data.sh $data $nj\n  $cmd JOB=1:$nj $dir/log/copy_alignments.JOB.log \\\n    copy-int-vector \"scp:utils/filter_scp.pl $data/split$nj/JOB/utt2spk $dir/ali.scp |\" \\\n    \"ark:| gzip -c > $dir/ali.JOB.gz\" || exit 1\nfi\n\ncp $dir/1.mdl $dir/final.mdl\n\necho $0: Done building tree\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/README.txt",
    "content": "The scripts related to end2end chain training are in this directory\nCurrently it has 3 scripts:\n\n** prepare_e2e.sh which is almost equivalent\nto regular chain's build-tree.sh (i.e. it creates the tree and\nthe transition-model) except it does not require any previously\ntrained models (in other terms, it does what stages -3 and -2\nof steps/train_mono.sh do).\n\n** get_egs_e2e.sh: this is simlilar to chain/get_egs.sh except it\nuses training FSTs (instead of lattices) to generate end2end egs.\n\n** train_e2e.py: this is very similar to chain/train.py but\nwith fewer stages (e.g. it does not compute the preconditioning matrix)\n\n\nFor details please see the comments at top of local/chain/e2e/run_flatstart_*.sh\nand also src/chain/chain-generic-numerator.h.\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/compute_biphone_stats.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright    2018 Hossein Hadian\n# Apache 2.0\n\nimport argparse\nfrom os.path import join\nimport sys\nimport copy\nimport random\n\nparser = argparse.ArgumentParser(description=\"\"\"This script reads\n    sequences of phone ids from std input and counts mono/biphone stats\n    and writes the results to std out. The output can be used with\n    gmm-init-biphone to create a better tree. The first part of the\n    outupt is biphone counts with this format for each line:\n    <phone-id> <phone-id> <count>\n    and the second part of the output is monophone counts with the\n    following format:\n    <phone-id> <count>\"\"\")\nparser.add_argument('langdir', type=str)\nparser.add_argument('--shared-phones', type=str, choices=['true','false'],\n                    default='true',\n                    help=\"If true, stats will be collected for shared phones.\")\n\nargs = parser.parse_args()\nargs.shared_phones = True if args.shared_phones == 'true' else False\n\n# Read phone sets\nphone_sets = []\nphones = []\nphone_to_shard_phone = {}\nphone_to_shard_phone[0] = 0  # The no-left-context case\nwith open(join(args.langdir, 'phones/sets.int'), 'r', encoding='latin-1') as f:\n    for line in f:\n        phone_set = line.strip().split()\n        phone_sets.append(phone_set)\n        for phone in phone_set:\n            phones.append(phone)\n            phone_to_shard_phone[phone] = phone_set[0]\n\nprint('Loaded {} phone-sets containing {} phones.'.format(len(phone_sets),\n                                                          len(phones)),\n      file=sys.stderr)\n\nbiphone_counts = {}\nmono_counts = {}\nfor line in sys.stdin:\n    line = line.strip().split()\n    key = line[0]\n    line_phones = line[1:]\n    for pair in zip([0] + line_phones, line_phones):  # 0 is for the no left-context case\n        if args.shared_phones:\n            pair = (phone_to_shard_phone[pair[0]], phone_to_shard_phone[pair[1]])\n        if pair not in biphone_counts:\n            biphone_counts[pair] = 0\n        biphone_counts[pair] += 1\n        mono_counts[pair[1]] = 1 if pair[1] not in mono_counts else mono_counts[pair[1]] + 1\n\nfor phone1 in [0] + phones:\n    for phone2 in phones:\n        pair = (phone1, phone2)\n        shared_pair = ((phone_to_shard_phone[pair[0]], phone_to_shard_phone[pair[1]])\n                       if args.shared_phones else pair)\n        count = biphone_counts[shared_pair] if shared_pair in biphone_counts else 0\n        if count != 0:\n            print('{} {} {}'.format(pair[0], pair[1], count))\nfor phone in phones:\n    shared = phone_to_shard_phone[phone] if args.shared_phones else phone\n    count = mono_counts[shared] if shared in mono_counts else 0\n    if count != 0:\n        print('{} {}'.format(phone, count))\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/get_egs_e2e.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015 Johns Hopkins University (Author: Daniel Povey)\n# Copyright   2017  Hossein Hadian\n# Apache 2.0.\n#\n\n\n# This is simlilar to chain/get_egs.sh except it\n# uses training FSTs (instead of lattices) to generate end2end egs.\n# It calls nnet3-chain-e2e-get-egs binary\n\n\n# Begin configuration section.\ncmd=run.pl\nnormalize_egs=true\nframe_subsampling_factor=3 # frames-per-second of features we train on divided\n                           # by frames-per-second at output of chain model\nleft_context=4    # amount of left-context per eg (i.e. extra frames of input features\n                  # not present in the output supervision).\nright_context=4   # amount of right-context per eg.\nleft_context_initial=-1    # if >=0, left-context for first chunk of an utterance\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance\ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\n\nnum_utts_subset=1400     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_valid_egs_combine=0  # #validation examples for combination weights at the very end.\nnum_train_egs_combine=1000 # number of train examples for the above.\nnum_egs_diagnostic=700 # number of examples for \"compute_prob\" jobs\nframes_per_iter=400000 # each iteration of training, see this many frames per\n                       # job, measured at the sampling rate of the features\n                       # used.  This is just a guideline; it will pick a number\n                       # that divides the number of samples in the entire data.\n\nstage=0\nnj=15         # This should be set to the maximum number of jobs you are\n              # comfortable to run in parallel; you can increase it if your disk\n              # speed is greater and you have more machines.\nmax_shuffle_jobs_run=50  # the shuffle jobs now include the nnet3-chain-normalize-egs command,\n                         # which is fairly CPU intensive, so we can run quite a few at once\n                         # without overloading the disks.\nsrand=0     # rand seed for nnet3-chain-get-egs, nnet3-chain-copy-egs and nnet3-chain-shuffle-egs\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\nonline_cmvn=false # Set to 'true' to replace 'apply-cmvn' by 'apply-cmvn-online' in the nnet3 input.\n                  # The configuration is passed externally via '$cmvn_opts' given to train.py,\n                  # typically as: --cmvn-opts=\"--config conf/online_cmvn.conf\".\n                  # The global_cmvn.stats are computed by this script from the features.\n                  # Note: the online cmvn for ivector extractor it is controlled separately in\n                  #       steps/online/nnet2/train_ivector_extractor.sh by --online-cmvn-iextractor\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <chain-dir> <fsts-dir> <egs-dir>\"\n  echo \" e.g.: $0 data/train exp/chain/e2e exp/chain/e2e/egs\"\n  echo \"\"\n  echo \"From <chain-dir>, 0.trans_mdl (the transition-model), tree (the tree)\"\n  echo \"and normalization.fst (the normalization FST, derived from the denominator FST)\"\n  echo \"are read.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --nj <nj>                                        # The maximum number of jobs you want to run in\"\n  echo \"                                                   # parallel (increase this only if you have good disk and\"\n  echo \"                                                   # network speed).  default=6\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --frames-per-iter <#samples;400000>              # Number of frames of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --feat-type <lda|raw>                            # (raw is the default).  The feature type you want\"\n  echo \"                                                   # to use as input to the neural net.\"\n  echo \"  --frame-subsampling-factor <factor;3>            # factor by which num-frames at nnet output is reduced \"\n  echo \"  --left-context <int;4>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;4>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # If >= 0, left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # If >= 0, right-context for last chunk of an utterance\"\n  echo \"  --num-egs-diagnostic <#frames;4000>              # Number of egs used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-egs-combine <#frames;10000>          # Number of egss used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nchaindir=$2\nfstdir=$3\ndir=$4\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $data/allowed_lengths.txt \\\n         $chaindir/{0.trans_mdl,tree,normalization.fst} $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log $dir/info\n\n# Get list of validation utterances.\n\nframe_shift=$(utils/data/get_frame_shift.sh $data)\nutils/data/get_utt2dur.sh $data\n\nframes_per_eg=$(cat $data/allowed_lengths.txt | tr '\\n' , | sed 's/,$//')\n\n[ ! -f \"$data/utt2len\" ] && feat-to-len scp:$data/feats.scp ark,t:$data/utt2len\n\ncat $data/utt2len | \\\n  awk '{print $1}' | \\\n  utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset > $dir/valid_uttlist\n\n\nlen_uttlist=`wc -l $dir/valid_uttlist | awk '{print $1}'`\nif [ $len_uttlist -lt $num_utts_subset ]; then\n  echo \"Number of utterances which have length at least $frames_per_eg is really low. Please check your data.\" && exit 1;\nfi\n\nif [ -f $data/utt2uniq ]; then  # this matters if you use data augmentation.\n  # because of this stage we can again have utts with lengths less than\n  # frames_per_eg\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\n# awk -v mf_len=222 '{if ($2 == mf_len) print $1}' | \\\ncat $data/utt2len | \\\n  awk '{print $1}' | \\\n   utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset > $dir/train_subset_uttlist\nlen_uttlist=`wc -l $dir/train_subset_uttlist | awk '{print $1}'`\nif [ $len_uttlist -lt $num_utts_subset ]; then\n  echo \"Number of utterances which have length at least $frames_per_eg is really low. Please check your data.\" && exit 1;\nfi\n\n## Set up features.\n\n# get the global_cmvn stats for online-cmvn,\nif $online_cmvn; then\n  # create global_cmvn.stats,\n  #\n  # caution: the top-level nnet training script should copy\n  # 'global_cmvn.stats' and 'online_cmvn' to its own dir.\n  if ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n    echo \"$0: Error summing cmvn stats\"\n    exit 1\n  fi\n  touch $dir/online_cmvn\nelse\n  [ -f $dir/online_cmvn ] && rm $dir/online_cmvn\nfi\n\n# create the feature pipelines,\nif ! $online_cmvn; then\n  # the original front-end with 'apply-cmvn',\n  echo \"$0: feature type is raw, with 'apply-cmvn'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\nelse\n  # the alternative front-end with 'apply-cmvn-online',\n  # - the $cmvn_opts can be set to '--config=conf/online_cmvn.conf' which is the setup of ivector-extractor,\n  echo \"$0: feature type is raw, with 'apply-cmvn-online'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=ark:$sdata/JOB/spk2utt $dir/global_cmvn.stats scp:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=ark:$data/spk2utt $dir/global_cmvn.stats scp:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=ark:$data/spk2utt $dir/global_cmvn.stats scp:- ark:- |\"\nfi\necho $cmvn_opts >$dir/cmvn_opts # caution: the top-level nnet training script should copy this to its own dir now.\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim > $dir/info/ivector_dim\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/info/final.ie.id || exit 1\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\n  echo 0 >$dir/info/ivector_dim\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s/JOB/1/g)\"\n  if ! feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo \"Command failed (getting feature dim): feat-to-dim \\\"$feats_one\\\"\"\n    exit 1\n  fi\n  echo $feat_dim > $dir/info/feat_dim\nelse\n  num_frames=$(cat $dir/info/num_frames) || exit 1;\n  feat_dim=$(cat $dir/info/feat_dim) || exit 1;\nfi\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/$frames_per_iter+1]\n\n# We may have to first create a smaller number of larger archives, with number\n# $num_archives_intermediate, if $num_archives is more than the maximum number\n# of open filehandles that the system allows per process (ulimit -n).\nmax_open_filehandles=500 #$(ulimit -n) || exit 1\nnum_archives_intermediate=$num_archives\narchives_multiple=1\nwhile [ $[$num_archives_intermediate+4] -gt $max_open_filehandles ]; do\n  archives_multiple=$[$archives_multiple+1]\n  num_archives_intermediate=$[$num_archives/$archives_multiple] || exit 1;\ndone\n# now make sure num_archives is an exact multiple of archives_multiple.\nnum_archives=$[$archives_multiple*$num_archives_intermediate] || exit 1;\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n# Work out the number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg*$num_archives)] || exit 1;\n! [ $egs_per_archive -le $frames_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= frames_per_iter=$frames_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\nif [ $left_context_initial -ge 0 ] || [ $right_context_final -ge 0 ]; then\n  echo \"$0:   ... and (left-context-initial,right-context-final) = ($left_context_initial,$right_context_final)\"\nfi\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/cegs.$x.ark; done)\n  for x in $(seq $num_archives_intermediate); do\n    utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/cegs_orig.$y.$x.ark; done)\n  done\nfi\n\n\negs_opts=\"--left-context=$left_context --right-context=$right_context --num-frames=$frames_per_eg --frame-subsampling-factor=$frame_subsampling_factor --compress=$compress\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\necho $left_context_initial > $dir/info/left_context_initial\necho $right_context_final > $dir/info/right_context_final\n\nnum_fst_jobs=$(cat $fstdir/num_jobs) || exit 1;\nfor id in $(seq $num_fst_jobs); do cat $fstdir/fst.$id.scp; done > $fstdir/fst.scp\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n\n  # do the filtering just once, as fst.scp may be long.\n  utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) \\\n    <$fstdir/fst.scp >$fstdir/fst_special.scp\n  if $normalize_egs; then\n    norm_opt=$chaindir/normalization.fst\n  else\n    norm_opt=\n  fi\n  $cmd $dir/log/create_valid_subset.log \\\n    utils/filter_scp.pl $dir/valid_uttlist $fstdir/fst_special.scp \\| \\\n    fstcopy scp:- ark:- \\| \\\n    nnet3-chain-e2e-get-egs $ivector_opts --srand=$srand \\\n      $egs_opts $norm_opt \\\n      \"$valid_feats\" ark,s,cs:- $chaindir/0.trans_mdl \"ark:$dir/valid_all.cegs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    utils/filter_scp.pl $dir/train_subset_uttlist $fstdir/fst_special.scp \\| \\\n    fstcopy scp:- ark:- \\| \\\n    nnet3-chain-e2e-get-egs $ivector_opts --srand=$srand \\\n      $egs_opts $norm_opt \\\n      \"$train_subset_feats\" ark,s,cs:- $chaindir/0.trans_mdl \"ark:$dir/train_subset_all.cegs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet3-chain-subset-egs --n=$num_valid_egs_combine ark:$dir/valid_all.cegs \\\n    ark:$dir/valid_combine.cegs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet3-chain-subset-egs --n=$num_egs_diagnostic ark:$dir/valid_all.cegs \\\n    ark:$dir/valid_diagnostic.cegs || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet3-chain-subset-egs --n=$num_train_egs_combine ark:$dir/train_subset_all.cegs \\\n    ark:$dir/train_combine.cegs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet3-chain-subset-egs --n=$num_egs_diagnostic ark:$dir/train_subset_all.cegs \\\n    ark:$dir/train_diagnostic.cegs || touch $dir/.error &\n  wait\n  sleep 5  # wait for file system to sync.\n  cat $dir/valid_combine.cegs $dir/train_combine.cegs > $dir/combine.cegs\n\n  for f in $dir/{combine,train_diagnostic,valid_diagnostic}.cegs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n\n  #rm $dir/valid_all.cegs $dir/train_subset_all.cegs $dir/{train,valid}_combine.cegs\n  #exit 0\nfi\n\necho \"num_archives_intermediate:\" $num_archives_intermediate\necho \"num_archives: $num_archives\"\necho \"archives_multiple: $archives_multiple\"\n\nif [ $stage -le 4 ]; then\n  # create cegs_orig.*.*.ark; the first index goes to $nj,\n  # the second to $num_archives_intermediate.\n\n  egs_list=\n  for n in $(seq $num_archives_intermediate); do\n    egs_list=\"$egs_list ark:$dir/cegs_orig.JOB.$n.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n\n  # The examples will go round-robin to egs_list.  Note: we omit the\n  # 'normalization.fst' argument while creating temporary egs: the phase of egs\n  # preparation that involves the normalization FST is quite CPU-intensive and\n  # it's more convenient to do it later, in the 'shuffle' stage.  Otherwise to\n  # make it efficient we need to use a large 'nj', like 40, and in that case\n  # there can be too many small files to deal with, because the total number of\n  # files is the product of 'nj' by 'num_archives_intermediate', which might be\n  # quite large.\n  $cmd JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    utils/filter_scp.pl $sdata/JOB/utt2spk $fstdir/fst.scp \\| \\\n    fstcopy scp:- ark:- \\| \\\n    nnet3-chain-e2e-get-egs $ivector_opts --srand=\\$[JOB+$srand] $egs_opts \\\n     \"$feats\" ark,s,cs:- $chaindir/0.trans_mdl ark:- \\| \\\n    nnet3-chain-copy-egs --random=true --srand=\\$[JOB+$srand] ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.*.JOB.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  # the input is a concatenation over the input jobs.\n  egs_list=\n  for n in $(seq $nj); do\n    egs_list=\"$egs_list $dir/cegs_orig.$n.JOB.ark\"\n  done\n\n  if [ $archives_multiple == 1 ]; then # normal case.\n    if $normalize_egs; then\n      $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n        nnet3-chain-normalize-egs $chaindir/normalization.fst \"ark:cat $egs_list|\" ark:- \\| \\\n        nnet3-chain-shuffle-egs --srand=\\$[JOB+$srand] ark:- ark:$dir/cegs.JOB.ark  || exit 1;\n    else\n      $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n        nnet3-chain-shuffle-egs --srand=\\$[JOB+$srand] \"ark:cat $egs_list|\" ark:$dir/cegs.JOB.ark  || exit 1;\n    fi\n  else\n    # we need to shuffle the 'intermediate archives' and then split into the\n    # final archives.  we create soft links to manage this splitting, because\n    # otherwise managing the output names is quite difficult (and we don't want\n    # to submit separate queue jobs for each intermediate archive, because then\n    # the --max-jobs-run option is hard to enforce).\n    output_archives=\"$(for y in $(seq $archives_multiple); do echo ark:$dir/cegs.JOB.$y.ark; done)\"\n    for x in $(seq $num_archives_intermediate); do\n      for y in $(seq $archives_multiple); do\n        archive_index=$[($x-1)*$archives_multiple+$y]\n        # egs.intermediate_archive.{1,2,...}.ark will point to egs.archive.ark\n        ln -sf cegs.$archive_index.ark $dir/cegs.$x.$y.ark || exit 1\n      done\n    done\n    $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-chain-normalize-egs $chaindir/normalization.fst \"ark:cat $egs_list|\" ark:- \\| \\\n      nnet3-chain-shuffle-egs --srand=\\$[JOB+$srand] ark:- ark:- \\| \\\n      nnet3-chain-copy-egs ark:- $output_archives || exit 1;\n  fi\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: removing temporary archives\"\n  (\n    cd $dir\n    for f in $(ls -l . | grep 'cegs_orig' | awk '{ X=NF-1; Y=NF-2; if ($X == \"->\")  print $Y, $NF; }'); do rm $f; done\n    # the next statement removes them if we weren't using the soft links to a\n    # 'storage' directory.\n    rm cegs_orig.*.ark 2>/dev/null\n  )\n  if [ $archives_multiple -gt 1 ]; then\n    # there are some extra soft links that we should delete.\n    for f in $dir/cegs.*.*.ark; do rm $f; done\n  fi\n  echo \"$0: removing temporary alignments\"\n  rm $dir/ali.{ark,scp} 2>/dev/null\n\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/prepare_e2e.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2017  Hossein Hadian\n# Apache 2.0\n\n# To be run from ..\n# Flat start chain model training.\n\n# This script initializes a trivial tree and transition model\n# for flat-start chain training. It then generates the training\n# graphs for the training data.\n\n# Begin configuration section.\ncmd=run.pl\nnj=4\nstage=0\nshared_phones=true\ntreedir=              # If specified, the tree and model will be copied from there\n                      # note that it may not be flat start anymore.\ntype=mono             # Can be either mono or biphone -- either way\n                      # the resulting tree is full (i.e. it doesn't do any tying)\nci_silence=false      # If true, silence phones will be treated as context independent\n\nscale_opts=\"--transition-scale=0.0 --self-loop-scale=0.0\"\ntie=false             # If true, gmm-init-biphone will do some tying when\n                      # creating the full biphone tree (it won't be full anymore).\n                      # Specifically, it will revert to monophone if the data\n                      # counts for a biphone are smaller than min_biphone_count.\n                      # If the monophone count is also smaller than min_monophone_count,\n                      # it will revert to a shared global phone. Note that this\n                      # only affects biphone models (i.e., type=biphone) which\n                      # use the special chain topology.\nmin_biphone_count=100\nmin_monophone_count=20\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: steps/prepare_e2e.sh [options] <data-dir> <lang-dir> <exp-dir>\"\n  echo \" e.g.: steps/prepare_e2e.sh data/train data/lang_chain exp/chain/e2e_tree\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --type <mono | biphone>                          # context dependency type\"\n  echo \"  --tie <true | false>                             # enable/disable count-based tying\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\ndir=$3\n\nif [[ \"$type\" != \"mono\" && \"$type\" != \"biphone\" ]]; then\n  echo \"'type' should be either mono or biphone.\"\n  exit 1;\nfi\n\noov_sym=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir/log\n\necho $scale_opts > $dir/scale_opts  # just for easier reference (it is in the logs too)\necho $nj > $dir/num_jobs\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncp $lang/phones.txt $dir || exit 1;\n\n[ ! -f $lang/phones/sets.int ] && exit 1;\n\nif $shared_phones; then\n  shared_phones_opt=\"--shared-phones=$lang/phones/sets.int\"\nfi\n\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nif $ci_silence; then\n  ci_opt=\"--ci-phones=$ciphonelist\"\nfi\n\ntie_opts=\nif $tie && [[ \"$type\" = \"biphone\" ]]; then\n  cat $data/text | steps/chain/e2e/text_to_phones.py --edge-silprob 0 \\\n                                                     --between-silprob 0 \\\n                                                     $lang | \\\n    cut -d' ' -f 2- | utils/sym2int.pl $lang/phones.txt | \\\n    steps/chain/e2e/compute_biphone_stats.py $lang >$dir/phone-stats.txt\n  tie_opts=\"--min-biphone-count=$min_biphone_count \\\n--min-monophone-count=$min_monophone_count --phone-counts=$dir/phone-stats.txt\"\nfi\n\nif [ $stage -le 0 ]; then\n  if [ -z $treedir ]; then\n    echo \"$0: Initializing $type system.\"\n    # feat dim does not matter here. Just set it to 10\n    $cmd $dir/log/init_${type}_mdl_tree.log \\\n         gmm-init-$type $tie_opts $ci_opt $shared_phones_opt $lang/topo 10 \\\n         $dir/0.mdl $dir/tree || exit 1;\n  else\n    echo \"$0: Copied tree/mdl from $treedir.\" >$dir/log/init_mdl_tree.log\n    cp $treedir/final.mdl $dir/0.mdl || exit 1;\n    cp $treedir/tree $dir || exit 1;\n  fi\n  copy-transition-model $dir/0.mdl $dir/0.trans_mdl\n  ln -s 0.mdl $dir/final.mdl  # for consistency with scripts which require a final.mdl\nfi\n\nlex=$lang/L.fst\nif [ $stage -le 1 ]; then\n  echo \"$0: Compiling training graphs\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs $scale_opts --read-disambig-syms=$lang/phones/disambig.int \\\n    $dir/tree $dir/0.mdl $lex \\\n    \"ark:sym2int.pl --map-oov $oov_sym -f 2- $lang/words.txt < $sdata/JOB/text|\" \\\n    \"ark,scp:$dir/fst.JOB.ark,$dir/fst.JOB.scp\" || exit 1;\nfi\n\necho \"$0: Done\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/text_to_phones.py",
    "content": "#!/usr/bin/env python\n\n# Copyright    2017 Hossein Hadian\n# Apache 2.0\n\n\n\"\"\" This reads data/train/text from standard input, converts the word transcriptions\n    to phone transcriptions using the provided lexicon,\n    and writes them to standard output.\n\"\"\"\nfrom __future__ import print_function\n\nimport argparse\nfrom os.path import join\nimport sys\nimport copy\nimport random\n\nparser = argparse.ArgumentParser(description=\"\"\"This script reads\n    data/train/text from std input and converts the word transcriptions\n    to phone transcriptions using the provided lexicon\"\"\")\nparser.add_argument('langdir', type=str)\nparser.add_argument('--edge-silprob', type=float, default=0.8,\n                    help=\"\"\"Probability of optional silence at the beginning\n                    and end.\"\"\")\nparser.add_argument('--between-silprob', type=float, default=0.2,\n                    help=\"Probability of optional silence between the words.\")\n\n\nargs = parser.parse_args()\n\n# optional silence\nsil = open(join(args.langdir,\n                \"phones/optional_silence.txt\")).readline().strip()\n\noov_word = open(join(args.langdir, \"oov.txt\")).readline().strip()\n\n\n# load the lexicon\nlexicon = {}\nwith open(join(args.langdir, \"phones/align_lexicon.txt\")) as f:\n    for line in f:\n        line = line.strip();\n        parts = line.split()\n        lexicon[parts[0]] = parts[2:]  # ignore parts[1]\n\nn_tot = 0\nn_fail = 0\nfor line in sys.stdin:\n    line = line.strip().split()\n    key = line[0]\n    word_trans = line[1:]   # word-level transcription\n    phone_trans = []        # phone-level transcription\n    if random.random() < args.edge_silprob:\n        phone_trans += [sil]\n    for i in range(len(word_trans)):\n        n_tot += 1\n        word = word_trans[i]\n        if word not in lexicon:\n            n_fail += 1\n            if n_fail < 20:\n                sys.stderr.write(\"{} not found in lexicon, replacing with {}\\n\".format(word, oov_word))\n            elif n_fail == 20:\n                sys.stderr.write(\"Not warning about OOVs any more.\\n\")\n            pronunciation = lexicon[oov_word]\n        else:\n            pronunciation = copy.deepcopy(lexicon[word])\n        phone_trans += pronunciation\n        prob = args.between_silprob if i < len(word_trans) - 1 else args.edge_silprob\n        if random.random() < prob:\n            phone_trans += [sil]\n    print(key + \" \" + \" \".join(phone_trans))\n\nsys.stderr.write(\"Done. {} out of {} were OOVs.\\n\".format(n_fail, n_tot))\n"
  },
  {
    "path": "egs/steps/nnet3/chain/e2e/train_e2e.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n#           2017    Hossein Hadian\n# Apache 2.0.\n\n\"\"\" This script does flat-start chain training and is based on\n    steps/nnet3/chain/train.py.\n\"\"\"\n\nimport argparse\nimport logging\nimport os\nimport pprint\nimport shutil\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.chain_objf.acoustic_model as chain_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting chain model trainer (train.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    We add compulsary arguments as named arguments for readability\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains RNN and DNN acoustic models using the 'chain'\n        objective function.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser().parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.chunk-width\", type=str, dest='chunk_width',\n                        default=\"20\",\n                        help=\"\"\"Number of frames per chunk in the examples\n                        used to train the RNN.   Caution: if you double this you\n                        should halve --trainer.samples-per-iter.  May be\n                        a comma-separated list of alternatives: first width\n                        is the 'principal' chunk-width, used preferentially\"\"\")\n\n    # chain options\n    parser.add_argument(\"--chain.lm-opts\", type=str, dest='lm_opts',\n                        default=None, action=common_lib.NullstrToNoneAction,\n                        help=\"options to be be passed to chain-est-phone-lm\")\n    parser.add_argument(\"--chain.l2-regularize\", type=float,\n                        dest='l2_regularize', default=0.0,\n                        help=\"\"\"Weight of regularization function which is the\n                        l2-norm of the output of the network. It should be used\n                        without the log-softmax layer for the outputs.  As\n                        l2-norm of the log-softmax outputs can dominate the\n                        objective function.\"\"\")\n    parser.add_argument(\"--chain.xent-regularize\", type=float,\n                        dest='xent_regularize', default=0.0,\n                        help=\"Weight of regularization function which is the \"\n                        \"cross-entropy cost the outputs.\")\n    parser.add_argument(\"--chain.leaky-hmm-coefficient\", type=float,\n                        dest='leaky_hmm_coefficient', default=0.00001,\n                        help=\"\")\n    parser.add_argument(\"--chain.apply-deriv-weights\", type=str,\n                        dest='apply_deriv_weights', default=True,\n                        action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"],\n                        help=\"\")\n    parser.add_argument(\"--chain.frame-subsampling-factor\", type=int,\n                        dest='frame_subsampling_factor', default=3,\n                        help=\"ratio of frames-per-second of features we \"\n                        \"train on, to chain model's output\")\n    parser.add_argument(\"--chain.alignment-subsampling-factor\", type=int,\n                        dest='alignment_subsampling_factor',\n                        default=3,\n                        help=\"ratio of frames-per-second of input \"\n                        \"alignments to chain model's output\")\n    parser.add_argument(\"--chain.left-deriv-truncate\", type=int,\n                        dest='left_deriv_truncate',\n                        default=None,\n                        help=\"Deprecated. Kept for back compatibility\")\n\n\n    # trainer options\n    parser.add_argument(\"--trainer.num-epochs\", type=float, dest='num_epochs',\n                        default=10.0,\n                        help=\"Number of epochs to train the model\")\n    parser.add_argument(\"--trainer.frames-per-iter\", type=int,\n                        dest='frames_per_iter', default=800000,\n                        help=\"\"\"Each iteration of training, see this many\n                        [input] frames per job.  This option is passed to\n                        get_egs.sh.  Aim for about a minute of training\n                        time\"\"\")\n    parser.add_argument(\"--trainer.num-chunk-per-minibatch\", type=str,\n                        dest='num_chunk_per_minibatch', default='128',\n                        help=\"\"\"Number of sequences to be processed in\n                        parallel every minibatch.  May be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.initial-effective-lrate\",\n                        type=float, dest='initial_effective_lrate',\n                        default=0.0002,\n                        help=\"Learning rate used during the initial iteration\")\n    parser.add_argument(\"--trainer.optimization.final-effective-lrate\",\n                        type=float, dest='final_effective_lrate',\n                        default=0.00002,\n                        help=\"Learning rate used during the final iteration\")\n    parser.add_argument(\"--trainer.optimization.shrink-value\", type=float,\n                        dest='shrink_value', default=1.0,\n                        help=\"\"\"Scaling factor used for scaling the parameter\n                        matrices when the derivative averages are below the\n                        shrink-threshold at the non-linearities.  E.g. 0.99.\n                        Only applicable when the neural net contains sigmoid or\n                        tanh units.\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-saturation-threshold\",\n                        type=float,\n                        dest='shrink_saturation_threshold', default=0.40,\n                        help=\"\"\"Threshold that controls when we apply the\n                        'shrinkage' (i.e. scaling by shrink-value).  If the\n                        saturation of the sigmoid and tanh nonlinearities in\n                        the neural net (as measured by\n                        steps/nnet3/get_saturation.pl) exceeds this threshold\n                        we scale the parameter matrices with the\n                        shrink-value.\"\"\")\n    # RNN-specific training options\n    parser.add_argument(\"--trainer.deriv-truncate-margin\", type=int,\n                        dest='deriv_truncate_margin', default=None,\n                        help=\"\"\"(Relevant only for recurrent models). If\n                        specified, gives the margin (in input frames) around\n                        the 'required' part of each chunk that the derivatives\n                        are backpropagated to. If unset, the derivatives are\n                        backpropagated all the way to the boundaries of the\n                        input data. E.g. 8 is a reasonable setting. Note: the\n                        'required' part of the chunk is defined by the model's\n                        {left,right}-context.\"\"\")\n\n    # General options\n    parser.add_argument(\"--feat-dir\", type=str, required=True,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--tree-dir\", type=str, required=True,\n                        help=\"\"\"Directory containing the tree to use for this\n                        model (we also expect 0.mdl and fsts.* in that\n                        directory\"\"\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv))\n    print(sys.argv)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if not common_train_lib.validate_chunk_width(args.chunk_width):\n        raise Exception(\"--egs.chunk-width has an invalid value\");\n\n    if not common_train_lib.validate_minibatch_size_str(args.num_chunk_per_minibatch):\n        raise Exception(\"--trainer.num-chunk-per-minibatch has an invalid value\");\n\n    if args.chunk_left_context < 0:\n        raise Exception(\"--egs.chunk-left-context should be non-negative\")\n\n    if args.chunk_right_context < 0:\n        raise Exception(\"--egs.chunk-right-context should be non-negative\")\n\n    if args.left_deriv_truncate is not None:\n        args.deriv_truncate_margin = -args.left_deriv_truncate\n        logger.warning(\n            \"--chain.left-deriv-truncate (deprecated) is set by user, and \"\n            \"--trainer.deriv-truncate-margin is set to negative of that \"\n            \"value={0}. We recommend using the option \"\n            \"--trainer.deriv-truncate-margin.\".format(\n                args.deriv_truncate_margin))\n\n    if (not os.path.exists(args.dir + \"/configs\")):\n        raise Exception(\"This scripts expects the directory specified with \"\n                        \"--dir={0} to exist and have a configs/ directory which \"\n                        \"is the output of make_configs.py script\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Check files\n    files = ['{0}/feats.scp'.format(args.feat_dir), '{0}/fst.1.scp'.format(args.tree_dir),\n             '{0}/final.mdl'.format(args.tree_dir), '{0}/tree'.format(args.tree_dir),\n             '{0}/phone_lm.fst'.format(args.tree_dir),\n             '{0}/num_jobs'.format(args.tree_dir)]\n    for file in files:\n        if not os.path.isfile(file):\n            raise Exception('Expected {0} to exist.'.format(file))\n\n    # Set some variables.\n    num_jobs = common_lib.get_number_of_jobs(args.tree_dir)\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n    logger.info(\"feat-dim: {}, ivector-dim: {}\".format(feat_dim, ivector_dim))\n\n    # split the training data into parts for individual jobs\n    # we will use the same number of jobs as that used for compiling FSTs\n    common_lib.execute_command(\"utils/split_data.sh {0} {1}\".format(\n            args.feat_dir, num_jobs))\n    shutil.copy('{0}/tree'.format(args.tree_dir), args.dir)\n    shutil.copy('{0}/phones.txt'.format(args.tree_dir), args.dir)\n    shutil.copy('{0}/phone_lm.fst'.format(args.tree_dir), args.dir)\n    shutil.copy('{0}/0.trans_mdl'.format(args.tree_dir), args.dir)\n    with open('{0}/num_jobs'.format(args.dir), 'w') as f:\n        f.write(str(num_jobs))\n\n    config_dir = '{0}/configs'.format(args.dir)\n    var_file = '{0}/vars'.format(config_dir)\n\n    variables = common_train_lib.parse_generic_config_vars_file(var_file)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = args.chunk_left_context + model_left_context\n    right_context = args.chunk_right_context + model_right_context\n    left_context_initial = (args.chunk_left_context_initial + model_left_context if\n                            args.chunk_left_context_initial >= 0 else -1)\n    right_context_final = (args.chunk_right_context_final + model_right_context if\n                           args.chunk_right_context_final >= 0 else -1)\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n\n    if (args.stage <= -5):\n        logger.info(\"Creating denominator FST\")\n        chain_lib.create_denominator_fst(args.dir, args.tree_dir, run_opts)\n\n    if (args.stage <= -4):\n        logger.info(\"Initializing a basic network...\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n                    nnet3-init --srand=-2 {dir}/configs/final.config \\\n                    {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                             dir=args.dir))\n\n    egs_left_context = left_context + args.frame_subsampling_factor / 2\n    egs_right_context = right_context + args.frame_subsampling_factor / 2\n    egs_left_context_initial = (left_context_initial + args.frame_subsampling_factor / 2 if\n                                left_context_initial >= 0 else -1)\n    egs_right_context_final = (right_context_final + args.frame_subsampling_factor / 2 if\n                               right_context_final >= 0 else -1)\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n    if (args.stage <= -3) and args.egs_dir is None:\n        logger.info(\"Generating end-to-end egs...\")\n        common_lib.execute_command(\n            \"\"\"steps/nnet3/chain/e2e/get_egs_e2e.sh {egs_opts} \\\n                    --cmd \"{command}\" \\\n                    --cmvn-opts \"{cmvn_opts}\" \\\n                    --online-ivector-dir \"{ivector_dir}\" \\\n                    --left-context {left_context} \\\n                    --right-context {right_context} \\\n                    --left-context-initial {left_context_initial} \\\n                    --right-context-final {right_context_final} \\\n                    --frame-subsampling-factor {frame_subsampling_factor} \\\n                    --stage {stage} \\\n                    --frames-per-iter {frames_per_iter} \\\n                    --srand {srand} \\\n                    {data} {dir} {fst_dir} {egs_dir}\"\"\".format(\n                        command=run_opts.egs_command,\n                        cmvn_opts=args.cmvn_opts if args.cmvn_opts is not None else '',\n                        ivector_dir=(args.online_ivector_dir\n                                     if args.online_ivector_dir is not None\n                                     else ''),\n                        left_context=egs_left_context,\n                        right_context=egs_right_context,\n                        left_context_initial=egs_left_context_initial,\n                        right_context_final=egs_right_context_final,\n                        frame_subsampling_factor=args.frame_subsampling_factor,\n                        stage=args.egs_stage, frames_per_iter=args.frames_per_iter,\n                        srand=args.srand,\n                        data=args.feat_dir, dir=args.dir, fst_dir=args.tree_dir,\n                        egs_dir=default_egs_dir,\n                        egs_opts=args.egs_opts if args.egs_opts is not None else ''))\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n        common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                        ivector_dim, ivector_id,\n                                        egs_left_context, egs_right_context,\n                                        egs_left_context_initial,\n                                        egs_right_context_final))\n\n    num_archives_expanded = num_archives * args.frame_subsampling_factor\n\n    if (args.num_jobs_final > num_archives_expanded):\n        raise Exception('num_jobs_final cannot exceed the '\n                        'expanded number of archives')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    logger.info(\"Copying the properties from {0} to {1}\".format(egs_dir, args.dir))\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n\n    if (args.stage <= -1):\n        logger.info(\"Preparing the initial acoustic model.\")\n        chain_lib.prepare_initial_acoustic_model(args.dir, run_opts)\n\n    with open(\"{0}/frame_subsampling_factor\".format(args.dir), \"w\") as f:\n        f.write(str(args.frame_subsampling_factor))\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_to_process = int(args.num_epochs * num_archives_expanded)\n    num_archives_processed = 0\n    num_iters = ((num_archives_to_process * 2)\n                 / (args.num_jobs_initial + args.num_jobs_final))\n\n    models_to_combine = common_train_lib.get_model_combine_iters(\n        num_iters, args.num_epochs,\n        num_archives_expanded, args.max_models_combine,\n        args.num_jobs_final)\n\n    min_deriv_time = None\n    max_deriv_time_relative = None\n    if args.deriv_truncate_margin is not None:\n        min_deriv_time = -args.deriv_truncate_margin - model_left_context\n        max_deriv_time_relative = \\\n           args.deriv_truncate_margin + model_right_context\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n\n        percent = num_archives_processed * 100.0 / num_archives_to_process\n        epoch = (num_archives_processed * args.num_epochs\n                 / num_archives_to_process)\n\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            model_file = \"{dir}/{iter}.mdl\".format(dir=args.dir, iter=iter)\n\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n            if args.shrink_value < shrinkage_value:\n                shrinkage_value = (args.shrink_value\n                                   if common_train_lib.should_do_shrinkage(\n                                        iter, model_file,\n                                        args.shrink_saturation_threshold)\n                                   else shrinkage_value)\n\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            chain_lib.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                shrinkage_value=shrinkage_value,\n                num_chunk_per_minibatch_str=args.num_chunk_per_minibatch,\n                apply_deriv_weights=args.apply_deriv_weights,\n                min_deriv_time=min_deriv_time,\n                max_deriv_time_relative=max_deriv_time_relative,\n                l2_regularize=args.l2_regularize,\n                xent_regularize=args.xent_regularize,\n                leaky_hmm_coefficient=args.leaky_hmm_coefficient,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                frame_subsampling_factor=args.frame_subsampling_factor,\n                run_opts=run_opts)\n\n\n            if args.cleanup:\n                # do a clean up everything but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(\n                            args.dir, \"log-probability\"))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n\n    if args.stage <= num_iters:\n        logger.info(\"Doing final combination to produce final.mdl\")\n        chain_lib.combine_models(\n            dir=args.dir, num_iters=num_iters,\n            models_to_combine=models_to_combine,\n            num_chunk_per_minibatch_str=args.num_chunk_per_minibatch,\n            egs_dir=egs_dir,\n            leaky_hmm_coefficient=args.leaky_hmm_coefficient,\n            l2_regularize=args.l2_regularize,\n            xent_regularize=args.xent_regularize,\n            run_opts=run_opts)\n\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        common_train_lib.clean_nnet_dir(\n            args.dir, num_iters, egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs)\n\n    # do some reporting\n    [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(\n        args.dir, \"log-probability\")\n    if args.email is not None:\n        common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                     \"complete\".format(args.dir), args.email)\n\n    with open(\"{dir}/accuracy.report\".format(dir=args.dir), \"w\") as f:\n        f.write(report)\n\n    common_lib.execute_command(\"steps/info/chain_dir_info.pl \"\n                                 \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nif (@ARGV != 2) {\n  print STDERR \"Usage: utils/gen_topo.pl <colon-separated-nonsilence-phones> <colon-separated-silence-phones>\\n\";\n  print STDERR \"e.g.:  utils/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\";\n  exit (1);\n}\n\n($nonsil_phones, $sil_phones) = @ARGV;\n\n$nonsil_phones =~ s/:/ /g;\n$sil_phones =~ s/:/ /g;\n$nonsil_phones =~ m/^\\d[ \\d]+$/ || die \"$0: bad arguments @ARGV\\n\";\n$sil_phones =~ m/^\\d[ \\d]*$/ || die \"$0: bad arguments @ARGV\\n\";\n\nprint \"<Topology>\\n\";\nprint \"<TopologyEntry>\\n\";\nprint \"<ForPhones>\\n\";\nprint \"$nonsil_phones $sil_phones\\n\";\nprint \"</ForPhones>\\n\";\n# The next two lines may look like a bug, but they are as intended.  State 0 has\n# no self-loop, it happens exactly once.  And it can go either to state 1 (with\n# a self-loop) or to state 2, so we can have zero or more instances of state 1\n# following state 0.\n# We make the transition-probs 0.5 so they normalize, to keep the code happy.\n# In fact, we always set the transition probability scale to 0.0 in the 'chain'\n# code, so they are never used.\nprint \"<State> 0 <PdfClass> 0 <Transition> 1 0.5 <Transition> 2 0.5 </State>\\n\";\nprint \"<State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State>\\n\";\nprint \"<State> 2 </State>\\n\";\nprint \"</TopologyEntry>\\n\";\nprint \"</Topology>\\n\";\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# This script was modified around 11.11.2016, when the code was extended to\n# support having a different pdf-class on the self loop.\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\n# We make the transition-probs 0.5 so they normalize, to keep the code happy.\n# In fact, we always set the transition probability scale to 0.0 in the 'chain'\n# code, so they are never used.\n# Note: the <ForwardPdfClass> will actually happen on the incoming arc because\n# we always build the graph with \"reorder=true\".\nprint(\"<State> 0 <ForwardPdfClass> 0 <SelfLoopPdfClass> 1 <Transition> 0 0.5 <Transition> 1 0.5 </State>\")\nprint(\"<State> 1 </State>\")\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo2.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\n\n# the pdf-classes are as follows:\n#  pdf-class 0 is in a 1-frame sequence, the initial and final state.\n#  pdf-class 1 is in a sequence with >=3 frames, the 'middle' states.  (important that\n#   it be numbered 1, which is the default list of pdf-classes used in 'cluster-phones').\n#  pdf-class 2 is the initial-state in a sequence with >= 2 frames.\n#  pdf-class 3 is the final-state in a sequence with >= 2 frames.\n# state 0 is nonemitting in this topology.\n\nprint(\"<State> 0 <Transition> 1 0.5 <Transition> 2 0.5 </State>\")  # initial nonemitting state.\nprint(\"<State> 1 <PdfClass> 0 <Transition> 5 1.0 </State>\")  # 1-frame sequence.\nprint(\"<State> 2 <PdfClass> 2 <Transition> 3 0.5 <Transition> 4 0.5 </State>\")  # 2 or more frames\nprint(\"<State> 3 <PdfClass> 1 <Transition> 3 0.5 <Transition> 4 0.5 </State>\")  # 3 or more frames\nprint(\"<State> 4 <PdfClass> 3 <Transition> 5 1.0 </State>\") # 2 or more frames.\nprint(\"<State> 5 </State>\")  # final nonemitting state\n\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo3.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\nprint(\"<State> 0 <PdfClass> 0 <Transition> 0 0.5 <Transition> 1 0.5 </State>\")\nprint(\"<State> 1 </State>\")\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo4.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\n# state 0 is obligatory (occurs once)\nprint(\"<State> 0 <PdfClass> 0 <Transition> 1 0.3333 <Transition> 2 0.3333 <Transition> 3 0.3333 </State> \")\n# state 1 is used only when >2 frames\nprint(\"<State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State>\")\n# state 2 is used only when >=2 frames (and occurs once)\nprint(\"<State> 2 <PdfClass> 2 <Transition> 3 1.0 </State>\")\nprint(\"<State> 3 </State>\")  # final nonemitting state\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo5.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\n# state 0 is nonemitting\nprint(\"<State> 0 <Transition> 1 0.5 <Transition> 2 0.5 </State>\")\n# state 1 is for when we traverse it in 1 state\nprint(\"<State> 1 <PdfClass> 0 <Transition> 4 1.0 </State>\")\n# state 2 is for when we traverse it in >1 state, for the first state.\nprint(\"<State> 2 <PdfClass> 2 <Transition> 3 1.0 </State>\")\n# state 3 is for the self-loop.  Use pdf-class 1 here so that the default\n# phone-class clustering (which uses only pdf-class 1 by default) gets only\n# stats from longer phones.\nprint(\"<State> 3 <PdfClass> 1 <Transition> 3 0.5 <Transition> 4 0.5 </State>\")\nprint(\"<State> 4 </State>\")\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n\n"
  },
  {
    "path": "egs/steps/nnet3/chain/gen_topo_orig.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# This file is as ./gen_topo.py used to be (before we extended the transition-model\n# code to support having a different self-loop pdf-class).  It is included\n# here for baseline and testing purposes.\n\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.  This is a modified version of\n# 'utils/gen_topo.pl' that generates a different type of topology, one that we\n# believe should be useful in the 'chain' model.  Note: right now it doesn't\n# have any real options, and it treats silence and nonsilence the same.  The\n# intention is that you write different versions of this script, or add options,\n# if you experiment with it.\n\nfrom __future__ import print_function\nimport argparse\n\n\nparser = argparse.ArgumentParser(description=\"Usage: steps/nnet3/chain/gen_topo.py \"\n                                             \"<colon-separated-nonsilence-phones> <colon-separated-silence-phones>\"\n                                             \"e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3\\n\",\n                                 epilog=\"See egs/swbd/s5c/local/chain/train_tdnn_a.sh for example of usage.\");\nparser.add_argument(\"nonsilence_phones\", type=str,\n                    help=\"List of non-silence phones as integers, separated by colons, e.g. 4:5:6:7:8:9\");\nparser.add_argument(\"silence_phones\", type=str,\n                    help=\"List of silence phones as integers, separated by colons, e.g. 1:2:3\");\n\nargs = parser.parse_args()\n\nsilence_phones = [ int(x) for x in args.silence_phones.split(\":\") ]\nnonsilence_phones = [ int(x) for x in args.nonsilence_phones.split(\":\") ]\nall_phones = silence_phones +  nonsilence_phones\n\nprint(\"<Topology>\")\nprint(\"<TopologyEntry>\")\nprint(\"<ForPhones>\")\nprint(\" \".join([str(x) for x in all_phones]))\nprint(\"</ForPhones>\")\n# The next two lines may look like a bug, but they are as intended.  State 0 has\n# no self-loop, it happens exactly once.  And it can go either to state 1 (with\n# a self-loop) or to state 2, so we can have zero or more instances of state 1\n# following state 0.\n# We make the transition-probs 0.5 so they normalize, to keep the code happy.\n# In fact, we always set the transition probability scale to 0.0 in the 'chain'\n# code, so they are never used.\nprint(\"<State> 0 <PdfClass> 0 <Transition> 1 0.5 <Transition> 2 0.5 </State>\")\nprint(\"<State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State>\")\nprint(\"<State> 2 </State>\")\nprint(\"</TopologyEntry>\")\nprint(\"</Topology>\")\n"
  },
  {
    "path": "egs/steps/nnet3/chain/get_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the 'chain' system\n# (and also the validation examples used for diagnostics), and puts them in\n# separate archives.\n#\n# This script dumps egs with many frames of labels, controlled by the\n# frames_per_eg config variable (default: 25), plus left and right context.\n# Because CTC training involves alignment of data, we can't meaningfully train\n# frame by frame.   The supervision approach involves the time alignment, though--\n# it is just applied in a loose way, where each symbol can appear in the\n# frame-range that it was in in the alignment, extended by a certain margin.\n#\n\n\n# Begin configuration section.\ncmd=run.pl\nframes_per_eg=25   # number of feature frames example (not counting added context).\n                   # more->less disk space and less time preparing egs, but more\n                   # I/O during training.\nframes_overlap_per_eg=0  # number of supervised frames of overlap that we aim for per eg.\n                  # can be useful to avoid wasted data if you're using --left-deriv-truncate\n                  # and --right-deriv-truncate.\nframe_subsampling_factor=3 # frames-per-second of features we train on divided\n                           # by frames-per-second at output of chain model\nalignment_subsampling_factor=3 # frames-per-second of input alignments divided\n                               # by frames-per-second at output of chain model\nleft_context=4    # amount of left-context per eg (i.e. extra frames of input features\n                  # not present in the output supervision).\nright_context=4   # amount of right-context per eg.\nconstrained=true  # 'constrained=true' is the traditional setup; 'constrained=false'\n                  # gives you the 'unconstrained' egs creation in which the time\n                  # boundaries are not enforced inside chunks.\n\nleft_context_initial=-1    # if >=0, left-context for first chunk of an utterance\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance\ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\n\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_valid_egs_combine=0  # #validation examples for combination weights at the very end.\nnum_train_egs_combine=1000 # number of train examples for the above.\nnum_egs_diagnostic=400 # number of frames for \"compute_prob\" jobs\nframes_per_iter=400000 # each iteration of training, see this many frames per\n                       # job, measured at the sampling rate of the features\n                       # used.  This is just a guideline; it will pick a number\n                       # that divides the number of samples in the entire data.\n\nright_tolerance=  # chain right tolerance == max label delay.\nleft_tolerance=\n\nstage=0\nmax_jobs_run=15         # This should be set to the maximum number of nnet3-chain-get-egs jobs you are\n                        # comfortable to run in parallel; you can increase it if your disk\n                        # speed is greater and you have more machines.\nmax_shuffle_jobs_run=50  # the shuffle jobs now include the nnet3-chain-normalize-egs command,\n                         # which is fairly CPU intensive, so we can run quite a few at once\n                         # without overloading the disks.\nsrand=0     # rand seed for nnet3-chain-get-egs, nnet3-chain-copy-egs and nnet3-chain-shuffle-egs\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\nonline_cmvn=false # Set to 'true' to replace 'apply-cmvn' by 'apply-cmvn-online' in the nnet3 input.\n                  # The configuration is passed externally via '$cmvn_opts' given to train.py,\n                  # typically as: --cmvn-opts=\"--config conf/online_cmvn.conf\".\n                  # The global_cmvn.stats are computed by this script from the features.\n                  # Note: the online cmvn for ivector extractor it is controlled separately in\n                  #       steps/online/nnet2/train_ivector_extractor.sh by --online-cmvn-iextractor\nlattice_lm_scale=     # If supplied, the graph/lm weight of the lattices will be\n                      # used (with this scale) in generating supervisions\n                      # This is 0 by default for conventional supervised training,\n                      # but may be close to 1 for the unsupervised part of the data\n                      # in semi-supervised training. The optimum is usually\n                      # 0.5 for unsupervised data.\nlattice_prune_beam=         # If supplied, the lattices will be pruned to this beam,\n                            # before being used to get supervisions.\nacwt=0.1   # For pruning\nderiv_weights_scp=\ngenerate_egs_scp=false\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <chain-dir> <lattice-dir> <egs-dir>\"\n  echo \" e.g.: $0 data/train exp/tri4_nnet exp/tri3_lats exp/tri4_nnet/egs\"\n  echo \"\"\n  echo \"From <chain-dir>, 0.trans_mdl (the transition-model), tree (the tree)\"\n  echo \"and normalization.fst (the normalization FST, derived from the denominator FST)\"\n  echo \"are read.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --max-jobs-run <max-jobs-run>                    # The maximum number of jobs you want to run in\"\n  echo \"                                                   # parallel (increase this only if you have good disk and\"\n  echo \"                                                   # network speed).  default=6\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --frames-per-iter <#samples;400000>              # Number of frames of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --frame-subsampling-factor <factor;3>            # factor by which num-frames at nnet output is reduced \"\n  echo \"  --frames-per-eg <frames;25>                      # number of supervised frames per eg on disk\"\n  echo \"  --frames-overlap-per-eg <frames;25>              # number of supervised frames of overlap between egs\"\n  echo \"  --left-context <int;4>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;4>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # If >= 0, left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # If >= 0, right-context for last chunk of an utterance\"\n  echo \"  --num-egs-diagnostic <#frames;4000>              # Number of egs used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-egs-combine <#frames;10000>          # Number of egs used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --lattice-lm-scale <float>                       # If supplied, the graph/lm weight of the lattices will be \"\n  echo \"                                                   # used (with this scale) in generating supervisions\"\n  echo \"  --lattice-prune-beam <float>                     # If supplied, the lattices will be pruned to this beam, \"\n  echo \"                                                   # before being used to get supervisions.\"\n  echo \"  --acwt <float;0.1>                               # Acoustic scale -- affects pruning\"\n  echo \"  --deriv-weights-scp <str>                        # If supplied, adds per-frame weights to the supervision.\"\n  echo \"  --generate-egs-scp <bool;false>                  # Generates scp files -- Required if the egs will be \"\n  echo \"                                                   # used for multilingual/multitask training.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nchaindir=$2\nlatdir=$3\ndir=$4\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $latdir/lat.1.gz $latdir/final.mdl \\\n         $chaindir/{0.trans_mdl,tree,normalization.fst} $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=$(cat $latdir/num_jobs) || exit 1\nif [ -f $latdir/per_utt ]; then\n  sdata=$data/split${nj}utt\n  utils/split_data.sh --per-utt $data $nj\nelse\n  sdata=$data/split$nj\n  utils/split_data.sh $data $nj\nfi\n\nmkdir -p $dir/log $dir/info\n\n# Get list of validation utterances.\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\n\nif [ -f $data/utt2uniq ]; then\n  # Must hold out all augmented versions of the same utterance.\n  echo \"$0: File $data/utt2uniq exists, so ensuring the hold-out set\" \\\n       \"includes all perturbed versions of the same source utterance.\"\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq 2>/dev/null | \\\n      utils/shuffle_list.pl 2>/dev/null | \\\n    awk -v max_utt=$num_utts_subset '{\n        for (n=2;n<=NF;n++) print $n;\n        printed += NF-1;\n        if (printed >= max_utt) exit(0); }' |\n    sort > $dir/valid_uttlist\nelse\n  awk '{print $1}' $data/utt2spk | \\\n    utils/shuffle_list.pl 2>/dev/null | \\\n    head -$num_utts_subset > $dir/valid_uttlist\nfi\nlen_valid_uttlist=$(wc -l < $dir/valid_uttlist)\n\nawk '{print $1}' $data/utt2spk | \\\n   utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl 2>/dev/null | \\\n   head -$num_utts_subset > $dir/train_subset_uttlist\nlen_trainsub_uttlist=$(wc -l <$dir/train_subset_uttlist)\n\nif [[ $len_valid_uttlist -lt $num_utts_subset ||\n      $len_trainsub_uttlist -lt $num_utts_subset ]]; then\n  echo \"$0: Number of utterances is very small. Please check your data.\" && exit 1;\nfi\n\necho \"$0: Holding out $len_valid_uttlist utterances in validation set and\" \\\n     \"$len_trainsub_uttlist in training diagnostic set, out of total\" \\\n     \"$(wc -l < $data/utt2spk).\"\n\n\necho \"$0: creating egs.  To ensure they are not deleted later you can do:  touch $dir/.nodelete\"\n\n## Set up features.\n\n# get the global_cmvn stats for online-cmvn,\nif $online_cmvn; then\n  # create global_cmvn.stats,\n  #\n  # caution: the top-level nnet training script should copy\n  # 'global_cmvn.stats' and 'online_cmvn' to its own dir.\n  if ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n    echo \"$0: Error summing cmvn stats\"\n    exit 1\n  fi\n  touch $dir/online_cmvn\nelse\n  [ -f $dir/online_cmvn ] && rm $dir/online_cmvn\nfi\n\n# create the feature pipelines,\nif ! $online_cmvn; then\n  # the original front-end with 'apply-cmvn',\n  echo \"$0: feature type is raw, with 'apply-cmvn'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\nelse\n  # the alternative front-end with 'apply-cmvn-online',\n  # - the $cmvn_opts can be set to '--config=conf/online_cmvn.conf' which is the setup of ivector-extractor,\n  echo \"$0: feature type is raw, with 'apply-cmvn-online'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=ark:$sdata/JOB/spk2utt $dir/global_cmvn.stats scp:- ark:- |\"\n  valid_spk2utt=\"ark:utils/filter_scp.pl $dir/valid_uttlist $data/utt2spk | utils/utt2spk_to_spk2utt.pl |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=\\\"$valid_spk2utt\\\" $dir/global_cmvn.stats scp:- ark:- |\"\n  train_subset_spk2utt=\"ark:utils/filter_scp.pl $dir/train_subset_uttlist $data/utt2spk | utils/utt2spk_to_spk2utt.pl |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=\\\"$train_subset_spk2utt\\\" $dir/global_cmvn.stats scp:- ark:- |\"\nfi\necho $cmvn_opts >$dir/cmvn_opts # caution: the top-level nnet training script should copy this to its own dir now.\n\n\ntree-info $chaindir/tree | grep num-pdfs | awk '{print $2}' > $dir/info/num_pdfs || exit 1\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim > $dir/info/ivector_dim\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/info/final.ie.id || exit 1\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\n  echo 0 >$dir/info/ivector_dim\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s/JOB/1/g)\"\n  if ! feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo \"Command failed (getting feature dim): feat-to-dim \\\"$feats_one\\\"\"\n    exit 1\n  fi\n  echo $feat_dim > $dir/info/feat_dim\nelse\n  num_frames=$(cat $dir/info/num_frames) || exit 1;\n  feat_dim=$(cat $dir/info/feat_dim) || exit 1;\nfi\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/$frames_per_iter+1]\n\n# We may have to first create a smaller number of larger archives, with number\n# $num_archives_intermediate, if $num_archives is more than the maximum number\n# of open filehandles that the system allows per process (ulimit -n).\n# This sometimes gives a misleading answer as GridEngine sometimes changes the\n# limit, so we limit it to 512.\nmax_open_filehandles=$(ulimit -n) || exit 1\n[ $max_open_filehandles -gt 512 ] && max_open_filehandles=512\nnum_archives_intermediate=$num_archives\narchives_multiple=1\nwhile [ $[$num_archives_intermediate+4] -gt $max_open_filehandles ]; do\n  archives_multiple=$[$archives_multiple+1]\n  num_archives_intermediate=$[$num_archives/$archives_multiple] || exit 1;\ndone\n# now make sure num_archives is an exact multiple of archives_multiple.\nnum_archives=$[$archives_multiple*$num_archives_intermediate] || exit 1;\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n# Work out the number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg*$num_archives)] || exit 1;\n! [ $egs_per_archive -le $frames_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= frames_per_iter=$frames_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\nif [ $left_context_initial -ge 0 ] || [ $right_context_final -ge 0 ]; then\n  echo \"$0:   ... and (left-context-initial,right-context-final) = ($left_context_initial,$right_context_final)\"\nfi\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/cegs.$x.ark; done)\n  for x in $(seq $num_archives_intermediate); do\n    utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/cegs_orig.$y.$x.ark; done)\n  done\nfi\n\negs_opts=\"--left-context=$left_context --right-context=$right_context --num-frames=$frames_per_eg --frame-subsampling-factor=$frame_subsampling_factor --compress=$compress\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\n[ ! -z \"$deriv_weights_scp\" ] && egs_opts=\"$egs_opts --deriv-weights-rspecifier=scp:$deriv_weights_scp\"\n\nchain_supervision_all_opts=\"--lattice-input=true --frame-subsampling-factor=$alignment_subsampling_factor\"\n[ ! -z $right_tolerance ] && \\\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --right-tolerance=$right_tolerance\"\n\n[ ! -z $left_tolerance ] && \\\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --left-tolerance=$left_tolerance\"\n\nif ! $constrained; then\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --convert-to-pdfs=false\"\n  trans_mdl_opt=--transition-model=$chaindir/0.trans_mdl\nelse\n  trans_mdl_opt=\nfi\n\n\nlats_rspecifier=\"ark:gunzip -c $latdir/lat.JOB.gz |\"\nif [ ! -z $lattice_prune_beam ]; then\n  if [ \"$lattice_prune_beam\" == \"0\" ] || [ \"$lattice_prune_beam\" == \"0.0\" ]; then\n    lats_rspecifier=\"$lats_rspecifier lattice-1best --acoustic-scale=$acwt ark:- ark:- |\"\n  else\n    lats_rspecifier=\"$lats_rspecifier lattice-prune --acoustic-scale=$acwt --beam=$lattice_prune_beam ark:- ark:- |\"\n  fi\nfi\n\nnormalization_fst_scale=1.0\n\nif [ ! -z \"$lattice_lm_scale\" ]; then\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --lm-scale=$lattice_lm_scale\"\n\n  normalization_fst_scale=$(perl -e \"\n  if ($lattice_lm_scale >= 1.0 || $lattice_lm_scale < 0) {\n    print STDERR \\\"Invalid --lattice-lm-scale $lattice_lm_scale\\\";\n    exit(1);\n  }\n  print (1.0 - $lattice_lm_scale);\") || exit 1\nfi\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\necho $left_context_initial > $dir/info/left_context_initial\necho $right_context_final > $dir/info/right_context_final\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Getting validation and training subset examples in background.\"\n  rm $dir/.error 2>/dev/null\n\n  (\n    $cmd --max-jobs-run 6 JOB=1:$nj $dir/log/lattice_copy.JOB.log \\\n      lattice-copy --include=\"cat $dir/valid_uttlist $dir/train_subset_uttlist |\" --ignore-missing \\\n      \"$lats_rspecifier\" \\\n      ark,scp:$dir/lat_special.JOB.ark,$dir/lat_special.JOB.scp || exit 1\n\n    for id in $(seq $nj); do cat $dir/lat_special.$id.scp; done > $dir/lat_special.scp\n\n    $cmd $dir/log/create_valid_subset.log \\\n      utils/filter_scp.pl $dir/valid_uttlist $dir/lat_special.scp \\| \\\n      lattice-align-phones --replace-output-symbols=true $latdir/final.mdl scp:- ark:- \\| \\\n      chain-get-supervision $chain_supervision_all_opts $chaindir/tree $chaindir/0.trans_mdl \\\n        ark:- ark:- \\| \\\n      nnet3-chain-get-egs $ivector_opts --srand=$srand \\\n         $egs_opts --normalization-fst-scale=$normalization_fst_scale \\\n         $trans_mdl_opt $chaindir/normalization.fst \\\n        \"$valid_feats\" ark,s,cs:- \"ark:$dir/valid_all.cegs\" || exit 1\n    $cmd $dir/log/create_train_subset.log \\\n      utils/filter_scp.pl $dir/train_subset_uttlist $dir/lat_special.scp \\| \\\n      lattice-align-phones --replace-output-symbols=true $latdir/final.mdl scp:- ark:- \\| \\\n      chain-get-supervision $chain_supervision_all_opts \\\n        $chaindir/tree $chaindir/0.trans_mdl ark:- ark:- \\| \\\n      nnet3-chain-get-egs $ivector_opts --srand=$srand \\\n        $egs_opts --normalization-fst-scale=$normalization_fst_scale \\\n        $trans_mdl_opt $chaindir/normalization.fst \\\n        \"$train_subset_feats\" ark,s,cs:- \"ark:$dir/train_subset_all.cegs\" || exit 1\n    sleep 5  # wait for file system to sync.\n    echo \"$0: Getting subsets of validation examples for diagnostics and combination.\"\n    if $generate_egs_scp; then\n      valid_diagnostic_output=\"ark,scp:$dir/valid_diagnostic.cegs,$dir/valid_diagnostic.scp\"\n      train_diagnostic_output=\"ark,scp:$dir/train_diagnostic.cegs,$dir/train_diagnostic.scp\"\n    else\n      valid_diagnostic_output=\"ark:$dir/valid_diagnostic.cegs\"\n      train_diagnostic_output=\"ark:$dir/train_diagnostic.cegs\"\n    fi\n    $cmd $dir/log/create_valid_subset_combine.log \\\n      nnet3-chain-subset-egs --n=$num_valid_egs_combine ark:$dir/valid_all.cegs \\\n      ark:$dir/valid_combine.cegs || exit 1\n    $cmd $dir/log/create_valid_subset_diagnostic.log \\\n      nnet3-chain-subset-egs --n=$num_egs_diagnostic ark:$dir/valid_all.cegs \\\n      $valid_diagnostic_output || exit 1\n\n    $cmd $dir/log/create_train_subset_combine.log \\\n      nnet3-chain-subset-egs --n=$num_train_egs_combine ark:$dir/train_subset_all.cegs \\\n      ark:$dir/train_combine.cegs || exit 1\n    $cmd $dir/log/create_train_subset_diagnostic.log \\\n      nnet3-chain-subset-egs --n=$num_egs_diagnostic ark:$dir/train_subset_all.cegs \\\n      $train_diagnostic_output || exit 1\n    sleep 5  # wait for file system to sync.\n    if $generate_egs_scp; then\n      cat $dir/valid_combine.cegs $dir/train_combine.cegs | \\\n        nnet3-chain-copy-egs ark:- ark,scp:$dir/combine.cegs,$dir/combine.scp\n    else\n      cat $dir/valid_combine.cegs $dir/train_combine.cegs > $dir/combine.cegs\n    fi\n\n    for f in $dir/{combine,train_diagnostic,valid_diagnostic}.cegs; do\n      [ ! -s $f ] && echo \"$0: No examples in file $f\" && exit 1;\n    done\n    rm $dir/valid_all.cegs $dir/train_subset_all.cegs $dir/{train,valid}_combine.cegs\n  ) || touch $dir/.error &\nfi\n\nif [ $stage -le 4 ]; then\n  # create cegs_orig.*.*.ark; the first index goes to $nj,\n  # the second to $num_archives_intermediate.\n\n  egs_list=\n  for n in $(seq $num_archives_intermediate); do\n    egs_list=\"$egs_list ark:$dir/cegs_orig.JOB.$n.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n\n  # The examples will go round-robin to egs_list.  Note: we omit the\n  # 'normalization.fst' argument while creating temporary egs: the phase of egs\n  # preparation that involves the normalization FST is quite CPU-intensive and\n  # it's more convenient to do it later, in the 'shuffle' stage.  Otherwise to\n  # make it efficient we need to use a large 'nj', like 40, and in that case\n  # there can be too many small files to deal with, because the total number of\n  # files is the product of 'nj' by 'num_archives_intermediate', which might be\n  # quite large.\n\n  $cmd --max-jobs-run $max_jobs_run JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    lattice-align-phones --replace-output-symbols=true $latdir/final.mdl \\\n      \"$lats_rspecifier\" ark:- \\| \\\n    chain-get-supervision $chain_supervision_all_opts \\\n      $chaindir/tree $chaindir/0.trans_mdl ark:- ark:- \\| \\\n    nnet3-chain-get-egs $ivector_opts --srand=\\$[JOB+$srand] $egs_opts \\\n      --num-frames-overlap=$frames_overlap_per_eg $trans_mdl_opt \\\n     \"$feats\" ark,s,cs:- ark:- \\| \\\n    nnet3-chain-copy-egs --random=true --srand=\\$[JOB+$srand] ark:- $egs_list || exit 1;\nfi\n\nif [ -f $dir/.error ]; then\n  echo \"$0: Error detected while creating train/valid egs\" && exit 1\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.*.JOB.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  # the input is a concatenation over the input jobs.\n  egs_list=\n  for n in $(seq $nj); do\n    egs_list=\"$egs_list $dir/cegs_orig.$n.JOB.ark\"\n  done\n\n  if [ $archives_multiple == 1 ]; then # normal case.\n    if $generate_egs_scp; then\n      output_archive=\"ark,scp:$dir/cegs.JOB.ark,$dir/cegs.JOB.scp\"\n    else\n      output_archive=\"ark:$dir/cegs.JOB.ark\"\n    fi\n    $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G \\\n      JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-chain-normalize-egs --normalization-fst-scale=$normalization_fst_scale \\\n        $chaindir/normalization.fst \"ark:cat $egs_list|\" ark:- \\| \\\n      nnet3-chain-shuffle-egs --srand=\\$[JOB+$srand] ark:- $output_archive || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate cegs.JOB.scp in single cegs.scp\n      for j in $(seq $num_archives_intermediate); do\n        cat $dir/cegs.$j.scp || exit 1;\n      done > $dir/cegs.scp || exit 1;\n      for f in $dir/cegs.*.scp; do rm $f; done\n    fi\n  else\n    # we need to shuffle the 'intermediate archives' and then split into the\n    # final archives.  we create soft links to manage this splitting, because\n    # otherwise managing the output names is quite difficult (and we don't want\n    # to submit separate queue jobs for each intermediate archive, because then\n    # the --max-jobs-run option is hard to enforce).\n    if $generate_egs_scp; then\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark,scp:$dir/cegs.JOB.$y.ark,$dir/cegs.JOB.$y.scp; done)\"\n    else\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark:$dir/cegs.JOB.$y.ark; done)\"\n    fi\n    for x in $(seq $num_archives_intermediate); do\n      for y in $(seq $archives_multiple); do\n        archive_index=$[($x-1)*$archives_multiple+$y]\n        # egs.intermediate_archive.{1,2,...}.ark will point to egs.archive.ark\n        ln -sf cegs.$archive_index.ark $dir/cegs.$x.$y.ark || exit 1\n      done\n    done\n    $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G \\\n      JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-chain-normalize-egs --normalization-fst-scale=$normalization_fst_scale \\\n        $chaindir/normalization.fst \"ark:cat $egs_list|\" ark:- \\| \\\n      nnet3-chain-shuffle-egs --srand=\\$[JOB+$srand] ark:- ark:- \\| \\\n      nnet3-chain-copy-egs ark:- $output_archives || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate cegs.JOB.scp in single cegs.scp\n      rm -rf $dir/cegs.scp\n      for j in $(seq $num_archives_intermediate); do\n        for y in $(seq $archives_multiple); do\n          cat $dir/cegs.$j.$y.scp || exit 1;\n        done\n      done > $dir/cegs.scp || exit 1;\n      for f in $dir/cegs.*.*.scp; do rm $f; done\n    fi\n  fi\nfi\n\nwait\nif [ -f $dir/.error ]; then\n  echo \"$0: Error detected while creating train/valid egs\" && exit 1\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: Removing temporary archives, alignments and lattices\"\n  (\n    cd $dir\n    for f in $(ls -l . | grep 'cegs_orig' | awk '{ X=NF-1; Y=NF-2; if ($X == \"->\")  print $Y, $NF; }'); do rm $f; done\n    # the next statement removes them if we weren't using the soft links to a\n    # 'storage' directory.\n    rm cegs_orig.*.ark 2>/dev/null\n  )\n  if ! $generate_egs_scp && [ $archives_multiple -gt 1 ]; then\n    # there are some extra soft links that we should delete.\n    for f in $dir/cegs.*.*.ark; do rm $f; done\n  fi\n  rm $dir/ali.{ark,scp} 2>/dev/null\n  rm $dir/lat_special.*.{ark,scp} 2>/dev/null\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/get_model_context.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#             2019  Idiap Research Institute (Author: Srikanth Madikeri)\n#\n# This script computes the total left and right context needed for example (eg)\n# creation from a set of 'chain' models.\n# See the usage message for more information about input and output formats.\n\n# Begin configuration section.\nframe_subsampling_factor=1   # The total frame subsampling factor of the bottom\n                             # + top model, i.e. the relative difference in\n                             # frame rate between the input of the bottom model\n                             # and the output of the top model.  Would normally\n                             # be 3.\n\nlangs=default                # the list of languages.  This script checks that\n                             # in the dir (first arg to the script), each\n                             # language exists as $lang.mdl, and it warns if\n                             # any model files appear (which might indicate a\n                             # script bug).\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  cat 1>&2 <<EOF\nUsage: $0 [opts] <model-dir> <output-info-file>\nThis script works out some acoustic-context-related information,\nand writes it, long with  the options provided to the script,\nto the <output-info-file> provided.  An example of what\noutput-info-file> might contain after this script is called, is:\nlangs default\nframe_subsampling_factor 3\nbottom_subsampling_factor 3\nmodel_left_context 22\nmodel_right_context 22\n  e.g.: $0 --frame-subsampling-factor 3 \n          --langs 'default' exp/chaina/tdnn1a_sp/0 exp/chaina/tdnn1a_sp/0/info.txt\n Options:\n     --frame-subsampling-factor    # (default: 1)  Total frame subsampling factor of\n                                   # both models combined, i.e. ratio of\n                                   # frame rate of input features vs.\n                                   # alignments and decoding (e.g. 3).\n     --bottom-subsampling-factor   # (default: 1) Controls the frequency at which\n                                   # the output of the bottom model is\n                                   # evaluated, and the interpretation of frame\n                                   # offsets in the top config file.  Must be a\n                                   # divisor of --frame-subsampling-factor\n     --langs                       # The list of languages (must be in quotes,\n                                   # to be parsed as a single arg).  May be\n                                   # 'default' or e.g. 'english french'\nEOF\n  exit 1;\nfi\n\n\ndir=$1\ninfo_file=$2\n\n# die on error or undefined variable.\nset -e -u\n\nif [ ! -d $dir ]; then\n  echo 1>&2 \"$0: expected directory $dir to exist\"\n  exit 1\nfi\n\nif [ -z $langs ]; then\n  echo 1>&2 \"$0: list of languages (--langs option) is empty\"\n  exit 1\nfi\n\nif  ! [ $frame_subsampling_factor -ge 1 ]; then\n  echo 1>&2 \"$0: there was a problem with the options --frame-subsampling-factor=$frame_subsampling_factor\"\n  exit 1\nfi\n\nmkdir -p $dir/temp\n\nfor lang in $langs; do\n  if [ ! -s $dir/$lang.mdl ]; then\n    echo 1>&2 \"$0: expected file $dir/$lang.mdl to exist and be nonempty (check --langs option)\"\n    exit 1\n  fi\n  nnet3-am-info $dir/$lang.mdl > $dir/temp/$lang.info\n  this_left_context=$(grep '^left-context:' $dir/temp/$lang.info | awk '{print $2}')\n  this_right_context=$(grep '^right-context:' $dir/temp/$lang.info | awk '{print $2}')\ndone\n\nleft_context=$this_left_context\nright_context=$this_right_context\n\n\ncat >$info_file <<EOF\nframe_subsampling_factor $frame_subsampling_factor\nlangs $langs\nmodel_left_context $left_context\nmodel_right_context $right_context\nEOF\n\n\necho \"$0: Finished getting model context\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/get_phone_post.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#  Apache 2.0.\n\n\n\n# This script obtains phone posteriors from a trained chain model, using either\n# the xent output or the forward-backward posteriors from the denominator fst.\n# The phone posteriors will be in matrices where the column index can be\n# interpreted as phone-index - 1.\n\n# You may want to mess with the compression options.  Be careful: with the current\n# settings, you might sometimes get exact zeros as the posterior values.\n\n# CAUTION!  This script isn't very suitable for dumping features from recurrent\n# architectures such as LSTMs, because it doesn't support setting the chunk size\n# and left and right context.  (Those would have to be passed into nnet3-compute\n# or nnet3-chain-compute-post).\n\n# Begin configuration section.\nstage=0\n\nnj=1  # Number of jobs to run.\ncmd=run.pl\nremove_word_position_dependency=false\nuse_xent_output=false\nonline_ivector_dir=\nuse_gpu=false\ncount_smoothing=1.0  # this should be some small number, I don't think it's critical;\n                     # it will mainly affect the probability we assign to phones that\n                     # were never seen in training.  note: this is added to the raw\n                     # transition-id occupation counts, so 1.0 means, add a single\n                     # frame's count to each transition-id's counts.\n\n# End configuration section.\n\nset -e -u\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n  echo \"Usage: $0 <chain-tree-dir> <chain-model-dir> <lang-dir> <data-dir> <phone-post-dir>\"\n  echo \" e.g.: $0 --remove-word-position-dependency true --online-ivector-dir exp/nnet3/ivectors_test_eval92_hires \\\\\"\n  echo \"       exp/chain/tree_a_sp exp/chain/tdnn1a_sp data/lang data/test_eval92_hires exp/chain/tdnn1a_sp_post_eval92\"\n  echo \" ... you'll normally want to set the --nj and --cmd options as well.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (run.pl|queue.pl|... <queue opts>)    # how to run jobs.\"\n  echo \"  --config <config-file>                      # config containing options\"\n  echo \"  --stage <stage>                             # stage to do partial re-run from.\"\n  echo \"  --nj <N>                                    # Number of parallel jobs to run, default:1\"\n  echo \"  --remove-word-position-dependency <bool>    # If true, remove word-position-dependency\"\n  echo \"                                              # info when dumping posteriors (default: false)\"\n  echo \"  --use-xent-output <bool>                    # If true, use the cross-entropy output of the\"\n  echo \"                                              # neural network when dumping posteriors\"\n  echo \"                                              # (default: false, will use chain denominator FST)\"\n  echo \"  --online-ivector-dir <dir>                  # Directory where we dumped online-computed\"\n  echo \"                                              # ivectors corresponding to the data in <data>\"\n  echo \"  --use-gpu <bool>                            # Set to true to use GPUs (not recommended as the\"\n  echo \"                                              # binary is very poorly optimized for GPU use).\"\n  exit 1;\nfi\n\n\ntree_dir=$1\nmodel_dir=$2\nlang=$3\ndata=$4\ndir=$5\n\n\nfor f in $tree_dir/tree $tree_dir/final.mdl $tree_dir/ali.1.gz $tree_dir/num_jobs \\\n         $model_dir/final.mdl $model_dir/frame_subsampling_factor $model_dir/den.fst \\\n         $data/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_sat.sh: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split${nj}utt\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh --per-utt $data $nj || exit 1;\n\nuse_ivector=false\n\ncmvn_opts=$(cat $model_dir/cmvn_opts)\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ];then\n  steps/nnet2/check_ivectors_compatible.sh $model_dir $online_ivector_dir || exit 1;\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_feats=\"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $online_ivector_dir/ivector_online.scp |\"\n  ivector_opts=\"--online-ivector-period=$ivector_period --online-ivectors='$ivector_feats'\"\nelse\n  ivector_opts=\nfi\n\nif $use_gpu; then\n  gpu_queue_opt=\"--gpu 1\"\n  gpu_opt=\"--use-gpu=yes\"\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  gpu_queue_opts=\n  gpu_opt=\"--use-gpu=no\"\nfi\nframe_subsampling_factor=$(cat $model_dir/frame_subsampling_factor)\n\nmkdir -p $dir/log\ncp $model_dir/frame_subsampling_factor $dir/\n\nif [ $stage -le 0 ]; then\n  if [ ! -f $dir/tacc ] || [ $dir/tacc -ot $tree_dir/ali.1.gz ]; then\n    echo \"$0: obtaining transition-id counts in $dir/tacc\"\n    # Obtain counts for each transition-id, from the alignments.\n    this_nj=$(cat $tree_dir/num_jobs)\n\n\n    $cmd JOB=1:$this_nj $dir/log/acc_taccs.JOB.log \\\n       ali-to-post \"ark:gunzip -c $tree_dir/ali.JOB.gz|\" ark:- \\| \\\n       post-to-tacc $tree_dir/final.mdl ark:- $dir/tacc.JOB\n\n    input_taccs=$(for n in $(seq $this_nj); do echo $dir/tacc.$n; done)\n\n    $cmd $dir/log/sum_taccs.log \\\n         vector-sum --binary=false $input_taccs $dir/tacc\n\n    rm $dir/tacc.*\n  else\n    echo \"$0: skipping creation of $dir/tacc since it already exists.\"\n  fi\nfi\n\n\nif [ $stage -le 1 ] && $remove_word_position_dependency; then\n  echo \"$0: creating $dir/phone_map.int\"\n  utils/lang/get_word_position_phone_map.pl $lang $dir\nelse\n  # Either way, $dir/phones.txt will be a symbol table for the phones that\n  # we are dumping (although the matrices we dump won't contain anything\n  # for symbol 0 which is <eps>).\n  grep -v '^#' $lang/phones.txt > $dir/phones.txt\nfi\n\nif [ $stage -le 1 ]; then\n  # we want the phones in integer form as it's safer for processing by script.\n  # $data/fake_phones.txt will just contain e.g. \"0 0\\n1 1\\n....\", it's used\n  # to force show-transitions to print the phones as integers.\n  awk '{print $2,$2}' <$lang/phones.txt >$dir/fake_phones.txt\n\n\n  # The format of the 'show-transitions' command below is like the following:\n  #show-transitions tempdir/phone_map.int exp/chain/tree_a_sp/final.mdl\n  #Transition-state 1: phone = 1 hmm-state = 0 forward-pdf = 0 self-loop-pdf = 51\n  # Transition-id = 1 p = 0.5 [self-loop]\n  # Transition-id = 2 p = 0.5 [0 -> 1]\n  #Transition-state 2: phone = 10 hmm-state = 0 forward-pdf = 0 self-loop-pdf = 51\n  # Transition-id = 3 p = 0.5 [self-loop]\n  # Transition-id = 4 p = 0.5 [0 -> 1]\n\n  # The following inline script processes that info about the transition model\n  # into the file $dir/phones_and_pdfs.txt, which has a line for each transition-id\n  # (starting from number 1), and the format of each line is\n  # <phone-id> <pdf-id>\n  show-transitions $dir/fake_phones.txt $tree_dir/final.mdl | \\\n    perl -ane ' if(m/Transition-state.* phone = (\\d+) pdf = (\\d+)/) { $phone = $1; $forward_pdf = $2; $self_loop_pdf = $2; }\n        if(m/Transition-state.* phone = (\\d+) .* forward-pdf = (\\d+) self-loop-pdf = (\\d+)/) {\n          $phone = $1; $forward_pdf = $2; $self_loop_pdf = $3; }\n        if(m/Transition-id/) {  if (m/self-loop/) { print \"$phone $self_loop_pdf\\n\"; }\n            else { print \"$phone $forward_pdf\\n\" } } ' > $dir/phones_and_pdfs.txt\n\n\n  # The following command just separates the 'tacc' file into a similar format\n  # to $dir/phones_and_pdfs.txt, with one count per line, and a line per transition-id\n  # starting from number 1.  We skip the first two fields which are \"[ 0\" (the 0 is\n  # for transition-id=0, since transition-ids are 1-based), and the last field which is \"]\".\n  awk '{ for (n=3;n<NF;n++) print $n; }' <$dir/tacc  >$dir/transition_counts.txt\n\n  num_lines1=$(wc -l <$dir/phones_and_pdfs.txt)\n  num_lines2=$(wc -l <$dir/transition_counts.txt)\n  if [ $num_lines1 -ne $num_lines2 ]; then\n    echo \"$0: mismatch in num-lines between phones_and_pdfs.txt and transition_counts.txt: $num_lines1 vs $num_lines2\"\n    exit 1\n  fi\n\n  # after 'paste', the format of the data will be\n  # <phone-id> <pdf-id> <data-count>\n  # we add the count smoothing at this point.\n  paste $dir/phones_and_pdfs.txt $dir/transition_counts.txt | \\\n     awk -v s=$count_smoothing '{print $1, $2, (s+$3);}' > $dir/combined_info.txt\n\n  if $remove_word_position_dependency; then\n    # map the phones to word-position-independent phones; you can see $dir/phones.txt\n    # to interpret the final output.\n    utils/apply_map.pl -f 1 $dir/phone_map.int <$dir/combined_info.txt > $dir/temp.txt\n    mv $dir/temp.txt $dir/combined_info.txt\n  fi\n\n  awk 'BEGIN{num_phones=1;num_pdfs=1;} { phone=$1; pdf=$2; count=$3; pdf_count[pdf] += count; counts[pdf,phone] += count;\n       if (phone>num_phones) num_phones=phone; if (pdf>=num_pdfs) num_pdfs = pdf + 1; }\n       END{ print \"[ \"; for(phone=1;phone<=num_phones;phone++) {\n          for (pdf=0;pdf<num_pdfs;pdf++) printf(\"%.3f \", counts[pdf,phone]/pdf_count[pdf]);\n           print \"\"; } print \"]\"; }' <$dir/combined_info.txt >$dir/transform.mat\n\nfi\n\n\nif [ $stage -le 2 ]; then\n\n  # note: --compression-method=3 is kTwoByteAuto: Each element is stored in two\n  # bytes as a uint16, with the representable range of values chosen\n  # automatically with the minimum and maximum elements of the matrix as its\n  # edges.\n  compress_opts=\"--compress=true --compression-method=3\"\n\n  if $use_xent_output; then\n    # This block uses the 'output-xent' output of the nnet.\n\n    model=\"nnet3-copy '--edits-config=echo remove-output-nodes name=output; echo rename-node old-name=output-xent new-name=output|' $model_dir/final.mdl -|\"\n\n    $cmd $gpu_queue_opts JOB=1:$nj $dir/log/get_phone_post.JOB.log \\\n       nnet3-compute $gpu_opt $ivector_opts \\\n       --frame-subsampling-factor=$frame_subsampling_factor --apply-exp=true \\\n       \"$model\" \"$feats\" ark:- \\| \\\n       transform-feats $dir/transform.mat ark:- ark:- \\| \\\n       copy-feats $compress_opts ark:- ark,scp:$dir/phone_post.JOB.ark,$dir/phone_post.JOB.scp\n  else\n    # This block is when we are using the 'chain' output (recommended as the posteriors\n    # will be much more accurate).\n    $cmd $gpu_queue_opts JOB=1:$nj $dir/log/get_phone_post.JOB.log \\\n       nnet3-chain-compute-post $gpu_opt $ivector_opts --transform-mat=$dir/transform.mat \\\n          --frame-subsampling-factor=$frame_subsampling_factor \\\n        $model_dir/final.mdl $model_dir/den.fst \"$feats\" ark:- \\| \\\n       copy-feats $compress_opts ark:- ark,scp:$dir/phone_post.JOB.ark,$dir/phone_post.JOB.scp\n  fi\n\n  sleep 5\n  # Make a single .scp file, for convenience.\n  for n in $(seq $nj); do cat $dir/phone_post.$n.scp; done > $dir/phone_post.scp\n\nfi\n"
  },
  {
    "path": "egs/steps/nnet3/chain/make_weighted_den_fst.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Vimal Manohar\n#           2017 Pegah Ghahremani\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script creates denominator FST (den.fst) and normalization.fst for\n# chain training. It additionally copies the transition model and tree from the\n# first alignment directory to the chain directory.\n# Alternatively, if the --am-dir option is used, the transition model and tree\n# are taken from there instead of the first alignment directory.\n# This script can accept multiple sources of alignments with same phone sets\n# that can be weighted to estimate phone LM.\n# You can use the --num-repeats option to repeat some source data more than\n# once when training the LM for the denominator FST.\n\nset -o pipefail\n\n# begin configuration section.\ncmd=run.pl\nstage=0\nnum_repeats= # Comma-separated list of positive integer multiplicities, one\n             # for each input alignment directory.  The alignments from\n             # each source will be scaled by the corresponding value when\n             # training the LM.\n             # If not specified, weight '1' is used for all data sources.\n\nam_dir=\nlm_opts='--num-extra-lm-states=2000'\n#end configuration section.\n\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -lt 2 ]; then\n  echo \"Usage: $0 [options] <ali-dir1> [<ali-dir2> ...] <out-dir>\";\n  echo \"e.g.: $0 exp/tri1_ali exp/tri2_ali exp/chain/tdnn_1a_sp\";\n  echo \"Options: \"\n  echo \" --cmd (run.pl|queue.pl...)      # Specify how to run jobs.\";\n  echo \"--lm-opts                        # Options for phone LM generation\";\n  echo \"--num-repeats                    # Comma-separated list of postive integer\"\n  echo \"                                 # multiplicities, one for each input\"\n  echo \"                                 # alignment directory.  The alignments\"\n  echo \"                                 # from each source will be scaled by\"\n  echo \"                                 # the corresponding value when training\"\n  echo \"                                 # the LM.  If not specified, weight '1'\"\n  echo \"                                 # is used for all data sources.\"\n  echo \"--am-dir                         # Path to the base AM directory. Set this\"\n  echo \"                                 # when the AM you will be training from\"\n  echo \"                                 # isn't necessarily the one which created\"\n  echo \"                                 # the alignments. If this is not set, the\"\n  echo \"                                 # tree and transition model from the first\"\n  echo \"                                 # ali-dir will be copied to out-dir.\"\n  exit 1;\nfi\n\ndir=${@: -1}   # the working directory: last argument to the script\nali_dirs=( $@ )  # read the remaining arguments into an array\nunset ali_dirs[${#ali_dirs[@]}-1]  # 'pop' the last argument which is $dir\nnum_alignments=${#ali_dirs[@]}    # number of alignment dirs to combine\n\nif [ -z \"$am_dir\" ]; then\n  am_dir=${ali_dirs[0]}\nfi\n\nmkdir -p $dir/log\n\n# Go through each alignment directory and make sure the phones match.\nfor n in `seq 0 $[$num_alignments-1]`;do\n  ali_dir=${ali_dirs[$n]}\n  for f in $ali_dir/ali.1.gz $ali_dir/final.mdl $ali_dir/tree; do\n    [ ! -f $f ] && echo \"$0: Expected file $f to exist\" && exit 1;\n  done\n  utils/lang/check_phones_compatible.sh ${am_dir}/phones.txt \\\n    ${ali_dirs[$n]}/phones.txt || exit 1;\ndone\n\n# Make sure we have the AM and tree in the am_dir.\nfor f in $am_dir/final.mdl $am_dir/tree; do\n  [ ! -f $f ] && echo \"$0: Expected file $f to exist\" && exit 1;\ndone\n\ncp $am_dir/tree $dir || exit 1\n\nif [ -z \"$num_repeats\" ]; then\n  # If 'num_repeats' is not specified, set num_repeats_array to e.g. (1 1 1).\n  num_repeats_array=( $(for n in $(seq $num_alignments); do echo 1; done) )\nelse\n  num_repeats_array=(${num_repeats//,/ })\n  num_repeats=${#num_repeats_array[@]}\n  if [ $num_repeats -ne $num_alignments ]; then\n    echo \"$0: too many or too few elements in --num-repeats option: '$num_repeats'\"\n    exit 1\n  fi\nfi\n\nall_phones=\"\"  # will contain the names of the .gz files containing phones,\n               # with some members possibly repeated per the --num-repeats\n               # option\nfor n in `seq 0 $[num_alignments-1]`; do\n  this_num_repeats=${num_repeats_array[$n]}\n  this_alignment_dir=${ali_dirs[$n]}\n  num_jobs=$(cat $this_alignment_dir/num_jobs)\n  if ! [ \"$this_num_repeats\" -ge 0 ]; then\n    echo \"Expected comma-separated list of integers for --num-repeats option, got '$num_repeats'\"\n    exit 1\n  fi\n\n\n  if [ $stage -le 1 ]; then\n    for j in $(seq $num_jobs); do gunzip -c $this_alignment_dir/ali.$j.gz; done | \\\n      ali-to-phones $this_alignment_dir/final.mdl ark:- \"ark:|gzip -c >$dir/phones.$n.gz\" || exit 1;\n  fi\n\n  if [ ! -s $dir/phones.$n.gz ]; then\n    echo \"$dir/phones.$n.gz is empty or does not exist\"\n    exit 1\n  fi\n\n  all_phones=\"$all_phones $(for r in $(seq $this_num_repeats); do echo $dir/phones.$n.gz; done)\"\ndone\n\nif [ $stage -le 2 ]; then\n  $cmd $dir/log/make_phone_lm_fst.log \\\n    gunzip -c $all_phones \\| \\\n    chain-est-phone-lm $lm_opts ark:- $dir/phone_lm.fst || exit 1;\n  rm $dir/phones.*.gz\nfi\n\nif [ $stage -le 3 ]; then\n  copy-transition-model $am_dir/final.mdl $dir/0.trans_mdl || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  $cmd $dir/log/make_den_fst.log \\\n    chain-make-den-fst $dir/tree $dir/0.trans_mdl \\\n    $dir/phone_lm.fst \\\n    $dir/den.fst $dir/normalization.fst || exit 1\nfi\n\necho \"Successfully created {den,normalization}.fst\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/chain/multilingual/combine_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017     Pegah Ghahremani\n#           2017-18  Vimal Manohar\n# Apache 2.0\n\n# This script generates examples for multilingual training of 'chain' \n# models using separate input egs dir per language as input.\n# This script is similar to steps/nnet3/multilingual/combine_egs.sh, but \n# works on 'chain' egs. This is also useful for semi-supervised training,\n# where supervised and unsupervised datasets are treated as different \n# languages.\n\n# This scripts produces 3 sets of files --\n# cegs.*.scp, cegs.output.*.ark, cegs.weight.*.ark\n#\n# cegs.*.scp are the SCP files of the training examples.\n# cegs.weight.*.ark map from the key of the example to the language-specific\n# weight of that example.\n# cegs.output.*.ark map from the key of the example to the name of\n# the output-node in the neural net for that specific language, e.g.\n# 'output-2'.\n#\n# Begin configuration section.\ncmd=run.pl\nblock_size=256          # This is the number of consecutive egs that we take from\n                        # each source, and it only affects the locality of disk\n                        # access.\nlang2weight=            # array of weights one per input languge to scale example's output\n                        # w.r.t its input language during training.\nstage=0\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 3 ]; then\n  cat <<EOF\n  This script generates examples for multilingual training of neural network\n  using separate input egs dir per language as input.\n  See top of the script for details.\n\n  Usage: $0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\n   e.g.: $0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\n\n  Options:\n      --cmd (utils/run.pl|utils/queue.pl <queue opts>)  # how to run jobs.\n      --block-size <int|512>      # it is the number of consecutive egs that we take from \n                                  # each source, and it only affects the locality of disk \n                                  # access. This does not have to be the actual minibatch size\nEOF\n  exit 1;\nfi\n\nnum_langs=$1\n\nshift 1\nargs=(\"$@\")\nmegs_dir=${args[-1]} # multilingual directory\nmkdir -p $megs_dir\nmkdir -p $megs_dir/info\nif [ ${#args[@]} != $[$num_langs+1] ]; then\n  echo \"$0: num of input example dirs provided is not compatible with num_langs $num_langs.\"\n  echo \"Usage:$0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\"\n  echo \"Usage:$0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\"\n  exit 1;\nfi\n\nrequired=\"cegs.scp combine.scp train_diagnostic.scp valid_diagnostic.scp\"\ntrain_scp_list=\ntrain_diagnostic_scp_list=\nvalid_diagnostic_scp_list=\ncombine_scp_list=\n\n# read paramter from $egs_dir[0]/info and cmvn_opts\n# to write in multilingual egs_dir.\ncheck_params=\"info/feat_dim info/ivector_dim info/left_context info/right_context info/left_context_initial info/right_context_final cmvn_opts\"\nivec_dim=`cat ${args[0]}/info/ivector_dim`\nif [ $ivec_dim -ne 0 ];then check_params=\"$check_params info/final.ie.id\"; fi\n\nfor param in $check_params info/frames_per_eg; do\n  cat ${args[0]}/$param > $megs_dir/$param || exit 1;\ndone\n\ntot_num_archives=0\nfor lang in $(seq 0 $[$num_langs-1]);do\n  multi_egs_dir[$lang]=${args[$lang]}\n  for f in $required; do\n    if [ ! -f ${multi_egs_dir[$lang]}/$f ]; then\n      echo \"$0: no such file ${multi_egs_dir[$lang]}/$f.\" && exit 1;\n    fi\n  done\n  num_archives=$(cat ${multi_egs_dir[$lang]}/info/num_archives)\n  tot_num_archives=$[tot_num_archives+num_archives]\n  train_scp_list=\"$train_scp_list ${args[$lang]}/cegs.scp\"\n  train_diagnostic_scp_list=\"$train_diagnostic_scp_list ${args[$lang]}/train_diagnostic.scp\"\n  valid_diagnostic_scp_list=\"$valid_diagnostic_scp_list ${args[$lang]}/valid_diagnostic.scp\"\n  combine_scp_list=\"$combine_scp_list ${args[$lang]}/combine.scp\"\n\n  # check parameter dimension to be the same in all egs dirs\n  for f in $check_params; do\n    if [ -f $megs_dir/$f ] && [ -f ${multi_egs_dir[$lang]}/$f ]; then\n      f1=$(cat $megs_dir/$f)\n      f2=$(cat ${multi_egs_dir[$lang]}/$f)\n      if [ \"$f1\" != \"$f2\" ]  ; then\n        echo \"$0: mismatch for $f in $megs_dir vs. ${multi_egs_dir[$lang]}($f1 vs. $f2).\"\n        exit 1;\n      fi\n    else\n      echo \"$0: file $f does not exits in $megs_dir or ${multi_egs_dir[$lang]}/$f .\"\n    fi\n  done\ndone\n\nif [ ! -z \"$lang2weight\" ]; then\n  egs_opt=\"--lang2weight '$lang2weight'\"\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: allocating multilingual examples for training.\"\n  # Generate cegs.*.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_train.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives $tot_num_archives \\\n      --block-size $block_size \\\n      --egs-prefix \"cegs.\" \\\n      $train_scp_list $megs_dir || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: combine combine.scp examples from all langs in $megs_dir/combine.scp.\"\n  # Generate combine.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_combine.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"combine.\" \\\n      $combine_scp_list $megs_dir || exit 1;\n\n  echo \"$0: combine train_diagnostic.scp examples from all langs in $megs_dir/train_diagnostic.scp.\"\n  # Generate train_diagnostic.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_train_diagnostic.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"train_diagnostic.\" \\\n      $train_diagnostic_scp_list $megs_dir || exit 1;\n\n\n  echo \"$0: combine valid_diagnostic.scp examples from all langs in $megs_dir/valid_diagnostic.scp.\"\n  # Generate valid_diagnostic.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_valid_diagnostic.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"valid_diagnostic.\" \\\n      $valid_diagnostic_scp_list $megs_dir || exit 1;\n\nfi\nfor egs_type in combine train_diagnostic valid_diagnostic; do\n  mv $megs_dir/${egs_type}.output.1.ark $megs_dir/${egs_type}.output.ark || exit 1;\n  mv $megs_dir/${egs_type}.weight.1.ark $megs_dir/${egs_type}.weight.ark || exit 1;\n  mv $megs_dir/${egs_type}.1.scp $megs_dir/${egs_type}.scp || exit 1;\ndone\nmv $megs_dir/info/cegs.num_archives $megs_dir/info/num_archives || exit 1;\nmv $megs_dir/info/cegs.num_tasks $megs_dir/info/num_tasks || exit 1;\necho \"$0: Finished preparing multilingual training example.\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain/train.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This script is based on steps/nnet3/chain/train.sh\n\"\"\"\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport argparse\nimport logging\nimport os\nimport pprint\nimport shutil\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.chain_objf.acoustic_model as chain_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting chain model trainer (train.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    We add compulsary arguments as named arguments for readability\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains RNN and DNN acoustic models using the 'chain'\n        objective function.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser().parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.chunk-width\", type=str, dest='chunk_width',\n                        default=\"20\",\n                        help=\"\"\"Number of frames per chunk in the examples\n                        used to train the RNN.   Caution: if you double this you\n                        should halve --trainer.samples-per-iter.  May be\n                        a comma-separated list of alternatives: first width\n                        is the 'principal' chunk-width, used preferentially\"\"\")\n    parser.add_argument(\"--egs.nj\", type=int, required=False,\n                        default=0, dest=\"egs_nj\",\n                        help=\"\"\"Number of jobs to use when generating egs.\n                        Default: the same number as used for tree generation.\n                        You probably do not need to tweak this, unless you\n                        want to adapt a neural network on some different,\n                        smaller-size data.\"\"\")\n\n    # chain options\n    parser.add_argument(\"--chain.lm-opts\", type=str, dest='lm_opts',\n                        default=None, action=common_lib.NullstrToNoneAction,\n                        help=\"options to be be passed to chain-est-phone-lm\")\n    parser.add_argument(\"--chain.l2-regularize\", type=float,\n                        dest='l2_regularize', default=0.0,\n                        help=\"\"\"Weight of regularization function which is the\n                        l2-norm of the output of the network. It should be used\n                        without the log-softmax layer for the outputs.  As\n                        l2-norm of the log-softmax outputs can dominate the\n                        objective function.\"\"\")\n    parser.add_argument(\"--chain.xent-regularize\", type=float,\n                        dest='xent_regularize', default=0.0,\n                        help=\"Weight of regularization function which is the \"\n                        \"cross-entropy cost the outputs.\")\n    parser.add_argument(\"--chain.right-tolerance\", type=int,\n                        dest='right_tolerance', default=5, help=\"\")\n    parser.add_argument(\"--chain.left-tolerance\", type=int,\n                        dest='left_tolerance', default=5, help=\"\")\n    parser.add_argument(\"--chain.leaky-hmm-coefficient\", type=float,\n                        dest='leaky_hmm_coefficient', default=0.00001,\n                        help=\"\")\n    parser.add_argument(\"--chain.apply-deriv-weights\", type=str,\n                        dest='apply_deriv_weights', default=True,\n                        action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"],\n                        help=\"\")\n    parser.add_argument(\"--chain.frame-subsampling-factor\", type=int,\n                        dest='frame_subsampling_factor', default=3,\n                        help=\"ratio of frames-per-second of features we \"\n                        \"train on, to chain model's output\")\n    parser.add_argument(\"--chain.alignment-subsampling-factor\", type=int,\n                        dest='alignment_subsampling_factor',\n                        default=3,\n                        help=\"ratio of frames-per-second of input \"\n                        \"alignments to chain model's output\")\n    parser.add_argument(\"--chain.left-deriv-truncate\", type=int,\n                        dest='left_deriv_truncate',\n                        default=None,\n                        help=\"Deprecated. Kept for back compatibility\")\n\n    # trainer options\n    parser.add_argument(\"--trainer.input-model\", type=str,\n                        dest='input_model', default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"If specified, this model is used as initial \"\n                             \"'raw' model (0.raw in the script) instead of \"\n                             \"initializing the model from the xconfig. \"\n                             \"Also configs dir is not expected to exist \"\n                             \"and left/right context is computed from this \"\n                             \"model.\")\n    parser.add_argument(\"--trainer.num-epochs\", type=float, dest='num_epochs',\n                        default=10.0,\n                        help=\"Number of epochs to train the model\")\n    parser.add_argument(\"--trainer.frames-per-iter\", type=int,\n                        dest='frames_per_iter', default=800000,\n                        help=\"\"\"Each iteration of training, see this many\n                        [input] frames per job.  This option is passed to\n                        get_egs.sh.  Aim for about a minute of training\n                        time\"\"\")\n\n    parser.add_argument(\"--trainer.num-chunk-per-minibatch\", type=str,\n                        dest='num_chunk_per_minibatch', default='128',\n                        help=\"\"\"Number of sequences to be processed in\n                        parallel every minibatch.  May be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.initial-effective-lrate\",\n                        type=float, dest='initial_effective_lrate',\n                        default=0.0002,\n                        help=\"Learning rate used during the initial iteration\")\n    parser.add_argument(\"--trainer.optimization.final-effective-lrate\",\n                        type=float, dest='final_effective_lrate',\n                        default=0.00002,\n                        help=\"Learning rate used during the final iteration\")\n    parser.add_argument(\"--trainer.optimization.shrink-value\", type=float,\n                        dest='shrink_value', default=1.0,\n                        help=\"\"\"Scaling factor used for scaling the parameter\n                        matrices when the derivative averages are below the\n                        shrink-threshold at the non-linearities.  E.g. 0.99.\n                        Only applicable when the neural net contains sigmoid or\n                        tanh units.\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-saturation-threshold\",\n                        type=float,\n                        dest='shrink_saturation_threshold', default=0.40,\n                        help=\"\"\"Threshold that controls when we apply the\n                        'shrinkage' (i.e. scaling by shrink-value).  If the\n                        saturation of the sigmoid and tanh nonlinearities in\n                        the neural net (as measured by\n                        steps/nnet3/get_saturation.pl) exceeds this threshold\n                        we scale the parameter matrices with the\n                        shrink-value.\"\"\")\n    # RNN-specific training options\n    parser.add_argument(\"--trainer.deriv-truncate-margin\", type=int,\n                        dest='deriv_truncate_margin', default=None,\n                        help=\"\"\"(Relevant only for recurrent models). If\n                        specified, gives the margin (in input frames) around\n                        the 'required' part of each chunk that the derivatives\n                        are backpropagated to. If unset, the derivatives are\n                        backpropagated all the way to the boundaries of the\n                        input data. E.g. 8 is a reasonable setting. Note: the\n                        'required' part of the chunk is defined by the model's\n                        {left,right}-context.\"\"\")\n\n    # General options\n    parser.add_argument(\"--feat-dir\", type=str, required=True,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--tree-dir\", type=str, required=True,\n                        help=\"\"\"Directory containing the tree to use for this\n                        model (we also expect final.mdl and ali.*.gz in that\n                        directory\"\"\")\n    parser.add_argument(\"--lat-dir\", type=str, required=True,\n                        help=\"Directory with numerator lattices \"\n                        \"used for training the neural network.\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv))\n    print(sys.argv)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if not common_train_lib.validate_chunk_width(args.chunk_width):\n        raise Exception(\"--egs.chunk-width has an invalid value\")\n\n    if not common_train_lib.validate_minibatch_size_str(args.num_chunk_per_minibatch):\n        raise Exception(\"--trainer.num-chunk-per-minibatch has an invalid value\")\n\n    if args.chunk_left_context < 0:\n        raise Exception(\"--egs.chunk-left-context should be non-negative\")\n\n    if args.chunk_right_context < 0:\n        raise Exception(\"--egs.chunk-right-context should be non-negative\")\n\n    if args.left_deriv_truncate is not None:\n        args.deriv_truncate_margin = -args.left_deriv_truncate\n        logger.warning(\n            \"--chain.left-deriv-truncate (deprecated) is set by user, and \"\n            \"--trainer.deriv-truncate-margin is set to negative of that \"\n            \"value={0}. We recommend using the option \"\n            \"--trainer.deriv-truncate-margin.\".format(\n                args.deriv_truncate_margin))\n\n    if (not os.path.exists(args.dir)):\n        raise Exception(\"Directory specified with --dir={0} \"\n                        \"does not exist.\".format(args.dir))\n    if (not os.path.exists(args.dir + \"/configs\") and\n        (args.input_model is None or not os.path.exists(args.input_model))):\n        raise Exception(\"Either --trainer.input-model option should be supplied, \"\n                        \"and exist; or the {0}/configs directory should exist.\"\n                        \"\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Check files\n    chain_lib.check_for_required_files(args.feat_dir, args.tree_dir,\n                                       args.lat_dir if args.egs_dir is None\n                                       else None)\n\n    # Copy phones.txt from tree-dir to dir. Later, steps/nnet3/decode.sh will\n    # use it to check compatibility between training and decoding phone-sets.\n    shutil.copy('{0}/phones.txt'.format(args.tree_dir), args.dir)\n\n    # Set some variables.\n    if args.egs_nj <= 0:\n        num_jobs = common_lib.get_number_of_jobs(args.tree_dir)\n    else:\n        num_jobs = args.egs_nj\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n\n    # split the training data into parts for individual jobs\n    # we will use the same number of jobs as that used for alignment\n    common_lib.execute_command(\"utils/split_data.sh {0} {1}\"\n                               \"\".format(args.feat_dir, num_jobs))\n    with open('{0}/num_jobs'.format(args.dir), 'w') as f:\n        f.write(str(num_jobs))\n\n    if args.input_model is None:\n        config_dir = '{0}/configs'.format(args.dir)\n        var_file = '{0}/vars'.format(config_dir)\n\n        variables = common_train_lib.parse_generic_config_vars_file(var_file)\n    else:\n        # If args.input_model is specified, the model left and right contexts\n        # are computed using input_model.\n        variables = common_train_lib.get_input_model_info(args.input_model)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = args.chunk_left_context + model_left_context\n    right_context = args.chunk_right_context + model_right_context\n    left_context_initial = (args.chunk_left_context_initial + model_left_context if\n                            args.chunk_left_context_initial >= 0 else -1)\n    right_context_final = (args.chunk_right_context_final + model_right_context if\n                           args.chunk_right_context_final >= 0 else -1)\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n    if (args.stage <= -6):\n        logger.info(\"Creating phone language-model\")\n        chain_lib.create_phone_lm(args.dir, args.tree_dir, run_opts,\n                                  lm_opts=args.lm_opts)\n\n    if (args.stage <= -5):\n        logger.info(\"Creating denominator FST\")\n        shutil.copy('{0}/tree'.format(args.tree_dir), args.dir)\n        chain_lib.create_denominator_fst(args.dir, args.tree_dir, run_opts)\n\n    if ((args.stage <= -4) and\n            os.path.exists(\"{0}/configs/init.config\".format(args.dir))\n            and (args.input_model is None)):\n        logger.info(\"Initializing a basic network for estimating \"\n                    \"preconditioning matrix\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n            nnet3-init --srand=-2 {dir}/configs/init.config \\\n            {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                     dir=args.dir))\n\n    egs_left_context = left_context + args.frame_subsampling_factor // 2\n    egs_right_context = right_context + args.frame_subsampling_factor // 2\n    # note: the '+ args.frame_subsampling_factor / 2' is to allow for the\n    # fact that we'll be shifting the data slightly during training to give\n    # variety to the training data.\n    egs_left_context_initial = (left_context_initial +\n                                args.frame_subsampling_factor // 2 if\n                                left_context_initial >= 0 else -1)\n    egs_right_context_final = (right_context_final +\n                               args.frame_subsampling_factor // 2 if\n                               right_context_final >= 0 else -1)\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n\n    if (args.egs_dir is not None) and (args.cmvn_opts != \"--norm-means=false --norm-vars=false\"):\n        logger.warning(\"the --feat.cmvn-opts option has no effect because we are not dumping egs\")\n\n    if (args.egs_dir is not None) and (args.frames_per_iter != 800000):\n        logger.warning(\"the --trainer.frames-per-iter option has no effect because we are not dumping egs\")\n\n    if ((args.stage <= -3) and args.egs_dir is None):\n        logger.info(\"Generating egs\")\n        if (not os.path.exists(\"{0}/den.fst\".format(args.dir)) or\n                not os.path.exists(\"{0}/normalization.fst\".format(args.dir)) or\n                not os.path.exists(\"{0}/tree\".format(args.dir))):\n            raise Exception(\"Chain egs generation expects {0}/den.fst, \"\n                            \"{0}/normalization.fst and {0}/tree \"\n                            \"to exist.\".format(args.dir))\n        # this is where get_egs.sh is called.\n        chain_lib.generate_chain_egs(\n            dir=args.dir, data=args.feat_dir,\n            lat_dir=args.lat_dir, egs_dir=default_egs_dir,\n            left_context=egs_left_context,\n            right_context=egs_right_context,\n            left_context_initial=egs_left_context_initial,\n            right_context_final=egs_right_context_final,\n            run_opts=run_opts,\n            left_tolerance=args.left_tolerance,\n            right_tolerance=args.right_tolerance,\n            frame_subsampling_factor=args.frame_subsampling_factor,\n            alignment_subsampling_factor=args.alignment_subsampling_factor,\n            frames_per_eg_str=args.chunk_width,\n            srand=args.srand,\n            egs_opts=args.egs_opts,\n            cmvn_opts=args.cmvn_opts,\n            online_ivector_dir=args.online_ivector_dir,\n            frames_per_iter=args.frames_per_iter,\n            stage=args.egs_stage)\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n         common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                         ivector_dim, ivector_id,\n                                         egs_left_context, egs_right_context,\n                                         egs_left_context_initial,\n                                         egs_right_context_final))\n    assert(args.chunk_width == frames_per_eg_str)\n    num_archives_expanded = num_archives * args.frame_subsampling_factor\n\n    if (args.num_jobs_final > num_archives_expanded):\n        raise Exception('num_jobs_final cannot exceed the '\n                        'expanded number of archives')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    logger.info(\"Copying the properties from {0} to {1}\".format(egs_dir, args.dir))\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n    if not os.path.exists('{0}/valid_diagnostic.cegs'.format(egs_dir)):\n        if (not os.path.exists('{0}/valid_diagnostic.scp'.format(egs_dir))):\n            raise Exception('Neither {0}/valid_diagnostic.cegs nor '\n                            '{0}/valid_diagnostic.scp exist.'\n                            'This script expects one of them.'.format(egs_dir))\n        use_multitask_egs = True\n    else:\n        use_multitask_egs = False\n\n    if ((args.stage <= -2) and (os.path.exists(args.dir+\"/configs/init.config\"))\n            and (args.input_model is None)):\n        logger.info('Computing the preconditioning matrix for input features')\n\n        chain_lib.compute_preconditioning_matrix(\n            args.dir, egs_dir, num_archives, run_opts,\n            max_lda_jobs=args.max_lda_jobs,\n            rand_prune=args.rand_prune,\n            use_multitask_egs=use_multitask_egs)\n\n    if (args.stage <= -1):\n        logger.info(\"Preparing the initial acoustic model.\")\n        chain_lib.prepare_initial_acoustic_model(args.dir, run_opts,\n                                                 input_model=args.input_model)\n\n    with open(\"{0}/frame_subsampling_factor\".format(args.dir), \"w\") as f:\n        f.write(str(args.frame_subsampling_factor))\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_to_process = int(args.num_epochs * num_archives_expanded)\n    num_archives_processed = 0\n    num_iters = ((num_archives_to_process * 2)\n                 // (args.num_jobs_initial + args.num_jobs_final))\n\n    # If do_final_combination is True, compute the set of models_to_combine.\n    # Otherwise, models_to_combine will be none.\n    if args.do_final_combination:\n        models_to_combine = common_train_lib.get_model_combine_iters(\n            num_iters, args.num_epochs,\n            num_archives_expanded, args.max_models_combine,\n            args.num_jobs_final)\n    else:\n        models_to_combine = None\n\n    min_deriv_time = None\n    max_deriv_time_relative = None\n    if args.deriv_truncate_margin is not None:\n        min_deriv_time = -args.deriv_truncate_margin - model_left_context\n        max_deriv_time_relative = \\\n           args.deriv_truncate_margin + model_right_context\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            model_file = \"{dir}/{iter}.mdl\".format(dir=args.dir, iter=iter)\n\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n            if args.shrink_value < shrinkage_value:\n                shrinkage_value = (args.shrink_value\n                                   if common_train_lib.should_do_shrinkage(\n                                       iter, model_file,\n                                       args.shrink_saturation_threshold)\n                                   else shrinkage_value)\n\n            percent = num_archives_processed * 100.0 / num_archives_to_process\n            epoch = (num_archives_processed * args.num_epochs\n                     / num_archives_to_process)\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            chain_lib.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                train_opts=' '.join(args.train_opts),\n                shrinkage_value=shrinkage_value,\n                num_chunk_per_minibatch_str=args.num_chunk_per_minibatch,\n                apply_deriv_weights=args.apply_deriv_weights,\n                min_deriv_time=min_deriv_time,\n                max_deriv_time_relative=max_deriv_time_relative,\n                l2_regularize=args.l2_regularize,\n                xent_regularize=args.xent_regularize,\n                leaky_hmm_coefficient=args.leaky_hmm_coefficient,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                frame_subsampling_factor=args.frame_subsampling_factor,\n                run_opts=run_opts,\n                backstitch_training_scale=args.backstitch_training_scale,\n                backstitch_training_interval=args.backstitch_training_interval,\n                use_multitask_egs=use_multitask_egs)\n\n            if args.cleanup:\n                # do a clean up everything but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(\n                            args.dir, \"log-probability\"))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n    if args.stage <= num_iters:\n        if args.do_final_combination:\n            logger.info(\"Doing final combination to produce final.mdl\")\n            chain_lib.combine_models(\n                dir=args.dir, num_iters=num_iters,\n                models_to_combine=models_to_combine,\n                num_chunk_per_minibatch_str=args.num_chunk_per_minibatch,\n                egs_dir=egs_dir,\n                leaky_hmm_coefficient=args.leaky_hmm_coefficient,\n                l2_regularize=args.l2_regularize,\n                xent_regularize=args.xent_regularize,\n                run_opts=run_opts,\n                max_objective_evaluations=args.max_objective_evaluations,\n                use_multitask_egs=use_multitask_egs)\n        else:\n            logger.info(\"Copying the last-numbered model to final.mdl\")\n            common_lib.force_symlink(\"{0}.mdl\".format(num_iters),\n                                     \"{0}/final.mdl\".format(args.dir))\n            chain_lib.compute_train_cv_probabilities(\n                dir=args.dir, iter=num_iters, egs_dir=egs_dir,\n                l2_regularize=args.l2_regularize, xent_regularize=args.xent_regularize,\n                leaky_hmm_coefficient=args.leaky_hmm_coefficient,\n                run_opts=run_opts,\n                use_multitask_egs=use_multitask_egs)\n            common_lib.force_symlink(\"compute_prob_valid.{iter}.log\"\n                                     \"\".format(iter=num_iters),\n                                     \"{dir}/log/compute_prob_valid.final.log\".format(\n                                         dir=args.dir))\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        # leave the last-two-numbered models, for diagnostic reasons.\n        common_train_lib.clean_nnet_dir(\n            args.dir, num_iters - 1, egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs)\n\n    # do some reporting\n    [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(\n        args.dir, \"log-probability\")\n    if args.email is not None:\n        common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                     \"complete\".format(args.dir), args.email)\n\n    with open(\"{dir}/accuracy.report\".format(dir=args.dir), \"w\") as f:\n        f.write(report)\n\n    common_lib.execute_command(\"steps/info/chain_dir_info.pl \"\n                               \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/chain/train_tdnn.sh",
    "content": "#!/usr/bin/env bash\n\n# THIS SCRIPT IS DEPRECATED, see ./train.py\n\n# note, TDNN is the same as what we used to call multisplice.\n# This version of the script, nnet3/chain/train_tdnn.sh, is for 'chain' systems.\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=10      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\n                   # Be careful with this: we actually go over the data\n                   # num-epochs * frame-subsampling-factor times, due to\n                   # using different data-shifts.\napply_deriv_weights=true\ninitial_effective_lrate=0.0002\nfinal_effective_lrate=0.00002\nextra_left_context=0  # actually for recurrent setups.\npnorm_input_dim=3000\npnorm_output_dim=300\nrelu_dim=  # you can use this to make it use ReLU's instead of p-norms.\n\njesus_opts=  # opts to steps/nnet3/make_jesus_configs.py.\n             # If nonempty, assumes you want to use the jesus nonlinearity,\n             # and you should supply various options to that script in\n             # this string.\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nminibatch_size=512  # This default is suitable for GPU-based training.\n                    # Set it to 128 for multi-threaded CPU-based training.\nlm_opts=   # options to chain-est-phone-lm\nl2_regularize=0.0\nleaky_hmm_coefficient=0.00001\nxent_regularize=0.0\nframes_per_iter=800000  # each iteration of training, see this many [input]\n                        # frames per job.  This option is passed to get_egs.sh.\n                        # Aim for about a minute of training time\nright_tolerance=5  # tolerance at the same frame-rate as the alignment directory.\nleft_tolerance=5    # tolerance at the same frame-rate as the alignment directory.\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nframe_subsampling_factor=3  # ratio of frames-per-second of features we train\n                            # on, to chain model's output\nalignment_subsampling_factor=3  # ratio of frames-per-second of input alignments\n                                # to chain model's output\nget_egs_stage=0    # can be used for rerunning after partial\nonline_ivector_dir=\nmax_param_change=2.0\nremove_egs=true  # set to false to disable removing egs after training is done.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\nngram_order=3\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\nfinal_layer_normalize_target=1.0  # you can set this to less than one if you\n                                  # think the final layer is learning too fast\n                                  # compared with the other layers.\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-7\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\n\n# count space-separated fields in splice_indexes to get num-hidden-layers.\nsplice_indexes=\"-4,-3,-2,-1,0,1,2,3,4  0  -2,2  0  -4,4 0\"\n\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\nrandprune=4.0 # speeds up LDA.\nuse_gpu=true    # if true, we run on GPU.\ncleanup=true\negs_dir=\nmax_lda_jobs=20  # use no more than 20 jobs for the LDA accumulation.\nlda_opts=\negs_opts=\ntransform_dir=     # If supplied, this dir used instead of latdir to find transforms.\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nframes_per_eg=25   # number of frames of output per chunk.  To be passed on to get_egs.sh.\nleft_deriv_truncate=   # number of time-steps to avoid using the deriv of, on the left.\nright_deriv_truncate=  # number of time-steps to avoid using the deriv of, on the right.\n\n# End configuration section.\n\ntrap 'for pid in $(jobs -pr); do kill -TERM $pid; done' INT QUIT TERM\n\n\necho \"$0: THIS SCRIPT IS DEPRECATED\"\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <tree-dir> <phone-lattice-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train exp/chain/tri3b_tree exp/tri3_latali exp/chain/tdnn_a\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job, for CPU-based training (will affect\"\n  echo \"                                                   # results as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --io-opts <opts|\\\"--max-jobs-run 10\\\">                      # Options given to e.g. queue.pl for jobs that do a lot of I/O.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --frames-per-iter <#frames|400000>               # Number of frames of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\ntreedir=$2\nlatdir=$3\ndir=$4\n\n\n# Check some files.\nfor f in $data/feats.scp $treedir/ali.1.gz $treedir/final.mdl $treedir/tree \\\n    $latdir/lat.1.gz $latdir/final.mdl $latdir/num_jobs $latdir/splice_opts; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# Copy phones.txt from tree-dir to dir. Later, steps/nnet3/decode.sh will\n# use it to check compatibility between training and decoding phone-sets.\ncp $treedir/phones.txt $dir\n\n# Set some variables.\nnj=`cat $treedir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $treedir/tree $dir\n\n\n# First work out the feature and iVector dimension, needed for tdnn config creation.\nfeat_dim=$(feat-to-dim --print-args=false scp:$data/feats.scp -) || \\\n  { echo \"$0: Error getting feature dim\"; exit 1; }\n\nif [ -z \"$online_ivector_dir\" ]; then\n  ivector_dim=0\nelse\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\nfi\n\nif  [ $stage -le -7 ]; then\n  echo \"$0: creating phone language-model\"\n\n  $cmd $dir/log/make_phone_lm.log \\\n    chain-est-phone-lm $lm_opts \\\n     \"ark:gunzip -c $treedir/ali.*.gz | ali-to-phones $treedir/final.mdl ark:- ark:- |\" \\\n     $dir/phone_lm.fst || exit 1\nfi\n\nif [ $stage -le -6 ]; then\n  echo \"$0: creating denominator FST\"\n  copy-transition-model $treedir/final.mdl $dir/0.trans_mdl\n  $cmd $dir/log/make_den_fst.log \\\n    chain-make-den-fst $dir/tree $dir/0.trans_mdl $dir/phone_lm.fst \\\n       $dir/den.fst $dir/normalization.fst || exit 1;\nfi\n\n# work out num-leaves\nnum_leaves=$(am-info $dir/0.trans_mdl | grep -w pdfs | awk '{print $NF}') || exit 1;\n[ $num_leaves -gt 0 ] || exit 1;\n\nif [ $stage -le -5 ]; then\n  echo \"$0: creating neural net configs\";\n\n  if [ ! -z \"$jesus_opts\" ]; then\n    $cmd $dir/log/make_configs.log \\\n       python steps/nnet3/make_jesus_configs.py \\\n      --xent-regularize=$xent_regularize \\\n      --include-log-softmax=false \\\n      --splice-indexes \"$splice_indexes\"  \\\n      --feat-dim $feat_dim \\\n      --ivector-dim $ivector_dim  \\\n       $jesus_opts \\\n      --num-targets $num_leaves \\\n      $dir/configs || exit 1;\n  else\n    [ $xent_regularize != \"0.0\" ] && \\\n      echo \"$0: --xent-regularize option not supported by tdnn/make_configs.py.\" && exit 1;\n    if [ ! -z \"$relu_dim\" ]; then\n      dim_opts=\"--relu-dim $relu_dim\"\n    else\n      dim_opts=\"--pnorm-input-dim $pnorm_input_dim --pnorm-output-dim  $pnorm_output_dim\"\n    fi\n\n    python steps/nnet3/tdnn/make_configs.py $pool_opts \\\n      --include-log-softmax=false \\\n      --final-layer-normalize-target $final_layer_normalize_target \\\n      --splice-indexes \"$splice_indexes\"  \\\n      --feat-dim $feat_dim \\\n      --ivector-dim $ivector_dim  \\\n      $dim_opts \\\n      --num-targets $num_leaves \\\n      --use-presoftmax-prior-scale false \\\n      $dir/configs || exit 1;\n  fi\n\n  # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n  # matrix.  This first config just does any initial splicing that we do;\n  # we do this as it's a convenient way to get the stats for the 'lda-like'\n  # transform.\n  $cmd $dir/log/nnet_init.log \\\n    nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;\nfi\n\n# sourcing the \"vars\" below sets\n# left_context=(something)\n# right_context=(something)\n# num_hidden_layers=(something)\n. $dir/configs/vars || exit 1;\n\n# the next 2 lines are in case the configs were created by an older\n# config-generating script, which writes to left_context and right_context\n# instead of model_left_context and model_right_context.\n[ -z $model_left_context ] && model_left_context=$left_context\n[ -z $model_right_context ] && model_right_context=$right_context\n\n! [ \"$num_hidden_layers\" -gt 0 ] && echo \\\n \"$0: Expected num_hidden_layers to be defined\" && exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$latdir\n\nif [ $stage -le -4 ] && [ -z \"$egs_dir\" ]; then\n  extra_opts=()\n  [ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n  [ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n  extra_opts+=(--transform-dir $transform_dir)\n  # we need a bit of extra left-context and right-context to allow for frame\n  # shifts (we use shifted version of the data for more variety).\n  extra_opts+=(--left-context $[$model_left_context+$frame_subsampling_factor/2+$extra_left_context])\n  extra_opts+=(--right-context $[$model_right_context+$frame_subsampling_factor/2])\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet3/chain/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --frames-per-iter $frames_per_iter --stage $get_egs_stage \\\n      --cmd \"$cmd\" \\\n      --right-tolerance \"$right_tolerance\" \\\n      --left-tolerance \"$left_tolerance\" \\\n      --frames-per-eg $frames_per_eg \\\n      --frame-subsampling-factor $frame_subsampling_factor \\\n      --alignment-subsampling-factor $alignment_subsampling_factor \\\n      $data $dir $latdir $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\nif [ \"$feat_dim\" != \"$(cat $egs_dir/info/feat_dim)\" ]; then\n  echo \"$0: feature dimension mismatch with egs in $egs_dir: $feat_dim vs $(cat $egs_dir/info/feat_dim)\";\n  exit 1;\nfi\nif [ \"$ivector_dim\" != \"$(cat $egs_dir/info/ivector_dim)\" ]; then\n  echo \"$0: ivector dimension mismatch with egs in $egs_dir: $ivector_dim vs $(cat $egs_dir/info/ivector_dim)\";\n  exit 1;\nfi\n\n# copy any of the following that exist, to $dir.\ncp $egs_dir/{cmvn_opts,splice_opts,final.mat} $dir 2>/dev/null\n\n# confirm that the egs_dir has the necessary context (especially important if\n# the --egs-dir option was used on the command line).\negs_left_context=$(cat $egs_dir/info/left_context) || exit -1\negs_right_context=$(cat $egs_dir/info/right_context) || exit -1\n( [ $egs_left_context -lt $model_left_context ] || \\\n  [ $egs_right_context -lt $model_right_context ] ) && \\\n   echo \"$0: egs in $egs_dir have too little context\" && exit -1;\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\nnum_archives_expanded=$[$num_archives*$frame_subsampling_factor]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\nif [ $stage -le -3 ]; then\n  echo \"$0: getting preconditioning matrix for input features.\"\n  num_lda_jobs=$num_archives\n  [ $num_lda_jobs -gt $max_lda_jobs ] && num_lda_jobs=$max_lda_jobs\n\n  # Write stats with the same format as stats for LDA.\n  $cmd JOB=1:$num_lda_jobs $dir/log/get_lda_stats.JOB.log \\\n      nnet3-chain-acc-lda-stats --rand-prune=$rand_prune \\\n         $dir/init.raw \"ark:$egs_dir/cegs.JOB.ark\" $dir/JOB.lda_stats || exit 1;\n\n  all_lda_accs=$(for n in $(seq $num_lda_jobs); do echo $dir/$n.lda_stats; done)\n  $cmd $dir/log/sum_transform_stats.log \\\n    sum-lda-accs $dir/lda_stats $all_lda_accs || exit 1;\n\n  rm $all_lda_accs || exit 1;\n\n  # this computes a fixed affine transform computed in the way we described in\n  # Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled variant\n  # of an LDA transform but without dimensionality reduction.\n  $cmd $dir/log/get_transform.log \\\n     nnet-get-feature-transform $lda_opts $dir/lda.mat $dir/lda_stats || exit 1;\n\n  ln -sf ../lda.mat $dir/configs/lda.mat\nfi\n\nif [ $stage -le -1 ]; then\n  # Add the first layer; this will add in the lda.mat and\n  # presoftmax_prior_scale.vec.\n\n  echo \"$0: creating initial raw model\"\n  $cmd $dir/log/add_first_layer.log \\\n       nnet3-init --srand=-1 $dir/init.raw $dir/configs/layer1.config $dir/0.raw || exit 1;\n\n\n  # The model-format for a 'chain' acoustic model is just the transition\n  # model and then the raw nnet, so we can use 'cat' to create this, as\n  # long as they have the same mode (binary or not binary).\n  # We ensure that they have the same mode (even if someone changed the\n  # script to make one or both of them text mode) by copying them both\n  # before concatenating them.\n\n  echo \"$0: creating initial model\"\n  $cmd $dir/log/init_model.log \\\n    nnet3-am-init $dir/0.trans_mdl $dir/0.raw $dir/0.mdl || exit 1;\nfi\n\necho $frame_subsampling_factor >$dir/frame_subsampling_factor || exit 1;\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  combine_queue_opt=\"--gpu 1\"\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\n  train_queue_opt=\"--num-threads $num_threads\"\n  combine_queue_opt=\"\"  # the combine stage will be quite slow if not using\n                        # GPU, as we didn't enable that program to use\n                        # multiple threads.\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many iterations we want to combine over in the final\n# nnet3-combine-fast invocation.  (We may end up subsampling from these if the\n# number exceeds max_model_combine).  The number we use is:\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     1/2 * iters_after_last_layer_added)\nnum_iters_combine=$max_models_combine\nif [ $num_iters_combine -lt $approx_iters_per_epoch_final ]; then\n   num_iters_combine=$approx_iters_per_epoch_final\nfi\nhalf_iters_after_add_layers=$[($num_iters-$finish_add_layers_iter)/2]\nif [ $num_iters_combine -gt $half_iters_after_add_layers ]; then\n  num_iters_combine=$half_iters_after_add_layers\nfi\nfirst_model_combine=$[$num_iters-$num_iters_combine+1]\n\nx=0\n\nderiv_time_opts=\n[ ! -z \"$left_deriv_truncate\" ] && deriv_time_opts=\"--optimization.min-deriv-time=$left_deriv_truncate\"\n[ ! -z \"$right_deriv_truncate\" ] && \\\n  deriv_time_opts=\"$deriv_time_opts --optimization.max-deriv-time=$((frames_per_eg - right_deriv_truncate))\"\n\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet3-chain-compute-prob --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient --xent-regularize=$xent_regularize \\\n          \"nnet3-am-copy --raw=true $dir/$x.mdl -|\" $dir/den.fst \\\n          \"ark,bg:nnet3-chain-merge-egs ark:$egs_dir/valid_diagnostic.cegs ark:- |\" &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet3-chain-compute-prob --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient  --xent-regularize=$xent_regularize \\\n          \"nnet3-am-copy --raw=true $dir/$x.mdl -|\" $dir/den.fst \\\n          \"ark,bg:nnet3-chain-merge-egs ark:$egs_dir/train_diagnostic.cegs ark:- |\" &\n\n    if [ $x -gt 0 ]; then\n      # This doesn't use the egs, it only shows the relative change in model parameters.\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-show-progress --use-gpu=no \"nnet3-am-copy --raw=true $dir/$[$x-1].mdl - |\" \\\n                  \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" '&&' \\\n        nnet3-am-info $dir/$x.mdl &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging but take the\n                       # best.\n      cur_num_hidden_layers=$[1+$x/$add_layers_period]\n      config=$dir/configs/layer$cur_num_hidden_layers.config\n      mdl=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl - | nnet3-init --srand=$x - $config - |\"\n      cache_io_opts=\"\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      mdl=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n      cache_io_opts=\"--read-cache=$dir/cache.$x\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n      this_max_param_change=$max_param_change\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size (and we will later choose the output of just one of the jobs): the\n      # model-averaging isn't always helpful when the model is changing too fast\n      # (i.e. it can worsen the objective function), and the smaller minibatch\n      # size will help to keep the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n      this_max_param_change=$(perl -e \"print ($max_param_change/sqrt(2));\")\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    (\n      trap 'for pid in $(jobs -pr); do kill -TERM $pid; done' INT QUIT TERM\n      # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame_shift=$[($k/$num_archives)%$frame_subsampling_factor];\n        if [ $n -eq 1 ]; then\n          # opts for computation cache (storing compiled computation).\n          this_cache_io_opts=\"$cache_io_opts --write-cache=$dir/cache.$[$x+1]\"\n        else\n          this_cache_io_opts=\"$cache_io_opts\"\n        fi\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-chain-train --apply-deriv-weights=$apply_deriv_weights \\\n             --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient --xent-regularize=$xent_regularize \\\n              $this_cache_io_opts $parallel_train_opts $deriv_time_opts \\\n             --max-param-change=$this_max_param_change \\\n            --print-interval=10 \"$mdl\" $dir/den.fst \\\n          \"ark,bg:nnet3-chain-copy-egs --frame-shift=$frame_shift ark:$egs_dir/cegs.$archive.ark ark:- | nnet3-chain-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-chain-merge-egs --minibatch-size=$this_minibatch_size ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    models_to_average=$(steps/nnet3/get_successful_models.py --difference-threshold 0.1 $this_num_jobs $dir/log/train.$x.%.log)\n    nnets_list=\n    for n in $models_to_average; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet3-average $nnets_list - \\| \\\n        nnet3-am-copy --set-raw-nnet=- $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet3-am-copy --set-raw-nnet=$dir/$[$x+1].$n.raw  $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%10] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  rm $dir/cache.$x 2>/dev/null\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.  In the nnet3 setup, the logic\n  # for doing averaging of subsets of the models in the case where\n  # there are too many models to reliably esetimate interpolation\n  # factors (max_models_combine) is moved into the nnet3-combine\n  nnets_list=()\n  for n in $(seq 0 $[num_iters_combine-1]); do\n    iter=$[$first_model_combine+$n]\n    [ ! -f $dir/$iter.mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n    mdl=\"nnet3-am-copy --raw=true $dir/$iter.mdl - |\"\n    nnets_list[$n]=\"$mdl\";\n  done\n\n  # Below, we use --use-gpu=no to disable nnet3-combine-fast from using a GPU,\n  # as if there are many models it can give out-of-memory error; and we set\n  # num-threads to 8 to speed it up (this isn't ideal...)\n\n  $cmd $combine_queue_opt $dir/log/combine.log \\\n    nnet3-chain-combine --num-iters=40  --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient \\\n       --enforce-sum-to-one=true --enforce-positive-weights=true \\\n       --verbose=3 $dir/den.fst \"${nnets_list[@]}\" \"ark,bg:nnet3-chain-merge-egs --minibatch-size=$minibatch_size ark:$egs_dir/combine.cegs ark:-|\" \\\n       \"|nnet3-am-copy --set-raw-nnet=- $dir/$first_model_combine.mdl $dir/final.mdl\" || exit 1;\n\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet3-chain-compute-prob --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient --xent-regularize=$xent_regularize \\\n           \"nnet3-am-copy --raw=true $dir/final.mdl - |\" $dir/den.fst \\\n    \"ark,bg:nnet3-chain-merge-egs ark:$egs_dir/valid_diagnostic.cegs ark:- |\" &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet3-chain-compute-prob --l2-regularize=$l2_regularize --leaky-hmm-coefficient=$leaky_hmm_coefficient --xent-regularize=$xent_regularize \\\n      \"nnet3-am-copy --raw=true $dir/final.mdl - |\" $dir/den.fst \\\n    \"ark,bg:nnet3-chain-merge-egs ark:$egs_dir/train_diagnostic.cegs ark:- |\" &\nfi\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\nsteps/info/chain_dir_info.pl $dir\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/combine_egs.sh",
    "content": "#!/bin/bash\n#\n# Copyright 2020 Srikanth Madikeri (Idiap Research Institute)\n# Apache 2.0\n#\n# This script combines egs folder generated with chain2 recipes to prepare a single egs folder\n# for multilingual training\n\necho \"$0 $@\"  # Print the command line for logging\n. ./cmd.sh\nset -e\n\n# Begin configuration section\ncmd=\nblock_size=256\nstage=0\nframes_per_job=1500000  \nleft_context=13\nright_context=9\n# TODO: add lang2weight support\nlang2weight=            # array of weights one per input languge to scale example's output\n                        # w.r.t its input language during training.\nlang_list=\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n[[ -f local.conf ]] && . local.conf\n\nif [ $# -lt 3 ]; then\n  cat <<EOF\n  This script generates examples for multilingual LF-MMI training.\n  The input egs directories are generated with chain2 get_egs scripts.\n\n  Usage: $0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\n   e.g.: $0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\n\n  Options:\n      --cmd (utils/run.pl|utils/queue.pl <queue opts>)  # how to run jobs.\nEOF\n  exit 1;\nfi\n\nnum_langs=$1\nif [ $# != $[$num_langs+2] ]; then\n  echo \"$0: num of input example dirs provided is not compatible with num_langs $num_langs.\"\n  echo \"Usage:$0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\"\n  echo \"Usage:$0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\"\n  exit 1;\nfi\nmegs_dir=${@: -1} # multilingual directory\nmkdir -p $megs_dir\nshift 1\nargs=(\"$@\")\n\nrequired=\"info.txt train.scp train_subset.scp heldout_subset.scp\"\ntrain_scp_list=\ntrain_diagnostic_scp_list=\nvalid_diagnostic_scp_list=\ncombine_scp_list=\n\n# we don't copy lang because there wont be a single lang\ncheck_params=\"feat_dim left_context right_context left_context_initial right_context_final\" \nivec_dim=`fgrep ivector_dim ${args[0]}/info.txt | awk '{print $2}'`\nif [ $ivec_dim -ne 0 ];then check_params=\"$check_params ivector_dim final.ie.id\"; fi\n\necho \"dir_type randomized_chain_egs\" > $megs_dir/info.txt\n# frames_per_chunk is not included in check_params because we allow different\n# values for different languages\nfor param in $check_params frames_per_chunk; do\n    awk \"/^$param/\" ${args[0]}/info.txt\n    \ndone >> $megs_dir/info.txt\necho \"langs ${lang_list[@]}\" >> $megs_dir/info.txt\n\ntot_num_archives=0\ntot_num_scps=0\nfor lang in $(seq 0 $[$num_langs-1]);do\n  multi_egs_dir[$lang]=${args[$lang]}\n  for f in $required; do\n    if [ ! -f ${multi_egs_dir[$lang]}/$f ]; then\n      echo \"$0: no such file ${multi_egs_dir[$lang]}/$f\" && exit 1;\n    fi\n  done\n  num_chunks=$(fgrep num_chunks ${multi_egs_dir[$lang]}/info.txt | awk '{print $2}')\n  curr_frames_per_chunk_avg=`awk '/^frames_per_chunk_avg/  {print $2;}' ${multi_egs_dir[$lang]}/info.txt`\n  tot_num_archives=$[tot_num_archives+((num_chunks*curr_frames_per_chunk_avg)/frames_per_job+1)]\n  tot_num_scps=$[tot_num_scps+num_scps]\n  train_diagnostic_scp_list=\"$train_diagnostic_scp_list ${args[$lang]}/train_subset.scp\"\n  valid_diagnostic_scp_list=\"$valid_diagnostic_scp_list ${args[$lang]}/valid_subset.scp\"\n  for f in $check_params; do\n    if [ `grep -c \"^$f\" ${multi_egs_dir[$lang]}/info.txt` -ge 1 ]; then\n      f1=$(fgrep -m 1 $f $megs_dir/info.txt | awk '{print $2}')\n      f2=$(fgrep -m 1 $f ${multi_egs_dir[$lang]}/info.txt | awk '{print $2}')\n      if [ \"$f1\" != \"$f2\" ]  ; then\n        echo \"$0: mismatch for $f in $megs_dir vs. ${multi_egs_dir[$lang]}($f1 vs. $f2).\"\n        exit 1;\n      fi\n    else\n      echo \"$0: parameter $f does not exist in $megs_dir or ${multi_egs_dir[$lang]}/$f .\"\n    fi\n  done\ndone\nnum_scp_files=$tot_num_archives\necho \"num_scp_files $num_scp_files\" >> $megs_dir/info.txt\nsed_cmd=\nfor lang in $(seq 0 $[$num_langs-1]);do\n    lang_name=${lang_list[$lang]}\n    weight=`echo $lang2weight | tr ',' ' ' | cut -d ' ' -f$[$lang+1]`\n    sed_cmd=\"$sed_cmd s/.*lang=${lang_name}.*/$weight/;\"\ndone\n\ndir=$megs_dir/\nif [ $stage -le 0 ]; then\n    echo \"$0: Creating $num_scp_files scp files.\"\n    for lang in $(seq 0 $[$num_langs-1]);do\n        lang_name=${lang_list[$lang]}\n        [ ! -d $dir/temp_${lang_name}/ ] && mkdir $dir/temp_${lang_name}/\n        # randomize, append language name as a query and split input scp into $num_blocks blocks\n        utils/shuffle_list.pl ${args[$lang]}/train.scp | \\\n            awk -v lang_name=\"$lang_name\" \\\n                '{if ($1 !~ /?/){$1=$1\"?lang=\" lang_name; print;} else {$1=$1\"&lang=\" lang_name; print;}}' > $dir/temp_${lang_name}/train.shuffled.scp \n            utils/split_scp.pl $dir/temp_${lang_name}/train.shuffled.scp \\\n                $(for i in $(seq $num_scp_files); do echo $dir/temp_${lang_name}/train.$i.scp; done) || exit 1\n        # split each block into sub-blocks\n        for i in `seq $num_scp_files`; do\n            utils/split_scp.pl <(utils/shuffle_list.pl $dir/temp_${lang_name}/train.$i.scp) \\\n                $(for j in $(seq $num_scp_files); do echo $dir/temp_${lang_name}/train.$i.$j.scp; done)\n        done\n    done\n\n    for j in `seq $num_scp_files`; do\n        input_list=$(for lang in $(seq 0 $[$num_langs-1]);do lang_name=${lang_list[$lang]}; echo $dir/temp_${lang_name}/train.*.$j.scp; done)\n        # the shuffling is probably not required because we will do it once again before\n        # merging examples\n        cat $input_list | utils/shuffle_list.pl > $dir/train.$j.scp\n        sed \"$sed_cmd\" < <(awk '{print $1}' $dir/train.$j.scp) > $dir/train.weight.$j.ark.col2\n        paste -d ' ' <(awk '{print $1}' $dir/train.$j.scp) $dir/train.weight.$j.ark.col2 > $dir/train.weight.$j.ark\n        rm $dir/train.weight.$j.ark.col2\n    done\nfi\n\nif [ $stage -le 1 ]; then\n    for subset_file  in train_subset heldout_subset; do\n        for lang in $(seq 0 $[$num_langs-1]);do\n            lang_name=${lang_list[$lang]}\n            cat ${args[$lang]}/${subset_file}.scp  | \\\n            awk -v lang_name=\"$lang_name\" \\\n                '{if ($1 !~ /?/){$1=$1\"?lang=\" lang_name; print;} else {$1=$1\"&lang=\" lang_name; print;}}' \n        done > $dir/${subset_file}.scp\n        sed \"$sed_cmd\" < <(awk '{print $1}' $dir/${subset_file}.scp) > $dir/${subset_file}.weight.ark.col2\n        paste -d ' ' <(awk '{print $1}' $dir/${subset_file}.scp) $dir/${subset_file}.weight.ark.col2 > $dir/${subset_file}.weight.ark\n        rm $dir/${subset_file}.weight.ark.col2\n    done\nfi\n\nif [ $stage -le 2 ]; then\n    echo \"$0: Clean up\"\n    for lang in $(seq 0 $[$num_langs-1]);do\n        lang_name=${lang_list[$lang]}\n        rm -r $dir/temp_${lang_name}/\n    done\nfi\n\necho \"$0: Finished preparing multilingual training example.\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/compute_preconditioning_matrix.sh",
    "content": "#!/bin/bash\n\n# Copyright 2019 Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n\nrand_prune=4.0\nnj=8\ncmd=run.pl\nlda_acc_opts=\nlda_transform_opts=\nlda_sum_opts=\negs_opts=\nstage=0\nuse_scp=true\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n    echo \"Usage: $0 [opts] <model> <egs-folder> <lda-output-folder>\" \n    echo \"e.g. $0 exp/chain/tdnn1a_sp/configs/init.raw exp/chain/tdnn1a_sp/egs/ exp/chain/tdnn1a_sp\"\n    echo \"\"\n    echo \"This script computes pre-conditioning matrix given the model (usually init.raw file from the config folder),\"\n    echo \"egs-folder which has train.*.scp files to be used to train LDA, and\"\n    echo \"lda-output-folder that will contain lda.mat file.\"\n    echo \"\"\n    echo \"Main options (for others, see top of script file)\"\n    echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n    echo \"  --nj <int;8> # number of jobs. this is also the number of train.*.scp files in egs/\"\n    echo \"  --lda-acc-opts # options to be passed to nnet3-chain-acc-lda-stats\"\n    echo \"  --lda-sum-opts # options to be passed to sum-lda-accs\"\n    echo \"  --lda-transform-opts # options to be passed to nnet-get-feature-transform\"\n    exit 1;\nfi\n\nmodel=$1\negs=$2\nldafolder=$3\n\nif [ ! -d $ldafolder ]; then\n    echo \"Creating $ldafolder\"\n    mkdir -p $ldafolder || exit 1\nfi\n\n\nif [ $stage -le 0 ]; then\n        if $use_scp; then\n            egs_rspecifier=\"ark:nnet3-chain-copy-egs $egs_opts scp:$egs/train.JOB.scp ark:- |\"\n        else\n            egs_rspecifier=\"ark:nnet3-chain-copy-egs $egs_opts ark:$egs/train.JOB.ark ark:- |\"\n        fi\n        echo \"$0: Accumulating LDA stats\"\n        $cmd JOB=1:$nj $ldafolder/log/acc.JOB.log \\\n                nnet3-chain-acc-lda-stats $lda_acc_opts --rand-prune=${rand_prune} \\\n                $model \"${egs_rspecifier}\" \\\n                $ldafolder/JOB.lda_stats || exit 1\nfi\n\nif [ $stage -le 1 ]; then\n    echo \"$0: Summing LDA stats\"\n    lda_stats_files=\n    for i in `seq 1 $nj`; do\n        lda_stats_files=\"$lda_stats_files $ldafolder/$i.lda_stats\"\n    done\n\n    $cmd $ldafolder/log/sum_transform_stats.log \\\n        sum-lda-accs $lda_sum_opts $ldafolder/lda_stats $lda_stats_files || exit 1\n    rm $lda_stats_files\nfi\n\nif [ $stage -le 2 ]; then\n    echo \"$0: Computing LDA transform\"\n    $cmd $ldafolder/log/get_transform.log \\\n        nnet-get-feature-transform $lda_transform_opts \\\n        $ldafolder/lda.mat $ldafolder/lda_stats || exit 1\n\n    rm $ldafolder/lda_stats\n    # lda.mat is in $ldafolder, i.e. one up from $ldafolder/config.\n    ln -sf ../lda.mat $ldafolder/configs/lda.mat\nfi\n\necho \"$0: Finished computing LDA transform\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/get_raw_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script dumps 'raw' egs for 'chain' training.  What 'raw' means in this\n# context is that they need to be further processed to merge egs of the same\n# speaker, etc.  So they won't be directly consumed by training, but by\n# by the script process_egs.sh.\n\n\n\n# Begin configuration section.\ncmd=run.pl\nframes_per_chunk=150  # Number of frames (at feature frame rate) per example.  You\n                      # are allowed to make this a comma-separated list,\n                      # e.g. 150,110,100, meaning that a range of eg widths are\n                      # allowed (but this may not be as helpful when using our\n                      # adaptation framework, since it will tend to split up\n                      # utterances into separate minibatches.\n\nframe_subsampling_factor=3 # frames-per-second of features we train on divided\n                           # by frames-per-second at output of chain model\nalignment_subsampling_factor=3 # frames-per-second of input alignments divided\n                               # by frames-per-second at output of chain model\nconstrained=true  # 'constrained=true' is the traditional setup; 'constrained=false'\n                  # gives you the 'unconstrained' egs creation in which the time\n                  # boundaries are not enforced inside chunks.\nleft_context=0    # amount of left-context per eg (i.e. extra frames of input\n                  # features not present in the output supervision).  Would\n                  # normally depend on the model context, plus desired 'extra'\n                  # context (e.g. for LSTM).\nright_context=0   # amount of right-context per eg.\n\nleft_context_initial=-1   # if >=0, right-context for last chunk of an utterance.\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance.\n\ncompress=true   # set this to false to disable compression (e.g. if you want to\n                # see whether results are affected).  Note: if the features on\n                # disk were originally compressed, nnet3-chain-get-egs will dump\n                # compressed features regardless (since there is no further loss\n                # in that case).\n\nlang=default   # the language name.  will usually be 'default' in single-language\n               # setups.  Requires because it's part of the name of some of\n               # the input files.\n\nright_tolerance=  # chain right tolerance == max label delay.  Only relevant if\n                  # constrained=true.  At frame rate of alignments.  Code\n                  # default is 5.\nleft_tolerance=   # chain left tolerance (versus alignments from lattices).\n                  # Only relevant if constrained=true.  At frame rate of\n                  # alignments.  Code default is 5.\n\nstage=0\nmax_jobs_run=40         # This should be set to the maximum number of\n                        # nnet3-chain-get-egs jobs you are comfortable to run in\n                        # parallel; you can increase it if your disk speed is\n                        # greater and you have more machines.\n\n\nsrand=0         # rand seed for nnet3-chain-get-egs, nnet3-chain-copy-egs and nnet3-chain-shuffle-egs\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\n\nlattice_lm_scale=     # If supplied, the graph/lm weight of the lattices will be\n                      # used (with this scale) in generating supervisions\n                      # This is 0 by default for conventional supervised training,\n                      # but may be close to 1 for the unsupervised part of the data\n                      # in semi-supervised training. The optimum is usually\n                      # 0.5 for unsupervised data.\nlattice_prune_beam=        # If supplied, the lattices will be pruned to this beam,\n                           # before being used to get supervisions.\n\nacwt=0.1   # For pruning.  Should be, for instance, 1.0 for chain lattices.\nderiv_weights_scp=\n\n# end configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <chain-dir> <lattice-dir> <raw-egs-dir>\"\n  echo \" e.g.: $0 data/train exp/chain/tdnn1a_sp exp/tri3_lats exp/chain/tdnn1a_sp/raw_egs\"\n  echo \"\"\n  echo \"From <chain-dir>, 0/<lang>.mdl (for the transition-model), <lang>.tree (the tree), \"\n  echo \"   den_fsts/<lang>.den.fst, and den_fsts/<lang>.normalization.fst (the normalization \"\n  echo \"   FST, derived from the denominator FST echo are read (where <lang> is specified\"\n  echo \"   by the --lang option (its default values is 'default')\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options (alternative to this\"\n  echo \"                                                   # command line)\"\n  echo \"  --max-jobs-run <max-jobs-run>                    # The maximum number of jobs you want to run in\"\n  echo \"                                                   # parallel (increase this only if you have good disk and\"\n  echo \"                                                   # network speed).  default=6\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --frame-subsampling-factor <factor;3>            # factor by which num-frames at nnet output is reduced \"\n  echo \"  --lang       <language-name;'default'>           # Name of the language, determines names of some inputs.\"\n  echo \"  --frames-per-chunk <frames;150>                  # number of supervised frames per chunk on disk\"\n  echo \"                                                   # ... may be a comma separated list, but we advise a single\"\n  echo \"                                                   #  number in most cases, due to interaction with the need \"\n  echo \"                                                   # to group egs from the same speaker into groups.\"\n  echo \"  --left-context <int;0>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;0>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # Left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # Right-context for last chunk of an utterance\"\n  echo \"  --lattice-lm-scale <float>                       # If supplied, the graph/lm weight of the lattices will be \"\n  echo \"                                                   # used (with this scale) in generating supervisions\"\n  echo \"  --lattice-prune-beam <float>                     # If supplied, the lattices will be pruned to this beam, \"\n  echo \"                                                   # before being used to get supervisions.\"\n  echo \"  --acwt <float;0.1>                               # Acoustic scale -- should be acoustic scale at which the \"\n  echo \"                                                   # supervision lattices are to be interpreted.  Affects pruning\"\n  echo \"  --deriv-weights-scp <str>                        # If supplied, adds per-frame weights to the supervision.\"\n  echo \"                                                   # (e.g., might be relevant for unsupervised training).\"\n  echo \"  --stage <stage|0>                                # Used to run this script from somewhere in\"\n  echo \"                                                   # the middle.\"\n  exit 1;\nfi\n\ndata=$1\nchaindir=$2\nlatdir=$3\ndir=$4\n\ntree=$chaindir/${lang}.tree\ntrans_mdl=$chaindir/init/${lang}_trans.mdl\nnormalization_fst=$chaindir/den_fsts/${lang}.normalization.fst\nden_fst=$chaindir/den_fsts/${lang}.den.fst\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $latdir/lat.1.gz $latdir/final.mdl \\\n         $tree $normalization_fst $den_fst $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=$(cat $latdir/num_jobs) || exit 1\nif [ -f $latdir/per_utt ]; then\n  sdata=$data/split${nj}utt\n  utils/split_data.sh --per-utt $data $nj\nelse\n  sdata=$data/split$nj\n  utils/split_data.sh $data $nj\nfi\n\nmkdir -p $dir/log  $dir/misc\n\ncp $tree $dir/misc/\ncopy-transition-model $trans_mdl $dir/misc/${lang}.trans_mdl\ncp $normalization_fst $den_fst $dir/misc/\ncp $data/utt2spk $dir/misc/\nif [ -f $data/utt2uniq ]; then\n  cp $data/utt2uniq $dir/misc/\nelif [ -f $dir/misc/utt2uniq ]; then\n  rm $dir/misc/utt2uniq\nfi\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $nj); do echo $dir/cegs.$x.ark; done)\nfi\n\n\nlats_rspecifier=\"ark:gunzip -c $latdir/lat.JOB.gz |\"\nif [ ! -z $lattice_prune_beam ]; then\n  if [ \"$lattice_prune_beam\" == \"0\" ] || [ \"$lattice_prune_beam\" == \"0.0\" ]; then\n    lats_rspecifier=\"$lats_rspecifier lattice-1best --acoustic-scale=$acwt ark:- ark:- |\"\n  else\n    lats_rspecifier=\"$lats_rspecifier lattice-prune --acoustic-scale=$acwt --beam=$lattice_prune_beam ark:- ark:- |\"\n  fi\nfi\n\negs_opts=\"--long-key=true --left-context=$left_context --right-context=$right_context --num-frames=$frames_per_chunk --frame-subsampling-factor=$frame_subsampling_factor --compress=$compress\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\n[ ! -z \"$deriv_weights_scp\" ] && egs_opts=\"$egs_opts --deriv-weights-rspecifier=scp:$deriv_weights_scp\"\n\n\nchain_supervision_all_opts=\"--lattice-input=true --frame-subsampling-factor=$alignment_subsampling_factor\"\n[ ! -z $right_tolerance ] && \\\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --right-tolerance=$right_tolerance\"\n\n[ ! -z $left_tolerance ] && \\\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --left-tolerance=$left_tolerance\"\n\nif ! $constrained; then\n  # e2e supervision\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --convert-to-pdfs=false\"\n  egs_opts=\"$egs_opts --transition-model=$chaindir/0.trans_mdl\"\nfi\n\nif [ ! -z \"$lattice_lm_scale\" ]; then\n  chain_supervision_all_opts=\"$chain_supervision_all_opts --lm-scale=$lattice_lm_scale\"\n\n  normalization_fst_scale=$(perl -e \"\n  if ($lattice_lm_scale >= 1.0 || $lattice_lm_scale < 0) {\n    print STDERR \\\"Invalid --lattice-lm-scale $lattice_lm_scale\\\"; exit(1);\n  }\n  print (1.0 - $lattice_lm_scale);\") || exit 1\n  egs_opts=\"$egs_opts --normalization-fst-scale=$normalization_fst_scale\"\nfi\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\nfi\n\nfeats=\"scp:$sdata/JOB/feats.scp\"\nif [ ! -z \"$cmvn_opts\" ]; then\n    if [ ! -f $data/cmvn.scp ]; then\n        echo \"Cannot find $data/cmvn.scp. But cmvn_opts=$cmvn_opts\"\n        exit 1\n    fi\n    if [ `echo $cmvn_opts | fgrep -c true` -eq 1 ]; then\n        feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n    fi\nfi\n\nif [ $stage -le 0 ]; then\n  $cmd --max-jobs-run $max_jobs_run JOB=1:$nj $dir/log/get_egs.JOB.log \\\n       lattice-align-phones --replace-output-symbols=true $latdir/final.mdl \\\n       \"$lats_rspecifier\" ark:- \\| \\\n       chain-get-supervision $chain_supervision_all_opts \\\n       $dir/misc/${lang}.tree $dir/misc/${lang}.trans_mdl ark:- ark:- \\| \\\n       nnet3-chain-get-egs $ivector_opts --srand=\\$[JOB+$srand] $egs_opts \\\n       \"$normalization_fst\" \"$feats\" ark,s,cs:- \\\n       ark,scp:$dir/cegs.JOB.ark,$dir/cegs.JOB.scp || exit 1;\nfi\n\n\nif [ $stage -le 1 ]; then\n  num_input_frames=$(steps/nnet2/get_num_frames.sh $data)\n  frames_and_chunks=$(for n in $(seq $nj); do cat $dir/log/get_egs.$n.log; done | \\\n           perl -e '$nc=0; $nf=0; while(<STDIN>) {\n     if (m/Split .+ into (\\d+) chunks/) { $this_nc = $1;  }\n     if (m/Average chunk length was (\\d+.\\d+) frames/) { $nf += $1 * $this_nc;  $nc += $this_nc; }\n    } print \"$nf $nc\"; ')\n    echo $frames_and_chunks\n  num_chunks=$(echo $frames_and_chunks | awk '{print $2}')\n  frames_per_chunk_avg=$[num_input_frames/num_chunks]\n  feat_dim=$(feat-to-dim scp:$sdata/1/feats.scp -)\n  num_leaves=$(tree-info $tree | awk '/^num-pdfs/ {print $2}')\n  if [ $left_context_initial -lt 0 ]; then\n    left_context_initial=$left_context\n  fi\n  if [ $right_context_final -lt 0 ]; then\n    right_context_final=$right_context\n  fi\n\n  cat >$dir/info.txt <<EOF\ndir_type raw_chain_egs\nnum_input_frames $num_input_frames\nnum_chunks $num_chunks\nlang $lang\nfeat_dim $feat_dim\nnum_leaves $num_leaves\nframes_per_chunk $frames_per_chunk\nframes_per_chunk_avg $frames_per_chunk_avg\nleft_context $left_context\nleft_context_initial $left_context_initial\nright_context $right_context\nright_context_final $right_context_final\nEOF\n\n  if [ ! -z \"$online_ivector_dir\" ]; then\n      ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n      echo ivector_dim $ivector_dim >> $dir/info.txt\n      steps/nnet2/get_ivector_id.sh $online_ivector_dir || exit 1\n      echo final.ie.id `cat $online_ivector_dir/final.ie.id` >> $dir/info.txt\n      if [ ! -f $online_ivector_dir/ivector_period ]; then\n        echo \"$0: $online_ivector_dir/ivector_period does not exist\"\n        exit 1\n      fi\n      ivector_period=$(cat $online_ivector_dir/ivector_period)\n      ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\n  else\n      ivector_opts=\"\"\n  fi\n\n  if ! cat $dir/info.txt | awk '{if (NF == 1) exit(1);}'; then\n    echo \"$0: we failed to obtain at least one of the fields in $dir/info.txt\"\n    exit 1\n  fi\nfi\n\n\nif [ $stage -le 2 ]; then\n  for n in $(seq $nj); do cat $dir/cegs.$n.scp; done > $dir/all.scp\nfi\n\necho \"$0: Finished preparing raw egs\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/internal/get_best_model.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n# This script is the equivalent of get_successful_models function in the python library.\n# It takes a list of models and returns either the best model (the deafult) or a list of\n# models to average.\n\nmodels_to_average=false\ndifference_threshold=1.0\noutput=output\n\n\n# echo \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 1 ]; then\n    echo \"Usage: $0: [options] <model-1-log> <model-2-log> .... <model-N-log>\"\n    echo \"where <model-n> is one of the n models to choose from.\"\n    echo \"\"\n    echo \"--models-to-average: when true, returns the models to be averaged rather than the single best model\"\n    echo \"--difference-threshold: used to reject models. models with objf < max-value - difference_threshold are rejected\"\n    echo \"--output: the objf of the this output layer is used for model selection\"\n    echo \"\"\n    exit 1;\nfi\n\nif ! $models_to_average; then\n    if [ $# -eq 1 ]; then\n        basename $1 | tr '.' ' ' | awk '{ print $(NF-1) }'\n        exit 0;\n    fi\n    model_log_list=$(for arg in $*; do echo $arg; done)\n    first_log=$1\n    log_line=`fgrep -m 1 \"Overall average objective function for '$output' is\" $first_log`\n    colno=`echo $log_line | cut -d '=' -f1 | wc -w`\n    ((colno+=2))\n    filename=$(fgrep -m 1 \"Overall average objective function for '$output' is\" $model_log_list | \\\n        cut -d ' ' -f1,$colno | tr ':' ' ' | \\\n        awk '{print $1,$3}' | \\\n        sort -k2,2 -g | tail -1 | cut -d ' ' -f1)\n    basename $filename | tr '.' ' ' | awk '{ print $(NF-1) }'\nfi\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/internal/get_train_schedule.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2019    Johns Hopkins University (author: Daniel Povey)\n# Copyright         Hossein Hadian\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  \n\n\n# Apache 2.0.\n\n\"\"\" This script outputs information about a neural net training schedule,\n    to be used by ../train.sh, in the form of lines that can be selected\n    and sourced by the shell.\n\"\"\"\n\nimport argparse\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Output training schedule information to be consumed by ../train.sh\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n    parser.add_argument(\"--frame-subsampling-factor\", type=int, default=3,\n                        help=\"\"\"Frame subsampling factor for the combined model\n                        (bottom+top), will normally be 3.  Required here in order\n                        to deal with frame-shifted versions of the input.\"\"\")\n    parser.add_argument(\"--initial-effective-lrate\",\n                        type=float,\n                        dest='initial_effective_lrate', default=0.001,\n                        help=\"\"\"Effective learning rate used on the first iteration,\n                        determines schedule via geometric interpolation with\n                        --final-effective-lrate.   Actual learning rate is\n                        this times the num-jobs on that iteration.\"\"\")\n    parser.add_argument(\"--final-effective-lrate\", type=float,\n                        dest='final_effective_lrate', default=0.0001,\n                        help=\"\"\"Learning rate used on the final iteration, see\n                        --initial-effective-lrate for more documentation.\"\"\")\n    parser.add_argument(\"--num-jobs-initial\", type=int, default=1,\n                        help=\"\"\"Number of parallel neural net jobs to use at\n                        the start of training\"\"\")\n    parser.add_argument(\"--num-jobs-final\", type=int, default=1,\n                        help=\"\"\"Number of parallel neural net jobs to use at\n                        the end of training.  Would normally\n                        be >= --num-jobs-initial\"\"\")\n    parser.add_argument(\"--num-epochs\", type=float, default=4.0,\n                        help=\"\"\"The number of epochs to train for.\n                        Note: the 'real' number of times we see each\n                        utterance is this number times --frame-subsampling-factor\n                        (to cover frame-shifted copies of the data), times\n                        the value of --num-repeats given to process_egs.sh,\n                        times any factor arising from data augmentation.\"\"\")\n    parser.add_argument(\"--dropout-schedule\", type=str,\n                        help=\"\"\"Use this to specify the dropout schedule (how the dropout probability varies\n                        with time, 0 == no dropout).  You specify a piecewise\n                        linear function on the domain [0,1], where 0 is the\n                        start and 1 is the end of training; the\n                        function-argument (x) rises linearly with the amount of\n                        data you have seen, not iteration number (this improves\n                        invariance to num-jobs-{initial-final}).  E.g. '0,0.2,0'\n                        means 0 at the start; 0.2 after seeing half the data;\n                        and 0 at the end.  You may specify the x-value of\n                        selected points, e.g.  '0,0.2@0.25,0' means that the 0.2\n                        dropout-proportion is reached a quarter of the way\n                        through the data.  The start/end x-values are at\n                        x=0/x=1, and other unspecified x-values are interpolated\n                        between known x-values.  You may specify different rules\n                        for different component-name patterns using\n                        'pattern1=func1 pattern2=func2', e.g. 'relu*=0,0.1,0\n                        lstm*=0,0.2,0'.  More general should precede less\n                        general patterns, as they are applied sequentially.\"\"\")\n\n    parser.add_argument(\"--num-scp-files\", type=int, default=0, required=True,\n                        help=\"\"\"The number of .scp files in the egs dir.\"\"\")\n    parser.add_argument(\"--schedule-out\", type=str, required=True,\n                        help=\"\"\"Output file containing the training schedule.  The output\n                        is lines, one per training iteration.\n                        Each line (one per iteration) is a list of ;-separated commands setting shell\n                        variables.  Currently the following variables are set:\n                        iter, num_jobs, inv_num_jobs, scp_indexes, frame_shifts, dropout_opt, lrate.\n                        \"\"\")\n\n    print(sys.argv, file=sys.stderr)\n    args = parser.parse_args()\n\n    return args\n\ndef get_schedules(args):\n    num_scp_files_expanded = args.num_scp_files * args.frame_subsampling_factor\n    num_scp_files_to_process = int(args.num_epochs * num_scp_files_expanded)\n    num_scp_files_processed = 0\n    num_iters = ((num_scp_files_to_process * 2)\n                 // (args.num_jobs_initial + args.num_jobs_final))\n\n    with open(args.schedule_out, 'w', encoding='latin-1') as ostream:\n        for iter in range(num_iters):\n            current_num_jobs = int(0.5 + args.num_jobs_initial\n                                   + (args.num_jobs_final - args.num_jobs_initial)\n                                   * float(iter) / num_iters)\n            # as a special case, for iteration zero we use just one job\n            # regardless of the --num-jobs-initial and --num-jobs-final.  This\n            # is because the model averaging does not work reliably for a\n            # freshly initialized model.\n            # if iter == 0:\n            #     current_num_jobs = 1\n\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_scp_files_processed,\n                                                       num_scp_files_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n\n            if args.dropout_schedule == \"\":\n                args.dropout_schedule = None\n            dropout_edit_option = common_train_lib.get_dropout_edit_option(\n                args.dropout_schedule,\n                float(num_scp_files_processed) / max(1, (num_scp_files_to_process - args.num_jobs_final)),\n                iter)\n\n            frame_shifts = []\n            egs = []\n            for job in range(1, current_num_jobs + 1):\n                # k is a zero-based index that we will derive the other indexes from.\n                k = num_scp_files_processed + job - 1\n                # work out the 1-based scp index.\n                scp_index = (k % args.num_scp_files) + 1\n                # previous : frame_shift = (k/num_scp_files) % frame_subsampling_factor\n                frame_shift = ((scp_index + k // args.num_scp_files)\n                               % args.frame_subsampling_factor)\n\n                # Instead of frame shifts like [0, 1, 2], we make them more like\n                # [0, 1, -1].  This is clearer in intent, and keeps the\n                # supervision starting at frame zero, which IIRC is a\n                # requirement somewhere in the 'chaina' code.\n#               TODO: delete this section if no longer useful\n                # if frame_shift > (args.frame_subsampling_factor // 2):\n                #     frame_shift = frame_shift - args.frame_subsampling_factor\n\n                frame_shifts.append(str(frame_shift))\n                egs.append(str(scp_index))\n\n\n            print(\"\"\"iter={iter}; num_jobs={nj}; inv_num_jobs={nj_inv}; scp_indexes=(pad {indexes}); frame_shifts=(pad {shifts}); dropout_opt=\"{opt}\"; lrate={lrate}\"\"\".format(\n                iter=iter, nj=current_num_jobs, nj_inv=(1.0 / current_num_jobs),\n                indexes = ' '.join(egs), shifts=' '.join(frame_shifts),\n                opt=dropout_edit_option, lrate=lrate), file=ostream)\n            num_scp_files_processed = num_scp_files_processed + current_num_jobs\n\n\ndef main():\n    args = get_args()\n    get_schedules(args)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/process_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script takes nnet examples dumped by steps/chain/get_raw_egs.sh and\n# combines the chunks into groups by speaker (to the extent possible; it may\n# need to combine speakers in some cases), locally randomizes the result, and\n# dumps the resulting egs to disk.  Chunks of these will later be globally\n# randomized (at the scp level) by steps/chaina/randomize_egs.sh\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_repeats=1  # number of times we repeat the same chunks with different\n               # grouping.  \ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\n\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\n\n\nshuffle_buffer_size=5000   # Size of buffer (containing grouped egs) to use\n                           # for random shuffle.\n\nstage=0\nnj=5             # the number of parallel jobs to run.\nsrand=0\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [opts] <raw-egs-dir> <processed-egs-dir>\"\n  echo \" e.g.: $0 exp/chaina/tdnn1a_sp/raw_egs exp/chaina/tdnn1a_sp/processed_egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options (alternative to this\"\n  echo \"                                                   # command line)\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-repeats <n;2>                              # Number of times we group the same chunks into different\"\n  echo \"                                                   # groups.  For now only the values 1 and 2 are\"\n  echo \"                                                   # recommended, due to the very simple way we choose\"\n  echo \"                                                   # the groups (it's consecutive).\"\n  echo \"  --nj       <num-jobs;5>                          # Number of jobs to run in parallel.  Usually quite a\"\n  echo \"                                                   # small number, as we'll be limited by disk access\"\n  echo \"                                                   # speed.\"\n  echo \"  --compress <bool;true>                           # True if you want the egs to be compressed\"\n  echo \"                                                   # (e.g. you may set to false for debugging purposes, to\"\n  echo \"                                                   # check that the compression is not hurting).\"\n  echo \"  --num-heldout-egs <n;200>                        # Number of egs to put in train_subset.scp and heldout_subset.scp.\"\n  echo \"                                                   # These will be used for diagnostics.  Note: this number is\"\n  echo \"                                                   # the number of  grouped egs, after merging --chunks-per-group\"\n  echo \"                                                   # chunks into a single eg.\"\n  echo \"                                                   # ... may be a comma separated list, but we advise a single\"\n  echo \"                                                   #  number in most cases, due to interaction with the need \"\n  echo \"                                                   # to group egs from the same speaker into groups.\"\n  echo \"  --stage <stage|0>                                # Used to run this script from somewhere in\"\n  echo \"                                                   # the middle.\"\n  exit 1;\nfi\n\nraw_egs_dir=$1\ndir=$2\n\n# die on error or undefined variable.\nset -e -u\n\nif ! steps/chain2/validate_raw_egs.sh $raw_egs_dir; then\n  echo \"$0: failed to validate input directory $raw_egs_dir\"\n  exit 1\nfi\n\n\nmkdir -p $dir/temp $dir/log\n\n\nif [ $stage -le 0 ]; then\n  echo \"$0: choosing heldout_subset and train_subset\"\n\n  utt2uniq_opt=\n  if [ -f $raw_egs_dir/misc/utt2uniq ]; then\n      utt2uniq_opt=\"--utt2uniq=$raw_egs_dir/misc/utt2uniq\"\n      echo \"$0: File $raw_egs_dir/misc/utt2uniq exists, so ensuring the hold-out set\" \\\n           \"includes all perturbed versions of the same source utterance.\"\n      utils/utt2spk_to_spk2utt.pl $raw_egs_dir/misc/utt2uniq 2>/dev/null | \\\n          utils/shuffle_list.pl 2>/dev/null | \\\n            awk -v max_utt=$num_utts_subset '{\n                for (n=2;n<=NF;n++) print $n;\n                printed += NF-1;\n                if (printed >= max_utt) nextfile; }' \\\n          | fgrep -f - $raw_egs_dir/all.scp | sort -k1,1 > $dir/temp/heldout_subset.list\n  else\n      awk '{print $1}' $raw_egs_dir/misc/utt2spk | \\\n        utils/shuffle_list.pl 2>/dev/null | \\\n        head -$num_utts_subset |  fgrep -f - $raw_egs_dir/all.scp | sort -k1,1 > $dir/temp/heldout_subset.list\n  fi\n\n  awk '{print $1}' $raw_egs_dir/misc/utt2spk | \\\n     utils/filter_scp.pl --exclude $dir/temp/heldout_subset.list | \\\n     utils/shuffle_list.pl 2>/dev/null | \\\n     head -$num_utts_subset | fgrep -f - $raw_egs_dir/all.scp | sort -k1,1 > $dir/temp/train_subset.list\n\n  awk '{print $1}' $raw_egs_dir/misc/utt2spk | \\\n     utils/filter_scp.pl --exclude $dir/temp/heldout_subset.list | fgrep -f - $raw_egs_dir/all.scp > $dir/temp/train.list\n  fi\nlen_valid_uttlist=$(wc -l < $dir/temp/heldout_subset.list)\nlen_trainsub_uttlist=$(wc -l <$dir/temp/train_subset.list)\n\nif [ $stage -le 1 ]; then\n\n  for name in heldout_subset train_subset; do\n    echo \"$0: merging and shuffling $name egs\"\n\n    cp $dir/temp/${name}.list $dir/temp/${name}.scp\n\n    $cmd $dir/log/shuffle_${name}_egs.log \\\n      nnet3-chain-shuffle-egs --srand=$srand scp:$dir/temp/${name}.scp ark,scp:$dir/${name}.ark,$dir/${name}.scp\n  done\n\n  # Split up the training list into multiple smaller lists, as it could be long.\n  utils/split_scp.pl $dir/temp/train.list  $(for j in $(seq $nj); do echo $dir/temp/train.$j.scp; done)\n\n  if [ -e $dir/storage ]; then\n    # Make soft links to storage directories, if distributing this way..  See\n    # utils/create_split_dir.pl.\n    echo \"$0: creating data links\"\n    utils/create_data_link.pl $(for j in $(seq $nj); do echo $dir/train.$j.ark; done) || true\n  fi\n\n  $cmd JOB=1:$nj $dir/log/shuffle_train_egs.JOB.log \\\n     nnet3-chain-shuffle-egs --buffer-size=$shuffle_buffer_size \\\n         --srand=\\$[JOB+$srand] scp:$dir/temp/train.JOB.scp ark,scp:$dir/train.JOB.ark,$dir/train.JOB.scp || exit 1;\n  cat $(for j in $(seq $nj); do echo $dir/train.$j.scp; done) > $dir/train.scp\nfi\n\ncat $raw_egs_dir/info.txt  | awk  -v num_repeats=$num_repeats \\\n   '\n  /^dir_type / { print \"dir_type processed_chain_egs\"; next; }\n  /^num_input_frames / { print \"num_input_frames \"$2 * num_repeats; next; } # approximate; ignores held-out egs.\n  /^num_chunks / { print \"num_chunks \" $2 * num_repeats; next; }\n   {print;}\n  END{print \"num_repeats \" num_repeats;}' >$dir/info.txt\n\n\n\nif ! cat $dir/info.txt | awk '{if (NF == 1) exit(1);}'; then\n  echo \"$0: we failed to obtain at least one of the fields in $dir/info.txt\"\n  exit 1\nfi\n\ncp -r $raw_egs_dir/misc/ $dir/\n\n\necho \"$0: Finished processing egs\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/randomize_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script takes nnet examples dumped by steps/chain/process_egs.sh,\n# globally randomizes the egs, and divides into multiple .scp files.  This is\n# the form of egs which is consumed by the training script.  All this is done\n# only by manipulating the contents of .scp files.  To keep locality of disk\n# access, we only randomize blocks of egs (e.g.  blocks containing 128 groups of\n# sequences).  This doesn't defeat randomization, because both process_egs.sh\n# and the training script use nnet3-shuffle-egs to do more local randomization.\n\n# Later on, we'll have a multilingual/multi-input-dir version fo this script\n# that combines egs from various data sources and possibly multiple languages.\n# This version assumes there is just one language.\n\n# Begin configuration section.\ncmd=run.pl\n\ngroups_per_block=128     # The 'groups' are the egs in the scp file from\n                         # process_egs.sh, containing '--chunks-per-group' sequences\n                         # each.\nnum_blocks=256\n\nframes_per_job=3000000   # The number of frames of data we want to process per\n                         # training job (will determine how long each job takes,\n                         # and the frequency of model averaging.  This was\n                         # previously called --frames-per-iter, but\n                         # --frames-per-job is clearer as each job does this\n                         # many.\n\nnum_groups_combine=1000  # the number of groups from the training set that we\n                         # randomly choose as input to nnet3-chain-combine;\n                         # these will go to combine.scp.  train_subset.scp and\n                         # heldout_subset.scp are, for now, just copied over\n                         # from the input.\n\n# Later we may provide a mechanism to change the language name; for now we\n# just copy it from the input.\n\n\nsrand=0\nstage=0\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [opts] <processed-egs-dir> <randomized-egs-dir>\"\n  echo \" e.g.: $0 --frames-per-job 2000000 exp/chain/tdnn1a_sp/processed_egs exp/chain/tdnn1a_sp/egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options (alternative to this\"\n  echo \"                                                   # command line)\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --groups-per-block <n;128>                       # The number of groups (i.e. previously merged egs\"\n  echo \"                                                   # containing --chunks-per-group chunks) to to consider \"\n  echo \"                                                   # as one block, where whole blocks are randomized;\"\n  echo \"                                                   # smaller means more complete randomization but less\"\n  echo \"                                                   # local disk access.\"\n  echo \"  --frames-per-job <n;3000000>                     # The number of input frames (not counting context)\"\n  echo \"                                                   # that we aim to have in each scp file after\"\n  echo \"                                                   # randomization and splitting.\"\n  echo \"  --num-groups-combine <n;1000>                    # The number of randomly chosen groups to\"\n  echo \"                                                   # put in the subset in 'combine.scp' which will\"\n  echo \"                                                   # be used in nnet3-chain-combine to decide which\"\n  echo \"                                                   # models to average over.\"\n  echo \"  --stage <stage|0>                                # Used to run this script from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --srand <srand|0>                                # Random seed, affects randomization.\"\n  exit 1;\nfi\n\nprocessed_egs_dir=$1\ndir=$2\n\n# die on error or undefined variable.\nset -e -u\n\nif ! steps/chain2/validate_processed_egs.sh $processed_egs_dir; then\n  echo \"$0: could not validate input directory $processed_egs_dir\"\n  exit 1\nfi\n\n# Work out how many groups per job and how many frames per job we'll have\n\ninfo_in=$processed_egs_dir/info.txt\n\n# num_scp_files is the number of archives\nnum_input_frames=$(awk '/^num_input_frames/ { nif=$2; print nif}' $info_in)\nframes_per_chunk_avg=$(awk '/^frames_per_chunk_avg/ { fpc=$2; print fpc}' $info_in)\nnum_chunks=$(awk '/^num_chunks/ { nc=$2; print nc}' $info_in)\nnum_scp_files=$[(num_chunks * frames_per_chunk_avg)/frames_per_job +1]\n[ $num_scp_files -eq 0 ] && num_scp_files=1\n\nframes_per_scp_file=$[(num_chunks*frames_per_chunk_avg)/num_scp_files] # because it may be slightly different from frames_per_job\n\n\nmkdir -p $dir/temp\n\nif [ -d $dir/misc ]; then\n  rm -r $dir/misc\nfi\n\nmkdir -p $dir/misc\ncp $processed_egs_dir/misc/* $dir/misc\n\nutils/shuffle_list.pl  $processed_egs_dir/train.scp > $dir/temp/train.scp\nutils/split_scp.pl $dir/temp/train.scp $(for i in $(seq $num_blocks); do echo $dir/temp/train.$i.scp; done)\nfor i in `seq $num_blocks`; do\n    utils/split_scp.pl <(utils/shuffle_list.pl $dir/temp/train.$i.scp) $(for j in $(seq $num_scp_files); do echo $dir/temp/train.$i.$j.scp; done)\ndone\nfor j in `seq $num_scp_files`; do\n    cat $dir/temp/train.*.$j.scp | utils/shuffle_list.pl > $dir/train.$j.scp\ndone\nrm -rf $dir/temp &\n\ncp $processed_egs_dir/heldout_subset.scp $processed_egs_dir/train_subset.scp $dir/\n\n\n# note: there is only one language in $processed_egs_dir (any\n# merging would be done at the randomization stage but that is not supported yet).\n\nlang=$(awk '/^lang / { print $2; }' <$processed_egs_dir/info.txt)\n\n# We'll store info files per language, containing the part of the information\n# that is language-specific, plus a single global info.txt containing stuff that\n# is not language specific.\n# This will get more complicated once we actually support multiple languages,\n# and when we allow multiple input processed egs dirs for the same language.\n\ngrep -v -E '^dir_type|^lang|^feat_dim' <$processed_egs_dir/info.txt | \\\n  cat <(echo \"dir_type randomized_chain_egs\") - > $dir/info_$lang.txt\n\n\ncat <<EOF >$dir/info.txt\ndir_type randomized_chain_egs\nnum_scp_files $num_scp_files\nlangs $lang\nframes_per_scp_file $frames_per_scp_file\nEOF\n# frames_per_job, after rounding, becomes frames_per_scp_file.\n\n# note: frames_per_chunk_avg will be present in the info.txt file as well as\n# the per-language files.\ngrep -E '^feat_dim|^frames_per_chunk_avg' <$processed_egs_dir/info.txt >>$dir/info.txt\n\n\n\nif ! cat $dir/info.txt | awk '{if (NF == 1) exit(1);}'; then\n  echo \"$0: we failed to obtain at least one of the fields in $dir/info.txt\"\n  exit 1\nfi\n\n\nwait;\necho \"$0: Finished randomizing egs\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/train.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n\n\n# Begin configuration section\nstage=-2\ncmd=run.pl\ngpu_cmd_opt=\nleaky_hmm_coefficient=0.1\nxent_regularize=0.1\napply_deriv_weights=false   # you might want to set this to true in unsupervised training\n                            # scenarios.\nmemory_compression_level=2  # Enables us to use larger minibatch size than we\n                            # otherwise could, but may not be optimal for speed\n                            # (--> set to 0 if you have plenty of memory.\ndropout_schedule=\nsrand=0\nmax_param_change=2.0    # we use a smaller than normal default (it's normally\n                        # 2.0), because there are two models (bottom and top).\nuse_gpu=yes   # can be \"yes\", \"no\", \"optional\", \"wait\"\nprint_interval=10\nmomentum=0.0\nparallel_train_opts=\nverbose_opt=\n\ncommon_opts=           # Options passed through to nnet3-chain-train and nnet3-chain-combine\n\nnum_epochs=4.0   #  Note: each epoch may actually contain multiple repetitions of\n                 #  the data, for various reasons:\n                 #    using the --num-repeats option in process_egs.sh\n                 #    data augmentation\n                 #    different data shifts (this includes 3 different shifts\n                 #    of the data if frame_subsampling_factor=3 (see $dir/init/info.txt)\n\nnum_jobs_initial=1\nnum_jobs_final=1\ninitial_effective_lrate=0.001\nfinal_effective_lrate=0.0001\nminibatch_size=32  # This is how you set the minibatch size. \n\nmax_iters_combine=80\nmax_models_combine=20\ndiagnostic_period=5    # Get diagnostics every this-many iterations\n\nshuffle_buffer_size=1000  # This \"buffer_size\" variable controls randomization of the groups\n                          # on each iter.\n\n\nl2_regularize=\nout_of_range_regularize=0.01\nmultilingual_eg=false\n\n# End configuration section\n\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0  [options] <egs-dir>  <model-dir>\"\n  echo \" e.g.: $0 exp/chain/tdnn1a_sp/egs  exp/chain/tdnn1a_sp\"\n  echo \"\"\n  echo \"This is the default script to train acoustic models for chain2 recipes.\"\n  echo \"The script requires two arguments:\"\n  echo \"<egs-dir>: directory where egs files are stored\"\n  echo \"<model-dir>: directory where the final model will be stored\"\n  echo \"\"\n  echo \"See the top of the script to check possible options to pass to it.\"\n  exit 1\nfi\n\negs_dir=$1\ndir=$2\n\nset -e -u  # die on failed command or undefined variable\n\nsteps/chain2/validate_randomized_egs.sh $egs_dir\n\nfor f in $dir/init/info.txt; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist\"\n    exit 1\n  fi\ndone\ncat $egs_dir/info.txt >> $dir/init/info.txt\n\n\nframe_subsampling_factor=$(awk '/^frame_subsampling_factor/ {print $2}' <$dir/init/info.txt)\nnum_scp_files=$(awk '/^num_scp_files/ {print $2}' <$egs_dir/info.txt)\n\nif [ $stage -le -2 ]; then\n    echo \"$0: Generating training schedule\"\n    steps/chain2/internal/get_train_schedule.py \\\n      --frame-subsampling-factor=$frame_subsampling_factor \\\n      --num-jobs-initial=$num_jobs_initial \\\n      --num-jobs-final=$num_jobs_final \\\n      --num-epochs=$num_epochs \\\n      --dropout-schedule=\"$dropout_schedule\" \\\n      --num-scp-files=$num_scp_files \\\n      --frame-subsampling-factor=$frame_subsampling_factor \\\n      --initial-effective-lrate=$initial_effective_lrate \\\n      --final-effective-lrate=$final_effective_lrate \\\n      --schedule-out=$dir/schedule.txt\nfi\n\n\nif [ \"$use_gpu\" != \"no\" ]; then gpu_cmd_opt=\"--gpu 1\"; else gpu_cmd_opt=\"\"; fi\n\nnum_iters=$(wc -l <$dir/schedule.txt)\n\necho \"$0: will train for $num_epochs epochs = $num_iters iterations\"\n\n# source the 1st line of schedule.txt in the shell; this sets\n# lrate and dropout_opt, among other variables.\n. <(head -n 1 $dir/schedule.txt)\nlangs=$(awk '/^langs/ { $1=\"\"; print; }' <$dir/init/info.txt | tail -1)\nnum_langs=$(echo $langs | wc -w)\n\nmkdir -p $dir/log\n\n# Copy models with initial learning rate and dropout options from $dir/init to $dir/0\nif [ $stage -le -1 ]; then\n  echo \"$0: Copying transition model\"\n  if [ $num_langs -eq 1 ]; then\n      echo \"$0: Num langs is 1\"\n      cp $dir/init/default.raw $dir/0.raw\n      if [ -f $dir/init/default_trans.mdl ]; then\n          cp $dir/init/default_trans.mdl $dir/0_trans.mdl \n      fi\n  else\n      echo \"$0: Num langs is $num_langs\"\n      cp $dir/init/multi.raw $dir/0.raw\n  fi\nfi\n\n\nl2_regularize_opt=\"\"\nif [ ! -z $l2_regularize ]; then\n    l2_regularize_opt=\"--l2-regularize=$l2_regularize\"\nfi\n\nx=0\nif [ $stage -gt $x ]; then x=$stage; fi\n\n[ $max_models_combine -gt $[num_iters/2] ] && max_models_combine=$[num_iters/2];\ncombine_start_iter=$[num_iters+1-max_models_combine]\n\nwhile [ $x -lt $num_iters ]; do\n  # Source some variables fromm schedule.txt.  The effect will be something\n  # like the following:\n  # iter=0; num_jobs=2; inv_num_jobs=0.5; scp_indexes=(pad 1 2); frame_shifts=(pad 1 2); dropout_opt=\"--edits='set-dropout-proportion name=* proportion=0.0'\" lrate=0.002\n  . <(grep \"^iter=$x;\" $dir/schedule.txt)\n\n  echo \"$0: training, iteration $x of $num_iters, num-jobs is $num_jobs\"\n\n  next_x=$[$x+1]\n  den_fst_dir=$egs_dir/misc\n  model_out_prefix=$dir/${next_x}\n  model_out=${model_out_prefix}.mdl\n  multilingual_eg_opts=\n  if $multilingual_eg; then\n       multilingual_eg_opts=\"--multilingual-eg=true\"\n  fi\n\n  # for the first 4 iterations, plus every $diagnostic_period iterations, launch\n  # some diagnostic processes.  We don't do this on iteration 0, because\n  # the batchnorm stats wouldn't be ready\n  if [ $x -gt 0 ] && [ $[x%diagnostic_period] -eq 0 -o $x -lt 5 ]; then\n\n    [ -f $dir/.error_diagnostic ] && rm $dir/.error_diagnostic\n    for name in train heldout; do\n      egs_opts=\n      if $multilingual_eg; then\n          weight_rspecifier=$egs_dir/diagnostic_${name}.weight.ark\n          [[ -f $weight_rspecifier ]] && egs_opts=\"--weights=ark:$weight_rspecifier\"\n      fi\n      $cmd $gpu_cmd_opt $dir/log/diagnostic_${name}.$x.log \\\n         nnet3-chain-train2 --use-gpu=$use_gpu \\\n            --leaky-hmm-coefficient=$leaky_hmm_coefficient \\\n            --xent-regularize=$xent_regularize \\\n            --out-of-range-regularize=$out_of_range_regularize \\\n            $l2_regularize_opt \\\n            --print-interval=10  \\\n           \"nnet3-copy --learning-rate=$lrate $dir/${x}.raw - |\" $den_fst_dir \\\n           \"ark:nnet3-chain-copy-egs $egs_opts scp:$egs_dir/${name}_subset.scp ark:- | nnet3-chain-merge-egs $multilingual_eg_opts --minibatch-size=1:64 ark:- ark:-|\" \\\n           $dir/${next_x}_${name}.mdl || touch $dir/.error_diagnostic &\n\n       # Make sure we do not run more than $num_jobs_final at once\n       [ $num_jobs_final -eq 1 ] && wait\n\n    done\n    wait\n  fi\n\n  if [ $x -gt 0 ]; then\n    # This doesn't use the egs, it only shows the relative change in model parameters.\n    $cmd $dir/log/progress.$x.log \\\n      nnet3-show-progress --use-gpu=no $dir/$(($x-1)).raw $dir/${x}.raw '&&' \\\n        nnet3-info $dir/${x}.raw &\n  fi\n\n  cache_io_opt=\"--write-cache=$dir/cache.$next_x\"\n  if [ $x -gt 0 -a -f $dir/cache.$x ]; then\n      cache_io_opt=\"$cache_io_opt --read-cache=$dir/cache.$x\"\n  fi\n  for j in $(seq $num_jobs); do\n    scp_index=${scp_indexes[$j]}\n    frame_shift=${frame_shifts[$j]}\n\n    egs_opts=\n    if $multilingual_eg; then\n        weight_rspecifier=$egs_dir/train.weight.$scp_index.ark\n        [[ -f $weight_rspecifier ]] && egs_opts=\"--weights=ark:$weight_rspecifier\"\n    fi\n    $cmd $gpu_cmd_opt $dir/log/train.$x.$j.log \\\n         nnet3-chain-train2  \\\n             $parallel_train_opts $verbose_opt \\\n             --out-of-range-regularize=$out_of_range_regularize \\\n             $cache_io_opt \\\n             --use-gpu=$use_gpu --apply-deriv-weights=$apply_deriv_weights \\\n             --leaky-hmm-coefficient=$leaky_hmm_coefficient --xent-regularize=$xent_regularize \\\n             --print-interval=$print_interval --max-param-change=$max_param_change \\\n             --momentum=$momentum \\\n             --l2-regularize-factor=$inv_num_jobs \\\n             $l2_regularize_opt \\\n             --srand=$srand \\\n             \"nnet3-copy --learning-rate=$lrate $dir/${x}.raw - |\" $den_fst_dir \\\n             \"ark:nnet3-chain-copy-egs $egs_opts --frame-shift=$frame_shift scp:$egs_dir/train.$scp_index.scp ark:- | nnet3-chain-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:- | nnet3-chain-merge-egs $multilingual_eg_opts --minibatch-size=$minibatch_size ark:- ark:-|\" \\\n             ${model_out_prefix}.$j.raw || touch $dir/.error &\n  done\n  wait\n  if [ -f $dir/.error ]; then\n    echo \"$0: error detected training on iteration $x\"\n    exit 1\n  fi\n  if [ $x -ge 1 ]; then\n      models_to_average=$(for j in `seq $num_jobs`; do echo ${model_out_prefix}.$j.raw; done)\n      $cmd $dir/log/average.$x.log \\\n          nnet3-average $models_to_average $dir/$next_x.raw  || exit 1;\n      rm $models_to_average\n  else\n      lang=$(echo $langs | awk '{print $1}')\n      model_index=`steps/nnet3/chain2/internal/get_best_model.sh --output output-${lang} $dir/log/train.$x.*.log`\n      cp ${model_out_prefix}.$model_index.raw $dir/$next_x.raw\n      rm ${model_out_prefix}.*.raw\n  fi\n  [ -f $dir/$x/.error_diagnostic ] && echo \"$0: error getting diagnostics on iter $x\" && exit 1;\n\n  if [ -f $dir/cache.$x ]; then\n      rm $dir/cache.$x\n  fi\n  delete_iter=$[x-2]\n  if [ $delete_iter -lt $combine_start_iter ]; then\n      if [ -f $dir/$delete_iter.raw ]; then\n          rm $dir/$delete_iter.raw\n      fi\n  fi\n  if [ -f $dir/${next_x}_train.mdl ]; then\n      rm $dir/${next_x}_{train,heldout}.mdl\n  fi\n  x=$[x+1]\ndone\n\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"$0: doing model combination\"\n  den_fst_dir=$egs_dir/misc\n  input_models=$(for x in $(seq $combine_start_iter $num_iters); do echo $dir/${x}.raw; done)\n  output_model_dir=$dir/final\n\n   $cmd $gpu_cmd_opt $dir/log/combine.log \\\n      nnet3-chain-combine2 --use-gpu=$use_gpu \\\n        --leaky-hmm-coefficient=$leaky_hmm_coefficient \\\n        --print-interval=10  \\\n        $den_fst_dir $input_models \\\n        \"ark:nnet3-chain-merge-egs $multilingual_eg_opts  scp:$egs_dir/train_subset.scp ark:-|\" \\\n        $dir/final.raw || exit 1;\n   if ! $multilingual_eg; then\n       nnet3-copy  --edits=\"rename-node old-name=output new-name=output-dummy; rename-node old-name=output-default new-name=output\" \\\n          $dir/final.raw - | \\\n          nnet3-am-init $dir/0_trans.mdl - $dir/final.mdl\n   fi\n\n   # Compute the probability of the final, combined model with\n   # the same subset we used for the previous diagnostic processes, as the\n   # different subsets will lead to different probs.\n   [ -f $dir/.error_diagnostic ] && rm $dir/.error_diagnostic\n   for name in train heldout; do\n     egs_opts=\n     if $multilingual_eg; then\n       weight_rspecifier=$egs_dir/diagnostic_${name}.weight.ark\n       [[ -f $weight_rspecifier ]] && egs_opts=\"--weights=ark:$weight_rspecifier\"\n     fi\n     $cmd $gpu_cmd_opt $dir/log/diagnostic_${name}.final.log \\\n       nnet3-chain-train2 --use-gpu=$use_gpu \\\n         --leaky-hmm-coefficient=$leaky_hmm_coefficient \\\n         --xent-regularize=$xent_regularize \\\n         --out-of-range-regularize=$out_of_range_regularize \\\n         $l2_regularize_opt \\\n         --print-interval=10  \\\n         $dir/final.raw  $den_fst_dir \\\n         \"ark:nnet3-chain-copy-egs $egs_opts scp:$egs_dir/${name}_subset.scp ark:- | nnet3-chain-merge-egs $multilingual_eg_opts --minibatch-size=1:64 ark:- ark:-|\" \\\n         $dir/final_${name}.mdl || touch $dir/.error_diagnostic &\n   done\n\n   if [ -f $dir/final_train.mdl ]; then\n     rm $dir/final_{train,heldout}.mdl\n   fi\nfi\n\nif [[ ! $multilingual_eg ]] && [[ ! -f $dir/final.mdl ]]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho \"$0: done\"\n\nsteps/info/chain_dir_info.pl $dir\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/validate_processed_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script validates a directory containing 'processed' egs for 'chain'\n# training, i.e. the output of process_egs.sh.  It also helps to document the\n# expectations on such a directory.\n\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0  <processed-egs-dir>\"\n  echo \" e.g.: $0 exp/chain/tdnn1a_sp/processed_egs\"\n  echo \"\"\n  echo \"Validates that the processed-egs dir has the expected format\"\nfi\n\ndir=$1\n\n# Note: the .ark files are not actually consumed directly downstream (only via\n# the top-level .scp files), but we check them anyway for now.\nfor f in $dir/train.scp $dir/info.txt \\\n         $dir/heldout_subset.{ark,scp} $dir/train_subset.{ark,scp} \\\n         $dir/train.1.scp $dir/train.1.ark; do\n  if ! [ -f $f -a -s $f ]; then\n    echo \"$0: expected file $f to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\n\nif [ $(awk '/^dir_type/ { print $2; }' <$dir/info.txt) != \"processed_chain_egs\" ]; then\n  grep dir_type $dir/info.txt\n  echo \"$0: dir_type should be processed_chain_egs in $dir/info.txt\"\n  exit 1\nfi\n\nlang=$(awk '/^lang / {print $2; }' <$dir/info.txt)\n\nfor f in $dir/misc/$lang.{trans_mdl,normalization.fst,den.fst}; do\n  if ! [ -f $f -a -s $f ]; then\n    echo \"$0: expected file $f to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\necho \"$0: sucessfully validated processed egs in $dir\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/validate_randomized_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script validates a directory containing 'randomized' egs for 'chain'\n# training, i.e. the output of randomize_egs.sh (this is the final form of the\n# egs which is consumed by the training script).  It also helps to document the\n# expectations on such a directory.\n\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0  <randomized-egs-dir>\"\n  echo \" e.g.: $0 exp/chain/tdnn1a_sp/egs\"\n  echo \"\"\n  echo \"Validates that the final (randomized) egs dir has the expected format\"\nfi\n\ndir=$1\n\n# Note: the .ark files are not actually consumed directly downstream (only via\n# the top-level .scp files), but we check them anyway for now.\nfor f in $dir/train.1.scp $dir/info.txt \\\n         $dir/heldout_subset.scp $dir/train_subset.scp; do\n  if ! [ -f $f -a -s $f ]; then\n    echo \"$0: expected file $f to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\n\nif [ $(awk '/^dir_type/ { print $2; }' <$dir/info.txt) != \"randomized_chain_egs\" ]; then\n  grep dir_type $dir/info.txt\n  echo \"$0: dir_type should be randomized_chain_egs in $dir/info.txt\"\n  exit 1\nfi\n\nlangs=$(awk '/^langs / {$1 = \"\"; print; }' <$dir/info.txt)\nnum_scp_files=$(awk '/^num_scp_files / { print $2; }' <$dir/info.txt)\n\nif [ -z \"$langs\" ]; then\n  echo \"$0: expecting the list of languages to be nonempty in $dir/info.txt\"\n  exit 1\nfi\n\nfor lang in $langs; do\n  for f in $dir/misc/$lang.{trans_mdl,normalization.fst,den.fst} $dir/info_${lang}.txt; do\n    if ! [ -f $f -a -s $f ]; then\n      echo \"$0: expected file $f to exist and be nonempty.\"\n      exit 1\n    fi\n  done\ndone\n\nfor i in $(seq $num_scp_files); do\n  if ! [ -s $dir/train.$i.scp ]; then\n    echo \"$0: expected file $dir/train.$i.scp to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\n\necho \"$0: sucessfully validated randomized egs in $dir\"\n"
  },
  {
    "path": "egs/steps/nnet3/chain2/validate_raw_egs.sh",
    "content": "#!/bin/bash\n\n# Copyright   2019  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright   2019  Idiap Research Institute (Author: Srikanth Madikeri).  Apache 2.0.\n#\n# This script validates a directory containing 'raw' egs for 'chain' training.\n# It also helps to document the expectations on such a directory.\n\n\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0  <raw-egs-dir>\"\n  echo \" e.g.: $0 exp/chaina/tdnn1a_sp/raw_egs\"\n  echo \"\"\n  echo \"Validates that the raw-egs dir has the expected format\"\nfi\n\ndir=$1\n\nfor f in $dir/all.scp $dir/cegs.1.ark $dir/info.txt \\\n         $dir/misc/utt2spk; do\n  if ! [ -s $f ]; then\n    echo \"$0: expected file $f to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\n\nif [ $(awk '/^dir_type/ { print $2; }' <$dir/info.txt) != \"raw_chain_egs\" ]; then\n  grep dir_type $dir/info.txt\n  echo \"$0: dir_type should be raw_chain_egs in $dir/info.txt\"\n  exit 1\nfi\n\nlang=$(awk '/^lang / {print $2; }' <$dir/info.txt)\n\nfor f in $dir/misc/$lang.{trans_mdl,normalization.fst,den.fst}; do\n  if ! [ -s $f ]; then\n    echo \"$0: expected file $f to exist and be nonempty.\"\n    exit 1\n  fi\ndone\n\necho \"$0: sucessfully validated raw egs in $dir\"\n"
  },
  {
    "path": "egs/steps/nnet3/components.py",
    "content": "#!/usr/bin/env python\n# Note: this file is part of some nnet3 config-creation tools that are now deprecated.\n\nfrom __future__ import print_function\nimport os\nimport argparse\nimport sys\nimport warnings\nimport copy\nfrom operator import itemgetter\n\ndef GetSumDescriptor(inputs):\n    sum_descriptors = inputs\n    while len(sum_descriptors) != 1:\n        cur_sum_descriptors = []\n        pair = []\n        while len(sum_descriptors) > 0:\n            value = sum_descriptors.pop()\n            if value.strip() != '':\n                pair.append(value)\n            if len(pair) == 2:\n                cur_sum_descriptors.append(\"Sum({0}, {1})\".format(pair[0], pair[1]))\n                pair = []\n        if pair:\n            cur_sum_descriptors.append(pair[0])\n        sum_descriptors = cur_sum_descriptors\n    return sum_descriptors\n\n# adds the input nodes and returns the descriptor\ndef AddInputLayer(config_lines, feat_dim, splice_indexes=[0], ivector_dim=0):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n    output_dim = 0\n    components.append('input-node name=input dim=' + str(feat_dim))\n    list = [('Offset(input, {0})'.format(n) if n != 0 else 'input') for n in splice_indexes]\n    output_dim += len(splice_indexes) * feat_dim\n    if ivector_dim > 0:\n        components.append('input-node name=ivector dim=' + str(ivector_dim))\n        list.append('ReplaceIndex(ivector, t, 0)')\n        output_dim += ivector_dim\n    if len(list) > 1:\n        splice_descriptor = \"Append({0})\".format(\", \".join(list))\n    else:\n        splice_descriptor = list[0]\n    print(splice_descriptor)\n    return {'descriptor': splice_descriptor,\n            'dimension': output_dim}\n\ndef AddNoOpLayer(config_lines, name, input):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    components.append('component name={0}_noop type=NoOpComponent dim={1}'.format(name, input['dimension']))\n    component_nodes.append('component-node name={0}_noop component={0}_noop input={1}'.format(name, input['descriptor']))\n\n    return {'descriptor':  '{0}_noop'.format(name),\n            'dimension': input['dimension']}\n\ndef AddLdaLayer(config_lines, name, input, lda_file):\n    return AddFixedAffineLayer(config_lines, name, input, lda_file)\n\ndef AddFixedAffineLayer(config_lines, name, input, matrix_file):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    components.append('component name={0}_fixaffine type=FixedAffineComponent matrix={1}'.format(name, matrix_file))\n    component_nodes.append('component-node name={0}_fixaffine component={0}_fixaffine input={1}'.format(name, input['descriptor']))\n\n    return {'descriptor':  '{0}_fixaffine'.format(name),\n            'dimension': input['dimension']}\n\n\ndef AddBlockAffineLayer(config_lines, name, input, output_dim, num_blocks):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n    assert((input['dimension'] % num_blocks == 0) and\n            (output_dim % num_blocks == 0))\n    components.append('component name={0}_block_affine type=BlockAffineComponent input-dim={1} output-dim={2} num-blocks={3}'.format(name, input['dimension'], output_dim, num_blocks))\n    component_nodes.append('component-node name={0}_block_affine component={0}_block_affine input={1}'.format(name, input['descriptor']))\n\n    return {'descriptor' : '{0}_block_affine'.format(name),\n                           'dimension' : output_dim}\n\ndef AddPermuteLayer(config_lines, name, input, column_map):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n    permute_indexes = \",\".join([str(x) for x in column_map])\n    components.append('component name={0}_permute type=PermuteComponent column-map={1}'.format(name, permute_indexes))\n    component_nodes.append('component-node name={0}_permute component={0}_permute input={1}'.format(name, input['descriptor']))\n\n    return {'descriptor': '{0}_permute'.format(name),\n            'dimension': input['dimension']}\n\ndef AddAffineLayer(config_lines, name, input, output_dim, ng_affine_options = \"\", max_change_per_component = 0.75):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    # Per-component max-change option\n    max_change_options = \"max-change={0:.2f}\".format(max_change_per_component) if max_change_per_component is not None else ''\n\n    components.append(\"component name={0}_affine type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input['dimension'], output_dim, ng_affine_options, max_change_options))\n    component_nodes.append(\"component-node name={0}_affine component={0}_affine input={1}\".format(name, input['descriptor']))\n\n    return {'descriptor':  '{0}_affine'.format(name),\n            'dimension': output_dim}\n\ndef AddAffRelNormLayer(config_lines, name, input, output_dim, ng_affine_options = \" bias-stddev=0 \", norm_target_rms = 1.0, self_repair_scale = None, max_change_per_component = 0.75):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    # self_repair_scale is a constant scaling the self-repair vector computed in RectifiedLinearComponent\n    self_repair_string = \"self-repair-scale={0:.10f}\".format(self_repair_scale) if self_repair_scale is not None else ''\n    # Per-component max-change option\n    max_change_options = \"max-change={0:.2f}\".format(max_change_per_component) if max_change_per_component is not None else ''\n\n    components.append(\"component name={0}_affine type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input['dimension'], output_dim, ng_affine_options, max_change_options))\n    components.append(\"component name={0}_relu type=RectifiedLinearComponent dim={1} {2}\".format(name, output_dim, self_repair_string))\n    components.append(\"component name={0}_renorm type=NormalizeComponent dim={1} target-rms={2}\".format(name, output_dim, norm_target_rms))\n\n    component_nodes.append(\"component-node name={0}_affine component={0}_affine input={1}\".format(name, input['descriptor']))\n    component_nodes.append(\"component-node name={0}_relu component={0}_relu input={0}_affine\".format(name))\n    component_nodes.append(\"component-node name={0}_renorm component={0}_renorm input={0}_relu\".format(name))\n\n    return {'descriptor':  '{0}_renorm'.format(name),\n            'dimension': output_dim}\n\ndef AddAffPnormLayer(config_lines, name, input, pnorm_input_dim, pnorm_output_dim, ng_affine_options = \" bias-stddev=0 \", norm_target_rms = 1.0):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    components.append(\"component name={0}_affine type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3}\".format(name, input['dimension'], pnorm_input_dim, ng_affine_options))\n    components.append(\"component name={0}_pnorm type=PnormComponent input-dim={1} output-dim={2}\".format(name, pnorm_input_dim, pnorm_output_dim))\n    components.append(\"component name={0}_renorm type=NormalizeComponent dim={1} target-rms={2}\".format(name, pnorm_output_dim, norm_target_rms))\n\n    component_nodes.append(\"component-node name={0}_affine component={0}_affine input={1}\".format(name, input['descriptor']))\n    component_nodes.append(\"component-node name={0}_pnorm component={0}_pnorm input={0}_affine\".format(name))\n    component_nodes.append(\"component-node name={0}_renorm component={0}_renorm input={0}_pnorm\".format(name))\n\n    return {'descriptor':  '{0}_renorm'.format(name),\n            'dimension': pnorm_output_dim}\n\ndef AddConvolutionLayer(config_lines, name, input,\n                       input_x_dim, input_y_dim, input_z_dim,\n                       filt_x_dim, filt_y_dim,\n                       filt_x_step, filt_y_step,\n                       num_filters, input_vectorization,\n                       param_stddev = None, bias_stddev = None,\n                       filter_bias_file = None,\n                       is_updatable = True):\n    assert(input['dimension'] == input_x_dim * input_y_dim * input_z_dim)\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    conv_init_string = (\"component name={name}_conv type=ConvolutionComponent \"\n                       \"input-x-dim={input_x_dim} input-y-dim={input_y_dim} input-z-dim={input_z_dim} \"\n                       \"filt-x-dim={filt_x_dim} filt-y-dim={filt_y_dim} \"\n                       \"filt-x-step={filt_x_step} filt-y-step={filt_y_step} \"\n                       \"input-vectorization-order={vector_order}\".format(name = name,\n                       input_x_dim = input_x_dim, input_y_dim = input_y_dim, input_z_dim = input_z_dim,\n                       filt_x_dim = filt_x_dim, filt_y_dim = filt_y_dim,\n                       filt_x_step = filt_x_step, filt_y_step = filt_y_step,\n                       vector_order = input_vectorization))\n    if filter_bias_file is not None:\n        conv_init_string += \" matrix={0}\".format(filter_bias_file)\n    else:\n        conv_init_string += \" num-filters={0}\".format(num_filters)\n\n    components.append(conv_init_string)\n    component_nodes.append(\"component-node name={0}_conv_t component={0}_conv input={1}\".format(name, input['descriptor']))\n\n    num_x_steps = (1 + (input_x_dim - filt_x_dim) // filt_x_step)\n    num_y_steps = (1 + (input_y_dim - filt_y_dim) // filt_y_step)\n    output_dim = num_x_steps * num_y_steps * num_filters;\n    return {'descriptor':  '{0}_conv_t'.format(name),\n            'dimension': output_dim,\n            '3d-dim': [num_x_steps, num_y_steps, num_filters],\n            'vectorization': 'zyx'}\n\n# The Maxpooling component assumes input vectorizations of type zyx\ndef AddMaxpoolingLayer(config_lines, name, input,\n                      input_x_dim, input_y_dim, input_z_dim,\n                      pool_x_size, pool_y_size, pool_z_size,\n                      pool_x_step, pool_y_step, pool_z_step):\n    if input_x_dim < 1 or input_y_dim < 1 or input_z_dim < 1:\n        raise Exception(\"non-positive maxpooling input size ({0}, {1}, {2})\".\n                 format(input_x_dim, input_y_dim, input_z_dim))\n    if pool_x_size > input_x_dim or pool_y_size > input_y_dim or pool_z_size > input_z_dim:\n        raise Exception(\"invalid maxpooling pool size vs. input size\")\n    if pool_x_step > pool_x_size or pool_y_step > pool_y_size or pool_z_step > pool_z_size:\n        raise Exception(\"invalid maxpooling pool step vs. pool size\")\n\n    assert(input['dimension'] == input_x_dim * input_y_dim * input_z_dim)\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    components.append('component name={name}_maxp type=MaxpoolingComponent '\n                      'input-x-dim={input_x_dim} input-y-dim={input_y_dim} input-z-dim={input_z_dim} '\n                      'pool-x-size={pool_x_size} pool-y-size={pool_y_size} pool-z-size={pool_z_size} '\n                      'pool-x-step={pool_x_step} pool-y-step={pool_y_step} pool-z-step={pool_z_step} '.\n                      format(name = name,\n                      input_x_dim = input_x_dim, input_y_dim = input_y_dim, input_z_dim = input_z_dim,\n                      pool_x_size = pool_x_size, pool_y_size = pool_y_size, pool_z_size = pool_z_size,\n                      pool_x_step = pool_x_step, pool_y_step = pool_y_step, pool_z_step = pool_z_step))\n\n    component_nodes.append('component-node name={0}_maxp_t component={0}_maxp input={1}'.format(name, input['descriptor']))\n\n    num_pools_x = 1 + (input_x_dim - pool_x_size) // pool_x_step;\n    num_pools_y = 1 + (input_y_dim - pool_y_size) // pool_y_step;\n    num_pools_z = 1 + (input_z_dim - pool_z_size) // pool_z_step;\n    output_dim = num_pools_x * num_pools_y * num_pools_z;\n\n    return {'descriptor':  '{0}_maxp_t'.format(name),\n            'dimension': output_dim,\n            '3d-dim': [num_pools_x, num_pools_y, num_pools_z],\n            'vectorization': 'zyx'}\n\n\ndef AddSoftmaxLayer(config_lines, name, input):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    components.append(\"component name={0}_log_softmax type=LogSoftmaxComponent dim={1}\".format(name, input['dimension']))\n    component_nodes.append(\"component-node name={0}_log_softmax component={0}_log_softmax input={1}\".format(name, input['descriptor']))\n\n    return {'descriptor':  '{0}_log_softmax'.format(name),\n            'dimension': input['dimension']}\n\n\ndef AddSigmoidLayer(config_lines, name, input, self_repair_scale = None):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    # self_repair_scale is a constant scaling the self-repair vector computed in SigmoidComponent\n    self_repair_string = \"self-repair-scale={0:.10f}\".format(self_repair_scale) if self_repair_scale is not None else ''\n    components.append(\"component name={0}_sigmoid type=SigmoidComponent dim={1}\".format(name, input['dimension'], self_repair_string))\n    component_nodes.append(\"component-node name={0}_sigmoid component={0}_sigmoid input={1}\".format(name, input['descriptor']))\n    return {'descriptor':  '{0}_sigmoid'.format(name),\n            'dimension': input['dimension']}\n\ndef AddOutputLayer(config_lines, input, label_delay = None, suffix=None, objective_type = \"linear\"):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n    name = 'output'\n    if suffix is not None:\n        name = '{0}-{1}'.format(name, suffix)\n\n    if label_delay is None:\n        component_nodes.append('output-node name={0} input={1} objective={2}'.format(name, input['descriptor'], objective_type))\n    else:\n        component_nodes.append('output-node name={0} input=Offset({1},{2}) objective={3}'.format(name, input['descriptor'], label_delay, objective_type))\n\ndef AddFinalLayer(config_lines, input, output_dim,\n        ng_affine_options = \" param-stddev=0 bias-stddev=0 \",\n        max_change_per_component = 1.5,\n        label_delay=None,\n        use_presoftmax_prior_scale = False,\n        prior_scale_file = None,\n        include_log_softmax = True,\n        add_final_sigmoid = False,\n        name_affix = None,\n        objective_type = \"linear\"):\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    if name_affix is not None:\n        final_node_prefix = 'Final-' + str(name_affix)\n    else:\n        final_node_prefix = 'Final'\n\n    prev_layer_output = AddAffineLayer(config_lines,\n            final_node_prefix , input, output_dim,\n            ng_affine_options, max_change_per_component)\n    if include_log_softmax:\n        if use_presoftmax_prior_scale :\n            components.append('component name={0}-fixed-scale type=FixedScaleComponent scales={1}'.format(final_node_prefix, prior_scale_file))\n            component_nodes.append('component-node name={0}-fixed-scale component={0}-fixed-scale input={1}'.format(final_node_prefix,\n                prev_layer_output['descriptor']))\n            prev_layer_output['descriptor'] = \"{0}-fixed-scale\".format(final_node_prefix)\n        prev_layer_output = AddSoftmaxLayer(config_lines, final_node_prefix, prev_layer_output)\n    elif add_final_sigmoid:\n        # Useful when you need the final outputs to be probabilities\n        # between 0 and 1.\n        # Usually used with an objective-type such as \"quadratic\"\n        prev_layer_output = AddSigmoidLayer(config_lines, final_node_prefix, prev_layer_output)\n    # we use the same name_affix as a prefix in for affine/scale nodes but as a\n    # suffix for output node\n    AddOutputLayer(config_lines, prev_layer_output, label_delay, suffix = name_affix, objective_type = objective_type)\n\ndef AddLstmLayer(config_lines,\n                 name, input, cell_dim,\n                 recurrent_projection_dim = 0,\n                 non_recurrent_projection_dim = 0,\n                 clipping_threshold = 30.0,\n                 zeroing_threshold = 15.0,\n                 zeroing_interval = 20,\n                 ng_per_element_scale_options = \"\",\n                 ng_affine_options = \"\",\n                 lstm_delay = -1,\n                 self_repair_scale_nonlinearity = None,\n                 max_change_per_component = 0.75):\n    assert(recurrent_projection_dim >= 0 and non_recurrent_projection_dim >= 0)\n    components = config_lines['components']\n    component_nodes = config_lines['component-nodes']\n\n    input_descriptor = input['descriptor']\n    input_dim = input['dimension']\n    name = name.strip()\n\n    if (recurrent_projection_dim == 0):\n        add_recurrent_projection = False\n        recurrent_projection_dim = cell_dim\n        recurrent_connection = \"m_t\"\n    else:\n        add_recurrent_projection = True\n        recurrent_connection = \"r_t\"\n    if (non_recurrent_projection_dim == 0):\n        add_non_recurrent_projection = False\n    else:\n        add_non_recurrent_projection = True\n\n    # self_repair_scale_nonlinearity is a constant scaling the self-repair vector computed in derived classes of NonlinearComponent,\n    # i.e.,  SigmoidComponent, TanhComponent and RectifiedLinearComponent\n    self_repair_nonlinearity_string = \"self-repair-scale={0:.10f}\".format(self_repair_scale_nonlinearity) if self_repair_scale_nonlinearity is not None else ''\n    # Natural gradient per element scale parameters\n    ng_per_element_scale_options += \" param-mean=0.0 param-stddev=1.0 \"\n    # Per-component max-change option\n    max_change_options = \"max-change={0:.2f}\".format(max_change_per_component) if max_change_per_component is not None else ''\n    # Parameter Definitions W*(* replaced by - to have valid names)\n    components.append(\"# Input gate control : W_i* matrices\")\n    components.append(\"component name={0}_W_i-xr type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + recurrent_projection_dim, cell_dim, ng_affine_options, max_change_options))\n    components.append(\"# note : the cell outputs pass through a diagonal matrix\")\n    components.append(\"component name={0}_w_ic type=NaturalGradientPerElementScaleComponent  dim={1} {2} {3}\".format(name, cell_dim, ng_per_element_scale_options, max_change_options))\n\n    components.append(\"# Forget gate control : W_f* matrices\")\n    components.append(\"component name={0}_W_f-xr type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + recurrent_projection_dim, cell_dim, ng_affine_options, max_change_options))\n    components.append(\"# note : the cell outputs pass through a diagonal matrix\")\n    components.append(\"component name={0}_w_fc type=NaturalGradientPerElementScaleComponent  dim={1} {2} {3}\".format(name, cell_dim, ng_per_element_scale_options, max_change_options))\n\n    components.append(\"#  Output gate control : W_o* matrices\")\n    components.append(\"component name={0}_W_o-xr type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + recurrent_projection_dim, cell_dim, ng_affine_options, max_change_options))\n    components.append(\"# note : the cell outputs pass through a diagonal matrix\")\n    components.append(\"component name={0}_w_oc type=NaturalGradientPerElementScaleComponent  dim={1} {2} {3}\".format(name, cell_dim, ng_per_element_scale_options, max_change_options))\n\n    components.append(\"# Cell input matrices : W_c* matrices\")\n    components.append(\"component name={0}_W_c-xr type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, input_dim + recurrent_projection_dim, cell_dim, ng_affine_options, max_change_options))\n\n\n    components.append(\"# Defining the non-linearities\")\n    components.append(\"component name={0}_i type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, self_repair_nonlinearity_string))\n    components.append(\"component name={0}_f type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, self_repair_nonlinearity_string))\n    components.append(\"component name={0}_o type=SigmoidComponent dim={1} {2}\".format(name, cell_dim, self_repair_nonlinearity_string))\n    components.append(\"component name={0}_g type=TanhComponent dim={1} {2}\".format(name, cell_dim, self_repair_nonlinearity_string))\n    components.append(\"component name={0}_h type=TanhComponent dim={1} {2}\".format(name, cell_dim, self_repair_nonlinearity_string))\n\n    components.append(\"# Defining the cell computations\")\n    components.append(\"component name={0}_c1 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n    components.append(\"component name={0}_c2 type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n    components.append(\"component name={0}_m type=ElementwiseProductComponent input-dim={1} output-dim={2}\".format(name, 2 * cell_dim, cell_dim))\n    components.append(\"component name={0}_c type=BackpropTruncationComponent dim={1} \"\n        \"clipping-threshold={2} zeroing-threshold={3} zeroing-interval={4} \"\n        \"recurrence-interval={5}\".format(name, cell_dim, clipping_threshold, zeroing_threshold,\n        zeroing_interval, abs(lstm_delay)))\n\n    # c1_t and c2_t defined below\n    component_nodes.append(\"component-node name={0}_c_t component={0}_c input=Sum({0}_c1_t, {0}_c2_t)\".format(name))\n    c_tminus1_descriptor = \"IfDefined(Offset({0}_c_t, {1}))\".format(name, lstm_delay)\n\n    component_nodes.append(\"# i_t\")\n    component_nodes.append(\"component-node name={0}_i1 component={0}_W_i-xr input=Append({1}, IfDefined(Offset({0}_{2}, {3})))\".format(name, input_descriptor, recurrent_connection, lstm_delay))\n    component_nodes.append(\"component-node name={0}_i2 component={0}_w_ic  input={1}\".format(name, c_tminus1_descriptor))\n    component_nodes.append(\"component-node name={0}_i_t component={0}_i input=Sum({0}_i1, {0}_i2)\".format(name))\n\n    component_nodes.append(\"# f_t\")\n    component_nodes.append(\"component-node name={0}_f1 component={0}_W_f-xr input=Append({1}, IfDefined(Offset({0}_{2}, {3})))\".format(name, input_descriptor, recurrent_connection, lstm_delay))\n    component_nodes.append(\"component-node name={0}_f2 component={0}_w_fc  input={1}\".format(name, c_tminus1_descriptor))\n    component_nodes.append(\"component-node name={0}_f_t component={0}_f input=Sum({0}_f1,{0}_f2)\".format(name))\n\n    component_nodes.append(\"# o_t\")\n    component_nodes.append(\"component-node name={0}_o1 component={0}_W_o-xr input=Append({1}, IfDefined(Offset({0}_{2}, {3})))\".format(name, input_descriptor, recurrent_connection, lstm_delay))\n    component_nodes.append(\"component-node name={0}_o2 component={0}_w_oc input={0}_c_t\".format(name))\n    component_nodes.append(\"component-node name={0}_o_t component={0}_o input=Sum({0}_o1, {0}_o2)\".format(name))\n\n    component_nodes.append(\"# h_t\")\n    component_nodes.append(\"component-node name={0}_h_t component={0}_h input={0}_c_t\".format(name))\n\n    component_nodes.append(\"# g_t\")\n    component_nodes.append(\"component-node name={0}_g1 component={0}_W_c-xr input=Append({1}, IfDefined(Offset({0}_{2}, {3})))\".format(name, input_descriptor, recurrent_connection, lstm_delay))\n    component_nodes.append(\"component-node name={0}_g_t component={0}_g input={0}_g1\".format(name))\n\n    component_nodes.append(\"# parts of c_t\")\n    component_nodes.append(\"component-node name={0}_c1_t component={0}_c1  input=Append({0}_f_t, {1})\".format(name, c_tminus1_descriptor))\n    component_nodes.append(\"component-node name={0}_c2_t component={0}_c2 input=Append({0}_i_t, {0}_g_t)\".format(name))\n\n    component_nodes.append(\"# m_t\")\n    component_nodes.append(\"component-node name={0}_m_t component={0}_m input=Append({0}_o_t, {0}_h_t)\".format(name))\n\n    # add the recurrent connections\n    if (add_recurrent_projection and add_non_recurrent_projection):\n        components.append(\"# projection matrices : Wrm and Wpm\")\n        components.append(\"component name={0}_W-m type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(name, cell_dim, recurrent_projection_dim + non_recurrent_projection_dim, ng_affine_options, max_change_options))\n        components.append(\"component name={0}_r type=BackpropTruncationComponent dim={1} \"\n            \"clipping-threshold={2} zeroing-threshold={3} zeroing-interval={4} \"\n            \"recurrence-interval={5}\".format(name, recurrent_projection_dim, clipping_threshold,\n            zeroing_threshold, zeroing_interval, abs(lstm_delay)))\n        component_nodes.append(\"# r_t and p_t\")\n        component_nodes.append(\"component-node name={0}_rp_t component={0}_W-m input={0}_m_t\".format(name))\n        component_nodes.append(\"dim-range-node name={0}_r_t_preclip input-node={0}_rp_t dim-offset=0 dim={1}\".format(name, recurrent_projection_dim))\n        component_nodes.append(\"component-node name={0}_r_t component={0}_r input={0}_r_t_preclip\".format(name))\n        output_descriptor = '{0}_rp_t'.format(name)\n        output_dim = recurrent_projection_dim + non_recurrent_projection_dim\n\n    elif add_recurrent_projection:\n        components.append(\"# projection matrices : Wrm\")\n        components.append(\"component name={0}_Wrm type=NaturalGradientAffineComponent input-dim={1} output-dim={2} {3} {4}\".format(\n            name, cell_dim, recurrent_projection_dim, ng_affine_options, max_change_options))\n        components.append(\"component name={0}_r type=BackpropTruncationComponent dim={1} \"\n            \"clipping-threshold={2} zeroing-threshold={3} zeroing-interval={4} \"\n            \"recurrence-interval={5}\".format(name, recurrent_projection_dim, clipping_threshold,\n            zeroing_threshold, zeroing_interval, abs(lstm_delay)))\n        component_nodes.append(\"# r_t\")\n        component_nodes.append(\"component-node name={0}_r_t_preclip component={0}_Wrm input={0}_m_t\".format(name))\n        component_nodes.append(\"component-node name={0}_r_t component={0}_r input={0}_r_t_preclip\".format(name))\n        output_descriptor = '{0}_r_t'.format(name)\n        output_dim = recurrent_projection_dim\n\n    else:\n        components.append(\"component name={0}_r type=BackpropTruncationComponent dim={1} \"\n            \"clipping-threshold={2} zeroing-threshold={3} zeroing-interval={4} \"\n            \"recurrence-interval={5}\".format(name, cell_dim, clipping_threshold,\n            zeroing_threshold, zeroing_interval, abs(lstm_delay)))\n        component_nodes.append(\"component-node name={0}_r_t component={0}_r input={0}_m_t\".format(name))\n        output_descriptor = '{0}_r_t'.format(name)\n        output_dim = cell_dim\n\n    return {\n            'descriptor': output_descriptor,\n            'dimension':output_dim\n            }\n\ndef AddBLstmLayer(config_lines,\n                  name, input, cell_dim,\n                  recurrent_projection_dim = 0,\n                  non_recurrent_projection_dim = 0,\n                  clipping_threshold = 1.0,\n                  zeroing_threshold = 3.0,\n                  zeroing_interval = 20,\n                  ng_per_element_scale_options = \"\",\n                  ng_affine_options = \"\",\n                  lstm_delay = [-1,1],\n                  self_repair_scale_nonlinearity = None,\n                  max_change_per_component = 0.75):\n    assert(len(lstm_delay) == 2 and lstm_delay[0] < 0 and lstm_delay[1] > 0)\n    output_forward = AddLstmLayer(config_lines = config_lines,\n                                  name = \"{0}_forward\".format(name),\n                                  input = input,\n                                  cell_dim = cell_dim,\n                                  recurrent_projection_dim = recurrent_projection_dim,\n                                  non_recurrent_projection_dim = non_recurrent_projection_dim,\n                                  clipping_threshold = clipping_threshold,\n                                  zeroing_threshold = zeroing_threshold,\n                                  zeroing_interval = zeroing_interval,\n                                  ng_per_element_scale_options = ng_per_element_scale_options,\n                                  ng_affine_options = ng_affine_options,\n                                  lstm_delay = lstm_delay[0],\n                                  self_repair_scale_nonlinearity = self_repair_scale_nonlinearity,\n                                  max_change_per_component = max_change_per_component)\n    output_backward = AddLstmLayer(config_lines = config_lines,\n                                   name = \"{0}_backward\".format(name),\n                                   input = input,\n                                   cell_dim = cell_dim,\n                                   recurrent_projection_dim = recurrent_projection_dim,\n                                   non_recurrent_projection_dim = non_recurrent_projection_dim,\n                                   clipping_threshold = clipping_threshold,\n                                   zeroing_threshold = zeroing_threshold,\n                                   zeroing_interval = zeroing_interval,\n                                   ng_per_element_scale_options = ng_per_element_scale_options,\n                                   ng_affine_options = ng_affine_options,\n                                   lstm_delay = lstm_delay[1],\n                                   self_repair_scale_nonlinearity = self_repair_scale_nonlinearity,\n                                   max_change_per_component = max_change_per_component)\n    output_descriptor = 'Append({0}, {1})'.format(output_forward['descriptor'], output_backward['descriptor'])\n    output_dim = output_forward['dimension'] + output_backward['dimension']\n\n    return {\n            'descriptor': output_descriptor,\n            'dimension':output_dim\n            }\n\n"
  },
  {
    "path": "egs/steps/nnet3/compute_output.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#                2016  Vimal Manohar\n# Apache 2.0.\n\n# This script does forward propagation through a neural network.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of jobs.\ncmd=run.pl\nuse_gpu=false\nframes_per_chunk=50\niter=final\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nframe_subsampling_factor=1\ncompress=false    # Specifies whether the output should be compressed before\n                  # dumping to disk\nonline_ivector_dir=\noutput_name=      # Dump outputs for this output-node\napply_exp=false  # Apply exp i.e. write likelihoods instead of log-likelihoods\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <data-dir> <nnet-dir> <output-dir>\"\n  echo \"e.g.:   steps/nnet3/compute_output.sh --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet3/ivectors_test_eval92 \\\\\"\n  echo \"    data/test_eval92_hires exp/nnet3/tdnn exp/nnet3/tdnn/output\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  exit 1;\nfi\n\ndata=$1\nsrcdir=$2\ndir=$3\n\nmkdir -p $dir/log\n\n# convert $dir to absolute pathname\nfdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nmodel=$srcdir/$iter.raw\nif [ ! -f $srcdir/$iter.raw ]; then\n  echo \"$0: WARNING: no such file $srcdir/$iter.raw. Trying $srcdir/$iter.mdl instead.\"\n  model=$srcdir/$iter.mdl\nfi\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nif [ ! -z \"$output_name\" ] && [ \"$output_name\" != \"output\" ]; then\n  echo \"$0: Using output-name $output_name\"\n  model=\"nnet3-copy --edits='remove-output-nodes name=output;rename-node old-name=$output_name new-name=output' $model - |\"\nfi\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\n\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n## Set up features.\nif [ -f $srcdir/final.mat ]; then\n  echo \"$0: ERROR: lda feature type is no longer supported.\" && exit 1\nfi\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nframe_subsampling_opt=\nif [ $frame_subsampling_factor -ne 1 ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\nfi\n\nif $apply_exp; then\n  output_wspecifier=\"ark:| copy-matrix --apply-exp ark:- ark,scp:$dir/output.JOB.ark,$dir/output.JOB.scp\"\nelse\n  output_wspecifier=\"ark:| copy-feats --compress=$compress ark:- ark,scp:$dir/output.JOB.ark,$dir/output.JOB.scp\"\nfi\n\ngpu_opt=\"--use-gpu=no\"\ngpu_queue_opt=\n\nif $use_gpu; then\n  gpu_queue_opt=\"--gpu 1\"\n  suffix=\"-batch\"\n  gpu_opt=\"--use-gpu=yes\"\nelse\n  gpu_opt=\"--use-gpu=no\"\nfi\n\nif [ $stage -le 2 ]; then\n  $cmd $gpu_queue_opt JOB=1:$nj $dir/log/compute_output.JOB.log \\\n    nnet3-compute$suffix $gpu_opt $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     \"$model\" \"$feats\" \"$output_wspecifier\" || exit 1;\nfi\n\nfor n in $(seq $nj); do\n  cat $dir/output.$n.scp\ndone > $dir/output.scp\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/convert_nnet2_to_nnet3.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017    Joachim Fainberg.\n\n# This script converts nnet2 models into nnet3 models.\n# It requires knowledge of valid components which\n# can be modified in the configuration section below.\n\nfrom __future__ import print_function\nimport argparse, os, tempfile, logging, sys, shutil, fileinput, re\nfrom collections import defaultdict, namedtuple\nimport numpy as np\nsys.path.insert(0, 'steps/')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\n\n# Begin configuration section\n# Components and their corresponding node names\n\nNODE_NAMES = {\n    \"<AffineComponent>\":\"affine\",\n    \"<AffineComponentPreconditioned>\":\"affine\",\n    \"<AffineComponentPreconditionedOnline>\":\"affine\",\n    \"<BlockAffineComponent>\":\"affine\",\n    \"<BlockAffineComponentPreconditioned>\":\"affine\",\n    \"<SigmoidComponent>\":\"nonlin\",\n    \"<TanhComponent>\":\"nonlin\",\n    \"<PowerComponent>\":\"nonlin\",\n    \"<RectifiedLinearComponent>\":\"nonlin\",\n    \"<SoftHingeComponent>\":\"nonlin\",\n    \"<PnormComponent>\":\"nonlin\",\n    \"<NormalizeComponent>\":\"renorm\",\n    \"<MaxoutComponent>\":\"maxout\",\n    \"<MaxpoolingComponent>\":\"maxpool\",\n    \"<ScaleComponent>\":\"rescale\",\n    \"<DropoutComponent>\":\"dropout\",\n    \"<SoftmaxComponent>\":\"softmax\",\n    \"<LogSoftmaxComponent>\":\"log-softmax\",\n    \"<FixedScaleComponent>\":\"fixed-scale\",\n    \"<FixedAffineComponent>\":\"fixed-affine\",\n    \"<FixedLinearComponent>\":\"fixed-linear\",\n    \"<FixedBiasComponent>\":\"fixed-bias\",\n    \"<PermuteComponent>\":\"permute\",\n    \"<AdditiveNoiseComponent>\":\"noise\",\n    \"<Convolutional1dComponent>\":\"conv\",\n    \"<SumGroupComponent>\":\"sum-group\",\n    \"<DctComponent>\":\"dct\",\n    \"<SpliceComponent>\":\"splice\",\n    \"<SpliceMaxComponent>\":\"splice\"\n}\n\nSPLICE_COMPONENTS = [c for c in NODE_NAMES if \"Splice\" in c]\nAFFINE_COMPONENTS = [c for c in NODE_NAMES if \"Affine\" in c]\n\nKNOWN_COMPONENTS = list(NODE_NAMES.keys())\n# End configuration section\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(filename)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(\n        description=\"Converts nnet2 into nnet3 models.\",\n        epilog=\"\"\"e.g. steps/nnet3/convert_nnet2_to_nnet3.py \n                  exp/tri4_nnet2 exp/tri4_nnet3\"\"\")\n    parser.add_argument(\"--tmpdir\", type=str, default=\"./\",\n                        help=\"Custom location for the temporary directory.\")\n    parser.add_argument(\"--skip-cleanup\", action='store_true',\n                        help=\"Will not remove the temporary directory.\")\n    parser.add_argument(\"--model\", type=str, default='final.mdl',\n                        help=\"Choose a specific model to convert.\")\n    parser.add_argument(\"--binary\", type=str, default=\"true\", \n                        choices=[\"true\",\"false\"], \n                        help=\"Whether to write the model in binary or not.\")\n    parser.add_argument(\"nnet2_dir\", metavar=\"src-nnet2-dir\", type=str,\n                        help=\"\")\n    parser.add_argument(\"nnet3_dir\", metavar=\"src-nnet3-dir\", type=str,\n                        help=\"\")\n\n    print(' '.join(sys.argv))\n\n    args = parser.parse_args()\n\n    if not os.path.exists(args.nnet3_dir):\n        os.makedirs(args.nnet3_dir)\n    if args.tmpdir and not os.path.exists(args.tmpdir):\n        os.makedirs(args.tmpdir)\n\n    return args\n\nclass Nnet3Model(object):\n    \"\"\"Holds configuration for an Nnet3 model.\"\"\"\n    \n    def __init__(self):\n        self.input_dim = -1\n        self.output_dim = -1\n        self.ivector_dim = 0 \n        self.counts = defaultdict(int)\n        self.num_components = 0\n        self.components_read = 0\n        self.config = \"\"\n        self.transition_model = \"\"\n        self.priors = \"\"\n        self.components = []\n\n    def add_component(self, component, pairs):\n        \"\"\"Adds components to the model. \n        \n        Takes a dictionary of key-value pairs.\n        \"\"\"\n        self.components_read += 1\n\n        Component = namedtuple(\"Component\", \"ident component pairs\")\n\n        if \"<InputDim>\" in pairs and self.input_dim == -1:\n            self.input_dim = int(pairs[\"<InputDim>\"])\n\n        if \"<ConstComponentDim>\" in pairs and self.ivector_dim == 0:\n            self.ivector_dim = int(pairs[\"<ConstComponentDim>\"])\n\n        # remove nnet2 specific tokens and catch descriptors\n        if component == \"<PnormComponent>\" and \"<P>\" in pairs:\n            pairs.pop(\"<P>\")\n        elif component in SPLICE_COMPONENTS:\n            self.components.append(Component(\"splice\", component, pairs))\n            return\n\n        # format pairs: {'<InputDim>':43} -> {'input-dim':43}\n        pairs = [\"{0}={1}\".format(token_to_string(key), pairs[key]) for key in pairs]\n        \n        # keep track of layer type number (e.g. affine3)\n        node_name = NODE_NAMES[component]\n        self.counts[node_name] += 1\n\n        # e.g. affine3\n        ident = node_name + str(self.counts[node_name])\n\n        # <PnormComponent> -> PnormComponent\n        component = component[1:-1]\n\n        self.components.append(Component(ident, component, pairs))\n\n    def write_config(self, filename):\n        \"\"\"Write config to filename.\"\"\"\n        logger.info(\"Writing config to {0}\".format(filename))\n\n        self.config = filename\n        with open(filename, 'w') as f:\n            for component in self.components:\n                if component.ident == \"splice\":\n                    continue\n                config_string = ' '.join(component.pairs)\n\n                f.write(\"component name={name} type={comp_type} {config_string}\"\n                        \"\\n\".format(name=component.ident, \n                                    comp_type=component.component, \n                                    config_string=config_string))\n\n            f.write(\"\\n# Component nodes\\n\")\n            if self.ivector_dim != 0:\n                f.write(\"input-node name=input dim={0}\\n\".format(self.input_dim-self.ivector_dim))\n                f.write(\"input-node name=ivector dim={0}\\n\".format(self.ivector_dim))\n            else:\n                f.write(\"input-node name=input dim={0}\\n\".format(self.input_dim))\n            previous_component = \"input\"\n            for component in self.components:\n                if component.ident == \"splice\":\n                    # Create splice string for the next node\n                    previous_component = make_splice_string(previous_component, \n                                                   component.pairs[\"<Context>\"],\n                                                   component.pairs[\"<ConstComponentDim>\"])\n                    continue\n                f.write(\"component-node name={name} component={name} \"\n                        \"input={inp}\\n\".format(name=component.ident, \n                                               inp=previous_component))\n                previous_component = component.ident\n            logger.warning(\"Assuming linear objective.\")\n            f.write(\"output-node name=output input={inp} objective={obj}\"\n                    \"\\n\".format(inp=previous_component, obj='linear'))\n\n    def write_model(self, model, binary=\"true\"):\n        if not os.path.exists(self.config):\n            raise IOError(\"Config file {0} does not exist.\".format(self.config))\n\n        # write raw model\n        common_lib.execute_command(\"nnet3-init --binary=true {0} {1}\"\n            .format(self.config, os.path.join(tmpdir, \"nnet3.raw\")))\n\n        # add transition model\n        common_lib.execute_command(\"nnet3-am-init --binary=true {0} {1} {2}\"\n            .format(self.transition_model, os.path.join(tmpdir, \"nnet3.raw\"),\n                    os.path.join(tmpdir, \"nnet3_no_prior.mdl\")))\n\n        # add priors\n        common_lib.execute_command(\"nnet3-am-adjust-priors \"\n                                     \"--binary={0} {1} {2} {3}\"\n            .format(binary, os.path.join(tmpdir, \"nnet3_no_prior.mdl\"), \n                    self.priors, model))\n\ndef parse_nnet2_to_nnet3(line_buffer):\n    \"\"\"Reads an Nnet2 model into an Nnet3 object.\n\n    Parses by passing line_buffer objects depending upon the\n    current place or component being read.\n\n    Returns Nnet3 object.\n    \"\"\"\n    model = Nnet3Model()\n\n    # <TransitionModel> ...\n    model.transition_model = parse_transition_model(line_buffer)\n    \n    # <Nnet> <NumComponents> ...\n    line, model.num_components = parse_nnet2_header(line_buffer)\n\n    # Parse remaining components\n    while True:\n        if line.startswith(\"</Components>\"):\n            break\n        component, pairs = parse_component(line, line_buffer)\n        model.add_component(component, pairs)\n        line = next(line_buffer)\n\n    model.priors = parse_priors(line, line_buffer)\n    \n    if model.components_read != model.num_components:\n        logger.error(\"Did not read all components succesfully: {0}/{1}\"\n                     .format(model.components_read, model.num_components))\n\n    return model\n\ndef parse_transition_model(line_buffer):\n    \"\"\"Writes transition model to text file.\n    \n    Returns filename.\n    \"\"\"\n    line = next(line_buffer)\n    assert line.startswith(\"<TransitionModel>\")\n\n    transition_model = os.path.join(tmpdir, \"transition_model\")\n\n    with open(transition_model, 'w') as fc:\n        fc.write(line)\n        \n        while True:\n            line = next(line_buffer)\n            fc.write(line)\n            if line.startswith(\"</TransitionModel>\"):\n                break\n\n        return transition_model\n\ndef parse_nnet2_header(line_buffer):\n    \"\"\"Returns number of components in Nnet2 header.\"\"\"\n    line = next(line_buffer)\n    assert line.startswith(\"<Nnet>\")\n\n    line = consume_token(\"<Nnet>\", line)\n    num_components = int(line.split()[1])\n    line = line.partition(str(num_components))[2]\n    line = consume_token(\"<Components>\", line)\n\n    return line, num_components \n                \ndef parse_component(line, line_buffer):\n    component = line.split()[0]\n    pairs = {}\n\n    if component in SPLICE_COMPONENTS:\n        line, pairs = parse_splice_component(component, line, line_buffer)\n    elif component in AFFINE_COMPONENTS:\n        pairs = parse_affine_component(component, line, line_buffer)\n    elif component == \"<FixedScaleComponent>\":\n        pairs = parse_fixed_scale_component(component, line, line_buffer)\n    elif component == \"<FixedBiasComponent>\":\n        pairs = parse_fixed_bias_component(component, line, line_buffer)\n    elif component == \"<SumGroupComponent>\":\n        pairs = parse_sum_group_component(component, line, line_buffer)\n    elif component in KNOWN_COMPONENTS:\n        pairs = parse_standard_component(component, line, line_buffer)\n    else:\n        raise LookupError(\"Unrecognised component, {0}.\".format(component))\n\n    parse_end_of_component(component, line, line_buffer)\n\n    return component, pairs\n\ndef parse_standard_component(component, line, line_buffer):\n    # Ignores stats such as ValueSum and DerivSum\n    line = consume_token(component, line)\n    pairs = re.findall(\"(<\\w+>) ([\\w.-]+)\", line)\n\n    return dict(pairs)\n\ndef parse_fixed_scale_component(component, line, line_buffer):\n    line = consume_token(component, line)\n    line = consume_token(\"<Scales>\", line)\n\n    scales = np.array([parse_vector(line)])\n\n    _, filename = tempfile.mkstemp(dir=tmpdir)\n    with open(filename, 'w') as f:\n        f.write(\"[ \")\n        np.savetxt(f, scales, newline='')\n        f.write(\" ]\")\n\n    return {\"<Scales>\" : filename}\n\ndef parse_sum_group_component(component, line, line_buffer):\n    line = consume_token(component, line)\n    line = consume_token(\"<Sizes>\", line)\n\n    sizes = line.strip().strip(\"[]\").strip().replace(' ', ',')\n\n    return {\"<Sizes>\" : sizes}\n\ndef parse_fixed_bias_component(component, line, line_buffer):\n    line = consume_token(component, line)\n    line = consume_token(\"<Bias>\", line)\n\n    scales = np.array([parse_vector(line)])\n\n    _, filename = tempfile.mkstemp(dir=tmpdir)\n    with open(filename, 'w') as f:\n        f.write(\"[ \")\n        np.savetxt(f, scales, newline='')\n        f.write(\" ]\")\n\n    return {\"<Bias>\" : filename}\n\ndef parse_splice_component(component, line, line_buffer):\n    if component == \"<SpliceMaxComponent>\":\n        raise NotImplementedError(\"Script doesn't support SpliceMaxComponent.\")\n\n    line = consume_token(component, line)\n    line = consume_token(\"<InputDim>\", line)\n    [input_dim, _, line] = line.strip().partition(' ')\n    line = consume_token(\"<Context>\", line)\n    context = line.strip()[1:-1].split()\n\n    const_component_dim = 0\n    line = next(line_buffer) # Context vector adds newline\n    line = consume_token(\"<ConstComponentDim>\", line)\n    const_component_dim = int(line.strip().split()[0])\n\n    return line, {\"<InputDim>\" : input_dim, \"<Context>\" : context, \n            \"<ConstComponentDim>\" : const_component_dim}\n\ndef parse_end_of_component(component, line, line_buffer):\n    # Keeps reading until it hits the end tag for component\n    end_component = \"</\" + component[1:]\n\n    while end_component not in line:\n        line = next(line_buffer)\n\n    return\n\ndef parse_affine_component(component, line, line_buffer):\n    assert (\"<LinearParams>\" in line)\n\n    pairs = dict(re.findall(\"(<\\w+>) ([\\w.-]+)\", line))\n\n    # read the linear params and bias and convert it to a matrix\n    weights = parse_weights(line_buffer)\n    bias = parse_bias(next(line_buffer))\n\n    matrix = np.concatenate([weights, bias.T], axis=1)\n\n    # write matrix and return pairs with filename\n    _, filename = tempfile.mkstemp(dir=tmpdir)\n    with open(filename, 'w') as f:\n        f.write(\"[ \")\n        np.savetxt(f, matrix)\n        f.write(\" ]\")\n\n    pairs[\"<Matrix>\"] = filename\n\n    return pairs\n\ndef parse_weights(line_buffer):\n    weights = []\n\n    while True:\n        line = next(line_buffer)\n\n        if line.strip().endswith(\"[\"):\n            continue\n        elif line.strip().endswith(\"]\"):\n            weights.append(parse_vector(line))\n            break\n        else:\n            weights.append(parse_vector(line))\n\n    return np.array(weights)\n\ndef parse_bias(line):\n    if \"<BiasParams>\" in line:\n        line = consume_token(\"<BiasParams>\", line)\n\n    return np.array([parse_vector(line)])\n\ndef parse_vector(line):\n    vector = line.strip().strip(\"[]\")\n    return np.array([float(x) for x in vector.split()], dtype=\"float32\")\n\ndef parse_priors(line, line_buffer):\n    vector = parse_vector(line.partition('[')[2])\n    priors = os.path.join(tmpdir, \"priors\")\n\n    with open(priors, 'w') as f:\n        f.write(\"[ \")\n        np.savetxt(f, vector, newline=' ')\n        f.write(\" ]\")\n\n    return priors\n\ndef token_to_string(token):\n    \"\"\"Converts tokens to lowercase, hyphen-bounded strings.\n\n    E.g. <InputDim> -> input-dim\n    \"\"\"\n    string = token[1:-1]\n    string = re.sub(r\"((?<=[a-z])[A-Z]|(?<!\\A)[A-Z](?=[a-z]))\", r'-\\1', string).lower()\n    return string\n\ndef consume_token(token, line):\n    \"\"\"Returns line without token\"\"\"\n    if token != line.split(None, 1)[0]:\n        logger.error(\"Unexpected token, expected '{0}', got '{1}'.\"\n              .format(token, line.split(None, 1)[0]))\n\n    return line.partition(token)[2]\n\ndef make_splice_string(nodename, context, const_component_dim=0):\n    \"\"\"Generates splice string from a list of context.\n\n    E.g. make_splice_string(\"renorm4\", [-4, 4])\n    returns \"Append(Offset(renorm4, -4), Offset(renorm4, 4))\"\n    \"\"\"\n    assert type(context) == list, \"context argument must be a list\"\n    string = [\"Offset({0}, {1})\".format(nodename, i) for i in context]\n    if const_component_dim > 0:\n        string.append(\"ReplaceIndex(ivector, t, 0)\")\n    string = \"Append(\" + \", \".join(string) + \")\"\n    return string\n\ntmpdir = \"\"\n\ndef Main():\n    args = GetArgs()\n    logger.info(\"Converting nnet2 model {0} to nnet3 model {1}\"\n                .format(os.path.join(args.nnet2_dir, args.model), \n                        os.path.join(args.nnet3_dir, args.model)))\n    global tmpdir\n    tmpdir = tempfile.mkdtemp(dir=args.tmpdir) \n\n    # Convert nnet2 model to text and remove preconditioning\n    common_lib.execute_command(\"nnet-am-copy \"\n            \"--remove-preconditioning=true --binary=false {0}/{1} {2}/{1}\"\n            .format(args.nnet2_dir, args.model, tmpdir))\n\n    # Parse nnet2 and return nnet3 object\n    with open(os.path.join(tmpdir, args.model)) as f:\n        nnet3 = parse_nnet2_to_nnet3(f)\n\n    # Write model\n    nnet3.write_config(os.path.join(tmpdir, \"config\"))\n    nnet3.write_model(os.path.join(args.nnet3_dir, args.model), \n                      binary=args.binary)\n        \n    if not args.skip_cleanup:\n        shutil.rmtree(tmpdir)\n    else:\n        logger.info(\"Not removing temporary directory {0}\".format(tmpdir))\n     \n    logger.info(\"Wrote nnet3 model to {0}\".format(os.path.join(args.nnet3_dir, \n                                                  args.model)))\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/nnet3/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script does decoding with a neural-net.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nuse_gpu=false # If true, will use a GPU, with nnet3-latgen-faster-batch.\n              # In that case it is recommended to set num-threads to a large\n              # number, e.g. 20 if you have that many free CPU slots on a GPU\n              # node, and to use a small number of jobs.\nscoring_opts=\nskip_diagnostics=false\nskip_scoring=false\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \"e.g.:   steps/nnet3/decode.sh --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet2_online/ivectors_test_eval92 \\\\\"\n  echo \"    exp/tri4b/graph_bg data/test_eval92_hires $dir/decode_bg_eval92\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n  echo \"  --use-gpu <true|false>                   # default: false.  If true, we recommend\"\n  echo \"                                           # to use large --num-threads as the graph\"\n  echo \"                                           # search becomes the limiting factor.\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nutils/lang/check_phones_compatible.sh {$srcdir,$graphdir}/phones.txt || exit 1\n\nfor f in $graphdir/HCLG.fst $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\nif [ -f $srcdir/cmvn_opts ]; then\n    cmvn_opts=`cat $srcdir/cmvn_opts`\nelse\n    cmvn_opts=\"--norm-means=false --norm-vars=false\"\nfi\nthread_string=\nif $use_gpu; then\n  if [ $num_threads -eq 1 ]; then\n    echo \"$0: **Warning: we recommend to use --num-threads > 1 for GPU-based decoding.\"\n  fi\n  thread_string=\"-batch --num-threads=$num_threads\"\n  queue_opt=\"--num-threads $num_threads --gpu 1\"\nelif [ $num_threads -gt 1 ]; then\n  thread_string=\"-parallel --num-threads=$num_threads\"\n  queue_opt=\"--num-threads $num_threads\"\nfi\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n## Set up features.\nif [ -f $srcdir/online_cmvn ]; then online_cmvn=true\nelse online_cmvn=false; fi\n\nif ! $online_cmvn; then\n  echo \"$0: feature type is raw\"\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\nelse\n  echo \"$0: feature type is raw (apply-cmvn-online)\"\n  feats=\"ark,s,cs:apply-cmvn-online $cmvn_opts --spk2utt=ark:$sdata/JOB/spk2utt $srcdir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- |\"\nfi\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nelif [ -f $srcdir/init/info.txt ]; then\n    frame_subsampling_factor=$(awk '/^frame_subsampling_factor/ {print $2}' <$srcdir/init/info.txt)\n    if [ ! -z $frame_subsampling_factor ]; then\n        frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\n    fi\nfi\n\nif [ $stage -le 1 ]; then\n  $cmd $queue_opt JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-faster$thread_string $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     $graphdir/HCLG.fst \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/decode_grammar.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This is a version of ./decode.sh that allows you to decode with a GrammarFst.\n# See kaldi-asr.org/doc/grammar.html for an overview of what this is about.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nscoring_opts=\nskip_diagnostics=false\nskip_scoring=false\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \"e.g.:   steps/nnet3/decode.sh --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet2_online/ivectors_test_eval92 \\\\\"\n  echo \"    exp/tri4b/graph_bg data/test_eval92_hires $dir/decode_bg_eval92\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nutils/lang/check_phones_compatible.sh {$srcdir,$graphdir}/phones.txt || exit 1\n\nfor f in $graphdir/HCLG.gra $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-grammar $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     $graphdir/HCLG.gra \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/decode_lookahead.sh",
    "content": "#!/bin/bash\n\n# Copyright 2019       Alpha Cephei Inc (Author: Nickolay Shmmyrev).\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script does decoding with a neural-net with lookahead composition of HCL and G graphs.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nuse_gpu=false # If true, will use a GPU, with nnet3-latgen-faster-batch.\n              # In that case it is recommended to set num-threads to a large\n              # number, e.g. 20 if you have that many free CPU slots on a GPU\n              # node, and to use a small number of jobs.\nscoring_opts=\nskip_diagnostics=false\nskip_scoring=false\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \"e.g.:   $0 --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet2_online/ivectors_test_eval92 \\\\\"\n  echo \"    exp/tri4b/graph_bg data/test_eval92_hires $dir/decode_bg_eval92\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n  echo \"  --use-gpu <true|false>                   # default: false.  If true, we recommend\"\n  echo \"                                           # to use large --num-threads as the graph\"\n  echo \"                                           # search becomes the limiting factor.\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\n#utils/lang/check_phones_compatible.sh {$srcdir,$graphdir}/phones.txt || exit 1\n\nfor f in $graphdir/HCLr.fst $graphdir/Gr.fst $graphdir/disambig_tid.int $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\nthread_string=\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 1 ]; then\n  $cmd $queue_opt JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-faster-lookahead $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     $graphdir/HCLr.fst $graphdir/Gr.fst $graphdir/disambig_tid.int \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/decode_looped.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n\n# This is like decode.sh except it uses \"looped\" decoding.  This is an nnet3\n# mechanism for reusing previously computed activations when we evaluate the\n# neural net for successive chunks of data.  It is applicable to TDNNs and LSTMs\n# and similar forward-recurrent topologies, but not to backward-recurrent\n# topologies like BLSTMs.  Be careful because the script itself does not have a\n# way to figure out what kind of topology you are using.\n#\n# Also be aware that this decoding mechanism means that you have effectively\n# unlimited context within the utterance.  Unless your models were trained (at\n# least partly) on quite large chunk-sizes, e.g. 100 or more (although the\n# longer the BLSTM recurrence the larger chunk-size you'd need in training),\n# there is a possibility that this effectively infinite left-context will cause\n# a mismatch with the training condition.  Also, for recurrent topologies, you may want to make sure\n# that the --extra-left-context-initial matches the --egs.chunk-left-context-initial\n# that you trained with, .  [note: if not specified during training, it defaults to\n# the same as the regular --extra-left-context\n\n# This script does decoding with a neural-net.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nscoring_opts=\nskip_diagnostics=false\nskip_scoring=false\nextra_left_context_initial=0\nonline_ivector_dir=\nminimize=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \"e.g.:   steps/nnet3/decode.sh --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet2_online/ivectors_test_eval92 \\\\\"\n  echo \"    exp/tri4b/graph_bg data/test_eval92_hires $dir/decode_bg_eval92\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $graphdir/HCLG.fst $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-faster-looped $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt \"$model\" \\\n     $graphdir/HCLG.fst \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\n\nif [ $stage -le 2 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ]\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/decode_score_fusion.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2018        Tien-Hong Lo\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Script for system combination using output of the neural networks.\n# This calls nnet3-compute, matrix-sum and latgen-faster-mapped to create a system combination.\nset -euo pipefail\n# begin configuration section.\ncmd=run.pl\n\n# Neural Network\nstage=0\niter=final\nnj=30\noutput_name=\"output\"\nivector_scale=1.0\napply_exp=false  # Apply exp i.e. write likelihoods instead of log-likelihoods\ncompress=false    # Specifies whether the output should be compressed before\n                  # dumping to disk\nuse_gpu=false\nskip_diagnostics=false\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nframe_subsampling_factor=\nframes_per_chunk=150\naverage=true\n\n# Decode\nbeam=15.0 # prune the lattices prior to MBR decoding, for speed.\nmax_active=7000\nmin_active=200\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\nlattice_beam=8.0 # Beam we use in lattice generation.\nnum_threads=1 # if >1, will use latgen-faster--map-parallel\nmin_lmwt=5\nmax_lmwt=15\nparallel_opts=\"--num-threads 3\"\nscoring_opts=\nminimize=false\nskip_scoring=false\n\nword_determinize=false  # If set to true, then output lattice does not retain\n                        # alternate paths a sequence of words (with alternate pronunciations).\n                        # Setting to true is the default in steps/nnet3/decode.sh.\n                        # However, setting this to false\n                        # is useful for generation w of semi-supervised training\n                        # supervision and frame-level confidences.\nwrite_compact=true   # If set to false, then writes the lattice in non-compact format,\n                     # retaining the acoustic scores on each arc. This is\n                     # required to be false for LM rescoring undeterminized\n                     # lattices (when --word-determinize is false)\n#end configuration section.\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\n\nif [ $# -lt 5 ]; then\n  echo \"Usage: $0 [options] <data-dir> <graph-dir> <nnet3-dir> <nnet3-dir2> [<nnet3-dir3> ... ] <output-dir>\"\n  echo \"e.g.:   steps/nnet3/decode_score_fusion.sh --nj 8 \\\\\"\n  echo \"    --online-ivector-dir exp/nnet3/ivectors_test \\\\\"\n  echo \"    data/test_hires exp/nnet3/tdnn/graph exp/nnet3/tdnn/output exp/nnet3/tdnn1/output .. \\\\\"\n  echo \"    exp/nnet3/tdnn_comb/decode_test\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  exit 1;\nfi\n\necho \"$0 $@\"\n\ndata=$1\ngraphdir=$2\ndir=${@: -1}  # last argument to the script\nshift 2;\nmodel_dirs=( $@ )  # read the remaining arguments into an array\nunset model_dirs[${#model_dirs[@]}-1]  # 'pop' the last argument which is odir\nnum_sys=${#model_dirs[@]}  # number of systems to combine\n\nfor f in $graphdir/words.txt $graphdir/phones/word_boundary.int ; do\n  [ ! -f $f ] && echo \"$0: file $f does not exist\" && exit 1;\ndone\n\n[ ! -z \"$online_ivector_dir\" ] && \\\n   extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n   \nif [ ! -z \"$online_ivector_dir\" ]; then\n    ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n    ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\n# assign frame_subsampling_factor automatically if empty\nif [ -z $frame_subsampling_factor ]; then\n   frame_subsampling_factor=`cat ${model_dirs[0]}/frame_subsampling_factor` || exit 1;\nfi\n\n# check if standard chain system or not.\nif [ $frame_subsampling_factor -eq 3 ]; then\n   if [ $acwt != 1.0 ] || [ $post_decode_acwt != 10.0 ]; then\n     echo -e '\\n\\n'\n     echo \"$0 WARNING: In standard chain system, acwt = 1.0, post_decode_acwt = 10.0\"\n     echo \"$0 WARNING: Your acwt = $acwt, post_decode_acwt = $post_decode_acwt\"\n     echo \"$0 WARNING: This is OK if you know what you are doing.\"\n     echo -e '\\n\\n'\n   fi\nfi\n\nframe_subsampling_opt=\nif [ $frame_subsampling_factor -ne 1 ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\nfi\n\n# Possibly use multi-threaded decoder\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/temp\n\nfor i in `seq 0 $[num_sys-1]`; do\n  srcdir=${model_dirs[$i]}\n  \n  model=$srcdir/$iter.mdl\n  if [ ! -f $srcdir/$iter.mdl ]; then\n    echo \"$0: Error: no such file $srcdir/$iter.raw. Trying $srcdir/$iter.mdl exit\" && exit 1;\n  fi\n  \n  # check that they have the same tree\n  show-transitions $graphdir/phones.txt $model > $dir/temp/transition.${i}.txt\n  cmp_tree=`diff -q $dir/temp/transition.0.txt $dir/temp/transition.${i}.txt | awk '{print $5}'`\n  if [ ! -z $cmp_tree ]; then\n    echo \"$0 tree must be the same.\"\n    exit 0;\n  fi\n  \n  # check that they have the same frame-subsampling-factor\n  if [ $frame_subsampling_factor -ne `cat $srcdir/frame_subsampling_factor` ]; then\n    echo \"$0 frame_subsampling_factor must be the same.\\\\\"\n    echo \"Default:$frame_subsampling_factor \\\\\"\n    echo \"In $srcdir:`cat $srcdir/frame_subsampling_factor`\"\n    exit 0;\n  fi\n  \n  for f in $data/feats.scp $model $extra_files; do\n    [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\n  done\n\n  if [ ! -z \"$output_name\" ] && [ \"$output_name\" != \"output\" ]; then\n    echo \"$0: Using output-name $output_name\"\n    model=\"nnet3-copy --edits='remove-output-nodes name=output;rename-node old-name=$output_name new-name=output' $model - |\"\n  fi\n\n  ## Set up features.\n  if [ -f $srcdir/final.mat ]; then\n    echo \"$0: Error: lda feature type is no longer supported.\" && exit 1\n  fi\n  \n  sdata=$data/split$nj;\n  cmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\n  \n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\n  if $apply_exp; then\n    output_wspecifier=\"ark:| copy-matrix --apply-exp ark:- ark:-\"\n  else\n    output_wspecifier=\"ark:| copy-feats --compress=$compress ark:- ark:-\"\n  fi\n\n  gpu_opt=\"--use-gpu=no\"\n  gpu_queue_opt=\n\n  if $use_gpu; then\n    gpu_queue_opt=\"--gpu 1\"\n    gpu_opt=\"--use-gpu=yes\"\n  fi\n\n  echo \"$i $model\";\n  models[$i]=\"ark,s,cs:nnet3-compute $gpu_opt $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     '$model' '$feats' '$output_wspecifier' |\"\ndone\n\n# remove tempdir\nrm -rf $dir/temp\n\n# split data to nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n# Assume the nnet trained by \n# the same tree and frame subsampling factor.\nmkdir -p $dir/log\n\nif [ -f $model ]; then\n  echo \"$0: $model exists, copy model to $dir/../\"\n  cp $model $dir/../\nfi\n\nif [ -f $srcdir/frame_shift ]; then\n  cp $srcdir/frame_shift $dir/../\n  echo \"$0: $srcdir/frame_shift exists, copy $srcdir/frame_shift to $dir/../\"\nelif [ -f $srcdir/frame_subsampling_factor ]; then\n  cp $srcdir/frame_subsampling_factor $dir/../\n  echo \"$0: $srcdir/frame_subsampling_factor exists, copy $srcdir/frame_subsampling_factor to $dir/../\"\nfi\n\nlat_wspecifier=\"ark:|\"\nextra_opts=\nif ! $write_compact; then\n  extra_opts=\"--determinize-lattice=false\"\n  lat_wspecifier=\"ark:| lattice-determinize-phone-pruned --beam=$lattice_beam --acoustic-scale=$acwt --minimize=$minimize --word-determinize=$word_determinize --write-compact=false $model ark:- ark:- |\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"$lat_wspecifier gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"$lat_wspecifier lattice-scale --acoustic-scale=$post_decode_acwt --write-compact=$write_compact ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\n\nif [ $stage -le 0 ]; then  \n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n     matrix-sum --average=$average \"${models[@]}\" ark:- \\| \\\n     latgen-faster-mapped$thread_string --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --minimize=$minimize --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --word-symbol-table=$graphdir/words.txt ${extra_opts} \"$model\" \\\n     $graphdir/HCLG.fst ark:- \"$lat_wspecifier\"\nfi\n\nif [ $stage -le 1 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\nif ! $skip_scoring ; then\n  if [ $stage -le 2 ]; then\n    [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n\tscoring_opts=\"--min_lmwt $min_lmwt\"\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\n\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/decode_semisup.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\n# This script does decoding with a neural-net.\n\n# Begin configuration section.\nstage=1\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\ncmd=run.pl\nbeam=15.0\nframes_per_chunk=50\nmax_active=7000\nmin_active=200\nivector_scale=1.0\nlattice_beam=8.0 # Beam we use in lattice generation.\niter=final\nnum_threads=1 # if >1, will use gmm-latgen-faster-parallel\nscoring_opts=\nskip_diagnostics=false\nskip_scoring=false\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\nonline_ivector_dir=\nminimize=false\nword_determinize=false  # If set to true, then output lattice does not retain\n                        # alternate paths a sequence of words (with alternate pronunciations).\n                        # Setting to true is the default in steps/nnet3/decode.sh.\n                        # However, setting this to false\n                        # is useful for generation w of semi-supervised training\n                        # supervision and frame-level confidences.\nwrite_compact=true   # If set to false, then writes the lattice in non-compact format,\n                     # retaining the acoustic scores on each arc. This is\n                     # required to be false for LM rescoring undeterminized\n                     # lattices (when --word-determinize is false)\n                     # Useful for semi-supervised training with rescored lattices.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n  echo \"e.g.:   steps/nnet3/decode.sh --nj 8 \\\\\"\n  echo \"--online-ivector-dir exp/nnet2_online/ivectors_test_eval92 \\\\\"\n  echo \"    exp/tri4b/graph_bg data/test_eval92_hires $dir/decode_bg_eval92\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 15.0\"\n  echo \"  --iter <iter>                            # Iteration of model to decode; default is final.\"\n  echo \"  --scoring-opts <string>                  # options to local/score.sh\"\n  echo \"  --num-threads <n>                        # number of threads to use, default 1.\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\nmodel=$srcdir/$iter.mdl\n\n\nextra_files=\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/check_ivectors_compatible.sh $srcdir $online_ivector_dir || exit 1\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfi\n\nutils/lang/check_phones_compatible.sh {$srcdir,$graphdir}/phones.txt || exit 1\n\nfor f in $graphdir/HCLG.fst $data/feats.scp $model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj;\ncmvn_opts=`cat $srcdir/cmvn_opts` || exit 1;\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nextra_opts=\nlat_wspecifier=\"ark:|\"\nif ! $write_compact; then\n  extra_opts=\"--determinize-lattice=false\"\n  lat_wspecifier=\"ark:| lattice-determinize-phone-pruned --beam=$lattice_beam --acoustic-scale=$acwt --minimize=$minimize --word-determinize=$word_determinize --write-compact=false $model ark:- ark:- |\"\nfi\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"$lat_wspecifier gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"$lat_wspecifier lattice-scale --acoustic-scale=$post_decode_acwt --write-compact=$write_compact ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\nframe_subsampling_opt=\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\n# Copy the model as it is required when generating egs\ncp $model $dir/  || exit 1\n\nif [ $stage -le 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode.JOB.log \\\n    nnet3-latgen-faster$thread_string $ivector_opts $frame_subsampling_opt \\\n     --frames-per-chunk=$frames_per_chunk \\\n     --extra-left-context=$extra_left_context \\\n     --extra-right-context=$extra_right_context \\\n     --extra-left-context-initial=$extra_left_context_initial \\\n     --extra-right-context-final=$extra_right_context_final \\\n     --minimize=$minimize --word-determinize=$word_determinize \\\n     --max-active=$max_active --min-active=$min_active --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=true \\\n     --word-symbol-table=$graphdir/words.txt ${extra_opts} \"$model\" \\\n     $graphdir/HCLG.fst \"$feats\" \"$lat_wspecifier\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  if ! $skip_diagnostics ; then\n    [ ! -z $iter ] && iter_opt=\"--iter $iter\"\n    steps/diagnostic/analyze_lats.sh --cmd \"$cmd\" $iter_opt $graphdir $dir\n  fi\nfi\n\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at\n# different acoustic scales to get the final output.\nif [ $stage -le 3 ]; then\n  if ! $skip_scoring ; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    echo \"score best paths\"\n    [ \"$iter\" != \"final\" ] && iter_opt=\"--iter $iter\"\n    local/score.sh $scoring_opts --cmd \"$cmd\" $data $graphdir $dir\n    echo \"score confidence and timing with sclite\"\n  fi\nfi\necho \"Decoding done.\"\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/dot/descriptor_parser.py",
    "content": "#!/usr/bin/env python\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nimport pprint\nimport re\nimport sys\n\nstart_identifier = \"(\"\nend_identifier = \")\"\n\ndef ParseSubsegmentsAndArguments(segment_endpoints, sub_segments, arguments, input_string):\n    # name the sub_segments, and other arguments\n    arg_name_start_index = segment_endpoints[0]\n    args = ''\n    for sub_segment in sub_segments:\n        endpoints = sub_segment['endpoints']\n        args += input_string[arg_name_start_index:endpoints[0]+1]\n        arg_name_start_index=endpoints[1]+1\n    args += input_string[arg_name_start_index:segment_endpoints[1]+1]\n\n    args = args.split(',')\n    if len(sub_segments) > 0:\n        sub_segment_index = 0\n        for sub_segment_name in args:\n            sub_segment_name = sub_segment_name.strip()\n            if sub_segment_name[-1] == \"(\":\n                # this subsegment is a function\n                sub_segment_name = sub_segment_name[:-1]\n                sub_segments[sub_segment_index]['name'] = sub_segment_name\n                sub_segment_index += 1\n\n            else:\n                arguments.append(sub_segment_name)\n    else:\n        arguments = [re.sub(',','', x.strip()) for x in input_string[segment_endpoints[0]:segment_endpoints[1]+1].split()]\n        sub_segments = []\n    return sub_segments, arguments\n\ndef IdentifyNestedSegments(input_string):\n    indices = []\n    segments = []\n    for i in range(len(input_string)):\n        if input_string[i] == start_identifier:\n            indices.append(i)\n        if input_string[i] == end_identifier:\n            # new segment has been found\n            current_segment_endpoints = [indices.pop(), i]\n            sub_segments = []\n            arguments = []\n            # identify the sub-segments\n            # the sub-segments would be on the top of the stack\n            # with start index greater than current segment\n            # and end index less than current segment\n            # these sub-segments are listed in reverse order on the stack,\n            # the final segment is on the top\n            while len(segments) > 0:\n                if ((segments[-1]['endpoints'][0] > current_segment_endpoints[0]) and\n                    (segments[-1]['endpoints'][1] < current_segment_endpoints[1])):\n                    sub_segments.insert(0, segments.pop())\n                else:\n                    break\n\n            sub_segments, arguments = ParseSubsegmentsAndArguments([current_segment_endpoints[0]+1, current_segment_endpoints[1]-1], sub_segments, arguments, input_string)\n            segments.append({\n                             'name':'',\n                             'endpoints':current_segment_endpoints,\n                             'sub_segments':sub_segments,\n                             'arguments':arguments\n                             })\n    arguments = []\n    segments, arguments = ParseSubsegmentsAndArguments([0, len(input_string)], segments, arguments, input_string)\n    if arguments:\n        if segments:\n            raise Exception('Arguments not expected outside top level braces : {0}'.format(input_string))\n    if len(segments) > 1:\n        raise Exception('only one parent segment expected : {0}'.format(input_string))\n\n    return [segments, arguments]\n\nif __name__ == \"__main__\":\n    strings= [\n        \"Append(Offset-2(input, -2), Offset-1(input, -1), input, Offset+1(input, 1), Offset+2(input, 2), ReplaceIndex(ivector, t, 0))\",\n        \"Wx\"]\n    for string in strings:\n        segments = IdentifyNestedSegments(string)\n        pprint.pprint(segments)\n\n"
  },
  {
    "path": "egs/steps/nnet3/dot/nnet3_to_dot.py",
    "content": "#!/usr/bin/env python\n\n# Copyright      2015  Johns Hopkins University (Author: Vijayaditya Peddinti)\n# Apache 2.0\n\n# script to convert nnet3-am-info output to a dot graph\n\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nimport re\nimport os\nimport argparse\nimport sys\nimport math\nimport warnings\nimport descriptor_parser\nimport pprint\n\nnode_attributes = {\n    'input-node':{\n        'shape':'oval'\n    },\n    'output-node':{\n        'shape':'oval'\n    },\n    'NaturalGradientAffineComponent':{\n        'color':'lightgrey',\n        'shape':'box',\n        'style':'filled'\n    },\n    'NaturalGradientPerElementScaleComponent':{\n        'color':'lightpink',\n        'shape':'box',\n        'style':'filled'\n    },\n    'ConvolutionComponent':{\n        'color':'lightpink',\n        'shape':'box',\n        'style':'filled'\n    },\n    'FixedScaleComponent':{\n        'color':'blueviolet',\n        'shape':'box',\n        'style':'filled'\n    },\n    'FixedAffineComponent':{\n        'color':'darkolivegreen1',\n        'shape':'box',\n        'style':'filled'\n    },\n    'SigmoidComponent':{\n        'color':'bisque',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'TanhComponent':{\n        'color':'bisque',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'NormalizeComponent':{\n        'color':'aquamarine',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'RectifiedLinearComponent':{\n        'color':'bisque',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'ClipGradientComponent':{\n        'color':'bisque',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'ElementwiseProductComponent':{\n        'color':'green',\n        'shape':'rectangle',\n        'style':'filled'\n    },\n    'LogSoftmaxComponent':{\n        'color':'cyan',\n        'shape':'rectangle',\n        'style':'filled'\n    }\n}\n\ndef GetDotNodeName(name_string, is_component = False):\n    # this function is required as dot does not allow all the component names\n    # allowed by nnet3.\n    # Identified incompatibilities :\n    #   1. dot does not allow hyphen(-) and dot(.) in names\n    #   2. Nnet3 names can be shared among components and component nodes\n    #      dot does not allow common names\n    #\n    node_name_string = re.sub(\"-\", \"hyphen\", name_string)\n    node_name_string = re.sub(\"\\.\", \"_dot_\", node_name_string)\n    if is_component:\n        node_name_string += node_name_string.strip() + \"_component\"\n    return {\"label\":name_string, \"node\":node_name_string}\n\ndef ProcessAppendDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    dot_graph = []\n    names = []\n    desc_name = 'Append_{0}'.format(affix)\n    for i in range(len(segment['sub_segments'])):\n        sub_segment = segment['sub_segments'][i]\n        part_name = \"{0}{1}{2}\".format(desc_name, sub_segment['name'], i)\n        names.append(\"<{0}> part {1}\".format(GetDotNodeName(part_name)['node'], i))\n        dot_graph += DescriptorSegmentToDot(sub_segment, \"{0}:{1}\".format(desc_name, part_name), desc_name)\n\n    part_index = len(segment['sub_segments'])\n    for i in range(len(segment['arguments'])):\n        part_name = \"{0}{1}{2}\".format(desc_name, segment['arguments'][i], part_index + i)\n        names.append(\"<{0}> part {1}\".format(GetDotNodeName(part_name)['node'], part_index + i))\n        dot_graph.append(\"{0} -> {1}:{2}\".format(GetDotNodeName(segment['arguments'][i])['node'], GetDotNodeName(desc_name)['node'], GetDotNodeName(part_name)['node']))\n\n    label = \"|\".join(names)\n    label = \"{{\"+label+\"}|Append}\"\n    dot_graph.append('{0} [shape=Mrecord, label=\"{1}\"];'.format(GetDotNodeName(desc_name)['node'], label))\n\n    attr_string = ''\n    if edge_attributes is not None:\n        if 'label' in edge_attributes:\n            attr_string += \" label={0} \".format(edge_attributes['label'])\n        if 'style' in edge_attributes:\n            attr_string += ' style={0} '.format(edge_attributes['style'])\n\n    dot_string = '{0} -> {1} [tailport=s]'.format(GetDotNodeName(desc_name)['node'], GetDotNodeName(parent_node_name)['node'])\n\n    if attr_string != '':\n        dot_string += ' [{0}] '.format(attr_string)\n    dot_graph.append(dot_string)\n\n\n    return dot_graph\n\ndef ProcessRoundDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    dot_graph = []\n\n    label = 'Round ({0})'.format(segment['arguments'][1])\n    style = None\n    if edge_attributes is not None:\n        if 'label' in edge_attributes:\n            label = \"{0} {1}\".format(edge_attributes['label'], label)\n        if 'style' in edge_attributes:\n            style  = 'style={0}'.format(edge_attributes['style'])\n\n    attr_string = 'label=\"{0}\"'.format(label)\n    if style is not None:\n        attr_string += ' {0}'.format(style)\n    dot_graph.append('{0}->{1} [ {2} ]'.format(GetDotNodeName(segment['arguments'][0])['node'],\n                                                                    GetDotNodeName(parent_node_name)['node'],\n                                                                    attr_string))\n    if segment['sub_segments']:\n        raise Exception(\"Round can just deal with forwarding descriptor, no sub-segments allowed\")\n    return dot_graph\n\n\ndef ProcessOffsetDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    dot_graph = []\n\n    label = 'Offset ({0})'.format(segment['arguments'][1])\n    style = None\n    if edge_attributes is not None:\n        if 'label' in edge_attributes:\n            label = \"{0} {1}\".format(edge_attributes['label'], label)\n        if 'style' in edge_attributes:\n            style  = 'style={0}'.format(edge_attributes['style'])\n\n    attr_string = 'label=\"{0}\"'.format(label)\n    if style is not None:\n        attr_string += ' {0}'.format(style)\n\n    dot_graph.append('{0}->{1} [ {2} ]'.format(GetDotNodeName(segment['arguments'][0])['node'],\n                                                                    GetDotNodeName(parent_node_name)['node'],\n                                                                    attr_string))\n    if segment['sub_segments']:\n        raise Exception(\"Offset can just deal with forwarding descriptor, no sub-segments allowed\")\n    return dot_graph\n\ndef ProcessSumDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    dot_graph = []\n    names = []\n    desc_name = 'Sum_{0}'.format(affix)\n    # create the sum node\n    for i in range(len(segment['sub_segments'])):\n        sub_segment = segment['sub_segments'][i]\n        part_name = \"{0}{1}{2}\".format(desc_name, sub_segment['name'], i)\n        names.append(\"<{0}> part {1}\".format(GetDotNodeName(part_name)['node'], i))\n        dot_graph += DescriptorSegmentToDot(sub_segment, \"{0}:{1}\".format(desc_name, part_name), \"{0}_{1}\".format(desc_name, i))\n\n    # link the sum node parts to corresponding segments\n    part_index = len(segment['sub_segments'])\n    for i in range(len(segment['arguments'])):\n        part_name = \"{0}{1}{2}\".format(desc_name, segment['arguments'][i], part_index + i)\n        names.append(\"<{0}> part {1}\".format(GetDotNodeName(part_name)['node'], part_index + i))\n        dot_graph.append(\"{0} -> {1}:{2}\".format(GetDotNodeName(segment['arguments'][i])['node'], GetDotNodeName(desc_name)['node'], GetDotNodeName(part_name)['node']))\n\n    label = \"|\".join(names)\n    label = '{{'+label+'}|Sum}'\n    dot_graph.append('{0} [shape=Mrecord, label=\"{1}\", color=red];'.format(GetDotNodeName(desc_name)['node'], label))\n\n    attr_string = ''\n    if edge_attributes is not None:\n        if 'label' in edge_attributes:\n            attr_string += \" label={0} \".format(edge_attributes['label'])\n        if 'style' in edge_attributes:\n            attr_string += ' style={0} '.format(edge_attributes['style'])\n\n    dot_string = '{0} -> {1}'.format(GetDotNodeName(desc_name)['node'], GetDotNodeName(parent_node_name)['node'])\n\n    dot_string += ' [{0} tailport=s ] '.format(attr_string)\n    dot_graph.append(dot_string)\n    return dot_graph\n\ndef ProcessReplaceIndexDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    dot_graph = []\n\n    label = 'ReplaceIndex({0}, {1})'.format(segment['arguments'][1], segment['arguments'][2])\n    style = None\n    if edge_attributes is not None:\n        if 'label' in edge_attributes:\n            label = \"{0} {1}\".format(edge_attributes['label'], label)\n        if 'style' in edge_attributes:\n            style  = 'style={0}'.format(edge_attributes['style'])\n\n    attr_string = 'label=\"{0}\"'.format(label)\n    if style is not None:\n        attr_string += ' {0}'.format(style)\n\n    dot_graph.append('{0}->{1} [{2}]'.format(GetDotNodeName(segment['arguments'][0])['node'],\n                                                                    GetDotNodeName(parent_node_name)['node'],\n                                                                    attr_string))\n    if segment['sub_segments']:\n        raise Exception(\"ReplaceIndex can just deal with forwarding descriptor, no sub-segments allowed\")\n    return dot_graph\n\ndef ProcessIfDefinedDescriptor(segment, parent_node_name, affix, edge_attributes = None):\n    # IfDefined adds attributes to the edges\n    if edge_attributes is not None:\n        raise Exception(\"edge_attributes was not None, this means an IfDefined descriptor was calling the current IfDefined descriptor. This is not allowed\")\n    dot_graph = []\n    dot_graph.append('#ProcessIfDefinedDescriptor')\n    names = []\n\n    if segment['sub_segments']:\n        sub_segment = segment['sub_segments'][0]\n        dot_graph += DescriptorSegmentToDot(sub_segment, parent_node_name, parent_node_name, edge_attributes={'style':'dotted', 'label':'IfDefined'})\n\n    if segment['arguments']:\n        dot_graph.append('{0} -> {1} [style=dotted, label=\"IfDefined\"]'.format(GetDotNodeName(segment['arguments'][0])['node'], GetDotNodeName(parent_node_name)['node']))\n\n    return dot_graph\n\ndef DescriptorSegmentToDot(segment, parent_node_name, affix, edge_attributes = None):\n    # segment is a dicionary which corresponds to a descriptor\n    dot_graph = []\n    if segment['name'] == \"Append\":\n        dot_graph += ProcessAppendDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"Offset\":\n        dot_graph += ProcessOffsetDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"Sum\":\n        dot_graph += ProcessSumDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"IfDefined\":\n        dot_graph += ProcessIfDefinedDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"ReplaceIndex\":\n        dot_graph += ProcessReplaceIndexDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"Round\":\n        dot_graph += ProcessRoundDescriptor(segment, parent_node_name, affix, edge_attributes)\n    elif segment['name'] == \"Scale\":\n        pass\n    else:\n        raise Exception('Descriptor {0}, is not recognized by this script. Please add Process{0}Descriptor method'.format(segment['name']))\n    return dot_graph\n\ndef Nnet3DescriptorToDot(descriptor, parent_node_name):\n    dot_lines = []\n    [segments, arguments] = descriptor_parser.IdentifyNestedSegments(descriptor)\n    if segments:\n        for segment in segments:\n            dot_lines += DescriptorSegmentToDot(segment, parent_node_name, parent_node_name)\n    elif arguments:\n        assert(len(arguments) == 1)\n        dot_lines.append(\"{0} -> {1}\".format(GetDotNodeName(arguments[0])['node'], GetDotNodeName(parent_node_name)['node']))\n    return dot_lines\n\ndef ParseNnet3String(string):\n    if re.search('^input-node|^component|^output-node|^component-node|^dim-range-node', string.strip()) is None:\n        return [None, None]\n\n    parts = string.split()\n    config_type = parts[0]\n    fields = []\n    prev_field = ''\n    for i in range(1, len(parts)):\n        if re.search('=', parts[i]) is None:\n            prev_field += ' '+parts[i]\n        else:\n            if not (prev_field.strip() == ''):\n                fields.append(prev_field)\n            sub_parts = parts[i].split('=')\n            if (len(sub_parts) != 2):\n                raise Exception('Malformed config line {0}'.format(string))\n            fields.append(sub_parts[0])\n            prev_field = sub_parts[1]\n    fields.append(prev_field)\n\n    parsed_string = {}\n    try:\n        while len(fields) > 0:\n            value = re.sub(',$', '', fields.pop().strip())\n            key = fields.pop()\n            parsed_string[key.strip()] = value.strip()\n    except IndexError:\n        raise Exception('Malformed config line {0}'.format(string))\n    return [config_type, parsed_string]\n\n# sample component config line\n# component name=L0_lda type=FixedAffineComponent, input-dim=300, output-dim=300, linear-params-stddev=0.00992724, bias-params-stddev=0.573973\ndef Nnet3ComponentToDot(component_config, component_attributes = None):\n    label = ''\n    if component_attributes is None:\n        component_attributes = component_config.keys()\n    attributes_to_print = set(component_attributes).intersection(list(component_config.keys()))\n    # process the known fields\n    for key in attributes_to_print:\n        if key in component_config:\n            label += '{0} = {1}\\\\n'.format(key, component_config[key])\n\n    attr_string = ''\n    try:\n        attributes = node_attributes[component_config['type']]\n        for key in attributes.keys():\n            attr_string += ' {0}={1} '.format(key, attributes[key])\n    except KeyError:\n        pass\n\n    return ['{0} [label=\"{1}\" {2}]'.format(GetDotNodeName(component_config['name'], is_component = True)['node'], label, attr_string)]\n\n\n# input-node name=input dim=40\ndef Nnet3InputToDot(parsed_config):\n    return ['{0} [ label=\"{1}\\\\ndim={2}\"]'.format(GetDotNodeName(parsed_config['name'])['node'], parsed_config['name'], parsed_config['dim'] )]\n\n# output-node name=output input=Final_log_softmax dim=3940 objective=linear\n#output-node name=output input=Offset(Final_log_softmax, 5) dim=3940 objective=linear\ndef Nnet3OutputToDot(parsed_config):\n    dot_graph = []\n    dot_graph += Nnet3DescriptorToDot(parsed_config['input'], parsed_config['name'])\n    dot_graph.append('{0} [ label=\"{1}\\\\nobjective={2}\"]'.format(GetDotNodeName(parsed_config['name'])['node'], parsed_config['name'], parsed_config['objective']))\n    return dot_graph\n\n# dim-range-node name=Lstm1_r_t input-node=Lstm1_rp_t dim-offset=0 dim=256\ndef Nnet3DimrangeToDot(parsed_config):\n    dot_graph = []\n    dot_node = GetDotNodeName(parsed_config['name'])\n    dot_graph.append('{0} [shape=rectangle, label=\"{1}\"]'.format(dot_node['node'], dot_node['label']))\n    dot_graph.append('{0} -> {1} [taillabel=\"dimrange({2}, {3})\"]'.format(GetDotNodeName(parsed_config['input-node'])['node'],\n                                                           GetDotNodeName(parsed_config['name'])['node'],\n                                                           parsed_config['dim-offset'],\n                                                           parsed_config['dim']))\n    return dot_graph\n\ndef Nnet3ComponentNodeToDot(parsed_config):\n    dot_graph = []\n    dot_graph += Nnet3DescriptorToDot(parsed_config['input'], parsed_config['name'])\n    dot_node = GetDotNodeName(parsed_config['name'])\n    dot_graph.append('{0} [ label=\"{1}\", shape=box ]'.format(dot_node['node'], dot_node['label']))\n    dot_graph.append('{0} -> {1} [ weight=10 ]'.format(GetDotNodeName(parsed_config['component'], is_component = True)['node'],\n                                                       GetDotNodeName(parsed_config['name'])['node']))\n    return dot_graph\n\ndef GroupConfigs(configs, node_prefixes = None):\n    if node_prefixes is None:\n        node_prefixes = []\n    # we make the assumption that nodes belonging to the same sub-graph have a\n    # commong prefix.\n    grouped_configs = {}\n    for node_prefix in node_prefixes:\n        group = []\n        rest = []\n        for config in configs:\n            if re.search('^{0}'.format(node_prefix), config[1]['name']) is not None:\n                group.append(config)\n            else:\n                rest.append(config)\n        configs = rest\n        grouped_configs[node_prefix] = group\n    grouped_configs[None] = configs\n\n    return grouped_configs\n\ndef ParseConfigLines(lines, node_prefixes = None, component_attributes = None ):\n    if node_prefixes is None:\n        node_prefixes = []\n    config_lines = []\n    dot_graph=[]\n    configs = []\n    for line in lines:\n        config_type, parsed_config = ParseNnet3String(line)\n        if config_type is not None:\n            configs.append([config_type, parsed_config])\n\n    # process the config lines\n    grouped_configs = GroupConfigs(configs, node_prefixes)\n    for group in grouped_configs.keys():\n        configs = grouped_configs[group]\n        if not configs:\n            continue\n        if group is not None:\n            # subgraphs prefixed with cluster will be treated differently by\n            # dot\n            dot_graph.append('subgraph cluster_{0} '.format(group) + \"{\")\n            dot_graph.append('color=blue')\n\n        for config in configs:\n            config_type = config[0]\n            parsed_config = config[1]\n            if config_type is None:\n                continue\n            if config_type == 'input-node':\n                dot_graph += Nnet3InputToDot(parsed_config)\n            elif config_type == 'output-node':\n                dot_graph += Nnet3OutputToDot(parsed_config)\n            elif config_type == 'component-node':\n                dot_graph += Nnet3ComponentNodeToDot(parsed_config)\n            elif config_type == 'dim-range-node':\n                dot_graph += Nnet3DimrangeToDot(parsed_config)\n            elif config_type == 'component':\n                dot_graph += Nnet3ComponentToDot(parsed_config, component_attributes)\n\n        if group is not None:\n            dot_graph.append('label = \"{0}\"'.format(group))\n            dot_graph.append('}')\n\n    dot_graph.insert(0, 'digraph nnet3graph {')\n    dot_graph.append('}')\n\n    return dot_graph\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Converts the output of nnet3-am-info \"\n                                                 \"to dot graph. The output has to be compiled\"\n                                                 \" with dot to generate a displayable graph\",\n                                    epilog=\"See steps/nnet3/nnet3_to_dot.sh for example.\");\n    parser.add_argument(\"--component-attributes\", type=str,\n                        help=\"Attributes of the components which should be displayed in the dot-graph \"\n                             \"e.g. --component-attributes name,type,input-dim,output-dim\", default=None)\n    parser.add_argument(\"--node-prefixes\", type=str,\n                        help=\"list of prefixes. Nnet3 components/component-nodes with the same prefix\"\n                        \" will be clustered together in the dot-graph\"\n                        \" --node-prefixes Lstm1,Lstm2,Layer1\", default=None)\n\n    parser.add_argument(\"dotfile\", help=\"name of the dot output file\")\n\n    print(' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    component_attributes = None\n    if args.component_attributes is not None:\n        component_attributes = args.component_attributes.split(',')\n    node_prefixes = []\n    if args.node_prefixes is not None:\n        node_prefixes = args.node_prefixes.split(',')\n\n    lines = sys.stdin.readlines()\n    dot_graph = ParseConfigLines(lines, component_attributes = component_attributes, node_prefixes = node_prefixes)\n\n    dotfile_handle = open(args.dotfile, \"w\")\n    dotfile_handle.write(\"\\n\".join(dot_graph))\n    dotfile_handle.close()\n"
  },
  {
    "path": "egs/steps/nnet3/get_degs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016   Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright 2014-2015   Vimal Manohar\n\n# Decodes denlats and dumps egs for discriminative training, in one script\n# (avoids writing the non-compact lattices to disk, which can use a lot of disk\n# space).\n\n\n# Begin configuration section.\ncmd=run.pl\nmax_copy_jobs=5  # Limit disk I/O\n\n# feature options\nonline_ivector_dir=\n\n# example splitting and context options\nframes_per_eg=150 # number of frames of labels per example.\n                  # Note: may in general be a comma-separated string of alternative\n                  # durations; the first one (the principal num-frames) is preferred.\nframes_overlap_per_eg=30 # number of supervised frames of overlap that we aim for per eg.\n                  # can be useful to avoid wasted data if you're using --left-deriv-truncate\n                  # and --right-deriv-truncate.\nlooped=false       # Set to true to enable looped decoding [can\n                   # be a bit faster, for forward-recurrent models like LSTMs.]\n\n# .. these context options also affect decoding.\nextra_left_context=0    # amount of left-context per eg, past what is required by the model\n                        # (only useful for recurrent networks like LSTMs/BLSTMs)\nextra_right_context=0   # amount of right-context per eg, past what is required by the model\n                        # (only useful for backwards-recurrent networks like BLSTMs)\nextra_left_context_initial=-1    # if >= 0, the --extra-left-context to use at\n                                 # the start of utterances.  Recommend 0 if you\n                                 # used 0 for the baseline DNN training; if <0,\n                                 # defaults to same as extra_left_context\nextra_right_context_final=-1     # if >= 0, the --extra-right-context to use at\n                                 # the end of utterances.  Recommend 0 if you\n                                 # used 0 for the baseline DNN training; if <0,\n                                 # defaults to same as extra_left_context\n\ncompress=true   # set this to false to disable lossy compression of features\n                # dumped with egs (e.g. if you want to see whether results are\n                # affected).\n\nnum_utts_subset=80     # number of utterances in validation and training\n                       # subsets used for diagnostics.\nnum_egs_subset=800     # number of egs (maximum) for the validation and training\n                       # subsets used for diagnostics.\nframes_per_iter=1000000 # each iteration of training, see this many frames\n                        # per job.  This is just a guideline; it will pick a number\n                        # that divides the number of samples in the entire data.\ncleanup=true\n\nstage=0\nnj=200\n\n# By default this script uses final.mdl in <srcdir>, this configures it.\niter=final\n\n\n# decoding-graph option\nself_loop_scale=0.1  # for decoding graph.. should be 1.0 for chain models.\n\n# options relating to decoding.\nframes_per_chunk_decoding=150\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\nmin_active=200\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\nnum_threads=1\n\n# affects whether we invoke lattice-determinize-non-compact after decoding\n# discriminative-get-supervision.\ndeterminize_before_split=true\n\n\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 5 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <src-dir> <ali-dir> <degs-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/nnet3/tdnn_a exp/nnet3/tdnn_a_ali exp/nnet3/tdnn_a_degs\"\n  echo \"\"\n  echo \"For options, see top of script file.  Standard options:\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs (probably would be good to add --max-jobs-run 5 or so if using\"\n  echo \"                                                   # GridEngine (to avoid excessive NFS traffic).\"\n  echo \"  --stage <stage|-8>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --online-ivector-dir <dir|\"\">                    # Directory for online-estimated iVectors, used in the\"\n  echo \"                                                   # online-neural-net setup.\"\n  echo \"  --nj <nj|200>                                    # number of jobs to submit to the queue.\"\n  echo \"  --num-threads <n|1>                              # number of threads per decoding job\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\nalidir=$4\ndir=$5\n\n\nextra_files=\n[ ! -z $online_ivector_dir ] && \\\n  extra_files=\"$extra_files $online_ivector_dir/ivector_period $online_ivector_dir/ivector_online.scp\"\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $lang/phones/silence.csl $srcdir/${iter}.mdl $srcdir/tree \\\n      $srcdir/cmvn_opts $alidir/ali.1.gz $alidir/num_jobs $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log $dir/info || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n\n\nutils/split_data.sh --per-utt $data $nj\nsdata=$data/split${nj}utt\n\n\n## Set up features.\necho \"$0: feature type is raw\"\n\n\ncmvn_opts=$(cat $srcdir/cmvn_opts) || exit 1\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\ncp $srcdir/{splice_opts,cmvn_opts} $dir 2>/dev/null || true\n\n## set iVector options\nif [ ! -z \"$online_ivector_dir\" ]; then\n  online_ivector_period=$(cat $online_ivector_dir/ivector_period)\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$online_ivector_period\"\nfi\n\n## set frame-subsampling-factor option and copy file\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $srcdir/frame_subsampling_factor) || exit 1\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$frame_subsampling_factor\"\n  cp $srcdir/frame_subsampling_factor $dir\n  if [ $frame_subsampling_factor -ne 1 ] && [ \"$self_loop_scale\" == \"0.1\" ]; then\n    echo \"$0: warning: frame_subsampling_factor is not 1 (so likely a chain system),\"\n    echo \"...  but self-loop-scale is 0.1.  Make sure this is not a mistake.\"\n    sleep 1\n  fi\nelse\n  frame_subsampling_factor=1\nfi\n\nif [ \"$self_loop_scale\" == \"1.0\" ] && [ \"$acwt\" == 0.1 ]; then\n  echo \"$0: warning: you set --self-loop-scale=1.0 (so likely a chain system)\",\n  echo \" ... but the acwt is still 0.1 (you probably want --acwt 1.0)\"\n  sleep 1\nfi\n\n## Make the decoding graph.\nif [ $stage -le 0 ]; then\n  new_lang=\"$dir/\"$(basename \"$lang\")\n  rm -r $new_lang 2>/dev/null\n  cp -rH $lang $dir\n  echo \"$0: Making unigram grammar FST in $new_lang\"\n  oov=$(cat data/lang/oov.txt)\n  cat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n   awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n    utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n    || exit 1;\n\n  utils/mkgraph.sh --self-loop-scale $self_loop_scale $new_lang $srcdir $dir/dengraph || exit 1;\nfi\n\n# copy alignments into ark,scp format which allows us to use different num-jobs\n# from the alignment, and is also convenient for getting priors.\nif [ $stage -le 1 ]; then\n  echo \"$0: Copying input alignments\"\n  nj_ali=$(cat $alidir/num_jobs)\n  alis=$(for n in $(seq $nj_ali); do echo -n \"$alidir/ali.$n.gz \"; done)\n  $cmd $dir/log/copy_alignments.log \\\n     copy-int-vector \"ark:gunzip -c $alis|\" \\\n     ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\nfi\n\n[ -f $dir/ali.scp ] || { echo \"$0: expected $dir/ali.scp to exist\"; exit 1; }\n\nif [ $stage -le 2 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s:JOB:1:g)\"\n  if feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo $feat_dim > $dir/info/feat_dim\n  else # run without stderr redirection to show the error.\n    feat-to-dim \"$feats_one\" -; exit 1\n  fi\nelse\n  num_frames=$(cat $dir/info/num_frames)\nfi\nif ! [ \"$num_frames\" -gt 0 ]; then\n  echo \"$0: bad num-frames=$num_frames\"; exit 1\nfi\n\n# copy the model to the degs directory.\ncp $srcdir/${iter}.mdl $dir/final.mdl || exit 1\n\n# Create some info in $dir/info\n\n# Work out total number of archives. Add one on the assumption the\n# num-frames won't divide exactly, and we want to round up.\nnum_archives=$[num_frames/frames_per_iter+1]\n\necho $num_archives >$dir/info/num_archives\necho $frame_subsampling_factor >$dir/info/frame_subsampling_factor\ncp $lang/phones/silence.csl $dir/info/\n\n# the first field in frames_per_eg (which is a comma-separated list of numbers)\n# is the 'principal' frames-per-eg, and for purposes of working out the number\n# of archives we assume that this will be the average number of frames per eg.\nframes_per_eg_principal=$(echo $frames_per_eg | cut -d, -f1)\n\n\n# read 'mof' as max_open_filehandles.\n# When splitting up the scp files, we don't want to have to hold too many\n# files open at once.  If the number of archives we have to write exceeds\n# 256 (or less if unlimit -n is smaller), we split in two stages.\nmof=$(ulimit -n) || exit 1\n# the next step helps work around inconsistency between different machines on a\n# cluster.  It's unlikely that the allowed number of open filehandles would ever\n# be less than 256.\nif [ $mof -gt 256 ]; then mof=256; fi\n# allocate mof minus 3 for the max allowed outputs, because of\n# stdin,stderr,stdout.  this will normally come to 253.  We'll do a two-stage\n# splitting if the needed number of scp files is larger than this.\nnum_groups=$[(num_archives+(mof-3)-1)/(mof-3)]\ngroup_size=$[(num_archives+num_groups-1)/num_groups]\nif [ $num_groups -gt 1 ]; then\n  new_num_archives=$[group_size*num_groups]\n  [ $new_num_archives -ne $num_archives ] && \\\n    echo \"$0: rounding up num-archives from $num_archives to $new_num_archives for easier splitting\"\n  num_archives=$new_num_archives\n  echo $new_num_archives >$dir/info/num_archives\nfi\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/degs.$x.ark; done)\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/degs.$x.scp; done)\n  utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/degs_orig.$y.ark; done)\n  utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/degs_orig.$y.scp; done)\n  utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/degs_orig_filtered.$y.scp; done)\nfi\n\n\nextra_context_opts=\"--extra-left-context=$extra_left_context --extra-right-context=$extra_right_context --extra-left-context-initial=$extra_left_context_initial --extra-right-context-final=$extra_right_context_final\"\n\n# work out absolute context opts, --left-context and so on [need model context]\nmodel_left_context=$(nnet3-am-info $srcdir/${iter}.mdl | grep \"^left-context:\" | awk '{print $2}')\nmodel_right_context=$(nnet3-am-info $srcdir/${iter}.mdl | grep \"^right-context:\" | awk '{print $2}')\nleft_context=$[model_left_context+extra_left_context+frame_subsampling_factor/2]\nright_context=$[model_right_context+extra_right_context+frame_subsampling_factor/2]\ncontext_opts=\"--left-context=$left_context --right-context=$right_context\"\nif [ $extra_left_context_initial -ge 0 ]; then\n  left_context_initial=$[model_left_context+extra_left_context_initial+frame_subsampling_factor/2]\n  context_opts=\"$context_opts --left-context-initial=$left_context_initial\"\nfi\nif [ $extra_right_context_final -ge 0 ]; then\n  right_context_final=$[model_right_context+extra_right_context_final+frame_subsampling_factor/2]\n  context_opts=\"$context_opts --right-context-final=$right_context_final\"\nfi\n\n##\nif [ $num_threads -eq 1 ]; then\n  if $looped; then\n    decoder=\"nnet3-latgen-faster-looped\"\n    [ $extra_left_context_initial -ge 0 ] && \\\n      decoder=\"$decoder --extra-left-context-initial=$extra_left_context_initial\"\n  else\n    decoder=\"nnet3-latgen-faster $extra_context_opts\"\n  fi\n  threads_cmd_opt=\nelse\n  $looped && { echo \"$0: --num-threads must be one if you use looped decoding\"; exit 1; }\n  threads_cmd_opt=\"--num-threads $num_threads\"\n  decoder=\"nnet3-latgen-faster-parallel --num-threads=$num_threads $extra_context_opts\"\n  true\nfi\n\n# set the command to determinize lattices, if specified.\nif $determinize_before_split; then\n  lattice_determinize_cmd=\"lattice-determinize-non-compact --acoustic-scale=$acwt --max-mem=$max_mem --minimize=true --prune=true --beam=$lattice_beam ark:- ark:-\"\nelse\n  lattice_determinize_cmd=\"cat\"\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: decoding and dumping egs\"\n  $cmd $threads_cmd_opt JOB=1:$nj $dir/log/decode_and_get_egs.JOB.log \\\n     $decoder \\\n     $ivector_opts $frame_subsampling_opt \\\n    --frames-per-chunk=$frames_per_chunk_decoding \\\n    --determinize-lattice=false \\\n    --max-active=$max_active --min-active=$min_active --beam=$beam \\\n    --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=false \\\n    --word-symbol-table=$lang/words.txt $dir/final.mdl  \\\n    $dir/dengraph/HCLG.fst \"$feats\" ark:- \\| \\\n    $lattice_determinize_cmd  \\| \\\n    nnet3-discriminative-get-egs --acoustic-scale=$acwt --compress=$compress \\\n      $frame_subsampling_opt --num-frames=$frames_per_eg \\\n      --num-frames-overlap=$frames_overlap_per_eg \\\n      $ivector_opts $context_opts \\\n      $dir/final.mdl \"$feats\"  \"ark,s,cs:-\" \\\n      \"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $dir/ali.scp |\" \\\n      ark,scp:$dir/degs_orig.JOB.ark,$dir/degs_orig.JOB.scp || exit 1\nfi\n\n\nif [ $stage -le 4 ]; then\n  echo \"$0: getting validation utterances.\"\n\n  ## Get list of validation utterances.\n  awk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n   > $dir/valid_uttlist || exit 1;\n\n  if [ -f $data/utt2uniq ]; then  # this matters if you use data augmentation.\n    echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n    echo \"include all perturbed versions of the same 'real' utterances.\"\n    mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n    utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n    cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n      sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n      awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n    rm $dir/uniq2utt $dir/valid_uttlist.tmp\n  fi\n\n  # the following awk statement turns 'foo123' into something like\n  # '^foo123-[0-9]\\+ ' which is a grep expression that matches the lines in the\n  # .scp file that correspond to an utterance in valid_uttlist.\n  cat $dir/valid_uttlist | awk '{printf(\"^%s-[0-9]\\\\+ \\n\", $1);}' \\\n     >$dir/valid_uttlist.regexps || exit 1\n\n  # remove the validation utterances from deg_orig.*.scp to produce\n  # degs_orig_filtered.*.scp.\n  # note: the '||' true is in case the grep returns nonzero status for\n  # some splits, because they were all validation utterances.\n  $cmd JOB=1:$nj $dir/log/filter_and_shuffle.JOB.log \\\n     grep -v -f $dir/valid_uttlist.regexps $dir/degs_orig.JOB.scp '>' \\\n     $dir/degs_orig_filtered.JOB.scp '||' true || exit 1\n\n  # extract just the validation utterances from deg_orig.*.scp to produce\n  # degs_valid.*.scp.\n  $cmd JOB=1:$nj $dir/log/extract_validation_egs.JOB.log \\\n    grep -f $dir/valid_uttlist.regexps $dir/degs_orig.JOB.scp '>' \\\n    $dir/degs_valid.JOB.scp '||' true || exit 1\n\n  for j in $(seq $nj); do\n    cat $dir/degs_valid.$j.scp; rm $dir/degs_valid.$j.scp;\n  done | utils/shuffle_list.pl | head -n$num_utts_subset >$dir/valid_diagnostic.scp || exit 1\n\n  [ -s $dir/valid_diagnostic.scp ] || { echo \"$0: error getting validation egs\"; exit 1; }\nfi\n\n\n\n# function/pseudo-command to randomly shuffle input lines using a small buffer size\nfunction shuffle {\n    perl -e ' use List::Util qw(shuffle); srand(0);\n       $bufsz=1000; @A = (); while(<STDIN>) { push @A, $_; if (@A == $bufsz) {\n       $n=int(rand()*$bufsz); print $A[$n]; $A[$n] = $A[$bufsz-1]; pop @A; }}\n       @A = shuffle(@A); print @A; '\n}\n# funtion/pseudo-command to put input lines round robin to command line args.\nfunction round_robin {\n  perl -e '@F=(); foreach $a (@ARGV) { my $f; open($f, \">$a\") || die \"opening file $a\"; push @F, $f; }\n         $N=@F; $N>0||die \"No output files\"; $n=0;\n         while (<STDIN>) { $fh=$F[$n%$N]; $n++; print $fh $_ || die \"error printing\"; } ' $*\n}\n\n\nif [ $stage -le 5 ]; then\n  echo \"$0: rearranging scp files\"\n\n  if [ $num_groups -eq 1 ]; then\n    # output directly to the archive files.\n    outputs=$(for n in $(seq $num_archives); do echo $dir/degs.$n.scp; done)\n  else\n    # output to intermediate 'group' files.\n    outputs=$(for g in $(seq $num_groups); do echo $dir/degs_group.$g.scp; done)\n  fi\n\n  # We can't use UNIX's split command because of compatibility issues (BSD\n  # version very different from GNU version), so we use 'round_robin' which is\n  # a bash function that calls an inline perl script.\n  for j in $(seq $nj); do cat $dir/degs_orig_filtered.$j.scp; done | \\\n    shuffle | round_robin $outputs || exit 1\n\n  if [ $num_groups -gt 1 ]; then\n    for g in $(seq $num_groups); do\n      first=$[1+group_size*(g-1)]\n      last=$[group_size*g]\n      outputs=$(for n in $(seq $first $last); do echo $dir/degs.$n.scp; done)\n      cat $dir/degs_group.$g.scp | shuffle | round_robin $outputs\n    done\n  fi\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: getting train-subset scp\"\n  # get degs_train_subset.scp by taking the top and tail of the degs files [quicker\n  # than cat'ing all the files, random shuffling and head]\n\n  nl=$[$num_egs_subset/$num_archives + 1]\n\n  # use utils/shuffle_list.pl because it provides a complete shuffle (ok since\n  # the amount of data is small).  note: shuf is not available on mac by\n  # default.\n  for n in $(seq $num_archives); do\n    head -n$nl $dir/degs.$n.scp;  tail -n$nl $dir/degs.$n.scp\n  done  | utils/shuffle_list.pl | head -n$num_utts_subset >$dir/train_diagnostic.scp\n  [ -s $dir/train_diagnostic.scp ] || { echo \"$0: error getting train_diagnostic.scp\"; exit 1; }\nfi\n\nif [ $stage -le 7 ]; then\n  echo \"$0: creating final archives\"\n  $cmd --max-jobs-run \"$max_copy_jobs\" \\\n     JOB=1:$num_archives $dir/log/copy_archives.JOB.log \\\n     nnet3-discriminative-copy-egs scp:$dir/degs.JOB.scp ark:$dir/degs.JOB.ark || exit 1\n\n  run.pl $dir/log/copy_train_subset.log \\\n      nnet3-discriminative-copy-egs scp:$dir/train_diagnostic.scp \\\n         ark:$dir/train_diagnostic.degs  || exit 1\n\n  run.pl $dir/log/copy_valid_subset.log \\\n      nnet3-discriminative-copy-egs scp:$dir/valid_diagnostic.scp \\\n         ark:$dir/valid_diagnostic.degs  || exit 1\nfi\n\nif [ $stage -le 10 ] && $cleanup; then\n  echo \"$0: cleaning up temporary files.\"\n  for j in $(seq $nj); do\n    for f in $dir/degs_orig.$j.{ark,scp} $dir/degs_orig_filtered.$j.scp; do\n      [ -L $f ] && rm $(utils/make_absolute.sh $f); rm $f\n    done\n  done\n  rm $dir/degs_group.*.scp $dir/valid_diagnostic.scp $dir/train_diagnostic.scp 2>/dev/null\n  rm $dir/ali.ark $dir/ali.scp 2>/dev/null\n  for n in $(seq $num_archives); do\n    for f in $dir/degs.$n.scp; do\n      [ -L $f ] && rm $(utils/make_absolute.sh $f); rm $f\n    done\n  done\nfi\n\n\necho \"$0: Finished decoding and preparing training examples\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/get_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n#\n# This script dumps egs with several frames of labels, controlled by the\n# frames_per_eg config variable (default: 8).  This takes many times less disk\n# space because typically we have 4 to 7 frames of context on the left and\n# right, and this ends up getting shared.  This is at the expense of slightly\n# higher disk I/O while training.\n\nset -o pipefail\ntrap \"\" PIPE\n\n# Begin configuration section.\ncmd=run.pl\nframe_subsampling_factor=1\nframes_per_eg=8   # number of frames of labels per example.  more->less disk space and\n                  # less time preparing egs, but more I/O during training.\n                  # Note: may in general be a comma-separated string of alternative\n                  # durations (more useful when using large chunks, e.g. for BLSTMs);\n                  # the first one (the principal num-frames) is preferred.\nleft_context=4    # amount of left-context per eg (i.e. extra frames of input features\n                  # not present in the output supervision).\nright_context=4   # amount of right-context per eg.\nleft_context_initial=-1    # if >=0, left-context for first chunk of an utterance\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance\ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\n\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=60000 # # train frames for the above.\nnum_frames_diagnostic=10000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=400000 # this is the target number of egs in each archive of egs\n                        # (prior to merging egs).  We probably should have called\n                        # it egs_per_iter. This is just a guideline; it will pick\n                        # a number that divides the number of samples in the\n                        # entire data.\n\nstage=0\nnj=6         # This should be set to the maximum number of jobs you are\n             # comfortable to run in parallel; you can increase it if your disk\n             # speed is greater and you have more machines.\nsrand=0     # rand seed for nnet3-copy-egs and nnet3-shuffle-egs\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\nonline_cmvn=false # Set to 'true' to replace 'apply-cmvn' by 'apply-cmvn-online' in the nnet3 input.\n                  # The configuration is passed externally via '$cmvn_opts' given to train.py,\n                  # typically as: --cmvn-opts=\"--config conf/online_cmvn.conf\".\n                  # The global_cmvn.stats are computed by this script from the features.\n                  # Note: the online cmvn for ivector extractor it is controlled separately in\n                  #       steps/online/nnet2/train_ivector_extractor.sh by --online-cmvn-iextractor\n\ngenerate_egs_scp=false # If true, it will generate egs.JOB.*.scp per egs archive\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <data> <ali-dir> <egs-dir>\"\n  echo \" e.g.: $0 data/train exp/tri3_ali exp/tri4_nnet/egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --nj <nj>                                        # The maximum number of jobs you want to run in\"\n  echo \"                                                   # parallel (increase this only if you have good disk and\"\n  echo \"                                                   # network speed).  default=6\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Target number of egs per archive (option is badly named)\"\n  echo \"  --frames-per-eg <frames;8>                       # number of frames per eg on disk\"\n  echo \"                                                   # May be either a single number or a comma-separated list\"\n  echo \"                                                   # of alternatives (useful when training LSTMs, where the\"\n  echo \"                                                   # frames-per-eg is the chunk size, to get variety of chunk\"\n  echo \"                                                   # sizes).  The first in the list is preferred and is used\"\n  echo \"                                                   # when working out the number of archives etc.\"\n  echo \"  --left-context <int;4>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;4>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # If >= 0, left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # If >= 0, right-context for last chunk of an utterance\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nalidir=$2\ndir=$3\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log $dir/info\ncp $alidir/tree $dir\n\nnum_ali_jobs=$(cat $alidir/num_jobs) || exit 1;\n\n\nnum_utts=$(cat $data/utt2spk | wc -l)\nif ! [ $num_utts -gt $[$num_utts_subset*4] ]; then\n  echo \"$0: number of utterances $num_utts in your training data is too small versus --num-utts-subset=$num_utts_subset\"\n  echo \"... you probably have so little data that it doesn't make sense to train a neural net.\"\n  exit 1\nfi\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset \\\n    > $dir/valid_uttlist\n\nif [ -f $data/utt2uniq ]; then  # this matters if you use data augmentation.\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset > $dir/train_subset_uttlist\n\necho \"$0: creating egs.  To ensure they are not deleted later you can do:  touch $dir/.nodelete\"\n\n## Set up features.\n\n# get the global_cmvn stats for online-cmvn,\nif $online_cmvn; then\n  # create global_cmvn.stats\n  #\n  # caution: the top-level nnet training script should copy\n  # 'global_cmvn.stats' and 'online_cmvn' to its own dir.\n  if ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n    echo \"$0: Error summing cmvn stats\"\n    exit 1\n  fi\n  touch $dir/online_cmvn\nelse\n  [ -f $dir/online_cmvn ] && rm $dir/online_cmvn\nfi\n\n# create the feature pipelines,\nif ! $online_cmvn; then\n  # the original front-end with 'apply-cmvn',\n  echo \"$0: feature type is raw, with 'apply-cmvn'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\nelse\n  # the alternative front-end with 'apply-cmvn-online',\n  # - the $cmvn_opts can be set to '--config=conf/online_cmvn.conf' which is the setup of ivector-extractor,\n  echo \"$0: feature type is raw, with 'apply-cmvn-online'\"\n  feats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=ark:$sdata/JOB/spk2utt  $dir/global_cmvn.stats scp:- ark:- |\"\n  valid_spk2utt=\"ark:utils/filter_scp.pl $dir/valid_uttlist $data/utt2spk | utils/utt2spk_to_spk2utt.pl |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=\\\"$valid_spk2utt\\\" $dir/global_cmvn.stats scp:- ark:- |\"\n  train_subset_spk2utt=\"ark:utils/filter_scp.pl $dir/train_subset_uttlist $data/utt2spk | utils/utt2spk_to_spk2utt.pl |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn-online $cmvn_opts --spk2utt=\\\"$train_subset_spk2utt\\\" $dir/global_cmvn.stats scp:- ark:- |\"\nfi\necho $cmvn_opts >$dir/cmvn_opts # caution: the top-level nnet training script should copy this to its own dir now.\n\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim > $dir/info/ivector_dim\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/info/final.ie.id || exit 1\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\n  echo 0 >$dir/info/ivector_dim\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s/JOB/1/g)\"\n  if feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo $feat_dim > $dir/info/feat_dim\n  else # run without redirection to show the error.\n    feat-to-dim \"$feats_one\" -; exit 1\n  fi\nelse\n  num_frames=$(cat $dir/info/num_frames) || exit 1;\n  feat_dim=$(cat $dir/info/feat_dim) || exit 1;\nfi\n\n\n# the first field in frames_per_eg (which is a comma-separated list of numbers)\n# is the 'principal' frames-per-eg, and for purposes of working out the number\n# of archives we assume that this will be the average number of frames per eg.\nframes_per_eg_principal=$(echo $frames_per_eg | cut -d, -f1)\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/($frames_per_eg_principal*$samples_per_iter)+1]\nif [ $num_archives -eq 1 ]; then\n  echo \"*** $0: warning: the --frames-per-eg is too large to generate one archive with\"\n  echo \"*** as many as --samples-per-iter egs in it.  Consider reducing --frames-per-eg.\"\n  sleep 4\nfi\n\n# We may have to first create a smaller number of larger archives, with number\n# $num_archives_intermediate, if $num_archives is more than the maximum number\n# of open filehandles that the system allows per process (ulimit -n).\n# This sometimes gives a misleading answer as GridEngine sometimes changes that\n# somehow, so we limit it to 512.\nmax_open_filehandles=$(ulimit -n) || exit 1\n[ $max_open_filehandles -gt 512 ] && max_open_filehandles=512\nnum_archives_intermediate=$num_archives\narchives_multiple=1\nwhile [ $[$num_archives_intermediate+4] -gt $max_open_filehandles ]; do\n  archives_multiple=$[$archives_multiple+1]\n  num_archives_intermediate=$[$num_archives/$archives_multiple+1];\ndone\n# now make sure num_archives is an exact multiple of archives_multiple.\nnum_archives=$[$archives_multiple*$num_archives_intermediate]\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n# Work out the number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg_principal*$num_archives)]\n! [ $egs_per_archive -le $samples_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= samples_per_iter=$samples_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\nif [ $left_context_initial -ge 0 ] || [ $right_context_final -ge 0 ]; then\n  echo \"$0:   ... and (left-context-initial,right-context-final) = ($left_context_initial,$right_context_final)\"\nfi\n\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/egs.$x.ark; done)\n  for x in $(seq $num_archives_intermediate); do\n    utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/egs_orig.$y.$x.ark; done)\n  done\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: copying data alignments\"\n  for id in $(seq $num_ali_jobs); do gunzip -c $alidir/ali.$id.gz; done | \\\n    copy-int-vector ark:- ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\nfi\n\negs_opts=\"--left-context=$left_context --right-context=$right_context --compress=$compress --num-frames=$frames_per_eg\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\necho $left_context_initial > $dir/info/left_context_initial\necho $right_context_final > $dir/info/right_context_final\n\n\nnum_pdfs=$(tree-info --print-args=false $alidir/tree | grep num-pdfs | awk '{print $2}')\nif [ $stage -le 3 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n  echo \"$0: ... extracting validation and training-subset alignments.\"\n\n\n  # do the filtering just once, as ali.scp may be long.\n  utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) \\\n    <$dir/ali.scp >$dir/ali_special.scp\n\n  $cmd $dir/log/create_valid_subset.log \\\n    utils/filter_scp.pl $dir/valid_uttlist $dir/ali_special.scp \\| \\\n    ali-to-pdf $alidir/final.mdl scp:- ark:- \\| \\\n    ali-to-post ark:- ark:- \\| \\\n    nnet3-get-egs --num-pdfs=$num_pdfs --frame-subsampling-factor=$frame_subsampling_factor \\\n      $ivector_opts $egs_opts \"$valid_feats\" \\\n      ark,s,cs:- \"ark:$dir/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    utils/filter_scp.pl $dir/train_subset_uttlist $dir/ali_special.scp \\| \\\n    ali-to-pdf $alidir/final.mdl scp:- ark:- \\| \\\n    ali-to-post ark:- ark:- \\| \\\n    nnet3-get-egs --num-pdfs=$num_pdfs --frame-subsampling-factor=$frame_subsampling_factor \\\n      $ivector_opts $egs_opts \"$train_subset_feats\" \\\n      ark,s,cs:- \"ark:$dir/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n  if $generate_egs_scp; then\n    valid_diagnostic_output=\"ark,scp:$dir/valid_diagnostic.egs,$dir/valid_diagnostic.scp\"\n    train_diagnostic_output=\"ark,scp:$dir/train_diagnostic.egs,$dir/train_diagnostic.scp\"\n  else\n    valid_diagnostic_output=\"ark:$dir/valid_diagnostic.egs\"\n    train_diagnostic_output=\"ark:$dir/train_diagnostic.egs\"\n  fi\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet3-subset-egs --n=$[$num_valid_frames_combine/$frames_per_eg_principal] ark:$dir/valid_all.egs \\\n      ark:$dir/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet3-subset-egs --n=$[$num_frames_diagnostic/$frames_per_eg_principal] ark:$dir/valid_all.egs \\\n    $valid_diagnostic_output || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet3-subset-egs --n=$[$num_train_frames_combine/$frames_per_eg_principal] ark:$dir/train_subset_all.egs \\\n      ark:$dir/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet3-subset-egs --n=$[$num_frames_diagnostic/$frames_per_eg_principal] ark:$dir/train_subset_all.egs \\\n    $train_diagnostic_output || touch $dir/.error &\n  wait\n  sleep 5  # wait for file system to sync.\n  cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n  if $generate_egs_scp; then\n    cat $dir/valid_combine.egs $dir/train_combine.egs  | \\\n    nnet3-copy-egs ark:- ark,scp:$dir/combine.egs,$dir/combine.scp\n    rm $dir/{train,valid}_combine.scp\n  else\n    cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n  fi\n  for f in $dir/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/valid_all.egs $dir/train_subset_all.egs $dir/{train,valid}_combine.egs\nfi\n\nif [ $stage -le 4 ]; then\n  # create egs_orig.*.*.ark; the first index goes to $nj,\n  # the second to $num_archives_intermediate.\n\n  egs_list=\n  for n in $(seq $num_archives_intermediate); do\n    egs_list=\"$egs_list ark:$dir/egs_orig.JOB.$n.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n  # The examples will go round-robin to egs_list.\n  $cmd JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet3-get-egs --num-pdfs=$num_pdfs --frame-subsampling-factor=$frame_subsampling_factor \\\n    $ivector_opts $egs_opts \"$feats\" \\\n    \"ark,s,cs:filter_scp.pl $sdata/JOB/utt2spk $dir/ali.scp | ali-to-pdf $alidir/final.mdl scp:- ark:- | ali-to-post ark:- ark:- |\" ark:- \\| \\\n    nnet3-copy-egs --random=true --srand=\\$[JOB+$srand] ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.*.JOB.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  # the input is a concatenation over the input jobs.\n  egs_list=\n  for n in $(seq $nj); do\n    egs_list=\"$egs_list $dir/egs_orig.$n.JOB.ark\"\n  done\n\n  if [ $archives_multiple == 1 ]; then # normal case.\n    if $generate_egs_scp; then\n      output_archive=\"ark,scp:$dir/egs.JOB.ark,$dir/egs.JOB.scp\"\n    else\n      output_archive=\"ark:$dir/egs.JOB.ark\"\n    fi\n    $cmd --max-jobs-run $nj JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-shuffle-egs --srand=\\$[JOB+$srand] \"ark:cat $egs_list|\" $output_archive  || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate egs.JOB.scp in single egs.scp\n      rm $dir/egs.scp 2> /dev/null || true\n      for j in $(seq $num_archives_intermediate); do\n        cat $dir/egs.$j.scp || exit 1;\n      done > $dir/egs.scp || exit 1;\n      for f in $dir/egs.*.scp; do rm $f; done\n    fi\n  else\n    # we need to shuffle the 'intermediate archives' and then split into the\n    # final archives.  we create soft links to manage this splitting, because\n    # otherwise managing the output names is quite difficult (and we don't want\n    # to submit separate queue jobs for each intermediate archive, because then\n    # the --max-jobs-run option is hard to enforce).\n    if $generate_egs_scp; then\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark,scp:$dir/egs.JOB.$y.ark,$dir/egs.JOB.$y.scp; done)\"\n    else\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark:$dir/egs.JOB.$y.ark; done)\"\n    fi\n    for x in $(seq $num_archives_intermediate); do\n      for y in $(seq $archives_multiple); do\n        archive_index=$[($x-1)*$archives_multiple+$y]\n        # egs.intermediate_archive.{1,2,...}.ark will point to egs.archive.ark\n        ln -sf egs.$archive_index.ark $dir/egs.$x.$y.ark || exit 1\n      done\n    done\n    $cmd --max-jobs-run $nj JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-shuffle-egs --srand=\\$[JOB+$srand] \"ark:cat $egs_list|\" ark:- \\| \\\n      nnet3-copy-egs ark:- $output_archives || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate egs.JOB.scp in single egs.scp\n      rm $dir/egs.scp 2> /dev/null || true\n      for j in $(seq $num_archives_intermediate); do\n        for y in $(seq $num_archives_intermediate); do\n          cat $dir/egs.$j.$y.scp || exit 1;\n        done\n      done > $dir/egs.scp || exit 1;\n      for f in $dir/egs.*.*.scp; do rm $f; done\n    fi\n  fi\nfi\n\nif [ $frame_subsampling_factor -ne 1 ]; then\n  echo $frame_subsampling_factor > $dir/info/frame_subsampling_factor\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: removing temporary archives\"\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_intermediate); do\n      file=$dir/egs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file)\n      rm $file\n    done\n  done\n  if [ $archives_multiple -gt 1 ]; then\n    # there are some extra soft links that we should delete.\n    for f in $dir/egs.*.*.ark; do rm $f; done\n  fi\n  echo \"$0: removing temporary alignments\"\n  # Ignore errors below because trans.* might not exist.\n  rm $dir/ali.{ark,scp} 2>/dev/null\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet3/get_egs_discriminative.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2016   Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright 2014-2015   Vimal Manohar\n\n# Note: you may find it more convenient to use the newer script get_degs.sh, which\n# combines decoding and example-creation in one step without writing lattices.\n\n# This script dumps examples MPE or MMI or state-level minimum bayes risk (sMBR)\n# training of neural nets.\n# Criterion supported are mpe, smbr and mmi\n\n# Begin configuration section.\ncmd=run.pl\nframes_per_eg=150 # number of frames of labels per example.  more->less disk space and\n                  # less time preparing egs, but more I/O during training.\n                  # Note: may in general be a comma-separated string of alternative\n                  # durations; the first one (the principal num-frames) is preferred.\nframes_overlap_per_eg=30 # number of supervised frames of overlap that we aim for per eg.\n                  # can be useful to avoid wasted data if you're using --left-deriv-truncate\n                  # and --right-deriv-truncate.\nframe_subsampling_factor=1 # ratio between input and output frame-rate of nnet.\n                           # this should be read from the nnet. For now, it is taken as an option\nleft_context=4    # amount of left-context per eg (i.e. extra frames of input features\n                  # not present in the output supervision).\nright_context=4   # amount of right-context per eg.\nleft_context_initial=-1    # if >=0, left-context for first chunk of an utterance\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance\nadjust_priors=true\ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\nnum_utts_subset=80     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\n\nframes_per_iter=400000 # each iteration of training, see this many frames\n                       # per job.  This is just a guideline; it will pick a number\n                       # that divides the number of samples in the entire data.\n\nacwt=0.1\n\nstage=0\nmax_jobs_run=15\nmax_shuffle_jobs_run=15\n\nonline_ivector_dir=\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\n\nnum_priors_subset=1000  #  number of utterances used to calibrate the per-state\n                        #  priors.  Note: these don't have to be held out from\n                        #  the training data.\nnum_archives_priors=10\n\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <denlat-dir> <src-model-file> <degs-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet_denlats exp/tri4/final.mdl exp/tri4_mpe/degs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs (probably would be good to add --max-jobs-run 5 or so if using\"\n  echo \"                                                   # GridEngine (to avoid excessive NFS traffic).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --stage <stage|-8>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --online-ivector-dir <dir|\"\">                    # Directory for online-estimated iVectors, used in the\"\n  echo \"                                                   # online-neural-net setup.\"\n  echo \"  --left-context <int;4>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;4>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # If >= 0, left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # If >= 0, right-context for last chunk of an utterance\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\nsrc_model=$5\ndir=$6\n\nextra_files=\n[ ! -z $online_ivector_dir ] && \\\n  extra_files=\"$online_ivector_dir/ivector_period $online_ivector_dir/ivector_online.scp\"\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/num_jobs $alidir/tree \\\n         $denlatdir/lat.1.gz $denlatdir/num_jobs $src_model $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log $dir/info || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nnj=$(cat $denlatdir/num_jobs) || exit 1;\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid_uttlist || exit 1;\n\nif [ -f $data/utt2uniq ]; then  # this matters if you use data augmentation.\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl | head -$num_utts_subset > $dir/train_subset_uttlist || exit 1;\n\nif [ $stage -le 1 ]; then\n  nj_ali=$(cat $alidir/num_jobs)\n  alis=$(for n in $(seq $nj_ali); do echo -n \"$alidir/ali.$n.gz \"; done)\n  $cmd $dir/log/copy_alignments.log \\\n    copy-int-vector \"ark:gunzip -c $alis|\" \\\n    ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\nfi\n\nprior_ali_rspecifier=\"ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $dir/ali.scp | ali-to-pdf $alidir/final.mdl scp:- ark:- |\"\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\ncp $alidir/tree $dir\ncp $lang/phones/silence.csl $dir/info/\ncp $src_model $dir/final.mdl || exit 1\n\n# Get list of utterances for prior computation.\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n  utils/shuffle_list.pl | head -$num_priors_subset \\\n  > $dir/priors_uttlist || exit 1;\n\nfeats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\nvalid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\ntrain_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\npriors_feats=\"ark,s,cs:utils/filter_scp.pl $dir/priors_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\necho $cmvn_opts > $dir/cmvn_opts\n\nif [ ! -z $online_ivector_dir ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period)\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  echo $ivector_dim >$dir/info/ivector_dim\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/info/final.ie.id || exit 1\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s:JOB:1:g)\"\n  if feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo $feat_dim > $dir/info/feat_dim\n  else # run without stderr redirection to show the error.\n    feat-to-dim \"$feats_one\" -; exit 1\n  fi\nfi\n\n# Work out total number of archives. Add one on the assumption the\n# num-frames won't divide exactly, and we want to round up.\nnum_archives=$[$num_frames/$frames_per_iter+1]\n\n# We may have to first create a smaller number of larger archives, with number\n# $num_archives_intermediate, if $num_archives is more than the maximum number\n# of open filehandles that the system allows per process (ulimit -n).\nmax_open_filehandles=$(ulimit -n) || exit 1\nnum_archives_intermediate=$num_archives\narchives_multiple=1\nwhile [ $[$num_archives_intermediate+4] -gt $max_open_filehandles ]; do\n  archives_multiple=$[$archives_multiple+1]\n  num_archives_intermediate=$[$num_archives/$archives_multiple] || exit 1;\ndone\n# now make sure num_archives is an exact multiple of archives_multiple.\nnum_archives=$[$archives_multiple*$num_archives_intermediate] || exit 1;\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n\n# the first field in frames_per_eg (which is a comma-separated list of numbers)\n# is the 'principal' frames-per-eg, and for purposes of working out the number\n# of archives we assume that this will be the average number of frames per eg.\nframes_per_eg_principal=$(echo $frames_per_eg | cut -d, -f1)\n\n# Work out the number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg_principal*$num_archives)] || exit 1;\n! [ $egs_per_archive -le $frames_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= frames_per_iter=$frames_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\nif [ $left_context_initial -ge 0 ] || [ $right_context_final -ge 0 ]; then\n  echo \"$0:   ... and (left-context-initial,right-context-final) = ($left_context_initial,$right_context_final)\"\nfi\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/degs.$x.ark; done)\n  for x in $(seq $num_archives_intermediate); do\n    utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/degs_orig.$y.$x.ark; done)\n  done\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: copying training lattices\"\n\n  $cmd --max-jobs-run 6 JOB=1:$nj $dir/log/lattice_copy.JOB.log \\\n    lattice-copy --write-compact=false --include=\"cat $dir/valid_uttlist $dir/train_subset_uttlist |\" --ignore-missing \\\n    \"ark:gunzip -c $denlatdir/lat.JOB.gz|\" ark,scp:$dir/lat_special.JOB.ark,$dir/lat_special.JOB.scp || exit 1;\n\n  for id in $(seq $nj); do cat $dir/lat_special.$id.scp; done > $dir/lat_special.scp\nfi\n\n\n\n# If frame_subsampling_factor > 0, we will later be shifting the egs slightly to\n# the left or right as part of training, so we see (e.g.) all shifts of the data\n# modulo 3... we need to extend the l/r context slightly to account for this, to\n# ensure we see the entire context that the model requires.\nleft_context=$[left_context+frame_subsampling_factor/2]\nright_context=$[right_context+frame_subsampling_factor/2]\n[ $left_context_initial -ge 0 ] && left_context_initial=$[left_context_initial+frame_subsampling_factor/2]\n[ $right_context_final -ge 0 ] && right_context_final=$[right_context_final+frame_subsampling_factor/2]\n\negs_opts=\"--left-context=$left_context --right-context=$right_context --num-frames=$frames_per_eg --compress=$compress --frame-subsampling-factor=$frame_subsampling_factor --acoustic-scale=$acwt\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\n\n# don't do the overlap thing for the priors computation data-- but do use the\n# same num-frames for the eg, which would be much more efficient in case it's a\n# recurrent model and has a lot of frames of context.  In any case we're not\n# doing SGD so there is no benefit in having short chunks.\npriors_egs_opts=\"--left-context=$left_context --right-context=$right_context --num-frames=$frames_per_eg --compress=$compress\"\n[ $left_context_initial -ge 0 ] && priors_egs_opts=\"$priors_egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && priors_egs_opts=\"$priors_egs_opts --right-context-final=$right_context_final\"\n\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\necho $left_context_initial > $dir/info/left_context_initial\necho $right_context_final > $dir/info/right_context_final\n\necho $frame_subsampling_factor > $dir/info/frame_subsampling_factor\n\n\nif [ \"$frame_subsampling_factor\" != 1 ]; then\n  if $adjust_priors; then\n    echo \"$0: setting --adjust-priors false since adjusting priors is not supported (and does not make sense) for chain models\"\n    adjust_priors=false\n  fi\nfi\n\n(\n  if $adjust_priors && [ $stage -le 10 ]; then\n    if [ ! -f $dir/ali.scp ]; then\n      nj_ali=$(cat $alidir/num_jobs)\n      alis=$(for n in $(seq $nj_ali); do echo -n \"$alidir/ali.$n.gz \"; done)\n      $cmd $dir/log/copy_alignments.log \\\n        copy-int-vector \"ark:gunzip -c $alis|\" \\\n        ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\n    fi\n\n    priors_egs_list=\n    for y in `seq $num_archives_priors`; do\n      utils/create_data_link.pl $dir/priors_egs.$y.ark\n      priors_egs_list=\"$priors_egs_list ark:$dir/priors_egs.$y.ark\"\n    done\n\n    echo \"$0: dumping egs for prior adjustment in the background.\"\n\n    num_pdfs=`am-info $alidir/final.mdl | grep pdfs | awk '{print $NF}' 2>/dev/null` || exit 1\n\n    $cmd $dir/log/create_priors_subset.log \\\n      nnet3-get-egs --num-pdfs=$num_pdfs $ivector_opts $priors_egs_opts \"$priors_feats\" \\\n      \"$prior_ali_rspecifier ali-to-post ark:- ark:- |\" \\\n      ark:- \\| nnet3-copy-egs ark:- $priors_egs_list || \\\n      { touch $dir/.error; echo \"Error in creating priors subset. See $dir/log/create_priors_subset.log\"; exit 1; }\n\n    sleep 3;\n\n    echo $num_archives_priors >$dir/info/num_archives_priors\n  else\n    echo 0 > $dir/info/num_archives_priors\n  fi\n) &\n\nif [ $stage -le 4 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n  echo \"$0: ... extracting validation and training-subset alignments.\"\n\n  #utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) \\\n  #  <$dir/lat.scp >$dir/lat_special.scp\n\n  utils/filter_scp.pl <(cat $dir/valid_uttlist $dir/train_subset_uttlist) \\\n    <$dir/ali.scp >$dir/ali_special.scp\n\n  $cmd $dir/log/create_valid_subset.log \\\n    nnet3-discriminative-get-egs $ivector_opts $egs_opts \\\n    $dir/final.mdl \"$valid_feats\" scp:$dir/lat_special.scp \\\n    scp:$dir/ali_special.scp \"ark:$dir/valid_diagnostic.degs\" || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset.log \\\n    nnet3-discriminative-get-egs $ivector_opts $egs_opts \\\n    $dir/final.mdl \"$train_subset_feats\" scp:$dir/lat_special.scp \\\n    scp:$dir/ali_special.scp  \"ark:$dir/train_diagnostic.degs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n\n  for f in $dir/{train_diagnostic,valid_diagnostic}.degs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\nfi\n\nif [ $stage -le 5 ]; then\n  # create degs_orig.*.*.ark; the first index goes to $nj,\n  # the second to $num_archives_intermediate.\n\n  degs_list=\n  for n in $(seq $num_archives_intermediate); do\n    degs_list=\"$degs_list ark:$dir/degs_orig.JOB.$n.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n\n  # The examples will go round-robin to degs_list.\n  # To make it efficient we need to use a large 'nj', like 40, and in that case\n  # there can be too many small files to deal with, because the total number of\n  # files is the product of 'nj' by 'num_archives_intermediate', which might be\n  # quite large.\n  $cmd --max-jobs-run $max_jobs_run JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet3-discriminative-get-egs $ivector_opts $egs_opts \\\n      --num-frames-overlap=$frames_overlap_per_eg \\\n      $dir/final.mdl \"$feats\" \"ark,s,cs:gunzip -c $denlatdir/lat.JOB.gz |\" \\\n      \"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $dir/ali.scp |\" ark:- \\| \\\n    nnet3-discriminative-copy-egs --random=true --srand=JOB ark:- $degs_list || exit 1;\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"degs_orig.*.JOB.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the degs.JOB.ark\n\n  # the input is a concatenation over the input jobs.\n  degs_list=\n  for n in $(seq $nj); do\n    degs_list=\"$degs_list $dir/degs_orig.$n.JOB.ark\"\n  done\n\n  if [ $archives_multiple == 1 ]; then # normal case.\n    $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-discriminative-shuffle-egs --srand=JOB \"ark:cat $degs_list|\" ark:$dir/degs.JOB.ark  || exit 1;\n  else\n    # we need to shuffle the 'intermediate archives' and then split into the\n    # final archives.  we create soft links to manage this splitting, because\n    # otherwise managing the output names is quite difficult (and we don't want\n    # to submit separate queue jobs for each intermediate archive, because then\n    # the --max-jobs-run option is hard to enforce).\n    output_archives=$(for y in $(seq $archives_multiple); do echo -n \"ark:$dir/degs.JOB.$y.ark \"; done)\n    for x in $(seq $num_archives_intermediate); do\n      for y in $(seq $archives_multiple); do\n        archive_index=$[($x-1)*$archives_multiple+$y]\n        # degs.intermediate_archive.{1,2,...}.ark will point to degs.archive.ark\n        ln -sf degs.$archive_index.ark $dir/degs.$x.$y.ark || exit 1\n      done\n    done\n    $cmd --max-jobs-run $max_shuffle_jobs_run --mem 8G JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-discriminative-shuffle-egs --srand=JOB \"ark:cat $degs_list|\" ark:- \\| \\\n      nnet3-discriminative-copy-egs ark:- $output_archives || exit 1;\n  fi\nfi\n\nif [ $stage -le 7 ]; then\n  echo \"$0: removing temporary archives\"\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_intermediate); do\n      file=$dir/degs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file)\n      rm $file\n    done\n  done\n  if [ $archives_multiple -gt 1 ]; then\n    # there are some extra soft links that we should delete.\n    for f in $dir/degs.*.*.ark; do rm $f; done\n  fi\n  echo \"$0: removing temporary lattices\"\n  rm $dir/lat.*\n  echo \"$0: removing temporary alignments\"\n  rm $dir/ali.{ark,scp} 2>/dev/null\nfi\n\nwait\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet3/get_egs_targets.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2015 Johns Hopkins University (Author: Daniel Povey).\n#           2015-2016 Vimal Manohar\n# Apache 2.0.\n\n# This script is similar to steps/nnet3/get_egs.sh but used\n# when getting general targets (not from alignment directory) for raw nnet\n#\n# This script, which will generally be called from other neural-net training\n# scripts, extracts the training examples used to train the neural net (and also\n# the validation examples used for diagnostics), and puts them in separate archives.\n#\n# This script dumps egs with several frames of labels, controlled by the\n# frames_per_eg config variable (default: 8).  This takes many times less disk\n# space because typically we have 4 to 7 frames of context on the left and\n# right, and this ends up getting shared.  This is at the expense of slightly\n# higher disk I/O while training.\n\nset -o pipefail\ntrap \"\" PIPE\n\n# Begin configuration section.\ncmd=run.pl\ntarget_type=sparse  # dense to have dense targets,\n                    # sparse to have posteriors targets\nnum_targets=        # required for target-type=sparse with raw nnet\nframe_subsampling_factor=1\nlength_tolerance=2\nframes_per_eg=8   # number of frames of labels per example.  more->less disk space and\n                  # less time preparing egs, but more I/O during training.\n                  # Note: may in general be a comma-separated string of alternative\n                  # durations (more useful when using large chunks, e.g. for BLSTMs);\n                  # the first one (the principal num-frames) is preferred.\nleft_context=4    # amount of left-context per eg (i.e. extra frames of input features\n                  # not present in the output supervision).\nright_context=4   # amount of right-context per eg.\nleft_context_initial=-1    # if >=0, left-context for first chunk of an utterance\nright_context_final=-1     # if >=0, right-context for last chunk of an utterance\ncompress=true   # set this to false to disable compression (e.g. if you want to see whether\n                # results are affected).\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_utts_subset_valid=  # number of utterances in validation\n                        # subsets used for shrinkage and diagnostics\n                        # if provided, overrides num-utts-subset\nnum_utts_subset_train=  # number of utterances in training\n                        # subsets used for shrinkage and diagnostics.\n                        # if provided, overrides num-utts-subset\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=60000 # # train frames for the above.\nnum_frames_diagnostic=10000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=400000 # this is the target number of egs in each archive of egs\n                        # (prior to merging egs).  We probably should have called\n                        # it egs_per_iter. This is just a guideline; it will pick\n                        # a number that divides the number of samples in the\n                        # entire data.\n\nstage=0\nnj=6         # This should be set to the maximum number of jobs you are\n             # comfortable to run in parallel; you can increase it if your disk\n             # speed is greater and you have more machines.\nsrand=0\nonline_ivector_dir=  # can be used if we are including speaker information as iVectors.\ncmvn_opts=  # can be used for specifying CMVN options, if feature type is not lda (if lda,\n            # it doesn't make sense to use different options than were used as input to the\n            # LDA transform).  This is used to turn off CMVN in the online-nnet experiments.\ngenerate_egs_scp=false # If true, it will generate egs.JOB.*.scp per egs archive\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <data> <targets-scp> <egs-dir>\"\n  echo \" e.g.: $0 data/train data/train/snr_targets.scp exp/tri4_nnet/egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --nj <nj>                                        # The maximum number of jobs you want to run in\"\n  echo \"                                                   # parallel (increase this only if you have good disk and\"\n  echo \"                                                   # network speed).  default=6\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Target number of egs per archive (option is badly named)\"\n  echo \"  --frames-per-eg <frames;8>                       # number of frames per eg on disk\"\n  echo \"                                                   # May be either a single number or a comma-separated list\"\n  echo \"                                                   # of alternatives (useful when training LSTMs, where the\"\n  echo \"                                                   # frames-per-eg is the chunk size, to get variety of chunk\"\n  echo \"                                                   # sizes).  The first in the list is preferred and is used\"\n  echo \"                                                   # when working out the number of archives etc.\"\n  echo \"  --left-context <int;4>                           # Number of frames on left side to append for feature input\"\n  echo \"  --right-context <int;4>                          # Number of frames on right side to append for feature input\"\n  echo \"  --left-context-initial <int;-1>                  # If >= 0, left-context for first chunk of an utterance\"\n  echo \"  --right-context-final <int;-1>                   # If >= 0, right-context for last chunk of an utterance\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\ntargets_scp=$2\ndir=$3\n\n# Check some files.\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\n\nfor f in $data/feats.scp $targets_scp $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log $dir/info\n\n[ -z \"$num_utts_subset_valid\" ] && num_utts_subset_valid=$num_utts_subset\n[ -z \"$num_utts_subset_train\" ] && num_utts_subset_train=$num_utts_subset\n\nnum_utts=$(cat $data/utt2spk | wc -l)\nif ! [ $num_utts -gt $[$num_utts_subset_valid*4] ]; then\n  echo \"$0: number of utterances $num_utts in your training data is too small versus --num-utts-subset=$num_utts_subset\"\n  echo \"... you probably have so little data that it doesn't make sense to train a neural net.\"\n  exit 1\nfi\n\n# Get list of validation utterances.\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset_valid | sort \\\n    > $dir/valid_uttlist\n\nif [ -f $data/utt2uniq ]; then  # this matters if you use data augmentation.\n  echo \"File $data/utt2uniq exists, so augmenting valid_uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid_uttlist $dir/valid_uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid_uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid_uttlist\n  rm $dir/uniq2utt $dir/valid_uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid_uttlist | \\\n   utils/shuffle_list.pl 2>/dev/null | head -$num_utts_subset_train | sort > $dir/train_subset_uttlist\n\n## Set up features.\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $sdata/JOB/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:- ark:- |\"\nvalid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\ntrain_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $data/feats.scp | apply-cmvn $cmvn_opts --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp scp:- ark:- |\"\necho $cmvn_opts >$dir/cmvn_opts # caution: the top-level nnet training script should copy this to its own dir now.\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/info/final.ie.id || exit 1\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1\n  echo $ivector_dim > $dir/info/ivector_dim\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nelse\n  ivector_opts=\"\"\n  echo 0 >$dir/info/ivector_dim\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\n  echo \"$0: working out feature dim\"\n  feats_one=\"$(echo $feats | sed s:JOB:1:g)\"\n  if feat_dim=$(feat-to-dim \"$feats_one\" - 2>/dev/null); then\n    echo $feat_dim > $dir/info/feat_dim\n  else # run without stderr redirection to show the error.\n    feat-to-dim \"$feats_one\" -; exit 1\n  fi\nelse\n  num_frames=$(cat $dir/info/num_frames) || exit 1;\n  feat_dim=$(cat $dir/info/feat_dim) || exit 1;\nfi\n\n\n# the first field in frames_per_eg (which is a comma-separated list of numbers)\n# is the 'principal' frames-per-eg, and for purposes of working out the number\n# of archives we assume that this will be the average number of frames per eg.\nframes_per_eg_principal=$(echo $frames_per_eg | cut -d, -f1)\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/($frames_per_eg_principal*$samples_per_iter)+1]\nif [ $num_archives -eq 1 ]; then\n  echo \"*** $0: warning: the --frames-per-eg is too large to generate one archive with\"\n  echo \"*** as many as --samples-per-iter egs in it.  Consider reducing --frames-per-eg.\"\n  sleep 4\nfi\n\n# We may have to first create a smaller number of larger archives, with number\n# $num_archives_intermediate, if $num_archives is more than the maximum number\n# of open filehandles that the system allows per process (ulimit -n).\n# This sometimes gives a misleading answer as GridEngine sometimes changes the\n# limit, so we limit it to 512.\nmax_open_filehandles=$(ulimit -n) || exit 1\n[ $max_open_filehandles -gt 512 ] && max_open_filehandles=512\nnum_archives_intermediate=$num_archives\narchives_multiple=1\nwhile [ $[$num_archives_intermediate+4] -gt $max_open_filehandles ]; do\n  archives_multiple=$[$archives_multiple+1]\n  num_archives_intermediate=$[$num_archives/$archives_multiple+1];\ndone\n# now make sure num_archives is an exact multiple of archives_multiple.\nnum_archives=$[$archives_multiple*$num_archives_intermediate]\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n# Work out the number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg_principal*$num_archives)]\n! [ $egs_per_archive -le $samples_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= samples_per_iter=$samples_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\nif [ $left_context_initial -ge 0 ] || [ $right_context_final -ge 0 ]; then\n  echo \"$0:   ... and (left-context-initial,right-context-final) = ($left_context_initial,$right_context_final)\"\nfi\n\n\n\nif [ -e $dir/storage ]; then\n  # Make soft links to storage directories, if distributing this way..  See\n  # utils/create_split_dir.pl.\n  echo \"$0: creating data links\"\n  utils/create_data_link.pl $(for x in $(seq $num_archives); do echo $dir/egs.$x.ark; done)\n  for x in $(seq $num_archives_intermediate); do\n    utils/create_data_link.pl $(for y in $(seq $nj); do echo $dir/egs_orig.$y.$x.ark; done)\n  done\nfi\n\negs_opts=\"--left-context=$left_context --right-context=$right_context --compress=$compress --num-frames=$frames_per_eg\"\n[ $left_context_initial -ge 0 ] && egs_opts=\"$egs_opts --left-context-initial=$left_context_initial\"\n[ $right_context_final -ge 0 ] && egs_opts=\"$egs_opts --right-context-final=$right_context_final\"\n\necho $left_context > $dir/info/left_context\necho $right_context > $dir/info/right_context\necho $left_context_initial > $dir/info/left_context_initial\necho $right_context_final > $dir/info/right_context_final\n\nfor n in `seq $nj`; do\n  utils/filter_scp.pl $sdata/$n/utt2spk $targets_scp > $dir/targets.$n.scp\ndone\n\ntargets_scp_split=$dir/targets.JOB.scp\n\nif [ $target_type == \"dense\" ]; then\n  num_targets=$(feat-to-dim \"scp:$targets_scp\" - 2>/dev/null) || exit 1\nfi\n\nif [ -z \"$num_targets\" ]; then\n  echo \"$0: num-targets is not set\"\n  exit 1\nfi\n\ncase $target_type in\n  \"dense\")\n    get_egs_program=\"nnet3-get-egs-dense-targets --num-targets=$num_targets\"\n    targets=\"scp,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $targets_scp_split |\"\n    valid_targets=\"scp,s,cs:utils/filter_scp.pl $dir/valid_uttlist $targets_scp |\"\n    train_subset_targets=\"scp,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $targets_scp |\"\n    ;;\n  \"sparse\")\n    get_egs_program=\"nnet3-get-egs --num-pdfs=$num_targets\"\n    targets=\"ark,s,cs:utils/filter_scp.pl --exclude $dir/valid_uttlist $targets_scp_split | ali-to-post scp:- ark:- |\"\n    valid_targets=\"ark,s,cs:utils/filter_scp.pl $dir/valid_uttlist $targets_scp | ali-to-post scp:- ark:- |\"\n    train_subset_targets=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset_uttlist $targets_scp | ali-to-post scp:- ark:- |\"\n    ;;\n  default)\n    echo \"$0: Unknown --target-type $target_type. Choices are dense and sparse\"\n    exit 1\nesac\n\nif [ $stage -le 3 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm -f $dir/.error 2>/dev/null\n  $cmd $dir/log/create_valid_subset.log \\\n    $get_egs_program --frame-subsampling-factor=$frame_subsampling_factor \\\n    --length-tolerance=$length_tolerance \\\n    $ivector_opts $egs_opts \"$valid_feats\" \\\n    \"$valid_targets\" \\\n    \"ark:$dir/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    $get_egs_program --frame-subsampling-factor=$frame_subsampling_factor \\\n    --length-tolerance=$length_tolerance \\\n    $ivector_opts $egs_opts \"$train_subset_feats\" \\\n    \"$train_subset_targets\" \\\n    \"ark:$dir/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n  if $generate_egs_scp; then\n    valid_diagnostic_output=\"ark,scp:$dir/valid_diagnostic.egs,$dir/valid_diagnostic.scp\"\n    train_diagnostic_output=\"ark,scp:$dir/train_diagnostic.egs,$dir/train_diagnostic.scp\"\n  else\n    valid_diagnostic_output=\"ark:$dir/valid_diagnostic.egs\"\n    train_diagnostic_output=\"ark:$dir/train_diagnostic.egs\"\n  fi\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet3-subset-egs --n=$[$num_valid_frames_combine/$frames_per_eg_principal] ark:$dir/valid_all.egs \\\n    ark:$dir/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet3-subset-egs --n=$[$num_frames_diagnostic/$frames_per_eg_principal] ark:$dir/valid_all.egs \\\n    $valid_diagnostic_output || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet3-subset-egs --n=$[$num_train_frames_combine/$frames_per_eg_principal] ark:$dir/train_subset_all.egs \\\n    ark:$dir/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet3-subset-egs --n=$[$num_frames_diagnostic/$frames_per_eg_principal] ark:$dir/train_subset_all.egs \\\n    $train_diagnostic_output || touch $dir/.error &\n  wait\n  sleep 5  # wait for file system to sync.\n  cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n  if $generate_egs_scp; then\n    cat $dir/valid_combine.egs $dir/train_combine.egs  | \\\n    nnet3-copy-egs ark:- ark,scp:$dir/combine.egs,$dir/combine.scp\n    rm $dir/{train,valid}_combine.scp\n  else\n    cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n  fi\n  for f in $dir/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/valid_all.egs $dir/train_subset_all.egs $dir/{train,valid}_combine.egs\nfi\n\nif [ $stage -le 4 ]; then\n  # create egs_orig.*.*.ark; the first index goes to $nj,\n  # the second to $num_archives_intermediate.\n\n  egs_list=\n  for n in $(seq $num_archives_intermediate); do\n    egs_list=\"$egs_list ark:$dir/egs_orig.JOB.$n.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n  # The examples will go round-robin to egs_list.\n  $cmd JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    $get_egs_program --frame-subsampling-factor=$frame_subsampling_factor \\\n    --length-tolerance=$length_tolerance \\\n    $ivector_opts $egs_opts \"$feats\" \"$targets\" \\\n    ark:- \\| \\\n    nnet3-copy-egs --random=true --srand=\\$[JOB+$srand] ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.*.JOB.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  # the input is a concatenation over the input jobs.\n  egs_list=\n  for n in $(seq $nj); do\n    egs_list=\"$egs_list $dir/egs_orig.$n.JOB.ark\"\n  done\n\n  if [ $archives_multiple == 1 ]; then # normal case.\n    if $generate_egs_scp; then\n      output_archive=\"ark,scp:$dir/egs.JOB.ark,$dir/egs.JOB.scp\"\n    else\n      output_archive=\"ark:$dir/egs.JOB.ark\"\n    fi\n    $cmd --max-jobs-run $nj JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-shuffle-egs --srand=\\$[JOB+$srand] \"ark:cat $egs_list|\" $output_archive  || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate egs.JOB.scp in single egs.scp\n      rm $dir/egs.scp 2> /dev/null || true\n      for j in $(seq $num_archives_intermediate); do\n        cat $dir/egs.$j.scp || exit 1;\n      done > $dir/egs.scp || exit 1;\n      for f in $dir/egs.*.scp; do rm $f; done\n    fi\n  else\n    # we need to shuffle the 'intermediate archives' and then split into the\n    # final archives.  we create soft links to manage this splitting, because\n    # otherwise managing the output names is quite difficult (and we don't want\n    # to submit separate queue jobs for each intermediate archive, because then\n    # the --max-jobs-run option is hard to enforce).\n    if $generate_egs_scp; then\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark,scp:$dir/egs.JOB.$y.ark,$dir/egs.JOB.$y.scp; done)\"\n    else\n      output_archives=\"$(for y in $(seq $archives_multiple); do echo ark:$dir/egs.JOB.$y.ark; done)\"\n    fi\n    for x in $(seq $num_archives_intermediate); do\n      for y in $(seq $archives_multiple); do\n        archive_index=$[($x-1)*$archives_multiple+$y]\n        # egs.intermediate_archive.{1,2,...}.ark will point to egs.archive.ark\n        ln -sf egs.$archive_index.ark $dir/egs.$x.$y.ark || exit 1\n      done\n    done\n    $cmd --max-jobs-run $nj JOB=1:$num_archives_intermediate $dir/log/shuffle.JOB.log \\\n      nnet3-shuffle-egs --srand=\\$[JOB+$srand] \"ark:cat $egs_list|\" ark:- \\| \\\n      nnet3-copy-egs ark:- $output_archives || exit 1;\n\n    if $generate_egs_scp; then\n      #concatenate egs.JOB.scp in single egs.scp\n      rm $dir/egs.scp 2> /dev/null || true\n      for j in $(seq $num_archives_intermediate); do\n        for y in $(seq $num_archives_intermediate); do\n          cat $dir/egs.$j.$y.scp || exit 1;\n        done\n      done > $dir/egs.scp || exit 1;\n      for f in $dir/egs.*.*.scp; do rm $f; done\n    fi\n  fi\nfi\n\nif [ $frame_subsampling_factor -ne 1 ]; then\n  echo $frame_subsampling_factor > $dir/info/frame_subsampling_factor\nfi\n\nwait\n\nif [ $stage -le 6 ]; then\n  echo \"$0: removing temporary archives\"\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_intermediate); do\n      file=$dir/egs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file)\n      rm $file\n    done\n  done\n  if [ $archives_multiple -gt 1 ]; then\n    # there are some extra soft links that we should delete.\n    for f in $dir/egs.*.*.ark; do rm $f; done\n  fi\n  echo \"$0: removing temporary stuff\"\n  rm -f $dir/targets.*.scp 2>/dev/null\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/nnet3/get_saturation.pl",
    "content": "#!/usr/bin/env perl\n\n# This program parses the output of nnet3-am-info or nnet3-info,\n# and prints out a number between zero and one that reflects\n# how saturated the (sigmoid and tanh) nonlinearities are, on average\n# over the model.\n#\n# This is based on the 'avg-deriv' (average-derivative) values printed\n# out for the sigmoid and tanh components.  The 'saturation' of such a component\n# is defined as (1.0 - its avg-deriv / the maximum possible derivative of that nonlinearity),\n# where the denominator is 1.0 for tanh and 0.25 for sigmoid.\n# This component averages the saturation over all the sigmoid/tanh units in\n# the network.\n#\n# It parses the Info() output of components of type SigmoidComponent,\n# TanhComponent, and LstmNonlinearityComponent.  It prints an error message to\n# stderr and returns with status 1 if it could not find the info for any such components\n# in the input stream.\n\n# Usage: nnet3-am-info 10.mdl | steps/nnet3/get_saturation.pl\n# or: nnet3-info 10.raw | steps/nnet3/get_saturation.pl\n\nuse warnings;\n\nmy $num_nonlinearities = 0;\nmy $total_saturation = 0.0;\n\nwhile (<STDIN>) {\n  if (m/type=SigmoidComponent/) {\n    # a line like:\n    # component name=Lstm1_f type=SigmoidComponent, dim=1280, count=5.02e+05,\n    # value-avg=[percentiles(0,1,2,5 10,20,50,80,90\n    # 95,98,99,100)=(0.06,0.17,0.19,0.24 0.28,0.33,0.44,0.62,0.79\n    # 0.96,0.99,1.0,1.0), mean=0.482, stddev=0.198],\n    # deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90\n    # 95,98,99,100)=(0.0001,0.003,0.004,0.03 0.12,0.18,0.22,0.24,0.25\n    # 0.25,0.25,0.25,0.25), mean=0.198, stddev=0.0591]\n    if (m/deriv-avg=[^m]+mean=([^,]+),/) {\n      $num_nonlinearities += 1;\n      my $this_saturation = 1.0 - ($1 / 0.25);\n      $total_saturation += $this_saturation;\n    } else {\n      print STDERR \"$0: could not make sense of line (no deriv-avg?): $_\";\n    }\n  } elsif (m/type=TanhComponent/) {\n    if (m/deriv-avg=[^m]+mean=([^,]+),/) {\n      $num_nonlinearities += 1;\n      my $this_saturation = 1.0 - ($1 / 1.0);\n      $total_saturation += $this_saturation;\n    } else {\n      print STDERR \"$0: could not make sense of line (no deriv-avg?): $_\";\n    }\n  } elsif (m/type=LstmNonlinearityComponent/) {\n    # An example of a line like this is right at the bottom of this program, it's extremely long.\n    my $ok = 1;\n    foreach my $sigmoid_name ( (\"i_t\", \"f_t\", \"o_t\") ) {\n      if (m/${sigmoid_name}_sigmoid=[{][^}]+deriv-avg=[^}]+mean=([^,]+),/) {\n        $num_nonlinearities += 1;\n        my $this_saturation = 1.0 - ($1 / 0.25);\n        $total_saturation += $this_saturation;\n      } else {\n        $ok = 0;\n      }\n    }\n    foreach my $tanh_name ( (\"c_t\", \"m_t\") ) {\n      if (m/${tanh_name}_tanh=[{][^}]+deriv-avg=[^}]+mean=([^,]+),/) {\n        $num_nonlinearities += 1;\n        my $this_saturation = 1.0 - ($1 / 1.0);\n        $total_saturation += $this_saturation;\n      } else {\n        $ok = 0;\n      }\n    }\n    if (! $ok) {\n      print STDERR \"Could not parse at least one of the avg-deriv values in the following info line: $_\";\n    }\n  } elsif (m/type=.*GruNonlinearityComponent/) {\n    if (m/deriv-avg=[^m]+mean=([^,]+),/) {\n      $num_nonlinearities += 1;\n      my $this_saturation = 1.0 - ($1 / 1.0);\n      $total_saturation += $this_saturation;\n    } else {\n      print STDERR \"$0: could not make sense of line (no deriv-avg?): $_\";\n    }\n  }\n}\n\n\nif ($num_nonlinearities == 0) {\n  print \"0.0\\n\";\n  exit(0);\n} else {\n  my $saturation = $total_saturation / $num_nonlinearities;\n  if ($saturation < 0.0 || $saturation > 1.0) {\n    print STDERR \"Bad saturation value: $saturation\\n\";\n    exit(1);\n  } else {\n    print \"$saturation\\n\";\n  }\n}\n\n\n\n\n# example line with LstmNonlinearityComponent that we parse:\n# component name=lstm2.lstm_nonlin type=LstmNonlinearityComponent, input-dim=2560, output-dim=1024, learning-rate=0.002, max-change=0.75, cell-dim=512, w_ic-rms=0.9941, w_fc-rms=0.8901, w_oc-rms=0.9794, count=3.53e+05, i_t_sigmoid={ self-repair-lower-threshold=0.05, self-repair-scale=1e-05, self-repaired-proportion=0.0722299, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.04,0.08,0.09,0.12 0.17,0.25,0.46,0.76,0.87 0.91,0.96,0.96,1.0), mean=0.494, stddev=0.253], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.0007,0.03,0.04,0.06 0.09,0.12,0.19,0.23,0.24 0.25,0.25,0.25,0.25), mean=0.179, stddev=0.0595] }, f_t_sigmoid={ self-repair-lower-threshold=0.05, self-repair-scale=1e-05, self-repaired-proportion=0.0688061, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.06,0.11,0.13,0.17 0.22,0.30,0.51,0.70,0.82 0.90,0.96,0.98,1.0), mean=0.509, stddev=0.219], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.001,0.01,0.03,0.07 0.11,0.15,0.21,0.24,0.25 0.25,0.25,0.25,0.25), mean=0.194, stddev=0.0561] }, c_t_tanh={ self-repair-lower-threshold=0.2, self-repair-scale=1e-05, self-repaired-proportion=0.178459, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(-1.0,-0.98,-0.97,-0.92 -0.82,-0.65,-0.01,0.66,0.87 0.94,0.95,0.97,0.99), mean=0.00447, stddev=0.612], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.003,0.02,0.04,0.10 0.14,0.25,0.65,0.84,0.90 0.94,0.97,0.97,0.98), mean=0.58, stddev=0.281] }, o_t_sigmoid={ self-repair-lower-threshold=0.05, self-repair-scale=1e-05, self-repaired-proportion=0.0608838, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.02,0.07,0.09,0.12 0.17,0.25,0.52,0.77,0.86 0.90,0.94,0.96,0.99), mean=0.514, stddev=0.256], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.007,0.04,0.04,0.07 0.09,0.12,0.19,0.23,0.24 0.25,0.25,0.25,0.25), mean=0.175, stddev=0.0579] }, m_t_tanh={ self-repair-lower-threshold=0.2, self-repair-scale=1e-05, self-repaired-proportion=0.134653, value-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(-0.99,-0.95,-0.92,-0.85 -0.73,-0.51,0.02,0.48,0.73 0.86,0.96,0.98,1.0), mean=0.00581, stddev=0.522], deriv-avg=[percentiles(0,1,2,5 10,20,50,80,90 95,98,99,100)=(0.002,0.03,0.04,0.13 0.26,0.41,0.75,0.93,0.97 0.99,1.0,1.0,1.0), mean=0.672, stddev=0.272] }\n"
  },
  {
    "path": "egs/steps/nnet3/get_successful_models.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport re\nimport os\nimport argparse\nimport sys\nimport warnings\nimport copy\nimport glob\n\n\nif __name__ == \"__main__\":\n    # we add compulsory arguments as named arguments for readability\n    parser = argparse.ArgumentParser(description=\"Create a list of models suitable for averaging \"\n                                                 \"based on their train objf values.\",\n                                     epilog=\"See steps/nnet3/lstm/train.sh for example.\")\n\n    parser.add_argument(\"--difference-threshold\", type=float,\n                        help=\"The threshold for discarding models, \"\n                        \"when objf of the model differs more than this value from the best model \"\n                        \"it is discarded.\",\n                        default=1.0)\n\n    parser.add_argument(\"num_models\", type=int,\n                        help=\"Number of models.\")\n\n    parser.add_argument(\"logfile_pattern\", type=str,\n                        help=\"Pattern for identifying the log-file names. \"\n                        \"It specifies the entire log file name, except for the job number, \"\n                        \"which is replaced with '%'. e.g. exp/nneet3/tdnn_sp/log/train.4.%.log\")\n\n\n    args = parser.parse_args()\n\n    assert(args.num_models > 0)\n\n    parse_regex = re.compile(\"LOG .* Overall average objective function for 'output' is ([0-9e.\\-+]+) over ([0-9e.\\-+]+) frames\")\n    loss = []\n    for i in range(args.num_models):\n        model_num = i + 1\n        logfile = re.sub('%', str(model_num), args.logfile_pattern)\n        lines = open(logfile, 'r').readlines()\n        this_loss = -100000\n        for line_num in range(1, len(lines) + 1):\n            # we search from the end as this would result in\n            # lesser number of regex searches. Python regex is slow !\n            mat_obj = parse_regex.search(lines[-1*line_num])\n            if mat_obj is not None:\n                this_loss = float(mat_obj.groups()[0])\n                break;\n        loss.append(this_loss);\n    max_index = loss.index(max(loss))\n    accepted_models = []\n    for i in range(args.num_models):\n        if (loss[max_index] - loss[i]) <= args.difference_threshold:\n            accepted_models.append(i+1)\n\n    model_list = \" \".join([str(x) for x in accepted_models])\n    print(model_list)\n\n    if len(accepted_models) != args.num_models:\n        print(\"WARNING: Only {0}/{1} of the models have been accepted for averaging, based on log files {2}.\".format(len(accepted_models), args.num_models, args.logfile_pattern), file=sys.stderr)\n        print(\"         Using models {0}\".format(model_list), file=sys.stderr)\n"
  },
  {
    "path": "egs/steps/nnet3/lstm/make_configs.py",
    "content": "#!/usr/bin/env python\n\n# This script is deprecated, please use ../xconfig_to_configs.py\n\nfrom __future__ import print_function\nimport os\nimport argparse\nimport sys\nimport warnings\nimport copy\nimport imp\n\nnodes = imp.load_source('nodes', 'steps/nnet3/components.py')\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\ndef GetArgs():\n    # we add compulsary arguments as named arguments for readability\n    parser = argparse.ArgumentParser(description=\"Writes config files and variables \"\n                                                 \"for LSTMs creation and training\",\n                                     epilog=\"See steps/nnet3/lstm/train.sh for example.\")\n\n    # Only one of these arguments can be specified, and one of them has to\n    # be compulsarily specified\n    feat_group = parser.add_mutually_exclusive_group(required = True)\n    feat_group.add_argument(\"--feat-dim\", type=int,\n                            help=\"Raw feature dimension, e.g. 13\")\n    feat_group.add_argument(\"--feat-dir\", type=str,\n                            help=\"Feature directory, from which we derive the feat-dim\")\n\n    # only one of these arguments can be specified\n    ivector_group = parser.add_mutually_exclusive_group(required = False)\n    ivector_group.add_argument(\"--ivector-dim\", type=int,\n                                help=\"iVector dimension, e.g. 100\", default=0)\n    ivector_group.add_argument(\"--ivector-dir\", type=str,\n                                help=\"iVector dir, which will be used to derive the ivector-dim  \", default=None)\n\n    num_target_group = parser.add_mutually_exclusive_group(required = True)\n    num_target_group.add_argument(\"--num-targets\", type=int,\n                                  help=\"number of network targets (e.g. num-pdf-ids/num-leaves)\")\n    num_target_group.add_argument(\"--ali-dir\", type=str,\n                                  help=\"alignment directory, from which we derive the num-targets\")\n    num_target_group.add_argument(\"--tree-dir\", type=str,\n                                  help=\"directory with final.mdl, from which we derive the num-targets\")\n\n    # General neural network options\n    parser.add_argument(\"--splice-indexes\", type=str,\n                        help=\"Splice indexes at input layer, e.g. '-3,-2,-1,0,1,2,3'\", required = True, default=\"0\")\n    parser.add_argument(\"--xent-regularize\", type=float,\n                        help=\"For chain models, if nonzero, add a separate output for cross-entropy \"\n                        \"regularization (with learning-rate-factor equal to the inverse of this)\",\n                        default=0.0)\n    parser.add_argument(\"--include-log-softmax\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"add the final softmax layer \", default=True, choices = [\"false\", \"true\"])\n    parser.add_argument(\"--max-change-per-component\", type=float,\n                        help=\"Enforces per-component max change (except for the final affine layer). \"\n                        \"if 0 it would not be enforced.\", default=0.75)\n    parser.add_argument(\"--max-change-per-component-final\", type=float,\n                        help=\"Enforces per-component max change for the final affine layer. \"\n                        \"if 0 it would not be enforced.\", default=1.5)\n\n    # LSTM options\n    parser.add_argument(\"--num-lstm-layers\", type=int,\n                        help=\"Number of LSTM layers to be stacked\", default=1)\n    parser.add_argument(\"--cell-dim\", type=int,\n                        help=\"dimension of lstm-cell\")\n    parser.add_argument(\"--recurrent-projection-dim\", type=int,\n                        help=\"dimension of recurrent projection\")\n    parser.add_argument(\"--non-recurrent-projection-dim\", type=int,\n                        help=\"dimension of non-recurrent projection\")\n    parser.add_argument(\"--hidden-dim\", type=int,\n                        help=\"dimension of fully-connected layers\")\n\n    # Natural gradient options\n    parser.add_argument(\"--ng-per-element-scale-options\", type=str,\n                        help=\"options to be supplied to NaturalGradientPerElementScaleComponent\", default=\"\")\n    parser.add_argument(\"--ng-affine-options\", type=str,\n                        help=\"options to be supplied to NaturalGradientAffineComponent\", default=\"\")\n\n    # Gradient clipper options\n    parser.add_argument(\"--norm-based-clipping\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"Outdated option retained for back compatibility, has no effect.\",\n                        default=True, choices = [\"false\", \"true\"])\n    parser.add_argument(\"--clipping-threshold\", type=float,\n                        help=\"clipping threshold used in BackpropTruncation components, \"\n                        \"if clipping-threshold=0 no clipping is done\", default=30)\n    parser.add_argument(\"--zeroing-threshold\", type=float,\n                        help=\"zeroing threshold used in BackpropTruncation components, \"\n                        \"if zeroing-threshold=0 no periodic zeroing is done\", default=15.0)\n    parser.add_argument(\"--zeroing-interval\", type=int,\n                        help=\"zeroing interval used in BackpropTruncation components\", default=20)\n    parser.add_argument(\"--self-repair-scale-nonlinearity\", type=float,\n                        help=\"A non-zero value activates the self-repair mechanism in the sigmoid and tanh non-linearities of the LSTM\", default=0.00001)\n    parser.add_argument(\"--self-repair-scale-clipgradient\", type=float,\n                        help=\"Outdated option retained for back compatibility, has no effect.\",\n                        default=1.0)\n\n    # Delay options\n    parser.add_argument(\"--label-delay\", type=int, default=None,\n                        help=\"option to delay the labels to make the lstm robust\")\n\n    parser.add_argument(\"--lstm-delay\", type=str, default=None,\n                        help=\"option to have different delays in recurrence for each lstm\")\n\n    parser.add_argument(\"config_dir\",\n                        help=\"Directory to write config files and variables\")\n\n    print(' '.join(sys.argv))\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if not os.path.exists(args.config_dir):\n        os.makedirs(args.config_dir)\n\n    ## Check arguments.\n    if args.feat_dir is not None:\n        args.feat_dim = common_lib.get_feat_dim(args.feat_dir)\n\n    if args.ali_dir is not None:\n        args.num_targets = common_lib.get_number_of_leaves_from_tree(args.ali_dir)\n    elif args.tree_dir is not None:\n        args.num_targets = common_lib.get_number_of_leaves_from_tree(args.tree_dir)\n\n    if args.ivector_dir is not None:\n        args.ivector_dim = common_lib.get_ivector_dim(args.ivector_dir)\n\n    if not args.feat_dim > 0:\n        raise Exception(\"feat-dim has to be postive\")\n\n    if not args.num_targets > 0:\n        print(args.num_targets)\n        raise Exception(\"num_targets has to be positive\")\n\n    if not args.ivector_dim >= 0:\n        raise Exception(\"ivector-dim has to be non-negative\")\n\n    if not args.max_change_per_component >= 0 or not args.max_change_per_component_final >= 0:\n        raise Exception(\"max-change-per-component and max_change-per-component-final should be non-negative\")\n\n    if (args.num_lstm_layers < 1):\n        sys.exit(\"--num-lstm-layers has to be a positive integer\")\n    if (args.clipping_threshold < 0 or args.zeroing_threshold < 0):\n        sys.exit(\"--clipping-threshold and --zeroing-threshold have to be non-negative\")\n    if not args.zeroing_interval > 0:\n        raise Exception(\"--zeroing-interval has to be positive\")\n    if args.lstm_delay is None:\n        args.lstm_delay = [[-1]] * args.num_lstm_layers\n    else:\n        try:\n            args.lstm_delay = ParseLstmDelayString(args.lstm_delay.strip())\n        except ValueError:\n            sys.exit(\"--lstm-delay has incorrect format value. Provided value is '{0}'\".format(args.lstm_delay))\n        if len(args.lstm_delay) != args.num_lstm_layers:\n            sys.exit(\"--lstm-delay: Number of delays provided has to match --num-lstm-layers\")\n\n    return args\n\ndef PrintConfig(file_name, config_lines):\n    f = open(file_name, 'w')\n    f.write(\"\\n\".join(config_lines['components'])+\"\\n\")\n    f.write(\"\\n#Component nodes\\n\")\n    f.write(\"\\n\".join(config_lines['component-nodes'])+\"\\n\")\n    f.close()\n\ndef ParseSpliceString(splice_indexes, label_delay=None):\n    ## Work out splice_array e.g. splice_array = [ [ -3,-2,...3 ], [0], [-2,2], .. [ -8,8 ] ]\n    split1 = splice_indexes.split(\" \");  # we already checked the string is nonempty.\n    if len(split1) < 1:\n        splice_indexes = \"0\"\n\n    left_context=0\n    right_context=0\n    if label_delay is not None:\n        left_context = -label_delay\n        right_context = label_delay\n\n    splice_array = []\n    try:\n        for i in range(len(split1)):\n            indexes = [int(x) for x in split1[i].strip().split(\",\")]\n            print(indexes)\n            if len(indexes) < 1:\n                raise ValueError(\"invalid --splice-indexes argument, too-short element: \"\n                                + splice_indexes)\n\n            if (i > 0)  and ((len(indexes) != 1) or (indexes[0] != 0)):\n                raise ValueError(\"elements of --splice-indexes splicing is only allowed initial layer.\")\n\n            if not indexes == sorted(indexes):\n                raise ValueError(\"elements of --splice-indexes must be sorted: \"\n                                + splice_indexes)\n            left_context += -indexes[0]\n            right_context += indexes[-1]\n            splice_array.append(indexes)\n    except ValueError as e:\n        raise ValueError(\"invalid --splice-indexes argument \" + splice_indexes + str(e))\n\n    left_context = max(0, left_context)\n    right_context = max(0, right_context)\n\n    return {'left_context':left_context,\n            'right_context':right_context,\n            'splice_indexes':splice_array,\n            'num_hidden_layers':len(splice_array)\n            }\n\ndef ParseLstmDelayString(lstm_delay):\n    ## Work out lstm_delay e.g. \"-1 [-1,1] -2\" -> list([ [-1], [-1, 1], [-2] ])\n    split1 = lstm_delay.split(\" \");\n    lstm_delay_array = []\n    try:\n        for i in range(len(split1)):\n            indexes = [int(x) for x in split1[i].strip().lstrip('[').rstrip(']').strip().split(\",\")]\n            if len(indexes) < 1:\n                raise ValueError(\"invalid --lstm-delay argument, too-short element: \"\n                                + lstm_delay)\n            elif len(indexes) == 2 and indexes[0] * indexes[1] >= 0:\n                raise ValueError('Warning: {} is not a standard BLSTM mode. There should be a negative delay for the forward, and a postive delay for the backward.'.format(indexes))\n            if len(indexes) == 2 and indexes[0] > 0: # always a negative delay followed by a postive delay\n                indexes[0], indexes[1] = indexes[1], indexes[0]\n            lstm_delay_array.append(indexes)\n    except ValueError as e:\n        raise ValueError(\"invalid --lstm-delay argument \" + lstm_delay + str(e))\n\n    return lstm_delay_array\n\n\ndef MakeConfigs(config_dir, feat_dim, ivector_dim, num_targets,\n                splice_indexes, lstm_delay, cell_dim, hidden_dim,\n                recurrent_projection_dim, non_recurrent_projection_dim,\n                num_lstm_layers, num_hidden_layers,\n                norm_based_clipping, clipping_threshold, zeroing_threshold, zeroing_interval,\n                ng_per_element_scale_options, ng_affine_options,\n                label_delay, include_log_softmax, xent_regularize,\n                self_repair_scale_nonlinearity, self_repair_scale_clipgradient,\n                max_change_per_component, max_change_per_component_final):\n\n    config_lines = {'components':[], 'component-nodes':[]}\n\n    config_files={}\n    prev_layer_output = nodes.AddInputLayer(config_lines, feat_dim, splice_indexes[0], ivector_dim)\n\n    # Add the init config lines for estimating the preconditioning matrices\n    init_config_lines = copy.deepcopy(config_lines)\n    init_config_lines['components'].insert(0, '# Config file for initializing neural network prior to')\n    init_config_lines['components'].insert(0, '# preconditioning matrix computation')\n    nodes.AddOutputLayer(init_config_lines, prev_layer_output)\n    config_files[config_dir + '/init.config'] = init_config_lines\n\n    prev_layer_output = nodes.AddLdaLayer(config_lines, \"L0\", prev_layer_output, config_dir + '/lda.mat')\n\n    for i in range(num_lstm_layers):\n        if len(lstm_delay[i]) == 2: # add a bi-directional LSTM layer\n            prev_layer_output = nodes.AddBLstmLayer(config_lines = config_lines,\n                                                    name = \"BLstm{0}\".format(i+1),\n                                                    input = prev_layer_output,\n                                                    cell_dim = cell_dim,\n                                                    recurrent_projection_dim = recurrent_projection_dim,\n                                                    non_recurrent_projection_dim = non_recurrent_projection_dim,\n                                                    clipping_threshold = clipping_threshold,\n                                                    zeroing_threshold = zeroing_threshold,\n                                                    zeroing_interval = zeroing_interval,\n                                                    ng_per_element_scale_options = ng_per_element_scale_options,\n                                                    ng_affine_options = ng_affine_options,\n                                                    lstm_delay = lstm_delay[i],\n                                                    self_repair_scale_nonlinearity = self_repair_scale_nonlinearity,\n                                                    max_change_per_component = max_change_per_component)\n        else: # add a uni-directional LSTM layer\n            prev_layer_output = nodes.AddLstmLayer(config_lines = config_lines,\n                                                   name = \"Lstm{0}\".format(i+1),\n                                                   input = prev_layer_output,\n                                                   cell_dim = cell_dim,\n                                                   recurrent_projection_dim = recurrent_projection_dim,\n                                                   non_recurrent_projection_dim = non_recurrent_projection_dim,\n                                                   clipping_threshold = clipping_threshold,\n                                                   zeroing_threshold = zeroing_threshold,\n                                                   zeroing_interval = zeroing_interval,\n                                                   ng_per_element_scale_options = ng_per_element_scale_options,\n                                                   ng_affine_options = ng_affine_options,\n                                                   lstm_delay = lstm_delay[i][0],\n                                                   self_repair_scale_nonlinearity = self_repair_scale_nonlinearity,\n                                                   max_change_per_component = max_change_per_component)\n        # make the intermediate config file for layerwise discriminative\n        # training\n        nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets, ng_affine_options, max_change_per_component = max_change_per_component_final, label_delay = label_delay, include_log_softmax = include_log_softmax)\n\n\n        if xent_regularize != 0.0:\n            nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets,\n                                include_log_softmax = True, label_delay = label_delay,\n                                max_change_per_component = max_change_per_component_final,\n                                name_affix = 'xent')\n\n        config_files['{0}/layer{1}.config'.format(config_dir, i+1)] = config_lines\n        config_lines = {'components':[], 'component-nodes':[]}\n\n    for i in range(num_lstm_layers, num_hidden_layers):\n        prev_layer_output = nodes.AddAffRelNormLayer(config_lines, \"L{0}\".format(i+1),\n                                               prev_layer_output, hidden_dim,\n                                               ng_affine_options, self_repair_scale = self_repair_scale_nonlinearity, max_change_per_component = max_change_per_component)\n        # make the intermediate config file for layerwise discriminative\n        # training\n        nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets, ng_affine_options, max_change_per_component = max_change_per_component_final, label_delay = label_delay, include_log_softmax = include_log_softmax)\n\n        if xent_regularize != 0.0:\n            nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets,\n                                include_log_softmax = True, label_delay = label_delay,\n                                max_change_per_component = max_change_per_component_final,\n                                name_affix = 'xent')\n\n        config_files['{0}/layer{1}.config'.format(config_dir, i+1)] = config_lines\n        config_lines = {'components':[], 'component-nodes':[]}\n\n    # printing out the configs\n    # init.config used to train lda-mllt train\n    for key in config_files.keys():\n        PrintConfig(key, config_files[key])\n\n\n\n\ndef ProcessSpliceIndexes(config_dir, splice_indexes, label_delay, num_lstm_layers):\n    parsed_splice_output = ParseSpliceString(splice_indexes.strip(), label_delay)\n    left_context = parsed_splice_output['left_context']\n    right_context = parsed_splice_output['right_context']\n    num_hidden_layers = parsed_splice_output['num_hidden_layers']\n    splice_indexes = parsed_splice_output['splice_indexes']\n\n    if (num_hidden_layers < num_lstm_layers):\n        raise Exception(\"num-lstm-layers : number of lstm layers has to be greater than number of layers, decided based on splice-indexes\")\n\n    # write the files used by other scripts like steps/nnet3/get_egs.sh\n    f = open(config_dir + \"/vars\", \"w\")\n    print('model_left_context={}'.format(left_context), file=f)\n    print('model_right_context={}'.format(right_context), file=f)\n    print('num_hidden_layers={}'.format(num_hidden_layers), file=f)\n    # print('initial_right_context=' + str(splice_array[0][-1]), file=f)\n    f.close()\n\n    return [left_context, right_context, num_hidden_layers, splice_indexes]\n\n\ndef Main():\n    args = GetArgs()\n    [left_context, right_context, num_hidden_layers, splice_indexes] = ProcessSpliceIndexes(args.config_dir, args.splice_indexes, args.label_delay, args.num_lstm_layers)\n\n    MakeConfigs(config_dir = args.config_dir,\n                feat_dim = args.feat_dim, ivector_dim = args.ivector_dim,\n                num_targets = args.num_targets,\n                splice_indexes = splice_indexes, lstm_delay = args.lstm_delay,\n                cell_dim = args.cell_dim,\n                hidden_dim = args.hidden_dim,\n                recurrent_projection_dim = args.recurrent_projection_dim,\n                non_recurrent_projection_dim = args.non_recurrent_projection_dim,\n                num_lstm_layers = args.num_lstm_layers,\n                num_hidden_layers = num_hidden_layers,\n                norm_based_clipping = args.norm_based_clipping,\n                clipping_threshold = args.clipping_threshold,\n                zeroing_threshold = args.zeroing_threshold,\n                zeroing_interval = args.zeroing_interval,\n                ng_per_element_scale_options = args.ng_per_element_scale_options,\n                ng_affine_options = args.ng_affine_options,\n                label_delay = args.label_delay,\n                include_log_softmax = args.include_log_softmax,\n                xent_regularize = args.xent_regularize,\n                self_repair_scale_nonlinearity = args.self_repair_scale_nonlinearity,\n                self_repair_scale_clipgradient = args.self_repair_scale_clipgradient,\n                max_change_per_component = args.max_change_per_component,\n                max_change_per_component_final = args.max_change_per_component_final)\n\nif __name__ == \"__main__\":\n    Main()\n"
  },
  {
    "path": "egs/steps/nnet3/lstm/train.sh",
    "content": "#!/usr/bin/env bash\n\n# THIS SCRIPT IS DEPRECATED, see ../train_rnn.py\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014-2015  Vijayaditya Peddinti\n# Apache 2.0.\n\n# Terminology:\n# sample - one input-output tuple, which is an input sequence and output sequence for LSTM\n# frame  - one output label and the input context used to compute it\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=10      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.0003\nfinal_effective_lrate=0.00003\nnum_jobs_initial=1 # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=20000 # 20k samples per job, for computing priors.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0    # can be used for rerunning after partial\nonline_ivector_dir=\npresoftmax_prior_scale_power=-0.25  # we haven't yet used pre-softmax prior scaling in the LSTM model\nremove_egs=true  # set to false to disable removing egs after training is done.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-6\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\n# count space-separated fields in splice_indexes to get num-hidden-layers.\nsplice_indexes=\"-2,-1,0,1,2 0 0\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\n# LSTM parameters\nnum_lstm_layers=3\ncell_dim=1024  # dimension of the LSTM cell\nhidden_dim=1024  # the dimension of the fully connected hidden layer outputs\nrecurrent_projection_dim=256\nnon_recurrent_projection_dim=256\nnorm_based_clipping=true  # if true norm_based_clipping is used.\n                          # In norm-based clipping the activation Jacobian matrix\n                          # for the recurrent connections in the network is clipped\n                          # to ensure that the individual row-norm (l2) does not increase\n                          # beyond the clipping_threshold.\n                          # If false, element-wise clipping is used.\nclipping_threshold=30     # if norm_based_clipping is true this would be the maximum value of the row l2-norm,\n                          # else this is the max-absolute value of each element in Jacobian.\nchunk_width=20  # number of output labels in the sequence used to train an LSTM\n                # Caution: if you double this you should halve --samples-per-iter.\nchunk_left_context=40  # number of steps used in the estimation of LSTM state before prediction of the first label\nchunk_right_context=0  # number of steps used in the estimation of LSTM state before prediction of the first label (usually used in bi-directional LSTM case)\nlabel_delay=5  # the lstm output is used to predict the label with the specified delay\nlstm_delay=\" -1 -2 -3 \"  # the delay to be used in the recurrence of lstms\n                         # \"-1 -2 -3\" means the a three layer stacked LSTM would use recurrence connections with\n                         # delays -1, -2 and -3 at layer1 lstm, layer2 lstm and layer3 lstm respectively\n                         # \"[-1,1] [-2,2] [-3,3]\" means a three layer stacked bi-directional LSTM would use recurrence\n                         # connections with delay -1 for the forward, 1 for the backward at layer1,\n                         # -2 for the forward, 2 for the backward at layer2, and so on at layer3\nnum_bptt_steps=    # this variable counts the number of time steps to back-propagate from the last label in the chunk\n                   # it is usually same as chunk_width\n\n\n# nnet3-train options\nshrink=0.99  # this parameter would be used to scale the parameter matrices\nshrink_threshold=0.15  # a value less than 0.25 that we compare the mean of\n                       # 'deriv-avg' for sigmoid components with, and if it's\n                       # less, we shrink.\nmax_param_change=2.0  # max param change per minibatch\nnum_chunk_per_minibatch=100  # number of sequences to be processed in parallel every mini-batch\n\nsamples_per_iter=20000 # this is really the number of egs in each archive.  Each eg has\n                       # 'chunk_width' frames in it-- for chunk_width=20, this value (20k)\n                       # is equivalent to the 400k number that we use as a default in\n                       # regular DNN training.\nmomentum=0.5    # e.g. 0.5.  Note: we implemented it in such a way that\n                # it doesn't increase the effective learning rate.\nuse_gpu=true    # if true, we run on GPU.\ncleanup=true\negs_dir=\nmax_lda_jobs=10  # use no more than 10 jobs for the LDA accumulation.\nlda_opts=\negs_opts=\ntransform_dir=     # If supplied, this dir used instead of alidir to find transforms.\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=raw  # or set to 'lda' to use LDA features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\n\nrand_prune=4.0 # speeds up LDA.\n\n# End configuration section.\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0: THIS SCRIPT IS DEPRECATED\"\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|10>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.0003>         # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.00003>          # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --momentum <momentum|0.5>                        # Momentum constant: note, this is \"\n  echo \"                                                   # implemented in such a way that it doesn't\"\n  echo \"                                                   # increase the effective learning rate.\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job, for CPU-based training (will affect\"\n  echo \"                                                   # results as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --splice-indexes <string|\\\"-2,-1,0,1,2 0 0\\\"> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : <frame_indices> .... <frame_indices> \"\n  echo \"                                                   # the number of fields determines the number of LSTM and non-recurrent layers\"\n  echo \"                                                   # also see the --num-lstm-layers option\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-epochs <list-of-epochs|''>             # A list of space-separated epoch indices the beginning of which\"\n  echo \"                                                   # realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  echo \" ################### LSTM options ###################### \"\n  echo \"  --num-lstm-layers <int|3>                        # number of LSTM layers\"\n  echo \"  --cell-dim   <int|1024>                          # dimension of the LSTM cell\"\n  echo \"  --hidden-dim      <int|1024>                     # the dimension of the fully connected hidden layer outputs\"\n  echo \"  --recurrent-projection-dim  <int|256>            # the output dimension of the recurrent-projection-matrix\"\n  echo \"  --non-recurrent-projection-dim  <int|256>        # the output dimension of the non-recurrent-projection-matrix\"\n  echo \"  --chunk-left-context <int|40>                    # number of time-steps used in the estimation of the first LSTM state\"\n  echo \"  --chunk-width <int|20>                           # number of output labels in the sequence used to train an LSTM\"\n  echo \"                                                   # Caution: if you double this you should halve --samples-per-iter.\"\n  echo \"  --norm-based-clipping <bool|true>                # if true norm_based_clipping is used.\"\n  echo \"                                                   # In norm-based clipping the activation Jacobian matrix\"\n  echo \"                                                   # for the recurrent connections in the network is clipped\"\n  echo \"                                                   # to ensure that the individual row-norm (l2) does not increase\"\n  echo \"                                                   # beyond the clipping_threshold.\"\n  echo \"                                                   # If false, element-wise clipping is used.\"\n  echo \"  --num-bptt-steps <int|>                          # this variable counts the number of time steps to back-propagate from the last label in the chunk\"\n  echo \"                                                   # it defaults to chunk_width\"\n  echo \"  --label-delay <int|5>                            # the lstm output is used to predict the label with the specified delay\"\n\n  echo \"  --lstm-delay <str|\\\" -1 -2 -3 \\\">                # the delay to be used in the recurrence of lstms\"\n  echo \"                                                   # \\\"-1 -2 -3\\\" means the a three layer stacked LSTM would use recurrence connections with \"\n  echo \"                                                   # delays -1, -2 and -3 at layer1 lstm, layer2 lstm and layer3 lstm respectively\"\n  echo \"  --clipping-threshold <int|30>                    # if norm_based_clipping is true this would be the maximum value of the row l2-norm,\"\n  echo \"                                                   # else this is the max-absolute value of each element in Jacobian.\"\n\n  echo \" ################### LSTM specific training options ###################### \"\n  echo \"  --num-chunks-per-minibatch <minibatch-size|100>  # Number of sequences to be processed in parallel in a minibatch\"\n  echo \"  --samples-per-iter <#samples|20000>              # Number of egs in each archive of data.  This times --chunk-width is\"\n  echo \"                                                   # the number of frames processed per iteration\"\n  echo \"  --shrink <shrink|0.99>                           # if non-zero this parameter will be used to scale the parameter matrices\"\n  echo \"  --shrink-threshold <threshold|0.15>              # a threshold (should be between 0.0 and 0.25) that controls when to\"\n  echo \"                                                   # do parameter shrinking.\"\n  echo \" for more options see the script\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n# First work out the feature and iVector dimension, needed for tdnn config creation.\ncase $feat_type in\n  raw) feat_dim=$(feat-to-dim --print-args=false scp:$data/feats.scp -) || \\\n      { echo \"$0: Error getting feature dim\"; exit 1; }\n    ;;\n  lda)  [ ! -f $alidir/final.mat ] && echo \"$0: With --feat-type lda option, expect $alidir/final.mat to exist.\"\n   # get num-rows in lda matrix, which is the lda feature dim.\n   feat_dim=$(matrix-dim --print-args=false $alidir/final.mat | cut -f 1)\n    ;;\n  *)\n   echo \"$0: Bad --feat-type '$feat_type';\"; exit 1;\nesac\nif [ -z \"$online_ivector_dir\" ]; then\n  ivector_dim=0\nelse\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\nfi\n\n\nif [ $stage -le -5 ]; then\n  echo \"$0: creating neural net configs\";\n\n  # create the config files for nnet initialization\n  # note an additional space is added to splice_indexes to\n  # avoid issues with the python ArgParser which can have\n  # issues with negative arguments (due to minus sign)\n  config_extra_opts=()\n  [ ! -z \"$lstm_delay\" ] && config_extra_opts+=(--lstm-delay \"$lstm_delay\")\n\n  steps/nnet3/lstm/make_configs.py  \"${config_extra_opts[@]}\" \\\n    --splice-indexes \"$splice_indexes \" \\\n    --num-lstm-layers $num_lstm_layers \\\n    --feat-dim $feat_dim \\\n    --ivector-dim $ivector_dim \\\n    --cell-dim $cell_dim \\\n    --hidden-dim $hidden_dim \\\n    --recurrent-projection-dim $recurrent_projection_dim \\\n    --non-recurrent-projection-dim $non_recurrent_projection_dim \\\n    --norm-based-clipping $norm_based_clipping \\\n    --clipping-threshold $clipping_threshold \\\n    --num-targets $num_leaves \\\n    --label-delay $label_delay \\\n   $dir/configs || exit 1;\n  # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n  # matrix.  This first config just does any initial splicing that we do;\n  # we do this as it's a convenient way to get the stats for the 'lda-like'\n  # transform.\n  $cmd $dir/log/nnet_init.log \\\n    nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;\nfi\n# sourcing the \"vars\" below sets\n# model_left_context=(something)\n# model_right_context=(something)\n# num_hidden_layers=(something)\n. $dir/configs/vars || exit 1;\nleft_context=$((chunk_left_context + model_left_context))\nright_context=$((chunk_right_context + model_right_context))\ncontext_opts=\"--left-context=$left_context --right-context=$right_context\"\n\n! [ \"$num_hidden_layers\" -gt 0 ] && echo \\\n \"$0: Expected num_hidden_layers to be defined\" && exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\nif [ $stage -le -4 ] && [ -z \"$egs_dir\" ]; then\n  extra_opts=()\n  [ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n  [ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n  [ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n  extra_opts+=(--transform-dir $transform_dir)\n  extra_opts+=(--left-context $left_context)\n  extra_opts+=(--right-context $right_context)\n\n  # Note: in RNNs we process sequences of labels rather than single label per sample\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet3/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --cmd \"$cmd\" $egs_opts \\\n      --stage $get_egs_stage \\\n      --samples-per-iter $samples_per_iter \\\n      --frames-per-eg $chunk_width \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\nif [ \"$feat_dim\" != \"$(cat $egs_dir/info/feat_dim)\" ]; then\n  echo \"$0: feature dimension mismatch with egs, $feat_dim vs $(cat $egs_dir/info/feat_dim)\";\n  exit 1;\nfi\nif [ \"$ivector_dim\" != \"$(cat $egs_dir/info/ivector_dim)\" ]; then\n  echo \"$0: ivector dimension mismatch with egs, $ivector_dim vs $(cat $egs_dir/info/ivector_dim)\";\n  exit 1;\nfi\n\n# copy any of the following that exist, to $dir.\ncp $egs_dir/{cmvn_opts,splice_opts,final.mat} $dir 2>/dev/null\n\n# confirm that the egs_dir has the necessary context (especially important if\n# the --egs-dir option was used on the command line).\negs_left_context=$(cat $egs_dir/info/left_context) || exit -1\negs_right_context=$(cat $egs_dir/info/right_context) || exit -1\n ( [ $egs_left_context -lt $left_context ] || \\\n   [ $egs_right_context -lt $right_context ] ) && \\\n   echo \"$0: egs in $egs_dir have too little context\" && exit -1;\n\nchunk_width=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/num_archives\"; exit 1; }\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives.\" && exit 1;\n\n\nif [ $stage -le -3 ]; then\n  echo \"$0: getting preconditioning matrix for input features.\"\n  num_lda_jobs=$num_archives\n  [ $num_lda_jobs -gt $max_lda_jobs ] && num_lda_jobs=$max_lda_jobs\n\n  # Write stats with the same format as stats for LDA.\n  $cmd JOB=1:$num_lda_jobs $dir/log/get_lda_stats.JOB.log \\\n      nnet3-acc-lda-stats --rand-prune=$rand_prune \\\n        $dir/init.raw \"ark:$egs_dir/egs.JOB.ark\" $dir/JOB.lda_stats || exit 1;\n\n  all_lda_accs=$(for n in $(seq $num_lda_jobs); do echo $dir/$n.lda_stats; done)\n  $cmd $dir/log/sum_transform_stats.log \\\n    sum-lda-accs $dir/lda_stats $all_lda_accs || exit 1;\n\n  rm $all_lda_accs || exit 1;\n\n  # this computes a fixed affine transform computed in the way we described in\n  # Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled variant\n  # of an LDA transform but without dimensionality reduction.\n  $cmd $dir/log/get_transform.log \\\n     nnet-get-feature-transform $lda_opts $dir/lda.mat $dir/lda_stats || exit 1;\n\n  ln -sf ../lda.mat $dir/configs/lda.mat\nfi\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: preparing initial vector for FixedScaleComponent before softmax\"\n  echo \"  ... using priors^$presoftmax_prior_scale_power and rescaling to average 1\"\n\n  # obtains raw pdf count\n  $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n     ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n     post-to-tacc --per-pdf=true  $alidir/final.mdl ark:- $dir/pdf_counts.JOB || exit 1;\n  $cmd $dir/log/sum_pdf_counts.log \\\n       vector-sum --binary=false $dir/pdf_counts.* $dir/pdf_counts || exit 1;\n  rm $dir/pdf_counts.*\n\n  awk -v power=$presoftmax_prior_scale_power -v smooth=0.01 \\\n     '{ for(i=2; i<=NF-1; i++) { count[i-2] = $i;  total += $i; }\n        num_pdfs=NF-2;  average_count = total/num_pdfs;\n        for (i=0; i<num_pdfs; i++) stot += (scale[i] = (count[i] + smooth * average_count)^power)\n        printf \" [ \"; for (i=0; i<num_pdfs; i++) printf(\"%f \", scale[i]*num_pdfs/stot); print \"]\" }' \\\n     $dir/pdf_counts > $dir/presoftmax_prior_scale.vec\n  ln -sf ../presoftmax_prior_scale.vec $dir/configs/presoftmax_prior_scale.vec\nfi\n\nif [ $stage -le -1 ]; then\n  # Add the first layer; this will add in the lda.mat and\n  # presoftmax_prior_scale.vec.\n  $cmd $dir/log/add_first_layer.log \\\n       nnet3-init --srand=-3 $dir/init.raw $dir/configs/layer1.config $dir/0.raw || exit 1;\n\n  # Convert to .mdl, train the transitions, set the priors.\n  $cmd $dir/log/init_mdl.log \\\n    nnet3-am-init $alidir/final.mdl $dir/0.raw - \\| \\\n    nnet3-am-train-transitions - \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl || exit 1;\nfi\n\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  combine_queue_opt=\"--gpu 1\"\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\n  combine_queue_opt=\"\"  # the combine stage will be quite slow if not using\n                        # GPU, as we didn't enable that program to use\n                        # multiple threads.\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\napprox_iters_per_epoch_final=$[$num_archives/$num_jobs_final]\n# First work out how many iterations we want to combine over in the final\n# nnet3-combine-fast invocation.  (We may end up subsampling from these if the\n# number exceeds max_model_combine).  The number we use is:\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     1/2 * iters_after_last_layer_added)\nnum_iters_combine=$max_models_combine\nif [ $num_iters_combine -lt $approx_iters_per_epoch_final ]; then\n   num_iters_combine=$approx_iters_per_epoch_final\nfi\nhalf_iters_after_add_layers=$[($num_iters-$finish_add_layers_iter)/2]\nif [ $num_iters_combine -gt $half_iters_after_add_layers ]; then\n  num_iters_combine=$half_iters_after_add_layers\nfi\nfirst_model_combine=$[$num_iters-$num_iters_combine+1]\n\nx=0\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n[ -z $num_bptt_steps ] && num_bptt_steps=$chunk_width;\nmin_deriv_time=$((chunk_width - num_bptt_steps))\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_effective_learning_rate=$(perl -e \"print ($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt));\");\n  this_learning_rate=$(perl -e \"print ($this_effective_learning_rate*$this_num_jobs);\");\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    # Set this_shrink value.\n    if [ $x -eq 0 ] || nnet3-am-info --print-args=false $dir/$x.mdl | \\\n      perl -e \"while(<>){ if (m/type=Sigmoid.+deriv-avg=.+mean=(\\S+)/) { \\$n++; \\$tot+=\\$1; } } exit(\\$tot/\\$n > $shrink_threshold);\"; then\n      this_shrink=$shrink; # e.g. avg-deriv of sigmoids was <= 0.125, so shrink.\n    else\n      this_shrink=1.0  # don't shrink: sigmoids are not over-saturated.\n    fi\n    echo \"On iteration $x, learning rate is $this_learning_rate and shrink value is $this_shrink.\"\n\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet3-copy-egs --srand=JOB --frame=random $context_opts ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet3-merge-egs ark:- ark:- \\| \\\n        nnet3-compute-from-egs --apply-exp=true \"nnet3-am-copy --raw=true $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet3-am-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet3/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet3/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet3/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n            \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n           \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\n\n    if [ $x -gt 0 ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-info \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" '&&' \\\n        nnet3-show-progress --use-gpu=no \"nnet3-am-copy --raw=true $dir/$[$x-1].mdl - |\" \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n        \"ark,bg:nnet3-merge-egs --minibatch-size=256 ark:$cur_egs_dir/train_diagnostic.egs ark:-|\" &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging but take the\n                       # best.\n      cur_num_hidden_layers=$[1+$x/$add_layers_period]\n      config=$dir/configs/layer$cur_num_hidden_layers.config\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl - | nnet3-init --srand=$x - $config - |\"\n      cache_read_opt=\"\" # an option for writing cache (storing pairs of nnet-computations\n                        # and computation-requests) during training.\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n      cache_read_opt=\"--read-cache=$dir/cache.$x\"\n    fi\n    if $do_average; then\n      this_num_chunk_per_minibatch=$num_chunk_per_minibatch\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size (and we will later choose the output of just one of the jobs): the\n      # model-averaging isn't always helpful when the model is changing too fast\n      # (i.e. it can worsen the objective function), and the smaller minibatch\n      # size will help to keep the update stable.\n      this_num_chunk_per_minibatch=$[$num_chunk_per_minibatch/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We cannot easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      # this is no longer true for RNNs as we use do not use the --frame option\n      # but we use the same script for consistency with FF-DNN code\n\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we will derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        if [ $n -eq 1 ]; then\n          # an option for writing cache (storing pairs of nnet-computations and\n          # computation-requests) during training.\n          cache_write_opt=\" --write-cache=$dir/cache.$[$x+1]\"\n        else\n          cache_write_opt=\"\"\n        fi\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-train $parallel_train_opts $cache_read_opt $cache_write_opt --print-interval=10 --momentum=$momentum \\\n          --max-param-change=$max_param_change \\\n          --optimization.min-deriv-time=$min_deriv_time \"$raw\" \\\n          \"ark,bg:nnet3-copy-egs $context_opts ark:$cur_egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_num_chunk_per_minibatch --measure-output-frames=false --discard-partial-minibatches=true ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    models_to_average=$(steps/nnet3/get_successful_models.py $this_num_jobs $dir/log/train.$x.%.log)\n    nnets_list=\n    for n in $models_to_average; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet3-average $nnets_list - \\| \\\n        nnet3-am-copy --scale=$this_shrink --set-raw-nnet=- $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $this_num_jobs $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet3-am-copy --scale=$this_shrink --set-raw-nnet=$dir/$[$x+1].$n.raw  $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  rm $dir/cache.$x 2>/dev/null\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.  In the nnet3 setup, the logic\n  # for doing averaging of subsets of the models in the case where\n  # there are too many models to reliably esetimate interpolation\n  # factors (max_models_combine) is moved into the nnet3-combine\n  nnets_list=()\n  for n in $(seq 0 $[num_iters_combine-1]); do\n    iter=$[$first_model_combine+$n]\n    mdl=$dir/$iter.mdl\n    [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n    nnets_list[$n]=\"nnet3-am-copy --raw=true $mdl -|\";\n  done\n\n  combine_num_chunk_per_minibatch=$(python -c \"print int(1024.0/($chunk_width))\")\n  $cmd $combine_queue_opt $dir/log/combine.log \\\n    nnet3-combine --num-iters=40 \\\n       --enforce-sum-to-one=true --enforce-positive-weights=true \\\n       --verbose=3 \"${nnets_list[@]}\" \"ark,bg:nnet3-merge-egs --measure-output-frames=false --minibatch-size=$combine_num_chunk_per_minibatch ark:$cur_egs_dir/combine.egs ark:-|\" \\\n    \"|nnet3-am-copy --set-raw-nnet=- $dir/$num_iters.mdl $dir/combined.mdl\" || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark,bg:nnet3-merge-egs --minibatch-size=256 ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet3-compute-prob  \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark,bg:nnet3-merge-egs --minibatch-size=256 ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  rm $dir/post.$x.*.vec 2>/dev/null\n  if [ $num_jobs_compute_prior -gt $num_archives ]; then egs_part=1;\n  else egs_part=JOB; fi\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$x.JOB.log \\\n    nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:$cur_egs_dir/egs.$egs_part.ark ark:- \\| \\\n    nnet3-merge-egs --measure-output-frames=true --minibatch-size=128 ark:- ark:- \\| \\\n    nnet3-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n      \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet3-am-adjust-priors $dir/combined.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n       rm $dir/$x.mdl\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet3/make_bottleneck_features.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016 Pegah Ghahremani\n\n# This script dumps bottleneck feature for model trained using nnet3.\n# CAUTION!  This script isn't very suitable for dumping features from recurrent\n# architectures such as LSTMs, because it doesn't support setting the chunk size\n# and left and right context.  (Those would have to be passed into nnet3-compute).\n# See also chain/get_phone_post.sh.\n\n# Begin configuration section.\nstage=1\nnj=4\ncmd=queue.pl\nuse_gpu=false\nivector_dir=\ncompress=true\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [[ ( $# -lt 4 ) || ( $# -gt 6 ) ]]; then\n   echo \"usage: steps/nnet3/make_bottleneck_features.sh <bnf-node-name> <input-data-dir> <bnf-data-dir> <nnet-dir> [<log-dir> [<bnfdir>] ]\"\n   echo \"e.g.:  steps/nnet3/make_bottleneck_features.sh tdnn_bn.renorm data/train data/train_bnf exp/nnet3/tdnn_bnf exp_bnf/dump_bnf bnf\"\n   echo \"Note: <log-dir> defaults to <bnf-data-dir>/log and <bnfdir> defaults to\"\n   echo \" <bnf-data-dir>/data\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --ivector-dir                                    # directory for ivectors\"\n   exit 1;\nfi\nbnf_name=$1 # the component-node name in nnet3 model used for bottleneck feature extraction\ndata=$2\nbnf_data=$3\nnnetdir=$4\nif [ $# -gt 4 ]; then\n  logdir=$5\nelse\n  logdir=$bnf_data/log\nfi\nif [ $# -gt 5 ]; then\n  bnfdir=$6\nelse\n  bnfdir=$bnf_data/data\nfi\n\n# Assume that final.nnet is in nnetdir\ncmvn_opts=`cat $nnetdir/cmvn_opts`;\nbnf_nnet=$nnetdir/final.raw\nif [ ! -f $bnf_nnet ] ; then\n  if [ ! -f $nnetdir/final.mdl ]; then\n    echo \"$0: No such file $bnf_nnet or $nnetdir/final.mdl\";\n    exit 1;\n  else\n    bnf_nnet=$nnetdir/final.mdl\n  fi\nfi\n\nif $use_gpu; then\n  compute_queue_opt=\"--gpu 1\"\n  compute_gpu_opt=\"--use-gpu=yes\"\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  compute_gpu_opt=\"--use-gpu=no\"\nfi\n\n\n## Set up input features of nnet\nname=`basename $data`\nsdata=$data/split$nj\n\nmkdir -p $logdir\nmkdir -p $bnf_data\nmkdir -p $bnfdir\necho $nj > $bnfdir/num_jobs\n\n[ ! -f $data/feats.scp ] && echo >&2 \"The file $data/feats.scp does not exist!\" && exit 1;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nuse_ivector=false\nif [ ! -z \"$ivector_dir\" ];then\n  use_ivector=true\n  steps/nnet2/check_ivectors_compatible.sh $nnetdir $ivector_dir || exit 1;\nfi\n\n## Set up features.\nif [ -f $nnetdir/online_cmvn ]; then online_cmvn=true\nelse online_cmvn=false; fi\n\nif ! $online_cmvn; then\n  echo \"$0: feature type is raw\"\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\nelse\n  echo \"$0: feature type is raw (apply-cmvn-online)\"\n  feats=\"ark,s,cs:apply-cmvn-online $cmvn_opts --spk2utt=ark:$sdata/JOB/spk2utt $nnetdir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- |\"\nfi\nivector_feats=\"scp:utils/filter_scp.pl $sdata/JOB/utt2spk $ivector_dir/ivector_online.scp |\"\n\nif [ $stage -le 1 ]; then\n  echo \"$0: Generating bottleneck (BNF) features using $bnf_nnet model as output of \"\n  echo \"    component-node with name $bnf_name.\"\n  echo \"output-node name=output input=$bnf_name\" > $bnf_data/output.config\n  modified_bnf_nnet=\"nnet3-copy --nnet-config=$bnf_data/output.config $bnf_nnet - |\"\n  ivector_opts=\n  if $use_ivector; then\n    ivector_period=$(cat $ivector_dir/ivector_period) || exit 1;\n    ivector_opts=\"--online-ivector-period=$ivector_period --online-ivectors='$ivector_feats'\"\n  fi\n  $cmd $compute_queue_opt JOB=1:$nj $logdir/make_bnf_$name.JOB.log \\\n    nnet3-compute $compute_gpu_opt $ivector_opts \"$modified_bnf_nnet\" \"$feats\" ark:- \\| \\\n    copy-feats --compress=$compress ark:- ark,scp:$bnfdir/raw_bnfeat_$name.JOB.ark,$bnfdir/raw_bnfeat_$name.JOB.scp || exit 1;\nfi\n\n\nN0=$(cat $data/feats.scp | wc -l)\nN1=$(cat $bnfdir/raw_bnfeat_$name.*.scp | wc -l)\nif [[ \"$N0\" != \"$N1\" ]]; then\n  echo \"$0: Error generating BNF features for $name (original:$N0 utterances, BNF:$N1 utterances)\"\n  exit 1;\nfi\n\n# Concatenate feats.scp into bnf_data\nfor n in $(seq $nj); do  cat $bnfdir/raw_bnfeat_$name.$n.scp; done > $bnf_data/feats.scp\n\nfor f in segments spk2utt text utt2spk wav.scp char.stm glm kws reco2file_and_channel stm; do\n  [ -e $data/$f ] && cp -r $data/$f $bnf_data/$f\ndone\n\necho \"$0: computing CMVN stats.\"\nsteps/compute_cmvn_stats.sh $bnf_data\n\necho \"$0: done making BNF features.\"\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/nnet3/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012        Johns Hopkins University (Author: Daniel Povey)\n#           2014-2015   Vimal Manohar\n# Apache 2.0.\n\n# Create denominator lattices for MMI/MPE training [deprecated].\n# This version uses the neural-net models (version 3, i.e. the nnet3 code).\n# Creates its output in $dir/lat.*.gz\n# Note: the more recent discriminative training scripts will not use this\n# script at all, they'll use get_degs.sh which combines the decoding\n# and egs-dumping into one script (to save disk space and disk I/O).\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nframes_per_chunk=50\nlattice_beam=7.0\nself_loop_scale=0.1\nacwt=0.1\nmax_active=5000\nmin_active=200\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\nnum_threads=1 # number of threads of decoder [only applicable if not looped, for now]\nonline_ivector_dir=\ndeterminize=true\nminimize=false\nivector_scale=1.0\nextra_left_context=0\nextra_right_context=0\nextra_left_context_initial=-1\nextra_right_context_final=-1\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nnum_threads=1 # Fixed to 1 for now\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/nnet3/make_denlats.sh [options] <data-dir> <lang-dir> <src-dir> <exp-dir>\"\n  echo \"  e.g.: steps/nnet3/make_denlats.sh data/train data/lang exp/nnet4 exp/nnet4_denlats\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n  echo \"                           # large databases so your jobs will be smaller and\"\n  echo \"                           # will (individually) finish reasonably soon.\"\n  echo \"  --num-threads  <n>                # number of threads per decoding job\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\n\nextra_files=\n[ ! -z \"$online_ivector_dir\" ] && \\\n  extra_files=\"$online_ivector_dir/ivector_online.scp $online_ivector_dir/ivector_period\"\nfor f in $data/feats.scp $lang/L.fst $srcdir/final.mdl $extra_files; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nsdata=$data/split$nj\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\ncp -rH $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\necho \"Compiling decoding graph in $dir/dengraph\"\nif [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n  echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  echo \"Making unigram grammar FST in $new_lang\"\n  cat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n   awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n    utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n    || exit 1;\n  utils/mkgraph.sh --self-loop-scale $self_loop_scale $new_lang $srcdir $dir/dengraph || exit 1;\nfi\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\ncp $srcdir/cmvn_opts $dir 2>/dev/null\n\necho \"$0: feature type is raw\"\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- |\"\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\nif [ ! -z \"$online_ivector_dir\" ]; then\n  ivector_period=$(cat $online_ivector_dir/ivector_period) || exit 1;\n  ivector_opts=\"--online-ivectors=scp:$online_ivector_dir/ivector_online.scp --online-ivector-period=$ivector_period\"\nfi\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\n  cp $srcdir/frame_subsampling_factor $dir\nfi\n\nlattice_determinize_cmd=\nif $determinize; then\n  lattice_determinize_cmd=\"lattice-determinize-non-compact --acoustic-scale=$acwt --max-mem=$max_mem --minimize=$minimize --prune=true --beam=$lattice_beam ark:- ark:- |\"\nfi\n\nif [ $sub_split -eq 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_den.JOB.log \\\n    nnet3-latgen-faster$thread_string $ivector_opts $frame_subsampling_opt \\\n    --frames-per-chunk=$frames_per_chunk \\\n    --extra-left-context=$extra_left_context \\\n    --extra-right-context=$extra_right_context \\\n    --extra-left-context-initial=$extra_left_context_initial \\\n    --extra-right-context-final=$extra_right_context_final \\\n    --minimize=false --determinize-lattice=false \\\n    --word-determinize=false --phone-determinize=false \\\n    --max-active=$max_active --min-active=$min_active --beam=$beam \\\n    --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=false \\\n    --max-mem=$max_mem --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n    $dir/dengraph/HCLG.fst \"$feats\" \\\n    \"ark:|$lattice_determinize_cmd gzip -c >$dir/lat.JOB.gz\" || exit 1\nelse\n\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n\n      $cmd --num-threads $num_threads JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        nnet3-latgen-faster$thread_string $ivector_opts $frame_subsampling_opt \\\n        --frames-per-chunk=$frames_per_chunk \\\n        --extra-left-context=$extra_left_context \\\n        --extra-right-context=$extra_right_context \\\n        --extra-left-context-initial=$extra_left_context_initial \\\n        --extra-right-context-final=$extra_right_context_final \\\n        --minimize=false --determinize-lattice=false \\\n        --word-determinize=false --phone-determinize=false \\\n        --max-active=$max_active --min-active=$min_active --beam=$beam \\\n        --lattice-beam=$lattice_beam --acoustic-scale=$acwt --allow-partial=false \\\n        --max-mem=$max_mem --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n        $dir/dengraph/HCLG.fst \"$feats_subset\" \\\n        \"ark:|$lattice_determinize_cmd gzip -c >$dir/lat.$n.JOB.gz\" || touch $dir/.error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo Merging archives for data subset $prev_n\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && echo \"$0: Merging lattices for subset $prev_n failed (or maybe some other error)\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/nnet3/make_tdnn_configs.py",
    "content": "#!/usr/bin/env python\n\n# This script is deprecated, please use ../xconfig_to_configs.py\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nimport re, os, argparse, sys, math, warnings\n\n\nparser = argparse.ArgumentParser(description=\"Writes config files and variables \"\n                                 \"for TDNNs creation and training\",\n                                 epilog=\"See steps/nnet3/train_tdnn.sh for example.\");\nparser.add_argument(\"--splice-indexes\", type=str,\n                    help=\"Splice indexes at each hidden layer, e.g. '-3,-2,-1,0,1,2,3 0 -2,2 0 -4,4 0 -8,8'\")\nparser.add_argument(\"--feat-dim\", type=int,\n                    help=\"Raw feature dimension, e.g. 13\")\nparser.add_argument(\"--ivector-dim\", type=int,\n                    help=\"iVector dimension, e.g. 100\", default=0)\nparser.add_argument(\"--include-log-softmax\", type=str,\n                    help=\"add the final softmax layer \", default=\"true\", choices = [\"false\", \"true\"])\nparser.add_argument(\"--final-layer-normalize-target\", type=float,\n                    help=\"RMS target for final layer (set to <1 if final layer learns too fast\",\n                    default=1.0)\nparser.add_argument(\"--pnorm-input-dim\", type=int,\n                    help=\"input dimension to p-norm nonlinearities\")\nparser.add_argument(\"--pnorm-output-dim\", type=int,\n                    help=\"output dimension of p-norm nonlinearities\")\nparser.add_argument(\"--relu-dim\", type=int,\n                    help=\"dimension of ReLU nonlinearities\")\nparser.add_argument(\"--use-presoftmax-prior-scale\", type=str,\n                    help=\"if true, a presoftmax-prior-scale is added\",\n                    choices=['true', 'false'], default = \"true\")\nparser.add_argument(\"--num-targets\", type=int,\n                    help=\"number of network targets (e.g. num-pdf-ids/num-leaves)\")\nparser.add_argument(\"config_dir\",\n                    help=\"Directory to write config files and variables\");\n\nprint(' '.join(sys.argv))\n\nargs = parser.parse_args()\n\nif not os.path.exists(args.config_dir):\n    os.makedirs(args.config_dir)\n\n## Check arguments.\nif args.splice_indexes is None:\n    sys.exit(\"--splice-indexes argument is required\");\nif args.feat_dim is None or not (args.feat_dim > 0):\n    sys.exit(\"--feat-dim argument is required\");\nif args.num_targets is None or not (args.num_targets > 0):\n    sys.exit(\"--num-targets argument is required\");\nif not args.relu_dim is None:\n    if not args.pnorm_input_dim is None or not args.pnorm_output_dim is None:\n        sys.exit(\"--relu-dim argument not compatible with \"\n                 \"--pnorm-input-dim or --pnorm-output-dim options\");\n    nonlin_input_dim = args.relu_dim\n    nonlin_output_dim = args.relu_dim\nelse:\n    if not args.pnorm_input_dim > 0 or not args.pnorm_output_dim > 0:\n        sys.exit(\"--relu-dim not set, so expected --pnorm-input-dim and \"\n                 \"--pnorm-output-dim to be provided.\");\n    nonlin_input_dim = args.pnorm_input_dim\n    nonlin_output_dim = args.pnorm_output_dim\n\nif args.use_presoftmax_prior_scale == \"true\":\n    use_presoftmax_prior_scale = True\nelse:\n    use_presoftmax_prior_scale = False\n\n## Work out splice_array e.g. splice_array = [ [ -3,-2,...3 ], [0], [-2,2], .. [ -8,8 ] ]\nsplice_array = []\nleft_context = 0\nright_context = 0\nsplit1 = args.splice_indexes.split();  # we already checked the string is nonempty.\nif len(split1) < 1:\n    sys.exit(\"invalid --splice-indexes argument, too short: \"\n             + args.splice_indexes)\ntry:\n    for string in split1:\n        split2 = string.split(\",\")\n        if len(split2) < 1:\n            sys.exit(\"invalid --splice-indexes argument, too-short element: \"\n                     + args.splice_indexes)\n        int_list = []\n        for int_str in split2:\n            int_list.append(int(int_str))\n        if not int_list == sorted(int_list):\n            sys.exit(\"elements of --splice-indexes must be sorted: \"\n                     + args.splice_indexes)\n        left_context += -int_list[0]\n        right_context += int_list[-1]\n        splice_array.append(int_list)\nexcept ValueError as e:\n    sys.exit(\"invalid --splice-indexes argument \" + args.splice_indexes + str(e))\nleft_context = max(0, left_context)\nright_context = max(0, right_context)\nnum_hidden_layers = len(splice_array)\ninput_dim = len(splice_array[0]) * args.feat_dim  +  args.ivector_dim\n\nf = open(args.config_dir + \"/vars\", \"w\")\nprint('left_context={}'.format(left_context), file=f)\nprint('right_context={}'.format(right_context), file=f)\n# the initial l/r contexts are actually not needed.\n# print('initial_left_context=' + str(splice_array[0][0]), file=f)\n# print('initial_right_context=' + str(splice_array[0][-1]), file=f)\nprint('num_hidden_layers={}'.format(num_hidden_layers), file=f)\nf.close()\n\nf = open(args.config_dir + \"/init.config\", \"w\")\nprint('# Config file for initializing neural network prior to', file=f)\nprint('# preconditioning matrix computation', file=f)\nprint('input-node name=input dim={}'.format(args.feat_dim), file=f)\nlist=[ ('Offset(input, {0})'.format(n) if n != 0 else 'input' ) for n in splice_array[0] ]\nif args.ivector_dim > 0:\n    print('input-node name=ivector dim={}'.format(args.ivector_dim), file=f)\n    list.append('ReplaceIndex(ivector, t, 0)')\n# example of next line:\n# output-node name=output input=\"Append(Offset(input, -3), Offset(input, -2), Offset(input, -1), ... , Offset(input, 3), ReplaceIndex(ivector, t, 0))\"\nprint('output-node name=output input=Append({0})'.format(\", \".join(list)), file=f)\nf.close()\n\nfor l in range(1, num_hidden_layers + 1):\n    f = open(args.config_dir + \"/layer{0}.config\".format(l), \"w\")\n    print('# Config file for layer {0} of the network'.format(l), file=f)\n    if l == 1:\n        print('component name=lda type=FixedAffineComponent matrix={0}/lda.mat'.\n              format(args.config_dir), file=f)\n    cur_dim = (nonlin_output_dim * len(splice_array[l-1]) if l > 1 else input_dim)\n\n    print('# Note: param-stddev in next component defaults to 1/sqrt(input-dim).', file=f)\n    print('component name=affine{0} type=NaturalGradientAffineComponent '\n          'input-dim={1} output-dim={2} bias-stddev=0'.\n        format(l, cur_dim, nonlin_input_dim), file=f)\n    if args.relu_dim is not None:\n        print('component name=nonlin{0} type=RectifiedLinearComponent dim={1}'.\n              format(l, args.relu_dim), file=f)\n    else:\n        print('# In nnet3 framework, p in P-norm is always 2.', file=f)\n        print('component name=nonlin{0} type=PnormComponent input-dim={1} output-dim={2}'.\n              format(l, args.pnorm_input_dim, args.pnorm_output_dim), file=f)\n    print('component name=renorm{0} type=NormalizeComponent dim={1} target-rms={2}'.format(\n        l, nonlin_output_dim,\n        (1.0 if l < num_hidden_layers else args.final_layer_normalize_target)), file=f)\n    print('component name=final-affine type=NaturalGradientAffineComponent '\n          'input-dim={0} output-dim={1} param-stddev=0 bias-stddev=0'.format(\n          nonlin_output_dim, args.num_targets), file=f)\n    # printing out the next two, and their component-nodes, for l > 1 is not\n    # really necessary as they will already exist, but it doesn't hurt and makes\n    # the structure clearer.\n    if args.include_log_softmax == \"true\":\n        if use_presoftmax_prior_scale :\n            print('component name=final-fixed-scale type=FixedScaleComponent '\n                  'scales={0}/presoftmax_prior_scale.vec'.format(\n                    args.config_dir), file=f)\n        print('component name=final-log-softmax type=LogSoftmaxComponent dim={0}'.format(\n                args.num_targets), file=f)\n    print('# Now for the network structure', file=f)\n    if l == 1:\n        splices = [ ('Offset(input, {0})'.format(n) if n != 0 else 'input') for n in splice_array[l-1] ]\n        if args.ivector_dim > 0: splices.append('ReplaceIndex(ivector, t, 0)')\n        orig_input='Append({0})'.format(', '.join(splices))\n        # e.g. orig_input = 'Append(Offset(input, -2), ... Offset(input, 2), ivector)'\n        print('component-node name=lda component=lda input={0}'.format(orig_input),\n              file=f)\n        cur_input='lda'\n    else:\n        # e.g. cur_input = 'Append(Offset(renorm1, -2), renorm1, Offset(renorm1, 2))'\n        splices = [ ('Offset(renorm{0}, {1})'.format(l-1, n) if n !=0 else 'renorm{0}'.format(l-1))\n                    for n in splice_array[l-1] ]\n        cur_input='Append({0})'.format(', '.join(splices))\n    print('component-node name=affine{0} component=affine{0} input={1} '.\n          format(l, cur_input), file=f)\n    print('component-node name=nonlin{0} component=nonlin{0} input=affine{0}'.\n          format(l), file=f)\n    print('component-node name=renorm{0} component=renorm{0} input=nonlin{0}'.\n          format(l), file=f)\n\n    print('component-node name=final-affine component=final-affine input=renorm{0}'.\n          format(l), file=f)\n\n    if args.include_log_softmax == \"true\":\n        if use_presoftmax_prior_scale:\n            print('component-node name=final-fixed-scale component=final-fixed-scale input=final-affine',\n                  file=f)\n            print('component-node name=final-log-softmax component=final-log-softmax '\n                  'input=final-fixed-scale', file=f)\n        else:\n            print('component-node name=final-log-softmax component=final-log-softmax '\n                  'input=final-affine', file=f)\n        print('output-node name=output input=final-log-softmax', file=f)\n    else:\n        print('output-node name=output input=final-affine', file=f)\n    f.close()\n\n# component name=nonlin1 type=PnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim\n# component name=renorm1 type=NormalizeComponent dim=$pnorm_output_dim\n# component name=final-affine type=NaturalGradientAffineComponent input-dim=$pnorm_output_dim output-dim=$num_leaves param-stddev=0 bias-stddev=0\n# component name=final-log-softmax type=LogSoftmaxComponent dim=$num_leaves\n\n\n# ## Write file $config_dir/init.config to initialize the network, prior to computing the LDA matrix.\n# ##will look like this, if we have iVectors:\n# input-node name=input dim=13\n# input-node name=ivector dim=100\n# output-node name=output input=\"Append(Offset(input, -3), Offset(input, -2), Offset(input, -1), ... , Offset(input, 3), ReplaceIndex(ivector, t, 0))\"\n\n# ## Write file $config_dir/layer1.config that adds the LDA matrix, assumed to be in the config directory as\n# ## lda.mat, the first hidden layer, and the output layer.\n# component name=lda type=FixedAffineComponent matrix=$config_dir/lda.mat\n# component name=affine1 type=NaturalGradientAffineComponent input-dim=$lda_input_dim output-dim=$pnorm_input_dim bias-stddev=0\n# component name=nonlin1 type=PnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim\n# component name=renorm1 type=NormalizeComponent dim=$pnorm_output_dim\n# component name=final-affine type=NaturalGradientAffineComponent input-dim=$pnorm_output_dim output-dim=$num_leaves param-stddev=0 bias-stddev=0\n# component name=final-log-softmax type=LogSoftmax dim=$num_leaves\n# # InputOf(output) says use the same Descriptor of the current \"output\" node.\n# component-node name=lda component=lda input=InputOf(output)\n# component-node name=affine1 component=affine1 input=lda\n# component-node name=nonlin1 component=nonlin1 input=affine1\n# component-node name=renorm1 component=renorm1 input=nonlin1\n# component-node name=final-affine component=final-affine input=renorm1\n# component-node name=final-log-softmax component=final-log-softmax input=final-affine\n# output-node name=output input=final-log-softmax\n\n\n# ## Write file $config_dir/layer2.config that adds the second hidden layer.\n# component name=affine2 type=NaturalGradientAffineComponent input-dim=$lda_input_dim output-dim=$pnorm_input_dim bias-stddev=0\n# component name=nonlin2 type=PnormComponent input-dim=$pnorm_input_dim output-dim=$pnorm_output_dim\n# component name=renorm2 type=NormalizeComponent dim=$pnorm_output_dim\n# component name=final-affine type=NaturalGradientAffineComponent input-dim=$pnorm_output_dim output-dim=$num_leaves param-stddev=0 bias-stddev=0\n# component-node name=affine2 component=affine2 input=Append(Offset(renorm1, -2), Offset(renorm1, 2))\n# component-node name=nonlin2 component=nonlin2 input=affine2\n# component-node name=renorm2 component=renorm2 input=nonlin2\n# component-node name=final-affine component=final-affine input=renorm2\n# component-node name=final-log-softmax component=final-log-softmax input=final-affine\n# output-node name=output input=final-log-softmax\n\n\n# ## ... etc.  In this example it would go up to $config_dir/layer5.config.\n\n"
  },
  {
    "path": "egs/steps/nnet3/multilingual/allocate_multilingual_examples.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright      2017 Pegah Ghahremani\n#                2018 Hossein Hadian\n#\n# Apache 2.0.\n\n\"\"\" This script generates examples for multilingual training of neural network.\n    This scripts produces 3 sets of files --\n    egs.*.scp, egs.output.*.ark, egs.weight.*.ark\n\n    egs.*.scp are the SCP files of the training examples.\n    egs.weight.*.ark map from the key of the example to the language-specific\n    weight of that example.\n    egs.output.*.ark map from the key of the example to the name of\n    the output-node in the neural net for that specific language, e.g.\n    'output-2'.\n\n    --egs-prefix option can be used to generate train and diagnostics egs files.\n    If --egs-prefix=train_diagnostics. is passed, then the files produced by the\n    script will be named with the prefix as \"train_diagnostics.\"\n    instead of \"egs.\"\n    i.e. the files produced are -- train_diagnostics.*.scp,\n    train_diagnostics.output.*.ark, train_diagnostics.weight.*.ark and\n    train_diagnostics.ranges.*.txt.\n    The other egs-prefix options used in the recipes are \"valid_diagnositics.\"\n    for validation examples and \"combine.\" for examples used for model\n    combination.\n\n    For chain training egs, the --egs-prefix option should be \"cegs.\"\n\n    You can call this script as (e.g.):\n\n    allocate_multilingual_examples.py [opts] example-scp-lists\n        multilingual-egs-dir\n\n    allocate_multilingual_examples.py --block-size 512\n        --lang2weight  \"0.2,0.8\" exp/lang1/egs.scp exp/lang2/egs.scp\n        exp/multi/egs\n\n\"\"\"\n\nimport os, argparse, sys, random\nimport logging\nimport traceback\n\nsys.path.insert(0, 'steps')\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Start generating multilingual examples')\n\n\ndef get_args():\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\" This script generates examples for multilingual training\n        of neural network by producing 3 sets of primary files\n        as egs.*.scp, egs.output.*.ark, egs.weight.*.ark.\n        egs.*.scp are the SCP files of the training examples.\n        egs.weight.*.ark map from the key of the example to the language-specific\n        weight of that example.\n        egs.output.*.ark map from the key of the example to the name of\n        the output-node in the neural net for that specific language, e.g.\n        'output-2'.\"\"\",\n        epilog=\"Called by steps/nnet3/multilingual/combine_egs.sh\")\n\n    parser.add_argument(\"--num-archives\", type=int, default=None,\n                        help=\"Number of archives to split the data into. (Note: in reality they are not \"\n                        \"archives, only scp files, but we use this notation by analogy with the \"\n                        \"conventional egs-creating script).\")\n    parser.add_argument(\"--block-size\", type=int, default=512,\n                        help=\"This relates to locality of disk access. 'block-size' is\"\n                        \"the average number of examples that are read consecutively\"\n                        \"from each input scp file (and are written in the same order to the output scp files)\"\n                        \"Smaller values lead to more random disk access (during \"\n                        \"the nnet3 training process).\")\n    parser.add_argument(\"--egs-prefix\", type=str, default=\"egs.\",\n                        help=\"This option can be used to add a prefix to the filenames \"\n                        \"of the output files. For e.g. \"\n                        \"if --egs-prefix=combine. , then the files produced \"\n                        \"by this script will be \"\n                        \"combine.output.*.ark, combine.weight.*.ark, and combine.*.scp\")\n    parser.add_argument(\"--lang2weight\", type=str,\n                        help=\"Comma-separated list of weights, one per language. \"\n                        \"The language order is as egs_scp_lists.\")\n# now the positional arguments\n    parser.add_argument(\"egs_scp_lists\", nargs='+',\n                        help=\"List of egs.scp files per input language.\"\n                           \"e.g. exp/lang1/egs/egs.scp exp/lang2/egs/egs.scp\")\n    parser.add_argument(\"egs_dir\",\n                        help=\"Name of output egs directory e.g. exp/tdnn_multilingual_sp/egs\")\n\n\n    print(sys.argv, file=sys.stderr)\n    args = parser.parse_args()\n\n    return args\n\n\ndef read_lines(file_handle, num_lines):\n    n_read = 0\n    lines = []\n    while n_read < num_lines:\n        line = file_handle.readline()\n        if not line:\n            break\n        lines.append(line.strip())\n        n_read += 1\n    return lines\n\n\ndef process_multilingual_egs(args):\n    args = get_args()\n\n    scp_lists = args.egs_scp_lists\n    num_langs = len(scp_lists)\n\n    lang_to_num_examples = [0] * num_langs\n    for lang in range(num_langs):\n        with open(scp_lists[lang]) as fh:\n            lang_to_num_examples[lang] = sum([1 for line in fh])\n        logger.info(\"Number of examples for language {0} \"\n                    \"is {1}.\".format(lang, lang_to_num_examples[lang]))\n\n    # If weights are not provided, the weights are 1.0.\n    if args.lang2weight is None:\n        lang2weight = [1.0] * num_langs\n    else:\n        lang2weight = args.lang2weight.split(\",\")\n        assert(len(lang2weight) == num_langs)\n\n    if not os.path.exists(os.path.join(args.egs_dir, 'info')):\n        os.makedirs(os.path.join(args.egs_dir, 'info'))\n\n    with open(\"{0}/info/{1}num_tasks\".format(args.egs_dir, args.egs_prefix), \"w\") as fh:\n        print(\"{0}\".format(num_langs), file=fh)\n\n    # Total number of egs in all languages\n    tot_num_egs = sum(lang_to_num_examples[i] for i in range(num_langs))\n    num_archives = args.num_archives\n\n    with open(\"{0}/info/{1}num_archives\".format(args.egs_dir, args.egs_prefix), \"w\") as fh:\n        print(\"{0}\".format(num_archives), file=fh)\n\n    logger.info(\"There are a total of {} examples in the input scp \"\n                \"files.\".format(tot_num_egs))\n    logger.info(\"Number of blocks in each output archive will be approximately \"\n                \"{}, and block-size is {}.\".format(int(round(tot_num_egs / num_archives / args.block_size)),\n                                                   args.block_size))\n    for lang in range(num_langs):\n        blocks_per_archive_this_lang = lang_to_num_examples[lang] / num_archives / args.block_size\n        warning = \"\"\n        if blocks_per_archive_this_lang < 1.0:\n            warning = (\"Warning: This means some of the output archives might \"\n                       \"not include any examples from this lang.\")\n        logger.info(\"The proportion of egs from lang {} is {:.2f}. The number of blocks \"\n                    \"per archive for this lang is approximately {:.2f}. \"\n                    \"{}\".format(lang, float(lang_to_num_examples[lang]) / tot_num_egs,\n                                blocks_per_archive_this_lang,\n                                warning))\n\n    in_scp_file_handles = [open(scp_lists[lang], 'r') for lang in range(num_langs)]\n\n    num_remaining_egs = tot_num_egs\n    lang_to_num_remaining_egs = [n for n in lang_to_num_examples]\n    for archive_index in range(num_archives + 1):  #  +1 is because we write to the last archive in two rounds\n        num_remaining_archives = num_archives - archive_index\n        num_remaining_blocks = float(num_remaining_egs) / args.block_size\n\n        last_round = (archive_index == num_archives)\n        if not last_round:\n            num_blocks_this_archive = int(round(float(num_remaining_blocks) / num_remaining_archives))\n            logger.info(\"Generating archive {} containing {} blocks...\".format(archive_index, num_blocks_this_archive))\n        else:  # This is the second round for the last archive. Flush all the remaining egs...\n            archive_index = num_archives - 1\n            num_blocks_this_archive = num_langs\n            logger.info(\"Writing all the {} remaining egs to the last archive...\".format(num_remaining_egs))\n\n        out_scp_file_handle = open('{0}/{1}{2}.scp'.format(args.egs_dir, args.egs_prefix, archive_index + 1),\n                                   'a' if last_round else 'w')\n        eg_to_output_file_handle = open(\"{0}/{1}output.{2}.ark\".format(args.egs_dir, args.egs_prefix, archive_index + 1),\n                                        'a' if last_round else 'w')\n        eg_to_weight_file_handle = open(\"{0}/{1}weight.{2}.ark\".format(args.egs_dir, args.egs_prefix, archive_index + 1),\n                                        'a' if last_round else 'w')\n\n\n        for block_index in range(num_blocks_this_archive):\n            # Find the lang with the highest proportion of remaining examples\n            remaining_proportions = [float(remain) / tot for remain, tot in zip(lang_to_num_remaining_egs, lang_to_num_examples)]\n            lang_index, max_proportion = max(enumerate(remaining_proportions), key=lambda a: a[1])\n\n            # Read 'block_size' examples from the selected lang and write them to the current output scp file:\n            example_lines  = read_lines(in_scp_file_handles[lang_index], args.block_size)\n            for eg_line in example_lines:\n                eg_id = eg_line.split()[0]\n                print(eg_line, file=out_scp_file_handle)\n                print(\"{0} output-{1}\".format(eg_id, lang_index), file=eg_to_output_file_handle)\n                print(\"{0} {1}\".format(eg_id, lang2weight[lang_index]), file=eg_to_weight_file_handle)\n\n            num_remaining_egs -= len(example_lines)\n            lang_to_num_remaining_egs[lang_index] -= len(example_lines)\n\n        out_scp_file_handle.close()\n        eg_to_output_file_handle.close()\n        eg_to_weight_file_handle.close()\n\n    for handle in in_scp_file_handles:\n        handle.close()\n    logger.info(\"Finished generating {0}*.scp, {0}output.*.ark \"\n                \"and {0}weight.*.ark files. Wrote a total of {1} examples \"\n                \"to {2} archives.\".format(args.egs_prefix,\n                                          tot_num_egs - num_remaining_egs, num_archives))\n\n\ndef main():\n    try:\n        args = get_args()\n        process_multilingual_egs(args)\n    except Exception as e:\n        traceback.print_exc()\n        sys.exit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/multilingual/combine_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017     Pegah Ghahremani\n#           2017-18  Vimal Manohar\n#           2018     Hossein Hadian\n# Apache 2.0\n\n# This script generates examples for multilingual training of neural network\n# using separate input egs dir per language as input.\n# This scripts produces 3 sets of files --\n# egs.*.scp, egs.output.*.ark, egs.weight.*.ark\n#\n# egs.*.scp are the SCP files of the training examples.\n# egs.weight.*.ark map from the key of the example to the language-specific\n# weight of that example.\n# egs.output.*.ark map from the key of the example to the name of\n# the output-node in the neural net for that specific language, e.g.\n# 'output-2'.\n#\n# Begin configuration section.\ncmd=run.pl\nblock_size=256          # This is the number of consecutive egs that we take from\n                        # each source, and it only affects the locality of disk\n                        # access.\nlang2weight=            # array of weights one per input languge to scale example's output\n                        # w.r.t its input language during training.\nstage=0\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 3 ]; then\n  cat <<EOF\n  This script generates examples for multilingual training of neural network\n  using separate input egs dir per language as input.\n  See top of the script for details.\n\n  Usage: $0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\n   e.g.: $0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\n\n  Options:\n      --cmd (utils/run.pl|utils/queue.pl <queue opts>)  # how to run jobs.\n      --block-size <int|512>      # it is the number of consecutive egs that we take from \n                                  # each source, and it only affects the locality of disk \n                                  # access. This does not have to be the actual minibatch size\nEOF\n  exit 1;\nfi\n\nnum_langs=$1\n\nshift 1\nargs=(\"$@\")\nmegs_dir=${args[-1]} # multilingual directory\nmkdir -p $megs_dir\nmkdir -p $megs_dir/info\nif [ ${#args[@]} != $[$num_langs+1] ]; then\n  echo \"$0: num of input example dirs provided is not compatible with num_langs $num_langs.\"\n  echo \"Usage:$0 [opts] <num-input-langs,N> <lang1-egs-dir> ...<langN-egs-dir> <multilingual-egs-dir>\"\n  echo \"Usage:$0 [opts] 2 exp/lang1/egs exp/lang2/egs exp/multi/egs\"\n  exit 1;\nfi\n\nrequired=\"egs.scp combine.scp train_diagnostic.scp valid_diagnostic.scp\"\ntrain_scp_list=\ntrain_diagnostic_scp_list=\nvalid_diagnostic_scp_list=\ncombine_scp_list=\n\n# read paramter from $egs_dir[0]/info and cmvn_opts\n# to write in multilingual egs_dir.\ncheck_params=\"info/feat_dim info/ivector_dim info/left_context info/right_context info/left_context_initial info/right_context_final cmvn_opts\"\nivec_dim=`cat ${args[0]}/info/ivector_dim`\nif [ $ivec_dim -ne 0 ];then check_params=\"$check_params info/final.ie.id\"; fi\n\nfor param in $check_params info/frames_per_eg; do\n  cat ${args[0]}/$param > $megs_dir/$param || exit 1;\ndone\n\ntot_num_archives=0\nfor lang in $(seq 0 $[$num_langs-1]);do\n  multi_egs_dir[$lang]=${args[$lang]}\n  for f in $required; do\n    if [ ! -f ${multi_egs_dir[$lang]}/$f ]; then\n      echo \"$0: no such file ${multi_egs_dir[$lang]}/$f.\" && exit 1;\n    fi\n  done\n  num_archives=$(cat ${multi_egs_dir[$lang]}/info/num_archives)\n  tot_num_archives=$[tot_num_archives+num_archives]\n  train_scp_list=\"$train_scp_list ${args[$lang]}/egs.scp\"\n  train_diagnostic_scp_list=\"$train_diagnostic_scp_list ${args[$lang]}/train_diagnostic.scp\"\n  valid_diagnostic_scp_list=\"$valid_diagnostic_scp_list ${args[$lang]}/valid_diagnostic.scp\"\n  combine_scp_list=\"$combine_scp_list ${args[$lang]}/combine.scp\"\n\n  # check parameter dimension to be the same in all egs dirs\n  for f in $check_params; do\n    if [ -f $megs_dir/$f ] && [ -f ${multi_egs_dir[$lang]}/$f ]; then\n      f1=$(cat $megs_dir/$f)\n      f2=$(cat ${multi_egs_dir[$lang]}/$f)\n      if [ \"$f1\" != \"$f2\" ]  ; then\n        echo \"$0: mismatch for $f in $megs_dir vs. ${multi_egs_dir[$lang]}($f1 vs. $f2).\"\n        exit 1;\n      fi\n    else\n      echo \"$0: file $f does not exits in $megs_dir or ${multi_egs_dir[$lang]}/$f .\"\n    fi\n  done\ndone\n\nif [ ! -z \"$lang2weight\" ]; then\n  egs_opt=\"--lang2weight '$lang2weight'\"\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: allocating multilingual examples for training.\"\n  # Generate egs.*.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_train.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives $tot_num_archives \\\n      --block-size $block_size \\\n      $train_scp_list $megs_dir || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: combine combine.scp examples from all langs in $megs_dir/combine.scp.\"\n  # Generate combine.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_combine.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"combine.\" \\\n      $combine_scp_list $megs_dir || exit 1;\n\n  echo \"$0: combine train_diagnostic.scp examples from all langs in $megs_dir/train_diagnostic.scp.\"\n  # Generate train_diagnostic.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_train_diagnostic.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"train_diagnostic.\" \\\n      $train_diagnostic_scp_list $megs_dir || exit 1;\n\n\n  echo \"$0: combine valid_diagnostic.scp examples from all langs in $megs_dir/valid_diagnostic.scp.\"\n  # Generate valid_diagnostic.scp for multilingual setup.\n  $cmd $megs_dir/log/allocate_multilingual_examples_valid_diagnostic.log \\\n    steps/nnet3/multilingual/allocate_multilingual_examples.py $egs_opt \\\n      --num-archives 1 \\\n      --block-size $block_size \\\n      --egs-prefix \"valid_diagnostic.\" \\\n      $valid_diagnostic_scp_list $megs_dir || exit 1;\n\nfi\nfor egs_type in combine train_diagnostic valid_diagnostic; do\n  mv $megs_dir/${egs_type}.output.1.ark $megs_dir/${egs_type}.output.ark || exit 1;\n  mv $megs_dir/${egs_type}.weight.1.ark $megs_dir/${egs_type}.weight.ark || exit 1;\n  mv $megs_dir/${egs_type}.1.scp $megs_dir/${egs_type}.scp || exit 1;\ndone\nmv $megs_dir/info/egs.num_archives $megs_dir/info/num_archives || exit 1;\nmv $megs_dir/info/egs.num_tasks $megs_dir/info/num_tasks || exit 1;\necho \"$0: Finished preparing multilingual training example.\"\n"
  },
  {
    "path": "egs/steps/nnet3/nnet3_to_dot.sh",
    "content": "#!/usr/bin/env bash\n\n# script showing use of nnet3_to_dot.py\n# Copyright 2015  Johns Hopkins University (Author: Vijayaditya Peddinti).\n\n# Begin configuration section.\ncomponent_attributes=\"name,type\"\nnode_prefixes=\"\"\ninfo_bin=nnet3-am-info\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <nnet3-mdl-file> <output-dot-file> <output-png-file>\"\n  echo \" e.g.: $0 exp/sdm1/nnet3/lstm_sp/0.mdl lstm.dot lstm.png\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --info-bin <nnet3-am-info|nnet3-info>        # Name of the binary to generate the nnet3 file\"\n  echo \"  --component-attributes <string|name,type>     # attributes to be printed in nnet3 components\"\n  echo \"  --node-prefixes <string|Lstm1,Lstm2>          # list of prefixes. Nnet3 components/component-nodes with the same prefix\"\n  echo \"                                                # will be clustered together in the dot-graph\"\n\n\n  exit 1;\nfi\n\nmodel=$1\ndot_file=$2\noutput_file=$3\n\nattr=${node_prefixes:+ --node-prefixes \"$node_prefixes\"}\n$info_bin $model | \\\n  steps/nnet3/dot/nnet3_to_dot.py \\\n    --component-attributes \"$component_attributes\" \\\n    $attr $dot_file\necho \"Generated the dot file $dot_file\"\n\ncommand -v dot >/dev/null 2>&1 || { echo >&2 \"This script requires dot but it's not installed. Please compile $dot_file with dot\"; exit 1; }\ndot -Tpdf $dot_file -o $output_file\n"
  },
  {
    "path": "egs/steps/nnet3/report/convert_model.py",
    "content": "#!/usr/bin/env python3\n\n# This script dumps the parameters of (most components of) an nnet3 model as a\n# pickled python dict.  (see documentation for the function 'read_model' below\n# for more details).\n#\n# It also contains some utility function that you can get access by importing this\n# file.\n#\n# In egs/mini_librispeech/s5/local/chain/diagnostic/report_example.py, you can\n# find an example of the use of this script.\n#\n# Copyright 2017-2018    Daniel Povey\n# Apache 2.0.\n\n\n# This requires python 3.\n\nimport sys\nimport subprocess\nimport numpy as np\nimport pickle\n\n\ndef read_next_token(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a file)\n      and a position 'pos', finds the next token in the string (defined as a nonempty\n      sequence of whitespace characters delimited by whitespace), and advances the\n      position to one character after the end of this token.\n\n      's' is expected to be of type 'str' and 'pos' of type 'int'.\n      This function returns a tuple\n         (token, new_pos).\n      If we're at the end of the string (there is only whitespace between 'pos' and\n      the end), then 'token' will be None and 'pos' will be len(s).\n   \"\"\"\n   assert isinstance(s, str) and isinstance(pos, int)\n   assert pos >= 0\n   # Skip over any initial whitespace.\n   while pos < len(s) and s[pos].isspace():\n      pos += 1\n   if pos >= len(s):\n      # We reached the end of the string s without finding any non-whitespace.\n      return (None, pos)\n   initial_pos = pos\n   while pos < len(s) and not s[pos].isspace():\n      pos += 1\n   token = s[initial_pos:pos]\n   return (token, pos)\n\ndef check_for_newline(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a file)\n      and a position 'pos', in the string, eats up all the whitespace it can\n      and records whether a newline was among that whitespace.\n      It returns a tuple\n         (saw_newline, new_pos)\n      where saw_newline will be true if a newline was seen, and new_pos is\n      the new position after eating up whitespace-- so either new_pos == len(s)\n      or s[new_pos] is non-whitespace.\n   \"\"\"\n   assert isinstance(s, str) and isinstance(pos, int)\n   assert pos >= 0\n   saw_newline = False\n   while pos < len(s) and s[pos].isspace():\n      if s[pos] == \"\\n\":\n         saw_newline = True\n      pos += 1\n   return (saw_newline, pos)\n\ndef read_float(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a file)\n      and a position 'pos', tries to read a text-format floating point or integer,\n      starting from this position, and returns the\n      pair (float, new_position).\n      If something goes wrong it will print a warning to stderr and return (None, pos)\n   \"\"\"\n   orig_pos = pos\n   (tok, pos) = read_next_token(s, pos)\n   f = None\n   try:\n      f = float(tok)\n   except:\n      print(\"{0}: at file position {1}, expected float but got {1}\".format(\n         sys.argv[0], orig_pos, tok), file=sys.stderr)\n      return (None, pos)\n   return (f, pos)\n\ndef read_int(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a\n      file) and a position 'pos', tries to read a text-format integer, starting\n      from this position, and returns the\n      pair (int, new_position).\n      If something goes wrong it will print a warning to stderr and return (None, pos)\n   \"\"\"\n   orig_pos = pos\n   (tok, pos) = read_next_token(s, pos)\n   i = None\n   try:\n      i = int(tok)\n   except:\n      print(\"{0}: at file position {1}, expected int but got {1}\".format(\n         tok).format(sys.argv[0], orig_pos, tok), file=sys.stderr)\n      return (None, pos)\n   return (i, pos)\n\ndef read_vector(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a file)\n      and a position 'pos', tries to read a text-format vector (something like \"[ 1.0 2.0 3.0 ]\"\n      starting from this position, reads it as a 1-dimensional numpy array, and returns\n      the pair (vector, new_position).\n      If something goes wrong it will print a warning to stderr and return (None, pos)\n   \"\"\"\n   orig_pos = pos\n   (tok, pos) = read_next_token(s, pos)\n   if tok != '[':\n      print(\"{0}: at file position {1}, expected vector but got {1}\".format(\n         tok).format(sys.argv[0], pos, tok), file=sys.stderr)\n      return (None, pos)\n   v = []\n   while True:\n      (tok, pos) = read_next_token(s, pos)\n      if tok is None or tok == ']':\n         break\n      try:\n         f = float(tok)\n         v.append(f)\n      except:\n         print(\"{0}: at file position {1}, reading vector, expected float but got {1}\".\n            format(sys.argv[0], pos, tok), file=sys.stderr)\n         return (None, pos)\n   if tok is None:\n      print(\"{0}: encountered EOF while reading vector.\".format(\n         tok).format(sys.argv[0]), file=sys.stderr)\n      return (None, pos)\n   return (np.array(v, dtype=np.float32), pos)\n\n\ndef read_matrix(s, pos):\n   \"\"\"This function, given a string s (probably a long string, like a line or a file)\n      and a position 'pos', tries to read a text-format matrix\n      (something like \"[\\n 1.0 2.0\\n 3.0 4.0 ]\")\n      starting from this position, reads it as a 2-dimensional numpy array, and returns\n      pair (matrix, new_position).\n      If something goes wrong it will print a warning to stderr and return (None, pos)\n   \"\"\"\n   orig_pos = pos\n   (tok, pos) = read_next_token(s, pos)\n   if tok != '[':\n      print(\"{0}: at file position {1}, expected matrix but got {1}\".format(\n         tok).format(sys.argv[0], pos, tok), file=sys.stderr)\n      return (None, pos)\n   # m will be an array of arrays (python arrays, not numpy arrays).\n   m = []\n   while True:\n      # At this point, assume we're ready to read a new vector\n      # (terminated by newline or by \"]\").\n      v = []\n      while True:\n         (tok, pos) = read_next_token(s, pos)\n         if tok == ']' or tok == None:\n            break\n         else:\n            try:\n               f = float(tok)\n               v.append(f)\n            except:\n               print(\"{0}: at file position {1}, reading matrix, expected float but got {2}\".format(\n                  sys.argv[0], pos, tok), file=sys.stderr)\n               return (None, pos)\n\n         (saw_newline, pos) = check_for_newline(s, pos)\n         if saw_newline:  # Newline terminates each row of the matrix.\n            break\n      if len(v) > 0:\n         m.append(v)\n      if tok == 'None':\n         print(\"{0}: matrix starting at position {1} was unexpectedly terminated by EOF.\".format(\n            sys.argv[0], pos), file=sys.stderr)\n         break\n      if tok == ']':\n         break\n   ans_mat = None\n   try:\n      ans_mat = np.array(m, dtype=np.float32)\n   except:\n      if tok is None:\n         print(\"{0}: error converting matrix starting at position {1} into numpy array.\".format(\n            sys.argv[0], orig_pos), file=sys.stderr)\n   return (ans_mat, pos)\n\n\n\ndef is_component_type(component_type):\n   \"\"\"Returns True if 'component_type' is a plausible component type, e.g.\n   something of the form \"<xxxComponent>\", otherwise False\"\"\"\n   return (isinstance(component_type, str) and len(component_type) >= 13 and\n           component_type[0] == \"<\" and component_type[-10:] == \"Component>\")\n\n\ndef read_generic(s, pos, terminating_token, action_dict):\n   \"\"\"This function is a generic mechanism for parsing things from text files\n     (after reading the text file into a string).  It will return a pair\n      (d, new_pos)\n     where new_pos is the position in the string after reading the object,\n     and d is a dict representing what we read in.\n\n     'terminating_token' is either a token (a whitespace-delimited string)\n         that terminates the object (something like \"</RectifiedLinearComponent>\"),\n         or a set containing possible terminating tokens.\n     'action_dict' is a dict from token to a pair (function, dict_key)\n         where 'function' is the function we should use to read in data,\n         and 'dict_key' is the key in the returned dictionary that we should\n         use to store the result.  For instance, we might have:\n             action_dict['<ParameterMatrix>'] = (read_matrix, 'params')\n     It is OK if not everything in the object is covered in 'action_dict'.\n     This function will simply skip over anything that it doesn't understand.\n   \"\"\"\n\n   if isinstance(terminating_token, str):\n      terminating_tokens = set([terminating_token])\n   else:\n      terminating_tokens = terminating_token\n      assert isinstance(terminating_tokens, set)\n   assert isinstance(action_dict, dict)\n\n   # d will contain the fields of the object.\n   d = dict()\n   orig_pos = pos\n   while True:\n      (tok, pos) = read_next_token(s, pos)\n      if tok in terminating_tokens:\n         break\n      if tok is None:\n         print(\"{0}: error reading object starting at position {1}, got EOF \"\n               \"while expecting one of: {2}\".format(\n                  sys.argv[0], orig_pos, terminating_tokens), file=sys.stderr)\n         break\n      if tok in action_dict:\n         p = action_dict[tok]\n         assert isinstance(p, tuple) and len(p) == 2\n         assert callable(p[0]) and isinstance(p[1], str)\n         (func, name) = p\n         (obj, pos) = func(s, pos)\n         d[name] = obj\n   return (d, pos)\n\n\ndef get_action_dict(component_type):\n   \"\"\"Given a component-type (i.e. a string, like <SigmoidComponent>, returns an\n      'action_dict' suitable for reading that component type (specifically, one\n       that can be given as the 'action_dict' argumnt of 'read_generic').  To\n      repeat the documentation there:\n\n     'action_dict' is a dict from token to a pair (function, dict_key)\n         where 'function' is the function we should use to read in data,\n         and 'dict_key' is the key in the returned dictionary that we should\n         use to store the result.  For instance, we might have:\n             action_dict['<ParameterMatrix>'] = (read_matrix, 'params')\n   \"\"\"\n   assert is_component_type(component_type)\n\n   # e.g. if component_type is '<SigmoidComponent>', raw_component_type would be\n   # 'Sigmoid'\n   raw_component_type = component_type[1:-10]\n   if raw_component_type in { 'Sigmoid', 'Tanh', 'RectifiedLinear',\n                              'Softmax', 'LogSoftmax', 'NoOp' }:\n      return { '<Dim>': (read_int, 'dim'),\n               '<BlockDim>': (read_int, 'block-dim'),\n               '<ValueAvg>': (read_vector, 'value-avg'),\n               '<DerivAvg>': (read_vector, 'deriv-avg'),\n               '<OderivRms>': (read_vector, 'oderiv-rms'),\n               '<Count>': (read_float, 'count'),\n               '<OderivCount>': (read_float, 'oderiv-count') }\n   if raw_component_type in {'Affine',\n                             'NaturalGradientAffine'}:\n      # We call  '<LinearParams>' to just 'params' for compatibility with\n      # LinearComponent.\n      return { '<LinearParams>': (read_matrix, 'params'),\n               '<BiasParams>': (read_vector, 'bias') }\n   if raw_component_type  == 'Linear':\n      return { '<Params>': (read_matrix, 'params') }\n   if raw_component_type == 'BatchNorm':\n      return { '<Dim>': (read_int, 'dim'),\n               '<Count>': (read_float, 'count'),\n               '<StatsMean>':  (read_vector, 'stats-mean'),\n               '<StatsVar>':  (read_vector, 'stats-var') }\n   # By default (if we don't know anything about the component type) we just\n   # don't read anything.\n   return { }\n\n\n\ndef get_stdout_from_command(command):\n   \"\"\" Executes a command and returns its stdout output as a string.  The\n       command is executed with shell=True, so it may contain pipes and\n       other shell constructs.  Raises an exception if the command exits\n       with nonzero status.\n    \"\"\"\n   p = subprocess.Popen(command, shell=True,\n                        stdout=subprocess.PIPE)\n\n   stdout = p.communicate()[0]\n   if p.returncode is not 0:\n      raise Exception(\"Command exited with status {0}: {1}\".format(\n         p.returncode, command))\n   return stdout.decode()\n\n\ndef read_component(s, pos):\n   \"\"\"Reads a component starting at position 'pos' in the string 's'.  At this position,\n      there is expected to be a component type, e.g. <RectifiedLinearComponent>, and this\n      funtion will read until after the end-marker, e.g. </RectifiedLinearComponent>,\n      or if this fails for some reason, until the next instance of <ComponentName>.\n\n      This funtion returns the pair (d, new_pos) where d is a dict from\n      element-name to object (e.g. d['params'] might contain a matrix), and\n      new_pos is the position in the string after reading this component in.\n      Returns (None, new_pos) if something went wrong.\n   \"\"\"\n   (component_type, pos) = read_next_token(s, pos)\n   if not is_component_type(component_type):\n      print(\"{0}: error reading Component: at position {1}, expected <xxxxComponent>,\"\n            \" got: {2}\".format(sys.argv[0], pos, component_type), file=sys.stderr)\n      while True:\n         (tok, pos) = read_next_token(s, pos)\n         if tok is None or tok == '<ComponentName>':\n            return (None, pos)\n   terminating_token = \"</\" + component_type[1:]\n   terminating_tokens = { terminating_token, '<ComponentName>' }\n\n   action_dict = get_action_dict(component_type)\n   (d, pos) = read_generic(s, pos, terminating_tokens, action_dict)\n   if d is not None:\n      d['type'] = component_type             # e.g. '<LinearComponent>'\n      d['raw-type'] = component_type[1:-10]  # e.g. 'Linear'\n   return (d, pos)\n\n\ndef read_model(filename):\n   \"\"\"Reads an nnet3 model from the provided filename, and returns a dict\n      from the component-name to a dict containing things we have read\n      in for that component.\"\"\"\n   command = \"nnet3-copy --binary=false {0} -\".format(filename)\n   s = get_stdout_from_command(command)\n   # The model starts with some structural stuff (component-nodes, etc.) that we\n   # won't be attempting to parse.  We start parsing when we reach\n   # <NumComponents>.\n   pos = 0\n   while True:\n      (tok, pos) = read_next_token(s, pos)\n      if tok is None:\n         print(\"{0}: unexpected EOF on output of command {1}\".format(\n            sys.argv[0], command))\n         return None\n      if tok == \"<NumComponents>\":\n         break\n   # we just read <NumComponents>\n   (tok, pos) = read_next_token(s, pos)\n   # 'd', which we return, will be a dict from component-name\n   # (e.g. 'tdnn1.affine'), to a dict containing elements of the component.\n   d = dict()\n   num_components = int(tok)  # shouldn't fail.\n   for c in range(num_components):\n      # read the components one by one...\n      (tok, pos) = read_next_token(s, pos)\n      if tok is None:\n         print(\"{0}: unexpected EOF on output of command {1}\".format(\n            sys.argv[0], command))\n         return None\n      # We normally expect that tok will be '<ComponentName>', but if we read in\n      # '<ComponentName>' while parsing the previous component (e.g. if its text form was\n      # not terminated in the way we expected), then we accept that '<ComponentName>'\n      # might not be available to parse.\n      if tok == '<ComponentName>':\n         component_pos = pos\n         (component_name, pos) = read_next_token(s, pos)\n      # At this point the type of the component will be printed: something like\n      # <NaturalGradientAffineComponent>.  We let 'read_component' take it from\n      # here, and it will read until the terminating </NaturalGradientAffineComponent>,\n      # or, in the case of error, to EOF or the next <ComponentName> string.\n      (component, pos) = read_component(s, pos)\n      if component != None:\n         d[component_name] = component\n      else:\n         print(\"{0}: error reading component with name {1} at position {2}\".format(\n            sys.argv[0], component_name, component_pos), file=sys.stderr)\n\n   return d\n\ndef compute_derived_quantities(model):\n   \"\"\"This function, given a model as returned by 'read_model', computes certain\n       potentially-useful derived quantities inside components: things like row\n       and column norms of parameter matrices, standard deviations of\n       accumulated stats.\n   \"\"\"\n   assert isinstance(model, dict)\n   for c in model.values():\n      # 'c' represents the component; it's a dict.\n      raw_component_type = c['raw-type']\n      if raw_component_type in {'Linear', 'Affine', 'NaturalGradientAffine'}:\n         params = c['params'] # this is the parameter matrix.\n         # compute the row and column norms of the parameter matrix.\n         c['row-norms'] = np.sqrt(np.sum(params * params, axis=1))\n         c['col-norms'] = np.sqrt(np.sum(params * params, axis=0))\n         size = c['col-norms'].size\n         if size % 3 == 0:\n            # if the input-dim of this layer is divisible by 3, then compute the\n            # column-norms after reshaping... this is a kind of pooled column-norm\n            # that makes sense for TDNNs or wherever we have used Append().\n            c['col-norms-3'] = np.sqrt(np.sum(np.power(c['col-norms'], 2).reshape(3, size/3), axis=0))\n            assert c['col-norms-3'].shape == (size/3,)\n\n      if raw_component_type == 'BatchNorm':\n         stats_var = c['stats-var']\n         c['stats-stddev'] = np.sqrt(stats_var)\n\ndef compute_progress(model1, model2):\n   \"\"\"This function, given two models assumed to come from two successive\n      iterations of training, computes certain component-level quantities\n      that relate to the rate of change of parameters, and stores them in\n      'model1'.\n   \"\"\"\n   for component_name in model1:\n      if not (component_name in model1 and component_name in model2):\n         continue\n      c1 = model1[component_name]\n      c2 = model2[component_name]\n      raw_component_type = c1['raw-type']\n      if raw_component_type in {'Linear', 'Affine', 'NaturalGradientAffine'}:\n         params1 = c1['params']\n         params2 = c2['params']\n         if params1.size != params2.size:\n            continue  # can't compare them if sizes differ.\n         params_diff = params1 - params2\n         c1['row-change'] = np.sqrt(np.sum(params_diff * params_diff, axis=1))\n         c1['col-change'] = np.sqrt(np.sum(params_diff * params_diff, axis=0))\n         # compute relative change in rows and columns.\n         epsilon = 1.0e-20\n         if 'row-norms' in c1:\n            c1['rel-row-change'] = c1['row-change'] / (c1['row-norms'] + epsilon)\n         if 'col-norms' in c1:\n            c1['rel-col-change'] = c1['col-change'] / (c1['col-norms'] + epsilon)\n\n\n         size = c1['col-norms'].size\n         if size % 3 == 0:\n            # if the input-dim of this layer is divisible by 3, then average the\n            # column changes over 3 blocks... this makes sense for TDNNs or\n            # wherever we have used Append().\n            c1['col-change-3'] = np.sum(c1['col-change'].reshape(3, size/3), axis=0)\n            c1['rel-col-change-3'] = c1['col-change-3'] / (c1['col-norms-3'] + epsilon)\n\n\ndef test():\n   assert sys.version_info.major >= 3\n   assert read_next_token(\"\", 0) == (None, 0)\n   assert read_next_token(\"hello\", 0) == (\"hello\", 5)\n   assert read_next_token(\"hello there\", 0) == (\"hello\", 5)\n   assert read_next_token(\"hello there\", 5) == (\"there\", 11)\n   assert read_next_token(\"hello there\", 6) == (\"there\", 11)\n   (a, pos) = read_vector(\" [ 1 2 3 ] \", 0)\n   assert pos == 10 and np.array_equal(np.array([1,2,3], dtype=np.float32), a)\n   assert check_for_newline(\"hello \", 4) == (False, 4)\n   assert check_for_newline(\"hello \", 5) == (False, 6)\n   assert check_for_newline(\"hello \\n\", 5) == (True, 7)\n   assert check_for_newline(\"hello \\nthere\", 5) == (True, 7)\n   (m, pos) = read_matrix(\" [\\n 1 2 3\\n 4 5 6 ] \", 0)\n   assert pos == 18 and np.array_equal(np.array([[1,2,3],[4,5,6]], dtype=np.float32), m)\n\n   s = \"  <ignore_this> 1 <some_vec> [ 1 2 3 ] <end>\"\n   (obj, pos) = read_generic(s, 0, \"<end>\", { '<some_vec>': (read_vector, 'some_vec') })\n   assert pos == len(s)\n   assert np.array_equal(obj['some_vec'], np.array([1, 2, 3], dtype=np.float32))\n\n   m = read_model('exp/chain_cleaned/tdnn1c_sp_bi/final.mdl')\n   compute_derived_quantities(m)\n   print(\"model is: {0}\".format(m))\n   print(\"tested\")\n\n\n\nif __name__ == '__main__':\n   if len(sys.argv) == 1:\n      test()\n\n   if len(sys.argv) != 3:\n      print(\"Usage: {0} <nnet3-model-in> <pickled-model-out>\".format(\n         sys.argv[0]), file=sys.stderr)\n      sys.exit(1)\n\n   m = read_model(sys.argv[1])\n   if m != None:\n      try:\n         f = open(sys.argv[2], \"wb\")\n         pickle.dump(m, f)\n      except:\n         print(\"{0}: error writing to {1}\".format(\n            sys.argv[2]), file=sys.stderr)\n"
  },
  {
    "path": "egs/steps/nnet3/report/generate_plots.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti\n#           2016    Vimal Manohar\n# Apache 2.0.\n\nfrom __future__ import division\nimport argparse\nimport errno\nimport logging\nimport os\nimport re\nimport sys\nimport warnings\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.report.log_parse as log_parse\nimport libs.common as common_lib\n\ntry:\n    import matplotlib as mpl\n    mpl.use('Agg')\n    import matplotlib.pyplot as plt\n    import numpy as np\n    from matplotlib.patches import Rectangle\n    # matplotlib issue https://github.com/matplotlib/matplotlib/issues/12513\n    # plt.subplot() generates a false-positive warninig, suppress it for now.\n    from matplotlib.cbook import MatplotlibDeprecationWarning\n    warnings.filterwarnings('ignore', category=MatplotlibDeprecationWarning,\n                            message='Adding an axes using the same arguments')\n    g_plot = True\nexcept ImportError:\n    g_plot = False\n\n\nlogging.basicConfig(format=\"%(filename)s:%(lineno)s:%(levelname)s:%(message)s\",\n                    level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        prog=sys.argv[0],  # By default, prog is set this to filename only.\n        formatter_class=type('', (argparse.RawDescriptionHelpFormatter,\n                                  argparse.ArgumentDefaultsHelpFormatter), {}),\n        description=\"Parses the training logs and generates a variety of plots.\\n\"\n        \"e.g.: %(prog)s \\\\\\n\"\n        \"  exp/nnet3/tdnn exp/nnet3/tdnn1 exp/nnet3/tdnn2 exp/nnet3/tdnn/report.\\n\"\n        \"The report file 'report.pdf' will be generated in the <output_dir> directory.\")\n\n    parser.add_argument(\"--start-iter\", type=int, metavar='N', default=1,\n                        help=\"Iteration from which plotting will start.\")\n    parser.add_argument(\"--is-chain\", type=common_lib.str_to_bool, default='false', metavar='BOOL',\n                        help=\"Set to 'true' if <exp_dir>s contain chain models.\")\n    parser.add_argument(\"--is-rnnlm\", type=common_lib.str_to_bool, default='false', metavar='BOOL',\n                        help=\"Set to 'true' if <exp_dir>s contain RNNLM.\")\n    parser.add_argument(\"--output-nodes\", type=str, metavar='NODES',\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"List of space separated <output-node>:<objective-type> entries, \"\n                        \"one for each output node\")\n    parser.add_argument(\"--comparison-dir\", type=str, metavar='DIR', action='append',\n                        help=\"[DEPRECATED] Experiment directories for comparison. \"\n                        \"These will only be used for plots, not tables.\")\n    parser.add_argument(\"exp_dir\", nargs='+',\n                        help=\"The first <exp_dir> is the current experiment directory, e.g. \"\n                        \"'exp/nnet3/tdnn'; the rest are up to 6 optional directories of other \"\n                        \"experiments to be graphed on same plots for comparison.\")\n    parser.add_argument(\"output_dir\",\n                        help=\"output directory for reports, e.g. 'exp/nnet3/tdnn/report'\")\n\n    args = parser.parse_args()\n    if ((args.comparison_dir is not None and len(args.comparison_dir) > 6) or\n        (args.exp_dir is not None and len(args.exp_dir) > 7)):\n        raise Exception(\n            \"Up to 6 comparison directories may be specified. \"\n            \"If you want to compare with more experiments, you would have to carefully tune \"\n            \"the plot_colors variable which specified colors used for plotting.\")\n    assert args.start_iter >= 1\n    if args.is_chain and args.is_rnnlm:\n        raise Exception(\"Options --is-chain and --is-rnnlm cannot be both true.\")\n    return args\n\n\ng_plot_colors = ['red', 'blue', 'green', 'black', 'magenta', 'yellow', 'cyan']\n\nclass LatexReport(object):\n    \"\"\"Class for writing a Latex report\"\"\"\n\n    def __init__(self, pdf_file):\n        self.pdf_file = pdf_file\n        self.document = []\n        self.document.append(r\"\"\"\n\\documentclass[prl,10pt,twocolumn]{revtex4}\n\\usepackage{graphicx}    % Used to import the graphics\n\\begin{document}\n\"\"\")\n\n    def add_figure(self, figure_pdf, title):\n        \"\"\"we will have keep extending this replacement list based on errors\n        during compilation escaping underscores in the title\"\"\"\n\n        title = r\"\\texttt{\"+re.sub(\"_\", \"\\_\", title)+\"}\"\n        fig_latex = r\"\"\"\n%...\n\\newpage\n\\begin{figure}[h]\n  \\begin{center}\n    \\caption{\"\"\" + title + r\"\"\"}\n    \\includegraphics[width=\\textwidth]{\"\"\" + figure_pdf + r\"\"\"}\n  \\end{center}\n\\end{figure}\n\\clearpage\n%...\n\"\"\"\n        self.document.append(fig_latex)\n\n    def close(self):\n        self.document.append(r\"\\end{document}\")\n        return self.compile()\n\n    def compile(self):\n        root, ext = os.path.splitext(self.pdf_file)\n        dir_name = os.path.dirname(self.pdf_file)\n        latex_file = root + \".tex\"\n        lat_file = open(latex_file, \"w\")\n        lat_file.write(\"\\n\".join(self.document))\n        lat_file.close()\n        logger.info(\"Compiling the LaTeX report.\")\n        try:\n            common_lib.execute_command(\n                \"pdflatex -interaction=batchmode \"\n                \"-output-directory={0} {1}\".format(dir_name, latex_file))\n        except Exception as e:\n            logger.warning(\"There was an error compiling LaTeX file %s. \"\n                           \"Check report.log generated by pdflatex in the same directory. %s\",\n                           latex_file, e)\n            return False\n        return True\n\n\ndef latex_compliant_name(name_string):\n    \"\"\"this function is required as latex does not allow all the component names\n    allowed by nnet3.\n    Identified incompatibilities :\n        1. latex does not allow dot(.) in file names\n    \"\"\"\n    node_name_string = re.sub(\"\\.\", \"_dot_\", name_string)\n\n    return node_name_string\n\n\ndef generate_acc_logprob_plots(exp_dir, output_dir, plot, key='accuracy',\n        file_basename='accuracy', comparison_dir=None,\n        start_iter=1, latex_report=None, output_name='output'):\n\n    assert start_iter >= 1\n\n    if plot:\n        fig = plt.figure()\n        plots = []\n\n    comparison_dir = [] if comparison_dir is None else comparison_dir\n    dirs = [exp_dir] + comparison_dir\n    index = 0\n    for dir in dirs:\n        [report, times, data] = log_parse.generate_acc_logprob_report(dir, key,\n                output_name)\n        if index == 0:\n            # this is the main experiment directory\n            with open(\"{0}/{1}.log\".format(output_dir,\n                                           file_basename), \"w\") as f:\n                f.write(report)\n\n        if plot:\n            color_val = g_plot_colors[index]\n            data = np.array(data)\n            if data.shape[0] == 0:\n                logger.warning(\"Couldn't find any rows for the\"\n                               \"accuracy/log-probability plot, not generating it\")\n                return\n            data = data[data[:, 0] >= start_iter, :]\n            plot_handle, = plt.plot(data[:, 0], data[:, 1], color=color_val,\n                                    linestyle=\"--\",\n                                    label=\"train {0}\".format(dir))\n            plots.append(plot_handle)\n            plot_handle, = plt.plot(data[:, 0], data[:, 2], color=color_val,\n                                    label=\"valid {0}\".format(dir))\n            plots.append(plot_handle)\n        index += 1\n    if plot:\n        plt.xlabel('Iteration')\n        plt.ylabel(key)\n        lgd = plt.legend(handles=plots, loc='lower center',\n                         bbox_to_anchor=(0.5, -0.2 + len(dirs) * -0.1),\n                         ncol=1, borderaxespad=0.)\n        plt.grid(True)\n        fig.suptitle(\"{0} plot for {1}\".format(key, output_name))\n        figfile_name = '{0}/{1}_{2}.pdf'.format(\n            output_dir, file_basename,\n            latex_compliant_name(output_name))\n        plt.savefig(figfile_name, bbox_extra_artists=(lgd,),\n                    bbox_inches='tight')\n        if latex_report is not None:\n            latex_report.add_figure(\n                figfile_name,\n                \"Plot of {0} vs iterations for {1}\".format(key, output_name))\n\n\n# The name of five gates of lstmp\ng_lstm_gate = ['i_t_sigmoid', 'f_t_sigmoid', 'c_t_tanh', 'o_t_sigmoid', 'm_t_tanh']\n\n# The \"extra\" item is a placeholder. As each unit in python plot is\n# composed by a legend_handle(linestyle) and a legend_label(description).\n# For the unit which doesn't have linestyle, we use the \"extra\" placeholder.\nif g_plot:\n    extra = Rectangle((0, 0), 1, 1, facecolor=\"w\", fill=False, edgecolor='none', linewidth=0)\n\n# This function is used to insert a column to the legend, the column_index is 1-based\ndef insert_a_column_legend(legend_handle, legend_label, lp, mp, hp,\n        dir, prefix_length, column_index):\n    handle = [extra, lp, mp, hp]\n    label = [\"[1]{0}\".format(dir[prefix_length:]), \"\", \"\", \"\"]\n    for row in range(1,5):\n        legend_handle.insert(column_index*row-1, handle[row-1])\n        legend_label.insert(column_index*row-1, label[row-1])\n\n\n# This function is used to plot a normal nonlinearity component or a gate of lstmp\ndef plot_a_nonlin_component(fig, dirs, stat_tables_per_component_per_dir,\n        component_name, common_prefix, prefix_length, component_type,\n        start_iter, gate_index=0, with_oderiv=0):\n    fig.clf()\n    index = 0\n    legend_handle = [extra, extra, extra, extra]\n    legend_label = [\"\", '5th percentile', '50th percentile', '95th percentile']\n\n    if not with_oderiv:\n        for dir in dirs:\n            color_val = g_plot_colors[index]\n            index += 1\n            try:\n                iter_stats = (stat_tables_per_component_per_dir[dir][component_name])\n            except KeyError:\n                # this component is not available in this network so lets\n                # not just plot it\n                insert_a_column_legend(legend_handle, legend_label, lp, mp, hp,\n                        dir, prefix_length, index+1)\n                continue\n\n            data = np.array(iter_stats)\n            data = data[data[:, 0] >= start_iter, :]\n\n            ax = plt.subplot(211)\n            lp, = ax.plot(data[:, 0], data[:, gate_index*10+5], color=color_val,\n                    linestyle='--')\n            mp, = ax.plot(data[:, 0], data[:, gate_index*10+6], color=color_val,\n                    linestyle='-')\n            hp, = ax.plot(data[:, 0], data[:, gate_index*10+7], color=color_val,\n                    linestyle='--')\n            insert_a_column_legend(legend_handle, legend_label, lp, mp, hp,\n                    dir, prefix_length, index+1)\n\n            ax.set_ylabel('Value-{0}'.format(component_type))\n            ax.grid(True)\n\n            ax = plt.subplot(212)\n            lp, = ax.plot(data[:, 0], data[:, gate_index*10+8], color=color_val,\n                    linestyle='--')\n            mp, = ax.plot(data[:, 0], data[:, gate_index*10+9], color=color_val,\n                    linestyle='-')\n            hp, = ax.plot(data[:, 0], data[:, gate_index*10+10], color=color_val,\n                    linestyle='--')\n            ax.set_xlabel('Iteration')\n            ax.set_ylabel('Derivative-{0}'.format(component_type))\n            ax.grid(True)\n\n        lgd = plt.legend(legend_handle, legend_label, loc='lower center',\n                bbox_to_anchor=(0.5 , -0.5 + len(dirs) * -0.2),\n                ncol=4, handletextpad = -2, title=\"[1]:{0}\".format(common_prefix),\n                borderaxespad=0.)\n        plt.grid(True)\n\n    else:\n        for dir in dirs:\n            color_val = g_plot_colors[index]\n            index += 1\n            try:\n                iter_stats = (stat_tables_per_component_per_dir[dir][component_name])\n            except KeyError:\n                # this component is not available in this network so lets\n                # not just plot it\n                insert_a_column_legend(legend_handle, legend_label, lp, mp, hp,\n                        dir, prefix_length, index+1)\n                continue\n\n            data = np.array(iter_stats)\n            data = data[data[:, 0] >= start_iter, :]\n            ax = plt.subplot(311)\n            lp, = ax.plot(data[:, 0], data[:, gate_index*10+7], color=color_val,\n                    linestyle='--')\n            mp, = ax.plot(data[:, 0], data[:, gate_index*10+8], color=color_val,\n                    linestyle='-')\n            hp, = ax.plot(data[:, 0], data[:, gate_index*10+9], color=color_val,\n                    linestyle='--')\n            insert_a_column_legend(legend_handle, legend_label, lp, mp, hp,\n                    dir, prefix_length, index+1)\n\n            ax.set_ylabel('Value-{0}'.format(component_type))\n            ax.grid(True)\n\n            ax = plt.subplot(312)\n            lp, = ax.plot(data[:, 0], data[:, gate_index*10+10], color=color_val,\n                    linestyle='--')\n            mp, = ax.plot(data[:, 0], data[:, gate_index*10+11], color=color_val,\n                    linestyle='-')\n            hp, = ax.plot(data[:, 0], data[:, gate_index*10+12], color=color_val,\n                    linestyle='--')\n            ax.set_ylabel('Derivative-{0}'.format(component_type))\n            ax.grid(True)\n\n            ax = plt.subplot(313)\n            lp, = ax.plot(data[:, 0], data[:, gate_index*10+13], color=color_val,\n                    linestyle='--')\n            mp, = ax.plot(data[:, 0], data[:, gate_index*10+14], color=color_val,\n                    linestyle='-')\n            hp, = ax.plot(data[:, 0], data[:, gate_index*10+15], color=color_val,\n                    linestyle='--')\n            ax.set_xlabel('Iteration')\n            ax.set_ylabel('Oderivative-{0}'.format(component_type))\n            ax.grid(True)\n\n            plt.subplots_adjust(top=0.8, hspace = 1.0, bottom = -0.2)\n        lgd = plt.legend(legend_handle, legend_label, loc='lower center',\n                bbox_to_anchor=(0.5 , -1.5 + len(dirs) * -0.2),\n                ncol=4, handletextpad = -2, title=\"[1]:{0}\".format(common_prefix),\n                borderaxespad=0.)\n        plt.grid(True)\n\n    return lgd\n\n\n# This function is used to generate the statistic plots of nonlinearity component\n# Mainly divided into the following steps:\n# 1) With log_parse function, we get the statistics from each directory.\n# 2) Convert the collected nonlinearity statistics into the tables. Each table\n#    contains all the statistics in each component of each directory.\n# 3) The statistics of each component are stored into corresponding log files.\n#    Each line of the log file contains the statistics of one iteration.\n# 4) Plot the \"Per-dimension average-(value, derivative) percentiles\" figure\n#    for each nonlinearity component.\ndef generate_nonlin_stats_plots(exp_dir, output_dir, plot, comparison_dir=None,\n                                start_iter=1, latex_report=None):\n    assert start_iter >= 1\n\n    comparison_dir = [] if comparison_dir is None else comparison_dir\n    dirs = [exp_dir] + comparison_dir\n    index = 0\n    stats_per_dir = {}\n    with_oderiv = 0\n\n    for dir in dirs:\n        stats_per_component_per_iter = (\n            log_parse.parse_progress_logs_for_nonlinearity_stats(dir))\n        for key in stats_per_component_per_iter:\n            if len(stats_per_component_per_iter[key]['stats']) == 0:\n                logger.warning(\"Couldn't find any rows for the\"\n                               \"nonlin stats plot, not generating it\")\n\n        stats_per_dir[dir] = stats_per_component_per_iter\n    # convert the nonlin stats into tables\n    stat_tables_per_component_per_dir = {}\n\n    for dir in dirs:\n        stats_per_component_per_iter = stats_per_dir[dir]\n        component_names = stats_per_component_per_iter.keys()\n        stat_tables_per_component = {}\n        for component_name in component_names:\n            comp_data = stats_per_component_per_iter[component_name]\n            comp_type = comp_data['type']\n            comp_stats = comp_data['stats']\n            iters = sorted(comp_stats)\n            iter_stats = []\n            for iter in iters:\n                iter_stats.append([iter] + comp_stats[iter])\n            stat_tables_per_component[component_name] = iter_stats\n        stat_tables_per_component_per_dir[dir] = stat_tables_per_component\n    if len(comp_stats[iter]) == 15:\n        with_oderiv = 1\n    main_stat_tables = stat_tables_per_component_per_dir[exp_dir]\n\n    for component_name in main_stat_tables.keys():\n        # this is the main experiment directory\n        with open(\"{dir}/nonlinstats_{comp_name}.log\".format(\n                    dir=output_dir, comp_name=component_name), \"w\") as f:\n            if with_oderiv:\n                # with oderiv-rms\n                f.write(\"Iteration\\tValueMean\\tValueStddev\\tDerivMean\\tDerivStddev\\t\"\n                        \"OderivMean\\tOderivStddev\\t\"\n                        \"Value_5th\\tValue_50th\\tValue_95th\\t\"\n                        \"Deriv_5th\\tDeriv_50th\\tDeriv_95th\\t\"\n                        \"Oderiv_5th\\tOderiv_50th\\tOderiv_95th\\n\")\n            else:\n                # without oderiv-rms\n                f.write(\"Iteration\\tValueMean\\tValueStddev\\tDerivMean\\tDerivStddev\\t\"\n                        \"Value_5th\\tValue_50th\\tValue_95th\\t\"\n                        \"Deriv_5th\\tDeriv_50th\\tDeriv_95th\\n\")\n            iter_stat_report = []\n            iter_stats = main_stat_tables[component_name]\n            for row in iter_stats:\n                iter_stat_report.append(\"\\t\".join([str(x) for x in row]))\n            f.write(\"\\n\".join(iter_stat_report))\n            f.close()\n    if plot:\n        main_component_names = sorted(main_stat_tables)\n        plot_component_names = set(main_component_names)\n        for dir in dirs:\n            component_names = set(stats_per_dir[dir].keys())\n            plot_component_names = plot_component_names.intersection(\n                component_names)\n        plot_component_names = sorted(plot_component_names)\n        if plot_component_names != main_component_names:\n            logger.warning(\"The components in all the neural networks in the \"\n                           \"given experiment dirs are not the same, so comparison plots are \"\n                           \"provided only for common component names. Make sure that these are \"\n                           \"comparable experiments before analyzing these plots.\")\n\n        fig = plt.figure()\n\n        common_prefix = os.path.commonprefix(dirs)\n        prefix_length = common_prefix.rfind('/')\n        common_prefix = common_prefix[0:prefix_length]\n\n        for component_name in main_component_names:\n            if stats_per_dir[exp_dir][component_name]['type'] == 'LstmNonlinearity':\n                for i in range(0,5):\n                    component_type = 'Lstm-' + g_lstm_gate[i]\n                    lgd = plot_a_nonlin_component(fig, dirs,\n                            stat_tables_per_component_per_dir, component_name,\n                            common_prefix, prefix_length, component_type, start_iter, i, with_oderiv)\n                    fig.suptitle(\"Per-dimension average-(value, derivative) percentiles for \"\n                         \"{component_name}-{gate}\".format(component_name=component_name, gate=g_lstm_gate[i]))\n                    comp_name = latex_compliant_name(component_name)\n                    figfile_name = '{dir}/nonlinstats_{comp_name}_{gate}.pdf'.format(\n                        dir=output_dir, comp_name=comp_name, gate=g_lstm_gate[i])\n                    fig.savefig(figfile_name, bbox_extra_artists=(lgd,),\n                        bbox_inches='tight')\n                    if latex_report is not None:\n                        latex_report.add_figure(\n                        figfile_name,\n                        \"Per-dimension average-(value, derivative) percentiles for \"\n                        \"{0}-{1}\".format(component_name, g_lstm_gate[i]))\n            else:\n                component_type = stats_per_dir[exp_dir][component_name]['type']\n                lgd = plot_a_nonlin_component(fig, dirs,\n                        stat_tables_per_component_per_dir,component_name,\n                        common_prefix, prefix_length, component_type, start_iter, 0, with_oderiv)\n                if with_oderiv:\n                    fig.suptitle(\"Per-dimension average-(value, derivative) and rms-oderivative percentiles for \"\n                         \"{component_name}\".format(component_name=component_name))\n                else:\n                    fig.suptitle(\"Per-dimension average-(value, derivative) percentiles for \"\n                         \"{component_name}\".format(component_name=component_name))\n                comp_name = latex_compliant_name(component_name)\n                figfile_name = '{dir}/nonlinstats_{comp_name}.pdf'.format(\n                    dir=output_dir, comp_name=comp_name)\n                fig.savefig(figfile_name, bbox_extra_artists=(lgd,),\n                        bbox_inches='tight')\n                if latex_report is not None:\n                    if with_oderiv:\n                        latex_report.add_figure(\n                        figfile_name,\n                        \"Per-dimension average-(value, derivative) and rms-oderivative percentiles for \"\n                        \"{0}\".format(component_name))\n                    else:\n                        latex_report.add_figure(\n                        figfile_name,\n                        \"Per-dimension average-(value, derivative) percentiles for \"\n                        \"{0}\".format(component_name))\n\n\n\ndef generate_clipped_proportion_plots(exp_dir, output_dir, plot,\n                                      comparison_dir=None, start_iter=1,\n                                      latex_report=None):\n    assert(start_iter >= 1)\n\n    comparison_dir = [] if comparison_dir is None else comparison_dir\n    dirs = [exp_dir] + comparison_dir\n    index = 0\n    stats_per_dir = {}\n    for dir in dirs:\n        try:\n            stats_per_dir[dir] = (\n                log_parse.parse_progress_logs_for_clipped_proportion(dir))\n        except log_parse.MalformedClippedProportionLineException as e:\n            raise e\n        except common_lib.KaldiCommandException as e:\n            logger.warning(\"Could not extract the clipped proportions for %s, \"\n                           \"this might be because there are no ClipGradientComponents.\", dir)\n            continue\n        if len(stats_per_dir[dir]) == 0:\n            logger.warning(\"Couldn't find any rows for the\"\n                           \"clipped proportion plot, not generating it\")\n    try:\n        main_cp_stats = stats_per_dir[exp_dir]['table']\n    except KeyError:\n        logger.warning(\"The main experiment directory %s does not have clipped proportions. \"\n                       \"Not generating clipped proportion plots.\", exp_dir)\n        return\n\n    # this is the main experiment directory\n    file = open(\"{dir}/clipped_proportion.log\".format(dir=output_dir), \"w\")\n    iter_stat_report = \"\"\n    for row in main_cp_stats:\n        iter_stat_report += \"\\t\".join([str(x) for x in row]) + \"\\n\"\n    file.write(iter_stat_report)\n    file.close()\n\n    if plot:\n        main_component_names = sorted(stats_per_dir[exp_dir]['cp_per_iter_per_component'])\n        plot_component_names = set(main_component_names)\n        for dir in dirs:\n            try:\n                component_names = set(stats_per_dir[dir]['cp_per_iter_per_component'])\n                plot_component_names = (\n                    plot_component_names.intersection(component_names))\n            except KeyError:\n                continue\n        plot_component_names = sorted(plot_component_names)\n        if plot_component_names != main_component_names:\n            logger.warning(\n                \"The components in all the neural networks in the given \"\n                \"experiment dirs are not the same, so comparison plots are \"\n                \"provided only for common component names. Make sure that these \"\n                \"are comparable experiments before analyzing these plots.\")\n\n        fig = plt.figure()\n        for component_name in main_component_names:\n            fig.clf()\n            index = 0\n            plots = []\n            for dir in dirs:\n                color_val = g_plot_colors[index]\n                index += 1\n                try:\n                    iter_stats = stats_per_dir[dir][\n                        'cp_per_iter_per_component'][component_name]\n                except KeyError:\n                    # this component is not available in this network so lets\n                    # not just plot it\n                    continue\n\n                data = np.array(iter_stats)\n                data = data[data[:, 0] >= start_iter, :]\n                ax = plt.subplot(111)\n                mp, = ax.plot(data[:, 0], data[:, 1], color=color_val,\n                              label=\"Clipped Proportion {0}\".format(dir))\n                plots.append(mp)\n                ax.set_ylabel('Clipped Proportion')\n                ax.set_ylim([0, 1.2])\n                ax.grid(True)\n            lgd = plt.legend(handles=plots, loc='lower center',\n                             bbox_to_anchor=(0.5, -0.5 + len(dirs) * -0.2),\n                             ncol=1, borderaxespad=0.)\n            plt.grid(True)\n            fig.suptitle(\"Clipped-proportion value at {comp_name}\".format(\n                            comp_name=component_name))\n            comp_name = latex_compliant_name(component_name)\n            figfile_name = '{dir}/clipped_proportion_{comp_name}.pdf'.format(\n                dir=output_dir, comp_name=comp_name)\n            fig.savefig(figfile_name, bbox_extra_artists=(lgd,),\n                        bbox_inches='tight')\n            if latex_report is not None:\n                latex_report.add_figure(\n                    figfile_name,\n                    \"Clipped proportion at {0}\".format(component_name))\n\n\ndef generate_parameter_diff_plots(exp_dir, output_dir, plot,\n                                  comparison_dir=None, start_iter=1,\n                                  latex_report=None):\n    # Parameter changes\n    assert start_iter >= 1\n\n    comparison_dir = [] if comparison_dir is None else comparison_dir\n    dirs = [exp_dir] + comparison_dir\n    index = 0\n    stats_per_dir = {}\n    key_file = {\"Parameter differences\": \"parameter.diff\",\n                \"Relative parameter differences\": \"relative_parameter.diff\"}\n    stats_per_dir = {}\n    for dir in dirs:\n        stats_per_dir[dir] = {}\n        for key in key_file:\n            stats_per_dir[dir][key] = (\n                log_parse.parse_progress_logs_for_param_diff(dir, key))\n\n    # write down the stats for the main experiment directory\n    for diff_type in key_file:\n        with open(\"{0}/{1}\".format(output_dir, key_file[diff_type]), \"w\") as f:\n            diff_per_component_per_iter = (\n                stats_per_dir[exp_dir][diff_type]['progress_per_component'])\n            component_names = (\n                stats_per_dir[exp_dir][diff_type]['component_names'])\n            max_iter = stats_per_dir[exp_dir][diff_type]['max_iter']\n            f.write(\" \".join([\"Iteration\"] + component_names)+\"\\n\")\n            total_missing_iterations = 0\n            gave_user_warning = False\n            for iter in range(max_iter + 1):\n                iter_data = [str(iter)]\n                for c in component_names:\n                    try:\n                        iter_data.append(\n                            str(diff_per_component_per_iter[c][iter]))\n                    except KeyError:\n                        total_missing_iterations += 1\n                        iter_data.append(\"NA\")\n                if (float(total_missing_iterations)/len(component_names) > 20\n                        and not gave_user_warning):\n                    logger.warning(\"There are more than %.0f missing iterations per component. \"\n                                   \"Something might be wrong.\",\n                                   float(total_missing_iterations)/ len(component_names))\n                    gave_user_warning = True\n\n                f.write(\" \".join(iter_data) + \"\\n\")\n\n    if plot:\n        # get the component names\n        diff_type = list(key_file.keys())[0]\n        main_component_names = sorted(stats_per_dir[exp_dir][diff_type]['progress_per_component'])\n        plot_component_names = set(main_component_names)\n        for dir in dirs:\n            try:\n                component_names = set(stats_per_dir[dir][diff_type]['progress_per_component'])\n                plot_component_names = plot_component_names.intersection(component_names)\n            except KeyError:\n                continue\n        plot_component_names = sorted(plot_component_names)\n        if plot_component_names != main_component_names:\n            logger.warning(\"The components in all the neural networks in the \"\n                           \"given experiment dirs are not the same, \"\n                           \"so comparison plots are provided only for common \"\n                           \"component names. \"\n                           \"Make sure that these are comparable experiments \"\n                           \"before analyzing these plots.\")\n\n        assert main_component_names\n\n        fig = plt.figure()\n        logger.info(\"Plotting parameter differences for components: \" +\n                    \", \".join(main_component_names))\n\n        for component_name in main_component_names:\n            fig.clf()\n            index = 0\n            plots = []\n            for dir in dirs:\n                color_val = g_plot_colors[index]\n                index += 1\n                iter_stats = []\n                try:\n                    for diff_type in ['Parameter differences',\n                                      'Relative parameter differences']:\n                        iter_stats.append(np.array(\n                            sorted(stats_per_dir[dir][diff_type][\n                                'progress_per_component'][\n                                    component_name].items())))\n                except KeyError as e:\n                    # this component is not available in this network so lets\n                    # not just plot it\n                    if dir == exp_dir:\n                        raise Exception(\"No parameter differences were available even in the main \"\n                                        \"experiment dir for the component {0}. Something went \"\n                                        \"wrong: {1}.\".format(component_name, e))\n                    continue\n                ax = plt.subplot(211)\n                mp, = ax.plot(iter_stats[0][:, 0], iter_stats[0][:, 1],\n                              color=color_val,\n                              label=\"Parameter Differences {0}\".format(dir))\n                plots.append(mp)\n                ax.set_ylabel('Parameter Differences')\n                ax.grid(True)\n\n                ax = plt.subplot(212)\n                mp, = ax.plot(iter_stats[1][:, 0], iter_stats[1][:, 1],\n                              color=color_val,\n                              label=\"Relative Parameter \"\n                                    \"Differences {0}\".format(dir))\n                ax.set_xlabel('Iteration')\n                ax.set_ylabel('Relative Parameter Differences')\n                ax.grid(True)\n\n            lgd = plt.legend(handles=plots, loc='lower center',\n                             bbox_to_anchor=(0.5, -0.5 + len(dirs) * -0.2),\n                             ncol=1, borderaxespad=0.)\n            plt.grid(True)\n            fig.suptitle(\"Parameter differences at {comp_name}\".format(\n                comp_name=component_name))\n            comp_name = latex_compliant_name(component_name)\n            figfile_name = '{dir}/param_diff_{comp_name}.pdf'.format(\n                dir=output_dir, comp_name=comp_name)\n            fig.savefig(figfile_name, bbox_extra_artists=(lgd,),\n                        bbox_inches='tight')\n            if latex_report is not None:\n                latex_report.add_figure(\n                    figfile_name,\n                    \"Parameter differences at {0}\".format(component_name))\n\n\ndef generate_plots(exp_dir, output_dir, output_names, comparison_dir=None,\n                   start_iter=1):\n    try:\n        os.makedirs(output_dir)\n    except OSError as e:\n        if e.errno == errno.EEXIST and os.path.isdir(output_dir):\n            pass\n        else:\n            raise e\n    if g_plot:\n        latex_report = LatexReport(\"{0}/report.pdf\".format(output_dir))\n    else:\n        latex_report = None\n\n    for (output_name, objective_type) in output_names:\n        if objective_type == \"linear\":\n            logger.info(\"Generating accuracy plots for '%s'\", output_name)\n            generate_acc_logprob_plots(\n                exp_dir, output_dir, g_plot, key='accuracy',\n                file_basename='accuracy', comparison_dir=comparison_dir,\n                start_iter=start_iter,\n                latex_report=latex_report, output_name=output_name)\n\n            logger.info(\"Generating log-likelihood plots for '%s'\", output_name)\n            generate_acc_logprob_plots(\n                exp_dir, output_dir, g_plot, key='log-likelihood',\n                file_basename='loglikelihood', comparison_dir=comparison_dir,\n                start_iter=start_iter,\n                latex_report=latex_report, output_name=output_name)\n        elif objective_type == \"chain\":\n            logger.info(\"Generating log-probability plots for '%s'\", output_name)\n            generate_acc_logprob_plots(\n                exp_dir, output_dir, g_plot,\n                key='log-probability', file_basename='log_probability',\n                comparison_dir=comparison_dir, start_iter=start_iter,\n                latex_report=latex_report, output_name=output_name)\n        elif objective_type == \"rnnlm_objective\":\n            logger.info(\"Generating RNNLM objective plots for '%s'\", output_name)\n            generate_acc_logprob_plots(\n                exp_dir, output_dir, g_plot, key='rnnlm_objective',\n                file_basename='objective', comparison_dir=comparison_dir,\n                start_iter=start_iter,\n                latex_report=latex_report, output_name=output_name)\n        else:\n            logger.info(\"Generating %s objective plots for '%s'\", objective_type, output_name)\n            generate_acc_logprob_plots(\n                exp_dir, output_dir, g_plot, key='objective',\n                file_basename='objective', comparison_dir=comparison_dir,\n                start_iter=start_iter,\n                latex_report=latex_report, output_name=output_name)\n\n    logger.info(\"Generating non-linearity stats plots\")\n    generate_nonlin_stats_plots(\n        exp_dir, output_dir, g_plot, comparison_dir=comparison_dir,\n        start_iter=start_iter, latex_report=latex_report)\n\n    logger.info(\"Generating clipped-proportion plots\")\n    generate_clipped_proportion_plots(\n        exp_dir, output_dir, g_plot, comparison_dir=comparison_dir,\n        start_iter=start_iter, latex_report=latex_report)\n\n    logger.info(\"Generating parameter difference plots\")\n    generate_parameter_diff_plots(\n        exp_dir, output_dir, g_plot, comparison_dir=comparison_dir,\n        start_iter=start_iter, latex_report=latex_report)\n\n    if g_plot and latex_report is not None:\n        has_compiled = latex_report.close()\n        if has_compiled:\n            logger.info(\"Report file %s/report.pdf has been generated successfully.\", output_dir)\n\n\ndef main():\n    args = get_args()\n\n    if not g_plot:\n        logger.warning(\n            \"This script requires matplotlib and numpy.\\n\"\n            \"... Install these packages to generate plots.\\n\"\n            \"... If you are on a cluster where you do not have admin rights, use venv.\\n\"\n            \"... Generating text data table files only.\")\n\n    output_nodes = []\n\n    if args.output_nodes is not None:\n        nodes = args.output_nodes.split(' ')\n        for n in nodes:\n            parts = n.split(':')\n            assert len(parts) == 2\n            output_nodes.append(tuple(parts))\n    elif args.is_chain:\n        output_nodes.append(('output', 'chain'))\n        output_nodes.append(('output-xent', 'chain'))\n    elif args.is_rnnlm:\n        output_nodes.append(('output', 'rnnlm_objective'))\n    else:\n        output_nodes.append(('output', 'linear'))\n\n    if args.comparison_dir is not None:\n      generate_plots(args.exp_dir[0], args.output_dir, output_nodes,\n                     comparison_dir=args.comparison_dir,\n                     start_iter=args.start_iter)\n    else:\n      if len(args.exp_dir) == 1:\n        generate_plots(args.exp_dir[0], args.output_dir, output_nodes,\n                       start_iter=args.start_iter)\n      if len(args.exp_dir) > 1:\n        generate_plots(args.exp_dir[0], args.output_dir, output_nodes,\n                       comparison_dir=args.exp_dir[1:],\n                       start_iter=args.start_iter)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/report/summarize_compute_debug_timing.py",
    "content": "#!/usr/bin/env python\n\n\n# Copyright 2016 Vijayaditya Peddinti.\n# Apache 2.0.\n\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nfrom __future__ import division\nimport sys\nimport re\nimport argparse\n\n# expects the output of nnet3*train with --computation-debug=true\n# will run faster if just the lines with \"DebugAfterExecute\" are provided\n# <train-command> |grep DebugAfterExecute | steps/nnet3/report/summarize_compute_debug_timing.py\n\ndef GetArgs():\n    parser = argparse.ArgumentParser(description=\"Summarizes the timing info from nnet3-*-train --computation.debug=true commands \")\n    parser.add_argument(\"--node-prefixes\", type=str,\n                        help=\"list of prefixes. Execution times from nnet3 components with the same prefix\"\n                        \" will be accumulated. Still distinguishes Propagate and BackPropagate commands\"\n                        \" --node-prefixes Lstm1,Lstm2,Layer1\", default=None)\n\n    print(' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    if args.node_prefixes is not None:\n        raise NotImplementedError\n        # this will be implemented after https://github.com/kaldi-asr/kaldi/issues/944\n        args.node_prefixes = args.node_prefixes.split(',')\n    else:\n        args.node_prefixes = []\n\n    return args\n# get opening bracket position corresponding to the last closing bracket\ndef FindOpenParanthesisPosition(string):\n    string = string.strip()\n    if string[-1] != \")\":\n        # we don't know how to deal with these strings\n        return None\n\n    string_index = len(string) - 1\n    closing_parans = []\n    closing_parans.append(string_index)\n    string_index -= 1\n    while string_index >= 0:\n        if string[string_index] == \"(\":\n            if len(closing_parans) == 1:\n                # this opening bracket corresponds to the last closing bracket\n                return string_index\n            else:\n                closing_parans.pop()\n        elif string[string_index] == \")\":\n            closing_parans.append(string_index)\n        string_index -= 1\n\n    raise Exception(\"Malformed string: Could not find opening paranthesis\\n\\t{0}\".format(string))\n\n# input : LOG (nnet3-chain-train:DebugAfterExecute():nnet-compute.cc:144) c68: BLstm1_backward_W_i-xr.Propagate(NULL, m6212(3136:3199, 0:555), &m31(0:63, 0:1023))\n# output : BLstm1_backward_W_i-xr.Propagate\ndef ExtractCommandName(command_string):\n    # create a concise representation for the the command\n    # strip off : LOG (nnet3-chain-train:DebugAfterExecute():nnet-compute.cc:144)\n    command = \" \".join(command_string.split()[2:])\n    # command = c68: BLstm1_backward_W_i-xr.Propagate(NULL, m6212(3136:3199, 0:555), &m31(0:63, 0:1023))\n    end_position = FindOpenParanthesisPosition(command)\n    if end_position is not None:\n        command = command[:end_position]\n    # command = c68: BLstm1_backward_W_i-xr.Propagate\n    command = \":\".join(command.split(\":\")[1:]).strip()\n    # command = BLstm1_backward_W_i-xr.Propagate\n    return command\n\ndef Main():\n    # Sample Line\n    # LOG (nnet3-chain-train:DebugAfterExecute():nnet-compute.cc:144) c128: m19 = []  |               |        time: 0.0007689 secs\n\n    debug_regex = re.compile(\"DebugAfterExecute\")\n    command_times = {}\n    for line in sys.stdin:\n        parts = line.split(\"|\")\n        if len(parts) != 3:\n            # we don't know how to deal with these lines\n            continue\n        if debug_regex.search(parts[0]) is not None:\n            # this is a line printed in the DebugAfterExecute method\n\n            # get the timing info\n            time_parts = parts[-1].split()\n            assert(len(time_parts) == 3 and time_parts[-1] == \"secs\" and time_parts[0] == \"time:\" )\n            time = float(time_parts[1])\n\n            command = ExtractCommandName(parts[0])\n           # store the time\n            try:\n                command_times[command] += time\n            except KeyError:\n                command_times[command] = time\n\n    total_time = sum(command_times.values())\n    sorted_commands = sorted(command_times.items(), key = lambda x: x[1], reverse = True)\n    for item in sorted_commands:\n        print(\"{c} : time {t} : fraction {f}\".format(c=item[0], t=item[1], f=float(item[1]) / total_time))\n\n\nif __name__ == \"__main__\":\n    args = GetArgs()\n    Main()\n\n\n"
  },
  {
    "path": "egs/steps/nnet3/tdnn/make_configs.py",
    "content": "#!/usr/bin/env python\n\n# This script is deprecated, please use ../xconfig_to_configs.py\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nfrom __future__ import division\nimport os\nimport argparse\nimport shlex\nimport sys\nimport warnings\nimport copy\nimport imp\nimport ast\n\nnodes = imp.load_source('', 'steps/nnet3/components.py')\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\ndef GetArgs():\n    # we add compulsary arguments as named arguments for readability\n    parser = argparse.ArgumentParser(description=\"Writes config files and variables \"\n                                                 \"for TDNNs creation and training\",\n                                     epilog=\"See steps/nnet3/tdnn/train.sh for example.\")\n\n    # Only one of these arguments can be specified, and one of them has to\n    # be compulsarily specified\n    feat_group = parser.add_mutually_exclusive_group(required = True)\n    feat_group.add_argument(\"--feat-dim\", type=int,\n                            help=\"Raw feature dimension, e.g. 13\")\n    feat_group.add_argument(\"--feat-dir\", type=str,\n                            help=\"Feature directory, from which we derive the feat-dim\")\n\n    # only one of these arguments can be specified\n    ivector_group = parser.add_mutually_exclusive_group(required = False)\n    ivector_group.add_argument(\"--ivector-dim\", type=int,\n                                help=\"iVector dimension, e.g. 100\", default=0)\n    ivector_group.add_argument(\"--ivector-dir\", type=str,\n                                help=\"iVector dir, which will be used to derive the ivector-dim  \", default=None)\n\n    num_target_group = parser.add_mutually_exclusive_group(required = True)\n    num_target_group.add_argument(\"--num-targets\", type=int,\n                                  help=\"number of network targets (e.g. num-pdf-ids/num-leaves)\")\n    num_target_group.add_argument(\"--ali-dir\", type=str,\n                                  help=\"alignment directory, from which we derive the num-targets\")\n    num_target_group.add_argument(\"--tree-dir\", type=str,\n                                  help=\"directory with final.mdl, from which we derive the num-targets\")\n\n    # CNN options\n    parser.add_argument('--cnn.layer', type=str, action='append', dest = \"cnn_layer\",\n                        help=\"CNN parameters at each CNN layer, e.g. --filt-x-dim=3 --filt-y-dim=8 \"\n                        \"--filt-x-step=1 --filt-y-step=1 --num-filters=256 --pool-x-size=1 --pool-y-size=3 \"\n                        \"--pool-z-size=1 --pool-x-step=1 --pool-y-step=3 --pool-z-step=1, \"\n                        \"when CNN layers are used, no LDA will be added\", default = None)\n    parser.add_argument(\"--cnn.bottleneck-dim\", type=int, dest = \"cnn_bottleneck_dim\",\n                        help=\"Output dimension of the linear layer at the CNN output \"\n                        \"for dimension reduction, e.g. 256.\"\n                        \"The default zero means this layer is not needed.\", default=0)\n    parser.add_argument(\"--cnn.cepstral-lifter\", type=float, dest = \"cepstral_lifter\",\n                        help=\"The factor used for determining the liftering vector in the production of MFCC. \"\n                        \"User has to ensure that it matches the lifter used in MFCC generation, \"\n                        \"e.g. 22.0\", default=22.0)\n\n    # General neural network options\n    parser.add_argument(\"--splice-indexes\", type=str, required = True,\n                        help=\"Splice indexes at each layer, e.g. '-3,-2,-1,0,1,2,3' \"\n                        \"If CNN layers are used the first set of splice indexes will be used as input \"\n                        \"to the first CNN layer and later splice indexes will be interpreted as indexes \"\n                        \"for the TDNNs.\")\n    parser.add_argument(\"--add-lda\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"If \\\"true\\\" an LDA matrix computed from the input features \"\n                        \"(spliced according to the first set of splice-indexes) will be used as \"\n                        \"the first Affine layer. This affine layer's parameters are fixed during training. \"\n                        \"If --cnn.layer is specified this option will be forced to \\\"false\\\".\",\n                        default=True, choices = [\"false\", \"true\"])\n\n    parser.add_argument(\"--include-log-softmax\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"add the final softmax layer \", default=True, choices = [\"false\", \"true\"])\n    parser.add_argument(\"--add-final-sigmoid\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"add a final sigmoid layer as alternate to log-softmax-layer. \"\n                        \"Can only be used if include-log-softmax is false. \"\n                        \"This is useful in cases where you want the output to be \"\n                        \"like probabilities between 0 and 1. Typically the nnet \"\n                        \"is trained with an objective such as quadratic\",\n                        default=False, choices = [\"false\", \"true\"])\n\n    parser.add_argument(\"--objective-type\", type=str,\n                        help = \"the type of objective; i.e. quadratic or linear\",\n                        default=\"linear\", choices = [\"linear\", \"quadratic\"])\n    parser.add_argument(\"--xent-regularize\", type=float,\n                        help=\"For chain models, if nonzero, add a separate output for cross-entropy \"\n                        \"regularization (with learning-rate-factor equal to the inverse of this)\",\n                        default=0.0)\n    parser.add_argument(\"--xent-separate-forward-affine\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"if using --xent-regularize, gives it separate last-but-one weight matrix\",\n                        default=False, choices = [\"false\", \"true\"])\n    parser.add_argument(\"--final-layer-normalize-target\", type=float,\n                        help=\"RMS target for final layer (set to <1 if final layer learns too fast\",\n                        default=1.0)\n    parser.add_argument(\"--max-change-per-component\", type=float,\n                        help=\"Enforces per-component max change (except for the final affine layer). \"\n                        \"if 0 it would not be enforced.\", default=0.75)\n    parser.add_argument(\"--max-change-per-component-final\", type=float,\n                        help=\"Enforces per-component max change for the final affine layer. \"\n                        \"if 0 it would not be enforced.\", default=1.5)\n    parser.add_argument(\"--subset-dim\", type=int, default=0,\n                        help=\"dimension of the subset of units to be sent to the central frame\")\n    parser.add_argument(\"--pnorm-input-dim\", type=int,\n                        help=\"input dimension to p-norm nonlinearities\")\n    parser.add_argument(\"--pnorm-output-dim\", type=int,\n                        help=\"output dimension of p-norm nonlinearities\")\n    relu_dim_group = parser.add_mutually_exclusive_group(required = False)\n    relu_dim_group.add_argument(\"--relu-dim\", type=int,\n                        help=\"dimension of all ReLU nonlinearity layers\")\n    relu_dim_group.add_argument(\"--relu-dim-final\", type=int,\n                        help=\"dimension of the last ReLU nonlinearity layer. Dimensions increase geometrically from the first through the last ReLU layer.\", default=None)\n    parser.add_argument(\"--relu-dim-init\", type=int,\n                        help=\"dimension of the first ReLU nonlinearity layer. Dimensions increase geometrically from the first through the last ReLU layer.\", default=None)\n\n    parser.add_argument(\"--self-repair-scale-nonlinearity\", type=float,\n                        help=\"A non-zero value activates the self-repair mechanism in the sigmoid and tanh non-linearities of the LSTM\", default=None)\n\n\n    parser.add_argument(\"--use-presoftmax-prior-scale\", type=str, action=common_lib.StrToBoolAction,\n                        help=\"if true, a presoftmax-prior-scale is added\",\n                        choices=['true', 'false'], default = True)\n    parser.add_argument(\"config_dir\",\n                        help=\"Directory to write config files and variables\")\n\n    print(' '.join(sys.argv))\n\n    args = parser.parse_args()\n    args = CheckArgs(args)\n\n    return args\n\ndef CheckArgs(args):\n    if not os.path.exists(args.config_dir):\n        os.makedirs(args.config_dir)\n\n    ## Check arguments.\n    if args.feat_dir is not None:\n        args.feat_dim = common_lib.get_feat_dim(args.feat_dir)\n\n    if args.ali_dir is not None:\n        args.num_targets = common_lib.get_number_of_leaves_from_tree(args.ali_dir)\n    elif args.tree_dir is not None:\n        args.num_targets = common_lib.get_number_of_leaves_from_tree(args.tree_dir)\n\n    if args.ivector_dir is not None:\n        args.ivector_dim = common_lib.get_ivector_dim(args.ivector_dir)\n\n    if not args.feat_dim > 0:\n        raise Exception(\"feat-dim has to be postive\")\n\n    if not args.num_targets > 0:\n        print(args.num_targets)\n        raise Exception(\"num_targets has to be positive\")\n\n    if not args.ivector_dim >= 0:\n        raise Exception(\"ivector-dim has to be non-negative\")\n\n    if (args.subset_dim < 0):\n        raise Exception(\"--subset-dim has to be non-negative\")\n\n    if not args.relu_dim is None:\n        if not args.pnorm_input_dim is None or not args.pnorm_output_dim is None or not args.relu_dim_init is None:\n            raise Exception(\"--relu-dim argument not compatible with \"\n                            \"--pnorm-input-dim or --pnorm-output-dim or --relu-dim-init options\");\n        args.nonlin_input_dim = args.relu_dim\n        args.nonlin_output_dim = args.relu_dim\n        args.nonlin_output_dim_final = None\n        args.nonlin_output_dim_init = None\n        args.nonlin_type = 'relu'\n\n    elif not args.relu_dim_final is None:\n        if not args.pnorm_input_dim is None or not args.pnorm_output_dim is None:\n            raise Exception(\"--relu-dim-final argument not compatible with \"\n                            \"--pnorm-input-dim or --pnorm-output-dim options\")\n        if args.relu_dim_init is None:\n            raise Exception(\"--relu-dim-init argument should also be provided with --relu-dim-final\")\n        if args.relu_dim_init > args.relu_dim_final:\n            raise Exception(\"--relu-dim-init has to be no larger than --relu-dim-final\")\n        args.nonlin_input_dim = None\n        args.nonlin_output_dim = None\n        args.nonlin_output_dim_final = args.relu_dim_final\n        args.nonlin_output_dim_init = args.relu_dim_init\n        args.nonlin_type = 'relu'\n\n    else:\n        if not args.relu_dim_init is None:\n            raise Exception(\"--relu-dim-final argument not compatible with \"\n                            \"--pnorm-input-dim or --pnorm-output-dim options\")\n        if not args.pnorm_input_dim > 0 or not args.pnorm_output_dim > 0:\n            raise Exception(\"--relu-dim not set, so expected --pnorm-input-dim and \"\n                            \"--pnorm-output-dim to be provided.\");\n        args.nonlin_input_dim = args.pnorm_input_dim\n        args.nonlin_output_dim = args.pnorm_output_dim\n        if (args.nonlin_input_dim < args.nonlin_output_dim) or (args.nonlin_input_dim % args.nonlin_output_dim != 0):\n            raise Exception(\"Invalid --pnorm-input-dim {0} and --pnorm-output-dim {1}\".format(args.nonlin_input_dim, args.nonlin_output_dim))\n        args.nonlin_output_dim_final = None\n        args.nonlin_output_dim_init = None\n        args.nonlin_type = 'pnorm'\n\n    if args.add_final_sigmoid and args.include_log_softmax:\n        raise Exception(\"--include-log-softmax and --add-final-sigmoid cannot both be true.\")\n\n    if args.xent_separate_forward_affine and args.add_final_sigmoid:\n        raise Exception(\"It does not make sense to have --add-final-sigmoid=true when xent-separate-forward-affine is true\")\n\n    if args.add_lda and args.cnn_layer is not None:\n        args.add_lda = False\n        warnings.warn(\"--add-lda is set to false as CNN layers are used.\")\n\n    if not args.max_change_per_component >= 0 or not args.max_change_per_component_final >= 0:\n        raise Exception(\"max-change-per-component and max_change-per-component-final should be non-negative\")\n\n    return args\n\ndef AddConvMaxpLayer(config_lines, name, input, args):\n    if '3d-dim' not in input:\n        raise Exception(\"The input to AddConvMaxpLayer() needs '3d-dim' parameters.\")\n\n    input = nodes.AddConvolutionLayer(config_lines, name, input,\n                              input['3d-dim'][0], input['3d-dim'][1], input['3d-dim'][2],\n                              args.filt_x_dim, args.filt_y_dim,\n                              args.filt_x_step, args.filt_y_step,\n                              args.num_filters, input['vectorization'])\n\n    if args.pool_x_size > 1 or args.pool_y_size > 1 or args.pool_z_size > 1:\n      input = nodes.AddMaxpoolingLayer(config_lines, name, input,\n                                input['3d-dim'][0], input['3d-dim'][1], input['3d-dim'][2],\n                                args.pool_x_size, args.pool_y_size, args.pool_z_size,\n                                args.pool_x_step, args.pool_y_step, args.pool_z_step)\n\n    return input\n\n# The ivectors are processed through an affine layer parallel to the CNN layers,\n# then concatenated with the CNN output and passed to the deeper part of the network.\ndef AddCnnLayers(config_lines, cnn_layer, cnn_bottleneck_dim, cepstral_lifter, config_dir, feat_dim, splice_indexes=[0], ivector_dim=0):\n    cnn_args = ParseCnnString(cnn_layer)\n    num_cnn_layers = len(cnn_args)\n    # We use an Idct layer here to convert MFCC to FBANK features\n    common_lib.write_idct_matrix(feat_dim, cepstral_lifter, config_dir.strip() + \"/idct.mat\")\n    prev_layer_output = {'descriptor':  \"input\",\n                         'dimension': feat_dim}\n    prev_layer_output = nodes.AddFixedAffineLayer(config_lines, \"Idct\", prev_layer_output, config_dir.strip() + '/idct.mat')\n\n    list = [('Offset({0}, {1})'.format(prev_layer_output['descriptor'],n) if n != 0 else prev_layer_output['descriptor']) for n in splice_indexes]\n    splice_descriptor = \"Append({0})\".format(\", \".join(list))\n    cnn_input_dim = len(splice_indexes) * feat_dim\n    prev_layer_output = {'descriptor':  splice_descriptor,\n                         'dimension': cnn_input_dim,\n                         '3d-dim': [len(splice_indexes), feat_dim, 1],\n                         'vectorization': 'yzx'}\n\n    for cl in range(0, num_cnn_layers):\n        prev_layer_output = AddConvMaxpLayer(config_lines, \"L{0}\".format(cl), prev_layer_output, cnn_args[cl])\n\n    if cnn_bottleneck_dim > 0:\n        prev_layer_output = nodes.AddAffineLayer(config_lines, \"cnn-bottleneck\", prev_layer_output, cnn_bottleneck_dim, \"\")\n\n    if ivector_dim > 0:\n        iv_layer_output = {'descriptor':  'ReplaceIndex(ivector, t, 0)',\n                           'dimension': ivector_dim}\n        iv_layer_output = nodes.AddAffineLayer(config_lines, \"ivector\", iv_layer_output, ivector_dim, \"\")\n        prev_layer_output['descriptor'] = 'Append({0}, {1})'.format(prev_layer_output['descriptor'], iv_layer_output['descriptor'])\n        prev_layer_output['dimension'] = prev_layer_output['dimension'] + iv_layer_output['dimension']\n\n    return prev_layer_output\n\ndef PrintConfig(file_name, config_lines):\n    f = open(file_name, 'w')\n    f.write(\"\\n\".join(config_lines['components'])+\"\\n\")\n    f.write(\"\\n#Component nodes\\n\")\n    f.write(\"\\n\".join(config_lines['component-nodes'])+\"\\n\")\n    f.close()\n\ndef ParseCnnString(cnn_param_string_list):\n    cnn_parser = argparse.ArgumentParser(description=\"cnn argument parser\")\n\n    cnn_parser.add_argument(\"--filt-x-dim\", required=True, type=int)\n    cnn_parser.add_argument(\"--filt-y-dim\", required=True, type=int)\n    cnn_parser.add_argument(\"--filt-x-step\", type=int, default = 1)\n    cnn_parser.add_argument(\"--filt-y-step\", type=int, default = 1)\n    cnn_parser.add_argument(\"--num-filters\", required=True, type=int)\n    cnn_parser.add_argument(\"--pool-x-size\", type=int, default = 1)\n    cnn_parser.add_argument(\"--pool-y-size\", type=int, default = 1)\n    cnn_parser.add_argument(\"--pool-z-size\", type=int, default = 1)\n    cnn_parser.add_argument(\"--pool-x-step\", type=int, default = 1)\n    cnn_parser.add_argument(\"--pool-y-step\", type=int, default = 1)\n    cnn_parser.add_argument(\"--pool-z-step\", type=int, default = 1)\n\n    cnn_args = []\n    for cl in range(0, len(cnn_param_string_list)):\n         cnn_args.append(cnn_parser.parse_args(shlex.split(cnn_param_string_list[cl])))\n\n    return cnn_args\n\ndef ParseSpliceString(splice_indexes):\n    splice_array = []\n    left_context = 0\n    right_context = 0\n    split1 = splice_indexes.split();  # we already checked the string is nonempty.\n    if len(split1) < 1:\n        raise Exception(\"invalid splice-indexes argument, too short: \"\n                 + splice_indexes)\n    try:\n        for string in split1:\n            split2 = string.split(\",\")\n            if len(split2) < 1:\n                raise Exception(\"invalid splice-indexes argument, too-short element: \"\n                         + splice_indexes)\n            int_list = []\n            for int_str in split2:\n                int_list.append(int(int_str))\n            if not int_list == sorted(int_list):\n                raise Exception(\"elements of splice-indexes must be sorted: \"\n                         + splice_indexes)\n            left_context += -int_list[0]\n            right_context += int_list[-1]\n            splice_array.append(int_list)\n    except ValueError as e:\n        raise Exception(\"invalid splice-indexes argument \" + splice_indexes + str(e))\n    left_context = max(0, left_context)\n    right_context = max(0, right_context)\n\n    return {'left_context':left_context,\n            'right_context':right_context,\n            'splice_indexes':splice_array,\n            'num_hidden_layers':len(splice_array)\n            }\n\n# The function signature of MakeConfigs is changed frequently as it is intended for local use in this script.\ndef MakeConfigs(config_dir, splice_indexes_string,\n                cnn_layer, cnn_bottleneck_dim, cepstral_lifter,\n                feat_dim, ivector_dim, num_targets, add_lda,\n                nonlin_type, nonlin_input_dim, nonlin_output_dim, subset_dim,\n                nonlin_output_dim_init, nonlin_output_dim_final,\n                use_presoftmax_prior_scale,\n                final_layer_normalize_target,\n                include_log_softmax,\n                add_final_sigmoid,\n                xent_regularize,\n                xent_separate_forward_affine,\n                self_repair_scale,\n                max_change_per_component, max_change_per_component_final,\n                objective_type):\n\n    parsed_splice_output = ParseSpliceString(splice_indexes_string.strip())\n\n    left_context = parsed_splice_output['left_context']\n    right_context = parsed_splice_output['right_context']\n    num_hidden_layers = parsed_splice_output['num_hidden_layers']\n    splice_indexes = parsed_splice_output['splice_indexes']\n    input_dim = len(parsed_splice_output['splice_indexes'][0]) + feat_dim + ivector_dim\n\n    if xent_separate_forward_affine:\n        if splice_indexes[-1] != [0]:\n            raise Exception(\"--xent-separate-forward-affine option is supported only if the last-hidden layer has no splicing before it. Please use a splice-indexes with just 0 as the final splicing config.\")\n\n    prior_scale_file = '{0}/presoftmax_prior_scale.vec'.format(config_dir)\n\n    config_lines = {'components':[], 'component-nodes':[]}\n\n    config_files={}\n    prev_layer_output = nodes.AddInputLayer(config_lines, feat_dim, splice_indexes[0], ivector_dim)\n\n    # Add the init config lines for estimating the preconditioning matrices\n    init_config_lines = copy.deepcopy(config_lines)\n    init_config_lines['components'].insert(0, '# Config file for initializing neural network prior to')\n    init_config_lines['components'].insert(0, '# preconditioning matrix computation')\n    nodes.AddOutputLayer(init_config_lines, prev_layer_output)\n    config_files[config_dir + '/init.config'] = init_config_lines\n\n    if cnn_layer is not None:\n        prev_layer_output = AddCnnLayers(config_lines, cnn_layer, cnn_bottleneck_dim, cepstral_lifter, config_dir,\n                                         feat_dim, splice_indexes[0], ivector_dim)\n\n    if add_lda:\n        prev_layer_output = nodes.AddLdaLayer(config_lines, \"L0\", prev_layer_output, config_dir + '/lda.mat')\n\n    left_context = 0\n    right_context = 0\n    # we moved the first splice layer to before the LDA..\n    # so the input to the first affine layer is going to [0] index\n    splice_indexes[0] = [0]\n\n    if not nonlin_output_dim is None:\n        nonlin_output_dims = [nonlin_output_dim] * num_hidden_layers\n    elif nonlin_output_dim_init < nonlin_output_dim_final and num_hidden_layers == 1:\n        raise Exception(\"num-hidden-layers has to be greater than 1 if relu-dim-init and relu-dim-final is different.\")\n    else:\n        # computes relu-dim for each hidden layer. They increase geometrically across layers\n        factor = pow(float(nonlin_output_dim_final) / nonlin_output_dim_init, 1.0 / (num_hidden_layers - 1)) if num_hidden_layers > 1 else 1\n        nonlin_output_dims = [int(round(nonlin_output_dim_init * pow(factor, i))) for i in range(0, num_hidden_layers)]\n        assert(nonlin_output_dims[-1] >= nonlin_output_dim_final - 1 and nonlin_output_dims[-1] <= nonlin_output_dim_final + 1) # due to rounding error\n        nonlin_output_dims[-1] = nonlin_output_dim_final # It ensures that the dim of the last hidden layer is exactly the same as what is specified\n\n    for i in range(0, num_hidden_layers):\n        # make the intermediate config file for layerwise discriminative training\n\n        # prepare the spliced input\n        if not (len(splice_indexes[i]) == 1 and splice_indexes[i][0] == 0):\n            try:\n                zero_index = splice_indexes[i].index(0)\n            except ValueError:\n                zero_index = None\n            # I just assume the prev_layer_output_descriptor is a simple forwarding descriptor\n            prev_layer_output_descriptor = prev_layer_output['descriptor']\n            subset_output = prev_layer_output\n            if subset_dim > 0:\n                # if subset_dim is specified the script expects a zero in the splice indexes\n                assert(zero_index is not None)\n                subset_node_config = \"dim-range-node name=Tdnn_input_{0} input-node={1} dim-offset={2} dim={3}\".format(i, prev_layer_output_descriptor, 0, subset_dim)\n                subset_output = {'descriptor' : 'Tdnn_input_{0}'.format(i),\n                                 'dimension' : subset_dim}\n                config_lines['component-nodes'].append(subset_node_config)\n            appended_descriptors = []\n            appended_dimension = 0\n            for j in range(len(splice_indexes[i])):\n                if j == zero_index:\n                    appended_descriptors.append(prev_layer_output['descriptor'])\n                    appended_dimension += prev_layer_output['dimension']\n                    continue\n                appended_descriptors.append('Offset({0}, {1})'.format(subset_output['descriptor'], splice_indexes[i][j]))\n                appended_dimension += subset_output['dimension']\n            prev_layer_output = {'descriptor' : \"Append({0})\".format(\" , \".join(appended_descriptors)),\n                                 'dimension'  : appended_dimension}\n        else:\n            # this is a normal affine node\n            pass\n\n        if xent_separate_forward_affine and i == num_hidden_layers - 1:\n            if xent_regularize == 0.0:\n                raise Exception(\"xent-separate-forward-affine=True is valid only if xent-regularize is non-zero\")\n\n            if nonlin_type == \"relu\" :\n                prev_layer_output_chain = nodes.AddAffRelNormLayer(config_lines, \"Tdnn_pre_final_chain\",\n                                                                   prev_layer_output, nonlin_output_dim,\n                                                                   norm_target_rms = final_layer_normalize_target,\n                                                                   self_repair_scale = self_repair_scale,\n                                                                   max_change_per_component = max_change_per_component)\n\n                prev_layer_output_xent = nodes.AddAffRelNormLayer(config_lines, \"Tdnn_pre_final_xent\",\n                                                                  prev_layer_output, nonlin_output_dim,\n                                                                  norm_target_rms = final_layer_normalize_target,\n                                                                  self_repair_scale = self_repair_scale,\n                                                                  max_change_per_component = max_change_per_component)\n            elif nonlin_type == \"pnorm\" :\n                prev_layer_output_chain = nodes.AddAffPnormLayer(config_lines, \"Tdnn_pre_final_chain\",\n                                                                 prev_layer_output, nonlin_input_dim, nonlin_output_dim,\n                                                                 norm_target_rms = final_layer_normalize_target)\n\n                prev_layer_output_xent = nodes.AddAffPnormLayer(config_lines, \"Tdnn_pre_final_xent\",\n                                                                prev_layer_output, nonlin_input_dim, nonlin_output_dim,\n                                                                norm_target_rms = final_layer_normalize_target)\n            else:\n                raise Exception(\"Unknown nonlinearity type\")\n\n            nodes.AddFinalLayer(config_lines, prev_layer_output_chain, num_targets,\n                               max_change_per_component = max_change_per_component_final,\n                               use_presoftmax_prior_scale = use_presoftmax_prior_scale,\n                               prior_scale_file = prior_scale_file,\n                               include_log_softmax = include_log_softmax)\n\n            nodes.AddFinalLayer(config_lines, prev_layer_output_xent, num_targets,\n                                ng_affine_options = \" param-stddev=0 bias-stddev=0 learning-rate-factor={0} \".format(\n                                    0.5 / xent_regularize),\n                                max_change_per_component = max_change_per_component_final,\n                                use_presoftmax_prior_scale = use_presoftmax_prior_scale,\n                                prior_scale_file = prior_scale_file,\n                                include_log_softmax = True,\n                                name_affix = 'xent')\n        else:\n            if nonlin_type == \"relu\":\n                prev_layer_output = nodes.AddAffRelNormLayer(config_lines, \"Tdnn_{0}\".format(i),\n                                                            prev_layer_output, nonlin_output_dims[i],\n                                                            norm_target_rms = 1.0 if i < num_hidden_layers -1 else final_layer_normalize_target,\n                                                            self_repair_scale = self_repair_scale,\n                                                            max_change_per_component = max_change_per_component)\n            elif nonlin_type == \"pnorm\":\n                prev_layer_output = nodes.AddAffPnormLayer(config_lines, \"Tdnn_{0}\".format(i),\n                                                           prev_layer_output, nonlin_input_dim, nonlin_output_dim,\n                                                           norm_target_rms = 1.0 if i < num_hidden_layers -1 else final_layer_normalize_target)\n            else:\n                raise Exception(\"Unknown nonlinearity type\")\n            # a final layer is added after each new layer as we are generating\n            # configs for layer-wise discriminative training\n\n            # add_final_sigmoid adds a sigmoid as a final layer as alternative\n            # to log-softmax layer.\n            # http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression#Softmax_Regression_vs._k_Binary_Classifiers\n            # This is useful when you need the final outputs to be probabilities between 0 and 1.\n            # Usually used with an objective-type such as \"quadratic\".\n            # Applications are k-binary classification such Ideal Ratio Mask prediction.\n            nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets,\n                               max_change_per_component = max_change_per_component_final,\n                               use_presoftmax_prior_scale = use_presoftmax_prior_scale,\n                               prior_scale_file = prior_scale_file,\n                               include_log_softmax = include_log_softmax,\n                               add_final_sigmoid = add_final_sigmoid,\n                               objective_type = objective_type)\n            if xent_regularize != 0.0:\n                nodes.AddFinalLayer(config_lines, prev_layer_output, num_targets,\n                                    ng_affine_options = \" param-stddev=0 bias-stddev=0 learning-rate-factor={0} \".format(\n                                          0.5 / xent_regularize),\n                                    max_change_per_component = max_change_per_component_final,\n                                    use_presoftmax_prior_scale = use_presoftmax_prior_scale,\n                                    prior_scale_file = prior_scale_file,\n                                    include_log_softmax = True,\n                                    name_affix = 'xent')\n\n        config_files['{0}/layer{1}.config'.format(config_dir, i+1)] = config_lines\n        config_lines = {'components':[], 'component-nodes':[]}\n\n    left_context += int(parsed_splice_output['left_context'])\n    right_context += int(parsed_splice_output['right_context'])\n\n    # write the files used by other scripts like steps/nnet3/get_egs.sh\n    f = open(config_dir + \"/vars\", \"w\")\n    print('model_left_context={}'.format(left_context), file=f)\n    print('model_right_context={}'.format(right_context), file=f)\n    print('num_hidden_layers={}'.format(num_hidden_layers), file=f)\n    print('num_targets={}'.format(num_targets), file=f)\n    print('add_lda=' + ('true' if add_lda else 'false'), file=f)\n    print('include_log_softmax=' + ('true' if include_log_softmax else 'false'), file=f)\n    print('objective_type=' + objective_type, file=f)\n    f.close()\n\n    # printing out the configs\n    # init.config used to train lda-mllt train\n    for key in config_files.keys():\n        PrintConfig(key, config_files[key])\n\ndef Main():\n    args = GetArgs()\n\n    MakeConfigs(config_dir = args.config_dir,\n                splice_indexes_string = args.splice_indexes,\n                feat_dim = args.feat_dim, ivector_dim = args.ivector_dim,\n                num_targets = args.num_targets,\n                add_lda = args.add_lda,\n                cnn_layer = args.cnn_layer,\n                cnn_bottleneck_dim = args.cnn_bottleneck_dim,\n                cepstral_lifter = args.cepstral_lifter,\n                nonlin_type = args.nonlin_type,\n                nonlin_input_dim = args.nonlin_input_dim,\n                nonlin_output_dim = args.nonlin_output_dim,\n                subset_dim = args.subset_dim,\n                nonlin_output_dim_init = args.nonlin_output_dim_init,\n                nonlin_output_dim_final = args.nonlin_output_dim_final,\n                use_presoftmax_prior_scale = args.use_presoftmax_prior_scale,\n                final_layer_normalize_target = args.final_layer_normalize_target,\n                include_log_softmax = args.include_log_softmax,\n                add_final_sigmoid = args.add_final_sigmoid,\n                xent_regularize = args.xent_regularize,\n                xent_separate_forward_affine = args.xent_separate_forward_affine,\n                self_repair_scale = args.self_repair_scale_nonlinearity,\n                max_change_per_component = args.max_change_per_component,\n                max_change_per_component_final = args.max_change_per_component_final,\n                objective_type = args.objective_type)\n\nif __name__ == \"__main__\":\n    Main()\n\n"
  },
  {
    "path": "egs/steps/nnet3/tdnn/train.sh",
    "content": "#!/usr/bin/env bash\n\n# THIS SCRIPT IS DEPRECATED, see ../train_dnn.py\n\n# note, TDNN is the same as what we used to call multisplice.\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\npnorm_input_dim=3000\npnorm_output_dim=300\nrelu_dim=  # you can use this to make it use ReLU's instead of p-norms.\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nminibatch_size=512  # This default is suitable for GPU-based training.\n                    # Set it to 128 for multi-threaded CPU-based training.\nmax_param_change=2.0  # max param change per minibatch\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=20000 # 20k samples per job, for computing priors.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0    # can be used for rerunning after partial\nonline_ivector_dir=\npresoftmax_prior_scale_power=-0.25\nuse_presoftmax_prior_scale=true\nremove_egs=true  # set to false to disable removing egs after training is done.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-6\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\n# count space-separated fields in splice_indexes to get num-hidden-layers.\nsplice_indexes=\"-4,-3,-2,-1,0,1,2,3,4  0  -2,2  0  -4,4 0\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\nchunk_training=false  # if true training is done with chunk randomization, rather than frame randomization\n\nrandprune=4.0 # speeds up LDA.\nuse_gpu=true    # if true, we run on GPU.\ncleanup=true\negs_dir=\nmax_lda_jobs=10  # use no more than 10 jobs for the LDA accumulation.\nlda_opts=\negs_opts=\ntransform_dir=     # If supplied, this dir used instead of alidir to find transforms.\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=raw  # or set to 'lda' to use LDA features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\nframes_per_eg=8 # to be passed on to get_egs.sh\nsubset_dim=0\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0: THIS SCRIPT IS DEPRECATED\"\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --presoftmax-prior-scale-power <power|-0.25>     # use the specified power value on the priors (inverse priors) to scale\"\n  echo \"                                                   # the pre-softmax outputs (set to 0.0 to disable the presoftmax element scale)\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job, for CPU-based training (will affect\"\n  echo \"                                                   # results as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-times <list-of-times|\\\"\\\">             # A list of space-separated floating point numbers between 0.0 and\"\n  echo \"                                                   # 1.0 to specify how far through training realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n# First work out the feature and iVector dimension, needed for tdnn config creation.\ncase $feat_type in\n  raw) feat_dim=$(feat-to-dim --print-args=false scp:$data/feats.scp -) || \\\n      { echo \"$0: Error getting feature dim\"; exit 1; }\n    ;;\n  lda)  [ ! -f $alidir/final.mat ] && echo \"$0: With --feat-type lda option, expect $alidir/final.mat to exist.\"\n   # get num-rows in lda matrix, which is the lda feature dim.\n   feat_dim=$(matrix-dim --print-args=false $alidir/final.mat | cut -f 1)\n    ;;\n  *)\n   echo \"$0: Bad --feat-type '$feat_type';\"; exit 1;\nesac\nif [ -z \"$online_ivector_dir\" ]; then\n  ivector_dim=0\nelse\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\nfi\n\n\nif [ $stage -le -5 ]; then\n  echo \"$0: creating neural net configs\";\n\n  if [ ! -z \"$relu_dim\" ]; then\n    dim_opts=\"--relu-dim $relu_dim\"\n  else\n    dim_opts=\"--pnorm-input-dim $pnorm_input_dim --pnorm-output-dim  $pnorm_output_dim\"\n  fi\n\n  # create the config files for nnet initialization\n  python steps/nnet3/tdnn/make_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --subset-dim \"$subset_dim\" \\\n    --feat-dim $feat_dim \\\n    --ivector-dim $ivector_dim  \\\n     $dim_opts \\\n    --use-presoftmax-prior-scale $use_presoftmax_prior_scale \\\n    --num-targets  $num_leaves  \\\n   $dir/configs || exit 1;\n\n  # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n  # matrix.  This first config just does any initial splicing that we do;\n  # we do this as it's a convenient way to get the stats for the 'lda-like'\n  # transform.\n  $cmd $dir/log/nnet_init.log \\\n    nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;\nfi\n\n# sourcing the \"vars\" below sets\n# left_context=(something)\n# right_context=(something)\n# num_hidden_layers=(something)\n. $dir/configs/vars || exit 1;\n\nleft_context=$model_left_context\nright_context=$model_right_context\n\ncontext_opts=\"--left-context=$left_context --right-context=$right_context\"\n\n! [ \"$num_hidden_layers\" -gt 0 ] && echo \\\n \"$0: Expected num_hidden_layers to be defined\" && exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\n\nif [ $stage -le -4 ] && [ -z \"$egs_dir\" ]; then\n  extra_opts=()\n  [ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n  [ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n  [ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n  extra_opts+=(--transform-dir $transform_dir)\n  extra_opts+=(--left-context $left_context)\n  extra_opts+=(--right-context $right_context)\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet3/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts \\\n      --frames-per-eg $frames_per_eg \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\nif [ \"$feat_dim\" != \"$(cat $egs_dir/info/feat_dim)\" ]; then\n  echo \"$0: feature dimension mismatch with egs, $feat_dim vs $(cat $egs_dir/info/feat_dim)\";\n  exit 1;\nfi\nif [ \"$ivector_dim\" != \"$(cat $egs_dir/info/ivector_dim)\" ]; then\n  echo \"$0: ivector dimension mismatch with egs, $ivector_dim vs $(cat $egs_dir/info/ivector_dim)\";\n  exit 1;\nfi\n\n# copy any of the following that exist, to $dir.\ncp $egs_dir/{cmvn_opts,splice_opts,final.mat} $dir 2>/dev/null\n\n# confirm that the egs_dir has the necessary context (especially important if\n# the --egs-dir option was used on the command line).\negs_left_context=$(cat $egs_dir/info/left_context) || exit -1\negs_right_context=$(cat $egs_dir/info/right_context) || exit -1\n ( [ $egs_left_context -lt $left_context ] || \\\n   [ $egs_right_context -lt $right_context ] ) && \\\n   echo \"$0: egs in $egs_dir have too little context\" && exit -1;\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nif [ \"$chunk_training\" == \"true\" ]; then\n  num_archives_expanded=$num_archives\nelse\n  num_archives_expanded=$[$num_archives*$frames_per_eg]\nfi\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\n\nif [ $stage -le -3 ]; then\n  echo \"$0: getting preconditioning matrix for input features.\"\n  num_lda_jobs=$num_archives\n  [ $num_lda_jobs -gt $max_lda_jobs ] && num_lda_jobs=$max_lda_jobs\n\n  # Write stats with the same format as stats for LDA.\n  $cmd JOB=1:$num_lda_jobs $dir/log/get_lda_stats.JOB.log \\\n      nnet3-acc-lda-stats --rand-prune=$rand_prune \\\n        $dir/init.raw \"ark:$egs_dir/egs.JOB.ark\" $dir/JOB.lda_stats || exit 1;\n\n  all_lda_accs=$(for n in $(seq $num_lda_jobs); do echo $dir/$n.lda_stats; done)\n  $cmd $dir/log/sum_transform_stats.log \\\n    sum-lda-accs $dir/lda_stats $all_lda_accs || exit 1;\n\n  rm $all_lda_accs || exit 1;\n\n  # this computes a fixed affine transform computed in the way we described in\n  # Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled variant\n  # of an LDA transform but without dimensionality reduction.\n  $cmd $dir/log/get_transform.log \\\n     nnet-get-feature-transform $lda_opts $dir/lda.mat $dir/lda_stats || exit 1;\n\n  ln -sf ../lda.mat $dir/configs/lda.mat\nfi\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: preparing initial vector for FixedScaleComponent before softmax\"\n  echo \"  ... using priors^$presoftmax_prior_scale_power and rescaling to average 1\"\n\n  # obtains raw pdf count\n  $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n     ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n     post-to-tacc --per-pdf=true  $alidir/final.mdl ark:- $dir/pdf_counts.JOB || exit 1;\n  $cmd $dir/log/sum_pdf_counts.log \\\n       vector-sum --binary=false $dir/pdf_counts.* $dir/pdf_counts || exit 1;\n  rm $dir/pdf_counts.*\n\n  awk -v power=$presoftmax_prior_scale_power -v smooth=0.01 \\\n     '{ for(i=2; i<=NF-1; i++) { count[i-2] = $i;  total += $i; }\n        num_pdfs=NF-2;  average_count = total/num_pdfs;\n        for (i=0; i<num_pdfs; i++) stot += (scale[i] = (count[i] + smooth * average_count)^power)\n        printf \" [ \"; for (i=0; i<num_pdfs; i++) printf(\"%f \", scale[i]*num_pdfs/stot); print \"]\" }' \\\n     $dir/pdf_counts > $dir/presoftmax_prior_scale.vec\n  ln -sf ../presoftmax_prior_scale.vec $dir/configs/presoftmax_prior_scale.vec\nfi\n\nif [ $stage -le -1 ]; then\n  # Add the first layer; this will add in the lda.mat and\n  # presoftmax_prior_scale.vec.\n  $cmd $dir/log/add_first_layer.log \\\n       nnet3-init --srand=-3 $dir/init.raw $dir/configs/layer1.config $dir/0.raw || exit 1;\n\n  # Convert to .mdl, train the transitions, set the priors.\n  $cmd $dir/log/init_mdl.log \\\n    nnet3-am-init $alidir/final.mdl $dir/0.raw - \\| \\\n    nnet3-am-train-transitions - \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl || exit 1;\nfi\n\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  combine_queue_opt=\"--gpu 1\"\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\n  combine_queue_opt=\"\"  # the combine stage will be quite slow if not using\n                        # GPU, as we didn't enable that program to use\n                        # multiple threads.\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many iterations we want to combine over in the final\n# nnet3-combine-fast invocation.  (We may end up subsampling from these if the\n# number exceeds max_model_combine).  The number we use is:\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     1/2 * iters_after_last_layer_added)\nnum_iters_combine=$max_models_combine\nif [ $num_iters_combine -lt $approx_iters_per_epoch_final ]; then\n   num_iters_combine=$approx_iters_per_epoch_final\nfi\nhalf_iters_after_add_layers=$[($num_iters-$finish_add_layers_iter)/2]\nif [ $num_iters_combine -gt $half_iters_after_add_layers ]; then\n  num_iters_combine=$half_iters_after_add_layers\nfi\nfirst_model_combine=$[$num_iters-$num_iters_combine+1]\n\nx=0\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet3-copy-egs --srand=JOB --frame=random $context_opts ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet3-merge-egs ark:- ark:- \\| \\\n        nnet3-compute-from-egs --apply-exp=true \"nnet3-am-copy --raw=true $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet3-am-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet3/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet3/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet3/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n            \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n           \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\n\n    if [ $x -gt 0 ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-show-progress --use-gpu=no \"nnet3-am-copy --raw=true $dir/$[$x-1].mdl - |\" \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n        \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:-|\" '&&' \\\n        nnet3-info \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging but take the\n                       # best.\n      cur_num_hidden_layers=$[1+$x/$add_layers_period]\n      config=$dir/configs/layer$cur_num_hidden_layers.config\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl - | nnet3-init --srand=$x - $config - |\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size (and we will later choose the output of just one of the jobs): the\n      # model-averaging isn't always helpful when the model is changing too fast\n      # (i.e. it can worsen the objective function), and the smaller minibatch\n      # size will help to keep the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-train $parallel_train_opts \\\n          --max-param-change=$max_param_change \"$raw\" \\\n          \"ark,bg:nnet3-copy-egs --frame=$frame $context_opts ark:$cur_egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_minibatch_size --discard-partial-minibatches=true ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet3-average $nnets_list - \\| \\\n        nnet3-am-copy --set-raw-nnet=- $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet3-am-copy --set-raw-nnet=$dir/$[$x+1].$n.raw  $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.  In the nnet3 setup, the logic\n  # for doing averaging of subsets of the models in the case where\n  # there are too many models to reliably esetimate interpolation\n  # factors (max_models_combine) is moved into the nnet3-combine\n  nnets_list=()\n  for n in $(seq 0 $[num_iters_combine-1]); do\n    iter=$[$first_model_combine+$n]\n    mdl=$dir/$iter.mdl\n    [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n    nnets_list[$n]=\"nnet3-am-copy --raw=true $mdl -|\";\n  done\n\n  # Below, we use --use-gpu=no to disable nnet3-combine-fast from using a GPU,\n  # as if there are many models it can give out-of-memory error; and we set\n  # num-threads to 8 to speed it up (this isn't ideal...)\n\n  $cmd $combine_queue_opt $dir/log/combine.log \\\n    nnet3-combine --num-iters=40 \\\n       --enforce-sum-to-one=true --enforce-positive-weights=true \\\n       --verbose=3 \"${nnets_list[@]}\" \"ark,bg:nnet3-merge-egs --minibatch-size=1024 ark:$cur_egs_dir/combine.egs ark:-|\" \\\n    \"|nnet3-am-copy --set-raw-nnet=- $dir/$num_iters.mdl $dir/combined.mdl\" || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet3-compute-prob  \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark,bg:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  if [ $num_jobs_compute_prior -gt $num_archives ]; then egs_part=1;\n  else egs_part=JOB; fi\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$x.JOB.log \\\n    nnet3-copy-egs --frame=random $context_opts --srand=JOB ark:$cur_egs_dir/egs.$egs_part.ark ark:- \\| \\\n    nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet3-merge-egs ark:- ark:- \\| \\\n    nnet3-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n      \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet3-am-adjust-priors $dir/combined.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\nsteps/info/nnet3_dir_info.pl $dir\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/tdnn/train_raw_nnet.sh",
    "content": "#!/usr/bin/env bash\n\n# THIS SCRIPT IS DEPRECATED, see ../train_raw_dnn.py\n\n# note, TDNN is the same as what we used to call multisplice.\n# THIS SCRIPT IS DEPRECATED, see ../train_raw_dnn.py\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014-2016  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nminibatch_size=512  # This default is suitable for GPU-based training.\n                    # Set it to 128 for multi-threaded CPU-based training.\nmax_param_change=2.0  # max param change per minibatch\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=20000 # 20k samples per job, for computing priors.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0    # can be used for rerunning after partial\nonline_ivector_dir=\nremove_egs=true  # set to false to disable removing egs after training is done.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-6\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\nchunk_training=false  # if true training is done with chunk randomization, rather than frame randomization\n\nrandprune=4.0 # speeds up LDA.\nuse_gpu=true    # if true, we run on GPU.\ncleanup=true\negs_dir=\nconfigs_dir=\nmax_lda_jobs=10  # use no more than 10 jobs for the LDA accumulation.\nlda_opts=\negs_opts=\ntransform_dir=     # If supplied, this dir used instead of alidir to find transforms.\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\nframes_per_eg=8 # to be passed on to get_egs.sh\n\n# Raw nnet training options i.e. without transition model\nnj=4\ndense_targets=true        # Use dense targets instead of sparse targets\n\n# End configuration section.\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0: THIS SCRIPT IS DEPRECATED\"\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"$0: THIS SCRIPT IS DEPRECATED, see ../train_raw_dnn.py\"\n  echo \"Usage: $0 [opts] <data> <targets-scp> <exp-dir>\"\n  echo \" e.g.: $0 data/train scp:snr_targets/targets.scp exp/nnet3_snr_predictor\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job, for CPU-based training (will affect\"\n  echo \"                                                   # results as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\ntargets_scp=$2\ndir=$3\n\n# Check some files.\nfor f in $data/feats.scp $targets_scp; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\n\n# First work out the feature and iVector dimension, needed for tdnn config creation.\nfeat_dim=$(feat-to-dim --print-args=false scp:$data/feats.scp -) || \\\n      { echo \"$0: Error getting feature dim\"; exit 1; }\n\nif [ -z \"$online_ivector_dir\" ]; then\n  ivector_dim=0\nelse\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\n  steps/nnet2/get_ivector_id.sh $online_ivector_dir > $dir/final.ie.id || exit 1\nfi\n\nif [ ! -z \"$configs_dir\" ]; then\n  cp -rT $configs_dir $dir/configs || exit 1\nfi\n\nif [ $stage -le -5 ]; then\n  # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n  # matrix.  This first config just does any initial splicing that we do;\n  # we do this as it's a convenient way to get the stats for the 'lda-like'\n  # transform.\n  $cmd $dir/log/nnet_init.log \\\n    nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;\nfi\n\n# sourcing the \"vars\" below sets\n# model_left_context=(something)\n# model_right_context=(something)\n# num_hidden_layers=(something)\n# num_targets=(something)\n# add_lda=(true|false)\n# include_log_softmax=(true|false)\n# objective_type=(something)\n. $dir/configs/vars || exit 1;\nleft_context=$model_left_context\nright_context=$model_right_context\n\n[ -z \"$num_targets\" ] && echo \"\\$num_targets is not defined. Needs to be defined in $dir/configs/vars.\" && exit 1\n[ -z \"$add_lda\" ] && echo \"\\$add_lda is not defined. Needs to be defined in $dir/configs/vars.\" && exit 1\n[ -z \"$include_log_softmax\" ] && echo \"\\$include_log_softmax is not defined. Needs to be defined in $dir/configs/vars.\" && exit 1\n[ -z \"$objective_type\" ] && echo \"\\$objective_type is not defined. Needs to be defined in $dir/configs/vars.\" && exit 1\n\ncontext_opts=\"--left-context=$left_context --right-context=$right_context\"\n\n! [ \"$num_hidden_layers\" -gt 0 ] && echo \\\n \"$0: Expected num_hidden_layers to be defined\" && exit 1;\n\nif $dense_targets; then\n  tmp_num_targets=`feat-to-dim scp:$targets_scp - 2>/dev/null` || exit 1\n\n  if [ $tmp_num_targets -ne $num_targets ]; then\n    echo \"Mismatch between num-targets provided to script vs configs\"\n    exit 1\n  fi\nfi\n\nif [ $stage -le -4 ] && [ -z \"$egs_dir\" ]; then\n  extra_opts=()\n  [ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n  [ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n  extra_opts+=(--transform-dir \"$transform_dir\")\n  extra_opts+=(--left-context $left_context)\n  extra_opts+=(--right-context $right_context)\n  echo \"$0: calling get_egs.sh\"\n\n  if $dense_targets; then\n    target_type=dense\n  else\n    target_type=sparse\n  fi\n\n  steps/nnet3/get_egs_targets.sh $egs_opts \"${extra_opts[@]}\" \\\n    --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n    --cmd \"$cmd\" --nj $nj \\\n    --frames-per-eg $frames_per_eg \\\n    --target-type $target_type --num-targets $num_targets \\\n    $data $targets_scp $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\nif [ ! -z \"$online_ivector_dir\" ] ; then\n  steps/nnet2/check_ivectors_compatible.sh $online_ivector_dir $egs_dir/info || exit 1\nfi\n\n\nif [ \"$feat_dim\" != \"$(cat $egs_dir/info/feat_dim)\" ]; then\n  echo \"$0: feature dimension mismatch with egs, $feat_dim vs $(cat $egs_dir/info/feat_dim)\";\n  exit 1;\nfi\nif [ \"$ivector_dim\" != \"$(cat $egs_dir/info/ivector_dim)\" ]; then\n  echo \"$0: ivector dimension mismatch with egs, $ivector_dim vs $(cat $egs_dir/info/ivector_dim)\";\n  exit 1;\nfi\n\n# copy any of the following that exist, to $dir.\ncp $egs_dir/{cmvn_opts,splice_opts,final.mat} $dir 2>/dev/null\n\n# confirm that the egs_dir has the necessary context (especially important if\n# the --egs-dir option was used on the command line).\negs_left_context=$(cat $egs_dir/info/left_context) || exit -1\negs_right_context=$(cat $egs_dir/info/right_context) || exit -1\n ( [ $egs_left_context -lt $left_context ] || \\\n   [ $egs_right_context -lt $right_context ] ) && \\\n   echo \"$0: egs in $egs_dir have too little context\" && exit -1;\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nif [ \"$chunk_training\" == \"true\" ]; then\n  num_archives_expanded=$num_archives\nelse\n  num_archives_expanded=$[$num_archives*$frames_per_eg]\nfi\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\n\nif $add_lda && [ $stage -le -3 ]; then\n  echo \"$0: getting preconditioning matrix for input features.\"\n  num_lda_jobs=$num_archives\n  [ $num_lda_jobs -gt $max_lda_jobs ] && num_lda_jobs=$max_lda_jobs\n\n  # Write stats with the same format as stats for LDA.\n  $cmd JOB=1:$num_lda_jobs $dir/log/get_lda_stats.JOB.log \\\n      nnet3-acc-lda-stats --rand-prune=$rand_prune \\\n        $dir/init.raw \"ark:$egs_dir/egs.JOB.ark\" $dir/JOB.lda_stats || exit 1;\n\n  all_lda_accs=$(for n in $(seq $num_lda_jobs); do echo $dir/$n.lda_stats; done)\n  $cmd $dir/log/sum_transform_stats.log \\\n    sum-lda-accs $dir/lda_stats $all_lda_accs || exit 1;\n\n  rm $all_lda_accs || exit 1;\n\n  # this computes a fixed affine transform computed in the way we described in\n  # Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled variant\n  # of an LDA transform but without dimensionality reduction.\n  $cmd $dir/log/get_transform.log \\\n     nnet-get-feature-transform $lda_opts $dir/lda.mat $dir/lda_stats || exit 1;\n\n  ln -sf ../lda.mat $dir/configs/lda.mat\nfi\n\n\nif [ $stage -le -1 ]; then\n  # Add the first layer; this will add in the lda.mat\n  $cmd $dir/log/add_first_layer.log \\\n       nnet3-init --srand=-3 $dir/init.raw $dir/configs/layer1.config $dir/0.raw || exit 1;\n\nfi\n\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  combine_queue_opt=\"--gpu 1\"\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\n  combine_queue_opt=\"\"  # the combine stage will be quite slow if not using\n                        # GPU, as we didn't enable that program to use\n                        # multiple threads.\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many iterations we want to combine over in the final\n# nnet3-combine-fast invocation.  (We may end up subsampling from these if the\n# number exceeds max_model_combine).  The number we use is:\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     1/2 * iters_after_last_layer_added)\nnum_iters_combine=$max_models_combine\nif [ $num_iters_combine -lt $approx_iters_per_epoch_final ]; then\n   num_iters_combine=$approx_iters_per_epoch_final\nfi\nhalf_iters_after_add_layers=$[($num_iters-$finish_add_layers_iter)/2]\nif [ $num_iters_combine -gt $half_iters_after_add_layers ]; then\n  num_iters_combine=$half_iters_after_add_layers\nfi\nfirst_model_combine=$[$num_iters-$num_iters_combine+1]\n\nx=0\n\n\ncompute_accuracy=false\nif [ \"$objective_type\" == \"linear\" ]; then\n  compute_accuracy=true\nfi\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet3-compute-prob --compute-accuracy=$compute_accuracy $dir/$x.raw \\\n      \"ark,bg:nnet3-merge-egs ark:$egs_dir/valid_diagnostic.egs ark:- |\" &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet3-compute-prob --compute-accuracy=$compute_accuracy $dir/$x.raw \\\n      \"ark,bg:nnet3-merge-egs ark:$egs_dir/train_diagnostic.egs ark:- |\" &\n\n    if [ $x -gt 0 ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-show-progress --use-gpu=no $dir/$[x-1].raw $dir/$x.raw \\\n        \"ark,bg:nnet3-merge-egs ark:$egs_dir/train_diagnostic.egs ark:-|\" '&&' \\\n        nnet3-info $dir/$x.raw &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging but take the\n                       # best.\n      cur_num_hidden_layers=$[1+$x/$add_layers_period]\n      config=$dir/configs/layer$cur_num_hidden_layers.config\n      raw=\"nnet3-copy --learning-rate=$this_learning_rate $dir/$x.raw - | nnet3-init --srand=$x - $config - |\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      raw=\"nnet3-copy --learning-rate=$this_learning_rate $dir/$x.raw -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size (and we will later choose the output of just one of the jobs): the\n      # model-averaging isn't always helpful when the model is changing too fast\n      # (i.e. it can worsen the objective function), and the smaller minibatch\n      # size will help to keep the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-train $parallel_train_opts \\\n          --max-param-change=$max_param_change \"$raw\" \\\n          \"ark,bg:nnet3-copy-egs --frame=$frame $context_opts ark:$egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_minibatch_size --discard-partial-minibatches=true ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet3-average $nnets_list $dir/$[x+1].raw || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $this_num_jobs $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet3-copy $dir/$[$x+1].$n.raw $dir/$[$x+1].raw || exit 1;\n    fi\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].raw ] && exit 1;\n    if [ -f $dir/$[$x-1].raw ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].raw\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.raw\"\n\n  # Now do combination.  In the nnet3 setup, the logic\n  # for doing averaging of subsets of the models in the case where\n  # there are too many models to reliably esetimate interpolation\n  # factors (max_models_combine) is moved into the nnet3-combine\n  nnets_list=()\n  for n in $(seq 0 $[num_iters_combine-1]); do\n    iter=$[$first_model_combine+$n]\n    nnet=$dir/$iter.raw\n    [ ! -f $nnet ] && echo \"Expected $nnet to exist\" && exit 1;\n    nnets_list[$n]=$nnet\n  done\n\n  # Below, we use --use-gpu=no to disable nnet3-combine-fast from using a GPU,\n  # as if there are many models it can give out-of-memory error; and we set\n  # num-threads to 8 to speed it up (this isn't ideal...)\n\n  $cmd $combine_queue_opt $dir/log/combine.log \\\n    nnet3-combine --num-iters=40 \\\n    --enforce-sum-to-one=true --enforce-positive-weights=true \\\n    --verbose=3 \"${nnets_list[@]}\" \"ark,bg:nnet3-merge-egs --minibatch-size=1024 ark:$egs_dir/combine.egs ark:-|\" \\\n    $dir/final.raw || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet3-compute-prob --compute-accuracy=$compute_accuracy $dir/final.raw \\\n    \"ark,bg:nnet3-merge-egs ark:$egs_dir/valid_diagnostic.egs ark:- |\" &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet3-compute-prob --compute-accuracy=$compute_accuracy $dir/final.raw \\\n    \"ark,bg:nnet3-merge-egs ark:$egs_dir/train_diagnostic.egs ark:- |\" &\nfi\n\nif $include_log_softmax && [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purpose of using as prior to convert posteriors to likelihoods.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  if [ $num_jobs_compute_prior -gt $num_archives ]; then egs_part=1;\n  else egs_part=JOB; fi\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$x.JOB.log \\\n    nnet3-copy-egs --frame=random $context_opts --srand=JOB ark:$egs_dir/egs.$egs_part.ark ark:- \\| \\\n    nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet3-merge-egs ark:- ark:- \\| \\\n    nnet3-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n    $dir/final.raw ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm -f $dir/post.$x.*.vec;\n\nfi\n\n\nif [ ! -f $dir/final.raw ]; then\n  echo \"$0: $dir/final.raw does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.raw\n    fi\n  done\nfi\n"
  },
  {
    "path": "egs/steps/nnet3/train_discriminative.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey)\n#           2014-2015  Vimal Manohar\n# Apache 2.0.\n\nset -o pipefail\n\n# This script does MPE or MMI or state-level minimum bayes risk (sMBR) training\n# using egs obtained by steps/nnet3/get_egs_discriminative.sh\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=4       # Number of epochs of training;\n                   # the number of iterations is worked out from this.\n                   # Be careful with this: we actually go over the data\n                   # num-epochs * frame-subsampling-factor times, due to\n                   # using different data-shifts.\nuse_gpu=true\napply_deriv_weights=true\nuse_frame_shift=false\nrun_diagnostics=true\nlearning_rate=0.00002\nmax_param_change=2.0\nscale_max_param_change=false # if this option is used, scale it by num-jobs.\n\neffective_lrate=    # If supplied, overrides the learning rate, which gets set to effective_lrate * num_jobs_nnet.\nacoustic_scale=0.1  # acoustic scale for MMI/MPFE/SMBR training.\nboost=0.0       # option relevant for MMI\n\ncriterion=smbr\ndrop_frames=false #  option relevant for MMI\none_silence_class=true # option relevant for MPE/SMBR\nnum_jobs_nnet=4    # Number of neural net jobs to run in parallel.  Note: this\n                   # will interact with the learning rates (if you decrease\n                   # this, you'll have to decrease the learning rate, and vice\n                   # versa).\nregularization_opts=\nminibatch_size=64  # This is the number of examples rather than the number of output frames.\nlast_layer_factor=1.0  # relates to modify-learning-rates [deprecated]\nshuffle_buffer_size=1000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n\n\nstage=-3\n\nnum_threads=16  # this is the default but you may want to change it, e.g. to 1 if\n                # using GPUs.\n\ncleanup=true\nkeep_model_iters=100\nremove_egs=false\nsrc_model=  # will default to $degs_dir/final.mdl\n\nnum_jobs_compute_prior=10\n\nmin_deriv_time=0\nmax_deriv_time_relative=0\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [opts] <degs-dir> <exp-dir>\"\n  echo \" e.g.: $0 exp/nnet3/tdnn_sp_degs exp/nnet3/tdnn_sp_smbr\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|4>                        # Number of epochs of training\"\n  echo \"  --learning-rate <learning-rate|0.0002>           # Learning rate to use\"\n  echo \"  --effective-lrate <effective-learning-rate>      # If supplied, learning rate will be set to\"\n  echo \"                                                   # this value times num-jobs-nnet.\"\n  echo \"  --num-jobs-nnet <num-jobs|8>                     # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.  Also note: if there are fewer archives\"\n  echo \"                                                   # of egs than this, it will get reduced automatically.\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job (will affect results\"\n  echo \"                                                   # as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.  With GPU, must be 1.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... \"\n  echo \"  --stage <stage|-3>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --boost <boost|0.0>                              # Boosting factor for MMI (e.g., 0.1)\"\n  echo \"  --drop-frames <true,false|false>                 # Option that affects MMI training: if true, we exclude gradients from frames\"\n  echo \"                                                   # where the numerator transition-id is not in the denominator lattice.\"\n  echo \"  --one-silence-class <true,false|false>           # Option that affects MPE/SMBR training (will tend to reduce insertions)\"\n  echo \"  --modify-learning-rates <true,false|false>       # If true, modify learning rates to try to equalize relative\"\n  echo \"                                                   # changes across layers. [deprecated]\"\n  exit 1;\nfi\n\ndegs_dir=$1\ndir=$2\n\n[ -z \"$src_model\" ] && src_model=$degs_dir/final.mdl\n\n# Check some files.\nfor f in $degs_dir/degs.1.ark $degs_dir/info/{num_archives,silence.csl,frame_subsampling_factor} $src_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log || exit 1;\n\n\nmodel_left_context=$(nnet3-am-info $src_model | grep \"^left-context:\" | awk '{print $2}')\nmodel_right_context=$(nnet3-am-info $src_model | grep \"^right-context:\" | awk '{print $2}')\n\n# Copy the ivector information\nif [ -f $degs_dir/info/final.ie.id ]; then\n  cp $degs_dir/info/final.ie.id $dir/ 2>/dev/null || true\nfi\n\n# copy some things\nfor f in splice_opts cmvn_opts tree final.mat; do\n  if [ -f $degs_dir/$f ]; then\n    cp $degs_dir/$f $dir/ || exit 1;\n  fi\ndone\n\nsilphonelist=`cat $degs_dir/info/silence.csl` || exit 1;\n\nnum_archives=$(cat $degs_dir/info/num_archives) || exit 1;\nframe_subsampling_factor=$(cat $degs_dir/info/frame_subsampling_factor)\n\necho $frame_subsampling_factor > $dir/frame_subsampling_factor\n\nif $use_frame_shift; then\n  num_archives_expanded=$[$num_archives*$frame_subsampling_factor]\nelse\n  num_archives_expanded=$num_archives\nfi\n\nif [ $num_jobs_nnet -gt $num_archives_expanded ]; then\n  echo \"$0: num-jobs-nnet $num_jobs_nnet exceeds number of archives $num_archives_expanded,\"\n  echo \" ... setting it to $num_archives.\"\n  num_jobs_nnet=$num_archives_expanded\nfi\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[$num_archives_to_process/$num_jobs_nnet]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\nfi\n\nif $use_frame_shift; then\n  num_epochs_expanded=$[num_epochs*frame_subsampling_factor]\nelse\n  num_epochs_expanded=$num_epochs\nfi\n\nfor e in $(seq 1 $num_epochs_expanded); do\n  x=$[($e*$num_archives)/$num_jobs_nnet] # gives the iteration number.\n  iter_to_epoch[$x]=$e\ndone\n\nif [ $stage -le -1 ]; then\n  echo \"$0: Copying initial model and modifying preconditioning setup\"\n\n  # Note, the baseline model probably had preconditioning, and we'll keep it;\n  # but we want online preconditioning with a larger number of samples of\n  # history, since in this setup the frames are only randomized at the segment\n  # level so they are highly correlated.  It might make sense to tune this a\n  # little, later on, although I doubt it matters once the --num-samples-history\n  # is large enough.\n\n  if [ ! -z \"$effective_lrate\" ]; then\n    learning_rate=$(perl -e \"print ($num_jobs_nnet*$effective_lrate);\")\n    echo \"$0: setting learning rate to $learning_rate = --num-jobs-nnet * --effective-lrate.\"\n  fi\n\n\n  # set the learning rate to $learning_rate, and\n  # set the output-layer's learning rate to\n  # $learning_rate times $last_layer_factor.\n  edits_str=\"set-learning-rate learning-rate=$learning_rate\"\n  if [ \"$last_layer_factor\" != \"1.0\" ]; then\n    last_layer_lrate=$(perl -e \"print ($learning_rate*$last_layer_factor);\") || exit 1\n    edits_str=\"$edits_str; set-learning-rate name=output.affine learning-rate=$last_layer_lrate\"\n  fi\n\n  $cmd $dir/log/convert.log \\\n    nnet3-am-copy --edits=\"$edits_str\" \"$src_model\" $dir/0.mdl || exit 1;\n\n  ln -sf 0.mdl $dir/epoch0.mdl\nfi\n\n\nrm $dir/.error 2>/dev/null\n\nx=0\n\nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    if $run_diagnostics; then\n      # Set off jobs doing some diagnostics, in the background.  # Use the egs dir from the previous iteration for the diagnostics\n      $cmd $dir/log/compute_objf_valid.$x.log \\\n        nnet3-discriminative-compute-objf  $regularization_opts \\\n        --silence-phones=$silphonelist \\\n        --criterion=$criterion --drop-frames=$drop_frames \\\n        --one-silence-class=$one_silence_class \\\n        --boost=$boost --acoustic-scale=$acoustic_scale \\\n        $dir/$x.mdl \\\n        \"ark,bg:nnet3-discriminative-copy-egs ark:$degs_dir/valid_diagnostic.degs ark:- | nnet3-discriminative-merge-egs --minibatch-size=1:64 ark:- ark:- |\" &\n      $cmd $dir/log/compute_objf_train.$x.log \\\n        nnet3-discriminative-compute-objf  $regularization_opts \\\n        --silence-phones=$silphonelist \\\n        --criterion=$criterion --drop-frames=$drop_frames \\\n        --one-silence-class=$one_silence_class \\\n        --boost=$boost --acoustic-scale=$acoustic_scale \\\n        $dir/$x.mdl \\\n        \"ark,bg:nnet3-discriminative-copy-egs ark:$degs_dir/train_diagnostic.degs ark:- | nnet3-discriminative-merge-egs --minibatch-size=1:64 ark:- ark:- |\" &\n    fi\n\n    if [ $x -gt 0 ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-show-progress --use-gpu=no \"nnet3-am-copy --raw=true $dir/$[$x-1].mdl - |\" \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n        '&&' \\\n        nnet3-info \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" &\n    fi\n\n\n    echo \"Training neural net (pass $x)\"\n\n    cache_read_opt=\"--read-cache=$dir/cache.$x\"\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in `seq $num_jobs_nnet`; do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n\n        if [ $n -eq 1 ]; then\n          # an option for writing cache (storing pairs of nnet-computations and\n          # computation-requests) during training.\n          cache_write_opt=\" --write-cache=$dir/cache.$[$x+1]\"\n        else\n          cache_write_opt=\"\"\n        fi\n\n        if $use_frame_shift; then\n          frame_shift=$[(k%num_archives + k/num_archives) % frame_subsampling_factor]\n        else\n          frame_shift=0\n        fi\n\n        #archive=$[(($n+($x*$num_jobs_nnet))%$num_archives)+1]\n        if $scale_max_param_change; then\n          this_max_param_change=$(perl -e \"print ($max_param_change * $num_jobs_nnet);\")\n        else\n          this_max_param_change=$max_param_change\n        fi\n\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-discriminative-train $cache_read_opt $cache_write_opt \\\n          --apply-deriv-weights=$apply_deriv_weights \\\n          --optimization.min-deriv-time=-$model_left_context \\\n          --optimization.max-deriv-time-relative=$model_right_context \\\n            $parallel_train_opts \\\n          --max-param-change=$this_max_param_change \\\n          --silence-phones=$silphonelist \\\n          --criterion=$criterion --drop-frames=$drop_frames \\\n          --one-silence-class=$one_silence_class \\\n          --boost=$boost --acoustic-scale=$acoustic_scale $regularization_opts \\\n          $dir/$x.mdl \\\n          \"ark,bg:nnet3-discriminative-copy-egs --frame-shift=$frame_shift ark:$degs_dir/degs.$archive.ark ark:- | nnet3-discriminative-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:- | nnet3-discriminative-merge-egs --minibatch-size=$minibatch_size ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n      [ -f $dir/.error ] && exit 1\n    )\n    [ -f $dir/.error ] && { echo \"Found $dir/.error. See $dir/log/train.$x.*.log\"; exit 1; }\n\n    nnets_list=$(for n in $(seq $num_jobs_nnet); do echo $dir/$[$x+1].$n.raw; done)\n\n    # below use run.pl instead of a generic $cmd for these very quick stages,\n    # so that we don't run the risk of waiting for a possibly hard-to-get GPU.\n    run.pl $dir/log/average.$x.log \\\n      nnet3-average $nnets_list - \\| \\\n      nnet3-am-copy --set-raw-nnet=- $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && echo \"$0: Did not create $dir/$[$x+1].mdl\" && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%$keep_model_iters] -ne 0  ] && \\\n       [ -z \"${iter_to_epoch[$[$x-1]]}\" ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n\n    [ -f $dir/.error ] && { echo \"Found $dir/.error. Error on iteration $x\"; exit 1; }\n  fi\n\n  rm $dir/cache.$x 2>/dev/null || true\n  x=$[$x+1]\n  num_archives_processed=$[num_archives_processed+num_jobs_nnet]\n\n  if [ $stage -le $x ] && [ ! -z \"${iter_to_epoch[$x]}\" ]; then\n    e=${iter_to_epoch[$x]}\n    ln -sf $x.mdl $dir/epoch$e.mdl\n\n    (\n      rm $dir/.error 2> /dev/null\n\n      steps/nnet3/adjust_priors.sh --egs-type degs \\\n        --num-jobs-compute-prior $num_jobs_compute_prior \\\n        --cmd \"$cmd\" --use-gpu false \\\n        --minibatch-size $minibatch_size \\\n        --use-raw-nnet false --iter epoch$e $dir $degs_dir \\\n        || { touch $dir/.error; echo \"Error in adjusting priors. See errors above.\"; exit 1; }\n    ) &\n  fi\n\ndone\n\nrm $dir/final.mdl 2>/dev/null\ncp $dir/$x.mdl $dir/final.mdl\n\n# function to remove egs that might be soft links.\nremove () { for x in $*; do [ -L $x ] && rm $(utils/make_absolute.sh $x); rm $x; done }\n\nif $cleanup && $remove_egs; then  # note: this is false by default.\n  echo Removing training examples\n  remove $degs_dir/degs.*\n  remove $degs_dir/priors_egs.*\nfi\n\n\nif $cleanup; then\n  echo Removing most of the models\n  for x in `seq 1 $keep_model_iters $num_iters`; do\n    if [ -z \"${iter_to_epoch[$x]}\" ]; then\n      # if $x is not an epoch-final iteration..\n      rm $dir/$x.mdl 2>/dev/null\n    fi\n  done\nfi\n\nwait\n[ -f $dir/.error ] && { echo \"Found $dir/.error.\"; exit 1; }\n\necho Done && exit 0\n"
  },
  {
    "path": "egs/steps/nnet3/train_dnn.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n#           2017 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n\"\"\" This script is based on steps/nnet3/tdnn/train.sh\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport logging\nimport os\nimport pprint\nimport shutil\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.frame_level_objf as train_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting DNN trainer (train_dnn.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    We add compulsory arguments as named arguments for readability\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains a feed forward DNN acoustic model using the\n        cross-entropy objective.  DNNs include simple DNNs, TDNNs and CNNs.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser(include_chunk_context=False).parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.frames-per-eg\", type=int, dest='frames_per_eg',\n                        default=8,\n                        help=\"Number of output labels per example\")\n\n    # trainer options\n    parser.add_argument(\"--trainer.input-model\", type=str,\n                        dest='input_model', default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"If specified, this model is used as initial\n                        raw model (0.raw in the script) instead of initializing\n                        the model from xconfig. Configs dir is not expected to\n                        exist and left/right context is computed from this\n                        model.\"\"\")\n    parser.add_argument(\"--trainer.prior-subset-size\", type=int,\n                        dest='prior_subset_size', default=20000,\n                        help=\"Number of samples for computing priors\")\n    parser.add_argument(\"--trainer.num-jobs-compute-prior\", type=int,\n                        dest='num_jobs_compute_prior', default=10,\n                        help=\"The prior computation jobs are single \"\n                        \"threaded and run on the CPU\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.minibatch-size\",\n                        type=str, dest='minibatch_size', default='512',\n                        help=\"\"\"Size of the minibatch used in SGD training\n                        (argument to nnet3-merge-egs); may be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n\n    # General options\n    parser.add_argument(\"--feat-dir\", type=str, required=False,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--lang\", type=str, required=False,\n                        help=\"Language directory\")\n    parser.add_argument(\"--ali-dir\", type=str, required=True,\n                        help=\"Directory with alignments used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv), file=sys.stderr)\n    print(sys.argv, file=sys.stderr)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if args.frames_per_eg < 1:\n        raise Exception(\"--egs.frames-per-eg should have a minimum value of 1\")\n\n    if not common_train_lib.validate_minibatch_size_str(args.minibatch_size):\n        raise Exception(\"--trainer.rnn.num-chunk-per-minibatch has an invalid value\")\n\n    if (not os.path.exists(args.dir)):\n        raise Exception(\"Directory specified with --dir={0} \"\n                        \"does not exist.\".format(args.dir))\n    if (not os.path.exists(args.dir + \"/configs\") and\n        (args.input_model is None or not os.path.exists(args.input_model))):\n        raise Exception(\"Either --trainer.input-model option should be supplied, \"\n                        \"and exist; or the {0}/configs directory should exist.\"\n                        \"{0}/configs is the output of make_configs.py\"\n                        \"\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.prior_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.prior_queue_opt = \"--gpu 1\"\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.prior_gpu_opt = \"--use-gpu=no\"\n        run_opts.prior_queue_opt = \"\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n    run_opts.num_jobs_compute_prior = args.num_jobs_compute_prior\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Copy phones.txt from ali-dir to dir. Later, steps/nnet3/decode.sh will\n    # use it to check compatibility between training and decoding phone-sets.\n    shutil.copy('{0}/phones.txt'.format(args.ali_dir), args.dir)\n\n    # Set some variables.\n    # num_leaves = common_lib.get_number_of_leaves_from_tree(args.ali_dir)\n    num_jobs = common_lib.get_number_of_jobs(args.ali_dir)\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n\n    # split the training data into parts for individual jobs\n    # we will use the same number of jobs as that used for alignment\n    common_lib.execute_command(\"utils/split_data.sh {0} {1}\".format(\n        args.feat_dir, num_jobs))\n    shutil.copy('{0}/tree'.format(args.ali_dir), args.dir)\n\n    with open('{0}/num_jobs'.format(args.dir), 'w') as f:\n        f.write('{}'.format(num_jobs))\n\n    if args.input_model is None:\n        config_dir = '{0}/configs'.format(args.dir)\n        var_file = '{0}/vars'.format(config_dir)\n\n        variables = common_train_lib.parse_generic_config_vars_file(var_file)\n    else:\n        # If args.input_model is specified, the model left and right contexts\n        # are computed using input_model.\n        variables = common_train_lib.get_input_model_info(args.input_model)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = model_left_context\n    right_context = model_right_context\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n\n    if (args.stage <= -5) and os.path.exists(args.dir+\"/configs/init.config\") and \\\n       (args.input_model is None):\n        logger.info(\"Initializing a basic network for estimating \"\n                    \"preconditioning matrix\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n                    nnet3-init --srand=-2 {dir}/configs/init.config \\\n                    {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                             dir=args.dir))\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n    if (args.stage <= -4) and args.egs_dir is None:\n        logger.info(\"Generating egs\")\n\n        if args.feat_dir is None:\n            raise Exception(\"--feat-dir option is required if you don't supply --egs-dir\")\n\n        train_lib.acoustic_model.generate_egs(\n            data=args.feat_dir, alidir=args.ali_dir, egs_dir=default_egs_dir,\n            left_context=left_context, right_context=right_context,\n            run_opts=run_opts,\n            frames_per_eg_str=str(args.frames_per_eg),\n            srand=args.srand,\n            egs_opts=args.egs_opts,\n            cmvn_opts=args.cmvn_opts,\n            online_ivector_dir=args.online_ivector_dir,\n            samples_per_iter=args.samples_per_iter,\n            stage=args.egs_stage)\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n         common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                         ivector_dim, ivector_id,\n                                         left_context, right_context))\n    assert str(args.frames_per_eg) == frames_per_eg_str\n\n    if args.num_jobs_final > num_archives:\n        raise Exception('num_jobs_final cannot exceed the number of archives '\n                        'in the egs directory')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n    if args.stage <= -3 and os.path.exists(args.dir+\"/configs/init.config\") and (args.input_model is None):\n        logger.info('Computing the preconditioning matrix for input features')\n\n        train_lib.common.compute_preconditioning_matrix(\n            args.dir, egs_dir, num_archives, run_opts,\n            max_lda_jobs=args.max_lda_jobs,\n            rand_prune=args.rand_prune)\n\n    if args.stage <= -2 and (args.input_model is None):\n        logger.info(\"Computing initial vector for FixedScaleComponent before\"\n                    \" softmax, using priors^{prior_scale} and rescaling to\"\n                    \" average 1\".format(\n                        prior_scale=args.presoftmax_prior_scale_power))\n\n        common_train_lib.compute_presoftmax_prior_scale(\n            args.dir, args.ali_dir, num_jobs, run_opts,\n            presoftmax_prior_scale_power=args.presoftmax_prior_scale_power)\n\n    if args.stage <= -1:\n        logger.info(\"Preparing the initial acoustic model.\")\n        train_lib.acoustic_model.prepare_initial_acoustic_model(\n            args.dir, args.ali_dir, run_opts,\n            input_model=args.input_model)\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_expanded = num_archives * args.frames_per_eg\n    num_archives_to_process = int(args.num_epochs * num_archives_expanded)\n    num_archives_processed = 0\n    num_iters = int(num_archives_to_process * 2 / (args.num_jobs_initial + args.num_jobs_final))\n\n    # If do_final_combination is True, compute the set of models_to_combine.\n    # Otherwise, models_to_combine will be none.\n    if args.do_final_combination:\n        models_to_combine = common_train_lib.get_model_combine_iters(\n            num_iters, args.num_epochs,\n            num_archives_expanded, args.max_models_combine,\n            args.num_jobs_final)\n    else:\n        models_to_combine = None\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n\n            percent = num_archives_processed * 100.0 / num_archives_to_process\n            epoch = (num_archives_processed * args.num_epochs\n                     / num_archives_to_process)\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            train_lib.common.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                train_opts=' '.join(args.train_opts),\n                minibatch_size_str=args.minibatch_size,\n                frames_per_eg=args.frames_per_eg,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shrinkage_value=shrinkage_value,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                run_opts=run_opts)\n\n            if args.cleanup:\n                # do a clean up everythin but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(args.dir))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n    if args.stage <= num_iters:\n        if args.do_final_combination:\n            logger.info(\"Doing final combination to produce final.mdl\")\n            train_lib.common.combine_models(\n                dir=args.dir, num_iters=num_iters,\n                models_to_combine=models_to_combine,\n                egs_dir=egs_dir,\n                minibatch_size_str=args.minibatch_size, run_opts=run_opts,\n                max_objective_evaluations=args.max_objective_evaluations)\n\n    if args.stage <= num_iters + 1:\n        logger.info(\"Getting average posterior for purposes of \"\n                    \"adjusting the priors.\")\n\n        # If args.do_final_combination is true, we will use the combined model.\n        # Otherwise, we will use the last_numbered model.\n        real_iter = 'combined' if args.do_final_combination else num_iters\n        avg_post_vec_file = train_lib.common.compute_average_posterior(\n            dir=args.dir, iter=real_iter,\n            egs_dir=egs_dir, num_archives=num_archives,\n            prior_subset_size=args.prior_subset_size, run_opts=run_opts)\n\n        logger.info(\"Re-adjusting priors based on computed posteriors\")\n        combined_or_last_numbered_model = \"{dir}/{iter}.mdl\".format(dir=args.dir,\n                iter=real_iter)\n        final_model = \"{dir}/final.mdl\".format(dir=args.dir)\n        train_lib.common.adjust_am_priors(args.dir, combined_or_last_numbered_model,\n                avg_post_vec_file, final_model, run_opts)\n\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        common_train_lib.clean_nnet_dir(\n            nnet_dir=args.dir, num_iters=num_iters, egs_dir=egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs)\n\n    # do some reporting\n    [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(args.dir)\n    if args.email is not None:\n        common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                     \"complete\".format(args.dir), args.email)\n\n    with open(\"{dir}/accuracy.report\".format(dir=args.dir), \"w\") as f:\n        f.write(report)\n\n    common_lib.execute_command(\"steps/info/nnet3_dir_info.pl \"\n                               \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/train_raw_dnn.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This script is similar to steps/nnet3/train_dnn.py but trains a\nraw neural network instead of an acoustic model.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport logging\nimport pprint\nimport os\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.frame_level_objf as train_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting raw DNN trainer (train_raw_dnn.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains a feed forward raw DNN (without transition model)\n        using frame-level objectives like cross-entropy and mean-squared-error.\n        DNNs include simple DNNs, TDNNs and CNNs.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser(include_chunk_context=False).parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.frames-per-eg\", type=int, dest='frames_per_eg',\n                        default=8,\n                        help=\"Number of output labels per example\")\n    parser.add_argument(\"--image.augmentation-opts\", type=str,\n                        dest='image_augmentation_opts',\n                        default=None,\n                        help=\"Image augmentation options\")\n\n    # trainer options\n    parser.add_argument(\"--trainer.input-model\", type=str,\n                        dest='input_model', default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"If specified, this model is used as initial\n                        raw model (0.raw in the script) instead of initializing\n                        the model from xconfig. Configs dir is not expected to\n                        exist and left/right context is computed from this\n                        model.\"\"\")\n    parser.add_argument(\"--trainer.prior-subset-size\", type=int,\n                        dest='prior_subset_size', default=20000,\n                        help=\"Number of samples for computing priors\")\n    parser.add_argument(\"--trainer.num-jobs-compute-prior\", type=int,\n                        dest='num_jobs_compute_prior', default=10,\n                        help=\"The prior computation jobs are single \"\n                        \"threaded and run on the CPU\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.minibatch-size\",\n                        type=str, dest='minibatch_size', default='512',\n                        help=\"\"\"Size of the minibatch used in SGD training\n                        (argument to nnet3-merge-egs); may be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n    parser.add_argument(\"--compute-average-posteriors\",\n                        type=str, action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"], default=False,\n                        help=\"\"\"If true, then the average output of the\n                        network is computed and dumped as post.final.vec\"\"\")\n\n    # General options\n    parser.add_argument(\"--nj\", type=int, default=4,\n                        help=\"Number of parallel jobs\")\n    parser.add_argument(\"--use-dense-targets\", type=str,\n                        action=common_lib.StrToBoolAction,\n                        default=True, choices=[\"true\", \"false\"],\n                        help=\"Train neural network using dense targets\")\n    parser.add_argument(\"--feat-dir\", type=str, required=False,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--targets-scp\", type=str, required=False,\n                        help=\"\"\"Targets for training neural network.\n                        This is a kaldi-format SCP file of target matrices.\n                        <utterance-id> <extended-filename-of-target-matrix>.\n                        The target matrix's column dim must match \n                        the neural network output dim, and the\n                        row dim must match the number of output frames \n                        i.e. after subsampling if \"--frame-subsampling-factor\" \n                        option is passed to --egs.opts.\"\"\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv))\n    print(sys.argv)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if args.frames_per_eg < 1:\n        raise Exception(\"--egs.frames-per-eg should have a minimum value of 1\")\n\n    if not common_train_lib.validate_minibatch_size_str(args.minibatch_size):\n        raise Exception(\"--trainer.optimization.minibatch-size has an invalid value\")\n\n    if (not os.path.exists(args.dir)):\n        raise Exception(\"Directory specified with --dir={0} \"\n                        \"does not exist.\".format(args.dir))\n    if (not os.path.exists(args.dir + \"/configs\") and\n        (args.input_model is None or not os.path.exists(args.input_model))):\n        raise Exception(\"Either --trainer.input-model option should be supplied, \"\n                        \"and exist; or the {0}/configs directory should exist.\"\n                        \"{0}/configs is the output of make_configs.py\"\n                        \"\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.prior_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.prior_queue_opt = \"--gpu 1\"\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.prior_gpu_opt = \"--use-gpu=no\"\n        run_opts.prior_queue_opt = \"\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n    run_opts.num_jobs_compute_prior = args.num_jobs_compute_prior\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Set some variables.\n\n    # note, feat_dim gets set to 0 if args.feat_dir is unset (None).\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n\n    config_dir = '{0}/configs'.format(args.dir)\n    var_file = '{0}/vars'.format(config_dir)\n\n    if args.input_model is None:\n        config_dir = '{0}/configs'.format(args.dir)\n        var_file = '{0}/vars'.format(config_dir)\n\n        variables = common_train_lib.parse_generic_config_vars_file(var_file)\n    else:\n        # If args.input_model is specified, the model left and right contexts\n        # are computed using input_model.\n        variables = common_train_lib.get_input_model_info(args.input_model)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = model_left_context\n    right_context = model_right_context\n\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n    if (args.stage <= -4) and os.path.exists(args.dir+\"/configs/init.config\") and \\\n       (args.input_model is None):\n        logger.info(\"Initializing the network for computing the LDA stats\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n                    nnet3-init --srand=-2 {dir}/configs/init.config \\\n                    {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                             dir=args.dir))\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n    if (args.stage <= -3) and args.egs_dir is None:\n        if args.targets_scp is None or args.feat_dir is None:\n            raise Exception(\"If you don't supply the --egs-dir option, the \"\n                            \"--targets-scp and --feat-dir options are required.\")\n\n        logger.info(\"Generating egs\")\n\n        if args.use_dense_targets:\n            target_type = \"dense\"\n            try:\n                num_targets = int(variables['num_targets'])\n                if (common_lib.get_feat_dim_from_scp(args.targets_scp)\n                        != num_targets):\n                    raise Exception(\"Mismatch between num-targets provided to \"\n                                    \"script vs configs\")\n            except KeyError as e:\n                num_targets = -1\n        else:\n            target_type = \"sparse\"\n            try:\n                num_targets = int(variables['num_targets'])\n            except KeyError as e:\n                raise Exception(\"KeyError {0}: Variables need to be defined \"\n                                \"in {1}\".format(\n                                    str(e), '{0}/configs'.format(args.dir)))\n\n        train_lib.raw_model.generate_egs_using_targets(\n            data=args.feat_dir, targets_scp=args.targets_scp,\n            egs_dir=default_egs_dir,\n            left_context=left_context, right_context=right_context,\n            run_opts=run_opts,\n            frames_per_eg_str=str(args.frames_per_eg),\n            srand=args.srand,\n            egs_opts=args.egs_opts,\n            cmvn_opts=args.cmvn_opts,\n            online_ivector_dir=args.online_ivector_dir,\n            samples_per_iter=args.samples_per_iter,\n            stage=args.egs_stage,\n            target_type=target_type,\n            num_targets=num_targets)\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n         common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                         ivector_dim, ivector_id,\n                                         left_context, right_context))\n    assert str(args.frames_per_eg) == frames_per_eg_str\n\n    if args.num_jobs_final > num_archives:\n        raise Exception('num_jobs_final cannot exceed the number of archives '\n                        'in the egs directory')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n    if args.stage <= -2 and os.path.exists(args.dir+\"/configs/init.config\") and \\\n       (args.input_model is None):\n        logger.info('Computing the preconditioning matrix for input features')\n\n        train_lib.common.compute_preconditioning_matrix(\n            args.dir, egs_dir, num_archives, run_opts,\n            max_lda_jobs=args.max_lda_jobs,\n            rand_prune=args.rand_prune)\n\n    if args.stage <= -1:\n        logger.info(\"Preparing the initial network.\")\n        common_train_lib.prepare_initial_network(args.dir, run_opts, args.srand, args.input_model)\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_expanded = num_archives * args.frames_per_eg\n    num_archives_to_process = int(args.num_epochs * num_archives_expanded)\n    num_archives_processed = 0\n    num_iters = int((num_archives_to_process * 2) / (args.num_jobs_initial + args.num_jobs_final))\n\n    # If do_final_combination is True, compute the set of models_to_combine.\n    # Otherwise, models_to_combine will be none.\n    if args.do_final_combination:\n        models_to_combine = common_train_lib.get_model_combine_iters(\n            num_iters, args.num_epochs,\n            num_archives_expanded, args.max_models_combine,\n            args.num_jobs_final)\n    else:\n        models_to_combine = None\n\n    if os.path.exists('{0}/valid_diagnostic.scp'.format(egs_dir)):\n        if os.path.exists('{0}/valid_diagnostic.egs'.format(egs_dir)):\n            raise Exception('both {0}/valid_diagnostic.egs and '\n                            '{0}/valid_diagnostic.scp exist.'\n                            'This script expects only one of them to exist.'\n                            ''.format(egs_dir))\n        use_multitask_egs = True\n    else:\n        if not os.path.exists('{0}/valid_diagnostic.egs'.format(egs_dir)):\n            raise Exception('neither {0}/valid_diagnostic.egs nor '\n                            '{0}/valid_diagnostic.scp exist.'\n                            'This script expects one of them.'\n                            ''.format(egs_dir))\n        use_multitask_egs = False\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n\n            percent = num_archives_processed * 100.0 / num_archives_to_process\n            epoch = (num_archives_processed * args.num_epochs\n                     / num_archives_to_process)\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            train_lib.common.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                train_opts=' '.join(args.train_opts),\n                minibatch_size_str=args.minibatch_size,\n                frames_per_eg=args.frames_per_eg,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shrinkage_value=shrinkage_value,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                run_opts=run_opts,\n                get_raw_nnet_from_am=False,\n                image_augmentation_opts=args.image_augmentation_opts,\n                use_multitask_egs=use_multitask_egs,\n                backstitch_training_scale=args.backstitch_training_scale,\n                backstitch_training_interval=args.backstitch_training_interval)\n\n            if args.cleanup:\n                # do a clean up everything but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval,\n                    get_raw_nnet_from_am=False)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(args.dir))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n    if args.stage <= num_iters:\n        if args.do_final_combination:\n            logger.info(\"Doing final combination to produce final.raw\")\n            train_lib.common.combine_models(\n                dir=args.dir, num_iters=num_iters,\n                models_to_combine=models_to_combine, egs_dir=egs_dir,\n                minibatch_size_str=args.minibatch_size, run_opts=run_opts,\n                get_raw_nnet_from_am=False,\n                max_objective_evaluations=args.max_objective_evaluations,\n                use_multitask_egs=use_multitask_egs)\n        else:\n            common_lib.force_symlink(\"{0}.raw\".format(num_iters),\n                                     \"{0}/final.raw\".format(args.dir))\n\n    if args.compute_average_posteriors and args.stage <= num_iters + 1:\n        logger.info(\"Getting average posterior for output-node 'output'.\")\n        train_lib.common.compute_average_posterior(\n            dir=args.dir, iter='final', egs_dir=egs_dir,\n            num_archives=num_archives,\n            prior_subset_size=args.prior_subset_size, run_opts=run_opts,\n            get_raw_nnet_from_am=False)\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        common_train_lib.clean_nnet_dir(\n            nnet_dir=args.dir, num_iters=num_iters, egs_dir=egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs,\n            get_raw_nnet_from_am=False)\n\n    # do some reporting\n    outputs_list = common_train_lib.get_outputs_list(\"{0}/final.raw\".format(\n        args.dir), get_raw_nnet_from_am=False)\n    if 'output' in outputs_list:\n        [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(\n            args.dir)\n        if args.email is not None:\n            common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                         \"complete\".format(args.dir),\n                                 args.email)\n\n        with open(\"{dir}/accuracy.{output_name}.report\".format(dir=args.dir,\n                                                               output_name=\"output\"),\n                  \"w\") as f:\n            f.write(report)\n\n    common_lib.execute_command(\"steps/info/nnet3_dir_info.pl \"\n                               \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/train_raw_rnn.py",
    "content": "#!/usr/bin/env python\n\n\n# Copyright 2016 Vijayaditya Peddinti.\n#           2016 Vimal Manohar\n#           2017 Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n\"\"\" This script is similar to steps/nnet3/train_rnn.py but trains a\nraw neural network instead of an acoustic model.\n\"\"\"\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport logging\nimport pprint\nimport os\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.frame_level_objf as train_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting RNN trainer (train_raw_rnn.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains a raw RNN (without transition model) using\n        frame-level objectives like cross-entropy and mean-squared-error.\n        RNNs include LSTMs, BLSTMs and GRUs.\n        RNN acoustic model training differs from feed-forward DNN training in\n        the following ways\n            1. RNN acoustic models train on output chunks rather than\n               individual outputs\n            2. The training includes additional stage of shrinkage, where the\n               parameters of the model are scaled when the derivative averages\n               at the non-linearities are below a threshold.\n            3. RNNs can also be trained with state preservation training\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser(default_chunk_left_context=40).parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.chunk-width\", type=str, dest='chunk_width',\n                        default=\"20\",\n                        help=\"\"\"Number of frames per chunk in the examples\n                        used to train the RNN.   Caution: if you double this you\n                        should halve --trainer.samples-per-iter.  May be\n                        a comma-separated list of alternatives: first width\n                        is the 'principal' chunk-width, used preferentially\"\"\")\n\n    # trainer options\n    parser.add_argument(\"--trainer.input-model\", type=str,\n                        dest='input_model', default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"If specified, this model is used as initial\n                        raw model (0.raw in the script) instead of initializing\n                        the model from xconfig. Configs dir is not expected to\n                        exist and left/right context is computed from this\n                        model.\"\"\")\n    parser.add_argument(\"--trainer.samples-per-iter\", type=int,\n                        dest='samples_per_iter', default=20000,\n                        help=\"\"\"This is really the number of egs in each\n                        archive.  Each eg has 'chunk_width' frames in it--\n                        for chunk_width=20, this value (20k) is equivalent\n                        to the 400k number that we use as a default in\n                        regular DNN training.\n                        Overrides the default value in CommonParser.\"\"\")\n    parser.add_argument(\"--trainer.prior-subset-size\", type=int,\n                        dest='prior_subset_size', default=20000,\n                        help=\"Number of samples for computing priors\")\n    parser.add_argument(\"--trainer.num-jobs-compute-prior\", type=int,\n                        dest='num_jobs_compute_prior', default=10,\n                        help=\"The prior computation jobs are single \"\n                        \"threaded and run on the CPU\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.momentum\", type=float,\n                        dest='momentum', default=0.5,\n                        help=\"\"\"Momentum used in update computation.\n                        Note: we implemented it in such a way that\n                        it doesn't increase the effective learning rate.\n                        Overrides the default value in CommonParser\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-value\", type=float,\n                        dest='shrink_value', default=0.99,\n                        help=\"\"\"Scaling factor used for scaling the parameter\n                        matrices when the derivative averages are below the\n                        shrink-threshold at the non-linearities.  E.g. 0.99.\n                        Only applicable when the neural net contains sigmoid or\n                        tanh units.\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-saturation-threshold\",\n                        type=float,\n                        dest='shrink_saturation_threshold', default=0.40,\n                        help=\"\"\"Threshold that controls when we apply the\n                        'shrinkage' (i.e. scaling by shrink-value).  If the\n                        saturation of the sigmoid and tanh nonlinearities in\n                        the neural net (as measured by\n                        steps/nnet3/get_saturation.pl) exceeds this threshold\n                        we scale the parameter matrices with the\n                        shrink-value.\"\"\")\n    # RNN specific trainer options\n    parser.add_argument(\"--trainer.rnn.num-chunk-per-minibatch\", type=str,\n                        dest='num_chunk_per_minibatch', default='100',\n                        help=\"\"\"Number of sequences to be processed in\n                        parallel every minibatch.  May be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n    parser.add_argument(\"--trainer.deriv-truncate-margin\", type=int,\n                        dest='deriv_truncate_margin', default=8,\n                        help=\"\"\"Margin (in input frames) around the 'required'\n                        part of each chunk that the derivatives are\n                        backpropagated to. E.g., 8 is a reasonable setting.\n                        Note: the 'required' part of the chunk is defined by\n                        the model's {left,right}-context.\"\"\")\n    parser.add_argument(\"--compute-average-posteriors\",\n                        type=str, action=common_lib.StrToBoolAction,\n                        choices=[\"true\", \"false\"], default=False,\n                        help=\"\"\"If true, then the average output of the\n                        network is computed and dumped as post.final.vec\"\"\")\n\n    # General options\n    parser.add_argument(\"--nj\", type=int, default=4,\n                        help=\"Number of parallel jobs\")\n    parser.add_argument(\"--use-dense-targets\", type=str,\n                        action=common_lib.StrToBoolAction,\n                        default=True, choices=[\"true\", \"false\"],\n                        help=\"Train neural network using dense targets\")\n    parser.add_argument(\"--feat-dir\", type=str, required=True,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--targets-scp\", type=str, required=True,\n                        help=\"Target for training neural network.\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv))\n    print(sys.argv)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if not common_train_lib.validate_chunk_width(args.chunk_width):\n        raise Exception(\"--egs.chunk-width has an invalid value\")\n\n    if not common_train_lib.validate_minibatch_size_str(args.num_chunk_per_minibatch):\n        raise Exception(\"--trainer.rnn.num-chunk-per-minibatch has an invalid value\")\n\n    if args.chunk_left_context < 0:\n        raise Exception(\"--egs.chunk-left-context should be non-negative\")\n\n    if args.chunk_right_context < 0:\n        raise Exception(\"--egs.chunk-right-context should be non-negative\")\n\n    if (not os.path.exists(args.dir)):\n        raise Exception(\"Directory specified with --dir={0} \"\n                        \"does not exist.\".format(args.dir))\n    if (not os.path.exists(args.dir + \"/configs\") and\n        (args.input_model is None or not os.path.exists(args.input_model))):\n        raise Exception(\"Either --trainer.input-model option should be supplied, \"\n                        \"and exist; or the {0}/configs directory should exist.\"\n                        \"{0}/configs is the output of make_configs.py\"\n                        \"\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.prior_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.prior_queue_opt = \"--gpu 1\"\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.prior_gpu_opt = \"--use-gpu=no\"\n        run_opts.prior_queue_opt = \"\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n    run_opts.num_jobs_compute_prior = args.num_jobs_compute_prior\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Set some variables.\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n\n    if args.input_model is None:\n        config_dir = '{0}/configs'.format(args.dir)\n        var_file = '{0}/vars'.format(config_dir)\n\n        variables = common_train_lib.parse_generic_config_vars_file(var_file)\n    else:\n        # If args.input_model is specified, the model left and right contexts\n        # are computed using input_model.\n        variables = common_train_lib.get_input_model_info(args.input_model)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = args.chunk_left_context + model_left_context\n    right_context = args.chunk_right_context + model_right_context\n    left_context_initial = (args.chunk_left_context_initial + model_left_context if\n                            args.chunk_left_context_initial >= 0 else -1)\n    right_context_final = (args.chunk_right_context_final + model_right_context if\n                           args.chunk_right_context_final >= 0 else -1)\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n\n    if (args.stage <= -4) and os.path.exists(args.dir+\"/configs/init.config\") and \\\n       (args.input_model is None):\n        logger.info(\"Initializing the network for computing the LDA stats\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n                    nnet3-init --srand=-2 {dir}/configs/init.config \\\n                    {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                             dir=args.dir))\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n    if (args.stage <= -3) and args.egs_dir is None:\n        logger.info(\"Generating egs\")\n\n        if args.use_dense_targets:\n            target_type = \"dense\"\n            try:\n                num_targets = int(variables['num_targets'])\n                if (common_lib.get_feat_dim_from_scp(args.targets_scp)\n                        != num_targets):\n                    raise Exception(\"Mismatch between num-targets provided to \"\n                                    \"script vs configs\")\n            except KeyError as e:\n                num_targets = -1\n        else:\n            target_type = \"sparse\"\n            try:\n                num_targets = int(variables['num_targets'])\n            except KeyError as e:\n                raise Exception(\"KeyError {0}: Variables need to be defined \"\n                                \"in {1}\".format(\n                                    str(e), '{0}/configs'.format(args.dir)))\n\n        train_lib.raw_model.generate_egs_using_targets(\n            data=args.feat_dir, targets_scp=args.targets_scp,\n            egs_dir=default_egs_dir,\n            left_context=left_context,\n            right_context=right_context,\n            left_context_initial=left_context_initial,\n            right_context_final=right_context_final,\n            run_opts=run_opts,\n            frames_per_eg_str=args.chunk_width,\n            srand=args.srand,\n            egs_opts=args.egs_opts,\n            cmvn_opts=args.cmvn_opts,\n            online_ivector_dir=args.online_ivector_dir,\n            samples_per_iter=args.samples_per_iter,\n            stage=args.egs_stage,\n            target_type=target_type,\n            num_targets=num_targets)\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n         common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                         ivector_dim, ivector_id,\n                                         left_context, right_context,\n                                         left_context_initial,\n                                         right_context_final))\n    if args.chunk_width != frames_per_eg_str:\n        raise Exception(\"mismatch between --egs.chunk-width and the frames_per_eg \"\n                        \"in the egs dir {0} vs {1}\".format(args.chunk_width,\n                                                           frames_per_eg_str))\n\n    if args.num_jobs_final > num_archives:\n        raise Exception('num_jobs_final cannot exceed the number of archives '\n                        'in the egs directory')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n    if args.stage <= -2 and os.path.exists(args.dir+\"/configs/init.config\") and \\\n       (args.input_model is None):\n        logger.info('Computing the preconditioning matrix for input features')\n\n        train_lib.common.compute_preconditioning_matrix(\n            args.dir, egs_dir, num_archives, run_opts,\n            max_lda_jobs=args.max_lda_jobs,\n            rand_prune=args.rand_prune)\n\n    if args.stage <= -1:\n        logger.info(\"Preparing the initial network.\")\n        common_train_lib.prepare_initial_network(args.dir, run_opts, args.srand, args.input_model)\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_to_process = int(args.num_epochs * num_archives)\n    num_archives_processed = 0\n    num_iters = int((num_archives_to_process * 2) / (args.num_jobs_initial + args.num_jobs_final))\n\n    # If do_final_combination is True, compute the set of models_to_combine.\n    # Otherwise, models_to_combine will be none.\n    if args.do_final_combination:\n        models_to_combine = common_train_lib.get_model_combine_iters(\n            num_iters, args.num_epochs,\n            num_archives, args.max_models_combine,\n            args.num_jobs_final)\n    else:\n        models_to_combine = None\n\n    if (os.path.exists('{0}/valid_diagnostic.scp'.format(egs_dir))):\n        if (os.path.exists('{0}/valid_diagnostic.egs'.format(egs_dir))):\n            raise Exception('both {0}/valid_diagnostic.egs and '\n                            '{0}/valid_diagnostic.scp exist.'\n                            'This script expects only one of them to exist.'\n                            ''.format(egs_dir))\n        use_multitask_egs = True\n    else:\n        if (not os.path.exists('{0}/valid_diagnostic.egs'\n                               ''.format(egs_dir))):\n            raise Exception('neither {0}/valid_diagnostic.egs nor '\n                            '{0}/valid_diagnostic.scp exist.'\n                            'This script expects one of them.'\n                            ''.format(egs_dir))\n        use_multitask_egs = False\n\n    min_deriv_time = None\n    max_deriv_time_relative = None\n    if args.deriv_truncate_margin is not None:\n        min_deriv_time = -args.deriv_truncate_margin - model_left_context\n        max_deriv_time_relative = \\\n           args.deriv_truncate_margin + model_right_context\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            model_file = \"{dir}/{iter}.raw\".format(dir=args.dir, iter=iter)\n\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n\n            # shrinkage_value is a scale on the parameters.\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n            if args.shrink_value < shrinkage_value:\n                shrinkage_value = (args.shrink_value\n                                   if common_train_lib.should_do_shrinkage(\n                                           iter, model_file,\n                                           args.shrink_saturation_threshold,\n                                           get_raw_nnet_from_am=False)\n                                   else shrinkage_value)\n\n            percent = num_archives_processed * 100.0 / num_archives_to_process\n            epoch = (num_archives_processed * args.num_epochs\n                     / num_archives_to_process)\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            train_lib.common.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                train_opts=' '.join(args.train_opts),\n                shrinkage_value=shrinkage_value,\n                minibatch_size_str=args.num_chunk_per_minibatch,\n                min_deriv_time=min_deriv_time,\n                max_deriv_time_relative=max_deriv_time_relative,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                run_opts=run_opts,\n                get_raw_nnet_from_am=False,\n                use_multitask_egs=use_multitask_egs,\n                compute_per_dim_accuracy=args.compute_per_dim_accuracy)\n\n            if args.cleanup:\n                # do a clean up everythin but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval,\n                    get_raw_nnet_from_am=False)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(args.dir))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n    if args.stage <= num_iters:\n        if args.do_final_combination:\n            logger.info(\"Doing final combination to produce final.raw\")\n            train_lib.common.combine_models(\n                dir=args.dir, num_iters=num_iters,\n                models_to_combine=models_to_combine, egs_dir=egs_dir,\n                minibatch_size_str=args.num_chunk_per_minibatch,\n                run_opts=run_opts, chunk_width=args.chunk_width,\n                get_raw_nnet_from_am=False,\n                compute_per_dim_accuracy=args.compute_per_dim_accuracy,\n                max_objective_evaluations=args.max_objective_evaluations,\n                use_multitask_egs=use_multitask_egs)\n        else:\n            common_lib.force_symlink(\"{0}.raw\".format(num_iters),\n                                     \"{0}/final.raw\".format(args.dir))\n\n    if args.compute_average_posteriors and args.stage <= num_iters + 1:\n        logger.info(\"Getting average posterior for purposes of \"\n                    \"adjusting the priors.\")\n        train_lib.common.compute_average_posterior(\n            dir=args.dir, iter='final', egs_dir=egs_dir,\n            num_archives=num_archives,\n            prior_subset_size=args.prior_subset_size, run_opts=run_opts,\n            get_raw_nnet_from_am=False)\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        common_train_lib.clean_nnet_dir(\n            nnet_dir=args.dir, num_iters=num_iters, egs_dir=egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs,\n            get_raw_nnet_from_am=False)\n\n    # do some reporting\n    [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(args.dir)\n    if args.email is not None:\n        common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                     \"complete\".format(args.dir), args.email)\n\n    with open(\"{dir}/accuracy.report\".format(dir=args.dir), \"w\") as f:\n        f.write(report)\n\n    common_lib.execute_command(\"steps/info/nnet3_dir_info.pl \"\n                               \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/train_rnn.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Vijayaditya Peddinti.\n#           2016    Vimal Manohar\n# Apache 2.0.\n\n\"\"\" This script is based on steps/nnet3/lstm/train.sh\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport logging\nimport os\nimport pprint\nimport shutil\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.nnet3.train.common as common_train_lib\nimport libs.common as common_lib\nimport libs.nnet3.train.frame_level_objf as train_lib\nimport libs.nnet3.report.log_parse as nnet3_log_parse\n\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\nlogger.info('Starting RNN trainer (train_rnn.py)')\n\n\ndef get_args():\n    \"\"\" Get args from stdin.\n\n    We add compulsary arguments as named arguments for readability\n\n    The common options are defined in the object\n    libs.nnet3.train.common.CommonParser.parser.\n    See steps/libs/nnet3/train/common.py\n    \"\"\"\n\n    parser = argparse.ArgumentParser(\n        description=\"\"\"Trains an RNN acoustic model using the cross-entropy\n        objective.  RNNs include LSTMs, BLSTMs and GRUs.\n        RNN acoustic model training differs from feed-forward DNN training in\n        the following ways\n            1. RNN acoustic models train on output chunks rather than\n               individual outputs\n            2. The training includes additional stage of shrinkage, where\n               the parameters of the model are scaled when the derivative\n               averages at the non-linearities are below a threshold.\n            3. RNNs can also be trained with state preservation training\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter,\n        conflict_handler='resolve',\n        parents=[common_train_lib.CommonParser(default_chunk_left_context=40).parser])\n\n    # egs extraction options\n    parser.add_argument(\"--egs.chunk-width\", type=str, dest='chunk_width',\n                        default=\"20\",\n                        help=\"\"\"Number of frames per chunk in the examples\n                        used to train the RNN.   Caution: if you double this you\n                        should halve --trainer.samples-per-iter.  May be\n                        a comma-separated list of alternatives: first width\n                        is the 'principal' chunk-width, used preferentially\"\"\")\n    parser.add_argument(\"--trainer.input-model\", type=str,\n                        dest='input_model', default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"If specified, this model is used as initial\n                        raw model (0.raw in the script) instead of initializing\n                        the model from xconfig. Configs dir is not expected to\n                        exist and left/right context is computed from this\n                        model.\"\"\")\n    parser.add_argument(\"--trainer.samples-per-iter\", type=int,\n                        dest='samples_per_iter', default=20000,\n                        help=\"\"\"This is really the number of egs in each\n                        archive.  Each eg has 'chunk_width' frames in it--\n                        for chunk_width=20, this value (20k) is equivalent\n                        to the 400k number that we use as a default in\n                        regular DNN training.\n                        Overrides the default value in CommonParser.\"\"\")\n    parser.add_argument(\"--trainer.prior-subset-size\", type=int,\n                        dest='prior_subset_size', default=20000,\n                        help=\"Number of samples for computing priors\")\n    parser.add_argument(\"--trainer.num-jobs-compute-prior\", type=int,\n                        dest='num_jobs_compute_prior', default=10,\n                        help=\"The prior computation jobs are single \"\n                        \"threaded and run on the CPU\")\n\n    # Parameters for the optimization\n    parser.add_argument(\"--trainer.optimization.momentum\", type=float,\n                        dest='momentum', default=0.5,\n                        help=\"\"\"Momentum used in update computation.\n                        Note: we implemented it in such a way that\n                        it doesn't increase the effective learning rate.\n                        Overrides the default value in CommonParser\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-value\", type=float,\n                        dest='shrink_value', default=0.99,\n                        help=\"\"\"Scaling factor used for scaling the parameter\n                        matrices when the derivative averages are below the\n                        shrink-threshold at the non-linearities.  E.g. 0.99.\n                        Only applicable when the neural net contains sigmoid or\n                        tanh units.\"\"\")\n    parser.add_argument(\"--trainer.optimization.shrink-saturation-threshold\",\n                        type=float,\n                        dest='shrink_saturation_threshold', default=0.40,\n                        help=\"\"\"Threshold that controls when we apply the\n                        'shrinkage' (i.e. scaling by shrink-value).  If the\n                        saturation of the sigmoid and tanh nonlinearities in\n                        the neural net (as measured by\n                        steps/nnet3/get_saturation.pl) exceeds this threshold\n                        we scale the parameter matrices with the\n                        shrink-value.\"\"\")\n    # RNN specific trainer options\n    parser.add_argument(\"--trainer.rnn.num-chunk-per-minibatch\", type=str,\n                        dest='num_chunk_per_minibatch', default='100',\n                        help=\"\"\"Number of sequences to be processed in\n                        parallel every minibatch.  May be a more general\n                        rule as accepted by the --minibatch-size option of\n                        nnet3-merge-egs; run that program without args to see\n                        the format.\"\"\")\n    parser.add_argument(\"--trainer.deriv-truncate-margin\", type=int,\n                        dest='deriv_truncate_margin', default=8,\n                        help=\"\"\"Margin (in input frames) around the 'required'\n                        part of each chunk that the derivatives are\n                        backpropagated to. E.g., 8 is a reasonable setting.\n                        Note: the 'required' part of the chunk is defined by\n                        the model's {left,right}-context.\"\"\")\n\n    # General options\n    parser.add_argument(\"--feat-dir\", type=str, required=False,\n                        help=\"Directory with features used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--lang\", type=str, required=False,\n                        help=\"Language directory\")\n    parser.add_argument(\"--ali-dir\", type=str, required=True,\n                        help=\"Directory with alignments used for training \"\n                        \"the neural network.\")\n    parser.add_argument(\"--dir\", type=str, required=True,\n                        help=\"Directory to store the models and \"\n                        \"all other files.\")\n\n    print(' '.join(sys.argv))\n    print(sys.argv)\n\n    args = parser.parse_args()\n\n    [args, run_opts] = process_args(args)\n\n    return [args, run_opts]\n\n\ndef process_args(args):\n    \"\"\" Process the options got from get_args()\n    \"\"\"\n\n    if not common_train_lib.validate_chunk_width(args.chunk_width):\n        raise Exception(\"--egs.chunk-width has an invalid value\")\n\n    if not common_train_lib.validate_minibatch_size_str(args.num_chunk_per_minibatch):\n        raise Exception(\"--trainer.rnn.num-chunk-per-minibatch has an invalid value\")\n\n    if args.chunk_left_context < 0:\n        raise Exception(\"--egs.chunk-left-context should be non-negative\")\n\n    if args.chunk_right_context < 0:\n        raise Exception(\"--egs.chunk-right-context should be non-negative\")\n\n    if (not os.path.exists(args.dir)):\n        raise Exception(\"Directory specified with --dir={0} \"\n                        \"does not exist.\".format(args.dir))\n    if (not os.path.exists(args.dir + \"/configs\") and\n        (args.input_model is None or not os.path.exists(args.input_model))):\n        raise Exception(\"Either --trainer.input-model option should be supplied, \"\n                        \"and exist; or the {0}/configs directory should exist. \"\n                        \"{0}/configs is the output of make_configs.py\"\n                        \"\".format(args.dir))\n\n    # set the options corresponding to args.use_gpu\n    run_opts = common_train_lib.RunOpts()\n    if args.use_gpu in [\"true\", \"false\"]:\n        args.use_gpu = (\"yes\" if args.use_gpu == \"true\" else \"no\")\n    if args.use_gpu in [\"yes\", \"wait\"]:\n        if not common_lib.check_if_cuda_compiled():\n            logger.warning(\n                \"\"\"You are running with one thread but you have not compiled\n                   for CUDA.  You may be running a setup optimized for GPUs.\n                   If you have GPUs and have nvcc installed, go to src/ and do\n                   ./configure; make\"\"\")\n\n        run_opts.train_queue_opt = \"--gpu 1\"\n        run_opts.parallel_train_opts = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.combine_queue_opt = \"--gpu 1\"\n        run_opts.prior_gpu_opt = \"--use-gpu={}\".format(args.use_gpu)\n        run_opts.prior_queue_opt = \"--gpu 1\"\n\n    else:\n        logger.warning(\"Without using a GPU this will be very slow. \"\n                       \"nnet3 does not yet support multiple threads.\")\n\n        run_opts.train_queue_opt = \"\"\n        run_opts.parallel_train_opts = \"--use-gpu=no\"\n        run_opts.combine_gpu_opt = \"--use-gpu=no\"\n        run_opts.combine_queue_opt = \"\"\n        run_opts.prior_gpu_opt = \"--use-gpu=no\"\n        run_opts.prior_queue_opt = \"\"\n\n    run_opts.command = args.command\n    run_opts.egs_command = (args.egs_command\n                            if args.egs_command is not None else\n                            args.command)\n    run_opts.num_jobs_compute_prior = args.num_jobs_compute_prior\n\n    return [args, run_opts]\n\n\ndef train(args, run_opts):\n    \"\"\" The main function for training.\n\n    Args:\n        args: a Namespace object with the required parameters\n            obtained from the function process_args()\n        run_opts: RunOpts object obtained from the process_args()\n    \"\"\"\n\n    arg_string = pprint.pformat(vars(args))\n    logger.info(\"Arguments for the experiment\\n{0}\".format(arg_string))\n\n    # Copy phones.txt from ali-dir to dir. Later, steps/nnet3/decode.sh will\n    # use it to check compatibility between training and decoding phone-sets.\n    shutil.copy('{0}/phones.txt'.format(args.ali_dir), args.dir)\n\n    # Set some variables.\n    num_jobs = common_lib.get_number_of_jobs(args.ali_dir)\n    feat_dim = common_lib.get_feat_dim(args.feat_dir)\n    ivector_dim = common_lib.get_ivector_dim(args.online_ivector_dir)\n    ivector_id = common_lib.get_ivector_extractor_id(args.online_ivector_dir)\n\n    # split the training data into parts for individual jobs\n    # we will use the same number of jobs as that used for alignment\n    common_lib.execute_command(\"utils/split_data.sh {0} {1}\".format(\n        args.feat_dir, num_jobs))\n    shutil.copy('{0}/tree'.format(args.ali_dir), args.dir)\n\n    with open('{0}/num_jobs'.format(args.dir), 'w') as f:\n        f.write('{}'.format(num_jobs))\n\n    config_dir = '{0}/configs'.format(args.dir)\n    var_file = '{0}/vars'.format(config_dir)\n\n    if args.input_model is None:\n        config_dir = '{0}/configs'.format(args.dir)\n        var_file = '{0}/vars'.format(config_dir)\n\n        variables = common_train_lib.parse_generic_config_vars_file(var_file)\n    else:\n        # If args.input_model is specified, the model left and right contexts\n        # are computed using input_model.\n        variables = common_train_lib.get_input_model_info(args.input_model)\n\n    # Set some variables.\n    try:\n        model_left_context = variables['model_left_context']\n        model_right_context = variables['model_right_context']\n    except KeyError as e:\n        raise Exception(\"KeyError {0}: Variables need to be defined in \"\n                        \"{1}\".format(str(e), '{0}/configs'.format(args.dir)))\n\n    left_context = args.chunk_left_context + model_left_context\n    right_context = args.chunk_right_context + model_right_context\n    left_context_initial = (args.chunk_left_context_initial + model_left_context if\n                            args.chunk_left_context_initial >= 0 else -1)\n    right_context_final = (args.chunk_right_context_final + model_right_context if\n                           args.chunk_right_context_final >= 0 else -1)\n\n    # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n    # matrix.  This first config just does any initial splicing that we do;\n    # we do this as it's a convenient way to get the stats for the 'lda-like'\n    # transform.\n\n    if (args.stage <= -5) and (args.input_model is None):\n        logger.info(\"Initializing a basic network for estimating \"\n                    \"preconditioning matrix\")\n        common_lib.execute_command(\n            \"\"\"{command} {dir}/log/nnet_init.log \\\n                    nnet3-init --srand=-2 {dir}/configs/init.config \\\n                    {dir}/init.raw\"\"\".format(command=run_opts.command,\n                                             dir=args.dir))\n\n    default_egs_dir = '{0}/egs'.format(args.dir)\n    if args.stage <= -4 and args.egs_dir is None:\n        logger.info(\"Generating egs\")\n\n        if args.feat_dir is None:\n            raise Exception(\"--feat-dir option is required if you don't supply --egs-dir\")\n\n        train_lib.acoustic_model.generate_egs(\n            data=args.feat_dir, alidir=args.ali_dir,\n            egs_dir=default_egs_dir,\n            left_context=left_context,\n            right_context=right_context,\n            left_context_initial=left_context_initial,\n            right_context_final=right_context_final,\n            run_opts=run_opts,\n            frames_per_eg_str=args.chunk_width,\n            srand=args.srand,\n            egs_opts=args.egs_opts,\n            cmvn_opts=args.cmvn_opts,\n            online_ivector_dir=args.online_ivector_dir,\n            samples_per_iter=args.samples_per_iter,\n            stage=args.egs_stage)\n\n    if args.egs_dir is None:\n        egs_dir = default_egs_dir\n    else:\n        egs_dir = args.egs_dir\n\n    [egs_left_context, egs_right_context,\n     frames_per_eg_str, num_archives] = (\n         common_train_lib.verify_egs_dir(egs_dir, feat_dim,\n                                         ivector_dim, ivector_id,\n                                         left_context, right_context,\n                                         left_context_initial, right_context_final))\n    if args.chunk_width != frames_per_eg_str:\n        raise Exception(\"mismatch between --egs.chunk-width and the frames_per_eg \"\n                        \"in the egs dir {0} vs {1}\".format(args.chunk_width,\n                                                           frames_per_eg_str))\n\n    if args.num_jobs_final > num_archives:\n        raise Exception('num_jobs_final cannot exceed the number of archives '\n                        'in the egs directory')\n\n    # copy the properties of the egs to dir for\n    # use during decoding\n    common_train_lib.copy_egs_properties_to_exp_dir(egs_dir, args.dir)\n\n    if args.stage <= -3 and (args.input_model is None):\n        logger.info('Computing the preconditioning matrix for input features')\n\n        train_lib.common.compute_preconditioning_matrix(\n            args.dir, egs_dir, num_archives, run_opts,\n            max_lda_jobs=args.max_lda_jobs,\n            rand_prune=args.rand_prune)\n\n    if args.stage <= -2 and (args.input_model is None):\n        logger.info(\"Computing initial vector for FixedScaleComponent before\"\n                    \" softmax, using priors^{prior_scale} and rescaling to\"\n                    \" average 1\".format(\n                        prior_scale=args.presoftmax_prior_scale_power))\n\n        common_train_lib.compute_presoftmax_prior_scale(\n            args.dir, args.ali_dir, num_jobs, run_opts,\n            presoftmax_prior_scale_power=args.presoftmax_prior_scale_power)\n\n    if args.stage <= -1:\n        logger.info(\"Preparing the initial acoustic model.\")\n        train_lib.acoustic_model.prepare_initial_acoustic_model(\n            args.dir, args.ali_dir, run_opts,\n            input_model=args.input_model)\n\n    # set num_iters so that as close as possible, we process the data\n    # $num_epochs times, i.e. $num_iters*$avg_num_jobs) ==\n    # $num_epochs*$num_archives, where\n    # avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n    num_archives_to_process = int(args.num_epochs * num_archives)\n    num_archives_processed = 0\n    num_iters = int((num_archives_to_process * 2) / (args.num_jobs_initial + args.num_jobs_final))\n\n    # If do_final_combination is True, compute the set of models_to_combine.\n    # Otherwise, models_to_combine will be none.\n    if args.do_final_combination:\n        models_to_combine = common_train_lib.get_model_combine_iters(\n            num_iters, args.num_epochs,\n            num_archives, args.max_models_combine,\n            args.num_jobs_final)\n    else:\n        models_to_combine = None\n\n    min_deriv_time = None\n    max_deriv_time_relative = None\n    if args.deriv_truncate_margin is not None:\n        min_deriv_time = -args.deriv_truncate_margin - model_left_context\n        max_deriv_time_relative = \\\n           args.deriv_truncate_margin + model_right_context\n\n    logger.info(\"Training will run for {0} epochs = \"\n                \"{1} iterations\".format(args.num_epochs, num_iters))\n\n    for iter in range(num_iters):\n        if (args.exit_stage is not None) and (iter == args.exit_stage):\n            logger.info(\"Exiting early due to --exit-stage {0}\".format(iter))\n            return\n\n        current_num_jobs = common_train_lib.get_current_num_jobs(\n            iter, num_iters,\n            args.num_jobs_initial, args.num_jobs_step, args.num_jobs_final)\n\n        if args.stage <= iter:\n            model_file = \"{dir}/{iter}.mdl\".format(dir=args.dir, iter=iter)\n\n\n            lrate = common_train_lib.get_learning_rate(iter, current_num_jobs,\n                                                       num_iters,\n                                                       num_archives_processed,\n                                                       num_archives_to_process,\n                                                       args.initial_effective_lrate,\n                                                       args.final_effective_lrate)\n\n            shrinkage_value = 1.0 - (args.proportional_shrink * lrate)\n            if shrinkage_value <= 0.5:\n                raise Exception(\"proportional-shrink={0} is too large, it gives \"\n                                \"shrink-value={1}\".format(args.proportional_shrink,\n                                                          shrinkage_value))\n            if args.shrink_value < shrinkage_value:\n                shrinkage_value = (args.shrink_value\n                                   if common_train_lib.should_do_shrinkage(\n                                           iter, model_file,\n                                           args.shrink_saturation_threshold) else 1.0)\n\n            percent = num_archives_processed * 100.0 / num_archives_to_process\n            epoch = (num_archives_processed * args.num_epochs\n                     / num_archives_to_process)\n            shrink_info_str = ''\n            if shrinkage_value != 1.0:\n                shrink_info_str = 'shrink: {0:0.5f}'.format(shrinkage_value)\n            logger.info(\"Iter: {0}/{1}   Jobs: {2}   \"\n                        \"Epoch: {3:0.2f}/{4:0.1f} ({5:0.1f}% complete)   \"\n                        \"lr: {6:0.6f}   {7}\".format(iter, num_iters - 1,\n                                                    current_num_jobs,\n                                                    epoch, args.num_epochs,\n                                                    percent,\n                                                    lrate, shrink_info_str))\n\n            train_lib.common.train_one_iteration(\n                dir=args.dir,\n                iter=iter,\n                srand=args.srand,\n                egs_dir=egs_dir,\n                num_jobs=current_num_jobs,\n                num_archives_processed=num_archives_processed,\n                num_archives=num_archives,\n                learning_rate=lrate,\n                dropout_edit_string=common_train_lib.get_dropout_edit_string(\n                    args.dropout_schedule,\n                    float(num_archives_processed) / num_archives_to_process,\n                    iter),\n                train_opts=' '.join(args.train_opts),\n                shrinkage_value=shrinkage_value,\n                minibatch_size_str=args.num_chunk_per_minibatch,\n                min_deriv_time=min_deriv_time,\n                max_deriv_time_relative=max_deriv_time_relative,\n                momentum=args.momentum,\n                max_param_change=args.max_param_change,\n                shuffle_buffer_size=args.shuffle_buffer_size,\n                run_opts=run_opts,\n                backstitch_training_scale=args.backstitch_training_scale,\n                backstitch_training_interval=args.backstitch_training_interval,\n                compute_per_dim_accuracy=args.compute_per_dim_accuracy)\n\n            if args.cleanup:\n                # do a clean up everythin but the last 2 models, under certain\n                # conditions\n                common_train_lib.remove_model(\n                    args.dir, iter-2, num_iters, models_to_combine,\n                    args.preserve_model_interval)\n\n            if args.email is not None:\n                reporting_iter_interval = num_iters * args.reporting_interval\n                if iter % reporting_iter_interval == 0:\n                    # lets do some reporting\n                    [report, times, data] = (\n                        nnet3_log_parse.generate_acc_logprob_report(args.dir))\n                    message = report\n                    subject = (\"Update : Expt {dir} : \"\n                               \"Iter {iter}\".format(dir=args.dir, iter=iter))\n                    common_lib.send_mail(message, subject, args.email)\n\n        num_archives_processed = num_archives_processed + current_num_jobs\n\n    if args.stage <= num_iters:\n        if args.do_final_combination:\n            logger.info(\"Doing final combination to produce final.mdl\")\n            train_lib.common.combine_models(\n                dir=args.dir, num_iters=num_iters,\n                models_to_combine=models_to_combine, egs_dir=egs_dir,\n                run_opts=run_opts,\n                minibatch_size_str=args.num_chunk_per_minibatch,\n                chunk_width=args.chunk_width,\n                max_objective_evaluations=args.max_objective_evaluations,\n                compute_per_dim_accuracy=args.compute_per_dim_accuracy)\n\n    if args.stage <= num_iters + 1:\n        logger.info(\"Getting average posterior for purposes of \"\n                    \"adjusting the priors.\")\n\n        # If args.do_final_combination is true, we will use the combined model.\n        # Otherwise, we will use the last_numbered model.\n        real_iter = 'combined' if args.do_final_combination else num_iters\n        avg_post_vec_file = train_lib.common.compute_average_posterior(\n            dir=args.dir, iter=real_iter, egs_dir=egs_dir,\n            num_archives=num_archives,\n            prior_subset_size=args.prior_subset_size, run_opts=run_opts)\n\n        logger.info(\"Re-adjusting priors based on computed posteriors\")\n        combined_or_last_numbered_model = \"{dir}/{iter}.mdl\".format(dir=args.dir,\n                iter=real_iter)\n        final_model = \"{dir}/final.mdl\".format(dir=args.dir)\n        train_lib.common.adjust_am_priors(args.dir, combined_or_last_numbered_model,\n                                          avg_post_vec_file, final_model,\n                                          run_opts)\n\n    if args.cleanup:\n        logger.info(\"Cleaning up the experiment directory \"\n                    \"{0}\".format(args.dir))\n        remove_egs = args.remove_egs\n        if args.egs_dir is not None:\n            # this egs_dir was not created by this experiment so we will not\n            # delete it\n            remove_egs = False\n\n        common_train_lib.clean_nnet_dir(\n            nnet_dir=args.dir, num_iters=num_iters, egs_dir=egs_dir,\n            preserve_model_interval=args.preserve_model_interval,\n            remove_egs=remove_egs)\n\n    # do some reporting\n    [report, times, data] = nnet3_log_parse.generate_acc_logprob_report(args.dir)\n    if args.email is not None:\n        common_lib.send_mail(report, \"Update : Expt {0} : \"\n                                     \"complete\".format(args.dir), args.email)\n\n    with open(\"{dir}/accuracy.report\".format(dir=args.dir), \"w\") as f:\n        f.write(report)\n\n    common_lib.execute_command(\"steps/info/nnet3_dir_info.pl \"\n                               \"{0}\".format(args.dir))\n\n\ndef main():\n    [args, run_opts] = get_args()\n    try:\n        train(args, run_opts)\n        common_lib.wait_for_background_commands()\n    except BaseException as e:\n        # look for BaseException so we catch KeyboardInterrupt, which is\n        # what we get when a background thread dies.\n        if args.email is not None:\n            message = (\"Training session for experiment {dir} \"\n                       \"died due to an error.\".format(dir=args.dir))\n            common_lib.send_mail(message, message, args.email)\n        if not isinstance(e, KeyboardInterrupt):\n            traceback.print_exc()\n        sys.exit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/nnet3/train_tdnn.sh",
    "content": "#!/usr/bin/env bash\n\n# THIS SCRIPT IS DEPRECATED, see ./train_dnn.py\n\n# note, TDNN is the same as what we used to call multisplice.\n\n# Copyright 2012-2015  Johns Hopkins University (Author: Daniel Povey).\n#           2013  Xiaohui Zhang\n#           2013  Guoguo Chen\n#           2014  Vimal Manohar\n#           2014  Vijayaditya Peddinti\n# Apache 2.0.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_epochs=15      # Number of epochs of training;\n                   # the number of iterations is worked out from this.\ninitial_effective_lrate=0.01\nfinal_effective_lrate=0.001\npnorm_input_dim=3000\npnorm_output_dim=300\nrelu_dim=  # you can use this to make it use ReLU's instead of p-norms.\nrand_prune=4.0 # Relates to a speedup we do for LDA.\nminibatch_size=512  # This default is suitable for GPU-based training.\n                    # Set it to 128 for multi-threaded CPU-based training.\nmax_param_change=2.0  # max param change per minibatch\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This option is passed to get_egs.sh\nnum_jobs_initial=1  # Number of neural net jobs to run in parallel at the start of training\nnum_jobs_final=8   # Number of neural net jobs to run in parallel at the end of training\nprior_subset_size=20000 # 20k samples per job, for computing priors.\nnum_jobs_compute_prior=10 # these are single-threaded, run on CPU.\nget_egs_stage=0    # can be used for rerunning after partial\nonline_ivector_dir=\npresoftmax_prior_scale_power=-0.25\nuse_presoftmax_prior_scale=true\nremove_egs=true  # set to false to disable removing egs after training is done.\n\nmax_models_combine=20 # The \"max_models_combine\" is the maximum number of models we give\n  # to the final 'combine' stage, but these models will themselves be averages of\n  # iteration-number ranges.\n\nshuffle_buffer_size=5000 # This \"buffer_size\" variable controls randomization of the samples\n                # on each iter.  You could set it to 0 or to a large value for complete\n                # randomization, but this would both consume memory and cause spikes in\n                # disk I/O.  Smaller is easier on disk and memory but less random.  It's\n                # not a huge deal though, as samples are anyway randomized right at the start.\n                # (the point of this is to get data in different minibatches on different iterations,\n                # since in the preconditioning method, 2 samples in the same minibatch can\n                # affect each others' gradients.\n\nadd_layers_period=2 # by default, add new layers every 2 iterations.\nstage=-6\nexit_stage=-100 # you can set this to terminate the training early.  Exits before running this stage\n\n# count space-separated fields in splice_indexes to get num-hidden-layers.\nsplice_indexes=\"-4,-3,-2,-1,0,1,2,3,4  0  -2,2  0  -4,4 0\"\n# Format : layer<hidden_layer>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n# note: hidden layers which are composed of one or more components,\n# so hidden layer indexing is different from component count\n\nrandprune=4.0 # speeds up LDA.\nuse_gpu=true    # if true, we run on GPU.\ncleanup=true\negs_dir=\nmax_lda_jobs=10  # use no more than 10 jobs for the LDA accumulation.\nlda_opts=\negs_opts=\ntransform_dir=     # If supplied, this dir used instead of alidir to find transforms.\ncmvn_opts=  # will be passed to get_lda.sh and get_egs.sh, if supplied.\n            # only relevant for \"raw\" features, not lda.\nfeat_type=raw  # or set to 'lda' to use LDA features.\nalign_cmd=              # The cmd that is passed to steps/nnet2/align.sh\nalign_use_gpu=          # Passed to use_gpu in steps/nnet2/align.sh [yes/no]\nrealign_times=          # List of times on which we realign.  Each time is\n                        # floating point number strictly between 0 and 1, which\n                        # will be multiplied by the num-iters to get an iteration\n                        # number.\nnum_jobs_align=30       # Number of jobs for realignment\n# End configuration section.\nframes_per_eg=8 # to be passed on to get_egs.sh\n\ntrap 'for pid in $(jobs -pr); do kill -KILL $pid; done' INT QUIT TERM\n\necho \"$0: THIS SCRIPT IS DEPRECATED\"\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/tri3_ali exp/tri4_nnet\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-epochs <#epochs|15>                        # Number of epochs of training\"\n  echo \"  --initial-effective-lrate <lrate|0.02> # effective learning rate at start of training.\"\n  echo \"  --final-effective-lrate <lrate|0.004>   # effective learning rate at end of training.\"\n  echo \"                                                   # data, 0.00025 for large data\"\n  echo \"  --num-hidden-layers <#hidden-layers|2>           # Number of hidden layers, e.g. 2 for 3 hours of data, 4 for 100hrs\"\n  echo \"  --add-layers-period <#iters|2>                   # Number of iterations between adding hidden layers\"\n  echo \"  --presoftmax-prior-scale-power <power|-0.25>     # use the specified power value on the priors (inverse priors) to scale\"\n  echo \"                                                   # the pre-softmax outputs (set to 0.0 to disable the presoftmax element scale)\"\n  echo \"  --num-jobs-initial <num-jobs|1>                  # Number of parallel jobs to use for neural net training, at the start.\"\n  echo \"  --num-jobs-final <num-jobs|8>                    # Number of parallel jobs to use for neural net training, at the end\"\n  echo \"  --num-threads <num-threads|16>                   # Number of parallel threads per job, for CPU-based training (will affect\"\n  echo \"                                                   # results as well as speed; may interact with batch size; if you increase\"\n  echo \"                                                   # this, you may want to decrease the batch size.\"\n  echo \"  --parallel-opts <opts|\\\"--num-threads 16 --mem 1G\\\">      # extra options to pass to e.g. queue.pl for processes that\"\n  echo \"                                                   # use multiple threads... note, you might have to reduce --mem\"\n  echo \"                                                   # versus your defaults, because it gets multiplied by the --num-threads argument.\"\n  echo \"  --minibatch-size <minibatch-size|128>            # Size of minibatch to process (note: product with --num-threads\"\n  echo \"                                                   # should not get too large, e.g. >2k).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --splice-indexes <string|layer0/-4:-3:-2:-1:0:1:2:3:4> \"\n  echo \"                                                   # Frame indices used for each splice layer.\"\n  echo \"                                                   # Format : layer<hidden_layer_index>/<frame_indices>....layer<hidden_layer>/<frame_indices> \"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --lda-dim <dim|''>                               # Dimension to reduce spliced features to with LDA\"\n  echo \"  --realign-times <list-of-times|\\\"\\\">             # A list of space-separated floating point numbers between 0.0 and\"\n  echo \"                                                   # 1.0 to specify how far through training realignment is to be done\"\n  echo \"  --align-cmd (utils/run.pl|utils/queue.pl <queue opts>) # passed to align.sh\"\n  echo \"  --align-use-gpu (yes/no)                         # specify is gpu is to be used for realignment\"\n  echo \"  --num-jobs-align <#njobs|30>                     # Number of jobs to perform realignment\"\n  echo \"  --stage <stage|-4>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nif [ ! -z \"$realign_times\" ]; then\n  [ -z \"$align_cmd\" ] && echo \"$0: realign_times specified but align_cmd not specified\" && exit 1\n  [ -z \"$align_use_gpu\" ] && echo \"$0: realign_times specified but align_use_gpu not specified\" && exit 1\nfi\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# Copy phones.txt from ali-dir to dir. Later, steps/nnet3/decode.sh will\n# use it to check compatibility between training and decoding phone-sets.\ncp $alidir/phones.txt $dir\n\n# Set some variables.\nnum_leaves=`tree-info $alidir/tree 2>/dev/null | grep num-pdfs | awk '{print $2}'` || exit 1\n[ -z $num_leaves ] && echo \"\\$num_leaves is unset\" && exit 1\n[ \"$num_leaves\" -eq \"0\" ] && echo \"\\$num_leaves is 0\" && exit 1\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n# in this dir we'll have just one job.\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/tree $dir\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n# First work out the feature and iVector dimension, needed for tdnn config creation.\ncase $feat_type in\n  raw) feat_dim=$(feat-to-dim --print-args=false scp:$data/feats.scp -) || \\\n      { echo \"$0: Error getting feature dim\"; exit 1; }\n    ;;\n  lda)  [ ! -f $alidir/final.mat ] && echo \"$0: With --feat-type lda option, expect $alidir/final.mat to exist.\"\n   # get num-rows in lda matrix, which is the lda feature dim.\n   feat_dim=$(matrix-dim --print-args=false $alidir/final.mat | cut -f 1)\n    ;;\n  *)\n   echo \"$0: Bad --feat-type '$feat_type';\"; exit 1;\nesac\nif [ -z \"$online_ivector_dir\" ]; then\n  ivector_dim=0\nelse\n  ivector_dim=$(feat-to-dim scp:$online_ivector_dir/ivector_online.scp -) || exit 1;\nfi\n\n\nif [ $stage -le -5 ]; then\n  echo \"$0: creating neural net configs\";\n\n  if [ ! -z \"$relu_dim\" ]; then\n    dim_opts=\"--relu-dim $relu_dim\"\n  else\n    dim_opts=\"--pnorm-input-dim $pnorm_input_dim --pnorm-output-dim  $pnorm_output_dim\"\n  fi\n\n  # create the config files for nnet initialization\n  python steps/nnet3/make_tdnn_configs.py  \\\n    --splice-indexes \"$splice_indexes\"  \\\n    --feat-dim $feat_dim \\\n    --ivector-dim $ivector_dim  \\\n     $dim_opts \\\n    --use-presoftmax-prior-scale $use_presoftmax_prior_scale \\\n    --num-targets  $num_leaves  \\\n   $dir/configs || exit 1;\n\n  # Initialize as \"raw\" nnet, prior to training the LDA-like preconditioning\n  # matrix.  This first config just does any initial splicing that we do;\n  # we do this as it's a convenient way to get the stats for the 'lda-like'\n  # transform.\n  $cmd $dir/log/nnet_init.log \\\n    nnet3-init --srand=-2 $dir/configs/init.config $dir/init.raw || exit 1;\nfi\n\n# sourcing the \"vars\" below sets\n# left_context=(something)\n# right_context=(something)\n# num_hidden_layers=(something)\n. $dir/configs/vars || exit 1;\n\ncontext_opts=\"--left-context=$left_context --right-context=$right_context\"\n\n! [ \"$num_hidden_layers\" -gt 0 ] && echo \\\n \"$0: Expected num_hidden_layers to be defined\" && exit 1;\n\n[ -z \"$transform_dir\" ] && transform_dir=$alidir\n\n\nif [ $stage -le -4 ] && [ -z \"$egs_dir\" ]; then\n  extra_opts=()\n  [ ! -z \"$cmvn_opts\" ] && extra_opts+=(--cmvn-opts \"$cmvn_opts\")\n  [ ! -z \"$feat_type\" ] && extra_opts+=(--feat-type $feat_type)\n  [ ! -z \"$online_ivector_dir\" ] && extra_opts+=(--online-ivector-dir $online_ivector_dir)\n  extra_opts+=(--transform-dir $transform_dir)\n  extra_opts+=(--left-context $left_context)\n  extra_opts+=(--right-context $right_context)\n  echo \"$0: calling get_egs.sh\"\n  steps/nnet3/get_egs.sh $egs_opts \"${extra_opts[@]}\" \\\n      --samples-per-iter $samples_per_iter --stage $get_egs_stage \\\n      --cmd \"$cmd\" $egs_opts \\\n      --frames-per-eg $frames_per_eg \\\n      $data $alidir $dir/egs || exit 1;\nfi\n\n[ -z $egs_dir ] && egs_dir=$dir/egs\n\nif [ \"$feat_dim\" != \"$(cat $egs_dir/info/feat_dim)\" ]; then\n  echo \"$0: feature dimension mismatch with egs, $feat_dim vs $(cat $egs_dir/info/feat_dim)\";\n  exit 1;\nfi\nif [ \"$ivector_dim\" != \"$(cat $egs_dir/info/ivector_dim)\" ]; then\n  echo \"$0: ivector dimension mismatch with egs, $ivector_dim vs $(cat $egs_dir/info/ivector_dim)\";\n  exit 1;\nfi\n\n# copy any of the following that exist, to $dir.\ncp $egs_dir/{cmvn_opts,splice_opts,final.mat} $dir 2>/dev/null\n\n# confirm that the egs_dir has the necessary context (especially important if\n# the --egs-dir option was used on the command line).\negs_left_context=$(cat $egs_dir/info/left_context) || exit -1\negs_right_context=$(cat $egs_dir/info/right_context) || exit -1\n ( [ $egs_left_context -lt $left_context ] || \\\n   [ $egs_right_context -lt $right_context ] ) && \\\n   echo \"$0: egs in $egs_dir have too little context\" && exit -1;\n\nframes_per_eg=$(cat $egs_dir/info/frames_per_eg) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\nnum_archives=$(cat $egs_dir/info/num_archives) || { echo \"error: no such file $egs_dir/info/frames_per_eg\"; exit 1; }\n\n# num_archives_expanded considers each separate label-position from\n# 0..frames_per_eg-1 to be a separate archive.\nnum_archives_expanded=$[$num_archives*$frames_per_eg]\n\n[ $num_jobs_initial -gt $num_jobs_final ] && \\\n  echo \"$0: --initial-num-jobs cannot exceed --final-num-jobs\" && exit 1;\n\n[ $num_jobs_final -gt $num_archives_expanded ] && \\\n  echo \"$0: --final-num-jobs cannot exceed #archives $num_archives_expanded.\" && exit 1;\n\n\nif [ $stage -le -3 ]; then\n  echo \"$0: getting preconditioning matrix for input features.\"\n  num_lda_jobs=$num_archives\n  [ $num_lda_jobs -gt $max_lda_jobs ] && num_lda_jobs=$max_lda_jobs\n\n  # Write stats with the same format as stats for LDA.\n  $cmd JOB=1:$num_lda_jobs $dir/log/get_lda_stats.JOB.log \\\n      nnet3-acc-lda-stats --rand-prune=$rand_prune \\\n        $dir/init.raw \"ark:$egs_dir/egs.JOB.ark\" $dir/JOB.lda_stats || exit 1;\n\n  all_lda_accs=$(for n in $(seq $num_lda_jobs); do echo $dir/$n.lda_stats; done)\n  $cmd $dir/log/sum_transform_stats.log \\\n    sum-lda-accs $dir/lda_stats $all_lda_accs || exit 1;\n\n  rm $all_lda_accs || exit 1;\n\n  # this computes a fixed affine transform computed in the way we described in\n  # Appendix C.6 of http://arxiv.org/pdf/1410.7455v6.pdf; it's a scaled variant\n  # of an LDA transform but without dimensionality reduction.\n  $cmd $dir/log/get_transform.log \\\n     nnet-get-feature-transform $lda_opts $dir/lda.mat $dir/lda_stats || exit 1;\n\n  ln -sf ../lda.mat $dir/configs/lda.mat\nfi\n\n\nif [ $stage -le -2 ]; then\n  echo \"$0: preparing initial vector for FixedScaleComponent before softmax\"\n  echo \"  ... using priors^$presoftmax_prior_scale_power and rescaling to average 1\"\n\n  # obtains raw pdf count\n  $cmd JOB=1:$nj $dir/log/acc_pdf.JOB.log \\\n     ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n     post-to-tacc --per-pdf=true  $alidir/final.mdl ark:- $dir/pdf_counts.JOB || exit 1;\n  $cmd $dir/log/sum_pdf_counts.log \\\n       vector-sum --binary=false $dir/pdf_counts.* $dir/pdf_counts || exit 1;\n  rm $dir/pdf_counts.*\n\n  awk -v power=$presoftmax_prior_scale_power -v smooth=0.01 \\\n     '{ for(i=2; i<=NF-1; i++) { count[i-2] = $i;  total += $i; }\n        num_pdfs=NF-2;  average_count = total/num_pdfs;\n        for (i=0; i<num_pdfs; i++) stot += (scale[i] = (count[i] + smooth * average_count)^power)\n        printf \" [ \"; for (i=0; i<num_pdfs; i++) printf(\"%f \", scale[i]*num_pdfs/stot); print \"]\" }' \\\n     $dir/pdf_counts > $dir/presoftmax_prior_scale.vec\n  ln -sf ../presoftmax_prior_scale.vec $dir/configs/presoftmax_prior_scale.vec\nfi\n\nif [ $stage -le -1 ]; then\n  # Add the first layer; this will add in the lda.mat and\n  # presoftmax_prior_scale.vec.\n  $cmd $dir/log/add_first_layer.log \\\n       nnet3-init --srand=-3 $dir/init.raw $dir/configs/layer1.config $dir/0.raw || exit 1;\n\n  # Convert to .mdl, train the transitions, set the priors.\n  $cmd $dir/log/init_mdl.log \\\n    nnet3-am-init $alidir/final.mdl $dir/0.raw - \\| \\\n    nnet3-am-train-transitions - \"ark:gunzip -c $alidir/ali.*.gz|\" $dir/0.mdl || exit 1;\nfi\n\n\n# set num_iters so that as close as possible, we process the data $num_epochs\n# times, i.e. $num_iters*$avg_num_jobs) == $num_epochs*$num_archives_expanded,\n# where avg_num_jobs=(num_jobs_initial+num_jobs_final)/2.\n\nnum_archives_to_process=$[$num_epochs*$num_archives_expanded]\nnum_archives_processed=0\nnum_iters=$[($num_archives_to_process*2)/($num_jobs_initial+$num_jobs_final)]\n\n! [ $num_iters -gt $[$finish_add_layers_iter+2] ] \\\n  && echo \"$0: Insufficient epochs\" && exit 1\n\nfinish_add_layers_iter=$[$num_hidden_layers * $add_layers_period]\n\necho \"$0: Will train for $num_epochs epochs = $num_iters iterations\"\n\nif $use_gpu; then\n  parallel_suffix=\"\"\n  train_queue_opt=\"--gpu 1\"\n  combine_queue_opt=\"--gpu 1\"\n  prior_gpu_opt=\"--use-gpu=yes\"\n  prior_queue_opt=\"--gpu 1\"\n  parallel_train_opts=\n  if ! cuda-compiled; then\n    echo \"$0: WARNING: you are running with one thread but you have not compiled\"\n    echo \"   for CUDA.  You may be running a setup optimized for GPUs.  If you have\"\n    echo \"   GPUs and have nvcc installed, go to src/ and do ./configure; make\"\n    exit 1\n  fi\nelse\n  echo \"$0: without using a GPU this will be very slow.  nnet3 does not yet support multiple threads.\"\n  parallel_train_opts=\"--use-gpu=no\"\n  combine_queue_opt=\"\"  # the combine stage will be quite slow if not using\n                        # GPU, as we didn't enable that program to use\n                        # multiple threads.\n  prior_gpu_opt=\"--use-gpu=no\"\n  prior_queue_opt=\"\"\nfi\n\n\napprox_iters_per_epoch_final=$[$num_archives_expanded/$num_jobs_final]\n# First work out how many iterations we want to combine over in the final\n# nnet3-combine-fast invocation.  (We may end up subsampling from these if the\n# number exceeds max_model_combine).  The number we use is:\n# min(max(max_models_combine, approx_iters_per_epoch_final),\n#     1/2 * iters_after_last_layer_added)\nnum_iters_combine=$max_models_combine\nif [ $num_iters_combine -lt $approx_iters_per_epoch_final ]; then\n   num_iters_combine=$approx_iters_per_epoch_final\nfi\nhalf_iters_after_add_layers=$[($num_iters-$finish_add_layers_iter)/2]\nif [ $num_iters_combine -gt $half_iters_after_add_layers ]; then\n  num_iters_combine=$half_iters_after_add_layers\nfi\nfirst_model_combine=$[$num_iters-$num_iters_combine+1]\n\nx=0\n\nfor realign_time in $realign_times; do\n  # Work out the iterations on which we will re-align, if the --realign-times\n  # option was used.  This is slightly approximate.\n  ! perl -e \"exit($realign_time > 0.0 && $realign_time < 1.0 ? 0:1);\" && \\\n    echo \"Invalid --realign-times option $realign_times: elements must be strictly between 0 and 1.\";\n  # the next formula is based on the one for mix_up_iter above.\n  realign_iter=$(perl -e '($j,$k,$n,$p)=@ARGV; print int(0.5 + ($j==$k ? $n*$p : $n*(sqrt((1-$p)*$j*$j+$p*$k*$k)-$j)/($k-$j))); ' $num_jobs_initial $num_jobs_final $num_iters $realign_time) || exit 1;\n  realign_this_iter[$realign_iter]=$realign_time\ndone\n\ncur_egs_dir=$egs_dir\n\nwhile [ $x -lt $num_iters ]; do\n  [ $x -eq $exit_stage ] && echo \"$0: Exiting early due to --exit-stage $exit_stage\" && exit 0;\n\n  this_num_jobs=$(perl -e \"print int(0.5+$num_jobs_initial+($num_jobs_final-$num_jobs_initial)*$x/$num_iters);\")\n\n  ilr=$initial_effective_lrate; flr=$final_effective_lrate; np=$num_archives_processed; nt=$num_archives_to_process;\n  this_learning_rate=$(perl -e \"print (($x + 1 >= $num_iters ? $flr : $ilr*exp($np*log($flr/$ilr)/$nt))*$this_num_jobs);\");\n\n  echo \"On iteration $x, learning rate is $this_learning_rate.\"\n\n  if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n    prev_egs_dir=$cur_egs_dir\n    cur_egs_dir=$dir/egs_${realign_this_iter[$x]}\n  fi\n\n  if [ $x -ge 0 ] && [ $stage -le $x ]; then\n    if [ ! -z \"${realign_this_iter[$x]}\" ]; then\n      time=${realign_this_iter[$x]}\n\n      echo \"Getting average posterior for purposes of adjusting the priors.\"\n      # Note: this just uses CPUs, using a smallish subset of data.\n      # always use the first egs archive, which makes the script simpler;\n      # we're using different random subsets of it.\n      rm $dir/post.$x.*.vec 2>/dev/null\n      $cmd JOB=1:$num_jobs_compute_prior $dir/log/get_post.$x.JOB.log \\\n        nnet3-copy-egs --srand=JOB --frame=random $context_opts ark:$prev_egs_dir/egs.1.ark ark:- \\| \\\n        nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n        nnet3-merge-egs ark:- ark:- \\| \\\n        nnet3-compute-from-egs --apply-exp=true \"nnet3-am-copy --raw=true $dir/$x.mdl -|\" ark:- ark:- \\| \\\n        matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n      sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n      $cmd $dir/log/vector_sum.$x.log \\\n        vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n      rm $dir/post.$x.*.vec;\n\n      echo \"Re-adjusting priors based on computed posteriors\"\n      $cmd $dir/log/adjust_priors.$x.log \\\n        nnet3-am-adjust-priors $dir/$x.mdl $dir/post.$x.vec $dir/$x.mdl || exit 1;\n\n      sleep 2\n\n      steps/nnet3/align.sh --nj $num_jobs_align --cmd \"$align_cmd\" --use-gpu $align_use_gpu \\\n        --transform-dir \"$transform_dir\" --online-ivector-dir \"$online_ivector_dir\" \\\n        --iter $x $data $lang $dir $dir/ali_$time || exit 1\n\n      steps/nnet3/relabel_egs.sh --cmd \"$cmd\" --iter $x $dir/ali_$time \\\n        $prev_egs_dir $cur_egs_dir || exit 1\n\n      if $cleanup && [[ $prev_egs_dir =~ $dir/egs* ]]; then\n        steps/nnet3/remove_egs.sh $prev_egs_dir\n      fi\n    fi\n\n    # Set off jobs doing some diagnostics, in the background.\n    # Use the egs dir from the previous iteration for the diagnostics\n    $cmd $dir/log/compute_prob_valid.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n            \"ark:nnet3-merge-egs ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n    $cmd $dir/log/compute_prob_train.$x.log \\\n      nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n           \"ark:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\n\n    if [ $x -gt 0 ]; then\n      $cmd $dir/log/progress.$x.log \\\n        nnet3-show-progress --use-gpu=no \"nnet3-am-copy --raw=true $dir/$[$x-1].mdl - |\" \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" \\\n        \"ark:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:-|\" '&&' \\\n        nnet3-info \"nnet3-am-copy --raw=true $dir/$x.mdl - |\" &\n    fi\n\n    echo \"Training neural net (pass $x)\"\n\n    if [ $x -gt 0 ] && \\\n      [ $x -le $[($num_hidden_layers-1)*$add_layers_period] ] && \\\n      [ $[$x%$add_layers_period] -eq 0 ]; then\n      do_average=false # if we've just mixed up, don't do averaging but take the\n                       # best.\n      cur_num_hidden_layers=$[1+$x/$add_layers_period]\n      config=$dir/configs/layer$cur_num_hidden_layers.config\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl - | nnet3-init --srand=$x - $config - |\"\n    else\n      do_average=true\n      if [ $x -eq 0 ]; then do_average=false; fi # on iteration 0, pick the best, don't average.\n      raw=\"nnet3-am-copy --raw=true --learning-rate=$this_learning_rate $dir/$x.mdl -|\"\n    fi\n    if $do_average; then\n      this_minibatch_size=$minibatch_size\n    else\n      # on iteration zero or when we just added a layer, use a smaller minibatch\n      # size (and we will later choose the output of just one of the jobs): the\n      # model-averaging isn't always helpful when the model is changing too fast\n      # (i.e. it can worsen the objective function), and the smaller minibatch\n      # size will help to keep the update stable.\n      this_minibatch_size=$[$minibatch_size/2];\n    fi\n\n    rm $dir/.error 2>/dev/null\n\n\n    ( # this sub-shell is so that when we \"wait\" below,\n      # we only wait for the training jobs that we just spawned,\n      # not the diagnostic jobs that we spawned above.\n\n      # We can't easily use a single parallel SGE job to do the main training,\n      # because the computation of which archive and which --frame option\n      # to use for each job is a little complex, so we spawn each one separately.\n      for n in $(seq $this_num_jobs); do\n        k=$[$num_archives_processed + $n - 1]; # k is a zero-based index that we'll derive\n                                               # the other indexes from.\n        archive=$[($k%$num_archives)+1]; # work out the 1-based archive index.\n        frame=$[(($k/$num_archives)%$frames_per_eg)]; # work out the 0-based frame\n        # index; this increases more slowly than the archive index because the\n        # same archive with different frame indexes will give similar gradients,\n        # so we want to separate them in time.\n\n        $cmd $train_queue_opt $dir/log/train.$x.$n.log \\\n          nnet3-train $parallel_train_opts \\\n          --max-param-change=$max_param_change \"$raw\" \\\n          \"ark:nnet3-copy-egs --frame=$frame $context_opts ark:$cur_egs_dir/egs.$archive.ark ark:- | nnet3-shuffle-egs --buffer-size=$shuffle_buffer_size --srand=$x ark:- ark:-| nnet3-merge-egs --minibatch-size=$this_minibatch_size --discard-partial-minibatches=true ark:- ark:- |\" \\\n          $dir/$[$x+1].$n.raw || touch $dir/.error &\n      done\n      wait\n    )\n    # the error message below is not that informative, but $cmd will\n    # have printed a more specific one.\n    [ -f $dir/.error ] && echo \"$0: error on iteration $x of training\" && exit 1;\n\n    nnets_list=\n    for n in `seq 1 $this_num_jobs`; do\n      nnets_list=\"$nnets_list $dir/$[$x+1].$n.raw\"\n    done\n\n    if $do_average; then\n      # average the output of the different jobs.\n      $cmd $dir/log/average.$x.log \\\n        nnet3-average $nnets_list - \\| \\\n        nnet3-am-copy --set-raw-nnet=- $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    else\n      # choose the best from the different jobs.\n      n=$(perl -e '($nj,$pat)=@ARGV; $best_n=1; $best_logprob=-1.0e+10; for ($n=1;$n<=$nj;$n++) {\n          $fn = sprintf($pat,$n); open(F, \"<$fn\") || die \"Error opening log file $fn\";\n          undef $logprob; while (<F>) { if (m/log-prob-per-frame=(\\S+)/) { $logprob=$1; } }\n          close(F); if (defined $logprob && $logprob > $best_logprob) { $best_logprob=$logprob;\n          $best_n=$n; } } print \"$best_n\\n\"; ' $num_jobs_nnet $dir/log/train.$x.%d.log) || exit 1;\n      [ -z \"$n\" ] && echo \"Error getting best model\" && exit 1;\n      $cmd $dir/log/select.$x.log \\\n        nnet3-am-copy --set-raw-nnet=$dir/$[$x+1].$n.raw  $dir/$x.mdl $dir/$[$x+1].mdl || exit 1;\n    fi\n\n    rm $nnets_list\n    [ ! -f $dir/$[$x+1].mdl ] && exit 1;\n    if [ -f $dir/$[$x-1].mdl ] && $cleanup && \\\n       [ $[($x-1)%100] -ne 0  ] && [ $[$x-1] -lt $first_model_combine ]; then\n      rm $dir/$[$x-1].mdl\n    fi\n  fi\n  x=$[$x+1]\n  num_archives_processed=$[$num_archives_processed+$this_num_jobs]\ndone\n\n\nif [ $stage -le $num_iters ]; then\n  echo \"Doing final combination to produce final.mdl\"\n\n  # Now do combination.  In the nnet3 setup, the logic\n  # for doing averaging of subsets of the models in the case where\n  # there are too many models to reliably esetimate interpolation\n  # factors (max_models_combine) is moved into the nnet3-combine\n  nnets_list=()\n  for n in $(seq 0 $[num_iters_combine-1]); do\n    iter=$[$first_model_combine+$n]\n    mdl=$dir/$iter.mdl\n    [ ! -f $mdl ] && echo \"Expected $mdl to exist\" && exit 1;\n    nnets_list[$n]=\"nnet3-am-copy --raw=true $mdl -|\";\n  done\n\n  # Below, we use --use-gpu=no to disable nnet3-combine-fast from using a GPU,\n  # as if there are many models it can give out-of-memory error; and we set\n  # num-threads to 8 to speed it up (this isn't ideal...)\n\n  $cmd $combine_queue_opt $dir/log/combine.log \\\n    nnet3-combine --num-iters=40 \\\n       --enforce-sum-to-one=true --enforce-positive-weights=true \\\n       --verbose=3 \"${nnets_list[@]}\" \"ark:nnet3-merge-egs --minibatch-size=1024 ark:$cur_egs_dir/combine.egs ark:-|\" \\\n    \"|nnet3-am-copy --set-raw-nnet=- $dir/$num_iters.mdl $dir/combined.mdl\" || exit 1;\n\n  # Compute the probability of the final, combined model with\n  # the same subset we used for the previous compute_probs, as the\n  # different subsets will lead to different probs.\n  $cmd $dir/log/compute_prob_valid.final.log \\\n    nnet3-compute-prob \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark:nnet3-merge-egs ark:$cur_egs_dir/valid_diagnostic.egs ark:- |\" &\n  $cmd $dir/log/compute_prob_train.final.log \\\n    nnet3-compute-prob  \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" \\\n    \"ark:nnet3-merge-egs ark:$cur_egs_dir/train_diagnostic.egs ark:- |\" &\nfi\n\nif [ $stage -le $[$num_iters+1] ]; then\n  echo \"Getting average posterior for purposes of adjusting the priors.\"\n  # Note: this just uses CPUs, using a smallish subset of data.\n  if [ $num_jobs_compute_prior -gt $num_archives ]; then egs_part=1;\n  else egs_part=JOB; fi\n  rm $dir/post.$x.*.vec 2>/dev/null\n  $cmd JOB=1:$num_jobs_compute_prior $prior_queue_opt $dir/log/get_post.$x.JOB.log \\\n    nnet3-copy-egs --frame=random $context_opts --srand=JOB ark:$cur_egs_dir/egs.$egs_part.ark ark:- \\| \\\n    nnet3-subset-egs --srand=JOB --n=$prior_subset_size ark:- ark:- \\| \\\n    nnet3-merge-egs ark:- ark:- \\| \\\n    nnet3-compute-from-egs $prior_gpu_opt --apply-exp=true \\\n      \"nnet3-am-copy --raw=true $dir/combined.mdl -|\" ark:- ark:- \\| \\\n    matrix-sum-rows ark:- ark:- \\| vector-sum ark:- $dir/post.$x.JOB.vec || exit 1;\n\n  sleep 3;  # make sure there is time for $dir/post.$x.*.vec to appear.\n\n  $cmd $dir/log/vector_sum.$x.log \\\n   vector-sum $dir/post.$x.*.vec $dir/post.$x.vec || exit 1;\n\n  rm $dir/post.$x.*.vec;\n\n  echo \"Re-adjusting priors based on computed posteriors\"\n  $cmd $dir/log/adjust_priors.final.log \\\n    nnet3-am-adjust-priors $dir/combined.mdl $dir/post.$x.vec $dir/final.mdl || exit 1;\nfi\n\n\nif [ ! -f $dir/final.mdl ]; then\n  echo \"$0: $dir/final.mdl does not exist.\"\n  # we don't want to clean up if the training didn't succeed.\n  exit 1;\nfi\n\nsleep 2\n\necho Done\n\nif $cleanup; then\n  echo Cleaning up data\n  if $remove_egs && [[ $cur_egs_dir =~ $dir/egs* ]]; then\n    steps/nnet2/remove_egs.sh $cur_egs_dir\n  fi\n\n  echo Removing most of the models\n  for x in `seq 0 $num_iters`; do\n    if [ $[$x%100] -ne 0 ] && [ $x -ne $num_iters ] && [ -f $dir/$x.mdl ]; then\n       # delete all but every 100th model; don't delete the ones which combine to form the final model.\n      rm $dir/$x.mdl\n    fi\n  done\nfi\n\nsteps/info/nnet3_dir_info.pl $dir\n\nexit 0\n"
  },
  {
    "path": "egs/steps/nnet3/xconfig_to_config.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016-2018    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2017    Google Inc. (vpeddinti@google.com)\n# Apache 2.0.\n\n# This is like xconfig_to_configs.py but with a simpler interface; it writes\n# to a single named file.\n\n\nimport argparse\nimport os\nimport sys\nfrom collections import defaultdict\n\nsys.path.insert(0, 'steps/')\n# the following is in case we weren't running this from the normal directory.\nsys.path.insert(0, os.path.realpath(os.path.dirname(sys.argv[0])) + '/')\n\nimport libs.nnet3.xconfig.parser as xparser\nimport libs.common as common_lib\n\n\ndef get_args():\n    # we add compulsory arguments as named arguments for readability\n    parser = argparse.ArgumentParser(\n        description=\"Reads an xconfig file and creates config files \"\n                    \"for neural net creation and training\",\n        epilog='Search egs/*/*/local/{nnet3,chain}/*sh for examples')\n    parser.add_argument('--xconfig-file', required=True,\n                        help='Filename of input xconfig file')\n    parser.add_argument('--existing-model',\n                        help='Filename of previously trained neural net '\n                             '(e.g. final.mdl) which is useful in case of '\n                             'using nodes from list of component-nodes in '\n                             'already trained model '\n                             'to generate new config file for new model.'\n                             'The context info is also generated using '\n                             'a model generated by adding final.config '\n                             'to the existing model.'\n                             'e.g. In Transfer learning: generate new model using '\n                             'component nodes in existing model.')\n    parser.add_argument('--config-file-out', required=True,\n                        help='Filename to write nnet config file.');\n    parser.add_argument('--nnet-edits', type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"This option is useful in case the network you\n                        are creating does not have an output node called\n                        'output' (e.g. for multilingual setups).  You can set\n                        this to an edit-string like: 'rename-node old-name=xxx\n                        new-name=output' if node xxx plays the role of the\n                        output node in this network.  This is only used for\n                        computing the left/right context.\"\"\")\n\n    print(' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n\n    return args\n\n\n\ndef write_config_file(config_file_out, all_layers):\n    # config_basename_to_lines is map from the basename of the\n    # config, as a string (i.e. 'ref', 'all', 'init') to a list of\n    # strings representing lines to put in the config file.\n    config_basename_to_lines = defaultdict(list)\n\n    for layer in all_layers:\n        try:\n            pairs = layer.get_full_config()\n            for config_basename, line in pairs:\n                config_basename_to_lines[config_basename].append(line)\n        except Exception as e:\n            print(\"{0}: error producing config lines from xconfig \"\n                  \"line '{1}': error was: {2}\".format(sys.argv[0],\n                                                      str(layer), repr(e)),\n                  file=sys.stderr)\n            # we use raise rather than raise(e) as using a blank raise\n            # preserves the backtrace\n            raise\n\n    with open(config_file_out, 'w') as f:\n        print('# This file was created by the command:\\n'\n              '# {0} '.format(sys.argv), file=f)\n        lines = config_basename_to_lines['final']\n        for line in lines:\n            print(line, file=f)\n\n\ndef main():\n    args = get_args()\n    existing_layers = []\n    if args.existing_model is not None:\n        existing_layers = xparser.get_model_component_info(args.existing_model)\n    all_layers = xparser.read_xconfig_file(args.xconfig_file, existing_layers)\n    write_config_file(args.config_file_out, all_layers)\n\n\nif __name__ == '__main__':\n    main()\n\n\n# test:\n# (echo 'input dim=40 name=input'; echo 'output name=output input=Append(-1,0,1)')  >xconfig; steps/nnet3/xconfig_to_config.py --xconfig-file=xconfig --config-file-out=foo\n"
  },
  {
    "path": "egs/steps/nnet3/xconfig_to_configs.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016    Johns Hopkins University (Dan Povey)\n#           2016    Vijayaditya Peddinti\n#           2017    Google Inc. (vpeddinti@google.com)\n# Apache 2.0.\n\n# we're using python 3.x style print but want it to work in python 2.x,\nfrom __future__ import print_function\nimport argparse\nimport os\nimport sys\nfrom collections import defaultdict\n\nsys.path.insert(0, 'steps/')\n# the following is in case we weren't running this from the normal directory.\nsys.path.insert(0, os.path.realpath(os.path.dirname(sys.argv[0])) + '/')\n\nimport libs.nnet3.xconfig.parser as xparser\nimport libs.common as common_lib\n\n\ndef get_args():\n    # we add compulsary arguments as named arguments for readability\n    parser = argparse.ArgumentParser(\n        description=\"Reads an xconfig file and creates config files \"\n                    \"for neural net creation and training\",\n        epilog='Search egs/*/*/local/{nnet3,chain}/*sh for examples')\n    parser.add_argument('--xconfig-file', required=True,\n                        help='Filename of input xconfig file')\n    parser.add_argument('--existing-model',\n                        help='Filename of previously trained neural net '\n                             '(e.g. final.mdl) which is useful in case of '\n                             'using nodes from list of component-nodes in '\n                             'already trained model '\n                             'to generate new config file for new model.'\n                             'The context info is also generated using '\n                             'a model generated by adding final.config '\n                             'to the existing model.'\n                             'e.g. In Transfer learning: generate new model using '\n                             'component nodes in existing model.')\n    parser.add_argument('--config-dir', required=True,\n                        help='Directory to write config files and variables')\n    parser.add_argument('--nnet-edits', type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"This option is useful in case the network you\n                        are creating does not have an output node called\n                        'output' (e.g. for multilingual setups).  You can set\n                        this to an edit-string like: 'rename-node old-name=xxx\n                        new-name=output' if node xxx plays the role of the\n                        output node in this network.  This is only used for\n                        computing the left/right context.\"\"\")\n\n    print(' '.join(sys.argv), file=sys.stderr)\n\n    args = parser.parse_args()\n    args = check_args(args)\n\n    return args\n\n\ndef check_args(args):\n    if not os.path.exists(args.config_dir):\n        os.makedirs(args.config_dir)\n    return args\n\n\ndef backup_xconfig_file(xconfig_file, config_dir):\n    \"\"\"we write a copy of the xconfig file just to have a record of the\n    original input.\n    \"\"\"\n    try:\n        xconfig_file_out = open(config_dir + '/xconfig', 'w')\n    except:\n        raise Exception('{0}: error opening file '\n                        '{1}/xconfig for output'.format(\n                            sys.argv[0], config_dir))\n    try:\n        xconfig_file_in = open(xconfig_file)\n    except:\n        raise Exception('{0}: error opening file {1} for input'\n                        ''.format(sys.argv[0], config_dir))\n\n    print(\"# This file was created by the command:\\n\"\n          \"# {0}\\n\"\n          \"# It is a copy of the source from which the config files in \"\n          \"# this directory were generated.\\n\".format(' '.join(sys.argv)),\n          file=xconfig_file_out)\n\n    while True:\n        line = xconfig_file_in.readline()\n        if line == '':\n            break\n        print(line.strip(), file=xconfig_file_out)\n    xconfig_file_out.close()\n    xconfig_file_in.close()\n\n\ndef write_expanded_xconfig_files(config_dir, all_layers):\n    \"\"\" This functions writes config_dir/xconfig.expanded.1 and\n    config_dir/xconfig.expanded.2, showing some of the internal stages of\n    processing the xconfig file before turning it into config files.\n    \"\"\"\n    try:\n        xconfig_file_out = open(config_dir + '/xconfig.expanded.1', 'w')\n    except:\n        raise Exception('{0}: error opening file '\n                        '{1}/xconfig.expanded.1 for output'.format(\n                            sys.argv[0], config_dir))\n\n    print('# This file was created by the command:\\n'\n          '# ' + ' '.join(sys.argv) + '\\n'\n          '#It contains the same content as ./xconfig but it was parsed and\\n'\n          '#default config values were set.\\n'\n          '# See also ./xconfig.expanded.2\\n', file=xconfig_file_out)\n\n    for layer in all_layers:\n        print('{}'.format(layer), file=xconfig_file_out)\n    xconfig_file_out.close()\n\n    try:\n        xconfig_file_out = open(config_dir + '/xconfig.expanded.2', 'w')\n    except:\n        raise Exception('{0}: error opening file '\n                        '{1}/xconfig.expanded.2 for output'.format(\n                            sys.argv[0], config_dir))\n\n    print('# This file was created by the command:\\n'\n          '# ' + ' '.join(sys.argv) + '\\n'\n          '# It contains the same content as ./xconfig but it was parsed,\\n'\n          '# default config values were set, \\n'\n          '# and Descriptors (input=xxx) were normalized.\\n'\n          '# See also ./xconfig.expanded.1\\n',\n          file=xconfig_file_out)\n\n    for layer in all_layers:\n        layer.normalize_descriptors()\n        print('{}'.format(layer), file=xconfig_file_out)\n    xconfig_file_out.close()\n\n\ndef get_config_headers():\n    \"\"\" This function returns a map from config-file basename\n    e.g. 'init', 'ref', 'layer1' to a documentation string that goes\n    at the top of the file.\n    \"\"\"\n    # resulting dict will default to the empty string for any config files not\n    # explicitly listed here.\n    ans = defaultdict(str)\n\n    ans['init'] = (\n        '# This file was created by the command:\\n'\n        '# ' + ' '.join(sys.argv) + '\\n'\n        '# It contains the input of the network and is used in\\n'\n        '# accumulating stats for an LDA-like transform of the\\n'\n        '# input features.\\n')\n    ans['ref'] = (\n        '# This file was created by the command:\\n'\n        '# ' + ' '.join(sys.argv) + '\\n'\n        '# It contains the entire neural network, but with those\\n'\n        '# components that would normally require fixed vectors/matrices\\n'\n        '# read from disk, replaced with random initialization\\n'\n        '# (this applies to the LDA-like transform and the\\n'\n        '# presoftmax-prior-scale, if applicable).  This file\\n'\n        '# is used only to work out the left-context and right-context\\n'\n        '# of the network.\\n')\n    ans['final'] = (\n        '# This file was created by the command:\\n'\n        '# ' + ' '.join(sys.argv) + '\\n'\n        '# It contains the entire neural network.\\n')\n\n    return ans\n\n\n# This is where most of the work of this program happens.\ndef write_config_files(config_dir, all_layers):\n    # config_basename_to_lines is map from the basename of the\n    # config, as a string (i.e. 'ref', 'all', 'init') to a list of\n    # strings representing lines to put in the config file.\n    config_basename_to_lines = defaultdict(list)\n\n    config_basename_to_header = get_config_headers()\n\n    for layer in all_layers:\n        try:\n            pairs = layer.get_full_config()\n            for config_basename, line in pairs:\n                config_basename_to_lines[config_basename].append(line)\n        except Exception as e:\n            print(\"{0}: error producing config lines from xconfig \"\n                  \"line '{1}': error was: {2}\".format(sys.argv[0],\n                                                      str(layer), repr(e)),\n                  file=sys.stderr)\n            # we use raise rather than raise(e) as using a blank raise\n            # preserves the backtrace\n            raise\n\n    # remove previous init.config\n    try:\n        os.remove(config_dir + '/init.config')\n    except OSError:\n        pass\n\n    for basename, lines in config_basename_to_lines.items():\n        # check the lines num start with 'output-node':\n        num_output_node_lines = sum( [ 1 if line.startswith('output-node' ) else 0\n                                       for line in lines ] )\n        if num_output_node_lines == 0:\n            if basename == 'init':\n                continue # do not write the init.config\n            else:\n                print('{0}: error in xconfig file {1}: may be lack of a '\n                      'output layer'.format(sys.argv[0], sys.argv[2]),\n                                            file=sys.stderr)\n                raise\n\n        header = config_basename_to_header[basename]\n        filename = '{0}/{1}.config'.format(config_dir, basename)\n        try:\n            f = open(filename, 'w')\n            print(header, file=f)\n            for line in lines:\n                print(line, file=f)\n            f.close()\n        except Exception as e:\n            print('{0}: error writing to config file {1}: error is {2}'\n                  ''.format(sys.argv[0], filename, repr(e)), file=sys.stderr)\n            # we use raise rather than raise(e) as using a blank raise\n            # preserves the backtrace\n            raise\n\n\ndef add_nnet_context_info(config_dir, nnet_edits=None,\n                          existing_model=None):\n    \"\"\"Create the 'vars' file that specifies model_left_context, etc.\"\"\"\n\n    common_lib.execute_command(\"nnet3-init {0} {1}/ref.config \"\n                               \"{1}/ref.raw\"\n                               \"\".format(existing_model if\n                                         existing_model is not None else \"\",\n                                         config_dir))\n    model = \"{0}/ref.raw\".format(config_dir)\n    if nnet_edits is not None:\n        model = \"nnet3-copy --edits='{0}' {1} - |\".format(nnet_edits,\n                                                          model)\n    out = common_lib.get_command_stdout('nnet3-info \"{0}\"'.format(model))\n    # out looks like this\n    # left-context: 7\n    # right-context: 0\n    # num-parameters: 90543902\n    # modulus: 1\n    # ...\n    info = {}\n    for line in out.split(\"\\n\")[:4]: # take 4 initial lines,\n        parts = line.split(\":\")\n        if len(parts) != 2:\n            continue\n        info[parts[0].strip()] = int(parts[1].strip())\n\n    # Writing the 'vars' file:\n    #   model_left_context=0\n    #   model_right_context=7\n    vf = open('{0}/vars'.format(config_dir), 'w')\n    vf.write('model_left_context={0}\\n'.format(info['left-context']))\n    vf.write('model_right_context={0}\\n'.format(info['right-context']))\n    vf.close()\n\ndef check_model_contexts(config_dir, nnet_edits=None, existing_model=None):\n    contexts = {}\n    for file_name in ['init', 'ref']:\n        if os.path.exists('{0}/{1}.config'.format(config_dir, file_name)):\n            contexts[file_name] = {}\n            common_lib.execute_command(\"nnet3-init {0} {1}/{2}.config \"\n                                       \"{1}/{2}.raw\"\n                                       \"\".format(existing_model if\n                                                 existing_model is not\n                                                 None else '',\n                                                 config_dir, file_name))\n            model = \"{0}/{1}.raw\".format(config_dir, file_name)\n            if nnet_edits is not None and file_name != 'init':\n                model = \"nnet3-copy --edits='{0}' {1} - |\".format(nnet_edits,\n                                                                  model)\n            out = common_lib.get_command_stdout('nnet3-info \"{0}\"'.format(model))\n            # out looks like this\n            # left-context: 7\n            # right-context: 0\n            # num-parameters: 90543902\n            # modulus: 1\n            # ...\n            for line in out.split(\"\\n\")[:4]: # take 4 initial lines,\n                parts = line.split(\":\")\n                if len(parts) != 2:\n                    continue\n                key = parts[0].strip()\n                value = int(parts[1].strip())\n                if key in ['left-context', 'right-context']:\n                    contexts[file_name][key] = value\n\n    if 'init' in contexts:\n        assert('ref' in contexts)\n        if ('left-context' in contexts['init'] and\n            'left-context' in contexts['ref']):\n            if ((contexts['init']['left-context']\n                 > contexts['ref']['left-context'])\n                or (contexts['init']['right-context']\n                    > contexts['ref']['right-context'])):\n               raise Exception(\n                    \"Model specified in {0}/init.config requires greater\"\n                    \" context than the model specified in {0}/ref.config.\"\n                    \" This might be due to use of label-delay at the output\"\n                    \" in ref.config. Please use delay=$label_delay in the\"\n                    \" initial fixed-affine-layer of the network, to avoid\"\n                    \" this issue.\")\n\n\n\ndef main():\n    args = get_args()\n    backup_xconfig_file(args.xconfig_file, args.config_dir)\n    existing_layers = []\n    if args.existing_model is not None:\n        existing_layers = xparser.get_model_component_info(args.existing_model)\n    all_layers = xparser.read_xconfig_file(args.xconfig_file, existing_layers)\n    write_expanded_xconfig_files(args.config_dir, all_layers)\n    write_config_files(args.config_dir, all_layers)\n    check_model_contexts(args.config_dir, args.nnet_edits,\n                         existing_model=args.existing_model)\n    add_nnet_context_info(args.config_dir, args.nnet_edits,\n                          existing_model=args.existing_model)\n\n\nif __name__ == '__main__':\n    main()\n\n\n# test:\n# mkdir -p foo; (echo 'input dim=40 name=input'; echo 'output name=output input=Append(-1,0,1)')  >xconfig; ./xconfig_to_configs.py xconfig foo\n#  mkdir -p foo; (echo 'input dim=40 name=input'; echo 'output-layer name=output dim=1924 input=Append(-1,0,1)')  >xconfig; ./xconfig_to_configs.py xconfig foo\n\n# mkdir -p foo; (echo 'input dim=40 name=input'; echo 'relu-renorm-layer name=affine1 dim=1024'; echo 'output-layer name=output dim=1924 input=Append(-1,0,1)')  >xconfig; ./xconfig_to_configs.py xconfig foo\n\n# mkdir -p foo; (echo 'input dim=100 name=ivector'; echo 'input dim=40 name=input'; echo 'fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=foo/bar/lda.mat'; echo 'output-layer name=output dim=1924 input=Append(-1,0,1)')  >xconfig; ./xconfig_to_configs.py xconfig foo\n"
  },
  {
    "path": "egs/steps/online/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects adaptation and pruning (scoring is on\n              # lattices).\nper_utt=false\ndo_endpointing=false\ndo_speex_compressing=false\nscoring_opts=\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the models are, as prepared by steps/online/prepare_online_decoding.sh\"\n   echo \"e.g.: $0 exp/tri3b/graph data/test exp/tri3b_online/decode/\"\n   echo \"\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --per-utt <true|false>                           # If true, decode per utterance without\"\n   echo \"                                                   # carrying forward adaptation info from previous\"\n   echo \"                                                   # utterances of each speaker.\"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nfor f in $srcdir/conf/online_decoding.conf $graphdir/HCLG.fst $graphdir/words.txt $data/wav.scp; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif ! $per_utt; then\n  spk2utt_rspecifier=\"ark:$sdata/JOB/spk2utt\"\nelse\n  mkdir -p $dir/per_utt\n  for j in $(seq $nj); do\n    awk '{print $1, $1}' <$sdata/$j/utt2spk >$dir/per_utt/utt2spk.$j || exit 1;\n  done\n  spk2utt_rspecifier=\"ark:$dir/per_utt/utt2spk.JOB\"\nfi\n\nif $do_endpointing; then\n  if $do_speex_compressing; then\n    wav_rspecifier=\"ark:compress-uncompress-speex scp:$sdata/JOB/wav.scp ark:-|extend-wav-with-silence ark:- ark:-|\"\n  else\n    wav_rspecifier=\"ark:extend-wav-with-silence scp:$sdata/JOB/wav.scp ark:-|\"\n  fi\nelse\n  if $do_speex_compressing; then\n    wav_rspecifier=\"ark:compress-uncompress-speex scp:$sdata/JOB/wav.scp ark:-|\"\n  else\n    wav_rspecifier=scp:$sdata/JOB/wav.scp\n  fi\nfi\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    online2-wav-gmm-latgen-faster --do-endpointing=$do_endpointing \\\n     --config=$srcdir/conf/online_decoding.conf \\\n     --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n     --acoustic-scale=$acwt --word-symbol-table=$graphdir/words.txt \\\n     $graphdir/HCLG.fst $spk2utt_rspecifier \"$wav_rspecifier\" \\\n      \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/online/nnet2/align.sh",
    "content": "#!/usr/bin/env bash\n# Copyright      2012  Brno University of Technology (Author: Karel Vesely)\n#           2013-2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Computes training alignments using DNN.  This takes as input a directory\n# prepared as for online-nnet2 decoding (e.g. by\n# steps/online/nnet2/prepare_online_decoding.sh), and it computes the features\n# directly from the wav.scp instead of relying on features dumped on disk;\n# this avoids the hassle of having to dump suitably matched features.\n\n\n# Begin configuration section.  \nnj=4\ncmd=run.pl\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\niter=final\nuse_gpu=no\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: $0 <data-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/nnet4 exp/nnet4_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nfor f in $srcdir/tree $srcdir/${iter}.mdl $data/wav.scp $lang/L.fst \\\n      $srcdir/conf/online_nnet2_decoding.conf; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\ncp $srcdir/{tree,${iter}.mdl} $dir || exit 1;\n\ngrep -v '^--endpoint' $srcdir/conf/online_nnet2_decoding.conf >$dir/feature.conf || exit 1;\n\n\nif [ -f $data/segments ]; then\n  # note: in the feature extraction, because the program online2-wav-dump-features is sensitive to the\n  # previous utterances within a speaker, we do the filtering after extracting the features.\n  echo \"$0 [info]: segments file exists: using that.\"\n  feats=\"ark,s,cs:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt ark,s,cs:- ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists, using wav.scp.\"\n  feats=\"ark,s,cs:online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt scp:$sdata/JOB/wav.scp ark:- |\"\nfi\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\ntra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata/JOB/text|\";\n\n$cmd JOB=1:$nj $dir/log/align.JOB.log \\\n  compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $srcdir/${iter}.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n  nnet-align-compiled $scale_opts --use-gpu=$use_gpu --beam=$beam --retry-beam=$retry_beam \\\n    $srcdir/${iter}.mdl ark:- \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n\necho \"$0: done aligning data.\"\n\n"
  },
  {
    "path": "egs/steps/online/nnet2/copy_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013-2014  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# Warning, this script is deprecated, please use utils/data/modify_speaker_info.sh\n\n# This script is as utils/copy_data_dir.sh in that it copies a data-dir,\n# but it supports the --utts-per-spk-max option.  If nonzero, it modifies\n# the utt2spk and spk2utt files by splitting each speaker into multiple\n# versions, so that each speaker has no more than --utts-per-spk-max\n# utterances.\n\n# begin configuration section\nutts_per_spk_max=-1\n# end configuration section\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0 --utts-per-spk-max 2 data/train data/train-max2\"\n  echo \"Options\"\n  echo \"   --utts-per-spk-max <n>  # number of utterances per speaker maximum,\"\n  echo \"                           # default -1 (meaning no maximum).  E.g. 2.\"\n  exit 1;\nfi\n\n\necho \"$0: this script is deprecated, please use utils/data/modify_speaker_info.sh.\"\n\nexport LC_ALL=C\n\nsrcdir=$1\ndestdir=$2\n\nif [ ! -f $srcdir/utt2spk ]; then\n  echo \"$0: no such file $srcdir/utt2spk\"\n  exit 1;\nfi\n\nset -e;\nset -o pipefail\n\nmkdir -p $destdir\n\n\nif [ \"$utts_per_spk_max\" != -1 ]; then\n  # create spk2utt file with reduced number of utterances per speaker.\n  awk -v max=$utts_per_spk_max '{ n=2; count=0;\n    while(n<=NF) {\n      int_max=int(max)+ (rand() < (max-int(max))?1:0);\n      nmax=n+int_max; count++; printf(\"%s-%06x\", $1, count);\n      for (;n<nmax&&n<=NF; n++) printf(\" %s\", $n); print \"\";} }' \\\n   <$srcdir/spk2utt >$destdir/spk2utt\n  utils/spk2utt_to_utt2spk.pl <$destdir/spk2utt >$destdir/utt2spk\n\n  if [ -f $srcdir/cmvn.scp ]; then\n    # below, the first apply_map command outputs a cmvn.scp indexed by utt;\n    # the second one outputs a cmvn.scp indexed by new speaker-id.\n    utils/apply_map.pl -f 2 $srcdir/cmvn.scp <$srcdir/utt2spk | \\\n      utils/apply_map.pl -f 1 $destdir/utt2spk | sort | uniq > $destdir/cmvn.scp\n    echo \"$0: mapping cmvn.scp, but you may want to recompute it if it's needed,\"\n    echo \" as it would probably change.\"\n  fi\n  if [ -f $srcdir/spk2gender ]; then\n    utils/apply_map.pl -f 2 $srcdir/spk2gender <$srcdir/utt2spk | \\\n      utils/apply_map.pl -f 1 $destdir/utt2spk | sort | uniq >$destdir/spk2gender\n  fi\nelse\n  cp $srcdir/spk2utt $srcdir/utt2spk $destdir/\n  [ -f $srcdir/spk2gender ] && cp $srcdir/spk2gender $destdir/\n  [ -f $srcdir/cmvn.scp ] && cp $srcdir/cmvn.scp $destdir/\nfi\n\n\nfor f in feats.scp segments wav.scp reco2file_and_channel text stm glm ctm; do\n  [ -f $srcdir/$f ] && cp $srcdir/$f $destdir/\ndone\n\necho \"$0: copied data from $srcdir to $destdir, with --utts-per-spk-max $utts_per_spk_max\"\nopts=\n[ ! -f $srcdir/feats.scp ] && opts=\"--no-feats\"\n[ ! -f $srcdir/text ] && opts=\"$opts --no-text\"\n[ ! -f $srcdir/wav.scp ] && opts=\"$opts --no-wav\"\n\nutils/validate_data_dir.sh $opts $destdir\n"
  },
  {
    "path": "egs/steps/online/nnet2/copy_ivector_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Johns Hopkins University (author: Hossein Hadian)\n# Apache 2.0\n\n# This script copies the necessary parts of an online ivector directory\n# optionally applying a mapping to the ivector_online.scp file\n\nutt2orig=\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0 exp/nnet3/online_ivector_train exp/nnet3/online_ivector_train_fs\"\n  echo \"Options\"\n  echo \"   --utt2orig=<file>     # utterance id mapping to use\"\n  exit 1;\nfi\n\n\nsrcdir=$1\ndestdir=$2\n\nif [ ! -f $srcdir/ivector_period ]; then\n  echo \"$0: no such file $srcdir/ivector_period\"\n  exit 1;\nfi\n\nif [ \"$destdir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <destdir> to be different.\"\n  exit 1\nfi\n\nset -e;\n\nmkdir -p $destdir\ncp -r $srcdir/{conf,ivector_period} $destdir\nif [ -z $utt2orig ]; then\n  cp $srcdir/ivector_online.scp $destdir\nelse\n  utils/apply_map.pl -f 2 $srcdir/ivector_online.scp < $utt2orig > $destdir/ivector_online.scp\nfi\ncp $srcdir/final.ie.id $destdir\n\necho \"$0: Copied necessary parts of online ivector directory $srcdir to $destdir\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nmax_active=7000\nthreaded=false\nmodify_ivector_config=false #  only relevant to threaded decoder.\nbeam=15.0\nlattice_beam=6.0\nacwt=0.1   # note: only really affects adaptation and pruning (scoring is on\n           # lattices).\nper_utt=false\nonline=true  # only relevant to non-threaded decoder.\ndo_endpointing=false\ndo_speex_compressing=false\nscoring_opts=\nskip_scoring=false\nsilence_weight=1.0  # set this to a value less than 1 (e.g. 0) to enable silence weighting.\nmax_state_duration=40 # This only has an effect if you are doing silence\n  # weighting.  This default is probably reasonable.  transition-ids repeated\n  # more than this many times in an alignment are treated as silence.\niter=final\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the models are, as prepared by steps/online/nnet2/prepare_online_decoding.sh\"\n   echo \"e.g.: $0 exp/tri3b/graph data/test exp/tri3b_online/decode/\"\n   echo \"\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --per-utt <true|false>                           # If true, decode per utterance without\"\n   echo \"                                                   # carrying forward adaptation info from previous\"\n   echo \"                                                   # utterances of each speaker.  Default: false\"\n   echo \"  --online <true|false>                            # Set this to false if you don't really care about\"\n   echo \"                                                   # simulating online decoding and just want the best\"\n   echo \"                                                   # results.  This will use all the data within each\"\n   echo \"                                                   # utterance (plus any previous utterance, if not in\"\n   echo \"                                                   # per-utterance mode) to estimate the iVectors.\"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --iter <iter>                                    # Iteration of model to decode; default is final.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nif $per_utt; then\n  utt_suffix=utt\n  utt_opt=\"--per-utt\"\nelse\n  utt_suffix=\n  utt_opt=\nfi\nsdata=$data/split${nj}${utt_suffix};\n\nmkdir -p $dir/log\nsplit_data.sh $utt_opt $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nfor f in $srcdir/conf/online_nnet2_decoding.conf $srcdir/${iter}.mdl \\\n    $graphdir/HCLG.fst $graphdir/words.txt $data/wav.scp; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif ! $per_utt; then\n  spk2utt_rspecifier=\"ark:$sdata/JOB/spk2utt\"\nelse\n  mkdir -p $dir/per_utt\n  for j in $(seq $nj); do\n    awk '{print $1, $1}' <$sdata/$j/utt2spk >$dir/per_utt/utt2spk.$j || exit 1;\n  done\n  spk2utt_rspecifier=\"ark:$dir/per_utt/utt2spk.JOB\"\nfi\n\nif [ -f $data/segments ]; then\n  wav_rspecifier=\"ark,s,cs:extract-segments scp,p:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\nelse\n  wav_rspecifier=\"ark,s,cs:wav-copy scp,p:$sdata/JOB/wav.scp ark:- |\"\nfi\nif $do_speex_compressing; then\n  wav_rspecifier=\"$wav_rspecifier compress-uncompress-speex ark:- ark:- |\"\nfi\nif $do_endpointing; then\n  wav_rspecifier=\"$wav_rspecifier extend-wav-with-silence ark:- ark:- |\"\nfi\n\nif [ \"$silence_weight\" != \"1.0\" ]; then\n  silphones=$(cat $graphdir/phones/silence.csl) || exit 1\n  silence_weighting_opts=\"--ivector-silence-weighting.max-state-duration=$max_state_duration --ivector-silence-weighting.silence_phones=$silphones --ivector-silence-weighting.silence-weight=$silence_weight\"\nelse\n  silence_weighting_opts=\nfi\n\n\nif $threaded; then\n  decoder=online2-wav-nnet2-latgen-threaded\n    # note: the decoder actually uses 4 threads, but the average usage will normally\n    # be more like 2.\n  parallel_opts=\"--num-threads 2\"\n  opts=\"--modify-ivector-config=$modify_ivector_config --verbose=1\"\nelse\n  decoder=online2-wav-nnet2-latgen-faster\n  parallel_opts=\n  opts=\"--online=$online\"\nfi\n\nif [ $stage -le 0 ]; then\n  $cmd $parallel_opts JOB=1:$nj $dir/log/decode.JOB.log \\\n    $decoder $opts $silence_weighting_opts --do-endpointing=$do_endpointing \\\n     --config=$srcdir/conf/online_nnet2_decoding.conf \\\n     --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n     --acoustic-scale=$acwt --word-symbol-table=$graphdir/words.txt \\\n     $srcdir/${iter}.mdl $graphdir/HCLG.fst $spk2utt_rspecifier \"$wav_rspecifier\" \\\n      \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/online/nnet2/dump_nnet_activations.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2013  Daniel Povey\n# Apache 2.0.\n\n# This script was modified from ./extract_ivectors_online2.sh.  It is to be used\n# when retraining the top layer of a system that was trained on another,\n# out-of-domain dataset, on some in-domain dataset.  It takes as input a\n# directory such as nnet_gpu_online as prepared by ./prepare_online_decoding.sh,\n# and a data directory, and it processes the wave files to get features and iVectors,\n# then puts it through all but the last layer of the neural net in that directory, and dumps\n# those final activations in a feats.scp file in the output directory.  These files\n# might be quite large.  A typical feature-dimension is 300; it's the p-norm output dim.\n# We compress these files (note: the compression is lossy).\n\n\n# Begin configuration section.\nnj=30\ncmd=\"run.pl\"\nstage=0\nutts_per_spk_max=2 # maximum 2 utterances per \"fake-speaker.\"\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [options] <data> <srcdir> <output-dir>\"\n  echo \" e.g.: $0 data/train exp/nnet2_online/nnet_a_online exp/nnet2_online/activations_train\"\n  echo \"Output is in <output-dir>/feats.scp\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue-opts>) # how to run jobs.\"\n  echo \"  --nj <n|10>                                      # Number of jobs (also see num-processes and num-threads)\"\n  echo \"  --stage <stage|0>                                # To control partial reruns\"\n  echo \"  --utts-per-spk-max <int;default=2>    # Controls splitting into 'fake speakers'.\"\n  echo \"                                        # Set to 1 if compatibility with utterance-by-utterance\"\n  echo \"                                        # decoding is the only factor, and to larger if you care \"\n  echo \"                                        # also about adaptation over several utterances.\"\n  exit 1;\nfi\n\ndata=$1\nsrcdir=$2\ndir=$3\n\nfor f in $data/wav.scp $srcdir/conf/online_nnet2_decoding.conf $srcdir/final.mdl; do\n  [ ! -f $f ] && echo \"No such file $f\" && exit 1;\ndone\n\n# Set various variables.\nmkdir -p $dir/log\necho $nj >$dir/num_jobs\nsdata=$data/split$nj;\nutils/split_data.sh $data $nj || exit 1;\n\n\nmkdir -p $dir/conf $dir/feats\ngrep -v '^--endpoint' $srcdir/conf/online_nnet2_decoding.conf > $dir/conf/online_feature_pipeline.conf\n\nif [ $stage -le 0 ]; then\n  ns=$(wc -l <$data/spk2utt)\n  if [ \"$ns\" == 1 -a \"$utts_per_spk_max\" != 1 ]; then\n    echo \"$0: you seem to have just one speaker in your database.  This is probably not a good idea.\"\n    echo \"  see http://kaldi-asr.org/doc/data_prep.html (search for 'bold') for why\"\n    echo \"  Setting --utts-per-spk-max to 1.\"\n    utts_per_spk_max=1\n  fi\n\n  mkdir -p $dir/spk2utt_fake\n  for job in $(seq $nj); do \n   # create fake spk2utt files with reduced number of utterances per speaker,\n   # so the network is well adapted to using iVectors from small amounts of\n   # training data.\n    awk -v max=$utts_per_spk_max '{ n=2; count=0; while(n<=NF) {\n      nmax=n+max; count++; printf(\"%s-%06x\", $1, count); for (;n<nmax&&n<=NF; n++) printf(\" %s\", $n); print \"\";} }' \\\n        <$sdata/$job/spk2utt >$dir/spk2utt_fake/spk2utt.$job\n  done\nfi\n\nif [ $stage -le 1 ]; then\n  info=$dir/nnet_info\n  nnet-am-info $srcdir/final.mdl >$info\n  nc=$(grep num-components $info | awk '{print $2}');\n  if grep SumGroupComponent $info >/dev/null; then \n    nc_truncate=$[$nc-3]  # we did mix-up: remove AffineComponent,\n                          # SumGroupComponent, SoftmaxComponent\n  else\n    nc_truncate=$[$nc-2]  # remove AffineComponent, SoftmaxComponent\n  fi\n  nnet-to-raw-nnet --truncate=$nc_truncate $srcdir/final.mdl $dir/nnet.raw\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: dumping neural net activations\"\n\n  # The next line is a no-op unless $dir/feats/storage/ exists; see utils/create_split_dir.pl.\n  for j in $(seq $nj); do  utils/create_data_link.pl $dir/feats/feats.$j.ark; done\n\n  if [ -f $data/segments ]; then\n    wav_rspecifier=\"ark,s,cs:extract-segments scp,p:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\n  else\n    wav_rspecifier=\"scp,p:$sdata/JOB/wav.scp\"\n  fi\n  $cmd JOB=1:$nj $dir/log/dump_activations.JOB.log \\\n    online2-wav-dump-features  --config=$dir/conf/online_feature_pipeline.conf \\\n      ark:$dir/spk2utt_fake/spk2utt.JOB \"$wav_rspecifier\" ark:- \\| \\\n    nnet-compute $dir/nnet.raw ark:- ark:- \\| \\\n    copy-feats --compress=true ark:- \\\n      ark,scp:$dir/feats/feats.JOB.ark,$dir/feats/feats.JOB.scp || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: combining activations across jobs\"\n  mkdir -p $dir/data\n  cp -r $data/* $dir/data\n  for j in $(seq $nj); do cat $dir/feats/feats.$j.scp; done >$dir/data/feats.scp || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: computing [fake] CMVN stats.\"\n  # We shouldn't actually be doing CMVN, but the get_egs.sh script expects it,\n  # so create fake CMVN stats.\n  steps/compute_cmvn_stats.sh --fake $dir/data $dir/log $dir/feats || exit 1\nfi\n\n\necho \"$0: done.  Output is in $dir/data/feats.scp\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/extract_ivectors.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright     2013  Daniel Povey\n# Apache 2.0.\n\n\n# This script computes iVectors in the same format as extract_ivectors_online.sh,\n# except that they are actually not really computed online, they are first computed\n# per speaker and just duplicated many times.\n# This is mainly intended for use in decoding, where you want the best possible\n# quality of iVectors.\n#\n# This setup also makes it possible to use a previous decoding or alignment, to\n# down-weight silence in the stats (default is --silence-weight 0.0).\n#\n# This is for when you use the \"online-decoding\" setup in an offline task, and\n# you want the best possible results.\n\n\n# Begin configuration section.\nnj=30\ncmd=\"run.pl\"\nstage=0\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\nivector_period=10\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.  Making this small during iVector\n                    # extraction is equivalent to scaling up the prior, and will\n                    # will tend to produce smaller iVectors where data-counts are\n                    # small.  It's not so important that this match the value\n                    # used when training the iVector extractor, but more important\n                    # that this match the value used when you do real online decoding\n                    # with the neural nets trained with these iVectors.\nmax_count=100       # Interpret this as a number of frames times posterior scale...\n                    # this config ensures that once the count exceeds this (i.e.\n                    # 1000 frames, or 10 seconds, by default), we start to scale\n                    # down the stats, accentuating the prior term.   This seems quite\n                    # important for some reason.\nsub_speaker_frames=0  # If >0, during iVector estimation we split each speaker\n                      # into possibly many 'sub-speakers', each with at least\n                      # this many frames of speech (evaluated after applying\n                      # silence_weight, so will typically exclude silence.\n                      # e.g. set this to 1000, and it will require at least 10 seconds\n                      # of speech per sub-speaker.\n\ncompress=true       # If true, compress the iVectors stored on disk (it's lossy\n                    # compression, as used for feature matrices).\nsilence_weight=0.0\nacwt=0.1  # used if input is a decode dir, to get best path from lattices.\nmdl=final  # change this if decode directory did not have ../final.mdl present.\nnum_threads=1 # Number of threads used by ivector-extract.  It is usually not\n              # helpful to set this to > 1.  It is only useful if you have\n              # fewer speakers than the number of jobs you want to run.\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ] && [ $# != 5 ]; then\n  echo \"Usage: $0 [options] <data> <lang> <extractor-dir> [<alignment-dir>|<decode-dir>|<weights-archive>] <ivector-dir>\"\n  echo \" e.g.: $0 data/test data/lang exp/nnet2_online/extractor exp/tri3/decode_test exp/nnet2_online/ivectors_test\"\n  echo \"If <alignment-dir|decode-dir> is provided, it is converted to frame-weights \"\n  echo \"giving silence frames a weight of --silence-weight (default: 0.0). \"\n  echo \"If <weights-archive> is provided, it must be a single archive file compressed \"\n  echo \"(using gunzip) containing per-frame weights for each utterance.\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <n|10>                                      # Number of jobs (also see num-threads)\"\n  echo \"  --num-threads <n|1>                              # Number of threads for each job\"\n  echo \"                                                   # Ignored if <alignment-dir> or <decode-dir> supplied.\"\n  echo \"  --stage <stage|0>                                # To control partial reruns\"\n  echo \"  --num-gselect <n|5>                              # Number of Gaussians to select using\"\n  echo \"                                                   # diagonal model.\"\n  echo \"  --min-post <float;default=0.025>                 # Pruning threshold for posteriors\"\n  echo \"  --ivector-period <int;default=10>                # How often to extract an iVector (frames)\"\n  echo \"  --posterior-scale <float;default=0.1>            # Scale on posteriors in iVector extraction; \"\n  echo \"                                                   # affects strength of prior term.\"\n\n  exit 1;\nfi\n\nif [ $# -eq 4 ]; then\n  data=$1\n  lang=$2\n  srcdir=$3\n  dir=$4\nelse # 5 arguments\n  data=$1\n  lang=$2\n  srcdir=$3\n  ali_or_decode_dir_or_weights=$4\n  dir=$5\nfi\n\nfor f in $data/feats.scp $srcdir/final.ie $srcdir/final.dubm $srcdir/global_cmvn.stats $srcdir/splice_opts \\\n  $lang/phones.txt $srcdir/online_cmvn.conf $srcdir/final.mat; do\n  [ ! -f $f ] && echo \"$0: No such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log\nsilphonelist=$(cat $lang/phones/silence.csl) || exit 1;\n\nif [ ! -z \"$ali_or_decode_dir_or_weights\" ]; then\n\n\n  if [ -f $ali_or_decode_dir_or_weights/ali.1.gz ]; then\n    if [ ! -f $ali_or_decode_dir_or_weights/${mdl}.mdl ]; then\n      echo \"$0: expected $ali_or_decode_dir_or_weights/${mdl}.mdl to exist.\"\n      exit 1;\n    fi\n    nj_orig=$(cat $ali_or_decode_dir_or_weights/num_jobs) || exit 1;\n\n    if [ $stage -le 0 ]; then\n      rm $dir/weights.*.gz 2>/dev/null\n\n      $cmd JOB=1:$nj_orig  $dir/log/ali_to_post.JOB.log \\\n        gunzip -c $ali_or_decode_dir_or_weights/ali.JOB.gz \\| \\\n        ali-to-post ark:- ark:- \\| \\\n        weight-silence-post $silence_weight $silphonelist $ali_or_decode_dir_or_weights/final.mdl ark:- ark:- \\| \\\n        post-to-weights ark:- \"ark:|gzip -c >$dir/weights.JOB.gz\" || exit 1;\n\n      # put all the weights in one archive.\n      for j in $(seq $nj_orig); do gunzip -c $dir/weights.$j.gz; done | gzip -c >$dir/weights.gz || exit 1;\n      rm $dir/weights.*.gz || exit 1;\n    fi\n\n  elif [ -f $ali_or_decode_dir_or_weights/lat.1.gz ]; then\n    nj_orig=$(cat $ali_or_decode_dir_or_weights/num_jobs) || exit 1;\n    if [ ! -f $ali_or_decode_dir_or_weights/../${mdl}.mdl ]; then\n      echo \"$0: expected $ali_or_decode_dir_or_weights/../${mdl}.mdl to exist.\"\n      exit 1;\n    fi\n\n\n    if [ $stage -le 0 ]; then\n      rm $dir/weights.*.gz 2>/dev/null\n\n      $cmd JOB=1:$nj_orig  $dir/log/lat_to_post.JOB.log \\\n        lattice-best-path --acoustic-scale=$acwt \"ark:gunzip -c $ali_or_decode_dir_or_weights/lat.JOB.gz|\" ark:/dev/null ark:- \\| \\\n        ali-to-post ark:- ark:- \\| \\\n        weight-silence-post $silence_weight $silphonelist $ali_or_decode_dir_or_weights/../${mdl}.mdl ark:- ark:- \\| \\\n        post-to-weights ark:- \"ark:|gzip -c >$dir/weights.JOB.gz\" || exit 1;\n\n      # put all the weights in one archive.\n      for j in $(seq $nj_orig); do gunzip -c $dir/weights.$j.gz; done | gzip -c >$dir/weights.gz || exit 1;\n      rm $dir/weights.*.gz || exit 1;\n    fi\n  elif [ -f $ali_or_decode_dir_or_weights ] && gunzip -c $ali_or_decode_dir_or_weights >/dev/null; then\n    cp $ali_or_decode_dir_or_weights $dir/weights.gz || exit 1;\n  else\n    echo \"$0: expected ali.1.gz or lat.1.gz to exist in $ali_or_decode_dir_or_weights\";\n    exit 1;\n  fi\nfi\n\nsdata=$data/split$nj;\nutils/split_data.sh $data $nj || exit 1;\n\necho $ivector_period > $dir/ivector_period || exit 1;\nsplice_opts=$(cat $srcdir/splice_opts)\n\ngmm_feats=\"ark,s,cs:apply-cmvn-online --spk2utt=ark:$sdata/JOB/spk2utt --config=$srcdir/online_cmvn.conf $srcdir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\nfeats=\"ark,s,cs:splice-feats $splice_opts scp:$sdata/JOB/feats.scp ark:- | transform-feats $srcdir/final.mat ark:- ark:- |\"\n\n# This adds online-cmvn in $feats, upon request (configuration taken from UBM),\n[ -f $srcdir/online_cmvn_iextractor ] && feats=\"$gmm_feats\"\n\n\nif [ $sub_speaker_frames -gt 0 ]; then\n\n  if [ $stage -le 1 ]; then\n  # We work out 'fake' spk2utt files that possibly split each speaker into multiple pieces.\n    if [ ! -z \"$ali_or_decode_dir_or_weights\" ]; then\n      gunzip -c $dir/weights.gz | copy-vector ark:- ark,t:- | \\\n        awk '{ sum=0; for (n=3;n<NF;n++) sum += $n; print $1, sum; }' > $dir/utt_counts || exit 1;\n    else\n      feat-to-len scp:$data/feats.scp ark,t:- > $dir/utt_counts || exit 1;\n    fi\n    if ! [ $(wc -l <$dir/utt_counts) -eq $(wc -l <$data/feats.scp) ]; then\n      echo \"$0: error getting per-utterance counts.\"\n      exit 0;\n    fi\n    cat $data/spk2utt | python -c \"\nimport sys\nutt_counts = {}\ntrash = list(map(lambda x: utt_counts.update({x.split()[0]:float(x.split()[1])}), open('$dir/utt_counts').readlines()))\nsub_speaker_frames = $sub_speaker_frames\nlines = sys.stdin.readlines()\ntotal_counts = {}\nfor line in lines:\n  parts = line.split()\n  spk = parts[0]\n  total_counts[spk] = 0\n  for utt in parts[1:]:\n    total_counts[spk] += utt_counts[utt]\n\nfor line_index in range(len(lines)):\n  line = lines[line_index]\n  parts = line.split()\n  spk = parts[0]\n\n  numeric_id=0\n  current_count = 0\n  covered_count = 0\n  current_utts = []\n  for utt in parts[1:]:\n    try:\n      current_count += utt_counts[utt]\n      covered_count += utt_counts[utt]\n    except KeyError:\n      raise Exception('No count found for the utterance {0}.'.format(utt))\n    current_utts.append(utt)\n    if ((current_count >= $sub_speaker_frames) and ((total_counts[spk] - covered_count) >= $sub_speaker_frames)) or (utt == parts[-1]):\n      spk_partial = '{0}-{1:06x}'.format(spk, numeric_id)\n      numeric_id += 1\n      print ('{0} {1}'.format(spk_partial, ' '.join(current_utts)))\n      current_utts = []\n      current_count = 0\n\"> $dir/spk2utt || exit 1;\n    mkdir -p $dir/split$nj\n    # create split versions of our spk2utt file.\n    for j in $(seq $nj); do\n      mkdir -p $dir/split$nj/$j\n      utils/filter_scp.pl -f 2 $sdata/$j/utt2spk <$dir/spk2utt >$dir/split$nj/$j/spk2utt || exit 1;\n      utils/spk2utt_to_utt2spk.pl <$dir/split$nj/$j/spk2utt >$dir/split$nj/$j/utt2spk || exit 1;\n    done\n  fi\n  this_sdata=$dir/split$nj\nelse\n  this_sdata=$sdata\nfi\n\nif [ $stage -le 2 ]; then\n  if [ ! -z \"$ali_or_decode_dir_or_weights\" ]; then\n    $cmd --num-threads $num_threads JOB=1:$nj $dir/log/extract_ivectors.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      weight-post ark:- \"ark,s,cs:gunzip -c $dir/weights.gz|\" ark:- \\| \\\n      ivector-extract --num-threads=$num_threads --acoustic-weight=$posterior_scale --compute-objf-change=true \\\n        --max-count=$max_count --spk2utt=ark:$this_sdata/JOB/spk2utt \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark,t:$dir/ivectors_spk.JOB.ark || exit 1;\n  else\n    $cmd --num-threads $num_threads JOB=1:$nj $dir/log/extract_ivectors.JOB.log \\\n      gmm-global-get-post --n=$num_gselect --min-post=$min_post $srcdir/final.dubm \"$gmm_feats\" ark:- \\| \\\n      ivector-extract --num-threads=$num_threads --acoustic-weight=$posterior_scale --compute-objf-change=true \\\n        --max-count=$max_count --spk2utt=ark:$this_sdata/JOB/spk2utt \\\n      $srcdir/final.ie \"$feats\" ark,s,cs:- ark,t:$dir/ivectors_spk.JOB.ark || exit 1;\n  fi\nfi\n\n# get an utterance-level set of iVectors (just duplicate the speaker-level ones).\n# note: if $this_sdata is set $dir/split$nj, then these won't be real speakers, they'll\n# be \"sub-speakers\" (speakers split up into multiple utterances).\nif [ $stage -le 3 ]; then\n  for j in $(seq $nj); do\n    utils/apply_map.pl -f 2 $dir/ivectors_spk.$j.ark <$this_sdata/$j/utt2spk >$dir/ivectors_utt.$j.ark || exit 1;\n  done\nfi\n\nivector_dim=$[$(head -n 1 $dir/ivectors_spk.1.ark | wc -w) - 3] || exit 1;\necho  \"$0: iVector dim is $ivector_dim\"\n\nbase_feat_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n\nstart_dim=$base_feat_dim\nend_dim=$[$base_feat_dim+$ivector_dim-1]\nabsdir=$(utils/make_absolute.sh $dir)\n\nif [ $stage -le 4 ]; then\n  # here, we are just using the original features in $sdata/JOB/feats.scp for\n  # their number of rows; we use the select-feats command to remove those\n  # features and retain only the iVector features.\n  $cmd JOB=1:$nj $dir/log/duplicate_feats.JOB.log \\\n    append-vector-to-feats scp:$sdata/JOB/feats.scp ark:$dir/ivectors_utt.JOB.ark ark:- \\| \\\n    select-feats \"$start_dim-$end_dim\" ark:- ark:- \\| \\\n    subsample-feats --n=$ivector_period ark:- ark:- \\| \\\n    copy-feats --compress=$compress ark:- \\\n    ark,scp:$absdir/ivector_online.JOB.ark,$absdir/ivector_online.JOB.scp || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: combining iVectors across jobs\"\n  for j in $(seq $nj); do cat $dir/ivector_online.$j.scp; done >$dir/ivector_online.scp || exit 1;\nfi\n\nsteps/nnet2/get_ivector_id.sh $srcdir > $dir/final.ie.id || exit 1\n\necho \"$0: done extracting (pseudo-online) iVectors to $dir using the extractor in $srcdir.\"\n\n"
  },
  {
    "path": "egs/steps/online/nnet2/extract_ivectors_online.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright     2013  Daniel Povey\n# Apache 2.0.\n\nset -o pipefail\n\n# This script extracts iVectors for a set of utterances, given\n# features and a trained iVector extractor.\n\n# The script is based on ^/egs/sre08/v1/sid/extract_ivectors.sh.  Instead of\n# extracting a single iVector per utterance, it extracts one every few frames\n# (controlled by the --ivector-period option, e.g. 10, which is to save compute).\n# This is used in training (and not-really-online testing) of neural networks\n# for online decoding.\n\n# Rather than treating each utterance separately, it carries forward\n# information from one utterance to the next, within the speaker.\n\n\n# Begin configuration section.\nnj=30\ncmd=\"run.pl\"\nstage=0\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\nivector_period=10\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.  Making this small during iVector\n                    # extraction is equivalent to scaling up the prior, and will\n                    # will tend to produce smaller iVectors where data-counts are\n                    # small.  It's not so important that this match the value\n                    # used when training the iVector extractor, but more important\n                    # that this match the value used when you do real online decoding\n                    # with the neural nets trained with these iVectors.\ncompress=true       # If true, compress the iVectors stored on disk (it's lossy\n                    # compression, as used for feature matrices).\nmax_count=0         # The use of this option (e.g. --max-count 100) can make\n                    # iVectors more consistent for different lengths of\n                    # utterance, by scaling up the prior term when the\n                    # data-count exceeds this value.  The data-count is after\n                    # posterior-scaling, so assuming the posterior-scale is 0.1,\n                    # --max-count 100 starts having effect after 1000 frames, or\n                    # 10 seconds of data.\nuse_vad=false\n\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [options] <data> <extractor-dir> <ivector-dir>\"\n  echo \" e.g.: $0 data/train exp/nnet2_online/extractor exp/nnet2_online/ivectors_train\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <n|10>                                      # Number of jobs\"\n  echo \"  --stage <stage|0>                                # To control partial reruns\"\n  echo \"  --num-gselect <n|5>                              # Number of Gaussians to select using\"\n  echo \"                                                   # diagonal model.\"\n  echo \"  --min-post <float;default=0.025>                 # Pruning threshold for posteriors\"\n  echo \"  --ivector-period <int;default=10>                # How often to extract an iVector (frames)\"\n  exit 1;\nfi\n\ndata=$1\nsrcdir=$2\ndir=$3\n\nextra_files=\nif $use_vad; then\n  extra_files=$data/vad.scp\nfi\n\nfor f in $data/feats.scp $srcdir/final.ie $srcdir/final.dubm $srcdir/global_cmvn.stats $srcdir/splice_opts \\\n     $srcdir/online_cmvn.conf $srcdir/final.mat $extra_files; do\n  [ ! -f $f ] && echo \"$0: No such file $f\" && exit 1;\ndone\n\n# Set various variables.\nmkdir -p $dir/log $dir/conf\n\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n#utils/split_data.sh $data $nj || exit 1;\n\necho $ivector_period > $dir/ivector_period || exit 1;\nsplice_opts=$(cat $srcdir/splice_opts)\n\n# the program ivector-extract-online2 does a bunch of stuff in memory and is\n# config-driven...  this was easier in this case because the same code is\n# involved in online decoding.  We need to create a config file for iVector\n# extraction.\n\nieconf=$dir/conf/ivector_extractor.conf\necho -n >$ieconf\ncp $srcdir/online_cmvn.conf $dir/conf/ || exit 1;\necho \"--cmvn-config=$dir/conf/online_cmvn.conf\" >>$ieconf\nfor x in $(echo $splice_opts); do echo \"$x\"; done > $dir/conf/splice.conf\necho \"--ivector-period=$ivector_period\" >>$ieconf\necho \"--splice-config=$dir/conf/splice.conf\" >>$ieconf\necho \"--lda-matrix=$srcdir/final.mat\" >>$ieconf\necho \"--global-cmvn-stats=$srcdir/global_cmvn.stats\" >>$ieconf\necho \"--diag-ubm=$srcdir/final.dubm\" >>$ieconf\necho \"--ivector-extractor=$srcdir/final.ie\" >>$ieconf\necho \"--num-gselect=$num_gselect\"  >>$ieconf\necho \"--min-post=$min_post\" >>$ieconf\necho \"--posterior-scale=$posterior_scale\" >>$ieconf\necho \"--max-remembered-frames=1000\" >>$ieconf # the default\necho \"--max-count=$max_count\" >>$ieconf\n[ -f $srcdir/online_cmvn_iextractor ] && echo \"--online-cmvn-iextractor=true\" >>$ieconf\n\n\nabsdir=$(utils/make_absolute.sh $dir)\n\nfor n in $(seq $nj); do\n  # This will do nothing unless the directory $dir/storage exists;\n  # it can be used to distribute the data among multiple machines.\n  utils/create_data_link.pl $dir/ivector_online.$n.ark\ndone\n\nif [ $stage -le 0 ]; then\n  echo \"$0: extracting iVectors\"\n  extra_opts=\n  if $use_vad; then\n    extra_opts=\"--frame-weights-rspecifier=scp:$data/vad.scp\"\n  fi\n\n  $cmd JOB=1:$nj $dir/log/extract_ivectors.JOB.log \\\n    ivector-extract-online2 --config=$ieconf $extra_opts \\\n      ark:$sdata/JOB/spk2utt scp:$sdata/JOB/feats.scp ark:- \\| \\\n    copy-feats --compress=$compress ark:- \\\n      ark,scp:$absdir/ivector_online.JOB.ark,$absdir/ivector_online.JOB.scp || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: combining iVectors across jobs\"\n  for j in $(seq $nj); do cat $dir/ivector_online.$j.scp; done >$dir/ivector_online.scp || exit 1;\nfi\n\nsteps/nnet2/get_ivector_id.sh $srcdir > $dir/final.ie.id || exit 1\n\necho \"$0: done extracting (online) iVectors to $dir using the extractor in $srcdir.\"\n\n"
  },
  {
    "path": "egs/steps/online/nnet2/get_egs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This is modified from ../../nnet2/get_egs.sh.\n# This script combines the\n# nnet-example extraction with the feature extraction directly from wave files;\n# it uses the program online2-wav-dump-feature to do all parts of feature\n# extraction: MFCC/PLP/fbank, possibly plus pitch, plus iVectors.  This script\n# is intended mostly for cross-system training for online decoding, where you\n# initialize the nnet from an existing, larger system.\n\n\n# Begin configuration section.\ncmd=run.pl\nnum_utts_subset=300    # number of utterances in validation and training\n                       # subsets used for shrinkage and diagnostics\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=10000 # # train frames for the above.\nnum_frames_diagnostic=4000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This is just a guideline; it will pick a number\n                        # that divides the number of samples in the entire data.\ntransform_dir=     # If supplied, overrides alidir\nnum_jobs_nnet=16    # Number of neural net jobs to run in parallel\nstage=0\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time.\nrandom_copy=false\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/online/nnet2/get_egs.sh [opts] <data> <ali-dir> <online-nnet-dir> <exp-dir>\"\n  echo \" e.g.: steps/online/nnet2/get_egs.sh data/train exp/tri3_ali exp/nnet2_online/nnet_a_gpu_online/ exp/tri4_nnet\"\n  echo \"In <online-nnet-dir>, it looks for final.mdl (need to compute required left and right context),\"\n  echo \"and a configuration file conf/online_nnet2_decoding.conf which describes the features.\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-jobs-nnet <num-jobs;16>                    # Number of parallel jobs to use for main neural net\"\n  echo \"                                                   # training (will affect results as well as speed; try 8, 16)\"\n  echo \"                                                   # Note: if you increase this, you may want to also increase\"\n  echo \"                                                   # the learning rate.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --feat-type <lda|raw>                            # (by default it tries to guess).  The feature type you want\"\n  echo \"                                                   # to use as input to the neural net.\"\n  echo \"  --splice-width <width;4>                         # Number of frames on each side to append for feature input\"\n  echo \"                                                   # (note: we splice processed, typically 40-dimensional frames\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n\n  exit 1;\nfi\n\ndata=$1\nalidir=$2\nonline_nnet_dir=$3\ndir=$4\n\n\nmdl=$online_nnet_dir/final.mdl # only needed for left and right context.\nfeature_conf=$online_nnet_dir/conf/online_nnet2_decoding.conf\n\nfor f in $data/wav.scp $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $feature_conf $mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log\ncp $alidir/tree $dir\ngrep -v '^--endpoint' $feature_conf >$dir/feature.conf || exit 1;\n\n# Get list of validation utterances.\nmkdir -p $dir/valid $dir/train_subset\n\nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid/uttlist || exit 1;\n\nif [ -f $data/utt2uniq ]; then\n  echo \"File $data/utt2uniq exists, so augmenting valid/uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid/uttlist $dir/valid/uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid/uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid/uttlist\n  rm $dir/uniq2utt $dir/valid/uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid/uttlist | \\\n   utils/shuffle_list.pl | head -$num_utts_subset > $dir/train_subset/uttlist || exit 1;\n\n\nfor subdir in valid train_subset; do\n  # In order for the iVector extraction to work right, we need to process all\n  # utterances of the speakers which have utterances in valid/uttlist, and the\n  # same for train_subset/uttlist.  We produce $dir/valid/uttlist_extended which\n  # will contain all utterances of all speakers which have utterances in\n  # $dir/valid/uttlist, and the same for $dir/train_subset/.\n\n  utils/filter_scp.pl $dir/$subdir/uttlist <$data/utt2spk | awk '{print $2}' > $dir/$subdir/spklist || exit 1;\n  utils/filter_scp.pl -f 2 $dir/$subdir/spklist <$data/utt2spk >$dir/$subdir/utt2spk || exit 1;\n  utils/utt2spk_to_spk2utt.pl <$dir/$subdir/utt2spk >$dir/$subdir/spk2utt || exit 1;\n  awk '{print $1}' <$dir/$subdir/utt2spk >$dir/$subdir/uttlist_extended || exit 1;\n  rm $dir/$subdir/spklist\ndone\n\nif [ -f $data/segments ]; then\n  # note: in the feature extraction, because the program online2-wav-dump-features is sensitive to the\n  # previous utterances within a speaker, we do the filtering after extracting the features.\n  echo \"$0 [info]: segments file exists: using that.\"\n  feats=\"ark,s,cs:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt ark,s,cs:- ark:- | subset-feats --exclude=$dir/valid/uttlist ark:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid/uttlist_extended $data/segments  | extract-segments scp:$data/wav.scp - ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/valid/spk2utt ark,s,cs:- ark:- | subset-feats --include=$dir/valid/uttlist ark:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset/uttlist_extended $data/segments  | extract-segments scp:$data/wav.scp - ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/train_subset/spk2utt ark,s,cs:- ark:- | subset-feats --include=$dir/train_subset/uttlist ark:- ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists, using wav.scp.\"\n  feats=\"ark,s,cs:online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt scp:$sdata/JOB/wav.scp ark:- | subset-feats --exclude=$dir/valid/uttlist ark:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid/uttlist_extended $data/wav.scp | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/valid/spk2utt scp:- ark:- | subset-feats --include=$dir/valid/uttlist ark:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset/uttlist_extended $data/wav.scp | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/train_subset/spk2utt scp:- ark:- | subset-feats --include=$dir/train_subset/uttlist ark:- ark:- |\"\nfi\n\nivector_dim=$(online2-wav-dump-features --config=$dir/feature.conf --print-ivector-dim=true) || exit 1;\n\n! [ $ivector_dim -ge 0 ] && echo \"$0: error getting iVector dim\" && exit 1;\n\n\nif [ $stage -le 0 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/num_frames\nelse\n  num_frames=`cat $dir/num_frames` || exit 1;\nfi\n\n# Working out number of iterations per epoch.\niters_per_epoch=`perl -e \"print int($num_frames/($samples_per_iter * $num_jobs_nnet) + 0.5);\"` || exit 1;\n[ $iters_per_epoch -eq 0 ] && iters_per_epoch=1\nsamples_per_iter_real=$[$num_frames/($num_jobs_nnet*$iters_per_epoch)]\necho \"$0: Every epoch, splitting the data up into $iters_per_epoch iterations,\"\necho \"$0: giving samples-per-iteration of $samples_per_iter_real (you requested $samples_per_iter).\"\n\n# Making soft links to storage directories.  This is a no-up unless\n# the subdirectory $dir/egs/storage/ exists.  See utils/create_split_dir.pl\nfor x in `seq 1 $num_jobs_nnet`; do\n  for y in `seq 0 $[$iters_per_epoch-1]`; do\n    utils/create_data_link.pl $dir/egs/egs.$x.$y.ark\n    utils/create_data_link.pl $dir/egs/egs_tmp.$x.$y.ark\n  done\n  for y in `seq 1 $nj`; do\n    utils/create_data_link.pl $dir/egs/egs_orig.$x.$y.ark\n  done\ndone\n\nremove () { for x in $*; do [ -L $x ] && rm $(utils/make_absolute.sh $x); rm $x; done }\n\nset -o pipefail\nleft_context=$(nnet-am-info $mdl | grep '^left-context' | awk '{print $2}') || exit 1;\nright_context=$(nnet-am-info $mdl | grep '^right-context' | awk '{print $2}') || exit 1;\nnnet_context_opts=\"--left-context=$left_context --right-context=$right_context\"\nset +o pipefail\n\nmkdir -p $dir/egs\n\nif [ $stage -le 2 ]; then\n  rm $dir/.error 2>/dev/null\n\n  echo \"$0: extracting validation and training-subset alignments.\"\n  set -o pipefail;\n  for id in $(seq $nj); do gunzip -c $alidir/ali.$id.gz; done | \\\n    copy-int-vector ark:- ark,t:- | \\\n    utils/filter_scp.pl <(cat $dir/valid/uttlist $dir/train_subset/uttlist) | \\\n    gzip -c >$dir/ali_special.gz || exit 1;\n  set +o pipefail; # unset the pipefail option.\n\n  echo \"Getting validation and training subset examples.\"\n  $cmd $dir/log/create_valid_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$valid_feats\" \\\n     \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/egs/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$train_subset_feats\" \\\n    \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/egs/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && exit 1;\n  echo \"Getting subsets of validation examples for diagnostics and combination.\"\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet-subset-egs --n=$num_valid_frames_combine ark:$dir/egs/valid_all.egs \\\n        ark:$dir/egs/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/egs/valid_all.egs \\\n    ark:$dir/egs/valid_diagnostic.egs || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet-subset-egs --n=$num_train_frames_combine ark:$dir/egs/train_subset_all.egs \\\n    ark:$dir/egs/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/egs/train_subset_all.egs \\\n    ark:$dir/egs/train_diagnostic.egs || touch $dir/.error &\n  wait\n  [ -f $dir/.error ] && echo \"Error detected while creating egs\" && exit 1;\n  cat $dir/egs/valid_combine.egs $dir/egs/train_combine.egs > $dir/egs/combine.egs\n\n  for f in $dir/egs/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/egs/valid_all.egs $dir/egs/train_subset_all.egs $dir/egs/{train,valid}_combine.egs $dir/ali_special.gz\nfi\n\nif [ $stage -le 3 ]; then\n\n  # Other scripts might need to know the following info:\n  echo $num_jobs_nnet >$dir/egs/num_jobs_nnet\n  echo $iters_per_epoch >$dir/egs/iters_per_epoch\n  echo $samples_per_iter_real >$dir/egs/samples_per_iter\n\n  echo \"Creating training examples\";\n  # in $dir/egs, create $num_jobs_nnet separate files with training examples.\n  # The order is not randomized at this point.\n\n  egs_list=\n  for n in `seq 1 $num_jobs_nnet`; do\n    egs_list=\"$egs_list ark:$dir/egs/egs_orig.$n.JOB.ark\"\n  done\n  echo \"Generating training examples on disk\"\n  # The examples will go round-robin to egs_list.\n  $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$feats\" \\\n    \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" ark:- \\| \\\n    nnet-copy-egs ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: rearranging examples into parts for different parallel jobs\"\n  # combine all the \"egs_orig.JOB.*.scp\" (over the $nj splits of the data) and\n  # then split into multiple parts egs.JOB.*.scp for different parts of the\n  # data, 0 .. $iters_per_epoch-1.\n\n  if [ $iters_per_epoch -eq 1 ]; then\n    echo \"$0: Since iters-per-epoch == 1, just concatenating the data.\"\n    for n in `seq 1 $num_jobs_nnet`; do\n      cat $dir/egs/egs_orig.$n.*.ark > $dir/egs/egs_tmp.$n.0.ark || exit 1;\n      remove $dir/egs/egs_orig.$n.*.ark\n    done\n  else # We'll have to split it up using nnet-copy-egs.\n    egs_list=\n    for n in `seq 0 $[$iters_per_epoch-1]`; do\n      egs_list=\"$egs_list ark:$dir/egs/egs_tmp.JOB.$n.ark\"\n    done\n    # note, the \"|| true\" below is a workaround for NFS bugs\n    # we encountered running this script with Debian-7, NFS-v4.\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/split_egs.JOB.log \\\n      nnet-copy-egs --random=$random_copy --srand=JOB \\\n        \"ark:cat $dir/egs/egs_orig.JOB.*.ark|\" $egs_list || exit 1;\n    remove $dir/egs/egs_orig.*.*.ark  2>/dev/null\n  fi\nfi\n\nif [ $stage -le 5 ]; then\n  # Next, shuffle the order of the examples in each of those files.\n  # Each one should not be too large, so we can do this in memory.\n  echo \"Shuffling the order of training examples\"\n  echo \"(in order to avoid stressing the disk, these won't all run at once).\"\n\n  for n in `seq 0 $[$iters_per_epoch-1]`; do\n    $cmd $io_opts JOB=1:$num_jobs_nnet $dir/log/shuffle.$n.JOB.log \\\n      nnet-shuffle-egs \"--srand=\\$[JOB+($num_jobs_nnet*$n)]\" \\\n      ark:$dir/egs/egs_tmp.JOB.$n.ark ark:$dir/egs/egs.JOB.$n.ark\n    remove $dir/egs/egs_tmp.*.$n.ark\n  done\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/get_egs2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#\n# This is modified from ../../nnet2/get_egs2.sh.  [note: get_egs2.sh is as get_egs.sh,\n# but uses the newer, more compact way of writing egs. where we write multiple\n# frames of labels in order to share the context.]\n# This script combines the\n# nnet-example extraction with the feature extraction directly from wave files;\n# it uses the program online2-wav-dump-feature to do all parts of feature\n# extraction: MFCC/PLP/fbank, possibly plus pitch, plus iVectors.  This script\n# is intended mostly for cross-system training for online decoding, where you\n# initialize the nnet from an existing, larger system.\n#\n\n# Begin configuration section.\ncmd=run.pl\nframes_per_eg=8   # number of frames of labels per example.  more->less disk space and\n                  # less time preparing egs, but more I/O during training.\n                  # note: the script may reduce this if reduce_frames_per_eg is true.\n\nreduce_frames_per_eg=true  # If true, this script may reduce the frames_per_eg\n                           # if there is only one archive and even with the\n                           # reduced frames_pe_eg, the number of\n                           # samples_per_iter that would result is less than or\n                           # equal to the user-specified value.\nnum_utts_subset=300     # number of utterances in validation and training\n                        # subsets used for shrinkage and diagnostics.\nnum_valid_frames_combine=0 # #valid frames for combination weights at the very end.\nnum_train_frames_combine=10000 # # train frames for the above.\nnum_frames_diagnostic=4000 # number of frames for \"compute_prob\" jobs\nsamples_per_iter=400000 # each iteration of training, see this many samples\n                        # per job.  This is just a guideline; it will pick a number\n                        # that divides the number of samples in the entire data.\n\nstage=0\nio_opts=\"--max-jobs-run 5\" # for jobs with a lot of I/O, limits the number running at one time. \nrandom_copy=false\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0 [opts] <data> <ali-dir> <online-nnet-dir> <egs-dir>\"\n  echo \" e.g.: $0 data/train exp/tri3_ali exp/nnet2_online/nnet_a_gpu_online/ exp/nnet2_online/nnet_b/egs\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl;utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --samples-per-iter <#samples;400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --feat-type <lda|raw>                            # (by default it tries to guess).  The feature type you want\"\n  echo \"                                                   # to use as input to the neural net.\"\n  echo \"  --frames-per-eg <frames;8>                       # number of frames per eg on disk\"\n  echo \"  --num-frames-diagnostic <#frames;4000>           # Number of frames used in computing (train,valid) diagnostics\"\n  echo \"  --num-valid-frames-combine <#frames;10000>       # Number of frames used in getting combination weights at the\"\n  echo \"                                                   # very end.\"\n  echo \"  --stage <stage|0>                                # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  \n  exit 1;\nfi\n\ndata=$1\nalidir=$2\nonline_nnet_dir=$3\ndir=$4\n\nmdl=$online_nnet_dir/final.mdl # only needed for left and right context.\nfeature_conf=$online_nnet_dir/conf/online_nnet2_decoding.conf\n\n\nfor f in $data/wav.scp $alidir/ali.1.gz $alidir/final.mdl $alidir/tree $mdl $feature_conf; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\nnj=`cat $alidir/num_jobs` || exit 1;  # number of jobs in alignment dir...\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\nmkdir -p $dir/log $dir/info\n! cmp $alidir/tree $online_nnet_dir/tree && \\\n   echo \"$0: warning, tree from alignment dir does not match tree from online-nnet dir (OK if for multilingual)\"\ncp $alidir/tree $dir\ngrep -v '^--endpoint' $feature_conf >$dir/feature.conf || exit 1;\nmkdir -p $dir/valid $dir/train_subset\n\n# Get list of validation utterances. \nawk '{print $1}' $data/utt2spk | utils/shuffle_list.pl | head -$num_utts_subset \\\n    > $dir/valid/uttlist || exit 1;\n\nif [ -f $data/utt2uniq ]; then\n  echo \"File $data/utt2uniq exists, so augmenting valid/uttlist to\"\n  echo \"include all perturbed versions of the same 'real' utterances.\"\n  mv $dir/valid/uttlist $dir/valid/uttlist.tmp\n  utils/utt2spk_to_spk2utt.pl $data/utt2uniq > $dir/uniq2utt\n  cat $dir/valid/uttlist.tmp | utils/apply_map.pl $data/utt2uniq | \\\n    sort | uniq | utils/apply_map.pl $dir/uniq2utt | \\\n    awk '{for(n=1;n<=NF;n++) print $n;}' | sort  > $dir/valid/uttlist\n  rm $dir/uniq2utt $dir/valid/uttlist.tmp\nfi\n\nawk '{print $1}' $data/utt2spk | utils/filter_scp.pl --exclude $dir/valid/uttlist | \\\n  utils/shuffle_list.pl | head -$num_utts_subset > $dir/train_subset/uttlist || exit 1;\n\n\nfor subdir in valid train_subset; do\n  # In order for the iVector extraction to work right, we need to process all\n  # utterances of the speakers which have utterances in valid/uttlist, and the\n  # same for train_subset/uttlist.  We produce $dir/valid/uttlist_extended which\n  # will contain all utterances of all speakers which have utterances in\n  # $dir/valid/uttlist, and the same for $dir/train_subset/.\n\n  utils/filter_scp.pl $dir/$subdir/uttlist <$data/utt2spk | awk '{print $2}' > $dir/$subdir/spklist || exit 1;\n  utils/filter_scp.pl -f 2 $dir/$subdir/spklist <$data/utt2spk >$dir/$subdir/utt2spk || exit 1;\n  utils/utt2spk_to_spk2utt.pl <$dir/$subdir/utt2spk >$dir/$subdir/spk2utt || exit 1;\n  awk '{print $1}' <$dir/$subdir/utt2spk >$dir/$subdir/uttlist_extended || exit 1;\n  rm $dir/$subdir/spklist\ndone\n\n\nif [ -f $data/segments ]; then\n  # note: in the feature extraction, because the program online2-wav-dump-features is sensitive to the\n  # previous utterances within a speaker, we do the filtering after extracting the features.\n  echo \"$0 [info]: segments file exists: using that.\"\n  feats=\"ark,s,cs:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt ark,s,cs:- ark:- | subset-feats --exclude=$dir/valid/uttlist ark:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid/uttlist_extended $data/segments  | extract-segments scp:$data/wav.scp - ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/valid/spk2utt ark,s,cs:- ark:- | subset-feats --include=$dir/valid/uttlist ark:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset/uttlist_extended $data/segments  | extract-segments scp:$data/wav.scp - ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/train_subset/spk2utt ark,s,cs:- ark:- | subset-feats --include=$dir/train_subset/uttlist ark:- ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists, using wav.scp.\"\n  feats=\"ark,s,cs:online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt scp:$sdata/JOB/wav.scp ark:- | subset-feats --exclude=$dir/valid/uttlist ark:- ark:- |\"\n  valid_feats=\"ark,s,cs:utils/filter_scp.pl $dir/valid/uttlist_extended $data/wav.scp | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/valid/spk2utt scp:- ark:- | subset-feats --include=$dir/valid/uttlist ark:- ark:- |\"\n  train_subset_feats=\"ark,s,cs:utils/filter_scp.pl $dir/train_subset/uttlist_extended $data/wav.scp | online2-wav-dump-features --config=$dir/feature.conf ark:$dir/train_subset/spk2utt scp:- ark:- | subset-feats --include=$dir/train_subset/uttlist ark:- ark:- |\"\nfi\n\nivector_dim=$(online2-wav-dump-features --config=$dir/feature.conf --print-ivector-dim=true) || exit 1;\n\n! [ $ivector_dim -ge 0 ] && echo \"$0: error getting iVector dim\" && exit 1;\n\n\n\nset -o pipefail\nleft_context=$(nnet-am-info $mdl | grep '^left-context' | awk '{print $2}') || exit 1;\nright_context=$(nnet-am-info $mdl | grep '^right-context' | awk '{print $2}') || exit 1;\nset +o pipefail\n\n\nif [ $stage -le 0 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n  echo $num_frames > $dir/info/num_frames\nelse\n  num_frames=`cat $dir/info/num_frames` || exit 1;\nfi\n\n# the + 1 is to round up, not down... we assume it doesn't divide exactly.\nnum_archives=$[$num_frames/($frames_per_eg*$samples_per_iter)+1]\n# (for small data)- while reduce_frames_per_eg == true and the number of\n# archives is 1 and would still be 1 if we reduced frames_per_eg by 1, reduce it\n# by 1.\nreduced=false\nwhile $reduce_frames_per_eg && [ $frames_per_eg -gt 1 ] && \\\n  [ $[$num_frames/(($frames_per_eg-1)*$samples_per_iter)] -eq 0 ]; do\n  frames_per_eg=$[$frames_per_eg-1]\n  num_archives=1\n  reduced=true\ndone\n$reduced && echo \"$0: reduced frames_per_eg to $frames_per_eg because amount of data is small.\"\n\necho $num_archives >$dir/info/num_archives\necho $frames_per_eg >$dir/info/frames_per_eg\n\n# Working out number of egs per archive\negs_per_archive=$[$num_frames/($frames_per_eg*$num_archives)]\n! [ $egs_per_archive -le $samples_per_iter ] && \\\n  echo \"$0: script error: egs_per_archive=$egs_per_archive not <= samples_per_iter=$samples_per_iter\" \\\n  && exit 1;\n\necho $egs_per_archive > $dir/info/egs_per_archive\n\necho \"$0: creating $num_archives archives, each with $egs_per_archive egs, with\"\necho \"$0:   $frames_per_eg labels per example, and (left,right) context = ($left_context,$right_context)\"\n\n# Making soft links to storage directories.  This is a no-up unless\n# the subdirectory $dir/storage/ exists.  See utils/create_split_dir.pl\nfor x in `seq $num_archives`; do\n  utils/create_data_link.pl $dir/egs.$x.ark\n  for y in `seq $nj`; do\n    utils/create_data_link.pl $dir/egs_orig.$x.$y.ark\n  done\ndone\n\nnnet_context_opts=\"--left-context=$left_context --right-context=$right_context\"\n\nif [ $stage -le 2 ]; then\n  echo \"$0: Getting validation and training subset examples.\"\n  rm $dir/.error 2>/dev/null\n  echo \"$0: ... extracting validation and training-subset alignments.\"\n  set -o pipefail;\n  for id in $(seq $nj); do gunzip -c $alidir/ali.$id.gz; done | \\\n    copy-int-vector ark:- ark,t:- | \\\n    utils/filter_scp.pl <(cat $dir/valid/uttlist $dir/train_subset/uttlist) | \\\n    gzip -c >$dir/ali_special.gz || exit 1;\n  set +o pipefail; # unset the pipefail option.\n\n  $cmd $dir/log/create_valid_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$valid_feats\" \\\n    \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/valid_all.egs\" || touch $dir/.error &\n  $cmd $dir/log/create_train_subset.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts \"$train_subset_feats\" \\\n     \"ark,s,cs:gunzip -c $dir/ali_special.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" \\\n     \"ark:$dir/train_subset_all.egs\" || touch $dir/.error &\n  wait;\n  [ -f $dir/.error ] && echo \"Error detected while creating train/valid egs\" && exit 1;\n  echo \"... Getting subsets of validation examples for diagnostics and combination.\"\n  $cmd $dir/log/create_valid_subset_combine.log \\\n    nnet-subset-egs --n=$num_valid_frames_combine ark:$dir/valid_all.egs \\\n        ark:$dir/valid_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_valid_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/valid_all.egs \\\n    ark:$dir/valid_diagnostic.egs || touch $dir/.error &\n\n  $cmd $dir/log/create_train_subset_combine.log \\\n    nnet-subset-egs --n=$num_train_frames_combine ark:$dir/train_subset_all.egs \\\n    ark:$dir/train_combine.egs || touch $dir/.error &\n  $cmd $dir/log/create_train_subset_diagnostic.log \\\n    nnet-subset-egs --n=$num_frames_diagnostic ark:$dir/train_subset_all.egs \\\n    ark:$dir/train_diagnostic.egs || touch $dir/.error &\n  wait\n  sleep 5  # wait for file system to sync.\n  cat $dir/valid_combine.egs $dir/train_combine.egs > $dir/combine.egs\n\n  for f in $dir/{combine,train_diagnostic,valid_diagnostic}.egs; do\n    [ ! -s $f ] && echo \"No examples in file $f\" && exit 1;\n  done\n  rm $dir/valid_all.egs $dir/train_subset_all.egs $dir/{train,valid}_combine.egs $dir/ali_special.gz\nfi\n\nif [ $stage -le 3 ]; then\n  # create egs_orig.*.*.ark; the first index goes to $num_archives,\n  # the second to $nj (which is the number of jobs in the original alignment\n  # dir)\n\n  egs_list=\n  for n in $(seq $num_archives); do\n    egs_list=\"$egs_list ark:$dir/egs_orig.$n.JOB.ark\"\n  done\n  echo \"$0: Generating training examples on disk\"\n  \n  # The examples will go round-robin to egs_list.\n  $cmd $io_opts JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs $ivectors_opt $nnet_context_opts --num-frames=$frames_per_eg \"$feats\" \\\n    \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-pdf $alidir/final.mdl ark:- ark:- | ali-to-post ark:- ark:- |\" ark:- \\| \\\n    nnet-copy-egs ark:- $egs_list || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: recombining and shuffling order of archives on disk\"\n  # combine all the \"egs_orig.JOB.*.scp\" (over the $nj splits of the data) and\n  # shuffle the order, writing to the egs.JOB.ark\n\n  egs_list=\n  for n in $(seq $nj); do \n    egs_list=\"$egs_list $dir/egs_orig.JOB.$n.ark\"\n  done\n\n  $cmd $io_opts $extra_opts JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n    nnet-shuffle-egs --srand=JOB \"ark:cat $egs_list|\" ark:$dir/egs.JOB.ark  || exit 1;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: removing temporary archives\"\n  for x in `seq $num_archives`; do\n    for y in `seq $nj`; do\n      file=$dir/egs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file)\n      rm $file\n    done\n  done\nfi\n\necho \"$0: Finished preparing training examples\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/get_egs_discriminative2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This script dumps examples MPE or MMI or state-level minimum bayes risk (sMBR)\n# training of neural nets.  Note: for \"criterion\", smbr > mpe > mmi in terms of\n# compatibility of the dumped egs, meaning you can use the egs dumped with\n# --criterion smbr for MPE or MMI, and egs dumped with --criterion mpe for MMI\n# training.  The discriminative training program itself doesn't enforce this and\n# it would let you mix and match them arbitrarily; we area speaking in terms of\n# the correctness of the algorithm that splits the lattices into pieces.\n\n# Begin configuration section.\ncmd=run.pl\ncriterion=smbr\ndrop_frames=false #  option relevant for MMI, affects how we dump examples.\nsamples_per_iter=400000 # measured in frames, not in \"examples\"\nmax_temp_archives=128 # maximum number of temp archives per input job, only\n                      # affects the process of generating archives, not the\n                      # final result.\n\nstage=0\niter=final\ncleanup=true\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: $0 [opts] <data> <lang> <ali-dir> <denlat-dir> <src-online-nnet2-dir> <degs-dir>\"\n  echo \" e.g.: $0 data/train data/lang exp/nnet2_online/nnet_a_online{_ali,_denlats,_degs}\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config file containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs (probably would be good to add --max-jobs-run 5 or so if using\"\n  echo \"                                                   # GridEngine (to avoid excessive NFS traffic).\"\n  echo \"  --samples-per-iter <#samples|400000>             # Number of samples of data to process per iteration, per\"\n  echo \"                                                   # process.\"\n  echo \"  --stage <stage|-8>                               # Used to run a partially-completed training process from somewhere in\"\n  echo \"                                                   # the middle.\"\n  echo \"  --criterion <criterion|smbr>                     # Training criterion: may be smbr, mmi or mpfe\"\n  echo \"  --online-ivector-dir <dir|\"\">                    # Directory for online-estimated iVectors, used in the\"\n  echo \"                                                   # online-neural-net setup.  (but you may want to use\"\n  echo \"                                                   # steps/online/nnet2/get_egs_discriminative2.sh instead)\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\nsrcdir=$5\ndir=$6\n\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/num_jobs $alidir/tree \\\n         $denlatdir/lat.1.gz $denlatdir/num_jobs $srcdir/$iter.mdl $srcdir/conf/online_nnet2_decoding.conf; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nmkdir -p $dir/log $dir/info || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nnj=$(cat $denlatdir/num_jobs) || exit 1; # $nj is the number of\n                                         # splits of the denlats and alignments.\n\n\nnj_ali=$(cat $alidir/num_jobs) || exit 1;\n\nsdata=$data/split$nj\nutils/split_data.sh $data $nj\n\n\n\n\nif [ $nj_ali -eq $nj ]; then\n  ali_rspecifier=\"ark,s,cs:gunzip -c $alidir/ali.JOB.gz |\"\nelse\n  ali_rspecifier=\"scp:$dir/ali.scp\"\n  if [ $stage -le 1 ]; then\n    echo \"$0: number of jobs in den-lats versus alignments differ: dumping them as single archive and index.\"\n    alis=$(for n in $(seq $nj_ali); do echo -n \"$alidir/ali.$n.gz \"; done)\n    copy-int-vector --print-args=false \\\n      \"ark:gunzip -c $alis|\" ark,scp:$dir/ali.ark,$dir/ali.scp || exit 1;\n  fi\nfi\n\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\ncp $alidir/tree $dir\ncp $lang/phones/silence.csl $dir/info || exit 1;\ncp $srcdir/$iter.mdl $dir/final.mdl || exit 1;\n\ngrep -v '^--endpoint' $srcdir/conf/online_nnet2_decoding.conf >$dir/feature.conf || exit 1;\n\nivector_dim=$(online2-wav-dump-features --config=$dir/feature.conf --print-ivector-dim=true) || exit 1;\n\necho $ivector_dim > $dir/info/ivector_dim\n\n! [ $ivector_dim -ge 0 ] && echo \"$0: error getting iVector dim\" && exit 1;\n\nif [ -f $data/segments ]; then\n  # note: in the feature extraction, because the program online2-wav-dump-features is sensitive to the\n  # previous utterances within a speaker, we do the filtering after extracting the features.\n  echo \"$0 [info]: segments file exists: using that.\"\n  feats=\"ark,s,cs:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt ark,s,cs:- ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists, using wav.scp.\"\n  feats=\"ark,s,cs:online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt scp:$sdata/JOB/wav.scp ark:- |\"\nfi\n\n\nif [ $stage -le 2 ]; then\n  echo \"$0: working out number of frames of training data\"\n  num_frames=$(steps/nnet2/get_num_frames.sh $data)\n\n  echo $num_frames > $dir/info/num_frames\n\n  # Working out total number of archives. Add one on the assumption the\n  # num-frames won't divide exactly, and we want to round up.\n  num_archives=$[$num_frames/$samples_per_iter + 1]\n\n  # the next few lines relate to how we may temporarily split each input job\n  # into fewer than $num_archives pieces, to avoid using an excessive\n  # number of filehandles.\n  archive_ratio=$[$num_archives/$max_temp_archives+1]\n  num_archives_temp=$[$num_archives/$archive_ratio]\n  # change $num_archives slightly to make it an exact multiple\n  # of $archive_ratio.\n  num_archives=$[$num_archives_temp*$archive_ratio]\n\n  echo $num_archives >$dir/info/num_archives || exit 1\n  echo $num_archives_temp >$dir/info/num_archives_temp || exit 1\n\n  frames_per_archive=$[$num_frames/$num_archives]\n\n  # note, this is the number of frames per archive prior to discarding frames.\n  echo $frames_per_archive > $dir/info/frames_per_archive\nelse\n  num_archives=$(cat $dir/info/num_archives) || exit 1;\n  num_archives_temp=$(cat $dir/info/num_archives_temp) || exit 1;\n  frames_per_archive=$(cat $dir/info/frames_per_archive) || exit 1;\nfi\n\necho \"$0: Splitting the data up into $num_archives archives (using $num_archives_temp temporary pieces per input job)\"\necho \"$0: giving samples-per-iteration of $frames_per_archive (you requested $samples_per_iter).\"\n\n# we create these data links regardless of the stage, as there are situations\n# where we would want to recreate a data link that had previously been deleted.\n\nif [ -d $dir/storage ]; then\n  echo \"$0: creating data links for distributed storage of degs\"\n  # See utils/create_split_dir.pl for how this 'storage' directory is created.\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_temp); do\n      utils/create_data_link.pl $dir/degs_orig.$x.$y.ark\n    done\n  done\n  for z in $(seq $num_archives); do\n    utils/create_data_link.pl $dir/degs.$z.ark\n  done\n  if [ $num_archives_temp -ne $num_archives ]; then\n    for z in $(seq $num_archives); do\n      utils/create_data_link.pl $dir/degs_temp.$z.ark\n    done\n  fi\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: getting initial training examples by splitting lattices\"\n\n  degs_list=$(for n in $(seq $num_archives_temp); do echo -n \"ark:$dir/degs_orig.JOB.$n.ark \"; done)\n\n  $cmd JOB=1:$nj $dir/log/get_egs.JOB.log \\\n    nnet-get-egs-discriminative --criterion=$criterion --drop-frames=$drop_frames \\\n      \"$srcdir/$iter.mdl\" \"$feats\" \"$ali_rspecifier\" \"ark,s,cs:gunzip -c $denlatdir/lat.JOB.gz|\" ark:- \\| \\\n    nnet-copy-egs-discriminative $const_dim_opt ark:- $degs_list || exit 1;\n  sleep 5;  # wait a bit so NFS has time to write files.\nfi\n\nif [ $stage -le 4 ]; then\n\n  degs_list=$(for n in $(seq $nj); do echo -n \"$dir/degs_orig.$n.JOB.ark \"; done)\n\n  if [ $num_archives -eq $num_archives_temp ]; then\n    echo \"$0: combining data into final archives and shuffling it\"\n\n    $cmd JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n      cat $degs_list \\| nnet-shuffle-egs-discriminative --srand=JOB ark:- \\\n       ark:$dir/degs.JOB.ark || exit 1;\n  else\n    echo \"$0: combining and re-splitting data into un-shuffled versions of final archives.\"\n\n    archive_ratio=$[$num_archives/$num_archives_temp]\n    ! [ $archive_ratio -gt 1 ] && echo \"$0: Bad archive_ratio $archive_ratio\" && exit 1;\n\n    # note: the \\$[ .. ] won't be evaluated until the job gets executed.  The\n    # aim is to write to the archives with the final numbering, 1\n    # ... num_archives, which is more than num_archives_temp.  The list with\n    # \\$[... ] expressions in it computes the set of final indexes for each\n    # temporary index.\n    degs_list_out=$(for n in $(seq $archive_ratio); do echo -n \"ark:$dir/degs_temp.\\$[((JOB-1)*$archive_ratio)+$n].ark \"; done)\n    # e.g. if dir=foo and archive_ratio=2, we'd have\n    # degs_list_out='foo/degs_temp.$[((JOB-1)*2)+1].ark foo/degs_temp.$[((JOB-1)*2)+2].ark'\n\n    $cmd JOB=1:$num_archives_temp $dir/log/resplit.JOB.log \\\n      cat $degs_list \\| nnet-copy-egs-discriminative --srand=JOB ark:- \\\n      $degs_list_out || exit 1;\n  fi\nfi\n\nif [ $stage -le 5 ] && [ $num_archives -ne $num_archives_temp ]; then\n  echo \"$0: shuffling final archives.\"\n\n  $cmd JOB=1:$num_archives $dir/log/shuffle.JOB.log \\\n    nnet-shuffle-egs-discriminative --srand=JOB ark:$dir/degs_temp.JOB.ark \\\n      ark:$dir/degs.JOB.ark || exit 1\n\nfi\n\nif $cleanup; then\n  echo \"$0: removing temporary archives.\"\n  for x in $(seq $nj); do\n    for y in $(seq $num_archives_temp); do\n      file=$dir/degs_orig.$x.$y.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file); rm $file\n    done\n  done\n  if [ $num_archives_temp -ne $num_archives ]; then\n    for z in $(seq $num_archives); do\n      file=$dir/degs_temp.$z.ark\n      [ -L $file ] && rm $(utils/make_absolute.sh $file); rm $file\n    done\n  fi\nfi\n\necho \"$0: Done.\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/get_pca_transform.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  David Snyder\n#\n# This script computes a PCA transform on top of spliced features processed with\n# apply-cmvn-online.\n#\n#\n# Apache 2.0.\n\n# Begin configuration.\ncmd=run.pl\nconfig=\nstage=0\ndim=40 # The dim after applying PCA\nnormalize_variance=true # If the PCA transform normalizes the variance\nnormalize_mean=true # If the PCA transform centers\nsplice_opts=\nonline_cmvn_opts=\nmax_utts=5000 # maximum number of files to use\nsubsample=5 # subsample features with this periodicity\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 2 ]; then\n  echo \"Usage: steps/nnet2/get_pca_transform.sh [options] <data> <dir>\"\n  echo \" e.g.: steps/train_pca_transform.sh data/train_si84 exp/tri2b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\ndata=$1\ndir=$2\n\nfor f in $data/feats.scp ; do\n  [ ! -f \"$f\" ] && echo \"$0: expecting file $f to exist\" && exit 1\ndone\n\nmkdir -p $dir/log\n\necho \"$splice_opts\" >$dir/splice_opts # keep track of frame-splicing options\n           # so that later stages of system building can know what they were.\necho $online_cmvn_opts > $dir/online_cmvn.conf # keep track of options to CMVN.\n\n# create global_cmvn.stats\nif ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n  echo \"$0: Error summing cmvn stats\"\n  exit 1\nfi\n\nfeats=\"ark,s,cs:utils/subset_scp.pl --quiet $max_utts $data/feats.scp | apply-cmvn-online $online_cmvn_opts $dir/global_cmvn.stats scp:- ark:- | splice-feats $splice_opts ark:- ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\n\nif [ $stage -le 0 ]; then\n  $cmd $dir/log/pca_est.log \\\n    est-pca --dim=$dim --normalize-variance=$normalize_variance \\\n    --normalize-mean=$normalize_mean \"$feats\" $dir/final.mat || exit 1;\nfi\n\necho \"Done estimating PCA transform in $dir\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/online/nnet2/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# Create denominator lattices for MMI/MPE training.\n# This version uses the online-nnet2 features.\n#\n# Creates its output in $dir/lat.*.gz\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\nnum_threads=1\nparallel_opts=  # ignored now.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/make_denlats.sh [options] <data-dir> <lang-dir> <src-dir> <exp-dir>\"\n  echo \"  e.g.: steps/make_denlats.sh data/train data/lang exp/nnet2_online/nnet_a_online exp/nnet2_online/nnet_a_denlats\"\n  echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n  echo \" plus transforms.\"\n  echo \"\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n  echo \"                           # large databases so your jobs will be smaller and\"\n  echo \"                           # will (individually) finish reasonably soon.\"\n  echo \"  --num-threads  <n>                # number of threads per decoding job\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nsrcdir=$3\ndir=$4\n\nfor f in $data/wav.scp $lang/L.fst $srcdir/final.mdl $srcdir/conf/online_nnet2_decoding.conf; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nsdata=$data/split$nj\n\nthread_string=\n[ $num_threads -gt 1 ] && thread_string=\"-parallel --num-threads=$num_threads\"\n\nmkdir -p $dir/log\nsplit_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\n\n# Compute grammar FST which corresponds to unigram decoding graph.\nnew_lang=\"$dir/\"$(basename \"$lang\")\n\n\ngrep -v '^--endpoint' $srcdir/conf/online_nnet2_decoding.conf >$dir/feature.conf || exit 1;\n\nif [ $stage -le 0 ]; then\n  # mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n  # it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n  # final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\n  cp -rH $lang $dir/\n\n  echo \"Compiling decoding graph in $dir/dengraph\"\n  if [ -s $dir/dengraph/HCLG.fst ] && [ $dir/dengraph/HCLG.fst -nt $srcdir/final.mdl ]; then\n    echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\n  else\n    echo \"Making unigram grammar FST in $new_lang\"\n    cat $data/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n      awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n      utils/make_unigram_grammar.pl | fstcompile | fstarcsort --sort_type=ilabel > $new_lang/G.fst \\\n      || exit 1;\n    utils/mkgraph.sh $new_lang $srcdir $dir/dengraph || exit 1;\n  fi\nfi\n\n\nif [ -f $data/segments ]; then\n  # note: in the feature extraction, because the program online2-wav-dump-features is sensitive to the\n  # previous utterances within a speaker, we do the filtering after extracting the features.\n  echo \"$0 [info]: segments file exists: using that.\"\n  feats=\"ark,s,cs:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- | online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt ark,s,cs:- ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists, using wav.scp.\"\n  feats=\"ark,s,cs:online2-wav-dump-features --config=$dir/feature.conf ark:$sdata/JOB/spk2utt scp:$sdata/JOB/wav.scp ark:- |\"\nfi\n\n\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\n\nif [ $sub_split -eq 1 ]; then\n  $cmd --num-threads $num_threads JOB=1:$nj $dir/log/decode_den.JOB.log \\\n   nnet-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n     $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      sdata2=$data/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n\n      $cmd --num-threads $num_threads JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        nnet-latgen-faster$thread_string --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n        --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || touch $dir/.error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo Merging archives for data subset $prev_n\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && echo \"$0: Merging lattices for subset $prev_n failed (or maybe some other error)\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/prepare_online_decoding.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nstage=0 # This allows restarting after partway, when something when wrong.\nfeature_type=mfcc\nadd_pitch=false\nmfcc_config=conf/mfcc.conf # you can override any of these you need to override.\nplp_config=conf/plp.conf\nfbank_config=conf/fbank.conf \n# online_pitch_config is the config file for both pitch extraction and\n# post-processing; we combine them into one because during training this\n# is given to the program compute-and-process-kaldi-pitch-feats.\nonline_pitch_config=conf/online_pitch.conf\n\n# Below are some options that affect the iVectors, and should probably\n# match those used in extract_ivectors_online.sh.\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\n               # caution: you should use the same value in the online-estimation\n               # code.\nmax_count=100   # This max-count of 100 can make iVectors more consistent for\n                # different lengths of utterance, by scaling up the prior term\n                # when the data-count exceeds this value.  The data-count is\n                # after posterior-scaling, so assuming the posterior-scale is\n                # 0.1, --max-count 100 starts having effect after 1000 frames,\n                # or 10 seconds of data.\niter=final\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ] && [ $# -ne 3 ]; then\n   echo \"Usage: $0 [options] <lang-dir> [<ivector-extractor-dir>] <nnet-dir> <output-dir>\"\n   echo \"e.g.: $0 data/lang exp/nnet2_online/extractor exp/nnet2_online/nnet exp/nnet2_online/nnet_online\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --feature-type <mfcc|plp>                        # Type of the base features; \"\n   echo \"                                                   # important to generate the correct\"\n   echo \"                                                   # configs in <output-dir>/conf/\"\n   echo \"  --add-pitch <true|false>                         # Append pitch features to cmvn\"\n   echo \"                                                   # (default: false)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --iter <model-iteration|final>                   # iteration of model to take.\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\n\nif [ $# -eq 4 ]; then\n  lang=$1\n  iedir=$2\n  srcdir=$3\n  dir=$4\nelse\n  [ $# -eq 3 ] || exit 1;\n  lang=$1\n  iedir=\n  srcdir=$2\n  dir=$3\nfi\n\nfor f in $lang/phones.txt $srcdir/${iter}.mdl $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nif [ ! -z \"$iedir\" ]; then\n  for f in final.{mat,ie,dubm} splice_opts global_cmvn.stats online_cmvn.conf; do\n    [ ! -f $iedir/$f ] && echo \"$0: no such file $iedir/$f\" && exit 1;\n  done\nfi\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\nmkdir -p $dir\ncp $lang/phones.txt $dir || exit 1;\n\ndir=$(utils/make_absolute.sh $dir) # Convert $dir to an absolute pathname, so that the\n                        # configuration files we write will contain absolute\n                        # pathnames.\nmkdir -p $dir/conf\n\n\ncp $srcdir/${iter}.mdl $dir/final.mdl || exit 1;\ncp $srcdir/tree $dir/ || exit 1;\nif [ ! -z \"$iedir\" ]; then\n  mkdir -p $dir/ivector_extractor/\n  cp $iedir/final.{mat,ie,dubm} $iedir/global_cmvn.stats $dir/ivector_extractor/ || exit 1;\n\n  # The following things won't be needed directly by the online decoding, but\n  # will allow us to run prepare_online_decoding.sh again with\n  # $dir/ivector_extractor/ as the input directory (useful in certain\n  # cross-system training scenarios).\n  cp $iedir/splice_opts $iedir/online_cmvn.conf $dir/ivector_extractor/ || exit 1;\nfi\n\n\nmkdir -p $dir/conf\nrm $dir/{plp,mfcc,fbank}.conf 2>/dev/null\necho \"$0: preparing configuration files in $dir/conf\"\n\nif [ -f $dir/conf/online_nnet2_decoding.conf ]; then\n  echo \"$0: moving $dir/conf/online_nnet2_decoding.conf to $dir/conf/online_nnet2_decoding.conf.bak\"\n  mv $dir/conf/online_nnet2_decoding.conf $dir/conf/online_nnet2_decoding.conf.bak\nfi\n\nconf=$dir/conf/online_nnet2_decoding.conf\necho -n >$conf\n\necho \"--feature-type=$feature_type\" >>$conf\n\ncase \"$feature_type\" in\n  mfcc)\n    echo \"--mfcc-config=$dir/conf/mfcc.conf\" >>$conf\n    cp $mfcc_config $dir/conf/mfcc.conf || exit 1;;\n  plp)\n    echo \"--plp-config=$dir/conf/plp.conf\" >>$conf\n    cp $plp_config $dir/conf/plp.conf || exit 1;;\n  fbank)\n    echo \"--fbank-config=$dir/conf/fbank.conf\" >>$conf\n    cp $fbank_config $dir/conf/fbank.conf || exit 1;;\n  *)\n    echo \"Unknown feature type $feature_type\"\nesac\n\n\n\nif [ ! -z \"$iedir\" ]; then\n  ieconf=$dir/conf/ivector_extractor.conf\n  echo -n >$ieconf\n  echo \"--ivector-extraction-config=$ieconf\" >>$conf\n  cp $iedir/online_cmvn.conf $dir/conf/online_cmvn.conf || exit 1;\n  # the next line puts each option from splice_opts on its own line in the config.\n  for x in $(cat $iedir/splice_opts); do echo \"$x\"; done > $dir/conf/splice.conf\n  echo \"--splice-config=$dir/conf/splice.conf\" >>$ieconf\n  echo \"--cmvn-config=$dir/conf/online_cmvn.conf\" >>$ieconf\n  echo \"--lda-matrix=$dir/ivector_extractor/final.mat\" >>$ieconf\n  echo \"--global-cmvn-stats=$dir/ivector_extractor/global_cmvn.stats\" >>$ieconf\n  echo \"--diag-ubm=$dir/ivector_extractor/final.dubm\" >>$ieconf\n  echo \"--ivector-extractor=$dir/ivector_extractor/final.ie\" >>$ieconf\n  echo \"--num-gselect=$num_gselect\"  >>$ieconf\n  echo \"--min-post=$min_post\" >>$ieconf\n  echo \"--posterior-scale=$posterior_scale\" >>$ieconf # this is currently the default in the scripts.\n  echo \"--max-remembered-frames=1000\" >>$ieconf # the default\n  echo \"--max-count=$max_count\" >>$ieconf\nfi\n\nif $add_pitch; then\n  echo \"$0: enabling pitch features\"\n  echo \"--add-pitch=true\" >>$conf\n  echo \"$0: creating $dir/conf/online_pitch.conf\"\n  if [ ! -f $online_pitch_config ]; then\n    echo \"$0: expected file '$online_pitch_config' to exist.\";\n    exit 1;\n  fi\n  cp $online_pitch_config $dir/conf/online_pitch.conf || exit 1;\n  echo \"--online-pitch-config=$dir/conf/online_pitch.conf\" >>$conf\nfi\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\necho \"--endpoint.silence-phones=$silphonelist\" >>$conf\necho \"$0: created config file $conf\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/prepare_online_decoding_retrain.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# This is as prepare_online_decoding.sh, but it's for a special case, where we\n# already have a directory that's been prepared in that way, but for another\n# corpus, and we have used the script\n# steps/online/nnet2/dump_nnet_activations.sh to dump activations of the last\n# hidden layer of that network on our data, and then steps/nnet2/retrain_fast.sh\n# to train a neural net on top of those activations.  The job of this script is\n# to take the original neural net, and the net that was trained on top of\n# its last hidden layer, combine them, and create an online-decoding directory\n# in the same format as is created by prepare_online_decoding.sh.\n# All the options for the feature extraction and the iVector extractor\n# are taken from the original directory from the other corpus.\n\n\n# Begin configuration.\nstage=0 # This allows restarting after partway, when something when wrong.\ncleanup=true\ncmd=run.pl\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ] && [ $# -ne 4 ]; then    \n  echo \"Usage: $0 [options] <orig-nnet-online-dir> [<new-lang-dir>] <new-nnet-dir> <new-nnet-online-dir>\"\n  echo \"e.g.: $0 exp_other/nnet2_online/nnet_a_online data/lang exp/nnet2_online/nnet_a exp/nnet2_online/nnet_a_online\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nif [ $# -eq 3 ]; then\n  echo \"$0: warning: it's better if you add the new <lang> directory as the 2nd argument.\"\n\n  online_src=$1\n  lang=\n  nnet_src=$2\n  dir=$3\nelse\n  online_src=$1\n  lang=$2\n  nnet_src=$3\n  dir=$4\n\n  extra_files=$lang/words.txt\nfi\n\n\nfor f in $online_src/conf/online_nnet2_decoding.conf $nnet_src/final.mdl $nnet_src/tree $extra_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\ndir_as_given=$dir\ndir=$(utils/make_absolute.sh $dir) # Convert $dir to an absolute pathname, so that the\n                        # configuration files we write will contain absolute\n                        # pathnames.\nmkdir -p $dir/conf $dir/log\n\n\ncp $nnet_src/tree $dir/ || exit 1;\n\n\n# There are a bunch of files that we will need to copy from $online_src, because\n# we're aiming to have one self-contained directory that has everything in it.\nmkdir -p $dir/ivector_extractor\ncp -r $online_src/ivector_extractor/* $dir/ivector_extractor\n\n[ ! -d $online_src/conf ] && \\\n  echo \"Expected directory $online_src/conf to exist\" && exit 1;\n\nfor x in $online_src/conf/*conf; do\n  # Replace directory name starting $online_src with those starting with $dir.\n  # We actually replace any directory names ending in /ivector_extractor/ or /conf/ \n  # with $dir/ivector_extractor/ or $dir/conf/\n  cat $x | perl -ape \"s:=(.+)/(ivector_extractor|conf)/:=$dir/\\$2/:;\" > $dir/conf/$(basename $x)\ndone\n\ninfo=$dir/nnet_info\nnnet-am-info $online_src/final.mdl >$info\nnc=$(grep num-components $info | awk '{print $2}');\nif grep SumGroupComponent $info >/dev/null; then \n  nc_truncate=$[$nc-3]  # we did mix-up: remove AffineComponent,\n                          # SumGroupComponent, SoftmaxComponent\nelse\n  nc_truncate=$[$nc-2]  # remove AffineComponent, SoftmaxComponent\nfi\n$cmd $dir/log/get_raw_nnet.log \\\n nnet-to-raw-nnet --truncate=$nc_truncate $online_src/final.mdl $dir/first_nnet.raw || exit 1;\n\n# Now create the final.mdl, by inserting $dir/first_nnet.raw at the beginning\n# of the model in $nnet_src/final.mdl\n\n$cmd $dir/log/append_nnet.log \\\n  nnet-insert --randomize-next-component=false --insert-at=0 \\\n  $nnet_src/final.mdl $dir/first_nnet.raw $dir/final.mdl || exit 1;\n\n$cleanup && rm $dir/first_nnet.raw\n\nif [ ! -z \"$lang\" ]; then\n  # if the $lang option was provided, modify the silence-phones in the config;\n  # these are only used for the endpointing code, but we should get this right.\n  cp $dir/conf/online_nnet2_decoding.conf{,.tmp}\n  silphones=$(cat $lang/phones/silence.csl) || exit 1;\n  cat $dir/conf/online_nnet2_decoding.conf.tmp | \\\n    sed s/silence-phones=.\\\\+/silence-phones=$silphones/ > $dir/conf/online_nnet2_decoding.conf\n  rm $dir/conf/online_nnet2_decoding.conf.tmp\nfi\n\necho \"$0: formatted neural net for online decoding in $dir_as_given\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/prepare_online_decoding_transfer.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# This is as prepare_online_decoding.sh, but for transfer learning-- the case where\n# you have an existing online-decoding directory where you have all the feature\n# stuff, that you don't want to change, but \n\n# Begin configuration.\nstage=0 # This allows restarting after partway, when something went wrong.\ncmd=run.pl\niter=final\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then    \n  echo \"Usage: $0 [options] <orig-nnet-online-dir> <new-lang-dir> <new-nnet-dir> <new-nnet-online-dir>\"\n  echo \"e.g.: $0 exp_other/nnet2_online/nnet_a_online data/lang exp/nnet2_online/nnet_a exp/nnet2_online/nnet_a_online\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nonline_src=$1\nlang=$2\nnnet_src=$3\ndir=$4\n\nfor f in $online_src/conf/online_nnet2_decoding.conf $nnet_src/final.mdl $nnet_src/tree $lang/words.txt; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\ndir_as_given=$dir\ndir=$(utils/make_absolute.sh $dir) # Convert $dir to an absolute pathname, so that the\n                        # configuration files we write will contain absolute\n                        # pathnames.\nmkdir -p $dir/conf $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $nnet_src/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $nnet_src/tree $dir/ || exit 1;\n\ncp $nnet_src/$iter.mdl $dir/ || exit 1;\n\n\n# There are a bunch of files that we will need to copy from $online_src, because\n# we're aiming to have one self-contained directory that has everything in it.\nmkdir -p $dir/ivector_extractor\ncp -r $online_src/ivector_extractor/* $dir/ivector_extractor\n\n[ ! -d $online_src/conf ] && \\\n  echo \"Expected directory $online_src/conf to exist\" && exit 1;\n\nfor x in $online_src/conf/*conf; do\n  # Replace directory name starting $online_src with those starting with $dir.\n  # We actually replace any directory names ending in /ivector_extractor/ or /conf/ \n  # with $dir/ivector_extractor/ or $dir/conf/\n  cat $x | perl -ape \"s:=(.+)/(ivector_extractor|conf)/:=$dir/\\$2/:;\" > $dir/conf/$(basename $x)\ndone\n\n\n# modify the silence-phones in the config; these are only used for the\n# endpointing code.\ncp $dir/conf/online_nnet2_decoding.conf{,.tmp}\nsilphones=$(cat $lang/phones/silence.csl) || exit 1;\ncat $dir/conf/online_nnet2_decoding.conf.tmp | \\\n  sed s/silence-phones=.\\\\+/silence-phones=$silphones/ > $dir/conf/online_nnet2_decoding.conf\nrm $dir/conf/online_nnet2_decoding.conf.tmp\n\necho \"$0: formatted neural net for online decoding in $dir_as_given\"\n"
  },
  {
    "path": "egs/steps/online/nnet2/train_diag_ubm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2012  Johns Hopkins University (Author: Daniel Povey)\n#             2013  Daniel Povey\n# Apache 2.0.\n\n# This script trains a diagonal UBM that we'll use in online iVector estimation,\n# where the online-estimated iVector will be used as a secondary input to a deep\n# neural net for single-pass DNN-based decoding.\n\n# This script was modified from ../../sre08/v1/sid/train_diag_ubm.sh.  It trains\n# a diagonal UBM on top of features processed with apply-cmvn-online and then\n# transformed with an LDA+MLLT or PCA matrix (obtained from the source\n# directory).  This script does not use the trained model from the source\n# directory to initialize the diagonal GMM; instead, we initialize the GMM using\n# gmm-global-init-from-feats, which sets the means to random data points and\n# then does some iterations of E-M in memory.  After the in-memory\n# initialization we train for a few iterations in parallel.  Note that if an\n# LDA+MLLT transform matrix is used, there will be a slight mismatch in that the\n# source LDA+MLLT matrix (final.mat) will have been estimated using standard\n# CMVN, and we're using online CMVN.  We don't think this will have much effect.\n\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nnum_iters=4\nstage=-2\nnum_gselect=30 # Number of Gaussian-selection indices to use while training\n               # the model.\nnum_frames=500000 # number of frames to keep in memory for initialization\nnum_iters_init=20\ninitial_gauss_proportion=0.5 # Start with half the target number of Gaussians\nsubsample=2 # subsample all features with this periodicity, in the main E-M phase.\ncleanup=true\nmin_gaussian_weight=0.0001\nremove_low_count_gaussians=true # set this to false if you need #gauss to stay fixed.\nnum_threads=16\nparallel_opts=  # ignored now.\nonline_cmvn_config=conf/online_cmvn.conf\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\nif [ $# != 4 ]; then\n  echo \"Usage: $0  <data> <num-gauss> <srcdir> <output-dir>\"\n  echo \" e.g.: $0 data/train 1024 exp/tri3b/ exp/diag_ubm\"\n  echo \"(in srcdir we find splice_opts and final.mat)\"\n  echo \"Options: \"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <num-jobs|4>                                # number of parallel jobs to run.\"\n  echo \"  --num-iters <niter|20>                           # number of iterations of parallel \"\n  echo \"                                                   # training (default: $num_iters)\"\n  echo \"  --stage <stage|-2>                               # stage to do partial re-run from.\"\n  echo \"  --num-gselect <n|30>                             # Number of Gaussians per frame to\"\n  echo \"                                                   # limit computation to, for speed\"\n  echo \" --subsample <n|5>                                 # In main E-M phase, use every n\"\n  echo \"                                                   # frames (a speedup)\"\n  echo \"  --num-frames <n|500000>                          # Maximum num-frames to keep in memory\"\n  echo \"                                                   # for model initialization\"\n  echo \"  --num-iters-init <n|20>                          # Number of E-M iterations for model\"\n  echo \"                                                   # initialization\"\n  echo \" --initial-gauss-proportion <proportion|0.5>       # Proportion of Gaussians to start with\"\n  echo \"                                                   # in initialization phase (then split)\"\n  echo \" --num-threads <n|32>                              # number of threads to use in initialization\"\n  echo \"                                                   # phase (must match with parallel-opts option)\"\n  echo \" --min-gaussian-weight <weight|0.0001>             # min Gaussian weight allowed in GMM\"\n  echo \"                                                   # initialization (this relatively high\"\n  echo \"                                                   # value keeps counts fairly even)\"\n  exit 1;\nfi\n\ndata=$1\nnum_gauss=$2\nsrcdir=$3\ndir=$4\n\n! [ $num_gauss -gt 0 ] && echo \"Bad num-gauss $num_gauss\" && exit 1;\n\nsdata=$data/split$nj\nmkdir -p $dir/log\nutils/split_data.sh $data $nj || exit 1;\n\nfor f in $data/feats.scp \"$online_cmvn_config\" $srcdir/splice_opts $srcdir/final.mat; do\n   [ ! -f \"$f\" ] && echo \"$0: expecting file $f to exist\" && exit 1\ndone\n\nif [ -d \"$dir\" ]; then\n  bak_dir=$(mktemp -d ${dir}/backup.XXX);\n  echo \"$0: Directory $dir already exists. Backing up diagonal UBM in ${bak_dir}\";\n  for f in $dir/final.mat $dir/final.dubm $dir/online_cmvn.conf $dir/global_cmvn.stats; do\n    [ -f \"$f\" ] && mv $f ${bak_dir}/\n  done\n  [ -d \"$dir/log\" ] && mv $dir/log ${bak_dir}/\nfi\n\nsplice_opts=$(cat $srcdir/splice_opts)\ncp $srcdir/splice_opts $dir/ || exit 1;\ncp $srcdir/final.mat $dir/ || exit 1;\ncp $online_cmvn_config $dir/online_cmvn.conf || exit 1;\n\n# create global_cmvn.stats\nif ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n  echo \"$0: Error summing cmvn stats\"\n  exit 1\nfi\n\n# Note: there is no point subsampling all_feats, because gmm-global-init-from-feats\n# effectively does subsampling itself (it keeps a random subset of the features).\nall_feats=\"ark,s,cs:apply-cmvn-online --config=$online_cmvn_config --spk2utt=ark:$data/spk2utt $dir/global_cmvn.stats scp:$data/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\nfeats=\"ark,s,cs:apply-cmvn-online --config=$online_cmvn_config --spk2utt=ark:$sdata/JOB/spk2utt $dir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\n\nnum_gauss_init=$(perl -e \"print int($initial_gauss_proportion * $num_gauss); \");\n! [ $num_gauss_init -gt 0 ] && echo \"Invalid num-gauss-init $num_gauss_init\" && exit 1;\n\nif [ $stage -le -2 ]; then\n  echo \"$0: initializing model from E-M in memory, \"\n  echo \"$0: starting from $num_gauss_init Gaussians, reaching $num_gauss;\"\n  echo \"$0: for $num_iters_init iterations, using at most $num_frames frames of data\"\n\n  $cmd --num-threads $num_threads $dir/log/gmm_init.log \\\n    gmm-global-init-from-feats --num-threads=$num_threads --num-frames=$num_frames \\\n     --min-gaussian-weight=$min_gaussian_weight \\\n     --num-gauss=$num_gauss --num-gauss-init=$num_gauss_init --num-iters=$num_iters_init \\\n    \"$all_feats\" $dir/0.dubm || exit 1;\nfi\n\n# Store Gaussian selection indices on disk-- this speeds up the training passes.\nif [ $stage -le -1 ]; then\n  echo \"Getting Gaussian-selection info\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$num_gselect $dir/0.dubm \"$feats\" \\\n      \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\necho \"$0: will train for $num_iters iterations, in parallel over\"\necho \"$0: $nj machines, parallelized with '$cmd'\"\n\nfor x in `seq 0 $[$num_iters-1]`; do\n  echo \"$0: Training pass $x\"\n  if [ $stage -le $x ]; then\n  # Accumulate stats.\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-global-acc-stats \"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" \\\n      $dir/$x.dubm \"$feats\" $dir/$x.JOB.acc || exit 1;\n    if [ $x -lt $[$num_iters-1] ]; then # Don't remove low-count Gaussians till last iter,\n      opt=\"--remove-low-count-gaussians=false\" # or gselect info won't be valid any more.\n    else\n      opt=\"--remove-low-count-gaussians=$remove_low_count_gaussians\"\n    fi\n    $cmd $dir/log/update.$x.log \\\n      gmm-global-est $opt --min-gaussian-weight=$min_gaussian_weight $dir/$x.dubm \"gmm-global-sum-accs - $dir/$x.*.acc|\" \\\n      $dir/$[$x+1].dubm || exit 1;\n\n    if $cleanup; then\n      rm $dir/$x.*.acc $dir/$x.dubm\n    fi\n  fi\ndone\n\nif $cleanup; then\n  rm $dir/gselect.*.gz\nfi\n\nmv $dir/$num_iters.dubm $dir/final.dubm || exit 1;\nexit 0;\n"
  },
  {
    "path": "egs/steps/online/nnet2/train_ivector_extractor.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright   2013  Daniel Povey\n# Apache 2.0.\n\n# This script is modified from ^/egs/sre08/v1/sid/train_ivector_extractor.sh.\n# It trains an iVector extractor for use in DNN training.  In this version, the\n# features used to obtain the Gaussian posteriors are based on sliding-window\n# CMN, but the actual iVector extractor sees the original features without CMN.\n# The idea is that the appropriate offset should be learned by the iVector\n# extractor itself, so the neural net can take as input the non-CMN features\n# together with the iVector.  [note: in future, we may just compute the\n# posteriors on top of non-CMN input, we'll have to see what works better.]\n\n# This script trains the i-vector extractor.  Note: there are 3 separate levels\n# of parallelization: num_threads, num_processes, and num_jobs.  This may seem a\n# bit excessive.  It has to do with minimizing memory usage and disk I/O,\n# subject to various constraints.  The \"num_threads\" is how many threads a\n# program uses; the \"num_processes\" is the number of separate processes a single\n# job spawns, and then sums the accumulators in memory.  Our recommendation:\n#  - Set num_threads to the minimum of (4, or how many virtual cores your machine has).\n#    (because of needing to lock various global quantities, the program can't\n#    use many more than 4 threads with good CPU utilization).\n#  - Set num_processes to the number of virtual cores on each machine you have, divided by\n#    num_threads.  E.g. 4, if you have 16 virtual cores.   If you're on a shared queue\n#    that's busy with other people's jobs, it may be wise to set it to rather less\n#    than this maximum though, or your jobs won't get scheduled.  And if memory is\n#    tight you need to be careful; in our normal setup, each process uses about 5G.\n#  - Set num_jobs to as many of the jobs (each using $num_threads * $num_processes CPUs)\n#    your queue will let you run at one time, but don't go much more than 10 or 20, or\n#    summing the accumulators will possibly get slow.  If you have a lot of data, you\n#    may want more jobs, though.\n\n# Begin configuration section.\nnj=10   # this is the number of separate queue jobs we run, but each one\n        # contains num_processes sub-jobs.. the real number of threads we\n        # run is nj * num_processes * num_threads, and the number of\n        # separate pieces of data is nj * num_processes.\nnum_threads=4\nnum_processes=4 # each job runs this many processes, each with --num-threads threads\ncmd=\"run.pl\"\nstage=-4\nivector_dim=100 # dimension of the extracted i-vector\nonline_cmvn_iextractor=false # apply online-cmvn on i-vector input features, uses the configuration from UBM,\nnum_iters=10\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\n               # caution: you should use the same value in the online-estimation\n               # code.\nsubsample=2  # This speeds up the training: training on every 2nd feature\n             # (configurable) Since the features are highly correlated across\n             # frames, we don't expect to lose too much from this.\nparallel_opts=  # ignored now.\ncleanup=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 <data> <diagonal-ubm-dir> <extractor-dir>\"\n  echo \" e.g.: $0 data/train exp/nnet2_online/diag_ubm/ exp/nnet2_online/extractor\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --num-iters <#iters|10>                          # Number of iterations of E-M\"\n  echo \"  --nj <n|10>                                      # Number of jobs (also see num-processes and num-threads)\"\n  echo \"  --num-processes <n|4>                            # Number of processes for each queue job (relates\"\n  echo \"                                                   # to summing accs in memory)\"\n  echo \"  --num-threads <n|4>                              # Number of threads for each process (can't be usefully\"\n  echo \"                                                   # increased much above 4)\"\n  echo \"  --stage <stage|-4>                               # To control partial reruns\"\n  echo \"  --num-gselect <n|5>                              # Number of Gaussians to select using\"\n  echo \"                                                   # diagonal model.\"\n  exit 1;\nfi\n\ndata=$1\nsrcdir=$2\ndir=$3\n\nfor f in $srcdir/final.dubm $srcdir/final.mat $srcdir/global_cmvn.stats $srcdir/splice_opts \\\n      $srcdir/online_cmvn.conf  $data/feats.scp; do\n  [ ! -f $f ] && echo \"No such file $f\" && exit 1;\ndone\n\n\nif [ -d \"$dir\" ]; then\n  bak_dir=$(mktemp -d ${dir}/backup.XXX);\n  echo \"$0: Directory $dir already exists. Backing up iVector extractor in ${bak_dir}\";\n  for f in $dir/final.ie $dir/*.ie $dir/final.mat $dir/final.dubm \\\n        $dir/online_cmvn.conf $dir/global_cmvn.stats; do\n    [ -f \"$f\" ] &&  mv $f ${bak_dir}/\n  done\n  [ -d \"$dir/log\" ] && mv $dir/log ${bak_dir}/\nfi\n\n# Set various variables.\nmkdir -p $dir/log\nnj_full=$[$nj*$num_processes]\nsdata=$data/split$nj_full;\nutils/split_data.sh $data $nj_full || exit 1;\n\ncp $srcdir/final.dubm $srcdir/final.mat $srcdir/global_cmvn.stats $srcdir/splice_opts \\\n      $srcdir/online_cmvn.conf $dir || exit 1;\n\nsplice_opts=$(cat $srcdir/splice_opts)\n\n## Set up features.  $gmm_feats is the version of the features with online CMVN, that we use\n## to get the Gaussian posteriors, $feats is the version of the features with no CMN.\ngmm_feats=\"ark,s,cs:apply-cmvn-online --config=$dir/online_cmvn.conf --spk2utt=ark:$sdata/JOB/spk2utt $dir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\nfeats=\"ark,s,cs:splice-feats $splice_opts scp:$sdata/JOB/feats.scp ark:- | transform-feats $dir/final.mat ark:- ark:- | subsample-feats --n=$subsample ark:- ark:- |\"\n\n## This adds online-cmvn in $feats, upon request (configuration taken from UBM),\n## ('online_cmvn_iextractor' marks that we added online_cmvn_iextractor)\nrm $dir/online_cmvn_iextractor 2>/dev/null || true\nif $online_cmvn_iextractor; then\n  feats=\"$gmm_feats\"\n  touch $dir/online_cmvn_iextractor\nfi\n\n# Initialize the i-vector extractor using the input GMM, which is converted to\n# full because that's what the i-vector extractor expects.  Note: we have to do\n# --use-weights=false to disable regression of the log weights on the ivector,\n# because that would make the online estimation of the ivector difficult (since\n# the online/real-time ivector estimation is the whole point of this script).\nif [ $stage -le -2 ]; then\n  $cmd $dir/log/init.log \\\n    ivector-extractor-init --ivector-dim=$ivector_dim --use-weights=false \\\n     \"gmm-global-to-fgmm $dir/final.dubm -|\" $dir/0.ie || exit 1\nfi\n\n# Do Gaussian selection and posterior extracion\n\n# if we subsample frame, modify the posterior-scale; this is likely\n# to make the original posterior-scale (before subsampling) suitable.\nmodified_posterior_scale=$(perl -e \"print $posterior_scale * $subsample;\");\n\nif [ $stage -le -1 ]; then\n  echo $nj_full > $dir/num_jobs\n  echo \"$0: doing Gaussian selection and posterior computation\"\n  $cmd JOB=1:$nj_full $dir/log/post.JOB.log \\\n    gmm-global-get-post --n=$num_gselect --min-post=$min_post $dir/final.dubm \"$gmm_feats\" ark:- \\| \\\n    scale-post ark:- $modified_posterior_scale \"ark:|gzip -c >$dir/post.JOB.gz\" || exit 1;\nelse\n  # make sure we at least have the right number of post.*.gz files.\n  if ! [ $nj_full -eq $(cat $dir/num_jobs) ]; then\n    echo \"Num-jobs mismatch $nj_full versus $(cat $dir/num_jobs)\"\n    exit 1\n  fi\nfi\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  if [ $stage -le $x ]; then\n    rm $dir/.error 2>/dev/null\n\n    Args=() # bash array of training commands for 1:nj, that put accs to stdout.\n    for j in $(seq $nj_full); do\n      Args[$j]=`echo \"ivector-extractor-acc-stats --num-threads=$num_threads $dir/$x.ie '$feats' 'ark,s,cs:gunzip -c $dir/post.JOB.gz|' -|\" | sed s/JOB/$j/g`\n    done\n\n    echo \"Accumulating stats (pass $x)\"\n    for g in $(seq $nj); do\n      start=$[$num_processes*($g-1)+1]\n      $cmd --num-threads $[$num_threads*$num_processes] $dir/log/acc.$x.$g.log \\\n        ivector-extractor-sum-accs --parallel=true \"${Args[@]:$start:$num_processes}\" \\\n          $dir/acc.$x.$g || touch $dir/.error &\n    done\n    wait\n    [ -f $dir/.error ] && echo \"Error accumulating stats on iteration $x\" && exit 1;\n    accs=\"\"\n    for j in $(seq $nj); do\n      accs+=\"$dir/acc.$x.$j \"\n    done\n    echo \"Summing accs (pass $x)\"\n    $cmd $dir/log/sum_acc.$x.log \\\n      ivector-extractor-sum-accs $accs $dir/acc.$x || exit 1;\n    echo \"Updating model (pass $x)\"\n    nt=$[$num_threads*$num_processes] # use the same number of threads that\n                                      # each accumulation process uses, since we\n                                      # can be sure the queue will support this many.\n                                      #\n                                      # The parallel-opts was either specified by\n                                      # the user or we computed it correctly in\n                                      # tge previous stages\n    $cmd --num-threads $[$num_threads*$num_processes] $dir/log/update.$x.log \\\n      ivector-extractor-est --num-threads=$nt $dir/$x.ie $dir/acc.$x $dir/$[$x+1].ie || exit 1;\n    rm $dir/acc.$x.*\n    if $cleanup; then\n      rm $dir/acc.$x $dir/$x.ie\n    fi\n  fi\n  x=$[$x+1]\ndone\n\nif $cleanup; then\n  rm $dir/post.*.gz\nfi\n\nrm $dir/final.ie 2>/dev/null\nln -s $x.ie $dir/final.ie\n\n# assign a unique id to this extractor\n# we are not interested in the id itself, just pre-caching ...\nsteps/nnet2/get_ivector_id.sh $dir > /dev/null || exit 1\n"
  },
  {
    "path": "egs/steps/online/nnet3/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n#           2016  Api.ai (Author: Ilya Platonov)\n# Apache 2.0\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nframes_per_chunk=20\nextra_left_context_initial=0\nmin_active=200\nmax_active=7000\nbeam=15.0\nlattice_beam=6.0\nacwt=0.1   # note: only really affects adaptation and pruning (scoring is on\n           # lattices).\npost_decode_acwt=1.0  # can be used in 'chain' systems to scale acoustics by 10 so the\n                      # regular scoring script works.\nper_utt=false\nonline=true  # only relevant to non-threaded decoder.\ndo_endpointing=false\ndo_speex_compressing=false\nscoring_opts=\nskip_scoring=false\nsilence_weight=1.0  # set this to a value less than 1 (e.g. 0) to enable silence weighting.\nmax_state_duration=40 # This only has an effect if you are doing silence\n  # weighting.  This default is probably reasonable.  transition-ids repeated\n  # more than this many times in an alignment are treated as silence.\niter=final\nonline_config=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the models are, as prepared by steps/online/nnet3/prepare_online_decoding.sh\"\n   echo \"e.g.: $0 exp/chain/tdnn/graph data/test exp/chain/tdnn_online/decode/\"\n   echo \"\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --online-config <config-file>                    # online decoder options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --per-utt <true|false>                           # If true, decode per utterance without\"\n   echo \"                                                   # carrying forward adaptation info from previous\"\n   echo \"                                                   # utterances of each speaker.  Default: false\"\n   echo \"  --online <true|false>                            # Set this to false if you don't really care about\"\n   echo \"                                                   # simulating online decoding and just want the best\"\n   echo \"                                                   # results.  This will use all the data within each\"\n   echo \"                                                   # utterance (plus any previous utterance, if not in\"\n   echo \"                                                   # per-utterance mode) to estimate the iVectors.\"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --iter <iter>                                    # Iteration of model to decode; default is final.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nif [ \"$online_config\" == \"\" ]; then\n  online_config=$srcdir/conf/online.conf;\nfi\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nfor f in $online_config $srcdir/${iter}.mdl \\\n    $graphdir/HCLG.fst $graphdir/words.txt $data/wav.scp; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif ! $per_utt; then\n  spk2utt_rspecifier=\"ark:$sdata/JOB/spk2utt\"\nelse\n  mkdir -p $dir/per_utt\n  for j in $(seq $nj); do\n    awk '{print $1, $1}' <$sdata/$j/utt2spk >$dir/per_utt/utt2spk.$j || exit 1;\n  done\n  spk2utt_rspecifier=\"ark:$dir/per_utt/utt2spk.JOB\"\nfi\n\nif [ -f $data/segments ]; then\n  wav_rspecifier=\"ark,s,cs:extract-segments scp,p:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\nelse\n  wav_rspecifier=\"ark,s,cs:wav-copy scp,p:$sdata/JOB/wav.scp ark:- |\"\nfi\nif $do_speex_compressing; then\n  wav_rspecifier=\"$wav_rspecifier compress-uncompress-speex ark:- ark:- |\"\nfi\nif $do_endpointing; then\n  wav_rspecifier=\"$wav_rspecifier extend-wav-with-silence ark:- ark:- |\"\nfi\n\nif [ \"$silence_weight\" != \"1.0\" ]; then\n  silphones=$(cat $graphdir/phones/silence.csl) || exit 1\n  silence_weighting_opts=\"--ivector-silence-weighting.max-state-duration=$max_state_duration --ivector-silence-weighting.silence_phones=$silphones --ivector-silence-weighting.silence-weight=$silence_weight\"\nelse\n  silence_weighting_opts=\nfi\n\n\nif [ \"$post_decode_acwt\" == 1.0 ]; then\n  lat_wspecifier=\"ark:|gzip -c >$dir/lat.JOB.gz\"\nelse\n  lat_wspecifier=\"ark:|lattice-scale --acoustic-scale=$post_decode_acwt ark:- ark:- | gzip -c >$dir/lat.JOB.gz\"\nfi\n\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    online2-wav-nnet3-latgen-faster $silence_weighting_opts --do-endpointing=$do_endpointing \\\n    --frames-per-chunk=$frames_per_chunk \\\n    --extra-left-context-initial=$extra_left_context_initial \\\n    --online=$online \\\n       $frame_subsampling_opt \\\n     --config=$online_config \\\n     --min-active=$min_active --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n     --acoustic-scale=$acwt --word-symbol-table=$graphdir/words.txt \\\n     $srcdir/${iter}.mdl $graphdir/HCLG.fst $spk2utt_rspecifier \"$wav_rspecifier\" \\\n      \"$lat_wspecifier\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $scoring_opts $data $graphdir $dir\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/online/nnet3/decode_wake_word.sh",
    "content": "#!/bin/bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n#           2016  Api.ai (Author: Ilya Platonov)\n#      2019-2020  Yiming Wang\n# Apache 2.0\n\n# This script is modified from steps/online/nnet3/decode.sh for wake word detection decoding\n\n# Begin configuration section.\nstage=0\nnj=4\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\ncmd=run.pl\nframes_per_chunk=20\nextra_left_context_initial=0\nmin_active=200\nmax_active=7000\nbeam=15.0\nper_utt=false\nonline=true  # only relevant to non-threaded decoder.\ndo_speex_compressing=false\nscoring_opts=\nskip_scoring=false\niter=final\nonline_config=\nwake_word=\"嗨小问\"\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n   echo \"Usage: $0 [options] <graph-dir> <data-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the models are, as prepared by steps/online/nnet3/prepare_online_decoding.sh\"\n   echo \"e.g.: $0 exp/chain/tdnn/graph data/test exp/chain/tdnn_online/decode/\"\n   echo \"\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --online-config <config-file>                    # online decoder options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --per-utt <true|false>                           # If true, decode per utterance without\"\n   echo \"                                                   # carrying forward adaptation info from previous\"\n   echo \"                                                   # utterances of each speaker.  Default: false\"\n   echo \"  --online <true|false>                            # Set this to false if you don't really care about\"\n   echo \"                                                   # simulating online decoding and just want the best\"\n   echo \"                                                   # results.  This will use all the data within each\"\n   echo \"                                                   # utterance (plus any previous utterance, if not in\"\n   echo \"                                                   # per-utterance mode) to estimate the iVectors.\"\n   echo \"  --scoring-opts <string>                          # options to local/score.sh\"\n   echo \"  --iter <iter>                                    # Iteration of model to decode; default is final.\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata=$2\ndir=$3\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\nsdata=$data/split$nj;\n\nif [ \"$online_config\" == \"\" ]; then\n  online_config=$srcdir/conf/online.conf;\nfi\n\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nfor f in $online_config $srcdir/${iter}.mdl \\\n    $graphdir/HCLG.fst $graphdir/words.txt $data/wav.scp; do\n  if [ ! -f $f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif ! $per_utt; then\n  spk2utt_rspecifier=\"ark:$sdata/JOB/spk2utt\"\nelse\n  mkdir -p $dir/per_utt\n  for j in $(seq $nj); do\n    awk '{print $1, $1}' <$sdata/$j/utt2spk >$dir/per_utt/utt2spk.$j || exit 1;\n  done\n  spk2utt_rspecifier=\"ark:$dir/per_utt/utt2spk.JOB\"\nfi\n\nif [ -f $data/segments ]; then\n  wav_rspecifier=\"ark,s,cs:extract-segments scp,p:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\nelse\n  wav_rspecifier=\"ark,s,cs:wav-copy scp,p:$sdata/JOB/wav.scp ark:- |\"\nfi\nif $do_speex_compressing; then\n  wav_rspecifier=\"$wav_rspecifier compress-uncompress-speex ark:- ark:- |\"\nfi\n\nwake_word_id=$(cat $graphdir/words.txt | grep $wake_word | awk '{print $2}')\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  # e.g. for 'chain' systems\n  frame_subsampling_opt=\"--frame-subsampling-factor=$(cat $srcdir/frame_subsampling_factor)\"\nfi\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    online2-wav-nnet3-wake-word-decoder-faster \\\n    --frames-per-chunk=$frames_per_chunk \\\n    --extra-left-context-initial=$extra_left_context_initial \\\n    --online=$online \\\n       $frame_subsampling_opt \\\n     --config=$online_config \\\n     --min-active=$min_active --max-active=$max_active --beam=$beam \\\n     --acoustic-scale=$acwt --wake-word-id=$wake_word_id \\\n     $srcdir/${iter}.mdl $graphdir/HCLG.fst $spk2utt_rspecifier \"$wav_rspecifier\" \\\n     $graphdir/words.txt ark,t:$dir/trans.JOB.txt \\\n     ark,t:$dir/ali.JOB.txt || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  for n in $(seq $nj); do\n    cat $dir/trans.$n.txt\n  done > $dir/trans.txt\n  rm -f $dir/trans.*.txt\n  for n in $(seq $nj); do\n    cat $dir/ali.$n.txt\n  done > $dir/ali.txt\n  rm -f $dir/ali.*.txt\nfi\n\nif [ $stage -le 2 ] && ! $skip_scoring ; then\n  [ ! -x local/score_online.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score_online.sh $scoring_opts --wake-word $wake_word $data $graphdir $dir\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/online/nnet3/prepare_online_decoding.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nstage=0 # This allows restarting after partway, when something when wrong.\nfeature_type=mfcc\nadd_pitch=false\nmfcc_config=conf/mfcc.conf # you can override any of these you need to override.\nplp_config=conf/plp.conf\nfbank_config=conf/fbank.conf\n\n# online_pitch_config is the config file for both pitch extraction and\n# post-processing; we combine them into one because during training this\n# is given to the program compute-and-process-kaldi-pitch-feats.\nonline_pitch_config=conf/online_pitch.conf\n\n# online_cmvn_config can be used both for nn-features and i-vector features.\n# If the file $dir/online_cmvn exists, it is used for both feature streams.\n# If $dir/online_cmvn does not exist, the config file is used only for normalizing\n# the input of ubm in i-vector extractor, the rest of the system is without online-cmvn.\n# The $dir/online_cmvn 'flag' file is created when training with online-cmvn.\nonline_cmvn_config=conf/online_cmvn.conf\n\n# Below are some options that affect the iVectors, and should probably\n# match those used in extract_ivectors_online.sh.\nnum_gselect=5 # Gaussian-selection using diagonal model: number of Gaussians to select\nposterior_scale=0.1 # Scale on the acoustic posteriors, intended to account for\n                    # inter-frame correlations.\nmin_post=0.025 # Minimum posterior to use (posteriors below this are pruned out)\n               # caution: you should use the same value in the online-estimation\n               # code.\nmax_count=100   # This max-count of 100 can make iVectors more consistent for\n                # different lengths of utterance, by scaling up the prior term\n                # when the data-count exceeds this value.  The data-count is\n                # after posterior-scaling, so assuming the posterior-scale is\n                # 0.1, --max-count 100 starts having effect after 1000 frames,\n                # or 10 seconds of data.\nivector_period=10 # Number of frames for which the i-vector stays the same\n                  # (use same value as from local/nnet3/run_ivector_common.sh).\n\niter=final\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ] && [ $# -ne 3 ]; then\n   echo \"Usage: $0 [options] <lang-dir> [<ivector-extractor-dir>] <nnet-dir> <output-dir>\"\n   echo \"e.g.: $0 data/lang exp/nnet2_online/extractor exp/nnet2_online/nnet exp/nnet2_online/nnet_online\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --feature-type <mfcc|plp>                        # Type of the base features; \"\n   echo \"                                                   # important to generate the correct\"\n   echo \"                                                   # configs in <output-dir>/conf/\"\n   echo \"  --add-pitch <true|false>                         # Append pitch features to cmvn\"\n   echo \"                                                   # (default: false)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --iter <model-iteration|final>                   # iteration of model to take.\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\n\nif [ $# -eq 4 ]; then\n  lang=$1\n  iedir=$2\n  srcdir=$3\n  dir=$4\nelse\n  [ $# -eq 3 ] || exit 1;\n  lang=$1\n  iedir=\n  srcdir=$2\n  dir=$3\nfi\n\nfor f in $lang/phones/silence.csl $srcdir/${iter}.mdl $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nif [ ! -z \"$iedir\" ]; then\n  for f in final.{mat,ie,dubm} splice_opts global_cmvn.stats online_cmvn.conf; do\n    [ ! -f $iedir/$f ] && echo \"$0: no such file $iedir/$f\" && exit 1;\n  done\n  if $add_pitch; then\n    iedim=`matrix-dim $iedir/final.mat | awk '{print $1}'`\n    amdim=`nnet3-am-info $srcdir/${iter}.mdl | grep \"input-dim:\" | awk '{print $2}'`\n    [ $(($amdim-$iedim)) -eq 0 ] && echo \"$0: remove pitch from the input of ivector extractor\" && exit 1;\n  fi\nfi\n\n\ndir=$(utils/make_absolute.sh $dir) # Convert $dir to an absolute pathname, so that the\n                        # configuration files we write will contain absolute\n                        # pathnames.\nmkdir -p $dir/conf\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $srcdir/${iter}.mdl $dir/final.mdl || exit 1;\ncp $srcdir/tree $dir/ || exit 1;\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  cp $srcdir/frame_subsampling_factor $dir/\nfi\n\n\nif [ ! -z \"$iedir\" ]; then\n  mkdir -p $dir/ivector_extractor/\n  cp $iedir/final.{mat,ie,dubm} $iedir/global_cmvn.stats $dir/ivector_extractor/ || exit 1;\n\n  # The following things won't be needed directly by the online decoding, but\n  # will allow us to run prepare_online_decoding.sh again with\n  # $dir/ivector_extractor/ as the input directory (useful in certain\n  # cross-system training scenarios).\n  cp $iedir/splice_opts $iedir/online_cmvn.conf $dir/ivector_extractor/ || exit 1;\nfi\n\n\nmkdir -p $dir/conf\nrm $dir/{plp,mfcc,fbank}.conf 2>/dev/null\necho \"$0: preparing configuration files in $dir/conf\"\n\nif [ -f $dir/conf/online.conf ]; then\n  echo \"$0: moving $dir/conf/online.conf to $dir/conf/online.conf.bak\"\n  mv $dir/conf/online.conf $dir/conf/online.conf.bak\nfi\n\nconf=$dir/conf/online.conf\necho -n >$conf\n\necho \"--feature-type=$feature_type\" >>$conf\n\ncase \"$feature_type\" in\n  mfcc)\n    echo \"--mfcc-config=$dir/conf/mfcc.conf\" >>$conf\n    cp $mfcc_config $dir/conf/mfcc.conf || exit 1;;\n  plp)\n    echo \"--plp-config=$dir/conf/plp.conf\" >>$conf\n    cp $plp_config $dir/conf/plp.conf || exit 1;;\n  fbank)\n    echo \"--fbank-config=$dir/conf/fbank.conf\" >>$conf\n    cp $fbank_config $dir/conf/fbank.conf || exit 1;;\n  *)\n    echo \"Unknown feature type $feature_type\"\nesac\n\ncp $online_cmvn_config $dir/conf/online_cmvn.conf || exit 1;\n\nif [ ! -z \"$iedir\" ]; then\n  ieconf=$dir/conf/ivector_extractor.conf\n  echo -n >$ieconf\n  echo \"--ivector-extraction-config=$ieconf\" >>$conf\n\n  # make sure that the online_cmvn config for i-extractor is same\n  # as the one passed in with '--online_cmvn_config'\n  ivec_cmvn_config=$iedir/online_cmvn.conf\n  if ! $(cmp --silent $online_cmvn_config $ivec_cmvn_config); then\n    echo \"Error, configs must be the same:\n      \\$online_cmvn_config=$online_cmvn_config\n      \\$ivec_cmvn_config=$ivec_cmvn_config\"\n    exit 1;\n  fi\n\n  # the next line puts each option from splice_opts on its own line in the config.\n  for x in $(cat $iedir/splice_opts); do echo \"$x\"; done > $dir/conf/splice.conf\n  echo \"--splice-config=$dir/conf/splice.conf\" >>$ieconf\n  echo \"--cmvn-config=$dir/conf/online_cmvn.conf\" >>$ieconf\n  echo \"--lda-matrix=$dir/ivector_extractor/final.mat\" >>$ieconf\n  echo \"--global-cmvn-stats=$dir/ivector_extractor/global_cmvn.stats\" >>$ieconf\n  echo \"--diag-ubm=$dir/ivector_extractor/final.dubm\" >>$ieconf\n  echo \"--ivector-extractor=$dir/ivector_extractor/final.ie\" >>$ieconf\n  echo \"--num-gselect=$num_gselect\"  >>$ieconf\n  echo \"--min-post=$min_post\" >>$ieconf\n  echo \"--posterior-scale=$posterior_scale\" >>$ieconf # this is currently the default in the scripts.\n  echo \"--max-remembered-frames=1000\" >>$ieconf # the default\n  echo \"--max-count=$max_count\" >>$ieconf\n  echo \"--ivector-period=$ivector_period\" >>$ieconf\n  # activate online-cmvn for the i-extractor, not only the ubm,\n  if [ -f $srcdir/online_cmvn ]; then\n    cp $iedir/online_cmvn_iextractor $dir/ivector_extractor/ || exit 1\n    echo \"--online-cmvn-iextractor=true\" >>$ieconf\n  fi\nfi\n\nif $add_pitch; then\n  echo \"$0: enabling pitch features\"\n  echo \"--add-pitch=true\" >>$conf\n  echo \"$0: creating $dir/conf/online_pitch.conf\"\n  if [ ! -f $online_pitch_config ]; then\n    echo \"$0: expected file '$online_pitch_config' to exist.\";\n    exit 1;\n  fi\n  cp $online_pitch_config $dir/conf/online_pitch.conf || exit 1;\n  echo \"--online-pitch-config=$dir/conf/online_pitch.conf\" >>$conf\nfi\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\necho \"--endpoint.silence-phones=$silphonelist\" >>$conf\n\n# activate the online-cmvn in nnet input features,\nif [ -f $srcdir/online_cmvn ]; then\n  cp $srcdir/online_cmvn $dir/\n  cp $srcdir/global_cmvn.stats $dir/\n  echo \"--cmvn-config=$dir/conf/online_cmvn.conf\" >>$conf\n  echo \"--global-cmvn-stats=$dir/global_cmvn.stats\" >>$conf\nfi\n\necho \"$0: created config file $conf\"\n"
  },
  {
    "path": "egs/steps/online/prepare_online_decoding.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nstage=0 # This allows restarting after partway, when something when wrong.\nfeature_type=mfcc\nonline_cmvn_config=conf/online_cmvn.conf\nadd_pitch=false\npitch_config=conf/pitch.conf\npitch_process_config=conf/pitch_process.conf\nper_utt_basis=true # If true, then treat each utterance as a separate speaker\n                   # for purposes of basis training... this is recommended if\n                   # the number of actual speakers in your training set is less\n                   # than (feature-dim) * (feature-dim+1).\nper_utt_cmvn=false # If true, apply online CMVN normalization per utterance\n                   # rather than per speaker.\nsilence_weight=0.01\ncmd=run.pl\ncleanup=true\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 -a $# -ne 5 ]; then\n   echo \"Usage: $0 [options] <data-dir> <lang-dir> <sat-model-dir> [<MMI-model>] <output-dir>\"\n   echo \"e.g.: $0 data/train data/lang exp/tri3b exp/tri3b_mmi/final.mdl exp/tri3b_online\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --feature-type <mfcc|plp>                        # Type of the base features; \"\n   echo \"                                                   # important to generate the correct\"\n   echo \"                                                   # configs in <output-dir>/conf/\"\n   echo \"  --online-cmvn-config <config>                    # config for online cmvn,\"\n   echo \"                                                   # default conf/online_cmvn.conf\"\n   echo \"  --add-pitch <true|false>                         # Append pitch features to cmvn\"\n   echo \"                                                   # (default: false)\"\n   echo \"  --per-utt-cmvn <true|false>                      # Apply online CMVN per utt, not\"\n   echo \"                                                   # per speaker (default: false)\"\n   echo \"  --per-utt-basis <true|false>                     # Do basis computation per utterance\"\n   echo \"                                                   # (default: true)\"\n   echo \"  --silence-weight <weight>                        # Weight on silence for basis fMLLR;\"\n   echo \"                                                   # default 0.01.\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\n\nif [ $# -eq 5 ]; then\n  data=$1\n  lang=$2\n  srcdir=$3\n  mmi_model=$4\n  dir=$5\nelse\n  data=$1\n  lang=$2\n  srcdir=$3\n  mmi_model=$srcdir/final.mdl\n  dir=$4\nfi\n\n\nfor f in $srcdir/final.mdl $srcdir/ali.1.gz $data/feats.scp $lang/phones.txt \\\n    $mmi_model $online_cmvn_config; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnj=`cat $srcdir/num_jobs` || exit 1;\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nmkdir -p $dir/log\necho $nj >$dir/num_jobs || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $srcdir/cmvn_opts 2>/dev/null`\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\ncp $srcdir/splice_opts $srcdir/cmvn_opts $srcdir/final.mat $srcdir/final.mdl $dir/ 2>/dev/null\n\ncp $mmi_model $dir/final.rescore_mdl\n\n# Set up the unadapted features \"$sifeats\".\nif [ -f $dir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\nif ! $per_utt_cmvn; then\n  online_cmvn_spk2utt_opt=\nelse\n  online_cmvn_spk2utt_opt=\"--spk2utt=ark:$sdata/JOB/spk2utt\"\nfi\n\n\n# create global_cmvn.stats\nif ! matrix-sum --binary=false scp:$data/cmvn.scp - >$dir/global_cmvn.stats 2>/dev/null; then\n  echo \"$0: Error summing cmvn stats\"\n  exit 1\nfi\n\nif $add_pitch; then\n  skip_opt=\"--skip-dims=13:14:15\" # should make this more general.\nfi\n\necho \"$0: feature type is $feat_type\";\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n        online_sifeats=\"ark,s,cs:apply-cmvn-online $skip_opt --config=$online_cmvn_config $dir/global_cmvn.stats $online_cmvn_spk2utt_opt scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n       online_sifeats=\"ark,s,cs:apply-cmvn-online $skip_opt --config=$online_cmvn_config $online_cmvn_spk2utt_opt $dir/global_cmvn.stats scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\";;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n# Set up the adapted features \"$feats\" for training set.\nif [ -f $srcdir/trans.1 ]; then\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$srcdir/trans.JOB ark:- ark:- |\";\nelse\n  feats=\"$sifeats\";\nfi\n\n\nif $per_utt_basis; then\n  spk2utt_opt=  # treat each utterance as separate speaker when computing basis.\n  echo \"Doing per-utterance adaptation for purposes of computing the basis.\"\nelse\n  echo \"Doing per-speaker adaptation for purposes of computing the basis.\"\n  [ `cat $sdata/spk2utt | wc -l` -lt $[41*40] ] && \\\n    echo \"Warning: number of speakers is small, might be better to use --per-utt=true.\"\n  spk2utt_opt=\"--spk2utt=ark:$sdata/JOB/spk2utt\"\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: Accumulating statistics for basis-fMLLR computation\"\n# Note: we get Gaussian level alignments with the \"final.mdl\" and the\n# speaker adapted features.\n  $cmd JOB=1:$nj $dir/log/basis_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $srcdir/ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $dir/final.mdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $dir/final.mdl \"$feats\" ark:- ark:- \\| \\\n    gmm-basis-fmllr-accs-gpost $spk2utt_opt \\\n    $dir/final.mdl \"$sifeats\" ark,s,cs:- $dir/basis.acc.JOB || exit 1;\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$0: computing the basis matrices.\"\n  $cmd $dir/log/basis_training.log \\\n    gmm-basis-fmllr-training $dir/final.mdl $dir/fmllr.basis $dir/basis.acc.* || exit 1;\n  if $cleanup; then\n    rm $dir/basis.acc.* 2>/dev/null\n  fi\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: accumulating stats for online alignment model.\"\n\n  # Accumulate stats for \"online alignment model\"-- this model is computed with\n  # the speaker-independent features and online CMVN, but matches\n  # Gaussian-for-Gaussian with the final speaker-adapted model.\n\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $srcdir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/final.mdl \"$feats\" \"$online_sifeats\" \\\n    ark,s,cs:- $dir/final.JOB.acc || exit 1;\n  [ `ls $dir/final.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_online_alimdl.log \\\n    gmm-est --remove-low-count-gaussians=false $dir/final.mdl \\\n    \"gmm-sum-accs - $dir/final.*.acc|\" $dir/final.oalimdl  || exit 1;\n  if $cleanup; then\n    rm $dir/final.*.acc\n  fi\nfi\n\nif [ $stage -le 3 ]; then\n  mkdir -p $dir/conf\n  rm $dir/{plp,mfcc}.conf 2>/dev/null\n  echo \"$0: preparing configuration files in $dir/conf\"\n  if [ -f $dir/conf/online_decoding.conf ]; then\n    echo \"$0: moving $dir/conf/online_decoding.conf to $dir/conf/online_decoding.conf.bak\"\n    mv $dir/conf/online_decoding.conf $dir/conf/online_decoding.conf.bak\n  fi\n  conf=$dir/conf/online_decoding.conf\n  echo -n >$conf\n  case \"$feature_type\" in\n    mfcc)\n      echo \"$0: creating $dir/conf/mfcc.conf\"\n      echo \"--mfcc-config=$dir/conf/mfcc.conf\" >>$conf\n      cp conf/mfcc.conf $dir/conf/ ;;\n    plp)\n      echo \"$0: enabling plp features\"\n      echo \"--feature-type=plp\" >>$conf\n      echo \"$0: creating $dir/conf/plp.conf\"\n      echo \"--plp-config=$dir/conf/plp.conf\" >>$conf\n      cp conf/plp.conf $dir/conf/ ;;\n    *)\n      echo \"Unknown feature type $feature_type\"\n  esac\n  if ! cp $online_cmvn_config $dir/conf/online_cmvn.conf; then\n    echo \"$0: error copying online cmvn config to $dir/conf/\"\n    exit 1;\n  fi\n  echo \"--cmvn-config=$dir/conf/online_cmvn.conf\" >>$conf\n  if [ -f $dir/final.mat ]; then\n    echo \"$0: enabling feature splicing\"\n    echo \"--splice-feats\" >>$conf\n    echo \"$0: creating $dir/conf/splice.conf\"\n    for x in $(cat $dir/splice_opts); do echo $x; done > $dir/conf/splice.conf\n    echo \"--splice-config=$dir/conf/splice.conf\" >>$conf\n    echo \"$0: enabling LDA\"\n    echo \"--lda-matrix=$dir/final.mat\" >>$conf\n  else\n    echo \"$0: enabling deltas\"\n    echo \"--add-deltas\" >>$conf\n  fi\n  if $add_pitch; then\n    echo \"$0: enabling pitch features\"\n    echo \"--add-pitch\" >>$conf\n    echo \"$0: creating $dir/conf/pitch.conf\"\n    echo \"--pitch-config=$dir/conf/pitch.conf\" >>$conf\n    if ! cp $pitch_config $dir/conf/pitch.conf; then\n      echo \"$0: error copying pitch config to $dir/conf/\"\n      exit 1;\n    fi;\n    echo \"$0: creating $dir/conf/pitch_process.conf\"\n    echo \"--pitch-process-config=$dir/conf/pitch_process.conf\" >>$conf\n    if ! cp $pitch_process_config $dir/conf/pitch_process.conf; then\n      echo \"$0: error copying pitch process config to $dir/conf/\"\n      exit 1;\n    fi;\n    nfields=$(sed -n '2,2p' $dir/global_cmvn.stats | \\\n      perl -e '$_ = <>; s/^\\s+|\\s+$//g; print scalar(split);');\n    if [ $nfields != 17 ]; then\n      echo \"$0: $dir/global_cmvn.stats has $nfields entries per row (expected 17).\"\n      echo \"$0: Did you append pitch features?\"\n      exit 1;\n    fi\n    #offset=$(sed -n '2,2p' $dir/global_cmvn.stats | \\\n    #  perl -e '$_ = <>; s/^\\s+|\\s+$//g; ($t, $c) = (split)[13, 16]; print -$t/$c;');\n    #echo \"--pov-offset=$offset\" >>$dir/conf/pitch_process.conf\n  fi\n\n  echo \"--fmllr-basis=$dir/fmllr.basis\" >>$conf\n  echo \"--online-alignment-model=$dir/final.oalimdl\" >>$conf\n  echo \"--model=$dir/final.mdl\" >>$conf\n  if ! cmp --quiet $dir/final.mdl $dir/final.rescore_mdl; then\n    echo \"--rescore-model=$dir/final.rescore_mdl\" >>$conf\n  fi\n  echo \"--silence-phones=$silphonelist\" >>$conf\n  echo \"--endpoint.silence-phones=$silphonelist\" >>$conf\n  echo \"--global-cmvn-stats=$dir/global_cmvn.stats\" >>$conf\n  echo \"$0: created config file $conf\"\nfi\n"
  },
  {
    "path": "egs/steps/oracle_wer.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright Johns Hopkins University (Author: Daniel Povey)  2013\n# Apache 2.0.\n\n# Begin configuration section.\nwildcard_symbols=\ncmd=run.pl\nacwt=0.08333\nbeam=\nstage=0\ncleanup=true\n# End configuration section.\n\n. utils/parse_options.sh\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ $# != 3 ]; then\n   echo \"Compute lattice oracle WER and depth, optionally pruning and minimizing the lattice\"\n   echo \"beforehand.  To produce oracle WER, requires there to be a file 'text' in data dir\"\n   echo \"(not usable if only stm is present)\"\n   echo \"\"\n   echo \"Usage: $0 [options] <data-dir> <lang-dir> <decode-dir>\"\n   echo \"e.g.: $0 --wildcard-symbols=1:3:4 data/test data/lang exp/tri5/test_tg\"\n   echo \"Options:\"\n   echo \"  --wildcard-symbols <colon-separated-integer-list>  # Allows you to specify words\"\n   echo \"                                                     # to be removed from both reference\"\n   echo \"                                                     # and hypothesis before computing oracle.\"\n   echo \"  --cmd <cmd>                                        # How to run the jobs (default: run.pl)\"\n   echo \"  --acwt <acwt>                                      # Acoustic scale, default $acwt: only\"\n   echo \"                                                     # has an effect if --prune option used.\"\n   echo \"  --beam <prune-beam, e.g. 6.0>                      # Lattice pruning beam (optional; can\"\n   echo \"                                                     # be used to compute oracle and depth at\"\n   echo \"                                                     # various beams.\"\n   echo \"  --stage <stage>                                    # Used to control partial re-runs\"\n   echo \"  --cleanup <true|false>                             # If true, remove pruned lattices.\"\n   exit 1;\nfi\n\n. ./path.sh || exit 1;\n\ndata=$1\nlang=$2\ndir=$3\n\n\nfor f in $data/text $lang/words.txt $dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nnj=`cat $dir/num_jobs` || exit 1;\noov_sym=`cat $lang/oov.int`\nsdata=$data/split$nj;\nsplit_data.sh $data $nj || exit 1;\n\nnl=$(grep -v IGNORE_TIME_SEGMENT_IN_SCORING $data/text | wc -l)\nif [ $nl -eq 0 ]; then\n  echo \"$0: error: $data/text only contains IGNORE_TIME_SEGMENT_IN_SCORING, or is empty.\"\n  exit 1;\nfi\n\nif [ ! -z \"$beam\" ]; then\n  prunedir=${dir}/lats_beam${beam}\n  mkdir -p $prunedir/log\n  \n  if [ $stage -le 0 ]; then\n    echo \"$0: creating pruned lattices\"\n    $cmd JOB=1:$nj $prunedir/log/prune.JOB.log \\\n      lattice-prune --acoustic-scale=$acwt --beam=$beam  \\\n        \"ark:gunzip -c $dir/lat.JOB.gz|\" \"ark:|gzip -c >$prunedir/lat.JOB.gz\" || exit 1;\n  fi\nelse\n  prunedir=$dir\nfi\n\nmkdir -p $prunedir/log\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: measuring lattice depth\"\n  $cmd JOB=1:$nj $prunedir/log/lattice_depth.JOB.log \\\n    lattice-depth \"ark:gunzip -c $prunedir/lat.JOB.gz|\" ark:/dev/null || exit 1;\n\n  # look for lines like: LOG (blah:blah.cc:95) Overall density is 153.3 over 164361 frames\n  grep -w Overall $prunedir/log/lattice_depth.*.log | \\\n    awk -v nj=$nj '{num+=$6*$8; den+=$8; nl++} END{ \n      if (nl != nj) { print \"Error: expected \" nj \" lines, got \" nl | \"cat 1>&2\"; }\n      printf(\"%.2f ( %d / %d )\\n\", num/den, num, den); }' > $prunedir/depth || exit 1;\n  echo -n \"Depth is: \"\n  cat $prunedir/depth\nfi\n\n\nif [ $stage -le 2 ]; then\n  echo \"$0: measuring lattice oracle WER\"\n  $cmd JOB=1:$nj $prunedir/log/lattice_oracle.JOB.log \\\n    lattice-oracle --wildcard-symbols=$wildcard_symbols  \\\n    \"ark:gunzip -c $prunedir/lat.JOB.gz|\" \\\n   \"ark:sym2int.pl --map-oov $oov_sym -f 2- $lang/words.txt $sdata/JOB/text | grep -v IGNORE_TIME_SEGMENT_IN_SCORING |\"  \\\n   ark:/dev/null || exit 1;\n\n  # look for lines like: LOG (blah:blah.cc:95) Overall %WER 25.6 [ 1243 / 6331, ... ]  \n  grep -w Overall $prunedir/log/lattice_oracle.*.log | \\\n    awk -v nj=$nj '{num+=$7; den+=$9; ins+=$10; del+=$12; sb+=$14; nl++} END{ \n      if (nl != nj) { print \"Error: expected \" nj \" lines, got \" nl | \"cat 1>&2\"; }\n      printf(\"%.2f%% [ %d / %d, %d insertions, %d deletions, %d substitutions ]\\n\", (100.0 * num/den), num, den, ins, del, sb); }' > \\\n      $prunedir/oracle_wer || exit 1;\n  echo -n \"Oracle WER is: \"\n  cat $prunedir/oracle_wer\nfi\n\nif $cleanup && [ ! -z $beam ]; then\n  echo \"$0: removing pruned lattices in $prunedir\"\n  rm $prunedir/lat.*.gz\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/overlap/get_overlap_segments.py",
    "content": "#! /usr/bin/env python3\n# Copyright   2020   Desh Raj\n# Apache 2.0.\n\"\"\"This script takes an input RTTM and transforms it in a \nparticular way: all overlapping segments are re-labeled \nas \"overlap\". This is useful for 2 cases: \n1. By retaining just the overlap segments (grep overlap),\nthe resulting RTTM can be used to train an overlap\ndetector.\n2. By retaining just the non-overlap segments (grep -v overlap),\nthe resulting file can be used to obtain (fairly) clean \nspeaker embeddings from the single-speaker regions of the\nrecording.\nThe output is written to stdout.\n\"\"\"\n\nimport argparse, os\nimport itertools\nfrom collections import defaultdict\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script filters an RTTM in several ways.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n    parser.add_argument(\"--label\", type=str, default=\"overlap\",\n                        help=\"Label for the overlap segments\")\n    parser.add_argument(\"input_rttm\", type=str,\n                        help=\"path of input rttm file\")\n    args = parser.parse_args()\n    return args\n\nclass Segment:\n    \"\"\"Stores all information about a segment\"\"\"\n\n    def __init__(self, reco_id, start_time, dur = None, end_time = None, spk_id = None):\n        self.reco_id = reco_id\n        self.start_time = start_time\n        if (dur is None):\n            self.end_time = end_time\n            self.dur = end_time - start_time\n        else:\n            self.dur = dur\n            self.end_time = start_time + dur\n        self.spk_id = spk_id\n\ndef groupby(iterable, keyfunc):\n    \"\"\"Wrapper around ``itertools.groupby`` which sorts data first.\"\"\"\n    iterable = sorted(iterable, key=keyfunc)\n    for key, group in itertools.groupby(iterable, keyfunc):\n        yield key, group\n\ndef find_overlapping_segments(segs, label):\n    reco_id = segs[0].reco_id\n    tokens = []\n    for seg in segs:\n        tokens.append((\"BEG\", seg.start_time))\n        tokens.append((\"END\", seg.end_time))\n    sorted_tokens = sorted(tokens, key=lambda x: x[1])\n    \n    overlap_segs = []\n    spkr_count = 0\n    ovl_begin = 0\n    ovl_end = 0\n    for token in sorted_tokens:\n        if (token[0] == \"BEG\"):\n            spkr_count +=1\n            if (spkr_count == 2):\n                ovl_begin = token[1]\n        else:\n            spkr_count -= 1\n            if (spkr_count == 1):\n                ovl_end = token[1]\n                overlap_segs.append(Segment(reco_id, ovl_begin, end_time=ovl_end, spk_id=label))\n    \n    return overlap_segs\n\ndef find_single_speaker_segments(segs):\n    reco_id = segs[0].reco_id\n    tokens = []\n    for seg in segs:\n        tokens.append((\"BEG\", seg.start_time, seg.spk_id))\n        tokens.append((\"END\", seg.end_time, seg.spk_id))\n    sorted_tokens = sorted(tokens, key=lambda x: x[1])\n    \n    single_speaker_segs = []\n    running_spkrs = set()\n    for token in sorted_tokens:\n        if (token[0] == \"BEG\"):\n            running_spkrs.add(token[2])\n            if (len(running_spkrs) == 1):\n                seg_begin = token[1]\n                cur_spkr = token[2]\n            elif (len(running_spkrs) == 2):\n                single_speaker_segs.append(Segment(reco_id, seg_begin, end_time=token[1], spk_id=cur_spkr))\n        elif (token[0] == \"END\"):\n            try:\n                running_spkrs.remove(token[2])\n            except:\n                Warning (\"Speaker not found\")\n            if (len(running_spkrs) == 1):\n                seg_begin = token[1]\n                cur_spkr = list(running_spkrs)[0]\n            elif (len(running_spkrs) == 0):\n                single_speaker_segs.append(Segment(reco_id, seg_begin, end_time=token[1], spk_id=cur_spkr))\n    \n    return single_speaker_segs\n\ndef main():\n    args = get_args()\n\n    # First we read all segments and store as a list of objects\n    segments = []\n    with open(args.input_rttm, 'r') as f:\n        for line in f.readlines():\n            parts = line.strip().split()\n            segments.append(Segment(parts[1], float(parts[3]), dur=float(parts[4]), spk_id=parts[7]))\n\n    # We group the segment list into a dictionary indexed by reco_id\n    reco2segs = defaultdict(list,\n        {reco_id : list(g) for reco_id, g in groupby(segments, lambda x: x.reco_id)})\n\n    overlap_segs = []\n    for reco_id in reco2segs.keys():\n        segs = reco2segs[reco_id]\n        overlap_segs.extend(find_overlapping_segments(segs, args.label))\n\n    single_speaker_segs = []\n    for reco_id in reco2segs.keys():\n        segs = reco2segs[reco_id]\n        single_speaker_segs.extend(find_single_speaker_segments(segs))\n    final_segs = sorted(overlap_segs + single_speaker_segs, key = lambda x: (x.reco_id, x.start_time))\n    \n    rttm_str = \"SPEAKER {0} 1 {1:7.3f} {2:7.3f} <NA> <NA> {3} <NA> <NA>\"\n    for seg in final_segs:\n        if (seg.dur > 0):\n            print(rttm_str.format(seg.reco_id, seg.start_time, seg.dur, seg.spk_id))\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/overlap/get_overlap_targets.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2020  Desh Raj (Johns Hopkins University)\n# Apache 2.0\n\n# This script prepares targets for whole recordings for training\n# an overlap detector system. It just takes as input a data dir\n# where it is assumed that MFCC features have already been\n# extracted. It also takes an overlap RTTM file containing\n# \"single\" and \"overlap\" segments, ideally generated using the\n# get_overlap_segments.py script. It uses these segments to \n# obtain per-frame targets for the recordings in the format:\n# [ silence single overlap ]\n\nfrom __future__ import division\n\nimport argparse\nimport logging\nimport numpy as np\nimport subprocess\nimport sys\nimport itertools\nfrom collections import defaultdict\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script prepares targets for whole recordings for training\n            an overlap detector system. It just takes as input a data dir\n            where it is assumed that MFCC features have already been\n            extracted. It also takes an overlap RTTM file containing\n            \"single\" and \"overlap\" segments, ideally generated using the\n            get_overlap_segments.py script. It uses these segments to \n            obtain per-frame targets for the recordings in the format:\n            [ silence single overlap ]\n        \"\"\")\n\n    parser.add_argument(\"--frame-shift\", type=float, default=0.01,\n                        help=\"Frame shift value in seconds\")\n    parser.add_argument(\"--label-smoothing\", type=float, default=0.0,\n                        help=\"Value between 0 and 1. Amount of label smoothing to apply\"\n                        \"to get soft labels instead of one-hot labels\")\n    parser.add_argument(\"reco2num_frames\", type=str,\n                        help=\"\"\"The number of frames per reco\n                        is used to determine the num-rows of the output matrix\n                        \"\"\")\n    parser.add_argument(\"overlap_rttm\", type=str,\n                        help=\"Input RTTM file containing single and overlap segments\")\n    parser.add_argument(\"out_targets_ark\", type=str,\n                        help=\"\"\"Output archive to which the\n                        recording-level matrix will be written in text\n                        format\"\"\")\n\n    args = parser.parse_args()\n\n    if args.frame_shift < 0.0001 or args.frame_shift > 1:\n        raise ValueError(\"--frame-shift should be in [0.0001, 1]; got {0}\"\n                         \"\".format(args.frame_shift))\n    return args\n\nclass Segment:\n    \"\"\"Stores all information about a segment\"\"\"\n    reco_id = ''\n    spk_id = ''\n    start_time = 0\n    dur = 0\n    end_time = 0\n\n    def __init__(self, reco_id, start_time, dur = None, end_time = None, label = None):\n        self.reco_id = reco_id\n        self.start_time = start_time\n        if (dur is None):\n            self.end_time = end_time\n            self.dur = end_time - start_time\n        else:\n            self.dur = dur\n            self.end_time = start_time + dur\n        self.label = label\n\ndef groupby(iterable, keyfunc):\n    \"\"\"Wrapper around ``itertools.groupby`` which sorts data first.\"\"\"\n    iterable = sorted(iterable, key=keyfunc)\n    for key, group in itertools.groupby(iterable, keyfunc):\n        yield key, group\n\ndef run(args):\n    # Get all reco to num_frames, which will be used to decide the number of\n    # rows of matrix\n    reco2num_frames = {}\n    with common_lib.smart_open(args.reco2num_frames) as f:\n        for line in f:\n            parts = line.strip().split()\n            if len(parts) != 2:\n                raise ValueError(\"Could not parse line {0}\".format(line))\n            reco2num_frames[parts[0]] = int(parts[1])\n\n    # We read all segments and store as a list of objects\n    segments = []\n    with common_lib.smart_open(args.overlap_rttm) as f:\n        for line in f.readlines():\n            parts = line.strip().split()\n            segments.append(Segment(parts[1], float(parts[3]), dur=float(parts[4]), label=parts[7]))\n\n    # We group the segment list into a dictionary indexed by reco_id\n    reco2segs = defaultdict(list,\n        {reco_id : list(g) for reco_id, g in groupby(segments, lambda x: x.reco_id)})\n\n    # Now, for each reco, create a matrix of shape num_frames x 3 and fill in using\n    # the segments information for that reco\n    reco2targets = {}\n    for reco_id in reco2num_frames:\n        segs = sorted(reco2segs[reco_id], key=lambda x: x.start_time)\n\n        target_val = 1 - args.label_smoothing\n        other_val = args.label_smoothing / 2\n        silence_vec = np.array([target_val,other_val,other_val], dtype=np.float)\n        single_vec = np.array([other_val,target_val,other_val], dtype=np.float)\n        overlap_vec = np.array([other_val,other_val,target_val], dtype=np.float)\n        num_targets = [0,0,0]\n\n        # The default target (if not single or overlap) is silence\n        targets_mat = np.tile(silence_vec, (reco2num_frames[reco_id],1))\n\n        # Now iterate over all segments of the recording and assign targets\n        for seg in segs:\n            start_frame = int(seg.start_time / args.frame_shift)\n            end_frame = min(int(seg.end_time / args.frame_shift), reco2num_frames[reco_id])\n            num_frames = end_frame - start_frame\n            if (num_frames <= 0):\n                continue\n            if (seg.label == \"overlap\"):\n                targets_mat[start_frame:end_frame] = np.tile(overlap_vec, (num_frames,1))\n                num_targets[2] += end_frame - start_frame\n            else:\n                targets_mat[start_frame:end_frame] = np.tile(single_vec, (num_frames,1))\n                num_targets[1] += end_frame - start_frame\n\n        num_targets[0] = reco2num_frames[reco_id] - sum(num_targets)\n        # print (\"{}: {}\".format(reco_id, num_targets))\n        reco2targets[reco_id] = targets_mat\n\n    with common_lib.smart_open(args.out_targets_ark, 'w') as f:\n        for reco_id in sorted(reco2targets.keys()):\n            common_lib.write_matrix_ascii(f, reco2targets[reco_id].tolist(), key=reco_id)\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\nif __name__ == \"__main__\":\n    main()"
  },
  {
    "path": "egs/steps/overlap/output_to_rttm.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n#           2018  Capital One (Author: Zhiyuan Guan)\n# Apache 2.0\n\n\"\"\"\nThis script converts frame-level overlap detector marks (in kaldi\ninteger vector text archive format) into kaldi segments and utt2spk.\nThe input integer vectors are expected to contain '1' for silence frames,\n'2' for speech frames of single speaker, and '3' for overlap frames.\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\nglobal_verbose = 0\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"\nThis script converts frame-level speech activity detection marks (in kaldi\ninteger vector text archive format) into kaldi segments and utt2spk.\nThe input integer vectors are expected to contain 1 for silence frames\nand 2 for speech frames.\n\"\"\",\n        formatter_class=argparse.RawTextHelpFormatter)\n\n    parser.add_argument(\"--verbose\", type=int, choices=[0, 1, 2, 3],\n                        default=0, help=\"Higher verbosity for more logging\")\n\n    parser.add_argument(\"--utt2dur\", type=str,\n                        help=\"File containing durations of utterances.\")\n\n    parser.add_argument(\"--frame-shift\", type=float, default=0.01,\n                        help=\"Frame shift to convert frame indexes to time\")\n\n    parser.add_argument(\"--segment-padding\", type=float, default=0.2,\n                        help=\"Additional padding on speech segments. But we \"\n                             \"ensure that the padding does not go beyond the \"\n                             \"adjacent segment.\")\n\n    parser.add_argument(\"--min-segment-dur\", type=float, default=0,\n                        help=\"Minimum duration (in seconds) required for a segment \"\n                             \"to be included. This is before any padding. Segments \"\n                             \"shorter than this duration will be removed.\")\n\n    parser.add_argument(\"--merge-consecutive-max-dur\", type=float, default=0,\n                        help=\"Merge consecutive segments as long as the merged \"\n                             \"segment is no longer than this many seconds. The segments \"\n                             \"are only merged if their boundaries are touching. \"\n                             \"This is after padding by --segment-padding seconds.\"\n                             \"0 means do not merge. Use 'inf' to not limit the duration.\")\n\n    parser.add_argument(\"--region-type\", type=str, default=\"overlap\",\n                        help=\"Specify if overlap or single-speaker or silence region \"\n                        \"to output in the rttm\")\n\n    parser.add_argument(\"in_ovl\", type=str,\n                        help=\"Input file containing alignments in \"\n                             \"text archive format\")\n\n    parser.add_argument(\"out_rttm\", type=str,\n                        help=\"Output kaldi segments file\")\n\n    args = parser.parse_args()\n\n    global global_verbose\n    global_verbose = args.verbose\n\n    logger.info(\"Setting verbosity to {0}\".format(global_verbose))\n\n    if args.verbose >= 3:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n    return args\n\n\ndef to_str(segment):\n    assert len(segment) == 3\n    return \"[{0:.3f}, {1:.3f}, {2}]\".format(segment[0], segment[1],\n                                            segment[2])\n\n\nclass SegmenterStats(object):\n    \"\"\"Stores stats about the post-process stages\"\"\"\n\n    def __init__(self):\n        self.num_segments_initial = 0\n        self.num_short_segments_filtered = 0\n        self.num_merges = 0\n        self.num_segments_final = 0\n        self.initial_duration = 0.0\n        self.padding_duration = 0.0\n        self.filter_short_duration = 0.0\n        self.final_duration = 0.0\n\n    def add(self, other):\n        \"\"\"Adds stats from another object\"\"\"\n        self.num_segments_initial += other.num_segments_initial\n        self.num_short_segments_filtered += other.num_short_segments_filtered\n        self.num_merges += other.num_merges\n        self.num_segments_final += other.num_segments_final\n        self.initial_duration += other.initial_duration\n        self.filter_short_duration += other.filter_short_duration\n        self.padding_duration += other.padding_duration\n        self.final_duration += other.final_duration\n\n    def __str__(self):\n        return (\"num-segments-initial={num_segments_initial}, \"\n                \"num-short-segments-filtered={num_short_segments_filtered}, \"\n                \"num-merges={num_merges}, \"\n                \"num-segments-final={num_segments_final}, \"\n                \"initial-duration={initial_duration}, \"\n                \"filter-short-duration={filter_short_duration}, \"\n                \"padding-duration={padding_duration}, \"\n                \"final-duration={final_duration}\".format(\n            num_segments_initial=self.num_segments_initial,\n            num_short_segments_filtered=self.num_short_segments_filtered,\n            num_merges=self.num_merges,\n            num_segments_final=self.num_segments_final,\n            initial_duration=self.initial_duration,\n            filter_short_duration=self.filter_short_duration,\n            padding_duration=self.padding_duration,\n            final_duration=self.final_duration))\n\n\ndef process_label(text_label):\n    \"\"\"Processes an input integer label and returns a 1, 2 or 3,\n    where 1 is for silence, 2 is for single-speaker region, and \n    3 is for overlap.\n\n    Arguments:\n        text_label -- input label (must be integer)\n    \"\"\"\n    prev_label = int(text_label)\n    if prev_label not in [1, 2, 3]:\n        raise ValueError(\"Expecting label to be 0 (silence), 1 (single speaker) or 2 (overlap); \"\n                         \"got {}\".format(prev_label))\n\n    return prev_label\n\n\nclass Segmentation(object):\n    \"\"\"Stores segmentation for an utterances\"\"\"\n\n    region_to_label = {'silence':1, 'single':2, 'overlap':3}\n\n    def __init__(self, region_type):\n        self.segments = None\n        self.label = self.region_to_label[region_type]\n        self.region_type = region_type\n        self.stats = SegmenterStats()\n\n    def initialize_segments(self, alignment, frame_shift=0.01):\n        \"\"\"Initializes segments from input alignment.\n        The alignment is frame-level overlap detection marks,\n        each of which must be 1, 2, or 3.\"\"\"\n        self.segments = []\n\n        assert len(alignment) > 0\n\n        prev_label = None\n        prev_length = 0\n        for i, text_label in enumerate(alignment):\n            if prev_label is not None and int(text_label) != prev_label:\n                if prev_label == self.label:\n                    self.segments.append(\n                        [float(i - prev_length) * frame_shift,\n                         float(i) * frame_shift, prev_label])\n                    self.stats.initial_duration += (prev_length * frame_shift)\n                prev_label = process_label(text_label)\n                prev_length = 0\n            elif prev_label is None:\n                prev_label = process_label(text_label)\n\n            prev_length += 1\n\n        if prev_length > 0 and prev_label == self.label:\n            self.segments.append(\n                [float(len(alignment) - prev_length) * frame_shift,\n                 float(len(alignment)) * frame_shift, prev_label])\n            self.stats.initial_duration += (prev_length * frame_shift)\n\n        self.stats.num_segments_initial = len(self.segments)\n        self.stats.num_segments_final = len(self.segments)\n        self.stats.final_duration = self.stats.initial_duration\n\n    def filter_short_segments(self, min_dur):\n        \"\"\"Filters out segments with durations shorter than 'min_dur'.\"\"\"\n        if min_dur <= 0:\n            return\n\n        segments_kept = []\n        for segment in self.segments:\n            assert segment[2] == self.label, segment\n            dur = segment[1] - segment[0]\n            if dur < min_dur:\n                self.stats.filter_short_duration += dur\n                self.stats.num_short_segments_filtered += 1\n            else:\n                segments_kept.append(segment)\n        self.segments = segments_kept\n        self.stats.num_segments_final = len(self.segments)\n        self.stats.final_duration -= self.stats.filter_short_duration\n\n    def pad_segments(self, segment_padding, max_duration=float(\"inf\")):\n        \"\"\"Pads segments by duration 'segment_padding' on either sides, but\n        ensures that the segments don't go beyond the neighboring segments\n        or the duration of the utterance 'max_duration'.\"\"\"\n        if max_duration == None:\n            max_duration = float(\"inf\")\n        for i, segment in enumerate(self.segments):\n            assert segment[2] == self.label, segment\n            segment[0] -= segment_padding  # try adding padding on the left side\n            self.stats.padding_duration += segment_padding\n            if segment[0] < 0.0:\n                # Padding takes the segment start to before the beginning of the utterance.\n                # Reduce padding.\n                self.stats.padding_duration += segment[0]\n                segment[0] = 0.0\n            if i >= 1 and self.segments[i - 1][1] > segment[0]:\n                # Padding takes the segment start to before the end the previous segment.\n                # Reduce padding.\n                self.stats.padding_duration -= (\n                        self.segments[i - 1][1] - segment[0])\n                segment[0] = self.segments[i - 1][1]\n\n            segment[1] += segment_padding\n            self.stats.padding_duration += segment_padding\n            if segment[1] >= max_duration:\n                # Padding takes the segment end beyond the max duration of the utterance.\n                # Reduce padding.\n                self.stats.padding_duration -= (segment[1] - max_duration)\n                segment[1] = max_duration\n            if (i + 1 < len(self.segments)\n                    and segment[1] > self.segments[i + 1][0]):\n                # Padding takes the segment end beyond the start of the next segment.\n                # Reduce padding.\n                self.stats.padding_duration -= (\n                        segment[1] - self.segments[i + 1][0])\n                segment[1] = self.segments[i + 1][0]\n        self.stats.final_duration += self.stats.padding_duration\n\n    def merge_consecutive_segments(self, max_dur):\n        \"\"\"Merge consecutive segments (happens after padding), provided that\n        the merged segment is no longer than 'max_dur'.\"\"\"\n        if max_dur <= 0 or not self.segments:\n            return\n\n        merged_segments = [self.segments[0]]\n        for segment in self.segments[1:]:\n            assert segment[2] == self.label, segment\n            if segment[0] == merged_segments[-1][1] and \\\n                    segment[1] - merged_segments[-1][0] <= max_dur:\n                # The segment starts at the same time the last segment ends,\n                # and the merged segment is shorter than 'max_dur'.\n                # Extend the previous segment.\n                merged_segments[-1][1] = segment[1]\n                self.stats.num_merges += 1\n            else:\n                merged_segments.append(segment)\n\n        self.segments = merged_segments\n        self.stats.num_segments_final = len(self.segments)\n\n    def write(self, key, file_handle):\n        \"\"\"Write segments to RTTM file\"\"\"\n        if global_verbose >= 2:\n            logger.info(\"For key {key}, got stats {stats}\".format(\n                key=key, stats=self.stats))\n        rttm_str = \"SPEAKER {0} 1 {1:7.3f} {2:7.3f} <NA> <NA> {3} <NA> <NA>\"\n        for segment in self.segments:\n            print(rttm_str.format(key, segment[0], segment[1] - segment[0], self.region_type),\n                file=file_handle)\n\n\ndef run(args):\n    \"\"\"The main function that does everything.\"\"\"\n    utt2dur = {}\n    if args.utt2dur is not None:\n        with common_lib.smart_open(args.utt2dur) as utt2dur_fh:\n            for line in utt2dur_fh:\n                parts = line.strip().split()\n                if len(parts) != 2:\n                    raise RuntimeError(\"Unable to parse line '{0}' in {1}\"\n                                       \"\".format(line.strip(), args.utt2dur))\n                utt2dur[parts[0]] = float(parts[1])\n\n    global_stats = SegmenterStats()\n    with common_lib.smart_open(args.in_ovl) as in_ovl_fh, \\\n            common_lib.smart_open(args.out_rttm, 'w') as out_rttm_fh:\n        for line in in_ovl_fh:\n            parts = line.strip().split()\n            utt_id = parts[0]\n\n            if len(parts) < 2:\n                raise RuntimeError(\"Unable to parse line '{0}' in {1}\"\n                                   \"\".format(line.strip(),\n                                             in_ovl_fh))\n\n            segmentation = Segmentation(args.region_type)\n            segmentation.initialize_segments(\n                parts[1:], args.frame_shift)\n            segmentation.filter_short_segments(args.min_segment_dur)\n            segmentation.pad_segments(args.segment_padding,\n                                             None if args.utt2dur is None\n                                             else utt2dur[utt_id])\n            segmentation.merge_consecutive_segments(args.merge_consecutive_max_dur)\n            segmentation.write(utt_id, out_rttm_fh)\n            global_stats.add(segmentation.stats)\n    logger.info(global_stats)\n\n\ndef main():\n    \"\"\"Parses arguments and calls the run method\"\"\"\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/overlap/post_process_output.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015-17  Vimal Manohar\n#           2020     Desh Raj\n# Apache 2.0.\n\n# This script post-processes the output of the overlap neural network,\n# which is in the form of frame-level alignments, into an RTTM file.\n# The alignments must be 0/1/2 denoting silence/single/overlap. Based\n# on this, this script can also be used to get single speaker regions.\n\nset -e -o pipefail -u\n. ./path.sh\n\ncmd=run.pl\nstage=-10\nnj=18\n\nregion_type=overlap # change this to \"single\" to get only single-speaker regions\n\n# The values below are in seconds\nframe_shift=0.01\nsegment_padding=0.2\nmin_segment_dur=0\nmerge_consecutive_max_dur=inf\n\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"This script post-processes the output of steps/segmentation/decode_sad.sh, \"\n  echo \"which is in the form of frame-level alignments, into kaldi segments. \"\n  echo \"The alignments must be speech activity detection marks i.e. 1 for silence \"\n  echo \"and 2 for speech.\"\n  echo \"Usage: $0 <data-dir> <output-dir> <rttm-dir>\"\n  echo \" e.g.: $0 data/dev_aspire_whole exp/vad_dev_aspire\"\n  exit 1\nfi\n\ndata_dir=$1\noutput_dir=$2    # Alignment directory containing frame-level SAD labels\ndir=$3\n\nmkdir -p $dir\n\nfor f in $output_dir/ali.1.gz $output_dir/num_jobs; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\" && exit 1\n  fi\ndone\n\nnj=`cat $output_dir/num_jobs` || exit 1\nutils/split_data.sh $data_dir $nj\n\nutils/data/get_utt2dur.sh $data_dir\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/segmentation.JOB.log \\\n    copy-int-vector \"ark:gunzip -c $output_dir/ali.JOB.gz |\" ark,t:- \\| \\\n    steps/overlap/output_to_rttm.py \\\n      --region-type=$region_type \\\n      --frame-shift=$frame_shift --segment-padding=$segment_padding \\\n      --min-segment-dur=$min_segment_dur --merge-consecutive-max-dur=$merge_consecutive_max_dur \\\n      --utt2dur=$data_dir/utt2dur - $dir/rttm_${region_type}.JOB\nfi\n\necho $nj > $dir/num_jobs\n\nfor n in $(seq $nj); do \n  cat $dir/rttm_${region_type}.$n\ndone > $dir/rttm_${region_type}\n"
  },
  {
    "path": "egs/steps/overlap/prepare_overlap_graph.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0\n\n\"\"\"Prepares a graph with a simple HMM topology for segmentation\nwith minimum and maximum speech duration constraints and minimum silence\nduration constraint. The graph is written to the 'output_graph', which\ncan be file or \"-\" for stdout.\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport math\nimport os\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script prepares a graph with a simple HMM topology\n        for overlap detection with minimum and maximum speech duration constraints\n        and minimum silence duration constraint. Additionally, we enforce the \n        constraint that there cannot be a direct transition between silence and\n        overlap states. The graph is written to the 'output_graph', which can be \n        file or \"-\" for stdout.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n    parser.add_argument(\"--transition-scale\", type=float, default=1.0,\n                        help=\"\"\"Scale on transition probabilities relative to\n                        LM weights\"\"\")\n    parser.add_argument(\"--loopscale\", type=float, default=0.1,\n                        help=\"\"\"Scale on self-loop log-probabilities relative\n                        to LM weights\"\"\")\n\n    parser.add_argument(\"--min-silence-duration\", type=float, default=0.01,\n                        help=\"\"\"Minimum duration for silence\"\"\")\n    parser.add_argument(\"--min-speech-duration\", type=float, default=0.3,\n                        help=\"\"\"Minimum duration for speech\"\"\")\n    parser.add_argument(\"--max-speech-duration\", type=float, default=10.0,\n                        help=\"\"\"Maximum duration for speech\"\"\")\n    parser.add_argument(\"--min-overlap-duration\", type=float, default=0.1,\n                        help=\"\"\"Minimum duration for overlap\"\"\")\n    parser.add_argument(\"--max-overlap-duration\", type=float, default=5.0,\n                        help=\"\"\"Maximum duration for overlap\"\"\")\n    parser.add_argument(\"--frame-shift\", type=float, default=0.03,\n                        help=\"\"\"Frame shift in seconds\"\"\")\n\n    parser.add_argument(\"--edge-silence-probability\", type=float,\n                        default=0.5,\n                        help=\"Probability of silence at the edges.\")\n    parser.add_argument(\"--transition-probability\", type=float, default=0.1,\n                        help=\"Transition probability for silence to speech \"\n                        \"or vice-versa\")\n\n    parser.add_argument(\"output_graph\", type=str,\n                        help=\"Output graph\")\n    args = parser.parse_args()\n\n    args.min_states_silence = int(args.min_silence_duration / args.frame_shift\n                                  + 0.5)\n    args.min_states_speech = int(args.min_speech_duration / args.frame_shift\n                                 + 0.5)\n    args.max_states_speech = int(args.max_speech_duration / args.frame_shift\n                                 + 0.5)\n    args.min_states_overlap = int(args.min_overlap_duration / args.frame_shift\n                                 + 0.5)\n    args.max_states_overlap = int(args.max_overlap_duration / args.frame_shift\n                                 + 0.5)\n\n    return args\n\n\ndef print_states(args, file_handle):\n    # Initial transition to silence\n    print (\"0 1 silence silence {0}\".format(-math.log(args.edge_silence_probability)),\n           file=file_handle)\n    silence_start_state = 1\n\n    # Silence min duration transitions\n    # 1->2, 2->3 and so on until\n    # (1 + min_states_silence - 2) -> (1 + min_states_silence - 1)  ...\n    for state in range(silence_start_state,\n                       silence_start_state + args.min_states_silence - 1):\n        print (\"{state} {next_state} silence silence {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n    silence_last_state = silence_start_state + args.min_states_silence - 1\n\n    # Silence self-loop\n    print (\"{state} {state} silence silence {cost}\".format(\n                state=silence_last_state, cost=0.0),\n           file=file_handle)\n\n    speech_start_state = silence_last_state + 1\n    # Initial transition to speech\n    print (\"0 {state} single single {cost}\".format(\n                state=speech_start_state,\n                cost=-math.log(1.0 - args.edge_silence_probability)),\n           file=file_handle)\n\n    # Silence to speech transition\n    print (\"{sil_state} {speech_state} single single {cost}\".format(\n                sil_state=silence_last_state,\n                speech_state=speech_start_state,\n                cost=-math.log(args.transition_probability)),\n           file=file_handle)\n     \n    # Speech min duration\n    for state in range(speech_start_state,\n                       speech_start_state + args.min_states_speech - 1):\n        print (\"{state} {next_state} single single {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n    speech_last_state = speech_start_state + args.max_states_speech - 1\n    overlap_start_state = speech_last_state + 1\n    # Speech max duration\n    for state in range(speech_start_state + args.min_states_speech - 1,\n                       speech_start_state + args.max_states_speech - 1):\n        print (\"{state} {next_state} single single {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n        print (\"{state} {sil_state} silence silence {cost}\".format(\n                    state=state, sil_state=silence_start_state,\n                    cost=-math.log(args.transition_probability)),\n               file=file_handle)\n\n        print (\"{state} {ovl_state} overlap overlap {cost}\".format(\n                    state=state, ovl_state=overlap_start_state,\n                    cost=-math.log(args.transition_probability)),\n               file=file_handle)\n\n    # Transition to silence after max duration of speech\n    print (\"{state} {sil_state} silence silence {cost}\".format(\n                state=speech_last_state, sil_state=silence_start_state,\n                cost=0.0),\n           file=file_handle)\n\n    \n    # Transition to overlap after max duration of speech\n    print (\"{state} {ovl_state} overlap overlap {cost}\".format(\n                state=speech_last_state, ovl_state=overlap_start_state,\n                cost=0),\n           file=file_handle)\n\n    # Overlap min duration\n    for state in range(overlap_start_state,\n                       overlap_start_state + args.min_states_overlap - 1):\n        print (\"{state} {next_state} overlap overlap {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n    # Overlap max duration\n    for state in range(overlap_start_state + args.min_states_overlap - 1,\n                       overlap_start_state + args.max_states_overlap - 1):\n        print (\"{state} {next_state} overlap overlap {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n        print (\"{state} {speech_state} single single {cost}\".format(\n                    state=state, speech_state=speech_start_state,\n                    cost=-math.log(args.transition_probability)),\n               file=file_handle)\n    overlap_last_state = overlap_start_state + args.max_states_overlap - 1\n\n    # Transition to speech after max duration of overlap\n    print (\"{state} {speech_state} single single {cost}\".format(\n                state=overlap_last_state, speech_state=speech_start_state,\n                cost=0.0),\n           file=file_handle)\n\n    for state in range(1, speech_start_state):\n        print (\"{state} {cost}\".format(\n                    state=state, cost=-math.log(args.edge_silence_probability)),\n               file=file_handle)\n\n    for state in range(speech_start_state, speech_last_state + 1):\n        print (\"{state} {cost}\".format(\n                    state=state,\n                    cost=-math.log(1.0 - args.edge_silence_probability)),\n               file=file_handle)\n\n    for state in range(overlap_start_state, overlap_last_state + 1):\n        print (\"{state} {cost}\".format(\n                    state=state,\n                    cost=0),\n               file=file_handle) \n\n\ndef main():\n    try:\n        args = get_args()\n        with common_lib.smart_open(args.output_graph, 'w') as f:\n            print_states(args, f)\n    except Exception:\n        raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/paste_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Brno University of Technology (Author: Karel Vesely)\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n# This script appends the features in two or more data directories.\n\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\ncmd=run.pl\nnj=4\nlength_tolerance=10 # length tolerance in frames (trim to shortest)\ncompress=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 5 ]; then\n   echo \"usage: $0 [options] <src-data-dir1> <src-data-dir2> [<src-data-dirN>] <dest-data-dir> <log-dir> <path-to-storage-dir>\";\n   echo \"e.g.: $0 data/train_mfcc data/train_bottleneck data/train_combined exp/append_mfcc_plp mfcc\"\n   echo \"options: \"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata_src_arr=(${@:1:$(($#-3))}) #array of source data-dirs\ndata=${@: -3: 1}\nlogdir=${@: -2: 1}\nark_dir=${@: -1: 1} #last arg.\n\ndata_src_first=${data_src_arr[0]} # get 1st src dir\n\n# make $ark_dir an absolute pathname.\nark_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $ark_dir ${PWD}`\n\nfor data_src in ${data_src_arr[@]}; do\n  utils/split_data.sh $data_src $nj || exit 1;\ndone\n\nmkdir -p $ark_dir $logdir\n\nmkdir -p $data\ncp $data_src_first/* $data/ 2>/dev/null # so we get the other files, such as utt2spk.\nrm $data/cmvn.scp 2>/dev/null\nrm $data/feats.scp 2>/dev/null\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\n# get list of source scp's for pasting\ndata_src_args=\nfor data_src in ${data_src_arr[@]}; do\n  data_src_args=\"$data_src_args scp:$data_src/split$nj/JOB/feats.scp\"\ndone\n\nfor n in $(seq $nj); do\n  # the next command does nothing unless $ark_dir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $ark_dir/pasted_$name.$n.ark\ndone\n\n$cmd JOB=1:$nj $logdir/append.JOB.log \\\n   paste-feats --length-tolerance=$length_tolerance $data_src_args ark:- \\| \\\n   copy-feats --compress=$compress ark:- \\\n    ark,scp:$ark_dir/pasted_$name.JOB.ark,$ark_dir/pasted_$name.JOB.scp || exit 1;\n\n# concatenate the .scp files together.\nfor ((n=1; n<=nj; n++)); do\n  cat $ark_dir/pasted_$name.$n.scp >> $data/feats.scp || exit 1;\ndone > $data/feats.scp || exit 1;\n\n\nnf=`cat $data/feats.scp | wc -l`\nnu=`cat $data/utt2spk | wc -l`\nif [ $nf -ne $nu ]; then\n  echo \"It seems not all of the feature files were successfully processed ($nf != $nu);\"\n  echo \"consider using utils/fix_data_dir.sh $data\"\nfi\n\necho \"Succeeded pasting features for $name into $data\"\n"
  },
  {
    "path": "egs/steps/pytorchnn/check_py.py",
    "content": "import numpy as np\nimport torch\n"
  },
  {
    "path": "egs/steps/pytorchnn/compute_sentence_scores.py",
    "content": "# Copyright 2020    Ke Li\n\n\"\"\" This script computes sentence scores with a PyTorch trained neural LM.\n    It is called by steps/pytorchnn/lmrescore_nbest_pytorchnn.sh\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nimport argparse\nfrom collections import defaultdict\n\nimport torch\nimport torch.nn as nn\n\n\ndef load_nbest(path):\n    r\"\"\"Read nbest lists.\n\n    Assume the input file format is as following:\n        en_4156-A_030185-030248-1 oh yeah\n        en_4156-A_030470-030672-1 well i'm going to have mine and two more classes\n        en_4156-A_030470-030672-2 well i'm gonna have mine and two more classes\n        ...\n\n    Args:\n        path (str): A file of nbest lists with the above format.\n\n    Returns:\n        The nbest lists represented by a dictionary from string to a list of\n        strings. The key is utterance id and the value is the hypotheses.\n    \"\"\"\n\n    nbest = defaultdict()\n    with open(path, 'r', encoding='utf-8') as f:\n        for line in f:\n            line = line.strip()\n            try:\n                key, hyp = line.split(' ', 1)\n            except ValueError:\n                key = line\n                hyp = ' '\n            key = key.rsplit('-', 1)[0]\n            if key not in nbest:\n                nbest[key] = [hyp]\n            else:\n                nbest[key].append(hyp)\n    return nbest\n\n\ndef read_vocab(path):\n    r\"\"\"Read vocabulary.\n\n    Args:\n        path (str): A file with a word and its integer index per line.\n\n    Returns:\n        A vocabulary represented by a dictionary from string to int (starting\n        from 0).\n    \"\"\"\n\n    word2idx = {}\n    idx2word = []\n    with open(path, 'r', encoding='utf-8') as f:\n        for line in f:\n            word = line.split()\n            assert len(word) == 2\n            word = word[0]\n            if word not in word2idx:\n                idx2word.append(word)\n                word2idx[word] = len(idx2word) - 1\n    return word2idx\n\n\ndef get_input_and_target(hyp, vocab):\n    r\"\"\"Convert a sentence to lists of integers, with input and target separately.\n\n    Args:\n        hyp (str):  Sentence, with words separated by spaces, e.g. 'hello there'\n        vocab:      A dictionary from string to int, e.g. {'<s>':0, 'hello':1,\n                    'there':2, 'apple':3, ...}\n\n    Returns:\n        A pair of lists, one with the integerized input sequence, one with the\n        integerized output/target sequence: in this case ([0, 1, 2], [1 2 0]),\n        because the input sequence has '<s>' added at the start and the output\n        sequence has '<s>' added at the end.\n        Words that are not in the vocabulary will be converted to '<unk>', which\n        is expected to be in the vocabulary if there are out-of-vocabulary words.\n    \"\"\"\n\n    input_string = '<s> ' + hyp\n    output_string = hyp + ' <s>'\n    input_ids, output_ids = [], []\n    for word in input_string.split():\n        try:\n            input_ids.append(vocab[word.lower()])\n        except KeyError:\n            input_ids.append(vocab['<unk>'])\n    for word in output_string.split():\n        try:\n            output_ids.append(vocab[word.lower()])\n        except KeyError:\n            output_ids.append(vocab['<unk>'])\n    return input_ids, output_ids\n\n\ndef compute_sentence_score(model, criterion, ntokens, data, target,\n                           model_type='LSTM', hidden=None):\n    r\"\"\"Compute neural language model score of a sentence.\n\n    Args:\n        model:      A neural language model.\n        criterion:  Training criterion of a neural language model, e.g.\n                    cross entropy.\n        ntokens:    Vocabulary size.\n        data:       Integerized input sentence.\n        target:     Integerized target sentence.\n        model_type: Model type, e.g. LSTM or Transformer or others.\n        hidden:     Initial hidden state for getting the score of the input\n                    sentence with a recurrent-typed neural language model\n                    (optional).\n\n    Returns:\n        The score (negative log-likelihood) of the input sequence from a neural\n        language model. If the model is recurrent-typed, the function has an\n        extra output: the last hidden state after computing the score of the\n        input sequence.\n    \"\"\"\n\n    length = len(data)\n    data = torch.LongTensor(data).view(-1, 1).contiguous()\n    target = torch.LongTensor(target).view(-1).contiguous()\n    with torch.no_grad():\n        if model_type == 'Transformer':\n            output = model(data)\n        else:\n            output, hidden = model(data, hidden)\n        loss = criterion(output.view(-1, ntokens), target)\n    sent_score = length * loss.item()\n    if model_type == 'Transformer':\n        return sent_score\n    return sent_score, hidden\n\n\ndef compute_scores(nbest, model, criterion, ntokens, vocab, model_type='LSTM'):\n    r\"\"\"Compute sentence scores of nbest lists from a neural language model.\n\n    Args:\n        nbest:      The nbest lists represented by a dictionary from string\n                    to a list of strings.\n        model:      A neural language model.\n        criterion:  Training criterion of a neural language model, e.g.\n                    cross entropy.\n        ntokens:    Vocabulary size.\n        model_type: Model type, e.g. LSTM or Transformer or others.\n\n    Returns:\n        The nbest litsts and their scores represented by a dictionary from\n        string to a pair of a hypothesis and its neural language model score.\n    \"\"\"\n\n    # Turn on evaluation mode which disables dropout.\n    model.eval()\n    nbest_and_scores = defaultdict(float)\n    if model_type != 'Transformer':\n        hidden = model.init_hidden(1)\n    for key in nbest.keys():\n        if model_type != 'Transformer':\n            cached_hiddens = []\n        for hyp in nbest[key]:\n            x, target = get_input_and_target(hyp, vocab)\n            if model_type == 'Transformer':\n                score = compute_sentence_score(model, criterion, ntokens, x,\n                                               target, model_type)\n            else:\n                score, new_hidden = compute_sentence_score(model, criterion,\n                                                           ntokens, x, target,\n                                                           model_type, hidden)\n                cached_hiddens.append(new_hidden)\n            if key in nbest_and_scores:\n                nbest_and_scores[key].append((hyp, score))\n            else:\n                nbest_and_scores[key] = [(hyp, score)]\n        # For RNN based LMs, initialize the current initial hidden states with\n        # those from hypotheses of a preceeding previous utterance.\n        # This achieves modest WER reductions compared with zero initialization\n        # as it provides context from previous utterances. We observe that using\n        # hidden states from which hypothesis of the previous utterance for\n        # initialization almost doesn't make a difference. So to make the code\n        # more general, the hidden states from the first hypothesis of the\n        # previous utterance is used for initialization. You can also use those\n        # from the one best hypothesis or just average hidden states from all\n        # hypotheses of the previous utterance.\n        if model_type != 'Transformer':\n            hidden = cached_hiddens[0]\n    return nbest_and_scores\n\n\ndef write_scores(nbest_and_scores, path):\n    r\"\"\"Write out sentence scores of nbest lists in the following format:\n        en_4156-A_030185-030248-1 7.98671\n        en_4156-A_030470-030672-1 46.5938\n        en_4156-A_030470-030672-2 46.9522\n        ...\n\n    Args:\n        nbest_and_scores: The nbest lists and their scores represented by a\n                          dictionary from string to a pair of a hypothesis and\n                          its neural language model score.\n        path (str):       A output file of nbest lists' scores in the above format.\n    \"\"\"\n\n    with open(path, 'w', encoding='utf-8') as f:\n        for key in nbest_and_scores.keys():\n            for idx, (_, score) in enumerate(nbest_and_scores[key], 1):\n                current_key = '-'.join([key, str(idx)])\n                f.write('%s %.4f\\n' % (current_key, score))\n    print(\"Write to %s\" % path)\n\n\ndef main():\n    parser = argparse.ArgumentParser(description=\"Compute sentence scores of \"\n                                     \"nbest lists with a PyTorch trained \"\n                                     \"neural language model.\")\n    parser.add_argument('--nbest-list', type=str, required=True,\n                        help=\"N-best hypotheses for rescoring\")\n    parser.add_argument('--outfile', type=str, required=True,\n                        help=\"Output file with language model scores associated \"\n                        \"with each hypothesis\")\n    parser.add_argument('--vocabulary', type=str, required=True,\n                        help=\"Vocabulary used for training\")\n    parser.add_argument('--model-path', type=str, required=True,\n                        help=\"Path to a pretrained neural model.\")\n    parser.add_argument('--model', type=str, default='LSTM',\n                        help='Network type. can be RNN, LSTM or Transformer.')\n    parser.add_argument('--emsize', type=int, default=200,\n                        help='size of word embeddings')\n    parser.add_argument('--nhid', type=int, default=200,\n                        help='number of hidden units per layer')\n    parser.add_argument('--nlayers', type=int, default=2,\n                        help='number of layers')\n    parser.add_argument('--nhead', type=int, default=2,\n                        help='the number of heads in the encoder/decoder of the '\n                        'transformer model')\n    args = parser.parse_args()\n    assert os.path.exists(args.nbest_list), \"Nbest list path does not exists.\"\n    assert os.path.exists(args.vocabulary), \"Vocabulary path does not exists.\"\n    assert os.path.exists(args.model_path), \"Model path does not exists.\"\n\n    print(\"Load vocabulary\")\n    vocab = read_vocab(args.vocabulary)\n    ntokens = len(vocab)\n    print(\"Load model and criterion\")\n    import model\n    if args.model == 'Transformer':\n        model = model.TransformerModel(ntokens, args.emsize, args.nhead,\n                                       args.nhid, args.nlayers,\n                                       activation=\"gelu\", tie_weights=True)\n    else:\n        model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid,\n                               args.nlayers, tie_weights=True)\n    with open(args.model_path, 'rb') as f:\n        model.load_state_dict(torch.load(f, map_location=lambda storage, loc: storage))\n        if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:\n            model.rnn.flatten_parameters()\n    criterion = nn.CrossEntropyLoss()\n    print(\"Load nbest list\")\n    nbest = load_nbest(args.nbest_list)\n    print(\"Compute sentence scores with a \", args.model, \" model\")\n    nbest_and_scores = compute_scores(nbest, model, criterion, ntokens, vocab,\n                                      model_type=args.model)\n    print(\"Write sentence scores out\")\n    write_scores(nbest_and_scores, args.outfile)\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/pytorchnn/data.py",
    "content": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport os\nimport torch\n\n\nclass Dictionary(object):\n    def __init__(self):\n        self.word2idx = {}\n        self.idx2word = []\n\n    def read_vocab(self, path):\n        with open(path, 'r', encoding='utf-8') as f:\n            for line in f:\n                word = line.split()\n                assert (len(word) == 2)\n                word = word[0]\n                if word not in self.word2idx:\n                    self.idx2word.append(word)\n                    self.word2idx[word] = len(self.idx2word) - 1\n\n    def __len__(self):\n        return len(self.idx2word)\n\n\nclass Corpus(object):\n    def __init__(self, path):\n        self.dictionary = Dictionary()\n        self.dictionary.read_vocab(os.path.join(path, 'words.txt'))\n        self.train = self.tokenize(os.path.join(path, 'train.txt'))\n        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))\n        self.test = self.tokenize(os.path.join(path, 'test.txt'))\n\n    def tokenize(self, path):\n        \"\"\"Tokenizes a text file.\"\"\"\n        assert os.path.exists(path)\n        with open(path, 'r', encoding='utf-8') as f:\n            all_ids = []\n            for line in f:\n                words = line.split() + ['<s>']\n                ids = []\n                for word in words:\n                    if word in self.dictionary.word2idx:\n                        ids.append(self.dictionary.word2idx[word])\n                    else:\n                        ids.append(self.dictionary.word2idx['<unk>'])\n                all_ids.append(torch.tensor(ids).type(torch.int64))\n            data = torch.cat(all_ids)\n\n        return data\n"
  },
  {
    "path": "egs/steps/pytorchnn/lmrescore_nbest_pytorchnn.sh",
    "content": "#!/usr/bin/env bash\n\n# This script is very similar to rnnlm/lmrescore_nbest.sh, and it performs N-best\n# LM rescoring with Pytorch trained neural LMs.\n\n# Begin configuration section.\nN=10\nmodel_type=LSTM # LSTM, GRU or Transformer\nembedding_dim=650\nhidden_dim=650\nnlayers=2\nnhead=6\ninv_acwt=10\ncmd=run.pl\nuse_phi=false  # This is kind of an obscure option.  If true, we'll remove the old\n  # LM weights (times 1-RNN_scale) using a phi (failure) matcher, which is\n  # appropriate if the old LM weights were added in this way, e.g. by\n  # lmrescore.sh.  Otherwise we'll use normal composition, which is appropriate\n  # if the lattices came directly from decoding.  This won't actually make much\n  # difference (if any) to WER, it's more so we know we are doing the right thing.\ntest=false # Activate a testing option.\nstage=1 # Stage of this script, for partial reruns.\nskip_scoring=false\nkeep_ali=true\n# End configuration section.\n\necho \"$0 $*\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# != 7 ]; then\n   echo \"Do language model rescoring of lattices (partially remove old LM, add new LM)\"\n   echo \"This version applies an neural LM and mixes it with the n-gram LM scores\"\n   echo \"previously in the lattices, controlled by the first parameter (nnlm-weight)\"\n   echo \"\"\n   echo \"Usage: $0 [options] <nn-weight> <old-lang-dir> <nn-model-dir> vocab <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \"Main options:\"\n   echo \"  --inv-acwt <inv-acwt>          # default 12.  e.g. --inv-acwt 17.  Equivalent to LM scale to use.\"\n   echo \"                                 # for N-best list generation... note, we'll score at different acwt's\"\n   echo \"  --cmd <run.pl|queue.pl [opts]> # how to run jobs.\"\n   echo \"  --phi (true|false)             # Should be set to true if the source lattices were created\"\n   echo \"                                 # by lmrescore.sh, false if they came from decoding.\"\n   echo \"  --N <N>                        # Value of N in N-best rescoring (default: 10)\"\n   exit 1;\nfi\n\nnnweight=$1 # weight of a neural network LM\noldlang=$2\nnn_model=$3\nvocabulary=$4\ndata=$5\nindir=$6\ndir=$7\n\nacwt=$(perl -e \"print (1.0/$inv_acwt);\")\n\n# Figures out if the old LM is G.fst or G.carpa\noldlm=$oldlang/G.fst\nif [ -f $oldlang/G.carpa ]; then\n  oldlm=$oldlang/G.carpa\nelif [ ! -f $oldlm ]; then\n  echo \"$0: expecting either $oldlang/G.fst or $oldlang/G.carpa to exist\" &&\\\n    exit 1;\nfi\n\nfor f in $nn_model $vocabulary $indir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist.\" && exit 1;\ndone\n\nnj=$(cat $indir/num_jobs) || exit 1;\nmkdir -p $dir;\ncp $indir/num_jobs $dir/num_jobs\n\nadir=$dir/archives\n\nphi=$(grep -w '#0' $oldlang/words.txt | awk '{print $2}')\n\nrm $dir/.error 2>/dev/null\nmkdir -p $dir/log\n\n# First convert lattice to N-best.  Be careful because this\n# will be quite sensitive to the acoustic scale; this should be close\n# to the one we'll finally get the best WERs with.\n# Note: the lattice-rmali part here is just because we don't\n# need the alignments for what we're doing.\nif [ $stage -le 1 ]; then\n  echo \"$0: converting lattices to N-best lists.\"\n  if $keep_ali; then\n    $cmd JOB=1:$nj $dir/log/lat2nbest.JOB.log \\\n      lattice-to-nbest --acoustic-scale=$acwt --n=$N \\\n      \"ark:gunzip -c $indir/lat.JOB.gz|\" \\\n      \"ark:|gzip -c >$dir/nbest1.JOB.gz\" || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/lat2nbest.JOB.log \\\n      lattice-to-nbest --acoustic-scale=$acwt --n=$N \\\n      \"ark:gunzip -c $indir/lat.JOB.gz|\" ark:- \\|  \\\n      lattice-rmali ark:- \"ark:|gzip -c >$dir/nbest1.JOB.gz\" || exit 1;\n  fi\nfi\n\n# next remove part of the old LM probs.\nif [ \"$oldlm\" == \"$oldlang/G.fst\" ]; then\n  if $use_phi; then\n    if [ $stage -le 2 ]; then\n      echo \"$0: removing old LM scores.\"\n      # Use the phi-matcher style of composition.. this is appropriate\n      # if the old LM scores were added e.g. by lmrescore.sh, using\n      # phi-matcher composition.\n      $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 \"ark:gunzip -c $dir/nbest1.JOB.gz|\" ark:- \\| \\\n        lattice-compose --phi-label=$phi ark:- $oldlm ark:- \\| \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- \"ark:|gzip -c >$dir/nbest2.JOB.gz\" \\\n        || exit 1;\n    fi\n  else\n    if [ $stage -le 2 ]; then\n      echo \"$0: removing old LM scores.\"\n      # this approach chooses the best path through the old LM FST, while\n      # subtracting the old scores.  If the lattices came straight from decoding,\n      # this is what we want.  Note here: each FST in \"nbest1.JOB.gz\" is a linear FST,\n      # it has no alternatives (the N-best format works by having multiple keys\n      # for each utterance).  When we do \"lattice-1best\" we are selecting the best\n      # path through the LM, there are no alternatives to consider within the\n      # original lattice.\n      $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 \"ark:gunzip -c $dir/nbest1.JOB.gz|\" ark:- \\| \\\n        lattice-compose ark:- \"fstproject --project_output=true $oldlm |\" ark:- \\| \\\n        lattice-1best ark:- ark:- \\| \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- \"ark:|gzip -c >$dir/nbest2.JOB.gz\" \\\n        || exit 1;\n    fi\n  fi\nelse\n  if [ $stage -le 2 ]; then\n    echo \"$0: removing old LM scores.\"\n    $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n      lattice-lmrescore-const-arpa --lm-scale=-1.0 \\\n      \"ark:gunzip -c $dir/nbest1.JOB.gz|\" $oldlm \\\n      \"ark:|gzip -c >$dir/nbest2.JOB.gz\"  || exit 1;\n  fi\nfi\n\nif [ $stage -le 3 ]; then\n# Decompose the n-best lists into 4 archives.\n  echo \"$0: creating separate-archive form of N-best lists.\"\n  $cmd JOB=1:$nj $dir/log/make_new_archives.JOB.log \\\n    mkdir -p $adir.JOB '&&' \\\n    nbest-to-linear \"ark:gunzip -c $dir/nbest2.JOB.gz|\" \\\n    \"ark,t:$adir.JOB/ali\" \"ark,t:$adir.JOB/words\" \\\n    \"ark,t:$adir.JOB/lmwt.nolm\" \"ark,t:$adir.JOB/acwt\" || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing the same with old LM scores.\"\n# Create an archive with the LM scores before we\n# removed the LM probs (will help us do interpolation).\n$cmd JOB=1:$nj $dir/log/make_old_archives.JOB.log \\\n  nbest-to-linear \"ark:gunzip -c $dir/nbest1.JOB.gz|\" \"ark:/dev/null\" \\\n  \"ark:/dev/null\" \"ark,t:$adir.JOB/lmwt.withlm\" \"ark:/dev/null\" || exit 1;\nfi\n\nif $test; then # This branch is a sanity check that at the acwt where we generated\n  # the N-best list, we get the same WER.\n  echo \"$0 [testing branch]: generating lattices without changing scores.\"\n  $cmd JOB=1:$nj $dir/log/test.JOB.log \\\n    linear-to-nbest \"ark:$adir.JOB/ali\" \"ark:$adir.JOB/words\" \"ark:$adir.JOB/lmwt.withlm\" \\\n     \"ark:$adir.JOB/acwt\" ark:- \\| \\\n    nbest-to-lattice ark:- \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\n  exit 0;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: Creating archives with text-form of words, and LM scores without graph scores.\"\n    # Do some small tasks; for these we don't use the queue, it will only slow us down.\n  for n in $(seq $nj); do\n    utils/int2sym.pl -f 2- $oldlang/words.txt < $adir.$n/words > $adir.$n/words_text || exit 1;\n    mkdir -p $adir.$n/temp\n    paste $adir.$n/lmwt.nolm $adir.$n/lmwt.withlm | awk '{print $1, ($4-$2);}' > \\\n      $adir.$n/lmwt.lmonly || exit 1;\n  done\nfi\n\nif [ $stage -le 6 ]; then\n  echo \"$0: invoking steps/pytorchnn/compute_sentence_scores.py which computes sentence scores with a PyTorch trained neural LM.\"\n  $cmd JOB=1:$nj $dir/log/compute_sentence_scores_pytorchnn.JOB.log \\\n    PYTHONPATH=steps/pytorchnn python steps/pytorchnn/compute_sentence_scores.py \\\n        --nbest-list $adir.JOB/words_text \\\n        --outfile $adir.JOB/lmwt.nn \\\n        --vocabulary $vocabulary \\\n        --model-path $nn_model \\\n        --model $model_type \\\n        --emsize $embedding_dim \\\n        --nhid $hidden_dim \\\n        --nlayers $nlayers \\\n        --nhead $nhead\nfi\n\nif [ $stage -le 7 ]; then\n  echo \"$0: reconstructing total LM+graph scores including interpolation of neural LM and old LM scores.\"\n  for n in $(seq $nj); do\n    paste $adir.$n/lmwt.nolm $adir.$n/lmwt.lmonly $adir.$n/lmwt.nn | awk -v nnweight=$nnweight \\\n      '{ key=$1; graphscore=$2; lmscore=$4; nnscore=$6;\n     score = graphscore+(nnweight*nnscore)+((1-nnweight)*lmscore);\n     print $1,score; } ' > $adir.$n/lmwt.interp.$nnweight || exit 1;\n  done\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: reconstructing archives back into lattices.\"\n  $cmd JOB=1:$nj $dir/log/reconstruct_lattice.JOB.log \\\n    linear-to-nbest \"ark:$adir.JOB/ali\" \"ark:$adir.JOB/words\" \\\n    \"ark:$adir.JOB/lmwt.interp.$nnweight\" \"ark:$adir.JOB/acwt\" ark:- \\| \\\n    nbest-to-lattice ark:- \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  echo \"scoring...\"\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $oldlang $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/pytorchnn/model.py",
    "content": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport math\nimport torch\nimport torch.nn as nn\n\n\nclass RNNModel(nn.Module):\n    \"\"\"Container module with an encoder, a recurrent module, and a decoder.\"\"\"\n    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5,\n                 tie_weights=False):\n        super(RNNModel, self).__init__()\n\n        self.rnn_type = rnn_type\n        self.nhid = nhid\n        self.nlayers = nlayers\n        self.drop = nn.Dropout(dropout)\n        self.encoder = nn.Embedding(ntoken, ninp)\n        if rnn_type in ['LSTM', 'GRU']:\n            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)\n        else:\n            try:\n                nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]\n            except KeyError:\n                raise ValueError(\"\"\"An invalid option for `--model` was supplied,\n                      options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']\"\"\")\n            self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity,\n                              dropout=dropout)\n        self.decoder = nn.Linear(nhid, ntoken)\n\n        if tie_weights:\n            if nhid != ninp:\n                raise ValueError('When using the tied flag, nhid must be equal '\n                                 'to emsize.')\n            self.decoder.weight = self.encoder.weight\n\n        self.init_weights()\n\n    def init_weights(self):\n        initrange = 0.1\n        nn.init.uniform_(self.encoder.weight, -initrange, initrange)\n        nn.init.zeros_(self.decoder.bias)\n        nn.init.uniform_(self.decoder.weight, -initrange, initrange)\n\n    def forward(self, x, hidden):\n        emb = self.drop(self.encoder(x))\n        output, hidden = self.rnn(emb, hidden)\n        output = self.drop(output)\n        decoded = self.decoder(output)\n        return decoded, hidden\n\n    def init_hidden(self, bsz):\n        weight = next(self.parameters())\n        if self.rnn_type == 'LSTM':\n            return (weight.new_zeros(self.nlayers, bsz, self.nhid),\n                    weight.new_zeros(self.nlayers, bsz, self.nhid))\n        return weight.new_zeros(self.nlayers, bsz, self.nhid)\n\n\nclass PositionalEncoding(nn.Module):\n    r\"\"\"Inject some information about the relative or absolute position of the\n        tokens in the sequence. The positional encodings have the same dimension\n        as the embeddings, so that the two can be summed. Here, we use sine and\n        cosine functions of different frequencies.\n    .. math::\n        \\text{PosEncoder}(pos, 2i) = sin(pos/10000^(2i/d_model))\n        \\text{PosEncoder}(pos, 2i+1) = cos(pos/10000^(2i/d_model))\n        \\text{where pos is the word position and i is the embed idx)\n    Args:\n        d_model: the embed dim (required).\n        dropout: the dropout value (default=0.1).\n        max_len: the max. length of the incoming sequence (default=5000).\n    Examples:\n        >>> pos_encoder = PositionalEncoding(d_model)\n    \"\"\"\n\n    def __init__(self, d_model, dropout=0.1, max_len=5000):\n        super(PositionalEncoding, self).__init__()\n        self.dropout = nn.Dropout(p=dropout)\n\n        pe = torch.zeros(max_len, d_model)\n        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)\n        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n        pe[:, 0::2] = torch.sin(position * div_term)\n        pe[:, 1::2] = torch.cos(position * div_term)\n        pe = pe.unsqueeze(0).transpose(0, 1)\n        self.register_buffer('pe', pe)\n\n    def forward(self, x):\n        r\"\"\"Inputs of forward function\n        Args:\n            x: the sequence fed to the positional encoder model (required).\n        Shape:\n            x: [sequence length, batch size, embed dim]\n            output: [sequence length, batch size, embed dim]\n        Examples:\n            >>> output = pos_encoder(x)\n        \"\"\"\n\n        x = x + self.pe[:x.size(0), :]\n        return self.dropout(x)\n\n\nclass TransformerModel(nn.Module):\n    \"\"\"Container module with an encoder, a transformer module, and a decoder.\"\"\"\n\n    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5,\n                 activation=\"relu\", tie_weights=False):\n        super(TransformerModel, self).__init__()\n        try:\n            from torch.nn import TransformerEncoder, TransformerEncoderLayer\n        except ImportError:\n            raise ImportError('TransformerEncoder module does not exist in '\n                              'PyTorch 1.1 or lower.')\n        self.model_type = 'Transformer'\n        self.src_mask = None\n        self.pos_encoder = PositionalEncoding(ninp, dropout)\n        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout,\n                                                 activation)\n        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)\n        self.encoder = nn.Embedding(ntoken, ninp)\n        self.ninp = ninp\n        self.decoder = nn.Linear(ninp, ntoken)\n        if tie_weights:\n            if nhid != ninp:\n                raise ValueError('When using the tied flag, nhid must be equal '\n                                 'to emsize.')\n            self.decoder.weight = self.encoder.weight\n        self.init_weights()\n\n    def _generate_square_subsequent_mask(self, sz):\n        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)\n        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(\n                mask == 1, float(0.0))\n        return mask\n\n    def init_weights(self):\n        initrange = 0.1\n        nn.init.uniform_(self.encoder.weight, -initrange, initrange)\n        nn.init.zeros_(self.decoder.bias)\n        nn.init.uniform_(self.decoder.weight, -initrange, initrange)\n\n    def forward(self, src, has_mask=True):\n        if has_mask:\n            device = src.device\n            if self.src_mask is None or self.src_mask.size(0) != len(src):\n                mask = self._generate_square_subsequent_mask(len(src)).to(device)\n                self.src_mask = mask\n        else:\n            self.src_mask = None\n        src = self.encoder(src) * math.sqrt(self.ninp)\n        src = self.pos_encoder(src)\n        output = self.transformer_encoder(src, self.src_mask)\n        output = self.decoder(output)\n        return output\n"
  },
  {
    "path": "egs/steps/pytorchnn/train.py",
    "content": "\"\"\" This script is modified based on the word language model example in PyTorch:\n    https://github.com/pytorch/examples/tree/master/word_language_model\n    An example of model training and N-best rescoring can be found here:\n    egs/swbd/s5c/local/pytorchnn/run_nnlm.sh\n\"\"\"\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport argparse\nimport time\nimport math\nimport random\nimport torch\nimport torch.nn as nn\nimport torch.optim as optim\n\nimport data\nimport model\n\nparser = argparse.ArgumentParser(description=\"Train and evaluate a neural \"\n                                 \"language model with PyTorch.\")\n# Model options\nparser.add_argument('--data', type=str, default='./data/pytorchnn',\n                    help='location of the data corpus')\nparser.add_argument('--model', type=str, default='LSTM',\n                    help='type of model architecture. can be RNN_TANH, '\n                    'RNN_RELU, LSTM, GRU or Transformer.')\nparser.add_argument('--emsize', type=int, default=200,\n                    help='size of word embeddings')\nparser.add_argument('--nhid', type=int, default=200,\n                    help='number of hidden units per layer')\nparser.add_argument('--nlayers', type=int, default=2,\n                    help='number of layers')\nparser.add_argument('--nhead', type=int, default=2,\n                    help='the number of heads in the encoder/decoder of the '\n                    'transformer model')\n\n# Training options\nparser.add_argument('--lr', type=float, default=0.1,\n                    help='initial learning rate')\nparser.add_argument('--batch-size', type=int, default=20, metavar='N',\n                    help='batch size')\nparser.add_argument('--epochs', type=int, default=20,\n                    help='upper epoch limit')\nparser.add_argument('--seq_len', type=int, default=35,\n                    help='sequence length limit')\nparser.add_argument('--clip', type=float, default=0.25,\n                    help='gradient clipping')\nparser.add_argument('--dropout', type=float, default=0.2,\n                    help='dropout applied to layers')\nparser.add_argument('--tied', action='store_true',\n                    help='tie the word embedding and softmax weights')\nparser.add_argument('--optimizer', type=str, default='SGD',\n                    help='type of optimizer')\nparser.add_argument('--log-interval', type=int, default=200, metavar='N',\n                    help='report interval')\n\n# Device options\nparser.add_argument('--cuda', action='store_true', help='use CUDA')\nparser.add_argument('--save', type=str, default='model.pt',\n                    help='path to save the final model')\nparser.add_argument('--seed', type=int, default=1111,\n                    help='random seed')\n\nargs = parser.parse_args()\nparams = vars(args)\n\n# Set the random seed for reproducibility\nrandom.seed(args.seed)\ntorch.manual_seed(args.seed)\nif torch.cuda.is_available():\n    if not args.cuda:\n        print('WARNING: You have a CUDA device, so you should probably run '\n              'with --cuda')\n    else:\n        torch.cuda.manual_seed_all(args.seed)\n\nprint('Configurations')\nfor arg, p in params.items():\n    print(arg, p)\n\ndevice = torch.device(\"cuda\" if args.cuda else \"cpu\")\n\n#############################\n# Load data\n#############################\ncorpus = data.Corpus(args.data)\n\n\ndef batchify(data, bsz, random_start_idx=False):\n    # Work out how cleanly we can divide the dataset into bsz parts.\n    nbatch = data.size(0) // bsz\n    # Shuffle data\n    if random_start_idx:\n        start_idx = random.randint(0, data.size(0) % bsz - 1)\n    else:\n        start_idx = 0\n    # Trim off any extra elements that wouldn't cleanly fit (remainders).\n    data = data.narrow(0, start_idx, nbatch * bsz)\n    # Evenly divide the data across the bsz batches\n    data = data.view(bsz, -1).t().contiguous()\n    return data.to(device)\n\n\neval_batch_size = 20\ntrain_data = batchify(corpus.train, args.batch_size)\nval_data = batchify(corpus.valid, eval_batch_size)\ntest_data = batchify(corpus.test, eval_batch_size)\n\n#############################\n# Build the model\n#############################\nntokens = len(corpus.dictionary)\nif args.model == 'Transformer':\n    # The activation function can be 'relu' (default) or 'gelu'\n    model = model.TransformerModel(ntokens, args.emsize, args.nhead, args.nhid,\n                      args.nlayers, args.dropout, \"gelu\", args.tied).to(device)\nelse:\n    model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid,\n                           args.nlayers, args.dropout, args.tied).to(device)\n\ntotal_params = sum(x.data.nelement() for x in model.parameters())\nprint('Args: {}'.format(args))\nprint('Model total parameters: {}'.format(total_params))\n\ncriterion = nn.CrossEntropyLoss()\n\n#############################\n# Training part\n#############################\n\n\ndef repackage_hidden(h):\n    \"\"\"Wraps hidden states in new Tensors, to detach them from their history.\"\"\"\n    if isinstance(h, torch.Tensor):\n        return h.detach()\n    return tuple(repackage_hidden(v) for v in h)\n\n\n# Divide the source data into chunks of length args.seq_len.\ndef get_batch(source, i):\n    seq_len = min(args.seq_len, len(source) - 1 - i)\n    data = source[i: i + seq_len]\n    target = source[i + 1: i + 1 + seq_len].view(-1)\n    return data, target\n\n\ndef train():\n    # Turn on training model which enables dropout.\n    model.train()\n    total_loss = 0.\n    start_time = time.time()\n    if args.model != 'Transformer':\n        hidden = model.init_hidden(args.batch_size)\n    for batch, i in enumerate(range(0, train_data.size(0) - 1, args.seq_len)):\n        data, targets = get_batch(train_data, i)\n        optimizer.zero_grad()\n        if args.model == 'Transformer':\n            output = model(data)\n        else:\n            # Starting each batch, the hidden state is detached from how it was\n            # previously produced. Otherwise, the model would try\n            # backpropagating all the way to start of the dataset.\n            hidden = repackage_hidden(hidden)\n            output, hidden = model(data, hidden)\n\n        loss = criterion(output.view(-1, ntokens), targets)\n        loss.backward()\n\n        # 'clip_grad_norm' helps prevent the exploding gradient problem.\n        torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)\n        optimizer.step()\n\n        total_loss += loss.item()\n        if batch % args.log_interval == 0 and batch > 0:\n            cur_loss = total_loss / args.log_interval\n            elapsed = time.time() - start_time\n            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.3f} | '\n                  'ms/batch {:5.2f} | loss {:5.2f} | ppl {:8.2f}'.format(\n                      epoch, batch, len(train_data) // args.seq_len, lr,\n                      elapsed * 1000 / args.log_interval, cur_loss,\n                      math.exp(cur_loss)))\n            total_loss = 0.\n            start_time = time.time()\n\n\ndef evaluate(source):\n    # Turn on evaluation mode which disables dropout.\n    model.eval()\n    total_loss = 0.\n    if args.model != 'Transformer':\n        hidden = model.init_hidden(eval_batch_size)\n    # Speed up evaluation with torch.no_grad()\n    with torch.no_grad():\n        for i in range(0, source.size(0) - 1, args.seq_len):\n            data, targets = get_batch(source, i)\n            if args.model == 'Transformer':\n                output = model(data)\n            else:\n                output, hidden = model(data, hidden)\n                hidden = repackage_hidden(hidden)\n            loss = criterion(output.view(-1, ntokens), targets)\n            total_loss += len(data) * loss.item()\n    return total_loss / (len(source) - 1)\n\n\n#############################\n# Train the model\n#############################\nlr = args.lr\nbest_val_loss = None\noptimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0.9,\n                      weight_decay=1e-5)\ncounter = 0\nprint(\"Start training\")\ntry:\n    for epoch in range(1, args.epochs + 1):\n        epoch_start_time = time.time()\n        train()\n        val_loss = evaluate(val_data)\n        print('-' * 89)\n        print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '\n              'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),\n                                         val_loss, math.exp(val_loss)))\n        print('-' * 89)\n\n        # Save the model if validation loss is the best we've seen so far.\n        # Saving state_dict is preferable.\n        if not best_val_loss or val_loss < best_val_loss:\n            with open(args.save, 'wb') as f:\n                torch.save(model.state_dict(), f)\n            best_val_loss = val_loss\n        else:\n            lr /= 2.\n            optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9,\n                                  weight_decay=1e-5)\n            counter += 1\n\n        # Early stopping\n        if counter == 8:\n            break\nexcept KeyboardInterrupt:\n    print('-' * 89)\n    print('Exiting from training early')\n\n# Load the best saved model.\nwith open(args.save, 'rb') as f:\n    model.load_state_dict(torch.load(f, map_location=lambda storage, loc: storage))\n    if args.model in ['RNN_TANH', 'RNN_RELU', 'LSTM', 'GRU']:\n        model.rnn.flatten_parameters()\n\n# Run on test data.\ntest_loss = evaluate(test_data)\nprint('=' * 89)\nprint('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(\n      test_loss, math.exp(test_loss)))\nprint('=' * 89)\n"
  },
  {
    "path": "egs/steps/resegment_data.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2013.  Apache 2.0.\n\n# This script segments speech data based on some kind of decoding of\n# whole recordings (e.g. whole conversation sides.  See \n# egs/swbd/s5b/local/run_resegment.sh for an example of usage.\n# You'll probably want to use the script resegment_text.sh\n\n# begin configuration section.\nstage=0\ncmd=run.pl\ncleanup=true\nsegmentation_opts=  # E.g. set this as --segmentation-opts \"--silence-proportion 0.2 --max-segment-length 10\"\n\n#end configuration section.\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [options] <in-data-dir> <lang> <decode-dir|ali-dir> <out-data-dir> <temp/log-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --segmentation-opts '--opt1 opt1val --opt2 opt2val' # options for segmentation.pl\"\n  echo \"e.g.:\"\n  echo \"$0 data/train_unseg exp/tri3b/decode_train_unseg data/train_seg exp/tri3b_resegment\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3 # may actually be decode-dir.\ndata_out=$4\ndir=$5\n\nmkdir -p $data_out || exit 1;\nrm $data_out/* 2>/dev/null # Old stuff that's partial can cause problems later if\n                           # we call fix_data_dir.sh; it will cause things to be \n                           # thrown out.\nmkdir -p $dir/log || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $alidir/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/ali.1.gz $alidir/num_jobs; do\n  if [ ! -f $f ]; then \n    echo \"$0: no such file $f\"\n    exit 1;\n    fi\ndone\n\nif [ -f $alidir/final.mdl ]; then\n  model=$alidir/final.mdl\nelse\n  if [ ! -f $alidir/../final.mdl ]; then\n    echo \"$0: found no model in $alidir/final.mdl or $alidir/../final.mdl\"\n    exit 1;\n  fi\n  model=$alidir/../final.mdl\nfi\n\n# get lists of sil,noise,nonsil phones\n# convert *.ali.gz to *.ali.gz with 0,1,2.\n# run perl script..\n# output segments?\n\n\nif ! [ `cat $lang/phones/optional_silence.txt | wc -w` -eq 1 ]; then\n  echo \"Error: this script only works if $lang/phones/optional_silence.txt contains exactly one entry.\";\n  echo \"You'd have to modify the script to handle other cases.\"\n  exit 1;\nfi\n\nsilphone=`cat $lang/phones/optional_silence.txt` \n# silphone will typically be \"sil\" or \"SIL\". \n\n# 3 sets of phones: 0 is silence, 1 is noise, 2 is speech.,\n(\n echo \"$silphone 0\"\n grep -v -w $silphone $lang/phones/silence.txt | awk '{print $1, 1;}'\n cat $lang/phones/nonsilence.txt | awk '{print $1, 2;}'\n) > $dir/phone_map.txt\n\n\nnj=`cat $alidir/num_jobs` || exit 1;\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/resegment.JOB.log \\\n    ali-to-phones --per-frame=true \"$model\" \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark,t:- \\| \\\n    utils/int2sym.pl -f 2- $lang/phones.txt \\| \\\n    utils/apply_map.pl -f 2- $dir/phone_map.txt \\| \\\n    utils/segmentation.pl $segmentation_opts \\| \\\n    gzip -c '>' $dir/segments.JOB.gz\nfi\n\nif [ $stage -le 1 ]; then\n  if [ -f $data/reco2file_and_channel ]; then\n    cp $data/reco2file_and_channel $data_out/reco2file_and_channel\n  fi\n  if [ -f $data/wav.scp ]; then\n    cp $data/wav.scp $data_out/wav.scp\n  else\n    echo \"Expected file $data/wav.scp to exist\" # or there is really nothing to copy.\n    exit 1\n  fi\n  for f in glm stm; do \n    if [ -f $data/$f ]; then\n      cp $data/$f $data_out/$f\n    fi\n  done\n\n  for n in `seq $nj`; do gunzip -c $dir/segments.$n.gz; done | \\\n    sort > $data_out/segments || exit 1;\n\n  [ ! -s $data_out/segments ] && echo \"No data produced\" && exit 1;\n\n  # We'll make the speaker-ids be the same as the recording-ids (e.g. conversation\n  # sides).  This will normally be OK for telephone data.\n  cat $data_out/segments | awk '{print $1, $2}' > $data_out/utt2spk || exit 1\n  utils/utt2spk_to_spk2utt.pl $data_out/utt2spk > $data_out/spk2utt || exit 1\n\n  if $cleanup; then\n    rm $dir/segments.*.gz\n  fi\nfi\n\ncat $data_out/segments | awk '{num_secs += $4 - $3;} END{print \"Number of hours of data is \" (num_secs/3600);}'\n\n"
  },
  {
    "path": "egs/steps/resegment_text.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright Johns Hopkins University (Author: Daniel Povey) 2013.  Apache 2.0.\n\n# This script takes two data directories that represent different\n# segmentations of the same data (both must have \"segments\" files and\n# the recording-ids must match), and it converts the text in one directory\n# to correspond to the segmentation in the other.  Its output is the\n# \"text\" file in the second directory.  To get the alignments, it\n# must be provided an \"alignment\" directory where the training data\n# from the first directory has been aligned.\n\n# begin configuration section.\nstage=0\ncmd=run.pl\n\n#end configuration section.\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: $0 [options] <in-data-dir> <lang> <ali-dir|model-dir> <out-data-dir> <temp/log-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"e.g.:\"\n  echo \"$0 data/train data/lang exp/tri3b_ali_all data/train_reseg exp/tri3b_resegment\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndata_out=$4\ndir=$5\n\n\nmkdir -p $dir/log || exit 1;\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/ali.1.gz $alidir/num_jobs \\\n   $alidir/final.mdl $data_out/reco2file_and_channel $data_out/segments; do\n  if [ ! -f $f ]; then \n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\n\nif [ $stage -le 0 ]; then\n  echo \"$0: calling get_train_ctm.sh to produce ctms of the alignments.\"\n  # Caution: this will produce logs in $alidir/log/get_ctm.log\n  steps/get_train_ctm.sh --cmd \"$cmd\" $data $lang $alidir || exit 1;  \nfi\n\n\nif [ $stage -le 1 ]; then\n  if [ ! -s $alidir/ctm ]; then\n    echo \"$0: file $data/ctm does not exist or is empty.\"\n    exit 1;\n  fi\n  echo \"$0: converting ctm to a format where we have the recording-id ...\"\n  echo \"$0: ... in place of the side and channel, e.g. sw02008-B instead of sw02008 B\"\n\n  cat $alidir/ctm | awk -v r=$data_out/reco2file_and_channel  \\\n   'BEGIN{while((getline < r) > 0) { if(NF!=3) {exit(1);} map[ $2 \"&\" $3 ] = $1;}}\n    {if (NF!=5) {print \"bad line \" $0; exit(2);} reco = map[$1 \"&\" $2];\n     if (length(reco) == 0) { print \"Bad key \" $1 \"&\" $2; exit(3); } \n     print reco, $3, $4, $5; } ' > $dir/ctm_per_reco\nfi\n\nif [ $stage -le 2 ]; then\n  cat $data_out/segments | perl -e '\n     @ARGV == 1 || die;\n     $ctm_per_reco = shift @ARGV;\n     $chunk_size = 3;\n     open(C, \"<$ctm_per_reco\") || die \"opening ctm file $ctm_per_reco\";\n     # we build up an associative array indexed by a pair of ids: $reco,$n\n     # where $n is a 5-second chunk of time.\n     sub to_chunk { my $t = shift @_; return int($t / $chunk_size); }\n     while (<C>) {\n       @A = split;  @A == 4 || die \"Bad line $_ in $ctm_per_reco\";\n       ($reco, $start, $length, $word) = @A;\n       $chunk = to_chunk($start);\n       if (! defined $reco2list{$reco,$chunk} ){ $reco2list{$reco,$chunk} = [ ]; } # new anonymous array\n       $arrayref = $reco2list{$reco,$chunk};\n       push @$arrayref, [ $start, $length, $word ]; # another level of anonymous array..\n     }\n     $num_utts = 0; $num_empty = 0;\n     while(<STDIN>) {\n       @A = split;  @A == 4 || die \"Bad line $_ in stdin\";\n       ($utt, $reco, $start, $end) = @A;\n       @text = ();\n       for ($chunk = to_chunk($start); $chunk <= to_chunk($end); $chunk++) {\n         $arrayref = $reco2list{$reco,$chunk};\n         if (defined $arrayref) {\n           foreach $entry ( @$arrayref ) { # note, $entry is itself an arrayref\n                                           # to an array containing $start $end $word.\n             $word_start = $$entry[0];\n             if ($word_start >= $start && $word_start <= $end) {\n               $word_end = $$entry[1] + $word_start;\n               if ($word_end >= $start && $word_end <= $end) {\n                 $word = $$entry[2]; defined $word || die;\n                 push @text, $word;\n               }\n             }\n           }\n         }\n       }\n       $num_utts++;\n       if (@text > 0) { $t = join(\" \", @text); print \"$utt $t\\n\";; }\n       else { $num_empty++; }\n     }\n     print STDERR \"Processed $num_utts utterances, of which $num_empty had no text.\\n\"; ' \\\n       $dir/ctm_per_reco | sort > $data_out/text || exit 1;\n\n  nw_old=`cat $data/text | wc | awk '{print $2 - $1}'`\n  nw_new=`cat $data_out/text | wc | awk '{print $2 - $1}'`\n  echo \"Number of words of training text changed from $nw_old to $nw_new\";\n\n  if [ ! -s $data_out/text ]; then\n    echo \"$0: produced empty output.  Something went wrong.\"\n    exit 1;\n  fi\nfi\n"
  },
  {
    "path": "egs/steps/rnnlmrescore.sh",
    "content": "#!/usr/bin/env bash\n\n# please see lmrescore_rnnlm_lat.sh which is a newer script using lattices.\n\n# Begin configuration section.\nN=10\ninv_acwt=12\ncmd=run.pl\nuse_phi=false  # This is kind of an obscure option.  If true, we'll remove the old\n  # LM weights (times 1-RNN_scale) using a phi (failure) matcher, which is\n  # appropriate if the old LM weights were added in this way, e.g. by\n  # lmrescore.sh.  Otherwise we'll use normal composition, which is appropriate\n  # if the lattices came directly from decoding.  This won't actually make much\n  # difference (if any) to WER, it's more so we know we are doing the right thing.\ntest=false # Activate a testing option.\nstage=1 # Stage of this script, for partial reruns.\nrnnlm_ver=rnnlm-0.3e\nskip_scoring=false\nkeep_ali=true\n# End configuration section.\n\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\n\nif [ $# != 6 ]; then\n   echo \"Do language model rescoring of lattices (partially remove old LM, add new LM)\"\n   echo \"This version applies an RNNLM and mixes it with the LM scores\"\n   echo \"previously in the lattices., controlled by the first parameter (rnnlm-weight)\"\n   echo \"\"\n   echo \"Usage: utils/rnnlmrescore.sh <rnn-weight> <old-lang-dir> <rnn-dir> <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \"Main options:\"\n   echo \"  --inv-acwt <inv-acwt>          # default 12.  e.g. --inv-acwt 17.  Equivalent to LM scale to use.\"\n   echo \"                                 # for N-best list generation... note, we'll score at different acwt's\"\n   echo \"  --cmd <run.pl|queue.pl [opts]> # how to run jobs.\"\n   echo \"  --phi (true|false)             # Should be set to true if the source lattices were created\"\n   echo \"                                 # by lmrescore.sh, false if they came from decoding.\"\n   echo \"  --N <N>                        # Value of N in N-best rescoring (default: 10)\"\n   exit 1;\nfi\n\n\n\nrnnweight=$1\noldlang=$2\nrnndir=$3\ndata=$4\nindir=$5\ndir=$6\n\n\nacwt=`perl -e \"print (1.0/$inv_acwt);\"` # Note: we'll actually produce lattices\n # that will be scored at a range of acoustic weights.  This acwt should be close\n # to the final one we'll pick, though, for best performance (it controls the\n # N-best list generation).\n\n# Figures out if the old LM is G.fst or G.carpa\noldlm=$oldlang/G.fst\nif [ -f $oldlang/G.carpa ]; then\n  oldlm=$oldlang/G.carpa\nelif [ ! -f $oldlm ]; then\n  echo \"$0: expecting either $oldlang/G.fst or $oldlang/G.carpa to exist\" &&\\\n    exit 1;\nfi\n\nfor f in $rnndir/rnnlm $data/feats.scp $indir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist.\" && exit 1;\ndone\n\nnj=`cat $indir/num_jobs` || exit 1;\nmkdir -p $dir;\ncp $indir/num_jobs $dir/num_jobs\n\nadir=$dir/archives\n\nphi=`grep -w '#0' $oldlang/words.txt | awk '{print $2}'`\n\nrm $dir/.error 2>/dev/null\nmkdir -p $dir/log\n\n# First convert lattice to N-best.  Be careful because this\n# will be quite sensitive to the acoustic scale; this should be close\n# to the one we'll finally get the best WERs with.\n# Note: the lattice-rmali part here is just because we don't\n# need the alignments for what we're doing.\nif [ $stage -le 1 ]; then\n  echo \"$0: converting lattices to N-best.\"\n  if $keep_ali; then\n    $cmd JOB=1:$nj $dir/log/lat2nbest.JOB.log \\\n      lattice-to-nbest --acoustic-scale=$acwt --n=$N \\\n      \"ark:gunzip -c $indir/lat.JOB.gz|\" \\\n      \"ark:|gzip -c >$dir/nbest1.JOB.gz\" || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/lat2nbest.JOB.log \\\n      lattice-to-nbest --acoustic-scale=$acwt --n=$N \\\n      \"ark:gunzip -c $indir/lat.JOB.gz|\" ark:- \\|  \\\n      lattice-rmali ark:- \"ark:|gzip -c >$dir/nbest1.JOB.gz\" || exit 1;\n  fi\nfi\n\n# next remove part of the old LM probs.\nif [ \"$oldlm\" == \"$oldlang/G.fst\" ]; then\n  if $use_phi; then\n    if [ $stage -le 2 ]; then\n      echo \"$0: removing old LM scores.\"\n      # Use the phi-matcher style of composition.. this is appropriate\n      # if the old LM scores were added e.g. by lmrescore.sh, using\n      # phi-matcher composition.\n      $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 \"ark:gunzip -c $dir/nbest1.JOB.gz|\" ark:- \\| \\\n        lattice-compose --phi-label=$phi ark:- $oldlm ark:- \\| \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- \"ark:|gzip -c >$dir/nbest2.JOB.gz\" \\\n        || exit 1;\n    fi\n  else\n    if [ $stage -le 2 ]; then\n      echo \"$0: removing old LM scores.\"\n      # this approach chooses the best path through the old LM FST, while\n      # subtracting the old scores.  If the lattices came straight from decoding,\n      # this is what we want.  Note here: each FST in \"nbest1.JOB.gz\" is a linear FST,\n      # it has no alternatives (the N-best format works by having multiple keys\n      # for each utterance).  When we do \"lattice-1best\" we are selecting the best\n      # path through the LM, there are no alternatives to consider within the\n      # original lattice.\n      $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 \"ark:gunzip -c $dir/nbest1.JOB.gz|\" ark:- \\| \\\n        lattice-compose ark:- \"fstproject --project_output=true $oldlm |\" ark:- \\| \\\n        lattice-1best ark:- ark:- \\| \\\n        lattice-scale --acoustic-scale=-1 --lm-scale=-1 ark:- \"ark:|gzip -c >$dir/nbest2.JOB.gz\" \\\n        || exit 1;\n    fi\n  fi\nelse\n  if [ $stage -le 2 ]; then\n    echo \"$0: removing old LM scores.\"\n    $cmd JOB=1:$nj $dir/log/remove_old.JOB.log \\\n      lattice-lmrescore-const-arpa --lm-scale=-1.0 \\\n      \"ark:gunzip -c $dir/nbest1.JOB.gz|\" $oldlm \\\n      \"ark:|gzip -c >$dir/nbest2.JOB.gz\"  || exit 1;\n  fi\nfi\n\nif [ $stage -le 3 ]; then\n# Decompose the n-best lists into 4 archives.\n  echo \"$0: creating separate-archive form of N-best lists.\"\n  $cmd JOB=1:$nj $dir/log/make_new_archives.JOB.log \\\n    mkdir -p $adir.JOB '&&' \\\n    nbest-to-linear \"ark:gunzip -c $dir/nbest2.JOB.gz|\" \\\n    \"ark,t:$adir.JOB/ali\" \"ark,t:$adir.JOB/words\" \\\n    \"ark,t:$adir.JOB/lmwt.nolm\" \"ark,t:$adir.JOB/acwt\" || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing the same with old LM scores.\"\n# Create an archive with the LM scores before we\n# removed the LM probs (will help us do interpolation).\n$cmd JOB=1:$nj $dir/log/make_old_archives.JOB.log \\\n  nbest-to-linear \"ark:gunzip -c $dir/nbest1.JOB.gz|\" \"ark:/dev/null\" \\\n  \"ark:/dev/null\" \"ark,t:$adir.JOB/lmwt.withlm\" \"ark:/dev/null\" || exit 1;\nfi\n\nif $test; then # This branch is a sanity check that at the acwt where we generated\n  # the N-best list, we get the same WER.\n  echo \"$0 [testing branch]: generating lattices without changing scores.\"\n  $cmd JOB=1:$nj $dir/log/test.JOB.log \\\n    linear-to-nbest \"ark:$adir.JOB/ali\" \"ark:$adir.JOB/words\" \"ark:$adir.JOB/lmwt.withlm\" \\\n     \"ark:$adir.JOB/acwt\" ark:- \\| \\\n    nbest-to-lattice ark:- \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\n  exit 0;\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: Creating archives with text-form of words, and LM scores without graph scores.\"\n    # Do some small tasks; for these we don't use the queue, it will only slow us down.\n  for n in `seq $nj`; do\n    utils/int2sym.pl -f 2- $oldlang/words.txt < $adir.$n/words > $adir.$n/words_text || exit 1;\n    mkdir -p $adir.$n/temp\n    paste $adir.$n/lmwt.nolm $adir.$n/lmwt.withlm | awk '{print $1, ($4-$2);}' > \\\n      $adir.$n/lmwt.lmonly || exit 1;\n  done\nfi\nif [ $stage -le 6 ]; then\n  echo \"$0: invoking utils/rnnlm_compute_scores.sh which calls rnnlm, to get RNN LM scores.\"\n  $cmd JOB=1:$nj $dir/log/rnnlm_compute_scores.JOB.log \\\n    utils/rnnlm_compute_scores.sh --rnnlm_ver $rnnlm_ver $rnndir $adir.JOB/temp $adir.JOB/words_text $adir.JOB/lmwt.rnn \\\n    || exit 1;\nfi\nif [ $stage -le 7 ]; then\n  echo \"$0: reconstructing total LM+graph scores including interpolation of RNNLM and old LM scores.\"\n  for n in `seq $nj`; do\n    paste $adir.$n/lmwt.nolm $adir.$n/lmwt.lmonly $adir.$n/lmwt.rnn | awk -v rnnweight=$rnnweight \\\n      '{ key=$1; graphscore=$2; lmscore=$4; rnnscore=$6;\n     score = graphscore+(rnnweight*rnnscore)+((1-rnnweight)*lmscore);\n     print $1,score; } ' > $adir.$n/lmwt.interp.$rnnweight || exit 1;\n  done\nfi\n\nif [ $stage -le 8 ]; then\n  echo \"$0: reconstructing archives back into lattices.\"\n  $cmd JOB=1:$nj $dir/log/reconstruct_lattice.JOB.log \\\n    linear-to-nbest \"ark:$adir.JOB/ali\" \"ark:$adir.JOB/words\" \\\n    \"ark:$adir.JOB/lmwt.interp.$rnnweight\" \"ark:$adir.JOB/acwt\" ark:- \\| \\\n    nbest-to-lattice ark:- \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $oldlang $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/scoring/score_kaldi_cer.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey, Yenda Trmal)\n# Apache 2.0\n\n# This script computes the CER (Character Error Rate) as opposed to the script\n# local/score_kaldi.sh (which computes WER i.e. Word Error Rate).\n# if you need to compute both the WER and CER, you can use the stage parameters\n# i.e. write your own local/score.sh that will contain\n# \n# steps/scoring/score_kaldi_wer.sh \"$@\"\n# steps/scoring/score_kaldi_cer.sh --stage 2 \"$@\"\n#\n# NOTE it would work without the --stage 2, but this way its more effective\n# as the lattice decoding won't be run twice.\n\n\n[ -f ./path.sh ] && . ./path.sh\n\n# begin configuration section.\ncmd=run.pl\ndecode_mbr=false\nstats=true\nbeam=6\nstage=0\nword_ins_penalty=0.0,0.5,1.0\nmin_lmwt=7\nmax_lmwt=17\niter=final\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [--cmd (run.pl|queue.pl...)] <data-dir> <lang-dir|graph-dir> <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --decode_mbr (true/false)       # maximum bayes risk decoding (confusion network).\"\n  echo \"    --min_lmwt <int>                # minumum LM-weight for lattice rescoring \"\n  echo \"    --max_lmwt <int>                # maximum LM-weight for lattice rescoring \"\n  exit 1;\nfi\n\ndata=$1\nlang_or_graph=$2\ndir=$3\n\nsymtab=$lang_or_graph/words.txt\n\nfor f in $symtab $dir/lat.1.gz $data/text; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\nref_filtering_cmd=\"cat\"\n[ -x local/wer_output_filter ] && ref_filtering_cmd=\"local/wer_output_filter\"\n[ -x local/wer_ref_filter ] && ref_filtering_cmd=\"local/wer_ref_filter\"\nhyp_filtering_cmd=\"cat\"\n[ -x local/wer_output_filter ] && hyp_filtering_cmd=\"local/wer_output_filter\"\n[ -x local/wer_hyp_filter ] && hyp_filtering_cmd=\"local/wer_hyp_filter\"\n\n\nif $decode_mbr ; then\n  echo \"$0: scoring with MBR, word insertion penalty=$word_ins_penalty\"\nelse\n  echo \"$0: scoring with word insertion penalty=$word_ins_penalty\"\nfi\n\n\nmkdir -p $dir/scoring_kaldi\ncat $data/text | $ref_filtering_cmd > $dir/scoring_kaldi/test_filt.txt || exit 1;\nif [ $stage -le 0 ]; then\n\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    mkdir -p $dir/scoring_kaldi/penalty_$wip/log\n\n    if $decode_mbr ; then\n      $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/best_path.LMWT.log \\\n        acwt=\\`perl -e \\\"print 1.0/LMWT\\\"\\`\\; \\\n        lattice-scale --inv-acoustic-scale=LMWT \"ark:gunzip -c $dir/lat.*.gz|\" ark:- \\| \\\n        lattice-add-penalty --word-ins-penalty=$wip ark:- ark:- \\| \\\n        lattice-prune --beam=$beam ark:- ark:- \\| \\\n        lattice-mbr-decode  --word-symbol-table=$symtab \\\n        ark:- ark,t:- \\| \\\n        utils/int2sym.pl -f 2- $symtab \\| \\\n        $hyp_filtering_cmd '>' $dir/scoring_kaldi/penalty_$wip/LMWT.txt || exit 1;\n\n    else\n      $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/best_path.LMWT.log \\\n        lattice-scale --inv-acoustic-scale=LMWT \"ark:gunzip -c $dir/lat.*.gz|\" ark:- \\| \\\n        lattice-add-penalty --word-ins-penalty=$wip ark:- ark:- \\| \\\n        lattice-best-path --word-symbol-table=$symtab ark:- ark,t:- \\| \\\n        utils/int2sym.pl -f 2- $symtab \\| \\\n        $hyp_filtering_cmd '>' $dir/scoring_kaldi/penalty_$wip/LMWT.txt || exit 1;\n    fi\n\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/score.LMWT.log \\\n      cat $dir/scoring_kaldi/penalty_$wip/LMWT.txt \\| \\\n      compute-wer --text --mode=present \\\n      ark:$dir/scoring_kaldi/test_filt.txt  ark,p:- \">&\" $dir/wer_LMWT_$wip || exit 1;\n\n  done\nfi\n\n\n# the stage 2 is intentional, to allow nice coexistence with score_kaldi.sh\n# in cases user would be combining calls to these two scripts as shown in\n# the example at the top of the file. Otherwise we or he/she would have to\n# filter the script parameters instead of simple forwarding.\nif [ $stage -le 2 ] ; then\n  files=($dir/scoring_kaldi/test_filt.txt)\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    for lmwt in $(seq $min_lmwt $max_lmwt); do\n      files+=($dir/scoring_kaldi/penalty_${wip}/${lmwt}.txt)\n    done\n  done\n\n  for f in \"${files[@]}\" ; do\n    fout=${f%.txt}.chars.txt\n    if [ -x local/character_tokenizer ]; then\n      cat $f |  local/character_tokenizer > $fout\n    else\n      cat $f |  perl -CSDA -ane '\n        {\n          print $F[0];\n          foreach $s (@F[1..$#F]) {\n            if (($s =~ /\\[.*\\]/) || ($s =~ /\\<.*\\>/) || ($s =~ \"!SIL\")) {\n              print \" $s\";\n            } else {\n              @chars = split \"\", $s;\n              foreach $c (@chars) {\n                print \" $c\";\n              }\n            }\n          }\n          print \"\\n\";\n        }' > $fout\n    fi\n  done\n\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/score.cer.LMWT.log \\\n      cat $dir/scoring_kaldi/penalty_$wip/LMWT.chars.txt \\| \\\n      compute-wer --text --mode=present \\\n      ark:$dir/scoring_kaldi/test_filt.chars.txt  ark,p:- \">&\" $dir/cer_LMWT_$wip || exit 1;\n  done\nfi\n\nif [ $stage -le 3 ] ; then\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    for lmwt in $(seq $min_lmwt $max_lmwt); do\n      # adding /dev/null to the command list below forces grep to output the filename\n      grep WER $dir/cer_${lmwt}_${wip} /dev/null\n    done\n  done | utils/best_wer.sh  >& $dir/scoring_kaldi/best_cer || exit 1\n\n  best_cer_file=$(awk '{print $NF}' $dir/scoring_kaldi/best_cer)\n  best_wip=$(echo $best_cer_file | awk -F_ '{print $NF}')\n  best_lmwt=$(echo $best_cer_file | awk -F_ '{N=NF-1; print $N}')\n\n  if [ -z \"$best_lmwt\" ]; then\n    echo \"$0: we could not get the details of the best CER from the file $dir/cer_*.  Probably something went wrong.\"\n    exit 1;\n  fi\n\n  if $stats; then\n    mkdir -p $dir/scoring_kaldi/cer_details\n    echo $best_lmwt > $dir/scoring_kaldi/cer_details/lmwt # record best language model weight\n    echo $best_wip > $dir/scoring_kaldi/cer_details/wip # record best word insertion penalty\n\n    $cmd $dir/scoring_kaldi/log/stats1.cer.log \\\n      cat $dir/scoring_kaldi/penalty_$best_wip/${best_lmwt}.chars.txt \\| \\\n      align-text --special-symbol=\"'***'\" ark:$dir/scoring_kaldi/test_filt.chars.txt ark:- ark,t:- \\|  \\\n      utils/scoring/wer_per_utt_details.pl --special-symbol \"'***'\" \\| tee $dir/scoring_kaldi/cer_details/per_utt \\|\\\n       utils/scoring/wer_per_spk_details.pl $data/utt2spk \\> $dir/scoring_kaldi/cer_details/per_spk || exit 1;\n\n    $cmd $dir/scoring_kaldi/log/stats2.cer.log \\\n      cat $dir/scoring_kaldi/cer_details/per_utt \\| \\\n      utils/scoring/wer_ops_details.pl --special-symbol \"'***'\" \\| \\\n      sort -b -i -k 1,1 -k 4,4rn -k 2,2 -k 3,3 \\> $dir/scoring_kaldi/cer_details/ops || exit 1;\n\n    $cmd $dir/scoring_kaldi/log/cer_bootci.cer.log \\\n      compute-wer-bootci --mode=present \\\n        ark:$dir/scoring_kaldi/test_filt.chars.txt ark:$dir/scoring_kaldi/penalty_$best_wip/${best_lmwt}.chars.txt \\\n        '>' $dir/scoring_kaldi/cer_details/cer_bootci || exit 1;\n\n  fi\nfi\n\n# If we got here, the scoring was successful.\n# As a  small aid to prevent confusion, we remove all wer_{?,??} files;\n# these originate from the previous version of the scoring files\n# i keep both statement here because it could lead to confusion about\n# the capabilities of the script (we don't do cer in the script)\nrm $dir/wer_{?,??} 2>/dev/null\nrm $dir/cer_{?,??} 2>/dev/null\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/scoring/score_kaldi_compare.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2016 Nicolas Serrano\n# Apache 2.0\n\n[ -f ./path.sh ] && . ./path.sh\n\n# begin configuration section.\ncmd=run.pl\nreplications=10000\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [--cmd (run.pl|queue.pl...)] <score-dir1> <score-dir2> <score-compare-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --replications <int>            # number of bootstrap evaluation to compute confidence.\"\n  exit 1;\nfi\n\ndir1=$1\ndir2=$2\ndir_compare=$3\n\nmkdir -p $dir_compare/log\n\nfor d in $dir1 $dir2; do\n  for f in test_filt.txt best_wer; do\n    [ ! -f $d/$f ] && echo \"$0: no such file $d/$f\" && exit 1;\n  done\ndone\n\n\nbest_wer_file1=$(awk '{print $NF}' $dir1/best_wer)\nbest_transcript_file1=$(echo $best_wer_file1 | sed -e 's=.*/wer_==' | \\\n        awk -v FS='_' -v dir=$dir1 '{print dir\"/penalty_\"$2\"/\"$1\".txt\"}')\n\nbest_wer_file2=$(awk '{print $NF}' $dir2/best_wer)\nbest_transcript_file2=$(echo $best_wer_file2 | sed -e 's=.*/wer_==' | \\\n        awk -v FS='_' -v dir=$dir2 '{print dir\"/penalty_\"$2\"/\"$1\".txt\"}')\n\n$cmd $dir_compare/log/score_compare.log \\\n  compute-wer-bootci --replications=$replications \\\n    ark:$dir1/test_filt.txt ark:$best_transcript_file1 ark:$best_transcript_file2 \\\n    '>' $dir_compare/wer_bootci_comparison || exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/scoring/score_kaldi_wer.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey, Yenda Trmal)\n# Apache 2.0\n\n# See the script steps/scoring/score_kaldi_cer.sh in case you need to evalutate CER\n\n[ -f ./path.sh ] && . ./path.sh\n\n# begin configuration section.\ncmd=run.pl\nstage=0\ndecode_mbr=false\nstats=true\nbeam=6\nword_ins_penalty=0.0,0.5,1.0\nmin_lmwt=7\nmax_lmwt=17\niter=final\n#end configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [--cmd (run.pl|queue.pl...)] <data-dir> <lang-dir|graph-dir> <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --stage (0|1|2)                 # start scoring script from part-way through.\"\n  echo \"    --decode_mbr (true/false)       # maximum bayes risk decoding (confusion network).\"\n  echo \"    --min_lmwt <int>                # minumum LM-weight for lattice rescoring \"\n  echo \"    --max_lmwt <int>                # maximum LM-weight for lattice rescoring \"\n  exit 1;\nfi\n\ndata=$1\nlang_or_graph=$2\ndir=$3\n\nsymtab=$lang_or_graph/words.txt\n\nfor f in $symtab $dir/lat.1.gz $data/text; do\n  [ ! -f $f ] && echo \"score.sh: no such file $f\" && exit 1;\ndone\n\n\nref_filtering_cmd=\"cat\"\n[ -x local/wer_output_filter ] && ref_filtering_cmd=\"local/wer_output_filter\"\n[ -x local/wer_ref_filter ] && ref_filtering_cmd=\"local/wer_ref_filter\"\nhyp_filtering_cmd=\"cat\"\n[ -x local/wer_output_filter ] && hyp_filtering_cmd=\"local/wer_output_filter\"\n[ -x local/wer_hyp_filter ] && hyp_filtering_cmd=\"local/wer_hyp_filter\"\n\n\nif $decode_mbr ; then\n  echo \"$0: scoring with MBR, word insertion penalty=$word_ins_penalty\"\nelse\n  echo \"$0: scoring with word insertion penalty=$word_ins_penalty\"\nfi\n\n\nmkdir -p $dir/scoring_kaldi\ncat $data/text | $ref_filtering_cmd > $dir/scoring_kaldi/test_filt.txt || exit 1;\nif [ $stage -le 0 ]; then\n\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    mkdir -p $dir/scoring_kaldi/penalty_$wip/log\n\n    if $decode_mbr ; then\n      $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/best_path.LMWT.log \\\n        acwt=\\`perl -e \\\"print 1.0/LMWT\\\"\\`\\; \\\n        lattice-scale --inv-acoustic-scale=LMWT \"ark:gunzip -c $dir/lat.*.gz|\" ark:- \\| \\\n        lattice-add-penalty --word-ins-penalty=$wip ark:- ark:- \\| \\\n        lattice-prune --beam=$beam ark:- ark:- \\| \\\n        lattice-mbr-decode  --word-symbol-table=$symtab \\\n        ark:- ark,t:- \\| \\\n        utils/int2sym.pl -f 2- $symtab \\| \\\n        $hyp_filtering_cmd '>' $dir/scoring_kaldi/penalty_$wip/LMWT.txt || exit 1;\n\n    else\n      $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/best_path.LMWT.log \\\n        lattice-scale --inv-acoustic-scale=LMWT \"ark:gunzip -c $dir/lat.*.gz|\" ark:- \\| \\\n        lattice-add-penalty --word-ins-penalty=$wip ark:- ark:- \\| \\\n        lattice-best-path --word-symbol-table=$symtab ark:- ark,t:- \\| \\\n        utils/int2sym.pl -f 2- $symtab \\| \\\n        $hyp_filtering_cmd '>' $dir/scoring_kaldi/penalty_$wip/LMWT.txt || exit 1;\n    fi\n\n    $cmd LMWT=$min_lmwt:$max_lmwt $dir/scoring_kaldi/penalty_$wip/log/score.LMWT.log \\\n      cat $dir/scoring_kaldi/penalty_$wip/LMWT.txt \\| \\\n      compute-wer --text --mode=present \\\n      ark:$dir/scoring_kaldi/test_filt.txt  ark,p:- \">&\" $dir/wer_LMWT_$wip || exit 1;\n\n  done\nfi\n\n\n\nif [ $stage -le 1 ]; then\n\n  for wip in $(echo $word_ins_penalty | sed 's/,/ /g'); do\n    for lmwt in $(seq $min_lmwt $max_lmwt); do\n      # adding /dev/null to the command list below forces grep to output the filename\n      grep WER $dir/wer_${lmwt}_${wip} /dev/null\n    done\n  done | utils/best_wer.sh  >& $dir/scoring_kaldi/best_wer || exit 1\n\n  best_wer_file=$(awk '{print $NF}' $dir/scoring_kaldi/best_wer)\n  best_wip=$(echo $best_wer_file | awk -F_ '{print $NF}')\n  best_lmwt=$(echo $best_wer_file | awk -F_ '{N=NF-1; print $N}')\n\n  if [ -z \"$best_lmwt\" ]; then\n    echo \"$0: we could not get the details of the best WER from the file $dir/wer_*.  Probably something went wrong.\"\n    exit 1;\n  fi\n\n  if $stats; then\n    mkdir -p $dir/scoring_kaldi/wer_details\n    echo $best_lmwt > $dir/scoring_kaldi/wer_details/lmwt # record best language model weight\n    echo $best_wip > $dir/scoring_kaldi/wer_details/wip # record best word insertion penalty\n\n    $cmd $dir/scoring_kaldi/log/stats1.log \\\n      cat $dir/scoring_kaldi/penalty_$best_wip/$best_lmwt.txt \\| \\\n      align-text --special-symbol=\"'***'\" ark:$dir/scoring_kaldi/test_filt.txt ark:- ark,t:- \\|  \\\n      utils/scoring/wer_per_utt_details.pl --special-symbol \"'***'\" \\| tee $dir/scoring_kaldi/wer_details/per_utt \\|\\\n       utils/scoring/wer_per_spk_details.pl $data/utt2spk \\> $dir/scoring_kaldi/wer_details/per_spk || exit 1;\n\n    $cmd $dir/scoring_kaldi/log/stats2.log \\\n      cat $dir/scoring_kaldi/wer_details/per_utt \\| \\\n      utils/scoring/wer_ops_details.pl --special-symbol \"'***'\" \\| \\\n      sort -b -i -k 1,1 -k 4,4rn -k 2,2 -k 3,3 \\> $dir/scoring_kaldi/wer_details/ops || exit 1;\n\n    $cmd $dir/scoring_kaldi/log/wer_bootci.log \\\n      compute-wer-bootci --mode=present \\\n        ark:$dir/scoring_kaldi/test_filt.txt ark:$dir/scoring_kaldi/penalty_$best_wip/$best_lmwt.txt \\\n        '>' $dir/scoring_kaldi/wer_details/wer_bootci || exit 1;\n\n  fi\nfi\n\n# If we got here, the scoring was successful.\n# As a  small aid to prevent confusion, we remove all wer_{?,??} files;\n# these originate from the previous version of the scoring files\n# i keep both statement here because it could lead to confusion about\n# the capabilities of the script (we don't do cer in the script)\nrm $dir/wer_{?,??} 2>/dev/null\nrm $dir/cer_{?,??} 2>/dev/null\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/search_index.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Guoguo Chen)\n# Apache 2.0\n\n# Begin configuration section.  \ncmd=run.pl\nnbest=-1\nstrict=true\nindices_dir=\nframe_subsampling_factor=1\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 2 ]; then\n   echo \"Usage: steps/search_index.sh [options] <kws-data-dir> <kws-dir>\"\n   echo \" e.g.: steps/search_index.sh data/kws exp/sgmm2_5a_mmi/decode/kws/\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --nbest <int>                                    # return n best results. (-1 means all)\"\n   echo \"  --indices-dir <path>                             # where the indices should be stored, by default it will be in <kws-dir>\"\n   exit 1;\nfi\n\n\nkwsdatadir=$1;\nkwsdir=$2;\n\nif [ -z $indices_dir ] ; then\n  indices_dir=$kwsdir\nfi\n\nmkdir -p $kwsdir/log;\nnj=`cat $indices_dir/num_jobs` || exit 1;\nif [ -f $kwsdatadir/keywords.fsts.gz ]; then\n  keywords=\"\\\"gunzip -c $kwsdatadir/keywords.fsts.gz|\\\"\"\nelif [ -f $kwsdatadir/keywords.fsts ]; then\n  keywords=$kwsdatadir/keywords.fsts;\nelse\n  echo \"$0: no such file $kwsdatadir/keywords.fsts[.gz]\" && exit 1;\nfi\n\nfor f in $indices_dir/index.1.gz ; do\n  [ ! -f $f ] && echo \"make_index.sh: no such file $f\" && exit 1;\ndone\n\n$cmd JOB=1:$nj $kwsdir/log/search.JOB.log \\\n  kws-search --strict=$strict --negative-tolerance=-1 \\\n  --frame-subsampling-factor=${frame_subsampling_factor} \\\n  \"ark:gzip -cdf $indices_dir/index.JOB.gz|\" ark:$keywords \\\n  \"ark,t:|gzip -c > $kwsdir/result.JOB.gz\" \\\n  \"ark,t:|gzip -c > $kwsdir/stats.JOB.gz\" || exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/segmentation/ali_to_targets.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script converts alignments into targets for training neural network\n# for speech activity detection. The mapping from phones to speech / silence / garbage\n# is defined by the options --silence-phones and --garbage-phones.\n# This is similar to the script steps/segmentation/lats_to_targets.sh which \n# converts lattices to targets. See that script for details about the \n# targets matrix.\n\nset -o pipefail\n\nsilence_phones=\ngarbage_phones=\nmax_phone_duration=0.5\n\ncmd=run.pl\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  cat <<EOF\n  This script converts alignments into targets for training neural network\n  for speech activity detection. The mapping from phones to speech / silence / garbage\n  is defined by the options --silence-phones and --garbage-phones.\n\n  This is similar to the script steps/segmentation/lats_to_targets.sh which \n  converts lattices to targets. See that script for details about the \n  targets matrix.\n\n  Usage: steps/segmentation/ali_to_targets.sh <data-dir> <lang> <ali-dir> <targets-dir>\"\n  e.g.: steps/segmentation/ali_to_targets.sh \\\n  --silence-phones data/lang/phones/optional_silence.txt \\\n  --garbage-phones data/lang/phones/silence.txt \\\n  --max-phone-duration 0.5 \\\n  data/train_split10s data/lang \\\n  exp/segmentation1a/tri3b_train_split10s_ali \\\n  exp/segmentation1a/tri3b_train_split10s_targets\nEOF\n  exit 1\nfi\n\ndata=$1\nlang=$2\nali_dir=$3\ndir=$4\n\nif [ -f $ali_dir/final.mdl ]; then\n  srcdir=$ali_dir\nelse\n  srcdir=$ali_dir/..\nfi\n\nfor f in $data/utt2spk $ali_dir/ali.1.gz $srcdir/final.mdl; do \n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\nmkdir -p $dir\n\nif [ -z \"$garbage_phones\" ]; then\n  oov_phone=$(steps/segmentation/internal/get_oov_phone.py $lang) || exit 1\n  echo $oov_phone | utils/int2sym.pl $lang/phones.txt > $dir/garbage_phones.txt || exit 1\nelse \n  cp $garbage_phones $dir/garbage_phones.txt || exit 1\nfi\n\nif [ -z \"$silence_phones\" ]; then\n  cat $lang/silence_phones.txt | \\\n    utils/filter_scp.pl --exclude $dir/garbage_phones.txt > \\\n    $dir/silence_phones.txt\nelse \n  cp $silence_phones $dir/silence_phones.txt\nfi\n\nnj=$(cat $ali_dir/num_jobs) || exit 1\n\n$cmd JOB=1:$nj $dir/log/get_arc_info.JOB.log \\\n  ali-to-phones --ctm-output --frame-shift=1 \\\n    $srcdir/final.mdl \"ark:gunzip -c $ali_dir/ali.JOB.gz |\" - \\| \\\n  utils/int2sym.pl -f 5 $lang/phones.txt \\| \\\n  awk '{print $1\" \"int($3)\" \"int($4)\" 1.0 \"$5}' \\> \\\n  $dir/arc_info_sym.JOB.txt || exit 1\n\n# make $dir an absolute pathname.\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nframe_subsampling_factor=1\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $srcdir/frames_subsampling_factor)\n  echo $frame_subsampling_factor > $dir/frame_subsampling_factor\nfi\n\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\nmax_phone_len=$(perl -e \"print int($max_phone_duration / $frame_shift)\")\n\n$cmd JOB=1:$nj $dir/log/get_targets.JOB.log \\\n  steps/segmentation/internal/arc_info_to_targets.py \\\n    --silence-phones=$dir/silence_phones.txt \\\n    --garbage-phones=$dir/garbage_phones.txt \\\n    --max-phone-length=$max_phone_len \\\n    $dir/arc_info_sym.JOB.txt - \\| \\\n  copy-feats ark,t:- \\\n    ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\nfor n in $(seq $nj); do\n  cat $dir/targets.$n.scp\ndone > $dir/targets.scp\n\nsteps/segmentation/validate_targets_dir.sh $dir $data || exit 1\n\necho \"$0: Done creating targets in $dir/targets.scp\"\n\n"
  },
  {
    "path": "egs/steps/segmentation/combine_targets_dirs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017 Nagendra Kumar Goel\n#           2018 Vimal Manohar   \n# Apache 2.0.\n\n# This script combines targets directory into a new targets directory \n# containing targets from all the input targets directories.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 3 ]; then\n  echo \"Usage: $0 [options] <data> <dest-targets-dir> <src-targets-dir1> <src-targets-dir2> ...\"\n  echo \"e.g.: $0 data/train exp/targets_combined exp/targets_1 exp/targets_2\"\n  exit 1;\nfi\n\nexport LC_ALL=C\n\ndata=$1;\nshift;\ndest=$1;\nshift;\nfirst_src=$1;\n\nmkdir -p $dest;\nrm -f $dest/{targets.*.ark,frame_subsampling_factor} 2>/dev/null\n\nframe_subsampling_factor=1\nif [ -f $first_src/frame_subsampling_factor ]; then\n  cp $first_src/frame_subsampling_factor $dest\n  frame_subsampling_factor=$(cat $dest/frame_subsampling_factor)\nfi\n\nfor d in $*; do\n  this_frame_subsampling_factor=1\n  if [ -f $d/frame_subsampling_factor ]; then\n    this_frame_subsampling_factor=$(cat $d/frame_subsampling_factor)\n  fi\n\n  if [ $this_frame_subsampling_factor != $frame_subsampling_factor ]; then\n    echo \"$0: Cannot combine targets directories with different frame-subsampling-factors\" 1>&2\n    exit 1\n  fi\n\n  cat $d/targets.scp\ndone | sort -k1,1 > $dest/targets.scp || exit 1\n\nsteps/segmentation/validate_targets_dir.sh $dest $data || exit 1\n\necho \"Combined targets and stored in $dest\"\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/convert_targets_dir_to_whole_recording.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script converts targets corresponding to 'data' at segments level \n# in 'targets_dir' to whole-recording level corresponding to the \n# whole-recording data directory 'whole_data'.\n\n# The targets for the whole-recording are created by simply copying the targets \n# for the in-segment region, while setting the out-of-segment region targets\n# to the target values contained in the file specified \n# (in kaldi vector text format) by --default-targets option.\n# By default, the 'default_targets' would be [ 0 0 0 ].\n# Note that the script steps/segmentation/get_targets_for_out_of_segments.sh \n# can be used to get targets only for the out-of-segment regions. It is \n# better to use that when you need specific target values like all silence \n# ([ 1 0 0 ]) or all garbage ([ 0 0 1 ]) for the out-of-segment regions. \n# That way you can control how the out-of-segment target values are \n# combined using the weights in steps/segmentation/merge_targets_dirs.sh\n\nnj=4\ncmd=run.pl\ndefault_targets=   # vector of default targets in text format\n\nset -o pipefail -u\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  cat <<EOF\n  This script converts targets corresponding to 'data' at segments level \n  in 'targets_dir' to whole-recording level corresponding to the \n  whole-recording data directory 'whole_data'.\n  See top of the script for more details.\n\n  Usage: steps/segmentation/convert_targets_to_whole_recording.sh <data-dir> <whole-data-dir> <targets-dir> <whole-targets-dir>\n   e.g.: steps/segmentation/convert_targets_to_whole_recording.sh \\\n    data/train_split10s data/train_whole \\\n    exp/segmentation1a/tri3b_train_split10s_targets \\\n    exp/segmentation1a/tri3b_train_whole_targets\nEOF\n  exit 1\nfi\n\ndata=$1\nwhole_data=$2\ntargets_dir=$3\ndir=$4\n\nif [ ! -f $data/segments ]; then\n  awk '{print $1}' $whole_data/wav.scp > $dir/recos\n  utils/filter_scp.pl $data/utt2spk $dir/recos > $dir/recos.data\n\n  nr=$(cat $dir/reco | wc -l)\n  nu=$(cat $dir/recos.data | wc -l) \n\n  if [ $nu -lt $[$nr - ($nr/20)] ]; then\n    echo \"Found less that 95% the recordings of $whole_data in $data.\"\n    exit 1;\n  fi\n\n  cp $targets_dir/targets.scp $dir\n  cp $targets_dir/frame_subsampling_factor $dir || true\n\n  exit 0\nfi\n\nfor f in $data/segments $targets_dir/targets.scp \\\n  $whole_data/wav.scp; do\n  if [ ! -f $f ]; then \n    echo \"$0: Could not find file $f\" \n    exit 1\n  fi\ndone\n\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\nframe_subsampling_factor=1\nif [ -f $targets_dir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $targets_dir/frames_subsampling_factor) || exit 1\nfi\nframe_shift=`perl -e \"print ($frame_shift * $frame_subsampling_factor);\"`\n\nmkdir -p $dir/split${nj}reco\nsplit_scps=\nfor n in $(seq $nj); do\n  split_scps=\"$split_scps $dir/split${nj}reco/wav.$n.scp\"\ndone\nutils/split_scp.pl $whole_data/wav.scp $split_scps\n\nutils/data/get_reco2utt_for_data.sh $data > $dir/reco2utt\n\nmkdir -p $dir/split${nj}reco\nutils/filter_scps.pl JOB=1:$nj $dir/split${nj}reco/wav.JOB.scp $dir/reco2utt \\\n  $dir/split${nj}reco/reco2utt.JOB || exit 1\nutils/filter_scps.pl -f 2 JOB=1:$nj $dir/split${nj}reco/wav.JOB.scp $data/segments \\\n    $dir/split${nj}reco/segments.JOB || exit 1\nutils/filter_scps.pl JOB=1:$nj $dir/split${nj}reco/segments.JOB $targets_dir/targets.scp \\\n    $dir/split${nj}reco/targets.JOB.scp || exit 1\n\n# make $dir an absolute pathname.\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nutils/data/get_utt2num_frames.sh --cmd \"$cmd\" --nj $nj $whole_data\ncp $whole_data/utt2num_frames $dir/reco2num_frames\n\n$cmd JOB=1:$nj $dir/log/merge_targets_to_reco.JOB.log \\\n  steps/segmentation/internal/merge_segment_targets_to_recording.py \\\n    --reco2num-frames=$dir/reco2num_frames --frame-shift=$frame_shift \\\n    --default-targets=\"$default_targets\" \\\n    $dir/split${nj}reco/reco2utt.JOB $dir/split${nj}reco/segments.JOB \\\n    $dir/split${nj}reco/targets.JOB.scp - \\| \\\n  copy-feats ark,t:- ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\nfor n in $(seq $nj); do\n  cat $dir/targets.$n.scp\ndone | sort -k1,1 > $dir/targets.scp\n\nsteps/segmentation/validate_targets_dir.sh $dir $whole_data || exit 1\n\necho \"$0: Converted targets to whole recordings in $dir\"\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/convert_utt2spk_and_segments_to_rttm.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0.\n\n\"\"\"This script converts kaldi-style utt2spk and segments to a NIST RTTM\nfile.\n\nThe RTTM format is\n<type> <file-id> <channel-id> <begin-time> \\\n        <duration> <ortho> <stype> <name> <conf>\n\n<type> = SPEAKER for each segment.\n<file-id> - the File-ID of the recording\n<channel-id> - the Channel-ID, usually 1\n<begin-time> - start time of segment\n<duration> - duration of segment\n<ortho> - <NA> (this is ignored)\n<stype> - <NA> (this is ignored)\n<name> - speaker name or id\n<conf> - <NA> (this is ignored)\n\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script converts kaldi-style utt2spk and\n        segments to a NIST RTTM file\"\"\")\n\n    parser.add_argument(\"--reco2file-and-channel\", type=str,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"Input reco2file_and_channel.\n                        The format is <recording-id> <file-id> <channel-id>.\n                        If not provided, then <recording-id> is taken as the\n                        <file-id> with <channel-id> = 1.\"\"\")\n    parser.add_argument(\"utt2spk\", type=str,\n                        help=\"Input utt2spk file\")\n    parser.add_argument(\"segments\", type=str,\n                        help=\"Input segments file\")\n    parser.add_argument(\"rttm_file\", type=str,\n                        help=\"Output RTTM file\")\n\n    args = parser.parse_args()\n    return args\n\n\ndef main():\n    args = get_args()\n\n    if args.reco2file_and_channel is not None:\n        reco2file_and_channel = {}\n        with common_lib.smart_open(args.reco2file_and_channel) as fh:\n            for line in fh:\n                parts = line.strip().split()\n                reco2file_and_channel[parts[0]] = (parts[1], parts[2])\n\n    utt2spk = {}\n    with common_lib.smart_open(args.utt2spk) as fh:\n        for line in fh:\n            parts = line.strip().split()\n            utt2spk[parts[0]] = parts[1]\n\n    with common_lib.smart_open(args.segments) as segments_reader, \\\n            common_lib.smart_open(args.rttm_file, 'w') as rttm_writer:\n        for line in segments_reader:\n            parts = line.strip().split()\n\n            utt = parts[0]\n            spkr = utt2spk[utt]\n\n            reco = parts[1]\n            file_id = reco\n            channel = 1\n\n            if args.reco2file_and_channel is not None:\n                try:\n                    file_id, channel = reco2file_and_channel[reco]\n                except KeyError:\n                    raise RuntimeError(\n                        \"Could not find recording {0} in {1}\".format(\n                            reco, args.reco2file_and_channel))\n\n            start_time = float(parts[2])\n            duration = float(parts[3]) - start_time\n\n            print(\"SPEAKER {0} {1} {2:7.2f} {3:7.2f} \"\n                  \"<NA> <NA> {4} <NA>\".format(\n                      file_id, channel, start_time,\n                      duration, spkr), file=rttm_writer)\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/copy_targets_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright    2017  Nagendra Kumar Goel\n#              2014  Johns Hopkins University (author: Nagendra K Goel)\n# Apache 2.0\n\n# This script makes a copy of targets directory (by copying targets.scp),\n# possibly adding a specified prefix or a suffix to the utterance names.\n\n# begin configuration section\nutt_prefix=\nutt_suffix=\n# end configuration section\n\nif [ -f ./path.sh ]; then . ./path.sh; fi\n. ./utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0  --utt-prefix=1- exp/segmentation_1a/train_whole_combined_targets_sub3 exp/segmentation_1a/train_whole_combined_targets_sub3_rev1\"\n  echo \"Options\"\n  echo \"   --utt-prefix=<prefix>     # Prefix for utterance ids, default empty\"\n  echo \"   --utt-suffix=<suffix>     # Suffix for utterance ids, default empty\"\n  exit 1;\nfi\n\nexport LC_ALL=C\n\nsrcdir=$1\ndestdir=$2\n\nmkdir -p $destdir\n\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  cp $srcdir/frame_subsampling_factor $destdir\nfi\n\ncat $srcdir/targets.scp | awk -v p=$utt_prefix -v s=$utt_suffix \\\n  '{printf(\"%s %s%s%s\\n\", $1, p, $1, s);}' > $destdir/utt_map\n\ncat $srcdir/targets.scp | utils/apply_map.pl -f 1 $destdir/utt_map | \\\n  sort -k1,1 > $destdir/targets.scp\n\necho \"$0: copied targets from $srcdir to $destdir\"\n"
  },
  {
    "path": "egs/steps/segmentation/decode_sad.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0.\n\n# This script does Viterbi decoding using a matrix of frame log-likelihoods \n# with the columns corresponding to the pdfs.\n# It is a wrapper around the binary decode-faster.\n\nset -e\nset -o pipefail\n\ncmd=run.pl\nnj=4\nacwt=0.1\nbeam=8\nmax_active=1000\ntransform=   # Transformation matrix to apply on the input archives read from output.scp\n\n. ./path.sh\n\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 <graph-dir> <nnet_output_dir> <decode-dir>\"\n  echo \" e.g.: $0 \"\n  exit 1 \nfi\n\ngraph_dir=$1\nnnet_output_dir=$2\ndir=$3\n\nmkdir -p $dir/log\n\necho $nj > $dir/num_jobs\n\nfor f in $graph_dir/HCLG.fst $nnet_output_dir/output.scp $extra_files; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\nrspecifier=\"ark:utils/split_scp.pl -j $nj \\$[JOB-1] $nnet_output_dir/output.scp | copy-feats scp:- ark:- |\"\n\n# Apply a transformation on the input matrix to combine \n# probs from different columns to pseudo-likelihoods\nif [ ! -z \"$transform\" ]; then\n  rspecifier=\"$rspecifier transform-feats $transform ark:- ark:- |\"\nfi\n\n# Convert pseudo-likelihoods to pseudo log-likelihood\nrspecifier=\"$rspecifier copy-matrix --apply-log ark:- ark:- |\"\n\ndecoder_opts+=(--acoustic-scale=$acwt --beam=$beam --max-active=$max_active)\n\n$cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n  decode-faster ${decoder_opts[@]} \\\n  $graph_dir/HCLG.fst \"$rspecifier\" \\\n  ark:/dev/null \"ark:| gzip -c > $dir/ali.JOB.gz\"\n"
  },
  {
    "path": "egs/steps/segmentation/detect_speech_activity.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016-17  Vimal Manohar\n#              2017  Nagendra Kumar Goel\n# Apache 2.0.\n\n# This script does nnet3-based speech activity detection given an input \n# kaldi data directory and outputs a segmented kaldi data directory.\n# This script can also do music detection and other similar segmentation\n# using appropriate options such as --output-name output-music.\n\nset -e \nset -o pipefail\nset -u\n\nif [ -f ./path.sh ]; then . ./path.sh; fi\n\naffix=  # Affix for the segmentation\nnj=32\ncmd=run.pl\nstage=-1\n\n# Feature options (Must match training)\nmfcc_config=conf/mfcc_hires.conf\nfeat_affix=   # Affix for the type of feature used\n\nconvert_data_dir_to_whole=true    # If true, the input data directory is \n                                  # first converted to whole data directory (i.e. whole recordings)\n                                  # and segmentation is done on that.\n                                  # If false, then the original segments are \n                                  # retained and they are split into sub-segments.\n\noutput_name=output   # The output node in the network\nsad_name=sad    # Base name for the directory storing the computed loglikes\n                # Can be music for music detection\nsegmentation_name=segmentation  # Base name for the directory doing segmentation\n                                # Can be segmentation_music for music detection\n\n# SAD network config\niter=final  # Model iteration to use\n\n# Contexts must ideally match training for LSTM models, but\n# may not necessarily for stats components\nextra_left_context=0  # Set to some large value, typically 40 for LSTM (must match training)\nextra_right_context=0  \nextra_left_context_initial=-1\nextra_right_context_final=-1\nframes_per_chunk=150\n\n# Decoding options\ngraph_opts=\"--min-silence-duration=0.03 --min-speech-duration=0.3 --max-speech-duration=10.0\"\nacwt=0.3\n\n# These <from>_in_<to>_weight represent the fraction of <from> probability \n# to transfer to <to> class.\n# e.g. --speech-in-sil-weight=0.0 --garbage-in-sil-weight=0.0 --sil-in-speech-weight=0.0 --garbage-in-speech-weight=0.3\ntransform_probs_opts=\"\"\n\n# Postprocessing options\nsegment_padding=0.2   # Duration (in seconds) of padding added to segments \nmin_segment_dur=0   # Minimum duration (in seconds) required for a segment to be included\n                    # This is before any padding. Segments shorter than this duration will be removed.\n                    # This is an alternative to --min-speech-duration above.\nmerge_consecutive_max_dur=0   # Merge consecutive segments as long as the merged segment is no longer than this many\n                              # seconds. The segments are only merged if their boundaries are touching.\n                              # This is after padding by --segment-padding seconds.\n                              # 0 means do not merge. Use 'inf' to not limit the duration.\n\necho $* \n\n. utils/parse_options.sh\n\nif [ $# -ne 5 ]; then\n  echo \"This script does nnet3-based speech activity detection given an input kaldi \"\n  echo \"data directory and outputs an output kaldi data directory.\"\n  echo \"See script for details of the options to be supplied.\"\n  echo \"Usage: $0 <src-data-dir> <sad-nnet-dir> <mfcc-dir> <work-dir> <out-data-dir>\"\n  echo \" e.g.: $0 ~/workspace/egs/ami/s5b/data/sdm1/dev exp/nnet3_sad_snr/nnet_tdnn_j_n4 \\\\\"\n  echo \"    mfcc_hires exp/segmentation_sad_snr/nnet_tdnn_j_n4 data/ami_sdm1_dev\"\n  echo \"\"\n  echo \"Options: \"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <num-job>                                 # number of parallel jobs to run.\"\n  echo \"  --stage <stage>                                # stage to do partial re-run from.\"\n  echo \"  --convert-data-dir-to-whole <true|false>    # If true, the input data directory is \"\n  echo \"                                              # first converted to whole data directory (i.e. whole recordings) \"\n  echo \"                                              # and segmentation is done on that.\"\n  echo \"                                              # If false, then the original segments are \"\n  echo \"                                              # retained and they are split into sub-segments.\"\n  echo \"  --output-name <name>    # The output node in the network\"\n  echo \"  --extra-left-context  <context|0>   # Set to some large value, typically 40 for LSTM (must match training)\"\n  echo \"  --extra-right-context  <context|0>   # For BLSTM or statistics pooling\"\n  exit 1\nfi\n\nsrc_data_dir=$1   # The input data directory that needs to be segmented.\n                  # If convert_data_dir_to_whole is true, any segments in that will be ignored.\nsad_nnet_dir=$2   # The SAD neural network\nmfcc_dir=$3       # The directory to store the features\ndir=$4            # Work directory\ndata_dir=$5       # The output data directory will be ${data_dir}_seg\n\naffix=${affix:+_$affix}\nfeat_affix=${feat_affix:+_$feat_affix}\n\ndata_id=`basename $data_dir`\nsad_dir=${dir}/${sad_name}${affix}_${data_id}_whole${feat_affix}\nseg_dir=${dir}/${segmentation_name}${affix}_${data_id}_whole${feat_affix}\n\nif $convert_data_dir_to_whole; then\n  test_data_dir=data/${data_id}_whole${feat_affix}_hires\n  if [ $stage -le 0 ]; then\n    rm -r ${test_data_dir} || true\n    utils/data/convert_data_dir_to_whole.sh $src_data_dir ${test_data_dir}\n  fi\nelse\n  test_data_dir=data/${data_id}${feat_affix}_hires\n  if [ $stage -le 0 ]; then\n    rm -r ${test_data_dir} || true\n    utils/copy_data_dir.sh $src_data_dir $test_data_dir\n  fi\nfi\n\n###############################################################################\n## Extract input features \n###############################################################################\n\nif [ $stage -le 1 ]; then\n  utils/fix_data_dir.sh $test_data_dir\n  steps/make_mfcc.sh --mfcc-config $mfcc_config --nj $nj --cmd \"$cmd\" --write-utt2num-frames true \\\n    ${test_data_dir} exp/make_hires$feat_affix/${data_id} $mfcc_dir\n  steps/compute_cmvn_stats.sh ${test_data_dir} exp/make_hires$feat_affix/${data_id} $mfcc_dir\n  utils/fix_data_dir.sh ${test_data_dir}\nfi\n\n###############################################################################\n## Forward pass through the network network and dump the log-likelihoods.\n###############################################################################\n\nframe_subsampling_factor=1\nif [ -f $sad_nnet_dir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $sad_nnet_dir/frame_subsampling_factor)\nfi\n\nmkdir -p $dir\nif [ $stage -le 4 ]; then\n  if [ \"$(readlink -f $sad_nnet_dir)\" != \"$(readlink -f $dir)\" ]; then\n    cp $sad_nnet_dir/cmvn_opts $dir || exit 1\n  fi\n\n  ########################################################################\n  ## Initialize neural network for decoding using the output $output_name\n  ########################################################################\n\n  if [ ! -z \"$output_name\" ] && [ \"$output_name\" != output ]; then\n    $cmd $dir/log/get_nnet_${output_name}.log \\\n      nnet3-copy --edits=\"rename-node old-name=$output_name new-name=output\" \\\n      $sad_nnet_dir/$iter.raw $dir/${iter}_${output_name}.raw || exit 1\n    iter=${iter}_${output_name}\n  else \n    if ! diff $sad_nnet_dir/$iter.raw $dir/$iter.raw; then\n      cp $sad_nnet_dir/$iter.raw $dir/\n    fi\n  fi\n\n  steps/nnet3/compute_output.sh --nj $nj --cmd \"$cmd\" \\\n    --iter ${iter} \\\n    --extra-left-context $extra_left_context \\\n    --extra-right-context $extra_right_context \\\n    --extra-left-context-initial $extra_left_context_initial \\\n    --extra-right-context-final $extra_right_context_final \\\n    --frames-per-chunk $frames_per_chunk --apply-exp true \\\n    --frame-subsampling-factor $frame_subsampling_factor \\\n    ${test_data_dir} $dir $sad_dir || exit 1\nfi\n\n###############################################################################\n## Prepare FST we search to make speech/silence decisions.\n###############################################################################\n\nutils/data/get_utt2dur.sh --nj $nj --cmd \"$cmd\" $test_data_dir || exit 1\nframe_shift=$(utils/data/get_frame_shift.sh $test_data_dir) || exit 1\n\ngraph_dir=${dir}/graph_${output_name}\nif [ $stage -le 5 ]; then\n  mkdir -p $graph_dir\n\n  # 1 for silence and 2 for speech\n  cat <<EOF > $graph_dir/words.txt\n<eps> 0\nsilence 1\nspeech 2\nEOF\n\n  $cmd $graph_dir/log/make_graph.log \\\n    steps/segmentation/internal/prepare_sad_graph.py $graph_opts \\\n      --frame-shift=$(perl -e \"print $frame_shift * $frame_subsampling_factor\") - \\| \\\n    fstcompile --isymbols=$graph_dir/words.txt --osymbols=$graph_dir/words.txt '>' \\\n      $graph_dir/HCLG.fst\nfi\n\n###############################################################################\n## Do Viterbi decoding to create per-frame alignments.\n###############################################################################\n\npost_vec=$sad_nnet_dir/post_${output_name}.vec\nif [ ! -f $sad_nnet_dir/post_${output_name}.vec ]; then\n  if [ ! -f $sad_nnet_dir/post_${output_name}.txt ]; then\n    echo \"$0: Could not find $sad_nnet_dir/post_${output_name}.vec. \"\n    echo \"Re-run the corresponding stage in the training script possibly \"\n    echo \"with --compute-average-posteriors=true or compute the priors \"\n    echo \"from the training labels\"\n    exit 1\n  else\n    post_vec=$sad_nnet_dir/post_${output_name}.txt\n  fi\nfi\n\nmkdir -p $seg_dir\nif [ $stage -le 6 ]; then\n  steps/segmentation/internal/get_transform_probs_mat.py \\\n    --priors=\"$post_vec\" $transform_probs_opts > $seg_dir/transform_probs.mat\n\n  steps/segmentation/decode_sad.sh --acwt $acwt --cmd \"$cmd\" \\\n    --nj $nj \\\n    --transform \"$seg_dir/transform_probs.mat\" \\\n    $graph_dir $sad_dir $seg_dir\nfi\n\n###############################################################################\n## Post-process segmentation to create kaldi data directory.\n###############################################################################\n\nif [ $stage -le 7 ]; then\n  steps/segmentation/post_process_sad_to_segments.sh \\\n    --segment-padding $segment_padding --min-segment-dur $min_segment_dur \\\n    --merge-consecutive-max-dur $merge_consecutive_max_dur \\\n    --cmd \"$cmd\" --frame-shift $(perl -e \"print $frame_subsampling_factor * $frame_shift\") \\\n    ${test_data_dir} ${seg_dir} ${seg_dir}\nfi\n\nif [ $stage -le 8 ]; then\n  utils/data/subsegment_data_dir.sh ${test_data_dir} ${seg_dir}/segments \\\n    ${data_dir}_seg\n  cp $src_data_dir/wav.scp ${data_dir}_seg\n  cp $src_data_dir/{stm,reco2file_and_channel,glm} ${data_dir}_seg/ || true\n  utils/fix_data_dir.sh ${data_dir}_seg\nfi\n\necho \"$0: Created output segmented kaldi data directory in ${data_dir}_seg\"\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/evaluate_segmentation.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2014  Johns Hopkins University (Author: Sanjeev Khudanpur), Vimal Manohar \n# Apache 2.0\n\n################################################################################\n#\n# This script was written to check the goodness of automatic segmentation tools\n# It assumes input in the form of two Kaldi segments files, i.e. a file each of\n# whose lines contain four space-separated values:\n#\n#    UtteranceID  FileID  StartTime EndTime\n#\n# It computes # missed frames, # false positives and # overlapping frames.\n#\n################################################################################\n\nif ($#ARGV == 1) {\n    $ReferenceSegmentation = $ARGV[0];\n    $HypothesizedSegmentation = $ARGV[1];\n    printf STDERR (\"Comparing reference segmentation\\n\\t%s\\nwith proposed segmentation\\n\\t%s\\n\",\n\t\t   $ReferenceSegmentation,\n\t\t   $HypothesizedSegmentation);\n} else {\n    printf STDERR \"This program compares the reference segmenation with the proposted segmentation\\n\";\n    printf STDERR \"Usage: $0 reference_segments_filename proposed_segments_filename\\n\";\n    printf STDERR \"e.g. $0 data/dev10h/segments data/dev10h.seg/segments\\n\";\n    exit (0);\n}\n\n################################################################################\n# First read the reference segmentation, and\n# store the start- and end-times of all segments in each file.\n################################################################################\n\nopen (SEGMENTS, \"cat $ReferenceSegmentation | sort -k2,2 -k3n,3 -k4n,4 |\")\n    || die \"Unable to open $ReferenceSegmentation\";\n$numLines = 0;\nwhile ($line=<SEGMENTS>) {\n    chomp $line;\n    @field = split(\"[ \\t]+\", $line);\n    unless ($#field == 3) {\n  exit (1);\n\tprintf STDERR \"Skipping unparseable line in file $ReferenceSegmentation\\n\\t$line\\n\";\n\tnext;\n    }\n    $fileID = $field[1];\n    unless (exists $firstSeg{$fileID}) {\n\t$firstSeg{$fileID} = $numLines;\n\t$actualSpeech{$fileID} = 0.0;\n\t$hypothesizedSpeech{$fileID} = 0.0;\n\t$foundSpeech{$fileID} = 0.0;\n\t$falseAlarm{$fileID} = 0.0;\n\t$minStartTime{$fileID} = 0.0;\n\t$maxEndTime{$fileID} = 0.0;\n    }\n    $refSegName[$numLines] = $field[0];\n    $refSegStart[$numLines] = $field[2];\n    $refSegEnd[$numLines] = $field[3];\n    $actualSpeech{$fileID} += ($field[3]-$field[2]);\n    $minStartTime{$fileID} = $field[2] if ($minStartTime{$fileID}>$field[2]);\n    $maxEndTime{$fileID} = $field[3] if ($maxEndTime{$fileID}<$field[3]);\n    $lastSeg{$fileID} = $numLines;\n    ++$numLines;\n}\nclose(SEGMENTS);\nprint STDERR \"Read $numLines segments from $ReferenceSegmentation\\n\";\n\n################################################################################\n# Process hypothesized segments sequentially, and gather speech/nonspeech stats\n################################################################################\n\nopen (SEGMENTS, \"cat $HypothesizedSegmentation | sort -k2,2 -k1,1 |\")\n    # Kaldi segments files are sorted by UtteranceID, but we re-sort them here\n    # so that all segments of a file are read together, sorted by start-time.\n    || die \"Unable to open $HypothesizedSegmentation\";\n$numLines = 0;\n$totalHypSpeech = 0.0;\n$totalFoundSpeech = 0.0;\n$totalFalseAlarm = 0.0;\n$numShortSegs = 0;\n$numLongSegs = 0;\nwhile ($line=<SEGMENTS>) {\n    chomp $line;\n    @field = split(\"[ \\t]+\", $line);\n    unless ($#field == 3) {\n  exit (1);\n\tprintf STDERR \"Skipping unparseable line in file $HypothesizedSegmentation\\n\\t$line\\n\";\n\tnext;\n    }\n    $fileID = $field[1];\n    $segStart = $field[2];\n    $segEnd = $field[3];\n    if (exists $firstSeg{$fileID}) {\n\t# This FileID exists in the reference segmentation\n\t# So gather statistics for this UtteranceID\n\t$hypothesizedSpeech{$fileID} += ($segEnd-$segStart);\n\t$totalHypSpeech += ($segEnd-$segStart);\n\tif (($segStart>=$maxEndTime{$fileID}) || ($segEnd<=$minStartTime{$fileID})) {\n\t    # This entire segment is a false alarm\n\t    $falseAlarm{$fileID} += ($segEnd-$segStart);\n\t    $totalFalseAlarm += ($segEnd-$segStart);\n\t} else {\n\t    # This segment may overlap one or more reference segments\n\t    $p = $firstSeg{$fileID};\n\t    while ($refSegEnd[$p]<=$segStart) {\n\t\t++$p;\n\t    }\n\t    # The overlap, if any, begins at the reference segment p\n\t    $q = $lastSeg{$fileID};\n\t    while ($refSegStart[$q]>=$segEnd) {\n\t\t--$q;\n\t    }\n\t    # The overlap, if any, ends at the reference segment q\n\t    if ($q<$p) {\n\t\t# This segment sits entirely in the nonspeech region\n\t\t# between the two reference speech segments q and p\n \t\t$falseAlarm{$fileID} += ($segEnd-$segStart);\n\t\t$totalFalseAlarm += ($segEnd-$segStart);\n\t    } else {\n\t\tif (($segEnd-$segStart)<0.20) {\n\t\t    # For diagnosing Pascal's VAD segmentation\n\t\t    print STDOUT \"Found short speech region $line\\n\";\n\t\t    ++$numShortSegs;\n\t\t} elsif (($segEnd-$segStart)>60.0) {\n\t\t    ++$numLongSegs;\n\t\t    # For diagnosing Pascal's VAD segmentation\n\t\t    print STDOUT \"Found long speech region $line\\n\";\n\t\t}\n\t\t# There is some overlap with segments p through q\n\t\tfor ($s=$p; $s<=$q; ++$s) {\n\t\t    if ($segStart<$refSegStart[$s]) {\n\t\t\t# There is a leading false alarm portion before s\n\t\t\t$falseAlarm{$fileID} += ($refSegStart[$s]-$segStart);\n\t\t\t$totalFalseAlarm += ($refSegStart[$s]-$segStart);\n\t\t\t$segStart=$refSegStart[$s];\n\t\t    }\n\t\t    $speechPortion = ($refSegEnd[$s]<$segEnd) ?\n\t\t\t($refSegEnd[$s]-$segStart) : ($segEnd-$segStart);\n\t\t    $foundSpeech{$fileID} += $speechPortion;\n\t\t    $totalFoundSpeech += $speechPortion;\n\t\t    $segStart=$refSegEnd[$s];\n\t\t}\n\t\tif ($segEnd>$segStart) {\n\t\t    # There is a trailing false alarm portion after q\n\t\t    $falseAlarm{$fileID} += ($segEnd-$segStart);\n\t\t    $totalFalseAlarm += ($segEnd-$segStart);\n\t\t}\n\t    }\n\t}\n    } else {\n\t# This FileID does not exist in the reference segmentation\n\t# So all this speech counts as a false alarm\n  exit (1);\n\tprintf STDERR (\"Unexpected fileID in hypothesized segments: %s\", $fileID);\n\t$totalFalseAlarm += ($segEnd-$segStart);\n    }\n    ++$numLines;\n}\nclose(SEGMENTS);\nprint STDERR \"Read $numLines segments from $HypothesizedSegmentation\\n\";\n\n################################################################################\n# Now that all hypothesized segments have been processed, compute needed stats\n################################################################################\n\n$totalActualSpeech = 0.0;\n$totalNonSpeechEst = 0.0; # This is just a crude estimate of total nonspeech.\nforeach $fileID (sort keys %actualSpeech) {\n    $totalActualSpeech += $actualSpeech{$fileID};\n    $totalNonSpeechEst += $maxEndTime{$fileID} - $actualSpeech{$fileID};\n    #######################################################################\n    # Print file-wise statistics to STDOUT; can pipe to /dev/null is needed\n    #######################################################################\n    printf STDOUT (\"%s: %.2f min actual speech, %.2f min hypothesized: %.2f min overlap (%d\\%), %.2f min false alarm (~%d\\%)\\n\",\n\t\t   $fileID,\n\t\t   ($actualSpeech{$fileID}/60.0),\n\t\t   ($hypothesizedSpeech{$fileID}/60.0),\n\t\t   ($foundSpeech{$fileID}/60.0),\n\t\t   ($foundSpeech{$fileID}*100/($actualSpeech{$fileID}+0.01)),\n\t\t   ($falseAlarm{$fileID}/60.0),\n\t\t   ($falseAlarm{$fileID}*100/($maxEndTime{$fileID}-$actualSpeech{$fileID}+0.01)));\n}\n\n################################################################################\n# Finally, we have everything needed to report the segmentation statistics.\n################################################################################\n\nprintf STDERR (\"------------------------------------------------------------------------\\n\");\nprintf STDERR (\"TOTAL: %.2f hrs actual speech, %.2f hrs hypothesized: %.2f hrs overlap (%d\\%), %.2f hrs false alarm (~%d\\%)\\n\",\n\t\t   ($totalActualSpeech/3600.0),\n\t\t   ($totalHypSpeech/3600.0),\n\t\t   ($totalFoundSpeech/3600.0),\n\t\t   ($totalFoundSpeech*100/($totalActualSpeech+0.000001)),\n\t\t   ($totalFalseAlarm/3600.0),\n\t\t   ($totalFalseAlarm*100/($totalNonSpeechEst+0.000001)));\nprintf STDERR (\"\\t$numShortSegs segments < 0.2 sec and $numLongSegs segments > 60.0 sec\\n\");\nprintf STDERR (\"------------------------------------------------------------------------\\n\");\n"
  },
  {
    "path": "egs/steps/segmentation/get_targets_for_out_of_segments.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script prepares targets for whole recordings for training \n# speech activity detection system on the out-of-segment regions. \n# See the script steps/segmentation/lats_to_targets.sh for details about the \n# targets matrix.\n# The out-of-segment regions are assigned the target values in the \n# file specified (in kaldi vector text format) by --default-targets option. \n# The in-segment regions are all assigned [ 0 0 0 ], \n# which means they don't contribute to the training. We will later be \n# combining these targets with other targets obtained from \n# supervision-constrained lattices and decoded lattices using the \n# script steps/segmentation/merge_targets.sh.\n# By default, the 'default_targets' would be [ 1 0 0 ], which means all\n# the out-of-segment regions are assumed as silence. But depending, on\n# the application and data, this could be [ 0 0 0 ] or [ 0 0 1 ] or\n# something with fractional weights.\n\nnj=4\ncmd=run.pl\ndefault_targets=   # vector of default targets in text format\nframe_subsampling_factor=1\n\nset -o pipefail -u\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  cat <<EOF\n  This script prepares targets for whole recordings for training \n  speech activity detection system on the out-of-segment regions. \n  See the top of the script for details.\n  Usage: steps/segmentation/get_targets_for_out_of_segments.sh <data-dir> <whole-data-dir> <targets-dir>\n   e.g.: steps/segmentation/get_targets_for_out_of_segments.sh \\\n    data/train_split10s data/train_whole \\\n    exp/segmentation1a/out_of_train_split10s_train_whole_default_targets\nEOF\n  exit 1\nfi\n\ndata=$1\nwhole_data=$2\ndir=$3\n\nfor f in $data/segments $whole_data/wav.scp; do\n  if [ ! -f $f ]; then \n    echo \"$0: Could not find file $f\" \n    exit 1\n  fi\ndone\n\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\n\nmkdir -p $dir/split${nj}reco\nsplit_scps=\nfor n in $(seq $nj); do\n  split_scps=\"$split_scps $dir/split${nj}reco/wav.$n.scp\"\ndone\nutils/split_scp.pl $whole_data/wav.scp $split_scps\n\nutils/data/get_reco2utt_for_data.sh $data > $dir/reco2utt\n\nmkdir -p $dir/split${nj}reco\nutils/filter_scps.pl JOB=1:$nj $dir/split${nj}reco/wav.JOB.scp $dir/reco2utt \\\n  $dir/split${nj}reco/reco2utt.JOB || exit 1\nutils/filter_scps.pl -f 2 JOB=1:$nj $dir/split${nj}reco/wav.JOB.scp $data/segments \\\n    $dir/split${nj}reco/segments.JOB || exit 1\n\n# make $dir an absolute pathname.\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nutils/data/get_utt2num_frames.sh $whole_data\ncp $whole_data/utt2num_frames $dir/reco2num_frames\n\n$cmd JOB=1:$nj $dir/log/get_default_targets.JOB.log \\\n  steps/segmentation/internal/get_default_targets_for_out_of_segments.py \\\n    --reco2num-frames=$dir/reco2num_frames \\\n    --default-targets=\"$default_targets\" \\\n    $dir/split${nj}reco/reco2utt.JOB $dir/split${nj}reco/segments.JOB - \\| \\\n  subsample-feats --n=$frame_subsampling_factor ark,t:- ark:- \\| \\\n  copy-feats ark:- ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\nif [ $frame_subsampling_factor -ne 1 ]; then\n  echo $frame_subsampling_factor > $dir/frame_subsampling_factor\nfi\n\nfor n in $(seq $nj); do\n  cat $dir/targets.$n.scp\ndone | sort -k1,1 > $dir/targets.scp\n\nsteps/segmentation/validate_targets_dir.sh $dir $whole_data || exit 1\n\necho \"$0: Got default targets for out-of-segments regions in $whole_data corresponding to segments in $data\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/internal/arc_info_to_targets.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script converts arc-info into targets for training\nspeech activity detection network. The output is a matrix archive\nwith each matrix having 3 columns -- silence, speech and garbage.\nThe posterior probabilities of the phones of each of the classes are\nsummed up to get the target matrix values.\n\"\"\"\n\nimport argparse\nimport logging\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script converts arc-info into targets for training\n        speech activity detection network. The output is a matrix archive\n        with each matrix having 3 columns -- silence, speech and garbage.\n        The posterior probabilities of the phones of each of the classes are\n        summed up to get the target matrix values.\n        \"\"\")\n\n    parser.add_argument(\"--silence-phones\", type=str,\n                        required=True,\n                        help=\"File containing a list of phones that will be \"\n                        \"treated as silence\")\n    parser.add_argument(\"--garbage-phones\", type=str,\n                        required=True,\n                        help=\"File containing a list of phones that will be \"\n                        \"treated as garbage class\")\n    parser.add_argument(\"--max-phone-length\", type=int, default=50,\n                        help=\"\"\"Maximum number of frames allowed for a speech\n                        phone above which the arc is treated as garbage.\"\"\")\n\n    parser.add_argument(\"arc_info\", type=str,\n                        help=\"Arc info file (output of lattice-arc-post). \"\n                        \"See the help for lattice-arc-post for information \"\n                        \"about the format of this input.\")\n    parser.add_argument(\"targets_file\", type=str,\n                        help=\"File to write targets matrix archive in text \"\n                        \"format\")\n    args = parser.parse_args()\n    return args\n\n\ndef run(args):\n    silence_phones = {}\n    with common_lib.smart_open(args.silence_phones) as silence_phones_fh:\n        for line in silence_phones_fh:\n            silence_phones[line.strip().split()[0]] = 1\n\n    if len(silence_phones) == 0:\n        raise RuntimeError(\"Could not find any phones in {silence}\"\n                           \"\".format(silence=args.silence_phones))\n\n    garbage_phones = {}\n    with common_lib.smart_open(args.garbage_phones) as garbage_phones_fh:\n        for line in garbage_phones_fh:\n            word = line.strip().split()[0]\n            if word in silence_phones:\n                raise RuntimeError(\"Word '{word}' is in both {silence} \"\n                                   \"and {garbage}\".format(\n                                       word=word,\n                                       silence=args.silence_phones,\n                                       garbage=args.garbage_phones))\n            garbage_phones[word] = 1\n\n    if len(garbage_phones) == 0:\n        raise RuntimeError(\"Could not find any phones in {garbage}\"\n                           \"\".format(garbage=args.garbage_phones))\n\n    num_utts = 0\n    num_err = 0\n    targets = []\n    prev_utt = \"\"\n\n    with common_lib.smart_open(args.arc_info) as arc_info_reader, \\\n            common_lib.smart_open(args.targets_file, 'w') as targets_writer:\n        for line in arc_info_reader:\n            try:\n                parts = line.strip().split()\n                utt = parts[0]\n\n                if utt != prev_utt:\n                    if prev_utt != \"\":\n                        if len(targets) > 0:\n                            num_utts += 1\n                            common_lib.write_matrix_ascii(\n                                targets_writer, targets, key=prev_utt)\n                        else:\n                            num_err += 1\n                    prev_utt = utt\n                    targets = []\n\n                start_frame = int(parts[1])\n                num_frames = int(parts[2])\n                post = float(parts[3])\n                phone = parts[4]\n\n                if start_frame + num_frames > len(targets):\n                    for t in range(len(targets), start_frame + num_frames):\n                        targets.append([0, 0, 0])\n                    assert start_frame + num_frames == len(targets)\n\n                for t in range(start_frame, start_frame + num_frames):\n                    if phone in silence_phones:\n                        targets[t][0] += post\n                    elif num_frames > args.max_phone_length:\n                        targets[t][2] += post\n                    elif phone in garbage_phones:\n                        targets[t][2] += post\n                    else:\n                        targets[t][1] += post\n            except Exception:\n                logger.error(\"Failed to process line {line} in {f}\"\n                             \"\".format(line=line.strip(), f=args.arc_info))\n                logger.error(\"len(targets) = {l}\".format(l=len(targets)))\n                raise\n\n    if prev_utt != \"\":\n        if len(targets) > 0:\n            num_utts += 1\n            common_lib.write_matrix_ascii(args.targets_file, targets,\n                                          key=prev_utt)\n        else:\n            num_err += 1\n\n    logger.info(\"Wrote {num_utts} targets; failed with {num_err}\"\n                \"\".format(num_utts=num_utts, num_err=num_err))\n    if num_utts == 0 or num_err >= num_utts // 2:\n        raise RuntimeError\n\n\ndef main():\n    args = get_args()\n\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/find_oov_phone.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"This script finds the OOV phone by reading the OOV word from\noov.int in the input <lang> directory and the lexicon\n<lang>/phones/align_lexicon.int.\nIt prints the OOV phone to stdout, if it can find a single phone\nmapping for the OOV word.\"\"\"\nfrom __future__ import print_function\n\nimport sys\n\n\ndef main():\n    if len(sys.argv) != 2:\n        raise RuntimeError(\"Usage: {0} <lang>\".format(sys.argv[0]))\n\n    lang = sys.argv[1]\n\n    oov_int = int(open(\"{0}/oov.int\").readline())\n    assert oov_int > 0\n\n    oov_mapped_to_multiple_phones = False\n    for line in open(\"{0}/phones/align_lexicon.int\"):\n        parts = line.strip().split()\n\n        if len(parts) < 3:\n            raise RuntimeError(\"Could not parse line {0} in \"\n                               \"{1}/phones/align_lexicon.int\"\n                               \"\".format(line, lang))\n\n        w = int(parts[0])\n        if w != oov_int:\n            continue\n\n        if len(parts[2:]) > 1:\n            # Try to find a single phone mapping for OOV\n            oov_mapped_to_multiple_phones = True\n            continue\n\n        p = int(parts[2])\n        print (\"{0}\".format(p))\n\n        raise SystemExit(0)\n\n    if oov_mapped_to_multiple_phones:\n        raise RuntimeError(\"OOV word found, but is mapped to multiples phones. \"\n                           \"This is an unusual case.\")\n\n    raise RuntimeError(\"Could not find OOV word in \"\n                       \"{0}/phones/align_lexicon.int\".format(lang))\n\n\nif __name__ != \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/get_default_targets_for_out_of_segments.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script gets targets for the whole recording\nby adding 'default_targets' vector read from file specified by\n--default-targets option for the out-of-segments regions and\nzeros for all other frames. See steps/segmentation/lats_to_targets.sh\nfor details about the targets matrix.\nBy default, the 'default_targets' would be [ 1 0 0 ], which means all\nthe out-of-segment regions are assumed as silence. But depending, on\nthe application and data, this could be [ 0 0 0 ] or [ 0 0 1 ] or\nsomething with fractional weights.\n\"\"\"\nfrom __future__ import division\n\nimport argparse\nimport logging\nimport numpy as np\nimport subprocess\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script gets targets for the whole recording\n        by adding 'default_targets' vector read from file specified by\n        --default-targets option for the out-of-segments regions and\n        zeros for all other frames. See steps/segmentation/lats_to_targets.sh\n        for details about the targets matrix.\n        By default, the 'default_targets' would be [ 1 0 0 ], which means all\n        the out-of-segment regions are assumed as silence. But depending, on\n        the application and data, this could be [ 0 0 0 ] or [ 0 0 1 ] or\n        something with fractional weights.\n        \"\"\")\n\n    parser.add_argument(\"--frame-shift\", type=float, default=0.01,\n                        help=\"Frame shift value in seconds\")\n    parser.add_argument(\"--default-targets\", type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"Vector of default targets for out-of-segments \"\n                        \"region\")\n    parser.add_argument(\"--length-tolerance\", type=int, default=2,\n                        help=\"Tolerate length mismatches of this many frames\")\n    parser.add_argument(\"--verbose\", type=int, default=0, choices=[0,1,2],\n                        help=\"Verbose level\")\n\n    parser.add_argument(\"--reco2num-frames\", type=str, required=True,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"The number of frames per reco\n                        is used to determine the num-rows of the output matrix\n                        \"\"\")\n    parser.add_argument(\"reco2utt\", type=str,\n                        help=\"\"\"reco2utt file.\n                        The format is <reco> <utt-1> <utt-2> ... <utt-N>\"\"\")\n    parser.add_argument(\"segments\", type=str,\n                        help=\"Input kaldi segments file\")\n    parser.add_argument(\"out_targets_ark\", type=str,\n                        help=\"\"\"Output archive to which the\n                        recording-level matrix will be written in text\n                        format\"\"\")\n\n    args = parser.parse_args()\n\n    if args.frame_shift < 0.0001 or args.frame_shift > 1:\n        raise ValueError(\"--frame-shift should be in [0.0001, 1]; got {0}\"\n                         \"\".format(args.frame_shift))\n\n    if args.verbose >= 2:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef run(args):\n    reco2utt = {}\n    with common_lib.smart_open(args.reco2utt) as f:\n        for line in f:\n            parts = line.strip().split()\n            if len(parts) < 2:\n                raise ValueError(\"Could not parse line {0}\".format(line))\n            reco2utt[parts[0]] = parts[1:]\n\n    reco2num_frames = {}\n    with common_lib.smart_open(args.reco2num_frames) as f:\n        for line in f:\n            parts = line.strip().split()\n            if len(parts) != 2:\n                raise ValueError(\"Could not parse line {0}\".format(line))\n            if parts[0] not in reco2utt:\n                continue\n            reco2num_frames[parts[0]] = int(parts[1])\n\n    segments = {}\n    with common_lib.smart_open(args.segments) as f:\n        for line in f:\n            parts = line.strip().split()\n            if len(parts) not in [4, 5]:\n                raise ValueError(\"Could not parse line {0}\".format(line))\n            utt = parts[0]\n            reco = parts[1]\n            if reco not in reco2utt:\n                continue\n            start_time = float(parts[2])\n            end_time = float(parts[3])\n            segments[utt] = [reco, start_time, end_time]\n\n    num_utt_err = 0\n    num_utt = 0\n    num_reco = 0\n\n    if args.default_targets is not None:\n        default_targets = np.matrix(common_lib.read_matrix_ascii(args.default_targets))\n    else:\n        default_targets = np.matrix([[1, 0, 0]])\n    assert (np.shape(default_targets)[0] == 1\n            and np.shape(default_targets)[1] == 3)\n\n    with common_lib.smart_open(args.out_targets_ark, 'w') as f:\n        for reco, utts in reco2utt.items():\n            reco_mat = np.repeat(default_targets, reco2num_frames[reco],\n                                 axis=0)\n            utts.sort(key=lambda x: segments[x][1])   # sort on start time\n            for i, utt in enumerate(utts):\n                if utt not in segments:\n                    num_utt_err += 1\n                    continue\n                segment = segments[utt]\n\n                start_frame = int(segment[1] / args.frame_shift)\n                end_frame = int(segment[2] / args.frame_shift)\n                num_frames = end_frame - start_frame\n\n                if end_frame > reco2num_frames[reco]:\n                    end_frame = reco2num_frames[reco]\n                    num_frames = end_frame - start_frame\n\n                reco_mat[start_frame:end_frame] = np.zeros([num_frames, 3])\n                num_utt += 1\n\n            if reco_mat.shape[0] > 0:\n                common_lib.write_matrix_ascii(f, reco_mat.tolist(),\n                                              key=reco)\n                num_reco += 1\n\n    logger.info(\"Got default out-of-segment targets for {num_reco} recordings \"\n                \"containing {num_utt} in-segment regions; \"\n                \"failed to account {num_utt_err} utterances\"\n                \"\".format(num_reco=num_reco, num_utt=num_utt,\n                          num_utt_err=num_utt_err))\n\n    if num_utt == 0 or num_utt_err > num_utt // 2 or num_reco == 0:\n        raise RuntimeError\n\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/get_transform_probs_mat.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\nimport argparse\nimport sys\nsys.path.insert(0, 'steps')\n\nimport libs.common as common_lib\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script writes to stdout a transformation matrix\n    to convert a 3x1 probability vector to a\n    2x1 pseudo-likelihood vector by first dividing by 3x1 priors vector.\"\"\")\n\n    parser.add_argument(\"--priors\", type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"Priors vector used to remove the priors from \"\n                        \"the neural network output posteriors to \"\n                        \"convert them to likelihoods\")\n\n    parser.add_argument(\"--sil-in-speech-weight\", type=float,\n                        default=0.0,\n                        help=\"The fraction of silence probability \"\n                        \"to add to speech\")\n    parser.add_argument(\"--speech-in-sil-weight\", type=float,\n                        default=0.0,\n                        help=\"The fraction of speech probability \"\n                        \"to add to silence\")\n    parser.add_argument(\"--garbage-in-speech-weight\", type=float,\n                        default=0.0,\n                        help=\"The fraction of garbage probability \"\n                        \"to add to speech\")\n    parser.add_argument(\"--garbage-in-sil-weight\", type=float,\n                        default=0.0,\n                        help=\"The fraction of garbage probability \"\n                        \"to add to silence\")\n    parser.add_argument(\"--sil-scale\", type=float,\n                        default=1.0, help=\"\"\"Scale on the silence probability\n                        (make this more than one to encourage\n                        decoding silence).\"\"\")\n\n    args = parser.parse_args()\n\n    return args\n\n\ndef run(args):\n    priors = [[1.0, 1.0, 1.0]]\n    if args.priors is not None:\n        priors = common_lib.read_matrix_ascii(args.priors)\n        if len(priors) != 0 and len(priors[0]) != 3:\n            raise RuntimeError(\"Invalid dimension for priors {0}\"\n                               \"\".format(priors))\n\n    priors_sum = sum(priors[0])\n    sil_prior = priors[0][0] / priors_sum\n    speech_prior = priors[0][1] / priors_sum\n    garbage_prior = priors[0][2] / priors_sum\n\n    transform_mat = [[args.sil_scale / sil_prior,\n                      args.speech_in_sil_weight / speech_prior,\n                      args.garbage_in_sil_weight / garbage_prior],\n                     [args.sil_in_speech_weight / sil_prior,\n                      1.0 / speech_prior,\n                      args.garbage_in_speech_weight / garbage_prior]]\n\n    common_lib.write_matrix_ascii(sys.stdout, transform_mat)\n\n\ndef main():\n    args = get_args()\n    run(args)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/merge_segment_targets_to_recording.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script merges targets matrices corresponding to\nsegments into targets matrix for whole recording. The frames that are not\nin any of the segments are assigned the default targets vector, specified by\nthe option --default-targets or [ 0 0 0 ] if unspecified.\n\"\"\"\nfrom __future__ import division\n\nimport argparse\nimport logging\nimport numpy as np\nimport subprocess\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script merges targets matrices corresponding to\n        segments into targets matrix for whole recording.\"\"\")\n\n    parser.add_argument(\"--frame-shift\", type=float, default=0.01,\n                        help=\"Frame shift value in seconds\")\n    parser.add_argument(\"--default-targets\", type=str, default=None,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"Vector of default targets for out-of-segments \"\n                        \"region\")\n    parser.add_argument(\"--length-tolerance\", type=int, default=4,\n                        help=\"Tolerate length mismatches of this many frames\")\n    parser.add_argument(\"--verbose\", type=int, default=0, choices=[0, 1, 2],\n                        help=\"Verbose level\")\n\n    parser.add_argument(\"--reco2num-frames\", type=str, required=True,\n                        action=common_lib.NullstrToNoneAction,\n                        help=\"\"\"The number of frames per reco\n                        is used to determine the num-rows of the output matrix\n                        \"\"\")\n    parser.add_argument(\"reco2utt\", type=str,\n                        help=\"\"\"reco2utt file.\n                        The format is <reco> <utt-1> <utt-2> ... <utt-N>\"\"\")\n    parser.add_argument(\"segments\", type=str,\n                        help=\"Input kaldi segments file\")\n    parser.add_argument(\"targets_scp\", type=str,\n                        help=\"\"\"SCP of input targets matrices.\n                        The matrices are indexed by the utterance-id.\"\"\")\n    parser.add_argument(\"out_targets_ark\", type=str,\n                        help=\"\"\"Output archive to which the\n                        recording-level matrix will be written in text\n                        format\"\"\")\n\n    args = parser.parse_args()\n\n    if args.frame_shift < 0.0001 or args.frame_shift > 1:\n        raise ValueError(\"--frame-shift should be in [0.0001, 1]; got {0}\"\n                         \"\".format(args.frame_shift))\n\n    if args.verbose >= 2:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef read_reco2utt_file(reco2utt_file):\n    # Read reco2utt file\n    reco2utt = {}\n    with common_lib.smart_open(reco2utt_file) as fh:\n        for line in fh:\n            parts = line.strip().split()\n            if len(parts) < 2:\n                raise ValueError(\"Could not parse line {0} in reco2utt \"\n                                 \"file {1}\".format(line, reco2utt_file))\n            reco2utt[parts[0]] = parts[1:]\n    return reco2utt\n\n\ndef read_reco2num_frames_file(reco2num_frames_file):\n    # Read reco2num_frames file\n    reco2num_frames = {}\n    with common_lib.smart_open(reco2num_frames_file) as fh:\n        for line in fh:\n            parts = line.strip().split()\n            if len(parts) != 2:\n                raise ValueError(\"Could not parse line {0} in \"\n                                 \"reco2num-frames file {1}\".format(\n                                     line, reco2num_frames_file))\n            reco2num_frames[parts[0]] = int(parts[1])\n    return reco2num_frames\n\n\ndef read_segments_file(segments_file, reco2utt):\n    # Read segments from segments file\n    segments = {}\n    with common_lib.smart_open(segments_file) as fh:\n        for line in fh:\n            parts = line.strip().split()\n            if len(parts) not in [4, 5]:\n                raise ValueError(\"Could not parse line {0} in \"\n                                 \"segments file {1}\".format(line, segments))\n            utt = parts[0]\n            reco = parts[1]\n            if reco not in reco2utt:\n                continue\n            start_time = float(parts[2])\n            end_time = float(parts[3])\n            segments[utt] = [reco, start_time, end_time]\n    return segments\n\n\ndef read_targets_scp(targets_scp, segments):\n    # Read the SCP file containing targets\n    targets = {}\n    with common_lib.smart_open(targets_scp) as fh:\n        for line in fh:\n            parts = line.strip().split()\n            if len(parts) != 2:\n                raise ValueError(\"Could not parse line {0} in \"\n                                 \"targets scp file\".format(line, targets_scp))\n            utt = parts[0]\n            if utt not in segments:\n                continue\n            targets[utt] = parts[1]\n    return targets\n\n\ndef run(args):\n    reco2utt = read_reco2utt_file(args.reco2utt)\n    reco2num_frames = read_reco2num_frames_file(args.reco2num_frames)\n    segments = read_segments_file(args.segments, reco2utt)\n    targets = read_targets_scp(args.targets_scp, segments)\n\n    if args.default_targets is not None:\n        # Read the vector of default targets for out-of-segment regions\n        default_targets = np.matrix(\n            common_lib.read_matrix_ascii(args.default_targets))\n    else:\n        default_targets = np.zeros([1, 3])\n    assert (np.shape(default_targets)[0] == 1\n            and np.shape(default_targets)[1] == 3)\n\n    num_utt_err = 0\n    num_utt = 0\n    num_reco = 0\n\n    with common_lib.smart_open(args.out_targets_ark, 'w') as fh:\n        for reco, utts in reco2utt.items():\n            # Read a recording and the list of its utterances from the\n            # reco2utt dictionary\n            reco_mat = np.repeat(default_targets, reco2num_frames[reco],\n                                 axis=0)\n            utts.sort(key=lambda x: segments[x][1])   # sort on start time\n\n            end_frame_accounted = 0\n\n            for i, utt in enumerate(utts):\n                if utt not in segments or utt not in targets:\n                    num_utt_err += 1\n                    continue\n                segment = segments[utt]\n\n                # Read the targets corresponding to the segments\n                cmd = (\"copy-feats --binary=false {mat_fn} -\"\n                       \"\".format(mat_fn=targets[utt]))\n                p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,\n                                     stderr=subprocess.PIPE)\n\n                try:\n                    mat = np.matrix(common_lib.read_matrix_ascii(p.stdout),\n                                    dtype='float32')\n                except Exception:\n                    logger.error(\"Command '{cmd}' failed\".format(cmd=cmd))\n                    raise\n                finally:\n                    [stdout, stderr] = p.communicate()\n                    if p.returncode is not None and p.returncode != 0:\n                        raise RuntimeError(\n                            'Command \"{cmd}\" failed with status {status}; '\n                            'stderr = {stderr}'.format(cmd=cmd, status=-p.returncode,\n                                                       stderr=stderr))\n\n                start_frame = int(segment[1] / args.frame_shift + 0.5)\n                end_frame = int(segment[2] / args.frame_shift + 0.5)\n                num_frames = end_frame - start_frame\n\n                if num_frames <= 0:\n                    raise ValueError(\"Invalid line in segments file {0}\"\n                                     \"\".format(segment))\n\n                if abs(mat.shape[0] - num_frames) > args.length_tolerance:\n                    logger.warning(\"For utterance {utt}, mismatch in segment \"\n                                   \"length and targets matrix size; \"\n                                   \"{s_len} vs {t_len}\".format(\n                                       utt=utt, s_len=num_frames,\n                                       t_len=mat.shape[0]))\n                    num_utt_err += 1\n                    continue\n\n                # Fix end_frame and num_frames if the segment goes beyond\n                # the length of the recording.\n                if end_frame > reco2num_frames[reco]:\n                    end_frame = reco2num_frames[reco]\n                    num_frames = end_frame - start_frame\n\n                # Fix \"num_frames\" and \"end_frame\" if \"num_frames\" is lower\n                # than the size of the targets matrix \"mat\"\n                num_frames = min(num_frames, mat.shape[0])\n                end_frame = start_frame + num_frames\n\n                if num_frames <= 0:\n                    logger.warning(\"For utterance {utt}, start-frame {start} \"\n                                   \"is outside the recording\"\n                                   \"\".format(utt=utt, start=start_frame))\n                    num_utt_err += 1\n                    continue\n\n                if end_frame < end_frame_accounted:\n                    logger.warning(\"For utterance {utt}, end-frame {end} \"\n                                   \"is before the end of a previous segment. \"\n                                   \"i.e. this segment is completely within \"\n                                   \"another segment. Ignoring this segment.\"\n                                   \"\".format(utt=utt, end=end_frame))\n                    num_utt_err +=1\n                    continue\n\n                if start_frame < end_frame_accounted:\n                    # Segment overlaps with a previous utterance\n                    # Combine targets using a weighted interpolation using a\n                    # triangular window with a weight of 1 at the start/end of\n                    # overlap and 0 at the end/start of the segment\n                    for n in range(0, end_frame_accounted - start_frame):\n                        w = float(n) / float(end_frame_accounted - start_frame)\n                        reco_mat[n + start_frame, :] = (\n                            reco_mat[n + start_frame, :] * (1.0 - w)\n                            + mat[n, :] * w)\n\n                    if end_frame > end_frame_accounted:\n                        reco_mat[end_frame_accounted:end_frame, :] = (\n                            mat[(end_frame_accounted-start_frame):\n                                (end_frame-start_frame), :])\n                else:\n                    # No overlap with the previous utterances.\n                    # So just add it to the output.\n                    reco_mat[start_frame:end_frame, :] = (\n                        mat[0:num_frames, :])\n                logger.debug(\"reco_mat shape = %s, mat shape = %s, \"\n                             \"start_frame = %d, end_frame = %d\", reco_mat.shape,\n                             mat.shape, start_frame, end_frame)\n\n                end_frame_accounted = end_frame\n                num_utt += 1\n\n            if reco_mat.shape[0] > 0:\n                common_lib.write_matrix_ascii(fh, reco_mat,\n                                              key=reco)\n                num_reco += 1\n\n    logger.info(\"Merged {num_utt} segment targets from {num_reco} recordings; \"\n                \"failed with {num_utt_err} utterances\"\n                \"\".format(num_utt=num_utt, num_reco=num_reco,\n                          num_utt_err=num_utt_err))\n\n    if num_utt == 0 or num_utt_err > num_utt // 2 or num_reco == 0:\n        raise RuntimeError\n\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/merge_targets.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script merges targets created from multiple sources (systems) into\nsingle targets matrices.\n\nUsage: merge_targets.py [options] <pasted-targets> <out-targets>\n e.g.: paste-feats scp:targets1.scp scp:targets2.scp ark,t:- | merge_targets.py --dim=3 - - | copy-feats ark,t:- ark:-\n\n<pasted-targets> is matrix archive with matrices corresponding to\ntargets from multiple sources appended together using paste-feats.\nThe column dimension is num-sources * dim, which dim is specified by --dim\noption.\n\"\"\"\n\nimport argparse\nimport logging\nimport numpy as np\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"\n    This script merges targets created from multiple sources (systems) into\n    single targets matrices.\n    Usage: merge_targets.py [options] <pasted-targets> <out-targets>\n     e.g.: paste-feats scp:targets1.scp scp:targets2.scp ark,t:- | merge_targets.py --dim=3 - - | copy-feats ark,t:- ark:-\n    \"\"\",\n        formatter_class=argparse.RawTextHelpFormatter)\n\n    parser.add_argument(\"--weights\", type=str, default=\"\",\n                        help=\"A comma-separated list of weights corresponding \"\n                        \"to each targets source being combined. \"\n                        \"Weights will be normalized internally to sum-to-one.\")\n    parser.add_argument(\"--dim\", type=int, default=3,\n                        help=\"Number of columns corresponding to each \"\n                        \"target matrix\")\n    parser.add_argument(\"--remove-mismatch-frames\", type=str, default=False,\n                        choices=[\"true\", \"false\"],\n                        action=common_lib.StrToBoolAction,\n                        help=\"If true, the mismatch frames are removed by \"\n                        \"setting targets to 0 in the following cases:\\n\"\n                        \"a) If none of the sources have a column with value \"\n                        \"> 0.5\\n\"\n                        \"b) If two sources have columns with value > 0.5, but \"\n                        \"they occur at different indexes e.g. silence prob is \"\n                        \"> 0.5 for the targets from alignment, and speech prob \"\n                        \"> 0.5 for the targets from decoding.\")\n\n    parser.add_argument(\"pasted_targets\", type=str,\n                        help=\"Input target matrices with columns appended \"\n                        \"together using paste-feats. Its column dimension is \"\n                        \"num-sources * dim, which dim is specified by --dim \"\n                        \"option.\")\n    parser.add_argument(\"out_targets\", type=str,\n                        help=\"Output target matrices\")\n\n    args = parser.parse_args()\n\n    if args.weights != \"\":\n        args.weights = [float(x) for x in args.weights.split(\",\")]\n        weights_sum = sum(args.weights)\n        args.weights = [x / weights_sum for x in args.weights]\n    else:\n        args.weights = None\n\n    return args\n\n\ndef should_remove_frame(row, dim):\n    \"\"\"Returns True if the frame needs to be removed.\n\n    Input:\n        row -- a list of values (of dimension num-sources x dim) corresponding\n               to the targets for one of the frames\n        dim -- Usually 3. The number of sources can be computed as the\n               len(row) / dim.\n\n    The frame is determined to be removed in the following cases:\n        1) None of the values > 0.5.\n        2) More than one source has best value >= 0.5, but at different\n           indexes in the source.\n    e.g. [ 1 0 0 0.6 0 0.4 0 0 0 ]   # kept because 1 and 0.6 are both > 0.5\n                                     # at the same class namely 0\n                                     # source[0] = [ 1 0 0 ]\n                                     # source[1] = [ 0.6 0 0.4 ]\n                                     # source[2] = [ 0 0 0 ]\n    e.g. [ 0 0 0 0.4 0 0.6 1 0 0 ]   # removed because source[1] has best value\n                                     # 0.6 > 0.5 at class 2 and source[2] has\n                                     # best value 1 > 0.5 at class 0.\n                                     # source[0] = [ 0 0 0 ]\n                                     # source[1] = [ 0.4 0 0.6 ]\n                                     # source[2] = [ 0 0 0 ]\n    \"\"\"\n    assert len(row) % dim == 0\n    num_sources = len(row) // dim\n\n    max_idx = np.argmax(row)\n    max_val = row[max_idx]\n\n    if max_val < 0.5:\n        # All the values < 0.5. So we are not confident of any sources.\n        # Remove frame.\n        return True\n\n    best_source = max_idx // dim\n    best_class = max_idx % dim\n\n    confident_in_source = []  # List of length num_sources\n                              # Element 'i' is 1,\n                              # if the best value for the source 'i' is > 0.5\n    best_values_for_source = []  # Element 'i' is a pair (value, class),\n                                 # where 'class' is argmax over the scores\n                                 # corresponding to the source 'i' and\n                                 # 'value' is the corresponding score.\n    for source_idx in range(num_sources):\n        idx = np.argmax(row[(source_idx * dim):\n                            ((source_idx+1) * dim)])\n        val = row[source_idx * dim + idx]\n        confident_in_source.append(bool(val > 0.5))\n        best_values_for_source.append((val, idx))\n\n    if sum(confident_in_source) == 1:\n        # We are confident in only one source. Keep frame.\n        return False\n\n    for source_idx in range(num_sources):\n        if source_idx == best_source:\n            assert confident_in_source[source_idx]\n            continue\n        if not confident_in_source[source_idx]:\n            continue\n        else:\n            # We are confident in a source other than the 'best_source'.\n            # If it's index is different from the 'best_class', then it is\n            # a mismatch and the frame must be removed.\n            val, idx = best_values_for_source[source_idx]\n            assert val > 0.5\n            if idx != best_class:\n                return True\n    return False\n\n\ndef run(args):\n    num_done = 0\n\n    with common_lib.smart_open(args.pasted_targets) as targets_reader, \\\n            common_lib.smart_open(args.out_targets, 'w') as targets_writer:\n        for key, mat in common_lib.read_mat_ark(targets_reader):\n            mat = np.matrix(mat)\n            if mat.shape[1] % args.dim != 0:\n                raise RuntimeError(\n                    \"For utterance {utt} in {f}, num-columns {nc} \"\n                    \"is not a multiple of dim {dim}\"\n                    \"\".format(utt=key, f=args.pasted_targets.name,\n                              nc=mat.shape[1], dim=args.dim))\n            num_sources = mat.shape[1] // args.dim\n\n            out_mat = np.matrix(np.zeros([mat.shape[0], args.dim]))\n\n            if args.remove_mismatch_frames:\n                for n in range(mat.shape[0]):\n                    if should_remove_frame(mat[n, :].getA()[0], args.dim):\n                        out_mat[n, :] = np.zeros([1, args.dim])\n                    else:\n                        for i in range(num_sources):\n                            out_mat[n, :] += (\n                                mat[n, (i * args.dim) : ((i+1) * args.dim)]\n                                * (1.0 if args.weights is None\n                                   else args.weights[i]))\n            else:\n                # Just interpolate the targets\n                for i in range(num_sources):\n                    out_mat += (\n                        mat[:, (i * args.dim) : ((i+1) * args.dim)]\n                        * (1.0 if args.weights is None else args.weights[i]))\n\n            common_lib.write_matrix_ascii(targets_writer, out_mat.tolist(),\n                                          key=key)\n            num_done += 1\n\n    logger.info(\"Merged {num_done} target matrices\"\n                \"\".format(num_done=num_done))\n\n    if num_done == 0:\n        raise RuntimeError\n\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/prepare_sad_graph.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0\n\n\"\"\"Prepares a graph with a simple HMM topology for segmentation\nwith minimum and maximum speech duration constraints and minimum silence\nduration constraint. The graph is written to the 'output_graph', which\ncan be file or \"-\" for stdout.\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport math\nimport os\nimport sys\nimport traceback\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(filename)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"This script prepares a graph with a simple HMM topology\n        for segmentation with minimum and maximum speech duration constraints\n        and minimum silence duration constraint. The graph is written to the\n        'output_graph', which can be file or \"-\" for stdout.  for segmentation\n        with minimum and maximum speech duration constraints and minimum silence\n        duration constraint.\"\"\",\n        formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n\n    parser.add_argument(\"--transition-scale\", type=float, default=1.0,\n                        help=\"\"\"Scale on transition probabilities relative to\n                        LM weights\"\"\")\n    parser.add_argument(\"--loopscale\", type=float, default=0.1,\n                        help=\"\"\"Scale on self-loop log-probabilities relative\n                        to LM weights\"\"\")\n\n    parser.add_argument(\"--min-silence-duration\", type=float, default=0.03,\n                        help=\"\"\"Minimum duration for silence\"\"\")\n    parser.add_argument(\"--min-speech-duration\", type=float, default=0.3,\n                        help=\"\"\"Minimum duration for speech\"\"\")\n    parser.add_argument(\"--max-speech-duration\", type=float, default=10.0,\n                        help=\"\"\"Maximum duration for speech\"\"\")\n    parser.add_argument(\"--frame-shift\", type=float, default=0.03,\n                        help=\"\"\"Frame shift in seconds\"\"\")\n\n    parser.add_argument(\"--edge-silence-probability\", type=float,\n                        default=0.5,\n                        help=\"Probability of silence at the edges.\")\n    parser.add_argument(\"--transition-probability\", type=float, default=0.1,\n                        help=\"Transition probability for silence to speech \"\n                        \"or vice-versa\")\n\n    parser.add_argument(\"output_graph\", type=str,\n                        help=\"Output graph\")\n    args = parser.parse_args()\n\n    args.min_states_silence = int(args.min_silence_duration / args.frame_shift\n                                  + 0.5)\n    args.min_states_speech = int(args.min_speech_duration / args.frame_shift\n                                 + 0.5)\n    args.max_states_speech = int(args.max_speech_duration / args.frame_shift\n                                 + 0.5)\n\n    return args\n\n\ndef print_states(args, file_handle):\n    # Initial transition to silence\n    print (\"0 1 silence silence {0}\".format(-math.log(args.edge_silence_probability)),\n           file=file_handle)\n    silence_start_state = 1\n\n    # Silence min duration transitions\n    # 1->2, 2->3 and so on until\n    # (1 + min_states_silence - 2) -> (1 + min_states_silence - 1)  ...\n    for state in range(silence_start_state,\n                       silence_start_state + args.min_states_silence - 1):\n        print (\"{state} {next_state} silence silence {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n    silence_last_state = silence_start_state + args.min_states_silence - 1\n\n    # Silence self-loop\n    print (\"{state} {state} silence silence {cost}\".format(\n                state=silence_last_state, cost=0.0),\n           file=file_handle)\n\n    speech_start_state = silence_last_state + 1\n    # Initial transition to speech\n    print (\"0 {state} speech speech {cost}\".format(\n                state=speech_start_state,\n                cost=-math.log(1.0 - args.edge_silence_probability)),\n           file=file_handle)\n\n    # Silence to speech transition\n    print (\"{sil_state} {speech_state} speech speech {cost}\".format(\n                sil_state=silence_last_state,\n                speech_state=speech_start_state,\n                cost=-math.log(args.transition_probability)),\n           file=file_handle)\n\n    # Speech min duration\n    for state in range(speech_start_state,\n                       speech_start_state + args.min_states_speech - 1):\n        print (\"{state} {next_state} speech speech {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n    # Speech max duration\n    for state in range(speech_start_state + args.min_states_speech - 1,\n                       speech_start_state + args.max_states_speech - 1):\n        print (\"{state} {next_state} speech speech {cost}\".format(\n                    state=state, next_state=state + 1, cost=0.0),\n               file=file_handle)\n\n        print (\"{state} {sil_state} silence silence {cost}\".format(\n                    state=state, sil_state=silence_start_state,\n                    cost=-math.log(args.transition_probability)),\n               file=file_handle)\n    speech_last_state = speech_start_state + args.max_states_speech - 1\n\n    # Transition to silence after max duration of speech\n    print (\"{state} {sil_state} silence silence {cost}\".format(\n                state=speech_last_state, sil_state=silence_start_state,\n                cost=0.0),\n           file=file_handle)\n\n    for state in range(1, speech_start_state):\n        print (\"{state} {cost}\".format(\n                    state=state, cost=-math.log(args.edge_silence_probability)),\n               file=file_handle)\n\n    for state in range(speech_start_state, speech_last_state + 1):\n        print (\"{state} {cost}\".format(\n                    state=state,\n                    cost=-math.log(1.0 - args.edge_silence_probability)),\n               file=file_handle)\n\n\ndef main():\n    try:\n        args = get_args()\n        with common_lib.smart_open(args.output_graph, 'w') as f:\n            print_states(args, f)\n    except Exception:\n        raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/resample_targets.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script reads a Kaldi text archive of matrices from 'targets_in_ark' (e.g.\n'-' for standard input), modifies them by subsampling them, and writes the\nmodified archive to 'targets_out_ark'.\nThis form of 'subsampling' is similar to taking every n'th frame (specifically:\nevery n'th row), except that we average over blocks of size 'n' instead of\ntaking every n'th element.\nThus, this script is similar to the binary 'subsample-feats' except that\nit subsamples by averaging.\n\"\"\"\n\nimport argparse\nimport logging\nimport numpy as np\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"\nThis script reads a Kaldi text archive of matrices from 'targets_in_ark' (e.g.\n'-' for standard input), modifies them by subsampling them, and writes the\nmodified archive to 'targets_out_ark'.\nThis form of 'subsampling' is similar to taking every n'th frame (specifically:\nevery n'th row), except that we average over blocks of size 'n' instead of\ntaking every n'th element.\nThus, this script is similar to the binary 'subsample-feats' except that\nit subsamples by averaging.\"\"\")\n\n    parser.add_argument(\"--subsampling-factor\", type=int, default=1,\n                        help=\"The sampling rate is scaled by this factor\")\n    parser.add_argument(\"--verbose\", type=int, default=0, choices=[0,1,2],\n                        help=\"Verbose level\")\n\n    parser.add_argument(\"targets_in_ark\", type=argparse.FileType('r'),\n                        help=\"Input targets archive\")\n    parser.add_argument(\"targets_out_ark\", type=argparse.FileType('w'),\n                        help=\"Output targets archive\")\n\n    args = parser.parse_args()\n\n    if args.subsampling_factor < 1:\n        raise ValueError(\"Invalid --subsampling-factor value {0}\".format(\n                            args.subsampling_factor))\n\n    if args.verbose >= 2:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef run(args):\n    num_utts = 0\n    for key, mat in common_lib.read_mat_ark(args.targets_in_ark):\n        mat = np.matrix(mat)\n        if args.subsampling_factor > 0:\n            num_indexes = ((mat.shape[0] + args.subsampling_factor - 1)\n                            / args.subsampling_factor)\n\n        out_mat = np.zeros([num_indexes, mat.shape[1]])\n        i = 0\n        for k in range(int(args.subsampling_factor / 2.0),\n                       mat.shape[0], args.subsampling_factor):\n            st = int(k - float(args.subsampling_factor) / 2.0)\n            end = int(k + float(args.subsampling_factor) / 2.0)\n\n            if st < 0:\n                st = 0\n            if end > mat.shape[0]:\n                end = mat.shape[0]\n\n            try:\n                out_mat[i, :] = np.sum(mat[st:end, :], axis=0) / float(end - st)\n            except IndexError:\n                logger.error(\"mat.shape = {0}, st = {1}, end = {2}\"\n                             \"\".format(mat.shape, st, end))\n                raise\n            assert i == k / args.subsampling_factor\n            i += 1\n\n        common_lib.write_matrix_ascii(args.targets_out_ark, out_mat, key=key)\n        num_utts += 1\n    args.targets_in_ark.close()\n    args.targets_out_ark.close()\n\n    logger.info(\"Sub-sampled {num_utts} target matrices\"\n                \"\".format(num_utts=num_utts))\n\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception as e:\n        logger.error(\"Script failed; traceback = \", exc_info=True)\n        raise SystemExit(1)\n    finally:\n        for f in [args.targets_in_ark, args.targets_out_ark]:\n            if f is not None:\n                f.close()\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/sad_to_segments.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n#           2018  Capital One (Author: Zhiyuan Guan)\n# Apache 2.0\n\n\"\"\"\nThis script converts frame-level speech activity detection marks (in kaldi\ninteger vector text archive format) into kaldi segments and utt2spk.\nThe input integer vectors are expected to contain '1' for silence frames\nand '2' for speech frames.\n\"\"\"\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\nglobal_verbose = 0\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"\nThis script converts frame-level speech activity detection marks (in kaldi\ninteger vector text archive format) into kaldi segments and utt2spk.\nThe input integer vectors are expected to contain 1 for silence frames\nand 2 for speech frames.\n\"\"\",\n        formatter_class=argparse.RawTextHelpFormatter)\n\n    parser.add_argument(\"--verbose\", type=int, choices=[0, 1, 2, 3],\n                        default=0, help=\"Higher verbosity for more logging\")\n\n    parser.add_argument(\"--utt2dur\", type=str,\n                        help=\"File containing durations of utterances.\")\n\n    parser.add_argument(\"--frame-shift\", type=float, default=0.01,\n                        help=\"Frame shift to convert frame indexes to time\")\n\n    parser.add_argument(\"--segment-padding\", type=float, default=0.2,\n                        help=\"Additional padding on speech segments. But we \"\n                             \"ensure that the padding does not go beyond the \"\n                             \"adjacent segment.\")\n\n    parser.add_argument(\"--min-segment-dur\", type=float, default=0,\n                        help=\"Minimum duration (in seconds) required for a segment \"\n                             \"to be included. This is before any padding. Segments \"\n                             \"shorter than this duration will be removed.\")\n\n    parser.add_argument(\"--merge-consecutive-max-dur\", type=float, default=0,\n                        help=\"Merge consecutive segments as long as the merged \"\n                             \"segment is no longer than this many seconds. The segments \"\n                             \"are only merged if their boundaries are touching. \"\n                             \"This is after padding by --segment-padding seconds.\"\n                             \"0 means do not merge. Use 'inf' to not limit the duration.\")\n\n    parser.add_argument(\"in_sad\", type=str,\n                        help=\"Input file containing alignments in \"\n                             \"text archive format\")\n\n    parser.add_argument(\"out_segments\", type=str,\n                        help=\"Output kaldi segments file\")\n\n    args = parser.parse_args()\n\n    global global_verbose\n    global_verbose = args.verbose\n\n    logger.info(\"Setting verbosity to {0}\".format(global_verbose))\n\n    if args.verbose >= 3:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n    return args\n\n\ndef to_str(segment):\n    assert len(segment) == 3\n    return \"[{0:.3f}, {1:.3f}, {2}]\".format(segment[0], segment[1],\n                                            segment[2])\n\n\nclass SegmenterStats(object):\n    \"\"\"Stores stats about the post-process stages\"\"\"\n\n    def __init__(self):\n        self.num_segments_initial = 0\n        self.num_short_segments_filtered = 0\n        self.num_merges = 0\n        self.num_segments_final = 0\n        self.initial_duration = 0.0\n        self.padding_duration = 0.0\n        self.filter_short_duration = 0.0\n        self.final_duration = 0.0\n\n    def add(self, other):\n        \"\"\"Adds stats from another object\"\"\"\n        self.num_segments_initial += other.num_segments_initial\n        self.num_short_segments_filtered += other.num_short_segments_filtered\n        self.num_merges += other.num_merges\n        self.num_segments_final += other.num_segments_final\n        self.initial_duration += other.initial_duration\n        self.filter_short_duration += other.filter_short_duration\n        self.padding_duration += other.padding_duration\n        self.final_duration += other.final_duration\n\n    def __str__(self):\n        return (\"num-segments-initial={num_segments_initial}, \"\n                \"num-short-segments-filtered={num_short_segments_filtered}, \"\n                \"num-merges={num_merges}, \"\n                \"num-segments-final={num_segments_final}, \"\n                \"initial-duration={initial_duration}, \"\n                \"filter-short-duration={filter_short_duration}, \"\n                \"padding-duration={padding_duration}, \"\n                \"final-duration={final_duration}\".format(\n            num_segments_initial=self.num_segments_initial,\n            num_short_segments_filtered=self.num_short_segments_filtered,\n            num_merges=self.num_merges,\n            num_segments_final=self.num_segments_final,\n            initial_duration=self.initial_duration,\n            filter_short_duration=self.filter_short_duration,\n            padding_duration=self.padding_duration,\n            final_duration=self.final_duration))\n\n\ndef process_label(text_label):\n    \"\"\"Processes an input integer label and returns a 1 or 2,\n    where 1 is for silence and 2 is for speech.\n\n    Arguments:\n        text_label -- input label (must be integer)\n    \"\"\"\n    prev_label = int(text_label)\n    if prev_label not in [1, 2]:\n        raise ValueError(\"Expecting label to 1 (non-speech) or 2 (speech); \"\n                         \"got {}\".format(prev_label))\n\n    return prev_label\n\n\nclass Segmentation(object):\n    \"\"\"Stores segmentation for an utterances\"\"\"\n\n    def __init__(self):\n        self.segments = None\n        self.stats = SegmenterStats()\n\n    def initialize_segments(self, alignment, frame_shift=0.01):\n        \"\"\"Initializes segments from input alignment.\n        The alignment is frame-level speech-activity detection marks,\n        each of which must be 1 or 2.\"\"\"\n        self.segments = []\n\n        assert len(alignment) > 0\n\n        prev_label = None\n        prev_length = 0\n        for i, text_label in enumerate(alignment):\n            if prev_label is not None and int(text_label) != prev_label:\n                if prev_label == 2:\n                    self.segments.append(\n                        [float(i - prev_length) * frame_shift,\n                         float(i) * frame_shift, prev_label])\n                    self.stats.initial_duration += (prev_length * frame_shift)\n                prev_label = process_label(text_label)\n                prev_length = 0\n            elif prev_label is None:\n                prev_label = process_label(text_label)\n\n            prev_length += 1\n\n        if prev_length > 0 and prev_label == 2:\n            self.segments.append(\n                [float(len(alignment) - prev_length) * frame_shift,\n                 float(len(alignment)) * frame_shift, prev_label])\n            self.stats.initial_duration += (prev_length * frame_shift)\n\n        self.stats.num_segments_initial = len(self.segments)\n        self.stats.num_segments_final = len(self.segments)\n        self.stats.final_duration = self.stats.initial_duration\n\n    def filter_short_segments(self, min_dur):\n        \"\"\"Filters out segments with durations shorter than 'min_dur'.\"\"\"\n        if min_dur <= 0:\n            return\n\n        segments_kept = []\n        for segment in self.segments:\n            assert segment[2] == 2, segment\n            dur = segment[1] - segment[0]\n            if dur < min_dur:\n                self.stats.filter_short_duration += dur\n                self.stats.num_short_segments_filtered += 1\n            else:\n                segments_kept.append(segment)\n        self.segments = segments_kept\n        self.stats.num_segments_final = len(self.segments)\n        self.stats.final_duration -= self.stats.filter_short_duration\n\n    def pad_speech_segments(self, segment_padding, max_duration=float(\"inf\")):\n        \"\"\"Pads segments by duration 'segment_padding' on either sides, but\n        ensures that the segments don't go beyond the neighboring segments\n        or the duration of the utterance 'max_duration'.\"\"\"\n        if max_duration == None:\n            max_duration = float(\"inf\")\n        for i, segment in enumerate(self.segments):\n            assert segment[2] == 2, segment\n            segment[0] -= segment_padding  # try adding padding on the left side\n            self.stats.padding_duration += segment_padding\n            if segment[0] < 0.0:\n                # Padding takes the segment start to before the beginning of the utterance.\n                # Reduce padding.\n                self.stats.padding_duration += segment[0]\n                segment[0] = 0.0\n            if i >= 1 and self.segments[i - 1][1] > segment[0]:\n                # Padding takes the segment start to before the end the previous segment.\n                # Reduce padding.\n                self.stats.padding_duration -= (\n                        self.segments[i - 1][1] - segment[0])\n                segment[0] = self.segments[i - 1][1]\n\n            segment[1] += segment_padding\n            self.stats.padding_duration += segment_padding\n            if segment[1] >= max_duration:\n                # Padding takes the segment end beyond the max duration of the utterance.\n                # Reduce padding.\n                self.stats.padding_duration -= (segment[1] - max_duration)\n                segment[1] = max_duration\n            if (i + 1 < len(self.segments)\n                    and segment[1] > self.segments[i + 1][0]):\n                # Padding takes the segment end beyond the start of the next segment.\n                # Reduce padding.\n                self.stats.padding_duration -= (\n                        segment[1] - self.segments[i + 1][0])\n                segment[1] = self.segments[i + 1][0]\n        self.stats.final_duration += self.stats.padding_duration\n\n    def merge_consecutive_segments(self, max_dur):\n        \"\"\"Merge consecutive segments (happens after padding), provided that\n        the merged segment is no longer than 'max_dur'.\"\"\"\n        if max_dur <= 0 or not self.segments:\n            return\n\n        merged_segments = [self.segments[0]]\n        for segment in self.segments[1:]:\n            assert segment[2] == 2, segment\n            if segment[0] == merged_segments[-1][1] and \\\n                    segment[1] - merged_segments[-1][0] <= max_dur:\n                # The segment starts at the same time the last segment ends,\n                # and the merged segment is shorter than 'max_dur'.\n                # Extend the previous segment.\n                merged_segments[-1][1] = segment[1]\n                self.stats.num_merges += 1\n            else:\n                merged_segments.append(segment)\n\n        self.segments = merged_segments\n        self.stats.num_segments_final = len(self.segments)\n\n    def write(self, key, file_handle):\n        \"\"\"Write segments to file\"\"\"\n        if global_verbose >= 2:\n            logger.info(\"For key {key}, got stats {stats}\".format(\n                key=key, stats=self.stats))\n        for segment in self.segments:\n            seg_id = \"{key}-{st:07d}-{end:07d}\".format(\n                key=key, st=int(segment[0] * 100), end=int(segment[1] * 100))\n            print(\"{seg_id} {key} {st:.2f} {end:.2f}\".format(\n                seg_id=seg_id, key=key, st=segment[0], end=segment[1]),\n                file=file_handle)\n\n\ndef run(args):\n    \"\"\"The main function that does everything.\"\"\"\n    utt2dur = {}\n    if args.utt2dur is not None:\n        with common_lib.smart_open(args.utt2dur) as utt2dur_fh:\n            for line in utt2dur_fh:\n                parts = line.strip().split()\n                if len(parts) != 2:\n                    raise RuntimeError(\"Unable to parse line '{0}' in {1}\"\n                                       \"\".format(line.strip(), args.utt2dur))\n                utt2dur[parts[0]] = float(parts[1])\n\n    global_stats = SegmenterStats()\n    with common_lib.smart_open(args.in_sad) as in_sad_fh, \\\n            common_lib.smart_open(args.out_segments, 'w') as out_segments_fh:\n        for line in in_sad_fh:\n            parts = line.strip().split()\n            utt_id = parts[0]\n\n            if len(parts) < 2:\n                raise RuntimeError(\"Unable to parse line '{0}' in {1}\"\n                                   \"\".format(line.strip(),\n                                             in_sad_fh))\n\n            segmentation = Segmentation()\n            segmentation.initialize_segments(\n                parts[1:], args.frame_shift)\n            segmentation.filter_short_segments(args.min_segment_dur)\n            segmentation.pad_speech_segments(args.segment_padding,\n                                             None if args.utt2dur is None\n                                             else utt2dur[utt_id])\n            segmentation.merge_consecutive_segments(args.merge_consecutive_max_dur)\n            segmentation.write(utt_id, out_segments_fh)\n            global_stats.add(segmentation.stats)\n    logger.info(global_stats)\n\n\ndef main():\n    \"\"\"Parses arguments and calls the run method\"\"\"\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        raise\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/internal/verify_phones_list.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"This script verifies the list of phones read from stdin are valid\nphones present in lang/phones.txt.\"\"\"\n\nimport argparse\nimport sys\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"\n    This script verifies the list of phones read from stdin are valid\n    phones present in lang/phones.txt.\"\"\")\n\n    parser.add_argument(\"phones\", type=str,\n                        help=\"File containing the list of all phones as the \"\n                        \"first column\")\n\n    args = parser.parse_args()\n    return args\n\n\ndef main():\n    args = get_args()\n    phones = set()\n    for line in open(args.phones):\n        phones.add(line.strip().split()[0])\n\n    for line in sys.stdin.readlines():\n        p = line.strip()\n\n        if p not in phones:\n            sys.stderr.write(\"Could not find phone {p} in {f}\"\n                             \"\\n\".format(p=p, f=args.phones))\n            raise SystemExit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/steps/segmentation/lats_to_targets.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script converts lattices into targets for training neural network\n# for speech activity detection. The targets is a matrix of size \n# (num-frames-subsampled x 3)\n# with each row representing probabilities for speech, silence and \n# garbage classes for the corresponding frame (after subsampling). The \n# probability values are lattice posteriors for the 3 classes and are\n# obtained by summing up phone arc posteriors for the phones\n# corresponding to each class.\n# The mapping from phones to speech / silence / garbage classes\n# is defined by the options --silence-phones and --garbage-phones.\n# Also \"speech\" phones longer than --max-phone-duration seconds are \n# treated as \"garbage\".\n\nset -o pipefail\n\nsilence_phones=\ngarbage_phones=\nmax_phone_duration=0.5\nacwt=0.1\n\ncmd=run.pl\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  cat <<EOF\n  This script converts lattices into targets for training neural network\n  for speech activity detection. The targets is a matrix of size \n  (num-frames-subsampled x 3)\n  with each row representing probabilities for speech, silence and \n  garbage classes for the corresponding frame (after subsampling). The \n  probability values are lattice posteriors for the 3 classes and are\n  obtained by summing up phone arc posteriors for the phones\n  corresponding to each class.\n  The mapping from phones to speech / silence / garbage classes\n  is defined by the options --silence-phones and --garbage-phones.\n  Also \"speech\" phones longer than --max-phone-duration seconds are \n  treated as \"garbage\".\n\n  Usage: steps/segmentation/lats_to_targets.sh <data-dir> <lang> <lattice-dir> <targets-dir>\"\n  e.g.: steps/segmentation/lats_to_targets.sh \\\n  --silence-phones exp/segmentation1a/silence_phones.txt \\\n  --garbage-phones exp/segmentation1a/garbage_phones.txt \\\n  --max-phone-duration 0.5 \\\n  data/train_split10s data/lang \\\n  exp/segmentation1a/tri3b_train_split10s_lats \\\n  exp/segmentation1a/tri3b_train_split10s_targets\n\n  note: \n  silence_phones.txt and garbage_phones.txt must list phones, one per line.\n  garbage_phones.txt can contain phones corresponding to ambiguous items like \n  OOV, laugh and spoken noise that you want to map to \"garbage class\".\n  silence_phones.txt might just contain the phones from \n  data/lang/phones/silence_phones.txt other than the garbage phones. These\n  are mapped to the \"silence\" class.\nEOF\n  exit 1\nfi\n\ndata=$1\nlang=$2\nlats_dir=$3\ndir=$4\n\nif [ -f $lats_dir/final.mdl ]; then\n  srcdir=$lats_dir\nelse\n  srcdir=$lats_dir/..\nfi\n\nfor f in $data/utt2spk $lats_dir/lat.1.gz $srcdir/final.mdl; do \n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\nmkdir -p $dir\n\nif [ -z \"$garbage_phones\" ]; then\n  oov_phone=$(steps/segmentation/internal/get_oov_phone.py $lang) || exit 1\n  echo $oov_phone | utils/int2sym.pl $lang/phones.txt > $dir/garbage_phones.txt || exit 1\nelse \n  cp $garbage_phones $dir/garbage_phones.txt || exit 1\nfi\n\nif [ -z \"$silence_phones\" ]; then\n  cat $lang/silence_phones.txt | \\\n    utils/filter_scp.pl --exclude $dir/garbage_phones.txt > \\\n    $dir/silence_phones.txt\nelse \n  cp $silence_phones $dir/silence_phones.txt\nfi\n\nnj=$(cat $lats_dir/num_jobs) || exit 1\n\n$cmd JOB=1:$nj $dir/log/get_arc_info.JOB.log \\\n  lattice-push \"ark:gunzip -c $lats_dir/lat.JOB.gz |\" ark:- \\| \\\n  lattice-align-phones --replace-output-symbols=true $srcdir/final.mdl ark:- ark:- \\| \\\n  lattice-arc-post --acoustic-scale=$acwt $srcdir/final.mdl ark:- - \\| \\\n  utils/int2sym.pl -f 5 $lang/phones.txt '>' \\\n  $dir/arc_info_sym.JOB.txt || exit 1\n\n# make $dir an absolute pathname.\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nframe_subsampling_factor=1\nif [ -f $srcdir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $srcdir/frames_subsampling_factor)\n  echo $frame_subsampling_factor > $dir/frame_subsampling_factor\nfi\n\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\nmax_phone_len=$(perl -e \"print int($max_phone_duration / $frame_shift)\")\n\n$cmd JOB=1:$nj $dir/log/get_targets.JOB.log \\\n  steps/segmentation/internal/arc_info_to_targets.py \\\n    --silence-phones=$dir/silence_phones.txt \\\n    --garbage-phones=$dir/garbage_phones.txt \\\n    --max-phone-length=$max_phone_len \\\n    $dir/arc_info_sym.JOB.txt - \\| \\\n  copy-feats ark,t:- \\\n    ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\nfor n in $(seq $nj); do\n  cat $dir/targets.$n.scp\ndone > $dir/targets.scp\n\nsteps/segmentation/validate_targets_dir.sh $dir $data || exit 1\n\necho \"$0: Done creating targets in $dir/targets.scp\"\n"
  },
  {
    "path": "egs/steps/segmentation/merge_targets_dirs.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script merges targets dirs created from multiple sources (systems) into\n# single targets matrices. See steps/segmentation/lats_to_targets.sh for \n# details about the format of the targets.\n\n# This script merges targets from multiple sources using weights supplied \n# by --weights option. Also the option --remove-mismatch-frames can be \n# used to remove frames different sources have mismatched labels.\n# e.g. We can check if the labels from supervision-constrained lattices \n# and those from decoding match.\n\ncmd=run.pl \nnj=4\nweights=        # A comma-separated list of weights corresponding to each\n                # target source being combined. Must match the number of \n                # source target directories.\nremove_mismatch_frames=true     # If true, the mismatch frames are removed by \n                                # setting targets to 0 in the following cases:\n                                # a) If none of the sources have a column with value > 0.5\n                                # b) If two sources have columns with value > 0.5, but\n                                # they occur at different indexes e.g. silence prob is > 0.5 for the\n                                # targets from alignment, and speech prob > 0.5 for the targets from\n                                # decoding\n\n[ -f ./path.sh ] && . ./path.sh \n. utils/parse_options.sh\n\nif [ $# -lt 3 ]; then\n  cat <<EOF\n  This script merges targets dirs created from multiple sources (systems) into\n  single targets matrices.\n  See top of the script for more details.\n\n  Usage: steps/segmentation/merge_targets_dirs.py <data> <targets-1> <targets-2> ... <merged-targets>\n  e.g.: steps/segmentation/merge_targets_dirs.py --weights 1.0,0.5 \\\n      data/train_whole \\\n      exp/segmentation1a/tri3b_train_whole_sup_targets_sub3 \\\n      exp/segmentation1a/tri3b_train_whole_targets_sub3 \\\n      exp/segmentation1a/tri3b_train_whole_combined_targets_sub3\nEOF\n  exit 1\nfi\n\ndata=$1\ndir=${@: -1}  # last argument to the script\nshift;\n\ntargets_dirs=( $@ )  # read the remaining arguments into an array\nunset targets_dirs[${#targets_dirs[@]}-1]  # 'pop' the last argument which is odir\nnum_sources=${#targets_dirs[@]}  # number of targets to combine\n\nutils/data/split_data.sh --per-utt $data $nj\nsdata=${data}/split${nj}utt\n\nframe_subsampling_factor=1\nif [ -f ${targets_dirs[0]}/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat ${targets_dirs[0]}/frame_subsampling_factor) || exit 1\nfi\n\nmkdir -p $dir/split${nj}\n\ntarget_id=1\nfor t in ${targets_dirs[@]}; do\n  this_frame_subsampling_factor=1\n  if [ -f $t/frame_subsampling_factor ]; then\n    this_frame_subsampling_factor=$(cat $t/frame_subsampling_factor) || exit 1\n  fi\n  if [ $this_frame_subsampling_factor -ne $frame_subsampling_factor ]; then\n    echo \"$0: Mismatch in frame_subsampling_factor in $t and ${targets_dirs[0]}; $this_frame_subsampling_factor vs $frame_subsampling_factor\"\n    exit 1\n  fi\n\n  utils/filter_scps.pl JOB=1:$nj $sdata/JOB/utt2spk \\\n    $t/targets.scp $dir/split${nj}/in_targets.$target_id.JOB.scp\n\n  targets_rspecifiers+=(\"scp:$dir/split${nj}/in_targets.$target_id.JOB.scp\")\n  target_id=$[target_id+1]\ndone\n\n# convert $dir to an absolute pathname.\nfdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\n$cmd JOB=1:$nj $dir/log/merge_targets.JOB.log \\\n  paste-feats \"${targets_rspecifiers[@]}\" ark,t:- \\| \\\n  steps/segmentation/internal/merge_targets.py --weights=\"$weights\" \\\n    --remove-mismatch-frames=$remove_mismatch_frames - - \\| \\\n  copy-feats ark,t:- ark,scp:$fdir/targets.JOB.ark,$fdir/targets.JOB.scp || exit 1\n\nfor n in `seq $nj`; do\n  cat $dir/targets.$n.scp\ndone > $dir/targets.scp\n\nrm $dir/targets.*.scp   # cleanup\n\nif [ $frame_subsampling_factor -ne 1 ]; then\n  echo $frame_subsampling_factor > $dir/frame_subsampling_factor\nfi\n\nsteps/segmentation/validate_targets_dir.sh $dir $data || exit 1\n\necho \"$0: Merged target directories to $dir\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/post_process_sad_to_segments.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015-17  Vimal Manohar\n# Apache 2.0.\n\n# This script post-processes the output of steps/segmentation/decode_sad.sh,\n# which is in the form of frame-level alignments, into a 'segments' file.\n# The alignments must be speech activity detection marks i.e. 1 for silence \n# and 2 for speech.\n\nset -e -o pipefail -u\n. ./path.sh\n\ncmd=run.pl\nstage=-10\nnj=18\n\n# The values below are in seconds\nframe_shift=0.01\nsegment_padding=0.2\nmin_segment_dur=0\nmerge_consecutive_max_dur=0\n\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"This script post-processes the output of steps/segmentation/decode_sad.sh, \"\n  echo \"which is in the form of frame-level alignments, into kaldi segments. \"\n  echo \"The alignments must be speech activity detection marks i.e. 1 for silence \"\n  echo \"and 2 for speech.\"\n  echo \"Usage: $0 <data-dir> <vad-dir> <segmentation-dir>\"\n  echo \" e.g.: $0 data/dev_aspire_whole exp/vad_dev_aspire\"\n  exit 1\nfi\n\ndata_dir=$1\nvad_dir=$2    # Alignment directory containing frame-level SAD labels\ndir=$3\n\nmkdir -p $dir\n\nfor f in $vad_dir/ali.1.gz $vad_dir/num_jobs; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\" && exit 1\n  fi\ndone\n\nnj=`cat $vad_dir/num_jobs` || exit 1\nutils/split_data.sh $data_dir $nj\n\nutils/data/get_utt2dur.sh $data_dir\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/segmentation.JOB.log \\\n    copy-int-vector \"ark:gunzip -c $vad_dir/ali.JOB.gz |\" ark,t:- \\| \\\n    steps/segmentation/internal/sad_to_segments.py \\\n      --frame-shift=$frame_shift --segment-padding=$segment_padding \\\n      --min-segment-dur=$min_segment_dur --merge-consecutive-max-dur=$merge_consecutive_max_dur \\\n      --utt2dur=$data_dir/utt2dur - $dir/segments.JOB\nfi\n\necho $nj > $dir/num_jobs\n\nfor n in $(seq $nj); do \n  cat $dir/segments.$n\ndone > $dir/segments\n"
  },
  {
    "path": "egs/steps/segmentation/prepare_targets_gmm.sh",
    "content": "#! /bin/bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n  \n# This script prepares targets for training neural network for \n# speech activity detction. \n# See steps/segmentation/lats_to_targets.sh for details about the \n# format of the targets.\n\n# The targets are obtained from a combination\n# of supervision-constrained lattices and lattices obtained by decoding. \n# Also, we assume that the out-of-segment regions are all silence (target \n# values of [ 1 0 0 ]. We merge the targets from the multiple sources \n# by a weighted average using weights specified by --weights. Also, \n# the frames where the labels from multiple sources do not match are \n# removed in the script steps/segmentation/merge_targets_dirs.sh.\n\n# In this script, we use GMMs trained for ASR on in-domain data \n# to generate the lattices required for creating the targets. To generate\n# supervision-constrained lattices, we use speaker-adapted GMM models. To \n# generate lattices without supervision, we use speaker-independent GMM models\n# from the LDA+MLLT stage, but apply per-recording cepstral mean subtraction.\n# The phones in the lattices are mapped deterministically to \n# 0, 1, and 2 representing respectively silence, speech and garbage classes.\n# The mapping is defined by --garbage-phones-list and --silence-phones-list\n# options. But when these are unspecified, the silence phones other than\n# oov are mapped to silence class and the oov is mapped to garbage class.\n\nstage=-1\ntrain_cmd=run.pl\ndecode_cmd=run.pl\nnj=4\nreco_nj=4\n\nlang_test=    # If different from $lang\ngraph_dir=    # If not provided, a new one will be created using $lang_test\n\ngarbage_phones_list=\nsilence_phones_list=\n\n# Uniform segmentation options for decoding whole recordings. All values are in\n# seconds.\nmax_segment_duration=10\noverlap_duration=2.5\nmax_remaining_duration=5  # If the last remaining piece when splitting uniformly\n                          # is smaller than this duration, then the last piece \n                          # is  merged with the previous.\nremove_mismatch_frames=true\n\n# List of weights on labels obtained from alignment, \n# labels obtained from decoding and default labels in out-of-segment regions\nmerge_weights=1.0,0.1,0.5\n\n[ -f ./path.sh ] && . ./path.sh \n\nset -e -u -o pipefail\n. utils/parse_options.sh \n\nif [ $# -ne 6 ]; then\n  cat <<EOF\n  This script prepares targets for training neural network for \n  speech activity detction. The targets are obtained from a combination\n  of supervision-constrained lattices and lattices obtained by decoding. \n  See comments in the script for more details.\n\n  Usage: $0 <lang> <data> <whole-recording-data> <ali-model-dir> <model-dir> <dir>\n   e.g.: $0 data/lang data/train data/train_whole exp/tri5 exp/tri4 exp/segmentation_1a\n  \n  Note: <whole-recording-data> is expected to have feats.scp and <data> \n  expected to have segments file. We will get the features for <data> by \n  using row ranges of <whole-recording-data>/feats.scp. This script will \n  work on a copy of <data> created to have the recording-id as the speaker-id.\nEOF\n  exit 1\nfi\n\nlang=$1   # Must match the one used to train the models\nin_data_dir=$2\nin_whole_data_dir=$3\nali_model_dir=$4  # Model directory used to align the $data_dir to get target \n                  # labels for training SAD. This should typically be a\n                  # speaker-adapted system.\nmodel_dir=$5      # Model direcotry used to decode the whole-recording version\n                  # of the $data_dir to get target labels for training SAD. This\n                  # should typically be a speaker-independent system like\n                  # LDA+MLLT system.\ndir=$6\n\nmkdir -p $dir\n\nif [ -z \"$lang_test\" ]; then\n  lang_test=$lang\nfi\n\nextra_files=\nif [ -z \"$graph_dir\" ]; then\n  extra_files=\"$extra_files $lang_test/G.fst $lang_test/phones.txt\"\nelse\n  extra_files=\"$extra_files $graph_dir/HCLG.fst $graph_dir/phones.txt\"\nfi\n\nfor f in $in_whole_data_dir/feats.scp $in_data_dir/segments \\\n  $lang/phones.txt $garbage_phones_list $silence_phones_list \\\n  $ali_model_dir/final.mdl $model_dir/final.mdl $extra_files; do\n  if [ ! -f $f ]; then\n    echo \"$0: Could not find file $f\"\n    exit 1\n  fi\ndone\n\nutils/validate_data_dir.sh --no-feats $in_data_dir || exit 1\nutils/validate_data_dir.sh --no-text $in_whole_data_dir || exit 1\n\nif ! cat $garbage_phones_list $silence_phones_list | \\\n  steps/segmentation/internal/verify_phones_list.py $lang/phones.txt; then\n  echo \"$0: Invalid $garbage_phones_list $silence_phones_list\"\n  exit 1\nfi\n\ndata_id=$(basename $in_data_dir)\nwhole_data_id=$(basename $in_whole_data_dir)\n\nif [ $stage -le 0 ]; then\n  rm -r $dir/$data_id 2>/dev/null || true\n  mkdir -p $dir/$data_id\n\n  utils/data/modify_speaker_info_to_recording.sh \\\n    $in_data_dir $dir/$data_id || exit 1\n  utils/validate_data_dir.sh --no-feats $dir/$data_id || exit 1\nfi \n\n# Work with a temporary data directory with recording-id as the speaker labels.\ndata_dir=$dir/${data_id}\n\n###############################################################################\n# Get feats for the manual segments\n###############################################################################\nif [ $stage -le 1 ]; then\n  utils/data/subsegment_data_dir.sh $in_whole_data_dir ${data_dir}/segments ${data_dir}/tmp\n  cp $data_dir/tmp/feats.scp $data_dir\n\n  steps/compute_cmvn_stats.sh $data_dir || exit 1\nfi\n\nif [ $stage -le 2 ]; then\n  utils/copy_data_dir.sh $in_whole_data_dir $dir/$whole_data_id\n\n  utils/fix_data_dir.sh $dir/$whole_data_id\n\n  # Copy the CMVN stats to the whole directory\n  cp $data_dir/cmvn.scp $dir/$whole_data_id\nfi\n\n# Work with a temporary data directory with CMVN stats computed using \n# only the segments from the original data directory.\nwhole_data_dir=$dir/$whole_data_id\n\n###############################################################################\n# Obtain supervision-constrained lattices\n###############################################################################\nsup_lats_dir=$dir/`basename ${ali_model_dir}`_sup_lats_${data_id}\nif [ $stage -le 3 ]; then\n  steps/align_fmllr_lats.sh --nj $nj --cmd \"$train_cmd\" \\\n    ${data_dir} ${lang} ${ali_model_dir} $sup_lats_dir || exit 1\nfi\n\n###############################################################################\n# Uniformly segment whole data directory for decoding\n###############################################################################\nuniform_seg_data_dir=$dir/${whole_data_id}_uniformseg_${max_segment_duration}sec\nuniform_seg_data_id=`basename $uniform_seg_data_dir`\n\nif [ $stage -le 4 ]; then\n  utils/data/get_segments_for_data.sh ${whole_data_dir} > \\\n    ${whole_data_dir}/segments\n\n  mkdir -p $uniform_seg_data_dir\n\n  utils/data/get_uniform_subsegments.py \\\n    --max-segment-duration $max_segment_duration \\\n    --overlap-duration $overlap_duration \\\n    --max-remaining-duration $max_remaining_duration \\\n    ${whole_data_dir}/segments > $uniform_seg_data_dir/sub_segments\n\n  utils/data/subsegment_data_dir.sh $whole_data_dir \\\n    $uniform_seg_data_dir/sub_segments $uniform_seg_data_dir\n  cp $whole_data_dir/cmvn.scp $uniform_seg_data_dir/\nfi\n\nmodel_id=$(basename $model_dir)\n###############################################################################\n# Create graph dir for decoding\n###############################################################################\nif [ -z \"$graph_dir\" ]; then\n  graph_dir=$dir/$model_id/graph\n  if [ $stage -le 5 ]; then\n    if [ ! -f $graph_dir/HCLG.fst ]; then\n      rm -r $dir/lang_test 2>/dev/null || true\n      cp -r $lang_test/ $dir/lang_test\n      utils/mkgraph.sh $dir/lang_test $model_dir $graph_dir || exit 1\n    fi\n  fi\nfi\n\n###############################################################################\n# Decode uniformly segmented data directory\n###############################################################################\nmodel_id=$(basename $model_dir)\ndecode_dir=$dir/${model_id}/decode_${uniform_seg_data_id}\nif [ $stage -le 6 ]; then \n  mkdir -p $decode_dir\n  \n  cp $model_dir/{final.mdl,final.mat,*_opts,tree} $dir/${model_id}\n  cp $model_dir/phones.txt $dir/$model_id\n\n  # We use a small beam and max-active since we are only interested in \n  # the speech / silence decisions, not the exact word sequences.\n  steps/decode.sh --cmd \"$decode_cmd --mem 2G\" --nj $nj \\\n    --max-active 1000 --beam 10.0 \\\n    --decode-extra-opts \"--word-determinize=false\" --skip-scoring true \\\n    $graph_dir $uniform_seg_data_dir $decode_dir\nfi\n\nali_model_id=`basename $ali_model_dir`\n###############################################################################\n# Get frame-level targets from lattices for nnet training\n# Targets are matrices of 3 columns -- silence, speech and garbage\n# The target values are obtained by summing up posterior probabilites of \n# arcs from lattice-arc-post over silence, speech and garbage phones.\n###############################################################################\nif [ $stage -le 7 ]; then\n  steps/segmentation/lats_to_targets.sh --cmd \"$train_cmd\" \\\n    --silence-phones \"$silence_phones_list\" \\\n    --garbage-phones \"$garbage_phones_list\" \\\n    --max-phone-duration 0.5 \\\n    $data_dir $lang $sup_lats_dir \\\n    $dir/${ali_model_id}_${data_id}_sup_targets\nfi\n\nif [ $stage -le 8 ]; then\n  steps/segmentation/lats_to_targets.sh --cmd \"$train_cmd\" \\\n    --silence-phones \"$silence_phones_list\" \\\n    --garbage-phones \"$garbage_phones_list\" \\\n    --max-phone-duration 0.5 \\\n    $uniform_seg_data_dir $lang $decode_dir \\\n    $dir/${model_id}_${uniform_seg_data_id}_targets\nfi\n\n###############################################################################\n# Convert targets to be w.r.t. whole data directory and subsample the \n# targets by a factor of 3.\n# Since the targets from transcript-constrained lattices have only values \n# for the manual segments, these are converted to whole recording-levels \n# by inserting [ 0 0 0 ] for the out-of-manual segment regions.\n###############################################################################\nif [ $stage -le 9 ]; then\n  steps/segmentation/convert_targets_dir_to_whole_recording.sh --cmd \"$train_cmd\" --nj $reco_nj \\\n    $data_dir $whole_data_dir \\\n    $dir/${ali_model_id}_${data_id}_sup_targets \\\n    $dir/${ali_model_id}_${whole_data_id}_sup_targets\n  \n  steps/segmentation/resample_targets_dir.sh --cmd \"$train_cmd\" --nj $reco_nj 3 \\\n    $whole_data_dir \\\n    $dir/${ali_model_id}_${whole_data_id}_sup_targets \\\n    $dir/${ali_model_id}_${whole_data_id}_sup_targets_sub3\nfi\n\n###############################################################################\n# Convert the targets from decoding to whole recording. \n###############################################################################\nif [ $stage -le 10 ]; then\n  steps/segmentation/convert_targets_dir_to_whole_recording.sh --cmd \"$train_cmd\" --nj $reco_nj \\\n    $dir/${uniform_seg_data_id} $whole_data_dir \\\n    $dir/${model_id}_${uniform_seg_data_id}_targets \\\n    $dir/${model_id}_${whole_data_id}_targets\n\n  steps/segmentation/resample_targets_dir.sh --cmd \"$train_cmd\" --nj $reco_nj 3 \\\n    $whole_data_dir \\\n    $dir/${model_id}_${whole_data_id}_targets \\\n    $dir/${model_id}_${whole_data_id}_targets_sub3\nfi\n\n###############################################################################\n# \"default targets\" values for the out-of-manual-segment regions.\n# We assume in this setup that this is silence i.e. [ 1 0 0 ].\n###############################################################################\n\nif [ $stage -le 11 ]; then\n  echo \" [ 1 0 0 ]\" > $dir/default_targets.vec\n  steps/segmentation/get_targets_for_out_of_segments.sh --cmd \"$train_cmd\" \\\n    --nj $reco_nj --frame-subsampling-factor 3 \\\n    --default-targets $dir/default_targets.vec \\\n    $data_dir $whole_data_dir $dir/out_of_seg_${whole_data_id}_default_targets_sub3\nfi\n\n###############################################################################\n# Merge targets for the same data from multiple sources (systems)\n# --weights is used to weight targets from alignment with a higher weight \n# the targets from decoding. \n# If --remove-mismatch-frames is true, then if alignment and decoding \n# disagree (more than 0.5 probability on different classes), then those frames\n# are removed by setting targets to [ 0 0 0 ]. \n###############################################################################\nif [ $stage -le 12 ]; then\n  steps/segmentation/merge_targets_dirs.sh --cmd \"$train_cmd\" --nj $reco_nj \\\n    --weights $merge_weights --remove-mismatch-frames $remove_mismatch_frames \\\n    $whole_data_dir \\\n    $dir/${ali_model_id}_${whole_data_id}_sup_targets_sub3 \\\n    $dir/${model_id}_${whole_data_id}_targets_sub3 \\\n    $dir/out_of_seg_${whole_data_id}_default_targets_sub3 \\\n    $dir/${whole_data_id}_combined_targets_sub3\nfi\n\ncp $dir/${whole_data_id}_combined_targets_sub3/targets.scp $dir/\n\necho \"$0: Prepared targets in $dir/targets.scp\"\n"
  },
  {
    "path": "egs/steps/segmentation/resample_targets_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script resamples the targets matrix by the specified <subsampling-factor>.\n# If <subsampling-factor> is negative, then the targets will be upsampled \n# by -<subsampling-factor>.\n# This script is a wrapper to steps/segmentation/internal/resample_targets.py,\n# which works very similar to the binary subsample-feats. See that script\n# for details about how the resampling is done.\n\n# See the script steps/segmentation/lats_to_targets.sh for details about \n# the format of the targets.\n\nnj=4\ncmd=run.pl\n\nset -o pipefail -u\n\n[ -f ./path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  cat <<EOF\n  This script resamples the targets matrix by the specified subsampling factor.\n  If <subsampling-factor> is negative, then the targets will be upsampled \n  by -<subsampling-factor>.\n  See top of the script for more details.\n\n  Usage: steps/segmentation/resample_targets.sh <subsampling-factor> <data-dir> <targets-dir> <resampled-targets-dir>\n   e.g.: steps/segmentation/resample_targets.sh 3 \\\n    data/train_whole \\\n    exp/segmentation1a/tri3b_train_whole_targets \\\n    exp/segmentation1a/tri3b_train_whole_targets_sub3\nEOF\n  exit 1\nfi\n\nsubsampling_factor=$1\ndata=$2\ntargets_dir=$3\ndir=$4\n\nframe_subsampling_factor=1\nif [ -f $targets_dir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $targets_dir/frames_subsampling_factor)\nfi\n\nfor f in $targets_dir/targets.scp $data/feats.scp; do \n  if [ ! -f $f ]; then \n    echo \"$0: Could not find file $f\" \n    exit 1\n  fi\ndone\n\nsteps/segmentation/validate_targets_dir.sh $targets_dir $data || exit 1\n\nmkdir -p $dir\n\nmkdir -p $targets_dir/split$nj\nsplit_scps=\nfor n in $(seq $nj); do\n  split_scps=\"$split_scps $targets_dir/split${nj}/targets.$n.scp\"\ndone\nutils/split_scp.pl $targets_dir/targets.scp $split_scps\n\n# make $dir an absolute pathname.\ndir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $dir ${PWD}`\n\nif [ $subsampling_factor -eq 1 ]; then\n  cp $targets_dir/targets.scp $dir\n  cp $targets_dir/frame_subsampling_factor $dir || true\nelif [ $subsampling_factor -gt 1 ]; then\n  $cmd JOB=1:$nj $dir/log/resample_targets.JOB.log \\\n    copy-feats scp:$targets_dir/split${nj}/targets.JOB.scp ark,t:- \\| \\\n    steps/segmentation/internal/resample_targets.py \\\n      --subsampling-factor=$subsampling_factor \\\n      - - \\| \\\n    copy-feats ark,t:- ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\n  perl -e \"print $frame_subsampling_factor * $subsampling_factor\" > \\\n    $dir/frame_subsampling_factor || exit 1\nelse\n  $cmd JOB=1:$nj $dir/log/resample_targets.JOB.log \\\n    subsample-feats --n=$subsampling_factor \\\n      scp:$targets_dir/split${nj}/targets.JOB.scp \\\n      ark,scp:$dir/targets.JOB.ark,$dir/targets.JOB.scp || exit 1\n\n  perl -e \"print $frame_subsampling_factor * (-$subsampling_factor)\" > \\\n    $dir/frame_subsampling_factor || exit 1\nfi \n \nfor n in $(seq $nj); do\n  cat $dir/targets.$n.scp\ndone > $dir/targets.scp\n\nsteps/segmentation/validate_targets_dir.sh $targets_dir $data\n\necho \"$0: Resampled targets in $dir\"\nexit 0\n"
  },
  {
    "path": "egs/steps/segmentation/validate_targets_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n# This script validates a 'targets_dir' as created by lats_to_targets.sh.\n# See that script for details about the format of the targets.\n\n[ -f ./path.sh ] && . ./path.sh\n\nif [ $# -ne 2 ]; then\n  cat <<EOF\n  This script validates a 'targets_dir' as created by lats_to_targets.sh.\n  See that script for details about the format of the targets.\n\n  Usage: steps/segmentation/validate_targets_dir.sh <targets-dir> <data-dir>\n  e.g.: steps/segmentation/validate_targets_dir.sh \\\n    exp/segmentation1a/tri3b_train_split10s_targets \\\n    data/train_split10s\nEOF\n  exit 1\nfi\n\ntargets_dir=$1\ndata=$2\n\ntmpdir=$(mktemp -d /tmp/kaldi.XXXX);\ntrap 'rm -rf \"$tmpdir\"' EXIT HUP INT PIPE TERM\n\nexport LC_ALL=C\n\nfunction check_sorted_and_uniq {\n  ! awk '{print $1}' $1 | sort | uniq | cmp -s - <(awk '{print $1}' $1) && \\\n    echo \"$0: file $1 is not in sorted order or has duplicates\" && exit 1;\n}\n\nfor f in $targets_dir/targets.scp $data/utt2spk; do \n  if [ ! -f $f ]; then\n    echo \"$0: Could not find $f\"\n    exit 1\n  fi\ndone\n\nutils/data/validate_data_dir.sh --no-text --no-wav --no-spk-sort \\\n  $data || exit 1\n\ncheck_sorted_and_uniq $targets_dir/targets.scp\n\nnu=`cat $data/utt2spk | wc -l` || exit 1\nnt=`cat $targets_dir/targets.scp | wc -l` || exit 1\nif [ $nt -ne $nu ]; then\n  echo \"WARNING: It seems not all of the targets files were successfully created in \"\n  echo \"$targets_dir/targets.scp for $data ($nt != $nu).\"\nfi\n\nif [ $nt -lt $[$nu - ($nu/20)] ]; then\n  echo \"Less than 95% the targets were successfully generated.  Probably a serious error.\"\n  exit 1\nfi\n\nhead -n 100 $targets_dir/targets.scp | sort -k1,1 | feat-to-len scp:- ark,t:$tmpdir/len.targets || exit 1\nutils/filter_scp.pl $tmpdir/len.targets $data/feats.scp | sort -k1,1 | feat-to-len scp:- ark,t:$tmpdir/len.feats || exit 1\n\nframe_subsampling_factor=1\nif [ -f $targets_dir/frame_subsampling_factor ]; then\n  frame_subsampling_factor=$(cat $targets_dir/frame_subsampling_factor) || exit 1\nfi\n\nutils/filter_scp.pl $tmpdir/len.feats $tmpdir/len.targets | \\\n  paste -d ' ' - $tmpdir/len.feats | python -c \"\nimport sys\nnum_lines = 0\nfor line in sys.stdin:\n  parts = line.strip().split()\n  if parts[0] != parts[2]:\n    continue\n  len_target = int(parts[1])\n  len_feats = int(float(parts[3]) / $frame_subsampling_factor)\n  diff = abs(len_target - len_feats)\n  if diff > 3:\n    sys.stderr.write('Mismatch in length for utterance {utt} between '\n                     'targets and feats: {0} vs {1}; diff={2}'.format(\n                      len_target, len_feats, diff, utt=parts[0]))\n    sys.exit(1)\n  num_lines += 1\" || exit 1\n\necho \"$0: Successfully validated data-directory $data\"\n"
  },
  {
    "path": "egs/steps/select_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# This script is deprecated. Use utils/data/limit_feature_dim.sh.\n\n# This script selects some specified dimensions of the features in the\n# input data directory.\n\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\ncmd=run.pl\nnj=4\ncompress=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 3 ] || [ $# -gt 5 ]; then\n   echo \"usage: $0 [options] <selector> <src-data-dir>  <dest-data-dir> [<log-dir> [<path-to-storage-dir>] ]\";\n   echo \"e.g.: $0 0-12 data/train_mfcc_pitch data/train_mfcconly exp/select_pitch_train mfcc\"\n   echo \"Note: <log-dir> defaults to <data-dir>/log, and <path-to-storage-dir> defaults to <data-dir>/data\"\n   echo \"options: \"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\nselector=\"$1\"\ndata_in=$2\ndata=$3\nif [ $# -gt 3 ];then\n  logdir=$4\nelse\n  logdir=$data/log\nfi\n\nif [ $# -gt 4 ];then\n  ark_dir=$5\nelse\n  ark_dir=$data/data\nfi\n\n# make $ark_dir an absolute pathname.\nark_dir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $ark_dir ${PWD}`\n\n\nutils/split_data.sh $data_in $nj || exit 1;\n\nmkdir -p $ark_dir $logdir\nmkdir -p $data\n\ncp $data_in/* $data/ 2>/dev/null # so we get the other files, such as utt2spk.\nrm $data/cmvn.scp 2>/dev/null\nrm $data/feats.scp 2>/dev/null\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nfor j in $(seq $nj); do\n  # the next command does nothing unless $mfccdir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $ark_dir/selected_$name.$j.ark\ndone\n\n$cmd JOB=1:$nj $logdir/append.JOB.log \\\n   select-feats \"$selector\" scp:$data_in/split$nj/JOB/feats.scp ark:- \\| \\\n   copy-feats --compress=$compress ark:- \\\n    ark,scp:$ark_dir/selected_$name.JOB.ark,$ark_dir/selected_$name.JOB.scp || exit 1;\n\n# concatenate the .scp files together.\nfor ((n=1; n<=nj; n++)); do\n  cat $ark_dir/selected_$name.$n.scp >> $data/feats.scp || exit 1;\ndone > $data/feats.scp || exit 1;\n\n\nnf=`cat $data/feats.scp | wc -l`\nnu=`cat $data/utt2spk | wc -l`\nif [ $nf -ne $nu ]; then\n  echo \"It seems not all of the feature files were successfully processed ($nf != $nu);\"\n  exit 1;\nfi\n\necho \"Succeeded selecting features for $name into $data\"\n"
  },
  {
    "path": "egs/steps/shift_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016    Vimal Manohar\n# Apache 2.0\n\n# This script is deprecated. The newer script utils/data/shift_feats.sh\n# should be used instead.\n\n# This script shifts the feats in the input data directory and creates a\n# new directory <input-data>_fs<num-frames-shift> with shifted feats.\n# If the shift is negative, the initial frames get truncated and the\n# last frame repeated; if positive, vice versa.\n# Used to prepare data for sequence training of models with\n# frame_subsampling_factor != 1 (e.g. chain models).\n\n# To be run from .. (one directory up from here)\n# see ../run.sh for example\n\n# Begin configuration section.\ncmd=run.pl\nnj=4\ncompress=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n   echo \"This script is deprecated. The newer script utils/data/shift_feats.sh\"\n   echo \"should be used instead.\"\n   echo \"usage: $0 [options] <frame-shift> <src-data-dir> <log-dir> <path-to-storage-dir>\";\n   echo \"e.g.: $0 -1 data/train exp/shift-1_train mfcc\"\n   echo \"options: \"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\nnum_frames_shift=$1\ndata_in=$2\nlogdir=$3\nfeatdir=$4\n\nutt_prefix=\"fs$num_frames_shift-\"\nspk_prefix=\"fs$num_frames_shift-\"\n\n# make $featdir an absolute pathname.\nfeatdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = \"$pwd/$dir\"; } print $dir; ' $featdir ${PWD}`\n\nutils/split_data.sh $data_in $nj || exit 1;\n\ndata=${data_in}_fs$num_frames_shift\n\nmkdir -p $featdir $logdir\nmkdir -p $data\n\nutils/copy_data_dir.sh --utt-prefix $utt_prefix --spk-prefix $spk_prefix \\\n  $data_in $data\n\nrm $data/feats.scp 2>/dev/null\n\n# use \"name\" as part of name of the archive.\nname=`basename $data`\n\nfor j in $(seq $nj); do\n  # the next command does nothing unless $mfccdir/storage/ exists, see\n  # utils/create_data_link.pl for more info.\n  utils/create_data_link.pl $featdir/raw_feats_$name.$j.ark\ndone\n\n$cmd JOB=1:$nj $logdir/shift.JOB.log \\\n  shift-feats --shift=$num_frames_shift \\\n  scp:$data_in/split$nj/JOB/feats.scp ark:- \\| \\\n  copy-feats --compress=$compress ark:- \\\n  ark,scp:$featdir/raw_feats_$name.JOB.ark,$featdir/raw_feats_$name.JOB.scp || exit 1;\n\n# concatenate the .scp files together.\nfor ((n=1; n<=nj; n++)); do\n  cat $featdir/raw_feats_$name.$n.scp\ndone | awk -v nfs=$num_frames_shift '{print \"fs\"nfs\"-\"$0}'>$data/feats.scp || exit 1;\n\nnf=`cat $data/feats.scp | wc -l`\nnu=`cat $data/utt2spk | wc -l`\nif [ $nf -ne $nu ]; then\n  echo \"It seems not all of the feature files were successfully processed ($nf != $nu);\"\n  exit 1;\nfi\n\necho \"Succeeded shifting features for $name into $data\"\n"
  },
  {
    "path": "egs/steps/subset_ali_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0.\n\ncmd=run.pl\n\nif [ -f ./path.sh ]; then . ./path.sh; fi\n\n. ./utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  cat <<EOF\n  This script creates an alignment directory containing a subset of \n  utterances contained in <subset-data-dir> from the \n  original alignment directory containing alignments for utterances in\n  <full-data-dir>.\n\n  The number of split jobs in the output alignment directory is \n  equal to the number of jobs in the original alignment directory, \n  unless the subset data directory has too few speakers.\n\n  Usage: $0 [options] <full-data-dir> <subset-data-dir> <ali-dir> <subset-ali-dir>\n   e.g.: $0 data/train_sp data/train exp/tri3_ali_sp exp/tri3_ali\n\n  Options: \n      --cmd (utils/run.pl|utils/queue.pl <queue opts>)  # how to run jobs.\nEOF\n  exit 1\nfi\n\ndata=$1\nsubset_data=$2\nali_dir=$3\ndir=$4\n\nnj=$(cat $ali_dir/num_jobs) || exit 1\nutils/split_data.sh $data $nj\n\nmkdir -p $dir\ncp $ali_dir/{final.mdl,*.mat,*_opts,tree} $dir/ || true\ncp -r $ali_dir/phones $dir 2>/dev/null || true\n\n$cmd JOB=1:$nj $dir/log/copy_alignments.JOB.log \\\n  copy-int-vector \"ark:gunzip -c $ali_dir/ali.JOB.gz |\" \\\n  ark,scp:$dir/ali_tmp.JOB.ark,$dir/ali_tmp.JOB.scp || exit 1\n\nfor n in `seq $nj`; do\n  cat $dir/ali_tmp.$n.scp \ndone > $dir/ali_tmp.scp\n\nnum_spk=$(cat $subset_data/spk2utt | wc -l)\nif [ $num_spk -lt $nj ]; then\n  nj=$num_spk\nfi\n\nutils/split_data.sh $subset_data $nj\n$cmd JOB=1:$nj $dir/log/filter_alignments.JOB.log \\\n  copy-int-vector \\\n  \"scp:utils/filter_scp.pl $subset_data/split${nj}/JOB/utt2spk $dir/ali_tmp.scp |\" \\\n  \"ark:| gzip -c > $dir/ali.JOB.gz\" || exit 1\n\necho $nj > $dir/num_jobs\n\nrm $dir/ali_tmp.*.{ark,scp} $dir/ali_tmp.scp\n\nexit 0\n"
  },
  {
    "path": "egs/steps/tandem/align_fmllr.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0\n\n# Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta)\n# + fMLLR (probably with SAT models).\n# It first computes an alignment with the final.alimdl (or the final.mdl if final.alimdl\n# is not present), then does 2 iterations of fMLLR estimation.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match the source directory.\n\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # factor by which to boost silence during alignment.\nfmllr_update_type=full\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"usage: steps/tandem/align_fmllr.sh <data1-dir> <data2-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/tandem/align_fmllr.sh {mfcc,bottleneck}/data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --fmllr-update-type (full|diag|offset|none)      # default full.\"\n   exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nsrcdir=$4\ndir=$5\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\n\n# Set up features.\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nsifeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  sifeats=\"$sifeats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\nalimdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $alimdl - |\"\nmdl_cmd=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $mdl - |\"\n\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata1/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le 1 ]; then\n  echo \"$0: aligning data in $data1 ($data2) using $alimdl and speaker-independent features.\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$alimdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$sifeats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 2 ]; then\n  echo \"$0: computing fMLLR transforms\"\n  if [ \"$alimdl\" != \"$mdl\" ]; then\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alimdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata1/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  else\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata1/JOB/spk2utt $mdl \"$sifeats\" \\\n      ark,s,cs:- ark:$dir/trans.JOB || exit 1;\n  fi\nfi\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\nif [ $stage -le 3 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl_cmd\" \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/align_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0\n\n# Computes training alignments and (if needed) speaker-vectors, given an\n# SGMM system.  If the system is built on top of SAT, you should supply\n# transforms with the --transform-dir option.\n\n# If you supply the --use-graphs option, it will use the training\n# graphs from the source directory.\n\n# Begin configuration section.\nstage=0\nnj=4\ncmd=run.pl\nuse_graphs=false # use graphs from srcdir\nuse_gselect=false # use gselect info from srcdir [regardless, we use\n   # Gaussian-selection info, we might have to compute it though.]\ngselect=15  # Number of Gaussian-selection indices for SGMMs.\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ntransform_dir=  # directory to find fMLLR transforms in.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"usage: steps/tandem/align_sgmm2.sh <data-dir1> <data-dir2> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/tandem/align_sgmm2.sh --transform-dir exp/tri3b {mfcc,bottleneck}/data/train data/lang \\\\\"\n   echo \"           exp/sgmm4a exp/sgmm5a_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --transform-dir <transform-dir>                  # directory to find fMLLR transforms\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nsrcdir=$4\ndir=$5\n\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\n[ -f $srcdir/final.alimdl ] && cp $srcdir/final.alimdl $dir\ncp $srcdir/final.occs $dir;\n\n## Set up features.\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option during alignment.\"\nfi\n##\n\n## Set up model and alignment model.\nmdl=$srcdir/final.mdl\nif [ -f $srcdir/final.alimdl ]; then\n  alimdl=$srcdir/final.alimdl\nelse\n  alimdl=$srcdir/final.mdl\nfi\n[ ! -f $mdl ] && echo \"$0: no such model $mdl\" && exit 1;\n\n## Work out where we're getting the graphs from.\nif $use_graphs; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-graphs true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"No graphs in $srcdir\" && exit 1;\n  graphdir=$srcdir\n  ln.pl $srcdir/fsts.*.gz $dir\nelse\n  graphdir=$dir\n  if [ $stage -le 0 ]; then\n    echo \"$0: compiling training graphs\"\n    tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata1/JOB/text|\";\n    $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log  \\\n      compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" \\\n        \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\n  fi\nfi\n\n## Work out where we're getting the Gaussian-selection info from\nif $use_gselect; then\n  [ \"$nj\" != \"`cat $srcdir/num_jobs`\" ] && \\\n    echo \"$0: you specified --use-gselect true, but #jobs mismatch.\" && exit 1;\n  [ ! -f $srcdir/gselect.1.gz ] && echo \"No gselect info in $srcdir\" && exit 1;\n  graphdir=$srcdir\n  gselect_opt=\"--gselect=ark:gunzip -c $srcdir/gselect.JOB.gz|\"\n  ln.pl $srcdir/gselect.*.gz $dir\nelse\n  graphdir=$dir\n  if [ $stage -le 1 ]; then\n    echo \"$0: computing Gaussian-selection info\"\n    # Note: doesn't matter whether we use $alimdl or $mdl, they will\n    # have the same gselect info.\n    $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n      sgmm2-gselect --full-gmm-nbest=$gselect $alimdl \\\n      \"$feats\" \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\n  fi\n  gselect_opt=\"--gselect=ark:gunzip -c $dir/gselect.JOB.gz|\"\nfi\n\n\nif [ $alimdl == $mdl ]; then\n  # Speaker-independent decoding-- just one pass.  Not normal.\n  T=`sgmm2-info $mdl | grep 'speaker vector space' | awk '{print $NF}'` || exit 1;\n  [ \"$T\" -ne 0 ] && echo \"No alignment model, yet speaker vector space nonempty\" && exit 1;\n\n  if [ $stage -le 2 ]; then\n    echo \"$0: aligning data in $data using model $mdl (no speaker-vectors)\"\n    $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n      sgmm2-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam $alimdl \\\n      \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n  echo \"$0: done aligning data.\"\n  exit 0;\nfi\n\n# Continue with system with speaker vectors.\nif [ $stage -le 2 ]; then\n  echo \"$0: aligning data in $data using model $alimdl\"\n  $cmd JOB=1:$nj $dir/log/align_pass1.JOB.log \\\n    sgmm2-align-compiled $scale_opts \"$gselect_opt\" --beam=$beam --retry-beam=$retry_beam $alimdl \\\n    \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/pre_ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 3 ]; then\n  echo \"$0: computing speaker vectors (1st pass)\"\n  $cmd JOB=1:$nj $dir/log/spk_vecs1.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    sgmm2-post-to-gpost \"$gselect_opt\" $alimdl \"$feats\" ark:- ark:- \\| \\\n    sgmm2-est-spkvecs-gpost --spk2utt=ark:$sdata1/JOB/spk2utt \\\n     $mdl \"$feats\" ark,s,cs:- ark:$dir/pre_vecs.JOB || exit 1;\nfi\n\nif [ $stage -le 4 ]; then\n  echo \"$0: computing speaker vectors (2nd pass)\"\n  $cmd JOB=1:$nj $dir/log/spk_vecs2.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/pre_ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alimdl ark:- ark:- \\| \\\n    sgmm2-est-spkvecs --spk2utt=ark:$sdata1/JOB/spk2utt \"$gselect_opt\" \\\n     --spk-vecs=ark:$dir/pre_vecs.JOB $mdl \"$feats\" ark,s,cs:- ark:$dir/vecs.JOB || exit 1;\n  rm $dir/pre_vecs.*\nfi\n\nif [ $stage -le 5 ]; then\n  echo \"$0: doing final alignment.\"\n  $cmd JOB=1:$nj $dir/log/align_pass2.JOB.log \\\n    sgmm2-align-compiled $scale_opts \"$gselect_opt\" --beam=$beam --retry-beam=$retry_beam \\\n     --utt2spk=ark:$sdata1/JOB/utt2spk --spk-vecs=ark:$dir/vecs.JOB \\\n     $mdl \"ark:gunzip -c $graphdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nrm $dir/pre_ali.*.gz\n\necho \"$0: done aligning data.\"\n\nutils/summarize_warnings.pl $dir/log\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/align_si.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0\n\n# Computes training alignments using a model with delta or\n# LDA+MLLT features.\n\n# If you supply the \"--use-graphs true\" option, it will use the training\n# graphs from the source directory (where the model is).  In this\n# case the number of jobs must match with the source directory.\n\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nuse_graphs=false\n# Begin configuration.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence during alignment.\n# End configuration options.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"usage: steps/tandem/align_si.sh <data1-dir> <data2-dir> <lang-dir> <src-dir> <align-dir>\"\n   echo \"e.g.:  steps/tandem/align_si.sh {mfcc,bottleneck}/data/train data/lang exp/tri1 exp/tri1_ali\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --use-graphs true                                # use graphs in src-dir\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nsrcdir=$4\ndir=$5\n\noov=`cat $lang/oov.int` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n# Set up the features\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\ncp $srcdir/{tree,final.mdl} $dir || exit 1;\ncp $srcdir/final.occs $dir;\n\n# Get some info on the feature types\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null` || exit 1;\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\n# for lda-type features, we need to copy both the lda (for baseft) and mllt\n# transformation (for the pasted features)\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/ || exit 1;\n   ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{tandem,splice_opts,normft2} $dir 2>/dev/null\n\necho \"$0: aligning data in $data using model from $srcdir, putting alignments in $dir\"\n\nmdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/final.mdl - |\"\n\nif $use_graphs; then\n  [ $nj != \"`cat $srcdir/num_jobs`\" ] && echo \"$0: mismatch in num-jobs\" && exit 1;\n  [ ! -f $srcdir/fsts.1.gz ] && echo \"$0: no such file $srcdir/fsts.1.gz\" && exit 1;\n\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n      \"ark:gunzip -c $srcdir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nelse\n  tra=\"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt $sdata1/JOB/text|\";\n  # We could just use gmm-align in the next line, but it's less efficient as it compiles the\n  # training graphs one by one.\n  $cmd JOB=1:$nj $dir/log/align.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/final.mdl  $lang/L.fst \"$tra\" ark:- \\| \\\n    gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" ark:- \\\n      \"$feats\" \"ark,t:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\necho \"$0: done aligning data.\"\n"
  },
  {
    "path": "egs/steps/tandem/decode.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration section.\ntransform_dir=\niter=\nmodel= # You can specify the model to use (e.g. if you want to use the .alimdl)\nnj=4\ncmd=run.pl\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nacwt=0.083333 # note: only really affects pruning (scoring is on lattices).\nmin_lmwt=9\nmax_lmwt=20\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/tandem/decode.sh [options] <graph-dir> <data1-dir> <data2-dir> <decode-dir>\"\n   echo \"... where <decode-dir> is assumed to be a sub-directory of the directory\"\n   echo \" where the model is.\"\n   echo \"e.g.: steps/tandem/decode.sh exp/mono/graph {mfcc,bottleneck}/data/test_dev93 exp/mono/decode_dev93\"\n   echo \"\"\n   echo \"This script works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out\"\n   echo \"what type of features you used (assuming it's one of these two)\"\n   echo \"\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --iter <iter>                                    # Iteration of model to test.\"\n   echo \"  --model <model>                                  # which model to use (e.g. to\"\n   echo \"                                                   # specify the final.alimdl)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --transform-dir <trans-dir>                      # dir to find fMLLR transforms \"\n   echo \"  --acwt <float>                                   # acoustic scale used for lattice generation \"\n   echo \"  --min-lmwt <int>                                 # minumum LM-weight for lattice rescoring \"\n   echo \"  --max-lmwt <int>                                 # maximum LM-weight for lattice rescoring \"\n   echo \"                                                   # speaker-adapted decoding\"\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata1=$2\ndata2=$3\ndir=$4\nsrcdir=`dirname $dir`; # The model directory is one level up from decoding directory.\n\nmkdir -p $dir/log\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\necho $nj > $dir/num_jobs\n\nif [ -z \"$model\" ]; then # if --model <mdl> was not specified on the command line...\n  if [ -z $iter ]; then model=$srcdir/final.mdl;\n  else model=$srcdir/$iter.mdl; fi\nfi\n\nfor f in $sdata1/1/feats.scp $sdata1/1/cmvn.scp $sdata2/1/feats.scp $model $graphdir/HCLG.fst; do\n  [ ! -f $f ] && echo \"decode.sh: no such file $f\" && exit 1;\ndone\n\n# Set up features.\n\n# Get some info on the feature types\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"decode.sh: feature type is $feat_type\";\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  if [ -e $srcdir/lda.mat ]; then\n    feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/lda.mat ark:- ark:- |\"\n  else\n    feats1=\"$feats1 add-deltas ark:- ark:- |\"\n  fi\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  echo \"Using cmvn for feats2\"\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $srcdir/final.mat ark:- ark:- |\"\nfi\n\n# speaker dependent transformations as requested\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"Using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne $nj ] && \\\n     echo \"Mismatch in number of jobs with $transform_dir\";\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nfi\n\n$cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n gmm-latgen-faster --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n   --acoustic-scale=$acwt --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n  $model $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" --min_lmwt $min_lmwt --max_lmwt $max_lmwt $data1 $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/decode_fmllr.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n\n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the\n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:\n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nfirst_beam=10.0 # Beam used in initial, speaker-indep. pass\nfirst_max_active=2000 # max-active used in initial pass.\nalignment_model=\nadapt_model=\nfinal_model=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in\n              # lattice generation.\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\nskip_scoring=false\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/decode_fmllr.sh [options] <graph-dir> <data1-dir> <data2-dir> <decode-dir>\"\n   echo \" e.g.: steps/decode_fmllr.sh exp/tri2b/graph {mfcc,bottleneck}/data/test_dev93 exp/tri2b/decode_dev93\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata1=$2\ndata2=$3\ndir=`echo $4 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nmkdir -p $dir/log\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\necho $nj > $dir/num_jobs\n\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1;\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data1/feats.scp $data2/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n## Work out name of alignment model. ##\nif [ -z \"$alignment_model\" ]; then\n  if [ -f \"$srcdir/final.alimdl\" ]; then alignment_model=$srcdir/final.alimdl;\n  else alignment_model=$srcdir/final.mdl; fi\nfi\n[ ! -f \"$alignment_model\" ] && echo \"$0: no alignment model $alignment_model \" && exit 1;\n##\n\n## Do the speaker-independent decoding, if --si-dir option not present. ##\nif [ -z \"$si_dir\" ]; then # we need to do the speaker-independent decoding pass.\n  si_dir=${dir}.si # Name it as our decoding dir, but with suffix \".si\".\n  if [ $stage -le 0 ]; then\n    steps/tandem/decode_si.sh --acwt $acwt --nj $nj --cmd \"$cmd\" --beam $first_beam --model $alignment_model --max-active $first_max_active $graphdir $data1 $data2 $si_dir || exit 1;\n  fi\nfi\n##\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $si_dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ ! -f \"$si_dir/lat.1.gz\" ] && echo \"No such file $si_dir/lat.1.gz\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n\n\n# Set up features.\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  echo \"Using cmvn for feats2\"\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nsifeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  sifeats=\"$sifeats transform-feats $srcdir/final.mat ark:- ark:- |\"\nfi\n\n\n\n## Now get the first-pass fMLLR transforms.\nif [ $stage -le 1 ]; then\n  echo \"$0: getting first-pass fMLLR transforms.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass1.JOB.log \\\n    gunzip -c $si_dir/lat.JOB.gz \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $alignment_model ark:- ark:- \\| \\\n    gmm-post-to-gpost $alignment_model \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-fmllr-gpost --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata1/JOB/spk2utt $adapt_model \"$sifeats\" ark,s,cs:- \\\n    ark:$dir/pre_trans.JOB || exit 1;\nfi\n##\n\npass1feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/pre_trans.JOB ark:- ark:- |\"\n\n## Do the main lattice generation pass.  Note: we don't determinize the lattices at\n## this stage, as we're going to use them in acoustic rescoring with the larger\n## model, and it's more correct to store the full state-level lattice for this purpose.\nif [ $stage -le 2 ]; then\n  echo \"$0: doing main lattice generation phase\"\n  $cmd JOB=1:$nj $dir/log/decode.JOB.log \\\n    gmm-latgen-faster --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt  \\\n    --determinize-lattice=false --allow-partial=true --word-symbol-table=$graphdir/words.txt \\\n    $adapt_model $graphdir/HCLG.fst \"$pass1feats\" \"ark:|gzip -c > $dir/lat.tmp.JOB.gz\" \\\n    || exit 1;\nfi\n##\n\n## Do a second pass of estimating the transform-- this time with the lattices\n## generated from the alignment model.  Compose the transforms to get\n## $dir/trans.1, etc.\nif [ $stage -le 3 ]; then\n  echo \"$0: estimating fMLLR transforms a second time.\"\n  $cmd JOB=1:$nj $dir/log/fmllr_pass2.JOB.log \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=4.0 \\\n    \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post $silence_weight $silphonelist $adapt_model ark:- ark:- \\| \\\n    gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n    --spk2utt=ark:$sdata1/JOB/spk2utt $adapt_model \"$pass1feats\" \\\n    ark,s,cs:- ark:$dir/trans_tmp.JOB '&&' \\\n    compose-transforms --b-is-affine=true ark:$dir/trans_tmp.JOB ark:$dir/pre_trans.JOB \\\n    ark:$dir/trans.JOB  || exit 1;\nfi\n##\n\nfeats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for\n# language model rescoring.\n\nif [ $stage -le 4 ]; then\n  echo \"$0: doing a final pass of acoustic rescoring.\"\n  $cmd JOB=1:$nj $dir/log/acoustic_rescore.JOB.log \\\n    gmm-rescore-lattice $final_model \"ark:gunzip -c $dir/lat.tmp.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" '&&' rm $dir/lat.tmp.JOB.gz || exit 1;\nfi\n\nif ! $skip_scoring ; then\n  [ ! -x local/score.sh ] && \\\n    echo \"$0: not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n  local/score.sh --cmd \"$cmd\" $data1 $graphdir $dir ||\n    { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\nfi\n\nrm $dir/{trans_tmp,pre_trans}.*\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/tandem/decode_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# This script does decoding with an SGMM system, with speaker vectors.\n# If the SGMM system was\n# built on top of fMLLR transforms from a conventional system, you should\n# provide the --transform-dir option.\n\n# Begin configuration section.\nstage=1\ntransform_dir=    # dir to find fMLLR transforms.\nnj=4 # number of decoding jobs.\nacwt=0.1  # Just a default value, used for adaptation and beam-pruning..\ncmd=run.pl\nbeam=13.0\ngselect=15  # Number of Gaussian-selection indices for SGMMs.  [Note:\n            # the first_pass_gselect variable is used for the 1st pass of\n            # decoding and can be tighter.\nfirst_pass_gselect=3 # Use a smaller number of Gaussian-selection indices in\n            # the 1st pass of decoding (lattice generation).\nmax_active=7000\n\n#WARNING: This option is renamed lattice_beam (it was renamed to follow the naming\n#         in the other scripts\nlattice_beam=6.0 # Beam we use in lattice generation.\nvecs_beam=4.0 # Beam we use to prune lattices while getting posteriors for\n    # speaker-vector computation.  Can be quite tight (actually we could\n    # probably just do best-path.\nuse_fmllr=false\nfmllr_iters=10\nfmllr_min_count=1000\nskip_scoring=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: steps/tandem/decode_sgmm2.sh [options] <graph-dir> <data-dir1> <data-dir2> <decode-dir>\"\n  echo \" e.g.: steps/tandem/decode_sgmm2.sh --transform-dir exp/tri3b/decode_dev93_tgpr \\\\\"\n  echo \"      exp/sgmm3a/graph_tgpr {mfcc,bottleneck}/data/test_dev93 exp/sgmm3a/decode_dev93_tgpr\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --transform-dir <decoding-dir>           # directory of previous decoding\"\n  echo \"                                           # where we can find transforms for SAT systems.\"\n  echo \"  --config <config-file>                   # config containing options\"\n  echo \"  --nj <nj>                                # number of parallel jobs\"\n  echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n  echo \"  --beam <beam>                            # Decoding beam; default 13.0\"\n  exit 1;\nfi\n\ngraphdir=$1\ndata1=$2\ndata2=$3\ndir=$4\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nfor f in $graphdir/HCLG.fst $data1/feats.scp $data2/feats.scp $srcdir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nsilphonelist=`cat $graphdir/phones/silence.csl` || exit 1\ngselect_opt=\"--gselect=ark:gunzip -c $dir/gselect.JOB.gz|\"\ngselect_opt_1stpass=\"$gselect_opt copy-gselect --n=$first_pass_gselect ark:- ark:- |\"\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n## Set up features.\n\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  echo \"Using cmvn for feats2\"\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n##\n\n## Save Gaussian-selection info to disk.\n# Note: we can use final.mdl regardless of whether there is an alignment model--\n# they use the same UBM.\n\nif [ $stage -le 1 ]; then\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect --full-gmm-nbest=$gselect $srcdir/final.mdl \\\n    \"$feats\" \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\n# Generate state-level lattice which we can rescore.  This is done with the alignment\n# model and no speaker-vectors.\nif [ $stage -le 2 ]; then\n  $cmd JOB=1:$nj $dir/log/decode_pass1.JOB.log \\\n    sgmm2-latgen-faster --max-active=$max_active --beam=$beam --lattice-beam=$lattice_beam \\\n    --acoustic-scale=$acwt --determinize-lattice=false --allow-partial=true \\\n    --word-symbol-table=$graphdir/words.txt \"$gselect_opt_1stpass\" $srcdir/final.alimdl \\\n    $graphdir/HCLG.fst \"$feats\" \"ark:|gzip -c > $dir/pre_lat.JOB.gz\" || exit 1;\nfi\n\n# Estimate speaker vectors (1st pass).  Prune before determinizing\n# because determinization can take a while on un-pruned lattices.\n# Note: the sgmm2-post-to-gpost stage is necessary because we have\n# a separate alignment-model and final model, otherwise we'd skip it\n# and use sgmm2-est-spkvecs.\nif [ $stage -le 3 ]; then\n  $cmd JOB=1:$nj $dir/log/vecs_pass1.JOB.log \\\n    gunzip -c $dir/pre_lat.JOB.gz \\| \\\n    lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $srcdir/final.alimdl ark:- ark:- \\| \\\n    sgmm2-post-to-gpost \"$gselect_opt\" $srcdir/final.alimdl \"$feats\" ark:- ark:- \\| \\\n    sgmm2-est-spkvecs-gpost --spk2utt=ark:$sdata1/JOB/spk2utt \\\n     $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/pre_vecs.JOB\" || exit 1;\nfi\n\n# Estimate speaker vectors (2nd pass).  Since we already have spk vectors,\n# at this point we need to rescore the lattice to get the correct posteriors.\nif [ $stage -le 4 ]; then\n  $cmd JOB=1:$nj $dir/log/vecs_pass2.JOB.log \\\n    gunzip -c $dir/pre_lat.JOB.gz \\| \\\n    sgmm2-rescore-lattice --spk-vecs=ark:$dir/pre_vecs.JOB --utt2spk=ark:$sdata1/JOB/utt2spk \\\n      \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n    lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n    lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n    sgmm2-est-spkvecs --spk2utt=ark:$sdata1/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/pre_vecs.JOB \\\n     $srcdir/final.mdl \"$feats\" ark,s,cs:- \"ark:$dir/vecs.JOB\" || exit 1;\nfi\nrm $dir/pre_vecs.*\n\nif $use_fmllr; then\n  # Estimate fMLLR transforms (note: these may be on top of any\n  # fMLLR transforms estimated with the baseline GMM system.\n  if [ $stage -le 5 ]; then # compute fMLLR transforms.\n    echo \"$0: computing fMLLR transforms.\"\n    if [ ! -f $srcdir/final.fmllr_mdl ] || [ $srcdir/final.fmllr_mdl -ot $srcdir/final.mdl ]; then\n      echo \"$0: computing pre-transform for fMLLR computation.\"\n      sgmm2-comp-prexform $srcdir/final.mdl $srcdir/final.occs $srcdir/final.fmllr_mdl || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/fmllr.JOB.log \\\n      gunzip -c $dir/pre_lat.JOB.gz \\| \\\n      sgmm2-rescore-lattice --spk-vecs=ark:$dir/vecs.JOB --utt2spk=ark:$sdata1/JOB/utt2spk \\\n      \"$gselect_opt\" $srcdir/final.mdl ark:- \"$feats\" ark:- \\| \\\n      lattice-prune --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-determinize-pruned --acoustic-scale=$acwt --beam=$vecs_beam ark:- ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $srcdir/final.mdl ark:- ark:- \\| \\\n      sgmm2-est-fmllr --spk2utt=ark:$sdata1/JOB/spk2utt \"$gselect_opt\" --spk-vecs=ark:$dir/vecs.JOB \\\n       --fmllr-iters=$fmllr_iters --fmllr-min-count=$fmllr_min_count \\\n      $srcdir/final.fmllr_mdl \"$feats\" ark,s,cs:- \"ark:$dir/trans.JOB\" || exit 1;\n  fi\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\nfi\n\n# Now rescore the state-level lattices with the adapted features and the\n# corresponding model.  Prune and determinize the lattices to limit\n# their size.\nif [ $stage -le 6 ]; then\n  $cmd JOB=1:$nj $dir/log/rescore.JOB.log \\\n    sgmm2-rescore-lattice \"$gselect_opt\" --utt2spk=ark:$sdata1/JOB/utt2spk --spk-vecs=ark:$dir/vecs.JOB \\\n    $srcdir/final.mdl \"ark:gunzip -c $dir/pre_lat.JOB.gz|\" \"$feats\" ark:- \\| \\\n    lattice-determinize-pruned --acoustic-scale=$acwt --beam=$lattice_beam ark:- \\\n    \"ark:|gzip -c > $dir/lat.JOB.gz\" || exit 1;\nfi\nrm $dir/pre_lat.*.gz\n\n# The output of this script is the files \"lat.*.gz\"-- we'll rescore this at different\n# acoustic scales to get the final output.\n\nif ! $skip_scoring ; then\n  if [ $stage -le 7 ]; then\n    [ ! -x local/score.sh ] && \\\n      echo \"Not scoring because local/score.sh does not exist or not executable.\" && exit 1;\n    local/score.sh --cmd \"$cmd\" $data1 $graphdir $dir ||\n      { echo \"$0: Scoring failed. (ignore by '--skip-scoring true')\"; exit 1; }\n  fi\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/make_denlats.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# Create denominator lattices for MMI/MPE training.\n# Creates its output in $dir/lat.*.gz\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\ntransform_dir=\nmax_mem=20000000 # This will stop the processes getting too large.\n# This is in bytes, but not \"real\" bytes-- you have to multiply\n# by something like 5 or 10 to get real bytes (not sure why so large)\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"Usage: steps/make_tandem_denlats.sh [options] <data1-dir> <data2-dir> <lang-dir> <src-dir> <exp-dir>\"\n   echo \"  e.g.: steps/make_tandem_denlats.sh {mfcc,bottleneck}/data/train data/lang exp/tri1 exp/tri1_denlats\"\n   echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n   echo \" plus transforms.\"\n   echo \"\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n   echo \"                           # large databases so your jobs will be smaller and\"\n   echo \"                           # will (individually) finish reasonably soon.\"\n   echo \"  --transform-dir <transform-dir>   # directory to find fMLLR transforms.\"\n   exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nsrcdir=$4\ndir=$5\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir\n\ncp -r $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\n\ncat $data1/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n  awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n  utils/make_unigram_grammar.pl | fstcompile > $dir/lang/G.fst \\\n   || exit 1;\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\n\nif [ -s $dir/dengraph/HCLG.fst ]; then\n   echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  utils/mkgraph.sh $dir/lang $srcdir $dir/dengraph || exit 1;\nfi\n\n\n## Set up features.\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"$0: using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne \"$nj\" ] \\\n    && echo \"$0: mismatch in number of jobs with $transform_dir\" && exit 1;\n  [ -f $srcdir/final.mat ] && ! cmp $transform_dir/final.mat $srcdir/final.mat && \\\n     echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  if [ -f $srcdir/final.alimdl ]; then\n    echo \"$0: you seem to have a SAT system but you did not supply the --transform-dir option.\";\n    exit 1;\n  fi\nfi\n\n\n# if this job is interrupted by the user, we want any background jobs to be\n# killed too.\ncleanup() {\n  local pids=$(jobs -pr)\n  [ -n \"$pids\" ] && kill $pids\n}\ntrap \"cleanup\" INT QUIT TERM EXIT\n\n\nif [ $sub_split -eq 1 ]; then\n  $cmd JOB=1:$nj $dir/log/decode_den.JOB.log \\\n   gmm-latgen-faster --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n    --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n     $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  # each job from 1 to $nj is split into multiple pieces (sub-split), and we aim\n  # to have at most two jobs running at each time.  The idea is that if we have stragglers\n  # from one job, we can be processing another one at the same time.\n  rm $dir/.error 2>/dev/null\n\n  prev_pid=\n  for n in `seq $[nj+1]`; do\n    if [ $n -gt $nj ]; then\n      this_pid=\n    elif [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n      this_pid=\n    else\n      ssdata1=$data1/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata1/$n $sub_split || exit 1;\n      ssdata2=$data2/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata2/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n\n      $cmd $parallel_opts JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        gmm-latgen-faster --beam=$beam --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n        --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || touch .error &\n      this_pid=$!\n    fi\n    if [ ! -z \"$prev_pid\" ]; then  # Wait for the previous job; merge the previous set of lattices.\n      wait $prev_pid\n      [ -f $dir/.error ] && echo \"$0: error generating denominator lattices\" && exit 1;\n      rm $dir/.merge_error 2>/dev/null\n      echo Merging archives for data subset $prev_n\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$prev_n.$k.gz || touch $dir/.merge_error;\n      done | gzip -c > $dir/lat.$prev_n.gz || touch $dir/.merge_error;\n      [ -f $dir/.merge_error ] && echo \"$0: Merging lattices for subset $prev_n failed (or maybe some other error)\" && exit 1;\n      rm $dir/lat.$prev_n.*.gz\n      touch $dir/.done.$prev_n\n    fi\n    prev_n=$n\n    prev_pid=$this_pid\n  done\nfi\n\n\necho \"$0: done generating denominator lattices.\"\n"
  },
  {
    "path": "egs/steps/tandem/make_denlats_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# Create denominator lattices for MMI/MPE training, with SGMM models.  If the\n# features have fMLLR transforms you have to supply the --transform-dir option.\n# It gets any speaker vectors from the \"alignment dir\" ($srcdir).  Note: this is\n# possibly a slight mismatch because the speaker vectors come from supervised\n# adaptation.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsub_split=1\nbeam=13.0\nlattice_beam=7.0\nacwt=0.1\nmax_active=5000\ntransform_dir=\nmax_mem=20000000 # This will stop the processes getting too large.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 5 ]; then\n   echo \"Usage: steps/tandem/make_denlats_sgmm2.sh [options] <data1-dir> <data2-dir> <lang-dir> <src-dir|srcdir> <exp-dir>\"\n   echo \"  e.g.: steps/tandem/make_denlats_sgmm2.sh {mfcc,bottleneck}/data1/train data1/lang exp/sgmm4a_ali exp/sgmm4a_denlats\"\n   echo \"Works for (delta|lda) features, and (with --transform-dir option) such features\"\n   echo \" plus transforms.\"\n   echo \"\"\n   echo \"Main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --nj <nj>                                        # number of parallel jobs\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --sub-split <n-split>                            # e.g. 40; use this for \"\n   echo \"                           # large databases so your jobs will be smaller and\"\n   echo \"                           # will (individually) finish reasonably soon.\"\n   echo \"  --transform-dir <transform-dir>   # directory to find fMLLR transforms.\"\n   exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nsrcdir=$4 # could also be $srcdir, but only if no vectors supplied.\ndir=$5\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $srcdir/phones.txt || exit 1;\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\noov=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir\n\ncp -r $lang $dir/\n\n# Compute grammar FST which corresponds to unigram decoding graph.\n\ncat $data1/text | utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt | \\\n  awk '{for(n=2;n<=NF;n++){ printf(\"%s \", $n); } printf(\"\\n\"); }' | \\\n  utils/make_unigram_grammar.pl | fstcompile > $dir/lang/G.fst \\\n   || exit 1;\n\n# mkgraph.sh expects a whole directory \"lang\", so put everything in one directory...\n# it gets L_disambig.fst and G.fst (among other things) from $dir/lang, and\n# final.mdl from $srcdir; the output HCLG.fst goes in $dir/graph.\n\nif [ -s $dir/dengraph/HCLG.fst ]; then\n   echo \"Graph $dir/dengraph/HCLG.fst already exists: skipping graph creation.\"\nelse\n  utils/mkgraph.sh $dir/lang $srcdir $dir/dengraph || exit 1;\nfi\n\n\n## Set up features.\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $srcdir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $srcdir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\nif [ ! -z \"$transform_dir\" ]; then # add transforms to features...\n  echo \"$0: using fMLLR transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"Expected $transform_dir/trans.1 to exist.\"\n  [ \"`cat $transform_dir/num_jobs`\" -ne \"$nj\" ] \\\n    && echo \"$0: mismatch in number of jobs with $transform_dir\" && exit 1;\n  [ -f $srcdir/final.mat ] && ! cmp $transform_dir/final.mat $srcdir/final.mat && \\\n     echo \"$0: LDA transforms differ between $srcdir and $transform_dir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  echo \"Assuming you don't have a SAT system, since no --transform-dir option supplied \"\nfi\n\nif [ -f $srcdir/gselect.1.gz ]; then\n  gselect_opt=\"--gselect=ark:gunzip -c $srcdir/gselect.JOB.gz|\"\nelse\n  echo \"$0: no such file $srcdir/gselect.1.gz\" && exit 1;\nfi\n\nif [ -f $srcdir/vecs.1 ]; then\n  spkvecs_opt=\"--spk-vecs=ark:$srcdir/vecs.JOB --utt2spk=ark:$sdata1/JOB/utt2spk\"\nelse\n  if [ -f $srcdir/final.alimdl ]; then\n    echo \"$0: You seem to have an SGMM system with speaker vectors,\"\n    echo \"yet we can't find speaker vectors.  Perhaps you supplied\"\n    echo \"the model director instead of the alignment directory?\"\n    exit 1;\n  fi\nfi\n\nif [ $sub_split -eq 1 ]; then\n  $cmd JOB=1:$nj $dir/log/decode_den.JOB.log \\\n   sgmm2-latgen-faster $spkvecs_opt \"$gselect_opt\" --beam=$beam \\\n     --lattice-beam=$lattice_beam --acoustic-scale=$acwt \\\n     --max-mem=$max_mem --max-active=$max_active --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n     $dir/dengraph/HCLG.fst \"$feats\" \"ark:|gzip -c >$dir/lat.JOB.gz\" || exit 1;\nelse\n  for n in `seq $nj`; do\n    if [ -f $dir/.done.$n ] && [ $dir/.done.$n -nt $srcdir/final.mdl ]; then\n      echo \"Not processing subset $n as already done (delete $dir/.done.$n if not)\";\n    else\n      ssdata1=$data1/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata1/$n $sub_split || exit 1;\n      ssdata2=$data2/split$nj/$n/split${sub_split}utt;\n      split_data.sh --per-utt $sdata2/$n $sub_split || exit 1;\n      mkdir -p $dir/log/$n\n      mkdir -p $dir/part\n      feats_subset=`echo $feats | sed \"s/trans.JOB/trans.$n/g\" | sed s:JOB/:$n/split${sub_split}utt/JOB/:g`\n      spkvecs_opt_subset=`echo $spkvecs_opt | sed \"s/JOB/$n/g\"`\n      gselect_opt_subset=`echo $gselect_opt | sed \"s/JOB/$n/g\"`\n      $cmd JOB=1:$sub_split $dir/log/$n/decode_den.JOB.log \\\n        sgmm2-latgen-faster $spkvecs_opt_subset \"$gselect_opt_subset\" \\\n          --beam=$beam --lattice-beam=$lattice_beam \\\n          --acoustic-scale=$acwt --max-mem=$max_mem --max-active=$max_active \\\n          --word-symbol-table=$lang/words.txt $srcdir/final.mdl  \\\n          $dir/dengraph/HCLG.fst \"$feats_subset\" \"ark:|gzip -c >$dir/lat.$n.JOB.gz\" || exit 1;\n      echo Merging archives for data subset $n\n      rm $dir/.error 2>/dev/null;\n      for k in `seq $sub_split`; do\n        gunzip -c $dir/lat.$n.$k.gz || touch $dir/.error;\n      done | gzip -c > $dir/lat.$n.gz || touch $dir/.error;\n      [ -f $dir/.error ] && echo Merging lattices for subset $n failed && exit 1;\n      rm $dir/lat.$n.*.gz\n      touch $dir/.done.$n\n    fi\n  done\nfi\n\n\necho \"$0: done generating denominator lattices with SGMMs.\"\n"
  },
  {
    "path": "egs/steps/tandem/mk_aslf_lda_mllt.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n\n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the\n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:\n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nalignment_model=\nadapt_model=\nfinal_model=\ntransform_dir=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in\n              # lattice generation.\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/tandem/mk_aslf_lda_mllt.sh [options] <graph-dir> <data1-dir> <data2-dir> <decode-dir>\"\n   echo \" e.g.: steps/tandem/mk_aslf_lda_mllt.sh exp/tri2b/graph {mfcc,bottleneck}/data/test_dev93 exp/tri2b/decode_dev93\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata1=$2\ndata2=$3\ndir=`echo $4 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nmkdir -p $dir/log`\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\necho $nj > $dir/num_jobs\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data1/feats.scp $data2/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n\n# Set up features.\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  echo \"Using cmvn for feats2\"\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nsifeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  sifeats=\"$sifeats transform-feats $srcdir/final.mat ark:- ark:- |\"\nfi\n\nif [ -e $dir/trans.1. ]; then\n  echo \"Using fMLLR transforms in $dir\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\nelif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for\n# language model rescoring.\n\necho \"Rescoring lattices, converting to slf\"\nmkdir -p $dir/slf\n$cmd JOB=1:$nj $dir/log/rescore.slf.JOB.log \\\n  lattice-align-words $graphdir/phones/word_boundary.int $final_model \"ark:gunzip -c $dir/lat.JOB.gz |\" ark:- \\| \\\n  gmm-rescore-lattice $final_model ark:- \"$feats\" ark,t:- \\| \\\n  utils/int2sym.pl -f 3 $graphdir/words.txt \\| \\\n  utils/convert_slf.pl - $dir/slf\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/tandem/mk_aslf_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n\n# Decoding script that does fMLLR.  This can be on top of delta+delta-delta, or\n# LDA+MLLT features.\n\n# There are 3 models involved potentially in this script,\n# and for a standard, speaker-independent system they will all be the same.\n# The \"alignment model\" is for the 1st-pass decoding and to get the\n# Gaussian-level alignments for the \"adaptation model\" the first time we\n# do fMLLR.  The \"adaptation model\" is used to estimate fMLLR transforms\n# and to generate state-level lattices.  The lattices are then rescored\n# with the \"final model\".\n#\n# The following table explains where we get these 3 models from.\n# Note: $srcdir is one level up from the decoding directory.\n#\n#   Model              Default source:\n#\n#  \"alignment model\"   $srcdir/final.alimdl              --alignment-model <model>\n#                     (or $srcdir/final.mdl if alimdl absent)\n#  \"adaptation model\"  $srcdir/final.mdl                 --adapt-model <model>\n#  \"final model\"       $srcdir/final.mdl                 --final-model <model>\n\n\n# Begin configuration section\nalignment_model=\nadapt_model=\nfinal_model=\ntransform_dir=\nstage=0\nacwt=0.083333 # Acoustic weight used in getting fMLLR transforms, and also in\n              # lattice generation.\nmax_active=7000\nbeam=13.0\nlattice_beam=6.0\nnj=4\nsilence_weight=0.01\ncmd=run.pl\nsi_dir=\nfmllr_update_type=full\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n   echo \"Usage: steps/tandem/mk_aslf_sgmm2.sh [options] <graph-dir> <data1-dir> <data2-dir> <decode-dir>\"\n   echo \" e.g.: steps/tandem/mk_aslf_sgmm2.sh exp/tri2b/graph {mfcc,bottleneck}/data/test_dev93 exp/tri2b/decode_dev93\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --config <config-file>                   # config containing options\"\n   echo \"  --nj <nj>                                # number of parallel jobs\"\n   echo \"  --cmd <cmd>                              # Command to run in parallel with\"\n   echo \"  --adapt-model <adapt-mdl>                # Model to compute transforms with\"\n   echo \"  --alignment-model <ali-mdl>              # Model to get Gaussian-level alignments for\"\n   echo \"                                           # 1st pass of transform computation.\"\n   echo \"  --final-model <finald-mdl>               # Model to finally decode with\"\n   echo \"  --si-dir <speaker-indep-decoding-dir>    # use this to skip 1st pass of decoding\"\n   echo \"                                           # Caution-- must be with same tree\"\n   echo \"  --acwt <acoustic-weight>                 # default 0.08333 ... used to get posteriors\"\n\n   exit 1;\nfi\n\n\ngraphdir=$1\ndata1=$2\ndata2=$3\ndir=`echo $4 | sed 's:/$::g'` # remove any trailing slash.\n\nsrcdir=`dirname $dir`; # Assume model directory one level up from decoding directory.\n\nmkdir -p $dir/log\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\necho $nj > $dir/num_jobs\n\n# Some checks.  Note: we don't need $srcdir/tree but we expect\n# it should exist, given the current structure of the scripts.\nfor f in $graphdir/HCLG.fst $data1/feats.scp $data2/feats.scp $srcdir/tree; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n## Some checks, and setting of defaults for variables.\n[ \"$nj\" -ne \"`cat $dir/num_jobs`\" ] && echo \"Mismatch in #jobs with si-dir\" && exit 1;\n[ -z \"$adapt_model\" ] && adapt_model=$srcdir/final.mdl\n[ -z \"$final_model\" ] && final_model=$srcdir/final.mdl\nfor f in $adapt_model $final_model; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n##\n\n\n# Set up features.\n\nsplice_opts=`cat $srcdir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $srcdir/normft2 2>/dev/null`\n\nif [ -f $srcdir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $srcdir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  echo \"Using cmvn for feats2\"\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nsifeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  sifeats=\"$sifeats transform-feats $srcdir/final.mat ark:- ark:- |\"\nfi\n\nif [ -e $dir/trans.1. ]; then\n  echo \"Using fMLLR transforms in $dir\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\nelif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" && exit 1;\n  [ \"$nj\" -ne \"`cat $transform_dir/num_jobs`\" ] \\\n    && echo \"$0: #jobs mismatch with transform-dir.\" && exit 1;\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelif grep 'transform-feats --utt2spk' $srcdir/log/acc.0.1.log 2>/dev/null; then\n  echo \"$0: **WARNING**: you seem to be using an SGMM system trained with transforms,\"\n  echo \"  but you are not providing the --transform-dir option in test time.\"\nfi\n\n\n# Rescore the state-level lattices with the final adapted features, and the final model\n# (which by default is $srcdir/final.mdl, but which may be specified on the command line,\n# useful in case of discriminatively trained systems).\n# At this point we prune and determinize the lattices and write them out, ready for\n# language model rescoring.\n\necho \"Rescoring lattices, converting to slf\"\nmkdir -p $dir/slf\n$cmd JOB=1:$nj $dir/log/rescore.slf.JOB.log \\\n  lattice-align-words $graphdir/phones/word_boundary.int $final_model \"ark:gunzip -c $dir/lat.JOB.gz |\" ark:- \\| \\\n  sgmm2-rescore-lattice --spk-vecs=ark:$dir/vecs.JOB --utt2spk=ark:$sdata1/JOB/utt2spk \\\n    \"--gselect=ark:gunzip -c $dir/gselect.JOB.gz |\" $final_model ark:- \"$feats\" ark,t:- \\| \\\n  utils/int2sym.pl -f 3 $graphdir/words.txt \\| \\\n  utils/convert_slf.pl - $dir/slf\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/tandem/train_deltas.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0\n\n# Begin configuration.\nstage=-4 #  This allows restarting after partway, when something when wrong.\nconfig=\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.2 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nnormft2=true  # typically, the tandem features will be normalized already b/c of pca\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# != 7 ]; then\n   echo \"Usage: steps/tandem/train_deltas.sh <num-leaves> <tot-gauss> <data1-dir> <data2-dir> <lang-dir> <alignment-dir> <exp-dir>\"\n   echo \" e.g.: steps/tandem/train_deltas.sh 2000 10000 {mfcc,bottleneck}/data/train_si84_half data/lang exp/mono_ali exp/tri1\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   echo \"  --normft2 (true|false)                           # apply CMVN to second features?\"\n   exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata1=$3\ndata2=$4\nlang=$5\nalidir=$6\ndir=$7\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data1/feats.scp $data2/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_tandem.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\noov=`cat $lang/oov.int` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n# Set up features\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n# Set up stream 1 (usually spectral features, so we use deltas)\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n\n# Set up stream 2 (usually bottleneck/posteriors), normalize if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# save config\necho $feats > $dir/tandem\necho $normft2 > $dir/normft2\n\nrm $dir/.error 2>/dev/null\n\nif [ $stage -le -3 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: getting questions for tree-building, via clustering\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n\n  gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data1/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: training pass $x\"\n  if [ $stage -le $x ]; then\n    if echo $realign_iters | grep -w $x >/dev/null; then\n      echo \"$0: aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n         \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n         \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n       \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --mix-up=$numgauss --power=$power \\\n        --write-occs=$dir/$[$x+1].occs $dir/$x.mdl \\\n       \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\n# Summarize warning messages...\nutils/summarize_warnings.pl  $dir/log\n\necho \"$0: Done training tandem system in $dir\"\n\n"
  },
  {
    "path": "egs/steps/tandem/train_lda_mllt.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0.\n\n# Begin configuration.\ncmd=run.pl\nconfig=\nstage=-5\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nmllt_iters=\"2 4 6 12\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25  # Last iter to increase #Gauss on.\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.2 # Exponent for number of gaussians according to occurrence counts\nrandprune=4.0 # This is approximately the ratio by which we will speed up the\n              # LDA and MLLT calculations via randomized pruning.\nsplice_opts=\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\n\ndim1=30  # dimension first stream (spectral features)\ndim2=40  # dimension second stream (pasted features, usually bn/posteriors)\n\n# apply CMVN to the second feature stream\nnormft2=true\n\n# do an extra LDA after pasting the features?\nextra_lda=false\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 7 ]; then\n  echo \"Usage: steps/tandem/train_lda_mllt.sh [options] <#leaves> <#gauss> <data1> <data2> <lang> <alignments> <dir>\"\n  echo \" e.g.: steps/tandem/train_lda_mllt.sh 2500 15000 {mfcc,bottleneck}/data/train_si84 data/lang exp/tri1_ali_si84 exp/tri2b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --normft2 (true|false)                           # apply CMVN to second data set (true)\"\n  echo \"  --extra-lda (true|false)                         # apply extra LDA after feature paste (false)\"\n  echo \"  --dim1 <n>                                       # dimension of the first feature stream by HLDA\"\n  echo \"  --dim2 <m>                                       # dimension of of the pasted features after 2nd HLDA\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata1=$3\ndata2=$4\nlang=$5\nalidir=$6\ndir=$7\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data1/feats.scp $data2/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_tandem_lda_mllt.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter #gauss increment\noov=`cat $lang/oov.int` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\nmkdir -p $dir/log\necho $nj >$dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n\n# Set up features.\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n# set up feature stream 1;  here we assume spectral features which we will \n# splice instead of deltas\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\"\n\n# Now estimate LDA, which will only be applied to the spectral features\n# (assuming that the tandem features were already discriminatively trained).\n# This is instead of the deltas.\nif [ $stage -le -5 ]; then\n  echo \"Accumulating LDA statistics (this only applies to the base feature part).\"\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      acc-lda --rand-prune=$randprune $alidir/final.mdl \"$feats1\" ark,s,cs:- \\\n       $dir/lda.JOB.acc || exit 1;\n  est-lda --write-full-matrix=$dir/full.mat --dim=$dim1 $dir/lda.mat $dir/lda.*.acc \\\n      2>$dir/log/lda_est.log || exit 1;\n  rm $dir/lda.*.acc\nfi\n\n# add transform to the features\nfeats1=\"$feats1 transform-feats $dir/lda.mat ark:- ark:- |\"\n\n# set up feature stream 2;  this are usually bottleneck or posterior features, \n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features;  note: $feats gets overwritten later in the script\n# once we have MLLT matrices\ntandemfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\nfeats=\"$tandemfeats\"\n\n# keep track of splicing/normalization options\necho $splice_opts > $dir/splice_opts\necho $normft2 > $dir/normft2\n\n\n# Begin training;  initially, we have no MLLT matrix\ncur_mllt_iter=0\n\nif [ $stage -le -4 -a $extra_lda == true ]; then\n  echo \"Accumulating LDA statistics (for tandem features this time).\"\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n    acc-lda --rand-prune=$randprune $alidir/final.mdl \"$tandemfeats\" ark,s,cs:- \\\n    $dir/lda.JOB.acc || exit 1;\n  est-lda --write-full-matrix=$dir/full.mat --dim=$dim2 $dir/0.mat $dir/lda.*.acc \\\n    2>$dir/log/lda_est.log || exit 1;\n  rm $dir/lda.*.acc\n  \n  feats=\"$tandemfeats transform-feats $dir/0.mat ark:- ark:- |\"\nfi\n\n# keep track of the features\necho $feats > $dir/tandem\n\nif [ $stage -le -3 ]; then\n  echo \"Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n   acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ `ls $dir/*.treeacc | wc -w` -ne \"$nj\" ] && echo \"Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\n\nif [ $stage -le -2 ]; then\n  echo \"Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n\n  # could mix up if we wanted:\n  # gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data1/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo Training pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n  if echo $mllt_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"Estimating MLLT\"\n      $cmd JOB=1:$nj $dir/log/macc.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-acc-mllt --rand-prune=$randprune  $dir/$x.mdl \"$feats\" ark:- $dir/$x.JOB.macc \\\n        || exit 1;\n      est-mllt $dir/$x.mat.new $dir/$x.*.macc 2> $dir/log/mupdate.$x.log || exit 1;\n      gmm-transform-means  $dir/$x.mat.new $dir/$x.mdl $dir/$x.mdl \\\n        2> $dir/log/transform_means.$x.log || exit 1;\n      \n      # see if this is the first MLLT iteration and there is no lda;  otherwise compose transforms\n      if [ $cur_mllt_iter == 0 -a $extra_lda == false ]; then\n        mv $dir/$x.mat.new $dir/$x.mat || exit 1;\n      else\n        compose-transforms --print-args=false $dir/$x.mat.new $dir/$cur_mllt_iter.mat $dir/$x.mat || exit 1;\n      fi\n\n      rm $dir/$x.*.macc\n    fi\n\n    # update features\n    feats=\"$tandemfeats transform-feats $dir/$x.mat ark:- ark:- |\"\n    cur_mllt_iter=$x\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss --power=$power \\\n        $dir/$x.mdl \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs \n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.{mdl,mat,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $cur_mllt_iter.mat $dir/final.mat\n\n# Summarize warning messages...\n\nutils/summarize_warnings.pl $dir/log\n\necho Done training system with LDA+MLLT tandem features in $dir\n"
  },
  {
    "path": "egs/steps/tandem/train_mllt.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0.\n\n# This is a vanilla tandem system where the first stream is just extended with\n# delta+deltadeltas, in contrast to the train_lda_mllt.sh script, where the\n# temoporal context of the first stream is modeled via HLDA\n\n# Begin configuration.\ncmd=run.pl\nconfig=\nstage=-5\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nmllt_iters=\"2 4 6 12\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25  # Last iter to increase #Gauss on.\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.2 # Exponent for number of gaussians according to occurrence counts\nrandprune=4.0 # This is approximately the ratio by which we will speed up the\n              # LDA and MLLT calculations via randomized pruning.\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\n\n# apply CMVN to the second feature stream?\nnormft2=true\n\n# Do additional LDA after pasting the features\ndim2=40\nextra_lda=false\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 7 ]; then\n  echo \"Usage: steps/tandem/train_mllt.sh [options] <#leaves> <#gauss> <data1> <data2> <lang> <alignments> <dir>\"\n  echo \" e.g.: steps/tandem/train_mllt.sh 2500 15000 {mfcc,bottleneck}/data/train_si84 data/lang exp/tri1_ali_si84 exp/tri2b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --normft2 (true|false)                           # apply CMVN to second data set (true)\"\n  echo \"  --extra-lda (true|false)                         # apply extra LDA after feature paste (false)\"\n  echo \"  --dim2 <n>                                       # dimension of the pasted features after 2nd HLDA\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata1=$3\ndata2=$4\nlang=$5\nalidir=$6\ndir=$7\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data1/feats.scp $data2/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_tandem_lda_mllt.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter #gauss increment\noov=`cat $lang/oov.int` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\nmkdir -p $dir/log\necho $nj >$dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n\n# Set up features.\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n# set up feature stream 1;  here we assume spectral features which we will \n# splice instead of deltas\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n\n# set up feature stream 2;  this are usually bottleneck or posterior features, \n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features;  note: $feats gets overwritten later in the script\n# once we have MLLT matrices\ntandemfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\nfeats=\"$tandemfeats\"\n\n# keep track of splicing/normalization options\necho $feats > $dir/tandem\necho $normft2 > $dir/normft2\n\n\n# Begin training;  initially, we have no MLLT matrix\ncur_mllt_iter=0\n\nif [ $stage -le -4 -a $extra_lda == true ]; then\n  echo \"Accumulating LDA statistics (for tandem features this time).\"\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n    acc-lda --rand-prune=$randprune $alidir/final.mdl \"$tandemfeats\" ark,s,cs:- \\\n    $dir/lda.JOB.acc || exit 1;\n  est-lda --write-full-matrix=$dir/full.mat --dim=$dim2 $dir/0.mat $dir/lda.*.acc \\\n    2>$dir/log/lda_est.log || exit 1;\n  rm $dir/lda.*.acc\n  \n  feats=\"$tandemfeats transform-feats $dir/0.mat ark:- ark:- |\"\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n   acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ `ls $dir/*.treeacc | wc -w` -ne \"$nj\" ] && echo \"Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\n\nif [ $stage -le -2 ]; then\n  echo \"Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n\n  # could mix up if we wanted:\n  # gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data1/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo Training pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n  if echo $mllt_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"Estimating MLLT\"\n      $cmd JOB=1:$nj $dir/log/macc.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-acc-mllt --rand-prune=$randprune  $dir/$x.mdl \"$feats\" ark:- $dir/$x.JOB.macc \\\n        || exit 1;\n      est-mllt $dir/$x.mat.new $dir/$x.*.macc 2> $dir/log/mupdate.$x.log || exit 1;\n      gmm-transform-means  $dir/$x.mat.new $dir/$x.mdl $dir/$x.mdl \\\n        2> $dir/log/transform_means.$x.log || exit 1;\n      \n      # see if this is the first MLLT iteration and there is no lda;  otherwise compose transforms\n      if [ $cur_mllt_iter == 0 -a $extra_lda == false ]; then\n        mv $dir/$x.mat.new $dir/$x.mat || exit 1;\n      else\n        compose-transforms --print-args=false $dir/$x.mat.new $dir/$cur_mllt_iter.mat $dir/$x.mat || exit 1;\n      fi\n\n      rm $dir/$x.*.macc\n    fi\n\n    # update features\n    feats=\"$tandemfeats transform-feats $dir/$x.mat ark:- ark:- |\"\n    cur_mllt_iter=$x\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss --power=$power \\\n        $dir/$x.mdl \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs \n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.{mdl,mat,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $cur_mllt_iter.mat $dir/final.mat\n\n# Summarize warning messages...\n\nutils/summarize_warnings.pl $dir/log\n\necho Done training system with LDA+MLLT tandem features in $dir\n"
  },
  {
    "path": "egs/steps/tandem/train_mmi.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# MMI training (or optionally boosted MMI, if you give the --boost option).\n# 4 iterations (by default) of Extended Baum-Welch update.\n#\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0\ncancel=true # if true, cancel num and den counts on each frame.\ntau=400\nweight_tau=10\nacwt=0.1\nstage=0\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 6 ]; then\n  echo \"Usage: steps/train_tandem_mmi.sh <data1> <data2> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/train_tandem_mmi.sh {mfcc,bottleneck}/data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n\n  exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nalidir=$4\ndenlatdir=$5\ndir=$6\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data1/feats.scp $data2/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\ncp $alidir/{final.mdl,tree} $dir\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n\n# Set up features\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $alidir/normft2 2>/dev/null`\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $alidir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $alidir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nfi\n##\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\ncur_mdl=$alidir/final.mdl\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of MMI training\"\n  # Note: the num and den states are accumulated at the same time, so we\n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-rescore-lattice $cur_mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      sum-post --merge=$cancel --scale1=-1 \\\n      ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      gmm-acc-stats2 $cur_mdl \"$feats\" ark,s,cs:- \\\n      $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n    rm $dir/den_acc.$x.*.acc\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n    rm $dir/num_acc.$x.*.acc\n\n  # note: this tau value is for smoothing towards model parameters, not\n  # as in the Boosted MMI paper, not towards the ML stats as in the earlier\n  # work on discriminative training (e.g. my thesis).\n  # You could use gmm-ismooth-stats to smooth to the ML stats, if you had\n  # them available [here they're not available if cancel=true].\n\n    $cmd $dir/log/update.$x.log \\\n      gmm-est-gaussians-ebw --tau=$tau $cur_mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc - \\| \\\n      gmm-est-weights-ebw --weight-tau=$weight_tau - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    rm $dir/{den,num}_acc.$x.acc\n  fi\n  cur_mdl=$dir/$[$x+1].mdl\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.\n\n  tail -n 50 $dir/log/acc.$x.*.log | perl -e '$acwt=shift @ARGV; while(<STDIN>) { if(m/gmm-acc-stats2.+Overall weighted acoustic likelihood per frame was (\\S+) over (\\S+) frames/) { $tot_aclike += $1*$2; $tot_frames1 += $2; } if(m|lattice-to-post.+Overall average log-like/frame is (\\S+) over (\\S+) frames.  Average acoustic like/frame is (\\S+)|) { $tot_den_lat_like += $1*$2; $tot_frames2 += $2; $tot_den_aclike += $3*$2; } } if (abs($tot_frames1 - $tot_frames2) > 0.01*($tot_frames1 + $tot_frames2)) { print STDERR \"Frame-counts disagree $tot_frames1 versus $tot_frames2\\n\"; } $tot_den_lat_like /= $tot_frames2; $tot_den_aclike /= $tot_frames2; $tot_aclike *= ($acwt / $tot_frames1);  $num_like = $tot_aclike + $tot_den_aclike; $per_frame_objf = $num_like - $tot_den_lat_like; print \"$per_frame_objf $tot_frames1\\n\"; ' $acwt > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  echo \"Iteration $x: objf was $objf, MMI auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"MMI training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/train_mmi_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# MMI training (or optionally boosted MMI, if you give the --boost option),\n# for SGMMs.  4 iterations (by default) of Extended Baum-Welch update.\n#\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0\ncancel=true # if true, cancel num and den counts on each frame.\nacwt=0.1\nstage=0\nupdate_opts=\ntransform_dir=\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 6 ]; then\n  echo \"Usage: steps/tandem/train_mmi_sgmm2.sh <data1> <data2> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/tandem/train_mmi_sgmm2.sh {mfcc,bottleneck}/data1/train_si84 data1/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --transform-dir <transform-dir>                  # directory to find fMLLR transforms.\"\n  exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\nalidir=$4\ndenlatdir=$5\ndir=$6\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data1/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\ncp $alidir/{final.mdl,tree} $dir\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n\n\n# Set up features\n\nsdata1=$data1/split$nj\nsdata2=$data2/split$nj\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $alidir/normft2 2>/dev/null`\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $alidir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $alidir/{splice_opts,normft2,tandem} $dir 2>/dev/null\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" \\\n    && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  echo \"$0: no fMLLR transforms.\"\nfi\n\nif [ -f $alidir/vecs.1 ]; then\n  echo \"$0: using speaker vectors from $alidir\"\n  spkvecs_opt=\"--spk-vecs=ark:$alidir/vecs.JOB --utt2spk=ark:$sdata1/JOB/utt2spk\"\nelse\n  echo \"$0: no speaker vectors.\"\n  spkvecs_opt=\nfi\n\nif [ -f $alidir/gselect.1.gz ]; then\n  echo \"$0: using Gaussian-selection info from $alidir\"\n  gselect_opt=\"--gselect=ark:gunzip -c $alidir/gselect.JOB.gz|\"\nelse\n  echo \"$0: error: no Gaussian-selection info found\" && exit 1;\nfi\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\ncur_mdl=$alidir/final.mdl\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of MMI training\"\n  # Note: the num and den states are accumulated at the same time, so we\n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      sgmm2-rescore-lattice \"$gselect_opt\" $spkvecs_opt $cur_mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      sum-post --merge=$cancel --scale1=-1 \\\n      ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      sgmm2-acc-stats2 \"$gselect_opt\" $spkvecs_opt $cur_mdl \"$feats\" ark,s,cs:- \\\n        $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      sgmm2-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n    rm $dir/den_acc.$x.*.acc\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      sgmm2-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n    rm $dir/num_acc.$x.*.acc\n\n    $cmd $dir/log/update.$x.log \\\n     sgmm2-est-ebw $update_opts $cur_mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n  fi\n  cur_mdl=$dir/$[$x+1].mdl\n\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.  Note: this code is same as in train_mmi.sh\n  tail -n 50 $dir/log/acc.$x.*.log | perl -e '$acwt=shift @ARGV; while(<STDIN>) { if(m/sgmm2-acc-stats2.+Overall weighted acoustic likelihood per frame was (\\S+) over (\\S+) frames/) { $tot_aclike += $1*$2; $tot_frames1 += $2; } if(m|lattice-to-post.+Overall average log-like/frame is (\\S+) over (\\S+) frames.  Average acoustic like/frame is (\\S+)|) { $tot_den_lat_like += $1*$2; $tot_frames2 += $2; $tot_den_aclike += $3*$2; } } if (abs($tot_frames1 - $tot_frames2) > 0.01*($tot_frames1 + $tot_frames2)) { print STDERR \"Frame-counts disagree $tot_frames1 versus $tot_frames2\\n\"; } $tot_den_lat_like /= $tot_frames2; $tot_den_aclike /= $tot_frames2; $tot_aclike *= ($acwt / $tot_frames1);  $num_like = $tot_aclike + $tot_den_aclike; $per_frame_objf = $num_like - $tot_den_lat_like; print \"$per_frame_objf $tot_frames1\\n\"; ' $acwt > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  echo \"Iteration $x: objf was $objf, MMI auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"MMI training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tandem/train_mono.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#                 Korbinian Riedhammer\n# Apache 2.0\n\n\n# To be run from ..\n# Flat start and monophone training, with delta-delta features.\n# This script applies cepstral mean normalization (per speaker).\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nnum_iters=40    # Number of iterations of training\nmax_iter_inc=30 # Last iter to increase #Gauss on.\ntotgauss=1000 # Target #Gaussians.  \nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\nrealign_iters=\"1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38\";\nconfig= # name of config file.\nstage=-4\npower=0.2 # exponent to determine number of gaussians from occurrence counts\nnormft2=true # typically, the tandem features will already be normalized due to pca\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/tandem/train_mono.sh [options] <data1-dir> <data2-dir> <lang-dir> <exp-dir>\"\n  echo \" e.g.: steps/tandem/train_mono.sh {mfcc,bottleneck}/data/train.1k data/lang exp/mono\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --normft2 (true|false)                           # apply CMVN to second features?\"\n  exit 1;\nfi\n\ndata1=$1\ndata2=$2\nlang=$3\ndir=$4\n\noov_sym=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\ncp $lang/phones.txt $dir || exit 1;\n\n# Set up features.\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n# Use deltas on the first tream (most likely this will be MFCCs or alike)\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n\n# Second stream will most likely be bottleneck or posteriors, so normalize\n# if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# paste features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\nexample_feats=\"`echo '$feats' | sed s/JOB/1/g`\";\n\n# get dimension\nallfeats=$(echo $feats | sed s:JOB:..:g)\nfeat_dim=$(feat-to-dim --print-args=false \"$allfeats\" - 2> $dir/log/feat_dim)\n\n# save stats\necho $feats > $dir/tandem\necho $normft2 > $dir/normft2\n\necho \"$0: Initializing monophone system.\"\n\n[ ! -f $lang/phones/sets.int ] && exit 1;\nshared_phones_opt=\"--shared-phones=$lang/phones/sets.int\"\n\nif [ $stage -le -3 ]; then\n# Note: JOB=. makes it use the whole set;  we want that to make sure we have phoneme \n  $cmd JOB=1 $dir/log/init.log \\\n    gmm-init-mono $shared_phones_opt \"--train-feats=$allfeats\" $lang/topo $feat_dim \\\n    $dir/0.mdl $dir/tree || exit 1;\nfi\n\nnumgauss=`gmm-info --print-args=false $dir/0.mdl | grep gaussians | awk '{print $NF}'`\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Compiling training graphs\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/0.mdl  $lang/L.fst \\\n    \"ark:sym2int.pl --map-oov $oov_sym -f 2- $lang/words.txt < $sdata1/JOB/text|\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"$0: Aligning data equally (pass 0)\"\n  $cmd JOB=1:$nj $dir/log/align.0.JOB.log \\\n    align-equal-compiled \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" ark,t:-  \\| \\\n    gmm-acc-stats-ali --binary=true $dir/0.mdl \"$feats\" ark:- \\\n    $dir/0.JOB.acc || exit 1;\nfi\n\n# In the following steps, the --min-gaussian-occupancy=3 option is important, otherwise\n# we fail to est \"rare\" phones and later on, they never align properly.\n\nif [ $stage -le 0 ]; then\n  gmm-est --min-gaussian-occupancy=3  --mix-up=$numgauss --power=$power \\\n    $dir/0.mdl \"gmm-sum-accs - $dir/0.*.acc|\" $dir/1.mdl 2> $dir/log/update.0.log || exit 1;\n  rm $dir/0.*.acc\nfi\n\n\nbeam=6 # will change to 10 below after 1st pass\n# note: using slightly wider beams for WSJ vs. RM.\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: Pass $x\"\n  if [ $stage -le $x ]; then\n    if echo $realign_iters | grep -w $x >/dev/null; then\n      echo \"$0: Aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$[$beam*4] \"$mdl\" \\\n        \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \"ark,t:|gzip -c >$dir/ali.JOB.gz\" \\\n        || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \"ark:gunzip -c $dir/ali.JOB.gz|\" \\\n      $dir/$x.JOB.acc || exit 1;\n\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss --power=$power $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs 2>/dev/null\n  fi\n  if [ $x -le $max_iter_inc ]; then\n     numgauss=$[$numgauss+$incgauss];\n  fi\n  beam=10\n  x=$[$x+1]\ndone\n\n( cd $dir; rm final.{mdl,occs} 2>/dev/null; ln -s $x.mdl final.mdl; ln -s $x.occs final.occs )\n\nutils/summarize_warnings.pl $dir/log\n\necho \"Done training tandem mono-phone system in $dir\"\n\n# example of showing the alignments:\n# show-alignments data/lang/phones.txt $dir/30.mdl \"ark:gunzip -c $dir/ali.0.gz|\" | head -4\n\n"
  },
  {
    "path": "egs/steps/tandem/train_sat.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# This does Speaker Adapted Training (SAT), i.e. train on\n# fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or\n# delta and delta-delta features.  If there are no transforms supplied\n# in the alignment directory, it will estimate transforms itself before\n# building the tree (and in any case, it estimates transforms a number\n# of times during training).\n\n\n# Begin configuration section.\nstage=-5\nfmllr_update_type=full\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\nrealign_iters=\"10 20 30\";\nfmllr_iters=\"2 4 6 12\";\nsilence_weight=0.0 # Weight on silence in fMLLR estimation.\nnum_iters=35   # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\npower=0.2 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nnormft2=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 7 ]; then\n  echo \"Usage: steps/tandem/train_sat.sh <#leaves> <#gauss> <data1> <data2> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/tandem/train_sat.sh 2500 15000 {mfcc,bottleneck}/data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri3b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata1=$3\ndata2=$4\nlang=$5\nalidir=$6\ndir=$7\n\nfor f in $data1/feats.scp $data2/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"train_tandem_sat.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc]  # per-iter #gauss increment\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\n# Set up features.\n\n# We will use the same settings as with the alidir\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $alidir/normft2 2>/dev/null`\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $alidir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nsifeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  sifeats=\"$sifeats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $alidir/{splice_opts,tandem,normft2} $dir 2>/dev/null\n\n\n\n## Get initial fMLLR transforms (possibly from alignment dir)\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: Using transforms from $alidir\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n  cur_trans_dir=$alidir\nelse\n  if [ $stage -le -4 ]; then\n    echo \"$0: obtaining initial fMLLR transforms since not present in $alidir\"\n    $cmd JOB=1:$nj $dir/log/fmllr.0.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata1/JOB/spk2utt $alidir/final.mdl \"$sifeats\" \\\n      ark:- ark:$dir/trans.JOB || exit 1;\n  fi\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\n  cur_trans_dir=$dir\nfi\n\nif [ $stage -le -3 ]; then\n  # Get tree stats.\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"$0: This is a bad warning.\";\n\n  rm $dir/treeacc\nfi\n\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata1/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n   echo Pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n\n  if echo $fmllr_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo Estimating fMLLR transforms\n      # We estimate a transform that's additional to the previous transform;\n      # we'll compose them.\n      $cmd JOB=1:$nj $dir/log/fmllr.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n        --spk2utt=ark:$sdata1/JOB/spk2utt $dir/$x.mdl \\\n        \"$feats\" ark:- ark:$dir/tmp_trans.JOB || exit 1;\n      for n in `seq $nj`; do\n        ! ( compose-transforms --b-is-affine=true \\\n          ark:$dir/tmp_trans.$n ark:$cur_trans_dir/trans.$n ark:$dir/composed_trans.$n \\\n          && mv $dir/composed_trans.$n $dir/trans.$n && \\\n          rm $dir/tmp_trans.$n ) 2>$dir/log/compose_transforms.$x.log \\\n          && echo \"$0: Error composing transforms\" && exit 1;\n      done\n    fi\n    feats=\"$sifeats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n    cur_trans_dir=$dir\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --power=$power --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\n\nif [ $stage -le $x ]; then\n  # Accumulate stats for \"alignment model\"-- this model is\n  # computed with the speaker-independent features, but matches Gaussian-for-Gaussian\n  # with the final speaker-adapted model.\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --power=$power --remove-low-count-gaussians=false $dir/$x.mdl \\\n    \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl  || exit 1;\n  rm $dir/$x.*.acc\nfi\n\nrm $dir/final.{mdl,alimdl,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $x.alimdl $dir/final.alimdl\n\n\n\nutils/summarize_warnings.pl $dir/log\n(\n  echo \"$0: Likelihood evolution:\"\n  for x in `seq $[$num_iters-1]`; do\n    tail -n 30 $dir/log/acc.$x.*.log | awk '/Overall avg like/{l += $(NF-3)*$(NF-1); t += $(NF-1); }\n        /Overall average logdet/{d += $(NF-3)*$(NF-1); t2 += $(NF-1);}\n        END{ d /= t2; l /= t; printf(\"%s \", d+l); } '\n  done\n  echo\n) | tee $dir/log/summary.log\n\necho Done\n"
  },
  {
    "path": "egs/steps/tandem/train_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#                 Korbinian Riedhammer\n\n# SGMM training, with speaker vectors.  This script would normally be called on\n# top of fMLLR features obtained from a conventional system, but it also works\n# on top of any type of speaker-independent features (based on\n# deltas+delta-deltas or LDA+MLLT).  For more info on SGMMs, see the paper \"The\n# subspace Gaussian mixture model--A structured model for speech recognition\".\n# (Computer Speech and Language, 2011).\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nstage=-6 # use this to resume partially finished training\ncontext_opts= # e.g. set it to \"--context-width=5 --central-position=2\"  for a\n# quinphone system.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nnum_iters=25   # Total number of iterations of training\nnum_iters_alimdl=3 # Number of iterations for estimating alignment model.\nmax_iter_inc=15 # Last iter to increase #substates on.\nrealign_iters=\"5 10 15\"; # Iters to realign on.\nspkvec_iters=\"5 8 12 17\" # Iters to estimate speaker vectors on.\nincrease_iters=\"6 10 14\"; # Iters on which to increase phn dim and/or spk dim;\n    # rarely necessary, and if it is, only the 1st will normally be necessary.\nrand_prune=0.1 # Randomized-pruning parameter for posteriors, to speed up training.\n               # Bigger -> more pruning; zero = no pruning.\nphn_dim=  # You can use this to set the phonetic subspace dim. [default: feat-dim+1]\nspk_dim=  # You can use this to set the speaker subspace dim. [default: feat-dim]\npower=0.2 # Exponent for number of gaussians according to occurrence counts\nbeam=8\nself_weight=0.9\nretry_beam=40\nleaves_per_group=5 # Relates to the SCTM (state-clustered tied-mixture) aspect:\n                   # average number of pdfs in a \"group\" of pdfs.\nupdate_m_iter=4\nspk_dep_weights=true # [Symmetric SGMM] set this to false if you don't want \"u\" (i.e. to turn off\n                      # symmetric SGMM.\nnormft2=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 8 ]; then\n  echo \"Usage: steps/tandem/train_sgmm2.sh <num-leaves> <num-substates> <data1> <data2> <lang> <ali-dir> <ubm> <exp-dir>\"\n  echo \" e.g.: steps/tandem/train_sgmm2.sh 5000 8000 {mfcc,bottleneck}/data/train_si84 data/lang \\\\\"\n  echo \"                      exp/tri3b_ali_si84 exp/ubm4a/final.ubm exp/sgmm4a\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-weight <sil-weight>                    # weight for silence (e.g. 0.5 or 0.0)\"\n  echo \"  --num-iters <#iters>                             # Number of iterations of E-M\"\n  echo \"  --leaves-per-group <#leaves>                     # Average #leaves shared in one group\"\n  exit 1;\nfi\n\nnum_pdfs=$1  # final #leaves, at 2nd level of tree.\ntotsubstates=$2\ndata1=$3\ndata2=$4\nlang=$5\nalidir=$6\nubm=$7\ndir=$8\n\nnum_groups=$[$num_pdfs/$leaves_per_group]\nfirst_spkvec_iter=`echo $spkvec_iters | awk '{print $1}'` || exit 1;\n\n# Check some files.\nfor f in $data1/feats.scp $data2/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $ubm; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\noov=`cat $lang/oov.int`\nsilphonelist=`cat $lang/phones/silence.csl`\nif [ \"$self_weight\" == \"1.0\" ]; then\n  numsubstates=$num_groups # Initial #-substates.\nelse\n  numsubstates=$num_pdfs # Initial #-substates.\nfi\nincsubstates=$[($totsubstates-$numsubstates)/$max_iter_inc] # per-iter increment for #substates\nfeat_dim=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/feature dimension/{print $NF}'` || exit 1;\n[ $feat_dim -eq $feat_dim ] || exit 1; # make sure it's numeric.\n[ -z $phn_dim ] && phn_dim=$[$feat_dim+1]\n[ -z $spk_dim ] && spk_dim=$feat_dim\nnj=`cat $alidir/num_jobs` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\nspkvecs_opt=  # Empty option for now, until we estimate the speaker vectors.\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\n\n## Set up features.\n\n\n# We will use the same settings as with the alidir\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $alidir/normft2 2>/dev/null`\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $alidir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $alidir/{splice_opts,tandem,normft2} $dir 2>/dev/null\n\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nfi\n##\n\n\nif [ $stage -le -6 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-stats\" && exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -5 ]; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree-two-level --binary=false --verbose=1 --max-leaves-first=$num_groups \\\n     --max-leaves-second=$num_pdfs $dir/treeacc $lang/phones/roots.int \\\n     $dir/questions.qst $lang/topo $dir/tree $dir/pdf2group.map || exit 1;\nfi\n\nif [ $stage -le -4 ]; then\n  echo \"$0: Initializing the model\"\n  # Note: if phn_dim > feat_dim+1 or spk_dim > feat_dim, these dims\n  # will be truncated on initialization.\n  $cmd $dir/log/init_sgmm.log \\\n    sgmm2-init --spk-dep-weights=$spk_dep_weights --self-weight=$self_weight \\\n       --pdf-map=$dir/pdf2group.map --phn-space-dim=$phn_dim \\\n       --spk-space-dim=$spk_dim $lang/topo $dir/tree $ubm $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"$0: doing Gaussian selection\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect $dir/0.mdl \"$feats\" \\\n    \"ark,t:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: compiling training graphs\"\n  text=\"ark:sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata1/JOB/text|\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/0.mdl  $lang/L.fst  \\\n    \"$text\" \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"$0: converting alignments\"\n  $cmd JOB=1:$nj $dir/log/convert_ali.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/0.mdl $dir/tree \"ark:gunzip -c $alidir/ali.JOB.gz|\" \\\n    \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n   echo \"$0: training pass $x ... \"\n   if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n     echo \"$0: re-aligning data\"\n     $cmd JOB=1:$nj $dir/log/align.$x.JOB.log  \\\n       sgmm2-align-compiled $spkvecs_opt $scale_opts \"$gselect_opt\" \\\n       --utt2spk=ark:$sdata1/JOB/utt2spk --beam=$beam --retry-beam=$retry_beam \\\n       $dir/$x.mdl \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n       \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n   fi\n   if [ $spk_dim -gt 0 ] && echo $spkvec_iters | grep -w $x >/dev/null; then\n     if [ $stage -le $x ]; then\n       $cmd JOB=1:$nj $dir/log/spkvecs.$x.JOB.log \\\n         ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n         weight-silence-post 0.01 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n         sgmm2-est-spkvecs --rand-prune=$rand_prune --spk2utt=ark:$sdata1/JOB/spk2utt \\\n         $spkvecs_opt \"$gselect_opt\" $dir/$x.mdl \"$feats\" ark,s,cs:- \\\n         ark:$dir/tmp_vecs.JOB '&&' mv $dir/tmp_vecs.JOB $dir/vecs.JOB || exit 1;\n     fi\n     spkvecs_opt=\"--spk-vecs=ark:$dir/vecs.JOB\"\n   fi\n   if [ $x -eq 0 ]; then\n     flags=vwcSt # on the first iteration, don't update projections M or N\n   elif [ $spk_dim -gt 0 -a $[$x%2] -eq 1 -a $x -ge $first_spkvec_iter ]; then\n     # Update N if we have speaker-vector space and x is odd,\n     # and we've already updated the speaker vectors...\n     flags=vNwSct\n   else\n     if [ $x -ge $update_m_iter ]; then\n       flags=vMwSct # udpate M.\n     else\n       flags=vwSct # no M on early iters, if --update-m-iter option given.\n     fi\n   fi\n   $spk_dep_weights && [ $x -ge $first_spkvec_iter ] && flags=${flags}u; # update\n   # spk-weight projections \"u\".\n\n   if [ $stage -le $x ]; then\n     $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n       sgmm2-acc-stats $spkvecs_opt --utt2spk=ark:$sdata1/JOB/utt2spk \\\n       --update-flags=$flags \"$gselect_opt\" --rand-prune=$rand_prune \\\n       $dir/$x.mdl \"$feats\" \"ark,s,cs:gunzip -c $dir/ali.JOB.gz | ali-to-post ark:- ark:-|\" \\\n       $dir/$x.JOB.acc || exit 1;\n   fi\n\n   # The next option is needed if the user specifies a phone or speaker sub-space\n   # dimension that's higher than the \"normal\" one.\n   increase_dim_opts=\n   if echo $increase_dim_iters | grep -w $x >/dev/null; then\n     increase_dim_opts=\"--increase-phn-dim=$phn_dim --increase-spk-dim=$spk_dim\"\n     # Note: the command below might have a null effect on some iterations.\n     if [ $spk_dim -gt $feat_dim ]; then\n       cmd JOB=1:$nj $dir/log/copy_vecs.$x.JOB.log \\\n         copy-vector --print-args=false --change-dim=$spk_dim \\\n         ark:$dir/vecs.JOB ark:$dir/vecs_tmp.$JOB '&&' \\\n         mv $dir/vecs_tmp.JOB $dir/vecs.JOB || exit 1;\n     fi\n   fi\n\n   if [ $stage -le $x ]; then\n     $cmd $dir/log/update.$x.log \\\n       sgmm2-est --update-flags=$flags --split-substates=$numsubstates \\\n       $increase_dim_opts --power=$power --write-occs=$dir/$[$x+1].occs \\\n       $dir/$x.mdl \"sgmm2-sum-accs - $dir/$x.*.acc|\" $dir/$[$x+1].mdl || exit 1;\n     rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs 2>/dev/null\n   fi\n   if [ $x -lt $max_iter_inc ]; then\n     numsubstates=$[$numsubstates+$incsubstates]\n   fi\n   x=$[$x+1];\ndone\n\nrm $dir/final.mdl $dir/final.occs 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\nif [ $spk_dim -gt 0 ]; then\n  # We need to create an \"alignment model\" that's been trained\n  # without the speaker vectors, to do the first-pass decoding with.\n  # in test time.\n\n  # We do this for a few iters, in this recipe.\n  final_mdl=$dir/$x.mdl\n  cur_alimdl=$dir/$x.mdl\n  while [ $x -lt $[$num_iters+$num_iters_alimdl] ]; do\n    echo \"$0: building alignment model (pass $x)\"\n    if [ $x -eq $num_iters ]; then # 1st pass of building alimdl.\n      flags=MwcS # don't update v the first time.  Note-- we never update transitions.\n      # they wouldn't change anyway as we use the same alignment as previously.\n    else\n      flags=vMwcS\n    fi\n    if [ $stage -le $x ]; then\n      $cmd JOB=1:$nj $dir/log/acc_ali.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        sgmm2-post-to-gpost $spkvecs_opt \"$gselect_opt\" \\\n         --utt2spk=ark:$sdata1/JOB/utt2spk $final_mdl \"$feats\" ark,s,cs:- ark:- \\| \\\n        sgmm2-acc-stats-gpost --rand-prune=$rand_prune --update-flags=$flags \\\n          $cur_alimdl \"$feats\" ark,s,cs:- $dir/$x.JOB.aliacc || exit 1;\n      $cmd $dir/log/update_ali.$x.log \\\n        sgmm2-est --update-flags=$flags --remove-speaker-space=true --power=$power \\\n        $cur_alimdl \"sgmm2-sum-accs - $dir/$x.*.aliacc|\" $dir/$[$x+1].alimdl || exit 1;\n      rm $dir/$x.*.aliacc || exit 1;\n      [ $x -gt $num_iters ]  && rm $dir/$x.alimdl\n    fi\n    cur_alimdl=$dir/$[$x+1].alimdl\n    x=$[$x+1]\n  done\n  rm $dir/final.alimdl 2>/dev/null\n  ln -s $x.alimdl $dir/final.alimdl\nfi\n\nutils/summarize_warnings.pl $dir/log\n\necho Done\n"
  },
  {
    "path": "egs/steps/tandem/train_ubm.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This trains a UBM (i.e. a mixture of Gaussians), by clustering\n# the Gaussians from a trained HMM/GMM system and then doing a few\n# iterations of UBM training.\n# We mostly use this for SGMM systems.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nsilence_weight=  # You can set it to e.g. 0.0, to weight down silence in training.\nstage=-2\nnum_gselect1=50 # first stage of Gaussian-selection\nnum_gselect2=25 # second stage.\nintermediate_num_gauss=2000\nnum_iters=3\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_tandem_ubm.sh <num-gauss> <data1> <data2> <lang> <ali-dir> <exp>\"\n  echo \" e.g.: steps/train_tandem_ubm.sh 400 {mfcc,bottneneck}/data/train_si84 data/lang exp/tri2b_ali_si84 exp/ubm3c\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-weight <sil-weight>                    # weight for silence (e.g. 0.5 or 0.0)\"\n  echo \"  --num-iters <#iters>                             # Number of iterations of E-M\"\n  exit 1;\nfi\n\nnum_gauss=$1\ndata1=$2\ndata2=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $data1/feats.scp $data2/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl; do\n  [ ! -f $f ] && echo \"No such file $f\" && exit 1;\ndone\n\nif [ $[$num_gauss*2] -gt $intermediate_num_gauss ]; then\n  echo \"intermediate_num_gauss was too small $intermediate_num_gauss\"\n  intermediate_num_gauss=$[$num_gauss*2];\n  echo \"setting it to $intermediate_num_gauss\"\nfi\n\n\n# Set various variables.\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata1=$data1/split$nj;\nsdata2=$data2/split$nj;\n\n[[ -d $sdata1 && $data1/feats.scp -ot $sdata1 ]] || split_data.sh $data1 $nj || exit 1;\n[[ -d $sdata2 && $data2/feats.scp -ot $sdata2 ]] || split_data.sh $data2 $nj || exit 1;\n\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\nnormft2=`cat $alidir/normft2 2>/dev/null`\n\n## Set up features.\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\n\ncase $feat_type in\n  delta)\n    echo \"$0: feature type is $feat_type\"\n    ;;\n  lda)\n    echo \"$0: feature type is $feat_type\"\n    cp $alidir/{lda,final}.mat $dir/ || exit 1;\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n# set up feature stream 1;  this are usually spectral features, so we will add\n# deltas or splice them\nfeats1=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata1/JOB/utt2spk scp:$sdata1/JOB/cmvn.scp scp:$sdata1/JOB/feats.scp ark:- |\"\n\nif [ \"$feat_type\" == \"delta\" ]; then\n  feats1=\"$feats1 add-deltas ark:- ark:- |\"\nelif [ \"$feat_type\" == \"lda\" ]; then\n  feats1=\"$feats1 splice-feats $splice_opts ark:- ark:- | transform-feats $dir/lda.mat ark:- ark:- |\"\nfi\n\n# set up feature stream 2;  this are usually bottleneck or posterior features,\n# which may be normalized if desired\nfeats2=\"scp:$sdata2/JOB/feats.scp\"\n\nif [ \"$normft2\" == \"true\" ]; then\n  feats2=\"ark,s,cs:apply-cmvn --norm-vars=false --utt2spk=ark:$sdata2/JOB/utt2spk scp:$sdata2/JOB/cmvn.scp $feats2 ark:- |\"\nfi\n\n# assemble tandem features\nfeats=\"ark,s,cs:paste-feats '$feats1' '$feats2' ark:- |\"\n\n# add transformation, if applicable\nif [ \"$feat_type\" == \"lda\" ]; then\n  feats=\"$feats transform-feats $dir/final.mat ark:- ark:- |\"\nfi\n\n# splicing/normalization options\ncp $alidir/{splice_opts,tandem,normft2} $dir 2>/dev/null\n\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata1/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nfi\n##\n\nif [ ! -z \"$silence_weight\" ]; then\n  weights_opt=\"--weights='ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- | weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- | post-to-weights ark:- ark:- |'\"\nelse\n  weights_opt=\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: clustering model $alidir/final.mdl to get initial UBM\"\n  $cmd $dir/log/cluster.log \\\n    init-ubm --intermediate-num-gauss=$intermediate_num_gauss --ubm-num-gauss=$num_gauss \\\n    --verbose=2 --fullcov-ubm=true $alidir/final.mdl $alidir/final.occs \\\n    $dir/0.ubm   || exit 1;\nfi\n\n# Do initial phase of Gaussian selection and save it to disk -- later on we'll\n# do more Gaussian selection to further prune, as the model changes.\n\n\nif [ $stage -le -1 ]; then\n  echo \"$0: doing Gaussian selection\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$num_gselect1 \"fgmm-global-to-gmm $dir/0.ubm - |\" \"$feats\" \\\n    \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Pass $x\"\n  $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n    gmm-gselect --n=$num_gselect2 \"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" \\\n    \"fgmm-global-to-gmm $dir/$x.ubm - |\" \"$feats\" ark:- \\| \\\n    fgmm-global-acc-stats $weights_opt --gselect=ark,s,cs:- $dir/$x.ubm \"$feats\" \\\n    $dir/$x.JOB.acc || exit 1;\n  lowcount_opt=\"--remove-low-count-gaussians=false\"\n  [ $[$x+1] -eq $num_iters ] && lowcount_opt=   # Only remove low-count Gaussians\n  # on last iter-- we can't do it earlier, or the Gaussian-selection info would\n  # be mismatched.\n  $cmd $dir/log/update.$x.log \\\n    fgmm-global-est $lowcount_opt --verbose=2 $dir/$x.ubm \"fgmm-global-sum-accs - $dir/$x.*.acc |\" \\\n      $dir/$[$x+1].ubm || exit 1;\n  rm $dir/$x.*.acc $dir/$x.ubm\n  x=$[$x+1]\ndone\n\nrm $dir/gselect.*.gz\nrm $dir/final.ubm 2>/dev/null\nmv $dir/$x.ubm $dir/final.ubm || exit 1;\n"
  },
  {
    "path": "egs/steps/tfrnnlm/check_py.py",
    "content": "import numpy as np\nimport tensorflow as tf\n"
  },
  {
    "path": "egs/steps/tfrnnlm/check_tensorflow_installed.sh",
    "content": "#!/usr/bin/env bash\n\n# this script checks if TF is installed to be used with python\n#                    and if TF related binaries in kaldi is ready to use\n. ./path.sh\n\nif which lattice-lmrescore-tf-rnnlm 2>&1>/dev/null; then\n  echo TensorFlow relate binaries found. This is good.\nelse\n  echo TF related binaries not compiled.\n  echo You need to go to tools/ and run extras/install_tensorflow_cc.sh first\n  echo and then do \\\"make\\\" under both src/tfrnnlm and src/tfrnnlmbin\n  exit 1\nfi\n\necho\n\nif python steps/tfrnnlm/check_py.py 2>/dev/null; then\n  echo TensorFlow ready to use on the python side. This is good.\nelse\n  echo TensorFlow not found on the python side.\n  echo Please go to tools/ and run extras/install_tensorflow_py.sh to install it\n  echo If you already have TensorFlow installed somewhere else, you would need\n  echo to add it to your PATH\n  exit 1\nfi\n"
  },
  {
    "path": "egs/steps/tfrnnlm/lmrescore_rnnlm_lat.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015  Guoguo Chen\n#           2017  Hainan Xu\n# Apache 2.0\n\n# This script rescores lattices with RNNLM trained with TensorFlow.\n# A faster and more accurate version of the algorithm is at\n# steps/tfrnnlm/lmrescore_rnnlm_lat_pruned.sh which is prefered\n# One example recipe of this script is at egs/ami/s5/local/tfrnnlm/run_lstm_fast.sh\n\n# Begin configuration section.\ncmd=run.pl\nskip_scoring=false\nmax_ngram_order=4 # Approximate the lattice-rescoring by limiting the max-ngram-order\n                  # if it's set, it merges histories in the lattice if they share\n                  # the same ngram history and this prevents the lattice from \n                  # exploding exponentially. Details of the n-gram approximation\n                  # method are described in section 2.3 of the paper\n                  # http://www.danielpovey.com/files/2018_icassp_lattice_pruning.pdf\nweight=0.5  # Interpolation weight for RNNLM.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n   echo \"Does language model rescoring of lattices (remove old LM, add new LM)\"\n   echo \"with TensorFlow RNNLM.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <old-lang-dir> <rnnlm-dir> \\\\\"\n   echo \"                   <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \" e.g.: $0 data/lang_tg data/tensorflow_lstm data/test \\\\\"\n   echo \"                   exp/tri3/test_tg exp/tri3/test_tfrnnlm\"\n   echo \"options: [--cmd (run.pl|queue.pl [queue opts])]\"\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nrnnlm_dir=$2\ndata=$3\nindir=$4\noutdir=$5\n\noldlm=$oldlang/G.fst\nif [ -f $oldlang/G.carpa ]; then\n  oldlm=$oldlang/G.carpa\nelif [ ! -f $oldlm ]; then\n  echo \"$0: expecting either $oldlang/G.fst or $oldlang/G.carpa to exist\" &&\\\n    exit 1;\nfi\n\necho \"$0: using $oldlm as old LM\"\n\n[ ! -d $rnnlm_dir/rnnlm ] && echo \"$0: Missing tf model folder $rnnlm_dir/rnnlm\" && exit 1;\n\nfor f in $rnnlm_dir/unk.probs $oldlang/words.txt $indir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: Missing file $f\" && exit 1\ndone\n\nawk -v n=$0 -v w=$weight 'BEGIN {if (w < 0 || w > 1) {\n  print n\": Interpolation weight should be in the range of [0, 1]\"; exit 1;}}' \\\n  || exit 1;\n\noldlm_command=\"fstproject --project_output=true $oldlm |\"\n\nmkdir -p $outdir/log\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\noldlm_weight=`perl -e \"print -1.0 * $weight;\"`\nif [ \"$oldlm\" == \"$oldlang/G.fst\" ]; then\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-lmrescore --lm-scale=$oldlm_weight \\\n    \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlm_command\" ark:-  \\| \\\n    lattice-lmrescore-tf-rnnlm --lm-scale=$weight \\\n    --max-ngram-order=$max_ngram_order \\\n    $rnnlm_dir/unk.probs $rnnlm_dir/wordlist.rnn.final $oldlang/words.txt ark:- \"$rnnlm_dir/rnnlm\" \\\n    \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\nelse\n  $cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n    lattice-lmrescore-const-arpa --lm-scale=$oldlm_weight \\\n    \"ark:gunzip -c $indir/lat.JOB.gz|\" \"$oldlm\" ark:-  \\| \\\n    lattice-lmrescore-tf-rnnlm --lm-scale=$weight \\\n    --max-ngram-order=$max_ngram_order \\\n    $rnnlm_dir/unk.probs $rnnlm_dir/wordlist.rnn.final $oldlang/words.txt ark:- \"$rnnlm_dir/rnnlm\" \\\n    \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\nfi\nif ! $skip_scoring ; then\n  err_msg=\"$0: Not scoring because local/score.sh does not exist or not executable.\"\n  [ ! -x local/score.sh ] && echo $err_msg && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $oldlang $outdir\nelse\n  echo \"$0: Not scoring because --skip-scoring was specified.\"\nfi\n\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/tfrnnlm/lmrescore_rnnlm_lat_pruned.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015  Guoguo Chen\n#           2017  Hainan Xu\n# Apache 2.0\n\n# This script rescores lattices with RNNLM trained with TensorFlow.\n# It uses a pruned algorithm to speed up the runtime and improve the accuracy.\n# which is an improved version over steps/tfrnnlm/lmrescore_rnnlm_lat.sh,\n# which uses the exact same interface\n# The details of the pruning algorithm is described in\n# http://www.danielpovey.com/files/2018_icassp_lattice_pruning.pdf\n# One example recipe of this script is at egs/ami/s5/local/tfrnnlm/run_lstm_fast.sh\n\n# Begin configuration section.\ncmd=run.pl\nskip_scoring=false\nmax_ngram_order=4 # Approximate the lattice-rescoring by limiting the max-ngram-order\n                  # if it's set, it merges histories in the lattice if they share\n                  # the same ngram history and this prevents the lattice from \n                  # exploding exponentially. Details of the n-gram approximation\n                  # method are described in section 2.3 of the paper\n                  # http://www.danielpovey.com/files/2018_icassp_lattice_pruning.pdf\nacwt=0.1\nweight=0.5  # Interpolation weight for RNNLM.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# != 5 ]; then\n   echo \"Does language model rescoring of lattices (remove old LM, add new LM)\"\n   echo \"with RNNLM.\"\n   echo \"\"\n   echo \"Usage: $0 [options] <old-lang-dir> <rnnlm-dir> \\\\\"\n   echo \"                   <data-dir> <input-decode-dir> <output-decode-dir>\"\n   echo \" e.g.: $0 data/lang_tg data/tensorflow_lstm data/test \\\\\"\n   echo \"                   exp/tri3/test_tg exp/tri3/test_tfrnnlm\"\n   echo \"options: [--cmd (run.pl|queue.pl [queue opts])]\"\n   exit 1;\nfi\n\n[ -f path.sh ] && . ./path.sh;\n\noldlang=$1\nrnnlm_dir=$2\ndata=$3\nindir=$4\noutdir=$5\n\noldlm=$oldlang/G.fst\ncarpa_option=\n\nif [ -f $oldlang/G.carpa ]; then\n  oldlm=$oldlang/G.carpa\n  carpa_option=\"--use-const-arpa=true\"\nelif [ ! -f $oldlm ]; then\n  echo \"$0: expecting either $oldlang/G.fst or $oldlang/G.carpa to exist\" &&\\\n    exit 1;\nfi\n\necho \"$0: using $oldlm as old LM\"\n\n[ ! -d $rnnlm_dir/rnnlm ] && echo \"$0: Missing tf model folder $rnnlm_dir/rnnlm\" && exit 1;\n\nfor f in $rnnlm_dir/unk.probs $oldlang/words.txt $indir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: Missing file $f\" && exit 1\ndone\n\nawk -v n=$0 -v w=$weight 'BEGIN {if (w < 0 || w > 1) {\n  print n\": Interpolation weight should be in the range of [0, 1]\"; exit 1;}}' \\\n  || exit 1;\n\nmkdir -p $outdir/log\nnj=`cat $indir/num_jobs` || exit 1;\ncp $indir/num_jobs $outdir\n\n$cmd JOB=1:$nj $outdir/log/rescorelm.JOB.log \\\n  lattice-lmrescore-tf-rnnlm-pruned --lm-scale=$weight \\\n  --acoustic-scale=$acwt --max-ngram-order=$max_ngram_order \\\n  $carpa_option $oldlm $oldlang/words.txt \\\n  $rnnlm_dir/unk.probs $rnnlm_dir/wordlist.rnn.final \"$rnnlm_dir/rnnlm\" \\\n  \"ark:gunzip -c $indir/lat.JOB.gz|\" \"ark,t:|gzip -c>$outdir/lat.JOB.gz\" || exit 1;\n\nif ! $skip_scoring ; then\n  err_msg=\"$0: Not scoring because local/score.sh does not exist or not executable.\"\n  [ ! -x local/score.sh ] && echo $err_msg && exit 1;\n  local/score.sh --cmd \"$cmd\" $data $oldlang $outdir\nelse\n  echo \"$0: Not scoring because --skip-scoring was specified.\"\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/tfrnnlm/lstm.py",
    "content": "# Copyright 2015 The TensorFlow Authors. All Rights Reserved.\n# Copyright (C) 2017 Intellisist, Inc. (Author: Hainan Xu)\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n# this script trains a vanilla RNNLM with TensorFlow. \n# to call the script, do\n# python steps/tfrnnlm/lstm.py --data_path=$datadir \\\n#        --save_path=$savepath --vocab_path=$rnn.wordlist [--hidden-size=$size]\n#\n# One example recipe is at egs/ami/s5/local/tfrnnlm/run_lstm.sh\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport absl\nimport absl.flags as flags\nimport tensorflow as tf\n\nimport reader\n\nflags.DEFINE_integer(\"hidden_size\", 200, \"hidden dim of RNN\")\n\nflags.DEFINE_string(\"data_path\", None,\n                    \"Where the training/test data is stored.\")\nflags.DEFINE_string(\"vocab_path\", None,\n                    \"Where the wordlist file is stored.\")\nflags.DEFINE_string(\"save_path\", \"export\",\n                    \"Model output directory.\")\nflags.DEFINE_bool(\"use_fp16\", False,\n                  \"Train using 16-bit floats instead of 32bit floats\")\n\nFLAGS = flags.FLAGS\n\n\nclass Config(object):\n  init_scale = 0.1\n  learning_rate = 1.0\n  max_grad_norm = 5\n  num_layers = 2\n  num_steps = 20\n  hidden_size = 200\n  max_epoch = 4\n  max_max_epoch = 13\n  keep_prob = 1.0\n  lr_decay = 0.5\n  batch_size = 64\n\n\ndef data_type():\n  return tf.float16 if FLAGS.use_fp16 else tf.float32\n\n\nclass RNNLMModel(tf.Module):\n  \"\"\"The RNN model itself.\"\"\"\n\n  def __init__(self, config, logits_bias_initializer=None):\n    super().__init__()\n    self._config = config\n\n    size = config.hidden_size\n    vocab_size = config.vocab_size\n    dt = data_type()\n\n    def lstm_cell():\n      return tf.keras.layers.LSTMCell(size, dtype=dt, unit_forget_bias=False)\n\n    def add_dropout(cell):\n      if config.keep_prob < 1:\n        cell = tf.nn.RNNCellDropoutWrapper(cell=cell, output_keep_prob=config.keep_prob)\n      return cell\n\n    self.embedding = tf.keras.layers.Embedding(vocab_size, size, dtype=dt)\n    self.cells = [lstm_cell() for _ in range(config.num_layers)]\n    self.rnn = tf.keras.layers.RNN(self.cells, return_sequences=True)\n\n    if logits_bias_initializer is None:\n      logits_bias_initializer = 'zeros'\n    self.fc = tf.keras.layers.Dense(vocab_size, bias_initializer=logits_bias_initializer)\n\n    # only used in training\n    self.training_cells = [add_dropout(cell) for cell in self.cells]\n    self.training_rnn = tf.keras.layers.RNN(self.training_cells, return_sequences=True)\n\n  def get_logits(self, word_ids, is_training=False):\n    rnn = self.training_rnn if is_training else self.rnn\n    inputs = self.embedding(word_ids)\n    if is_training and self._config.keep_prob < 1:\n      inputs = tf.nn.dropout(inputs, 1 - self._config.keep_prob)\n    rnn_out = rnn(inputs)\n    logits = self.fc(rnn_out)\n    return logits\n\n  def get_loss(self, word_ids, labels, is_training=False):\n    logits = self.get_logits(word_ids, is_training)\n    loss_obj = tf.losses.SparseCategoricalCrossentropy(from_logits=True)\n    return loss_obj(labels, logits)\n\n  def get_score(self, logits):\n    \"\"\"Take logits as input, output a score.\"\"\"\n    return tf.nn.log_softmax(logits)\n\n  @tf.function\n  def get_initial_state(self):\n    \"\"\"Exported function which emits zeroed RNN context vector.\"\"\"\n    # This seems a bug in TensorFlow, but passing tf.int32 makes the state tensor also int32.\n    fake_input = tf.constant(0, dtype=tf.float32, shape=[1, 1])\n    initial_state = tf.stack(self.rnn.get_initial_state(fake_input))\n    return {\"initial_state\": initial_state}\n\n  @tf.function\n  def single_step(self, context, word_id):\n    \"\"\"Exported function which perform one step of the RNN model.\"\"\"\n    rnn = tf.keras.layers.RNN(self.cells, return_state=True)\n    context = tf.unstack(context)\n    context = [tf.unstack(c) for c in context]\n\n    inputs = self.embedding(word_id)\n    rnn_out_and_states = rnn(inputs, initial_state=context)\n\n    rnn_out = rnn_out_and_states[0]\n    rnn_states = tf.stack(rnn_out_and_states[1:])\n\n    logits = self.fc(rnn_out)\n    output = self.get_score(logits)\n    log_prob = output[0, word_id[0, 0]]\n    return {\"log_prob\": log_prob, \"rnn_states\": rnn_states, \"rnn_out\": rnn_out}\n\n\nclass RNNLMModelTrainer(tf.Module):\n  \"\"\"This class contains training code.\"\"\"\n\n  def __init__(self, model: RNNLMModel, config):\n    super().__init__()\n    self.model = model\n    self.learning_rate = tf.Variable(1e-3, dtype=tf.float32, trainable=False)\n    self.optimizer = tf.optimizers.SGD(learning_rate=self.learning_rate)\n    self.max_grad_norm = config.max_grad_norm\n\n    self.eval_mean_loss = tf.metrics.Mean()\n\n  def train_one_epoch(self, data_producer, learning_rate, verbose=True):\n    print(\"start epoch with learning rate {}\".format(learning_rate))\n    self.learning_rate.assign(learning_rate)\n\n    for i, (inputs, labels) in enumerate(data_producer.iterate()):\n      loss = self._train_step(inputs, labels)\n      if verbose and i % (data_producer.epoch_size // 10) == 1:\n        print(\"{}/{}: loss={}\".format(i, data_producer.epoch_size, loss))\n\n  @tf.function\n  def evaluate(self, data_producer):\n    self.eval_mean_loss.reset_states()\n    for i, (inputs, labels) in enumerate(data_producer.iterate()):\n      loss = self.model.get_loss(inputs, labels)\n      self.eval_mean_loss.update_state(loss)\n\n    return self.eval_mean_loss.result()\n\n  @tf.function\n  def _train_step(self, inputs, labels):\n    with tf.GradientTape() as tape:\n      loss = self.model.get_loss(inputs, labels, is_training=True)\n\n    tvars = self.model.trainable_variables\n    grads = tape.gradient(loss, tvars)\n    clipped_grads, _ = tf.clip_by_global_norm(grads, self.max_grad_norm)\n    self.optimizer.apply_gradients(zip(clipped_grads, tvars))\n    return loss\n\n\ndef get_config():\n  return Config()\n\n\ndef main(_):\n  # Turn this on to try the model code with this source file itself!\n  __TESTING = False\n\n  if __TESTING:\n    (train_data, valid_data), word_map = reader.rnnlm_gen_data(__file__, reader.__file__)\n  else:\n    if not FLAGS.data_path:\n      raise ValueError(\"Must set --data_path to RNNLM data directory\")\n\n    raw_data = reader.rnnlm_raw_data(FLAGS.data_path, FLAGS.vocab_path)\n    train_data, valid_data, _, word_map = raw_data\n\n  config = get_config()\n  config.hidden_size = FLAGS.hidden_size\n  config.vocab_size = len(word_map)\n\n  if __TESTING:\n    # use a much smaller scale on our tiny test data\n    config.num_steps = 8\n    config.batch_size = 4\n\n  model = RNNLMModel(config)\n  train_producer = reader.RNNLMProducer(train_data, config.batch_size, config.num_steps)\n  trainer = RNNLMModelTrainer(model, config)\n\n  valid_producer = reader.RNNLMProducer(valid_data, config.batch_size, config.num_steps)\n\n  # Save variables to disk if you want to prevent crash...\n  # Data producer can also be saved to preverse feeding progress.\n  checkpoint = tf.train.Checkpoint(trainer=trainer, data_feeder=train_producer)\n  manager = tf.train.CheckpointManager(checkpoint, \"checkpoints/\", 5)\n\n  for i in range(config.max_max_epoch):\n    lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)\n    lr = config.learning_rate * lr_decay\n    trainer.train_one_epoch(train_producer, lr)\n    manager.save()\n\n    eval_loss = trainer.evaluate(valid_producer)\n    print(\"validating: loss={}\".format(eval_loss))\n\n  # Export\n  print(\"Saving model to %s.\" % FLAGS.save_path)\n  spec = [tf.TensorSpec(shape=[config.num_layers, 2, 1, config.hidden_size], dtype=data_type(), name=\"context\"),\n          tf.TensorSpec(shape=[1, 1], dtype=tf.int32, name=\"word_id\")]\n  cfunc = model.single_step.get_concrete_function(*spec)\n  cfunc2 = model.get_initial_state.get_concrete_function()\n  tf.saved_model.save(model, FLAGS.save_path, signatures={\"single_step\": cfunc, \"get_initial_state\": cfunc2})\n\n\nif __name__ == \"__main__\":\n  absl.app.run(main)\n"
  },
  {
    "path": "egs/steps/tfrnnlm/lstm_fast.py",
    "content": "# Copyright 2015 The TensorFlow Authors. All Rights Reserved.\n# Copyright (C) 2017 Intellisist, Inc. (Author: Hainan Xu)\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n# this script trains a vanilla RNNLM with TensorFlow. \n# to call the script, do\n# python steps/tfrnnlm/lstm_fast.py --data_path=$datadir \\\n#        --save_path=$savepath --vocab_path=$rnn.wordlist [--hidden-size=$size]\n#\n# One example recipe is at egs/ami/s5/local/tfrnnlm/run_vanilla_rnnlm.sh\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport absl\nimport absl.flags as flags\nimport tensorflow as tf\nfrom tensorflow.python.keras.losses import LossFunctionWrapper\n\nimport reader\nfrom lstm import RNNLMModel, RNNLMModelTrainer\n\n# flags.DEFINE_integer(\"hidden_size\", 200, \"hidden dim of RNN\")\n#\n# flags.DEFINE_string(\"data_path\", None,\n#                     \"Where the training/test data is stored.\")\n# flags.DEFINE_string(\"vocab_path\", None,\n#                     \"Where the wordlist file is stored.\")\n# flags.DEFINE_string(\"save_path\", \"export\",\n#                     \"Model output directory.\")\n# flags.DEFINE_bool(\"use_fp16\", False,\n#                   \"Train using 16-bit floats instead of 32bit floats\")\n\nFLAGS = flags.FLAGS\n\n\nclass Config(object):\n  \"\"\"Small config.\"\"\"\n  init_scale = 0.1\n  learning_rate = 1\n  max_grad_norm = 5\n  num_layers = 2\n  num_steps = 20\n  hidden_size = 200\n  max_epoch = 4\n  max_max_epoch = 13\n  keep_prob = 1.0\n  lr_decay = 0.8\n  batch_size = 64\n\n\ndef data_type():\n  return tf.float16 if FLAGS.use_fp16 else tf.float32\n\n\n# this new \"softmax\" function we show can train a \"self-normalized\" RNNLM where\n# the sum of the output is automatically (close to) 1.0\n# which saves a lot of computation for lattice-rescoring\ndef new_softmax(labels, logits):\n  flatten_labels = tf.reshape(labels, [-1])\n  n_samples = tf.shape(flatten_labels)[0]\n  flatten_logits = tf.reshape(logits, shape=[n_samples, -1])\n  f_logits = tf.exp(flatten_logits)\n  row_sums = tf.reduce_sum(f_logits, -1) # this is the negative part of the objf\n\n  t2 = tf.expand_dims(flatten_labels, 1)\n  range = tf.expand_dims(tf.range(n_samples), 1)\n  ind = tf.concat([range, t2], 1)\n  res = tf.gather_nd(flatten_logits, ind)\n\n  return -res + row_sums - 1\n\n\nclass MyFastLossFunction(LossFunctionWrapper):\n  def __init__(self):\n    super().__init__(new_softmax)\n\n\nclass FastRNNLMModel(RNNLMModel):\n  def __init__(self, config):\n    super().__init__(config, tf.constant_initializer(-9))\n\n  def get_loss(self, word_ids, labels, is_training=False):\n    logits = self.get_logits(word_ids, is_training)\n    loss_obj = MyFastLossFunction()\n    return loss_obj(labels, logits)\n\n  def get_score(self, logits):\n    # In this implementation, logits can be used as dist output\n    return logits\n\n\ndef get_config():\n  return Config()\n\n\ndef main(_):\n  # Turn this on to try the model code with this source file itself!\n  __TESTING = False\n\n  if __TESTING:\n    (train_data, valid_data), word_map = reader.rnnlm_gen_data(__file__, reader.__file__)\n  else:\n    if not FLAGS.data_path:\n      raise ValueError(\"Must set --data_path to RNNLM data directory\")\n\n    raw_data = reader.rnnlm_raw_data(FLAGS.data_path, FLAGS.vocab_path)\n    train_data, valid_data, _, word_map = raw_data\n\n  config = get_config()\n  config.hidden_size = FLAGS.hidden_size\n  config.vocab_size = len(word_map)\n\n  if __TESTING:\n    # use a much smaller scale on our tiny test data\n    config.num_steps = 8\n    config.batch_size = 4\n\n  model = FastRNNLMModel(config)\n  train_producer = reader.RNNLMProducer(train_data, config.batch_size, config.num_steps)\n  trainer = RNNLMModelTrainer(model, config)\n\n  valid_producer = reader.RNNLMProducer(valid_data, config.batch_size, config.num_steps)\n\n  # Save variables to disk if you want to prevent crash...\n  # Data producer can also be saved to preverse feeding progress.\n  checkpoint = tf.train.Checkpoint(trainer=trainer, data_feeder=train_producer)\n  manager = tf.train.CheckpointManager(checkpoint, \"checkpoints/\", 5)\n\n  for i in range(config.max_max_epoch):\n    lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)\n    lr = config.learning_rate * lr_decay\n    trainer.train_one_epoch(train_producer, lr)\n    manager.save()\n\n    eval_loss = trainer.evaluate(valid_producer)\n    print(\"validating: loss={}\".format(eval_loss))\n\n  # Export\n  print(\"Saving model to %s.\" % FLAGS.save_path)\n  spec = [tf.TensorSpec(shape=[config.num_layers, 2, 1, config.hidden_size], dtype=data_type(), name=\"context\"),\n          tf.TensorSpec(shape=[1, 1], dtype=tf.int32, name=\"word_id\")]\n  cfunc = model.single_step.get_concrete_function(*spec)\n  cfunc2 = model.get_initial_state.get_concrete_function()\n  tf.saved_model.save(model, FLAGS.save_path, signatures={\"single_step\": cfunc, \"get_initial_state\": cfunc2})\n\n\nif __name__ == \"__main__\":\n  absl.app.run(main)\n"
  },
  {
    "path": "egs/steps/tfrnnlm/reader.py",
    "content": "# Copyright 2015 The TensorFlow Authors. All Rights Reserved.\n# Copyright (C) 2017 Intellisist, Inc. (Author: Hainan Xu)\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\n\"\"\"Utilities for parsing RNNLM text files.\"\"\"\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport collections\nimport os\n\nimport tensorflow as tf\n\ndef _read_words(filename):\n  with tf.gfile.GFile(filename, \"r\") as f:\n    return f.read().decode(\"utf-8\").split()\n\ndef _build_vocab(filename):\n  words = _read_words(filename)\n  word_to_id = dict(list(zip(words, list(range(len(words))))))\n  return word_to_id\n\n\ndef _file_to_word_ids(filename, word_to_id):\n  data = _read_words(filename)\n  return [word_to_id[word] for word in data if word in word_to_id]\n\n\ndef rnnlm_raw_data(data_path, vocab_path):\n  \"\"\"Load RNNLM raw data from data directory \"data_path\".\n\n  Args:\n    data_path: string path to the directory where train/valid files are stored\n\n  Returns:\n    tuple (train_data, valid_data, test_data, vocabulary)\n    where each of the data objects can be passed to RNNLMIterator.\n  \"\"\"\n\n  train_path = os.path.join(data_path, \"train\")\n  valid_path = os.path.join(data_path, \"valid\")\n\n  word_to_id = _build_vocab(vocab_path)\n  train_data = _file_to_word_ids(train_path, word_to_id)\n  valid_data = _file_to_word_ids(valid_path, word_to_id)\n  vocabulary = len(word_to_id)\n  return train_data, valid_data, vocabulary, word_to_id\n\n\ndef rnnlm_gen_data(*files):\n  \"\"\"Generates data and vocab from files.\n\n  This function is used solely for testing.\n  \"\"\"\n  import collections\n  import re\n\n  all_words = collections.Counter()\n  all_word_lists = []\n  for f in files:\n    with open(f, mode=\"r\") as fp:\n      text = fp.read()\n\n    word_list = re.split(\"[^A-Za-z]\", text)\n    word_list = list(filter(None, word_list))\n    all_words.update(word_list)\n    all_word_lists.append(word_list)\n\n  word_to_id = {word: i for i, (word, _) in enumerate(all_words.most_common())}\n\n  def convert(word_list):\n    return [word_to_id[word] for word in word_list]\n\n  all_word_ids = [convert(word_list) for word_list in all_word_lists]\n  return all_word_ids, word_to_id\n\n\nclass RNNLMProducer(tf.Module):\n  \"\"\"This is the data feeder.\"\"\"\n\n  def __init__(self, raw_data, batch_size, num_steps, name=None):\n    super().__init__(name)\n    self.batch_size = batch_size\n    self.num_steps = num_steps\n    self.epoch_size = (len(raw_data) - 1) // num_steps // batch_size\n\n    # load data into a variable so that it will be separated from graph\n    self._raw_data = tf.Variable(raw_data, dtype=tf.int32, trainable=False)\n\n    ds_x = tf.data.Dataset.from_tensor_slices(self._raw_data)\n    ds_y = ds_x.skip(1)\n    ds = tf.data.Dataset.zip((ds_x, ds_y))\n    # form samples\n    ds = ds.batch(num_steps, drop_remainder=True)\n    # form batches\n    self._ds = ds.batch(batch_size, drop_remainder=True)\n\n  def iterate(self):\n    return self._ds\n\n\nif __name__ == \"__main__\":\n  samples = list(range(100))\n  ds = RNNLMProducer(samples, 4, 8)\n  print(ds.epoch_size)\n  for data in ds.iterate():\n    print(data)\n"
  },
  {
    "path": "egs/steps/tfrnnlm/vanilla_rnnlm.py",
    "content": "# Copyright 2015 The TensorFlow Authors. All Rights Reserved.\n# Copyright (C) 2017 Intellisist, Inc. (Author: Hainan Xu)\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n# this script trains a vanilla RNNLM with TensorFlow. \n# to call the script, do\n# python steps/tfrnnlm/vanilla_rnnlm.py --data_path=$datadir \\\n#        --save_path=$savepath --vocab_path=$rnn.wordlist [--hidden-size=$size]\n#\n# One example recipe is at egs/ami/s5/local/tfrnnlm/run_vanilla_rnnlm.sh\n\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport sys\n\nimport inspect\nimport time\n\nimport numpy as np\nimport tensorflow as tf\n\nimport reader\n\nflags = tf.flags\nlogging = tf.logging\n\nflags.DEFINE_integer(\"hidden_size\", 200, \"hidden dim of RNN\")\n\nflags.DEFINE_string(\"data_path\", None,\n                    \"Where the training/test data is stored.\")\nflags.DEFINE_string(\"vocab_path\", None,\n                    \"Where the wordlist file is stored.\")\nflags.DEFINE_string(\"save_path\", None,\n                    \"Model output directory.\")\nflags.DEFINE_bool(\"use_fp16\", False,\n                  \"Train using 16-bit floats instead of 32bit floats\")\n\nFLAGS = flags.FLAGS\n\nclass Config(object):\n  \"\"\"Small config.\"\"\"\n  init_scale = 0.1\n  learning_rate = 0.2\n  max_grad_norm = 1\n  num_layers = 1\n  num_steps = 20\n  hidden_size = 200\n  max_epoch = 4\n  max_max_epoch = 20\n  keep_prob = 1\n  lr_decay = 0.95\n  batch_size = 64\n\ndef data_type():\n  return tf.float16 if FLAGS.use_fp16 else tf.float32\n\n\nclass RnnlmInput(object):\n  \"\"\"The input data.\"\"\"\n\n  def __init__(self, config, data, name=None):\n    self.batch_size = batch_size = config.batch_size\n    self.num_steps = num_steps = config.num_steps\n    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps\n    self.input_data, self.targets = reader.rnnlm_producer(\n        data, batch_size, num_steps, name=name)\n\nclass RnnlmModel(object):\n  \"\"\"The RNNLM model.\"\"\"\n\n  def __init__(self, is_training, config, input_):\n    self._input = input_\n\n    batch_size = input_.batch_size\n    num_steps = input_.num_steps\n    size = config.hidden_size\n    vocab_size = config.vocab_size\n\n    def rnn_cell():\n      # With the latest TensorFlow source code (as of Mar 27, 2017),\n      # the BasicLSTMCell will need a reuse parameter which is unfortunately not\n      # defined in TensorFlow 1.0. To maintain backwards compatibility, we add\n      # an argument check here:\n      if 'reuse' in inspect.getargspec(\n          tf.contrib.rnn.BasicRNNCell.__init__).args:\n        return tf.contrib.rnn.BasicRNNCell(size,\n                                           reuse=tf.get_variable_scope().reuse)\n      else:\n        return tf.contrib.rnn.BasicRNNCell(size)\n    attn_cell = rnn_cell\n\n    if is_training and config.keep_prob < 1:\n      def attn_cell():\n        return tf.contrib.rnn.DropoutWrapper(\n            rnn_cell(), output_keep_prob=config.keep_prob)\n\n    self.cell = tf.contrib.rnn.MultiRNNCell(\n        [attn_cell() for _ in range(config.num_layers)], state_is_tuple=True)\n\n    self._initial_state = self.cell.zero_state(batch_size, data_type())\n    self._initial_state_single = self.cell.zero_state(1, data_type())\n\n    self.initial = tf.reshape(tf.stack(axis=0, values=self._initial_state_single), [config.num_layers, 1, size], name=\"test_initial_state\")\n\n    # first implement the less efficient version\n    test_word_in = tf.placeholder(tf.int32, [1, 1], name=\"test_word_in\")\n\n    state_placeholder = tf.placeholder(tf.float32, [config.num_layers, 1, size], name=\"test_state_in\")\n    # unpacking the input state context \n    l = tf.unstack(state_placeholder, axis=0)\n    test_input_state = tuple(\n               [l[idx] for idx in range(config.num_layers)]\n    )\n\n    with tf.device(\"/cpu:0\"):\n      self.embedding = tf.get_variable(\n          \"embedding\", [vocab_size, size], dtype=data_type())\n\n      inputs = tf.nn.embedding_lookup(self.embedding, input_.input_data)\n      test_inputs = tf.nn.embedding_lookup(self.embedding, test_word_in)\n\n    # test time\n    with tf.variable_scope(\"RNN\"):\n      (test_cell_output, test_output_state) = self.cell(test_inputs[:, 0, :], test_input_state)\n\n    test_state_out = tf.reshape(tf.stack(axis=0, values=test_output_state), [config.num_layers, 1, size], name=\"test_state_out\")\n    test_cell_out = tf.reshape(test_cell_output, [1, size], name=\"test_cell_out\")\n    # above is the first part of the graph for test\n    # test-word-in\n    #               > ---- > test-state-out\n    # test-state-in        > test-cell-out\n\n\n    # below is the 2nd part of the graph for test\n    # test-word-out\n    #               > prob(word | test-word-out)\n    # test-cell-in\n\n    test_word_out = tf.placeholder(tf.int32, [1, 1], name=\"test_word_out\")\n    cellout_placeholder = tf.placeholder(tf.float32, [1, size], name=\"test_cell_in\")\n\n    softmax_w = tf.get_variable(\n        \"softmax_w\", [size, vocab_size], dtype=data_type())\n    softmax_b = tf.get_variable(\"softmax_b\", [vocab_size], dtype=data_type())\n\n    test_logits = tf.matmul(cellout_placeholder, softmax_w) + softmax_b\n    test_softmaxed = tf.nn.log_softmax(test_logits)\n\n    p_word = test_softmaxed[0, test_word_out[0,0]]\n    test_out = tf.identity(p_word, name=\"test_out\")\n\n    if is_training and config.keep_prob < 1:\n      inputs = tf.nn.dropout(inputs, config.keep_prob)\n\n    # Simplified version of models/tutorials/rnn/rnn.py's rnn().\n    # This builds an unrolled LSTM for tutorial purposes only.\n    # In general, use the rnn() or state_saving_rnn() from rnn.py.\n    #\n    # The alternative version of the code below is:\n    #\n    # inputs = tf.unstack(inputs, num=num_steps, axis=1)\n    # outputs, state = tf.contrib.rnn.static_rnn(\n    #     cell, inputs, initial_state=self._initial_state)\n    outputs = []\n    state = self._initial_state\n    with tf.variable_scope(\"RNN\"):\n      for time_step in range(num_steps):\n        if time_step > -1: tf.get_variable_scope().reuse_variables()\n        (cell_output, state) = self.cell(inputs[:, time_step, :], state)\n        outputs.append(cell_output)\n\n    output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size])\n    logits = tf.matmul(output, softmax_w) + softmax_b\n    loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(\n        [logits],\n        [tf.reshape(input_.targets, [-1])],\n        [tf.ones([batch_size * num_steps], dtype=data_type())])\n    self._cost = cost = tf.reduce_sum(loss) / batch_size\n    self._final_state = state\n\n    if not is_training:\n      return\n\n    self._lr = tf.Variable(0.0, trainable=False)\n    tvars = tf.trainable_variables()\n    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),\n                                      config.max_grad_norm)\n    optimizer = tf.train.MomentumOptimizer(self._lr, 0.9)\n    self._train_op = optimizer.apply_gradients(\n        list(zip(grads, tvars)),\n        global_step=tf.contrib.framework.get_or_create_global_step())\n\n    self._new_lr = tf.placeholder(\n        tf.float32, shape=[], name=\"new_learning_rate\")\n    self._lr_update = tf.assign(self._lr, self._new_lr)\n\n  def assign_lr(self, session, lr_value):\n    session.run(self._lr_update, feed_dict={self._new_lr: lr_value})\n\n  @property\n  def input(self):\n    return self._input\n\n  @property\n  def initial_state(self):\n    return self._initial_state\n\n  @property\n  def cost(self):\n    return self._cost\n\n  @property\n  def final_state(self):\n    return self._final_state\n\n  @property\n  def lr(self):\n    return self._lr\n\n  @property\n  def train_op(self):\n    return self._train_op\n\ndef run_epoch(session, model, eval_op=None, verbose=False):\n  \"\"\"Runs the model on the given data.\"\"\"\n  start_time = time.time()\n  costs = 0.0\n  iters = 0\n  state = session.run(model.initial_state)\n\n  fetches = {\n      \"cost\": model.cost,\n      \"final_state\": model.final_state,\n  }\n  if eval_op is not None:\n    fetches[\"eval_op\"] = eval_op\n\n  for step in range(model.input.epoch_size):\n    feed_dict = {}\n    for i, h in enumerate(model.initial_state):\n      feed_dict[h] = state[i]\n\n    vals = session.run(fetches, feed_dict)\n    cost = vals[\"cost\"]\n    state = vals[\"final_state\"]\n\n    costs += cost\n    iters += model.input.num_steps\n\n    if verbose and step % (model.input.epoch_size // 10) == 10:\n      print(\"%.3f perplexity: %.3f speed: %.0f wps\" %\n            (step * 1.0 / model.input.epoch_size, np.exp(costs / iters),\n             iters * model.input.batch_size / (time.time() - start_time)))\n\n  return np.exp(costs / iters)\n\n\ndef get_config():\n  return Config()\n\ndef main(_):\n  if not FLAGS.data_path:\n    raise ValueError(\"Must set --data_path to RNNLM data directory\")\n\n  raw_data = reader.rnnlm_raw_data(FLAGS.data_path, FLAGS.vocab_path)\n  train_data, valid_data, _, word_map = raw_data\n\n  config = get_config()\n  config.hidden_size = FLAGS.hidden_size\n  config.vocab_size = len(word_map)\n  eval_config = get_config()\n  eval_config.batch_size = 1\n  eval_config.num_steps = 1\n\n  with tf.Graph().as_default():\n    initializer = tf.random_uniform_initializer(-config.init_scale,\n                                                config.init_scale)\n\n    with tf.name_scope(\"Train\"):\n      train_input = RnnlmInput(config=config, data=train_data, name=\"TrainInput\")\n      with tf.variable_scope(\"Model\", reuse=None, initializer=initializer):\n        m = RnnlmModel(is_training=True, config=config, input_=train_input)\n      tf.summary.scalar(\"Training Loss\", m.cost)\n      tf.summary.scalar(\"Learning Rate\", m.lr)\n\n    with tf.name_scope(\"Valid\"):\n      valid_input = RnnlmInput(config=config, data=valid_data, name=\"ValidInput\")\n      with tf.variable_scope(\"Model\", reuse=True, initializer=initializer):\n        mvalid = RnnlmModel(is_training=False, config=config, input_=valid_input)\n      tf.summary.scalar(\"Validation Loss\", mvalid.cost)\n\n    sv = tf.train.Supervisor(logdir=FLAGS.save_path)\n    with sv.managed_session() as session:\n      for i in range(config.max_max_epoch):\n        lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)\n\n        m.assign_lr(session, config.learning_rate * lr_decay)\n\n        print(\"Epoch: %d Learning rate: %.3f\" % (i + 1, session.run(m.lr)))\n        train_perplexity = run_epoch(session, m, eval_op=m.train_op,\n                                     verbose=True)\n\n        print(\"Epoch: %d Train Perplexity: %.3f\" % (i + 1, train_perplexity))\n        valid_perplexity = run_epoch(session, mvalid)\n        print(\"Epoch: %d Valid Perplexity: %.3f\" % (i + 1, valid_perplexity))\n\n      if FLAGS.save_path:\n        print(\"Saving model to %s.\" % FLAGS.save_path)\n        sv.saver.save(session, FLAGS.save_path)\n\nif __name__ == \"__main__\":\n  tf.app.run()\n"
  },
  {
    "path": "egs/steps/train_deltas.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nstage=-4 #  This allows restarting after partway, when something when wrong.\nconfig=\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\nbeam=10\ncareful=false\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.25 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nnorm_vars=false # deprecated.  Prefer --cmvn-opts \"--norm-vars=true\"\n                # use the option --cmvn-opts \"--norm-means=false\"\ncmvn_opts=\ndelta_opts=\ncontext_opts=   # use\"--context-width=5 --central-position=2\" for quinphone\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n   echo \"Usage: steps/train_deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>\"\n   echo \"e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_deltas.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\noov=`cat $lang/oov.int` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata=$data/split$nj;\nsplit_data.sh $data $nj || exit 1;\n\n\n[ $(cat $alidir/cmvn_opts 2>/dev/null | wc -c) -gt 1 ] && [ -z \"$cmvn_opts\" ] && \\\n  echo \"$0: warning: ignoring CMVN options from source directory $alidir\"\n$norm_vars && cmvn_opts=\"--norm-vars=true $cmvn_opts\"\necho $cmvn_opts  > $dir/cmvn_opts # keep track of options to CMVN.\n[ ! -z $delta_opts ] && echo $delta_opts > $dir/delta_opts\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\"\n\nrm $dir/.error 2>/dev/null\n\nif [ $stage -le -3 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts \\\n    --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: getting questions for tree-building, via clustering\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int \\\n    $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int \\\n    $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  $cmd $dir/log/init_model.log \\\n    gmm-init-model  --write-occs=$dir/1.occs  \\\n      $dir/tree $dir/treeacc $lang/topo $dir/1.mdl || exit 1;\n  if grep 'no stats' $dir/log/init_model.log; then\n     echo \"** The warnings above about 'no stats' generally mean you have phones **\"\n     echo \"** (or groups of phones) in your phone set that had no corresponding data. **\"\n     echo \"** You should probably figure out whether something went wrong, **\"\n     echo \"** or whether your data just doesn't happen to have examples of those **\"\n     echo \"** phones. **\"\n  fi\n\n  gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: training pass $x\"\n  if [ $stage -le $x ]; then\n    if echo $realign_iters | grep -w $x >/dev/null; then\n      echo \"$0: aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" \\\n         \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n         \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n       \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --mix-up=$numgauss --power=$power \\\n        --write-occs=$dir/$[$x+1].occs $dir/$x.mdl \\\n       \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.mdl $dir/final.occs 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\n# Summarize warning messages...\nutils/summarize_warnings.pl  $dir/log\n\nsteps/info/gmm_dir_info.pl $dir\n\necho \"$0: Done training system with delta+delta-delta features in $dir\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/train_diag_ubm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright Johns Hopkins University (Author: Daniel Povey),  2012.\n# Apache 2.0.\n\n# Train a diagonal mixture of Gaussians.  This is trained without\n# reference to class labels-- except that, optionally, you can down-weight\n# silence phones, and alignments are needed for that.\n#\n# The current use for this is in fMMI training.\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nnum_iters=3\nsilence_weight=\nstage=-2\n# The value \"intermediate\" is a number of Gaussians we first obtain by clustering\n# the Gaussians within each state of the model, before clustering down to\n# $num_Gauss.  This is for efficiency.  It's not a very important parameter,\n# as far as I know.\nintermediate=2000\nnum_gselect=50 # Number of Gaussian-selection indices to use while training\n               # the model.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\nif [ $# != 5 ]; then\n  echo \"Usage: steps/train_diag_ubm.sh <num-gauss> <data> <lang> <alignment-dir|src-dir> <dir>\"\n  echo \" e.g.: steps/train_diag_ubm.sh 400 data/train_si84 data/lang exp/tri2b_ali_si84 exp/ubm3c\"\n  echo \"Options: \"\n  echo \"  --silence-weight <sil-weight>                  # default 1.0.  Use to down-weight silence.\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --nj <num-job>                                 # number of parallel jobs to run.\"\n  echo \"  --num-iters <niter>                            # number of iterations of training (default: $num_iters)\"\n  echo \"  --stage <stage>                                # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnum_gauss=$1\ndata=$2\nlang=$3\nalidir=$4\ndir=$5\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nmkdir -p $dir/log\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ -f $alidir/trans.1 ]; then\n  echo Using transforms from $alidir;\n  [ \"$nj\" -ne \"`cat $alidir/num_jobs`\" ] && \\\n    echo \"The number of jobs differs from alignment directory $alidir.\" && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$alidir/trans.JOB ark:- ark:- |\"\nfi\n\nif [ ! -z \"$silence_weight\" ]; then\n  [ ! -f $alidir/ali.1.gz ] && \\\n    echo \"You specified weighting for silence but $alidir/ali.1.gz does not exist.\" && exit 1;\n  [ \"$nj\" -ne \"`cat $alidir/num_jobs`\" ] && \\\n    echo \"You specified silence weight but $alidir has different #jobs.\" && exit 1;\n  weights=\"--weights='ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- | weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- | post-to-weights ark:- ark:- |'\"\nelse\n  weights=\nfi\n\n# $intermediate should be more than $num_gauss..\n[ $[$num_gauss*2] -gt $intermediate ] && intermediate=$[$num_gauss*2] \\\n  && echo \"Setting intermediate=$intermediate (it was too small)\";\n\nif [ $stage -le -2 ]; then\n echo \"Clustering Gaussians in $alidir/final.mdl\"\n $cmd $dir/log/cluster.log \\\n  init-ubm --fullcov-ubm=false --intermediate-num-gauss=$intermediate \\\n    --ubm-num-gauss=$num_gauss $alidir/final.mdl $alidir/final.occs $dir/0.dubm   || exit 1;\nfi\n\n# Store Gaussian selection indices on disk-- this speeds up the training passes.\nif [ $stage -le -1 ]; then\n  echo Getting Gaussian-selection info\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$num_gselect $dir/0.dubm \"$feats\" \\\n      \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\nfor x in `seq 0 $[$num_iters-1]`; do\n  echo \"Training pass $x\"\n  if [ $stage -le $x ]; then\n  # Accumulate stats.\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-global-acc-stats $weights \"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" \\\n      $dir/$x.dubm \"$feats\" $dir/$x.JOB.acc || exit 1;\n    if [ $x -lt $[$num_iters-1] ]; then # Don't remove low-count Gaussians till last iter,\n      opt=\"--remove-low-count-gaussians=false\" # or gselect info won't be valid any more.\n    fi\n    $cmd $dir/log/update.$x.log \\\n      gmm-global-est $opt $dir/$x.dubm \"gmm-global-sum-accs - $dir/$x.*.acc|\" \\\n      $dir/$[$x+1].dubm || exit 1;\n    rm $dir/$x.*.acc $dir/$x.dubm\n  fi\ndone\n\nrm $dir/gselect.*.gz\nmv $dir/$num_iters.dubm $dir/final.dubm || exit 1;\nexit 0;\n"
  },
  {
    "path": "egs/steps/train_lda_mllt.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#\n# LDA+MLLT refers to the way we transform the features after computing\n# the MFCCs: we splice across several frames, reduce the dimension (to 40\n# by default) using Linear Discriminant Analysis), and then later estimate,\n# over multiple iterations, a diagonalizing transform known as MLLT or STC.\n# See http://kaldi-asr.org/doc/transform.html for more explanation.\n#\n# Apache 2.0.\n\n# Begin configuration.\ncmd=run.pl\nconfig=\nstage=-5\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nmllt_iters=\"2 4 6 12\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25  # Last iter to increase #Gauss on.\ndim=40\nbeam=10\nretry_beam=40\ncareful=false\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.25 # Exponent for number of gaussians according to occurrence counts\nrandprune=4.0 # This is approximately the ratio by which we will speed up the\n              # LDA and MLLT calculations via randomized pruning.\nsplice_opts=\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nnorm_vars=false # deprecated.  Prefer --cmvn-opts \"--norm-vars=false\"\ncmvn_opts=\ncontext_opts=   # use \"--context-width=5 --central-position=2\" for quinphone.\n# End configuration.\ntrain_tree=true  # if false, don't actually train the tree.\nuse_lda_mat=  # If supplied, use this LDA[+MLLT] matrix.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_lda_mllt.sh [options] <#leaves> <#gauss> <data> <lang> <alignments> <dir>\"\n  echo \" e.g.: steps/train_lda_mllt.sh 2500 15000 data/train_si84 data/lang exp/tri1_ali_si84 exp/tri2b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_lda_mllt.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter #gauss increment\noov=`cat $lang/oov.int` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\necho \"$splice_opts\" >$dir/splice_opts # keep track of frame-splicing options\n           # so that later stages of system building can know what they were.\n\n\n[ $(cat $alidir/cmvn_opts 2>/dev/null | wc -c) -gt 1 ] && [ -z \"$cmvn_opts\" ] && \\\n  echo \"$0: warning: ignoring CMVN options from source directory $alidir\"\n$norm_vars && cmvn_opts=\"--norm-vars=true $cmvn_opts\"\necho $cmvn_opts > $dir/cmvn_opts # keep track of options to CMVN.\n\nsdata=$data/split$nj;\nsplit_data.sh $data $nj || exit 1;\n\nsplicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\"\n# Note: $feats gets overwritten later in the script.\nfeats=\"$splicedfeats transform-feats $dir/0.mat ark:- ark:- |\"\n\n\n\nif [ $stage -le -5 ]; then\n  if [ -z \"$use_lda_mat\" ]; then\n    echo \"$0: Accumulating LDA statistics.\"\n    rm $dir/lda.*.acc 2>/dev/null\n    $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      acc-lda --rand-prune=$randprune $alidir/final.mdl \"$splicedfeats\" ark,s,cs:- \\\n      $dir/lda.JOB.acc || exit 1;\n    est-lda --write-full-matrix=$dir/full.mat --dim=$dim $dir/0.mat $dir/lda.*.acc \\\n      2>$dir/log/lda_est.log || exit 1;\n    rm $dir/lda.*.acc\n  else\n    echo \"$0: Using supplied LDA matrix $use_lda_mat\"\n    cp $use_lda_mat $dir/0.mat || exit 1;\n    [ ! -z \"$mllt_iters\" ] && \\\n      echo \"$0: Warning: using supplied LDA matrix $use_lda_mat but we will do MLLT,\" && \\\n      echo \"     which you might not want; to disable MLLT, specify --mllt-iters ''\" && \\\n      sleep 5\n  fi\nfi\n\ncur_lda_iter=0\n\nif [ $stage -le -4 ] && $train_tree; then\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts \\\n    --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ `ls $dir/*.treeacc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int \\\n    $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int \\\n    $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  if $train_tree; then\n    gmm-init-model  --write-occs=$dir/1.occs  \\\n      $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n    grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n    rm $dir/treeacc\n  else\n    cp $alidir/tree $dir/ || exit 1;\n    $cmd JOB=1 $dir/log/init_model.log \\\n      gmm-init-model-flat $dir/tree $lang/topo $dir/1.mdl \\\n        \"$feats subset-feats ark:- ark:-|\" || exit 1;\n  fi\nfi\n\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ] && [ \"$realign_iters\" != \"\" ]; then\n  echo \"$0: Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo Training pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n  if echo $mllt_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"$0: Estimating MLLT\"\n      $cmd JOB=1:$nj $dir/log/macc.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-acc-mllt --rand-prune=$randprune  $dir/$x.mdl \"$feats\" ark:- $dir/$x.JOB.macc \\\n        || exit 1;\n      est-mllt $dir/$x.mat.new $dir/$x.*.macc 2> $dir/log/mupdate.$x.log || exit 1;\n      gmm-transform-means  $dir/$x.mat.new $dir/$x.mdl $dir/$x.mdl \\\n        2> $dir/log/transform_means.$x.log || exit 1;\n      compose-transforms --print-args=false $dir/$x.mat.new $dir/$cur_lda_iter.mat $dir/$x.mat || exit 1;\n      rm $dir/$x.*.macc\n    fi\n    feats=\"$splicedfeats transform-feats $dir/$x.mat ark:- ark:- |\"\n    cur_lda_iter=$x\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss --power=$power \\\n        $dir/$x.mdl \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.{mdl,mat,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $cur_lda_iter.mat $dir/final.mat\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\n# Summarize warning messages...\nutils/summarize_warnings.pl $dir/log\n\nsteps/info/gmm_dir_info.pl $dir\n\necho \"$0: Done training system with LDA+MLLT features in $dir\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/train_lvtln.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012-2014  Johns Hopkins University (Author: Daniel Povey)\n# Copyright 2014       Vimal Manohar\n# This training script trains linear-VTLN models starting from an existing\n# system based on either LDA+MLLT or delta+delta-delta features.\n# Works with either mfcc or plp features, but you need to set the \n# --base-feat-type option.\n# The resulting system can be used with align_lvtln.sh and/or decode_lvtln.sh\n# to get VTLN warping factors for data, for warped data extraction, or (for\n# the training data) you can use the warping factors this script outputs\n# in $dir/final.warp\n#\n# Apache 2.0\n\n# Begin configuration.\nstage=-6 #  This allows restarting after partway, when something when wrong.\nconfig=\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.25 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\ncmvn_opts=  # you can supply e.g. --cmvn-opts \"--norm-vars=true\" to turn on variance\n            # normalization, but only if base system is the delta type, not LDA.\nlvtln_iters=\"2 4 6 8 10 12 14 16 20\"; # iters on which to recompute LVTLN transform\"\nnum_utt_lvtln_init=200; # number of utterances (subset) to initialize\n                        # LVTLN transform.  Not too critical.\nmin_warp=0.85\nmax_warp=1.25\nwarp_step=0.01\nbase_feat_type=mfcc # or could be PLP.\nlogdet_scale=0.0\n\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nnum_classes=$(perl -e \"print int(1.5 + ($max_warp - $min_warp) / $warp_step);\") || exit 1;\ndefault_class=$(perl -e \"print int(0.5 + (1.0 - $min_warp) / $warp_step);\") || exit 1;\n\nif [ $# != 6 ]; then\n   echo \"Usage: $0 <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>\"\n   echo \"e.g.: $0 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt $data/wav.scp; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\noov=`cat $lang/oov.int` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata=$data/split$nj;\nsplit_data.sh $data $nj || exit 1;\n\n\ncp $alidir/splice_opts $dir 2>/dev/null\n\n\nif [ ! -f $alidir/final.mat ]; then\n  [ $(cat $alidir/cmvn_opts 2>/dev/null | wc -c) -gt 1 ] && [ -z \"$cmvn_opts\" ] && \\\n    echo \"$0: warning: ignoring CMVN options from $alidir.\";\n  echo $cmvn_opts > $dir/cmvn_opts\n\n  echo \"$0: Using delta+delta-delta features since $alidir/final.mat does not exist\"\n  sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n  # for the subsets of features that we use to estimate the linear transforms, we don't\n  # bother with CMVN.  This will give us wrong offsets on the transforms, but it will end\n  # up not mattering because we allow an arbitrary offset (bias) term when we apply\n  # these transforms.\n  featsub_warped=\"ark:add-deltas ark:$dir/feats.CLASS.ark ark:- |\" # you need to define CLASS when invoking $cmd.\n  featsub_unwarped=\"ark:add-deltas ark:$dir/feats.$default_class.ark ark:- |\"\nelse\n  echo \"$0: Using LDA features\"\n  [ ! -z \"$cmvn_opts\" ] && echo  \"$0: you cannot supply --cmvn-opts if base system is LDA.\"\n  cp $alidir/final.mat $alidir/full.mat $alidir/splice_opts $alidir/cmvn_opts $dir 2>/dev/null \n  cmvn_opts=`cat $dir/cmvn_opts 2>/dev/null`\n  sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $dir/final.mat ark:- ark:- |\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n  featsub_warped=\"ark:splice-feats $splice_opts ark:$dir/feats.CLASS.ark ark:- | transform-feats $dir/final.mat ark:- ark:- |\" # you need to define CLASS when invoking $cmd.\n  featsub_unwarped=\"ark:splice-feats $splice_opts ark:$dir/feats.$default_class.ark ark:- | transform-feats $dir/final.mat ark:- ark:- |\"  \nfi\n\nif [ -f $data/utt2warp ]; then\n  echo \"$0: source data directory $data appears to already have VTLN.\";\n  exit 1;\nfi\n\n# create a small subset of utterances for purposes of initializing the LVTLN transform\n# utils/shuffle_list.pl is deterministic, unlike sort -R.\ncat $data/utt2spk | awk '{print $1}' | utils/shuffle_list.pl | \\\n  head -n $num_utt_lvtln_init > $dir/utt_subset\n\nif [ $stage -le -6 ]; then\n  echo \"$0: computing warped subset of features\"\n  if [ -f $data/segments ]; then\n    echo \"$0 [info]: segments file exists: using that.\"\n    subset_feats=\"utils/filter_scp.pl $dir/utt_subset $data/segments | extract-segments scp:$data/wav.scp - ark:- \"\n  else\n    echo \"$0 [info]: no segments file exists: using wav.scp directly.\"\n    subset_feats=\"utils/filter_scp.pl $dir/utt_subset $data/wav.scp | wav-copy scp:- ark:- \"\n  fi\n  rm $dir/.error 2>/dev/null\n  for c in $(seq 0 $[$num_classes-1]); do\n    this_warp=$(perl -e \"print ($min_warp + ($c*$warp_step));\")\n    $cmd $dir/log/compute_warped_feats.$c.log \\\n      $subset_feats \\| compute-${base_feat_type}-feats --verbose=2 \\\n      --config=conf/${base_feat_type}.conf --vtln-warp=$this_warp ark:- ark:- \\| \\\n      copy-feats --compress=true ark:- ark:$dir/feats.$c.ark || touch $dir/.error &\n  done\n  wait;\n  if [ -f $dir/.error ]; then\n    echo \"$0: Computing warped features failed: check $dir/log/compute_warped_feats.*.log\"\n    exit 1;\n  fi\nfi\n\nif ! utils/filter_scp.pl $dir/utt_subset $data/feats.scp | \\\n  compare-feats --threshold=0.98 scp:-  ark:$dir/feats.$default_class.ark >&/dev/null; then\n  echo \"$0: features stored on disk differ from those computed with no warping.\"\n  echo \"    Possibly your feature type is wrong (--base-feat-type option)\"\n  exit 1;\nfi\n  \nif [ -f $data/segments ]; then\n  subset_utts=\"ark:extract-segments scp:$sdata/JOB/wav.scp $sdata/JOB/segments ark:- |\"\nelse\n  echo \"$0 [info]: no segments file exists: using wav.scp directly.\"\n  subset_utts=\"ark:wav-copy scp:$sdata/JOB/wav.scp ark:- |\"\nfi\n\nif [ $stage -le -5 ]; then\n  echo \"$0: initializing base LVTLN transforms in $dir/0.lvtln (ignore warnings below)\"\n  dim=$(feat-to-dim \"$featsub_unwarped\" - ) || exit 1;\n\n  $cmd $dir/log/init_lvtln.log \\\n    gmm-init-lvtln --dim=$dim --num-classes=$num_classes --default-class=$default_class \\\n      $dir/0.lvtln || exit 1;\n\n  $cmd JOB=1:$nj $dir/log/get_weights.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz |\" ark:- \\| \\\n    weight-silence-post 0.0 \"$silphonelist\" $alidir/final.mdl ark:- ark:- \\| \\\n    post-to-weights ark:- \"ark,scp:$dir/weights.JOB.ark,$dir/weights.JOB.scp\" || exit 1\n\n  for n in `seq 1 $nj`; do \n    cat $dir/weights.$n.scp\n  done > $dir/weights.scp\n\n  for c in $(seq 0 $[$num_classes-1]); do\n    this_warp=$(perl -e \"print ($min_warp + ($c*$warp_step));\")\n    orig_feats=ark:$dir/feats.$default_class.ark\n    warped_feats=ark:$dir/feats.$c.ark\n    logfile=$dir/log/train_special.$c.log\n    this_featsub_warped=\"$(echo $featsub_warped | sed s/CLASS/$c/)\"\n    if ! gmm-train-lvtln-special --warp=$this_warp --normalize-var=true \\\n      --weights-in=\"scp:$dir/weights.scp\" \\\n      $c $dir/0.lvtln $dir/0.lvtln \\\n      \"$featsub_unwarped\" \"$this_featsub_warped\" 2>$logfile; then\n      echo \"$0: Error training LVTLN transform, see $logfile\";\n      exit 1;\n    fi\n  done  \n  rm $dir/final.lvtln 2>/dev/null\n  ln -s 0.lvtln $dir/final.lvtln\nfi\n\nif [ $stage -le -4 ]; then\n  echo \"$0: computing initial LVTLN transforms for speakers\"\n\n  if [ -f $alidir/final.alimdl ]; then\n    # if the base system was trained with SAT, it's probably better\n    # to use the .alimdl, trained speaker-independent, to get the\n    # LVTLN transforms (LVTLN may be closer to an unadapted system).\n    echo \"$0: to get initial LVTLN transforms, using $alidir/final.alimdl\"\n    srcmodel=$alidir/final.alimdl\n  else\n    srcmodel=$alidir/final.mdl\n  fi\n\n  $cmd JOB=1:$nj $dir/log/lvtln.0.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n    weight-silence-post 0.0 \"$silphonelist\" $alidir/final.mdl ark:- ark:- \\| \\\n    gmm-post-to-gpost $srcmodel \"$sifeats\" ark:- ark:- \\| \\\n    gmm-est-lvtln-trans --logdet-scale=$logdet_scale --verbose=1 \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $srcmodel \\\n      $dir/0.lvtln \"$sifeats\" ark:- ark:$dir/trans.JOB ark,t:$dir/warp.0.JOB || exit 1\n  \n  # consolidate the warps into one file.\n  for j in $(seq $nj); do cat $dir/warp.0.$j; done > $dir/warp.0\n  rm $dir/warp.0.*\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: getting questions for tree-building, via clustering\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n\n  gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: training pass $x\"\n  if echo $realign_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"$0: aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n         \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n         \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n    fi\n  fi\n  if echo $lvtln_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"Re-estimating LVTLN transforms\"\n      $cmd JOB=1:$nj $dir/log/lvtln.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post 0.0 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-post-to-gpost $dir/$x.mdl \"$feats\" ark:- ark:- \\| \\\n        gmm-est-lvtln-trans --logdet-scale=$logdet_scale --verbose=1 \\\n          --spk2utt=ark:$sdata/JOB/spk2utt $dir/$x.mdl \\\n          $dir/0.lvtln \"$sifeats\" ark:- ark:$dir/new_trans.JOB ark,t:$dir/warp.$x.JOB || exit 1\n      # consolidate the warps into one file.\n      for j in $(seq $nj); do mv $dir/new_trans.$j $dir/trans.$j; done\n      for j in $(seq $nj); do cat $dir/warp.$x.$j; done > $dir/warp.$x\n      rm $dir/warp.$x.*\n    fi\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --mix-up=$numgauss --power=$power \\\n      --write-occs=$dir/$[$x+1].occs $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\n\nif [ $stage -le $x ]; then\n  # Accumulate stats for \"alignment model\"-- this model is computed with the\n  # speaker-independent features, but matches Gaussian-for-Gaussian with the\n  # final speaker-adapted model.\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --power=$power --remove-low-count-gaussians=false $dir/$x.mdl \\\n    \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl  || exit 1;\n  rm $dir/$x.*.acc\nfi\n\nif true; then # Diagnostics\n  last_iter=$(echo 0 $lvtln_iters  | awk '{print $NF;}')\n  ln -sf warp.$last_iter $dir/final.warp\n  if [ -f $data/spk2gender ]; then \n    # To make it easier to eyeball the male and female speakers' warps\n    # separately, separate them out.\n    for g in m f; do # means: for gender in male female\n      cat $dir/final.warp | \\\n        utils/filter_scp.pl <(grep -w $g $data/spk2gender | awk '{print $1}') > $dir/final.warp.$g\n      echo -n \"The last few warp factors for gender $g are: \"\n      tail -n 10 $dir/final.warp.$g | awk '{printf(\"%s \", $2);}'; \n      echo\n    done\n  fi\nfi\n\nln -sf $x.mdl $dir/final.mdl\nln -sf $x.occs $dir/final.occs\nln -sf $x.alimdl $dir/final.alimdl\n\n# Summarize warning messages...\nutils/summarize_warnings.pl  $dir/log\n\necho \"$0: Done training LVTLN system in $dir\"\n"
  },
  {
    "path": "egs/steps/train_map.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n\n# Train a model on top of existing features (no feature-space learning of any\n# kind is done).  This script does not re-train the tree, it just does one iteration\n# of MAP adaptation to the model in the input alignment-directory.  It's useful for\n# adapting a system to a specific gender, or new acoustic conditions.\n\n# Note: what we implement here is not the MAP from the paper by Gauvain and Lee,\n# it's the simpler (and, I believe, more widely used) so-called \"relevance MAP\",\n# implemented in HTK, where we add a fixed count \"tau\" of fake Gaussian stats\n# generated from the old model, to the new 'in-domain' stats from the features\n# and alignments provided;  and we only update the mean.  So if the new count\n# is zero it just gives you the Gaussian parameters from the old model, but as\n# you get more than about tau counts, it approaches the in-domain stats.\n# We use 'gmm-ismooth-stats' in the command line because the equations for this\n# are the same as the equations for i-smoothing in discriminative training\n# (for which, see my [Dan Povey's] PhD thesis).\n\n# Begin configuration..\ncmd=run.pl\nstage=0\ntau=20 # smoothing constant used in MAP estimation, corresponds to the number of\n       # \"fake counts\" that we add for the old model.  Larger tau corresponds to less\n       # aggressive re-estimation, and more smoothing.  You might want to try 10 or 15 also\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: steps/train_map.sh <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_map.sh data/train_si84_female data/lang exp/tri3c_ali_si84_female exp/tri4b_female\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndir=$4\n\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# Set various variables.\nnj=`cat $alidir/num_jobs` || exit 1;\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\n\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\ncp $alidir/tree $dir\n# link ali.*.gz from $alidir to dest directory.\nutils/ln.pl $alidir/ali.*.gz $dir\n\n\necho $nj >$dir/num_jobs\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n## Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    cp $alidir/full.mat $dir 2>/dev/null\n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  ln.pl $alidir/trans.* $dir # Link them to dest dir.\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\nelse\n  feats=\"$sifeats\"\nfi\n##\n\nif [ $stage -le 0 ]; then\n  $cmd JOB=1:$nj $dir/log/acc.JOB.log \\\n    gmm-acc-stats-ali  $alidir/final.mdl \"$feats\" \\\n    \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz|\"  $dir/0.JOB.acc || exit 1;\n\n  [ \"`ls $dir/0.*.acc | wc -w`\" -ne \"$nj\" ] && echo \"$0: wrong #accs\" && exit 1;\n\n  $cmd $dir/log/sum_accs.log \\\n    gmm-sum-accs $dir/0.acc $dir/0.*.acc || exit 1;\n\n  rm $dir/0.*.acc\nfi\n\nif [ $stage -le 1 ]; then\n  # Update only the model means.  This is traditional in MAP estimation.\n  $cmd $dir/log/update.log \\\n     gmm-ismooth-stats --smooth-from-model --tau=$tau $alidir/final.mdl $dir/0.acc - \\| \\\n     gmm-est --update-flags=m --write-occs=$dir/final.occs --remove-low-count-gaussians=false \\\n           $alidir/final.mdl - $dir/final.mdl || exit 1;\nfi\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_mmi.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# MMI training (or optionally boosted MMI, if you give the --boost option).\n# 4 iterations (by default) of Extended Baum-Welch update.\n#\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0\ncancel=true # if true, cancel num and den counts on each frame.\ndrop_frames=false # if true, ignore stats from frames where num + den\n                       # have no overlap. \ntau=400\nweight_tau=10\nacwt=0.1\nstage=0\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/train_mmi.sh <data> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/train_mmi.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n  \n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\ndir=$5\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\ncp $alidir/tree $dir\ncp $alidir/final.mdl $dir/0.mdl\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n# Set up features\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f $alidir/trans.1 ] && echo Using transforms from $alidir && \\\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of MMI training\"\n  # Note: the num and den states are accumulated at the same time, so we\n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-rescore-lattice $dir/$x.mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      sum-post --drop-frames=$drop_frames --merge=$cancel --scale1=-1 \\\n      ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      gmm-acc-stats2 $dir/$x.mdl \"$feats\" ark,s,cs:- \\\n      $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n    rm $dir/den_acc.$x.*.acc\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n    rm $dir/num_acc.$x.*.acc\n\n  # note: this tau value is for smoothing towards model parameters, not\n  # as in the Boosted MMI paper, not towards the ML stats as in the earlier\n  # work on discriminative training (e.g. my thesis).  \n  # You could use gmm-ismooth-stats to smooth to the ML stats, if you had\n  # them available [here they're not available if cancel=true].\n\n    $cmd $dir/log/update.$x.log \\\n      gmm-est-gaussians-ebw --tau=$tau $dir/$x.mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc - \\| \\\n      gmm-est-weights-ebw --weight-tau=$weight_tau - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    rm $dir/{den,num}_acc.$x.acc\n  fi\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.\n\n  tail -n 50 $dir/log/acc.$x.*.log | perl -e '$acwt=shift @ARGV; while(<STDIN>) { if(m/gmm-acc-stats2.+Overall weighted acoustic likelihood per frame was (\\S+) over (\\S+) frames/) { $tot_aclike += $1*$2; $tot_frames1 += $2; } if(m|lattice-to-post.+Overall average log-like/frame is (\\S+) over (\\S+) frames.  Average acoustic like/frame is (\\S+)|) { $tot_den_lat_like += $1*$2; $tot_frames2 += $2; $tot_den_aclike += $3*$2; } } if (abs($tot_frames1 - $tot_frames2) > 0.01*($tot_frames1 + $tot_frames2)) { print STDERR \"Frame-counts disagree $tot_frames1 versus $tot_frames2\\n\"; } $tot_den_lat_like /= $tot_frames2; $tot_den_aclike /= $tot_frames2; $tot_aclike *= ($acwt / $tot_frames1);  $num_like = $tot_aclike + $tot_den_aclike; $per_frame_objf = $num_like - $tot_den_lat_like; print \"$per_frame_objf $tot_frames1\\n\"; ' $acwt > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  echo \"Iteration $x: objf was $objf, MMI auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"MMI training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/train_mmi_fmmi.sh",
    "content": "#!/usr/bin/env bash\n# by Johns Hopkins University (Author: Daniel Povey), 2012.  Apache 2.0.\n\n# This script does MMI discriminative training, including\n# feature-space (like fMPE) and model-space components. \n# If you give the --boost option it does \"boosted MMI\" (BMMI).\n# On the iterations of training it alternates feature-space\n# and model-space training.  We do 8 iterations in total--\n# 4 of each type ((B)MMI, f(B)MMI)\n\n\n# Begin configuration section.\ncmd=run.pl\nschedule=\"fmmi fmmi fmmi fmmi mmi mmi mmi mmi\"\nboost=0.0\nlearning_rate=0.01\ntau=400 # For model.  Note: we're doing smoothing \"to the previous iteration\",\n    # so --smooth-from-model so 400 seems like a more sensible default\n    # than 100.  We smooth to the previous iteration because now\n    # we are discriminatively training the features (and not using\n    # the indirect differential), so it seems like it wouldn't make \n    # sense to use any element of ML.\nweight_tau=10 # for model weights.\ncancel=true # if true, cancel num and den counts as described in \n     # the boosted MMI paper. \ndrop_frames=false # if true, ignore stats from frames where num + den\n                       # have no overlap. \nindirect=true # if true, use indirect derivative.\nacwt=0.1\nstage=-1\nngselect=2; # Just the 2 top Gaussians.  Beyond that, adding more Gaussians\n            # wouldn't make much difference since the posteriors would be very small.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_mmi_fmmi.sh <data> <lang> <ali-dir> <diag-ubm-dir> <denlat-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_mmi_fmmi.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/ubm2d exp/tri2b_denlats_si84 exp/tri2b_fmmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1) ... boosted MMI.\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n  echo \"  --learning-rate                                  # learning rate for fMMI, default 0.01\"\n  echo \"  --schedule                                       # learning schedule: by default,\"\n  echo \"                                                   # \\\"fmmi mmi fmmi mmi fmmi mmi fmmi mmi\\\"\"\n  exit 1;\nfi\n\n\ndata=$1\nlang=$2\nalidir=$3\ndubmdir=$4  # where diagonal UBM is.\ndenlatdir=$5\ndir=$6\n\nsilphonelist=`cat $lang/phones/silence.csl`\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $lang/phones.txt $dubmdir/final.dubm $alidir/final.mdl \\\n    $alidir/ali.1.gz $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"Expected file $f to exist\" && exit 1;\ndone\ncp $alidir/final.mdl $alidir/tree $dir || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\n# Note: $feats is the features before fMPE.\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f $alidir/trans.1 ] && echo Using transforms from $alidir && \\\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$alidir/trans.JOB ark:- ark:- |\"\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\nfmpefeats=\"$feats\" # At first, the features \"after fMPE\" are the same as the \n                   # base features.\n\n\n# Initialize the fMPE object.  Note: we call it .fmpe because\n# that's what it was called in the original paper, but since\n# we're using the MMI objective function, it's really fMMI.\n\nfmpe-init $dubmdir/final.dubm $dir/0.fmpe 2>$dir/log/fmpe_init.log || exit 1;\n\n\nif [ $stage -le -1 ]; then\n  # Get the gselect (Gaussian selection) info for fMPE.\n  # Note: fMPE object starts with GMM object, so can be read\n  # as one.\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$ngselect $dir/0.fmpe \"$feats\" \\\n    \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\ncp $alidir/final.mdl $dir/0.mdl\n\nx=0\nnum_iters=`echo $schedule | wc -w`\n\nwhile [ $x -lt $num_iters ]; do\n  iter_type=`echo $schedule | cut -d ' ' -f $[$x+1]`\n  case $iter_type in \n    fmmi)\n    echo \"Iteration $x: doing fMMI\"\n    if [ $stage -le $x ]; then\n      numpost=\"ark,s,cs:gunzip -c $alidir/ali.JOB.gz| ali-to-post ark:- ark:-|\"\n        # Note: the command gmm-fmpe-acc-stats below requires the pre-fMPE features.\n      $cmd JOB=1:$nj $dir/log/acc_fmmi.$x.JOB.log \\\n        gmm-rescore-lattice $dir/$x.mdl \"$lats\" \"$fmpefeats\" ark:- \\| \\\n        lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n        sum-post --drop-frames=$drop_frames --scale1=-1 ark:- \"$numpost\" ark:- \\| \\\n        gmm-fmpe-acc-stats $dir/$x.mdl $dir/$x.fmpe \"$feats\" \\\n        \"ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" ark,s,cs:- \\\n        $dir/$x.JOB.fmpe_acc || exit 1;\n      \n      ( fmpe-sum-accs $dir/$x.fmpe_acc $dir/$x.*.fmpe_acc && \\\n        rm $dir/$x.*.fmpe_acc && \\\n        fmpe-est --learning-rate=$learning_rate $dir/$x.fmpe $dir/$x.fmpe_acc $dir/$[$x+1].fmpe ) \\\n        2>$dir/log/est_fmpe.$x.log || exit 1;\n    fi\n    # We need to set the features to use the correct fMPE object.\n    fmpefeats=\"$feats fmpe-apply-transform $dir/$[$x+1].fmpe ark:- 'ark,s,cs:gunzip -c $dir/gselect.JOB.gz|' ark:- |\" \n    rm $dir/$[x+1].mdl 2>/dev/null; ln -s $x.mdl $dir/$[$x+1].mdl # link previous model.\n    # Now, diagnostics.\n    objf_nf=`grep Overall $dir/log/acc_fmmi.$x.*.log | grep gmm-fmpe-acc-stats | awk '{ p+=$10*$12; nf+=$12; } END{print p/nf, nf;}'`\n    objf=`echo $objf_nf | awk '{print $1}'`;\n    nf=`echo $objf_nf | awk '{print $2}'`;\n    impr=`grep Objf $dir/log/est_fmpe.$x.log | awk '{print $NF}'`\n    impr=`perl -e \"print ($impr/$nf);\"` # normalize by #frames.\n    echo On iter $x, objf was $objf, auxf improvement from fMMI was $impr | tee $dir/objf.$x.log\n    ;;\n    mmi) # MMI iteration.\n    echo \"Iteration $x: doing MMI (getting stats)...\"\n    # Get denominator stats...  For simplicity we rescore the lattice\n    # on all iterations, even though it shouldn't be necessary on the zeroth\n    # (but we want this script to work even if $alidir doesn't contain the\n    # model used to generate the lattice).\n    if [ $stage -le $x ]; then\n      $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n        gmm-rescore-lattice $dir/$x.mdl \"$lats\" \"$fmpefeats\" ark:- \\| \\\n        lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n        sum-post --drop-frames=$drop_frames --merge=$cancel --scale1=-1 \\\n        ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n        gmm-acc-stats2 $dir/$x.mdl \"$fmpefeats\" ark,s,cs:- \\\n        $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n      n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n      [ \"$n\" -ne $[$nj*2] ] && \\\n        echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n      $cmd $dir/log/den_acc_sum.$x.log \\\n        gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n      rm $dir/den_acc.$x.*.acc\n      $cmd $dir/log/num_acc_sum.$x.log \\\n        gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n      rm $dir/num_acc.$x.*.acc\n\n      # note: this tau value is for smoothing to model parameters;\n      # you need to use gmm-ismooth-stats to smooth to the ML stats,\n      # but anyway this script does canceling of num and den stats on\n      # each frame (as suggested in the Boosted MMI paper) which would\n      # make smoothing to ML impossible without accumulating extra stats.\n      $cmd $dir/log/update.$x.log \\\n        gmm-est-gaussians-ebw --tau=$tau $dir/$x.mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc - \\| \\\n        gmm-est-weights-ebw --weight-tau=$weight_tau - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    else \n      echo \"not doing this iteration because --stage=$stage\"\n    fi\n  \n    # Some diagnostics.. note, this objf is somewhat comparable to the\n    # MMI objective function divided by the acoustic weight, and differences in it\n    # are comparable to the auxf improvement printed by the update program.\n    objf_nf=`grep Overall $dir/log/acc.$x.*.log | grep gmm-acc-stats2 | awk '{ p+=$10*$12; nf+=$12; } END{print p/nf, nf;}'`\n    objf=`echo $objf_nf | awk '{print $1}'`;\n    nf=`echo $objf_nf | awk '{print $2}'`;\n    impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n    impr=`perl -e \"print ($impr/$nf);\"` # renormalize by \"real\" #frames, to correct\n    # for the canceling of stats.\n    echo On iter $x, objf was $objf, auxf improvement was $impr | tee $dir/objf.$x.log\n    rm $dir/$[x+1].fmpe 2>/dev/null; ln -s $x.fmpe $dir/$[$x+1].fmpe # link previous fMPE transform\n    ;;\n    *) echo \"Invalid --schedule option: expected only mmi or fmmi.\";\n  esac\n  x=$[$x+1]\ndone\n\necho \"Succeeded with $num_iters iters iterations of MMI+fMMI training (boosting factor = $boost)\"\n\nrm $dir/final.mdl 2>/dev/null; ln -s $num_iters.mdl $dir/final.mdl\nrm $dir/final.fmpe 2>/dev/null; ln -s $num_iters.fmpe $dir/final.fmpe \n\n# Now do some cleanup.\nrm $dir/gselect.*.gz $dir/*.acc $dir/*.fmpe_acc\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/train_mmi_fmmi_indirect.sh",
    "content": "#!/usr/bin/env bash\n# by Johns Hopkins University (Author: Daniel Povey), 2012.  Apache 2.0.\n\n# This script does MMI discriminative training, including\n# feature-space (like fMPE) and model-space components. \n# If you give the --boost option it does \"boosted MMI\" (BMMI).\n# On the iterations of training it alternates feature-space\n# and model-space training.  We do 8 iterations in total--\n# 4 of each type ((B)MMI, f(B)MMI)\n\n\n# Begin configuration section.\ncmd=run.pl\nschedule=\"fmmi mmi fmmi mmi fmmi mmi fmmi mmi\"\nboost=0.0\nlearning_rate=0.02\ntau=200 # For model.  Note: we're doing smoothing \"to the previous iteration\",\n    # so --smooth-from-model so 200 seems like a more sensible default\n    # than 100.  We smooth to the previous iteration because now\n    # we are discriminatively training the features (and not using\n    # the indirect differential), so it seems like it wouldn't make \n    # sense to use any element of ML.\ncancel=true # if true, cancel num and den counts as described in \n     # the boosted MMI paper. \ndrop_frames=false # if true, ignore stats from frames where num + den\n                       # have no overlap. \nacwt=0.1\nstage=-1\nngselect=2; # Just the 2 top Gaussians.  Beyond that, adding more Gaussians\n            # wouldn't make much difference since the posteriors would be very small.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_mmi_fmmi.sh <data> <lang> <ali-dir> <diag-ubm-dir> <denlat-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_mmi_fmmi.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/ubm2d exp/tri2b_denlats_si84 exp/tri2b_fmmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1) ... boosted MMI.\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n  echo \"  --learning-rate                                  # learning rate for fMMI, default 0.01\"\n  echo \"  --schedule                                       # learning schedule: by default,\"\n  echo \"                                                   # \\\"fmmi mmi fmmi mmi fmmi mmi fmmi mmi\\\"\"\n  exit 1;\nfi\n\n\ndata=$1\nlang=$2\nalidir=$3\ndubmdir=$4  # where diagonal UBM is.\ndenlatdir=$5\ndir=$6\n\nsilphonelist=`cat $lang/phones/silence.csl`\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\nutils/lang/check_phones_compatible.sh $lang/phones.txt $dubmdir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $lang/phones.txt $dubmdir/final.dubm $alidir/final.mdl \\\n  $alidir/ali.1.gz $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"Expected file $f to exist\" && exit 1;\ndone\ncp $alidir/final.mdl $alidir/tree $dir || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\n# Note: $feats is the features before fMPE.\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f $alidir/trans.1 ] && echo Using transforms from $alidir && \\\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$alidir/trans.JOB ark:- ark:- |\"\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\nfmpefeats=\"$feats\" # At first, the features \"after fMPE\" are the same as the \n                   # base features.\n\n\n# Initialize the fMPE object.  Note: we call it .fmpe because\n# that's what it was called in the original paper, but since\n# we're using the MMI objective function, it's really fMMI.\n\nfmpe-init $dubmdir/final.dubm $dir/0.fmpe 2>$dir/log/fmpe_init.log || exit 1;\n\n\nif [ $stage -le -1 ]; then\n  # Get the gselect (Gaussian selection) info for fMPE.\n  # Note: fMPE object starts with GMM object, so can be read\n  # as one.\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$ngselect $dir/0.fmpe \"$feats\" \\\n    \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\ncp $alidir/final.mdl $dir/0.mdl\n\nx=0\nnum_iters=`echo $schedule | wc -w`\n\nwhile [ $x -lt $num_iters ]; do\n  iter_type=`echo $schedule | cut -d ' ' -f $[$x+1]`\n  case $iter_type in \n    fmmi) fmmi_iter=true; local_cancel=false;;\n    mmi) fmmi_iter=false; local_cancel=$cancel;;\n    *) echo \"Bad iteration type $iter_type\"; exit 1;;\n  esac\n\n  echo \"Getting MMI stats (needed for fMMI and MMI iterations).\";\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-rescore-lattice $dir/$x.mdl \"$lats\" \"$fmpefeats\" ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      sum-post --merge=$local_cancel --scale1=-1 --drop-frames=$drop_frames \\\n      ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      gmm-acc-stats2 $dir/$x.mdl \"$fmpefeats\" ark,s,cs:- \\\n      $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    rm $dir/.error 2>/dev/null\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || touch $dir/.error &\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || touch $dir/.error &\n    wait\n    [ -f $dir/.error ] && echo \"Error summing accs\" && exit 1;\n    rm $dir/den_acc.$x.*.acc\n    rm $dir/num_acc.$x.*.acc\n  fi\n\n  if $fmmi_iter; then\n    echo \"Iteration $x: doing fMMI\"\n    if [ $stage -le $x ]; then\n      # Get model derivative.  Note: the \"ml accumulator\" is the same as the \"numerator\"\n      # since this is MMI.  We avoided doing the \"canceling of stats\" on this iteration\n      # so that this would be true (this canceling wouldn't affect the derivative anyway,\n      # so can have no benefit for fMMI, unlike MMI).\n      $cmd $dir/log/get_stats_deriv.$x.log \\\n        gmm-get-stats-deriv $dir/$x.mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc \\\n        $dir/num_acc.$x.acc $dir/model_deriv.$x.gmmacc\n      numpost=\"ark,s,cs:gunzip -c $alidir/ali.JOB.gz| ali-to-post ark:- ark:-|\"\n        # Note: the command gmm-fmpe-acc-stats below requires the pre-fMPE features.\n      $cmd JOB=1:$nj $dir/log/acc_fmmi.$x.JOB.log \\\n        gmm-rescore-lattice $dir/$x.mdl \"$lats\" \"$fmpefeats\" ark:- \\| \\\n        lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n        sum-post --drop-frames=$drop_frames --merge=false --scale1=-1 \\\n          ark:- \"$numpost\" ark:- \\| \\\n        gmm-fmpe-acc-stats --model-derivative=$dir/model_deriv.$x.gmmacc \\\n          $dir/$x.mdl $dir/$x.fmpe \"$feats\" \\\n         \"ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" ark,s,cs:-  \\\n         $dir/$x.JOB.fmpe_acc || exit 1;\n      \n      ( fmpe-sum-accs $dir/$x.fmpe_acc $dir/$x.*.fmpe_acc && \\\n        rm $dir/$x.*.fmpe_acc && \\\n        fmpe-est --learning-rate=$learning_rate $dir/$x.fmpe $dir/$x.fmpe_acc $dir/$[$x+1].fmpe ) \\\n        2>$dir/log/est_fmpe.$x.log || exit 1;\n\n      fmpefeats=\"$feats fmpe-apply-transform $dir/$[$x+1].fmpe ark:- 'ark,s,cs:gunzip -c $dir/gselect.JOB.gz|' ark:- |\" \n      # OK, now we do one iteration of the \"rescaling update\" where we use the\n      # old and new ML accs, and we shift and rescale the model to match the new\n      # features.\n      $cmd JOB=1:$nj $dir/log/acc_ml.$x.JOB.log \\\n        gmm-acc-stats-ali $dir/$x.mdl \"$fmpefeats\" \"ark:gunzip -c $alidir/ali.JOB.gz|\" \\\n          $dir/new_ml_acc.$x.JOB.acc || exit 1;\n      $cmd $dir/log/new_ml_acc_sum.$x.log \\\n        gmm-sum-accs $dir/new_ml_acc.$x.acc $dir/new_ml_acc.$x.*.acc || exit 1;\n      $cmd $dir/log/update_rescale.$x.log \\\n        gmm-est-rescale $dir/$x.mdl $dir/num_acc.$x.acc $dir/new_ml_acc.$x.acc \\\n        $dir/$[$x+1].mdl || exit 1;\n    fi\n    # We need to set the features to use the correct fMPE object.\n    # This is a repeat of a command above-- in case we didn't do this stage.\n    fmpefeats=\"$feats fmpe-apply-transform $dir/$[$x+1].fmpe ark:- 'ark,s,cs:gunzip -c $dir/gselect.JOB.gz|' ark:- |\" \n    # Now, diagnostics.\n    objf_nf=`grep Overall $dir/log/acc_fmmi.$x.*.log | grep gmm-fmpe-acc-stats | awk '{ p+=$10*$12; nf+=$12; } END{print p/nf, nf;}'`\n    objf=`echo $objf_nf | awk '{print $1}'`;\n    nf=`echo $objf_nf | awk '{print $2}'`;\n    impr=`grep Objf $dir/log/est_fmpe.$x.log | awk '{print $NF}'`\n    impr=`perl -e \"print ($impr/$nf);\"` # normalize by #frames.\n    echo On iter $x, objf was $objf, auxf improvement from fMMI was $impr | tee $dir/objf.$x.log\n  else # MMI iteration-- on this iteration do model-space update.\n    echo \"Iteration $x: doing MMI update\"\n      # note: this tau value is for smoothing to model parameters;\n      # you need to use gmm-ismooth-stats to smooth to the ML stats,\n      # but anyway this script does canceling of num and den stats on\n      # each frame (as suggested in the Boosted MMI paper) which would\n      # make smoothing to ML impossible without accumulating extra stats.\n    if [ $stage -le $x ]; then\n      $cmd $dir/log/update.$x.log \\\n        gmm-est-gaussians-ebw --tau=$tau $dir/$x.mdl $dir/num_acc.$x.acc $dir/den_acc.$x.acc - \\| \\\n        gmm-est-weights-ebw - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    else \n      echo \"not doing this iteration because --stage=$stage\"\n    fi\n    \n    # Some diagnostics.. note, this objf is somewhat comparable to the\n    # MMI objective function divided by the acoustic weight, and differences in it\n    # are comparable to the auxf improvement printed by the update program.\n    objf_nf=`grep Overall $dir/log/acc.$x.*.log | grep gmm-acc-stats2 | awk '{ p+=$10*$12; nf+=$12; } END{print p/nf, nf;}'`\n    objf=`echo $objf_nf | awk '{print $1}'`;\n    nf=`echo $objf_nf | awk '{print $2}'`;\n    impr=`grep Overall $dir/log/update.$x.log | head -1 | awk '{print $10*$12;}'`\n    impr=`perl -e \"print ($impr/$nf);\"` # renormalize by \"real\" #frames, to correct\n    # for the canceling of stats.\n    echo On iter $x, objf was $objf, auxf improvement was $impr | tee $dir/objf.$x.log\n    rm $dir/$[x+1].fmpe 2>/dev/null; ln -s $x.fmpe $dir/$[$x+1].fmpe # link previous fMPE transform\n  fi\n  x=$[$x+1]\ndone\n\necho \"Succeeded with $num_iters iters iterations of MMI+fMMI training (boosting factor = $boost)\"\n\nrm $dir/final.mdl 2>/dev/null; ln -s $num_iters.mdl $dir/final.mdl\nrm $dir/final.fmpe 2>/dev/null; ln -s $num_iters.fmpe $dir/final.fmpe \n\n# Now do some cleanup.\nrm $dir/gselect.*.gz $dir/*.acc $dir/*.fmpe_acc\nexit 0;\n\n"
  },
  {
    "path": "egs/steps/train_mmi_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# MMI training (or optionally boosted MMI, if you give the --boost option),\n# for SGMMs.  4 iterations (by default) of Extended Baum-Welch update.\n#\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0\ncancel=true # if true, cancel num and den counts on each frame.\ndrop_frames=false # this is the same as frame dropping (see Karel's ICASSP2013 paper).\nacwt=0.1\nstage=0\nupdate_opts=\ntransform_dir=\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/train_mmi_sgmm2.sh <data> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/train_mmi_sgmm2.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"  \n  echo \"  --transform-dir <transform-dir>                  # directory to find fMLLR transforms.\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\ndir=$5\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\ncp $alidir/tree $dir\ncp $alidir/final.mdl $dir/0.mdl\ncp $alidir/final.alimdl $dir\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n# Set up features\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\nif [ ! -z \"$transform_dir\" ]; then\n  echo \"$0: using transforms from $transform_dir\"\n  [ ! -f $transform_dir/trans.1 ] && echo \"$0: no such file $transform_dir/trans.1\" \\\n    && exit 1;\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$transform_dir/trans.JOB ark:- ark:- |\"\nelse\n  echo \"$0: no fMLLR transforms.\"\nfi\n\nif [ -f $alidir/vecs.1 ]; then\n  echo \"$0: using speaker vectors from $alidir\"\n  spkvecs_opt=\"--spk-vecs=ark:$alidir/vecs.JOB --utt2spk=ark:$sdata/JOB/utt2spk\"\nelse\n  echo \"$0: no speaker vectors.\"\n  spkvecs_opt=\nfi\n\nif [ -f $alidir/gselect.1.gz ]; then\n  echo \"$0: using Gaussian-selection info from $alidir\"\n  gselect_opt=\"--gselect=ark,s,cs:gunzip -c $alidir/gselect.JOB.gz|\"\nelse\n  echo \"$0: error: no Gaussian-selection info found\" && exit 1;\nfi\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of MMI training\"\n  # Note: the num and den states are accumulated at the same time: \n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      test -s $dir/den_acc.$x.JOB.gz -a -s $dir/num_acc.$x.JOB.gz '||' \\\n      sgmm2-rescore-lattice --speedup=true \"$gselect_opt\" $spkvecs_opt $dir/$x.mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-post --acoustic-scale=$acwt ark:- ark:- \\| \\\n      sum-post --drop-frames=$drop_frames --merge=$cancel --scale1=-1 \\\n      ark:- \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" ark:- \\| \\\n      sgmm2-acc-stats2 \"$gselect_opt\" $spkvecs_opt $dir/$x.mdl \"$feats\" ark,s,cs:- \\\n      \"|gzip -c >$dir/num_acc.$x.JOB.gz\" \"|gzip -c >$dir/den_acc.$x.JOB.gz\" || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.gz | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    num_acc_sum=\"sgmm2-sum-accs - \";\n    den_acc_sum=\"sgmm2-sum-accs - \";\n    for j in `seq $nj`; do \n      num_acc_sum=\"$num_acc_sum 'gunzip -c $dir/num_acc.$x.$j.gz|'\"; \n      den_acc_sum=\"$den_acc_sum 'gunzip -c $dir/den_acc.$x.$j.gz|'\"; \n    done\n    $cmd $dir/log/update.$x.log \\\n     sgmm2-est-ebw $update_opts $dir/$x.mdl \"$num_acc_sum |\" \"$den_acc_sum |\" \\\n      $dir/$[$x+1].mdl || exit 1;\n    rm $dir/*_acc.$x.*.gz \n  fi\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.  Note: this code is same as in train_mmi.sh\n  tail -n 50 $dir/log/acc.$x.*.log | perl -e '$acwt=shift @ARGV; while(<STDIN>) { if(m/sgmm2-acc-stats2.+Overall weighted acoustic likelihood per frame was (\\S+) over (\\S+) frames/) { $tot_aclike += $1*$2; $tot_frames1 += $2; } if(m|lattice-to-post.+Overall average log-like/frame is (\\S+) over (\\S+) frames.  Average acoustic like/frame is (\\S+)|) { $tot_den_lat_like += $1*$2; $tot_frames2 += $2; $tot_den_aclike += $3*$2; } } if (abs($tot_frames1 - $tot_frames2) > 0.01*($tot_frames1 + $tot_frames2)) { print STDERR \"Frame-counts disagree $tot_frames1 versus $tot_frames2\\n\"; } $tot_den_lat_like /= $tot_frames2; $tot_den_aclike /= $tot_frames2; $tot_aclike *= ($acwt / $tot_frames1);  $num_like = $tot_aclike + $tot_den_aclike; $per_frame_objf = $num_like - $tot_den_lat_like; print \"$per_frame_objf $tot_frames1\\n\"; ' $acwt > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  echo \"Iteration $x: objf was $objf, MMI auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"MMI training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nrm $dir/*.acc 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/train_mono.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n#           2019  Xiaohui Zhang\n# Apache 2.0\n\n\n# To be run from ..\n# Flat start and monophone training, with delta-delta features.\n# This script applies cepstral mean normalization (per speaker).\n\n# Begin configuration section.\nnj=4\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nnum_iters=40    # Number of iterations of training\nmax_iter_inc=30 # Last iter to increase #Gauss on.\ninitial_beam=6 # beam used in the first iteration (set smaller to speed up initialization)\nregular_beam=10 # beam used after the first iteration\nretry_beam=40\ntotgauss=1000 # Target #Gaussians.\ncareful=false\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\nrealign_iters=\"1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 23 26 29 32 35 38\";\nconfig= # name of config file.\nstage=-4\npower=0.25 # exponent to determine number of gaussians from occurrence counts\nnorm_vars=false # deprecated, prefer --cmvn-opts \"--norm-vars=false\"\ncmvn_opts=  # can be used to add extra options to cmvn.\ndelta_opts= # can be used to add extra options to add-deltas\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \"Usage: steps/train_mono.sh [options] <data-dir> <lang-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_mono.sh data/train.1k data/lang exp/mono\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\ndir=$3\n\noov_sym=`cat $lang/oov.int` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\ncp $lang/phones.txt $dir || exit 1;\n\n$norm_vars && cmvn_opts=\"--norm-vars=true $cmvn_opts\"\necho $cmvn_opts  > $dir/cmvn_opts # keep track of options to CMVN.\n[ ! -z $delta_opts ] && echo $delta_opts > $dir/delta_opts # keep track of options to delta\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\"\nexample_feats=\"`echo $feats | sed s/JOB/1/g`\";\n\necho \"$0: Initializing monophone system.\"\n\n[ ! -f $lang/phones/sets.int ] && exit 1;\nshared_phones_opt=\"--shared-phones=$lang/phones/sets.int\"\n\nif [ $stage -le -3 ]; then\n  # Note: JOB=1 just uses the 1st part of the features-- we only need a subset anyway.\n  if ! feat_dim=`feat-to-dim \"$example_feats\" - 2>/dev/null` || [ -z $feat_dim ]; then\n    feat-to-dim \"$example_feats\" -\n    echo \"error getting feature dimension\"\n    exit 1;\n  fi\n  $cmd JOB=1 $dir/log/init.log \\\n    gmm-init-mono $shared_phones_opt \"--train-feats=$feats subset-feats --n=10 ark:- ark:-|\" $lang/topo $feat_dim \\\n    $dir/0.mdl $dir/tree || exit 1;\nfi\n\nnumgauss=`gmm-info --print-args=false $dir/0.mdl | grep gaussians | awk '{print $NF}'`\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Compiling training graphs\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/0.mdl  $lang/L.fst \\\n    \"ark:sym2int.pl --map-oov $oov_sym -f 2- $lang/words.txt < $sdata/JOB/text|\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"$0: Aligning data equally (pass 0)\"\n  $cmd JOB=1:$nj $dir/log/align.0.JOB.log \\\n    align-equal-compiled \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" ark,t:-  \\| \\\n    gmm-acc-stats-ali --binary=true $dir/0.mdl \"$feats\" ark:- \\\n    $dir/0.JOB.acc || exit 1;\nfi\n\n# In the following steps, the --min-gaussian-occupancy=3 option is important, otherwise\n# we fail to est \"rare\" phones and later on, they never align properly.\n\nif [ $stage -le 0 ]; then\n  gmm-est --min-gaussian-occupancy=3  --mix-up=$numgauss --power=$power \\\n    $dir/0.mdl \"gmm-sum-accs - $dir/0.*.acc|\" $dir/1.mdl 2> $dir/log/update.0.log || exit 1;\n  rm $dir/0.*.acc\nfi\n\nbeam=$initial_beam # will change to regular_beam below after 1st pass\n# note: using slightly wider beams for WSJ vs. RM.\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: Pass $x\"\n  if [ $stage -le $x ]; then\n    if echo $realign_iters | grep -w $x >/dev/null; then\n      echo \"$0: Aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" \\\n        \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \"ark,t:|gzip -c >$dir/ali.JOB.gz\" \\\n        || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \"ark:gunzip -c $dir/ali.JOB.gz|\" \\\n      $dir/$x.JOB.acc || exit 1;\n\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss --power=$power $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs 2>/dev/null\n  fi\n  if [ $x -le $max_iter_inc ]; then\n     numgauss=$[$numgauss+$incgauss];\n  fi\n  beam=$regular_beam\n  x=$[$x+1]\ndone\n\n( cd $dir; rm final.{mdl,occs} 2>/dev/null; ln -s $x.mdl final.mdl; ln -s $x.occs final.occs )\n\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\nutils/summarize_warnings.pl $dir/log\n\nsteps/info/gmm_dir_info.pl $dir\n\necho \"$0: Done training monophone system in $dir\"\n\nexit 0\n\n# example of showing the alignments:\n# show-alignments data/lang/phones.txt $dir/30.mdl \"ark:gunzip -c $dir/ali.0.gz|\" | head -4\n\n"
  },
  {
    "path": "egs/steps/train_mpe.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# MMI training (or optionally boosted MMI, if you give the --boost option).\n# 4 iterations (by default) of Extended Baum-Welch update.\n#\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\nboost=0.0\ncancel=true # if true, cancel num and den counts on each frame.\ntau=400\nweight_tau=10\nacwt=0.1\nstage=0\nsmooth_to_mode=true\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/train_mmi.sh <data> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/train_mmi.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_mmi\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --boost <boost-weight>                           # (e.g. 0.1), for boosted MMI.  (default 0)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n  \n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\ndir=$5\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\ncp $alidir/{final.mdl,tree} $dir\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n# Set up features\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f $alidir/trans.1 ] && echo Using transforms from $alidir && \\\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\nif [[ \"$boost\" != \"0.0\" && \"$boost\" != 0 ]]; then\n  lats=\"$lats lattice-boost-ali --b=$boost --silence-phones=$silphonelist $alidir/final.mdl ark:- 'ark,s,cs:gunzip -c $alidir/ali.JOB.gz|' ark:- |\"\nfi\n\n\ncur_mdl=$alidir/final.mdl\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of MPE training\"\n  # Note: the num and den states are accumulated at the same time, so we\n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-rescore-lattice $cur_mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-mpe-post --acoustic-scale=$acwt $cur_mdl \\\n        \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz |\" ark:- ark:- \\| \\\n      gmm-acc-stats2 $cur_mdl \"$feats\" ark,s,cs:- \\\n        $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of MMI accumulators $n versus 2*$nj\" && exit 1;\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n    rm $dir/den_acc.$x.*.acc\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n    rm $dir/num_acc.$x.*.acc\n\n    # note: this tau value is for smoothing towards model parameters, not\n    # as in the Boosted MMI paper, not towards the ML stats as in the earlier\n    # work on discriminative training (e.g. my thesis).  \n    # You could use gmm-ismooth-stats to smooth to the ML stats, if you had\n    # them available [here they're not available if cancel=true].\n    if ! $smooth_to_model; then\n      echo \"Iteration $x of MPE: computing ml (smoothing) stats\"\n      $cmd JOB=1:$nj $dir/log/acc_ml.$x.JOB.log \\\n        gmm-acc-stats $cur_mdl \"$feats\" \\\n          \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" \\\n          $dir/ml.$x.JOB.acc || exit 1;\n      $cmd $dir/log/acc_ml_sum.$x.log \\\n        gmm-sum-accs $dir/ml.$x.acc $dir/ml.$x.*.acc || exit 1;\n      rm $dir/ml.$x.*.acc\n      num_stats=\"gmm-ismooth-stats --tau=$tau $dir/ml.$x.acc $dir/num_acc.$x.acc -|\"\n    else \n      num_stats=\"gmm-ismooth-stats --smooth-from-model=true --tau=$tau $cur_mdl $dir/num_acc.$x.acc -|\"\n    fi  \n    \n    $cmd $dir/log/update.$x.log \\\n      gmm-est-gaussians-ebw $cur_mdl \"$num_stats\" $dir/den_acc.$x.acc - \\| \\\n      gmm-est-weights-ebw - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    rm $dir/{den,num}_acc.$x.acc\n  fi\n  cur_mdl=$dir/$[$x+1].mdl\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.\n\n tail -n 50 $dir/log/acc.$x.*.log | perl -e 'while(<STDIN>) { if(m/lattice-to-mpe-post.+Overall average frame-accuracy is (\\S+) over (\\S+) frames/) { $tot_objf += $1*$2; $tot_frames += $2; }} $tot_objf /= $tot_frames; print \"$tot_objf $tot_frames\\n\"; ' > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  # This gives us a projected objective function improvement.\n  echo \"Iteration $x: objf was $objf, MPE auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"MPE training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/train_quick.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n\n# Train a model on top of existing features (no feature-space learning of any\n# kind is done).  This script initializes the model (i.e., the GMMs) from the\n# previous system's model.  That is: for each state in the current model (after\n# tree building), it chooses the closes state in the old model, judging the\n# similarities based on overlap of counts in the tree stats.\n\n# Begin configuration..\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 15\"; # Only realign twice.\nnum_iters=20    # Number of iterations of training\nmaxiterinc=15 # Last iter to increase #Gauss on.\nbatch_size=750 # batch size to use while compiling graphs... memory/speed tradeoff.\nbeam=10 # alignment beam.\nretry_beam=40\nstage=-5\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_quick.sh <num-leaves> <num-gauss> <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_quick.sh 2500 15000 data/train_si284 data/lang exp/tri3c_ali_si284 exp/tri4b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n# Set various variables.\noov=`cat $lang/oov.int`\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl`\nnumgauss=$[totgauss/2] # Start with half the total number of Gaussians.  We won't have\n  # to mix up much probably, as we're initializing with the old (already mixed-up) pdf's.  \n[ $numgauss -lt $numleaves ] && numgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$maxiterinc] # per-iter increment for #Gauss\nnj=`cat $alidir/num_jobs` || exit 1;\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\nmkdir -p $dir/log\necho $nj >$dir/num_jobs\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n## Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    cp $alidir/full.mat $dir 2>/dev/null\n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  ln.pl $alidir/trans.* $dir # Link them to dest dir.\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\nelse\n  feats=\"$sifeats\"\nfi\n##\n\n\nif [ $stage -le -5 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-stats\" && exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -4 ]; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"$0: Initializing the model\"\n\n  # The gmm-init-model command (with more than the normal # of command-line args)\n  # will initialize the p.d.f.'s to the p.d.f.'s in the alignment model.\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/tmp.mdl $alidir/tree $alidir/final.mdl  \\\n    2>$dir/log/init_model.log || exit 1;\n\n  grep 'no stats' $dir/log/init_model.log && echo \"$0: This is a bad warning.\";\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: mixing up old model.\"\n  # We do both mixing-down and mixing-up to get the target #Gauss in each state,\n  # since the initial model may have either more or fewer Gaussians than we want.\n  gmm-mixup --mix-down=$numgauss --mix-up=$numgauss $dir/tmp.mdl $dir/1.occs $dir/1.mdl \\\n    2> $dir/log/mixup.log || exit 1;\n  rm $dir/tmp.mdl \nfi\n\n# Convert alignments to the new tree.\nif [ $stage -le -1 ]; then\n  echo \"$0: converting old alignments\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling training graphs\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int --batch-size=$batch_size $dir/tree $dir/1.mdl $lang/L.fst  \\\n    \"ark:sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n    \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: pass $x\"\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo \"$0: aligning data\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam $dir/$x.mdl \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \"ark:|gzip -c >$dir/ali.JOB.gz\" \\\n      || exit 1;\n  fi\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\"  $dir/$x.JOB.acc || exit 1;\n    [ \"`ls $dir/$x.*.acc | wc -w`\" -ne \"$nj\" ] && echo \"$0: wrong #accs\" && exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs\n  fi\n  [[ $x -le $maxiterinc ]] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: estimating alignment model\"\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ \"`ls $dir/$x.*.acc | wc -w`\" -ne \"$nj\" ] && echo \"$0: wrong #accs\" && exit 1;\n\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --write-occs=$dir/final.occs --remove-low-count-gaussians=false $dir/$x.mdl \\\n    \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl || exit 1;\n  rm $dir/$x.*.acc\n  rm $dir/final.alimdl 2>/dev/null \n  ln -s $x.alimdl $dir/final.alimdl\nfi\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_raw_sat.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n\n# This does Speaker Adapted Training (SAT).  We train on fMLLR-adapted features,\n# but in this \"raw\" script, these transforms are at the level of the raw\n# cepstra.  The model must be built on top of LDA+MLLT features, and the\n# transforms are estimated using the model, in a rather clever way.  If there\n# are no raw transforms supplied in the alignment directory, it will estimate\n# transforms itself before building the tree (and in any case, it estimates\n# transforms a number of times during training).\n# You need to decode the models it builds with decode_raw_fmllr.sh\n\n# Begin configuration section.\nstage=-6\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\ncontext_opts=  # e.g. set this to \"--context-width 5 --central-position 2\" for quinphone.\nrealign_iters=\"10 20 30\";\nfmllr_iters=\"2 4 6 12\";\nmllt_iters=\"3 5 7 10\"\ndim=40\nrandprune=4.0 # This is approximately the ratio by which we will speed up the\n              # LDA and MLLT calculations via randomized pruning.\nsilence_weight=0.0 # Weight on silence in fMLLR estimation.\nnum_iters=35   # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\npower=0.2 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\ntrain_tree=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_sat.sh <#leaves> <#gauss> <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_sat.sh 2500 15000 data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri3b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"train_sat.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc]  # per-iter #gauss increment\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nsdata=$data/split$nj;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\nraw_dim=$(feat-to-dim scp:$data/feats.scp -) || exit 1;\n! [ \"$raw_dim\" -gt 0 ] && echo \"raw feature dim not set\" && exit 1;\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n# Set up features.\n\nif [[ ! -f $alidir/final.mat || ! -f $alidir/full.mat ]]; then\n  echo \"$0: expected to find  $alidir/final.mat and $alidir/full.mat\"\n  exit 1\nfi\n\nsisplicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- |\"\nsifeats=\"$sisplicedfeats transform-feats $alidir/final.mat ark:- ark:- |\"\n\n\n## Get initial fMLLR transforms (possibly from alignment dir)\nif [ -f $alidir/raw_trans.1 ]; then\n  echo \"$0: Using transforms from $alidir\"\n  cur_trans_dir=$alidir\nelse \n  if [ $stage -le -6 ]; then\n    echo \"$0: obtaining initial fMLLR transforms since not present in $alidir\"\n    # The next line is necessary because of $silphonelist otherwise being incorrect; would require\n    # old $lang dir which would require another option.  Not needed anyway.\n    full_lda_mat=\"get-full-lda-mat --print-args=false $alidir/final.mat $alidir/full.mat -|\"\n    $cmd JOB=1:$nj $dir/log/fmllr.0.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      gmm-est-fmllr-raw --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt $alidir/final.mdl \\\n        \"$full_lda_mat\" \"$sisplicedfeats\" ark:- ark:$dir/raw_trans.JOB || exit 1;\n  fi\n  cur_trans_dir=$dir\nfi\n\nsplicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$cur_trans_dir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- |\"\n\n\nif [ $stage -le -5 ]; then\n  echo \"Accumulating LDA statistics.\"\n  $cmd JOB=1:$nj $dir/log/lda_acc.JOB.log \\\n    ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post 0.0 $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      acc-lda --rand-prune=$randprune $alidir/final.mdl \"$splicedfeats\" ark,s,cs:- \\\n       $dir/lda.JOB.acc || exit 1;\n  est-lda --write-full-matrix=$dir/full.mat --dim=$dim $dir/0.mat $dir/lda.*.acc \\\n      2>$dir/log/lda_est.log || exit 1;  \n  rm $dir/lda.*.acc\nfi\n\ncur_lda_iter=0\nfeats=\"$splicedfeats transform-feats $dir/$cur_lda_iter.mat ark:- ark:- |\"\n\n# To build the tree, we use the previous directory's LDA transform, which\n# is better as it has MLLT also.  It leads to higher auxiliary function\n# improvements in tree building, which is generally a good thing.\ntree_feats=\"$splicedfeats transform-feats $alidir/final.mat ark:- ark:- |\"\n\n\nif [ $stage -le -4 ] && $train_tree; then\n  # Get tree stats.\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts --ci-phones=$ciphonelist $alidir/final.mdl \"$tree_feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  # Since we trained the tree on different feats, we don't use gmm-init-model, which\n  # would initialize the tree with invalid features.  This doesn't really matter anyway,\n  # the first iteration of training will set suitable initial parameters.\n  cp $alidir/tree $dir/ || exit 1;\n  $cmd JOB=1 $dir/log/init_model.log \\\n    gmm-init-model-flat $dir/tree $lang/topo $dir/1.mdl \\\n    \"$tree_feats subset-feats ark:- ark:-|\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ] && [ \"$realign_iters\" != \"\" ]; then\n  echo \"$0: Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n   echo Pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n\n  if echo $fmllr_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo Estimating fMLLR transforms\n      # We estimate a transform that's additional to the previous transform;\n      # we'll compose them.\n\n      full_lda_mat=\"get-full-lda-mat --print-args=false $dir/$cur_lda_iter.mat $dir/full.mat - |\"\n      $cmd JOB=1:$nj $dir/log/fmllr.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-est-fmllr-raw --raw-feat-dim=$raw_dim --spk2utt=ark:$sdata/JOB/spk2utt $dir/$x.mdl \"$full_lda_mat\" \\\n          \"$splicedfeats\" ark:- ark:$dir/tmp_trans.JOB || exit 1;\n      for n in `seq $nj`; do\n        ! ( compose-transforms --b-is-affine=true \\\n          ark:$dir/tmp_trans.$n ark:$cur_trans_dir/raw_trans.$n ark:$dir/composed_trans.$n \\\n          && mv $dir/composed_trans.$n $dir/raw_trans.$n && \\\n          rm $dir/tmp_trans.$n ) 2>$dir/log/compose_transforms.$x.log \\\n          && echo \"$0: Error composing transforms\" && exit 1;\n      done\n    fi\n    cur_trans_dir=$dir\n    splicedfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$cur_trans_dir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- |\"\n    feats=\"$splicedfeats transform-feats $dir/$cur_lda_iter.mat ark:- ark:- |\"\n  fi\n\n  if echo $mllt_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo \"Estimating MLLT\"\n      $cmd JOB=1:$nj $dir/log/macc.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        weight-silence-post 0.0 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-acc-mllt --rand-prune=$randprune  $dir/$x.mdl \"$feats\" ark:- $dir/$x.JOB.macc \\\n        || exit 1;\n      est-mllt $dir/$x.mat.new $dir/$x.*.macc 2> $dir/log/mupdate.$x.log || exit 1;\n      gmm-transform-means  $dir/$x.mat.new $dir/$x.mdl $dir/$x.mdl \\\n        2> $dir/log/transform_means.$x.log || exit 1;\n      compose-transforms --print-args=false $dir/$x.mat.new $dir/$cur_lda_iter.mat $dir/$x.mat || exit 1;\n      rm $dir/$x.*.macc\n    fi\n    cur_lda_iter=$x\n    feats=\"$splicedfeats transform-feats $dir/$cur_lda_iter.mat ark:- ark:- |\"\n  fi\n  \n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --power=$power --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs 2>/dev/null\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\n\nif [ $stage -le $x ]; then\n  # Accumulate stats for \"alignment model\"-- this model is\n  # computed with the speaker-independent features, but matches Gaussian-for-Gaussian\n  # with the final speaker-adapted model.\n  sifeats=\"$sisplicedfeats transform-feats $dir/$cur_lda_iter.mat ark:- ark:- |\"\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --power=$power --remove-low-count-gaussians=false $dir/$x.mdl \\\n    \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl  || exit 1;\n  rm $dir/$x.*.acc\nfi\n\nrm $dir/final.{mdl,alimdl,mat,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $x.alimdl $dir/final.alimdl\nln -s $cur_lda_iter.mat $dir/final.mat\n\n\nutils/summarize_warnings.pl $dir/log\n(\n  echo \"$0: Likelihood evolution (not sure if this is totally correct):\"\n  for x in `seq $[$num_iters-1]`; do\n    tail -n 30 $dir/log/acc.$x.*.log | awk '/Overall avg like/{l += $(NF-3)*$(NF-1); t += $(NF-1); }\n        /Overall average logdet/{d += $(NF-3)*$(NF-1); t2 += $(NF-1);} \n        END{ d /= t2; l /= t; printf(\"%s \", d+l); } '\n  done\n  echo\n) | tee $dir/log/summary.log\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_sat.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n\n# This does Speaker Adapted Training (SAT), i.e. train on\n# fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or\n# delta and delta-delta features.  If there are no transforms supplied\n# in the alignment directory, it will estimate transforms itself before\n# building the tree (and in any case, it estimates transforms a number\n# of times during training).\n\n\n# Begin configuration section.\nstage=-5\nexit_stage=-100 # you can use this to require it to exit at the\n                # beginning of a specific stage.  Not all values are\n                # supported.\nfmllr_update_type=full\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\ncareful=false\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\ncontext_opts=  # e.g. set this to \"--context-width 5 --central-position 2\" for quinphone.\nrealign_iters=\"10 20 30\";\nfmllr_iters=\"2 4 6 12\";\nsilence_weight=0.0 # Weight on silence in fMLLR estimation.\nnum_iters=35   # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\npower=0.2 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\nphone_map=\ntrain_tree=true\ntree_stats_opts=\ncluster_phones_opts=\ncompile_questions_opts=\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_sat.sh <#leaves> <#gauss> <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_sat.sh 2500 15000 data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri3b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"train_sat.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc]  # per-iter #gauss increment\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nsdata=$data/split$nj;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nphone_map_opt=\n[ ! -z \"$phone_map\" ] && phone_map_opt=\"--phone-map='$phone_map'\"\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null # delta option.\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n# Set up features.\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\n## Set up speaker-independent features.\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    cp $alidir/full.mat $dir 2>/dev/null\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n## Get initial fMLLR transforms (possibly from alignment dir)\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: Using transforms from $alidir\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n  cur_trans_dir=$alidir\nelse\n  if [ $stage -le -5 ]; then\n    echo \"$0: obtaining initial fMLLR transforms since not present in $alidir\"\n    # The next line is necessary because of $silphonelist otherwise being incorrect; would require\n    # old $lang dir which would require another option.  Not needed anyway.\n    [ ! -z \"$phone_map\" ] && \\\n       echo \"$0: error: you must provide transforms if you use the --phone-map option.\" && exit 1;\n    $cmd JOB=1:$nj $dir/log/fmllr.0.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:- \\| \\\n      weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n      --spk2utt=ark:$sdata/JOB/spk2utt $alidir/final.mdl \"$sifeats\" \\\n      ark:- ark:$dir/trans.JOB || exit 1;\n  fi\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$dir/trans.JOB ark:- ark:- |\"\n  cur_trans_dir=$dir\nfi\n\nif [ $stage -le -4 ] && $train_tree; then\n  # Get tree stats.\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts $tree_stats_opts $phone_map_opt --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $cluster_phones_opts $context_opts $dir/treeacc $lang/phones/sets.int $dir/questions.int 2>$dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $compile_questions_opts $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  if $train_tree; then\n    gmm-init-model  --write-occs=$dir/1.occs  \\\n      $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n    grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n    rm $dir/treeacc\n  else\n    cp $alidir/tree $dir/ || exit 1;\n    $cmd JOB=1 $dir/log/init_model.log \\\n      gmm-init-model-flat $dir/tree $lang/topo $dir/1.mdl \\\n        \"$feats subset-feats ark:- ark:-|\" || exit 1;\n  fi\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $phone_map_opt $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n[ \"$exit_stage\" -eq 0 ] && echo \"$0: Exiting early: --exit-stage $exit_stage\" && exit 0;\n\nif [ $stage -le 0 ] && [ \"$realign_iters\" != \"\" ]; then\n  echo \"$0: Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n   echo Pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam --careful=$careful \"$mdl\" \\\n      \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n      \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n\n  if echo $fmllr_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      echo Estimating fMLLR transforms\n      # We estimate a transform that's additional to the previous transform;\n      # we'll compose them.\n      $cmd JOB=1:$nj $dir/log/fmllr.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \\\n        --spk2utt=ark:$sdata/JOB/spk2utt $dir/$x.mdl \\\n        \"$feats\" ark:- ark:$dir/tmp_trans.JOB || exit 1;\n      for n in `seq $nj`; do\n        ! ( compose-transforms --b-is-affine=true \\\n          ark:$dir/tmp_trans.$n ark:$cur_trans_dir/trans.$n ark:$dir/composed_trans.$n \\\n          && mv $dir/composed_trans.$n $dir/trans.$n && \\\n          rm $dir/tmp_trans.$n ) 2>$dir/log/compose_transforms.$x.log \\\n          && echo \"$0: Error composing transforms\" && exit 1;\n      done\n    fi\n    feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n    cur_trans_dir=$dir\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --power=$power --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\n\nif [ $stage -le $x ]; then\n  # Accumulate stats for \"alignment model\"-- this model is\n  # computed with the speaker-independent features, but matches Gaussian-for-Gaussian\n  # with the final speaker-adapted model.\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --power=$power --remove-low-count-gaussians=false $dir/$x.mdl \\\n    \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl  || exit 1;\n  rm $dir/$x.*.acc\nfi\n\nrm $dir/final.{mdl,alimdl,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $x.alimdl $dir/final.alimdl\n\n\nsteps/diagnostic/analyze_alignments.sh --cmd \"$cmd\" $lang $dir\n\nutils/summarize_warnings.pl $dir/log\n(\n  echo \"$0: Likelihood evolution:\"\n  for x in `seq $[$num_iters-1]`; do\n    tail -n 30 $dir/log/acc.$x.*.log | awk '/Overall avg like/{l += $(NF-3)*$(NF-1); t += $(NF-1); }\n        /Overall average logdet/{d += $(NF-3)*$(NF-1); t2 += $(NF-1);}\n        END{ d /= t2; l /= t; printf(\"%s \", d+l); } '\n  done\n  echo\n) | tee $dir/log/summary.log\n\n\nsteps/info/gmm_dir_info.pl $dir\n\necho \"$0: done training SAT system in $dir\"\n\nexit 0\n"
  },
  {
    "path": "egs/steps/train_sat_basis.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n# Copyright 2013  GoVivace Inc. (Author: Nagendra Goel), Apache 2.0\n\n# This does Speaker Adapted Training (SAT), i.e. train on\n# fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or\n# delta and delta-delta features.  If there are no transforms supplied\n# in the alignment directory, it will estimate transforms itself before\n# building the tree (and in any case, it estimates transforms a number\n# of times during training).\n\n\n# Begin configuration section.\nstage=-5\ncmd=run.pl\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\nbasis_fmllr_opts=\"--fmllr-min-count=22  --num-iters=10 --size-scale=0.2 --step-size-iters=3\"\ncontext_opts=  # e.g. set this to \"--context-width 5 --central-position 2\" for quinphone.\nrealign_iters=\"10 20 30\";\nfmllr_iters=\"2 4 6 12\";\nsilence_weight=0.0 # Weight on silence in fMLLR estimation.\nnum_iters=35   # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\npower=0.2 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\ntrain_tree=true\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n  echo \"Usage: steps/train_sat.sh <#leaves> <#gauss> <data> <lang> <ali-dir> <exp-dir>\"\n  echo \" e.g.: steps/train_sat.sh 2500 15000 data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri3b\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $data/feats.scp $lang/phones.txt $alidir/final.mdl $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"train_sat.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc]  # per-iter #gauss increment\noov=`cat $lang/oov.int`\nnj=`cat $alidir/num_jobs` || exit 1;\nsilphonelist=`cat $lang/phones/silence.csl`\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\nsdata=$data/split$nj;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\necho $nj >$dir/num_jobs\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\n# Set up features.\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\n## Set up speaker-independent features.\ncase $feat_type in\n  delta) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) sifeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n## Get initial fMLLR transforms (possibly from alignment dir)\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: Using transforms from $alidir\"\n  feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n  cur_trans_dir=$alidir\nelse\n  if [ $stage -le -5 ]; then\n    echo \"$0: obtaining initial basis fMLLR transforms since not present in $alidir\"\n    # The next line is necessary because of $silphonelist otherwise being incorrect; would require\n    # old $lang dir which would require another option.  Not needed anyway.\n    $cmd JOB=1:$nj $dir/log/fmllr.0.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:-  \\| \\\n      weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alidir/final.mdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-basis-fmllr-accs-gpost \\\n      $alidir/final.mdl \"$sifeats\" ark,s,cs:- $dir/basis.acc.JOB || exit 1;\n\n    # Compute the basis matrices.\n    $cmd $dir/log/basis_training.log \\\n          gmm-basis-fmllr-training $alidir/final.mdl $alidir/fmllr.basis $dir/basis.acc.* || exit 1;\n    $cmd JOB=1:$nj $dir/log/fmllr.0.JOB.log \\\n      ali-to-post \"ark:gunzip -c $alidir/ali.JOB.gz|\" ark:-  \\| \\\n      weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- \\| \\\n      gmm-post-to-gpost $alidir/final.mdl \"$sifeats\" ark:- ark:- \\| \\\n      gmm-est-basis-fmllr-gpost $basis_fmllr_opts --spk2utt=ark:$sdata/JOB/spk2utt  \\\n      $alidir/final.mdl $alidir/fmllr.basis \"$sifeats\"  ark,s,cs:- \\\n      ark:$alidir/trans.JOB || exit 1;\n\n    feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n    cur_trans_dir=$alidir\n  fi\nfi\n\nif [ $stage -le -4 ] && $train_tree; then\n  # Get tree stats.\n  echo \"$0: Accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-accs\" && exit 1;\n  $cmd $dir/log/sum_tree_acc.log \\\n    sum-tree-stats $dir/treeacc $dir/*.treeacc || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -3 ] && $train_tree; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree $context_opts --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: Initializing the model\"\n  if $train_tree; then\n    gmm-init-model  --write-occs=$dir/1.occs  \\\n      $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n    grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n    rm $dir/treeacc\n  else\n    cp $alidir/tree $dir/ || exit 1;\n    $cmd JOB=1 $dir/log/init_model.log \\\n      gmm-init-model-flat $dir/tree $lang/topo $dir/1.mdl \\\n        \"$feats subset-feats ark:- ark:-|\" || exit 1;\n  fi\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: Converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ] && [ \"$realign_iters\" != \"\" ]; then\n  echo \"$0: Compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo Pass $x\n  if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n    echo Aligning data\n    mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n    $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n      gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n        \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n        \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n  fi\n\n  if echo $fmllr_iters | grep -w $x >/dev/null; then\n    if [ $stage -le $x ]; then\n      # Note: it's not really necessary to re-estimate the basis each time\n      # but this is the way the script does it right now.\n      echo Estimating basis and fMLLR transforms\n      $cmd JOB=1:$nj $dir/log/fmllr_est.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-post-to-gpost $dir/$x.mdl \"$feats\" ark:- ark:- \\| \\\n        gmm-basis-fmllr-accs-gpost \\\n          $dir/$x.mdl \"$sifeats\" ark,s,cs:- $dir/basis.acc.JOB || exit 1;\n\n      # Compute the basis matrices.\n      $cmd $dir/log/basis_training.log \\\n        gmm-basis-fmllr-training $dir/$x.mdl $dir/fmllr.basis $dir/basis.acc.* || exit 1;\n\n      $cmd JOB=1:$nj $dir/log/fmllr_app.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n        gmm-post-to-gpost $dir/$x.mdl \"$sifeats\" ark:- ark:- \\| \\\n        gmm-est-basis-fmllr-gpost $basis_fmllr_opts --spk2utt=ark:$sdata/JOB/spk2utt \\\n          $dir/$x.mdl $dir/fmllr.basis \"$sifeats\"  ark,s,cs:- \\\n          ark:$dir/trans.JOB || exit 1;\n\n    fi\n    feats=\"$sifeats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark:$dir/trans.JOB ark:- ark:- |\"\n    cur_trans_dir=$dir\n  fi\n\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali $dir/$x.mdl \"$feats\" \\\n      \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --power=$power --write-occs=$dir/$[$x+1].occs --mix-up=$numgauss $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\n\nif [ $stage -le $x ]; then\n  # Accumulate stats for \"alignment model\"-- this model is\n  # computed with the speaker-independent features, but matches Gaussian-for-Gaussian\n  # with the final speaker-adapted model.\n  $cmd JOB=1:$nj $dir/log/acc_alimdl.JOB.log \\\n    ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:-  \\| \\\n    gmm-acc-stats-twofeats $dir/$x.mdl \"$feats\" \"$sifeats\" \\\n    ark,s,cs:- $dir/$x.JOB.acc || exit 1;\n  [ `ls $dir/$x.*.acc | wc -w` -ne \"$nj\" ] && echo \"$0: Wrong #accs\" && exit 1;\n  # Update model.\n  $cmd $dir/log/est_alimdl.log \\\n    gmm-est --power=$power --remove-low-count-gaussians=false $dir/$x.mdl \\\n      \"gmm-sum-accs - $dir/$x.*.acc|\" $dir/$x.alimdl  || exit 1;\n  rm $dir/$x.*.acc\nfi\n\nrm $dir/final.{mdl,alimdl,occs} 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\nln -s $x.alimdl $dir/final.alimdl\n\n\n\nutils/summarize_warnings.pl $dir/log\n(\n  echo \"$0: Likelihood evolution:\"\n  for x in `seq $[$num_iters-1]`; do\n    tail -n 30 $dir/log/acc.$x.*.log | awk '/Overall avg like/{l += $(NF-3)*$(NF-1); t += $(NF-1); }\n        /Overall average logdet/{d += $(NF-3)*$(NF-1); t2 += $(NF-1);}\n        END{ d /= t2; l /= t; printf(\"%s \", d+l); } '\n  done\n  echo\n) | tee $dir/log/summary.log\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_segmenter.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# Begin configuration.\nstage=-4 # For restarting a process that went part way.\nconfig=\ncmd=run.pl\n\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nrealign_iters=\"10 20 30\";\nnum_iters=35    # Number of iterations of training\nmax_iter_inc=25 # Last iter to increase #Gauss on.\nbeam=10\nretry_beam=40\nboost_silence=1.0 # Factor by which to boost silence likelihoods in alignment\npower=0.25 # Exponent for number of gaussians according to occurrence counts\ncluster_thresh=-1  # for build-tree control final bottom-up clustering of leaves\n# End configuration.\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# != 6 ]; then\n   echo \"Usage: steps/train_segmenter.sh deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>\"\n   echo \"e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1\"\n   echo \"main options (for others, see top of script file)\"\n   echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n   echo \"  --config <config-file>                           # config containing options\"\n   echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n   exit 1;\nfi\n\nnumleaves=$1\ntotgauss=$2\ndata=$3\nlang=$4\nalidir=$5\ndir=$6\n\nfor f in $alidir/final.mdl $alidir/ali.1.gz $data/feats.scp $lang/phones.txt; do\n  [ ! -f $f ] && echo \"train_deltas.sh: no such file $f\" && exit 1;\ndone\n\nnumgauss=$numleaves\nincgauss=$[($totgauss-$numgauss)/$max_iter_inc] # per-iter increment for #Gauss\noov=`cat $lang/oov.int` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\nnj=`cat $alidir/num_jobs` || exit 1;\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nfeats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\"\n\nrm $dir/.error 2>/dev/null\n\nif [ $stage -le -3 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats  --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: getting questions for tree-building, via clustering\"\n  # preparing questions, roots file...\n  cluster-phones $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree --verbose=1 --max-leaves=$numleaves \\\n    --cluster-thresh=$cluster_thresh $dir/treeacc $lang/phones/roots.int \\\n    $dir/questions.qst $lang/topo $dir/tree || exit 1;\n\n  gmm-init-model  --write-occs=$dir/1.occs  \\\n    $dir/tree $dir/treeacc $lang/topo $dir/1.mdl 2> $dir/log/init_model.log || exit 1;\n  grep 'no stats' $dir/log/init_model.log && echo \"This is a bad warning.\";\n\n  gmm-mixup --mix-up=$numgauss $dir/1.mdl $dir/1.occs $dir/1.mdl 2>$dir/log/mixup.log || exit 1;\n  rm $dir/treeacc\nfi\n\nif [ $stage -le -1 ]; then\n  # Convert the alignments.\n  echo \"$0: converting alignments from $alidir to use current tree\"\n  $cmd JOB=1:$nj $dir/log/convert.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/1.mdl $dir/tree \\\n     \"ark:gunzip -c $alidir/ali.JOB.gz|\" \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le 0 ]; then\n  echo \"$0: compiling graphs of transcripts\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/1.mdl  $lang/L.fst  \\\n     \"ark:utils/sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $data/split$nj/JOB/text |\" \\\n      \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nx=1\nwhile [ $x -lt $num_iters ]; do\n  echo \"$0: training pass $x\"\n  if [ $stage -le $x ]; then\n    if echo $realign_iters | grep -w $x >/dev/null; then\n      echo \"$0: aligning data\"\n      mdl=\"gmm-boost-silence --boost=$boost_silence `cat $lang/phones/optional_silence.csl` $dir/$x.mdl - |\"\n      $cmd JOB=1:$nj $dir/log/align.$x.JOB.log \\\n        gmm-align-compiled $scale_opts --beam=$beam --retry-beam=$retry_beam \"$mdl\" \\\n         \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n         \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n    fi\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-acc-stats-ali  $dir/$x.mdl \"$feats\" \\\n       \"ark,s,cs:gunzip -c $dir/ali.JOB.gz|\" $dir/$x.JOB.acc || exit 1;\n    $cmd $dir/log/update.$x.log \\\n      gmm-est --mix-up=$numgauss --power=$power \\\n        --write-occs=$dir/$[$x+1].occs $dir/$x.mdl \\\n       \"gmm-sum-accs - $dir/$x.*.acc |\" $dir/$[$x+1].mdl || exit 1;\n    rm $dir/$x.mdl $dir/$x.*.acc\n    rm $dir/$x.occs\n  fi\n  [ $x -le $max_iter_inc ] && numgauss=$[$numgauss+$incgauss];\n  x=$[$x+1];\ndone\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\n# Summarize warning messages...\nutils/summarize_warnings.pl  $dir/log\n\necho \"$0: Done training system with delta+delta-delta features in $dir\"\n\n"
  },
  {
    "path": "egs/steps/train_sgmm2.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# SGMM training, with speaker vectors.  This script would normally be called on\n# top of fMLLR features obtained from a conventional system, but it also works\n# on top of any type of speaker-independent features (based on\n# deltas+delta-deltas or LDA+MLLT).  For more info on SGMMs, see the paper \"The\n# subspace Gaussian mixture model--A structured model for speech recognition\".\n# (Computer Speech and Language, 2011).\n\n# Begin configuration section.\ncmd=run.pl\nstage=-6 # use this to resume partially finished training\ncontext_opts= # e.g. set it to \"--context-width=5 --central-position=2\"  for a\n# quinphone system.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nnum_iters=25   # Total number of iterations of training\nnum_iters_alimdl=3 # Number of iterations for estimating alignment model.\nmax_iter_inc=15 # Last iter to increase #substates on.\nrealign_iters=\"5 10 15\"; # Iters to realign on.\nspkvec_iters=\"5 8 12 17\" # Iters to estimate speaker vectors on.\nincrease_iters=\"6 10 14\"; # Iters on which to increase phn dim and/or spk dim;\n    # rarely necessary, and if it is, only the 1st will normally be necessary.\nrand_prune=0.1 # Randomized-pruning parameter for posteriors, to speed up training.\n               # Bigger -> more pruning; zero = no pruning.\nphn_dim=  # You can use this to set the phonetic subspace dim. [default: feat-dim+1]\nspk_dim=  # You can use this to set the speaker subspace dim. [default: feat-dim]\npower=0.25 # Exponent for number of gaussians according to occurrence counts\nbeam=8\nself_weight=0.9\nretry_beam=40\nleaves_per_group=5 # Relates to the SCTM (state-clustered tied-mixture) aspect:\n                   # average number of pdfs in a \"group\" of pdfs.\nupdate_m_iter=4\nspk_dep_weights=true # [Symmetric SGMM] set this to false if you don't want \"u\" (i.e. to turn off\n                      # symmetric SGMM.\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 7 ]; then\n  echo \"Usage: steps/train_sgmm2.sh <num-leaves> <num-substates> <data> <lang> <ali-dir> <ubm> <exp-dir>\"\n  echo \" e.g.: steps/train_sgmm2.sh 5000 8000 data/train_si84 data/lang \\\\\"\n  echo \"                      exp/tri3b_ali_si84 exp/ubm4a/final.ubm exp/sgmm4a\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-weight <sil-weight>                    # weight for silence (e.g. 0.5 or 0.0)\"\n  echo \"  --num-iters <#iters>                             # Number of iterations of E-M\"\n  echo \"  --leaves-per-group <#leaves>                     # Average #leaves shared in one group\"\n  exit 1;\nfi\n\nnum_pdfs=$1  # final #leaves, at 2nd level of tree.\ntotsubstates=$2\ndata=$3\nlang=$4\nalidir=$5\nubm=$6\ndir=$7\n\nnum_groups=$[$num_pdfs/$leaves_per_group]\nfirst_spkvec_iter=`echo $spkvec_iters | awk '{print $1}'` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $ubm $alidir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\noov=`cat $lang/oov.int`\nsilphonelist=`cat $lang/phones/silence.csl`\nif [ \"$self_weight\" == \"1.0\" ]; then\n  numsubstates=$num_groups # Initial #-substates.\nelse\n  numsubstates=$num_pdfs # Initial #-substates.\nfi\nincsubstates=$[($totsubstates-$numsubstates)/$max_iter_inc] # per-iter increment for #substates\nfeat_dim=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/feature dimension/{print $NF}'` || exit 1;\n[ $feat_dim -eq $feat_dim ] || exit 1; # make sure it's numeric.\n[ -z $phn_dim ] && phn_dim=$[$feat_dim+1]\n[ -z $spk_dim ] && spk_dim=$feat_dim\nnj=`cat $alidir/num_jobs` || exit 1;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\necho $nj > $dir/num_jobs\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nspkvecs_opt=  # Empty option for now, until we estimate the speaker vectors.\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\n\n## Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nelif [ -f $alidir/raw_trans.1 ]; then\n  echo \"$0: using raw-fMLLR transforms from $alidir\"\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\nfi\n##\n\n\nif [ $stage -le -6 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-stats\" && exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -5 ]; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree-two-level $context_opts --binary=false --verbose=1 --max-leaves-first=$num_groups \\\n     --max-leaves-second=$num_pdfs $dir/treeacc $lang/phones/roots.int \\\n     $dir/questions.qst $lang/topo $dir/tree $dir/pdf2group.map || exit 1;\nfi\n\nif [ $stage -le -4 ]; then\n  echo \"$0: Initializing the model\"\n  # Note: if phn_dim > feat_dim+1 or spk_dim > feat_dim, these dims\n  # will be truncated on initialization.\n  $cmd $dir/log/init_sgmm.log \\\n    sgmm2-init --spk-dep-weights=$spk_dep_weights --self-weight=$self_weight \\\n       --pdf-map=$dir/pdf2group.map --phn-space-dim=$phn_dim \\\n       --spk-space-dim=$spk_dim $lang/topo $dir/tree $ubm $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"$0: doing Gaussian selection\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect $dir/0.mdl \"$feats\" \\\n    \"ark,t:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: compiling training graphs\"\n  text=\"ark:sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text|\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/0.mdl  $lang/L.fst  \\\n    \"$text\" \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"$0: converting alignments\"\n  $cmd JOB=1:$nj $dir/log/convert_ali.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/0.mdl $dir/tree \"ark:gunzip -c $alidir/ali.JOB.gz|\" \\\n    \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n   echo \"$0: training pass $x ... \"\n   if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n     echo \"$0: re-aligning data\"\n     $cmd JOB=1:$nj $dir/log/align.$x.JOB.log  \\\n       sgmm2-align-compiled $spkvecs_opt $scale_opts \"$gselect_opt\" \\\n       --utt2spk=ark:$sdata/JOB/utt2spk --beam=$beam --retry-beam=$retry_beam \\\n       $dir/$x.mdl \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n       \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n   fi\n   if [ $spk_dim -gt 0 ] && echo $spkvec_iters | grep -w $x >/dev/null; then\n     if [ $stage -le $x ]; then\n       $cmd JOB=1:$nj $dir/log/spkvecs.$x.JOB.log \\\n         ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n         weight-silence-post 0.01 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n         sgmm2-est-spkvecs --rand-prune=$rand_prune --spk2utt=ark:$sdata/JOB/spk2utt \\\n         $spkvecs_opt \"$gselect_opt\" $dir/$x.mdl \"$feats\" ark,s,cs:- \\\n         ark:$dir/tmp_vecs.JOB '&&' mv $dir/tmp_vecs.JOB $dir/vecs.JOB || exit 1;\n     fi\n     spkvecs_opt=\"--spk-vecs=ark:$dir/vecs.JOB\"\n   fi\n   if [ $x -eq 0 ]; then\n     flags=vwcSt # on the first iteration, don't update projections M or N\n   elif [ $spk_dim -gt 0 -a $[$x%2] -eq 1 -a $x -ge $first_spkvec_iter ]; then\n     # Update N if we have speaker-vector space and x is odd,\n     # and we've already updated the speaker vectors...\n     flags=vNwSct\n   else\n     if [ $x -ge $update_m_iter ]; then\n       flags=vMwSct # udpate M.\n     else\n       flags=vwSct # no M on early iters, if --update-m-iter option given.\n     fi\n   fi\n   $spk_dep_weights && [ $x -ge $first_spkvec_iter ] && flags=${flags}u; # update\n   # spk-weight projections \"u\".\n\n   if [ $stage -le $x ]; then\n     $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n       sgmm2-acc-stats $spkvecs_opt --utt2spk=ark:$sdata/JOB/utt2spk \\\n       --update-flags=$flags \"$gselect_opt\" --rand-prune=$rand_prune \\\n       $dir/$x.mdl \"$feats\" \"ark,s,cs:gunzip -c $dir/ali.JOB.gz | ali-to-post ark:- ark:-|\" \\\n       $dir/$x.JOB.acc || exit 1;\n   fi\n\n   # The next option is needed if the user specifies a phone or speaker sub-space\n   # dimension that's higher than the \"normal\" one.\n   increase_dim_opts=\n   if echo $increase_dim_iters | grep -w $x >/dev/null; then\n     increase_dim_opts=\"--increase-phn-dim=$phn_dim --increase-spk-dim=$spk_dim\"\n     # Note: the command below might have a null effect on some iterations.\n     if [ $spk_dim -gt $feat_dim ]; then\n       cmd JOB=1:$nj $dir/log/copy_vecs.$x.JOB.log \\\n         copy-vector --print-args=false --change-dim=$spk_dim \\\n         ark:$dir/vecs.JOB ark:$dir/vecs_tmp.$JOB '&&' \\\n         mv $dir/vecs_tmp.JOB $dir/vecs.JOB || exit 1;\n     fi\n   fi\n\n   if [ $stage -le $x ]; then\n     $cmd $dir/log/update.$x.log \\\n       sgmm2-est --update-flags=$flags --split-substates=$numsubstates \\\n       $increase_dim_opts --power=$power --write-occs=$dir/$[$x+1].occs \\\n       $dir/$x.mdl \"sgmm2-sum-accs - $dir/$x.*.acc|\" $dir/$[$x+1].mdl || exit 1;\n     rm $dir/$x.mdl $dir/$x.*.acc $dir/$x.occs 2>/dev/null\n   fi\n   if [ $x -lt $max_iter_inc ]; then\n     numsubstates=$[$numsubstates+$incsubstates]\n   fi\n   x=$[$x+1];\ndone\n\nrm $dir/final.mdl $dir/final.occs 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\nif [ $spk_dim -gt 0 ]; then\n  # We need to create an \"alignment model\" that's been trained\n  # without the speaker vectors, to do the first-pass decoding with.\n  # in test time.\n\n  # We do this for a few iters, in this recipe.\n  final_mdl=$dir/$x.mdl\n  cur_alimdl=$dir/$x.mdl\n  while [ $x -lt $[$num_iters+$num_iters_alimdl] ]; do\n    echo \"$0: building alignment model (pass $x)\"\n    if [ $x -eq $num_iters ]; then # 1st pass of building alimdl.\n      flags=MwcS # don't update v the first time.  Note-- we never update transitions.\n      # they wouldn't change anyway as we use the same alignment as previously.\n    else\n      flags=vMwcS\n    fi\n    if [ $stage -le $x ]; then\n      $cmd JOB=1:$nj $dir/log/acc_ali.$x.JOB.log \\\n        ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n        sgmm2-post-to-gpost $spkvecs_opt \"$gselect_opt\" \\\n         --utt2spk=ark:$sdata/JOB/utt2spk $final_mdl \"$feats\" ark,s,cs:- ark:- \\| \\\n        sgmm2-acc-stats-gpost --rand-prune=$rand_prune --update-flags=$flags \\\n          $cur_alimdl \"$feats\" ark,s,cs:- $dir/$x.JOB.aliacc || exit 1;\n      $cmd $dir/log/update_ali.$x.log \\\n        sgmm2-est --update-flags=$flags --remove-speaker-space=true --power=$power \\\n        $cur_alimdl \"sgmm2-sum-accs - $dir/$x.*.aliacc|\" $dir/$[$x+1].alimdl || exit 1;\n      rm $dir/$x.*.aliacc || exit 1;\n      [ $x -gt $num_iters ]  && rm $dir/$x.alimdl\n    fi\n    cur_alimdl=$dir/$[$x+1].alimdl\n    x=$[$x+1]\n  done\n  rm $dir/final.alimdl 2>/dev/null\n  ln -s $x.alimdl $dir/final.alimdl\nfi\n\nutils/summarize_warnings.pl $dir/log\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_sgmm2_group.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This version of the train_sgmm2 script has several jobs on each machine, and adds the\n# accumulators up in memory.\n\n# SGMM training, with speaker vectors.  This script would normally be called on\n# top of fMLLR features obtained from a conventional system, but it also works\n# on top of any type of speaker-independent features (based on\n# deltas+delta-deltas or LDA+MLLT).  For more info on SGMMs, see the paper \"The\n# subspace Gaussian mixture model--A structured model for speech recognition\".\n# (Computer Speech and Language, 2011).\n\n# Begin configuration section.\ncmd=run.pl\nstage=-6 # use this to resume partially finished training\ncontext_opts= # e.g. set it to \"--context-width=5 --central-position=2\"  for a\n# quinphone system.\nscale_opts=\"--transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1\"\nnum_iters=25   # Total number of iterations of training\nnum_iters_alimdl=3 # Number of iterations for estimating alignment model.\nmax_iter_inc=15 # Last iter to increase #substates on.\nrealign_iters=\"5 10 15\"; # Iters to realign on.\nspkvec_iters=\"5 8 12 17\" # Iters to estimate speaker vectors on.\nincrease_iters=\"6 10 14\"; # Iters on which to increase phn dim and/or spk dim;\n    # rarely necessary, and if it is, only the 1st will normally be necessary.\nrand_prune=0.1 # Randomized-pruning parameter for posteriors, to speed up training.\n               # Bigger -> more pruning; zero = no pruning.\nphn_dim=  # You can use this to set the phonetic subspace dim. [default: feat-dim+1]\nspk_dim=  # You can use this to set the speaker subspace dim. [default: feat-dim]\npower=0.25 # Exponent for number of gaussians according to occurrence counts\nbeam=8\nself_weight=0.9\nretry_beam=40\nleaves_per_group=5 # Relates to the SCTM (state-clustered tied-mixture) aspect:\n                   # average number of pdfs in a \"group\" of pdfs.\nupdate_m_iter=4\nspk_dep_weights=true # [Symmetric SGMM] set this to false if you don't want \"u\" (i.e. to turn off\n                      # symmetric SGMM.\ngroup=3 # Number of jobs to group together on a single machine, and add the\n        # stats locally.  Don't confuse this with leaves_per_group and so on,\n        # they are totally unrelated.\nparallel_opts=  # this option is now ignored.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 7 ]; then\n  echo \"Usage: steps/train_sgmm2.sh <num-leaves> <num-substates> <data> <lang> <ali-dir> <ubm> <exp-dir>\"\n  echo \" e.g.: steps/train_sgmm2.sh 5000 8000 data/train_si84 data/lang \\\\\"\n  echo \"                      exp/tri3b_ali_si84 exp/ubm4a/final.ubm exp/sgmm4a\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --group <n>                                      # number of jobs on one machine, default 3.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-weight <sil-weight>                    # weight for silence (e.g. 0.5 or 0.0)\"\n  echo \"  --num-iters <#iters>                             # Number of iterations of E-M\"\n  exit 1;\nfi\n\nnum_pdfs=$1  # final #leaves, at 2nd level of tree.\ntotsubstates=$2\ndata=$3\nlang=$4\nalidir=$5\nubm=$6\ndir=$7\n\nnum_groups=$[$num_pdfs/$leaves_per_group]\nfirst_spkvec_iter=`echo $spkvec_iters | awk '{print $1}'` || exit 1;\nciphonelist=`cat $lang/phones/context_indep.csl` || exit 1;\n\n# Check some files.\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $ubm $alidir/num_jobs; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\n\n# Set some variables.\noov=`cat $lang/oov.int`\nsilphonelist=`cat $lang/phones/silence.csl`\nif [ \"$self_weight\" == \"1.0\" ]; then\n  numsubstates=$num_groups # Initial #-substates.\nelse\n  numsubstates=$num_pdfs # Initial #-substates.\nfi\nincsubstates=$[($totsubstates-$numsubstates)/$max_iter_inc] # per-iter increment for #substates\nfeat_dim=`gmm-info $alidir/final.mdl 2>/dev/null | awk '/feature dimension/{print $NF}'` || exit 1;\n[ $feat_dim -eq $feat_dim ] || exit 1; # make sure it's numeric.\n[ -z $phn_dim ] && phn_dim=$[$feat_dim+1]\n[ -z $spk_dim ] && spk_dim=$feat_dim\nnj=`cat $alidir/num_jobs` || exit 1;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\n\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null # frame-splicing options.\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\necho $nj > $dir/num_jobs\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nspkvecs_opt=  # Empty option for now, until we estimate the speaker vectors.\ngselect_opt=\"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\"\n\n## Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\nif [ -f $alidir/trans.1 ]; then\n  echo \"$0: using transforms from $alidir\"\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\nelif [ -f $alidir/raw_trans.1 ]; then\n  echo \"$0: using raw-fMLLR transforms from $alidir\"\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\nfi\n##\n\n\nif [ $stage -le -6 ]; then\n  echo \"$0: accumulating tree stats\"\n  $cmd JOB=1:$nj $dir/log/acc_tree.JOB.log \\\n    acc-tree-stats $context_opts --ci-phones=$ciphonelist $alidir/final.mdl \"$feats\" \\\n    \"ark:gunzip -c $alidir/ali.JOB.gz|\" $dir/JOB.treeacc || exit 1;\n  [ \"`ls $dir/*.treeacc | wc -w`\" -ne \"$nj\" ] && echo \"$0: Wrong #tree-stats\" && exit 1;\n  sum-tree-stats $dir/treeacc $dir/*.treeacc 2>$dir/log/sum_tree_acc.log || exit 1;\n  rm $dir/*.treeacc\nfi\n\nif [ $stage -le -5 ]; then\n  echo \"$0: Getting questions for tree clustering.\"\n  # preparing questions, roots file...\n  cluster-phones $context_opts $dir/treeacc $lang/phones/sets.int $dir/questions.int 2> $dir/log/questions.log || exit 1;\n  cat $lang/phones/extra_questions.int >> $dir/questions.int\n  compile-questions $context_opts $lang/topo $dir/questions.int $dir/questions.qst 2>$dir/log/compile_questions.log || exit 1;\n\n  echo \"$0: Building the tree\"\n  $cmd $dir/log/build_tree.log \\\n    build-tree-two-level $context_opts --binary=false --verbose=1 --max-leaves-first=$num_groups \\\n     --max-leaves-second=$num_pdfs $dir/treeacc $lang/phones/roots.int \\\n     $dir/questions.qst $lang/topo $dir/tree $dir/pdf2group.map || exit 1;\nfi\n\nif [ $stage -le -4 ]; then\n  echo \"$0: Initializing the model\"\n  # Note: if phn_dim > feat_dim+1 or spk_dim > feat_dim, these dims\n  # will be truncated on initialization.\n  $cmd $dir/log/init_sgmm.log \\\n    sgmm2-init --spk-dep-weights=$spk_dep_weights --self-weight=$self_weight \\\n       --pdf-map=$dir/pdf2group.map --phn-space-dim=$phn_dim \\\n       --spk-space-dim=$spk_dim $lang/topo $dir/tree $ubm $dir/0.mdl || exit 1;\nfi\n\nif [ $stage -le -3 ]; then\n  echo \"$0: doing Gaussian selection\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    sgmm2-gselect $dir/0.mdl \"$feats\" \\\n    \"ark,t:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: compiling training graphs\"\n  text=\"ark:sym2int.pl --map-oov $oov -f 2- $lang/words.txt < $sdata/JOB/text|\"\n  $cmd JOB=1:$nj $dir/log/compile_graphs.JOB.log \\\n    compile-train-graphs --read-disambig-syms=$lang/phones/disambig.int $dir/tree $dir/0.mdl  $lang/L.fst  \\\n    \"$text\" \"ark:|gzip -c >$dir/fsts.JOB.gz\" || exit 1;\nfi\n\nif [ $stage -le -1 ]; then\n  echo \"$0: converting alignments\"\n  $cmd JOB=1:$nj $dir/log/convert_ali.JOB.log \\\n    convert-ali $alidir/final.mdl $dir/0.mdl $dir/tree \"ark:gunzip -c $alidir/ali.JOB.gz|\" \\\n    \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n   echo \"$0: training pass $x ... \"\n   if echo $realign_iters | grep -w $x >/dev/null && [ $stage -le $x ]; then\n     echo \"$0: re-aligning data\"\n     $cmd JOB=1:$nj $dir/log/align.$x.JOB.log  \\\n       sgmm2-align-compiled $spkvecs_opt $scale_opts \"$gselect_opt\" \\\n       --utt2spk=ark:$sdata/JOB/utt2spk --beam=$beam --retry-beam=$retry_beam \\\n       $dir/$x.mdl \"ark:gunzip -c $dir/fsts.JOB.gz|\" \"$feats\" \\\n       \"ark:|gzip -c >$dir/ali.JOB.gz\" || exit 1;\n   fi\n   if [ $spk_dim -gt 0 ] && echo $spkvec_iters | grep -w $x >/dev/null; then\n     if [ $stage -le $x ]; then\n       $cmd JOB=1:$nj $dir/log/spkvecs.$x.JOB.log \\\n         ali-to-post \"ark:gunzip -c $dir/ali.JOB.gz|\" ark:- \\| \\\n         weight-silence-post 0.01 $silphonelist $dir/$x.mdl ark:- ark:- \\| \\\n         sgmm2-est-spkvecs --rand-prune=$rand_prune --spk2utt=ark:$sdata/JOB/spk2utt \\\n         $spkvecs_opt \"$gselect_opt\" $dir/$x.mdl \"$feats\" ark,s,cs:- \\\n         ark:$dir/tmp_vecs.JOB '&&' mv $dir/tmp_vecs.JOB $dir/vecs.JOB || exit 1;\n     fi\n     spkvecs_opt=\"--spk-vecs=ark:$dir/vecs.JOB\"\n   fi\n   if [ $x -eq 0 ]; then\n     flags=vwcSt # on the first iteration, don't update projections M or N\n   elif [ $spk_dim -gt 0 -a $[$x%2] -eq 1 -a $x -ge $first_spkvec_iter ]; then\n     # Update N if we have speaker-vector space and x is odd,\n     # and we've already updated the speaker vectors...\n     flags=vNwSct\n   else\n     if [ $x -ge $update_m_iter ]; then\n       flags=vMwSct # udpate M.\n     else\n       flags=vwSct # no M on early iters, if --update-m-iter option given.\n     fi\n   fi\n   $spk_dep_weights && [ $x -ge $first_spkvec_iter ] && flags=${flags}u; # update\n   # spk-weight projections \"u\".\n\n   # Submit separate jobs for small groups (of size $group) of accumulators.\n   Args=() # bash array of training commands for 1:nj, that put accs to stdout.\n   for n in `seq $nj`; do\n     Args[$n]=`echo \"sgmm2-acc-stats $spkvecs_opt --utt2spk=ark:$sdata/JOB/utt2spk \\\n             --update-flags=$flags '$gselect_opt' --rand-prune=$rand_prune \\\n             $dir/$x.mdl '$feats' 'ark,s,cs:gunzip -c $dir/ali.JOB.gz | ali-to-post ark:- ark:-|' - |\" | sed s/JOB/$n/g`\n   done\n\n   g=0\n   rm $dir/.error 2>/dev/null\n   if [ $stage -le $x ]; then\n     while [ $[$g*$group] -lt $nj ]; do\n       if [ -s $dir/acc.$x.$g.gz ]; then\n         echo \"Skipping creation of acc $dir/acc.$x.$g.gz as it already exists.\"\n       else\n         start=$[$g*$group + 1]; # start-position in array Args.\n         # see http://www.thegeekstuff.com/2010/06/bash-array-tutorial/, this uses Bash arrays.\"\n         # The syntax \"${Args[@]:$start:$group}\" is equivalent to, say,\n         # \"${Args[3]}\" \"${Args[4]}\" if start=3 and group=2.  Except it's smart about the end\n         # of the array, it won't give you empty quoted strings if the length \"group\" takes you off\n         # the end of the array.\n         $cmd --num-threads \"$group\" $dir/log/acc.$x.$g.log \\\n           sgmm2-sum-accs --parallel=true \"|gzip -c >$dir/acc.$x.$g.gz\" \"${Args[@]:$start:$group}\"  || touch $dir/.error &\n       fi\n       g=$[$g+1];\n     done\n     wait\n     if [ -f $dir/.error ]; then\n       echo \"Something went wrong during accumulation on pass $x\"\n       exit 1;\n     fi\n   fi\n\n   # The next option is needed if the user specifies a phone or speaker sub-space\n   # dimension that's higher than the \"normal\" one.\n   increase_dim_opts=\n   if echo $increase_dim_iters | grep -w $x >/dev/null; then\n     increase_dim_opts=\"--increase-phn-dim=$phn_dim --increase-spk-dim=$spk_dim\"\n     # Note: the command below might have a null effect on some iterations.\n     if [ $spk_dim -gt $feat_dim ]; then\n       cmd JOB=1:$nj $dir/log/copy_vecs.$x.JOB.log \\\n         copy-vector --print-args=false --change-dim=$spk_dim \\\n         ark:$dir/vecs.JOB ark:$dir/vecs_tmp.$JOB '&&' \\\n         mv $dir/vecs_tmp.JOB $dir/vecs.JOB || exit 1;\n     fi\n   fi\n\n   if [ $stage -le $x ]; then\n     acc_sum=\"sgmm2-sum-accs - \";\n     for j in `seq 0 $[$g-1]`; do acc_sum=\"$acc_sum 'gunzip -c $dir/acc.$x.$j.gz|'\"; done\n     $cmd $dir/log/update.$x.log \\\n       sgmm2-est --update-flags=$flags --split-substates=$numsubstates \\\n       $increase_dim_opts --power=$power --write-occs=$dir/$[$x+1].occs \\\n       $dir/$x.mdl \"$acc_sum|\" $dir/$[$x+1].mdl || exit 1;\n     rm $dir/$x.mdl $dir/acc.$x.*.gz $dir/$x.occs 2>/dev/null\n   fi\n   if [ $x -lt $max_iter_inc ]; then\n     numsubstates=$[$numsubstates+$incsubstates]\n   fi\n   x=$[$x+1];\ndone\n\nrm $dir/final.mdl $dir/final.occs 2>/dev/null\nln -s $x.mdl $dir/final.mdl\nln -s $x.occs $dir/final.occs\n\nif [ $spk_dim -gt 0 ]; then\n  # We need to create an \"alignment model\" that's been trained\n  # without the speaker vectors, to do the first-pass decoding with.\n  # in test time.\n\n  # We do this for a few iters, in this recipe.\n  final_mdl=$dir/$x.mdl\n  cur_alimdl=$dir/$x.mdl\n  while [ $x -lt $[$num_iters+$num_iters_alimdl] ]; do\n    echo \"$0: building alignment model (pass $x)\"\n    if [ $x -eq $num_iters ]; then # 1st pass of building alimdl.\n      flags=MwcS # don't update v the first time.  Note-- we never update transitions.\n      # they wouldn't change anyway as we use the same alignment as previously.\n    else\n      flags=vMwcS\n    fi\n    if [ $stage -le $x ]; then\n      Args=() # bash array of training commands for 1:nj, that put accs to stdout.\n      for n in `seq $nj`; do\n        Args[$n]=`echo \"ali-to-post 'ark:gunzip -c $dir/ali.JOB.gz|' ark:- | \\\n          sgmm2-post-to-gpost $spkvecs_opt '$gselect_opt' \\\n          --utt2spk=ark:$sdata/JOB/utt2spk $final_mdl '$feats' ark,s,cs:- ark:- | \\\n                  sgmm2-acc-stats-gpost --rand-prune=$rand_prune --update-flags=$flags \\\n          $cur_alimdl '$feats' ark,s,cs:- - |\" | sed s/JOB/$n/g`\n      done\n      g=0\n      rm $dir/.error 2>/dev/null\n      while [ $[$g*$group] -lt $nj ]; do\n        if [ -s $dir/acc.$x.$g.gz ]; then\n          echo \"Skipping creation of acc $dir/acc.$x.$g.gz as it already exists.\"\n        else\n          start=$[$g*$group + 1]; # start-position in array Args.\n          $cmd --num-threads \"$group\" $dir/log/acc.$x.$g.log \\\n            sgmm2-sum-accs --parallel=true \"|gzip -c >$dir/acc.$x.$g.gz\" \"${Args[@]:$start:$group}\"  || touch $dir/.error &\n        fi\n        g=$[$g+1];\n      done\n      wait\n      if [ -f $dir/.error ]; then\n        echo \"Something went wrong during accumulation on pass $x\"\n        exit 1;\n      fi\n      acc_sum=\"sgmm2-sum-accs - \";\n      for j in `seq 0 $[$g-1]`; do acc_sum=\"$acc_sum 'gunzip -c $dir/acc.$x.$j.gz|'\"; done\n      $cmd $dir/log/update_ali.$x.log \\\n        sgmm2-est --update-flags=$flags --remove-speaker-space=true --power=$power \\\n        $cur_alimdl \"$acc_sum|\" $dir/$[$x+1].alimdl || exit 1;\n      rm $dir/acc.$x.*.gz || exit 1;\n      [ $x -gt $num_iters ]  && rm $dir/$x.alimdl\n    fi\n    cur_alimdl=$dir/$[$x+1].alimdl\n    x=$[$x+1]\n  done\n  rm $dir/final.alimdl 2>/dev/null\n  ln -s $x.alimdl $dir/final.alimdl\nfi\n\nutils/summarize_warnings.pl $dir/log\n\necho Done\n"
  },
  {
    "path": "egs/steps/train_smbr.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# sMBR training \n# 4 iterations (by default) of Extended Baum-Welch update.\n#\n# For the numerator we have a fixed alignment rather than a lattice--\n# this actually follows from the way lattices are defined in Kaldi, which\n# is to have a single path for each word (output-symbol) sequence.\n\n# Begin configuration section.\ncmd=run.pl\nnum_iters=4\ncancel=true # if true, cancel num and den counts on each frame.\ntau=400\nweight_tau=10\nacwt=0.1\nstage=0\nsmooth_to_mode=true\none_silence_class=false\n# End configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\nif [ $# -ne 5 ]; then\n  echo \"Usage: steps/train_smbr.sh <data> <lang> <ali> <denlats> <exp>\"\n  echo \" e.g.: steps/train_smbr.sh data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri2b_denlats_si84 exp/tri2b_smbr\"\n  echo \"Main options (for others, see top of script file)\"\n  echo \"  --cancel (true|false)                            # cancel stats (true by default)\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --stage <stage>                                  # stage to do partial re-run from.\"\n  echo \"  --tau                                            # tau for i-smooth to last iter (default 200)\"\n  echo \"  --one-silence-class <true|false>                 # If true, newer approach which will tend\"\n  echo \"                                                   # to reduce insertions (default: false)\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2\nalidir=$3\ndenlatdir=$4\ndir=$5\nmkdir -p $dir/log\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\nfor f in $data/feats.scp $alidir/{tree,final.mdl,ali.1.gz} $denlatdir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\nnj=`cat $alidir/num_jobs` || exit 1;\n[ \"$nj\" -ne \"`cat $denlatdir/num_jobs`\" ] && \\\n  echo \"$alidir and $denlatdir have different num-jobs\" && exit 1;\n\nsdata=$data/split$nj\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null`\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\nmkdir -p $dir/log\ncp $alidir/splice_opts $dir 2>/dev/null\ncp $alidir/cmvn_opts $dir 2>/dev/null # cmn/cmvn option.\ncp $alidir/delta_opts $dir 2>/dev/null\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\necho $nj > $dir/num_jobs\n\ncp $alidir/{final.mdl,tree} $dir\n\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\n\n# Set up features\n\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir    \n    ;;\n  *) echo \"Invalid feature type $feat_type\" && exit 1;\nesac\n\n[ -f $alidir/trans.1 ] && echo Using transforms from $alidir && \\\n  feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n\nlats=\"ark:gunzip -c $denlatdir/lat.JOB.gz|\"\n\ncur_mdl=$alidir/final.mdl\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Iteration $x of sMBR training\"\n  # Note: the num and den states are accumulated at the same time, so we\n  # can cancel them per frame.\n  if [ $stage -le $x ]; then\n    $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n      gmm-rescore-lattice $cur_mdl \"$lats\" \"$feats\" ark:- \\| \\\n      lattice-to-smbr-post --acoustic-scale=$acwt \\\n        --one-silence-class=$one_silence_class \\\n        --silence-phones=$silphonelist  $cur_mdl \\\n        \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz |\" ark:- ark:- \\| \\\n      gmm-acc-stats2 $cur_mdl \"$feats\" ark,s,cs:- \\\n        $dir/num_acc.$x.JOB.acc $dir/den_acc.$x.JOB.acc || exit 1;\n\n    n=`echo $dir/{num,den}_acc.$x.*.acc | wc -w`;\n    [ \"$n\" -ne $[$nj*2] ] && \\\n      echo \"Wrong number of sMBR accumulators $n versus 2*$nj\" && exit 1;\n    $cmd $dir/log/den_acc_sum.$x.log \\\n      gmm-sum-accs $dir/den_acc.$x.acc $dir/den_acc.$x.*.acc || exit 1;\n    rm $dir/den_acc.$x.*.acc\n    $cmd $dir/log/num_acc_sum.$x.log \\\n      gmm-sum-accs $dir/num_acc.$x.acc $dir/num_acc.$x.*.acc || exit 1;\n    rm $dir/num_acc.$x.*.acc\n\n  # note: this tau value is for smoothing towards model parameters, not\n  # as in the Boosted MMI paper, not towards the ML stats as in the earlier\n  # work on discriminative training (e.g. my thesis).  \n  # You could use gmm-ismooth-stats to smooth to the ML stats, if you had\n  # them available [here they're not available if cancel=true].\n    if ! $smooth_to_model; then\n      echo \"Iteration $x of sMBR: computing ml (smoothing) stats\"\n      $cmd JOB=1:$nj $dir/log/acc_ml.$x.JOB.log \\\n        gmm-acc-stats $cur_mdl \"$feats\" \\\n          \"ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- |\" \\\n          $dir/ml.$x.JOB.acc || exit 1;\n      $cmd $dir/log/acc_ml_sum.$x.log \\\n        gmm-sum-accs $dir/ml.$x.acc $dir/ml.$x.*.acc || exit 1;\n      rm $dir/ml.$x.*.acc\n      num_stats=\"gmm-ismooth-stats --tau=$tau $dir/ml.$x.acc $dir/num_acc.$x.acc -|\"\n    else \n      num_stats=\"gmm-ismooth-stats --smooth-from-model=true --tau=$tau $cur_mdl $dir/num_acc.$x.acc -|\"\n    fi  \n    \n    $cmd $dir/log/update.$x.log \\\n      gmm-est-gaussians-ebw $cur_mdl \"$num_stats\" $dir/den_acc.$x.acc - \\| \\\n      gmm-est-weights-ebw - $dir/num_acc.$x.acc $dir/den_acc.$x.acc $dir/$[$x+1].mdl || exit 1;\n    rm $dir/{den,num}_acc.$x.acc\n  fi\n  cur_mdl=$dir/$[$x+1].mdl\n\n  # Some diagnostics: the objective function progress and auxiliary-function\n  # improvement.\n\n tail -n 50 $dir/log/acc.$x.*.log | perl -e 'while(<STDIN>) { if(m/lattice-to-smbr-post.+Overall average frame-accuracy is (\\S+) over (\\S+) frames/) { $tot_objf += $1*$2; $tot_frames += $2; }} $tot_objf /= $tot_frames; print \"$tot_objf $tot_frames\\n\"; ' > $dir/tmpf\n  objf=`cat $dir/tmpf | awk '{print $1}'`;\n  nf=`cat $dir/tmpf | awk '{print $2}'`;\n  rm $dir/tmpf\n  impr=`grep -w Overall $dir/log/update.$x.log | awk '{x += $10*$12;} END{print x;}'`\n  impr=`perl -e \"print ($impr*$acwt/$nf);\"` # We multiply by acwt, and divide by $nf which is the \"real\" number of frames.\n  # This gives us a projected objective function improvement.\n  echo \"Iteration $x: objf was $objf, sMBR auxf change was $impr\" | tee $dir/objf.$x.log\n  x=$[$x+1]\ndone\n\necho \"sMBR training finished\"\n\nrm $dir/final.mdl 2>/dev/null\nln -s $x.mdl $dir/final.mdl\n\nexit 0;\n"
  },
  {
    "path": "egs/steps/train_ubm.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This trains a UBM (i.e. a mixture of Gaussians), by clustering\n# the Gaussians from a trained HMM/GMM system and then doing a few\n# iterations of UBM training.\n# We mostly use this for SGMM systems.\n\n# Begin configuration section.\ncmd=run.pl\nsilence_weight=  # You can set it to e.g. 0.0, to weight down silence in training.\nstage=-2\nnum_gselect1=50 # first stage of Gaussian-selection\nnum_gselect2=25 # second stage.\nintermediate_num_gauss=2000\nnum_iters=3\nno_fmllr=false\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\n\nif [ $# != 5 ]; then\n  echo \"Usage: steps/train_ubm.sh <num-gauss> <data> <lang> <ali-dir> <exp>\"\n  echo \" e.g.: steps/train_ubm.sh 400 data/train_si84 data/lang exp/tri2b_ali_si84 exp/ubm3c\"\n  echo \"main options (for others, see top of script file)\"\n  echo \"  --config <config-file>                           # config containing options\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  echo \"  --silence-weight <sil-weight>                    # weight for silence (e.g. 0.5 or 0.0)\"\n  echo \"  --num-iters <#iters>                             # Number of iterations of E-M\"\\\n  echo \"  --no-fmllr (true|false)                          # ignore speaker matrices even if present\"\n  exit 1;\nfi\n\nnum_gauss=$1\ndata=$2\nlang=$3\nalidir=$4\ndir=$5\n\nfor f in $data/feats.scp $lang/L.fst $alidir/ali.1.gz $alidir/final.mdl $alidir/num_jobs; do\n  [ ! -f $f ] && echo \"No such file $f\" && exit 1;\ndone\n\nif [ $[$num_gauss*2] -gt $intermediate_num_gauss ]; then\n  echo \"intermediate_num_gauss was too small $intermediate_num_gauss\"\n  intermediate_num_gauss=$[$num_gauss*2];\n  echo \"setting it to $intermediate_num_gauss\"\nfi\n\n\n# Set various variables.\nsilphonelist=`cat $lang/phones/silence.csl` || exit 1;\nnj=`cat $alidir/num_jobs` || exit 1;\n\nmkdir -p $dir/log\necho $nj > $dir/num_jobs\nsdata=$data/split$nj;\n[[ -d $sdata && $data/feats.scp -ot $sdata ]] || split_data.sh $data $nj || exit 1;\nsplice_opts=`cat $alidir/splice_opts 2>/dev/null` # frame-splicing options.\ncmvn_opts=`cat $alidir/cmvn_opts 2>/dev/null`\ndelta_opts=`cat $alidir/delta_opts 2>/dev/null`\n\nutils/lang/check_phones_compatible.sh $lang/phones.txt $alidir/phones.txt || exit 1;\ncp $lang/phones.txt $dir || exit 1;\n\n## Set up features.\nif [ -f $alidir/final.mat ]; then feat_type=lda; else feat_type=delta; fi\necho \"$0: feature type is $feat_type\"\n\ncase $feat_type in\n  delta) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | add-deltas $delta_opts ark:- ark:- |\";;\n  lda) feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\n    cp $alidir/final.mat $dir\n    ;;\n  *) echo \"$0: invalid feature type $feat_type\" && exit 1;\nesac\n\n\nif [ -f $alidir/trans.1 ]; then\n  if $no_fmllr; then\n    echo \"$0: deliberately ignoring speaker transforms from $alidir\"\n  else\n    echo \"$0: using transforms from $alidir\"\n    feats=\"$feats transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/trans.JOB ark:- ark:- |\"\n  fi\nelif [ -f $alidir/raw_trans.1 ]; then\n  echo \"$0: using raw-FMLLR transforms from $alidir\"\n  feats=\"ark,s,cs:apply-cmvn $cmvn_opts --utt2spk=ark:$sdata/JOB/utt2spk scp:$sdata/JOB/cmvn.scp scp:$sdata/JOB/feats.scp ark:- | transform-feats --utt2spk=ark:$sdata/JOB/utt2spk ark,s,cs:$alidir/raw_trans.JOB ark:- ark:- | splice-feats $splice_opts ark:- ark:- | transform-feats $alidir/final.mat ark:- ark:- |\"\nfi\n##\n\nif [ ! -z \"$silence_weight\" ]; then\n  weights_opt=\"--weights='ark,s,cs:gunzip -c $alidir/ali.JOB.gz | ali-to-post ark:- ark:- | weight-silence-post $silence_weight $silphonelist $alidir/final.mdl ark:- ark:- | post-to-weights ark:- ark:- |'\"\nelse\n  weights_opt=\nfi\n\nif [ $stage -le -2 ]; then\n  echo \"$0: clustering model $alidir/final.mdl to get initial UBM\"\n  $cmd $dir/log/cluster.log \\\n    init-ubm --intermediate-num-gauss=$intermediate_num_gauss --ubm-num-gauss=$num_gauss \\\n    --verbose=2 --fullcov-ubm=true $alidir/final.mdl $alidir/final.occs \\\n    $dir/0.ubm   || exit 1;\nfi\n\n# Do initial phase of Gaussian selection and save it to disk -- later on we'll\n# do more Gaussian selection to further prune, as the model changes.\n\n\nif [ $stage -le -1 ]; then\n  echo \"$0: doing Gaussian selection\"\n  $cmd JOB=1:$nj $dir/log/gselect.JOB.log \\\n    gmm-gselect --n=$num_gselect1 \"fgmm-global-to-gmm $dir/0.ubm - |\" \"$feats\" \\\n    \"ark:|gzip -c >$dir/gselect.JOB.gz\" || exit 1;\nfi\n\n\nx=0\nwhile [ $x -lt $num_iters ]; do\n  echo \"Pass $x\"\n  $cmd JOB=1:$nj $dir/log/acc.$x.JOB.log \\\n    gmm-gselect --n=$num_gselect2 \"--gselect=ark,s,cs:gunzip -c $dir/gselect.JOB.gz|\" \\\n    \"fgmm-global-to-gmm $dir/$x.ubm - |\" \"$feats\" ark:- \\| \\\n    fgmm-global-acc-stats $weights_opt --gselect=ark,s,cs:- $dir/$x.ubm \"$feats\" \\\n    $dir/$x.JOB.acc || exit 1;\n  lowcount_opt=\"--remove-low-count-gaussians=false\"\n  [ $[$x+1] -eq $num_iters ] && lowcount_opt=   # Only remove low-count Gaussians\n  # on last iter-- we can't do it earlier, or the Gaussian-selection info would\n  # be mismatched.\n  $cmd $dir/log/update.$x.log \\\n    fgmm-global-est $lowcount_opt --verbose=2 $dir/$x.ubm \"fgmm-global-sum-accs - $dir/$x.*.acc |\" \\\n      $dir/$[$x+1].ubm || exit 1;\n  rm $dir/$x.*.acc $dir/$x.ubm\n  x=$[$x+1]\ndone\n\nrm $dir/gselect.*.gz\nrm $dir/final.ubm 2>/dev/null\nmv $dir/$x.ubm $dir/final.ubm || exit 1;\n"
  },
  {
    "path": "egs/steps/word_align_lattices.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright Johns Hopkins University (Author: Daniel Povey)  2012\n# Apache 2.0.\n\n# Begin configuration section.\nsilence_label=0\ncmd=run.pl\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nfor x in `seq 2`; do\n  [ \"$1\" == \"--silence-label\" ] && silence_label=$2 && shift 2;\n  [ \"$1\" == \"--cmd\" ] && cmd=\"$2\" && shift 2;\ndone\n\nif [ $# != 3 ]; then\n   echo \"Word-align lattices (make the arcs sync up with words)\"\n   echo \"\"\n   echo \"Usage: $0 [options] <lang-dir> <decode-dir-in> <decode-dir-out>\"\n   echo \"options: [--cmd (run.pl|queue.pl [queue opts])] [--silence-label <integer-id-of-silence-word>]\"\n   exit 1;\nfi\n\n. ./path.sh || exit 1;\n\nlang=$1\nindir=$2\noutdir=$3\n\nmdl=`dirname $indir`/final.mdl\nwbfile=$lang/phones/word_boundary.int\n\nfor f in $mdl $wbfile $indir/num_jobs; do\n  [ ! -f $f ] && echo \"word_align_lattices.sh: no such file $f\" && exit 1;\ndone\n\nmkdir -p $outdir/log\n\n\ncp $indir/num_jobs $outdir;\nnj=`cat $indir/num_jobs`\n\n$cmd JOB=1:$nj $outdir/log/align.JOB.log \\\n  lattice-align-words --silence-label=$silence_label --test=true \\\n   $wbfile $mdl \"ark:gunzip -c $indir/lat.JOB.gz|\" \"ark,t:|gzip -c >$outdir/lat.JOB.gz\" || exit 1;\n\n"
  },
  {
    "path": "egs/utils/add_disambig.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Adds some specified number of disambig symbols to a symbol table.\n# Adds these as #1, #2, etc.\n# If the --include-zero option is specified, includes an extra one\n# #0.\n\n$include_zero = 0;\nif($ARGV[0] eq \"--include-zero\") {\n    $include_zero = 1;\n    shift @ARGV;\n}\n\nif(@ARGV != 2) {\n    die \"Usage: add_disambig.pl [--include-zero] symtab.txt num_extra > symtab_out.txt \";\n}\n\n\n$input = $ARGV[0];\n$nsyms = $ARGV[1];\n\nopen(F, \"<$input\") || die \"Opening file $input\";\n\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"Bad line $_\";\n    $lastsym = $A[1];\n    print;\n}\n\nif(!defined($lastsym)){\n die \"Empty symbol file?\";\n}\n\nif($include_zero) {\n    $lastsym++;\n    print \"#0  $lastsym\\n\";\n}\n\nfor($n = 1; $n <= $nsyms; $n++) {\n    $y = $n + $lastsym;\n    print \"#$n  $y\\n\";\n}\n"
  },
  {
    "path": "egs/utils/add_lex_disambig.pl",
    "content": "#!/usr/bin/env perl\n#  Copyright 2010-2011  Microsoft Corporation\n#            2013-2016  Johns Hopkins University (author: Daniel Povey)\n#                 2015  Hainan Xu\n#                 2015  Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Adds disambiguation symbols to a lexicon.\n# Outputs still in the normal lexicon format.\n# Disambig syms are numbered #1, #2, #3, etc. (#0\n# reserved for symbol in grammar).\n# Outputs the number of disambig syms to the standard output.\n# With the --pron-probs option, expects the second field\n# of each lexicon line to be a pron-prob.\n# With the --sil-probs option, expects three additional\n# fields after the pron-prob, representing various components\n# of the silence probability model.\n\n$pron_probs = 0;\n$sil_probs = 0;\n$first_allowed_disambig = 1;\n\nfor ($n = 1; $n <= 3 && @ARGV > 0; $n++) {\n  if ($ARGV[0] eq \"--pron-probs\") {\n    $pron_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--sil-probs\") {\n    $sil_probs = 1;\n    shift @ARGV;\n  }\n  if ($ARGV[0] eq \"--first-allowed-disambig\") {\n    $first_allowed_disambig = 0 + $ARGV[1];\n    if ($first_allowed_disambig < 1) {\n      die \"add_lex_disambig.pl: invalid --first-allowed-disambig option: $first_allowed_disambig\\n\";\n    }\n    shift @ARGV;\n    shift @ARGV;\n  }\n}\n\nif (@ARGV != 2) {\n  die \"Usage: add_lex_disambig.pl [opts] <lexicon-in> <lexicon-out>\\n\" .\n    \"This script adds disambiguation symbols to a lexicon in order to\\n\" .\n    \"make decoding graphs determinizable; it adds pseudo-phone\\n\" .\n    \"disambiguation symbols #1, #2 and so on at the ends of phones\\n\" .\n    \"to ensure that all pronunciations are different, and that none\\n\" .\n    \"is a prefix of another.\\n\" .\n    \"It prints to the standard output the number of the largest-numbered\" .\n    \"disambiguation symbol that was used.\\n\" .\n    \"\\n\" .\n    \"Options:   --pron-probs       Expect pronunciation probabilities in the 2nd field\\n\" .\n    \"           --sil-probs        [should be with --pron-probs option]\\n\" .\n    \"                              Expect 3 extra fields after the pron-probs, for aspects of\\n\" .\n    \"                              the silence probability model\\n\" .\n    \"           --first-allowed-disambig <n>  The number of the first disambiguation symbol\\n\" .\n    \"                              that this script is allowed to add.  By default this is\\n\" .\n    \"                              #1, but you can set this to a larger value using this option.\\n\" .\n    \"e.g.:\\n\" .\n    \" add_lex_disambig.pl lexicon.txt lexicon_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs lexiconp.txt lexiconp_disambig.txt\\n\" .\n    \" add_lex_disambig.pl --pron-probs --sil-probs lexiconp_silprob.txt lexiconp_silprob_disambig.txt\\n\";\n}\n\n\n$lexfn = shift @ARGV;\n$lexoutfn = shift @ARGV;\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\n\n# (1)  Read in the lexicon.\n@L = ( );\nwhile(<L>) {\n    @A = split(\" \", $_);\n    push @L, join(\" \", @A);\n}\n\n# (2) Work out the count of each phone-sequence in the\n# lexicon.\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) {\n      $p = shift @A;\n      if (!($p > 0.0 && $p <= 1.0)) { die \"Bad lexicon line $l (expecting pron-prob as second field)\"; }\n    }\n    if ($sil_probs) {\n      $silp = shift @A;\n      if (!($silp > 0.0 && $silp <= 1.0)) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n      $correction = shift @A;\n      if ($correction <= 0.0) { die \"Bad lexicon line $l for silprobs\"; }\n    }\n    if (!(@A)) {\n      die \"Bad lexicon line $1, no phone in phone list\";\n    }\n    $count{join(\" \",@A)}++;\n}\n\n# (3) For each left sub-sequence of each phone-sequence, note down\n# that it exists (for identifying prefixes of longer strings).\n\nforeach $l (@L) {\n    @A = split(\" \", $l);\n    shift @A; # Remove word.\n    if ($pron_probs) { shift @A; } # remove pron-prob.\n    if ($sil_probs) {\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob\n      shift @A; # Remove silprob, there three numbers for sil_probs\n    }\n    while(@A > 0) {\n        pop @A;  # Remove last phone\n        $issubseq{join(\" \",@A)} = 1;\n    }\n}\n\n# (4) For each entry in the lexicon:\n#  if the phone sequence is unique and is not a\n#  prefix of another word, no diambig symbol.\n#  Else output #1, or #2, #3, ... if the same phone-seq\n#  has already been assigned a disambig symbol.\n\n\nopen(O, \">$lexoutfn\") || die \"Opening lexicon file $lexoutfn for writing.\\n\";\n\n# max_disambig will always be the highest-numbered disambiguation symbol that\n# has been used so far.\n$max_disambig = $first_allowed_disambig - 1;\n\nforeach $l (@L) {\n  @A = split(\" \", $l);\n  $word = shift @A;\n  if ($pron_probs) {\n    $pron_prob = shift @A;\n  }\n  if ($sil_probs) {\n    $sil_word_prob = shift @A;\n    $word_sil_correction = shift @A;\n    $prev_nonsil_correction = shift @A\n  }\n  $phnseq = join(\" \", @A);\n  if (!defined $issubseq{$phnseq}\n      && $count{$phnseq} == 1) {\n    ;                           # Do nothing.\n  } else {\n    if ($phnseq eq \"\") {        # need disambig symbols for the empty string\n      # that are not use anywhere else.\n      $max_disambig++;\n      $reserved_for_the_empty_string{$max_disambig} = 1;\n      $phnseq = \"#$max_disambig\";\n    } else {\n      $cur_disambig = $last_used_disambig_symbol_of{$phnseq};\n      if (!defined $cur_disambig) {\n        $cur_disambig = $first_allowed_disambig;\n      } else {\n        $cur_disambig++;           # Get a number that has not been used yet for\n                                   # this phone sequence.\n      }\n      while (defined $reserved_for_the_empty_string{$cur_disambig}) {\n        $cur_disambig++;\n      }\n      if ($cur_disambig > $max_disambig) {\n        $max_disambig = $cur_disambig;\n      }\n      $last_used_disambig_symbol_of{$phnseq} = $cur_disambig;\n      $phnseq = $phnseq . \" #\" . $cur_disambig;\n    }\n  }\n  if ($pron_probs) {\n    if ($sil_probs) {\n      print O \"$word\\t$pron_prob\\t$sil_word_prob\\t$word_sil_correction\\t$prev_nonsil_correction\\t$phnseq\\n\";\n    } else {\n      print O \"$word\\t$pron_prob\\t$phnseq\\n\";\n    }\n  } else {\n    print O \"$word\\t$phnseq\\n\";\n  }\n}\n\nprint $max_disambig . \"\\n\";\n"
  },
  {
    "path": "egs/utils/analyze_segments.pl",
    "content": "#!/usr/bin/perl\n# Copyright 2015 GoVivace Inc. (Author: Nagendra Kumar Goel)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Analyze a segments file and print important stats on it.\n\n$dur = $total = 0;\n$maxDur = 0;\n$minDur = 9999999999;\n$n = 0;\nwhile(<>){\n    chomp;\n    @t = split(/\\s+/);\n    $dur = $t[3] - $t[2];\n    $total += $dur;\n    if ($dur > $maxDur) {\n        $maxSegId = $t[0];\n        $maxDur = $dur;\n    }\n    if ($dur < $minDur) {\n        $minSegId = $t[0];\n        $minDur = $dur;\n    }\n    $n++;\n}\n$avg=$total/$n;\n$hrs = $total/3600;\nprint \"Total $hrs hours of data\\n\";\nprint \"Average segment length $avg seconds\\n\";\nprint \"Segment $maxSegId has length of $maxDur seconds\\n\";\nprint \"Segment $minSegId has length of $minDur seconds\\n\";\n"
  },
  {
    "path": "egs/utils/apply_map.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\n# This program is a bit like ./sym2int.pl in that it applies a map\n# to things in a file, but it's a bit more general in that it doesn't\n# assume the things being mapped to are single tokens, they could\n# be sequences of tokens.  See the usage message.\n\n\n$permissive = 0;\n\nfor ($x = 0; $x <= 2; $x++) {\n\n  if (@ARGV > 0 && $ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesty (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n\n  if (@ARGV > 0 && $ARGV[0] eq '--permissive') {\n    shift @ARGV;\n    # Mapping is optional (missing key is printed to output)\n    $permissive = 1;\n  }\n}\n\nif(@ARGV != 1) {\n  print STDERR \"Invalid usage: \" . join(\" \", @ARGV) . \"\\n\";\n  print STDERR <<'EOF';\nUsage: apply_map.pl [options] map <input >output\n options: [-f <field-range> ] [--permissive]\n   This applies a map to some specified fields of some input text:\n   For each line in the map file: the first field is the thing we\n   map from, and the remaining fields are the sequence we map it to.\n   The -f (field-range) option says which fields of the input file the map\n   map should apply to.\n   If the --permissive option is supplied, fields which are not present\n   in the map will be left as they were.\n Applies the map 'map' to all input text, where each line of the map\n is interpreted as a map from the first field to the list of the other fields\n Note: <field-range> can look like 4-5, or 4-, or 5-, or 1, it means the field\n range in the input to apply the map to.\n e.g.: echo A B | apply_map.pl a.txt\n where a.txt is:\n A a1 a2\n B b\n will produce:\n a1 a2 b\nEOF\n  exit(1);\n}\n\n($map_file) = @ARGV;\nopen(M, \"<$map_file\") || die \"Error opening map file $map_file: $!\";\n\nwhile (<M>) {\n  @A = split(\" \", $_);\n  @A >= 1 || die \"apply_map.pl: empty line.\";\n  $i = shift @A;\n  $o = join(\" \", @A);\n  $map{$i} = $o;\n}\n\nwhile(<STDIN>) {\n  @A = split(\" \", $_);\n  for ($x = 0; $x < @A; $x++) {\n    if ( (!defined $field_begin || $x >= $field_begin)\n         && (!defined $field_end || $x <= $field_end)) {\n      $a = $A[$x];\n      if (!defined $map{$a}) {\n        if (!$permissive) {\n          die \"apply_map.pl: undefined key $a in $map_file\\n\";\n        } else {\n          print STDERR \"apply_map.pl: warning! missing key $a in $map_file\\n\";\n        }\n      } else {\n        $A[$x] = $map{$a};\n      }\n    }\n  }\n  print join(\" \", @A) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/utils/best_wer.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# To be run from one directory above this script.\n\nperl -e 'while(<>){ \n    s/\\|(\\d)/\\| $1/g; s/(\\d)\\|/$1 \\|/g;\n    if (m/[WS]ER (\\S+)/ && (!defined $bestwer || $bestwer > $1)){ $bestwer = $1; $bestline=$_; } # kaldi \"compute-wer\" tool.\n    elsif (m: (Mean|Sum/Avg|)\\s*\\|\\s*\\S+\\s+\\S+\\s+\\|\\s+\\S+\\s+\\S+\\s+\\S+\\s+\\S+\\s+(\\S+)\\s+\\S+\\s+\\|:\n        && (!defined $bestwer || $bestwer > $2)){ $bestwer = $2; $bestline=$_; } }  # sclite.\n   if (defined $bestline){ print $bestline; } ' | \\\n  awk 'BEGIN{ FS=\"%WER\"; } { if(NF == 2) { print FS$2\" \"$1; } else { print $0; }}' | \\\n  awk 'BEGIN{ FS=\"Sum/Avg\"; } { if(NF == 2) { print $2\" \"$1; } else { print $0; }}' | \\\n  awk '{ if($1!~/%WER/) { print \"%WER \"$9\" \"$0; } else { print $0; }}' | \\\n  sed -e 's|\\s\\s*| |g' -e 's|\\:$||' -e 's|\\:\\s*\\|\\s*$||'\n\n\n\n"
  },
  {
    "path": "egs/utils/build_const_arpa_lm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2014  Guoguo Chen\n# Apache 2.0\n\n# This script reads in an Arpa format language model, and converts it into the\n# ConstArpaLm format language model.\n\n# begin configuration section\n# end configuration section\n\n[ -f path.sh ] && . ./path.sh;\n\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <arpa-lm-path> <old-lang-dir> <new-lang-dir>\"\n  echo \"e.g.:\"\n  echo \"  $0 data/local/lm/3-gram.full.arpa.gz data/lang/ data/lang_test_tgmed\"\n  echo \"Options\"\n  exit 1;\nfi\n\nexport LC_ALL=C\n\narpa_lm=$1\nold_lang=$2\nnew_lang=$3\n\nmkdir -p $new_lang\n\nmkdir -p $new_lang\ncp -r $old_lang/* $new_lang\n\nunk=`cat $old_lang/oov.int`\nbos=`grep \"^<s>\\s\" $old_lang/words.txt | awk '{print $2}'`\neos=`grep \"^</s>\\s\" $old_lang/words.txt | awk '{print $2}'`\nif [[ -z $bos || -z $eos ]]; then\n  echo \"$0: <s> and </s> symbols are not in $old_lang/words.txt\"\n  exit 1\nfi\nif [[ -z $unk ]]; then\n  echo \"$0: can't find oov symbol id in $old_lang/oov.int\"\n  exit 1\nfi\n\n\narpa-to-const-arpa --bos-symbol=$bos \\\n  --eos-symbol=$eos --unk-symbol=$unk \\\n  \"gunzip -c $arpa_lm | utils/map_arpa_lm.pl $new_lang/words.txt|\"  $new_lang/G.carpa  || exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/combine_data.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n#           2014  David Snyder\n\n# This script combines the data from multiple source directories into\n# a single destination directory.\n\n# See http://kaldi-asr.org/doc/data_prep.html#data_prep_data for information\n# about what these directories contain.\n\n# Begin configuration section.\nextra_files= # specify additional files in 'src-data-dir' to merge, ex. \"file1 file2 ...\"\nskip_fix=false # skip the fix_data_dir.sh in the end\n# End configuration section.\n\necho \"$0 $@\"  # Print the command line for logging\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -lt 2 ]; then\n  echo \"Usage: combine_data.sh [--extra-files 'file1 file2'] <dest-data-dir> <src-data-dir1> <src-data-dir2> ...\"\n  echo \"Note, files that don't appear in all source dirs will not be combined,\"\n  echo \"with the exception of utt2uniq and segments, which are created where necessary.\"\n  exit 1\nfi\n\ndest=$1;\nshift;\n\nfirst_src=$1;\n\nrm -r $dest 2>/dev/null\nmkdir -p $dest;\n\nexport LC_ALL=C\n\nfor dir in $*; do\n  if [ ! -f $dir/utt2spk ]; then\n    echo \"$0: no such file $dir/utt2spk\"\n    exit 1;\n  fi\ndone\n\n# Check that frame_shift are compatible, where present together with features.\ndir_with_frame_shift=\nfor dir in $*; do\n  if [[ -f $dir/feats.scp && -f $dir/frame_shift ]]; then\n    if [[ $dir_with_frame_shift ]] &&\n       ! cmp -s $dir_with_frame_shift/frame_shift $dir/frame_shift; then\n      echo \"$0:error: different frame_shift in directories $dir and \" \\\n           \"$dir_with_frame_shift. Cannot combine features.\"\n      exit 1;\n    fi\n    dir_with_frame_shift=$dir\n  fi\ndone\n\n# W.r.t. utt2uniq file the script has different behavior compared to other files\n# it is not compulsary for it to exist in src directories, but if it exists in\n# even one it should exist in all. We will create the files where necessary\nhas_utt2uniq=false\nfor in_dir in $*; do\n  if [ -f $in_dir/utt2uniq ]; then\n    has_utt2uniq=true\n    break\n  fi\ndone\n\nif $has_utt2uniq; then\n  # we are going to create an utt2uniq file in the destdir\n  for in_dir in $*; do\n    if [ ! -f $in_dir/utt2uniq ]; then\n      # we assume that utt2uniq is a one to one mapping\n      cat $in_dir/utt2spk | awk '{printf(\"%s %s\\n\", $1, $1);}'\n    else\n      cat $in_dir/utt2uniq\n    fi\n  done | sort -k1 > $dest/utt2uniq\n  echo \"$0: combined utt2uniq\"\nelse\n  echo \"$0 [info]: not combining utt2uniq as it does not exist\"\nfi\n# some of the old scripts might provide utt2uniq as an extrafile, so just remove it\nextra_files=$(echo \"$extra_files\"|sed -e \"s/utt2uniq//g\")\n\n# segments are treated similarly to utt2uniq. If it exists in some, but not all\n# src directories, then we generate segments where necessary.\nhas_segments=false\nfor in_dir in $*; do\n  if [ -f $in_dir/segments ]; then\n    has_segments=true\n    break\n  fi\ndone\n\nif $has_segments; then\n  for in_dir in $*; do\n    if [ ! -f $in_dir/segments ]; then\n      echo \"$0 [info]: will generate missing segments for $in_dir\" 1>&2\n      utils/data/get_segments_for_data.sh $in_dir\n    else\n      cat $in_dir/segments\n    fi\n  done | sort -k1 > $dest/segments\n  echo \"$0: combined segments\"\nelse\n  echo \"$0 [info]: not combining segments as it does not exist\"\nfi\n\nfor file in utt2spk utt2lang utt2dur utt2num_frames reco2dur feats.scp text cmvn.scp vad.scp reco2file_and_channel wav.scp spk2gender $extra_files; do\n  exists_somewhere=false\n  absent_somewhere=false\n  for d in $*; do\n    if [ -f $d/$file ]; then\n      exists_somewhere=true\n    else\n      absent_somewhere=true\n      fi\n  done\n\n  if ! $absent_somewhere; then\n    set -o pipefail\n    ( for f in $*; do cat $f/$file; done ) | sort -k1 > $dest/$file || exit 1;\n    set +o pipefail\n    echo \"$0: combined $file\"\n  else\n    if ! $exists_somewhere; then\n      echo \"$0 [info]: not combining $file as it does not exist\"\n    else\n      echo \"$0 [info]: **not combining $file as it does not exist everywhere**\"\n    fi\n  fi\ndone\n\nutils/utt2spk_to_spk2utt.pl <$dest/utt2spk >$dest/spk2utt\n\nif [[ $dir_with_frame_shift ]]; then\n  cp $dir_with_frame_shift/frame_shift $dest\nfi\n\nif ! $skip_fix ; then\n  utils/fix_data_dir.sh $dest || exit 1;\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/utils/convert_slf.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2014  Brno University of Technology (author Karel Vesely)\n# Copyright 2013  Korbinian Riedhammer\n\n# Convert a kaldi-lattice to HTK SLF format;  if given an output\n# directory, each lattice will be put in an individual gzipped file.\n\n# Internal representation of nodes, links:\n# node hash:\n# { W=>[word], t=>[time], n_out_arcs=>[number_of_outgoing_arcs] };\n# (Time internally represented as integer number of frames.)\n# link hash:\n# { S=>[start_node], E=>[end_node], W=>[word], v=>[0], a=>[acoustic_score], l=>[graph_score] }\n# \n# The HTK output supports:\n# - words on links [default],\n#   - simpler, same as in kaldi lattices, node-ids in output correspond to kaldi lattices\n# - words on nodes,\n#   - apart from original nodes, there are extra nodes containing the words.\n#   - each original ark is replaced by word-node and two links, connecting it with original nodes.\n\n\nuse utf8;\nuse List::Util qw(max);\n\nbinmode(STDIN, \":encoding(utf8)\");\nbinmode(STDOUT, \":encoding(utf8)\");\n\n# defaults\n$framerate=0.01;\n$wordtonode=0;\n\n$usage=\"Convert kaldi lattices to HTK SLF (v1.1) format.\\n\".\n       \"Usage: convert_slf.pl [options] lat-file.txt [out-dir]\\n\".\n       \"  e.g. lattice-align-words lang/phones/word_boundary.int final.mdl 'ark:gunzip -c lat.gz |' ark,t:- | utils/int2sym.pl -f 3 lang/words.txt | $0 - slf/\\n\".\n       \"\\n\".\n       \"Options regarding the SLF output:\\n\".\n       \"  --frame-rate x  Frame rate to compute timing information (default: $framerate)\\n\".\n       \"  --word-to-node  Print the word symbols on nodes (adds extra nodes+links; default: words at links)\\n\".\n       \"\\n\";\n\n# parse options\nwhile (@ARGV gt 0 and $ARGV[0] =~ m/^--/) {\n  $param = shift @ARGV;\n  if ($param eq \"--frame-rate\") { $framerate = shift @ARGV; }\n  elsif ($param eq \"--word-to-node\") { $wordtonode = 1;}\n  else {\n    print STDERR \"Unknown option $param\\n\";\n    print STDERR;\n    print STDERR $usage;\n    exit 1;\n  }\n}\n\n# check positional arg count\nif (@ARGV < 1 || @ARGV > 2) {\n  print STDERR $usage;\n  exit 1;\n}\n\n# store gzipped lattices individually to outdir:\n$outdir = \"\";\nif (@ARGV == 2) {\n  $outdir = pop @ARGV;\n  unless (-d $outdir) { system(\"mkdir -p $outdir\"); }\n  unless (-d $outdir) {\n    print STDERR \"Could not create directory $outdir\\n\";\n    exit 1;\n  }\n}\n# or we'll print lattices to stdout:\nif ($outdir eq \"\") {\n  open(FH, \">-\") or die \"Could not write to stdout (???)\\n\";\n}\n\n\n### parse kaldi lattices:\n\n$utt = \"\";\n$arc = 0;\n$latest_time = 0.0;\n@links = ();\n%nodes = ();\n%nodes_extra = ();\n%accepting_states = ();\n\nopen (FI, $ARGV[0]) or die \"Could not read from file\\n\";\nbinmode(FI, \":encoding(utf8)\");\n\nwhile(<FI>) {\n  chomp;\n\n  @A = split /\\s+/;\n\n  if (@A == 1 and $utt eq \"\") {\n    # new lattice\n    $utt = $A[0];\n    $nodes{0} = { W=>\"!NULL\", t=>0.0, n_out_arcs=>0 }; #initial node\n\n  } elsif (@A == 1) {\n    # accepting node without FST weight, store data for link to terminal super-state\n    $accepting_states{$A[0]} = { W=>\"!NULL\", v=>0, a=>0, l=>0 };\n\n  } elsif (@A == 2) {\n    # accepting state with FST weight on it, again store data for the link\n    ($s, $info) = @A;\n    ($gs, $as, $ss) = split(/,/, $info);\n\n    # kaldi saves -log, but HTK does it the other way round\n    $gs *= -1;\n    $as *= -1;\n\n    # the state sequence is something like 1_2_4_56_45, get number of tokens after splitting by '_':\n    $ss = scalar split(/_/, $ss);\n    \n    # update the end time\n    die \"Node $s not yet visited, is lattice sorted topologically? $utt\" unless exists $nodes{$s}{t};\n    $time_end = $nodes{$s}{t} + $ss;\n    if ($latest_time < $time_end) { $latest_time = $time_end; }\n\n    # add the link data\n    $accepting_states{$A[0]} = { W=>\"!NULL\", v=>0, a=>$as, l=>$gs };\n\n  } elsif (@A == 4 or @A == 3) {\n    # FSA arc\n    ($s, $e, $w, $info) = @A;\n    if ($info ne \"\") {\n      ($gs, $as, $ss) = split(/,/, $info);\n    } else {\n      $gs = 0; $as = 0; $ss = \"\";\n    }\n\n    # rename epsilons to null\n    $w = \"!NULL\" if $w eq \"<eps>\";\n\n    # kaldi saves -log, but HTK does it the other way round\n    $gs *= -1;\n    $as *= -1;\n    \n    # the state sequence is something like 1_2_4_56_45, get number of tokens after splitting by '_':\n    $ss = scalar split(/_/, $ss);\n    \n    # keep track of the number of outgoing arcs for each node \n    # (later, we will connect sinks to the terminal state)\n    $nodes{$s}{n_out_arcs} += 1;\n\n    # keep track of timing\n    die \"Node $s not yet visited, is lattice sorted topologically? $utt\" unless exists $nodes{$s};\n    $time_end = $nodes{$s}{t} + $ss;\n    if ($latest_time < $time_end) { $latest_time = $time_end; }\n\n    # sanity check on already existing node\n    if (exists $nodes{$e}) {\n      die \"Node $e previously stored with different time \".$nodes{$e}{t}.\" now $time_end, $utt.\\n\"\n       if $time_end ne $nodes{$e}{t};\n    }\n\n    # store internal representation of the arc\n    if (not $wordtonode) {\n      # The words on links, the lattice keeps it's original structure,\n      # add node; do not overwrite\n      $nodes{$e} = { t=>$time_end, n_out_arcs=>0 } unless defined $nodes{$e};\n      # add the link data\n      push @links, { S=>$s, E=>$e, W=>$w, v=>0, a=>$as, l=>$gs };\n\n    } else {\n      # The problem here was that, if we have a node with several incoming links,\n      # the links can have different words on it, so we cannot simply put word from \n      # link into the node.\n      #\n      # The simple solution is:\n      # each FST arc gets replaced by extra node with word and two links,\n      # connecting it with original nodes.\n      #\n      # The lattice gets larger, and it is good to minimize the lattice during importing.\n      #\n      # During reading the FST, we don't know how many nodes there are in total, \n      # so the extra nodes are stored separately, indexed by arc number, \n      # and links have flags describing which type of node are they connected to.\n\n      # add 'extra node' containing the word:\n      $nodes_extra{$arc} = { W=>$w, t=>$time_end };\n      # add 'original node'; do not overwrite\n      $nodes{$e} = { W=>\"!NULL\", t=>$time_end, n_out_arcs=>0 } unless defined $nodes{$e};\n      \n      # add the link from 'original node' to 'extra node'\n      push @links, { S=>$s, E=>$arc, W=>$w, v=>0, a=>$as, l=>$gs, to_extra_node=>1 };\n      # add the link from 'extra node' to 'original node'\n      push @links, { S=>$arc, E=>$e, W=>$w, v=>0, a=>0, l=>0, from_extra_node=>1 };\n   \n      # increase arc counter \n      $arc++;\n    }\n\n  } elsif (@A == 0) { # end of lattice reading, we'll add terminal super-state, and print it soon...\n    # find sinks\n    %sinks = ();\n    for $n (keys %nodes) { \n      $sinks{$n} = 1 if ($nodes{$n}{n_out_arcs} == 0);\n    }\n\n    # sanity check: lattices need at least one sink!\n    if (scalar keys %sinks == 0) {\n      print STDERR \"Error: $utt does not have at least one sink node-- cyclic lattice??\\n\";\n    }\n\n    # add terminal super-state,\n    $last_node = max(keys(%nodes)) + 1;\n    $nodes{$last_node} = { W=>\"!NULL\", t=>$latest_time };\n\n    # connect all accepting states with terminal super-state,\n    for $accept (sort { $a <=> $b } keys %accepting_states) {\n      %a = %{$accepting_states{$accept}};\n      push @links, { S=>$accept, E=>$last_node, W=>$a{W}, v=>$a{v}, a=>$a{a}, l=>$a{l} };\n    }\n\n    # connect also all sinks that are not accepting states,\n    for $sink (sort { $a <=> $b } keys %sinks) {\n      unless(exists($accepting_states{$sink})) {\n        print STDERR \"WARNING: detected sink node which is not accepting state in lattice $utt, incomplete lattice?\\n\";\n        $a = \\$accepting_states{$accept};\n        push @links, { S=>$accept, E=>$last_node, W=>\"!NULL\", v=>0, a=>0, l=>0 };\n      }\n    }\n\n    # print out the lattice;  open file handle first\n    unless ($outdir eq \"\") {\n      open(FH, \"|-\", \"gzip -c > $outdir/$utt.lat.gz\") or die \"Could not write to $outdir/$utt.lat.gz\\n\";\n      binmode(FH, \":encoding(utf8)\");\n    } \n\n    if (not $wordtonode) {\n      # print lattice with words on links:\n      \n      # header\n      print FH \"VERSION=1.1\\n\";\n      print FH \"UTTERANCE=$utt\\n\";\n      print FH \"N=\".(keys %nodes).\"\\tL=\".(@links).\"\\n\";\n\n      # nodes\n      for $n (sort { $a <=> $b } keys %nodes) {\n        printf FH \"I=%d\\tt=%.2f\\n\", $n, $nodes{$n}{t}*$framerate;\n      }\n\n      # links/arks\n      for $i (0 .. $#links) {\n        %l = %{$links[$i]}; # get hash representing the link...\n        printf FH \"J=$i\\tS=%d\\tE=%d\\tW=%s\\tv=%f\\ta=%f\\tl=%f\\n\", $l{S}, $l{E}, $l{W}, $l{v}, $l{a}, $l{l};\n      }\n\n    } else {\n      # print lattice with words in the nodes:\n\n      # header\n      print FH \"VERSION=1.1\\n\";\n      print FH \"UTTERANCE=$utt\\n\";\n      print FH \"N=\".(scalar(keys(%nodes))+scalar(keys(%nodes_extra))).\"\\tL=\".(@links).\"\\n\";\n\n      # number of original nodes, offset of extra_nodes\n      $node_id_offset = scalar keys %nodes;\n\n      # nodes\n      for $n (sort { $a <=> $b } keys %nodes) {\n        printf FH \"I=%d\\tW=%s\\tt=%.2f\\n\", $n, $nodes{$n}{W}, $nodes{$n}{t}*$framerate;\n      }\n      # extra nodes\n      for $n (sort { $a <=> $b } keys %nodes_extra) {\n        printf FH \"I=%d\\tW=%s\\tt=%.2f\\n\", $n+$node_id_offset, $nodes_extra{$n}{W}, $nodes_extra{$n}{t}*$framerate;\n      }\n\n      # links/arks\n      for $i (0 .. $#links) {\n        %l = %{$links[$i]}; # get hash representing the link...\n        if ($l{from_extra_node}) { $l{S} += $node_id_offset; }\n        if ($l{to_extra_node}) { $l{E} += $node_id_offset; }\n        printf FH \"J=$i\\tS=%d\\tE=%d\\tv=%f\\ta=%f\\tl=%f\\n\", $l{S}, $l{E}, $l{v}, $l{a}, $l{l};\n      }\n    }\n\n    print FH \"\\n\";\n\n    # close handle if it was a file\n    close(FH) unless ($outdir eq \"\");\n\n    # clear data\n    $utt = \"\";\n    $arc = 0;\n    $latest_time = 0.0;\n    @links = ();\n    %nodes = ();\n    %nodes_extra = ();\n    %accepting_states = ();\n  } else {\n    die \"Unexpected column number of input line\\n$_\";\n  }\n}\n\nif ($utt != \"\") {\n  print STDERR \"Last lattice was not printed as it might be incomplete?  Missing empty line?\\n\";\n}\n\n"
  },
  {
    "path": "egs/utils/convert_slf_parallel.sh",
    "content": "#!/usr/bin/env bash\n# Copyright Brno University of Technology (Author: Karel Vesely) 2014.  Apache 2.0.\n\n# This script converts lattices to HTK format compatible with other toolkits.\n# We can choose to put words to nodes or arcs, as both is valid in the SLF format.\n\n# begin configuration section.\ncmd=run.pl\ndirname=lats-in-htk-slf\nparallel_opts=\"--max-jobs-run 50\" # We should limit disk stress\nword_to_node=false # Words in arcs or nodes? [default:arcs]\n#end configuration section.\n\necho \"$0 $@\"\n\n[ -f ./path.sh ] && . ./path.sh\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <data-dir> <lang-dir|graph-dir> <decode-dir>\"\n  echo \" Options:\"\n  echo \"    --cmd (run.pl|queue.pl...)      # specify how to run the sub-processes.\"\n  echo \"    --word-to-link (true|false)     # put word symbols on links or nodes.\"\n  echo \"    --parallel-opts STR             # parallelization options (def.: '--max-jobs-run 50').\"\n  echo \"e.g.:\"\n  echo \"$0 data/dev data/lang exp/tri4a/decode_dev\"\n  exit 1;\nfi\n\ndata=$1\nlang=$2 # Note: may be graph directory not lang directory, but has the necessary stuff copied.\ndir=$3\n\nmodel=$(dirname $dir)/final.mdl # assume model one level up from decoding dir.\n\nfor f in $lang/words.txt $lang/phones/align_lexicon.int $model $dir/lat.1.gz; do\n  [ ! -f $f ] && echo \"$0: expecting file $f to exist\" && exit 1;\ndone\n\n[ ! -d $dir/$dirname/log ] && mkdir -p $dir/$dirname\n\necho \"$0: Converting lattices into '$dir/$dirname'\"\n\n# Words in arcs or nodes? [default:nodes]\nword_to_link_arg=\n$word_to_node && word_to_node_arg=\"--word-to-node\"\n\nnj=$(cat $dir/num_jobs)\n\n# convert the lattices (individually, gzipped)\n$cmd $parallel_opts JOB=1:$nj $dir/$dirname/log/lat_convert.JOB.log \\\n  mkdir -p $dir/$dirname/JOB/ '&&' \\\n  lattice-align-words-lexicon --output-error-lats=true --output-if-empty=true \\\n    $lang/phones/align_lexicon.int $model \"ark:gunzip -c $dir/lat.JOB.gz |\" ark,t:- \\| \\\n  utils/int2sym.pl -f 3 $lang/words.txt \\| \\\n  utils/convert_slf.pl $word_to_node_arg - $dir/$dirname/JOB/ || exit 1\n\n# make list of lattices\nfind -L $PWD/$dir/$dirname -name *.lat.gz > $dir/$dirname/lat_htk.scp || exit 1\n\n# check number of lattices:\nnseg=$(cat $data/segments | wc -l)\nnlat_out=$(cat $dir/$dirname/lat_htk.scp | wc -l)\necho \"segments $nseg, saved-lattices $nlat_out\"\n#\n[ $nseg -ne $nlat_out ] && echo \"WARNING: missing $((nseg-nlat_out)) lattices for some segments!\" \\\n  && exit 1\n\necho \"success, converted lats to HTK : $PWD/$dir/$dirname/lat_htk.scp\"\nexit 0\n\n"
  },
  {
    "path": "egs/utils/copy_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script operates on a directory, such as in data/train/,\n# that contains some subset of the following files:\n#  feats.scp\n#  wav.scp\n#  vad.scp\n#  spk2utt\n#  utt2spk\n#  text\n#\n# It copies to another directory, possibly adding a specified prefix or a suffix\n# to the utterance and/or speaker names.  Note, the recording-ids stay the same.\n#\n\n\n# begin configuration section\nspk_prefix=\nutt_prefix=\nspk_suffix=\nutt_suffix=\nvalidate_opts=   # should rarely be needed.\n# end configuration section\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0 --spk-prefix=1- --utt-prefix=1- data/train data/train_1\"\n  echo \"Options\"\n  echo \"   --spk-prefix=<prefix>     # Prefix for speaker ids, default empty\"\n  echo \"   --utt-prefix=<prefix>     # Prefix for utterance ids, default empty\"\n  echo \"   --spk-suffix=<suffix>     # Suffix for speaker ids, default empty\"\n  echo \"   --utt-suffix=<suffix>     # Suffix for utterance ids, default empty\"\n  exit 1;\nfi\n\n\nexport LC_ALL=C\n\nsrcdir=$1\ndestdir=$2\n\nif [ ! -f $srcdir/utt2spk ]; then\n  echo \"copy_data_dir.sh: no such file $srcdir/utt2spk\"\n  exit 1;\nfi\n\nif [ \"$destdir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <destdir> to be different.\"\n  exit 1\nfi\n\nset -e;\n\nmkdir -p $destdir\n\ncat $srcdir/utt2spk | awk -v p=$utt_prefix -v s=$utt_suffix '{printf(\"%s %s%s%s\\n\", $1, p, $1, s);}' > $destdir/utt_map\ncat $srcdir/spk2utt | awk -v p=$spk_prefix -v s=$spk_suffix '{printf(\"%s %s%s%s\\n\", $1, p, $1, s);}' > $destdir/spk_map\n\nif [ ! -f $srcdir/utt2uniq ]; then\n  if [[ ! -z $utt_prefix  ||  ! -z $utt_suffix ]]; then\n    cat $srcdir/utt2spk | awk -v p=$utt_prefix -v s=$utt_suffix '{printf(\"%s%s%s %s\\n\", p, $1, s, $1);}' > $destdir/utt2uniq\n  fi\nelse\n  cat $srcdir/utt2uniq | awk -v p=$utt_prefix -v s=$utt_suffix '{printf(\"%s%s%s %s\\n\", p, $1, s, $2);}' > $destdir/utt2uniq\nfi\n\ncat $srcdir/utt2spk | utils/apply_map.pl -f 1 $destdir/utt_map  | \\\n  utils/apply_map.pl -f 2 $destdir/spk_map >$destdir/utt2spk\n\nutils/utt2spk_to_spk2utt.pl <$destdir/utt2spk >$destdir/spk2utt\n\nif [ -f $srcdir/feats.scp ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/feats.scp >$destdir/feats.scp\nfi\n\nif [ -f $srcdir/vad.scp ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/vad.scp >$destdir/vad.scp\nfi\n\nif [ -f $srcdir/segments ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/segments >$destdir/segments\n  cp $srcdir/wav.scp $destdir\nelse # no segments->wav indexed by utt.\n  if [ -f $srcdir/wav.scp ]; then\n    utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/wav.scp >$destdir/wav.scp\n  fi\nfi\n\nif [ -f $srcdir/reco2file_and_channel ]; then\n  cp $srcdir/reco2file_and_channel $destdir/\nfi\n\nif [ -f $srcdir/text ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/text >$destdir/text\nfi\nif [ -f $srcdir/utt2dur ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/utt2dur >$destdir/utt2dur\nfi\nif [ -f $srcdir/utt2num_frames ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/utt2num_frames >$destdir/utt2num_frames\nfi\nif [ -f $srcdir/reco2dur ]; then\n  if [ -f $srcdir/segments ]; then\n    cp $srcdir/reco2dur $destdir/reco2dur\n  else\n    utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/reco2dur >$destdir/reco2dur\n  fi\nfi\nif [ -f $srcdir/spk2gender ]; then\n  utils/apply_map.pl -f 1 $destdir/spk_map <$srcdir/spk2gender >$destdir/spk2gender\nfi\nif [ -f $srcdir/cmvn.scp ]; then\n  utils/apply_map.pl -f 1 $destdir/spk_map <$srcdir/cmvn.scp >$destdir/cmvn.scp\nfi\nfor f in frame_shift stm glm ctm; do\n  if [ -f $srcdir/$f ]; then\n    cp $srcdir/$f $destdir\n  fi\ndone\n\nrm $destdir/spk_map $destdir/utt_map\n\necho \"$0: copied data from $srcdir to $destdir\"\n\nfor f in feats.scp cmvn.scp vad.scp utt2lang utt2uniq utt2dur utt2num_frames text wav.scp reco2file_and_channel frame_shift stm glm ctm; do\n  if [ -f $destdir/$f ] && [ ! -f $srcdir/$f ]; then\n    echo \"$0: file $f exists in dest $destdir but not in src $srcdir.  Moving it to\"\n    echo \" ... $destdir/.backup/$f\"\n    mkdir -p $destdir/.backup\n    mv $destdir/$f $destdir/.backup/\n  fi\ndone\n\n\n[ ! -f $srcdir/feats.scp ] && validate_opts=\"$validate_opts --no-feats\"\n[ ! -f $srcdir/text ] && validate_opts=\"$validate_opts --no-text\"\n\nutils/validate_data_dir.sh $validate_opts $destdir\n"
  },
  {
    "path": "egs/utils/create_data_link.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2013  Guoguo Chen\n#           2014  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n#\n# This script distributes data onto different file systems by making symbolic\n# links. It is supposed to use together with utils/create_split_dir.pl, which\n# creates a \"storage\" directory that links to different file systems.\n#\n# If a sub-directory egs/storage does not exist, it does nothing. If it exists,\n# then it selects pseudo-randomly a number from those available in egs/storage/*\n# creates a link such as\n#\n#   egs/egs.3.4.ark -> storage/4/egs.3.4.ark\n#\nuse strict;\nuse warnings;\nuse File::Basename;\nuse File::Spec;\nuse Getopt::Long;\n\nsub GetGCD {\n  my ($a, $b) = @_;\n  while ($a != $b) {\n    if ($a > $b) {\n      $a = $a - $b;\n    } else {\n      $b = $b - $a;\n    }\n  }\n  return $a;\n}\n\nmy $Usage = <<EOU;\ncreate_data_link.pl:\nThis script distributes data onto different file systems by making symbolic\nlinks. It is supposed to use together with utils/create_split_dir.pl, which\ncreates a \"storage\" directory that links to different file systems.\n\nIf a sub-directory foo/storage does not exist, it does nothing. If it exists,\nthen it selects pseudo-randomly a number from those available in foo/storage/*\ncreates a link such as\n\n  foo/egs.3.4.ark -> storage/4/egs.3.4.ark\n\nUsage: utils/create_data_link.pl <data-archive1> [<data-archive2> ... ]\n e.g.: utils/create_data_link.pl foo/bar/egs.3.4.ark foo/bar/egs.3.5.ark\n (note: the dirname, e.g. foo/bar/, must be the same in all cases).\n\nSee also utils/remove_data_links.sh\nEOU\n\nGetOptions();\n\nif (@ARGV == 0) {\n  die $Usage;\n}\n\nmy $example_fullpath = $ARGV[0];\n\n# Check if the storage has been created. If so, do nothing.\nmy $dirname = dirname($example_fullpath);\nif (! -d \"$dirname/storage\") {\n  exit(0);\n}\n\n# Storage exists, create symbolic links in the next few steps.\n\n# First, get a list of the available storage directories, and check if they are\n# properly created.\nopendir(my $dh, \"$dirname/storage/\") || die \"$0: Fail to open $dirname/storage/\\n\";\nmy @storage_dirs = grep(/^[0-9]*$/, readdir($dh));\nclosedir($dh);\nmy $num_storage = scalar(@storage_dirs);\nfor (my $x = 1; $x <= $num_storage; $x++) {\n  (-d \"$dirname/storage/$x\") || die \"$0: $dirname/storage/$x does not exist\\n\";\n}\n\n# Second, get the coprime list.\nmy @coprimes;\nfor (my $n = 1; $n <= $num_storage; $n++) {\n  if (GetGCD($n, $num_storage) == 1) {\n    push(@coprimes, $n);\n  }\n}\n\nmy $ret = 0;\n\nforeach my $fullpath (@ARGV) {\n  if ($dirname ne dirname($fullpath)) {\n    die \"Mismatch in directory names of arguments: $example_fullpath versus $fullpath\";\n  }\n\n  # Finally, work out the directory index where we should put the data to.\n  my $basename = basename($fullpath);\n  my $filename_numbers = $basename;\n  $filename_numbers =~ s/[^0-9]+/ /g;\n  my @filename_numbers = split(\" \", $filename_numbers);\n  my $total = 0;\n  my $index = 0;\n  foreach my $x (@filename_numbers) {\n    if ($index >= scalar(@coprimes)) {\n      $index = 0;\n    }\n    $total += $x * $coprimes[$index];\n    $index++;\n  }\n  my $dir_index = $total % $num_storage + 1;\n\n  # Make the symbolic link.\n  if (-e $fullpath) {\n    unlink($fullpath);\n  }\n  if (symlink(\"storage/$dir_index/$basename\", $fullpath) != 1) { # failure\n    $ret = 1;  # will exit with error status.\n  }\n}\n\nexit($ret);\n\n## testing:\n# rm -rf foo bar\n# mkdir -p bar/{1,2,3,4}\n# mkdir -p foo/storage\n# for x in 1 2 3 4; do ln -s ../../bar/$x foo/storage/$x; done\n# utils/create_data_link.pl utils/create_data_link.pl foo/1.3.ark  foo/2.3.ark\n# ls -l foo\n# total 0\n# lrwxrwxrwx 1 dpovey fax 17 Sep  2 17:41 1.3.ark -> storage/3/1.3.ark\n# lrwxrwxrwx 1 dpovey fax 17 Sep  2 17:41 2.3.ark -> storage/4/2.3.ark\n# drwxr-xr-x 2 dpovey fax 38 Sep  2 17:40 storage\n"
  },
  {
    "path": "egs/utils/create_split_dir.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2013  Guoguo Chen\n# Apache 2.0.\n#\n# This script creates storage directories on different file systems, and creates\n# symbolic links to those directories. For example, a command\n#\n#   utils/create_split_dir.pl /export/gpu-0{3,4,5}/egs/storage egs/storage\n#\n# will mkdir -p all of those directories, and will create links\n#\n#   egs/storage/1 -> /export/gpu-03/egs/storage\n#   egs/storage/2 -> /export/gpu-03/egs/storage\n#   ...\n#\nuse strict;\nuse warnings;\nuse File::Spec;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\ncreate_split_dir.pl:\nThis script creates storage directories on different file systems, and creates\nsymbolic links to those directories.\n\nUsage: utils/create_split_dir.pl <actual_storage_dirs> <pseudo_storage_dir>\n e.g.: utils/create_split_dir.pl /export/gpu-0{3,4,5}/egs/storage egs/storage\n\nAllowed options:\n  --suffix    : Common suffix to <actual_storage_dirs>    (string, default = \"\")\n\nSee also create_data_link.pl, which is intended to work with the resulting\ndirectory structure, and remove_data_links.sh\nEOU\n\nmy $suffix=\"\";\nGetOptions('suffix=s' => \\$suffix);\n\nif (@ARGV < 2) {\n  die $Usage;\n}\n\nmy $ans = 1;\n\nmy $dir = pop(@ARGV);\nsystem(\"mkdir -p $dir 2>/dev/null\");\n\nmy @all_actual_storage = ();\nforeach my $file (@ARGV) {\n  push @all_actual_storage, File::Spec->rel2abs($file . \"/\" . $suffix);\n}\n\nmy $index = 1;\nforeach my $actual_storage (@all_actual_storage) {\n  my $pseudo_storage = \"$dir/$index\";\n\n  # If the symbolic link already exists, delete it.\n  if (-l $pseudo_storage) {\n    print STDERR \"$0: link $pseudo_storage already exists, not overwriting.\\n\";\n    $index++;\n    next;\n  }\n\n  # Create the destination directory and make the link.\n  system(\"mkdir -p $actual_storage 2>/dev/null\");\n  if ($? != 0) {\n    print STDERR \"$0: error creating directory $actual_storage\\n\";\n    exit(1);\n  }\n  { # create a README file for easier deletion.\n    open(R, \">$actual_storage/README.txt\");\n    my $storage_dir = File::Spec->rel2abs($dir);\n    print R \"# This directory is linked from $storage_dir, as part of Kaldi striped data\\n\";\n    print R \"# The full list of directories where this data resides is:\\n\";\n    foreach my $d (@all_actual_storage) {\n      print R \"$d\\n\";\n    }\n    close(R);\n  }\n  my $ret = symlink($actual_storage, $pseudo_storage);\n\n  # Process the returned values\n  $ans = $ans && $ret;\n  if (! $ret) {\n    print STDERR \"Error linking $actual_storage to $pseudo_storage\\n\";\n  }\n\n  $index++;\n}\n\nexit($ans == 1 ? 0 : 1);\n"
  },
  {
    "path": "egs/utils/ctm/convert_ctm.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n# This takes as standard input a ctm file that's \"relative to the utterance\",\n# i.e. times are measured relative to the beginning of the segments, and it\n# uses a \"segments\" file (format:\n# utterance-id recording-id start-time end-time\n# ) and a \"reco2file_and_channel\" file (format:\n# recording-id basename-of-file\n\n$skip_unknown=undef;\nif ( $ARGV[0] eq \"--skip-unknown\" ) {\n  $skip_unknown=1;\n  shift @ARGV;\n}\n\nif (@ARGV < 2 || @ARGV > 3) {\n  print STDERR \"Usage: convert_ctm.pl <segments-file> <reco2file_and_channel-file> [<utterance-ctm>] > real-ctm\\n\";\n  exit(1);\n}\n\n$segments = shift @ARGV;\n$reco2file_and_channel = shift @ARGV;\n\nopen(S, \"<$segments\") || die \"opening segments file $segments\";\nwhile(<S>) {\n  @A = split(\" \", $_);\n  @A == 4 || die \"Bad line in segments file: $_\";\n  ($utt, $recording_id, $begin_time, $end_time) = @A;\n  $utt2reco{$utt} = $recording_id;\n  $begin{$utt} = $begin_time;\n  $end{$utt} = $end_time;\n}\nclose(S);\nopen(R, \"<$reco2file_and_channel\") || die \"open reco2file_and_channel file $reco2file_and_channel\";\nwhile(<R>) {\n  @A = split(\" \", $_);\n  @A == 3 || die \"Bad line in reco2file_and_channel file: $_\";\n  ($recording_id, $file, $channel) = @A;\n  $reco2file{$recording_id} = $file;\n  $reco2channel{$recording_id} = $channel;\n}\n\n\n# Now process the ctm file, which is either the standard input or the third\n# command-line argument.\n$num_done = 0;\nwhile(<>) {\n  @A= split(\" \", $_);\n  ( @A == 5 || @A == 6 ) || die \"Unexpected ctm format: $_\";\n  # lines look like:\n  # <utterance-id> 1 <begin-time> <length> <word> [ confidence ]\n  ($utt, $one, $wbegin, $wlen, $w, $conf) = @A;\n  $reco = $utt2reco{$utt};\n  if (!defined $reco) { \n      next if defined $skip_unknown;\n      die \"Utterance-id $utt not defined in segments file $segments\"; \n  }\n  $file = $reco2file{$reco};\n  $channel = $reco2channel{$reco};\n  if (!defined $file || !defined $channel) { \n    die \"Recording-id $reco not defined in reco2file_and_channel file $reco2file_and_channel\"; \n  }\n  $b = $begin{$utt};\n  $e = $end{$utt};\n  $wbegin_r = $wbegin + $b; # Make it relative to beginning of the recording.\n  $wbegin_r = sprintf(\"%.2f\", $wbegin_r);\n  $wlen = sprintf(\"%.2f\", $wlen);\n  if (defined $conf) {\n    $line = \"$file $channel $wbegin_r $wlen $w $conf\\n\"; \n  } else {\n    $line = \"$file $channel $wbegin_r $wlen $w\\n\"; \n  }\n  if ($wbegin_r + $wlen > $e + 0.01) {\n    print STDERR \"Warning: word appears to be past end of recording; line is $line\";\n  }\n  print $line; # goes to stdout.\n  $num_done++;\n}\n\nif ($num_done == 0) { exit 1; } else { exit 0; }\n\n__END__\n\n# Test example [also test it without the 0.5's]\necho utt reco 10.0 20.0 > segments\necho reco file A > reco2file_and_channel\necho utt 1 8.0 1.0 word 0.5 > ctm_in\necho file A 18.00 1.00 word 0.5 > ctm_out\nutils/convert_ctm.pl segments reco2file_and_channel ctm_in | cmp - ctm_out || echo error\nrm segments reco2file_and_channel ctm_in ctm_out\n\n\n\n\n"
  },
  {
    "path": "egs/utils/ctm/fix_ctm.sh",
    "content": "#! /bin/bash\n\nstmfile=$1\nctmfile=$2\n\nsegments_stm=`cat $stmfile | cut -f 1 -d ' ' | sort -u`\nsegments_ctm=`cat $ctmfile | cut -f 1 -d ' ' | sort -u`\n\nsegments_stm_count=`echo \"$segments_stm\" | wc -l `\nsegments_ctm_count=`echo \"$segments_ctm\" | wc -l `\n\n#echo $segments_stm_count\n#echo $segments_ctm_count\n\nif [ \"$segments_stm_count\" -gt \"$segments_ctm_count\"  ] ; then\n  pp=$( diff <(echo \"$segments_stm\") <(echo \"$segments_ctm\" ) | grep \"^<\" | sed \"s/^< *//g\")\n  (\n    for elem in $pp ; do\n      echo \"$elem 1 0 0 EMPTY_RECOGNIZED_PHRASE\"\n    done\n  ) >> $ctmfile\n  echo \"FIXED CTM FILE\"\n  exit 0\nelif [ \"$segments_stm_count\" -lt \"$segments_ctm_count\"  ] ; then\n  echo \"Segment STM count: $segments_stm_count\"\n  echo \"Segment CTM count: $segments_ctm_count\"\n  echo \"FAILURE FIXING CTM FILE\"\n  exit 1\nelse\n  exit 0\nfi\n\n"
  },
  {
    "path": "egs/utils/ctm/resolve_ctm_overlaps.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2014  Johns Hopkins University (Authors: Daniel Povey)\n#           2014  Vijayaditya Peddinti\n#           2016  Vimal Manohar\n# Apache 2.0.\n\n\"\"\"\nScript to combine ctms with overlapping segments.\nThe current approach is very simple. It ignores the words,\nwhich are hypothesized in the half of the overlapped region\nthat is closer to the utterance boundary.\nSo if there are two segments\nin the region 0s to 30s and 25s to 55s, with overlap of 5s,\nthe last 2.5s of the first utterance i.e. from 27.5s to 30s is truncated\nand the first 2.5s of the second utterance i.e. from 25s to 27.s is truncated.\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport argparse\nimport collections\nimport logging\n\nfrom collections import defaultdict\n\nlogger = logging.getLogger(__name__)\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\n    '%(asctime)s [%(pathname)s:%(lineno)s - '\n    '%(funcName)s - %(levelname)s ] %(message)s')\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\n\ndef get_args():\n    \"\"\"gets command line arguments\"\"\"\n\n    usage = \"\"\" Python script to resolve overlaps in ctms.  May be used with\n                utils/data/subsegment_data_dir.sh. \"\"\"\n    parser = argparse.ArgumentParser(usage)\n    parser.add_argument('segments', type=argparse.FileType('r'),\n                        help='use segments to resolve overlaps')\n    parser.add_argument('ctm_in', type=argparse.FileType('r'),\n                        help='input_ctm_file')\n    parser.add_argument('ctm_out', type=argparse.FileType('w'),\n                        help='output_ctm_file')\n    parser.add_argument('--verbose', type=int, default=0,\n                        help=\"Higher value for more verbose logging.\")\n    args = parser.parse_args()\n\n    if args.verbose > 2:\n        logger.setLevel(logging.DEBUG)\n        handler.setLevel(logging.DEBUG)\n\n    return args\n\n\ndef read_segments(segments_file):\n    \"\"\"Read from segments and returns two dictionaries,\n    {utterance-id: (recording_id, start_time, end_time)}\n    {recording_id: list-of-utterances}\n    \"\"\"\n    segments = {}\n    reco2utt = defaultdict(list)\n\n    num_lines = 0\n    for line in segments_file:\n        num_lines += 1\n        parts = line.strip().split()\n        assert len(parts) in [4, 5]\n        segments[parts[0]] = (parts[1], float(parts[2]), float(parts[3]))\n        reco2utt[parts[1]].append(parts[0])\n\n    logger.info(\"Read %d lines from segments file %s\",\n                num_lines, segments_file.name)\n    segments_file.close()\n\n    return segments, reco2utt\n\n\ndef read_ctm(ctm_file, segments):\n    \"\"\"Read CTM from ctm_file into a dictionary of values indexed by the\n    recording.\n    It is assumed to be sorted by the recording-id and utterance-id.\n\n    Returns a dictionary {recording : ctm_lines}\n        where ctm_lines is a list of lines of CTM corresponding to the\n        utterances in the recording.\n        The format is as follows:\n        [[(utteranceA, channelA, start_time1, duration1, hyp_word1, conf1),\n          (utteranceA, channelA, start_time2, duration2, hyp_word2, conf2),\n          ...\n          (utteranceA, channelA, start_timeN, durationN, hyp_wordN, confN)],\n         [(utteranceB, channelB, start_time1, duration1, hyp_word1, conf1),\n          (utteranceB, channelB, start_time2, duration2, hyp_word2, conf2),\n          ...],\n         ...\n         [...\n          (utteranceZ, channelZ, start_timeN, durationN, hyp_wordN, confN)]\n        ]\n    \"\"\"\n    ctms = {}\n\n    num_lines = 0\n    for line in ctm_file:\n        num_lines += 1\n        parts = line.split()\n\n        utt = parts[0]\n        reco = segments[utt][0]\n\n        if (reco, utt) not in ctms:\n            ctms[(reco, utt)] = []\n\n        ctms[(reco, utt)].append([parts[0], parts[1], float(parts[2]),\n                                  float(parts[3])] + parts[4:])\n\n    logger.info(\"Read %d lines from CTM %s\", num_lines, ctm_file.name)\n\n    ctm_file.close()\n    return ctms\n\n\ndef resolve_overlaps(ctms, segments):\n    \"\"\"Resolve overlaps within segments of the same recording.\n\n    Returns new lines of CTM for the recording.\n\n    Arguments:\n        ctms - The CTM lines for a single recording. This is one value stored\n            in the dictionary read by read_ctm(). Assumes that the lines\n            are sorted by the utterance-ids.\n            The format is the following:\n            [[(utteranceA, channelA, start_time1, duration1, hyp_word1, conf1),\n              (utteranceA, channelA, start_time2, duration2, hyp_word2, conf2),\n              ...\n              (utteranceA, channelA, start_timeN, durationN, hyp_wordN, confN)\n             ],\n             [(utteranceB, channelB, start_time1, duration1, hyp_word1, conf1),\n              (utteranceB, channelB, start_time2, duration2, hyp_word2, conf2),\n              ...],\n             ...\n             [...\n              (utteranceZ, channelZ, start_timeN, durationN, hyp_wordN, confN)]\n            ]\n        segments - Dictionary containing the output of read_segments()\n            { utterance_id: (recording_id, start_time, end_time) }\n        \"\"\"\n    total_ctm = []\n    if len(ctms) == 0:\n        raise RuntimeError('CTMs for recording is empty. '\n                           'Something wrong with the input ctms')\n\n    # First column of first line in CTM for first utterance\n    next_utt = ctms[0][0][0]\n    for utt_index, ctm_for_cur_utt in enumerate(ctms):\n        if utt_index == len(ctms) - 1:\n            break\n\n        if len(ctm_for_cur_utt) == 0:\n            next_utt = ctms[utt_index + 1][0][0]\n            continue\n\n        cur_utt = ctm_for_cur_utt[0][0]\n        if cur_utt != next_utt:\n            logger.error(\n                \"Current utterance %s is not the same as the next \"\n                \"utterance %s in previous iteration.\\n\"\n                \"CTM is not sorted by utterance-id?\",\n                cur_utt, next_utt)\n            raise ValueError\n\n        # Assumption here is that the segments are written in\n        # consecutive order?\n        ctm_for_next_utt = ctms[utt_index + 1]\n        next_utt = ctm_for_next_utt[0][0]\n        if segments[next_utt][1] < segments[cur_utt][1]:\n            logger.error(\n                \"Next utterance %s <= Current utterance %s. \"\n                \"CTM is not sorted by start-time of utterance-id.\",\n                next_utt, cur_utt)\n            raise ValueError\n\n        try:\n            # length of this utterance\n            window_length = segments[cur_utt][2] - segments[cur_utt][1]\n\n            # overlap of this segment with the next segment\n            # i.e. current_utterance_end_time - next_utterance_start_time\n            # Note: It is possible for this to be negative when there is\n            # actually no overlap between consecutive segments.\n            try:\n                overlap = segments[cur_utt][2] - segments[next_utt][1]\n            except KeyError:\n                logger(\"Could not find utterance %s in segments\",\n                       next_utt)\n                raise\n\n            if overlap > 0 and segments[next_utt][2] <= segments[cur_utt][2]:\n                # Next utterance is entirely within this utterance.\n                # So we leave this ctm as is and make the next one empty.\n                total_ctm.extend(ctm_for_cur_utt)\n                ctms[utt_index + 1] = []\n                continue\n\n            # find a break point (a line in the CTM) for the current utterance\n            # i.e. the first line that has more than half of it outside\n            # the first half of the overlap region.\n            # Note: This line will not be included in the output CTM, which is\n            # only upto the line before this.\n            try:\n                index = next(\n                    (i for i, line in enumerate(ctm_for_cur_utt)\n                     if (line[2] + line[3] / 2.0\n                         > window_length - overlap / 2.0)))\n            except StopIteration:\n                # It is possible for such a word to not exist, e.g the last\n                # word in the CTM is longer than overlap length and starts\n                # before the beginning of the overlap.\n                # or the last word ends before the middle of the overlap.\n                index = len(ctm_for_cur_utt)\n\n            # Ignore the hypotheses beyond this midpoint. They will be\n            # considered as part of the next segment.\n            total_ctm.extend(ctm_for_cur_utt[:index])\n\n            # Find a break point (a line in the CTM) for the next utterance\n            # i.e. the first line that has more than half of it outside\n            # the first half of the overlap region.\n            try:\n                index = next(\n                    (i for i, line in enumerate(ctm_for_next_utt)\n                    if line[2] + line[3] / 2.0 > overlap / 2.0))\n            except StopIteration:\n                # This can happen if there is no word hypothesized after\n                # half the overlap region.\n                ctms[utt_index + 1] = []\n                continue\n\n            if index > 0:\n                # Update the ctm_for_next_utt to include only the lines\n                # starting from index.\n                ctms[utt_index + 1] = ctm_for_next_utt[index:]\n            # else leave the ctm as is.\n        except:\n            logger.error(\"Could not resolve overlaps between CTMs for \"\n                         \"%s and %s\", cur_utt, next_utt)\n            logger.error(\"Current CTM:\")\n            for line in ctm_for_cur_utt:\n                logger.error(ctm_line_to_string(line))\n            logger.error(\"Next CTM:\")\n            for line in ctm_for_next_utt:\n                logger.error(ctm_line_to_string(line))\n            raise\n\n    # merge the last ctm entirely\n    total_ctm.extend(ctms[-1])\n\n    return total_ctm\n\n\ndef ctm_line_to_string(line):\n    \"\"\"Converts a line of CTM to string.\"\"\"\n    return \"{0} {1} {2} {3} {4}\".format(line[0], line[1], line[2], line[3],\n                                        \" \".join(line[4:]))\n\n\ndef write_ctm(ctm_lines, out_file):\n    \"\"\"Writes CTM lines stored in a list to file.\"\"\"\n    for line in ctm_lines:\n        print(ctm_line_to_string(line), file=out_file)\n\n\ndef run(args):\n    \"\"\"this method does everything in this script\"\"\"\n    segments, reco2utt = read_segments(args.segments)\n    ctms = read_ctm(args.ctm_in, segments)\n\n    for reco, utts in reco2utt.items():\n        ctms_for_reco = []\n        for utt in sorted(utts, key=lambda x: segments[x][1]):\n            if (reco, utt) in ctms:\n                ctms_for_reco.append(ctms[(reco, utt)])\n        if len(ctms_for_reco) == 0:\n            logger.info(\"CTM for recording {0} was empty\".format(reco))\n            continue\n        try:\n            # Process CTMs in the recordings\n            ctms_for_reco = resolve_overlaps(ctms_for_reco, segments)\n            write_ctm(ctms_for_reco, args.ctm_out)\n        except Exception:\n            logger.error(\"Failed to process CTM for recording %s\",\n                         reco)\n            raise\n    args.ctm_out.close()\n    logger.info(\"Wrote CTM for %d recordings.\", len(ctms))\n\n\ndef main():\n    \"\"\"The main function which parses arguments and call run().\"\"\"\n    args = get_args()\n    try:\n        run(args)\n    except:\n        logger.error(\"Failed to resolve overlaps\", exc_info=True)\n        raise SystemExit(1)\n    finally:\n        try:\n            for f in [args.segments, args.ctm_in, args.ctm_out]:\n                if f is not None:\n                    f.close()\n        except IOError:\n            logger.error(\"Could not close some files. \"\n                         \"Disk error or broken pipes?\")\n            raise\n        except UnboundLocalError:\n            raise SystemExit(1)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/data/combine_short_segments.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script copies and modifies a data directory while combining\n# segments whose duration is lower than a specified minimum segment\n# length.\n#\n# Note: this does not work for the wav.scp, since there is no natural way to\n# concatenate segments; you have to operate on directories that already have\n# features extracted.\n\n#\n\n\n# begin configuration section\ncleanup=true\nspeaker_only=false  # If true, utterances are only combined from the same speaker.\n                    # It may be useful for the speaker recognition task.\n                    # If false, utterances are preferentially combined from the same speaker,\n                    # and then combined across different speakers.\n# end configuration section\n\n\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <min-segment-length-in-seconds> <dir>\"\n  echo \"e.g.:\"\n  echo \" $0 data/train 1.55 data/train_comb\"\n  echo \" Options:\"\n  echo \"  --speaker-only <true|false>  # options to internal/choose_utts_to_combine.py, default false.\"\n  exit 1;\nfi\n\n\nexport LC_ALL=C\n\nsrcdir=$1\nmin_seg_len=$2\ndir=$3\n\nif [ \"$dir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <dir> to be different.\"\n  exit 1\nfi\n\nfor f in $srcdir/utt2spk $srcdir/feats.scp; do\n  [ ! -s $f ] && echo \"$0: expected file $f to exist and be nonempty\" && exit 1\ndone\n\nif ! awk '{if (NF != 2) exit(1);}' <$srcdir/feats.scp; then\n  echo \"$0: could not combine short segments because $srcdir/feats.scp has \"\n  echo \" entries with too many fields\"\nfi\n\nif ! mkdir -p $dir; then\n  echo \"$0: could not create directory $dir\"\n  exit 1;\nfi\n\nif ! utils/validate_data_dir.sh --no-text $srcdir; then\n  echo \"$0: failed to validate input directory $srcdir.  If needed, run   utils/fix_data_dir.sh $srcdir\"\n  exit 1\nfi\n\nif ! python -c \"x=float('$min_seg_len'); assert(x>0.0 and x<100.0);\" 2>/dev/null; then\n  echo \"$0: bad <min-segment-length-in-seconds>: got '$min_seg_len'\"\n  exit 1\nfi\n\nset -e\nset -o pipefail\n\n# make sure $srcdir/utt2dur exists.\nutils/data/get_utt2dur.sh $srcdir\n\nutils/data/internal/choose_utts_to_combine.py --min-duration=$min_seg_len \\\n  --merge-within-speakers-only=$speaker_only \\\n  $srcdir/spk2utt $srcdir/utt2dur $dir/utt2utts $dir/utt2spk $dir/utt2dur\n\nutils/utt2spk_to_spk2utt.pl < $dir/utt2spk > $dir/spk2utt\n\n# create the feats.scp.\n# if a line of utt2utts is like 'utt2-comb2 utt2 utt3', then\n# the utils/apply_map.pl will create a line that looks like\n# 'utt2-comb2 foo.ark:4315 foo.ark:431423'\n# and the awk command creates suitable command lines like:\n# 'utt2-comb2 concat-feats foo.ark:4315 foo.ark:431423 - |'\nutils/apply_map.pl -f 2- $srcdir/feats.scp <$dir/utt2utts | \\\n  awk '{if (NF<=2){print;} else { $1 = $1 \" concat-feats --print-args=false\"; $NF = $NF \" - |\"; print; }}' > $dir/feats.scp\n\n# create $dir/text by concatenating the source 'text' entries for the original\n# utts.\nif [ -f $srcdir/text ]; then\n  utils/apply_map.pl -f 2- $srcdir/text <$dir/utt2utts > $dir/text\nfi\n\nif [ -f $srcdir/utt2uniq ]; then\n  # the utt2uniq file is such that if 2 utts were derived from the same original\n  # utt (e.g. by speed perturbing) they map to the same 'uniq' value.  This is\n  # so that we can properly hold out validation data for neural net training and\n  # know that we're not training on perturbed verions of that utterance.  We\n  # need to obtain the utt2uniq file so that if any 2 'new' utts contain any of\n  # the same 'old' utts, their 'uniq' values are the same [but otherwise as far\n  # as possible, the 'uniq' values are different.]\n  #\n  # we'll do this by arranging the old 'uniq' values into groups as necessary to\n  # capture this property.\n\n  # The following command creates 'uniq_sets', each line of which contains\n  # a set of original 'uniq' values, and effectively we assert that they must\n  # be grouped together to the same 'uniq' value.\n  # the first awk command prints a group of the original utterance-ids that\n  # are combined together into a single new utterance, and the apply_map\n  # command converts those into a list of original 'uniq' values.\n  awk '{$1 = \"\"; print;}' < $dir/utt2utts | \\\n    utils/apply_map.pl $srcdir/utt2uniq > $dir/uniq_sets\n\n  # The next command creates $dir/uniq2merged_uniq, which is a map from the\n  # original 'uniq' values to the 'merged' uniq values.\n  # for example, if $dir/uniq_sets were to contain\n  # a b\n  # b c\n  # d\n  # then we'd obtain a uniq2merged_uniq file that looks like:\n  # a a\n  # b a\n  # c a\n  # d d\n  # ... because a and b appear together, and b and c appear together,\n  # they have to be merged into the same set, and we name that set 'a'\n  # (in general, we take the lowest string in lexicographical order).\n\n  cat $dir/uniq_sets | LC_ALL=C python3 -c '\nimport sys;\nfrom collections import defaultdict\nuniq2orig_uniq = dict()\nequal_pairs = set()  # set of 2-tuples (a,b) which should have equal orig_uniq\nwhile True:\n    line = sys.stdin.readline()\n    if line == \"\": break\n    split_line = line.split() # list of uniq strings that should map in same set\n    # initialize uniq2orig_uniq to the identity mapping\n    for uniq in split_line: uniq2orig_uniq[uniq] = uniq\n    for a in split_line[1:]: equal_pairs.add((split_line[0], a))\n\nchanged = True\nwhile changed:\n    changed = False\n    for a,b in equal_pairs:\n         min_orig_uniq = min(uniq2orig_uniq[a], uniq2orig_uniq[b])\n         for x in [a,b]:\n             if uniq2orig_uniq[x] != min_orig_uniq:\n                 uniq2orig_uniq[x] = min_orig_uniq\n                 changed = True\n\nfor uniq in sorted(uniq2orig_uniq.keys()):\n    print(uniq, uniq2orig_uniq[uniq])\n' > $dir/uniq_to_orig_uniq\n  rm $dir/uniq_sets\n\n\n  # In the following command, suppose we have a line like:\n  # utt1-comb2 utt1 utt2\n  # .. the first awk command retains only the first original utt, to give\n  # utt1-comb2 utt1\n  # [we can pick one arbitrarily since we know any of them would map to the same\n  # orig_uniq value.]\n  # the first apply_map.pl command maps the 'utt1' to the 'uniq' value it mapped to\n  # in $srcdir, and the second apply_map.pl command maps it to the grouped 'uniq'\n  # value obtained by the inline python script above.\n  awk '{print $1, $2}' < $dir/utt2utts | utils/apply_map.pl -f 2 $srcdir/utt2uniq | \\\n    utils/apply_map.pl -f 2 $dir/uniq_to_orig_uniq > $dir/utt2uniq\n  rm $dir/uniq_to_orig_uniq\nfi\n\n# note: the user will have to recompute the cmvn, as the speakers may have changed.\nrm $dir/cmvn.scp 2>/dev/null || true\n\nutils/validate_data_dir.sh --no-text --no-wav $dir\n\nif $cleanup; then\n  rm $dir/utt2utts\nfi\n"
  },
  {
    "path": "egs/utils/data/convert_data_dir_to_whole.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016-2018  Vimal Manohar\n# Apache 2.0\n\n# This scripts converts a data directory into a \"whole\" data directory\n# by removing the segments and using the recordings themselves as \n# utterances\n\nset -o pipefail\n\n. ./path.sh\n\n. utils/parse_options.sh\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: convert_data_dir_to_whole.sh <in-data> <out-data>\"\n  echo \" e.g.: convert_data_dir_to_whole.sh data/dev data/dev_whole\"\n  exit 1\nfi\n\ndata=$1\ndir=$2\n\nif [ ! -f $data/segments ]; then\n  echo \"$0: Data directory already does not contain segments. So just copying it.\"\n  utils/copy_data_dir.sh $data $dir\n  exit 0\nfi\n\nmkdir -p $dir\ncp $data/wav.scp $dir\nif [ -f $data/reco2file_and_channel ]; then \n  cp $data/reco2file_and_channel $dir; \nfi\n\nmkdir -p $dir/.backup\nif [ -f $dir/feats.scp ]; then\n  mv $dir/feats.scp $dir/.backup\nfi\nif [ -f $dir/cmvn.scp ]; then\n  mv $dir/cmvn.scp $dir/.backup\nfi\nif [ -f $dir/utt2spk ]; then\n  mv $dir/utt2spk $dir/.backup\nfi\n\n[ -f $data/stm ] && cp $data/stm $dir\n[ -f $data/glm ] && cp $data/glm $dir\n\nutils/data/internal/combine_segments_to_recording.py \\\n  --write-reco2utt=$dir/reco2sorted_utts $data/segments $dir/utt2spk || exit 1\n\nif [ -f $data/text ]; then\n  utils/apply_map.pl -f 2- $data/text < $dir/reco2sorted_utts > $dir/text || exit 1\nfi\n\nrm $dir/reco2sorted_utts\n\nutils/fix_data_dir.sh $dir || exit 1\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/extend_segment_times.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport sys\nimport argparse\nfrom collections import defaultdict\n\n\nparser = argparse.ArgumentParser(description=\"\"\"\n Usage: extend_segment_times.py [options] <input-segments >output-segments\n This program pads the times in a 'segments' file (e.g. data/train/segments)\n with specified left and right context (for cases where there was no\n silence padding in the original segments file)\"\"\")\n\nparser.add_argument(\"--start-padding\", type = float, default = 0.1,\n                    help=\"Amount of padding, in seconds, for the start time of \"\n                    \"each segment (start times <0 will be set to zero).\")\nparser.add_argument(\"--end-padding\", type = float, default = 0.1,\n                    help=\"Amount of padding, in seconds, for the end time of \"\n                    \"each segment.\")\nparser.add_argument(\"--last-segment-end-padding\", type = float, default = 0.1,\n                    help=\"Amount of padding, in seconds, for the end time of \"\n                    \"the last segment of each file (maximum allowed).\")\nparser.add_argument(\"--fix-overlapping-segments\", type = str,\n                    default = 'true', choices=['true', 'false'],\n                    help=\"If true, prevent segments from overlapping as a result \"\n                    \"of the padding (or that were already overlapping)\")\nargs = parser.parse_args()\n\n\n# the input file will be a sequence of lines which are each of the form:\n# <utterance-id> <recording-id> <start-time> <end-time>\n# e.g.\n# utt-1 recording-1 0.62 5.40\n# The output will be in the same format and in the same\n# order, except wiht modified times.\n\n# This variable maps from a recording-id to a listof the utterance\n# indexes (as integer indexes into 'entries']\n# that are part of that recording.\nrecording_to_utt_indexes = defaultdict(list)\n\n# This is an array of the entries in the segments file, in the fomrat:\n# (utterance-id as astring, recording-id as string,\n#  start-time as float, end-time as float)\nentries = []\n\n\nwhile True:\n    line = sys.stdin.readline()\n    if line == '':\n        break\n    try:\n        [ utt_id, recording_id, start_time, end_time ] = line.split()\n        start_time = float(start_time)\n        end_time = float(end_time)\n    except:\n        sys.exit(\"extend_segment_times.py: could not interpret line: \" + line)\n    if not end_time > start_time:\n        print(\"extend_segment_times.py: bad segment (ignoring): \" + line,\n              file = sys.stderr)\n    recording_to_utt_indexes[recording_id].append(len(entries))\n    entries.append([utt_id, recording_id, start_time, end_time])\n\nnum_times_fixed = 0\n\nfor recording, utt_indexes in recording_to_utt_indexes.items():\n    # this_entries is a list of lists, sorted on mid-time.\n    # Notice: because lists are objects, when we change 'this_entries'\n    # we change the underlying entries.\n    this_entries = sorted([ entries[x] for x in utt_indexes ],\n                          key = lambda x : 0.5 * (x[2] + x[3]))\n    min_time = 0\n    max_time = max([ x[3] for x in this_entries ]) + args.last_segment_end_padding\n    start_padding = args.start_padding\n    end_padding = args.end_padding\n    for n in range(len(this_entries)):\n        this_entries[n][2] = max(min_time, this_entries[n][2] - start_padding)\n        this_entries[n][3] = min(max_time, this_entries[n][3] + end_padding)\n\n    for n in range(len(this_entries) - 1):\n        this_end_time = this_entries[n][3]\n        next_start_time = this_entries[n+1][2]\n        if this_end_time > next_start_time and args.fix_overlapping_segments == 'true':\n            midpoint = 0.5 * (this_end_time + next_start_time)\n            this_entries[n][3] = midpoint\n            this_entries[n+1][2] = midpoint\n            num_times_fixed += 1\n\n\n# this prints a number with a certain number of digits after\n# the point, while removing trailing zeros.\ndef FloatToString(f):\n    num_digits = 6 # we want to print 6 digits after the zero\n    g = f\n    while abs(g) > 1.0:\n        g *= 0.1\n        num_digits += 1\n    format_str = '%.{0}g'.format(num_digits)\n    return format_str % f\n\nfor entry in entries:\n    [ utt_id, recording_id, start_time, end_time ] = entry\n    if not start_time < end_time:\n        print(\"extend_segment_times.py: bad segment after processing (ignoring): \" +\n              ' '.join(entry), file = sys.stderr)\n        continue\n    print(utt_id, recording_id, FloatToString(start_time), FloatToString(end_time))\n\n\nprint(\"extend_segment_times.py: extended {0} segments; fixed {1} \"\n      \"overlapping segments\".format(len(entries), num_times_fixed),\n      file = sys.stderr)\n\n## test:\n#  (echo utt1 reco1 0.2 6.2; echo utt2 reco1 6.3 9.8 )| extend_segment_times.py\n# and also try the above with the options --last-segment-end-padding=0.0 --fix-overlapping-segments=false\n\n"
  },
  {
    "path": "egs/utils/data/extract_wav_segments_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright    2017  Hossein Hadian\n# Apache 2.0\n\n# This script copies a data directory (which has a 'segments' file), extracting\n# wav segments (according to the 'segments' file)\n# so that the resulting data directory does not have a 'segments' file anymore.\n\nnj=4\ncmd=run.pl\n\n. ./utils/parse_options.sh\n. ./path.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 <srcdir> <destdir>\"\n  echo \" This script copies data directory <srcdir> to <destdir> and removes\"\n  echo \" the 'segments' file by extracting the wav segments.\"\n  echo \"Options: \"\n  echo \"  --nj <nj>                                        # number of parallel jobs\"\n  echo \"  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.\"\n  exit 1;\nfi\n\n\nexport LC_ALL=C\n\nsrcdir=$1\ndir=$2\nlogdir=$dir/log\n\nif ! mkdir -p $dir/data; then\n  echo \"$0: failed to create directory $dir/data\"\n  exit 1\nfi\nmkdir -p $logdir\n\nset -eu -o pipefail\nutils/copy_data_dir.sh $srcdir $dir\n\nsplit_segments=\"\"\nfor n in $(seq $nj); do\n  split_segments=\"$split_segments $logdir/segments.$n\"\ndone\n\nutils/split_scp.pl $srcdir/segments $split_segments\n\n$cmd JOB=1:$nj $logdir/extract_wav_segments.JOB.log \\\n     extract-segments scp,p:$srcdir/wav.scp $logdir/segments.JOB \\\n     ark,scp:$dir/data/wav_segments.JOB.ark,$dir/data/wav_segments.JOB.scp\n\n# concatenate the .scp files together.\nfor n in $(seq $nj); do\n  cat $dir/data/wav_segments.$n.scp\ndone > $dir/data/wav_segments.scp\n\ncat $dir/data/wav_segments.scp | awk '{ print $1 \" wav-copy \" $2 \" - |\" }' >$dir/wav.scp\nrm $dir/{segments,reco2file_and_channel} 2>/dev/null || true\n"
  },
  {
    "path": "egs/utils/data/fix_subsegment_feats.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0.\n\nuse warnings;\n\n# This script reads from stdin a feats.scp file that contains frame ranges and\n# ensures that they don't exceed the maximum number of frames supplied in the\n# <utt2max-frames> file. \n# <utt2max-frames> is usually computed using get_utt2num_frames.sh on the \n# original directory which will be segmented using \n# utils/data/subsegment_data_dir.sh.\n# \n# e.g. feats.scp\n# utt_foo-1 foo-bar.ark:514231[721:892]\n# \n# utt2max-frames\n# utt_foo-1 891\n# \n# fixed_feats.scp\n# utt_foo-1 foo-bar.ark:514231[721:890]\n# \n# Note: Here 891 is the number of frames in the archive foo-bar.ark\n# The frame end for utt_foo-1, i.e. 892 (0-indexed) exceeds the archive size\n# (891) by two frames. This script fixes that line by truncating the range \n# to 890.\n\nif (scalar @ARGV != 1) {\n  my $usage = <<END;\nThis script reads from stdin a feats.scp file that contains frame ranges and\nensures that they don't exceed the maximum number of frames supplied in the\n<utt2max-frames> file. \n\nUsage: $0 <utt2max-frames> < feats.scp > fixed_feats.scp\nEND\n  die \"$usage\";\n}\n\nmy $utt2max_frames_file = $ARGV[0];\n\nopen MAX_FRAMES, $utt2max_frames_file or die \"$0: Could not open file $utt2max_frames_file\";\n\nmy %utt2max_frames;\n\nwhile (<MAX_FRAMES>) {\n  chomp;\n  my @F = split;\n  \n  (scalar @F == 2) or die \"$0: Invalid line $_ in $utt2max_frames_file\";\n\n  $utt2max_frames{$F[0]} = $F[1];\n}\n\nwhile (<STDIN>) {\n  my $line = $_;\n  \n  #if (m/\\[([^][]*)\\]\\[([^][]*)\\]\\s*$/) {\n  #  print STDERR (\"fix_subsegment_feats.pl: this script only supports single indices\");\n  #  exit(1);\n  #}\n  \n  my $before_range = \"\";\n  my $range = \"\";\n\n  if (m/^(.*)\\[([^][]*)\\]\\s*$/) {\n    $before_range = $1;\n    $range = $2;\n  } else {\n    print;\n    next;\n  }\n\n  my @F = split(/ /, $before_range);\n  my $utt = shift @F;\n  defined $utt2max_frames{$utt} or die \"fix_subsegment_feats.pl: Could not find key $utt in $utt2max_frames_file.\\nError with line $line\";\n\n  if ($range !~ m/^(\\d*):(\\d*)([,]?.*)$/) {\n    print STDERR \"fix_subsegment_feats.pl: could not make sense of input line $_\";\n    exit(1);\n  }\n    \n  my $row_start = $1;\n  my $row_end = $2;\n  my $col_range = $3;\n  \n  if ($row_start >= $utt2max_frames{$utt}) {\n    print STDERR \"Removing $utt because row_start $row_start >= file max length $utt2max_frames{$utt}\\n\";\n    next;\n  }  \n  if ($row_end >= $utt2max_frames{$utt}) {\n    print STDERR \"Fixed row_end for $utt from $row_end to $utt2max_frames{$utt}-1\\n\";\n    $row_end = $utt2max_frames{$utt} - 1;\n  } \n   \n  if ($row_start ne \"\") {\n    $range = \"$row_start:$row_end\";\n  } else {\n    $range = \"\";\n  }\n\n  if ($col_range ne \"\") {\n    $range .= \",$col_range\";\n  }\n  print (\"$utt \" . join(\" \", @F) . \"[\" . $range . \"]\\n\");\n}\n"
  },
  {
    "path": "egs/utils/data/get_allowed_durations.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright     2017  Hossein Hadian\n#               2019  Facebook Inc. (Author: Vimal Manohar)\n# Apache 2.0\n\n\n\"\"\" This script generates a set of allowed lengths of utterances\n    spaced by a factor (like 10%). This is useful for generating\n    fixed-length chunks for chain training.\n\"\"\"\n\nimport argparse\nimport os\nimport sys\nimport copy\nimport math\nimport logging\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"\n    This script creates a list of allowed durations of utterances for flatstart\n    LF-MMI training corresponding to input data directory 'data_dir' and writes\n    it in two files in output directory 'dir':\n    1) allowed_durs.txt -- durations are in seconds\n    2) allowed_lengths.txt -- lengths are in number of frames\n\n    Both the allowed_durs.txt and allowed_lengths.txt are formatted to\n    have one entry on each line. Examples are as follows:\n\n    $ echo data/train/allowed_lengths.txt\n    414\n    435\n    468\n\n    $ echo data/train/allowed_durs.txt\n    4.16\n    4.37\n    4.70\n\n    These files can then be used by a downstream script to perturb the\n    utterances to these lengths.\n    A perturbed data directory (created by a downstream script\n    similar to utils/data/perturb_speed_to_allowed_lengths.py)\n    that only contains utterances of these allowed durations,\n    along with the corresponding allowed_lengths.txt are\n    consumed by the e2e chain egs preparation script.\n    See steps/nnet3/chain/e2e/get_egs_e2e.sh for how these are used.\n\n    See also:\n    * egs/cifar/v1/image/get_allowed_lengths.py -- a similar script for OCR datasets\n    * utils/data/perturb_speed_to_allowed_lengths.py --\n        creates the allowed_lengths.txt AND perturbs the data directory\n    \"\"\")\n    parser.add_argument('factor', type=float, default=12,\n                        help='Spacing (in percentage) between allowed lengths. '\n                        'Can be 0, which means all seen lengths that are a multiple of '\n                        'frame_subsampling_factor will be allowed.')\n    parser.add_argument('data_dir', type=str, help='path to data dir. Assumes that '\n                        'it contains the utt2dur file.')\n    parser.add_argument('dir', type=str, help='We write the output files '\n                        'allowed_lengths.txt and allowed_durs.txt to this directory.')\n    parser.add_argument('--coverage-factor', type=float, default=0.05,\n                        help=\"\"\"Percentage of durations not covered from each\n                             side of duration histogram.\"\"\")\n    parser.add_argument('--frame-shift', type=int, default=10,\n                        help=\"\"\"Frame shift in milliseconds.\"\"\")\n    parser.add_argument('--frame-length', type=int, default=25,\n                        help=\"\"\"Frame length in milliseconds.\"\"\")\n    parser.add_argument('--frame-subsampling-factor', type=int, default=3,\n                        help=\"\"\"Chain frame subsampling factor.\n                             See steps/nnet3/chain/train.py\"\"\")\n    args = parser.parse_args()\n    return args\n\n\ndef read_kaldi_mapfile(path):\n    \"\"\" Read any Kaldi mapping file - like text, .scp files, etc.\n    \"\"\"\n\n    m = {}\n    with open(path, 'r', encoding='latin-1') as f:\n        for line in f:\n            line = line.strip(\" \\t\\r\\n\")\n            sp_pos = line.find(' ')\n            key = line[:sp_pos]\n            val = line[sp_pos+1:]\n            m[key] = val\n    return m\n\n\ndef find_duration_range(utt2dur, coverage_factor):\n    \"\"\"Given a list of utterance durations, find the start and end duration to cover\n\n     If we try to cover\n     all durations which occur in the training set, the number of\n     allowed lengths could become very large.\n\n     Returns\n     -------\n     start_dur: float\n     end_dur: float\n    \"\"\"\n    durs = [float(val) for key, val in utt2dur.items()]\n    durs.sort()\n    to_ignore_dur = 0\n    tot_dur = sum(durs)\n    for d in durs:\n        to_ignore_dur += d\n        if to_ignore_dur * 100.0 / tot_dur > coverage_factor:\n            start_dur = d\n            break\n    to_ignore_dur = 0\n    for d in reversed(durs):\n        to_ignore_dur += d\n        if to_ignore_dur * 100.0 / tot_dur > coverage_factor:\n            end_dur = d\n            break\n    if start_dur < 0.3:\n        start_dur = 0.3  # a hard limit to avoid too many allowed lengths --not critical\n    return start_dur, end_dur\n\n\ndef get_allowed_durations(start_dur, end_dur, args):\n    \"\"\"Given the start and end duration, find a set of\n       allowed durations spaced by args.factor%. Also write\n       out the list of allowed durations and the corresponding\n       allowed lengths (in frames) on disk.\n\n     Returns\n     -------\n     allowed_durations: list of allowed durations (in seconds)\n    \"\"\"\n\n    allowed_durations = []\n    d = start_dur\n    with open(os.path.join(args.dir, 'allowed_durs.txt'), 'w', encoding='latin-1') as durs_fp, \\\n           open(os.path.join(args.dir, 'allowed_lengths.txt'), 'w', encoding='latin-1') as lengths_fp:\n        while d < end_dur:\n            length = int(d * 1000 - args.frame_length) / args.frame_shift + 1\n            if length % args.frame_subsampling_factor != 0:\n                length = (args.frame_subsampling_factor *\n                              (length // args.frame_subsampling_factor))\n                d = (args.frame_shift * (length - 1.0)\n                     + args.frame_length + args.frame_shift / 2) / 1000.0\n            allowed_durations.append(d)\n            durs_fp.write(\"{}\\n\".format(d))\n            lengths_fp.write(\"{}\\n\".format(int(length)))\n            d *= args.factor\n    return allowed_durations\n\n\ndef get_trivial_allowed_durations(utt2dur, args):\n    lengths = list(set(\n        [int(float(d) * 1000 - args.frame_length) / args.frame_shift + 1\n         for key, d in utt2dur.items()]\n    ))\n    lengths.sort()\n\n    allowed_durations = []\n    with open(os.path.join(args.dir, 'allowed_durs.txt'), 'w', encoding='latin-1') as durs_fp, \\\n           open(os.path.join(args.dir, 'allowed_lengths.txt'), 'w', encoding='latin-1') as lengths_fp:\n        for length in lengths:\n            if length % args.frame_subsampling_factor != 0:\n                length = (args.frame_subsampling_factor *\n                              (length // args.frame_subsampling_factor))\n                d = (args.frame_shift * (length - 1.0)\n                     + args.frame_length + args.frame_shift / 2) / 1000.0\n            allowed_durations.append(d)\n            durs_fp.write(\"{}\\n\".format(d))\n            lengths_fp.write(\"{}\\n\".format(int(length)))\n\n    assert len(allowed_durations) > 0\n    start_dur = allowed_durations[0]\n    end_dur = allowed_durations[-1]\n\n    logger.info(\"Durations in the range [{},{}] will be covered.\"\n                \"\".format(start_dur, end_dur))\n    logger.info(\"There will be {} unique allowed lengths \"\n                \"for the utterances.\".format(len(allowed_durations)))\n\n    return allowed_durations\n\n\ndef main():\n    args = get_args()\n    utt2dur = read_kaldi_mapfile(os.path.join(args.data_dir, 'utt2dur'))\n\n    if args.factor == 0.0:\n        get_trivial_allowed_durations(utt2dur, args)\n        return\n\n    args.factor = 1.0 + args.factor / 100.0\n\n    start_dur, end_dur = find_duration_range(utt2dur, args.coverage_factor)\n    logger.info(\"Durations in the range [{},{}] will be covered. \"\n                \"Coverage rate: {}%\".format(start_dur, end_dur,\n                                      100.0 - args.coverage_factor * 2))\n    logger.info(\"There will be {} unique allowed lengths \"\n                \"for the utterances.\".format(int(math.log(end_dur / start_dur)/\n                                                 math.log(args.factor))))\n\n    get_allowed_durations(start_dur, end_dur, args)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/data/get_frame_shift.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script takes as input a data directory, such as data/train/, preferably\n# with utt2dur file already existing (or the utt2dur file will be created if\n# not), and it attempts to work out the approximate frame shift by comparing the\n# utt2dur with the output of feat-to-len on the feats.scp.  It prints it out.\n# if the shift is very close to, but above, 0.01 (the normal frame shift) it\n# rounds it down.\n\n. utils/parse_options.sh\n. ./path.sh\n\nif [ $# != 1 ]; then\n  cat >&2 <<EOF\nUsage: frame_shift=\\$($0 <datadir>)\ne.g.:  frame_shift=\\$($0 data/train)\n\nThis script prints the frame-shift in seconds (e.g. 0.01) to the standard out.\nIts output is intended to be captured in a shell variable.\n\nIf <datadir> does not contain the file utt2dur, this script may invoke\nutils/data/get_utt2dur.sh, which will require write permission to <datadir>.\nEOF\n  exit 1\nfi\n\nexport LC_ALL=C\n\ndir=$1\n\nif [[ -s $dir/frame_shift ]]; then\n  cat $dir/frame_shift\n  exit\nfi\n\nif [ ! -f $dir/feats.scp ]; then\n  echo \"$0: $dir/feats.scp does not exist\" 1>&2\n  exit 1\nfi\n\nif [ ! -s $dir/utt2dur ]; then\n  if [ ! -e $dir/wav.scp ] && [ ! -s $dir/segments ]; then\n    echo \"$0: neither $dir/wav.scp nor $dir/segments exist; assuming a frame shift of 0.01.\" 1>&2\n    echo 0.01\n    exit 0\n  fi\n  echo \"$0: $dir/utt2dur does not exist: creating it\" 1>&2\n  utils/data/get_utt2dur.sh 1>&2 $dir || exit 1\nfi\n\ntemp=$(mktemp /tmp/tmp.XXXX) || exit 1\n\nfeat-to-len --print-args=false \"scp:head -n 10 $dir/feats.scp|\" ark,t:- > $temp\n\nif [[ ! -s $temp ]]; then\n  rm $temp\n  echo \"$0: error running feat-to-len\" 1>&2\n  exit 1\nfi\n\nframe_shift=$(head -n 10 $dir/utt2dur | paste - $temp | awk '\n      { dur += $2; frames += $4; }\n  END { shift = dur / frames;\n        if (shift > 0.01 && shift < 0.0102) shift = 0.01;\n        print shift; }') || exit 1;\n\nrm $temp\n\necho $frame_shift > $dir/frame_shift\necho $frame_shift\nexit 0\n"
  },
  {
    "path": "egs/utils/data/get_num_frames.sh",
    "content": "#!/usr/bin/env bash\n\n# This script works out the approximate number of frames in a training directory.\n# This is sometimes needed by higher-level scripts\n\n\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# -ne 1 ]; then\n  (\n    echo \"Usage: $0 <data-dir>\"\n    echo \"Prints the number of frames of data in the data-dir\"\n  ) 1>&2\nfi\n\ndata=$1\n\nif [ ! -f $data/utt2dur ]; then\n  utils/data/get_utt2dur.sh $data 1>&2 || exit 1\nfi\n\nframe_shift=$(utils/data/get_frame_shift.sh $data) || exit 1\n\nawk -v s=$frame_shift '{n += $2} END{printf(\"%.0f\\n\", (n / s))}' <$data/utt2dur\n"
  },
  {
    "path": "egs/utils/data/get_reco2dur.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Johns Hopkins University (author: Daniel Povey)\n#           2018  Andrea Carmantini\n# Apache 2.0\n\n# This script operates on a data directory, such as in data/train/, and adds the\n# reco2dur file if it does not already exist.  The file 'reco2dur' maps from\n# recording to the duration of the recording in seconds.  This script works it\n# out from the 'wav.scp' file, or, if utterance-ids are the same as recording-ids, from the\n# utt2dur file (it first tries interrogating the headers, and if this fails, it reads the wave\n# files in entirely.)\n# We could use durations from segments file, but that's not the duration of the recordings\n# but the sum of utterance lenghts (silence in between could be excluded from segments)\n# For sum of utterance lenghts:\n# awk 'FNR==NR{uttdur[$1]=$2;next}\n# { for(i=2;i<=NF;i++){dur+=uttdur[$i];}\n#   print $1 FS dur; dur=0  }'  $data/utt2dur $data/reco2utt\n\n\nframe_shift=0.01\ncmd=run.pl\nnj=48\n\n. utils/parse_options.sh\n. ./path.sh\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0 [options] <datadir>\"\n  echo \"e.g.:\"\n  echo \" $0 data/train\"\n  echo \" Options:\"\n  echo \" --frame-shift      # frame shift in seconds. Only relevant when we are\"\n  echo \"                    # getting duration from feats.scp (default: 0.01). \"\n  exit 1\nfi\n\nexport LC_ALL=C\n\ndata=$1\n\n\nif [ -s $data/reco2dur ] && \\\n  [ $(wc -l < $data/wav.scp) -eq $(wc -l < $data/reco2dur) ]; then\n  echo \"$0: $data/reco2dur already exists with the expected length.  We won't recompute it.\"\n  exit 0;\nfi\n\nif [ -s $data/utt2dur ] && \\\n   [ $(wc -l < $data/utt2spk) -eq $(wc -l < $data/utt2dur) ] && \\\n   [ ! -s $data/segments ]; then\n\n  echo \"$0: $data/wav.scp indexed by utt-id; copying utt2dur to reco2dur\"\n  cp $data/utt2dur $data/reco2dur && exit 0;\n\nelif [ -f $data/wav.scp ]; then\n  echo \"$0: obtaining durations from recordings\"\n\n  # if the wav.scp contains only lines of the form\n  # utt1  /foo/bar/sph2pipe -f wav /baz/foo.sph |\n  if cat $data/wav.scp | perl -e '\n     while (<>) { s/\\|\\s*$/ |/;  # make sure final | is preceded by space.\n             @A = split; if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&\n                               $A[2] eq \"-f\" && $A[3] eq \"wav\" && $A[5] eq \"|\")) { exit(1); }\n             $reco = $A[0]; $sphere_file = $A[4];\n\n             if (!open(F, \"<$sphere_file\")) { die \"Error opening sphere file $sphere_file\"; }\n             $sample_rate = -1;  $sample_count = -1;\n             for ($n = 0; $n <= 30; $n++) {\n                $line = <F>;\n                if ($line =~ m/sample_rate -i (\\d+)/) { $sample_rate = $1; }\n                if ($line =~ m/sample_count -i (\\d+)/) { $sample_count = $1; }\n                if ($line =~ m/end_head/) { break; }\n             }\n             close(F);\n             if ($sample_rate == -1 || $sample_count == -1) {\n               die \"could not parse sphere header from $sphere_file\";\n             }\n             $duration = $sample_count * 1.0 / $sample_rate;\n             print \"$reco $duration\\n\";\n     } ' > $data/reco2dur; then\n    echo \"$0: successfully obtained recording lengths from sphere-file headers\"\n  else\n    echo \"$0: could not get recording lengths from sphere-file headers, using wav-to-duration\"\n    if ! command -v wav-to-duration >/dev/null; then\n      echo  \"$0: wav-to-duration is not on your path\"\n      exit 1;\n    fi\n\n    read_entire_file=false\n    if grep -q 'sox.*speed' $data/wav.scp; then\n      read_entire_file=true\n      echo \"$0: reading from the entire wav file to fix the problem caused by sox commands with speed perturbation. It is going to be slow.\"\n      echo \"... It is much faster if you call get_reco2dur.sh *before* doing the speed perturbation via e.g. perturb_data_dir_speed.sh or \"\n      echo \"... perturb_data_dir_speed_3way.sh.\"\n    fi\n\n    num_recos=$(wc -l <$data/wav.scp)\n    if [ $nj -gt $num_recos ]; then\n      nj=$num_recos\n    fi\n\n    temp_data_dir=$data/wav${nj}split\n    wavscps=$(for n in `seq $nj`; do echo $temp_data_dir/$n/wav.scp; done)\n    subdirs=$(for n in `seq $nj`; do echo $temp_data_dir/$n; done)\n\n    if ! mkdir -p $subdirs >&/dev/null; then\n\tfor n in `seq $nj`; do\n\t    mkdir -p $temp_data_dir/$n\n\tdone\n    fi\n\n    utils/split_scp.pl $data/wav.scp $wavscps\n\n\n    $cmd JOB=1:$nj $data/log/get_reco_durations.JOB.log \\\n      wav-to-duration --read-entire-file=$read_entire_file \\\n      scp:$temp_data_dir/JOB/wav.scp ark,t:$temp_data_dir/JOB/reco2dur || \\\n        { echo \"$0: there was a problem getting the durations\"; exit 1; } # This could\n\n    for n in `seq $nj`; do\n      cat $temp_data_dir/$n/reco2dur\n    done > $data/reco2dur\n  fi\n  rm -r $temp_data_dir\nelse\n  echo \"$0: Expected $data/wav.scp to exist\"\n  exit 1\nfi\n\nlen1=$(wc -l < $data/wav.scp)\nlen2=$(wc -l < $data/reco2dur)\nif [ \"$len1\" != \"$len2\" ]; then\n  echo \"$0: warning: length of reco2dur does not equal that of wav.scp, $len2 != $len1\"\n  if [ $len1 -gt $[$len2*2] ]; then\n    echo \"$0: less than half of recordings got a duration: failing.\"\n    exit 1\n  fi\nfi\n\necho \"$0: computed $data/reco2dur\"\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/get_reco2utt_for_data.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0\n\nif [ $# -ne 1 ]; then\n  echo \"This script outputs a mapping from recording to a list of utterances \"\n  echo \"corresponding to the recording. It is analogous to the content of \"\n  echo \"a spk2utt file, but is indexed by recording instead of speaker.\"\n  echo \"Usage: get_reco2utt.sh <data>\"\n  echo \" e.g.: get_reco2utt.sh data/train\"\n  exit 1\nfi\n\ndata=$1\n\nif [ ! -s $data/segments ]; then\n  utils/data/get_segments_for_data.sh $data > $data/segments\nfi\n\ncut -d ' ' -f 1,2 $data/segments | utils/utt2spk_to_spk2utt.pl\n"
  },
  {
    "path": "egs/utils/data/get_segments_for_data.sh",
    "content": "#!/usr/bin/env bash\n\n# This script operates on a data directory, such as in data/train/,\n# and writes new segments to stdout. The file 'segments' maps from\n# utterance to time offsets into a recording, with the format:\n#   <utterance-id> <recording-id> <segment-begin> <segment-end>\n# This script assumes utterance and recording ids are the same (i.e., that\n# wav.scp is indexed by utterance), and uses durations from 'utt2dur', \n# created if necessary by get_utt2dur.sh.\n\n. ./path.sh\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0 [options] <datadir>\"\n  echo \"e.g.:\"\n  echo \" $0 data/train > data/train/segments\"\n  exit 1\nfi\n\ndata=$1\n\nif [ ! -s $data/utt2dur ]; then\n  utils/data/get_utt2dur.sh $data 1>&2 || exit 1;\nfi\n\n# <utt-id> <utt-id> 0 <utt-dur>\nawk '{ print $1, $1, 0, $2 }' $data/utt2dur\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/get_uniform_subsegments.py",
    "content": "#! /usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n#           2017  Matthew Maciejewski\n# Apache 2.0.\n\nfrom __future__ import print_function\nimport argparse\nimport logging\nimport sys\nimport textwrap\n\ndef get_args():\n    parser = argparse.ArgumentParser(\n        description=textwrap.dedent(\"\"\"\n        Creates a subsegments file from an input segments file.\n\n        The output format is\n\n        <subsegment-id> <utterance-id> <start-time> <end-time>,\n\n        where the timings are relative to the start-time of the\n         <utterance-id> in the input segments file.\n        Reminder: the format of the input segments file is:\n\n         <utterance-id> <recording-id> <start-time> <end-time>\n\n        where the recording-id corresponds to a wav file (or a channel of\n        a wav file) from wav.scp.  Note: you can use\n        utils/data/get_segments_for_data.sh to generate a 'default'\n        segments file for your data if one doesn't already exist.\n\n        e.g.: get_uniform_subsegments.py data/dev/segments > \\\\\n                data/dev_uniform_segments/sub_segments\n\n        utils/data/subsegment_data_dir.sh data/dev \\\\\n            data/dev_uniform_segments/sub_segments data/dev_uniform_segments\n\n        The output is written to stdout. The resulting file can be\n        passed to utils/data/subsegment_data_dir.sh to sub-segment\n        the data directory.\"\"\"),\n        formatter_class=argparse.RawDescriptionHelpFormatter)\n    parser.add_argument(\"--max-segment-duration\", type=float,\n                        default=30, help=\"\"\"Maximum duration of the\n                        subsegments (in seconds)\"\"\")\n    parser.add_argument(\"--overlap-duration\", type=float,\n                        default=5, help=\"\"\"Overlap between\n                        adjacent segments (in seconds)\"\"\")\n    parser.add_argument(\"--max-remaining-duration\", type=float,\n                        default=10, help=\"\"\"Segment is not split\n                        if the left-over duration is more than this\n                        many seconds\"\"\")\n    parser.add_argument(\"--constant-duration\", type=bool,\n                        default=False, help=\"\"\"Final segment is given\n                        a start time max-segment-duration before the\n                        end to force a constant segment duration. This\n                        overrides the max-remaining-duration parameter\"\"\")\n    parser.add_argument(\"segments_file\", type=argparse.FileType('r'),\n                        help=\"\"\"Input kaldi segments file\"\"\")\n\n    args = parser.parse_args()\n    return args\n\n\ndef run(args):\n    if (args.constant_duration):\n        dur_threshold = args.max_segment_duration\n    else:\n        dur_threshold = args.max_segment_duration + args.max_remaining_duration\n\n    for line in args.segments_file:\n        parts = line.strip().split()\n        utt_id = parts[0]\n        start_time = float(parts[2])\n        end_time = float(parts[3])\n\n        dur = end_time - start_time\n\n        start = start_time\n        while (dur > dur_threshold):\n            end = start + args.max_segment_duration\n            start_relative = start - start_time\n            end_relative = end - start_time\n            new_utt = \"{utt_id}-{s:08d}-{e:08d}\".format(\n                utt_id=utt_id, s=int(100 * start_relative),\n                e=int(100 * end_relative))\n            print (\"{new_utt} {utt_id} {s:.3f} {e:.3f}\".format(\n                new_utt=new_utt, utt_id=utt_id, s=start_relative,\n                e=start_relative + args.max_segment_duration))\n            start += args.max_segment_duration - args.overlap_duration\n            dur -= args.max_segment_duration - args.overlap_duration\n\n        if (args.constant_duration):\n            if (dur < 0):\n              continue\n            if (dur < args.max_remaining_duration):\n              start = max(end_time - args.max_segment_duration, start_time)\n            end = min(start + args.max_segment_duration, end_time)\n        else:\n            end = end_time\n        new_utt = \"{utt_id}-{s:08d}-{e:08d}\".format(\n            utt_id=utt_id, s=int(round(100 * (start - start_time))),\n            e=int(round(100 * (end - start_time))))\n        print (\"{new_utt} {utt_id} {s:.3f} {e:.3f}\".format(\n            new_utt=new_utt, utt_id=utt_id, s=start - start_time,\n            e=end - start_time))\n\n\ndef main():\n    args = get_args()\n    try:\n        run(args)\n    except Exception:\n        logging.error(\"Failed creating subsegments\", exc_info=True)\n        raise SystemExit(1)\n    finally:\n        args.segments_file.close()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "egs/utils/data/get_utt2dur.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script operates on a data directory, such as in data/train/, and adds the\n# utt2dur file if it does not already exist.  The file 'utt2dur' maps from\n# utterance to the duration of the utterance in seconds.  This script works it\n# out from the 'segments' file, or, if not present, from the wav.scp file (it\n# first tries interrogating the headers, and if this fails, it reads the wave\n# files in entirely.)\n\nframe_shift=0.01\ncmd=run.pl\nnj=48\n\n. utils/parse_options.sh\n. ./path.sh\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0 [options] <datadir>\"\n  echo \"e.g.:\"\n  echo \" $0 data/train\"\n  echo \" Options:\"\n  echo \" --frame-shift      # frame shift in seconds. Only relevant when we are\"\n  echo \"                    # getting duration from feats.scp, and only if the \"\n  echo \"                    # file frame_shift does not exist (default: 0.01). \"\n  exit 1\nfi\n\nexport LC_ALL=C\n\ndata=$1\n\nif [ -s $data/utt2dur ] && \\\n  [ $(wc -l < $data/utt2spk) -eq $(wc -l < $data/utt2dur) ]; then\n  echo \"$0: $data/utt2dur already exists with the expected length.  We won't recompute it.\"\n  exit 0;\nfi\n\nif [ -s $data/segments ]; then\n  echo \"$0: working out $data/utt2dur from $data/segments\"\n  awk '{len=$4-$3; print $1, len;}' < $data/segments  > $data/utt2dur\nelif [[ -s $data/frame_shift && -f $data/utt2num_frames ]]; then\n  echo \"$0: computing $data/utt2dur from $data/{frame_shift,utt2num_frames}.\"\n  frame_shift=$(cat $data/frame_shift) || exit 1\n  # The 1.5 correction is the typical value of (frame_length-frame_shift)/frame_shift.\n  awk -v fs=$frame_shift '{ $2=($2+1.5)*fs; print }' <$data/utt2num_frames  >$data/utt2dur\nelif [ -f $data/wav.scp ]; then\n  echo \"$0: segments file does not exist so getting durations from wave files\"\n\n  # if the wav.scp contains only lines of the form\n  # utt1  /foo/bar/sph2pipe -f wav /baz/foo.sph |\n  if perl <$data/wav.scp -e '\n     while (<>) { s/\\|\\s*$/ |/;  # make sure final | is preceded by space.\n             @A = split; if (!($#A == 5 && $A[1] =~ m/sph2pipe$/ &&\n                               $A[2] eq \"-f\" && $A[3] eq \"wav\" && $A[5] eq \"|\")) { exit(1); }\n             $utt = $A[0]; $sphere_file = $A[4];\n\n             if (!open(F, \"<$sphere_file\")) { die \"Error opening sphere file $sphere_file\"; }\n             $sample_rate = -1;  $sample_count = -1;\n             for ($n = 0; $n <= 30; $n++) {\n                $line = <F>;\n                if ($line =~ m/sample_rate -i (\\d+)/) { $sample_rate = $1; }\n                if ($line =~ m/sample_count -i (\\d+)/) { $sample_count = $1; }\n                if ($line =~ m/end_head/) { break; }\n             }\n             close(F);\n             if ($sample_rate == -1 || $sample_count == -1) {\n               die \"could not parse sphere header from $sphere_file\";\n             }\n             $duration = $sample_count * 1.0 / $sample_rate;\n             print \"$utt $duration\\n\";\n     } ' > $data/utt2dur; then\n    echo \"$0: successfully obtained utterance lengths from sphere-file headers\"\n  else\n    echo \"$0: could not get utterance lengths from sphere-file headers, using wav-to-duration\"\n    if ! command -v wav-to-duration >/dev/null; then\n      echo  \"$0: wav-to-duration is not on your path\"\n      exit 1;\n    fi\n\n    read_entire_file=true\n    if grep -q 'sox.*speed' $data/wav.scp; then\n      read_entire_file=true\n      echo \"$0: reading from the entire wav file to fix the problem caused by sox commands with speed perturbation. It is going to be slow.\"\n      echo \"... It is much faster if you call get_utt2dur.sh *before* doing the speed perturbation via e.g. perturb_data_dir_speed.sh or \"\n      echo \"... perturb_data_dir_speed_3way.sh.\"\n    fi\n\n\n    num_utts=$(wc -l <$data/utt2spk)\n    if [ $nj -gt $num_utts ]; then\n      nj=$num_utts\n    fi\n\n    utils/data/split_data.sh --per-utt $data $nj\n    sdata=$data/split${nj}utt\n\n    $cmd JOB=1:$nj $data/log/get_durations.JOB.log \\\n      wav-to-duration --read-entire-file=$read_entire_file \\\n      scp:$sdata/JOB/wav.scp ark,t:$sdata/JOB/utt2dur || \\\n        { echo \"$0: there was a problem getting the durations\"; exit 1; }\n\n    for n in `seq $nj`; do\n      cat $sdata/$n/utt2dur\n    done > $data/utt2dur\n  fi\nelif [ -f $data/feats.scp ]; then\n  echo \"$0: wave file does not exist so getting durations from feats files\"\n  if [[ -s $data/frame_shift ]]; then\n    frame_shift=$(cat $data/frame_shift) || exit 1\n    echo \"$0: using frame_shift=$frame_shift from file $data/frame_shift\"\n  fi\n  # The 1.5 correction is the typical value of (frame_length-frame_shift)/frame_shift.\n  feat-to-len scp:$data/feats.scp ark,t:- |\n    awk -v frame_shift=$frame_shift '{print $1, ($2+1.5)*frame_shift}' >$data/utt2dur\nelse\n  echo \"$0: Expected $data/wav.scp, $data/segments or $data/feats.scp to exist\"\n  exit 1\nfi\n\nlen1=$(wc -l < $data/utt2spk)\nlen2=$(wc -l < $data/utt2dur)\nif [ \"$len1\" != \"$len2\" ]; then\n  echo \"$0: warning: length of utt2dur does not equal that of utt2spk, $len2 != $len1\"\n  if [ $len1 -gt $[$len2*2] ]; then\n    echo \"$0: less than half of utterances got a duration: failing.\"\n    exit 1\n  fi\nfi\n\necho \"$0: computed $data/utt2dur\"\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/get_utt2num_frames.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016  Vimal Manohar\n# Apache 2.0.\n\ncmd=run.pl\nnj=4\n\nframe_shift=0.01\nframe_overlap=0.015\n\n. utils/parse_options.sh\n. ./path.sh\n\nif [ $# -ne 1 ]; then\n  echo \"This script writes a file utt2num_frames with the \"\n  echo \"number of frames in each utterance as measured based on the \"\n  echo \"duration of the utterances (in utt2dur) and the specified \"\n  echo \"frame_shift and frame_overlap.\"\n  echo \"Usage: $0 <data>\"\n  exit 1\nfi\n\ndata=$1\n\nif [ -s $data/utt2num_frames ]; then\n  echo \"$0: $data/utt2num_frames already present!\"\n  exit 0;\nfi\n\nif [ ! -f $data/feats.scp ]; then\n  utils/data/get_utt2dur.sh --nj ${nj} --cmd \"$cmd\" $data\n  awk -v fs=$frame_shift -v fovlp=$frame_overlap \\\n    '{print $1\" \"int( ($2 - fovlp) / fs)}' $data/utt2dur > $data/utt2num_frames\n  exit 0\nfi\n\nutils/split_data.sh --per-utt $data $nj || exit 1\n$cmd JOB=1:$nj $data/log/get_utt2num_frames.JOB.log \\\n  feat-to-len scp:$data/split${nj}utt/JOB/feats.scp ark,t:$data/split${nj}utt/JOB/utt2num_frames || exit 1\n\nfor n in `seq $nj`; do\n  cat $data/split${nj}utt/$n/utt2num_frames\ndone > $data/utt2num_frames\n\necho \"$0: Computed and wrote $data/utt2num_frames\"\n"
  },
  {
    "path": "egs/utils/data/internal/choose_utts_to_combine.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Vijayaditya Peddinti\n#           2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\nfrom __future__ import print_function\nimport argparse\nfrom random import randint\nimport sys\nimport os\nfrom collections import defaultdict\n\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script, called from data/utils/combine_short_segments.sh, chooses consecutive\nutterances to concatenate that will satisfy the minimum segment length.  It uses the\n--spk2utt file to ensure that utterances from the same speaker are preferentially\ncombined (as far as possible while respecting the minimum segment length).\nIf it has to combine utterances across different speakers in order to satisfy the\nduration constraint, it will assign the combined utterances to the speaker which\ncontributed the most to the duration of the combined utterances.\n\n\nThe utt2uts output of this program is a map from new\nutterance-id to a list of old utterance-ids, so for example if the inputs were\nutt1, utt2 and utt3, and utterances 2 and 3 were combined, the output might look\nlike:\nutt1 utt1\nutt2-combine2 utt2 utt3\nThe utt2spk output of this program assigns utterances to the speakers of the input;\nin the (hopefully rare) case where utterances were combined across speakers, it\nwill assign the utterance to whichever of the original speakers contributed the most\nto the grouped utterance.\n\"\"\")\n\n\nparser.add_argument(\"--min-duration\", type = float, default = 1.55,\n                    help=\"Minimum utterance duration\")\nparser.add_argument(\"--merge-within-speakers-only\", type = str, default = 'false',\n                    choices = ['true', 'false'],\n                    help=\"If true, utterances are only combined from the same speaker.\"\n                    \"It may be useful for the speaker recognition task.\"\n                    \"If false, utterances are preferentially combined from the same speaker,\"\n                    \"and then combined across different speakers.\")\nparser.add_argument(\"spk2utt_in\", type = str, metavar = \"<spk2utt-in>\",\n                    help=\"Filename of [input] speaker to utterance map needed \"\n                    \"because this script tries to merge utterances from the \"\n                    \"same speaker as much as possible, and also needs to produce\"\n                    \"an output utt2spk map.\")\nparser.add_argument(\"utt2dur_in\", type = str, metavar = \"<utt2dur-in>\",\n                    help=\"Filename of [input] utterance-to-duration map, with lines like 'utt1 1.23'.\")\nparser.add_argument(\"utt2utts_out\", type = str, metavar = \"<utt2utts-out>\",\n                    help=\"Filename of [output] new-utterance-to-old-utterances map, with lines \"\n                    \"like 'utt1 utt1' or 'utt2-comb2 utt2 utt3'\")\nparser.add_argument(\"utt2spk_out\", type = str, metavar = \"<utt2spk-out>\",\n                    help=\"Filename of [output] utt2spk map, which maps new utterances to original \"\n                    \"speakers.  If utterances were combined across speakers, we map the new \"\n                    \"utterance to the speaker that contributed the most to them.\")\nparser.add_argument(\"utt2dur_out\", type = str, metavar = \"<utt2spk-out>\",\n                    help=\"Filename of [output] utt2dur map, which is just the summations of \"\n                    \"the durations of the source utterances.\")\n\n\nargs = parser.parse_args()\n\n\n# This LessThan is designed to be impervious to roundoff effects in cases where\n# numbers are really always separated by a distance >> 1.0e-05.  It will return\n# false if x and y are almost identical, differing only by roundoff effects.\ndef LessThan(x, y):\n    return x < y - 1.0e-5\n\n\n# This function implements the core of the utterance-combination code.\n# The input 'durations' is a list of durations, which must all be\n# >=0.0  This function tries to combine consecutive indexes\n# into groups such that for each group, the total duration is at\n# least 'min_duration'.  It returns a list of (start,end) indexes.\n# For example, CombineList(0.1, [5.0,6.0,7.0]) would return\n# [ (0,1), (1,2), (2,3) ] because no combination is necessary; each\n# returned pair represents a singleton group.\n# Or CombineList(1.0, [0.5, 0.6, 0.7]) would return\n# [ (0,3) ].\n# Or CombineList(1.0, [0.5, 0.6, 1.7]) would return\n# [ (0,2), (2,3) ].\n# Note: if sum(durations) < min_duration, this function will\n# return everything in one group but of course the sum of durations\n# will be less than the total.\ndef CombineList(min_duration, durations):\n    assert min_duration >= 0.0 and min(durations) > 0.0\n\n    num_utts = len(durations)\n\n    # for each utterance-index i, group_start[i] gives us the\n    # start-index of the group of utterances of which it's currently\n    # a member.\n    group_start = list(range(num_utts))\n    # if utterance-index i currently corresponds to the start of a group\n    # of utterances, then group_durations[i] is the total duration of\n    # that utterance-group, otherwise undefined.\n    group_durations = list(durations)\n    # if utterance-index i currently corresponds to the start of a group\n    # of utterances, then group_end[i] is the end-index (i.e. last index plus one\n    # of that utterance-group, otherwise undefined.\n    group_end = [ x + 1 for x in range(num_utts) ]\n\n    queue = [ i for i in range(num_utts) if LessThan(group_durations[i], min_duration) ]\n\n    while len(queue) > 0:\n        i = queue.pop()\n        if group_start[i] != i or not LessThan(group_durations[i], min_duration):\n            # this group no longer exists or already has at least the minimum duration.\n            continue\n        this_dur = group_durations[i]\n        # left_dur is the duration of the group to the left of this group,\n        # or 0.0 if there is no such group.\n        left_dur = group_durations[group_start[i-1]] if i > 0 else 0.0\n        # right_dur is the duration of the group to the right of this group,\n        # or 0.0 if there is no such group.\n        right_dur = group_durations[group_end[i]] if group_end[i] < num_utts else 0.0\n\n\n        if left_dur == 0.0 and right_dur == 0.0:\n            # there is only one group.  Nothing more to merge; break\n            assert group_start[i] == 0 and group_end[i] == num_utts\n            break\n        # work out whether to combine left or right,\n        # by means of the combine_left variable [ True or False ]\n        if left_dur == 0.0:\n            combine_left = False\n        elif right_dur == 0.0:\n            combine_left = True\n        elif LessThan(left_dur + this_dur, min_duration):\n            # combining left would still be below the minimum duration->\n            # combine right... if it's above the min duration then good;\n            # otherwise it still doesn't really matter so we might as well\n            # pick one.\n            combine_left = False\n        elif LessThan(right_dur + this_dur, min_duration):\n            # combining right would still be below the minimum duration,\n            # and combining left would be >= the min duration (else we wouldn't\n            # have reached this line) -> combine left.\n            combine_left = True\n        elif LessThan(left_dur, right_dur):\n            # if we reached here then combining either way would take us >= the\n            # minimum duration; but if left_dur < right_dur then we combine left\n            # because that would give us more evenly sized segments.\n            combine_left = True\n        else:\n            # if we reached here then combining either way would take us >= the\n            # minimum duration; but  left_dur >= right_dur, so we combine right\n            # because that would give us more evenly sized segments.\n            combine_left = False\n\n        if combine_left:\n            assert left_dur != 0.0\n            new_group_start = group_start[i-1]\n            group_end[new_group_start] = group_end[i]\n            for j in range(group_start[i], group_end[i]):\n                group_start[j] = new_group_start\n                group_durations[new_group_start] += durations[j]\n            # note: there is no need to add group_durations[new_group_start] to\n            # the queue even if it is still below the minimum length, because it\n            # would have previously had to have been below the minimum length,\n            # therefore it would already be in the queue.\n        else:\n            assert right_dur != 0.0\n            # group start doesn't change, group end changes.\n            old_group_end = group_end[i]\n            new_group_end = group_end[old_group_end]\n            group_end[i] = new_group_end\n            for j in range(old_group_end, new_group_end):\n                group_durations[i] += durations[j]\n                group_start[j] = i\n            if LessThan(group_durations[i], min_duration):\n                # the group starting at i is still below the minimum length, so\n                # we need to put it back on the queue.\n                queue.append(i)\n\n    ans = []\n    cur_group_start = 0\n    while cur_group_start < num_utts:\n        ans.append( (cur_group_start, group_end[cur_group_start]) )\n        cur_group_start = group_end[cur_group_start]\n    return ans\n\ndef SelfTest():\n    assert CombineList(0.1, [5.0, 6.0, 7.0]) == [ (0,1), (1,2), (2,3) ]\n    assert CombineList(0.5, [0.1, 6.0, 7.0]) == [ (0,2), (2,3) ]\n    assert CombineList(0.5, [6.0, 7.0, 0.1]) == [ (0,1), (1,3) ]\n    # in the two examples below, it combines with the shorter one if both would\n    # be above min-dur.\n    assert CombineList(0.5, [6.0, 0.1, 7.0]) == [ (0,2), (2,3) ]\n    assert CombineList(0.5, [7.0, 0.1, 6.0]) == [ (0,1), (1,3) ]\n    # in the example below, it combines with whichever one would\n    # take it above the min-dur, if there is only one such.\n    # note, it tests the 0.1 first as the queue is popped from the end.\n    assert CombineList(1.0, [1.0, 0.5, 0.1, 6.0]) == [ (0,2), (2,4) ]\n\n    for x in range(100):\n        min_duration = 0.05\n        num_utts = randint(1, 15)\n        durations = []\n        for i in range(num_utts):\n            durations.append(0.01 * randint(1, 10))\n        ranges = CombineList(min_duration, durations)\n        if len(ranges) > 1:  # check that each range's duration is >= min_duration\n            for j in range(len(ranges)):\n                (start, end) = ranges[j]\n                this_dur = sum([ durations[k] for k in range(start, end) ])\n                assert not LessThan(this_dur, min_duration)\n\n        # check that the list returned is not affected by very tiny differences\n        # in the inputs.\n        durations2 = list(durations)\n        for i in range(len(durations2)):\n            durations2[i] += 1.0e-07 * randint(-5, 5)\n        ranges2 = CombineList(min_duration, durations2)\n        assert ranges2 == ranges\n\n# This function figures out the grouping of utterances.\n# The input is:\n# 'min_duration' which is the minimum utterance length in seconds.\n# 'merge_within_speakers_only' which is a ['true', 'false'] choice.\n# If true, then utterances are only combined if they belong to the same speaker.\n# 'spk2utt' which is a list of pairs (speaker-id, [list-of-utterances])\n# 'utt2dur' which is a dict from utterance-id to duration (as a float)\n# It returns a lists of lists of utterances; each list corresponds to\n# a group, e.g.\n# [ ['utt1'], ['utt2', 'utt3'] ]\ndef GetUtteranceGroups(min_duration, merge_within_speakers_only, spk2utt, utt2dur):\n    # utt_groups will be a list of lists of utterance-ids formed from the\n    # first pass of combination.\n    utt_groups = []\n    # group_durations will be the durations of the corresponding elements of\n    # 'utt_groups'.\n    group_durations = []\n\n    # This block calls CombineList for the utterances of each speaker\n    # separately, in the 'first pass' of combination.\n    for i in range(len(spk2utt)):\n        (spk, utts) = spk2utt[i]\n        durations = [] # durations for this group of utts.\n        for utt in utts:\n            try:\n                durations.append(utt2dur[utt])\n            except:\n                sys.exit(\"choose_utts_to_combine.py: no duration available \"\n                         \"in utt2dur file {0} for utterance {1}\".format(\n                        args.utt2dur_in, utt))\n        ranges = CombineList(min_duration, durations)\n        for start, end in ranges:  # each element of 'ranges' is a 2-tuple (start, end)\n            utt_groups.append( [ utts[i] for i in range(start, end) ])\n            group_durations.append(sum([ durations[i] for i in range(start, end) ]))\n\n    old_dur_sum = sum(utt2dur.values())\n    new_dur_sum = sum(group_durations)\n    if abs(old_dur_sum - new_dur_sum) > 0.0001 * old_dur_sum:\n        print(\"choose_utts_to_combine.py: large difference in total \"\n              \"durations: {0} vs {1} \".format(old_dur_sum, new_dur_sum),\n              file = sys.stderr)\n\n    # Now we combine the groups obtained above, in case we had situations where\n    # the combination of all the utterances of one speaker were still below\n    # the minimum duration.\n    if merge_within_speakers_only == 'true':\n      return utt_groups\n    else:\n      new_utt_groups = []\n      ranges = CombineList(min_duration, group_durations)\n      for start, end in ranges:\n          # the following code is destructive of 'utt_groups' but it doesn't\n          # matter.\n          this_group = utt_groups[start]\n          for i in range(start + 1, end):\n              this_group += utt_groups[i]\n          new_utt_groups.append(this_group)\n      print(\"choose_utts_to_combine.py: combined {0} utterances to {1} utterances \"\n            \"while respecting speaker boundaries, and then to {2} utterances \"\n            \"with merging across speaker boundaries.\".format(\n              len(utt2dur), len(utt_groups), len(new_utt_groups)),\n            file = sys.stderr)\n      return new_utt_groups\n\n\nSelfTest()\n\nif args.min_duration < 0.0:\n    print(\"choose_utts_to_combine.py: bad minium duration {0}\".format(\n            args.min_duration))\n\n# spk2utt is a list of 2-tuples (speaker-id, [list-of-utterances])\nspk2utt = []\n# utt2spk is a dict from speaker-id to utternace-id.\nutt2spk = dict()\ntry:\n    f = open(args.spk2utt_in)\nexcept:\n    sys.exit(\"choose_utts_to_combine.py: error opening --spk2utt={0}\".format(args.spk2utt_in))\nwhile True:\n    line = f.readline()\n    if line == '':\n        break\n    a = line.split()\n    if len(a) < 2:\n        sys.exit(\"choose_utts_to_combine.py: bad line in spk2utt file: \" + line)\n    spk = a[0]\n    utts = a[1:]\n    spk2utt.append((spk, utts))\n    for utt in utts:\n        if utt in utt2spk:\n            sys.exit(\"choose_utts_to_combine.py: utterance {0} is listed more than once\"\n                     \"in the spk2utt file {1}\".format(utt, args.spk2utt_in))\n        utt2spk[utt] = spk\nf.close()\n\n# utt2dur is a dict from utterance-id (as a string) to duration in seconds (as a float)\nutt2dur = dict()\ntry:\n    f = open(args.utt2dur_in)\nexcept:\n    sys.exit(\"choose_utts_to_combine.py: error opening utt2dur file {0}\".format(args.utt2dur_in))\nwhile True:\n    line = f.readline()\n    if line == '':\n        break\n    try:\n        [ utt, dur ] = line.split()\n        dur = float(dur)\n        utt2dur[utt] = dur\n    except:\n        sys.exit(\"choose_utts_to_combine.py: bad line in utt2dur file {0}: {1}\".format(\n                args.utt2dur_in, line))\n\n\nutt_groups = GetUtteranceGroups(args.min_duration, args.merge_within_speakers_only, spk2utt, utt2dur)\n\n# set utt_group names to an array like [ 'utt1', 'utt2-comb2', 'utt4', ... ]\nutt_group_names = [ group[0] if len(group)==1 else \"{0}-comb{1}\".format(group[0], len(group))\n                    for group in utt_groups ]\n\n\n# write the utt2utts file.\ntry:\n    with open(args.utt2utts_out, 'w') as f:\n        for i in range(len(utt_groups)):\n            print(utt_group_names[i], ' '.join(utt_groups[i]), file = f)\nexcept Exception as e:\n    sys.exit(\"choose_utts_to_combine.py: exception writing to \"\n             \"<utt2utts-out>={0}: {1}\".format(args.utt2utts_out, str(e)))\n\n# write the utt2spk file.\ntry:\n    with open(args.utt2spk_out, 'w') as f:\n        for i in range(len(utt_groups)):\n            utt_group = utt_groups[i]\n            spk_list = [ utt2spk[utt] for utt in utt_group ]\n            if spk_list == [ spk_list[0] ] * len(utt_group):\n                spk = spk_list[0]\n            else:\n                spk2dur = defaultdict(float)\n                # spk2dur is a map from the speaker-id to the duration within this\n                # utt, that it comprises.\n                for utt in utt_group:\n                    spk2dur[utt2spk[utt]] += utt2dur[utt]\n                # the following code, which picks the speaker that contributed\n                # the most to the duration of this utterance, is a little\n                # complex because we want to break ties in a deterministic way\n                # picking the earlier spaker in case of a tied duration.\n                longest_spk_dur = -1.0\n                spk = None\n                for this_spk in sorted(spk2dur.keys()):\n                    if LessThan(longest_spk_dur, spk2dur[this_spk]):\n                        longest_spk_dur = spk2dur[this_spk]\n                        spk = this_spk\n                assert spk != None\n            new_utt = utt_group_names[i]\n            print(new_utt, spk, file = f)\nexcept Exception as e:\n    sys.exit(\"choose_utts_to_combine.py: exception writing to \"\n             \"<utt2spk-out>={0}: {1}\".format(args.utt2spk_out, str(e)))\n\n# write the utt2dur file.\ntry:\n    with open(args.utt2dur_out, 'w') as f:\n        for i in range(len(utt_groups)):\n            utt_name = utt_group_names[i]\n            duration = sum([ utt2dur[utt] for utt in utt_groups[i]])\n            print(utt_name, duration, file = f)\nexcept Exception as e:\n    sys.exit(\"choose_utts_to_combine.py: exception writing to \"\n             \"<utt2dur-out>={0}: {1}\".format(args.utt2dur_out, str(e)))\n\n"
  },
  {
    "path": "egs/utils/data/internal/combine_segments_to_recording.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018  Vimal Manohar\n# Apache 2.0\n\nfrom __future__ import print_function\nimport argparse\nimport sys\nimport collections\nfrom collections import defaultdict\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"\n        This script combines segments into utterances at\n        recording-level and write out new utt2spk file with reco-id as the\n        speakers. If --write-reco2utt is provided, it writes a mapping from\n        recording-id to the list of utterances sorted by start and end times.\n        This map can be used to combine text corresponding to the segments to\n        recording-level.\"\"\")\n\n    parser.add_argument(\"--write-reco2utt\", help=\"If provided, writes a \"\n                        \"mapping from recording-id to list of utterances \"\n                        \"sorted by start and end times.\")\n    parser.add_argument(\"segments_in\", help=\"Input segments file\")\n    parser.add_argument(\"utt2spk_out\", help=\"Output utt2spk file\")\n\n    args = parser.parse_args()\n\n    return args\n\n\ndef main():\n    args = get_args()\n\n    utt2reco = {}\n    segments_for_reco = defaultdict(list)\n    for line in open(args.segments_in):\n        parts = line.strip().split()\n\n        if len(parts) < 4:\n            raise TypeError(\"bad line in segments file {}\".format(line))\n\n        utt = parts[0]\n        reco = parts[1]\n        start_time = parts[2]\n        end_time = parts[3]\n\n        segments_for_reco[reco].append((utt, start_time, end_time))\n        utt2reco[utt] = reco\n\n    if args.write_reco2utt is not None:\n        with open(args.write_reco2utt, 'w') as reco2utt_writer, \\\n                open(args.utt2spk_out, 'w') as utt2spk_writer:\n            for reco, segments_in_reco in segments_for_reco.items():\n                utts = ' '.join([seg[0] for seg in sorted(\n                    segments_in_reco, key=lambda x:(x[1], x[2]))])\n                print(\"{0} {1}\".format(reco, utts), file=reco2utt_writer)\n                print (\"{0} {0}\".format(reco), file=utt2spk_writer)\n    else:\n        with open(args.utt2spk_out, 'w') as utt2spk_writer:\n            for reco in segments_for_reco.keys():\n                print (\"{0} {0}\".format(reco), file=utt2spk_writer)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/data/internal/modify_speaker_info.py",
    "content": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nimport argparse, sys,os\nfrom collections import defaultdict\nparser = argparse.ArgumentParser(description=\"\"\"\nCombine consecutive utterances into fake speaker ids for a kind of\npoor man's segmentation.  Reads old utt2spk from standard input,\noutputs new utt2spk to standard output.\"\"\")\nparser.add_argument(\"--utts-per-spk-max\", type = int, required = True,\n                    help=\"Maximum number of utterances allowed per speaker\")\nparser.add_argument(\"--seconds-per-spk-max\", type = float, required = True,\n                    help=\"\"\"Maximum duration in seconds allowed per speaker.\n                         If this option is >0, --utt2dur option must be provided.\"\"\")\nparser.add_argument(\"--utt2dur\", type = str,\n                    help=\"\"\"Filename of input 'utt2dur' file (needed only if\n                    --seconds-per-spk-max is provided)\"\"\")\nparser.add_argument(\"--respect-speaker-info\", type = str, default = 'true',\n                    choices = ['true', 'false'],\n                    help=\"\"\"If true, the output speakers will be split from \"\n                    \"existing speakers.\"\"\")\n\nargs = parser.parse_args()\n\nutt2spk = dict()\n# an undefined spk2utt entry will default to an empty list.\nspk2utt = defaultdict(lambda: [])\n\nwhile True:\n    line = sys.stdin.readline()\n    if line == '':\n        break;\n    a = line.split()\n    if len(a) != 2:\n        sys.exit(\"modify_speaker_info.py: bad utt2spk line from standard input (expected two fields): \" +\n                 line)\n    [ utt, spk ] = a\n    utt2spk[utt] = spk\n    spk2utt[spk].append(utt)\n\nif args.seconds_per_spk_max > 0:\n    utt2dur = dict()\n    try:\n        f = open(args.utt2dur)\n        while True:\n            line = f.readline()\n            if line == '':\n                break\n            a = line.split()\n            if len(a) != 2:\n                sys.exit(\"modify_speaker_info.py: bad utt2dur line from standard input (expected two fields): \" +\n                         line)\n            [ utt, dur ] = a\n            utt2dur[utt] = float(dur)\n        for utt in utt2spk:\n            if not utt in utt2dur:\n                sys.exit(\"modify_speaker_info.py: utterance {0} not in utt2dur file {1}\".format(\n                        utt, args.utt2dur))\n    except Exception as e:\n        sys.exit(\"modify_speaker_info.py: problem reading utt2dur info: \" + str(e))\n\n# splits a list of utts into a list of lists, based on constraints from the\n# command line args.  Note: the last list will tend to be shorter than the others,\n# we make no attempt to fix this.\ndef SplitIntoGroups(uttlist):\n    ans = [] # list of lists.\n    cur_uttlist = []\n    cur_dur = 0.0\n    for utt in uttlist:\n        if ((args.utts_per_spk_max > 0 and len(cur_uttlist) == args.utts_per_spk_max) or\n            (args.seconds_per_spk_max > 0 and len(cur_uttlist) > 0 and\n             cur_dur + utt2dur[utt] > args.seconds_per_spk_max)):\n            ans.append(cur_uttlist)\n            cur_uttlist = []\n            cur_dur = 0.0\n        cur_uttlist.append(utt)\n        if args.seconds_per_spk_max > 0:\n            cur_dur += utt2dur[utt]\n    if len(cur_uttlist) > 0:\n        ans.append(cur_uttlist)\n    return ans\n\n\n# This function will return '%01d' if d < 10, '%02d' if d < 100, and so on.\n# It's for printf printing of numbers in such a way that sorted order will be\n# correct.\ndef GetFormatString(d):\n    ans = 1\n    while (d >= 10):\n        d //= 10  # integer division\n        ans += 1\n    # e.g. we might return the string '%01d' or '%02d'\n    return '%0{0}d'.format(ans)\n\n\nif args.respect_speaker_info == 'true':\n    for spk in sorted(spk2utt.keys()):\n        uttlists = SplitIntoGroups(spk2utt[spk])\n        format_string = '%s-' + GetFormatString(len(uttlists))\n        for i in range(len(uttlists)):\n            # the following might look like: '%s-%02d'.format('john_smith' 9 + 1),\n            # giving 'john_smith-10'.\n            this_spk = format_string % (spk, i + 1)\n            for utt in uttlists[i]:\n                print(utt, this_spk)\nelse:\n    uttlists = SplitIntoGroups(sorted(utt2spk.keys()))\n    format_string = 'speaker-' + GetFormatString(len(uttlists))\n    for i in range(len(uttlists)):\n        # the following might look like: 'speaker-%04d'.format(105 + 1),\n        # giving 'speaker-0106'.\n        this_spk = format_string % (i + 1)\n        for utt in uttlists[i]:\n            print(utt, this_spk)\n\n"
  },
  {
    "path": "egs/utils/data/internal/perturb_volume.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0\n\n\"\"\"\nThis script reads a wav.scp file from the input and perturbs the\nvolume of the recordings and writes to stdout the contents of\na new wav.scp file.\n\"\"\"\nfrom __future__ import print_function\n\nimport argparse\nimport re\nimport random\nimport sys\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"\n        This script reads a wav.scp file from the input and perturbs the\n        volume of the recordings and writes to stdout the contents of\n        a new wav.scp file.\n        If --reco2vol is provided, then for each recording, the volume factor\n        specified in that file is applied.\n        Otherwise, a volume factor is chosen randomly from a uniform\n        distribution between --scale-low and --scale-high.\n        \"\"\")\n\n    parser.add_argument(\"--scale-low\", type=float, default=0.125,\n                        help=\"Minimum volume scale to be applied.\")\n    parser.add_argument(\"--scale-high\", type=float, default=2,\n                        help=\"Maximum volume scale to tbe applid.\")\n    parser.add_argument(\"--reco2vol\", type=str, default=None,\n                        help=\"If supplied, it must be a file of the format \"\n                        \"<reco-id> <volume-scale>, which specifies the \"\n                        \"volume scale to be applied for each recording.\")\n    parser.add_argument(\"--write-reco2vol\", type=str, default=None,\n                        help=\"If provided, the volume scale used for each \"\n                        \"recording will be written to this file\")\n    args = parser.parse_args()\n\n    if args.reco2vol == \"\":\n        args.reco2vol = None\n    if args.write_reco2vol == \"\":\n        args.write_reco2vol = None\n\n    return args\n\n\ndef read_reco2vol(volumes_file):\n    \"\"\"Read volume scales for recordings.\n    The format of volumes_file is <reco-id> <volume-scale>\n    Returns a dictionary { reco-id : volume-scale }\n    \"\"\"\n    volumes = {}\n    with open(volumes_file) as volume_reader:\n        for line in volume_reader.readlines():\n            if len(line.strip()) == 0:\n                continue\n\n            parts = line.strip().split()\n            if len(parts) != 2:\n                raise RuntimeError(\"Unable to parse the line {0} in file {1}.\"\n                                   \"\".format(line.strip(), volumes_file))\n            volumes[parts[0]] = float(parts[1])\n    return volumes\n\n\ndef run(args):\n    random.seed(0)\n\n    volumes = None\n    if args.reco2vol is not None:\n        volumes = read_reco2vol(args.reco2vol)\n\n    if args.write_reco2vol is not None:\n        volume_writer = open(args.write_reco2vol, 'w')\n\n    for line in sys.stdin.readlines():\n        if len(line.strip()) == 0:\n            continue\n        parts = line.strip().split()\n        reco_id = parts[0]\n\n        vol = random.uniform(args.scale_low, args.scale_high)\n        if volumes is not None:\n            if reco_id not in volumes:\n                raise RuntimeError('Could not find volume for id {0} in '\n                                   '{1}'.format(reco_id, args.reco2vol))\n            vol = volumes[reco_id]\n\n        # Handle three cases of rxfilenames appropriately;\n        # 'input piped command', 'file offset' and 'filename'\n        if line.strip()[-1] == '|':\n            print ('{0} sox --vol {1} -t wav - -t wav - |'.format(\n                line.strip(), vol))\n        elif re.search(':[0-9]+$', line.strip()) is not None:\n            print ('{id} wav-copy {wav} - | '\n                   'sox --vol {vol} -t wav - -t wav - |'.format(\n                       id=parts[0], wav=' '.join(parts[1:]), vol=vol))\n        else:\n            print ('{id} sox --vol {vol} -t wav {wav} -t wav - |'.format(\n                id=parts[0], wav=' '.join(parts[1:]), vol=vol))\n\n        if args.write_reco2vol is not None:\n            volume_writer.write('{id} {vol}\\n'.format(id=parts[0], vol=vol))\n\n\ndef main():\n    args = get_args()\n    run(args)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/data/limit_feature_dim.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Alibaba Robotics Corp. (author: Xingyu Na)\n# Apache 2.0\n\n# The script creates a new data directory by selecting a specified\n# dimension range of the features in the source directory.\n\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: \"\n  echo \"  $0 <feat-dim-range> <srcdir> <destdir>\"\n  echo \"The script creates a new data directory by selecting a specified\"\n  echo \"dimension range of the features in the source directory.\"\n  echo \"e.g.:\"\n  echo \" $0 0:39 data/train_hires_pitch data/train_hires\"\n  exit 1;\nfi\n\nfeat_dim_range=$1\nsrcdir=$2\ndestdir=$3\n\nif [ \"$destdir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <destdir> to be different.\"\n  exit 1\nfi\n\nif [ ! -f $srcdir/feats.scp ]; then\n  echo \"$0: no such file $srcdir/feats.scp\"\n  exit 1;\nfi\n\nmkdir -p $destdir\nutils/copy_data_dir.sh $srcdir $destdir\n\nif [ -f $destdir/cmvn.scp ]; then\n  rm $destdir/cmvn.scp\n  echo \"$0: warning: removing $destdir/cmvn.cp, you will have to regenerate it from the features.\"\nfi\n\nrm $destdir/feats.scp\nsed 's/$/\\[:,'${feat_dim_range}'\\]/' $srcdir/feats.scp | \\\n  utils/data/normalize_data_range.pl > $destdir/feats.scp\n\n[ ! -f $srcdir/text ] && validate_opts=\"$validate_opts --no-text\"\nutils/validate_data_dir.sh $validate_opts $destdir\n"
  },
  {
    "path": "egs/utils/data/modify_speaker_info.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013-2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script copies a data directory (like utils/copy_data.sh) while\n# modifying (splitting or merging) the speaker information in that data directory.\n#\n# This is done without looking at the data at all; we use only duration\n# constraints and maximum-num-utts-per-speaker to assign contiguous\n# sets of utterances to speakers.\n#\n# This has two general uses:\n# (1) when dumping iVectors for training purposes, it's helpful to have\n#   a good variety of iVectors, and this can be accomplished by splitting\n#   speakers up into multiple copies of those speakers.  We typically\n#   use the --utts-per-spk-max 2 option for this.\n# (2) when dealing with data that is not diarized, and given that we\n#   haven't checked any diarization scripts into Kaldi yet, this\n#   script can do a \"dumb\" diarization that just groups consecutive\n#   utterances into groups based on length constraints.\n#   There are two cases here:\n\n#       a) With --respect-speaker-info true (the default),\n#         it only splits within existing speakers.\n#         This is suitable when you have existing speaker\n#         info that's meaningful in some way, e.g. represents\n#         individual recordings.\n#      b) With --respect-speaker-info false,\n#        it completely ignores the existing speaker information\n#        and constructs new speaker identities based on\n#        utterance names.  This is suitable in scenarios when\n#        you have a one-to-one map between speakers and\n#        utterances.\n\n# begin configuration section\nutts_per_spk_max=-1\nseconds_per_spk_max=-1\nrespect_speaker_info=true\n# end configuration section\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0 --utts-per-spk-max 2 data/train data/train-max2\"\n  echo \"Options\"\n  echo \"   --utts-per-spk-max <n>  # number of utterances per speaker maximum,\"\n  echo \"                           # default -1 (meaning no maximum).  E.g. 2.\"\n  echo \"   --seconds-per-spk-max <n> # number of seconds per speaker maximum,\"\n  echo \"                             # default -1 (meaning no maximum).  E.g. 60.\"\n  echo \"   --respect-speaker-info <true|false>  # If true, respect the\"\n  echo \"                                        # existing speaker map (i.e. do not\"\n  echo \"                                        # assign utterances from different\"\n  echo \"                                        # speakers to the same generated speaker).\"\n  echo \"                                        # Default: true.\"\n  echo \"Note: one or both of the --utts-per-spk-max or --seconds-per-spk-max\"\n  echo \"options is required.\"\n  exit 1;\nfi\n\nexport LC_ALL=C\n\nsrcdir=$1\ndestdir=$2\n\nif [ \"$destdir\"  == \"$srcdir\" ]; then\n  echo \"$0: <srcdir> must be different from <destdir>.\"\n  exit 1\nfi\n\nif [ \"$seconds_per_spk_max\" == \"-1\" ] && ! [ \"$utts_per_spk_max\" -gt 0 ]; then\n  echo \"$0: one or both of the --utts-per-spk-max or --seconds-per-spk-max options must be provided.\"\nfi\n\nif [ ! -f $srcdir/utt2spk ]; then\n  echo \"$0: no such file $srcdir/utt2spk\"\n  exit 1;\nfi\n\nset -e;\nset -o pipefail\n\nmkdir -p $destdir\n\nif [ \"$seconds_per_spk_max\" != -1 ]; then\n  # we need the utt2dur file.\n  utils/data/get_utt2dur.sh $srcdir\n  utt2dur_opt=\"--utt2dur=$srcdir/utt2dur\"\nelse\n  utt2dur_opt=\nfi\n\nutils/data/internal/modify_speaker_info.py \\\n   $utt2dur_opt --respect-speaker-info=$respect_speaker_info \\\n  --utts-per-spk-max=$utts_per_spk_max --seconds-per-spk-max=$seconds_per_spk_max \\\n  <$srcdir/utt2spk >$destdir/utt2spk\n\nutils/utt2spk_to_spk2utt.pl <$destdir/utt2spk >$destdir/spk2utt\n\n# This script won't create the new cmvn.scp, it should be recomputed.\nif [ -f $destdir/cmvn.scp ]; then\n  mkdir -p $destdir/.backup\n  mv $destdir/cmvn.scp $destdir/.backup\n  echo \"$0: moving $destdir/cmvn.scp to $destdir/.backup/cmvn.scp\"\nfi\n\n# these things won't be affected by the change of speaker mapping.\nfor f in feats.scp segments wav.scp reco2file_and_channel text stm glm ctm; do\n  [ -f $srcdir/$f ] && cp $srcdir/$f $destdir/\ndone\n\n\norig_num_spk=$(wc -l <$srcdir/spk2utt)\nnew_num_spk=$(wc -l <$destdir/spk2utt)\n\necho \"$0: copied data from $srcdir to $destdir, number of speakers changed from $orig_num_spk to $new_num_spk\"\nopts=\n[ ! -f $srcdir/feats.scp ] && opts=\"--no-feats\"\n[ ! -f $srcdir/text ] && opts=\"$opts --no-text\"\n[ ! -f $srcdir/wav.scp ] && opts=\"$opts --no-wav\"\n\nutils/validate_data_dir.sh $opts $destdir\n"
  },
  {
    "path": "egs/utils/data/modify_speaker_info_to_recording.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Vimal Manohar\n# Apache 2.0.\n\n# Copy the data directory, but modify it to use the recording-id as the \n# speaker. This is useful to get matching speaker information in the \n# whole recording data directory.\n# Note that this also appends the recording-id as a prefix to the \n# utterance-id.\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 <in-data> <out-data>\"\n  echo \" e.g.: $0 data/train data/train_recospk\"\n  exit 1\nfi\n\nin_data=$1\nout_data=$2\n\nmkdir -p $out_data\n\nfor f in wav.scp segments utt2spk; do \n  if [ ! -f $in_data/$f ]; then\n    echo \"$0: Could not find file $in_data/$f\" \n    exit 1\n  fi\ndone\n\ncp $in_data/wav.scp $out_data/ || exit 1\ncp $in_data/reco2file_and_channel $out_data/ 2> /dev/null || true\nawk '{print $1\" \"$2\"-\"$1}' $in_data/segments > \\\n  $out_data/old2new.uttmap || exit 1\nutils/apply_map.pl -f 1 $out_data/old2new.uttmap < $in_data/segments > \\\n  $out_data/segments || exit 1\nawk '{print $1\" \"$2}' $out_data/segments > $out_data/utt2spk || exit 1\nutils/utt2spk_to_spk2utt.pl $out_data/utt2spk > $out_data/spk2utt || exit 1\n\nif [ -f $in_data/text ]; then\n  utils/apply_map.pl -f 1 $out_data/old2new.uttmap < $in_data/text > \\\n    $out_data/text || exit 1\nfi\n\nif [ -f $in_data/feats.scp ]; then\n  utils/apply_map.pl -f 1 $out_data/old2new.uttmap < $in_data/feats.scp > \\\n    $out_data/feats.scp || exit 1\nfi\n\nutils/fix_data_dir.sh $out_data || exit 1\nutils/validate_data_dir.sh --no-text --no-feats $out_data || exit 1\n"
  },
  {
    "path": "egs/utils/data/normalize_data_range.pl",
    "content": "#!/usr/bin/env perl\n\n# This script is intended to read and write scp files possibly containing indexes for\n# sub-ranges of features, like\n# foo-123  bar.ark:431423[78:89]\n# meaning rows 78 through 89 of the matrix located at bar.ark:431423.\n#\n# Its purpose is to normalize lines which have ranges on top of ranges, like\n#\n# foo-123  bar.ark:431423[78:89][3:4]\n#\n# This program interprets the later [] expression as a sub-range of the matrix returned by the first []\n# expression; in this case, we'd get\n#\n# foo-123  bar.ark:431423[81:82]\n#\n# Note that these ranges are based on zero-indexing, and have a 'first:last'\n# interpretation, so the range [0:0] is a matrix with one row.  And also note\n# that column ranges are permitted, after row ranges, and the row range may be\n# empty, e.g.\n\n# foo-123  bar.ark:431423[81:82,0:13]\n# or\n# foo-123  bar.ark:431423[81:82,0:13]\n#\n\n# This program reads from the standard input (or command-line file or files),\n# and writes to the standard output.\n\n\n# This function combines ranges, either row or column ranges.  start1 and end1\n# are the first range, and start2 and end2 are interpreted as a sub-range of the\n# first range.  It is acceptable for either start1 and end1, or start2 and end2, to\n# be empty.\n# This function returns the start and end of the range, as an array.\nsub combine_ranges {\n  ($row_or_column, $start1, $end1, $start2, $end2) = @_;\n\n  if ($start1 eq \"\" && $end1 eq \"\") {\n    return ($start2, $end2);\n  } elsif ($start2 eq \"\" && $end2 eq \"\") {\n    return ($start1, $end1);\n  } else {\n    # For now this script doesn't support the case of ranges like [20:], even\n    # though they are supported at the C++ level.\n    if ($start1 eq \"\" || $start2 eq \"\" || $end1 eq \"\" || $end2 == \"\") {\n      chop $line;\n      print STDERR (\"normalize_data_range.pl: could not make sense of line $line\\n\");\n      exit(1)\n    }\n    if ($start1 + $end2 > $end1) {\n      chop $line;\n      print STDERR (\"normalize_data_range.pl: could not make sense of line $line \" .\n            \"[second $row_or_column range too large vs first range, $start1 + $end2 > $end1]\\n\");\n          # exit(1);\n      return ($start2+$start1, $end1);\n    }\n    return ($start2+$start1, $end2+$start1);\n  }\n}\n\n\nwhile (<>) {\n  $line = $_;\n  # we only need to do something if we detect two of these ranges.\n  # The following regexp matches strings of the form ...[foo][bar]\n  # where foo and bar have no square brackets in them.\n  if (m/\\[([^][]*)\\]\\[([^][]*)\\]\\s*$/) {\n    $before_range = $`;\n    $first_range = $1;   # e.g. '0:500,20:21', or '0:500', or ',0:13'.\n    $second_range = $2;  # has same general format as first_range.\n    if ($_ =~ m/concat-feats /) {\n      # sometimes in scp files, we use the command concat-feats to splice together\n      # two feature matrices.  Handling this correctly is complicated and we don't\n      # anticipate needing it, so we just refuse to process this type of data.\n      print STDERR (\"normalize_data_range.pl: this script cannot [yet] normalize the data ranges \" .\n        \"if concat-feats was in the input data\\n\");\n      exit(1);\n    }\n    # print STDERR \"matched: $before_range $first_range $second_range\\n\";\n    if ($first_range !~ m/^((\\d*):(\\d*)|)(,(\\d*):(\\d*)|)$/) {\n      print STDERR \"normalize_data_range.pl: could not make sense of input line $_\";\n      exit(1);\n    }\n    $row_start1 = $2;\n    $row_end1 = $3;\n    $col_start1 = $5;\n    $col_end1 = $6;\n\n    if ($second_range !~ m/^((\\d*):(\\d*)|)(,(\\d*):(\\d*)|)$/) {\n      print STDERR \"normalize_data_range.pl: could not make sense of input line $_\";\n      exit(1);\n    }\n    $row_start2 = $2;\n    $row_end2 = $3;\n    $col_start2 = $5;\n    $col_end2 = $6;\n\n    ($row_start, $row_end) = combine_ranges(\"row\", $row_start1, $row_end1, $row_start2, $row_end2);\n    ($col_start, $col_end) = combine_ranges(\"column\", $col_start1, $col_end1, $col_start2, $col_end2);\n\n\n    if ($row_start ne \"\") {\n      $range = \"$row_start:$row_end\";\n    } else {\n      $range = \"\";\n    }\n    if ($col_start ne \"\") {\n      $range .= \",$col_start:$col_end\";\n    }\n    print $before_range . \"[\" . $range . \"]\\n\";\n  } else {\n    print;\n  }\n}\n\n__END__\n\n# Testing\n# echo foo |  utils/data/normalize_data_range.pl -> foo\n# echo 'foo[bar:baz]' |  utils/data/normalize_data_range.pl -> foo[bar:baz]\n# echo 'foo[bar:baz][bin:bang]' |  utils/data/normalize_data_range.pl -> normalize_data_range.pl: could not make sense of input line foo[bar:baz][bin:bang]\n# echo 'foo[10:20][0:5]' |  utils/data/normalize_data_range.pl -> foo[10:15]\n# echo 'foo[,10:20][,0:5]' |  utils/data/normalize_data_range.pl -> foo[,10:15]\n# echo 'foo[,0:100][1:15]' |  utils/data/normalize_data_range.pl -> foo[1:15,0:100]\n# echo 'foo[1:15][,0:100]' |  utils/data/normalize_data_range.pl -> foo[1:15,0:100]\n# echo 'foo[10:20][0:11]' |  utils/data/normalize_data_range.pl -> normalize_data_range.pl: could not make sense of line foo[10:20][0:11] [second row range too large vs first range, 10 + 11 > 20]\n# echo 'foo[,10:20][,0:11]' |  utils/data/normalize_data_range.pl -> normalize_data_range.pl: could not make sense of line foo[,10:20][,0:11] [second column range too large vs first range, 10 + 11 > 20]\n"
  },
  {
    "path": "egs/utils/data/perturb_data_dir_speed_3way.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016-2018  Johns Hopkins University (author: Daniel Povey)\n#                2018  Hossein Hadian\n\n# Apache 2.0\n\n# This script does the standard 3-way speed perturbing of\n# a data directory (it operates on the wav.scp).\n\n# If you add the option \"--always-include-prefix true\", it will include the\n# prefix \"sp1.0-\" for the original un-perturbed data.  This can help resolve\n# problems with sorting.\n# We don't make '--always-include-prefix true' the default  behavior because\n# it can break some older scripts that relied on the original utterance-ids\n# being a subset of the perturbed data's utterance-ids.\n\nalways_include_prefix=false\n\n. utils/parse_options.sh\n\nif [ $# != 2 ]; then\n  echo \"Usage: perturb_data_dir_speed_3way.sh <srcdir> <destdir>\"\n  echo \"Applies standard 3-way speed perturbation using factors of 0.9, 1.0 and 1.1.\"\n  echo \"e.g.:\"\n  echo \" $0 [options] data/train data/train_sp\"\n  echo \"Note: if <destdir>/feats.scp already exists, this will refuse to run.\"\n  echo \"Options:\"\n  echo \"    --always-include-prefix [true|false]   # default: false.  If set to true,\"\n  echo \"                                           # it will add the prefix 'sp1.0-' to\"\n  echo \"                                           # utterance and speaker-ids for data at\"\n  echo \"                                           # the original speed.  Can resolve\"\n  echo \"                                           # issues RE data sorting.\"\n  exit 1\nfi\n\nsrcdir=$1\ndestdir=$2\n\nif [ ! -f $srcdir/wav.scp ]; then\n  echo \"$0: expected $srcdir/wav.scp to exist\"\n  exit 1\nfi\n\nif [ -f $destdir/feats.scp ]; then\n  echo \"$0: $destdir/feats.scp already exists: refusing to run this (please delete $destdir/feats.scp if you want this to run)\"\n  exit 1\nfi\n\necho \"$0: making sure the utt2dur and the reco2dur files are present\"\necho \"... in ${srcdir}, because obtaining it after speed-perturbing\"\necho \"... would be very slow, and you might need them.\"\nutils/data/get_utt2dur.sh ${srcdir}\nutils/data/get_reco2dur.sh ${srcdir}\n\nutils/data/perturb_data_dir_speed.sh 0.9 ${srcdir} ${destdir}_speed0.9 || exit 1\nutils/data/perturb_data_dir_speed.sh 1.1 ${srcdir} ${destdir}_speed1.1 || exit 1\n\nif $always_include_prefix; then\n  utils/copy_data_dir.sh --spk-prefix sp1.0- --utt-prefix sp1.0- ${srcdir} ${destdir}_speed1.0\n  if [ ! -f $srcdir/utt2uniq ]; then\n    cat $srcdir/utt2spk | awk  '{printf(\"sp1.0-%s %s\\n\", $1, $1);}' > ${destdir}_speed1.0/utt2uniq\n  else\n    cat $srcdir/utt2uniq | awk '{printf(\"sp1.0-%s %s\\n\", $1, $2);}' > ${destdir}_speed1.0/utt2uniq\n  fi\n  utils/data/combine_data.sh $destdir ${destdir}_speed1.0 ${destdir}_speed0.9 ${destdir}_speed1.1 || exit 1\n\n  rm -r ${destdir}_speed0.9 ${destdir}_speed1.1 ${destdir}_speed1.0\nelse\n  utils/data/combine_data.sh $destdir ${srcdir} ${destdir}_speed0.9 ${destdir}_speed1.1 || exit 1\n  rm -r ${destdir}_speed0.9 ${destdir}_speed1.1\nfi\n\necho \"$0: generated 3-way speed-perturbed version of data in $srcdir, in $destdir\"\nif ! utils/validate_data_dir.sh --no-feats --no-text $destdir; then\n  echo \"$0: Validation failed.  If it is a sorting issue, try the option '--always-include-prefix true'.\"\n  exit 1\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/perturb_data_dir_volume.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n# This script operates on a data directory, such as in data/train/, and modifies\n# the wav.scp to perturb the volume (typically useful for training data when\n# using systems that don't have cepstral mean normalization).\n\nreco2vol=   # A file with the format <reco-id> <volume> that specifies the \n            # factor by which the volume of the recording must be scaled.\n            # If not provided, then the volume will be chosen randomly to \n            # be between --scale-low and --scale-high.\nwrite_reco2vol=     # File to write volume-scales applied to the recordings.\n                    # Can be passed to --reco2vol to use the same volumes for \n                    # another data directory. \n                    # e.g. the unperturbed data directory.\nscale_low=0.125\nscale_high=2\n\n. utils/parse_options.sh\n\nif [ $# != 1 ]; then\n  echo \"Usage: $0 <datadir>\"\n  echo \"e.g.:\"\n  echo \" $0 data/train\"\n  exit 1\nfi\n\nexport LC_ALL=C\n\ndata=$1\n\nif [ ! -f $data/wav.scp ]; then\n  echo \"$0: Expected $data/wav.scp to exist\"\n  exit 1\nfi\n\n# Check if volume perturbation is already this. We assume that the volume\n# perturbation is done if it has a line 'sox --vol' applied on the whole \n# recording.\n# e.g. \n# foo-1 cat foo.wav | sox --vol 1.6 -t wav - -t wav - |    # volume perturbation done\n# bar-1 sox --vol 1.2 bar.wav -t wav - |                   # volume perturbation done\n# foo-2 wav-reverberate --additive-signals=\"sox --vol=0.1 noise1.wav -t wav -|\" foo.wav |   # volume perturbation not done\nvolume_perturb_done=`head -n100 $data/wav.scp | python -c \"\nimport sys, re\nfor line in sys.stdin.readlines():\n  if len(line.strip()) == 0:\n    continue\n  # Handle three cases of rxfilenames appropriately; 'input piped command', 'file offset' and 'filename'\n  parts = line.strip().split()\n  if line.strip()[-1] == '|':\n    if re.search('sox --vol', ' '.join(parts[-11:])):\n      print('true')\n      sys.exit(0)\n  elif re.search(':[0-9]+$', line.strip()) is not None:\n    continue\n  else:\n    if ' '.join(parts[1:3]) == 'sox --vol':\n      print('true')\n      sys.exit(0)\nprint('false')\n\"` || exit 1\n\nif $volume_perturb_done; then\n  echo \"$0: It looks like the data was already volume perturbed.  Not doing anything.\"\n  exit 0\nfi\n\ncat $data/wav.scp | utils/data/internal/perturb_volume.py \\\n  --reco2vol=$reco2vol ${write_reco2vol:+--write-reco2vol=$write_reco2vol} \\\n  --scale-low=$scale_low --scale-high=$scale_high > \\\n  $data/wav.scp_scaled || exit 1;\n\nlen1=$(cat $data/wav.scp | wc -l)\nlen2=$(cat $data/wav.scp_scaled | wc -l)\nif [ \"$len1\" != \"$len2\" ]; then\n  echo \"$0: error detected: number of lines changed $len1 vs $len2\";\n  exit 1\nfi\n\nmv $data/wav.scp_scaled $data/wav.scp\n\nif [ -f $data/feats.scp ]; then\n  echo \"$0: $data/feats.scp exists; moving it to $data/.backup/ as it wouldn't be valid any more.\"\n  mkdir -p $data/.backup/\n  mv $data/feats.scp $data/.backup/\nfi\n\necho \"$0: added volume perturbation to the data in $data\"\nexit 0\n\n"
  },
  {
    "path": "egs/utils/data/perturb_speed_to_allowed_lengths.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright     2017  Hossein Hadian\n# Apache 2.0\n\n\n\"\"\" This script perturbs speeds of utterances to force their lengths to some\n    allowed lengths spaced by a factor (like 10%)\n\"\"\"\n\nimport argparse\nimport os\nimport sys\nimport copy\nimport math\nimport logging\n\nsys.path.insert(0, 'steps')\nimport libs.common as common_lib\n\nlogger = logging.getLogger('libs')\nlogger.setLevel(logging.INFO)\nhandler = logging.StreamHandler()\nhandler.setLevel(logging.INFO)\nformatter = logging.Formatter(\"%(asctime)s [%(pathname)s:%(lineno)s - \"\n                              \"%(funcName)s - %(levelname)s ] %(message)s\")\nhandler.setFormatter(formatter)\nlogger.addHandler(handler)\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script copies the 'srcdir'\n                                   data directory to output data directory 'dir'\n                                   while modifying the utterances so that there are\n                                   3 copies of each utterance: one with the same\n                                   speed, one with a higher speed (not more than\n                                   factor% faster) and one with a lower speed\n                                   (not more than factor% slower)\"\"\")\n    parser.add_argument('factor', type=float, default=12,\n                        help='Spacing (in percentage) between allowed lengths.')\n    parser.add_argument('srcdir', type=str,\n                        help='path to source data dir')\n    parser.add_argument('dir', type=str, help='output dir')\n    parser.add_argument('--coverage-factor', type=float, default=0.05,\n                        help=\"\"\"Percentage of durations not covered from each\n                             side of duration histogram.\"\"\")\n    parser.add_argument('--frame-shift', type=int, default=10,\n                        help=\"\"\"Frame shift in milliseconds.\"\"\")\n    parser.add_argument('--frame-length', type=int, default=25,\n                        help=\"\"\"Frame length in milliseconds.\"\"\")\n    parser.add_argument('--frame-subsampling-factor', type=int, default=3,\n                        help=\"\"\"Chain frame subsampling factor.\n                             See steps/nnet3/chain/train.py\"\"\")\n    parser.add_argument('--speed-perturb', type=str, choices=['true','false'],\n                        default='true',\n                        help=\"\"\"If false, no speed perturbation will occur, i.e.\n                             only 1 copy of each utterance will be\n                             saved, which is modified to have an allowed length\n                             by using extend-wav-with-silence.\"\"\")\n    args = parser.parse_args()\n    args.speed_perturb = True if args.speed_perturb == 'true' else False\n    return args\n\nclass Utterance(object):\n    \"\"\" This class represents a Kaldi utterance\n        in a data directory like data/train\n    \"\"\"\n\n    def __init__(self, uid, wavefile, speaker, transcription, dur):\n        self.wavefile = (wavefile if wavefile.rstrip(\" \\t\\r\\n\").endswith('|') else\n                         'cat {} |'.format(wavefile))\n        self.speaker = speaker\n        self.transcription = transcription\n        self.id = uid\n        self.dur = float(dur)\n\n    def to_kaldi_utt_str(self):\n        return self.id + \" \" + self.transcription\n\n    def to_kaldi_wave_str(self):\n        return self.id + \" \" + self.wavefile\n\n    def to_kaldi_dur_str(self):\n        return \"{} {:0.3f}\".format(self.id, self.dur)\n\n\ndef read_kaldi_datadir(dir):\n    \"\"\" Read a data directory like\n        data/train as a list of utterances\n    \"\"\"\n\n    # check to make sure that no segments file exists as this script won't work\n    # with data directories which use a segments file.\n    if os.path.isfile(os.path.join(dir, 'segments')):\n        logger.info(\"The data directory '{}' seems to use a 'segments' file. \"\n                    \"This script does not yet support a 'segments' file. You'll need \"\n                    \"to use utils/data/extract_wav_segments_data_dir.sh \"\n                    \"to convert the data dir so it does not use a 'segments' file. \"\n                    \"Exiting...\".format(dir))\n        sys.exit(1)\n\n    logger.info(\"Loading the data from {}...\".format(dir))\n    utterances = []\n    wav_scp = read_kaldi_mapfile(os.path.join(dir, 'wav.scp'))\n    text = read_kaldi_mapfile(os.path.join(dir, 'text'))\n    utt2dur = read_kaldi_mapfile(os.path.join(dir, 'utt2dur'))\n    utt2spk = read_kaldi_mapfile(os.path.join(dir, 'utt2spk'))\n\n    num_fail = 0\n    for utt in wav_scp:\n        if utt in text and utt in utt2dur and utt in utt2spk:\n            utterances.append(Utterance(utt, wav_scp[utt], utt2spk[utt],\n                                  text[utt], utt2dur[utt]))\n        else:\n            num_fail += 1\n\n    if float(len(utterances)) / len(wav_scp) < 0.5:\n        logger.info(\"More than half your data is problematic. Try \"\n                    \"fixing using fix_data_dir.sh.\")\n        sys.exit(1)\n\n    logger.info(\"Successfully read {} utterances. Failed for {} \"\n                \"utterances.\".format(len(utterances), num_fail))\n    return utterances\n\n\ndef read_kaldi_mapfile(path):\n    \"\"\" Read any Kaldi mapping file - like text, .scp files, etc.\n    \"\"\"\n\n    m = {}\n    with open(path, 'r', encoding='latin-1') as f:\n        for line in f:\n            line = line.strip(\" \\t\\r\\n\")\n            sp_pos = line.find(' ')\n            key = line[:sp_pos]\n            val = line[sp_pos+1:]\n            m[key] = val\n    return m\n\ndef generate_kaldi_data_files(utterances, outdir):\n    \"\"\" Write out a list of utterances as Kaldi data files into an\n        output data directory.\n    \"\"\"\n\n    logger.info(\"Exporting to {}...\".format(outdir))\n    speakers = {}\n\n    with open(os.path.join(outdir, 'text'), 'w', encoding='latin-1') as f:\n        for utt in utterances:\n            f.write(utt.to_kaldi_utt_str() + \"\\n\")\n\n    with open(os.path.join(outdir, 'wav.scp'), 'w', encoding='latin-1') as f:\n        for utt in utterances:\n            f.write(utt.to_kaldi_wave_str() + \"\\n\")\n\n    with open(os.path.join(outdir, 'utt2dur'), 'w', encoding='latin-1') as f:\n        for utt in utterances:\n            f.write(utt.to_kaldi_dur_str() + \"\\n\")\n\n    with open(os.path.join(outdir, 'utt2spk'), 'w', encoding='latin-1') as f:\n        for utt in utterances:\n            f.write(utt.id + \" \" + utt.speaker + \"\\n\")\n            if utt.speaker not in speakers:\n                speakers[utt.speaker] = [utt.id]\n            else:\n                speakers[utt.speaker].append(utt.id)\n\n    with open(os.path.join(outdir, 'spk2utt'), 'w', encoding='latin-1') as f:\n        for s in speakers:\n            f.write(s + \" \")\n            for utt in speakers[s]:\n                f.write(utt + \" \")\n            f.write('\\n')\n\n    logger.info(\"Successfully wrote {} utterances to data \"\n                \"directory '{}'\".format(len(utterances), outdir))\n\ndef find_duration_range(utterances, coverage_factor):\n    \"\"\"Given a list of utterances, find the start and end duration to cover\n\n     If we try to cover\n     all durations which occur in the training set, the number of\n     allowed lengths could become very large.\n\n     Returns\n     -------\n     start_dur: int\n     end_dur: int\n    \"\"\"\n    durs = []\n    for u in utterances:\n        durs.append(u.dur)\n    durs.sort()\n    to_ignore_dur = 0\n    tot_dur = sum(durs)\n    for d in durs:\n        to_ignore_dur += d\n        if to_ignore_dur * 100.0 / tot_dur > coverage_factor:\n            start_dur = d\n            break\n    to_ignore_dur = 0\n    for d in reversed(durs):\n        to_ignore_dur += d\n        if to_ignore_dur * 100.0 / tot_dur > coverage_factor:\n            end_dur = d\n            break\n    if start_dur < 0.3:\n        start_dur = 0.3  # a hard limit to avoid too many allowed lengths --not critical\n    return start_dur, end_dur\n\n\ndef find_allowed_durations(start_dur, end_dur, args):\n    \"\"\"Given the start and end duration, find a set of\n       allowed durations spaced by args.factor%. Also write\n       out the list of allowed durations and the corresponding\n       allowed lengths (in frames) on disk.\n\n     Returns\n     -------\n     allowed_durations: list of allowed durations (in seconds)\n    \"\"\"\n\n    allowed_durations = []\n    d = start_dur\n    with open(os.path.join(args.dir, 'allowed_durs.txt'), 'w', encoding='latin-1') as durs_fp, \\\n           open(os.path.join(args.dir, 'allowed_lengths.txt'), 'w', encoding='latin-1') as lengths_fp:\n        while d < end_dur:\n            length = int(d * 1000 - args.frame_length) / args.frame_shift + 1\n            if length % args.frame_subsampling_factor != 0:\n                length = (args.frame_subsampling_factor *\n                              (length // args.frame_subsampling_factor))\n                d = (args.frame_shift * (length - 1.0)\n                     + args.frame_length + args.frame_shift / 2) / 1000.0\n            allowed_durations.append(d)\n            durs_fp.write(\"{}\\n\".format(d))\n            lengths_fp.write(\"{}\\n\".format(int(length)))\n            d *= args.factor\n    return allowed_durations\n\n\n\ndef perturb_utterances(utterances, allowed_durations, args):\n    \"\"\"Given a set of utterances and a set of allowed durations, generate\n       an extended set of perturbed utterances (all having an allowed duration)\n\n     Returns\n     -------\n     perturbed_utterances: list of pertubed utterances\n    \"\"\"\n\n    perturbed_utterances = []\n    for u in utterances:\n        # find i such that: allowed_durations[i-1] <= u.dur <= allowed_durations[i]\n        # i = len(allowed_durations) --> no upper bound\n        # i = 0         --> no lower bound\n        if u.dur < allowed_durations[0]:\n            i = 0\n        elif u.dur > allowed_durations[-1]:\n            i = len(allowed_durations)\n        else:\n            i = 1\n            while i < len(allowed_durations):\n                if u.dur <= allowed_durations[i] and u.dur >= allowed_durations[i - 1]:\n                    break\n                i += 1\n\n        if i > 0 and args.speed_perturb:  # we have a smaller allowed duration\n            allowed_dur = allowed_durations[i - 1]\n            speed = u.dur / allowed_dur\n            if max(speed, 1.0/speed) > args.factor:  # this could happen for very short/long utterances\n                continue\n            u1 = copy.deepcopy(u)\n            u1.id = 'pv1-' + u.id\n            u1.speaker = 'pv1-' + u.speaker\n            u1.wavefile = '{} sox -t wav - -t wav - speed {} | '.format(u.wavefile, speed)\n            u1.dur = allowed_dur\n            perturbed_utterances.append(u1)\n\n\n        if i < len(allowed_durations):  # we have a larger allowed duration\n            allowed_dur2 = allowed_durations[i]\n            speed = u.dur / allowed_dur2\n            if max(speed, 1.0/speed) > args.factor:\n                continue\n\n            ## Add two versions for the second allowed_duration\n            ## one version is by using speed modification using sox\n            ## the other is by extending by silence\n            if args.speed_perturb:\n                u2 = copy.deepcopy(u)\n                u2.id = 'pv2-' + u.id\n                u2.speaker = 'pv2-' + u.speaker\n                u2.wavefile = '{} sox -t wav - -t wav - speed {} | '.format(u.wavefile, speed)\n                u2.dur = allowed_dur2\n                perturbed_utterances.append(u2)\n\n            delta = allowed_dur2 - u.dur\n            if delta <= 1e-4:\n                continue\n            u3 = copy.deepcopy(u)\n            u3.id = 'pv3-' + u.id\n            u3.speaker = 'pv3-' + u.speaker\n            u3.wavefile = '{} extend-wav-with-silence --extra-silence-length={} - - | '.format(u.wavefile, delta)\n            u3.dur = allowed_dur2\n            perturbed_utterances.append(u3)\n    return perturbed_utterances\n\n\n\ndef main():\n    args = get_args()\n    args.factor = 1.0 + args.factor / 100.0\n\n    if not os.path.exists(args.dir):\n        os.makedirs(args.dir)\n\n    utterances = read_kaldi_datadir(args.srcdir)\n\n    start_dur, end_dur = find_duration_range(utterances, args.coverage_factor)\n    logger.info(\"Durations in the range [{},{}] will be covered. \"\n                \"Coverage rate: {}%\".format(start_dur, end_dur,\n                                      100.0 - args.coverage_factor * 2))\n    logger.info(\"There will be {} unique allowed lengths \"\n                \"for the utterances.\".format(int(math.log(end_dur / start_dur)/\n                                                 math.log(args.factor))))\n\n    allowed_durations = find_allowed_durations(start_dur, end_dur, args)\n\n    perturbed_utterances = perturb_utterances(utterances, allowed_durations,\n                                              args)\n\n    generate_kaldi_data_files(perturbed_utterances, args.dir)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/data/remove_dup_utts.sh",
    "content": "#!/usr/bin/env bash\n\n# Remove excess utterances once they appear  more than a specified\n# number of times with the same transcription, in a data set.\n# E.g. useful for removing excess \"uh-huh\" from training.\n\nif [ $# != 3 ]; then\n  echo \"Usage: remove_dup_utts.sh max-count <src-data-dir> <dest-data-dir>\"\n  echo \"e.g.: remove_dup_utts.sh 10 data/train data/train_nodup\"\n  echo \"This script is used to filter out utterances that have from over-represented\"\n  echo \"transcriptions (such as 'uh-huh'), by limiting the number of repetitions of\"\n  echo \"any given word-sequence to a specified value.  It's often used to get\"\n  echo \"subsets for early stages of training.\"\n  exit 1;\nfi\n\nmaxcount=$1\nsrcdir=$2\ndestdir=$3\nmkdir -p $destdir\n\n[ ! -f $srcdir/text ] && echo \"$0: Invalid input directory $srcdir\" && exit 1;\n\n! mkdir -p $destdir && echo \"$0: could not create directory $destdir\" && exit 1;\n\n! [ \"$maxcount\" -gt 1 ] && echo \"$0: invalid max-count '$maxcount'\" && exit 1;\n\ncp $srcdir/* $destdir\ncat $srcdir/text | \\\n  perl -e '\n  $maxcount = shift @ARGV;\n  @all = ();\n   $p1 = 103349; $p2 = 71147; $k = 0;\n   sub random { # our own random number generator: predictable.\n     $k = ($k + $p1) % $p2;\n     return ($k / $p2);\n  }\n  while(<>) {\n    push @all, $_;\n    @A = split(\" \", $_);\n    shift @A;\n    $text = join(\" \", @A);\n    $count{$text} ++;\n  }\n  foreach $line (@all) {\n    @A = split(\" \", $line);\n    shift @A;\n    $text = join(\" \", @A);\n    $n = $count{$text};\n    if ($n < $maxcount || random() < ($maxcount / $n)) {\n      print $line;\n    }\n  }'  $maxcount >$destdir/text\n\necho \"Reduced number of utterances from `cat $srcdir/text | wc -l` to `cat $destdir/text | wc -l`\"\n\necho \"Using fix_data_dir.sh to reconcile the other files.\"\nutils/fix_data_dir.sh $destdir\nrm -r $destdir/.backup\n\nexit 0\n"
  },
  {
    "path": "egs/utils/data/resample_data_dir.sh",
    "content": "#! /bin/bash\n\n# Copyright 2016  Vimal Manohar\n#           2018  Xiaohui Zhang\n# Apache 2.0.\n\nif [ $# -ne 2 ]; then\n  echo \"This script adds a sox line in wav.scp to resample the audio at a \"\n  echo \"different sampling-rate\"\n  echo \"Usage: $0 <frequency> <data-dir>\"\n  echo \" e.g.: $0 8000 data/dev\"\n  exit 1\nfi\n\nfreq=$1\ndir=$2\n\nsox=`which sox` || { echo \"Could not find sox in PATH\"; exit 1; }\n\nif [ -f $dir/feats.scp ]; then\n  mkdir -p $dir/.backup\n  mv $dir/feats.scp $dir/.backup/\n  if [ -f $dir/cmvn.scp ]; then\n    mv $dir/cmvn.scp $dir/.backup/\n  fi\n  echo \"$0: feats.scp already exists. Moving it to $dir/.backup\"\nfi\n\n# After resampling we cannot compute utt2dur from wav.scp any more,\n# so we create utt2dur now, in case it's needed later\nif [ ! -s $dir/utt2dur ]; then\n  utils/data/get_utt2dur.sh $dir 1>&2 || exit 1;\nfi\n\nmv $dir/wav.scp $dir/wav.scp.tmp\ncat $dir/wav.scp.tmp | python -c \"import sys\nfor line in sys.stdin.readlines():\n  splits = line.strip().split()\n  if splits[-1] == '|':\n    out_line = line.strip() + ' $sox -t wav - -c 1 -b 16 -t wav - rate $freq |'\n  else:\n    out_line = '{0} cat {1} | $sox -t wav - -c 1 -b 16 -t wav - rate $freq |'.format(splits[0], ' '.join(splits[1:]))\n  print (out_line)\" > ${dir}/wav.scp\nrm $dir/wav.scp.tmp\n"
  },
  {
    "path": "egs/utils/data/shift_and_combine_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2017  Hossein Hadian\n\n# Apache 2.0\n\nwrite_utt2orig=              # if provided, this script will write\n                             # a mapping of shifted utterance ids\n                             # to the original ones into the file\n                             # specified by this option\n\necho \"$0 $@\"  # Print the command line for logging\nif [ -f path.sh ]; then . ./path.sh; fi\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 <frame-subsampling-factor> <srcdir> <destdir>\"\n  echo \"e.g.: $0 3 data/train data/train_fs3\"\n  echo \"For use in perturbing data for discriminative training and alignment of\"\n  echo \"frame-subsampled systems, this script uses utils/data/shift_feats.sh\"\n  echo \"and utils/data/combine_data.sh to shift the features\"\n  echo \"<frame-subsampling-factor> different ways and combine them.\"\n  echo \"E.g. if <frame-subsampling-factor> is 3, this script will combine\"\n  echo \"the data frame-shifted by -1, 0 and 1 (c.f. shift-feats).\"\n  exit 1\nfi\n\nframe_subsampling_factor=$1\nsrcdir=$2\ndestdir=$3\n\nif [ ! -f $srcdir/feats.scp ]; then\n  echo \"$0: expected $srcdir/feats.scp to exist\"\n  exit 1\nfi\n\nif [ -f $destdir/feats.scp ]; then\n  echo \"$0: $destdir/feats.scp already exists: refusing to run this (please delete $destdir/feats.scp if you want this to run)\"\n  exit 1\nfi\n\nif [ ! -z $write_utt2orig ]; then\n  awk '{print $1 \" \" $1}' $srcdir/feats.scp >$write_utt2orig\nfi\n\ntmp_shift_destdirs=()\nfor frame_shift in `seq $[-(frame_subsampling_factor/2)] $[-(frame_subsampling_factor/2) + frame_subsampling_factor - 1]`; do\n  if [ \"$frame_shift\" == 0 ]; then continue; fi\n  utils/data/shift_feats.sh $frame_shift $srcdir ${destdir}_fs$frame_shift || exit 1\n  tmp_shift_destdirs+=(\"${destdir}_fs$frame_shift\")\n  if [ ! -z $write_utt2orig ]; then\n    awk -v prefix=\"fs$frame_shift-\" '{printf(\"%s%s %s\\n\", prefix, $1, $1);}' $srcdir/feats.scp >>$write_utt2orig\n  fi  \ndone\nutils/data/combine_data.sh $destdir $srcdir ${tmp_shift_destdirs[@]} || exit 1\nrm -r ${tmp_shift_destdirs[@]}\n\nutils/validate_data_dir.sh $destdir\n\nsrc_nf=`cat $srcdir/feats.scp | wc -l`\ndest_nf=`cat $destdir/feats.scp | wc -l`\nif [ $[src_nf*frame_subsampling_factor] -ne $dest_nf ]; then\n  echo \"There was a problem. Expected number of feature lines in destination dir to be $[src_nf*frame_subsampling_factor];\"\n  exit 1;\nfi\n\necho \"$0: Successfully generated $frame_subsampling_factor-way shifted version of data in $srcdir, in $destdir\"\n"
  },
  {
    "path": "egs/utils/data/shift_feats.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2016    Vimal Manohar\n#           2017    Hossein Hadian\n# Apache 2.0\n\necho \"$0 $@\"  # Print the command line for logging\nif [ -f path.sh ]; then . ./path.sh; fi\n. parse_options.sh || exit 1;\n\nif [ $# != 3 ]; then\n  echo \" Usage: $0 <frame-shift> <srcdir> <destdir>\"\n  echo \"e.g.: $0 -1 data/train data/train_fs-1\"\n  echo \"The script creates a new data directory with the features modified\"\n  echo \"using the program shift-feats with the specified frame-shift.\"\n  echo \"This program automatically adds the prefix 'fs<frame-shift>-' to the\"\n  echo \"utterance and speaker names. See also utils/data/shift_and_combine_feats.sh\"\n  exit 1\nfi\n\nframe_shift=$1\nsrcdir=$2\ndestdir=$3\n\n\nif [ \"$destdir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <destdir> to be different.\"\n  exit 1\nfi\n\nif [ ! -f $srcdir/feats.scp ]; then\n  echo \"$0: no such file $srcdir/feats.scp\"\n  exit 1;\nfi\n\nutt_prefix=\"fs$frame_shift-\"\nspk_prefix=\"fs$frame_shift-\"\n\nmkdir -p $destdir\nutils/copy_data_dir.sh --utt-prefix $utt_prefix --spk-prefix $spk_prefix \\\n  $srcdir $destdir\n\nif grep --quiet \"'\" $srcdir/feats.scp; then\n  echo \"$0: the input features already use single quotes. Can't proceed.\"\n  exit 1;\nfi\n\nawk -v shift=$frame_shift 'NF == 2 {uttid=$1; feat=$2; qt=\"\";} \\\nNF > 2 {idx=index($0, \" \"); uttid=$1; feat=substr($0, idx + 1); qt=\"\\x27\";} \\\nNF {print uttid \" shift-feats --print-args=false --shift=\" shift, qt feat qt \" - |\";}' \\\n  $destdir/feats.scp >$destdir/feats_shifted.scp\nmv -f $destdir/feats_shifted.scp $destdir/feats.scp\n\necho \"$0: Done\"\n\n"
  },
  {
    "path": "egs/utils/data/subsegment_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0\n\n\n# This script allows you to specify a 'segments' file with segments\n# relative to existing utterances, with lines like\n#  utterance_foo-1 utterance_foo 7.5 8.2\n#  utterance_foo-2 utterance_foo 8.9 10.1\n# and a 'text' file with sub-segmented text like\n#  utterance_foo-1 hello there\n#  utterance_foo-2 how are you\n# and combine this with an existing data-dir that was all relative\n# to the original utterance-ids like 'utterance_foo', producing\n# a new subsegmented output directory.\n#\n# It does the right thing for you on the various files that the\n# data directory contained (except you have to recreate\n# the CMVN stats).\n\n\nsegment_end_padding=0.0\ncmd=run.pl\nnj=1\n\n. utils/parse_options.sh\n\nif [ $# != 4 ] && [ $# != 3 ]; then\n  echo \"Usage: \"\n  echo \"  $0 [options] <srcdir> <subsegments-file> [<text-file>] <destdir>\"\n  echo \"This script sub-segments a data directory.  <subsegments-file> is to\"\n  echo \"have lines of the form <new-utt> <old-utt> <start-time-within-old-utt> <end-time-within-old-utt>\"\n  echo \"and <text-file> is of the form <new-utt> <word1> <word2> ... <wordN>.\"\n  echo \"This script appropriately combines the <subsegments-file> with the original\"\n  echo \"segments file, if necessary, and if not, creates a segments file.\"\n  echo \"e.g.:\"\n  echo \" $0 data/train [options] exp/tri3b_resegment/segments exp/tri3b_resegment/text data/train_resegmented\"\n  echo \" Options:\"\n  echo \"  --segment-end-padding <padding-time>       # e.g. 0.02.  Default 0.0.  If provided,\"\n  echo \"                                             # we will add this value to the end times of <destdir>/segments\"\n  echo \"                                             # when creating it.  This can be useful to account for\"\n  echo \"                                             # end effects in feature generation.  The reason this is\"\n  echo \"                                             # not just applied to the input segments file, is that\"\n  echo \"                                             # for purposes of computing the num-frames of the parts of\"\n  echo \"                                             # matrices in feats.scp, the padding should not be done.\"\n  echo \"  See also: resolve_ctm_overlaps.py\"\n  exit 1;\nfi\n\n\nexport LC_ALL=C\n\nsrcdir=$1\nsubsegments=$2\n\nadd_subsegment_text=false\nif [ $# -eq 4 ]; then\n  new_text=$3\n  dir=$4\n  add_subsegment_text=true\n\n  if [ ! -f \"$new_text\" ]; then\n    echo \"$0: no such file $new_text\"\n    exit 1\n  fi\n\nelse\n  dir=$3\nfi\n\nfor f in \"$subsegments\" \"$srcdir/utt2spk\"; do\n  if [ ! -f \"$f\" ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\ndone\n\nif ! mkdir -p $dir; then\n  echo \"$0: failed to create directory $dir\"\nfi\n\nif $add_subsegment_text; then\n  if ! cmp <(awk '{print $1}' <$subsegments)  <(awk '{print $1}' <$new_text); then\n    echo \"$0: expected the first fields of the files $subsegments and $new_text to be identical\"\n    exit 1\n  fi\nfi\n\n# create the utt2spk in $dir\nif ! awk '{if (NF != 4 || !($4 > $3)) { print(\"Bad line: \" $0); exit(1) } }' <$subsegments; then\n  echo \"$0: failed checking subsegments file $subsegments\"\n  exit 1\nfi\n\nset -e\nset -o pipefail\n\n# Create a mapping from the new to old utterances.  This file will be deleted later.\nawk '{print $1, $2}' < $subsegments > $dir/new2old_utt\n\n# Create the new utt2spk file [just map from the second field\nutils/apply_map.pl -f 2 $srcdir/utt2spk < $dir/new2old_utt >$dir/utt2spk\n# .. and the new spk2utt file.\nutils/utt2spk_to_spk2utt.pl  <$dir/utt2spk >$dir/spk2utt\n\nif $add_subsegment_text; then\n  # the new text file is just what the user provides.\n  cp $new_text $dir/text\nfi\n\n# copy the source wav.scp\ncp $srcdir/wav.scp $dir\nif [ -f $srcdir/reco2file_and_channel ]; then\n  cp $srcdir/reco2file_and_channel $dir\nfi\n\n# copy the source reco2dur\nif [ -f $srcdir/reco2dur ]; then\n  cp $srcdir/reco2dur $dir\nfi\n\nif [ -f $srcdir/segments ]; then\n  # we have to map the segments file.\n  # What's going on below is a little subtle.\n  # $srcdir/segments has lines like: <old-utt-id> <recording-id> <start-time> <end-time>\n  # and $subsegments has lines like: <new-utt-id> <old-utt-id> <start-time> <end-time>\n  # The apply-map command replaces <old-utt-id> [the 2nd field of $subsegments]\n  # with <recording-id> <start-time> <end-time>.\n  # so after that first command we have lines like\n  # <new-utt-id> <recording-id> <start-time-of-old-utt-within-recording> <end-time-old-utt-within-recording> \\\n  #   <start-time-of-new-utt-within-old-utt> <end-time-of-new-utt-within-old-utt>\n  # which the awk command turns into:\n  # <new-utt-id> <recording-id> <start-time-of-new-utt-within-recording> <end-time-of-new-utt-within-recording>\n  utils/apply_map.pl -f 2 $srcdir/segments <$subsegments | \\\n    awk -v pad=$segment_end_padding '{ print $1, $2, $5+$3, $6+$3+pad; }' >$dir/segments\nelse\n  # the subsegments file just becomes the segments file.\n  awk -v pad=$segment_end_padding '{$4 += pad; print}' <$subsegments >$dir/segments\nfi\n\nif [ -f $srcdir/utt2uniq ]; then\n  utils/apply_map.pl -f 2 $srcdir/utt2uniq <$dir/new2old_utt >$dir/utt2uniq\nfi\n\nif [ -f $srcdir/feats.scp ]; then\n  # We want to avoid recomputing the features.   We'll use sub-matrices of the\n  # original feature matrices, using the [] notation that is available for\n  # matrices in Kaldi.\n  if [ ! -s $srcdir/frame_shift ]; then\n    frame_shift=$(utils/data/get_frame_shift.sh $srcdir) || exit 1\n  else\n    frame_shift=$(cat $srcdir/frame_shift)\n  fi\n  echo \"$0: note: frame shift is $frame_shift [affects feats.scp]\"\n\n  # The subsegments format is <new-utt-id> <old-utt-id> <start-time> <end-time>.\n  # e.g. 'utt_foo-1 utt_foo 7.21 8.93'\n  # The first awk command replaces this with the format:\n  # <new-utt-id> <old-utt-id> <first-frame> <last-frame>\n  # e.g. 'utt_foo-1 utt_foo 721 893'\n  # and the apply_map.pl command replaces 'utt_foo' (the 2nd field) with its corresponding entry\n  # from the original wav.scp, so we get a line like:\n  # e.g. 'utt_foo-1 foo-bar.ark:514231 721 892'\n  # Note: the reason we subtract one from the last time is that it's going to\n  # represent the 'last' frame, not the 'end' frame [i.e. not one past the last],\n  # in the matlab-like, but zero-indexed [first:last] notion.  For instance, a segment with 1 frame\n  # would have start-time 0.00 and end-time 0.01, which would become the frame range\n  # [0:0]\n  # The second awk command turns this into something like\n  # utt_foo-1 foo-bar.ark:514231[721:892]\n  # It has to be a bit careful because the format actually allows for more general things\n  # like pipes that might contain spaces, so it has to be able to produce output like the\n  # following:\n  # utt_foo-1 some command|[721:892]\n  # The 'end' frame is ensured to not exceed the feature archive size of\n  # <old-utt-id>. This is done using the script fix_subsegment_feats.pl.\n  # e.g if the number of frames in foo-bar.ark is 891, then the features are\n  # truncated to that many frames.\n  # utt_foo-1 foo-bar.ark:514231[721:890]\n  # Lastly, utils/data/normalize_data_range.pl will only do something nontrivial if\n  # the original data-dir already had data-ranges in square brackets.\n\n  # Here, we computes the maximum 'end' frame allowed for each <new-utt-id>.\n  # This is equal to the number of frames in the feature archive for <old-utt-id>.\n  if [ ! -f $srcdir/utt2num_frames ]; then\n    echo \"$0: WARNING: Could not find $srcdir/utt2num_frames. It might take a long time to run get_utt2num_frames.sh.\"\n    echo \"Increase the number of jobs or write this file while extracting features by passing --write-utt2num-frames true to steps/make_mfcc.sh etc.\"\n  fi\n  utils/data/get_utt2num_frames.sh --cmd \"$cmd\" --nj $nj $srcdir\n  awk '{print $1\" \"$2}' $subsegments | \\\n    utils/apply_map.pl -f 2 $srcdir/utt2num_frames > \\\n    $dir/utt2max_frames\n\n  awk -v s=$frame_shift '{print $1, $2, int(($3/s)+0.5), int(($4/s)-0.5);}' <$subsegments| \\\n    utils/apply_map.pl -f 2 $srcdir/feats.scp | \\\n    awk '{p=NF-1; for (n=1;n<NF-2;n++) printf(\"%s \", $n); k=NF-2; l=NF-1; printf(\"%s[%d:%d]\\n\", $k, $l, $NF)}' | \\\n    utils/data/fix_subsegment_feats.pl $dir/utt2max_frames | \\\n    utils/data/normalize_data_range.pl >$dir/feats.scp || { echo \"Failed to create $dir/feats.scp\" && exit; }\n\n  # Parse the frame ranges from feats.scp, which is in the form of [first-frame:last-frame]\n  # and write the number-of-frames = last-frame - first-frame + 1 for the utterance.\n  cat $dir/feats.scp | perl -ne 'm/^(\\S+) .+\\[(\\d+):(\\d+)\\]$/; print \"$1 \" . ($3-$2+1) . \"\\n\"' > \\\n    $dir/utt2num_frames\n\n  # Here we add frame ranges to the elements of vad.scp, as we did for rows of feats.scp above.\n  if [ -f $srcdir/vad.scp ]; then\n    cat $subsegments | awk -v s=$frame_shift '{print $1, $2, int(($3/s)+0.5), int(($4/s)-0.5);}' | \\\n      utils/apply_map.pl -f 2 $srcdir/vad.scp | \\\n      awk '{p=NF-1; for (n=1;n<NF-2;n++) printf(\"%s \", $n); k=NF-2; l=NF-1; printf(\"%s[%d:%d]\\n\", $k, $l, $NF)}' | \\\n      utils/data/fix_subsegment_feats.pl $dir/utt2max_frames | \\\n      utils/data/normalize_data_range.pl >$dir/vad.scp\n  fi\nfi\n\n\nif [ -f $dir/cmvn.scp ]; then\n  rm $dir/cmvn.scp\n  echo \"$0: warning: removing $dir/cmvn.scp, you will have to regenerate it from the features.\"\nfi\n\n# remove the utt2dur file in case it's now invalid-- it be regenerated from the segments file.\nrm $dir/utt2dur 2>/dev/null || true\n\nif [ -f $srcdir/spk2gender ]; then\n  cp $srcdir/spk2gender $dir\nfi\nif [ -f $srcdir/glm ]; then\n  cp $srcdir/glm $dir\nfi\nif [ -f $srcdir/stm ]; then\n  cp $srcdir/stm $dir\nfi\n\nfor f in ctm; do\n  if [ -f $srcdir/$f ]; then\n    echo \"$0: not copying $srcdir/$f to $dir because sub-segmenting it is \"\n    echo \" ... not implemented yet (and probably it's not needed.)\"\n  fi\ndone\n\nrm $dir/new2old_utt\n\necho \"$0: subsegmented data from $srcdir to $dir\"\n"
  },
  {
    "path": "egs/utils/dict_dir_add_pronprobs.sh",
    "content": "#!/usr/bin/env bash\n\n# Apache 2.0.\n# Copyright  2014  Johns Hopkins University (author: Daniel Povey)\n#            2014  Guoguo Chen\n#            2015  Hainan Xu\n\n\n# The thing that this script implements is described in the paper:\n# \"PRONUNCIATION AND SILENCE PROBABILITY MODELING FOR ASR\"\n# by Guoguo Chen et al, see\n# http://www.danielpovey.com/files/2015_interspeech_silprob.pdf\n\n. ./path.sh || exit 1;\n\n# begin configuration\nmax_normalize=true\n# end configuration\n\n. utils/parse_options.sh || exit 1;\n\nset -e\n\nif [[ $# -ne 3 && $# -ne 5 ]]; then\n  echo \"Usage: $0 [options] <input-dict-dir> <input-pron-counts> \\\\\"\n  echo \"          [input-sil-counts] [input-bigram-counts] <output-dict-dir>\"\n  echo \" e.g.: $0 data/local/dict \\\\\"\n  echo \"          exp/tri3/pron_counts_nowb.txt exp/tri3/sil_counts_nowb.txt \\\\\"\n  echo \"          exp/tri3/pron_bigram_counts_nowb.txt data/local/dict_prons\"\n  echo \" e.g.: $0 data/local/dict \\\\\"\n  echo \"          exp/tri3/pron_counts_nowb.txt data/local/dict_prons\"\n  echo \"\"\n  echo \"This script takes pronunciation counts, e.g. generated by aligning your training\"\n  echo \"data and getting the prons using steps/get_prons.sh, and creates a modified\"\n  echo \"dictionary directory with pronunciation probabilities. If the [input-sil-counts]\"\n  echo \"parameter is provided, it will also include silprobs in the generated lexicon.\"\n  echo \"Options:\"\n  echo \"   --max-normalize   (true|false)             # default true.  If true,\"\n  echo \"                                              # divide each pron-prob by the\"\n  echo \"                                              # most likely pron-prob per word.\"\n  exit 1;\nfi\n\nif [ $# -eq 3 ]; then\n  srcdir=$1\n  pron_counts=$2\n  dir=$3\nelif [ $# -eq 5 ]; then\n  srcdir=$1\n  pron_counts=$2\n  sil_counts=$3\n  bigram_counts=$4\n  dir=$5\nfi\n\nif [ ! -s $pron_counts ]; then\n  echo \"$0: expected file $pron_counts to exist\";\n  exit 1;\nfi\n\nmkdir -p $dir || exit 1;\nutils/validate_dict_dir.pl $srcdir;\n\nif [ -f $srcdir/lexicon.txt ]; then\n  src_lex=$srcdir/lexicon.txt\n  perl -ane 'print join(\" \", split(\" \", $_)) . \"\\n\";' < $src_lex |\\\n    sort -u > $dir/lexicon.txt\nelif [ -f $srcdir/lexiconp.txt ]; then\n  echo \"$0: removing the pron-probs from $srcdir/lexiconp.txt to create $dir/lexicon.txt\"\n  # the Perl command below normalizes the spaces (avoid double space).\n  src_lex=$srcdir/lexiconp.txt\n  awk '{$2 = \"\"; print $0;}' <$srcdir/lexiconp.txt |\\\n    perl -ane 'print join(\" \", split(\" \" ,$_)) . \"\\n\";' |\\\n    sort -u > $dir/lexicon.txt || exit 1;\nfi\n\n\n# the cat and awk commands below are implementing add-one smoothing.\ncat <(awk '{print 1, $0;}' <$dir/lexicon.txt) $pron_counts | \\\n  awk '{ count = $1; $1 = \"\"; word_count[$2] += count; pron_count[$0] += count; pron2word[$0] = $2; }\n       END{ for (p in pron_count) { word = pron2word[p]; num = pron_count[p]; den = word_count[word];\n          print num / den, p } } ' | \\\n    awk '{ word = $2; $2 = $1; $1 = word; print; }' | grep -v '^<eps>' |\\\n    sort -k1,1 -k2g,2 -k3 > $dir/lexiconp.txt\n\n\nn_old=$(wc -l <$dir/lexicon.txt)\nn_new=$(wc -l <$dir/lexiconp.txt)\n\nif [ \"$n_old\" != \"$n_new\" ]; then\n  echo \"$0: number of lines differs from $dir/lexicon.txt $n_old vs $dir/lexiconp.txt $n_new\"\n  echo \"Probably something went wrong (e.g. input prons were generated from a different lexicon\"\n  echo \"than $srcdir, or you used pron_counts.txt when you should have used pron_counts_nowb.txt\"\n  echo \"or something else.  Make sure the prons in $src_lex $pron_counts look\"\n  echo \"the same.\"\n  exit 1;\nfi\n\nif $max_normalize; then\n  echo \"$0: normalizing pronprobs so maximum is 1 for each word.\"\n  cat $dir/lexiconp.txt | awk '{if ($2 > max[$1]) { max[$1] = $2; }} END{for (w in max) { print w, max[w]; }}' > $dir/maxp.txt\n\n  awk -v maxf=$dir/maxp.txt  'BEGIN{ while (getline <maxf) { max[$1] = $2; }} { $2 = $2 / max[$1]; print }' <$dir/lexiconp.txt > $dir/lexicon_tmp.txt || exit 1;\n\n  if ! [ $(wc -l  <$dir/lexicon_tmp.txt)  -eq $(wc -l  <$dir/lexiconp.txt) ]; then\n    echo \"$0: error max-normalizing pron-probs\"\n    exit 1;\n  fi\n  mv $dir/lexicon_tmp.txt $dir/lexiconp.txt\n  rm $dir/maxp.txt\nfi\n\n# Create $dir/lexiconp_silprob.txt and $dir/silprob.txt if silence counts file\n# exists. The format of $dir/lexiconp_silprob.txt is:\n# word pron-prob P(s_r | w)  F(s_l | w) F(n_l | w) pron\n#  where:  P(s_r | w) is the probability of silence to the right of the word\n#          F(s_l | w) is a factor which is greater than one if silence to the\n#                  left of the word is more than averagely probable.\n#          F(n_l | w) is a factor which is greater than one if nonsilence to the\n#                  left of the word is more than averagely probable.\nif [ -n \"$sil_counts\" ]; then\n  if [ ! -s \"$sil_counts\" ]; then\n    echo \"$0: expected file $sil_counts to exist and not empty\" && exit 1;\n  fi\n  cat $sil_counts | perl -e '\n    # Load silence counts\n    %sil_wpron = (); %nonsil_wpron = (); %wpron_sil = (); %wpron_nonsil = ();\n    $sil_count = 0; $nonsil_count = 0;\n    while (<STDIN>) {\n      chomp; @col = split; @col >= 5 || die \"'$0': bad line \\\"$_\\\"\\n\";\n      $wpron = join(\" \", @col[4..scalar(@col)-1]);\n      ($sil_wpron{$wpron}, $nonsil_wpron{$wpron},\n       $wpron_sil{$wpron}, $wpron_nonsil{$wpron}) = @col[0..3];\n      $sil_count += $sil_wpron{$wpron}; $nonsil_count += $nonsil_wpron{$wpron};\n    }\n\n    # Open files.\n    ($lexiconp, $bigram_counts, $lexiconp_silprob, $silprob) = @ARGV;\n    open(LP, \"<$lexiconp\") || die \"'$0': fail to open $lexiconp\\n\";\n    open(WPC, \"<$bigram_counts\") || die \"'$0': fail to open $bigram_counts\\n\";\n    open(SP, \">$silprob\") || die \"'$0': fail to open $silprob\\n\";\n    open(LPSP, \">$lexiconp_silprob\") ||\n      die \"'$0': fail to open $lexiconp_silprob\\n\";\n\n    # Computes P(s_r | w) in the paper.\n    $lambda2 = 2;             # Smoothing term, \\lambda_2 in the paper.\n    %P_w_sr = ();\n    %all_wprons = ();\n    $sil_prob = sprintf(\"%.2f\", $sil_count / ($sil_count + $nonsil_count));\n    while (<LP>) {\n      chomp; @col = split; @col >= 3 || die \"'$0': bad line \\\"$_\\\"\\n\";\n      $word = shift @col; $pron_prob = shift @col; $pron = join(\" \", @col);\n      unshift(@col, $word); $wpron = join(\" \", @col);\n\n      $wpron_sil_count = $wpron_sil{$wpron} + $sil_prob * $lambda2;\n      $wpron_nonsil_count = $wpron_nonsil{$wpron} + (1 - $sil_prob) * $lambda2;\n      $sil_after_prob = sprintf(\"%.2f\",\n        $wpron_sil_count / ($wpron_sil_count + $wpron_nonsil_count));\n      if ($sil_after_prob == \"0.00\") { $sil_after_prob = \"0.01\"; }\n      if ($sil_after_prob == \"1.00\") { $sil_after_prob = \"0.99\"; }\n      $P_w_sr{$wpron} = $sil_after_prob;\n\n      $all_wprons{$wpron} = $pron_prob;\n    }\n\n    # Reads C(v ? w) in the paper.\n    %wpron_pair_count = ();\n    while (<WPC>) {\n      chomp; @col = split(\"\\t\"); @col == 3 || die \"'$0': bad line \\\"$_\\\"\\n\";\n      $count = shift @col; $wpron1 = shift @col; $wpron2 = shift @col;\n      $key = \"${wpron1}\\t${wpron2}\";\n      $wpron_pair_count{$key} = $count;\n    }\n\n    # Computes \\bar{C}(s w) and \\bar{C}(n w) in the paper.\n    %bar_C_s_w = ();\n    %bar_C_n_w = ();\n    foreach my $key (keys %wpron_pair_count) {\n      $count = $wpron_pair_count{$key};\n      ($wpron1, $wpron2) = split(\"\\t\", $key);\n      $bar_C_s_w{$wpron2} += $count * $P_w_sr{$wpron1};\n      $bar_C_n_w{$wpron2} += $count * (1 - $P_w_sr{$wpron1});\n    }\n\n    # Computes F(s_l | w) and F(n_l | w) in the paper.\n    $lambda3 = 2;             # Smoothing term, \\lambda_3 in the paper.\n    foreach my $wpron (keys %all_wprons) {\n      @col = split(\" \", $wpron);\n      $word = shift @col;\n      $pron = join(\" \", @col);\n      $pron_prob = $all_wprons{$wpron};\n\n      $F_sl_w = ($sil_wpron{$wpron} + $lambda3) / ($bar_C_s_w{$wpron} + $lambda3);\n      $F_nl_w = ($nonsil_wpron{$wpron} + $lambda3) / ($bar_C_n_w{$wpron} + $lambda3);\n      $F_sl_w = sprintf(\"%.2f\", $F_sl_w);\n      $F_nl_w = sprintf(\"%.2f\", $F_nl_w);\n      if ($F_sl_w == \"0.00\") { $F_sl_w = \"0.01\"; }\n      if ($F_nl_w == \"0.00\") { $F_nl_w = \"0.01\"; }\n\n      print LPSP \"$word $pron_prob $P_w_sr{$wpron} $F_sl_w $F_nl_w $pron\\n\";\n    }\n\n    # Create silprob.txt\n    $BOS_sil_count = $wpron_sil{\"<s>\"} + $sil_prob * $lambda2;\n    $BOS_nonsil_count = $wpron_nonsil{\"<s>\"} + (1 - $sil_prob) * $lambda2;\n    $P_BOS_sr = sprintf(\"%.2f\", $BOS_sil_count / ($BOS_sil_count + $BOS_nonsil_count));\n    $F_sl_EOS = ($sil_wpron{\"</s>\"} + $lambda3) / ($bar_C_s_w{\"</s>\"} + $lambda3);\n    $F_nl_EOS = ($nonsil_wpron{\"</s>\"} + $lambda3) / ($bar_C_n_w{\"</s>\"} + $lambda3);\n    if ($P_BOS_sr == \"1.00\") { $P_BOS_sr = \"0.99\"; }\n    if ($P_BOS_sr == \"0.00\") { $P_BOS_sr = \"0.01\"; }\n    if ($F_sl_EOS == \"0.00\") { $F_sl_EOS = \"0.01\"; }\n    if ($F_nl_EOS == \"0.00\") { $F_nl_EOS = \"0.01\"; }\n    print SP \"<s> $P_BOS_sr\\n</s>_s $F_sl_EOS\\n</s>_n $F_nl_EOS\\noverall $sil_prob\\n\";\n    ' $dir/lexiconp.txt $bigram_counts $dir/lexiconp_silprob_unsorted.txt $dir/silprob.txt\n    sort -k1,1 -k2g,2 -k6 $dir/lexiconp_silprob_unsorted.txt > $dir/lexiconp_silprob.txt\nfi\n\n# now regenerate lexicon.txt from lexiconp.txt, to make sure the lines are\n# in the same order.\ncat $dir/lexiconp.txt | awk '{$2 = \"\"; print;}' | sed 's/  / /g' >$dir/lexicon.txt\n\n\n# add mandatory files.\nfor f in silence_phones.txt nonsilence_phones.txt; do\n  if [ ! -f $srcdir/$f ]; then\n    echo \"$0: expected $srcdir/$f to exist.\"\n    exit 1;\n  fi\n  cp $srcdir/$f $dir/ || exit 1;\ndone\n\n\n# add optional files (at least, I think these are optional; would have to check the docs).\nfor f in optional_silence.txt extra_questions.txt; do\n  if [ -f $srcdir/$f ]; then\n    cp $srcdir/$f $dir || exit 1;\n  fi\ndone\n\n\necho \"$0: produced dictionary directory with probabilities in $dir/\"\necho \"$0: validating $dir ..\"\nsleep 1\nutils/validate_dict_dir.pl $dir || exit 1;\n\n\necho \"Some low-probability prons include: \"\necho \"# sort -k2,2 -n $dir/lexiconp.txt  | head -n 8\"\n\nsort -k2,2 -n $dir/lexiconp.txt  | head -n 8\n\nexit 0\n"
  },
  {
    "path": "egs/utils/eps2disambig.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n#                2015 Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script replaces epsilon with #0 on the input side only, of the G.fst\n# acceptor.  \n\nwhile(<>){\n  if (/\\s+#0\\s+/) {\n    print STDERR \"$0: ERROR: LM has word #0, \" .\n                 \"which is reserved as disambiguation symbol\\n\";\n    exit 1;\n  }\n  s:^(\\d+\\s+\\d+\\s+)\\<eps\\>(\\s+):$1#0$2:;\n  print;\n}\n"
  },
  {
    "path": "egs/utils/filt.py",
    "content": "#!/usr/bin/env python\n\n# Apache 2.0\n\nfrom __future__ import print_function\nimport sys\n\nvocab=set()\nwith open(sys.argv[1]) as vocabfile:\n    for line in vocabfile:\n        vocab.add(line.strip())\n\nwith open(sys.argv[2]) as textfile:\n    for line in textfile:\n        print(\" \".join([word if word in vocab else '<UNK>' for word in line.strip().split()]))\n"
  },
  {
    "path": "egs/utils/filter_scp.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation\n#                     Johns Hopkins University (author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This script takes a list of utterance-ids or any file whose first field\n# of each line is an utterance-id, and filters an scp\n# file (or any file whose \"n-th\" field is an utterance id), printing\n# out only those lines whose \"n-th\" field is in id_list. The index of\n# the \"n-th\" field is 1, by default, but can be changed by using\n# the -f <n> switch\n\n$exclude = 0;\n$field = 1;\n$shifted = 0;\n\ndo {\n  $shifted=0;\n  if ($ARGV[0] eq \"--exclude\") {\n    $exclude = 1;\n    shift @ARGV;\n    $shifted=1;\n  }\n  if ($ARGV[0] eq \"-f\") {\n    $field = $ARGV[1];\n    shift @ARGV; shift @ARGV;\n    $shifted=1\n  }\n} while ($shifted);\n\nif(@ARGV < 1 || @ARGV > 2) {\n  die \"Usage: filter_scp.pl [--exclude] [-f <field-to-filter-on>] id_list [in.scp] > out.scp \\n\" .\n      \"Prints only the input lines whose f'th field (default: first) is in 'id_list'.\\n\" .\n      \"Note: only the first field of each line in id_list matters.  With --exclude, prints\\n\" .\n      \"only the lines that were *not* in id_list.\\n\" .\n      \"Caution: previously, the -f option was interpreted as a zero-based field index.\\n\" .\n      \"If your older scripts (written before Oct 2014) stopped working and you used the\\n\" .\n      \"-f option, add 1 to the argument.\\n\" .\n      \"See also: utils/filter_scp.pl .\\n\";\n}\n\n\n$idlist = shift @ARGV;\nopen(F, \"<$idlist\") || die \"Could not open id-list file $idlist\";\nwhile(<F>) {\n  @A = split;\n  @A>=1 || die \"Invalid id-list file line $_\";\n  $seen{$A[0]} = 1;\n}\n\nif ($field == 1) { # Treat this as special case, since it is common.\n  while(<>) {\n    $_ =~ m/\\s*(\\S+)\\s*/ || die \"Bad line $_, could not get first field.\";\n    # $1 is what we filter on.\n    if ((!$exclude && $seen{$1}) || ($exclude && !defined $seen{$1})) {\n      print $_;\n    }\n  }\n} else {\n  while(<>) {\n    @A = split;\n    @A > 0 || die \"Invalid scp file line $_\";\n    @A >= $field || die \"Invalid scp file line $_\";\n    if ((!$exclude && $seen{$A[$field-1]}) || ($exclude && !defined $seen{$A[$field-1]})) {\n      print $_;\n    }\n  }\n}\n\n# tests:\n# the following should print \"foo 1\"\n# ( echo foo 1; echo bar 2 ) | utils/filter_scp.pl <(echo foo)\n# the following should print \"bar 2\".\n# ( echo foo 1; echo bar 2 ) | utils/filter_scp.pl -f 2 <(echo 2)\n"
  },
  {
    "path": "egs/utils/filter_scps.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012   Microsoft Corporation\n#           2012-2016   Johns Hopkins University (author: Daniel Povey)\n#                2015   Xiaohui Zhang\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This script takes multiple lists of utterance-ids or any file whose first field\n# of each line is an utterance-id, as filters, and filters an scp\n# file (or any file whose \"n-th\" field is an utterance id), printing\n# out only those lines whose \"n-th\" field is in filter. The index of\n# the \"n-th\" field is 1, by default, but can be changed by using\n# the -f <n> switch\n\n\n$field = 1;\n$shifted = 0;\n$print_warnings = 1;\ndo {\n  $shifted=0;\n  if ($ARGV[0] eq \"-f\") {\n    $field = $ARGV[1];\n    shift @ARGV; shift @ARGV;\n    $shifted = 1;\n  }\n  if (@ARGV[0] eq \"--no-warn\") {\n    $print_warnings = 0;\n    shift @ARGV;\n    $shifted = 1;\n  }\n} while ($shifted);\n\n\nif(@ARGV != 4) {\n  die \"Usage: utils/filter_scps.pl [-f <field-to-filter-on>] <job-range-specifier> <filter-pattern> <input-scp> <output-scp-pattern>\\n\" .\n       \"e.g.:  utils/filter_scps.pl  JOB=1:10 data/train/split10/JOB/spk2utt data/train/feats.scp data/train/split10/JOB/feats.scp\\n\" .\n       \"similar to utils/filter_scp.pl, but it uses multiple filters and output multiple filtered files.\\n\".\n       \"The -f option specifies the field in <input-scp> that we filter on (default: 1).\" .\n       \"See also: utils/filter_scp.pl\\n\";\n}\n\nif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:10\n  $jobname = $1;\n  $jobstart = $2;\n  $jobend = $3;\n  shift;\n  if ($jobstart > $jobend) {\n    die \"filter_scps.pl: invalid job range $ARGV[0]\";\n  }\n} else {\n  die \"filter_scps.pl: bad job-range specifier $ARGV[0]: expected e.g. JOB=1:10\";\n}\n\n$idlist = shift @ARGV;\n\nif ($idlist !~ m/$jobname/ &&\n    $jobend > $jobstart) {\n  print STDERR \"filter_scps.pl: you are trying to use multiple filter files as filter patterns but \"\n    . \"you are providing just one filter file ($idlist)\\n\";\n  exit(1);\n}\n\n\n$infile = shift @ARGV;\n\n$outfile = shift @ARGV;\n\nif ($outfile !~ m/$jobname/ &&  $jobend > $jobstart) {\n  print STDERR \"filter_scps.pl: you are trying to create multiple filtered files but \"\n    . \"you are providing just one output file ($outfile)\\n\";\n  exit(1);\n}\n\n# This hashes from the id (e.g. utterance-id) to an array of the relevant\n# job-ids (which are integers).  In any normal use-case, this array will contain\n# exactly one job-id for any given id, but we want to be agnostic about this.\n%id2jobs = ( );\n\n# Some variables that we set to produce a warning.\n$warn_uncovered = 0;\n$warn_multiply_covered = 0;\n\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $idlist_n = $idlist;\n  $idlist_n =~ s/$jobname/$jobid/g;\n\n  open(F, \"<$idlist_n\") || die \"Could not open id-list file $idlist_n\";\n\n  while(<F>) {\n    @A = split;\n    @A >= 1 || die \"Invalid line $_ in id-list file $idlist_n\";\n    $id = $A[0];\n    if (! defined $id2jobs{$id}) {\n      $id2jobs{$id} = [ ];  # new anonymous array.\n    }\n    push @{$id2jobs{$id}}, $jobid;\n  }\n  close(F);\n}\n\n# job2output hashes from the job-id, to an anonymous array containing\n# a sequence of output lines.\n%job2output = ( );\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $job2output{$jobid} = [ ];  # new anonymous array.\n}\n\nopen (F, \"< $infile\") or die \"Can't open $infile for read: $!\";\nwhile (<F>) {\n  if ($field == 1) {           # Treat this as special case, since it is common.\n    $_ =~ m/\\s*(\\S+)\\s*/ || die \"Bad line $_, could not get first field.\";\n    # $1 is what we filter on.\n    $id = $1;\n  } else {\n    @A = split;\n    @A > 0 || die \"Invalid scp file line $_\";\n    @A >= $field || die \"Invalid scp file line $_\";\n    $id = $A[$field-1];\n  }\n  if ( ! defined $id2jobs{$id}) {\n    $warn_uncovered = 1;\n  } else {\n    @jobs = @{$id2jobs{$id}};   # this dereferences the array reference.\n    if (@jobs > 1) {\n      $warn_multiply_covered = 1;\n    }\n    foreach $job_id (@jobs) {\n      if (!defined $job2output{$job_id}) {\n        die \"Likely code error\";\n      }\n      push @{$job2output{$job_id}}, $_;\n    }\n  }\n}\nclose(F);\n\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $outfile_n = $outfile;\n  $outfile_n =~ s/$jobname/$jobid/g;\n  open(FW, \">$outfile_n\") || die \"Could not open output file $outfile_n\";\n  $printed = 0;\n  foreach $line (@{$job2output{$jobid}}) {\n    print FW $line;\n    $printed = 1;\n  }\n  if (!printed) {\n    print STDERR \"filter_scps.pl: warning: output to $outfile_n is empty\\n\";\n  }\n  close(FW);\n}\n\nif ($warn_uncovered && $print_warnings) {\n  print STDERR \"filter_scps.pl: warning: some input lines did not get output\\n\";\n}\nif ($warn_multiply_covered && $print_warnings) {\n  print STDERR \"filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt] \" .\n    join(\" \", @ARGV) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/utils/find_arpa_oovs.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\nif (  @ARGV < 1 && @ARGV > 2) {\n    die \"Usage: find_arpa_oovs.pl words.txt [lm.arpa]\\n\";\n    # This program finds words in the arpa file that are not symbols\n    # in the OpenFst-format symbol table words.txt.  It prints them\n    # on the standard output, one per line.\n}\n\n$symtab = shift @ARGV;\nopen(S, \"<$symtab\") || die \"Failed opening symbol table file $symtab\\n\";\nwhile(<S>){\n    @A = split(\" \", $_);\n    @A == 2 || die \"Bad line in symbol table file: $_\";\n    $seen{$A[0]} = 1;\n}\n\n$found_data=0;\n$curgram=0;\nwhile(<>) { # Find the \\data\\ marker.\n    if(m:^\\\\data\\\\\\s*$:) { $found_data=1; last; }\n}\n\nif ($found_data==0) {\n  print STDERR \"find_arpa_oovs.pl: found no \\\\data\\\\ marker in the ARPA input.\\n\";\n  exit(1);\n}\n\nwhile(<>) {\n    if(m/^\\\\(\\d+)\\-grams:\\s*$/) {\n        $curgram = $1;\n        if($curgram > 1) {\n            last; # This is an optimization as we can get the vocab from the 1-grams\n        }\n    } elsif($curgram > 0) {\n        @A = split(\" \", $_);\n        if(@A > 1) {\n            shift @A;\n            for($n=0;$n<$curgram;$n++) {\n                $word = $A[$n];\n                if(!defined $word) { print STDERR \"Unusual line $_ (line $.) in arpa file.\\n\"; }\n                $in_arpa{$word} = 1;\n            }\n        } else {\n            if(@A > 0 && $A[0] !~ m:\\\\end\\\\:) {\n                print STDERR \"Unusual line $_ (line $.) in arpa file\\n\";\n            }\n        }\n    }\n}\n\nforeach $w (keys %in_arpa) {\n    if(!defined $seen{$w} && $w ne \"<s>\" && $w ne \"</s>\") {\n        print \"$w\\n\";\n    }\n}\n"
  },
  {
    "path": "egs/utils/fix_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\n# This script makes sure that only the segments present in\n# all of \"feats.scp\", \"wav.scp\" [if present], segments [if present]\n# text, and utt2spk are present in any of them.\n# It puts the original contents of data-dir into\n# data-dir/.backup\n\ncmd=\"$@\"\n\nutt_extra_files=\nspk_extra_files=\n\n. utils/parse_options.sh\n\nif [ $# != 1 ]; then\n  echo \"Usage: utils/data/fix_data_dir.sh <data-dir>\"\n  echo \"e.g.: utils/data/fix_data_dir.sh data/train\"\n  echo \"This script helps ensure that the various files in a data directory\"\n  echo \"are correctly sorted and filtered, for example removing utterances\"\n  echo \"that have no features (if feats.scp is present)\"\n  exit 1\nfi\n\ndata=$1\n\nif [ -f $data/images.scp ]; then\n  image/fix_data_dir.sh $cmd\n  exit $?\nfi\n\nmkdir -p $data/.backup\n\n[ ! -d $data ] && echo \"$0: no such directory $data\" && exit 1;\n\n[ ! -f $data/utt2spk ] && echo \"$0: no such file $data/utt2spk\" && exit 1;\n\nset -e -o pipefail -u\n\ntmpdir=$(mktemp -d /tmp/kaldi.XXXX);\ntrap 'rm -rf \"$tmpdir\"' EXIT HUP INT PIPE TERM\n\nexport LC_ALL=C\n\nfunction check_sorted {\n  file=$1\n  sort -k1,1 -u <$file >$file.tmp\n  if ! cmp -s $file $file.tmp; then\n    echo \"$0: file $1 is not in sorted order or not unique, sorting it\"\n    mv $file.tmp $file\n  else\n    rm $file.tmp\n  fi\n}\n\nfor x in utt2spk spk2utt feats.scp text segments wav.scp cmvn.scp vad.scp \\\n    reco2file_and_channel spk2gender utt2lang utt2uniq utt2dur reco2dur utt2num_frames; do\n  if [ -f $data/$x ]; then\n    cp $data/$x $data/.backup/$x\n    check_sorted $data/$x\n  fi\ndone\n\n\nfunction filter_file {\n  filter=$1\n  file_to_filter=$2\n  cp $file_to_filter ${file_to_filter}.tmp\n  utils/filter_scp.pl $filter ${file_to_filter}.tmp > $file_to_filter\n  if ! cmp ${file_to_filter}.tmp  $file_to_filter >&/dev/null; then\n    length1=$(cat ${file_to_filter}.tmp | wc -l)\n    length2=$(cat ${file_to_filter} | wc -l)\n    if [ $length1 -ne $length2 ]; then\n      echo \"$0: filtered $file_to_filter from $length1 to $length2 lines based on filter $filter.\"\n    fi\n  fi\n  rm $file_to_filter.tmp\n}\n\nfunction filter_recordings {\n  # We call this once before the stage when we filter on utterance-id, and once\n  # after.\n\n  if [ -f $data/segments ]; then\n  # We have a segments file -> we need to filter this and the file wav.scp, and\n  # reco2file_and_utt, if it exists, to make sure they have the same list of\n  # recording-ids.\n\n    if [ ! -f $data/wav.scp ]; then\n      echo \"$0: $data/segments exists but not $data/wav.scp\"\n      exit 1;\n    fi\n    awk '{print $2}' < $data/segments | sort | uniq > $tmpdir/recordings\n    n1=$(cat $tmpdir/recordings | wc -l)\n    [ ! -s $tmpdir/recordings ] && \\\n      echo \"Empty list of recordings (bad file $data/segments)?\" && exit 1;\n    utils/filter_scp.pl $data/wav.scp $tmpdir/recordings > $tmpdir/recordings.tmp\n    mv $tmpdir/recordings.tmp $tmpdir/recordings\n\n\n    cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments\n    filter_file $tmpdir/recordings $data/segments\n    cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments\n    rm $data/segments.tmp\n\n    filter_file $tmpdir/recordings $data/wav.scp\n    [ -f $data/reco2file_and_channel ] && filter_file $tmpdir/recordings $data/reco2file_and_channel\n    [ -f $data/reco2dur ] && filter_file $tmpdir/recordings $data/reco2dur\n    true\n  fi\n}\n\nfunction filter_speakers {\n  # throughout this program, we regard utt2spk as primary and spk2utt as derived, so...\n  utils/utt2spk_to_spk2utt.pl $data/utt2spk > $data/spk2utt\n\n  cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers\n  for s in cmvn.scp spk2gender; do\n    f=$data/$s\n    if [ -f $f ]; then\n      filter_file $f $tmpdir/speakers\n    fi\n  done\n\n  filter_file $tmpdir/speakers $data/spk2utt\n  utils/spk2utt_to_utt2spk.pl $data/spk2utt > $data/utt2spk\n\n  for s in cmvn.scp spk2gender $spk_extra_files; do\n    f=$data/$s\n    if [ -f $f ]; then\n      filter_file $tmpdir/speakers $f\n    fi\n  done\n}\n\nfunction filter_utts {\n  cat $data/utt2spk | awk '{print $1}' > $tmpdir/utts\n\n  ! cat $data/utt2spk | sort | cmp - $data/utt2spk && \\\n    echo \"utt2spk is not in sorted order (fix this yourself)\" && exit 1;\n\n  ! cat $data/utt2spk | sort -k2 | cmp - $data/utt2spk && \\\n    echo \"utt2spk is not in sorted order when sorted first on speaker-id \" && \\\n    echo \"(fix this by making speaker-ids prefixes of utt-ids)\" && exit 1;\n\n  ! cat $data/spk2utt | sort | cmp - $data/spk2utt && \\\n    echo \"spk2utt is not in sorted order (fix this yourself)\" && exit 1;\n\n  if [ -f $data/utt2uniq ]; then\n    ! cat $data/utt2uniq | sort | cmp - $data/utt2uniq && \\\n      echo \"utt2uniq is not in sorted order (fix this yourself)\" && exit 1;\n  fi\n\n  maybe_wav=\n  maybe_reco2dur=\n  [ ! -f $data/segments ] && maybe_wav=wav.scp # wav indexed by utts only if segments does not exist.\n  [ -s $data/reco2dur ] && [ ! -f $data/segments ] && maybe_reco2dur=reco2dur # reco2dur indexed by utts\n\n  maybe_utt2dur=\n  if [ -f $data/utt2dur ]; then\n    cat $data/utt2dur | \\\n      awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2dur.ok || exit 1\n    maybe_utt2dur=utt2dur.ok\n  fi\n\n  maybe_utt2num_frames=\n  if [ -f $data/utt2num_frames ]; then\n    cat $data/utt2num_frames | \\\n      awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2num_frames.ok || exit 1\n    maybe_utt2num_frames=utt2num_frames.ok\n  fi\n\n  for x in feats.scp text segments utt2lang $maybe_wav $maybe_utt2dur $maybe_utt2num_frames; do\n    if [ -f $data/$x ]; then\n      utils/filter_scp.pl $data/$x $tmpdir/utts > $tmpdir/utts.tmp\n      mv $tmpdir/utts.tmp $tmpdir/utts\n    fi\n  done\n  rm $data/utt2dur.ok 2>/dev/null || true\n  rm $data/utt2num_frames.ok 2>/dev/null || true\n\n  [ ! -s $tmpdir/utts ] && echo \"fix_data_dir.sh: no utterances remained: not proceeding further.\" && \\\n    rm $tmpdir/utts && exit 1;\n\n\n  if [ -f $data/utt2spk ]; then\n    new_nutts=$(cat $tmpdir/utts | wc -l)\n    old_nutts=$(cat $data/utt2spk | wc -l)\n    if [ $new_nutts -ne $old_nutts ]; then\n      echo \"fix_data_dir.sh: kept $new_nutts utterances out of $old_nutts\"\n    else\n      echo \"fix_data_dir.sh: kept all $old_nutts utterances.\"\n    fi\n  fi\n\n  for x in utt2spk utt2uniq feats.scp vad.scp text segments utt2lang utt2dur utt2num_frames $maybe_wav $maybe_reco2dur $utt_extra_files; do\n    if [ -f $data/$x ]; then\n      cp $data/$x $data/.backup/$x\n      if ! cmp -s $data/$x <( utils/filter_scp.pl $tmpdir/utts $data/$x ) ; then\n        utils/filter_scp.pl $tmpdir/utts $data/.backup/$x > $data/$x\n      fi\n    fi\n  done\n\n}\n\nfilter_recordings\nfilter_speakers\nfilter_utts\nfilter_speakers\nfilter_recordings\n\nutils/utt2spk_to_spk2utt.pl $data/utt2spk > $data/spk2utt\n\necho \"fix_data_dir.sh: old files are kept in $data/.backup\"\n"
  },
  {
    "path": "egs/utils/format_lm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Arnab Ghoshal\n#           2010-2011  Microsoft Corporation\n#           2016-2018  Johns Hopkins University (author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\nset -e\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: $0 <lang_dir> <arpa-LM> <lexicon> <out_dir>\"\n  echo \"E.g.: $0 data/lang data/local/lm/foo.kn.gz data/local/dict/lexicon.txt data/lang_test\"\n  echo \"Convert ARPA-format language models to FSTs.\";\n  exit 1;\nfi\n\nlang_dir=$1\nlm=$2\nlexicon=$3\nout_dir=$4\nmkdir -p $out_dir\n\n[ -f ./path.sh ] && . ./path.sh\n\necho \"Converting '$lm' to FST\"\n\n# the -ef test checks if  source and target directory\n# are two different directories in the filesystem\n# if they are the same, the section guarded by the test\n# would be actually harmfull (deleting the phones/ subdirectory)\nif [ -e $out_dir ] && [ ! $lang_dir -ef $out_dir ] ; then\n  if [ -e $out_dir/phones ] ; then\n    rm -r $out_dir/phones\n  fi\n\n  for f in phones.txt words.txt topo L.fst L_disambig.fst phones oov.int oov.txt; do\n     cp -r $lang_dir/$f $out_dir\n  done\nfi\n\nlm_base=$(basename $lm '.gz')\ngunzip -c $lm \\\n  | arpa2fst --disambig-symbol=#0 \\\n             --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst\nset +e\nfstisstochastic $out_dir/G.fst\nset -e\n# The output is like:\n# 9.14233e-05 -0.259833\n# we do expect the first of these 2 numbers to be close to zero (the second is\n# nonzero because the backoff weights make the states sum to >1).\n\n# Everything below is only for diagnostic.\n# Checking that G has no cycles with empty words on them (e.g. <s>, </s>);\n# this might cause determinization failure of CLG.\n# #0 is treated as an empty word.\nmkdir -p $out_dir/tmpdir.g\nawk '{if(NF==1){ printf(\"0 0 %s %s\\n\", $1,$1); }}\n     END{print \"0 0 #0 #0\"; print \"0\";}' \\\n     < \"$lexicon\" > $out_dir/tmpdir.g/select_empty.fst.txt\n\nfstcompile --isymbols=$out_dir/words.txt --osymbols=$out_dir/words.txt \\\n  $out_dir/tmpdir.g/select_empty.fst.txt \\\n  | fstarcsort --sort_type=olabel \\\n  | fstcompose - $out_dir/G.fst > $out_dir/tmpdir.g/empty_words.fst\n\nfstinfo $out_dir/tmpdir.g/empty_words.fst | grep cyclic | grep -w 'y' \\\n  && echo \"Language model has cycles with empty words\" && exit 1\n\nrm -r $out_dir/tmpdir.g\n\n\necho \"Succeeded in formatting LM: '$lm'\"\n"
  },
  {
    "path": "egs/utils/format_lm_sri.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Arnab Ghoshal\n# Copyright 2010-2011  Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Begin configuration section.\nsrilm_opts=\"-subset -prune-lowprobs -unk -tolower\"\n# end configuration sections\n\n\n. utils/parse_options.sh\n\nif [ $# -ne 4 ] && [ $# -ne 3 ]; then\n  echo \"Usage: $0 [options] <lang-dir> <arpa-LM> [<lexicon>] <out-dir>\"\n  echo \"The <lexicon> argument is no longer needed but is supported for back compatibility\"\n  echo \"E.g.: utils/format_lm_sri.sh data/lang data/local/lm/foo.kn.gz data/local/dict/lexicon.txt data/lang_test\"\n  echo \"Converts ARPA-format language models to FSTs. Change the LM vocabulary using SRILM.\"\n  echo \"Note: if you want to just convert ARPA LMs to FSTs, there is a simpler way to do this\"\n  echo \"that doesn't require SRILM: see utils/format_lm.sh\"\n  echo \"options:\"\n  echo \" --help                 # print this message and exit\"\n  echo \" --srilm-opts STRING      # options to pass to SRILM tools (default: '$srilm_opts')\"\n  exit 1;\nfi\n\n\nif [ $# -eq 4 ] ; then\n  lang_dir=$1\n  lm=$2\n  lexicon=$3\n  out_dir=$4\nelse\n  lang_dir=$1\n  lm=$2\n  out_dir=$3\nfi\n\nfor f in $lm $lang_dir/words.txt; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected input file $f to exist.\"\n    exit 1;\n  fi\ndone\n\n[ -f ./path.sh ] && . ./path.sh\n\nloc=`which change-lm-vocab`\nif [ -z $loc ]; then\n  echo You appear to not have SRILM tools installed.\n  echo cd to $KALDI_ROOT/tools and run extras/install_srilm.sh.\n  exit 1\nfi\n\necho \"Converting '$lm' to FST\"\ntmpdir=$(mktemp -d /tmp/kaldi.XXXX);\ntrap 'rm -rf \"$tmpdir\"' EXIT\n\nmkdir -p $out_dir\ncp -r $lang_dir/* $out_dir || exit 1;\n\nawk '{print $1}' $out_dir/words.txt > $tmpdir/voc || exit 1;\n\n# Change the LM vocabulary to be the intersection of the current LM vocabulary\n# and the set of words in the pronunciation lexicon. This also renormalizes the\n# LM by recomputing the backoff weights, and remove those ngrams whose\n# probabilities are lower than the backed-off estimates.\nchange-lm-vocab -vocab $tmpdir/voc -lm $lm -write-lm - $srilm_opts | \\\n  arpa2fst --disambig-symbol=#0 \\\n           --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst || exit 1\n\nfstisstochastic $out_dir/G.fst\n\n# The output is like:\n# 9.14233e-05 -0.259833\n# we do expect the first of these 2 numbers to be close to zero (the second is\n# nonzero because the backoff weights make the states sum to >1).\n\necho \"Succeeded in formatting LM '$lm' -> '$out_dir/G.fst'\"\n"
  },
  {
    "path": "egs/utils/gen_topo.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012  Johns Hopkins University (author: Daniel Povey)\n\n# Generate a topology file.  This allows control of the number of states in the\n# non-silence HMMs, and in the silence HMMs.\n\nif (@ARGV != 4) {\n  print STDERR \"Usage: utils/gen_topo.pl <num-nonsilence-states> <num-silence-states> <colon-separated-nonsilence-phones> <colon-separated-silence-phones>\\n\";\n  print STDERR \"e.g.:  utils/gen_topo.pl 3 5 4:5:6:7:8:9:10 1:2:3\\n\";\n  exit (1);\n}\n\n($num_nonsil_states, $num_sil_states, $nonsil_phones, $sil_phones) = @ARGV;\n\n( $num_nonsil_states >= 1 && $num_nonsil_states <= 100 ) ||\n  die \"Unexpected number of nonsilence-model states $num_nonsil_states\\n\";\n(( $num_sil_states == 1 || $num_sil_states >= 3) && $num_sil_states <= 100 ) ||\n  die \"Unexpected number of silence-model states $num_sil_states\\n\";\n\n$nonsil_phones =~ s/:/ /g;\n$sil_phones =~ s/:/ /g;\n$nonsil_phones =~ m/^\\d[ \\d]*$/ || die \"$0: bad arguments @ARGV\\n\";\n$sil_phones =~ m/^\\d[ \\d]*$/ || die \"$0: bad arguments @ARGV\\n\";\n\nprint \"<Topology>\\n\";\nprint \"<TopologyEntry>\\n\";\nprint \"<ForPhones>\\n\";\nprint \"$nonsil_phones\\n\";\nprint \"</ForPhones>\\n\";\nfor ($state = 0; $state < $num_nonsil_states; $state++) {\n  $statep1 = $state+1;\n  print \"<State> $state <PdfClass> $state <Transition> $state 0.75 <Transition> $statep1 0.25 </State>\\n\";\n}\nprint \"<State> $num_nonsil_states </State>\\n\"; # non-emitting final state.\nprint \"</TopologyEntry>\\n\";\n# Now silence phones.  They have a different topology-- apart from the first and\n# last states, it's fully connected, as long as you have >= 3 states.\n\nif ($num_sil_states > 1) {\n  $transp = 1.0 / ($num_sil_states-1);\n  print \"<TopologyEntry>\\n\";\n  print \"<ForPhones>\\n\";\n  print \"$sil_phones\\n\";\n  print \"</ForPhones>\\n\";\n  print \"<State> 0 <PdfClass> 0 \";\n  for ($nextstate = 0; $nextstate < $num_sil_states-1; $nextstate++) { # Transitions to all but last\n    # emitting state.\n    print \"<Transition> $nextstate $transp \";\n  }\n  print \"</State>\\n\";\n  for ($state = 1; $state < $num_sil_states-1; $state++) { # the central states all have transitions to\n    # themselves and to the last emitting state.\n    print \"<State> $state <PdfClass> $state \";\n    for ($nextstate = 1; $nextstate < $num_sil_states; $nextstate++) {\n      print \"<Transition> $nextstate $transp \";\n    }\n    print \"</State>\\n\";\n  }\n  # Final emitting state (non-skippable).\n  $state = $num_sil_states-1;\n  print \"<State> $state <PdfClass> $state <Transition> $state 0.75 <Transition> $num_sil_states 0.25 </State>\\n\";\n  # Final nonemitting state:\n  print \"<State> $num_sil_states </State>\\n\";\n  print \"</TopologyEntry>\\n\";\n} else {\n  print \"<TopologyEntry>\\n\";\n  print \"<ForPhones>\\n\";\n  print \"$sil_phones\\n\";\n  print \"</ForPhones>\\n\";\n  print \"<State> 0 <PdfClass> 0 \";\n  print \"<Transition> 0 0.75 \";\n  print \"<Transition> 1 0.25 \";\n  print \"</State>\\n\";\n  print \"<State> $num_sil_states </State>\\n\"; # non-emitting final state.\n  print \"</TopologyEntry>\\n\";\n}\n\nprint \"</Topology>\\n\";\n"
  },
  {
    "path": "egs/utils/int2sym.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\nundef $field_begin;\nundef $field_end;\n\n\nif ($ARGV[0] eq \"-f\") {\n  shift @ARGV;\n  $field_spec = shift @ARGV;\n  if ($field_spec =~ m/^\\d+$/) {\n    $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n  }\n  if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesy (properly, 1-10)\n    if ($1 ne \"\") {\n      $field_begin = $1 - 1; # Change to zero-based indexing.\n    }\n    if ($2 ne \"\") {\n      $field_end = $2 - 1; # Change to zero-based indexing.\n    }\n  }\n  if (!defined $field_begin && !defined $field_end) {\n    die \"Bad argument to -f option: $field_spec\";\n  }\n}\n$symtab = shift @ARGV;\nif(!defined $symtab) {\n    print STDERR \"Usage: int2sym.pl [options] symtab [input] > output\\n\" .\n      \"options: [-f (<field>|<field_start>-<field-end>)]\\n\" .\n      \"e.g.: -f 2, or -f 3-4\\n\";\n    exit(1);\n}\n\nopen(F, \"<$symtab\") || die \"Error opening symbol table file $symtab\";\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"bad line in symbol table file: $_\";\n    $int2sym{$A[1]} = $A[0];\n}\n\nsub int2sym {\n    my $a = shift @_;\n    my $pos = shift @_;\n    if($a !~  m:^\\d+$:) { # not all digits..\n      $pos1 = $pos+1; # make it one-based.\n      die \"int2sym.pl: found noninteger token $a [in position $pos1]\\n\";\n    }\n    $s = $int2sym{$a};\n    if(!defined ($s)) {\n      die \"int2sym.pl: integer $a not in symbol table $symtab.\";\n    }\n    return $s;\n}\n\n$error = 0;\nwhile (<>) {\n  @A = split(\" \", $_);\n  for ($pos = 0; $pos <= $#A; $pos++) {\n    $a = $A[$pos];\n    if ( (!defined $field_begin || $pos >= $field_begin)\n         && (!defined $field_end || $pos <= $field_end)) {\n      $a = int2sym($a, $pos);\n    }\n    print $a . \" \";\n  }\n  print \"\\n\";\n}\n\n\n\n"
  },
  {
    "path": "egs/utils/kwslist_post_process.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012  Johns Hopkins University (Author: Guoguo Chen)\n# Apache 2.0.\n#\n\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nsub ReadKwslist {\n  my $kwslist_in = shift @_;\n\n  my $source = \"STDIN\";\n  if ($kwslist_in ne \"-\") {\n    open(I, \"<$kwslist_in\") || die \"$0: Fail to open kwslist $kwslist_in\\n\";\n    $source = \"I\";\n  }\n\n  # Read in the kwslist and parse it. Note that this is a naive parse -- I simply\n  # assume that the kwslist is \"properly\" generated\n  my @KWS;\n  my (@info, $kwid, $tbeg, $dur, $file, $score, $channel);\n  my ($kwlist_filename, $language, $system_id) = (\"\", \"\", \"\");\n  while (<$source>) {\n    chomp;\n\n    if (/<kwslist/) {\n      /language=\"(\\S+)\"/ && ($language = $1);\n      /system_id=\"(\\S+)\"/ && ($system_id = $1);\n      /kwlist_filename=\"(\\S+)\"/ && ($kwlist_filename = $1);\n      @info = ($kwlist_filename, $language, $system_id);\n      next;\n    }\n\n    if (/<detected_kwlist/) {\n      ($kwid) = /kwid=\"(\\S+)\"/;\n      next;\n    }\n\n    if (/<kw/) {\n      ($dur) = /dur=\"(\\S+)\"/;\n      ($file) = /file=\"(\\S+)\"/;\n      ($tbeg) = /tbeg=\"(\\S+)\"/;\n      ($score) = /score=\"(\\S+)\"/;\n      ($channel) = /channel=\"(\\S+)\"/;\n      push(@KWS, [$kwid, $file, $channel, $tbeg, $dur, $score, \"\"]);\n    }\n  }\n\n  $kwslist_in eq \"-\" || close(I);\n\n  return [\\@info, \\@KWS];\n}\n\nsub PrintKwslist {\n  my ($info, $KWS) = @_;\n\n  my $kwslist = \"\";\n\n  # Start printing\n  $kwslist .= \"<kwslist kwlist_filename=\\\"$info->[0]\\\" language=\\\"$info->[1]\\\" system_id=\\\"$info->[2]\\\">\\n\";\n  my $prev_kw = \"\";\n  foreach my $kwentry (@{$KWS}) {\n    if ($prev_kw ne $kwentry->[0]) {\n      if ($prev_kw ne \"\") {$kwslist .= \"  </detected_kwlist>\\n\";}\n      $kwslist .= \"  <detected_kwlist search_time=\\\"1\\\" kwid=\\\"$kwentry->[0]\\\" oov_count=\\\"0\\\">\\n\";\n      $prev_kw = $kwentry->[0];\n    }\n    $kwslist .= \"    <kw file=\\\"$kwentry->[1]\\\" channel=\\\"$kwentry->[2]\\\" tbeg=\\\"$kwentry->[3]\\\" dur=\\\"$kwentry->[4]\\\" score=\\\"$kwentry->[5]\\\" decision=\\\"$kwentry->[6]\\\"\";\n    if (defined($kwentry->[7])) {$kwslist .= \" threshold=\\\"$kwentry->[7]\\\"\";}\n    if (defined($kwentry->[8])) {$kwslist .= \" raw_score=\\\"$kwentry->[8]\\\"\";}\n    $kwslist .= \"/>\\n\";\n  }\n  $kwslist .= \"  </detected_kwlist>\\n\";\n  $kwslist .= \"</kwslist>\\n\";\n\n  return $kwslist;\n}\n\nsub KwslistOutputSort {\n  if ($a->[0] ne $b->[0]) {\n    if ($a->[0] =~ m/[0-9]+$/ and $b->[0] =~ m/[0-9]+$/) {\n      ($a->[0] =~ /([0-9]*)$/)[0] <=> ($b->[0] =~ /([0-9]*)$/)[0]\n    } else {\n      $a->[0] cmp $b->[0];\n    }\n  } elsif ($a->[5] ne $b->[5]) {\n    $b->[5] <=> $a->[5];\n  } else {\n    $a->[1] cmp $b->[1];\n  }\n}\nsub KwslistDupSort {\n  my ($a, $b, $duptime) = @_;\n  if ($a->[0] ne $b->[0]) {\n    $a->[0] cmp $b->[0];\n  } elsif ($a->[1] ne $b->[1]) {\n    $a->[1] cmp $b->[1];\n  } elsif ($a->[2] ne $b->[2]) {\n    $a->[2] cmp $b->[2];\n  } elsif (abs($a->[3]-$b->[3]) >= $duptime){\n    $a->[3] <=> $b->[3];\n  } elsif ($a->[5] ne $b->[5]) {\n    $b->[5] <=> $a->[5];\n  } else {\n    $b->[4] <=> $a->[4];\n  }\n}\n\nmy $Usage = <<EOU;\nThis script reads a kwslist.xml file, does the post processing such as making decisions,\nnormalizing score, removing duplicates, etc. It writes the results to another kwslist.xml\nfile.\n\nUsage: kwslist_post_process.pl [options] <kwslist_in|-> <kwslist_out|->\n e.g.: kwslist_post_process.pl kwslist.in.xml kwslist.out.xml\n\nAllowed options:\n  --beta        : Beta value when computing ATWV                (float,   default = 999.9)\n  --digits      : How many digits should the score use          (int,     default = \"infinite\")\n  --duptime     : Tolerance for duplicates                      (float,   default = 0.5)\n  --duration    : Duration of the audio (Actural length/2)      (float,   default = 3600)\n  --normalize   : Normalize scores or not                       (boolean, default = false)\n  --Ntrue-scale : Keyword independent scale factor for Ntrue    (float,   default = 1.0)\n  --remove-dup  : Remove duplicates                             (boolean, default = false)\n  --remove-NO   : Remove the \"NO\" decision instances            (boolean, default = false)\n  --verbose     : Verbose level (higher --> more kws section)   (integer, default 0)\n  --YES-cutoff  : Only keep \"\\$YES-cutoff\" yeses for each kw     (int,     default = -1)\n\nEOU\n\nmy $beta = 999.9;\nmy $duration = 3600;\nmy $normalize = \"false\";\nmy $verbose = 0;\nmy $Ntrue_scale = 1.0;\nmy $remove_dup = \"false\";\nmy $duptime = 0.5;\nmy $remove_NO = \"false\";\nmy $digits = 0;\nmy $YES_cutoff = -1;\nGetOptions('beta=f'     => \\$beta,\n  'duration=f'          => \\$duration,\n  'normalize=s'         => \\$normalize,\n  'verbose=i'           => \\$verbose,\n  'Ntrue-scale=f'       => \\$Ntrue_scale,\n  'remove-dup=s'        => \\$remove_dup,\n  'duptime=f'           => \\$duptime,\n  'remove-NO=s'         => \\$remove_NO,\n  'digits=i'            => \\$digits,\n  'YES-cutoff=i'        => \\$YES_cutoff);\n\n($normalize eq \"true\" || $normalize eq \"false\") || die \"$0: Bad value for option --normalize\\n\";\n($remove_dup eq \"true\" || $remove_dup eq \"false\") || die \"$0: Bad value for option --remove-dup\\n\";\n($remove_NO eq \"true\" || $remove_NO eq \"false\") || die \"$0: Bad value for option --remove-NO\\n\";\n\n@ARGV == 2 || die $Usage;\n\n# Workout the input/output source\nmy $kwslist_in = shift @ARGV;\nmy $kwslist_out = shift @ARGV;\n\nmy ($info, $KWS) = @{ReadKwslist($kwslist_in)};\n\n# Work out the Ntrue\nmy %Ntrue;\nforeach my $kwentry (@{$KWS}) {\n  if (!defined($Ntrue{$kwentry->[0]})) {\n    $Ntrue{$kwentry->[0]} = 0.0;\n  }\n  $Ntrue{$kwentry->[0]} += $kwentry->[5];\n}\n\n# Scale the Ntrue and work out the expected count based threshold\nmy %threshold;\nforeach my $key (keys %Ntrue) {\n  $Ntrue{$key} *= $Ntrue_scale;\n  $threshold{$key} = $Ntrue{$key}/($duration/$beta+($beta-1)/$beta*$Ntrue{$key});\n}\n\n# Removing duplicates\nif ($remove_dup eq \"true\") {\n  my @tmp = sort {KwslistDupSort($a, $b, $duptime)} @{$KWS};\n  my @KWS = ();\n  push(@KWS, $tmp[0]);\n  for (my $i = 1; $i < scalar(@tmp); $i ++) {\n    my $prev = $KWS[-1];\n    my $curr = $tmp[$i];\n    if ((abs($prev->[3]-$curr->[3]) < $duptime ) &&\n        ($prev->[2] eq $curr->[2]) &&\n        ($prev->[1] eq $curr->[1]) &&\n        ($prev->[0] eq $curr->[0])) {\n      next;\n    } else {\n      push(@KWS, $curr);\n    }\n  }\n  $KWS = \\@KWS;\n}\n\nmy $format_string = \"%g\";\nif ($digits gt 0 ) {\n  $format_string = \"%.\" . $digits .\"f\";\n}\n\n# Making decisions...\nmy %YES_count;\nforeach my $kwentry (@{$KWS}) {\n  my $threshold = $threshold{$kwentry->[0]};\n  if ($kwentry->[5] > $threshold) {\n    $kwentry->[6] = \"YES\";\n    if (defined($YES_count{$kwentry->[0]})) {\n      $YES_count{$kwentry->[0]} ++;\n    } else {\n      $YES_count{$kwentry->[0]} = 1;\n    }\n  } else {\n    $kwentry->[6] = \"NO\";\n    if (!defined($YES_count{$kwentry->[0]})) {\n      $YES_count{$kwentry->[0]} = 0;\n    }\n  }\n  if ($verbose > 0) {\n    push(@{$kwentry}, sprintf(\"%g\", $threshold));\n  }\n  if ($normalize eq \"true\") {\n    if ($verbose > 0) {\n      push(@{$kwentry}, $kwentry->[5]);\n    }\n    my $numerator = (1-$threshold)*$kwentry->[5];\n    my $denominator = (1-$threshold)*$kwentry->[5]+(1-$kwentry->[5])*$threshold;\n    if ($denominator != 0) {\n      $kwentry->[5] = sprintf($format_string, $numerator/$denominator);\n    } else {\n      $kwentry->[5] = sprintf($format_string, $kwentry->[5]);\n    }\n  } else {\n    $kwentry->[5] = sprintf($format_string, $kwentry->[5]);\n  }\n}\n\n# Sorting and printing\nmy @tmp = sort KwslistOutputSort @{$KWS};\n\n# Process the YES-cutoff. Note that you don't need this for the normal cases where\n# hits and false alarms are balanced\nif ($YES_cutoff != -1) {\n  my $count = 1;\n  for (my $i = 1; $i < scalar(@tmp); $i ++) {\n    if ($tmp[$i]->[0] ne $tmp[$i-1]->[0]) {\n      $count = 1;\n      next;\n    }\n    if ($YES_count{$tmp[$i]->[0]} > $YES_cutoff*2) {\n      $tmp[$i]->[6] = \"NO\";\n      $tmp[$i]->[5] = 0;\n      next;\n    }\n    if (($count == $YES_cutoff) && ($tmp[$i]->[6] eq \"YES\")) {\n      $tmp[$i]->[6] = \"NO\";\n      $tmp[$i]->[5] = 0;\n      next;\n    }\n    if ($tmp[$i]->[6] eq \"YES\") {\n      $count ++;\n    }\n  }\n}\n\n# Process the remove-NO decision\nif ($remove_NO eq \"true\") {\n  my @KWS = @tmp;\n  @tmp = ();\n  for (my $i = 0; $i < scalar(@KWS); $i ++) {\n    if ($KWS[$i]->[6] eq \"YES\") {\n      push(@tmp, $KWS[$i]);\n    }\n  }\n}\n\n# Printing\nmy $kwslist = PrintKwslist($info, \\@tmp);\n\nif ($kwslist_out eq \"-\") {\n  print $kwslist;\n} else {\n  open(O, \">$kwslist_out\") || die \"$0: Fail to open output file $kwslist_out\\n\";\n  print O $kwslist;\n  close(O);\n}\n"
  },
  {
    "path": "egs/utils/lang/add_unigrams_arpa.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0.\n#\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\n# This is a simple script to add unigrams to an ARPA lm file.\nUsage: utils/lang/add_unigrams_arpa.pl [options] <oov-prob-file> <scale> <input-arpa >output-arpa\n<oov-prob-file> contains a list of words and their probabilities, e.g. \"jack 0.2\". All probs will be\nscaled by a positive scalar <scale> and then be used as the unigram prob. of the added word.\nThe scale should approximiately relect the OOV rate of the language in concern.\nEOU\n\nmy @F;\nmy @OOVS;\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\n# Gets parameters.\nmy $oov_prob_file = shift @ARGV;\nmy $scale = shift @ARGV;\nmy $arpa_in = shift @ARGV;\nmy $arpa_out = shift @ARGV;\n\n# Opens files.\nopen(F, \"<$oov_prob_file\") || die \"$0: Fail to open $oov_prob_file\\n\";\nwhile (<F>) { push @OOVS, $_; }\nmy $num_oovs = @OOVS;\n\n$scale > 0.0 || die \"Bad scale\";\nprint STDERR \"$0: Creating LM file with additional unigrams, using $oov_prob_file\\n\";\n\nmy %vocab;\nmy $unigram = 0;\nmy $num_unigrams = 0;\nmy @lines;\n\n# Parse and record the head and unigrams in the ARPA LM.\nwhile(<STDIN>) {\n  if (m/^ngram 1=(\\d+)/) { $num_unigrams = $1; }\n  \n  if (m/^\\\\2-grams:$/) { last; }\n  if (m/^\\\\1-grams:$/) { $unigram = 1; push(@lines, $_); next; }\n  if (m/^\\\\2-grams:$/) { $unigram = 0; }\n\n  my @col = split(\" \", $_);\n  if ( $unigram == 1 ) {\n    # Record in-vocab words into a map.\n    if ( @col > 0 ) {\n      my $word = $col[1];\n      $vocab{$word} = 1;\n      push(@lines, $_);\n    } else {\n      # Insert out-of-vocab words and their probs into the unigram list.\n      foreach my $l (@OOVS) {\n        my @A = split(\" \", $l);\n        @A == 2 || die \"bad line in oov2prob: $_;\";\n        my $word = $A[0];\n        my $prob = $A[1];\n        if (exists($vocab{$word})) { next; }\n        $num_unigrams ++;\n        my $log10prob = (log($prob * $scale) / log(10.0));\n        $vocab{$word} = 1;\n        my $line = sprintf(\"%.6f\\t$word\\n\", $log10prob);\n        push(@lines, $line);\n      }\n    }\n  } else { push(@lines, $_); }\n}\n\n# Print the head and unigrams, with the updated # unigrams in the head.\nforeach my $l (@lines) {\n  if ($l =~ m/ngram 1=/) {\n    print \"ngram 1=$num_unigrams\\n\";\n  } else {\n    print $l;\n  }\n}\n\n# Print the left fields.\nprint \"\\n\\\\2-grams:\\n\";\nwhile(<STDIN>) {\n  print;\n}\n\nclose(F);\nexit 0\n"
  },
  {
    "path": "egs/utils/lang/adjust_unk_arpa.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2018  Xiaohui Zhang\n# Apache 2.0.\n#\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\n# This is a simple script to set/scale the prob of n-grams where the OOV dict entry is the predicted word, in an ARPA lm file.\nUsage: utils/lang/adjust_unk_arpa.pl [options] <oov-dict-entry> <unk-scale> <input-arpa >output-arpa\n\nAllowed options:\n  --fixed-value (true|false)   : If true, interpret the unk-scale as a fixed value we'll set to\n                                 the unigram prob of the OOV dict entry, rather than using it to\n                                 scale the probs. In this case higher order n-grams containing\n                                 the OOV dict entry remain untouched. This is useful when the OOV\n                                 dict entry doesn't appear in n-grams (n>1) as the predicted word.\nEOU\n\nmy $fixed_value = \"false\";\nGetOptions('fixed-value=s' => \\$fixed_value);\n\n($fixed_value eq \"true\" || $fixed_value eq \"false\") ||\n  die \"$0: Bad value for option --fixed-value\\n\";\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\n# Gets parameters.\nmy $unk_word = shift @ARGV;\nmy $unk_scale = shift @ARGV;\nmy $arpa_in = shift @ARGV;\nmy $arpa_out = shift @ARGV;\n\n$unk_scale > 0.0 || die \"Bad unk_scale\"; # this must be positive\nif ( $fixed_value eq \"true\" ) {\n  print STDERR \"$0: Setting the unigram prob of $unk_word in LM file as $unk_scale.\\n\";\n} else {\n  print STDERR \"$0: Scaling the probs of ngrams where $unk_word is the predicted word in LM file by $unk_scale.\\n\";\n}\n\nmy $ngram = 0; # the order of ngram we are visiting\n\n# Change the unigram prob of the unk-word in the ARPA LM.\nwhile(<STDIN>) {\n  if (m/^\\\\1-grams:$/) { $ngram = 1; }\n  if (m/^\\\\2-grams:$/) { $ngram = 2; }\n  if (m/^\\\\3-grams:$/) { $ngram = 3; }\n  if (m/^\\\\4-grams:$/) { $ngram = 4; }\n  if (m/^\\\\5-grams:$/) { $ngram = 5; }\n  my @col = split(\" \", $_);\n  if ( @col > 1 && $ngram > 0 && $col[$ngram] eq $unk_word ) {\n    if ( $fixed_value eq \"true\" && $ngram == 1 ) {\n      $col[0] = (log($unk_scale) / log(10.0));\n    } elsif ($fixed_value eq \"false\" ) {\n      $col[0] += (log($unk_scale) / log(10.0));\n    }\n    my $line = join(\"\\t\", @col);\n    print \"$line\\n\";\n  } else {\n    print;\n  }\n}\n\nexit 0\n"
  },
  {
    "path": "egs/utils/lang/adjust_unk_graph.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018 Xiaohui Zhang\n# Apache 2.0\n\n# This script copies a fully expanded decoding graph (HCLG.fst) and scales the scores\n# of all arcs whose output symbol is a user-specified OOV symbol (or any other word).\n# This achieves an equivalent effect of utils/lang/adjust_unk_arpa.pl, which scales\n# the LM prob of all ngrams predicting an OOV symbol, while avoiding re-creating the graph.\n\nset -o pipefail\n\nif [ $# != 4 ]; then\n   echo \"Usage: utils/adjust_unk_graph.sh <oov-dict-entry> <scale> <in-graph-dir> <out-graph-dir>\"\n   echo \"e.g.: utils/adjust_unk_graph.sh \\\"<unk>\\\" 0.1 exp/tri1/graph exp/tri1/graph_unk_scale_0.1\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\noov_word=$1\nunk_scale=$2\ngraphdir_in=$3\ngraphdir_out=$4\n\nmkdir -p $graphdir_out\n\nrequired=\"HCLG.fst words.txt disambig_tid.int num_pdfs phones phones.txt words.txt\"\nfor f in $required; do\n  [ ! -e $graphdir_in/$f ] && echo \"adjust_unk_graph.sh: expected $graphdir_in/$f to exist\" && exit 1;\n  cp -r $graphdir_in/$f $graphdir_out\ndone\n\ncp -r $graphdir_in/{disambig_tid.int,num_pdfs,phones,phones.txt,words.txt} $graphdir_out\n\noov_id=`echo $oov_word | utils/sym2int.pl $graphdir_in/words.txt`\n[ -z $oov_id ] && echo \"adjust_unk_graph.sh: the specified oov symbol $oov_word is out of the vocabulary.\" && exit 1;\nfstprint $graphdir_in/HCLG.fst | awk -v oov=$oov_id -v unk_scale=$unk_scale '{if($4==oov) $5=$5-log(unk_scale);print $0}' | \\\n  fstcompile | fstconvert --fst_type=const  > $graphdir_out/HCLG.fst || exit 1;\n"
  },
  {
    "path": "egs/utils/lang/bpe/add_final_optional_silence.sh",
    "content": "#!/usr/bin/env bash\n. ./path.sh\n\nfinal_sil_prob=0.5\n\necho \"$0 $@\"  # Print the command line for logging\n\n. ./utils/parse_options.sh\n\nif [ $# -ne 1 ]; then\n  echo \"Usage: $0  <lang>\"\n  echo \" Add final optional silence to lexicon FSTs (L.fst and L_disambig.fst) in\"\n  echo \" lang/ directory <lang>.\"\n  echo \" This can be useful in systems with byte-pair encoded (BPE) lexicons, in which\"\n  echo \" the word-initial silence is part of the lexicon, so we turn off the standard\"\n  echo \" optional silence in the lexicon\"\n  echo \"options:\"\n  echo \"   --final-sil-prob <final silence probability>      # default 0.5\"\n  exit 1;\nfi\n\nlang=$1\n\nif [ $lang/phones/final_sil_prob -nt $lang/phones/nonsilence.txt ]; then\n  echo \"$0 $lang/phones/final_sil_prob exists. Exiting...\"\n  exit 1;\nfi\n\nsilphone=$(cat $lang/phones/optional_silence.int)\n\nsil_eq_zero=$(echo $(perl -e \"if ( $final_sil_prob == 0.0) {print 'true';} else {print 'false';}\"))\nsil_eq_one=$(echo $(perl -e \"if ( $final_sil_prob == 1.0) {print 'true';} else {print 'false';}\"))\nsil_lt_zero=$(echo $(perl -e \"if ( $final_sil_prob < 0.0) {print 'true';} else {print 'false';}\"))\nsil_gt_one=$(echo $(perl -e \"if ( $final_sil_prob > 1.0) {print 'true';} else {print 'false';}\"))\n\nif  $sil_lt_zero || $sil_gt_one; then\n  echo \"$0 final-sil-prob should be between 0.0 and 1.0. Final silence was not added.\"\n  exit 1;\nelse\n  if $sil_eq_zero; then\n    echo \"$0 final-sil-prob = 0 => Final silence was not added.\"\n    exit 0;\n  elif $sil_eq_one; then\n    ( echo \"0 1 $silphone 0\";\n      echo \"1\" ) | fstcompile > $lang/final_sil.fst\n  else\n    log_silprob=$(echo $(perl -e \"print log $final_sil_prob\"))\n    ( echo \"0 1 $silphone 0 $log_silprob\";\n      echo \"0 $log_silprob\";\n      echo \"1\" ) | fstcompile > $lang/final_sil.fst\n  fi\n  mv $lang/L.fst $lang/L.fst.orig\n  mv $lang/L_disambig.fst $lang/L_disambig.fst.orig\n  fstconcat $lang/L.fst.orig $lang/final_sil.fst | fstarcsort --sort_type=olabel > $lang/L.fst\n  fstconcat $lang/L_disambig.fst.orig $lang/final_sil.fst | fstarcsort --sort_type=olabel > $lang/L_disambig.fst\n  echo \"$final_sil_prob\" > $lang/phones/final_sil_prob\nfi\n"
  },
  {
    "path": "egs/utils/lang/bpe/apply_bpe.py",
    "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n# Author: Rico Sennrich\n# Released under the MIT License.\n\n\"\"\"Use operations learned with learn_bpe.py to encode a new text.\nThe text will not be smaller, but use only a fixed vocabulary, with rare words\nencoded as variable-length sequences of subword units.\n\nReference:\nRico Sennrich, Barry Haddow and Alexandra Birch (2015). Neural Machine Translation of Rare Words with Subword Units.\nProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.\n\"\"\"\n\nfrom __future__ import unicode_literals, division\n\nimport sys\nimport codecs\nimport io\nimport argparse\nimport re\n\n# hack for python2/3 compatibility\nfrom io import open\nargparse.open = open\n\nclass BPE(object):\n\n    def __init__(self, codes, merges=-1, separator='@@', vocab=None, glossaries=None):\n\n        codes.seek(0)\n\n        # check version information\n        firstline = codes.readline()\n        if firstline.startswith('#version:'):\n            self.version = tuple([int(x) for x in re.sub(r'(\\.0+)*$','', firstline.split()[-1]).split(\".\")])\n        else:\n            self.version = (0, 1)\n            codes.seek(0)\n\n        self.bpe_codes = [tuple(item.strip().split(' ')) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]\n\n        for item in self.bpe_codes:\n            if len(item) != 2:\n                sys.stderr.write('Error: invalid line in BPE codes file: {0}\\n'.format(' '.join(item)))\n                sys.stderr.write('The line should exist of exactly two subword units, separated by whitespace\\n'.format(' '.join(item)))\n                sys.exit(1)\n\n        # some hacking to deal with duplicates (only consider first instance)\n        self.bpe_codes = dict([(code,i) for (i,code) in reversed(list(enumerate(self.bpe_codes)))])\n\n        self.bpe_codes_reverse = dict([(pair[0] + pair[1], pair) for pair,i in self.bpe_codes.items()])\n\n        self.separator = separator\n\n        self.vocab = vocab\n\n        self.glossaries = glossaries if glossaries else []\n\n        self.cache = {}\n\n    def process_line(self, line):\n        \"\"\"segment line, dealing with leading and trailing whitespace\"\"\"\n\n        out = \"\"\n\n        leading_whitespace = len(line)-len(line.lstrip())\n        if leading_whitespace:\n            out += line[:leading_whitespace]\n\n        out += self.segment(line)\n\n        trailing_whitespace = len(line)-len(line.rstrip())\n        if trailing_whitespace:\n            out += line[-trailing_whitespace:]\n\n        return out\n\n    def segment(self, sentence):\n        \"\"\"segment single sentence (whitespace-tokenized string) with BPE encoding\"\"\"\n        output = []\n        for word in sentence.strip().split(' '):\n            # eliminate double spaces\n            if not word:\n                continue\n            new_word = [out for segment in self._isolate_glossaries(word)\n                        for out in encode(segment,\n                                          self.bpe_codes,\n                                          self.bpe_codes_reverse,\n                                          self.vocab,\n                                          self.separator,\n                                          self.version,\n                                          self.cache,\n                                          self.glossaries)]\n\n            for item in new_word[:-1]:\n                output.append(item + self.separator)\n            output.append(new_word[-1])\n\n        return ' '.join(output)\n\n    def _isolate_glossaries(self, word):\n        word_segments = [word]\n        for gloss in self.glossaries:\n            word_segments = [out_segments for segment in word_segments\n                                 for out_segments in isolate_glossary(segment, gloss)]\n        return word_segments\n\ndef create_parser():\n    parser = argparse.ArgumentParser(\n        formatter_class=argparse.RawDescriptionHelpFormatter,\n        description=\"learn BPE-based word segmentation\")\n\n    parser.add_argument(\n        '--input', '-i', type=argparse.FileType('r'), default=sys.stdin,\n        metavar='PATH',\n        help=\"Input file (default: standard input).\")\n    parser.add_argument(\n        '--codes', '-c', type=argparse.FileType('r'), metavar='PATH',\n        required=True,\n        help=\"File with BPE codes (created by learn_bpe.py).\")\n    parser.add_argument(\n        '--merges', '-m', type=int, default=-1,\n        metavar='INT',\n        help=\"Use this many BPE operations (<= number of learned symbols)\"+\n             \"default: Apply all the learned merge operations\")\n    parser.add_argument(\n        '--output', '-o', type=argparse.FileType('w'), default=sys.stdout,\n        metavar='PATH',\n        help=\"Output file (default: standard output)\")\n    parser.add_argument(\n        '--separator', '-s', type=str, default='@@', metavar='STR',\n        help=\"Separator between non-final subword units (default: '%(default)s'))\")\n    parser.add_argument(\n        '--vocabulary', type=argparse.FileType('r'), default=None,\n        metavar=\"PATH\",\n        help=\"Vocabulary file (built with get_vocab.py). If provided, this script reverts any merge operations that produce an OOV.\")\n    parser.add_argument(\n        '--vocabulary-threshold', type=int, default=None,\n        metavar=\"INT\",\n        help=\"Vocabulary threshold. If vocabulary is provided, any word with frequency < threshold will be treated as OOV\")\n    parser.add_argument(\n        '--glossaries', type=str, nargs='+', default=None,\n        metavar=\"STR\",\n        help=\"Glossaries. The strings provided in glossaries will not be affected\"+\n             \"by the BPE (i.e. they will neither be broken into subwords, nor concatenated with other subwords\")\n\n    return parser\n\ndef get_pairs(word):\n    \"\"\"Return set of symbol pairs in a word.\n\n    word is represented as tuple of symbols (symbols being variable-length strings)\n    \"\"\"\n    pairs = set()\n    prev_char = word[0]\n    for char in word[1:]:\n        pairs.add((prev_char, char))\n        prev_char = char\n    return pairs\n\ndef encode(orig, bpe_codes, bpe_codes_reverse, vocab, separator, version, cache, glossaries=None):\n    \"\"\"Encode word based on list of BPE merge operations, which are applied consecutively\n    \"\"\"\n\n    if orig in cache:\n        return cache[orig]\n\n    if orig in glossaries:\n        cache[orig] = (orig,)\n        return (orig,)\n\n    if version == (0, 1):\n        word = tuple(orig) + ('</w>',)\n    elif version == (0, 2): # more consistent handling of word-final segments\n        word = tuple(orig[:-1]) + ( orig[-1] + '</w>',)\n    else:\n        raise NotImplementedError\n\n    pairs = get_pairs(word)\n\n    if not pairs:\n        return orig\n\n    while True:\n        bigram = min(pairs, key = lambda pair: bpe_codes.get(pair, float('inf')))\n        if bigram not in bpe_codes:\n            break\n        first, second = bigram\n        new_word = []\n        i = 0\n        while i < len(word):\n            try:\n                j = word.index(first, i)\n                new_word.extend(word[i:j])\n                i = j\n            except:\n                new_word.extend(word[i:])\n                break\n\n            if word[i] == first and i < len(word)-1 and word[i+1] == second:\n                new_word.append(first+second)\n                i += 2\n            else:\n                new_word.append(word[i])\n                i += 1\n        new_word = tuple(new_word)\n        word = new_word\n        if len(word) == 1:\n            break\n        else:\n            pairs = get_pairs(word)\n\n    # don't print end-of-word symbols\n    if word[-1] == '</w>':\n        word = word[:-1]\n    elif word[-1].endswith('</w>'):\n        word = word[:-1] + (word[-1].replace('</w>',''),)\n\n    if vocab:\n        word = check_vocab_and_split(word, bpe_codes_reverse, vocab, separator)\n\n    cache[orig] = word\n    return word\n\ndef recursive_split(segment, bpe_codes, vocab, separator, final=False):\n    \"\"\"Recursively split segment into smaller units (by reversing BPE merges)\n    until all units are either in-vocabulary, or cannot be split futher.\"\"\"\n\n    try:\n        if final:\n            left, right = bpe_codes[segment + '</w>']\n            right = right[:-4]\n        else:\n            left, right = bpe_codes[segment]\n    except:\n        #sys.stderr.write('cannot split {0} further.\\n'.format(segment))\n        yield segment\n        return\n\n    if left + separator in vocab:\n        yield left\n    else:\n        for item in recursive_split(left, bpe_codes, vocab, separator, False):\n            yield item\n\n    if (final and right in vocab) or (not final and right + separator in vocab):\n        yield right\n    else:\n        for item in recursive_split(right, bpe_codes, vocab, separator, final):\n            yield item\n\ndef check_vocab_and_split(orig, bpe_codes, vocab, separator):\n    \"\"\"Check for each segment in word if it is in-vocabulary,\n    and segment OOV segments into smaller units by reversing the BPE merge operations\"\"\"\n\n    out = []\n\n    for segment in orig[:-1]:\n        if segment + separator in vocab:\n            out.append(segment)\n        else:\n            #sys.stderr.write('OOV: {0}\\n'.format(segment))\n            for item in recursive_split(segment, bpe_codes, vocab, separator, False):\n                out.append(item)\n\n    segment = orig[-1]\n    if segment in vocab:\n        out.append(segment)\n    else:\n        #sys.stderr.write('OOV: {0}\\n'.format(segment))\n        for item in recursive_split(segment, bpe_codes, vocab, separator, True):\n            out.append(item)\n\n    return out\n\n\ndef read_vocabulary(vocab_file, threshold):\n    \"\"\"read vocabulary file produced by get_vocab.py, and filter according to frequency threshold.\n    \"\"\"\n\n    vocabulary = set()\n\n    for line in vocab_file:\n        word, freq = line.strip().split(' ')\n        freq = int(freq)\n        if threshold == None or freq >= threshold:\n            vocabulary.add(word)\n\n    return vocabulary\n\ndef isolate_glossary(word, glossary):\n    \"\"\"\n    Isolate a glossary present inside a word.\n\n    Returns a list of subwords. In which all 'glossary' glossaries are isolated \n\n    For example, if 'USA' is the glossary and '1934USABUSA' the word, the return value is:\n        ['1934', 'USA', 'B', 'USA']\n    \"\"\"\n    if word == glossary or glossary not in word:\n        return [word]\n    else:\n        splits = word.split(glossary)\n        segments = [segment.strip() for split in splits[:-1] for segment in [split, glossary] if segment != '']\n        return segments + [splits[-1].strip()] if splits[-1] != '' else segments\n\nif __name__ == '__main__':\n\n    # python 2/3 compatibility\n    if sys.version_info < (3, 0):\n        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)\n        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)\n        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)\n    else:\n        sys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')\n        sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')\n        sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', write_through=True, line_buffering=True)\n\n    parser = create_parser()\n    args = parser.parse_args()\n\n    # read/write files as UTF-8\n    args.codes = codecs.open(args.codes.name, encoding='utf-8')\n    if args.input.name != '<stdin>':\n        args.input = codecs.open(args.input.name, encoding='utf-8')\n    if args.output.name != '<stdout>':\n        args.output = codecs.open(args.output.name, 'w', encoding='utf-8')\n    if args.vocabulary:\n        args.vocabulary = codecs.open(args.vocabulary.name, encoding='utf-8')\n\n    if args.vocabulary:\n        vocabulary = read_vocabulary(args.vocabulary, args.vocabulary_threshold)\n    else:\n        vocabulary = None\n\n    bpe = BPE(args.codes, args.merges, args.separator, vocabulary, args.glossaries)\n\n    for line in args.input:\n        args.output.write(bpe.process_line(line))\n"
  },
  {
    "path": "egs/utils/lang/bpe/bidi.py",
    "content": "#!/usr/bin/env python3\n# Copyright   2018 Chun-Chieh Chang\n\n# This script is largely written by Stephen Rawls\n# and uses the python package https://pypi.org/project/PyICU_BiDi/\n# The code leaves right to left text alone and reverses left to right text.\n\nimport icu_bidi\nimport io\nimport sys\nimport unicodedata\n# R=strong right-to-left;  AL=strong arabic right-to-left\nrtl_set =  set(chr(i) for i in range(sys.maxunicode)\n               if unicodedata.bidirectional(chr(i)) in ['R','AL'])\ndef determine_text_direction(text):\n    # Easy case first\n    for char in text:\n        if char in rtl_set:\n            return icu_bidi.UBiDiLevel.UBIDI_RTL\n    # If we made it here we did not encounter any strongly rtl char\n    return icu_bidi.UBiDiLevel.UBIDI_LTR\n\ndef utf8_visual_to_logical(text):\n    text_dir = determine_text_direction(text)\n\n    bidi = icu_bidi.Bidi()\n    bidi.inverse = True\n    bidi.reordering_mode = icu_bidi.UBiDiReorderingMode.UBIDI_REORDER_INVERSE_LIKE_DIRECT\n    bidi.reordering_options = icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_DEFAULT # icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_INSERT_MARKS\n\n    bidi.set_para(text, text_dir, None)\n\n    res = bidi.get_reordered(0 | icu_bidi.UBidiWriteReorderedOpt.UBIDI_DO_MIRRORING | icu_bidi.UBidiWriteReorderedOpt.UBIDI_KEEP_BASE_COMBINING)\n\n    return res\n\ndef utf8_logical_to_visual(text):\n    text_dir = determine_text_direction(text)\n\n    bidi = icu_bidi.Bidi()\n\n    bidi.reordering_mode = icu_bidi.UBiDiReorderingMode.UBIDI_REORDER_DEFAULT\n    bidi.reordering_options = icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_DEFAULT  #icu_bidi.UBiDiReorderingOption.UBIDI_OPTION_INSERT_MARKS\n\n    bidi.set_para(text, text_dir, None)\n\n    res = bidi.get_reordered(0 | icu_bidi.UBidiWriteReorderedOpt.UBIDI_DO_MIRRORING | icu_bidi.UBidiWriteReorderedOpt.UBIDI_KEEP_BASE_COMBINING)\n\n    return res\n\n\n##main##\nsys.stdin = io.TextIOWrapper(sys.stdin.buffer, encoding=\"utf8\")\nsys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding=\"utf8\")\nfor line in sys.stdin:\n    line = line.strip()\n    line = utf8_logical_to_visual(line)[::-1]\n    sys.stdout.write(line + '\\n')\n"
  },
  {
    "path": "egs/utils/lang/bpe/learn_bpe.py",
    "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n# Author: Rico Sennrich\n# Released under the MIT License.\n\n\"\"\"Use byte pair encoding (BPE) to learn a variable-length encoding of the vocabulary in a text.\nUnlike the original BPE, it does not compress the plain text, but can be used to reduce the vocabulary\nof a text to a configurable number of symbols, with only a small increase in the number of tokens.\n\nReference:\nRico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units.\nProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.\n\"\"\"\n\nfrom __future__ import unicode_literals\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport sys\nimport codecs\nimport re\nimport copy\nimport argparse\nfrom collections import defaultdict, Counter\n\n# hack for python2/3 compatibility\nfrom io import open\nargparse.open = open\n\ndef create_parser():\n    parser = argparse.ArgumentParser(\n        formatter_class=argparse.RawDescriptionHelpFormatter,\n        description=\"learn BPE-based word segmentation\")\n\n    parser.add_argument(\n        '--input', '-i', type=argparse.FileType('r'), default=sys.stdin,\n        metavar='PATH',\n        help=\"Input text (default: standard input).\")\n\n    parser.add_argument(\n        '--output', '-o', type=argparse.FileType('w'), default=sys.stdout,\n        metavar='PATH',\n        help=\"Output file for BPE codes (default: standard output)\")\n    parser.add_argument(\n        '--symbols', '-s', type=int, default=10000,\n        help=\"Create this many new symbols (each representing a character n-gram) (default: %(default)s))\")\n    parser.add_argument(\n        '--min-frequency', type=int, default=2, metavar='FREQ',\n        help='Stop if no symbol pair has frequency >= FREQ (default: %(default)s))')\n    parser.add_argument('--dict-input', action=\"store_true\",\n        help=\"If set, input file is interpreted as a dictionary where each line contains a word-count pair\")\n    parser.add_argument(\n        '--verbose', '-v', action=\"store_true\",\n        help=\"verbose mode.\")\n\n    return parser\n\ndef get_vocabulary(fobj, is_dict=False):\n    \"\"\"Read text and return dictionary that encodes vocabulary\n    \"\"\"\n    vocab = Counter()\n    for i, line in enumerate(fobj):\n        if is_dict:\n            try:\n                word, count = line.strip().split(' ')\n            except:\n                print('Failed reading vocabulary file at line {0}: {1}'.format(i, line))\n                sys.exit(1)\n            vocab[word] += int(count)\n        else:\n            for word in line.strip().split(' '):\n                if word:\n                    vocab[word] += 1\n    return vocab\n\ndef update_pair_statistics(pair, changed, stats, indices):\n    \"\"\"Minimally update the indices and frequency of symbol pairs\n\n    if we merge a pair of symbols, only pairs that overlap with occurrences\n    of this pair are affected, and need to be updated.\n    \"\"\"\n    stats[pair] = 0\n    indices[pair] = defaultdict(int)\n    first, second = pair\n    new_pair = first+second\n    for j, word, old_word, freq in changed:\n\n        # find all instances of pair, and update frequency/indices around it\n        i = 0\n        while True:\n            # find first symbol\n            try:\n                i = old_word.index(first, i)\n            except ValueError:\n                break\n            # if first symbol is followed by second symbol, we've found an occurrence of pair (old_word[i:i+2])\n            if i < len(old_word)-1 and old_word[i+1] == second:\n                # assuming a symbol sequence \"A B C\", if \"B C\" is merged, reduce the frequency of \"A B\"\n                if i:\n                    prev = old_word[i-1:i+1]\n                    stats[prev] -= freq\n                    indices[prev][j] -= 1\n                if i < len(old_word)-2:\n                    # assuming a symbol sequence \"A B C B\", if \"B C\" is merged, reduce the frequency of \"C B\".\n                    # however, skip this if the sequence is A B C B C, because the frequency of \"C B\" will be reduced by the previous code block\n                    if old_word[i+2] != first or i >= len(old_word)-3 or old_word[i+3] != second:\n                        nex = old_word[i+1:i+3]\n                        stats[nex] -= freq\n                        indices[nex][j] -= 1\n                i += 2\n            else:\n                i += 1\n\n        i = 0\n        while True:\n            try:\n                # find new pair\n                i = word.index(new_pair, i)\n            except ValueError:\n                break\n            # assuming a symbol sequence \"A BC D\", if \"B C\" is merged, increase the frequency of \"A BC\"\n            if i:\n                prev = word[i-1:i+1]\n                stats[prev] += freq\n                indices[prev][j] += 1\n            # assuming a symbol sequence \"A BC B\", if \"B C\" is merged, increase the frequency of \"BC B\"\n            # however, if the sequence is A BC BC, skip this step because the count of \"BC BC\" will be incremented by the previous code block\n            if i < len(word)-1 and word[i+1] != new_pair:\n                nex = word[i:i+2]\n                stats[nex] += freq\n                indices[nex][j] += 1\n            i += 1\n\n\ndef get_pair_statistics(vocab):\n    \"\"\"Count frequency of all symbol pairs, and create index\"\"\"\n\n    # data structure of pair frequencies\n    stats = defaultdict(int)\n\n    #index from pairs to words\n    indices = defaultdict(lambda: defaultdict(int))\n\n    for i, (word, freq) in enumerate(vocab):\n        prev_char = word[0]\n        for char in word[1:]:\n            stats[prev_char, char] += freq\n            indices[prev_char, char][i] += 1\n            prev_char = char\n\n    return stats, indices\n\n\ndef replace_pair(pair, vocab, indices):\n    \"\"\"Replace all occurrences of a symbol pair ('A', 'B') with a new symbol 'AB'\"\"\"\n    first, second = pair\n    pair_str = ''.join(pair)\n    pair_str = pair_str.replace('\\\\','\\\\\\\\')\n    changes = []\n    pattern = re.compile(r'(?<!\\S)' + re.escape(first + ' ' + second) + r'(?!\\S)')\n    if sys.version_info < (3, 0):\n        iterator = indices[pair].iteritems()\n    else:\n        iterator = indices[pair].items()\n    for j, freq in iterator:\n        if freq < 1:\n            continue\n        word, freq = vocab[j]\n        new_word = ' '.join(word)\n        new_word = pattern.sub(pair_str, new_word)\n        new_word = tuple(new_word.split(' '))\n\n        vocab[j] = (new_word, freq)\n        changes.append((j, new_word, word, freq))\n\n    return changes\n\ndef prune_stats(stats, big_stats, threshold):\n    \"\"\"Prune statistics dict for efficiency of max()\n\n    The frequency of a symbol pair never increases, so pruning is generally safe\n    (until we the most frequent pair is less frequent than a pair we previously pruned)\n    big_stats keeps full statistics for when we need to access pruned items\n    \"\"\"\n    for item,freq in list(stats.items()):\n        if freq < threshold:\n            del stats[item]\n            if freq < 0:\n                big_stats[item] += freq\n            else:\n                big_stats[item] = freq\n\n\ndef main(infile, outfile, num_symbols, min_frequency=2, verbose=False, is_dict=False):\n    \"\"\"Learn num_symbols BPE operations from vocabulary, and write to outfile.\n    \"\"\"\n\n    # version 0.2 changes the handling of the end-of-word token ('</w>');\n    # version numbering allows bckward compatibility\n    outfile.write('#version: 0.2\\n')\n\n    vocab = get_vocabulary(infile, is_dict)\n    vocab = dict([(tuple(x[:-1])+(x[-1]+'</w>',) ,y) for (x,y) in vocab.items()])\n    sorted_vocab = sorted(vocab.items(), key=lambda x: x[1], reverse=True)\n\n    stats, indices = get_pair_statistics(sorted_vocab)\n    big_stats = copy.deepcopy(stats)\n    # threshold is inspired by Zipfian assumption, but should only affect speed\n    threshold = max(stats.values()) / 10\n    for i in range(num_symbols):\n        if stats:\n            most_frequent = max(stats, key=lambda x: (stats[x], x))\n\n        # we probably missed the best pair because of pruning; go back to full statistics\n        if not stats or (i and stats[most_frequent] < threshold):\n            prune_stats(stats, big_stats, threshold)\n            stats = copy.deepcopy(big_stats)\n            most_frequent = max(stats, key=lambda x: (stats[x], x))\n            # threshold is inspired by Zipfian assumption, but should only affect speed\n            threshold = stats[most_frequent] * i/(i+10000.0)\n            prune_stats(stats, big_stats, threshold)\n\n        if stats[most_frequent] < min_frequency:\n            sys.stderr.write('no pair has frequency >= {0}. Stopping\\n'.format(min_frequency))\n            break\n\n        if verbose:\n            sys.stderr.write('pair {0}: {1} {2} -> {1}{2} (frequency {3})\\n'.format(i, most_frequent[0], most_frequent[1], stats[most_frequent]))\n        outfile.write('{0} {1}\\n'.format(*most_frequent))\n        changes = replace_pair(most_frequent, sorted_vocab, indices)\n        update_pair_statistics(most_frequent, changes, stats, indices)\n        stats[most_frequent] = 0\n        if not i % 100:\n            prune_stats(stats, big_stats, threshold)\n\n\nif __name__ == '__main__':\n\n    # python 2/3 compatibility\n    if sys.version_info < (3, 0):\n        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr)\n        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)\n        sys.stdin = codecs.getreader('UTF-8')(sys.stdin)\n    else:\n        sys.stderr = codecs.getwriter('UTF-8')(sys.stderr.buffer)\n        sys.stdout = codecs.getwriter('UTF-8')(sys.stdout.buffer)\n        sys.stdin = codecs.getreader('UTF-8')(sys.stdin.buffer)\n\n    parser = create_parser()\n    args = parser.parse_args()\n\n    # read/write files as UTF-8\n    if args.input.name != '<stdin>':\n        args.input = codecs.open(args.input.name, encoding='utf-8')\n    if args.output.name != '<stdout>':\n        args.output = codecs.open(args.output.name, 'w', encoding='utf-8')\n\n    main(args.input, args.output, args.symbols, args.min_frequency, args.verbose, is_dict=args.dict_input)\n"
  },
  {
    "path": "egs/utils/lang/bpe/prepend_words.py",
    "content": "#!/usr/bin/env python3\n\n# This script, prepend '|' to every words in the transcript to mark\n# the beginning of the words for finding the initial-space of every word\n# after decoding.\n\nimport sys\nimport io\nimport re\n\nwhitespace = re.compile(\"[ \\t]+\")\ninfile = io.TextIOWrapper(sys.stdin.buffer, encoding='latin-1')\noutput = io.TextIOWrapper(sys.stdout.buffer, encoding='latin-1')\nfor line in infile:\n    words = whitespace.split(line.strip(\" \\t\\r\\n\"))\n    output.write(' '.join([ \"|\"+word for word in words]) + '\\n')\n"
  },
  {
    "path": "egs/utils/lang/bpe/reverse.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# This script, reverse all latin and digits sequences\n# (including words like MP3) to put them in the right order in the images.\n\nimport re, os, sys, io\n\nin_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')\nout_stream = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')\nfor line in in_stream:\n    out_stream.write(re.sub(r'[a-zA-Z0-9][a-zA-Z0-9\\s\\.\\:]*[a-zA-Z0-9]',\n                            lambda m:m.group(0)[::-1], line))\n"
  },
  {
    "path": "egs/utils/lang/check_g_properties.pl",
    "content": "#!/usr/bin/env perl\n\nuse IPC::Open2;\n\nif (@ARGV != 1) {\n  print \"Usage: $0 [options] <lang_directory>\\n\";\n  print \"e.g.:  $0 data/lang\\n\";\n  exit(1);\n}\n\n$lang = shift @ARGV;\n\n# This script checks that G.fst in the lang.fst directory is OK with respect\n# to certain expected properties, and returns nonzero exit status if a problem was\n# detected.  It is called from validate_lang.pl.\n# This only checks the properties of G that relate to disambiguation symbols,\n# epsilons and forbidden symbols <s> and </s>.\n\nif (! -e \"$lang/G.fst\") {\n  print \"$0: error: $lang/G.fst does not exist\\n\";\n  exit(1);\n}\n\nopen(W, \"<$lang/words.txt\") || die \"opening $lang/words.txt\";\n$hash_zero = -1;\nwhile (<W>) {\n  @A = split(\" \", $_);\n  ($sym, $int) = @A;\n  if ($sym eq \"<s>\" || $sym eq \"</s>\") { $is_forbidden{$int} = 1; }\n  if ($sym eq \"#0\") { $hash_zero = $int; }\n  if ($sym =~ m/^#nonterm/) { $is_nonterminal{$int} = 1; }\n}\n\nif (-e \"$lang/phones/wdisambig_words.int\") {\n  open(F, \"<$lang/phones/wdisambig_words.int\") || die \"opening $lang/phones/wdisambig_words.int\";\n  while (<F>) {\n    chop;\n    $is_disambig{$_} = 1;\n  }\n} else {\n  $is_disambig{$hash_zero} = 1;\n}\n\n$input_cmd = \". ./path.sh; fstprint $lang/G.fst|\";\nopen(G, $input_cmd) || die \"running command $input_cmd\";\n\n$info_cmd = \". ./path.sh; fstcompile | fstinfo \";\nopen2(O, I, \"$info_cmd\") || die \"running command $info_cmd\";\n\n$has_epsilons = 0;\n\nwhile (<G>) {\n  @A = split(\" \", $_);\n  if (@A >= 4) {\n    if ($is_forbidden{$A[2]} || $is_forbidden{$A[3]}) {\n      chop;\n      print \"$0: validating $lang: error: line $_ in G.fst contains forbidden symbol <s> or </s>\\n\";\n      exit(1);\n    } elsif ($is_disambig{$A[2]}) {\n      print I $_;\n      if ($A[3] != 0) {\n        chop;\n        print \"$0: validating $lang: error: line $_ in G.fst has disambig on input but no epsilon on output\\n\";\n        exit(1);\n      }\n    } elsif ($A[2] == 0) {\n      print I $_;\n      $has_epsilons = 1;\n    } elsif ($A[2] != $A[3] && !$is_nonterminal{$A[2]} ) {\n      chop;\n      print \"$0: validating $lang: error: line $_ in G.fst has inputs and outputs different but input is not disambig symbol or nonterminal.\\n\";\n      exit(1);\n    }\n  }\n}\n\nclose(I);  # tell 'fstcompile | fstinfo' pipeline that its input is done.\nwhile (<O>) {\n  if (m/cyclic\\s+y/) {\n    print \"$0: validating $lang: error: G.fst has cycles containing only disambig symbols and epsilons.  Would cause determinization failure\\n\";\n    exit(1);\n  }\n}\n\nif ($has_epsilons) {\n  print \"$0: warning: validating $lang: G.fst has epsilon-input arcs.  We don't expect these in most setups.\\n\";\n}\n\nprint \"--> $0 successfully validated $lang/G.fst\\n\";\nexit(0);\n"
  },
  {
    "path": "egs/utils/lang/check_phones_compatible.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2016 Hang Lyu\n\n# Licensed udner the Apache License, Version 2.0 (the \"Lincense\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OF IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script exits with status zero if the phone symbols tables are the same\n# except for possible differences in disambiguation symbols (meaning that all\n# symbols except those beginning with a # are mapped to the same values).\n# Otherwise it prints a warning and exits with status 1.\n# For the sake of compatibility with other scripts that did not write the\n# phones.txt to model directories, this script exits silently with status 0\n# if one of the phone symbol tables does not exist.\n\n. utils/parse_options.sh || exit 1;\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: utils/lang/check_phones_compatible.sh <phones-symbol-table1> <phones-symbol-table2>\"\n  echo \"e.g.: utils/lang/check_phones_compatible.sh data/lang/phones.txt exp/tri3/phones.txt\"\n  exit 1;\nfi\n\ntable_first=$1\ntable_second=$2\n\n# check if the files exist or not\nif [ ! -f $table_first ]; then\n  if [ ! -f $table_second ]; then\n    echo \"$0: Error! Both of the two phones-symbol tables are absent.\"\n    echo \"Please check your command\"\n    exit 1;\n  else\n    # The phones-symbol-table1 is absent. The model directory maybe created by old script.\n    # For back compatibility, this script exits silently with status 0.\n    exit 0;\n  fi\nelif [ ! -f $table_second ]; then\n  # The phones-symbol-table2 is absent. The model directory maybe created by old script.\n  # For back compatibility, this script exits silently with status 0.\n  exit 0;\nfi\n\n# Check if the two tables are the same (except for possible difference in disambiguation symbols).\nif ! cmp -s <(grep -v \"^#\" $table_first) <(grep -v \"^#\" $table_second); then\n  echo \"$0: phone symbol tables $table_first and $table_second are not compatible.\"\n  exit 1;\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/lang/compute_sentence_probs_arpa.py",
    "content": "#!/usr/bin/env python\n\n# Dongji Gao\n\n# We're using python 3.x style but want it to work in python 2.x\n\nfrom __future__ import print_function\nimport argparse\nimport sys\nimport math\n\nparser = argparse.ArgumentParser(description=\"This script evaluates the log probabilty (default log base is e) of each sentence \"\n                                             \"from data (in text form), given a language model in arpa form \"\n                                             \"and a specific ngram order.\",\n                                 epilog=\"e.g. ./compute_sentence_probs_arpa.py ARPA_LM NGRAM_ORDER TEXT_IN PROB_FILE --log-base=LOG_BASE\",\n                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)\nparser.add_argument(\"arpa_lm\", type=str,\n                    help=\"Input language model in arpa form.\")\nparser.add_argument(\"ngram_order\", type=int,\n                    help=\"Order of ngram\")\nparser.add_argument(\"text_in\", type=str,\n                    help=\"Filename of input text file (each line will be interpreted as a sentence).\")\nparser.add_argument(\"prob_file\", type=str,\n                    help=\"Filename of output probability file.\")\nparser.add_argument(\"--log-base\", type=float, default=math.exp(1),\n                    help=\"Log base for log porbability\")\nargs = parser.parse_args()\n\ndef check_args(args):\n    args.text_in_handle = sys.stdin if args.text_in == \"-\" else open(args.text_in, \"r\")\n    args.prob_file_handle = sys.stdout if args.prob_file == \"-\" else open(args.prob_file, \"w\")\n    if args.log_base <= 0:\n        sys.exit(\"compute_sentence_probs_arpa.py: Invalid log base (must be greater than 0)\")\n\ndef is_logprob(input):\n    if input[0] == \"-\":\n        try:\n            float(input[1:])\n            return True\n        except:\n            return False\n    else:\n        return False\n\ndef check_number(model_file, tot_num):\n    cur_num = 0\n    max_ngram_order = 0\n    with open(model_file) as model:\n        lines = model.readlines()\n        for line in lines[1:]:\n            if \"=\" not in line:\n                return (cur_num == tot_num), max_ngram_order\n            cur_num += int(line.split(\"=\")[-1])\n            max_ngram_order = int(line.split(\"=\")[0].split()[-1])\n\n# This function load language model in arpa form and save in a dictionary for\n# computing sentence probabilty of input text file.\ndef load_model(model_file):\n    with open(model_file) as model:\n        ngram_dict = {}\n        lines = model.readlines()\n\n        # check arpa form\n        if lines[0][:-1] != \"\\\\data\\\\\":\n            sys.exit(\"compute_sentence_probs_arpa.py: Please make sure that language model is in arpa form.\")\n\n        # read line\n        for line in lines:\n            if line[0] == \"-\":\n                line_split = line.split()\n                if is_logprob(line_split[-1]):\n                    ngram_key = \" \".join(line_split[1:-1])\n                    if ngram_key in ngram_dict:\n                        sys.exit(\"compute_sentence_probs_arpa.py: Duplicated ngram in arpa language model: {}.\".format(ngram_key))\n                    ngram_dict[ngram_key] = (line_split[0], line_split[-1])\n                else:\n                    ngram_key = \" \".join(line_split[1:])\n                    if ngram_key in ngram_dict:\n                        sys.exit(\"compute_sentence_probs_arpa.py: Duplicated ngram in arpa language model: {}.\".format(ngram_key))\n                    ngram_dict[ngram_key] = (line_split[0],)\n\n    return ngram_dict, len(ngram_dict)\n\ndef compute_sublist_prob(sub_list):\n    if len(sub_list) == 0:\n        sys.exit(\"compute_sentence_probs_arpa.py: Ngram substring not found in arpa language model, please check.\")\n\n    sub_string = \" \".join(sub_list)\n    if sub_string in ngram_dict:\n        return -float(ngram_dict[sub_string][0][1:])\n    else:\n        backoff_substring = \" \".join(sub_list[:-1])\n        backoff_weight = 0.0 if (backoff_substring not in ngram_dict or len(ngram_dict[backoff_substring]) < 2) \\\n                         else -float(ngram_dict[backoff_substring][1][1:])\n        return compute_sublist_prob(sub_list[1:]) + backoff_weight\n\ndef compute_begin_prob(sub_list):\n    logprob = 0\n    for i in range(1, len(sub_list) - 1):\n        logprob += compute_sublist_prob(sub_list[:i + 1])\n    return logprob\n\n# The probability is computed in this way:\n# p(word_N | word_N-1 ... word_1) = ngram_dict[word_1 ... word_N][0].\n# Here gram_dict is a dictionary stores a tuple corresponding to ngrams.\n# The first element of tuple is probablity and the second is backoff probability (if exists).\n# If the particular ngram (word_1 ... word_N) is not in the dictionary, then\n# p(word_N | word_N-1 ... word_1) = p(word_N | word_(N-1) ... word_2) * backoff_weight(word_(N-1) | word_(N-2) ... word_1)\n# If the sequence (word_(N-1) ... word_1) is not in the dictionary, then the backoff_weight gets replaced with 0.0 (log1)\n# More details can be found in https://cmusphinx.github.io/wiki/arpaformat/\ndef compute_sentence_prob(sentence, ngram_order):\n    sentence_split = sentence.split()\n    for i in range(len(sentence_split)):\n        if sentence_split[i] not in ngram_dict:\n            sentence_split[i] = \"<unk>\"\n    sen_length = len(sentence_split)\n\n    if sen_length < ngram_order:\n        return compute_begin_prob(sentence_split)\n    else:\n        logprob = 0\n        begin_sublist = sentence_split[:ngram_order]\n        logprob += compute_begin_prob(begin_sublist)\n\n        for i in range(sen_length - ngram_order + 1):\n            cur_sublist = sentence_split[i : i + ngram_order]\n            logprob += compute_sublist_prob(cur_sublist)\n\n    return logprob\n\n\ndef output_result(text_in_handle, output_file_handle, ngram_order):\n    lines = text_in_handle.readlines()\n    logbase_modifier = math.log(10, args.log_base)\n    for line in lines:\n        new_line = \"<s> \" + line[:-1] + \" </s>\"\n        logprob = compute_sentence_prob(new_line, ngram_order)\n        new_logprob = logprob * logbase_modifier\n        output_file_handle.write(\"{}\\n\".format(new_logprob))\n    text_in_handle.close()\n    output_file_handle.close()\n\n\nif __name__ == \"__main__\":\n    check_args(args)\n    ngram_dict, tot_num = load_model(args.arpa_lm)\n\n    num_valid, max_ngram_order = check_number(args.arpa_lm, tot_num)\n    if not num_valid:\n        sys.exit(\"compute_sentence_probs_arpa.py: Wrong loading model.\")\n    if args.ngram_order <= 0 or args.ngram_order > max_ngram_order:\n        sys.exit(\"compute_sentence_probs_arpa.py: \" +\n            \"Invalid ngram_order (either negative or greater than maximum ngram number ({}) allowed)\".format(max_ngram_order))\n\n    output_result(args.text_in_handle, args.prob_file_handle, args.ngram_order)\n"
  },
  {
    "path": "egs/utils/lang/extend_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright     2018  Johns Hopkins University (Author: Daniel Povey);\n#               2019  Dongji Gao\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nsil_prob=0.5\nsilprob_file=\n# end configuration section\n\necho \"$0 $@\"  # Print the command line for logging\n\n. utils/parse_options.sh\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: utils/extend_lang.sh <old-lang-dir> <lexicon> <new-lang-dir>\"\n  echo \"e.g.: utils/extend_lang.sh data/lang data/local/dict_new_words/lexiconp.txt data/lang_new_words\"\n  echo \"\"\n  echo \"This script creates a lang/ directory <new-lang-dir> with L.fst and L_disambig.fst\"\n  echo \"derived from the provided lexicon, but all other information being the same as the old\"\n  echo \"lang/ directory, including the phones.txt and words.txt being compatible (however,\"\n  echo \"words.txt may have new words, and phones.txt may have extra disambiguation symbols\"\n  echo \"if needed).  We do not allow new phones.\"\n  echo \"\"\n  echo \"CAUTION: the lexicon generated will only cover the words in the provided lexicon,\"\n  echo \"which might not include all the words in words.txt.  You should make sure your\"\n  echo \"lexicon is a superset of the original lexicon used to generate <old-lang-dir>,\"\n  echo \"if this would be a problem for your scenario.\"\n  echo \"\"\n  echo \"The basename of <lexicon> must be either lexicon.txt, lexiconp.txt or lexiconp_silprob.txt.\"\n  echo \"\"\n  echo \"Options\"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  echo \"     --silprob-file <file contains silence probability>    # must be provided if lexicon is lexiconp_silprob.txt\"\n  exit 1;\nfi\n\nsrcdir=$1\nlexicon=$2\ndir=$3\n\n[ -f path.sh ] && . ./path.sh\n\n\nfor f in $srcdir/phones.txt $lexicon; do\n  if [ ! -f $f ]; then\n    echo \"$0: expected file $f to exist\"\n    exit 1\n  fi\ndone\n\nif ! awk '{if(NF < 2) exit(1)} END{if(NR==0) exit(1)}' <$lexicon; then\n  echo \"$0: it looks like there words without pronunciations or..\"\n  echo \"  ...blank lines in $lexicon, or it is empty.\"\n  exit 1\nfi\n\nmkdir -p $dir\n\nif [ -d $dir/phones ]; then rm -r $dir/phones; fi\n\ncp -r $srcdir/phones $dir/\n\nfor f in oov.int oov.txt phones.txt topo words.txt; do\n  cp $srcdir/$f $dir/\ndone\n\ntmpdir=$dir/temp\nrm -r $tmpdir 2>/dev/null\nmkdir -p $tmpdir\n\nsilprob=false\n\nif [ $(basename $lexicon) == \"lexiconp_silprob.txt\" ]; then\n  silprob=true\n  if [ -z $silprob_file ] ; then\n    echo \"silprob_file not provided, checking $srcdir\"\n    if [ -f $srcdir/silprob.txt ]; then\n        silprob_file=$srcdir/silprob.txt\n        echo \"silprob_file found in $srcdir\"\n    else\n        echo \"silprob_file not found in $srcdir\" && exit 1;\n    fi\n  else\n    if [ ! -f $silprob_file ]; then\n      echo \"$silprob_file does not exist\" && exit 1;\n    fi\n  fi\nelif [ $(basename $lexicon) != lexiconp.txt ]; then\n  echo \"$0: currently this script only supports the lexiconp.txt or lexiconp_silprob.txt format;\"\n  echo \" ... your lexicon has to have that filename.\"\nfi\n\n# Get the list of extra words.\nawk -v w=$srcdir/words.txt 'BEGIN{while(getline <w) seen[$1] = $1} { if (!($1 in seen)) oov[$1] = 1}\n                     END{ for(k in oov) print k;}' <$lexicon >$tmpdir/extra_words.txt\n\n# Add entries to words.txt for all the words that were not previously in the\n# lexicon.\nhighest_number=$(tail -n 1 $srcdir/words.txt | awk '{print $2}')\nawk -v start=$highest_number '{print $1, NR+start}' <$tmpdir/extra_words.txt >>$dir/words.txt\necho \"$0: added $(wc -l <$tmpdir/extra_words.txt) extra words to words.txt\"\n\nif [ -f $dir/phones/nonterminals.txt ]; then\n  # extra grammar-decoding-related options for getting the lexicon.\n  grammar_opts=\"--left-context-phones=$dir/phones/left_context_phones.txt --nonterminals=$srcdir/phones/nonterminals.txt\"\nelse\n  grammar_opts=\"\"\nfi\n\nif [ -f $dir/phones/word_boundary.txt ]; then\n  # was `if $position_dependent_phones; then..` in prepare_lang.sh\n  if \"$silprob\"; then\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; $silword_p = shift @A;\n              $wordsil_f = shift @A; $wordnonsil_f = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_S\\n\"; }\n         else { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $lexicon > $tmpdir/lexiconp_silprob.txt\n  else\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; @A>0||die;\n           if(@A==1) { print \"$w $p $A[0]_S\\n\"; } else { print \"$w $p $A[0]_B \";\n           for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $lexicon > $tmpdir/lexiconp.txt || exit 1;\n  fi\nelse\n  if \"$silprob\"; then\n    cp $lexicon $tempdir/lexiconp_silprob.txt\n  else\n    cp $lexicon $tmpdir/lexiconp.txt\n  fi\nfi\n\n# Check that there are no unseen phones in the lexicon.\nif \"$silprob\"; then\n  if ! utils/sym2int.pl -f 6- $srcdir/phones.txt $tmpdir/lexiconp_silprob.txt >/dev/null; then\n    echo \"$0: it looks like there are unseen phones in your lexicon $lexicon\"\n    exit 1\n  fi\nelse \n  if ! utils/sym2int.pl -f 3- $srcdir/phones.txt $tmpdir/lexiconp.txt >/dev/null; then\n    echo \"$0: it looks like there are unseen phones in your lexicon $lexicon\"\n    exit 1\n  fi\nfi\n\nif \"$silprob\"; then\n  ndisambig=$(utils/add_lex_disambig.pl --pron-probs --sil-probs $tmpdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob_disambig.txt)\nelse\n  ndisambig=$(utils/add_lex_disambig.pl --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\n\nndisambig=$[ndisambig+1]  # Add one to disambiguate silence.\n\n# we'll need to figure out whether any of these disambiguation symbols are\n# absent from our current disambiguation phones.. if they are, then we need to\n# add them as new disambiguation symbols to phones.txt.\nfor n in $(seq 0 $ndisambig); do\n  sym='#'$n; if ! grep -w -q \"$sym\" $dir/phones/disambig.txt; then echo \"$sym\"; fi\ndone > $tmpdir/extra_disambig.txt\nhighest_number=$(tail -n 1 $srcdir/phones.txt | awk '{print $2}')\nawk -v start=$highest_number '{print $1, NR+start}' <$tmpdir/extra_disambig.txt >>$dir/phones.txt\necho \"$0: added $(wc -l <$tmpdir/extra_disambig.txt) extra disambiguation symbols to phones.txt\"\n\n# add extra_disambig symbols into disambig.txt\ncat $tmpdir/extra_disambig.txt >> $dir/phones/disambig.txt\nutils/sym2int.pl $dir/phones.txt <$dir/phones/disambig.txt >$dir/phones/disambig.int\nutils/sym2int.pl $dir/phones.txt <$dir/phones/disambig.txt | \\\n  awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/disambig.csl\n\nsilphone=`cat $srcdir/phones/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\nif \"$silprob\"; then\n  # remove the silprob\n  cat $tmpdir/lexiconp_silprob.txt |\\\n    awk '{\n      for(i=1; i<=NF; i++) {\n        if(i!=3 && i!=4 && i!=5) printf(\"%s\\t\", $i); if(i==NF) print \"\";\n      }\n    }' > $tmpdir/lexiconp.txt\nfi\n\n# First remove pron-probs from the lexicon.\nperl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' <$tmpdir/lexiconp.txt >$tmpdir/align_lexicon.txt\n\n# Note: here, $silphone will have no suffix e.g. _S because it occurs as optional-silence,\n# and is not part of a word.\n[ ! -z \"$silphone\" ] && echo \"<eps> $silphone\" >> $tmpdir/align_lexicon.txt\n\ncat $tmpdir/align_lexicon.txt | \\\n  perl -ane '@A = split; print $A[0], \" \", join(\" \", @A), \"\\n\";' | sort | uniq > $dir/phones/align_lexicon.txt\n\nif [ -f $dir/phones/nonterminals.txt ]; then\n  for w in \"#nonterm_begin\" \"#nonterm_end\" $(cat $dir/phones/nonterminals.txt); do\n    echo $w $w  # These are words without pronunciations, so leave those prons\n                # empty.\n    done >> $dir/phones/align_lexicon.txt\nfi\n\n# create phones/align_lexicon.int from phones/align_lexicon.txt\ncat $dir/phones/align_lexicon.txt | utils/sym2int.pl -f 3- $dir/phones.txt | \\\n  utils/sym2int.pl -f 1-2 $dir/words.txt > $dir/phones/align_lexicon.int\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\nif \"$silprob\"; then\n  utils/lang/make_lexicon_fst_silprob.py $grammar_opts --sil-phone=$silphone \\\n         $tmpdir/lexiconp_silprob.txt $silprob_file | \\\n      fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n        --keep_isymbols=false --keep_osymbols=false |   \\\n      fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nelse\n  utils/lang/make_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone \\\n           $tmpdir/lexiconp.txt | \\\n      fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n        --keep_isymbols=false --keep_osymbols=false | \\\n      fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n\n# and create the version that has disambiguation symbols.\nif \"$silprob\"; then\n  utils/lang/make_lexicon_fst_silprob.py $grammar_opts \\\n    --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n    $tmpdir/lexiconp_silprob_disambig.txt $silprob_file | \\\n    fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n      --keep_isymbols=false --keep_osymbols=false |   \\\n    fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | \\\n    fstarcsort --sort_type=olabel > $dir/L_disambig.fst || exit 1;\nelse\n  utils/lang/make_lexicon_fst.py $grammar_opts \\\n    --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n       $tmpdir/lexiconp_disambig.txt | \\\n     fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n                --keep_isymbols=false --keep_osymbols=false | \\\n     fstaddselfloops $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | \\\n     fstarcsort --sort_type=olabel > $dir/L_disambig.fst || exit 1;\nfi\n\n\necho \"$(basename $0): validating output directory\"\n# the --skip-generate-words-check option is needed because L.fst may not actually\n# contain all the words in words.txt.\n! utils/validate_lang.pl --skip-generate-words-check $dir && echo \"$(basename $0): error validating output\" &&  exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/lang/get_word_position_phone_map.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2018  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n#\nuse strict;\nuse warnings;\n\nmy $Usage = <<EOU;\n# This script is for creating a mapping from word-position-dependent phones\n# (with _I, _B, _E, _S suffixes) to word-position-independent phones,\n# along with a word-position-independent version of phones.txt.\n# It should only be required in very unusual situations.\n\nUsage: utils/lang/get_word_position_phone_map.pl <lang-dir> <output-dir>\n\n<lang-dir> is a conventional lang dir as validated by validate_lang.pl.\nIt is an error if <lang-dir> does not have word-position-dependent phones.\n\nTo <output-dir> will be written the following files:\n  phones.txt is a conventional symbol table, similar in format to the one\n   in <lang-dir>, but without word-position-dependency or disambiguation\n   symbols.\n  phone_map.int is a mapping from the input <lang-dir>'s phones to\n   the phones in <output-dir>/phones.txt, containing integers, i.e.\n   <word-position-dependent-phone> <word-position-independent-phone>.\n  phone_map.txt is the text form of the mapping in phone_map.int, mostly\n   provided for reference.\nEOU\n\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\nmy $lang_dir = shift @ARGV;\nmy $output_dir = shift @ARGV;\n\nforeach my $filename ( (\"phones.txt\", \"phones/disambig.int\") ) {\n  if (! -f \"$lang_dir/$filename\") {\n    die \"$0: expected file $lang_dir/$filename to exist\";\n  }\n}\n\nif (! -d $output_dir) {\n  die \"$0: expected directory $output_dir to exist\";\n}\n\n\n# %is_disambig is a hash indexed by integer phone index in the input $lang_dir,\n# which will contain 1 for each (integer) disambiguation symbol.\nmy %is_disambig;\n\nopen(D, \"<$lang_dir/phones/disambig.int\") || die \"opening $lang_dir/phones/disambig.int\";\nwhile (<D>) {\n  my $disambig_sym = int($_);\n  $is_disambig{$disambig_sym} = 1;\n}\nclose(D);\n\n## @orig_phone_list will be an array indexed by integer index, containing\n## the written form of the original, non-word-position-dependent phones.\n## (but excluding disambiguation symbols like #0, #1 and so on).\n## E.g. @orig_phone_list = ( \"<eps>\", \"SIL\", \"SIL_B\", \"SIL_E\", \"SIL_I\", \"SIL_S\", ... )\nmy @orig_phone_list = ();\n\n\n## @mapped_phones will be an array of the same size as @orig_phone_list, but\n## containing the same phone mapped to context-independent form,\n## e.g. ( \"<eps>\", \"SIL\", \"SIL\", \"SIL\", SIL\", \"SIL\",... )\nmy @mapped_phones = ();\n\n\n## @mapped_phone_list will contain the distinct mapped phones in order,\n## e.g. ( \"<eps>\", \"SIL\", \"AA\", ... )\nmy @mapped_phone_list = ();\n\n## mapped_phone_to_int will be a mapping from the strings in @mapped_phones,\n## such as \"<eps>\" and \"SIL\", to an integer like 0, 1, ....\nmy %mapped_phone_to_int;\n\n# $cur_mapped_int keeps track of the symbols we've used in the output\n# phones.txt.\nmy $cur_mapped_int = 0;\n\n# $cur_line is the current line index in input phones.txt\nmy $cur_line = 0;\n\nopen(F, \"<$lang_dir/phones.txt\") || die \"$0: failed to open $lang_dir/phones.txt for reading\";\n\nwhile (<F>) {\n  chop;  # remove newline from $_ (the line we just read) for easy printing.\n  my @A = split;  # split $_ on space.\n  if (@A != 2) {  # if the array @A does not have length 2...\n    die \"$0: bad line $_ in file $lang_dir/phones.txt\";\n  }\n  my $phone_name = $A[0];  # e.g. \"<eps>\" or \"SIL\" or \"SIL_B\" ...\n  my $phone_int = int($A[1]);\n  if ($phone_int != $cur_line) {\n    die (\"$0: unexpected line $_ in $lang_dir/phones.txt, expected integer to be $cur_line\");\n  }\n  if (! $is_disambig{$phone_int}) {\n    # if it's not a disambiguation symbol...\n    my $mapped_phone_name = $phone_name;\n    $mapped_phone_name =~ s/_[BESI]$//;\n\n    push @orig_phone_list, $phone_name;\n    push @mapped_phones, $mapped_phone_name;\n\n    if (!defined $mapped_phone_to_int{$mapped_phone_name}) {\n      $mapped_phone_to_int{$mapped_phone_name} = $cur_mapped_int++;\n      push @mapped_phone_list, $mapped_phone_name;\n    }\n  }\n  $cur_line++;\n}\nclose(F);\n\nif ($cur_line == 0) {\n  die \"$0: empty $lang_dir/phones.txt\";\n}\n\nif ($cur_mapped_int == @orig_phone_list) {\n  # if the number of distinct mapped phones is the same as the\n  # number of input phones (including epsilon), it means the mapping\n  # was a no-op.  This is an error, because it doesn't make sense to\n  # run this script on input that was not word-position-dependent.\n  die \"input lang dir $lang_dir was not word-position-dependent.\";\n}\n\nopen(P, \">$output_dir/phones.txt\") || die \"failed to open $output_dir/phones.txt for writing.\";\nopen(I, \">$output_dir/phone_map.int\") || die \"failed to open $output_dir/phone_map.int for writing.\";\nopen(T, \">$output_dir/phone_map.txt\") || die \"failed to open $output_dir/phone_map.txt for writing.\";\n\nfor (my $x = 0; $x <= $#mapped_phone_list; $x++) {\n  print P \"$mapped_phone_list[$x] $x\\n\";\n}\n\n\nfor (my $x = 0; $x <= $#orig_phone_list; $x++) {\n  my $orig_phone_name = $orig_phone_list[$x];\n  my $mapped_phone_name = $mapped_phones[$x];\n  my $y = $mapped_phone_to_int{$mapped_phone_name};\n  defined $y || die \"code error\";\n\n  print I \"$x $y\\n\";\n  print T \"$orig_phone_name $mapped_phone_name\\n\";\n}\n\n\n(close(I) && close(T) && close(P)) || die \"failed to close file (disk full?)\";\n\n\nexit(0);\n"
  },
  {
    "path": "egs/utils/lang/grammar/augment_phones_txt.py",
    "content": "#!/usr/bin/env python3\n\n\nimport argparse\nimport re\nimport os\nimport sys\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script augments a phones.txt\n       file (a phone-level symbol table) by adding certain special symbols\n       relating to grammar support.  See ../add_nonterminals.sh for context.\"\"\")\n\n    parser.add_argument('input_phones_txt', type=str,\n                        help='Filename of input phones.txt file, to be augmented')\n    parser.add_argument('nonterminal_symbols_list', type=str,\n                        help='Filename of a file containing a list of nonterminal '\n                        'symbols, one per line.  E.g. #nonterm:contact_list')\n    parser.add_argument('output_phones_txt', type=str, help='Filename of output '\n                        'phones.txt file.  May be the same as input-phones-txt.')\n    args = parser.parse_args()\n    return args\n\n\n\n\ndef read_phones_txt(filename):\n    \"\"\"Reads the phones.txt file in 'filename', returns a 2-tuple (lines, highest_symbol)\n       where 'lines' is all the lines the phones.txt as a list of strings,\n       and 'highest_symbol' is the integer value of the highest-numbered symbol\n       in the symbol table.  It is an error if the phones.txt is empty or mis-formatted.\"\"\"\n\n    # The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n    # encoding means \"treat words as sequences of bytes\", and it is compatible\n    # with utf-8 encoding as well as other encodings such as gbk, as long as the\n    # spaces are also spaces in ascii (which we check).  It is basically how we\n    # emulate the behavior of python before python3.\n    whitespace = re.compile(\"[ \\t]+\")\n    with open(filename, 'r', encoding='latin-1') as f:\n        lines = [line.strip(\" \\t\\r\\n\") for line in f]\n        highest_numbered_symbol = 0\n        for line in lines:\n            s = whitespace.split(line)\n            try:\n                i = int(s[1])\n                if i > highest_numbered_symbol:\n                    highest_numbered_symbol = i\n            except:\n                raise RuntimeError(\"Could not interpret line '{0}' in file '{1}'\".format(\n                line, filename))\n            if s[0] == '#nonterm_bos':\n                raise RuntimeError(\"It looks like the symbol table {0} already has nonterminals \"\n                                   \"in it.\".format(filename))\n        return lines, highest_numbered_symbol\n\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminal symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef write_phones_txt(orig_lines, highest_numbered_symbol, nonterminals, filename):\n    \"\"\"Writes updated phones.txt to 'filename'.  'orig_lines' is the original lines\n       in the phones.txt file as a list of strings (without the newlines);\n       highest_numbered_symbol is the highest numbered symbol in the original\n       phones.txt; nonterminals is a list of strings like '#nonterm:foo'.\"\"\"\n    with open(filename, 'w', encoding='latin-1') as f:\n        for l in orig_lines:\n            print(l, file=f)\n        cur_symbol = highest_numbered_symbol + 1\n        for n in ['#nonterm_bos', '#nonterm_begin', '#nonterm_end', '#nonterm_reenter' ] + nonterminals:\n            print(\"{0} {1}\".format(n, cur_symbol), file=f)\n            cur_symbol = cur_symbol + 1\n\n\n\ndef main():\n    args = get_args()\n    (lines, highest_symbol) = read_phones_txt(args.input_phones_txt)\n    nonterminals = read_nonterminals(args.nonterminal_symbols_list)\n    write_phones_txt(lines, highest_symbol, nonterminals, args.output_phones_txt)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/lang/grammar/augment_words_txt.py",
    "content": "#!/usr/bin/env python3\n\n\nimport argparse\nimport os\nimport sys\nimport re\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script augments a words.txt\n       file (a word-level symbol table) by adding certain special symbols\n       relating to grammar support.  See ../add_nonterminals.sh for context,\n       and augment_phones_txt.py.\"\"\")\n\n    parser.add_argument('input_words_txt', type=str,\n                        help='Filename of input words.txt file, to be augmented')\n    parser.add_argument('nonterminal_symbols_list', type=str,\n                        help='Filename of a file containing a list of nonterminal '\n                        'symbols, one per line.  E.g. #nonterm:contact_list')\n    parser.add_argument('output_words_txt', type=str, help='Filename of output '\n                        'words.txt file.  May be the same as input-words-txt.')\n    args = parser.parse_args()\n    return args\n\n\n\n\ndef read_words_txt(filename):\n    \"\"\"Reads the words.txt file in 'filename', returns a 2-tuple (lines, highest_symbol)\n       where 'lines' is all the lines the words.txt as a list of strings,\n       and 'highest_symbol' is the integer value of the highest-numbered symbol\n       in the symbol table.  It is an error if the words.txt is empty or mis-formatted.\"\"\"\n\n    # The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n    # encoding means \"treat words as sequences of bytes\", and it is compatible\n    # with utf-8 encoding as well as other encodings such as gbk, as long as the\n    # spaces are also spaces in ascii (which we check).  It is basically how we\n    # emulate the behavior of python before python3.\n    whitespace = re.compile(\"[ \\t]+\")\n    with open(filename, 'r', encoding='latin-1') as f:\n        lines = [line.strip(\" \\t\\r\\n\") for line in f]\n        highest_numbered_symbol = 0\n        for line in lines:\n            s = whitespace.split(line)\n            try:\n                i = int(s[1])\n                if i > highest_numbered_symbol:\n                    highest_numbered_symbol = i\n            except:\n                raise RuntimeError(\"Could not interpret line '{0}' in file '{1}'\".format(\n                line, filename))\n            if s[0] in [ '#nonterm_begin', '#nonterm_end' ]:\n                raise RuntimeError(\"It looks like the symbol table {0} already has nonterminals \"\n                                   \"in it.\".format(filename))\n        return lines, highest_numbered_symbol\n\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminal symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef write_words_txt(orig_lines, highest_numbered_symbol, nonterminals, filename):\n    \"\"\"Writes updated words.txt to 'filename'.  'orig_lines' is the original lines\n       in the words.txt file as a list of strings (without the newlines);\n       highest_numbered_symbol is the highest numbered symbol in the original\n       words.txt; nonterminals is a list of strings like '#nonterm:foo'.\"\"\"\n    with open(filename, 'w', encoding='latin-1') as f:\n        for l in orig_lines:\n            print(l, file=f)\n        cur_symbol = highest_numbered_symbol + 1\n        for n in [ '#nonterm_begin', '#nonterm_end' ] + nonterminals:\n            print(\"{0} {1}\".format(n, cur_symbol), file=f)\n            cur_symbol = cur_symbol + 1\n\n\ndef main():\n    args = get_args()\n    (lines, highest_symbol) = read_words_txt(args.input_words_txt)\n    nonterminals = read_nonterminals(args.nonterminal_symbols_list)\n    write_words_txt(lines, highest_symbol, nonterminals, args.output_words_txt)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/lang/internal/apply_unk_lm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright      2016 Johns Hopkins University (Author: Daniel Povey);\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Begin configuration section.\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f path.sh ] && . ./path.sh\n\n\n. utils/parse_options.sh\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] <input-unk-lm-fst> <lang-dir>\"\n  echo \"e.g.: $0 exp/make_unk/unk_fst.txt data/lang_unk\"\n  echo \"\"\n  echo \"This script, which is called from the end of prepare_lang.sh,\"\n  echo \"inserts the unknown-word LM FST into the lexicon FSTs\"\n  echo \"<lang-dir>/L.fst and <lang-dir>/L_disambig.fst in place of\"\n  echo \"the special disambiguation symbol #2 (which was inserted by\"\n  echo \"add_lex_disambig.pl as a placeholder for this FST).\"\n  echo \"\"\n  echo \"  <input-unk-lm-fst>:  A text-form FST, typically with the name\"\n  echo \"                unk_fst.txt.  We will remove all symbols from the\"\n  echo \"                output before applying it.\"\n  echo \"  <lang-dir>:  A partially built lang/ directory.  We modify\"\n  echo \"               L.fst and L_disambig.fst, and read only words.txt.\"\n  exit 1;\nfi\n\n\nunk_lm_fst=$1\nlang=$2\n\nset -e\n\nfor f in \"$unk_lm_fst\" $lang/L.fst $lang/L_disambig.fst $lang/words.txt $lang/oov.int; do\n  [ ! -f $f ] && echo \"$0: expected file $f to exist\" && exit 1;\ndone\n\nunused_phone_label=$(tail -n 1 $lang/phones.txt | awk '{print $2 + 1}')\nlabel_to_replace=$(awk '{if ($1 == \"#2\") {print $2;}}' <$lang/phones.txt)\n! [ \"$unused_phone_label\" -eq \"$unused_phone_label\" -a \"$label_to_replace\" -eq \"$label_to_replace\" ] && \\\n   echo \"$0: error getting unused phone label or label for #2\" && exit 1\n\n\n# OK, now fstreplace works based on olabels, but we actually want to deal with ilabels,\n# so we need to invert all the FSTs before and after doing fstreplace.\nawk '{if(NF>=4) $4 = \"<eps>\"; print }' <$unk_lm_fst | \\\n  fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt | \\\n  fstinvert > $lang/unk_temp.fst\n\nnum_states_unk=$(fstinfo $lang/unk_temp.fst | grep '# of states' | awk '{print $NF}')\n\n# fstreplace usage is:\n# Usage: fstreplace root.fst rootlabel [rule1.fst label1 ...] [out.fst]\n# ... the rootlabel should just be an otherwise unused symbol.\n# all the labels are olabels (word labels).. that is hardcoded in fstreplace.\n\nfor f in L.fst L_disambig.fst; do\n\n  # with OpenFst tools, to refer to the standard input/output you need to use\n  # the empty string '' and not '-'.\n  fstinvert $lang/$f | fstreplace '' \"$unused_phone_label\" $lang/unk_temp.fst \"$label_to_replace\" | fstinvert > $lang/${f}.temp\n\n  num_states_old=$(fstinfo $lang/$f | grep '# of states' | awk '{print $NF}')\n  num_states_new=$(fstinfo $lang/${f}.temp | grep '# of states' | awk '{print $NF}')\n  num_states_added=$[$num_states_new-$num_states_old]\n  echo \"$0: in $f, substituting in the unknown-word LM (which had $num_states_unk states) added $num_states_added new FST states.\"\n  mv -f $lang/${f}.temp $lang/$f\ndone\n\nrm $lang/unk_temp.fst\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/lang/internal/arpa2fst_constrained.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport sys\nimport argparse\nimport math\nfrom collections import defaultdict\n\n# note, this was originally based\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script converts an ARPA-format language model to FST format\n(like the C++ program arpa2fst), but does so while applying bigram\nconstraints supplied in a separate file.  The resulting language\nmodel will have no unigram state, and there will be no backoff from\nthe bigram level.\nThis is useful for phone-level language models in order to keep\ngraphs small and impose things like linguistic constraints on\nallowable phone sequences.\nThis script writes its output to the stdout.  It is a text-form FST,\nsuitable for compilation by fstcompile.\n\"\"\")\n\n\nparser.add_argument('--disambig-symbol', type = str, default = \"#0\",\n                    help = 'Disambiguation symbol (e.g. #0), '\n                    'that is printed on the input side only of backoff '\n                    'arcs (output side would be epsilon)')\nparser.add_argument('arpa_in', type = str,\n                    help = 'The input ARPA file (must not be gzipped)')\nparser.add_argument('allowed_bigrams_in', type = str,\n                    help = \"A file containing the list of allowed bigram pairs.  \"\n                    \"Must include pairs like '<s> foo' and 'foo </s>', as well as \"\n                    \"pairs like 'foo bar'.\")\nparser.add_argument('--verbose', type = int, default = 0,\n                    choices=[0,1,2,3,4,5], help = 'Verbose level')\n\nargs = parser.parse_args()\n\nif args.verbose >= 1:\n    print(' '.join(sys.argv), file = sys.stderr)\n\n\nclass HistoryState(object):\n    def __init__(self):\n        # note: neither backoff_prob nor the floats\n        # in word_to_prob are in log space.\n        self.backoff_prob = 1.0\n        # will be a dict from string to float.  the prob is\n        # the actual probability of the word, including any probability\n        # mass from backoff (they get added together while writing out\n        # the arpa, and these probs are read in from the arpa).\n        self.word_to_prob = dict()\n\n\nclass ArpaModel(object):\n    def __init__(self):\n        # self.orders is indexed by history-length [i.e. 0 for unigram,\n        # 1 for bigram and so on], and is then a dict indexed\n        # by tuples of history-words.  E.g. for trigrams, we'd index\n        # it as self.orders[2][('a', 'b')].\n        # The value-type of the dict is HistoryState.  E.g. to set the\n        # probability of the trigram a b -> c to 0.2, we'd do\n        # self.orders[2][('a', 'b')].word_to_prob['c'] = 0.2\n        self.orders = []\n\n    def Read(self, arpa_in):\n        assert len(self.orders) == 0\n        log10 = math.log(10.0)\n        if arpa_in == \"\" or arpa_in == \"-\":\n            arpa_in = \"/dev/stdin\"\n        try:\n            f = open(arpa_in, \"r\")\n        except:\n            sys.exit(\"{0}: error opening ARPA file {1}\".format(\n                     sys.argv[0], arpa_in))\n        # first read till the \\data\\ marker.\n        while True:\n            line = f.readline()\n            if line == '':\n                sys.exit(\"{0}: reading {1}, got EOF looking for \\\\data\\\\ marker.\".format(\n                    sys.argv[0], arpa_in))\n            if line[0:6] == '\\\\data\\\\':\n                break\n        while True:\n            # read, and ignore, the lines like 'ngram 1=1264'...\n            line = f.readline()\n            if line == '\\n' or line == '\\r\\n':\n                break\n            if line[0:5] != 'ngram':\n                sys.exit(\"{0}: reading {1}, read something unexpected in header: {2}\".format(\n                    sys.argv[0], arpa_in, line[:-1]))\n            rest=line[5:]\n            a = rest.split('=')  # e.g. a = [ '1', '1264] ]\n            if len(a) != 2:\n                sys.exit(\"{0}: reading {1}, read something unexpected in header: {2}\".format(\n                    sys.argv[0], arpa_in, line[:-1]))\n            max_order = int(a[0])\n\n\n        for n in range(max_order):\n            # self.orders[n], indexed by history-length (length of the\n            # history-vector, == order-1), is a map from history as a tuple\n            # of strings, to class HistoryState.\n            self.orders.append(defaultdict(lambda: HistoryState()))\n\n        cur_order = 0\n        while True:\n            line = f.readline()\n            if line == '':\n                sys.exit(\"{0}: reading {1}, found EOF while looking for \\\\end\\\\ marker.\".format(\n                    sys.argv[0], arpa_in))\n            elif line[0:5] == '\\\\end\\\\':\n                if len(self.orders) == 0:\n                    sys.exit(\"{0}: reading {1}, read no n-grams.\".format(sys.argv[0], arpa_in))\n                break\n            else:\n                cur_order += 1\n                expected_line = '\\\\{0}-grams:'.format(cur_order)\n                if not expected_line in line:  # e.g. allow trailing whitespace and newline\n                    sys.exit(\"{0}: reading {1}, expected line {1}, got {2}\".format(arpa_in, expected_line, line[:-1]))\n                if args.verbose >= 2:\n                    print(\"{0}: reading {1}-grams\".format(\n                        sys.argv[0], cur_order), file = sys.stderr)\n\n                # now read all the n-grams from this order.\n                while True:\n                    line = f.readline()\n                    # the section of n-grams is terminated by a blank line.\n                    if line == '\\n' or line == '\\r\\n':\n                        break\n                    a = line.split()\n                    l = len(a)\n                    if l != cur_order + 1 and l != cur_order + 2:\n                        sys.exit(\"{0}: reading {1}: in {2}-grams section, got bad line: {3}\".format(\n                            sys.argv[0], arpa_in, cur_order, line[:-1]))\n                    try:\n                        prob = math.exp(float(a[0]) * log10)\n                        hist = tuple(a[1:cur_order])  # tuple of strings\n                        word = a[cur_order]  # a string\n                        backoff_prob = math.exp(float(a[cur_order+1]) * log10) if l == cur_order + 2 else None\n                    except Exception as e:\n                        sys.exit(\"{0}: reading {1}: in {2}-grams section, got bad \"\n                                 \"line (exception is: {3}): {4}\".format(\n                                     sys.argv[0], arpa_in, cur_order,\n                                     str(type(e)) + ',' + str(e), line[:-1]))\n                    self.orders[cur_order-1][hist].word_to_prob[word] = prob\n                    if backoff_prob != None:\n                        self.orders[cur_order][hist + (word,)].backoff_prob = backoff_prob\n\n        if args.verbose >= 2:\n            print(\"{0}: read {1}-gram model from {2}\".format(\n                sys.argv[0], cur_order, arpa_in), file = sys.stderr)\n        if cur_order < 2:\n            # we'd have to have some if-statements in the code to make this work,\n            # and I don't want to have to test it.\n            sys.exit(\"{0}: this script does not work when the ARPA language model \"\n                     \"is unigram.\".format(sys.argv[0]))\n\n    # Returns the probability of word 'word' in history-state 'hist'.\n    # Dies with error if this word is not predicted at all by the LM (not in vocab).\n    # history-state does not exist.\n    def GetProb(self, hist, word):\n        assert len(hist) < len(self.orders)\n        if len(hist) == 0:\n            word_to_prob = self.orders[0][()].word_to_prob\n            if not word in word_to_prob:\n                sys.exit(\"{0}: no probability in unigram for word {1}\".format(\n                    sys.argv[0], word))\n            return word_to_prob[word]\n        else:\n            if hist in self.orders[len(hist)]:\n                hist_state = self.orders[len(hist)][hist]\n                if word in hist_state.word_to_prob:\n                    return hist_state.word_to_prob[word]\n                else:\n                    return hist_state.backoff_prob * self.GetProb(hist[1:], word)\n            else:\n                return self.GetProb(hist[1:], word)\n\n    # This gets the state corresponding to 'hist' in 'hist_to_state', but backs\n    # off for us if there is no such state.\n    def GetStateForHist(self, hist_to_state, hist):\n        if hist in hist_to_state:\n            return hist_to_state[hist]\n        else:\n            if len(hist) <= 1:\n                # this would likely be a code error, but possibly an error\n                # in the ARPA file\n                sys.exit(\"{0}: error processing histories: history-state {1} \"\n                         \"does not exist.\".format(sys.argv[0], hist))\n            return self.GetStateForHist(hist_to_state, hist[1:])\n\n\n    def GetHistToStateMap(self):\n        # This function, called from PrintAsFst, returns (hist_to_state,\n        # state_to_hist), which map from history (as a tuple of strings) to\n        # integer FST-state and vice versa.\n\n        hist_to_state = dict()\n        state_to_hist = []\n\n        # Make sure the initial bigram state comes first (and that\n        # we have such a state even if it was completely pruned\n        # away in the bigram LM.. which is unlikely of course)\n        hist = ('<s>',)\n        hist_to_state[hist] = len(state_to_hist)\n        state_to_hist.append(hist)\n\n        # create a bigram state for each of the 'real' words...  even if the LM\n        # didn't naturally have such bigram states, we'll create them so that we\n        # can enforce the bigram constraints supplied in 'bigrams_file' by the\n        # user.\n        for word in self.orders[0][()].word_to_prob:\n            if word != '<s>' and word != '</s>':\n                hist = (word,)\n                hist_to_state[hist] = len(state_to_hist)\n                state_to_hist.append(hist)\n\n        # note: we do not allocate an FST state for the unigram state, because\n        # we don't have a unigram state in the output FST, only bigram states; and\n        # we don't iterate over bigram histories because we covered them all above;\n        # that's why we start 'n' from 2 below instead of from 0.\n        for n in range(2, len(self.orders)):\n            for hist in self.orders[n].keys():\n                # note: hist is a tuple of strings.\n                assert not hist in hist_to_state\n                hist_to_state[hist] = len(state_to_hist)\n                state_to_hist.append(hist)\n\n        return (hist_to_state, state_to_hist)\n\n    # This function prints the estimated language model as an FST.\n    # disambig_symbol will be something like '#0' (a symbol introduced\n    # to make the result determinizable).\n    # bigram_map represent the allowed bigrams (left-word, right-word): it's a map\n    # from left-word to a set of right-words (both are strings).\n    def PrintAsFst(self, disambig_symbol, bigram_map):\n        # History will map from history (as a tuple) to integer FST-state.\n        (hist_to_state, state_to_hist) = self.GetHistToStateMap()\n\n\n        # The following 3 things are just for diagnostics.\n        normalization_stats = [ [0, 0.0] for x in range(len(self.orders)) ]\n        num_ngrams_allowed = 0\n        num_ngrams_disallowed = 0\n\n        for state in range(len(state_to_hist)):\n            hist = state_to_hist[state]\n            hist_len = len(hist)\n            assert hist_len > 0\n            if hist_len == 1:  # it's a bigram state...\n                context_word = hist[0]\n                if not context_word in bigram_map:\n                    print(\"{0}: warning: word {1} appears in ARPA but is not listed \"\n                          \"as a left context in the bigram map\".format(\n                              sys.argv[0], context_word), file = sys.stderr)\n                    continue\n                # word list is a list of words that can follow this word.  It must be nonempty.\n                word_list = list(bigram_map[context_word])\n\n                normalization_stats[hist_len][0] += 1\n\n                for word in word_list:\n                    prob = self.GetProb((context_word,), word)\n                    assert prob != 0\n                    normalization_stats[hist_len][1] += prob\n                    cost = -math.log(prob)\n                    if abs(cost) < 0.01 and args.verbose >= 3:\n                        print(\"{0}: warning: very small cost {1} for {2}->{3}\".format(\n                            sys.argv[0], cost, context_word, word), file=sys.stderr)\n                    if word == '</s>':\n                        # print the final-prob of this state.\n                        print(\"%d %.3f\" % (state, cost))\n                    else:\n                        next_state = self.GetStateForHist(hist_to_state,\n                                                          (context_word, word))\n                        print(\"%d %d %s %s %.3f\" %\n                              (state, next_state, word, word, cost))\n            else:  # it's a higher-order than bigram state.\n                assert hist in self.orders[hist_len]\n                hist_state = self.orders[hist_len][hist]\n                most_recent_word = hist[-1]\n\n                normalization_stats[hist_len][0] += 1\n                normalization_stats[hist_len][1] += \\\n                  sum([ self.GetProb(hist, word) for word in bigram_map[most_recent_word]])\n\n                for word, prob in hist_state.word_to_prob.items():\n                    cost = -math.log(prob)\n                    if word in bigram_map[most_recent_word]:\n                        num_ngrams_allowed += 1\n                    else:\n                        num_ngrams_disallowed += 1\n                        continue\n                    if word == '</s>':\n                        # print the final-prob of this state.\n                        print(\"%d %.3f\" % (state, cost))\n                    else:\n                        next_state = self.GetStateForHist(hist_to_state,\n                                                          (hist) + (word,))\n                        print(\"%d %d %s %s %.3f\" %\n                              (state, next_state, word, word, cost))\n                # Now deal with the backoff probability of this state (back off\n                # to the lower-order state).\n                assert hist in self.orders[hist_len]\n                backoff_prob = self.orders[hist_len][hist].backoff_prob\n                assert backoff_prob != 0.0\n                cost = -math.log(backoff_prob)\n                backoff_hist = hist[1:]\n                backoff_state = self.GetStateForHist(hist_to_state, backoff_hist)\n                # note: we only print the disambig symbol on the input side.\n                if args.verbose >= 3 and abs(cost) < 0.001:\n                    print(\"{0}: very low backoff cost {1} for history {2}, state = {3}\".format(\n                        sys.argv[0], cost, str(hist), state), file = sys.stderr)\n\n                # For hist-states that completely back off (they have no words coming out of them),\n                # there is no need to disambiguate, we can print an epsilon that will later be removed.\n                this_disambig_symbol = disambig_symbol if len(hist_state.word_to_prob) != 0 else '<eps>'\n                print(\"%d %d %s <eps> %.3f\" %\n                      (state, backoff_state, this_disambig_symbol, cost))\n        if args.verbose >= 1:\n            for hist_len in range(1, len(self.orders)):\n                num_states = normalization_stats[hist_len][0]\n                avg_prob_sum = normalization_stats[hist_len][1] / num_states if num_states > 0 else 0.0\n                print(\"{0}: for {1}-gram states, over {2} states the average sum of \"\n                      \"probs was {3} (would be 1.0 if properly normalized).\".format(\n                          sys.argv[0], hist_len + 1, num_states, avg_prob_sum),\n                      file = sys.stderr)\n            if num_ngrams_disallowed != 0:\n                print(\"{0}: for explicit n-grams higher than bigram from the ARPA model, {0} \"\n                      \"were allowed by the bigram constraints and {1} were disallowed (we \"\n                      \"normally expect all or almost all of them to be allowed).\".format(\n                          num_ngrams_allowed, num_ngrams_disallowed), file = sys.stderr)\n\n\n\n# returns a map which is a dict [indexed by left-hand word] of sets [containing\n# the right-hand word].\ndef ReadBigramMap(bigrams_file):\n    ans = defaultdict(lambda: set())\n\n    have_one_bos = False\n    have_one_eos = False\n    have_one_regular = False\n\n    try:\n        f = open(bigrams_file, \"r\")\n    except:\n        sys.exit(\"utils/lang/internal/arpa2fst_constrained.py: error opening \"\n                 \"bigrams file \" + bigrams_file)\n    while True:\n        line = f.readline()\n        if line == '':\n            break\n        a = line.split()\n        if len(a) != 2:\n            sys.exit(\"utils/lang/internal/arpa2fst_constrained.py: bad line in \"\n                     \"bigrams file {0} (expect 2 fields): {1}\".format(\n                         bigrams_file, line[:-1]))\n        [word1, word2] = a\n        if word1 in ans and word2 in ans[word1]:\n            sys.exit(\"{0}: bigrams file contained duplicate entry: {1} {2}\".format(\n                sys.argv[0], word1, word2), file = sys.stderr)\n        if word2 == '<s>' or word1 == '</s>':\n            sys.exit(\"{0}: bad sequence of BOS/EOS symbols: {1} {2}\".format(\n                sys.argv[0], word1, word2))\n        if word1 == '<s>':\n            have_one_bos = True\n        elif word2 == '</s>':\n            have_one_eos = True\n        else:\n            have_one_regular = True\n        ans[word1].add(word2)\n    # check for at least one pair with BOS\n    if len(ans) == 0:\n        sys.exit(\"{0}: no data found in bigrams file {1}\".format(\n            sys.argv[0], bigrams_file))\n    elif not (have_one_bos and have_one_eos and have_one_regular):\n        sys.exit(\"{0}: the bigrams file {1} does not look right \"\n                 \"(make sure BOS and EOS symbols are there)\".format(\n            sys.argv[0], bigrams_file))\n    return ans\n\narpa_model = ArpaModel()\narpa_model.Read(args.arpa_in)\nbigrams_map = ReadBigramMap(args.allowed_bigrams_in)\nif len(args.disambig_symbol.split()) != 1:\n    sys.exit(\"{0}: invalid option --disambig-symbol={1}\".format(\n        sys.argv[0], args.disambig_symbol))\narpa_model.PrintAsFst(args.disambig_symbol, bigrams_map)\n"
  },
  {
    "path": "egs/utils/lang/internal/modify_unk_pron.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nimport sys\nimport os\nimport argparse\nfrom collections import defaultdict\n\n# note, this was originally based\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script replaces the existing pronunciation of the\nunknown word in the provided lexicon, with a pronunciation\nconsisting of three disambiguation symbols: #1 followed by #2\nfollowed by #3.\nThe #2 will later be replaced by a phone-level LM by\napply_unk_lm.sh (called later on by prepare_lang.sh).\nCaution: this script is sensitive to the basename of the\nlexicon: it should be called either lexiconp.txt, in which\ncase the format is 'word pron-prob p1 p2 p3 ...'\nor lexiconp_silprob.txt, in which case the format is\n'word pron-prob sil-prob1 sil-prob2 sil-prob3 p1 p2 p3....'.\nIt is an error if there is not exactly one pronunciation of\nthe unknown word in the lexicon.\"\"\",\nepilog=\"\"\"E.g.: modify_unk_pron.py data/local/lang/lexiconp.txt '<unk>'.\nThis script is called from prepare_lang.sh.\"\"\")\n\nparser.add_argument('lexicon_file', type = str,\n                    help = 'Filename of the lexicon file to operate on (this is '\n                    'both an input and output of this script).')\nparser.add_argument('unk_word', type = str,\n                    help = \"The printed form of the unknown/OOV word, normally '<unk>'.\")\n\nargs = parser.parse_args()\n\nif len(args.unk_word.split()) != 1:\n    sys.exit(\"{0}: invalid unknown-word '{1}'\".format(\n        sys.argv[0], args.unk_word))\n\nbasename = os.path.basename(args.lexicon_file)\nif basename != 'lexiconp.txt' and basename != 'lexiconp_silprob.txt':\n    sys.exit(\"{0}: expected the basename of the lexicon file to be either \"\n             \"'lexiconp.txt' or 'lexiconp_silprob.txt', got: {1}\".format(\n                 sys.argv[0], args.lexicon_file))\n# the lexiconp.txt format is: word pron-prob p1 p2 p3...\n# lexiconp_silprob.txt has 3 extra real-valued fields after the pron-prob.\nnum_fields_before_pron = 2 if basename == 'lexiconp.txt' else 5\n\nprint(' '.join(sys.argv), file = sys.stderr)\n\ntry:\n    lexicon_in = open(args.lexicon_file, 'r')\nexcept:\n    sys.exit(\"{0}: failed to open lexicon file {1}\".format(\n        sys.argv[0], args.lexicon_file))\n\nsplit_lines = []\nunk_index = -1\nwhile True:\n    line = lexicon_in.readline()\n    if line == '':\n        break\n    this_split_line = line.split()\n    if this_split_line[0] == args.unk_word:\n        if unk_index != -1:\n            sys.exit(\"{0}: expected there to be exactly one pronunciation of the \"\n                     \"unknown word {1} in {2}, but there are more than one.\".format(\n                         sys.argv[0], args.lexicon_file, args.unk_word))\n        unk_index = len(split_lines)\n    if len(this_split_line) <= num_fields_before_pron:\n        sys.exit(\"{0}: input file {1} had a bad line (too few fields): {2}\".format(\n            sys.argv[0], args.lexicon_file, line[:-1]))\n    split_lines.append(this_split_line)\n\nif len(split_lines) == 0:\n    sys.exit(\"{0}: read no data from lexicon file {1}.\".format(\n        sys.argv[0], args.lexicon_file))\n\n\nif unk_index == -1:\n    sys.exit(\"{0}: expected there to be exactly one pronunciation of the \"\n             \"unknown word {1} in {2}, but there are none.\".format(\n                 sys.argv[0], args.unk_word, args.lexicon_file))\n\nlexicon_in.close()\n\n# now modify the pron.\nsplit_lines[unk_index] = split_lines[unk_index][0:num_fields_before_pron] + [ '#1', '#2', '#3' ]\n\n\ntry:\n    # write to the same file.\n    lexicon_out = open(args.lexicon_file, 'w')\nexcept:\n    sys.exit(\"{0}: failed to open lexicon file {1} for writing (permissions probleM?)\".format(\n        sys.argv[0], args.lexicon_file))\n\nfor split_line in split_lines:\n    print(' '.join(split_line), file = lexicon_out)\n\ntry:\n    lexicon_out.close()\nexcept:\n    sys.exit(\"{0}: failed to close lexicon file {1} after writing (disk full?)\".format(\n        sys.argv[0], args.lexicon_file))\n"
  },
  {
    "path": "egs/utils/lang/limit_arpa_unk_history.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018    Armin Oliya\n# Apache 2.0.\n\n'''\nThis script takes an existing ARPA lanugage model and limits the <unk> history\nto make it suitable for downstream <unk> modeling.\nThis is for the case when you don't have access\nto the original text corpus that is used for creating the LM.\nIf you do, you can use pocolm with the option --limit-unk-history=true.\nThis keeps the graph compact after adding the unk model.\n'''\n\nimport argparse\nimport io\nimport re\nimport sys\nfrom collections import defaultdict\n\n\nparser = argparse.ArgumentParser(\n    description='''This script takes an existing ARPA lanugage model\n    and limits the <unk> history to make it suitable\n    for downstream <unk> modeling.\n    It supports up to 5-grams.''',\n    usage='''utils/lang/limit_arpa_unk_history.py\n    <oov-dict-entry> <input-arpa >output-arpa''',\n    epilog='''E.g.: gunzip -c src.arpa.gz |\n    utils/lang/limit_arpa_unk_history.py \"<unk>\" | gzip -c >dest.arpa.gz''')\n\nparser.add_argument(\n    'oov_dict_entry',\n    help='oov identifier, for example \"<unk>\"', type=str)\nargs = parser.parse_args()\n\n\ndef get_ngram_stats(old_lm_lines):\n    ngram_counts = defaultdict(int)\n\n    for i in range(10):\n        g = re.search(r\"ngram (\\d)=(\\d+)\", old_lm_lines[i])\n        if g:\n            ngram_counts[int(g.group(1))] = int(g.group(2))\n\n    if len(ngram_counts) == 0:\n        sys.exit(\"\"\"Couldn't get counts per ngram section.\n            The input doesn't seem to be a valid ARPA language model.\"\"\")\n\n    max_ngrams = list(ngram_counts.keys())[-1]\n    skip_rows = ngram_counts[1]\n\n    if max_ngrams > 5:\n        sys.exit(\"This script supports up to 5-gram language models.\")\n\n    return max_ngrams, skip_rows, ngram_counts\n\n\ndef find_and_replace_unks(old_lm_lines, max_ngrams, skip_rows):\n    ngram_diffs = defaultdict(int)\n    whitespace_pattern = re.compile(\"[ \\t]+\")\n    unk_pattern = re.compile(\n        \"[0-9.-]+(?:[\\s\\\\t]\\S+){1,3}[\\s\\\\t]\" + args.oov_dict_entry +\n        \"[\\s\\\\t](?!-[0-9]+\\.[0-9]+).*\")\n    backoff_pattern = re.compile(\n        \"[0-9.-]+(?:[\\s\\\\t]\\S+){1,3}[\\s\\\\t]<unk>[\\s\\\\t]-[0-9]+\\.[0-9]+\")\n    passed_2grams, last_ngram = False, False\n    unk_row_count, backoff_row_count = 0, 0\n\n    print(\"Upadting the language model .. \", file=sys.stderr)\n    new_lm_lines = old_lm_lines[:skip_rows]\n\n    for i in range(skip_rows, len(old_lm_lines)):\n            line = old_lm_lines[i].strip(\" \\t\\r\\n\")\n\n            if \"\\{}-grams:\".format(3) in line:\n                passed_2grams = True\n            if \"\\{}-grams:\".format(max_ngrams) in line:\n                last_ngram = True\n\n            for i in range(max_ngrams):\n                if \"\\{}-grams:\".format(i+1) in line:\n                    ngram = i+1\n\n            # remove any n-gram states of the form: foo <unk> -> X\n            # that is, any n-grams of order > 2 where <unk>\n            # is the second-to-last word\n            # here we skip 1-gram and 2-gram sections of arpa\n\n            if passed_2grams:\n                g_unk = unk_pattern.search(line)\n                if g_unk:\n                    ngram_diffs[ngram] = ngram_diffs[ngram] - 1\n                    unk_row_count += 1\n                    continue\n\n            # remove backoff probability from the lines that end with <unk>\n            # for example, the -0.64 in -4.09 every <unk> -0.64\n            # here we skip the last n-gram section because it\n            # doesn't include backoff probabilities\n\n            if not last_ngram:\n                g_backoff = backoff_pattern.search(line)\n                if g_backoff:\n                    updated_row = whitespace_pattern.split(g_backoff.group(0))[:-1]\n                    updated_row = updated_row[0] + \\\n                        \"\\t\" + \" \".join(updated_row[1:]) + \"\\n\"\n                    new_lm_lines.append(updated_row)\n                    backoff_row_count += 1\n                    continue\n\n            new_lm_lines.append(line+\"\\n\")\n\n    print(\"Removed {} lines including {} as second-to-last term.\".format(\n        unk_row_count, args.oov_dict_entry), file=sys.stderr)\n    print(\"Removed backoff probabilties from {} lines.\".format(\n        backoff_row_count), file=sys.stderr)\n\n    return new_lm_lines, ngram_diffs\n\n\ndef read_old_lm():\n    print(\"Reading ARPA LM frome input stream .. \", file=sys.stderr)\n\n    with io.TextIOWrapper(\n            sys.stdin.buffer,\n            encoding=\"latin-1\") as input_stream:\n        old_lm_lines = input_stream.readlines()\n\n    return old_lm_lines\n\n\ndef write_new_lm(new_lm_lines, ngram_counts, ngram_diffs):\n    ''' Update n-gram counts that go in the header of the arpa lm '''\n\n    for i in range(10):\n        g = re.search(r\"ngram (\\d)=(\\d+)\", new_lm_lines[i])\n        if g:\n            n = int(g.group(1))\n            if n in ngram_diffs:\n                # ngram_diffs contains negative values\n                new_num_ngrams = ngram_counts[n] + ngram_diffs[n]\n                new_lm_lines[i] = \"ngram {}={}\\n\".format(\n                    n, new_num_ngrams)\n\n    with io.TextIOWrapper(\n            sys.stdout.buffer,\n            encoding=\"latin-1\") as output_stream:\n        output_stream.writelines(new_lm_lines)\n\n\ndef main():\n    old_lm_lines = read_old_lm()\n    max_ngrams, skip_rows,  ngram_counts = get_ngram_stats(old_lm_lines)\n    new_lm_lines, ngram_diffs = find_and_replace_unks(\n        old_lm_lines, max_ngrams, skip_rows)\n    write_new_lm(new_lm_lines, ngram_counts, ngram_diffs)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/lang/make_kn_lm.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2016  Johns Hopkins University (Author: Daniel Povey)\n#           2018  Ruizhe Huang\n# Apache 2.0.\n\n# This is an implementation of computing Kneser-Ney smoothed language model\n# in the same way as srilm. This is a back-off, unmodified version of\n# Kneser-Ney smoothing, which produces the same results as the following\n# command (as an example) of srilm:\n#\n# $ ngram-count -order 4 -kn-modify-counts-at-end -ukndiscount -gt1min 0 -gt2min 0 -gt3min 0 -gt4min 0 \\\n# -text corpus.txt -lm lm.arpa\n#\n# The data structure is based on: kaldi/egs/wsj/s5/utils/lang/make_phone_lm.py\n# The smoothing algorithm is based on: http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html\n\nimport sys\nimport os\nimport re\nimport io\nimport math\nimport argparse\nfrom collections import Counter, defaultdict\n\n\nparser = argparse.ArgumentParser(description=\"\"\"\n    Generate kneser-ney language model as arpa format. By default,\n    it will read the corpus from standard input, and output to standard output.\n    \"\"\")\nparser.add_argument(\"-ngram-order\", type=int, default=4, choices=[2, 3, 4, 5, 6, 7], help=\"Order of n-gram\")\nparser.add_argument(\"-text\", type=str, default=None, help=\"Path to the corpus file\")\nparser.add_argument(\"-lm\", type=str, default=None, help=\"Path to output arpa file for language models\")\nparser.add_argument(\"-verbose\", type=int, default=0, choices=[0, 1, 2, 3, 4, 5], help=\"Verbose level\")\nargs = parser.parse_args()\n\ndefault_encoding = \"latin-1\"  # For encoding-agnostic scripts, we assume byte stream as input.\n                              # Need to be very careful about the use of strip() and split()\n                              # in this case, because there is a latin-1 whitespace character\n                              # (nbsp) which is part of the unicode encoding range.\n                              # Ref: kaldi/egs/wsj/s5/utils/lang/bpe/prepend_words.py @ 69cd717\nstrip_chars = \" \\t\\r\\n\"\nwhitespace = re.compile(\"[ \\t]+\")\n\n\nclass CountsForHistory:\n    # This class (which is more like a struct) stores the counts seen in a\n    # particular history-state.  It is used inside class NgramCounts.\n    # It really does the job of a dict from int to float, but it also\n    # keeps track of the total count.\n    def __init__(self):\n        # The 'lambda: defaultdict(float)' is an anonymous function taking no\n        # arguments that returns a new defaultdict(float).\n        self.word_to_count = defaultdict(int)\n        self.word_to_context = defaultdict(set)  # using a set to count the number of unique contexts\n        self.word_to_f = dict()  # discounted probability\n        self.word_to_bow = dict()  # back-off weight\n        self.total_count = 0\n\n    def words(self):\n        return self.word_to_count.keys()\n\n    def __str__(self):\n        # e.g. returns ' total=12: 3->4, 4->6, -1->2'\n        return ' total={0}: {1}'.format(\n            str(self.total_count),\n            ', '.join(['{0} -> {1}'.format(word, count)\n                      for word, count in self.word_to_count.items()]))\n\n    def add_count(self, predicted_word, context_word, count):\n        assert count >= 0\n\n        self.total_count += count\n        self.word_to_count[predicted_word] += count\n        if context_word is not None:\n            self.word_to_context[predicted_word].add(context_word)\n\n\nclass NgramCounts:\n    # A note on data-structure.  Firstly, all words are represented as\n    # integers.  We store n-gram counts as an array, indexed by (history-length\n    # == n-gram order minus one) (note: python calls arrays \"lists\") of dicts\n    # from histories to counts, where histories are arrays of integers and\n    # \"counts\" are dicts from integer to float.  For instance, when\n    # accumulating the 4-gram count for the '8' in the sequence '5 6 7 8', we'd\n    # do as follows: self.counts[3][[5,6,7]][8] += 1.0 where the [3] indexes an\n    # array, the [[5,6,7]] indexes a dict, and the [8] indexes a dict.\n    def __init__(self, ngram_order, bos_symbol='<s>', eos_symbol='</s>'):\n        assert ngram_order >= 2\n\n        self.ngram_order = ngram_order\n        self.bos_symbol = bos_symbol\n        self.eos_symbol = eos_symbol\n\n        self.counts = []\n        for n in range(ngram_order):\n            self.counts.append(defaultdict(lambda: CountsForHistory()))\n\n        self.d = []  # list of discounting factor for each order of ngram\n\n    # adds a raw count (called while processing input data).\n    # Suppose we see the sequence '6 7 8 9' and ngram_order=4, 'history'\n    # would be (6,7,8) and 'predicted_word' would be 9; 'count' would be\n    # 1.\n    def add_count(self, history, predicted_word, context_word, count):\n        self.counts[len(history)][history].add_count(predicted_word, context_word, count)\n\n    # 'line' is a string containing a sequence of integer word-ids.\n    # This function adds the un-smoothed counts from this line of text.\n    def add_raw_counts_from_line(self, line):\n        words = [self.bos_symbol] + whitespace.split(line) + [self.eos_symbol]\n\n        for i in range(len(words)):\n            for n in range(1, self.ngram_order+1):\n                if i + n > len(words):\n                    break\n\n                ngram = words[i: i + n]\n                predicted_word = ngram[-1]\n                history = tuple(ngram[: -1])\n                if i == 0 or n == self.ngram_order:\n                    context_word = None\n                else:\n                    context_word = words[i-1]\n\n                self.add_count(history, predicted_word, context_word, 1)\n\n    def add_raw_counts_from_standard_input(self):\n        lines_processed = 0\n        infile = io.TextIOWrapper(sys.stdin.buffer, encoding=default_encoding)  # byte stream as input\n        for line in infile:\n            line = line.strip(strip_chars)\n            if line == '':\n                break\n            self.add_raw_counts_from_line(line)\n            lines_processed += 1\n        if lines_processed == 0 or args.verbose > 0:\n            print(\"make_phone_lm.py: processed {0} lines of input\".format(lines_processed), file=sys.stderr)\n\n    def add_raw_counts_from_file(self, filename):\n        lines_processed = 0\n        with open(filename, encoding=default_encoding) as fp:\n            for line in fp:\n                line = line.strip(strip_chars)\n                if line == '':\n                    break\n                self.add_raw_counts_from_line(line)\n                lines_processed += 1\n        if lines_processed == 0 or args.verbose > 0:\n            print(\"make_phone_lm.py: processed {0} lines of input\".format(lines_processed), file=sys.stderr)\n\n    def cal_discounting_constants(self):\n        # For each order N of N-grams, we calculate discounting constant D_N = n1_N / (n1_N + 2 * n2_N),\n        # where n1_N is the number of unique N-grams with count = 1 (counts-of-counts).\n        # This constant is used similarly to absolute discounting.\n        # Return value: d is a list of floats, where d[N+1] = D_N\n\n        self.d = [0]  # for the lowest order, i.e., 1-gram, we do not need to discount, thus the constant is 0\n                      # This is a special case: as we currently assumed having seen all vocabularies in the dictionary,\n                      # but perhaps this is not the case for some other scenarios.\n        for n in range(1, self.ngram_order):\n            this_order_counts = self.counts[n]\n            n1 = 0\n            n2 = 0\n            for hist, counts_for_hist in this_order_counts.items():\n                stat = Counter(counts_for_hist.word_to_count.values())\n                n1 += stat[1]\n                n2 += stat[2]\n            assert n1 + 2 * n2 > 0\n            self.d.append(n1 * 1.0 / (n1 + 2 * n2))\n\n    def cal_f(self):\n        # f(a_z) is a probability distribution of word sequence a_z.\n        # Typically f(a_z) is discounted to be less than the ML estimate so we have\n        # some leftover probability for the z words unseen in the context (a_).\n        #\n        # f(a_z) = (c(a_z) - D0) / c(a_)    ;; for highest order N-grams\n        # f(_z)  = (n(*_z) - D1) / n(*_*)\t;; for lower order N-grams\n\n        # highest order N-grams\n        n = self.ngram_order - 1\n        this_order_counts = self.counts[n]\n        for hist, counts_for_hist in this_order_counts.items():\n            for w, c in counts_for_hist.word_to_count.items():\n                counts_for_hist.word_to_f[w] = max((c - self.d[n]), 0) * 1.0 / counts_for_hist.total_count\n\n        # lower order N-grams\n        for n in range(0, self.ngram_order - 1):\n            this_order_counts = self.counts[n]\n            for hist, counts_for_hist in this_order_counts.items():\n\n                n_star_star = 0\n                for w in counts_for_hist.word_to_count.keys():\n                    n_star_star += len(counts_for_hist.word_to_context[w])\n\n                if n_star_star != 0:\n                    for w in counts_for_hist.word_to_count.keys():\n                        n_star_z = len(counts_for_hist.word_to_context[w])\n                        counts_for_hist.word_to_f[w] = max((n_star_z - self.d[n]), 0) * 1.0 / n_star_star\n                else:  # patterns begin with <s>, they do not have \"modified count\", so use raw count instead\n                    for w in counts_for_hist.word_to_count.keys():\n                        n_star_z = counts_for_hist.word_to_count[w]\n                        counts_for_hist.word_to_f[w] = max((n_star_z - self.d[n]), 0) * 1.0 / counts_for_hist.total_count\n\n    def cal_bow(self):\n        # Backoff weights are only necessary for ngrams which form a prefix of a longer ngram.\n        # Thus, two sorts of ngrams do not have a bow:\n        # 1) highest order ngram\n        # 2) ngrams ending in </s>\n        #\n        # bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z))\n        # Note that Z1 is the set of all words with c(a_z) > 0\n\n        # highest order N-grams\n        n = self.ngram_order - 1\n        this_order_counts = self.counts[n]\n        for hist, counts_for_hist in this_order_counts.items():\n            for w in counts_for_hist.word_to_count.keys():\n                counts_for_hist.word_to_bow[w] = None\n\n        # lower order N-grams\n        for n in range(0, self.ngram_order - 1):\n            this_order_counts = self.counts[n]\n            for hist, counts_for_hist in this_order_counts.items():\n                for w in counts_for_hist.word_to_count.keys():\n                    if w == self.eos_symbol:\n                        counts_for_hist.word_to_bow[w] = None\n                    else:\n                        a_ = hist + (w,)\n\n                        assert len(a_) < self.ngram_order\n                        assert a_ in self.counts[len(a_)].keys()\n\n                        a_counts_for_hist = self.counts[len(a_)][a_]\n\n                        sum_z1_f_a_z = 0\n                        for u in a_counts_for_hist.word_to_count.keys():\n                            sum_z1_f_a_z += a_counts_for_hist.word_to_f[u]\n\n                        sum_z1_f_z = 0\n                        _ = a_[1:]\n                        _counts_for_hist = self.counts[len(_)][_]\n                        for u in a_counts_for_hist.word_to_count.keys():  # Should be careful here: what is Z1\n                            sum_z1_f_z += _counts_for_hist.word_to_f[u]\n\n                        counts_for_hist.word_to_bow[w] = (1.0 - sum_z1_f_a_z) / (1.0 - sum_z1_f_z)\n\n    def print_raw_counts(self, info_string):\n        # these are useful for debug.\n        print(info_string)\n        res = []\n        for this_order_counts in self.counts:\n            for hist, counts_for_hist in this_order_counts.items():\n                for w in counts_for_hist.word_to_count.keys():\n                    ngram = \" \".join(hist) + \" \" + w\n                    ngram = ngram.strip(strip_chars)\n\n                    res.append(\"{0}\\t{1}\".format(ngram, counts_for_hist.word_to_count[w]))\n        res.sort(reverse=True)\n        for r in res:\n            print(r)\n\n    def print_modified_counts(self, info_string):\n        # these are useful for debug.\n        print(info_string)\n        res = []\n        for this_order_counts in self.counts:\n            for hist, counts_for_hist in this_order_counts.items():\n                for w in counts_for_hist.word_to_count.keys():\n                    ngram = \" \".join(hist) + \" \" + w\n                    ngram = ngram.strip(strip_chars)\n\n                    modified_count = len(counts_for_hist.word_to_context[w])\n                    raw_count = counts_for_hist.word_to_count[w]\n\n                    if modified_count == 0:\n                        res.append(\"{0}\\t{1}\".format(ngram, raw_count))\n                    else:\n                        res.append(\"{0}\\t{1}\".format(ngram, modified_count))\n        res.sort(reverse=True)\n        for r in res:\n            print(r)\n\n    def print_f(self, info_string):\n        # these are useful for debug.\n        print(info_string)\n        res = []\n        for this_order_counts in self.counts:\n            for hist, counts_for_hist in this_order_counts.items():\n                for w in counts_for_hist.word_to_count.keys():\n                    ngram = \" \".join(hist) + \" \" + w\n                    ngram = ngram.strip(strip_chars)\n\n                    f = counts_for_hist.word_to_f[w]\n                    if f == 0:  # f(<s>) is always 0\n                        f = 1e-99\n\n                    res.append(\"{0}\\t{1}\".format(ngram, math.log(f, 10)))\n        res.sort(reverse=True)\n        for r in res:\n            print(r)\n\n    def print_f_and_bow(self, info_string):\n        # these are useful for debug.\n        print(info_string)\n        res = []\n        for this_order_counts in self.counts:\n            for hist, counts_for_hist in this_order_counts.items():\n                for w in counts_for_hist.word_to_count.keys():\n                    ngram = \" \".join(hist) + \" \" + w\n                    ngram = ngram.strip(strip_chars)\n\n                    f = counts_for_hist.word_to_f[w]\n                    if f == 0:  # f(<s>) is always 0\n                        f = 1e-99\n\n                    bow = counts_for_hist.word_to_bow[w]\n                    if bow is None:\n                        res.append(\"{1}\\t{0}\".format(ngram, math.log(f, 10)))\n                    else:\n                        res.append(\"{1}\\t{0}\\t{2}\".format(ngram, math.log(f, 10), math.log(bow, 10)))\n        res.sort(reverse=True)\n        for r in res:\n            print(r)\n\n    def print_as_arpa(self, fout=io.TextIOWrapper(sys.stdout.buffer, encoding='latin-1')):\n        # print as ARPA format.\n\n        print('\\\\data\\\\', file=fout)\n        for hist_len in range(self.ngram_order):\n            # print the number of n-grams.\n            print('ngram {0}={1}'.format(\n                hist_len + 1,\n                sum([len(counts_for_hist.word_to_f) for counts_for_hist in self.counts[hist_len].values()])),\n                file=fout\n            )\n\n        print('', file=fout)\n\n        for hist_len in range(self.ngram_order):\n            print('\\\\{0}-grams:'.format(hist_len + 1), file=fout)\n\n            this_order_counts = self.counts[hist_len]\n            for hist, counts_for_hist in this_order_counts.items():\n                for word in counts_for_hist.word_to_count.keys():\n                    ngram = hist + (word,)\n                    prob = counts_for_hist.word_to_f[word]\n                    bow = counts_for_hist.word_to_bow[word]\n\n                    if prob == 0:  # f(<s>) is always 0\n                        prob = 1e-99\n\n                    line = '{0}\\t{1}'.format('%.7f' % math.log10(prob), ' '.join(ngram))\n                    if bow is not None:\n                        line += '\\t{0}'.format('%.7f' % math.log10(bow))\n                    print(line, file=fout)\n            print('', file=fout)\n        print('\\\\end\\\\', file=fout)\n\n\nif __name__ == \"__main__\":\n\n    ngram_counts = NgramCounts(args.ngram_order)\n\n    if args.text is None:\n        ngram_counts.add_raw_counts_from_standard_input()\n    else:\n        assert os.path.isfile(args.text)\n        ngram_counts.add_raw_counts_from_file(args.text)\n\n    ngram_counts.cal_discounting_constants()\n    ngram_counts.cal_f()\n    ngram_counts.cal_bow()\n\n    if args.lm is None:\n        ngram_counts.print_as_arpa()\n    else:\n        with open(args.lm, 'w', encoding=default_encoding) as f:\n            ngram_counts.print_as_arpa(fout=f)\n"
  },
  {
    "path": "egs/utils/lang/make_lexicon_fst.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright   2018  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n\n# see get_args() below for usage message.\nimport argparse\nimport os\nimport sys\nimport math\nimport re\n\n# The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n# encoding means \"treat words as sequences of bytes\", and it is compatible\n# with utf-8 encoding as well as other encodings such as gbk, as long as the\n# spaces are also spaces in ascii (which we check).  It is basically how we\n# emulate the behavior of python before python3.\nsys.stdout = open(1, 'w', encoding='latin-1', closefd=False)\nsys.stderr = open(2, 'w', encoding='latin-1', closefd=False)\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates the\n       text form of a lexicon FST, to be compiled by fstcompile using the\n       appropriate symbol tables (phones.txt and words.txt) .  It will mostly\n       be invoked indirectly via utils/prepare_lang.sh.  The output goes to\n       the stdout.\"\"\")\n\n    parser.add_argument('--sil-phone', dest='sil_phone', type=str,\n                        help=\"\"\"Text form of optional-silence phone, e.g. 'SIL'.  See also\n                        the --silprob option.\"\"\")\n    parser.add_argument('--sil-prob', dest='sil_prob', type=float, default=0.0,\n                        help=\"\"\"Probability of silence between words (including at the\n                        beginning and end of word sequences).  Must be in the range [0.0, 1.0].\n                        This refers to the optional silence inserted by the lexicon; see\n                        the --silphone option.\"\"\")\n    parser.add_argument('--sil-disambig', dest='sil_disambig', type=str,\n                        help=\"\"\"Disambiguation symbol to disambiguate silence, e.g. #5.\n                        Will only be supplied if you are creating the version of L.fst\n                        with disambiguation symbols, intended for use with cyclic G.fst.\n                        This symbol was introduced to fix a rather obscure source of\n                        nondeterminism of CLG.fst, that has to do with reordering of\n                        disambiguation symbols and phone symbols.\"\"\")\n    parser.add_argument('--left-context-phones', dest='left_context_phones', type=str,\n                        help=\"\"\"Only relevant if --nonterminals is also supplied; this relates\n                        to grammar decoding (see http://kaldi-asr.org/doc/grammar.html or\n                        src/doc/grammar.dox).  Format is a list of left-context phones,\n                        in text form, one per line.  E.g. data/lang/phones/left_context_phones.txt\"\"\")\n    parser.add_argument('--nonterminals', type=str,\n                        help=\"\"\"If supplied, --left-context-phones must also be supplied.\n                        List of user-defined nonterminal symbols such as #nonterm:contact_list,\n                        one per line.  E.g. data/local/dict/nonterminals.txt.\"\"\")\n    parser.add_argument('lexiconp', type=str,\n                        help=\"\"\"Filename of lexicon with pronunciation probabilities\n                        (normally lexiconp.txt), with lines of the form 'word prob p1 p2...',\n                        e.g. 'a   1.0    ay'\"\"\")\n    args = parser.parse_args()\n    return args\n\n\ndef read_lexiconp(filename):\n    \"\"\"Reads the lexiconp.txt file in 'filename', with lines like 'word pron p1 p2 ...'.\n    Returns a list of tuples (word, pron_prob, pron), where 'word' is a string,\n   'pron_prob', a float, is the pronunciation probability (which must be >0.0\n    and would normally be <=1.0),  and 'pron' is a list of strings representing phones.\n    An element in the returned list might be ('hello', 1.0, ['h', 'eh', 'l', 'ow']).\n    \"\"\"\n\n    ans = []\n    found_empty_prons = False\n    found_large_pronprobs = False\n    # See the comment near the top of this file, RE why we use latin-1.\n    with open(filename, 'r', encoding='latin-1') as f:\n        whitespace = re.compile(\"[ \\t]+\")\n        for line in f:\n            a = whitespace.split(line.strip(\" \\t\\r\\n\"))\n            if len(a) < 2:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            word = a[0]\n            if word == \"<eps>\":\n                # This would clash with the epsilon symbol normally used in OpenFst.\n                print(\"{0}: error: found <eps> as a word in lexicon file \"\n                      \"{1}\".format(line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            try:\n                pron_prob = float(a[1])\n            except:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2}, 2nd field \"\n                      \"should be pron-prob\".format(sys.argv[0], line.strip(\" \\t\\r\\n\"), filename),\n                      file=sys.stderr)\n                sys.exit(1)\n            prons = a[2:]\n            if pron_prob <= 0.0:\n                print(\"{0}: error: invalid pron-prob in line '{1}' of lexicon file {2} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            if len(prons) == 0:\n                found_empty_prons = True\n            ans.append( (word, pron_prob, prons) )\n            if pron_prob > 1.0:\n                found_large_pronprobs = True\n    if found_empty_prons:\n        print(\"{0}: warning: found at least one word with an empty pronunciation \"\n              \"in lexicon file {1}.\".format(sys.argv[0], filename),\n              file=sys.stderr)\n    if found_large_pronprobs:\n        print(\"{0}: warning: found at least one word with pron-prob >1.0 \"\n              \"in {1}\".format(sys.argv[0], filename), file=sys.stderr)\n\n\n    if len(ans) == 0:\n        print(\"{0}: error: found no pronunciations in lexicon file {1}\".format(\n            sys.argv[0], filename), file=sys.stderr)\n        sys.exit(1)\n    return ans\n\n\ndef write_nonterminal_arcs(start_state, loop_state, next_state,\n                           nonterminals, left_context_phones):\n    \"\"\"This function relates to the grammar-decoding setup, see\n    kaldi-asr.org/doc/grammar.html.  It is called from write_fst_no_silence\n    and write_fst_silence, and writes to the stdout some extra arcs\n    in the lexicon FST that relate to nonterminal symbols.\n    See the section \"Special symbols in L.fst,\n    kaldi-asr.org/doc/grammar.html#grammar_special_l.\n       start_state: the start-state of L.fst.\n       loop_state:  the state of high out-degree in L.fst where words leave\n                  and enter.\n       next_state: the number from which this function can start allocating its\n                  own states.  the updated value of next_state will be returned.\n       nonterminals: the user-defined nonterminal symbols as a list of\n          strings, e.g. ['#nonterm:contact_list', ... ].\n       left_context_phones: a list of phones that may appear as left-context,\n          e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n    shared_state = next_state\n    next_state += 1\n    final_state = next_state\n    next_state += 1\n\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=start_state, dest=shared_state,\n        phone='#nonterm_begin', word='#nonterm_begin',\n        cost=0.0))\n\n    for nonterminal in nonterminals:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=loop_state, dest=shared_state,\n            phone=nonterminal, word=nonterminal,\n            cost=0.0))\n    # this_cost equals log(len(left_context_phones)) but the expression below\n    # better captures the meaning.  Applying this cost to arcs keeps the FST\n    # stochatic (sum-to-one, like an HMM), so that if we do weight pushing\n    # things won't get weird.  In the grammar-FST code when we splice things\n    # together we will cancel out this cost, see the function CombineArcs().\n    this_cost = -math.log(1.0 / len(left_context_phones))\n\n    for left_context_phone in left_context_phones:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=shared_state, dest=loop_state,\n            phone=left_context_phone, word='<eps>', cost=this_cost))\n    # arc from loop-state to a final-state with #nonterm_end as ilabel and olabel\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=loop_state, dest=final_state,\n        phone='#nonterm_end', word='#nonterm_end', cost=0.0))\n    print(\"{state}\\t{final_cost}\".format(\n        state=final_state, final_cost=0.0))\n    return next_state\n\n\n\ndef write_fst_no_silence(lexicon, nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n    when --sil-prob=0.0, meaning there is no optional silence allowed.\n\n      'lexicon' is a list of 3-tuples (word, pron-prob, prons) as returned by\n        read_lexiconp().\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    loop_state = 0\n    next_state = 1  # the next un-allocated state, will be incremented as we go.\n    for (word, pronprob, pron) in lexicon:\n        cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state,\n                dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=(cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            loop_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\ndef write_fst_with_silence(lexicon, sil_prob, sil_phone, sil_disambig,\n                           nonterminals=None, left_context_phones=None):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n       when --sil-prob != 0.0, meaning there is optional silence\n     'lexicon' is a list of 3-tuples (word, pron-prob, prons)\n         as returned by read_lexiconp().\n     'sil_prob', which is expected to be strictly between 0.. and 1.0, is the\n         probability of silence\n     'sil_phone' is the silence phone, e.g. \"SIL\".\n     'sil_disambig' is either None, or the silence disambiguation symbol, e.g. \"#5\".\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n\n    assert sil_prob > 0.0 and sil_prob < 1.0\n    sil_cost = -math.log(sil_prob)\n    no_sil_cost = -math.log(1.0 - sil_prob);\n\n    start_state = 0\n    loop_state = 1  # words enter and leave from here\n    sil_state = 2   # words terminate here when followed by silence; this state\n                    # has a silence transition to loop_state.\n    next_state = 3  # the next un-allocated state, will be incremented as we go.\n\n\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=loop_state,\n        phone='<eps>', word='<eps>', cost=no_sil_cost))\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=sil_state,\n        phone='<eps>', word='<eps>', cost=sil_cost))\n    if sil_disambig is None:\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=loop_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n    else:\n        sil_disambig_state = next_state\n        next_state += 1\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_state, dest=sil_disambig_state,\n            phone=sil_phone, word='<eps>', cost=0.0))\n        print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n            src=sil_disambig_state, dest=loop_state,\n            phone=sil_disambig, word='<eps>', cost=0.0))\n\n\n    for (word, pronprob, pron) in lexicon:\n        pron_cost = -math.log(pronprob)\n        cur_state = loop_state\n        for i in range(len(pron) - 1):\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=cur_state, dest=next_state,\n                phone=pron[i],\n                word=(word if i == 0 else '<eps>'),\n                cost=(pron_cost if i == 0 else 0.0)))\n            cur_state = next_state\n            next_state += 1\n\n        i = len(pron) - 1  # note: i == -1 if pron is empty.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=loop_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=no_sil_cost + (pron_cost if i <= 0 else 0.0)))\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,\n            dest=sil_state,\n            phone=(pron[i] if i >= 0 else '<eps>'),\n            word=(word if i <= 0 else '<eps>'),\n            cost=sil_cost + (pron_cost if i <= 0 else 0.0)))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            start_state, loop_state, next_state,\n            nonterminals, left_context_phones)\n\n    print(\"{state}\\t{final_cost}\".format(\n        state=loop_state,\n        final_cost=0.0))\n\n\n\n\ndef write_words_txt(orig_lines, highest_numbered_symbol, nonterminals, filename):\n    \"\"\"Writes updated words.txt to 'filename'.  'orig_lines' is the original lines\n       in the words.txt file as a list of strings (without the newlines);\n       highest_numbered_symbol is the highest numbered symbol in the original\n       words.txt; nonterminals is a list of strings like '#nonterm:foo'.\"\"\"\n    with open(filename, 'w', encoding='latin-1') as f:\n        for l in orig_lines:\n            print(l, file=f)\n        cur_symbol = highest_numbered_symbol + 1\n        for n in [ '#nonterm_begin', '#nonterm_end' ] + nonterminals:\n            print(\"{0} {1}\".format(n, cur_symbol), file=f)\n            cur_symbol = cur_symbol + 1\n\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminals symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef read_left_context_phones(filename):\n    \"\"\"Reads, checks, and returns a list of left-context phones, in text form, one\n       per line.  Returns a list of strings, e.g. ['a', 'ah', ..., '#nonterm_bos' ]\"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no left-context phones.\".format(filename))\n    whitespace = re.compile(\"[ \\t]+\")\n    for s in ans:\n        if len(whitespace.split(s)) != 1:\n            raise RuntimeError(\"The file {0} contains an invalid line '{1}'\".format(filename, s)   )\n\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate left-context phones are present in file {0}\".format(filename))\n    return ans\n\n\ndef is_token(s):\n    \"\"\"Returns true if s is a string and is space-free.\"\"\"\n    if not isinstance(s, str):\n        return False\n    whitespace = re.compile(\"[ \\t\\r\\n]+\")\n    split_str = whitespace.split(s);\n    return len(split_str) == 1 and s == split_str[0]\n\n\ndef main():\n    args = get_args()\n\n    lexicon = read_lexiconp(args.lexiconp)\n\n    if args.nonterminals is None:\n        nonterminals, left_context_phones = None, None\n    else:\n        if args.left_context_phones is None:\n            print(\"{0}: if --nonterminals is specified, --left-context-phones must also \"\n                  \"be specified\".format(sys.argv[0]))\n            sys.exit(1)\n        nonterminals = read_nonterminals(args.nonterminals)\n        left_context_phones = read_left_context_phones(args.left_context_phones)\n\n    if args.sil_prob == 0.0:\n          write_fst_no_silence(lexicon,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n    else:\n        # Do some checking that the options make sense.\n        if args.sil_prob < 0.0 or args.sil_prob >= 1.0:\n            print(\"{0}: invalid value specified --sil-prob={1}\".format(\n                sys.argv[0], args.sil_prob), file=sys.stderr)\n            sys.exit(1)\n\n        if not is_token(args.sil_phone):\n            print(\"{0}: you specified --sil-prob={1} but --sil-phone is set \"\n                  \"to '{2}'\".format(sys.argv[0], args.sil_prob, args.sil_phone),\n                  file=sys.stderr)\n            sys.exit(1)\n        if args.sil_disambig is not None and not is_token(args.sil_disambig):\n            print(\"{0}: invalid value --sil-disambig='{1}' was specified.\"\n                  \"\".format(sys.argv[0], args.sil_disambig), file=sys.stderr)\n            sys.exit(1)\n        write_fst_with_silence(lexicon, args.sil_prob, args.sil_phone,\n                               args.sil_disambig,\n                               nonterminals=nonterminals,\n                               left_context_phones=left_context_phones)\n\n\n\n#    (lines, highest_symbol) = read_words_txt(args.input_words_txt)\n#    nonterminals = read_nonterminals(args.nonterminal_symbols_list)\n#    write_words_txt(lines, highest_symbol, nonterminals, args.output_words_txt)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/lang/make_lexicon_fst_silprob.py",
    "content": "#!/usr/bin/env python3\n# Copyright   2018  Johns Hopkins University (author: Daniel Povey)\n#             2018  Jiedan Zhu\n# Apache 2.0.\n# see get_args() below for usage message.\n\nimport argparse\nimport os\nimport sys\nimport math\nimport re\n\n# The use of latin-1 encoding does not preclude reading utf-8.  latin-1\n# encoding means \"treat words as sequences of bytes\", and it is compatible\n# with utf-8 encoding as well as other encodings such as gbk, as long as the\n# spaces are also spaces in ascii (which we check).  It is basically how we\n# emulate the behavior of python before python3.\n\nsys.stdout = open(1, 'w', encoding='latin-1', closefd=False)\nsys.stderr = open(2, 'w', encoding='latin-1', closefd=False)\n\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates the\n       text form of a lexicon FST, to be compiled by fstcompile using the\n       appropriate symbol tables (phones.txt and words.txt) .  It will mostly\n       be invoked indirectly via utils/prepare_lang.sh.  The output goes to\n       the stdout.\n\n       This version is for a lexicon with word-specific silence probabilities,\n       see http://www.danielpovey.com/files/2015_interspeech_silprob.pdf\n       for an explanation\"\"\")\n\n    parser.add_argument('--sil-phone', dest='sil_phone', type=str,\n                        help=\"Text form of optional-silence phone, e.g. 'SIL'.\")\n    parser.add_argument('--sil-disambig', dest='sil_disambig', type=str, default=\"<eps>\",\n                        help=\"\"\"Disambiguation symbol to disambiguate silence, e.g. #5.\n                        Will only be supplied if you are creating the version of L.fst\n                        with disambiguation symbols, intended for use with cyclic G.fst.\n                        This symbol was introduced to fix a rather obscure source of\n                        nondeterminism of CLG.fst, that has to do with reordering of\n                        disambiguation symbols and phone symbols.\"\"\")\n    parser.add_argument('lexiconp', type=str,\n                        help=\"\"\"Filename of lexicon with pronunciation probabilities\n                        (normally lexiconp.txt), with lines of the form\n                        'word pron-prob prob-of-sil correction-term-for-sil correction-term-for-no-sil p1 p2...',\n                        e.g. 'a   1.0  0.8 1.2  0.6  ay'\"\"\")\n    parser.add_argument('silprobs', type=str,\n                        help=\"\"\"Filename with silence probabilities, with lines of the form\n                        '<s> p(sil-after|<s>) //\n                        </s>_s correction-term-for-sil-for-</s> //\n                        </s>_n correction-term-for-no-sil-for-</s> //\n                        overall p(overall-sil), where // represents line break.\n                        See also utils/dict_dir_add_pronprobs.sh,\n                        which creates this file as silprob.txt.\"\"\")\n    parser.add_argument('--left-context-phones', dest='left_context_phones', type=str,\n                        help=\"\"\"Only relevant if --nonterminals is also supplied; this relates\n                        to grammar decoding (see http://kaldi-asr.org/doc/grammar.html or\n                        src/doc/grammar.dox).  Format is a list of left-context phones,\n                        in text form, one per line.  E.g. data/lang/phones/left_context_phones.txt\"\"\")\n    parser.add_argument('--nonterminals', type=str,\n                        help=\"\"\"If supplied, --left-context-phones must also be supplied.\n                        List of user-defined nonterminal symbols such as #nonterm:contact_list,\n                        one per line.  E.g. data/local/dict/nonterminals.txt.\"\"\")\n\n    args = parser.parse_args()\n    return args\n\n\ndef read_silprobs(filename):\n    \"\"\" Reads the silprobs file (e.g. silprobs.txt) which will have a format like this:\n     <s> 0.99\n     </s>_s 2.50607106867326\n     </s>_n 0.00653829808100956\n     overall 0.20\n    and returns it as a 4-tuple, e.g. in this example (0.99, 2.50, 0.006, 0.20)\n    \"\"\"\n    silbeginprob = -1\n    silendcorrection = -1\n    nonsilendcorrection = -1\n    siloverallprob = -1\n    with open(filename, 'r', encoding='latin-1') as f:\n        whitespace = re.compile(\"[ \\t]+\")\n        for line in f:\n            a = whitespace.split(line.strip(\" \\t\\r\\n\"))\n            if len(a) != 2:\n                print(\"{0}: error: found bad line '{1}' in silprobs file {1} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            label = a[0]\n            try:\n                if label == \"<s>\":\n                    silbeginprob = float(a[1])\n                elif label == \"</s>_s\":\n                    silendcorrection = float(a[1])\n                elif label == \"</s>_n\":\n                    nonsilendcorrection = float(a[1])\n                elif label == \"overall\":\n                    siloverallprob = float(a[1]) # this is not in use, still keep it?\n                else:\n                    raise RuntimeError()\n            except:\n                print(\"{0}: error: found bad line '{1}' in silprobs file {1}\"\n                      .format(sys.argv[0], line.strip(\" \\t\\r\\n\"), filename),\n                      file=sys.stderr)\n                sys.exit(1)\n    if (silbeginprob <= 0.0 or silbeginprob > 1.0 or\n        silendcorrection <= 0.0 or nonsilendcorrection <= 0.0 or\n        siloverallprob <= 0.0 or siloverallprob > 1.0):\n        print(\"{0}: error: prob is not correct in silprobs file {1}.\"\n            .format(sys.argv[0], filename), file=sys.stderr)\n        sys.exit(1)\n    return (silbeginprob, silendcorrection, nonsilendcorrection, siloverallprob)\n\n\ndef read_lexiconp(filename):\n    \"\"\"Reads the lexiconp.txt file in 'filename', with lines like\n    'word p(pronunciation|word) p(sil-after|word) correction-term-for-sil\n    correction-term-for-no-sil p1 p2 ...'.\n    Returns a list of tuples (word, pron_prob, word_sil_prob,\n    sil_word_correction, non_sil_word_correction, prons), where 'word' is a string,\n   'pron_prob', a float, is the pronunciation probability (which must be >0.0\n    and would normally be <=1.0), 'word_sil_prob' is a float,\n    'sil_word_correction' is a float, 'non_sil_word_correction' is a float,\n    and 'pron' is a list of strings representing phones.\n    An element in the returned list might be\n    ('hello', 1.0, 0.5, 0.3, 0.6, ['h', 'eh', 'l', 'ow']).\n    \"\"\"\n    ans = []\n    found_empty_prons = False\n    found_large_pronprobs = False\n    # See the comment near the top of this file, RE why we use latin-1.\n    whitespace = re.compile(\"[ \\t]+\")\n    with open(filename, 'r', encoding='latin-1') as f:\n        for line in f:\n            a = whitespace.split(line.strip(\" \\t\\r\\n\"))\n            if len(a) < 2:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {1} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            word = a[0]\n            if word == \"<eps>\":\n                # This would clash with the epsilon symbol normally used in OpenFst.\n                print(\"{0}: error: found <eps> as a word in lexicon file \"\n                      \"{1}\".format(line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            try:\n                pron_prob = float(a[1])\n                word_sil_prob = float(a[2])\n                sil_word_correction = float(a[3])\n                non_sil_word_correction = float(a[4])\n            except:\n                print(\"{0}: error: found bad line '{1}' in lexicon file {2}, 2nd field \"\n                      \"through 5th field should be numbers\".format(sys.argv[0],\n                                                                   line.strip(\" \\t\\r\\n\"), filename),\n                      file=sys.stderr)\n                sys.exit(1)\n            prons = a[5:]\n            if pron_prob <= 0.0:\n                print(\"{0}: error: invalid pron-prob in line '{1}' of lexicon file {2} \".format(\n                    sys.argv[0], line.strip(\" \\t\\r\\n\"), filename), file=sys.stderr)\n                sys.exit(1)\n            if len(prons) == 0:\n                found_empty_prons = True\n            ans.append((\n                word, pron_prob, word_sil_prob,\n                sil_word_correction, non_sil_word_correction, prons))\n            if pron_prob > 1.0:\n                found_large_pronprobs = True\n    if found_empty_prons:\n        print(\"{0}: warning: found at least one word with an empty pronunciation \"\n              \"in lexicon file {1}.\".format(sys.argv[0], filename),\n              file=sys.stderr)\n    if found_large_pronprobs:\n        print(\"{0}: warning: found at least one word with pron-prob >1.0 \"\n              \"in {1}\".format(sys.argv[0], filename), file=sys.stderr)\n    if len(ans) == 0:\n        print(\"{0}: error: found no pronunciations in lexicon file {1}\".format(\n            sys.argv[0], filename), file=sys.stderr)\n        sys.exit(1)\n    return ans\n\n\ndef write_nonterminal_arcs(start_state, sil_state, non_sil_state,\n                           next_state, sil_phone,\n                           nonterminals, left_context_phones):\n    \"\"\"This function relates to the grammar-decoding setup, see\n    kaldi-asr.org/doc/grammar.html.  It is called from write_fst, and writes to\n    the stdout some extra arcs in the lexicon FST that relate to nonterminal\n    symbols.\n\n    See the section \"Special symbols in L.fst,\n    kaldi-asr.org/doc/grammar.html#grammar_special_l.\n       start_state: the start-state of L.fst.\n       sil_state:  the state of high out-degree in L.fst where words leave\n                   when preceded by optional silence\n       non_sil_state:   the state of high out-degree in L.fst where words leave\n                   when not preceded by optional silence\n       next_state: the number from which this function can start allocating its\n                  own states.  the updated value of next_state will be returned.\n       sil_phone:  the optional-silence phone (a string, e.g 'sil')\n       nonterminals: the user-defined nonterminal symbols as a list of\n          strings, e.g. ['#nonterm:contact_list', ... ].\n       left_context_phones: a list of phones that may appear as left-context,\n          e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n    shared_state = next_state\n    next_state += 1\n    final_state = next_state\n    next_state += 1\n\n    print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n        src=start_state, dest=shared_state,\n        phone='#nonterm_begin', word='#nonterm_begin',\n        cost=0.0))\n\n    for nonterminal in nonterminals:\n        # What we are doing here could be viewed as a little lazy, by going to\n        # 'shared_state' instead of a state specific to nonsilence vs. silence\n        # left-context vs. unknown (for #nonterm_begin).  If we made them\n        # separate we could improve (by half) the correctness of how it\n        # interacts with sil-probs in the hard-to-handle case where\n        # word-position-dependent phones are not used and some words end\n        # in the optional-silence phone.\n        for src in [sil_state, non_sil_state]:\n            print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n                src=src, dest=shared_state,\n                phone=nonterminal, word=nonterminal,\n                cost=0.0))\n\n    # this_cost equals log(len(left_context_phones)) but the expression below\n    # better captures the meaning.  Applying this cost to arcs keeps the FST\n    # stochatic (sum-to-one, like an HMM), so that if we do weight pushing\n    # things won't get weird.  In the grammar-FST code when we splice things\n    # together we will cancel out this cost, see the function CombineArcs().\n    this_cost = -math.log(1.0 / len(left_context_phones))\n\n    for left_context_phone in left_context_phones:\n        # The following line is part of how we get this to interact correctly with\n        # the silence probabilities: if the left-context phone was the silence\n        # phone, it goes to sil_state, else nonsil_state.  This won't always\n        # do the right thing if you have a system without word-position-dependent\n        # phones (--position-dependent-phones false to prepare_lang.sh) and\n        # you have words that end in the optional-silence phone.\n        dest = (sil_state if left_context_phone == sil_phone else non_sil_state)\n\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=shared_state, dest=dest,\n            phone=left_context_phone, word='<eps>', cost=this_cost))\n\n    # arc from sil_state and non_sil_state to a final-state with #nonterm_end as\n    # ilabel and olabel.  The costs on these arcs are zero because if you take\n    # that arc, you are not really terminating the sequence, you are just\n    # skipping to sil_state or non_sil_state in the FST one level up.  It\n    # takes the correct path because of the code around 'dest = ...' a few\n    # lines above this, after reaching 'shared_state' because it saw the\n    # user-defined nonterminal.\n    for src in [sil_state, non_sil_state]:\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=src, dest=final_state,\n            phone='#nonterm_end', word='#nonterm_end', cost=0.0))\n    print(\"{state}\\t{final_cost}\".format(\n        state=final_state, final_cost=0.0))\n    return next_state\n\ndef write_fst(lexicon, silprobs, sil_phone, sil_disambig,\n              nonterminals = None, left_context_phones = None):\n    \"\"\"Writes the text format of L.fst (or L_disambig.fst)  to the standard output.\n     'lexicon' is a list of 5-tuples\n     (word, pronprob, wordsilprob, silwordcorrection, nonsilwordcorrection, pron)\n         as returned by read_lexiconp().\n     'silprobs' is a 4-tuple of probabilities as returned by read_silprobs().\n     'sil_phone' is the silence phone, e.g. \"SIL\".\n     'sil_disambig' is either '<eps>', or the silence disambiguation symbol, e.g. \"#5\".\n     'nonterminals', which relates to grammar decoding (see kaldi-asr.org/doc/grammar.html),\n        is either None, or the user-defined nonterminal symbols as a list of\n        strings, e.g. ['#nonterm:contact_list', ... ].\n     'left_context_phones', which also relates to grammar decoding, and must be\n        supplied if 'nonterminals' is supplied is either None or a list of\n        phones that may appear as left-context, e.g. ['a', 'ah', ... '#nonterm_bos'].\n    \"\"\"\n    silbeginprob, silendcorrection, nonsilendcorrection, siloverallprob = silprobs\n    initial_sil_cost = -math.log(silbeginprob)\n    initial_non_sil_cost = -math.log(1.0 - silbeginprob);\n    sil_end_correction_cost = -math.log(silendcorrection)\n    non_sil_end_correction_cost = -math.log(nonsilendcorrection);\n    start_state = 0\n    non_sil_state = 1  # words enter and leave from here\n    sil_state = 2   # words terminate here when followed by silence; this state\n                    # has a silence transition to loop_state.\n    next_state = 3  # the next un-allocated state, will be incremented as we go.\n\n    # Arcs from the start state to the silence and nonsilence loop states\n    # The one to the nonsilence state has the silence disambiguation symbol\n    # (We always use that symbol on the *non*-silence-containing arcs, which\n    # avoids having to introduce extra arcs).\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=non_sil_state,\n        phone=sil_disambig, word='<eps>', cost=initial_non_sil_cost))\n    print('{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}'.format(\n        src=start_state, dest=sil_state,\n        phone=sil_phone, word='<eps>', cost=initial_sil_cost))\n\n    for (word, pronprob, wordsilprob, silwordcorrection, nonsilwordcorrection, pron) in lexicon:\n        pron_cost = -math.log(pronprob)\n        word_to_sil_cost = -math.log(wordsilprob)\n        word_to_non_sil_cost = -math.log(1.0 - wordsilprob)\n        sil_to_word_cost = -math.log(silwordcorrection)\n        non_sil_to_word_cost = -math.log(nonsilwordcorrection)\n\n        if len(pron) == 0:\n            # this is not really expected but we try to handle it gracefully.\n            pron = ['<eps>']\n\n        new_state = next_state  # allocate a new state\n        next_state += 1\n        # Create transitions from both non_sil_state and sil_state to 'new_state',\n        # with the word label and the word's first phone on them\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=non_sil_state, dest=new_state,\n            phone=pron[0], word=word, cost=(pron_cost + non_sil_to_word_cost)))\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=sil_state, dest=new_state,\n            phone=pron[0], word=word, cost=(pron_cost + sil_to_word_cost)))\n        cur_state = new_state\n\n        # add states and arcs for all but the first phone.\n        for i in range(1, len(pron)):\n            new_state = next_state\n            next_state += 1\n            print(\"{src}\\t{dest}\\t{phone}\\t<eps>\".format(\n                src=cur_state, dest=new_state, phone=pron[i]))\n            cur_state = new_state\n\n        # ... and from there we return via two arcs to the silence and\n        # nonsilence state.  the silence-disambig symbol, if used,q\n        # goes on the nonsilence arc; this saves us having to insert an epsilon.\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state,  dest=non_sil_state,\n            phone=sil_disambig, word='<eps>',\n            cost=word_to_non_sil_cost))\n        print(\"{src}\\t{dest}\\t{phone}\\t{word}\\t{cost}\".format(\n            src=cur_state, dest=sil_state,\n            phone=sil_phone, word='<eps>',\n            cost=word_to_sil_cost))\n\n    if nonterminals is not None:\n        next_state = write_nonterminal_arcs(\n            start_state, sil_state, non_sil_state,\n            next_state, sil_phone,\n            nonterminals, left_context_phones)\n\n    print('{src}\\t{cost}'.format(src=sil_state, cost=sil_end_correction_cost))\n    print('{src}\\t{cost}'.format(src=non_sil_state, cost=non_sil_end_correction_cost))\n\ndef read_nonterminals(filename):\n    \"\"\"Reads the user-defined nonterminal symbols in 'filename', checks that\n       it has the expected format and has no duplicates, and returns the nonterminal\n       symbols as a list of strings, e.g.\n       ['#nonterm:contact_list', '#nonterm:phone_number', ... ]. \"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no nonterminals symbols.\".format(filename))\n    for nonterm in ans:\n        if nonterm[:9] != '#nonterm:':\n            raise RuntimeError(\"In file '{0}', expected nonterminal symbols to start with '#nonterm:', found '{1}'\"\n                               .format(filename, nonterm))\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\ndef read_left_context_phones(filename):\n    \"\"\"Reads, checks, and returns a list of left-context phones, in text form, one\n       per line.  Returns a list of strings, e.g. ['a', 'ah', ..., '#nonterm_bos' ]\"\"\"\n    ans = [line.strip(\" \\t\\r\\n\") for line in open(filename, 'r', encoding='latin-1')]\n    if len(ans) == 0:\n        raise RuntimeError(\"The file {0} contains no left-context phones.\".format(filename))\n    for s in ans:\n        if len(s.split()) != 1:\n            raise RuntimeError(\"The file {0} contains an invalid line '{1}'\".format(filename, s)   )\n\n    if len(set(ans)) != len(ans):\n        raise RuntimeError(\"Duplicate nonterminal symbols are present in file {0}\".format(filename))\n    return ans\n\n\ndef main():\n    args = get_args()\n    silprobs = read_silprobs(args.silprobs)\n    lexicon = read_lexiconp(args.lexiconp)\n\n\n    if args.nonterminals is None:\n        nonterminals, left_context_phones = None, None\n    else:\n        if args.left_context_phones is None:\n            print(\"{0}: if --nonterminals is specified, --left-context-phones must also \"\n                  \"be specified\".format(sys.argv[0]))\n            sys.exit(1)\n        nonterminals = read_nonterminals(args.nonterminals)\n        left_context_phones = read_left_context_phones(args.left_context_phones)\n\n    write_fst(lexicon, silprobs, args.sil_phone, args.sil_disambig,\n              nonterminals, left_context_phones)\n\n\nif __name__ == '__main__':\n      main()\n"
  },
  {
    "path": "egs/utils/lang/make_phone_bigram_lang.sh",
    "content": "#!/usr/bin/env bash\n\n# Apache 2.0.  Copyright 2012, Johns Hopkins University (author: Daniel Povey)\n\n# This script creates a \"lang\" directory of the \"testing\" type (including G.fst)\n# given an existing \"alignment\" directory and an existing \"lang\" directory.\n# The directory contains only single-phone words, and a bigram language model that\n# is built without smoothing, on top of single phones.  The point of no smoothing\n# is to limit the number of transitions, so we can decode reasonably fast, and the\n# graph won't blow up.  This is probably going to be most useful for things like\n# language-id.\n#\n#  See also steps/make_phone_graph.sh\n\n\necho \"$0 $@\"  # Print the command line for logging\n\n[ -f ./path.sh ] && . ./path.sh; # source the path.\n. parse_options.sh || exit 1;\n\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0: [options] <lang-dir> <ali-dir> <output-lang-dir>\"\n  echo \"e.g.: $0: data/lang exp/tri3b_ali data/lang_phone_bg\"\n  exit 1;\nfi\n\nlang=$1\nalidir=$2\nlang_out=$3\n\nfor f in $lang/phones.txt $alidir/ali.1.gz; do\n  [ ! -f $f ] && echo \"Expected file $f to exist\" && exit 1;\ndone\n\nmkdir -p $lang_out || exit 1;\n\ngrep -v '#' $lang/phones.txt >  $lang_out/phones.txt # no disambig symbols\n      # needed; G and L . G will be deterministic.\ncp $lang/topo $lang_out\nrm -r $lang_out/phones 2>/dev/null\ncp -r $lang/phones/ $lang_out/\nrm $lang_out/phones/word_boundary.* 2>/dev/null # these would\n  # no longer be valid.\nrm $lang_out/phones/wdisambig* 2>/dev/null  # ditto this.\n\n# List of disambig symbols will be empty: not needed, since G.fst and L.fst * G.fst\n# are determinizable without any.\necho -n > $lang_out/phones/disambig.txt\necho -n > $lang_out/phones/disambig.int\necho -n > $lang_out/phones/disambig.csl\necho -n > $lang_out/phones/wdisambig.txt\necho -n > $lang_out/phones/wdisambig_phones.int\necho -n > $lang_out/phones/wdisambig_words.int\n\n# Let OOV symbol be the first phone.  This is arbitrary, it's just\n# so that validate_lang.pl succeeds.  We should never actually use\n# this.\noov_sym=$(tail -n +2 $lang_out/phones.txt | head -n 1 | awk '{print $1}')\noov_int=$(tail -n +2 $lang_out/phones.txt | head -n 1 | awk '{print $2}')\necho $oov_sym > $lang_out/oov.txt\necho $oov_int > $lang_out/oov.int\n\n\n# Get phone-level transcripts of training data and create a\n# language model.\nali-to-phones $alidir/final.mdl \"ark:gunzip -c $alidir/ali.*.gz|\" ark,t:- | \\\n  perl -e 'while(<>) {\n    @A = split(\" \", $_);\n    shift @A; # Remove the utterance-id.\n    foreach $p ( @A ) { $phones{$p} = 1; } # assoc. array of phones.\n    unshift @A, \"<s>\";\n    push @A, \"</s>\";\n    for ($n = 0; $n+1 < @A; $n++) {\n      $p = $A[$n]; $q = $A[$n+1];\n      $count{$p,$q}++;\n      $histcount{$p}++;\n    }\n  }\n  @phones = keys %phones;\n  unshift @phones, \"<s>\";\n  # @phones is now all real phones, plus <s>.\n  for ($n = 0; $n < @phones; $n++) {\n    $phn2state{$phones[$n]} = $n;\n  }\n  foreach $p (@phones) {\n    $src = $phn2state{$p};\n    $hist = $histcount{$p};\n    $hist > 0 || die;\n    foreach $q (@phones) {\n      $c = $count{$p,$q};\n      if (defined $c) {\n        $cost = -log($c / $hist); # cost on FST arc.\n        $dest = $phn2state{$q};\n        print \"$src $dest $q $cost\\n\";  # Note: q is actually numeric.\n      }\n    }\n    $c = $count{$p,\"</s>\"};\n    if (defined $c) {\n      $cost = -log($c / $hist); # cost on FST arc.\n      print \"$src $cost\\n\"; # final-prob.\n    }\n  } ' | fstcompile --acceptor=true | \\\n    fstarcsort --sort_type=ilabel > $lang_out/G.fst\n\n# symbols for phones and words are the same.\n# Neither has disambig symbols.\ncp $lang_out/phones.txt $lang_out/words.txt\n\ngrep -v '<eps>' $lang_out/phones.txt | awk '{printf(\"0 0 %s %s\\n\", $2, $2);} END{print(\"0 0.0\");}' | \\\n   fstcompile  > $lang_out/L.fst\n\n# note: first two fields of align_lexicon.txt are interpreted as the word; the remaining\n# fields are the phones that are in the pron of the word.  These are all the same, for us.\nfor p in $(grep -v '<eps>' $lang_out/phones.txt | awk '{print $1}'); do echo $p $p $p; done > $lang_out/phones/align_lexicon.txt\n\n# just use one sym2int.pl command, since phones.txt and words.txt are identical.\nutils/sym2int.pl $lang_out/phones.txt <$lang_out/phones/align_lexicon.txt >$lang_out/phones/align_lexicon.int\n\n# L and L_disambig are the same.\ncp $lang_out/L.fst $lang_out/L_disambig.fst\n\nutils/validate_lang.pl --skip-disambig-check $lang_out || exit 1;\n"
  },
  {
    "path": "egs/utils/lang/make_phone_lm.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2016  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\nfrom __future__ import print_function\nfrom __future__ import division\nimport sys\nimport argparse\nimport math\nfrom collections import defaultdict\n\n# note, this was originally based\n\nparser = argparse.ArgumentParser(description=\"\"\"\nThis script creates a language model that's intended to be used in modeling\nphone sequences (either of sentences or of dictionary entries), although of\ncourse it will work for any type of data.  The easiest way\nto describe it is as a a Kneser-Ney language model (unmodified, with addition)\nwith a fixed discounting constant equal to 1, except with no smoothing of the\nbigrams (and hence no unigram state).  This is (a) because we want to keep the\ngraph after context expansion small, (b) because languages tend to have\nconstraints on which phones can follow each other, and (c) in order to get valid\nsequences of word-position-dependent phones so that lattice-align-words can\nwork.  It also includes have a special entropy-based pruning technique that\nbacks off the statistics of pruned n-grams to lower-order states.\n\nThis script reads lines from its standard input, each\nconsisting of a sequence of integer symbol-ids (which should be > 0),\nrepresenting the phone sequences of a sentence or dictionary entry.\nThis script outputs a backoff language model in FST format\"\"\",\n                                 epilog=\"See also utils/lang/make_phone_bigram_lang.sh\")\n\n\nparser.add_argument(\"--phone-disambig-symbol\", type = int, required = False,\n                    help = \"Integer corresponding to an otherwise-unused \"\n                    \"phone-level disambiguation symbol (e.g. #5).  This is \"\n                    \"inserted at the beginning of the phone sequence and \"\n                    \"whenever we back off.\")\nparser.add_argument(\"--ngram-order\", type = int, default = 4,\n                    choices = [2,3,4,5,6,7],\n                    help = \"Order of n-gram to use (but see also --num-extra-states;\"\n                    \"the effective order after pruning may be less.\")\nparser.add_argument(\"--num-extra-ngrams\", type = int, default = 20000,\n                    help = \"Target number of n-grams in addition to the n-grams in \"\n                    \"the bigram LM states which can't be pruned away.  n-grams \"\n                    \"will be pruned to reach this target.\")\nparser.add_argument(\"--no-backoff-ngram-order\", type = int, default = 2,\n                    choices = [1,2,3,4,5],\n                    help = \"This specifies the n-gram order at which (and below which) \"\n                    \"no backoff or pruning should be done.  This is expected to normally \"\n                    \"be bigram, but for testing purposes you may want to set it to \"\n                    \"1.\")\nparser.add_argument(\"--print-as-arpa\", type = str, default = \"false\",\n                    choices = [\"true\", \"false\"],\n                    help = \"If true, print LM in ARPA format (default is to print \"\n                    \"as FST).  You must also set --no-backoff-ngram-order=1 or \"\n                    \"this is not allowed.\")\nparser.add_argument(\"--verbose\", type = int, default = 0,\n                    choices=[0,1,2,3,4,5], help = \"Verbose level\")\n\nargs = parser.parse_args()\n\nif args.verbose >= 1:\n    print(' '.join(sys.argv), file = sys.stderr)\n\n\n\nclass CountsForHistory(object):\n    ## This class (which is more like a struct) stores the counts seen in a\n    ## particular history-state.  It is used inside class NgramCounts.\n    ## It really does the job of a dict from int to float, but it also\n    ## keeps track of the total count.\n    def __init__(self):\n        # The 'lambda: defaultdict(float)' is an anonymous function taking no\n        # arguments that returns a new defaultdict(float).\n        self.word_to_count = defaultdict(int)\n        self.total_count = 0\n\n    def Words(self):\n        return list(self.word_to_count.keys())\n\n    def __str__(self):\n        # e.g. returns ' total=12 3->4 4->6 -1->2'\n        return ' total={0} {1}'.format(\n            str(self.total_count),\n            ' '.join(['{0} -> {1}'.format(word, count)\n                      for word, count in self.word_to_count.items()]))\n\n\n    ## Adds a certain count (expected to be integer, but might be negative).  If\n    ## the resulting count for this word is zero, removes the dict entry from\n    ## word_to_count.\n    ## [note, though, that in some circumstances we 'add back' zero counts\n    ## where the presence of n-grams would be structurally required by the arpa,\n    ## specifically if a higher-order history state has a nonzero count,\n    ## we need to structurally have the count there in the states it backs\n    ## off to.\n    def AddCount(self, predicted_word, count):\n        self.total_count += count\n        assert self.total_count >= 0\n        old_count = self.word_to_count[predicted_word]\n        new_count = old_count + count\n        if new_count < 0:\n            print(\"predicted-word={0}, old-count={1}, count={2}\".format(\n                    predicted_word, old_count, count))\n        assert new_count >= 0\n        if new_count == 0:\n            del self.word_to_count[predicted_word]\n        else:\n            self.word_to_count[predicted_word] = new_count\n\nclass NgramCounts(object):\n    ## A note on data-structure.  Firstly, all words are represented as\n    ## integers.  We store n-gram counts as an array, indexed by (history-length\n    ## == n-gram order minus one) (note: python calls arrays \"lists\") of dicts\n    ## from histories to counts, where histories are arrays of integers and\n    ## \"counts\" are dicts from integer to float.  For instance, when\n    ## accumulating the 4-gram count for the '8' in the sequence '5 6 7 8', we'd\n    ## do as follows: self.counts[3][[5,6,7]][8] += 1.0 where the [3] indexes an\n    ## array, the [[5,6,7]] indexes a dict, and the [8] indexes a dict.\n    def __init__(self, ngram_order):\n        assert ngram_order >= 2\n        # Integerized counts will never contain negative numbers, so\n        # inside this program, we use -3 and -2 for the BOS and EOS symbols\n        # respectively.\n        # Note: it's actually important that the bos-symbol is the most negative;\n        # it helps ensure that we print the state with left-context <s> first\n        # when we print the FST, and this means that the start-state will have\n        # the correct value.\n        self.bos_symbol = -3\n        self.eos_symbol = -2\n        # backoff_symbol is kind of a pseudo-word, it's used in keeping track of\n        # the backoff counts in each state.\n        self.backoff_symbol = -1\n        self.total_num_words = 0  # count includes EOS but not BOS.\n        self.counts = []\n        for n in range(ngram_order):\n            self.counts.append(defaultdict(lambda: CountsForHistory()))\n\n    # adds a raw count (called while processing input data).\n    # Suppose we see the sequence '6 7 8 9' and ngram_order=4, 'history'\n    # would be (6,7,8) and 'predicted_word' would be 9; 'count' would be\n    # 1.\n    def AddCount(self, history, predicted_word, count):\n        self.counts[len(history)][history].AddCount(predicted_word, count)\n\n\n    # 'line' is a string containing a sequence of integer word-ids.\n    # This function adds the un-smoothed counts from this line of text.\n    def AddRawCountsFromLine(self, line):\n        try:\n            words = [self.bos_symbol] + [ int(x) for x in line.split() ] + [self.eos_symbol]\n        except:\n            sys.exit(\"make_phone_lm.py: bad input line {0} (expected a sequence \"\n                     \"of integers)\".format(line))\n\n        for n in range(1, len(words)):\n            predicted_word = words[n]\n            history_start = max(0, n + 1 - args.ngram_order)\n            history = tuple(words[history_start:n])\n            self.AddCount(history, predicted_word, 1)\n            self.total_num_words += 1\n\n    def AddRawCountsFromStandardInput(self):\n        lines_processed = 0\n        while True:\n            line = sys.stdin.readline()\n            if line == '':\n                break\n            self.AddRawCountsFromLine(line)\n            lines_processed += 1\n        if lines_processed == 0 or args.verbose > 0:\n            print(\"make_phone_lm.py: processed {0} lines of input\".format(\n                    lines_processed), file = sys.stderr)\n\n\n    # This backs off the counts by subtracting 1 and assigning the subtracted\n    # count to the backoff state.  It's like a special case of Kneser-Ney with D\n    # = 1.  The optimal D would likely be something like 0.9, but we plan to\n    # later do entropy-pruning, and the remaining small counts of 0.1 would\n    # essentially all get pruned away anyway, so we don't lose much by doing it\n    # like this.\n    def ApplyBackoff(self):\n        # note: in the normal case where args.no_backoff_ngram_order == 2 we\n        # don't do backoff for history-length = 1 (i.e. for bigrams)... this is\n        # a kind of special LM where we're not going to back off to unigram,\n        # there will be no unigram.\n        if args.verbose >= 1:\n            initial_num_ngrams = self.GetNumNgrams()\n        for n in reversed(list(range(args.no_backoff_ngram_order, args.ngram_order))):\n            this_order_counts = self.counts[n]\n            for hist, counts_for_hist in this_order_counts.items():\n                backoff_hist = hist[1:]\n                backoff_counts_for_hist = self.counts[n-1][backoff_hist]\n                this_discount_total = 0\n                for word in counts_for_hist.Words():\n                    counts_for_hist.AddCount(word, -1)\n                    # You can interpret the following line as incrementing the\n                    # count-of-counts for the next-lower order.  Note, however,\n                    # that later when we remove n-grams, we'll also add their\n                    # counts to the next-lower-order history state, so the\n                    # resulting counts won't strictly speaking be\n                    # counts-of-counts.\n                    backoff_counts_for_hist.AddCount(word, 1)\n                    this_discount_total += 1\n                counts_for_hist.AddCount(self.backoff_symbol, this_discount_total)\n\n        if args.verbose >= 1:\n            # Note: because D == 1, we completely back off singletons.\n            print(\"make_phone_lm.py: ApplyBackoff() reduced the num-ngrams from \"\n                  \"{0} to {1}\".format(initial_num_ngrams, self.GetNumNgrams()),\n                  file = sys.stderr)\n\n\n    # This function prints out to stderr the n-gram counts stored in this\n    # object; it's used for debugging.\n    def Print(self, info_string):\n        print(info_string, file=sys.stderr)\n        # these are useful for debug.\n        total = 0.0\n        total_excluding_backoff = 0.0\n        for this_order_counts in self.counts:\n            for hist, counts_for_hist in this_order_counts.items():\n                print(str(hist) + str(counts_for_hist), file = sys.stderr)\n                total += counts_for_hist.total_count\n                total_excluding_backoff += counts_for_hist.total_count\n                if self.backoff_symbol in counts_for_hist.word_to_count:\n                    total_excluding_backoff -= counts_for_hist.word_to_count[self.backoff_symbol]\n        print('total count = {0}, excluding backoff = {1}'.format(\n                total, total_excluding_backoff), file = sys.stderr)\n\n    def GetHistToStateMap(self):\n        # This function, called from PrintAsFst, returns a map from\n        # history to integer FST-state.\n        hist_to_state = dict()\n        fst_state_counter = 0\n        for n in range(0, args.ngram_order):\n            for hist in self.counts[n].keys():\n                hist_to_state[hist] = fst_state_counter\n                fst_state_counter += 1\n        return hist_to_state\n\n    # Returns the probability of word 'word' in history-state 'hist'.\n    # If 'word' is self.backoff_symbol, returns the backoff prob\n    # of this history-state.\n    # Returns None if there is no such word in this history-state, or this\n    # history-state does not exist.\n    def GetProb(self, hist, word):\n        if len(hist) >= args.ngram_order or not hist in self.counts[len(hist)]:\n            return None\n        counts_for_hist = self.counts[len(hist)][hist]\n        total_count = float(counts_for_hist.total_count)\n        if not word in counts_for_hist.word_to_count:\n            print(\"make_phone_lm.py: no prob for {0} -> {1} \"\n                  \"[no such count]\".format(hist, word),\n                  file = sys.stderr)\n            return None\n        prob = float(counts_for_hist.word_to_count[word]) / total_count\n        if len(hist) > 0 and word != self.backoff_symbol and \\\n          self.backoff_symbol in counts_for_hist.word_to_count:\n            prob_in_backoff = self.GetProb(hist[1:], word)\n            backoff_prob = float(counts_for_hist.word_to_count[self.backoff_symbol]) / total_count\n            try:\n                prob += backoff_prob * prob_in_backoff\n            except:\n                sys.exit(\"problem, hist is {0}, word is {1}\".format(hist, word))\n        return prob\n\n    def PruneEmptyStates(self):\n        # Removes history-states that have no counts.\n\n        # It's possible in principle for history-states to have no counts and\n        # yet they cannot be pruned away because a higher-order version of the\n        # state exists with nonzero counts, so we have to keep track of this.\n        protected_histories = set()\n\n        states_removed_per_hist_len = [ 0 ] * args.ngram_order\n\n        for n in reversed(list(range(args.no_backoff_ngram_order,\n                                args.ngram_order))):\n            num_states_removed = 0\n            for hist, counts_for_hist in self.counts[n].items():\n                l = len(counts_for_hist.word_to_count)\n                assert l > 0 and self.backoff_symbol in counts_for_hist.word_to_count\n                if l == 1 and not hist in protected_histories:  # only the backoff symbol has a count.\n                    del self.counts[n][hist]\n                    num_states_removed += 1\n                else:\n                    # if this state was not pruned away, then the state that\n                    # it backs off to may not be pruned away either.\n                    backoff_hist = hist[1:]\n                    protected_histories.add(backoff_hist)\n            states_removed_per_hist_len[n] = num_states_removed\n        if args.verbose >= 1:\n            print(\"make_phone_lm.py: in PruneEmptyStates(), num states removed for \"\n                  \"each history-length was: \" + str(states_removed_per_hist_len),\n                  file = sys.stderr)\n\n    def EnsureStructurallyNeededNgramsExist(self):\n        # makes sure that if an n-gram like (6, 7, 8) -> 9 exists,\n        # then counts exist for (7, 8) -> 9 and (8,) -> 9.  It does so\n        # by adding zero counts where such counts were absent.\n        # [note: () -> 9 is guaranteed anyway by the backoff method, if\n        # we have a unigram state].\n        if args.verbose >= 1:\n            num_ngrams_initial = self.GetNumNgrams()\n        for n in reversed(list(range(args.no_backoff_ngram_order,\n                                args.ngram_order))):\n\n            for hist, counts_for_hist in self.counts[n].items():\n                # This loop ensures that if we have an n-gram like (6, 7, 8) -> 9,\n                # then, say, (7, 8) -> 9 and (8) -> 9 exist.\n                reduced_hist = hist\n                for m in reversed(list(range(args.no_backoff_ngram_order, n))):\n                    reduced_hist = reduced_hist[1:]  # shift an element off\n                                                     # the history.\n                    counts_for_backoff_hist = self.counts[m][reduced_hist]\n                    for word in counts_for_hist.word_to_count.keys():\n                        counts_for_backoff_hist.word_to_count[word] += 0\n                # This loop ensures that if we have an n-gram like (6, 7, 8) -> 9,\n                # then, say, (6, 7) -> 8 and (6) -> 7 exist.  This will be needed\n                # for FST representations of the ARPA LM.\n                reduced_hist = hist\n                for m in reversed(list(range(args.no_backoff_ngram_order, n))):\n                    this_word = reduced_hist[-1]\n                    reduced_hist = reduced_hist[:-1]  # pop an element off the\n                                                      # history\n                    counts_for_backoff_hist = self.counts[m][reduced_hist]\n                    counts_for_backoff_hist.word_to_count[this_word] += 0\n        if args.verbose >= 1:\n            print(\"make_phone_lm.py: in EnsureStructurallyNeededNgramsExist(), \"\n                  \"added {0} n-grams\".format(self.GetNumNgrams() - num_ngrams_initial),\n                  file = sys.stderr)\n\n\n\n    # This function prints the estimated language model as an FST.\n    def PrintAsFst(self, word_disambig_symbol):\n        # n is the history-length (== order + 1).  We iterate over the\n        # history-length in the order 1, 0, 2, 3, and then iterate over the\n        # histories of each order in sorted order.  Putting order 1 first\n        # and sorting on the histories\n        # ensures that the bigram state with <s> as the left context comes first.\n        # (note: self.bos_symbol is the most negative symbol)\n\n        # History will map from history (as a tuple) to integer FST-state.\n        hist_to_state = self.GetHistToStateMap()\n\n        for n in [ 1, 0 ] + list(range(2, args.ngram_order)):\n            this_order_counts = self.counts[n]\n            # For order 1, make sure the keys are sorted.\n            keys = this_order_counts.keys() if n != 1 else sorted(this_order_counts.keys())\n            for hist in keys:\n                word_to_count = this_order_counts[hist].word_to_count\n                this_fst_state = hist_to_state[hist]\n\n                for word in word_to_count.keys():\n                    # work out this_cost.  Costs in OpenFst are negative logs.\n                    this_cost = -math.log(self.GetProb(hist, word))\n\n                    if word > 0: # a real word.\n                        next_hist = hist + (word,)  # appending tuples\n                        while not next_hist in hist_to_state:\n                            next_hist = next_hist[1:]\n                        next_fst_state = hist_to_state[next_hist]\n                        print(this_fst_state, next_fst_state, word, word,\n                              this_cost)\n                    elif word == self.eos_symbol:\n                        # print final-prob for this state.\n                        print(this_fst_state, this_cost)\n                    else:\n                        assert word == self.backoff_symbol\n                        backoff_fst_state = hist_to_state[hist[1:len(hist)]]\n                        print(this_fst_state, backoff_fst_state,\n                              word_disambig_symbol, 0, this_cost)\n\n    # This function returns a set of n-grams that cannot currently be pruned\n    # away, either because a higher-order form of the same n-gram already exists,\n    # or because the n-gram leads to an n-gram state that exists.\n    # [Note: as we prune, we remove any states that can be removed; see that\n    # PruneToIntermediateTarget() calls PruneEmptyStates().\n\n    def GetProtectedNgrams(self):\n        ans = set()\n        for n in range(args.no_backoff_ngram_order + 1, args.ngram_order):\n            for hist, counts_for_hist in self.counts[n].items():\n                # If we have an n-gram (6, 7, 8) -> 9, the following loop will\n                # add the backed-off n-grams (7, 8) -> 9 and (8) -> 9 to\n                # 'protected-ngrams'.\n                reduced_hist = hist\n                for m in reversed(list(range(args.no_backoff_ngram_order, n))):\n                    reduced_hist = reduced_hist[1:]  # shift an element off\n                                                     # the history.\n\n                    for word in counts_for_hist.word_to_count.keys():\n                        if word != self.backoff_symbol:\n                            ans.add(reduced_hist + (word,))\n                # The following statement ensures that if we are in a\n                # history-state (6, 7, 8), then n-grams (6, 7, 8) and (6, 7) are\n                # protected.  This assures that the FST states are accessible.\n                reduced_hist = hist\n                for m in reversed(list(range(args.no_backoff_ngram_order, n))):\n                    ans.add(reduced_hist)\n                    reduced_hist = reduced_hist[:-1]  # pop an element off the\n                                                      # history\n        return ans\n\n    def PruneNgram(self, hist, word):\n        counts_for_hist = self.counts[len(hist)][hist]\n        assert word != self.backoff_symbol and word in counts_for_hist.word_to_count\n        count = counts_for_hist.word_to_count[word]\n        del counts_for_hist.word_to_count[word]\n        counts_for_hist.word_to_count[self.backoff_symbol] += count\n        # the next call adds the count to the symbol 'word' in the backoff\n        # history-state, and also updates its 'total_count'.\n        self.counts[len(hist) - 1][hist[1:]].AddCount(word, count)\n\n    # The function PruningLogprobChange is the same as the same-named\n    # function in float-counts-prune.cc in pocolm.  Note, it doesn't access\n    # any class members.\n\n    # This function computes the log-likelihood change (<= 0) from backing off\n    # a particular symbol to the lower-order state.\n    # The value it returns can be interpreted as a lower bound the actual log-likelihood\n    # change.  By \"the actual log-likelihood change\" we mean of data generated by\n    # the model itself before making the change, then modeled with the changed model\n    # [and comparing the log-like with the log-like before changing the model].  That is,\n    # it's a K-L divergence, but with the caveat that we don't normalize by the\n    # overall count of the data, so it's a K-L divergence multiplied by the training-data\n    # count.\n\n    #  'count' is the count of the word (call it 'a') in this state.  It's an integer.\n    #  'discount' is the discount-count in this state (represented as the count\n    #         for the symbol self.backoff_symbol).  It's an integer.\n    #  [note: we don't care about the total-count in this state, it cancels out.]\n    #  'backoff_count' is the count of word 'a' in the lower-order state.\n    #                 [actually it is the augmented count, treating any\n    #                  extra probability from even-lower-order states as\n    #                  if it were a count].  It's a float.\n    #  'backoff_total' is the total count in the lower-order state.  It's a float.\n    def PruningLogprobChange(self, count, discount, backoff_count, backoff_total):\n        if count == 0:\n            return 0.0\n\n        assert discount > 0 and backoff_total >= backoff_count and backoff_total >= 0.99 * discount\n\n\n        # augmented_count is like 'count', but with the extra count for symbol\n        # 'a' due to backoff included.\n        augmented_count = count + discount * backoff_count / backoff_total\n\n        # We imagine a phantom symbol 'b' that represents all symbols other than\n        # 'a' appearing in this history-state that are accessed via backoff.  We\n        # treat these as being distinct symbols from the same symbol if accessed\n        # not-via-backoff.  (Treating same symbols as distinct gives an upper bound\n        # on the divergence).  We also treat them as distinct from the same symbols\n        # that are being accessed via backoff from other states.  b_count is the\n        # observed count of symbol 'b' in this state (the backed-off count is\n        # zero).  b_count is also the count of symbol 'b' in the backoff state.\n        # Note: b_count will not be negative because backoff_total >= backoff_count.\n        b_count = discount * ((backoff_total - backoff_count) / backoff_total)\n        assert b_count >= -0.001 * backoff_total\n\n        # We imagine a phantom symbol 'c' that represents all symbols other than\n        # 'a' and 'b' appearing in the backoff state, which got there from\n        # backing off other states (other than 'this' state).  Again, we imagine\n        # the symbols are distinct even though they may not be (i.e. that c and\n        # b represent disjoint sets of symbol, even though they might not really\n        # be disjoint), and this gives us an upper bound on the divergence.\n        c_count = backoff_total - backoff_count - b_count\n        assert c_count >= -0.001 * backoff_total\n\n        # a_other is the count of 'a' in the backoff state that comes from\n        # 'other sources', i.e. it was backed off from history-states other than\n        # the current history state.\n        a_other_count = backoff_count - discount * backoff_count / backoff_total\n        assert a_other_count >= -0.001 * backoff_count\n\n        # the following sub-expressions are the 'new' versions of certain\n        # quantities after we assign the total count 'count' to backoff.  it\n        # increases the backoff count in 'this' state, and also the total count\n        # in the backoff state, and the count of symbol 'a' in the backoff\n        # state.\n        new_backoff_count = backoff_count + count  # new count of symbol 'a' in\n                                                    # backoff state\n        new_backoff_total = backoff_total + count  # new total count in\n                                                    # backoff state.\n        new_discount = discount + count  # new discount-count in 'this' state.\n\n\n        # all the loglike changes below are of the form\n        # count-of-symbol * log(new prob / old prob)\n        # which can be more conveniently written (by canceling the denominators),\n        # count-of-symbol * log(new count / old count).\n\n        # this_a_change is the log-like change of symbol 'a' coming from 'this'\n        # state.  bear in mind that\n        # augmented_count = count + discount * backoff_count / backoff_total,\n        # and the 'count' term is zero in the numerator part of the log expression,\n        # because symbol 'a' is completely backed off in 'this' state.\n        this_a_change = augmented_count * \\\n            math.log((new_discount * new_backoff_count / new_backoff_total)/ \\\n                         augmented_count)\n\n        # other_a_change is the log-like change of symbol 'a' coming from all\n        # other states than 'this'.  For speed reasons we don't examine the\n        # direct (non-backoff) counts of symbol 'a' in all other states than\n        # 'this' that back off to the backoff state-- it would be slower.\n        # Instead we just treat the direct part of the prob for symbol 'a' as a\n        # distinct symbol when it comes from those other states... as usual,\n        # doing so gives us an upper bound on the divergence.\n        other_a_change = \\\n            a_other_count * math.log((new_backoff_count / new_backoff_total) / \\\n                                         (backoff_count / backoff_total)) \n\n        # b_change is the log-like change of phantom symbol 'b' coming from\n        # 'this' state (and note: it only comes from this state, that's how we\n        # defined it).\n        # note: the expression below could be more directly written as a\n        # ratio of pseudo-counts as follows, by converting the backoff probabilities\n        # into pseudo-counts in 'this' state:\n        #  b_count * logf((new_discount * b_count / new_backoff_total) /\n        #                 (discount * b_count / backoff_total),\n        # but we cancel b_count to give us the expression below.\n        b_change = b_count * math.log((new_discount / new_backoff_total) / \\\n                                          (discount / backoff_total))\n\n        # c_change is the log-like change of phantom symbol 'c' coming from\n        # all other states that back off to the backoff sate (and all prob. mass of\n        # 'c' comes from those other states).  The expression below could be more\n        # directly written as a ratio of counts, as c_count * logf((c_count /\n        # new_backoff_total) / (c_count / backoff_total)), but we simplified it to\n        # the expression below.\n        c_change = c_count * math.log(backoff_total / new_backoff_total)\n\n        ans = this_a_change + other_a_change + b_change + c_change\n        # the answer should not be positive.\n        assert ans <= 0.0001 * (count + discount + backoff_count + backoff_total)\n        if args.verbose >= 4:\n            print(\"pruning-logprob-change for {0},{1},{2},{3} is {4}\".format(\n                    count, discount, backoff_count, backoff_total, ans),\n                  file = sys.stderr)\n        return ans\n\n\n    def GetLikeChangeFromPruningNgram(self, hist, word):\n        counts_for_hist = self.counts[len(hist)][hist]\n        counts_for_backoff_hist = self.counts[len(hist) - 1][hist[1:]]\n        assert word != self.backoff_symbol and word in counts_for_hist.word_to_count\n        count = counts_for_hist.word_to_count[word]\n        discount = counts_for_hist.word_to_count[self.backoff_symbol]\n        backoff_total = counts_for_backoff_hist.total_count\n        # backoff_count is a pseudo-count: it's like the count of 'word' in the\n        # backoff history-state, but adding something to account for further\n        # levels of backoff.\n        try:\n            backoff_count = self.GetProb(hist[1:], word) * backoff_total\n        except:\n            print(\"problem getting backoff count: hist = {0}, word = {1}\".format(hist, word),\n                  file = sys.stderr)\n            sys.exit(1)\n\n        return self.PruningLogprobChange(float(count), float(discount),\n                                         backoff_count, float(backoff_total))\n\n    # note: returns loglike change per word.\n    def PruneToIntermediateTarget(self, num_extra_ngrams):\n        protected_ngrams = self.GetProtectedNgrams()\n        initial_num_extra_ngrams = self.GetNumExtraNgrams()\n        num_ngrams_to_prune = initial_num_extra_ngrams - num_extra_ngrams\n        assert num_ngrams_to_prune > 0\n\n        num_candidates_per_order = [ 0 ] * args.ngram_order\n        num_pruned_per_order = [ 0 ] * args.ngram_order\n\n\n        # like_change_and_ngrams this will be a list of tuples consisting\n        # of the likelihood change as a float and then the words of the n-gram\n        # that we're considering pruning,\n        # e.g. (-0.164, 7, 8, 9)\n        # meaning that pruning the n-gram (7, 8) -> 9 leads to\n        # a likelihood change of -0.164.  We'll later sort this list\n        # so we can prune the n-grams that made the least-negative\n        # likelihood change.\n        like_change_and_ngrams = []\n        for n in range(args.no_backoff_ngram_order, args.ngram_order):\n            for hist, counts_for_hist in self.counts[n].items():\n                for word, count in counts_for_hist.word_to_count.items():\n                    if word != self.backoff_symbol:\n                        if not hist + (word,) in protected_ngrams:\n                            like_change = self.GetLikeChangeFromPruningNgram(hist, word)\n                            like_change_and_ngrams.append((like_change,) + hist + (word,))\n                            num_candidates_per_order[len(hist)] += 1\n\n        like_change_and_ngrams.sort(reverse = True)\n\n        if num_ngrams_to_prune > len(like_change_and_ngrams):\n            print('make_phone_lm.py: aimed to prune {0} n-grams but could only '\n                  'prune {1}'.format(num_ngrams_to_prune, len(like_change_and_ngrams)),\n                  file = sys.stderr)\n            num_ngrams_to_prune = len(like_change_and_ngrams)\n\n        total_loglike_change = 0.0\n\n        for i in range(num_ngrams_to_prune):\n            total_loglike_change += like_change_and_ngrams[i][0]\n            hist = like_change_and_ngrams[i][1:-1]  # all but 1st and last elements\n            word = like_change_and_ngrams[i][-1]  # last element\n            num_pruned_per_order[len(hist)] += 1\n            self.PruneNgram(hist, word)\n\n        like_change_per_word = total_loglike_change / self.total_num_words\n\n        if args.verbose >= 1:\n            effective_threshold = (like_change_and_ngrams[num_ngrams_to_prune - 1][0]\n                                   if num_ngrams_to_prune >= 0 else 0.0)\n            print(\"Pruned from {0} ngrams to {1}, with threshold {2}.  Candidates per order were {3}, \"\n                  \"num-ngrams pruned per order were {4}.  Like-change per word was {5}\".format(\n                    initial_num_extra_ngrams,\n                    initial_num_extra_ngrams - num_ngrams_to_prune,\n                    '%.4f' % effective_threshold,\n                    num_candidates_per_order,\n                    num_pruned_per_order,\n                    like_change_per_word), file = sys.stderr)\n\n        if args.verbose >= 3:\n            print(\"Pruning: like_change_and_ngrams is:\\n\" +\n                  '\\n'.join([str(x) for x in like_change_and_ngrams[:num_ngrams_to_prune]]) +\n                  \"\\n-------- stop pruning here: ----------\\n\" +\n                  '\\n'.join([str(x) for x in like_change_and_ngrams[num_ngrams_to_prune:]]),\n                  file = sys.stderr)\n            self.Print(\"Counts after pruning to num-extra-ngrams={0}\".format(\n                    initial_num_extra_ngrams - num_ngrams_to_prune))\n\n        self.PruneEmptyStates()\n        if args.verbose >= 3:\n            ngram_counts.Print(\"Counts after removing empty states [inside pruning algorithm]:\")\n        return like_change_per_word\n\n\n\n    def PruneToFinalTarget(self, num_extra_ngrams):\n        # prunes to a specified num_extra_ngrams.  The 'extra_ngrams' refers to\n        # the count of n-grams of order higher than args.no_backoff_ngram_order.\n        # We construct a sequence of targets that gradually approaches\n        # this value.  Doing it iteratively like this is a good way\n        # to deal with the fact that sometimes we can't prune a certain\n        # n-gram before certain other n-grams are pruned (because\n        # they lead to a state that must be kept, or an n-gram exists\n        # that backs off to this n-gram).\n\n        current_num_extra_ngrams = self.GetNumExtraNgrams()\n\n        if num_extra_ngrams >= current_num_extra_ngrams:\n            print('make_phone_lm.py: not pruning since target num-extra-ngrams={0} is >= '\n                  'current num-extra-ngrams={1}'.format(num_extra_ngrams, current_num_extra_ngrams),\n                  file=sys.stderr)\n            return\n\n        target_sequence = [num_extra_ngrams]\n        # two final iterations where the targets differ by factors of 1.1,\n        # preceded by two iterations where the targets differ by factors of 1.2.\n        for this_factor in [ 1.1, 1.2 ]:\n            for n in range(0,2):\n                if int((target_sequence[-1]+1) * this_factor) < current_num_extra_ngrams:\n                    target_sequence.append(int((target_sequence[-1]+1) * this_factor))\n        # then change in factors of 1.3\n        while True:\n            this_factor = 1.3\n            if int((target_sequence[-1]+1) * this_factor) < current_num_extra_ngrams:\n                target_sequence.append(int((target_sequence[-1]+1) * this_factor))\n            else:\n                break\n\n        target_sequence = list(set(target_sequence))  # only keep unique targets.\n        target_sequence.sort(reverse = True)\n\n        print('make_phone_lm.py: current num-extra-ngrams={0}, pruning with '\n              'following sequence of targets: {1}'.format(current_num_extra_ngrams,\n                                                          target_sequence),\n              file = sys.stderr)\n        total_like_change_per_word = 0.0\n        for target in target_sequence:\n            total_like_change_per_word += self.PruneToIntermediateTarget(target)\n\n        if args.verbose >= 1:\n            print('make_phone_lm.py: K-L divergence from pruning (upper bound) is '\n                  '%.4f' % total_like_change_per_word, file = sys.stderr)\n\n\n    # returns the number of n-grams on top of those that can't be pruned away\n    # because their order is <= args.no_backoff_ngram_order.\n    def GetNumExtraNgrams(self):\n        ans = 0\n        for hist_len in range(args.no_backoff_ngram_order, args.ngram_order):\n            # note: hist_len + 1 is the actual order.\n            ans += self.GetNumNgrams(hist_len)\n        return ans\n\n\n    def GetNumNgrams(self, hist_len = None):\n        ans = 0\n        if hist_len == None:\n            for hist_len in range(args.ngram_order):\n                # note: hist_len + 1 is the actual order.\n                ans += self.GetNumNgrams(hist_len)\n            return ans\n        else:\n            for counts_for_hist in self.counts[hist_len].values():\n                ans += len(counts_for_hist.word_to_count)\n                if self.backoff_symbol in counts_for_hist.word_to_count:\n                    ans -= 1  # don't count the backoff symbol, it doesn't produce\n                              # its own n-gram line.\n            return ans\n\n\n    # this function, used in PrintAsArpa, converts an integer to\n    # a string by either printing it as a string, or for self.bos_symbol\n    # and self.eos_symbol, printing them as \"<s>\" and \"</s>\" respectively.\n    def IntToString(self, i):\n        if i == self.bos_symbol:\n            return '<s>'\n        elif i == self.eos_symbol:\n            return '</s>'\n        else:\n            assert i != self.backoff_symbol\n            return str(i)\n\n\n\n    def PrintAsArpa(self):\n        # Prints out the FST in ARPA format.\n        assert args.no_backoff_ngram_order == 1  # without unigrams we couldn't\n                                                 # print as ARPA format.\n\n        print('\\\\data\\\\');\n        for hist_len in range(args.ngram_order):\n            # print the number of n-grams.  Add 1 for the 1-gram\n            # section because of <s>, we print -99 as the prob so we\n            # have a place to put the backoff prob.\n            print('ngram {0}={1}'.format(\n                    hist_len + 1,\n                    self.GetNumNgrams(hist_len) + (1 if hist_len == 0 else 0)))\n\n        print('')\n\n        for hist_len in range(args.ngram_order):\n            print('\\\\{0}-grams:'.format(hist_len + 1))\n\n            # print fake n-gram for <s>, for its backoff prob.\n            if hist_len == 0:\n                backoff_prob = self.GetProb((self.bos_symbol,), self.backoff_symbol)\n                if backoff_prob != None:\n                    print('-99\\t<s>\\t{0}'.format('%.5f' % math.log10(backoff_prob)))\n\n            for hist in self.counts[hist_len].keys():\n                for word in self.counts[hist_len][hist].word_to_count.keys():\n                    if word != self.backoff_symbol:\n                        prob = self.GetProb(hist, word)\n                        assert prob != None and prob > 0\n                        backoff_prob = self.GetProb((hist)+(word,), self.backoff_symbol)\n                        line = '{0}\\t{1}'.format('%.5f' % math.log10(prob),\n                                                 ' '.join(self.IntToString(x) for x in hist + (word,)))\n                        if backoff_prob != None:\n                            line += '\\t{0}'.format('%.5f' % math.log10(backoff_prob))\n                        print(line)\n            print('')\n        print('\\\\end\\\\')\n\n\n\nngram_counts = NgramCounts(args.ngram_order)\nngram_counts.AddRawCountsFromStandardInput()\n\nif args.verbose >= 3:\n    ngram_counts.Print(\"Raw counts:\")\nngram_counts.ApplyBackoff()\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after applying Kneser-Ney discounting:\")\nngram_counts.EnsureStructurallyNeededNgramsExist()\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after adding structurally-needed n-grams (1st time):\")\nngram_counts.PruneEmptyStates()\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after removing empty states:\")\nngram_counts.PruneToFinalTarget(args.num_extra_ngrams)\n\nngram_counts.EnsureStructurallyNeededNgramsExist()\nif args.verbose >= 3:\n    ngram_counts.Print(\"Counts after adding structurally-needed n-grams (2nd time):\")\n\n\n\n\nif args.print_as_arpa == \"true\":\n    ngram_counts.PrintAsArpa()\nelse:\n    if args.phone_disambig_symbol == None:\n        sys.exit(\"make_phone_lm.py: --phone-disambig-symbol must be provided (unless \"\n                 \"you are writing as ARPA\")\n    ngram_counts.PrintAsFst(args.phone_disambig_symbol)\n\n\n## Below are some little test commands that can be used to look at the detailed stats\n## for a kind of sanity check.\n# test comand:\n# (echo 6 7 8 4; echo 7 8 9; echo 7 8; echo 7 4; echo 8 4 ) | utils/lang/make_phone_lm.py --phone-disambig-symbol=400  --verbose=3\n#  (echo 6 7 8 4; echo 7 8 9; echo 7 8; echo 7 4; echo 8 4 ) | utils/lang/make_phone_lm.py --phone-disambig-symbol=400  --verbose=3 --num-extra-ngrams=0\n# (echo 6 7 8 4; echo 6 7 ) | utils/lang/make_phone_lm.py --print-as-arpa=true --no-backoff-ngram-order=1  --verbose=3\n\n\n## The following shows how we created some data suitable to do comparisons with\n## other language modeling toolkits.  Note: we're running in a configuration\n## where --no-backoff-ngram-order=1 (i.e. we have a unigram LM state) because\n## it's the only way to get perplexity calculations and to write an ARPA file.\n##\n# cd egs/tedlium/s5_r2\n# . ./path.sh\n# mkdir -p lm_test\n# ali-to-phones exp/tri3/final.mdl \"ark:gunzip -c exp/tri3/ali.*.gz|\" ark,t:-  | awk '{$1 = \"\"; print}' > lm_test/phone_seqs\n# wc lm_test/phone_seqs\n# 92464  8409563 27953288 lm_test/phone_seqs\n# head -n 20000 lm_test/phone_seqs > lm_test/train.txt\n# tail -n 1000 lm_test/phone_seqs > lm_test/test.txt\n\n## This shows make_phone_lm.py with the default number of extra-lm-states (20k)\n## You have to have SRILM on your path to ger perplexities [note: it should be on the\n## path if you installed it and you sourced the tedlium s5b path.sh, as above.]\n# utils/lang/make_phone_lm.py --print-as-arpa=true --no-backoff-ngram-order=1 --verbose=1 < lm_test/train.txt > lm_test/arpa_pr20k\n# ngram -order 4 -unk -lm lm_test/arpa_pr20k -ppl lm_test/test.txt\n# file lm_test/test.txt: 1000 sentences, 86489 words, 3 OOVs\n# 0 zeroprobs, logprob= -80130.1 ppl=*8.23985* ppl1= 8.44325\n# on training data: 0 zeroprobs, logprob= -1.6264e+06 ppl= 7.46947 ppl1= 7.63431\n\n## This shows make_phone_lm.py without any pruning (make --num-extra-ngrams very large).\n# utils/lang/make_phone_lm.py --print-as-arpa=true --num-extra-ngrams=1000000 --no-backoff-ngram-order=1 --verbose=1 < lm_test/train.txt > lm_test/arpa\n# ngram -order 4 -unk -lm lm_test/arpa -ppl lm_test/test.txt\n# file lm_test/test.txt: 1000 sentences, 86489 words, 3 OOVs\n# 0 zeroprobs, logprob= -74976 ppl=*7.19459* ppl1= 7.36064\n# on training data: 0 zeroprobs, logprob= -1.44198e+06 ppl= 5.94659 ppl1= 6.06279\n\n## This is SRILM without pruning (c.f. the 7.19 above, it's slightly better).\n# ngram-count -text lm_test/train.txt -order 4 -kndiscount2 -kndiscount3 -kndiscount4 -interpolate -lm lm_test/arpa_srilm\n# ngram -order 4 -unk -lm lm_test/arpa_srilm -ppl lm_test/test.txt\n# file lm_test/test.txt: 1000 sentences, 86489 words, 3 OOVs\n# 0 zeroprobs, logprob= -74742.2 ppl= *7.15044* ppl1= 7.31494\n\n\n## This is SRILM with a pruning beam tuned to get 20k n-grams above unigram\n##  (c.f. the 8.23 above, it's a lot worse).\n# ngram-count -text lm_test/train.txt -order 4 -kndiscount2 -kndiscount3 -kndiscount4 -interpolate -prune 1.65e-05 -lm lm_test/arpa_srilm.pr1.65e-5\n# the model has 20249 n-grams above unigram [c.f. our 20k]\n# ngram -order 4 -unk -lm lm_test/arpa_srilm.pr1.65e-5 -ppl lm_test/test.txt\n# file lm_test/test.txt: 1000 sentences, 86489 words, 3 OOVs\n# 0 zeroprobs, logprob= -86803.7 ppl=*9.82202* ppl1= 10.0849\n\n\n## This is pocolm..\n## Note: we have to hold out some of the training data as dev to\n## estimate the hyperparameters, but we'll fold it back in before\n## making the final LM. [--fold-dev-into=train]\n# mkdir -p lm_test/data/text\n# head -n 1000 lm_test/train.txt > lm_test/data/text/dev.txt\n# tail -n +1001 lm_test/train.txt > lm_test/data/text/train.txt\n## give it a 'large' num-words so it picks them all.\n# export PATH=$PATH:../../../tools/pocolm/scripts\n# train_lm.py --num-word=100000 --fold-dev-into=train lm_test/data/text 4 lm_test/data/lm_unpruned\n# get_data_prob.py lm_test/test.txt lm_test/data/lm_unpruned/100000_4.pocolm\n## compute-probs: average log-prob per word was -1.95956 (perplexity = *7.0962*) over 87489 words.\n## Note: we can compare this perplexity with 7.15 with SRILM and 7.19 with make_phone_lm.py.\n\n#   pruned_lm_dir=${lm_dir}/${num_word}_${order}_prune${threshold}.pocolm\n# prune_lm_dir.py --target-num-ngrams=20100 lm_test/data/lm_unpruned/100000_4.pocolm lm_test/data/lm_unpruned/100000_4_pr20k.pocolm\n# get_data_prob.py lm_test/test.txt lm_test/data/lm_unpruned/100000_4_pr20k.pocolm\n## compute-probs: average log-prob per word was -2.0409 (perplexity = 7.69757) over 87489 words.\n## note: the 7.69 can be compared with 9.82 from SRILM and 8.23 from pocolm.\n## format_arpa_lm.py lm_test/data/lm_unpruned/100000_4_pr20k.pocolm | head\n## .. it has 20488 n-grams above unigram.  More than 20k but not enough to explain the difference\n## .. in perplexity.\n\n## OK... if I reran after modifying prune_lm_dir.py to comment out the line\n## 'steps += 'EM EM'.split()' which adds the two EM stages per step, and got the\n## perplexity again, I got the following:\n## compute-probs: average log-prob per word was -2.09722 (perplexity = 8.14353) over 87489 words.\n## .. so it turns out the E-M is actually important.\n"
  },
  {
    "path": "egs/utils/lang/make_position_dependent_subword_lexicon.py",
    "content": "#!/usr/bin/env python3\n\n# 2019 Dongji Gao\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\nfrom make_lexicon_fst import read_lexiconp\nimport argparse\nimport math\n\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates a\n        position-dependent subword lexicon from a position-independent subword lexicon\n        by adding suffixes (\"_B\", \"_I\", \"_E\", \"_S\") to the related phones.\n        It assumes that the input lexicon does not contain disambiguation symbols.\"\"\")\n    parser.add_argument(\"--separator\", type=str, default=\"@@\", help=\"\"\"Separator\n        indicates the position of a subword in a word. \n        Subword ends with separator can only appear at the beginning or middle of a word. \n        Subword without separator can only appear at the end of a word or is a word itself.\n        E.g. \"international -> inter@@ nation@@ al\";\n             \"nation        -> nation\"\n        The separator should match the separator used in the input lexicon.\"\"\")\n    parser.add_argument(\"lexiconp\", type=str, help=\"\"\"Filename of subword position-independent \n        lexicon with pronunciation probabilities, with lines of the form 'subword prob p1 p2 ...'\"\"\")\n    args = parser.parse_args()\n    return args\n\ndef is_end(subword, separator):\n    \"\"\"Return true if the subword can appear at the end of a word (i.e., the subword \n    does not end with separator). Return false otherwise.\"\"\"\n    return not subword.endswith(separator)\n\ndef write_position_dependent_lexicon(lexiconp, separator):\n    \"\"\"Print a position-dependent lexicon for each subword from the input lexiconp by adding\n    appropriate suffixes (\"_B\", \"_I\", \"_E\", \"_S\") to the phone sequence related to the subword.\n    There are 4 types of position-dependent subword:\n    1) Beginning subword. It can only appear at the beginning of a word.\n       The first phone suffix should be \"_B\" and other suffixes should be \"_I\"s:\n        nation@@ 1.0 n_B ey_I sh_I ih_I n_I\n        n@@      1.0 n_B\n    2) Middle subword. It can only appear at the middle of a word.\n       All phone suffixes should be \"_I\"s:\n        nation@@ 1.0 n_I ey_I sh_I ih_I n_I\n    3) End subword. It can only appear at the end of a word.\n       The last phone suffix should be \"_E\" and other suffixes should be \"_I\"s:\n        nation   1.0 n_I ey_I sh_I ih_I n_E\n        n        1.0 n_E\n    4) Singleton subword (i.e., the subword is word it self). \n       The first phone suffix should be \"_B\" and the last suffix should be \"_E\".\n       All other suffixes should be \"_I\"s. If there is only one phone, its suffix should be \"_S\":\n        nation   1.0 n_B ey_I sh_I ih_I n_E\n        n        1.0 n_S\n    In most cases (i.e., subwords have more than 1 phones), the suffixes of phones in the middle are \"_I\"s.\n    So the suffix_list is initialized with all _I and we only replace the first and last phone suffix when\n    dealing with different cases when necessary.\n    \"\"\"\n    for (word, prob, phones) in lexiconp:\n        phones_length = len(phones)\n\n        # suffix_list is initialized by all \"_I\"s.\n        suffix_list = [\"_I\" for i in range(phones_length)]\n\n        if is_end(word, separator):\n            # print end subword lexicon by replacing the last phone suffix by \"_E\"\n            suffix_list[-1] = \"_E\"\n            phones_list = [phone + suffix for (phone, suffix) in zip(phones, suffix_list)]\n            print(\"{} {} {}\".format(word, prob, ' '.join(phones_list)))\n\n            # print singleton subword lexicon\n            # the phone suffix is \"_S\" if the there is only 1 phone.\n            if phones_length == 1:\n                suffix_list[0] = \"_S\"\n                phones_list = [phone + suffix for (phone, suffix) in zip(phones, suffix_list)]\n                print(\"{} {} {}\".format(word, prob, ' '.join(phones_list)))\n            # the first phone suffix is \"_B\" is there is more than 1 phones.\n            else:\n                suffix_list[0] = \"_B\"\n                phones_list = [phone + suffix for (phone, suffix) in zip(phones, suffix_list)]\n                print(\"{} {} {}\".format(word, prob, ' '.join(phones_list)))\n        else:\n            # print middle subword lexicon\n            phones_list = [phone + suffix for (phone, suffix) in zip(phones, suffix_list)]\n            print(\"{} {} {}\".format(word, prob, ' '.join(phones_list)))\n\n            # print beginning subword lexicon by replacing the first phone suffix by \"_B\"\n            suffix_list[0] = \"_B\"\n            phones_list = [phone + suffix for (phone, suffix) in zip(phones, suffix_list)]\n            print(\"{} {} {}\".format(word, prob, ' '.join(phones_list)))\n\ndef main():\n    args = get_args()\n    lexiconp = read_lexiconp(args.lexiconp)\n    write_position_dependent_lexicon(lexiconp, args.separator)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/lang/make_subword_lexicon_fst.py",
    "content": "#!/usr/bin/env python3\n\n# 2019 Dongji Gao\n# Apache 2.0.\n\nfrom make_lexicon_fst import read_lexiconp\nimport argparse\nimport math\nimport sys\n\n# see get_args() below for usage mesage\ndef get_args():\n    parser = argparse.ArgumentParser(description=\"\"\"This script creates the\n        text form of a subword lexicon FST to be compiled by fstcompile using\n        the appropriate symbol tables (phones.txt and words.txt). It will mostly\n        be invoked indirectly via utils/prepare_lang_subword.sh. The output\n        goes to the stdout. This script is the subword version of make_lexicon_fst.py.\n        It only allows optional silence to appear after end-subword or singleton-subword,\n        (i.e., subwords without separator). In this version we do not support\n        pronunciation probability. (i.e., pron-prob = 1.0)\"\"\")\n\n    parser.add_argument('--sil-phone', type=str, help=\"\"\"Text form of\n        optional-silence phone, e.g. 'SIL'. See also the --sil-prob option.\"\"\")\n    parser.add_argument('--sil-prob', type=float, default=0.0, help=\"\"\"Probability\n        of silence between words (including the beginning and end of word sequence).\n        Must be in range [0.0, 1.0). This refer to the optional silence inserted by\n        the lexicon; see the --sil-phone option.\"\"\")\n    parser.add_argument('--sil-disambig', type=str, help=\"\"\"Disambiguation symbol\n        to disambiguate silence, e.g. #5. Will only be supplied if you are creating \n        the version of L.fst with disambiguation symbols, intended for use with cyclic \n        G.fst. This symbol was introduced to fix a rather obscure source of nondeterminism \n        of CLG.fst, that has to do with reordering of disambiguation symbols and phone symbols.\"\"\")\n    parser.add_argument('--position-dependent', action=\"store_true\", help=\"\"\"Whether \n        the input lexicon is position-dependent.\"\"\")\n    parser.add_argument(\"--separator\", type=str, default=\"@@\", help=\"\"\"Separator\n        indicates the position of a subword in a word.\n        Subword followed by separator can only appear at the beginning or middle of a word.\n        Subword without separator can only appear at the end of a word or is a word itself.\n        E.g. \"international -> inter@@ nation@@ al\";\n             \"nation        -> nation\"\n    The separator should match the separator used in the input lexicon.\"\"\")\n    parser.add_argument('lexiconp', type=str, help=\"\"\"Filename of lexicon with\n        pronunciation probabilities (normally lexiconp.txt), with lines of the\n        form 'subword prob p1 p2...', e.g. 'a, 1.0 ay'\"\"\")\n    args = parser.parse_args()\n    return args\n\ndef contain_disambig_symbol(phones):\n    \"\"\"Return true if the phone sequence contains disambiguation symbol.\n    Return false otherwise. Disambiguation symbol is at the end of phones \n    in the form of #1, #2... There is at most one disambiguation \n    symbol for each phone sequence\"\"\"\n    return True if phones[-1].startswith(\"#\") else False\n\ndef print_arc(src, dest, phone, word, cost):\n    print('{}\\t{}\\t{}\\t{}\\t{}'.format(src, dest, phone, word, cost))\n\ndef is_end(word, separator):\n    \"\"\"Return true if the subword can appear at the end of a word (i.e., the subword\n    does not end with separator). Return false otherwise.\"\"\"\n    return not word.endswith(separator)\n\ndef get_suffix(phone):\n    \"\"\"Return the suffix of a phone. The suffix is in the form of '_B', '_I'...\"\"\"\n    if len(phone) < 3:\n        print(\"{}: invalid phone {} (please check if the phone is position-dependent)\".format(\n              sys.argv[0], phone), file=sys.stderr)\n        sys.exit(1)\n    return phone[-2:]\n\ndef write_fst_no_silence(lexicon, position_dependent, separator):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n    when --sil-prob=0.0, meaning there is no optional silence allowed.\n    loop_state here is the start and final state of the fst. It goes to word_start_state\n    via epsilon transition.\n    In position-independent case, there is no difference between beginning word and \n    middle word. So all subwords with separator would leave from and enter word_start_state.\n    All subword without separator would leave from word_start_state and enter loop_state.\n    This guarantees that optional silence can only follow a word-end subword.\n\n    In position-dependent case, there are 4 types of position-dependent subword:\n    1) Beginning subword. The first phone suffix should be \"_B\" and other suffixes should be \"_I\"s:\n        nation@@ 1.0 n_B ey_I sh_I ih_I n_I\n        n@@      1.0 n_B\n    2) Middle subword. All phone suffixes should be \"_I\"s:\n        nation@@ 1.0 n_I ey_I sh_I ih_I n_I\n    3) End subword. The last phone suffix should be \"_E\" and other suffixes be should \"_I\"s:\n        nation   1.0 n_I ey_I sh_I ih_I n_E\n        n        1.0 n_E\n    4) Singleton subword (i.e., the subword is word it self).\n       The first phone suffix should be \"_B\" and the last suffix should be \"_E\".\n       All other suffix should be \"_I\"s. If there is only one phone, its suffix should be \"_S\":\n        nation   1.0 n_B ey_I sh_I ih_I n_E\n        n        1.0 n_S\n\n    So we need an extra word_internal_state. The beginning word \n    would leave from word_start_state and enter word_internal_state and middle word\n    would leave from and enter word_internal_state. The rest part is same.\n\n      'lexicon' is a list of 3-tuples (subword, pron-prob, prons) as returned by\n      'position_dependent', which is true is the lexicon is position-dependent.\n      'separator' is a symbol which indicates the position of a subword in word.\n    \"\"\"\n    # regular setting\n    loop_state = 0\n    word_start_state = 1\n    next_state = 2\n\n    print_arc(loop_state, word_start_state, \"<eps>\", \"<eps>\", 0.0)\n\n    # optional setting for word_internal_state\n    if position_dependent:\n        word_internal_state = next_state\n        next_state += 1\n\n    for (word, pron_prob, phones) in lexicon:\n        pron_cost = 0.0                # do not support pron_prob\n        phones_len = len(phones)\n\n        # set start and end state for different cases\n        if position_dependent:\n            first_phone_suffix = get_suffix(phones[0])\n            last_phone = phones[-2] if contain_disambig_symbol(phones) else phones[-1]\n            last_phone_suffix = get_suffix(last_phone)\n\n            # singleton word\n            if first_phone_suffix == \"_S\":\n                current_state = word_start_state\n                end_state = loop_state\n            # set the current_state\n            elif first_phone_suffix == \"_B\":\n                current_state = word_start_state\n            elif first_phone_suffix == \"_I\" or first_phone_suffix == \"_E\":\n                current_state = word_internal_state\n            # then set the end_state\n            if last_phone_suffix == \"_B\" or last_phone_suffix == \"_I\":\n                end_state = word_internal_state\n            elif last_phone_suffix == \"_E\":\n                end_state = loop_state\n        else:\n            current_state = word_start_state\n            end_state = loop_state if is_end(word, separator) else word_start_state\n\n        # print arcs (except the last one) for the subword\n        for i in range(phones_len - 1):\n            word = word if i == 0 else \"<eps>\"\n            cost = pron_cost if i == 0 else 0.0\n            print_arc(current_state, next_state, phones[i], word, cost)\n            current_state = next_state\n            next_state += 1\n\n        # print the last arc\n        i = phones_len - 1\n        phone = phones[i] if i >=0 else \"<eps>\"\n        word = word if i <= 0 else \"<eps>\"\n        cost = pron_cost if i <= 0 else 0.0\n        print_arc(current_state, end_state, phone, word, cost)\n\n    # set the final state\n    print(\"{state}\\t{final_cost}\".format(state=loop_state, final_cost=0.0))\n\ndef write_fst_with_silence(lexicon, sil_phone, sil_prob, sil_disambig, position_dependent, separator):\n    \"\"\"Writes the text format of L.fst to the standard output.  This version is for\n    when --sil-prob=0.0, meaning there is no optional silence allowed.\n    loop_state here is the start and final state of the fst. It goes to word_start_state\n    via epsilon transition.\n\n    In position-independent case, there is no difference between beginning word and \n    middle word. So all subwords with separator would leave from and enter word_start_state.\n    All subword without separator would leave from word_start_state and enter sil_state.\n    This guarantees that optional silence can only follow a word-end subword and such subwords\n    must appear at the end of the whole subword sequence.\n\n    In position-dependent case, there are 4 types of position-dependent subword:\n    1) Beginning subword. The first phone suffix should be \"_B\" and other suffixes should be \"_I\"s:\n        nation@@ 1.0 n_B ey_I sh_I ih_I n_I\n        n@@      1.0 n_B\n    2) Middle subword. All phone suffixes should be \"_I\"s:\n        nation@@ 1.0 n_I ey_I sh_I ih_I n_I\n    3) End subword. The last phone suffix should be \"_E\" and other suffixes be should \"_I\"s:\n        nation   1.0 n_I ey_I sh_I ih_I n_E\n        n        1.0 n_E\n    4) Singleton subword (i.e., the subword is word it self).\n       The first phone suffix should be \"_B\" and the last suffix should be \"_E\".\n       All other suffix should be \"_I\"s. If there is only one phone, its suffix should be \"_S\":\n        nation   1.0 n_B ey_I sh_I ih_I n_E\n        n        1.0 n_S\n\n    So we need an extra word_internal_state. The beginning word \n    would leave from word_start_state and enter word_internal_state and middle word\n    would leave from and enter word_internal_state. The rest part is same.\n\n      'lexicon' is a list of 3-tuples (subword, pron-prob, prons)\n         as returned by read_lexiconp().\n      'sil_prob', which is expected to be strictly between 0.0 and 1.0, is the\n         probability of silence\n      'sil_phone' is the silence phone, e.g. \"SIL\".\n      'sil_disambig' is either None, or the silence disambiguation symbol, e.g. \"#5\".\n      'position_dependent', which is True is the lexicion is position-dependent.\n      'separator' is the symbol we use to indicate the position of a subword in word.\n    \"\"\"\n\n    sil_cost = -math.log(sil_prob)\n    no_sil_cost = -math.log(1 - sil_prob)\n\n    # regular setting\n    start_state = 0\n    loop_state = 1         # also the final state\n    sil_state = 2          # words terminate here when followed by silence; this state\n                           # has a licence transition to loop_state\n    word_start_state = 3   # subword leave from here\n    next_state = 4         # the next un-allocated state, will be incremented as we go\n\n    print_arc(start_state, loop_state, \"<eps>\", \"<eps>\", no_sil_cost)\n    print_arc(start_state, sil_state, \"<eps>\", \"<eps>\", sil_cost)\n    print_arc(loop_state, word_start_state, \"<eps>\", \"<eps>\", 0.0)\n\n    # optional setting for disambig_state\n    if sil_disambig is None:\n        print_arc(sil_state, loop_state, sil_phone, \"<eps>\", 0.0)\n    else:\n        disambig_state = next_state\n        next_state += 1\n        print_arc(sil_state, disambig_state, sil_phone, \"<eps>\", 0.0)\n        print_arc(disambig_state, loop_state, sil_disambig, \"<eps>\", 0.0)\n\n    # optional setting for word_internal_state\n    if position_dependent:\n        word_internal_state = next_state\n        next_state += 1\n\n    for (word, pron_prob, phones) in lexicon:\n        pron_cost = 0.0           # do not support pron_prob\n        phones_len = len(phones)\n        \n        # set start and end state for different cases\n        if position_dependent:\n            first_phone_suffix = get_suffix(phones[0])\n            last_phone = phones[-2] if contain_disambig_symbol(phones) else phones[-1]\n            last_phone_suffix = get_suffix(last_phone)\n\n            # singleton subword\n            if first_phone_suffix == \"_S\":\n                current_state = word_start_state\n                end_state_list = [loop_state, sil_state]\n                end_cost_list = [no_sil_cost, sil_cost]\n            # first set the current_state\n            elif first_phone_suffix == \"_B\":\n                current_state = word_start_state\n            elif first_phone_suffix == \"_I\" or first_phone_suffix == \"_E\":\n                current_state = word_internal_state\n            # then set the end_state (end_state_list)\n            if last_phone_suffix == \"_B\" or last_phone_suffix == \"_I\":\n                end_state_list = [word_internal_state]\n                end_cost_list = [0.0]\n            elif last_phone_suffix == \"_E\":\n                end_state_list = [loop_state, sil_state]\n                end_cost_list = [no_sil_cost, sil_cost]\n        else:\n            current_state = word_start_state\n            if is_end(word, separator):\n                end_state_list = [loop_state, sil_state]\n                end_cost_list = [no_sil_cost, sil_cost]\n            else:\n                end_state_list = [word_start_state]\n                end_cost_list = [0.0]\n\n        # print arcs (except the last one) for the subword\n        for i in range(phones_len - 1):\n            word = word if i == 0 else \"<eps>\"\n            cost = pron_cost if i == 0 else 0.0\n            print_arc(current_state, next_state, phones[i], word, cost)\n            current_state = next_state\n            next_state += 1\n\n        # print the last arc\n        i = phones_len - 1\n        phone = phones[i] if i >= 0 else \"<eps>\"\n        word = word if i <= 0 else \"<eps>\"\n        cost = pron_cost if i <= 0 else 0.0\n        for (end_state, end_cost) in zip(end_state_list, end_cost_list):\n            print_arc(current_state, end_state, phone, word, cost + end_cost)\n\n    # set the final state\n    print(\"{state}\\t{final_cost}\".format(state=loop_state, final_cost=0.0))\n\ndef main():\n    args = get_args()\n    if args.sil_prob < 0.0 or args.sil_prob >= 1.0:\n        print(\"{}: invalid value specified --sil-prob={}\".format(\n              sys.argv[0], args.sil_prob), file=sys.stderr)\n        sys.exit(1)\n    lexicon = read_lexiconp(args.lexiconp)\n    if args.sil_prob == 0.0:\n        write_fst_no_silence(lexicon, args.position_dependent, args.separator)\n    else:\n        write_fst_with_silence(lexicon, args.sil_phone, args.sil_prob, \n            args.sil_disambig, args.position_dependent, args.separator)\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "egs/utils/lang/make_unk_lm.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright      2016 Johns Hopkins University (Author: Daniel Povey);\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Begin configuration section.\ncmd=run.pl\nngram_order=4\nnum_extra_ngrams=10000\nposition_dependent_phones=true\nuse_pocolm=true\nmin_word_length=2\nstage=0\nphone_disambig_symbol=\"#1\"\n\n# end configuration sections\n\n[ -f path.sh ] && . ./path.sh\n. utils/parse_options.sh\n\nif [ $# -ne 2 ]; then\n  echo \"Usage: $0 [options] <input-dict-dir> <work-dir>\"\n  echo \"e.g.: $0 data/local/dict exp/make_unk\"\n  echo \"\"\n  echo \"This script creates, as an FST, a phone language model suitable for modeling\"\n  echo \"the unknown word.  It first trains a language model on the phone sequences of the\"\n  echo \"provided dictionary entries (which should be without any word-position-dependency\"\n  echo \"tags); it then creates an FST from it, while, for compactness after context-dependency\"\n  echo \"limiting the transitions to seen bigram pairs of phones.  Then, by composing with\"\n  echo \"a separate FST it converts it into word-position-dependent phones if applicable,\"\n  echo \"while imposing a minimum-number-of-phones constraint.\"\n  echo \"\"\n  echo \"  <input-dict-dir>:  A dictionary directory (as validated by validate_dict_dir.pl);\"\n  echo \"             the dictionary from this location (lexicon.txt, lexiconp.txt, or\"\n  echo \"             lexiconp_silprob.txt) will be used to train the language model on\"\n  echo \"             phones.  The files silence_phones.txt and nonsilence_phones.txt will\"\n  echo \"             be used to construct a symbol table used internally, and to\"\n  echo \"             exclude lexicon entries containing silences.\"\n  echo \" <work-dir>:    A place to put logs and the output of this script.  The output of\"\n  echo \"                this script will be written to <work-dir>/unk_fst.txt (we write in\"\n  echo \"                text form so that it's independent of the phones.txt).\"\n  echo \"Options:\"\n  echo \"    --ngram-order <n>                 # (default: 4)  N-gram order of the phone-level language\"\n  echo \"                                      # model.  Must be in range [2, 7]\"\n  echo \"    --num-extra-ngrams <n>            # (default: 10000).  The maximum the number of n-grams\"\n  echo \"                                      # that may be present in the language model in addition\"\n  echo \"                                      # to the unigrams.  The LM will be pruned to achieve this.\"\n  echo \"    --use-pocolm <true|false>         # (default: true).  If true, use pocolm to estimate the\"\n  echo \"                                      # language model; you will be prompted to install it if\"\n  echo \"                                      # needed.  (If false, we use the script make_phone_lm.py,\"\n  echo \"                                      # which is simpler but the perplexity is not as good).\"\n  echo \"    --position-dependent-phones <true|false>  # (default: true).  If true, assume position-dependent\"\n  echo \"                                      # phones (although in any case the lexicon should use position-\"\n  echo \"                                      # independent phones).  If position-dependent phones are used,\"\n  echo \"                                      # after creating the LM we compose with an FST that converts\"\n  echo \"                                      # into position-dependent phones while enforcing the natural\"\n  echo \"                                      # constraints that they form a single word.\"\n  echo \"    --min-word-length <1|2>           # (default: 2).  May only be 1 or 2.  The minimum word length\"\n  echo \"                                      # (in number of phones) that is allowed\"\n  echo \"    --phone-disambig-symbol <symbol>  # default: '#1'.  This is the symbol that will be put on the\"\n  echo \"                                      # input side of backoff arcs.  You won't normally have to change\"\n  echo \"                                      # this because prepare_lang.sh expects '#1' there.\"\n  exit 1;\nfi\n\n\ndict_dir=$1\ndir=$2\n\nset -e\n\nmkdir -p $dir/log\n\nif [ $stage -le 0 ]; then\n  if ! utils/validate_dict_dir.pl $dict_dir >&$dir/log/validate_dict_dir.log; then\n    cat $dir/log/validate_dict_dir.log\n    echo \"$0: failed to validate input dict-dir $dict_dir\"\n    exit 1\n  fi\nfi\n\nif ! [ $ngram_order -ge 2 ] || ! [ $ngram_order -le 7 ]; then\n  echo \"$0: invalid --ngram-order $ngram_order (must be in [2,7])\"\n  exit 1\nfi\n\nif ! [ $min_word_length -ge 1 ] || ! [ $min_word_length -le 2 ]; then\n  echo \"$0: invalid --min-word-length $min_word_length (must be in [1,2])\"\n  exit 1\nfi\n\n# The next command creates a symbol table that will cover all the symbols we might\n# possibly need in this script.  The word-position-dependent suffixes (_B and so on\n# won't be needed if --position-dependent-phones is false, but it won't hurt.\ncat $dict_dir/silence_phones.txt $dict_dir/nonsilence_phones.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n; }' | \\\n  awk '{print $1; print $1 \"_B\"; print $1 \"_I\"; print $1 \"_S\"; print $1 \"_E\";}' | \\\n      cat - <(echo \"$phone_disambig_symbol\") | \\\n  awk 'BEGIN{print \"<eps> 0\";} {print $1, NR;}' > $dir/phones.txt\n\nphone_disambig_int=$(tail -n 1 <$dir/phones.txt | awk '{print $2}')\nif ! [ $phone_disambig_int == $phone_disambig_int ]; then\n  echo \"$0: problem working out integer form of phone-disambig symbol.\"\n  exit 1;\nfi\n\nif [ -e $dict_dir/lexicon.txt ]; then\n  src_dict=$dict_dir/lexicon.txt\n  first_phone_field=2\nelif [ -e $dict_dir/lexiconp.txt ]; then\n  src_dict=$dict_dir/lexiconp.txt\n  first_phone_field=3\nelse\n  [ ! -e $dict_dir/lexiconp_silprob.txt ] && \\\n    echo \"$0: expected file $dict_dir/lexiconp_silprob.txt to exist\" && exit 1\n  src_dict=$dict_dir/lexiconp_silprob.tt\n  first_phone_field=6\nfi\n\ncat $dict_dir/silence_phones.txt | awk '{for(n=1;n<=NF;n++) print $n; }' > $dir/silence_phones.txt\n\n# prepare the cleaned up version of the dictionary (to train our phone LM), with\n# the first field (the word) removed, with prons that have silence phones in\n# them removed, and with empty prons (which should not be allowed anyway, but\n# just in case..) removed.\nawk -v dir=$dir -v ff=$first_phone_field \\\n   'BEGIN{ while ((getline <(dir\"/silence_phones.txt\")) > 0) sil[$1]=1;  }\n         { ok=1; for (n=ff; n<=NF; n++) { if ($n in sil) ok=0; }\n           if (ok && NF>=ff) { for (n=ff;n<=NF;n++) printf(\"%s \",$n); print \"\"; } else {\n            print(\"make_unk_lm.sh: info: not including dict line: \", $0) >\"/dev/stderr\" }}' <$src_dict >$dir/training.txt\ncat $dir/training.txt | awk '{for(n=1;n<=NF;n++) seen[$n]=1; } END{for (k in seen) print k;}' > $dir/all_nonsil_phones\n\nnum_dict_lines=$(wc -l <$src_dict)\nnum_train_lines=$(wc -l < $dir/training.txt)\nif ! [ $num_train_lines -gt 0 ]; then\n  echo \"$0: something went wrong getting text to train phone-level LM.\"\n  exit 1\nfi\necho \"$0: training on $num_train_lines words out of $num_dict_lines in the \"\necho \"     ... original dictionary (excluding words with silence phones).\"\n\n\nif [ $num_train_lines -lt 2000 ] && $use_pocolm; then\n  echo \"$0: the number of lines of training data is very small [$num_train_lines].\"\n  echo \"    Setting --use-pocolm to false since it probably won't work well\"\n  echo \"    on so little data (e.g. hard to estimate the discounting parameters)\"\n  echo \"    Using make_phone_lm.py instead.\"\n  use_pocolm=false\nfi\n\nif $use_pocolm; then\n  if [ ! -e $KALDI_ROOT/tools/pocolm ]; then\n    echo \"$0: $KALDI_ROOT/tools/pocolm does not exist:\"\n    echo \" ... please do:  cd $KALDI_ROOT/tools; extras/install_pocolm.sh\"\n    echo \" ... and then rerun this script.\"\n    exit 1\n  fi\n\n  PATH=$KALDI_ROOT/tools/pocolm/scripts:$PATH\n\n  if [ $stage -le 1 ]; then\n    echo \"$0: training $ngram_order-gram LM with pocolm\"\n\n    mkdir -p $dir/pocolm/text\n    heldout_ratio=5  # hold out one fifth of the data as validation to estimate\n    # metaparameters; we'll fold it back in before estimating the\n    # final LM.\n    cat $dir/training.txt | awk -v h=$heldout_ratio '{if(NR%h == 0) print; }' > $dir/pocolm/text/dev.txt\n    cat $dir/training.txt | awk -v h=$heldout_ratio '{if(NR%h != 0) print; }' > $dir/pocolm/text/train.txt\n\n\n    # the following options are because we expect the amount of data to be small,\n    # all the data subsampling isn't really needed and will increase the chance of\n    # something going wrong.\n\n    small_data_opts=\"--num-splits 4 --warm-start-ratio 1\"\n    $cmd $dir/log/train_lm.log \\\n         train_lm.py --wordlist $dir/all_nonsil_phones $small_data_opts \\\n         --fold-dev-into=train $dir/pocolm/text $ngram_order $dir/pocolm\n  fi\n\n  if [ $stage -le 2 ]; then\n    echo \"$0: pruning LM with pocolm\"\n    num_words=$(wc -l <$dir/all_nonsil_phones)\n    num_ngrams=$[$num_extra_ngrams+$num_words]\n\n\n    $cmd $dir/log/prune_lm_dir.log \\\n         prune_lm_dir.py --target-num-ngrams=$num_ngrams \\\n         $dir/pocolm/all_nonsil_phones_${ngram_order}.pocolm $dir/poclm/lm_pruned\n\n    # format as arpa.\n    format_arpa_lm.py $dir/poclm/lm_pruned > $dir/pocolm.arpa\n  fi\n\n  if [ $stage -le 3 ]; then\n    echo \"$0: applying bigram constraints and converting from ARPA to FST\"\n    # now get bigram constraints: we want to get an FST that only allows phone\n    # bigrams that we've seen (this may enforce certain linguistic constraints,\n    # and also stops the graph from blowing up too much once we introduce\n    # phonetic context.\n    # The NF > 0 is just a double-check that there are no empty prons, which\n    # would be bad as it would allow an empty pronunciation of the unknown word.\n    cat $dir/training.txt | awk '{ if (NF > 0) printf(\"<s> %s </s>\\n\", $0); }' | \\\n      awk '{for(n=1;n<NF;n++) { m=n+1; seen[ $n \" \" $m ] = 1; }} END{for(k in seen) print k;}' \\\n          > $dir/allowed_bigrams\n\n    $cmd $dir/log/arpa2fst.log \\\n         utils/lang/internal/arpa2fst_constrained.py --verbose=3 \\\n           --disambig-symbol=\"$phone_disambig_symbol\" \\\n         $dir/pocolm.arpa $dir/allowed_bigrams '>' $dir/unk_fst_orig.txt\n  fi\nelse\n\n  if [ $stage -le 1 ]; then\n    echo \"$0: using make_phone_lm.py to create $ngram_order-gram language-model FST\"\n    $cmd $dir/log/make_phone_lm.log \\\n         utils/sym2int.pl $dir/phones.txt $dir/training.txt '|' \\\n         utils/lang/make_phone_lm.py --verbose=2 \\\n         --phone-disambig-symbol=$phone_disambig_int \\\n         --num-extra-ngrams=$num_extra_ngrams \\\n         --ngram-order=$ngram_order '|' \\\n         utils/int2sym.pl -f 3-4 $dir/phones.txt '>'$dir/unk_fst_orig.txt\n  fi\nfi\n\n\nsym_opts=\"--isymbols=$dir/phones.txt --osymbols=$dir/phones.txt\"\n\nif ! $position_dependent_phones; then\n  if  [ $min_word_length == 1 ]; then\n    echo \"$0: no word-length constraint or word-position-dependency, so exiting.\"\n    # There is no need to compose unk_fst_orig.txt with a separate FST: because of\n    # the bigram constraints and because we ensure that there were no empty prons\n    # in the dictionary (no empty lines in training.txt), the FST wouldn't allow\n    # length-zero words anyway.\n    cp $dir/unk_fst_orig.txt $dir/unk_fst.txt\n    fstcompile $sym_opts <$dir/unk_fst.txt >$dir/unk.fst\n    exit 0;\n  else\n    echo \"$0: creating constraint_fst.txt for min-word-length=2 constraint.\"\n    # min-word-length is 2; we need to apply that constraint.  A note on the FST\n    # states: 0 is start state, 1 is \"seen one phone\", 2 is \"seen two or more\n    # phones\".\n    # We don't need to take into account the disambig symbol because we compose on\n    # the right with this FST, and it doesn't appear on the output side.\n    cat $dir/all_nonsil_phones | \\\n      awk '{ph[$1]=1} END{ for (p in ph) { print 0,1,p,p; print 1,2,p,p; print 2,2,p,p; }\n                 print 2,0.0; }' > $dir/constraint_fst.txt\n  fi\nelse\n  echo \"$0: creating constraint_fst.txt for min-word-length=$min_word_length constraint, plus word-position-dependency conversion.\"\n\n  # Add constraints and convert phones without tags into phones with the _B, _E, _I and _S\n  # tags (begin, end, internal, singleton).\n\n  # States:\n  # 0 is start state,\n  # 1 is \"seen initial phone (and maybe internal phones) of multi-phone word\",\n  # 2 is \"seen final phone of multi-phone word\".\n  # 3 is \"seen phone of single-phone word\"; note, if --min-word-length is 2,\n  #      then state 3 will not exist.\n\n  cat $dir/all_nonsil_phones | \\\n    awk -v mwl=$min_word_length -v \"disambig=$phone_disambig_symbol\" \\\n '{ph[$1]=1} END{ for (n=0; n<3; n++) print n,n,disambig,disambig;\n                  for (p in ph) { printf(\"0 1 %s %s_B\\n\", p, p); printf(\"1 1 %s %s_I\\n\", p, p);\n                                  printf(\"1 2 %s %s_E\\n\", p, p); if (mwl==1) printf(\"0 3 %s %s_S\\n\", p, p);  }\n                 print 2,0.0; if (mwl==1) print 3,0.0; }' >$dir/constraint_fst.txt\nfi\n\n\necho \"$0: creating final FST via composition, etc.\"\n\nfstcompile $sym_opts <$dir/constraint_fst.txt | fstarcsort > $dir/constraint.fst\nfstcompile $sym_opts <$dir/unk_fst_orig.txt >$dir/unk_orig.fst\n\n# The first 'fstproject' below projects on the input; it makes sure the\n# disambiguation symbol appears on the output side also.\n# The fstcompose actually applies the constraints and does the conversion, but\n# after this the \"correct\" phones appear only on the output side.\n# The second 'fstproject' copies the word-position-dependent phones to\n# the input side.\n# The 'fstpushspecial' pushes the weights, as the composition with the\n#  constraint FST makes the FST quite non-stochastic [weights per state do not\n#  sum up to one].\n# The 'fstrmsymbols' command makes sure the disambiguation symbol appears only\n# on the input side.\n# 'fstminimizeencoded' combines states that are the same as far as their output\n# arcs are concerned; in the case where --min-word-length is 1, this combines\n# a lot of final-states that have no transitions out of them.\nfstproject $dir/unk_orig.fst | \\\n  fstcompose - $dir/constraint.fst | \\\n  fstproject --project_output=true | \\\n  fstpushspecial | \\\n  fstminimizeencoded | \\\n  fstrmsymbols --remove-from-output=true <(echo $phone_disambig_int) >$dir/unk.fst\n\nfstprint $sym_opts <$dir/unk.fst >$dir/unk_fst.txt\n\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/lang/validate_disambig_sym_file.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2016 FAU Erlangen (Author: Axel Horndasch)\n# Apache 2.0.\n#\n# Concept: Dan Povey\n\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\nUsage:  validate_disambig_sym_file.pl [options] disambig_syms.txt\n\nThis scripts checks if the entries of a file containing disambiguation symbols\n(word or phone level) are all valid. To be valid the symbols\n- must start with the hash mark '#',\n- must not contain any whitespace,\n- must not be equal to '#-1' (disallowed because it is used internally in some\n  FST stuff).\n\nIn case the option '--allow-numeric' is used with 'false', the symbols must\nalso be non-numeric (to avoid overlap with the automatically generated symbols).\n\nAllowed options:\n  --allow-numeric (true|false) : Default true. If false, disallow numeric\n                                 disambiguation symbols like #0, #1 and so on.\n\nEOU\n\n# Command line options\nmy $allow_numeric = \"true\";\n\n# Get the optional command line options\nGetOptions(\n    \"allow-numeric=s\" => \\$allow_numeric,\n    ) or die ($Usage);\n\nif (@ARGV != 1) {\n  die($Usage);\n}\n\nmy $disambig_sym_file = shift @ARGV;\n\nprint \"$0: Checking validity of file \\\"$disambig_sym_file\\\" ...\\n\";\nif (-z $disambig_sym_file) {\n  print \"$0: The file \\\"$disambig_sym_file\\\" is empty or does not exist, exiting ...\\n\"; exit 1;\n}\n\nif (not open(SYMS, \"<$disambig_sym_file\")) {\n  print \"$0: Could not open file \\\"$disambig_sym_file\\\", exiting ...\\n\"; exit 1;\n}\n\n# Go through the file containing disambiguation symbols line by line\nwhile (<SYMS>) {\n  chomp;\n  my $symbol = $_;\n\n  if ($symbol =~ /^#(.*)$/) {\n    my $sympart = $1;\n    if ($sympart eq \"\") {\n      print \"$0: Only \\\"$symbol\\\" is not allowed as a disambiguation symbol, exiting ...\\n\"; exit 1;\n    }\n    if ($sympart =~/\\s+/) {\n      print \"$0: The disambiguation symbol \\\"$symbol\\\" contains whitespace, exiting ...\\n\"; exit 1;\n    }\n    if ($sympart eq \"-1\") {\n      print \"$0: The disambiguation symbol \\\"$symbol\\\" is not allowed, exiting ...\\n\"; exit 1;\n    }\n    if ($allow_numeric eq \"false\" &&\n\t$sympart =~/^[0-9]+$/) {\n      print \"$0: Since \\\"$symbol\\\" is supposed to be an extra disambiguation symbol, it must not be numeric, exiting ...\\n\"; exit 1;\n    }\n  } else {\n    print \"$0: The disambiguation symbol \\\"$symbol\\\" does not start with a '#', exiting ...\\n\"; exit 1;\n  }\n}\n\nprint \"--> SUCCESS [validating disambiguation symbol file \\\"$disambig_sym_file\\\"]\\n\";\nexit 0;\n\n"
  },
  {
    "path": "egs/utils/ln.pl",
    "content": "#!/usr/bin/env perl\nuse File::Spec;\n\nif ( @ARGV < 2 ) {\n  print STDERR \"usage: ln.pl input1 input2 dest-dir\\n\" .\n    \"This script does a soft link of input1, input2, etc.\" .\n    \"to dest-dir, using relative links where possible\\n\" .\n    \"Note: input-n and dest-dir may both be absolute pathnames,\\n\" .\n    \"or relative pathnames, relative to the current directlory.\\n\";\n  exit(1);\n}  \n\n$dir = pop @ARGV;\nif ( ! -d $dir ) {\n  print STDERR \"ln.pl: last argument must be a directory ($dir is not a directory)\\n\";\n  exit(1);\n}\n\n$ans = 1; # true.\n\n$absdir = File::Spec->rel2abs($dir); # Get $dir as abs path.\ndefined $absdir || die \"No such directory $dir\";\nforeach $file (@ARGV) {\n  $absfile =  File::Spec->rel2abs($file); # Get $file as abs path.\n  defined $absfile || die \"No such file or directory: $file\";\n  @absdir_split = split(\"/\", $absdir);\n  @absfile_split = split(\"/\", $absfile);\n\n  $newfile = $absdir . \"/\" . $absfile_split[$#absfile_split]; # we'll use this\n  # as the destination in the link command.\n  $num_removed = 0;\n  while (@absdir_split > 0 && $absdir_split[0] eq $absfile_split[0]) {\n    shift @absdir_split;\n    shift @absfile_split;\n    $num_removed++;\n  }\n  if (-l $newfile) { # newfile is already a link -> safe to delete it.\n    unlink($newfile); # \"unlink\" just means delete.\n  }\n  if ($num_removed == 0) { # will use absolute pathnames.\n    $oldfile = \"/\" . join(\"/\", @absfile_split);\n    $ret = symlink($oldfile, $newfile);\n  } else {\n    $num_dots = @absdir_split;\n    $oldfile = join(\"/\", @absfile_split);\n    for ($n = 0; $n < $num_dots; $n++) {\n      $oldfile = \"../\" . $oldfile;\n    }\n    $ret = symlink($oldfile, $newfile);\n  }\n  $ans = $ans && $ret;\n  if (! $ret) {\n    print STDERR \"Error linking $oldfile to $newfile\\n\";\n  }\n}\n\nexit ($ans == 1 ? 0 : 1);\n\n"
  },
  {
    "path": "egs/utils/make_absolute.sh",
    "content": "#!/usr/bin/env bash\n\n# This script replaces the command readlink -f (which is not portable).\n# It turns a pathname into an absolute pathname, including following soft links.\ntarget_file=$1\n\ncd $(dirname $target_file)\ntarget_file=$(basename \"$target_file\")\n\n# Iterate down a (possible) chain of symlinks\nwhile [ -L \"$target_file\" ]; do\n    target_file=$(readlink $target_file)\n    cd $(dirname $target_file)\n    target_file=$(basename $target_file)\ndone\n\n# Compute the canonicalized name by finding the physical path \n# for the directory we're in and appending the target file.\nphys_dir=$(pwd -P)\nresult=$phys_dir/$target_file\necho $result\n"
  },
  {
    "path": "egs/utils/make_lexicon_fst.pl",
    "content": "#!/usr/bin/env perl\n\n# THIS SCRIPT IS DEPRECATED AND WILL BE REMOVED.  See\n# utils/lang/make_lexicon_fst.py which is the python-based replacement.\n\n\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2010-2011  Microsoft Corporation\n#                2013  Johns Hopkins University (author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# makes lexicon FST, in text form, from lexicon (pronunciation probabilities optional).\n\n$pron_probs = 0;\n\nif ((@ARGV > 0) && ($ARGV[0] eq \"--pron-probs\")) {\n  $pron_probs = 1;\n  shift @ARGV;\n}\n\nif (@ARGV != 1 && @ARGV != 3 && @ARGV != 4) {\n  print STDERR \"Usage: make_lexicon_fst.pl [--pron-probs] lexicon.txt [silprob silphone [sil_disambig_sym]] >lexiconfst.txt\\n\\n\";\n  print STDERR \"Creates a lexicon FST that transduces phones to words, and may allow optional silence.\\n\\n\";\n  print STDERR \"Note: ordinarily, each line of lexicon.txt is:\\n\";\n  print STDERR \"  word phone1 phone2 ... phoneN;\\n\";\n  print STDERR \"if the --pron-probs option is used, each line is:\\n\";\n  print STDERR \"  word pronunciation-probability phone1 phone2 ... phoneN.\\n\\n\";\n  print STDERR \"The probability 'prob' will typically be between zero and one, and note that\\n\";\n  print STDERR \"it's generally helpful to normalize so the largest one for each word is 1.0, but\\n\";\n  print STDERR \"this is your responsibility.\\n\\n\";\n  print STDERR \"The silence disambiguation symbol, e.g. something like #5, is used only\\n\";\n  print STDERR \"when creating a lexicon with disambiguation symbols, e.g. L_disambig.fst,\\n\";\n  print STDERR \"and was introduced to fix a particular case of non-determinism of decoding graphs.\\n\\n\";\n  exit(1);\n}\n\n$lexfn = shift @ARGV;\nif (@ARGV == 0) {\n  $silprob = 0.0;\n} elsif (@ARGV == 2) {\n  ($silprob,$silphone) = @ARGV;\n} else {\n  ($silprob,$silphone,$sildisambig) = @ARGV;\n}\nif ($silprob != 0.0) {\n  $silprob < 1.0 || die \"Sil prob cannot be >= 1.0\";\n  $silcost = -log($silprob);\n  $nosilcost = -log(1.0 - $silprob);\n}\n\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\n\n\nif ( $silprob == 0.0 ) { # No optional silences: just have one (loop+final) state which is numbered zero.\n  $loopstate = 0;\n  $nextstate = 1;               # next unallocated state.\n  while (<L>) {\n    @A = split(\" \", $_);\n    @A == 0 && die \"Empty lexicon line.\";\n    foreach $a (@A) {\n      if ($a eq \"<eps>\") {\n        die \"Bad lexicon line $_ (<eps> is forbidden)\";\n      }\n    }\n    $w = shift @A;\n    if (! $pron_probs) {\n      $pron_cost = 0.0;\n    } else {\n      $pron_prob = shift @A;\n      if (! defined $pron_prob || !($pron_prob > 0.0 && $pron_prob <= 1.0)) {\n        die \"Bad pronunciation probability in line $_\";\n      }\n      $pron_cost = -log($pron_prob);\n    }\n    if ($pron_cost != 0.0) { $pron_cost_string = \"\\t$pron_cost\"; } else { $pron_cost_string = \"\"; }\n\n    $s = $loopstate;\n    $word_or_eps = $w;\n    while (@A > 0) {\n      $p = shift @A;\n      if (@A > 0) {\n        $ns = $nextstate++;\n      } else {\n        $ns = $loopstate;\n      }\n      print \"$s\\t$ns\\t$p\\t$word_or_eps$pron_cost_string\\n\";\n      $word_or_eps = \"<eps>\";\n      $pron_cost_string = \"\"; # so we only print it on the first arc of the word.\n      $s = $ns;\n    }\n  }\n  print \"$loopstate\\t0\\n\";      # final-cost.\n} else {                        # have silence probs.\n  $startstate = 0;\n  $loopstate = 1;\n  $silstate = 2;   # state from where we go to loopstate after emitting silence.\n  print \"$startstate\\t$loopstate\\t<eps>\\t<eps>\\t$nosilcost\\n\"; # no silence.\n  if (!defined $sildisambig) {\n    print \"$startstate\\t$loopstate\\t$silphone\\t<eps>\\t$silcost\\n\"; # silence.\n    print \"$silstate\\t$loopstate\\t$silphone\\t<eps>\\n\";             # no cost.\n    $nextstate = 3;\n  } else {\n    $disambigstate = 3;\n    $nextstate = 4;\n    print \"$startstate\\t$disambigstate\\t$silphone\\t<eps>\\t$silcost\\n\"; # silence.\n    print \"$silstate\\t$disambigstate\\t$silphone\\t<eps>\\n\"; # no cost.\n    print \"$disambigstate\\t$loopstate\\t$sildisambig\\t<eps>\\n\"; # silence disambiguation symbol.\n  }\n  while (<L>) {\n    @A = split(\" \", $_);\n    $w = shift @A;\n    if (! $pron_probs) {\n      $pron_cost = 0.0;\n    } else {\n      $pron_prob = shift @A;\n      if (! defined $pron_prob || !($pron_prob > 0.0 && $pron_prob <= 1.0)) {\n        die \"Bad pronunciation probability in line $_\";\n      }\n      $pron_cost = -log($pron_prob);\n    }\n    if ($pron_cost != 0.0) { $pron_cost_string = \"\\t$pron_cost\"; } else { $pron_cost_string = \"\"; }\n    $s = $loopstate;\n    $word_or_eps = $w;\n    while (@A > 0) {\n      $p = shift @A;\n      if (@A > 0) {\n        $ns = $nextstate++;\n        print \"$s\\t$ns\\t$p\\t$word_or_eps$pron_cost_string\\n\";\n        $word_or_eps = \"<eps>\";\n        $pron_cost_string = \"\"; $pron_cost = 0.0; # so we only print it the 1st time.\n        $s = $ns;\n      } elsif (!defined($silphone) || $p ne $silphone) {\n        # This is non-deterministic but relatively compact,\n        # and avoids epsilons.\n        $local_nosilcost = $nosilcost + $pron_cost;\n        $local_silcost = $silcost + $pron_cost;\n        print \"$s\\t$loopstate\\t$p\\t$word_or_eps\\t$local_nosilcost\\n\";\n        print \"$s\\t$silstate\\t$p\\t$word_or_eps\\t$local_silcost\\n\";\n      } else {\n        # no point putting opt-sil after silence word.\n        print \"$s\\t$loopstate\\t$p\\t$word_or_eps$pron_cost_string\\n\";\n      }\n    }\n  }\n  print \"$loopstate\\t0\\n\";      # final-cost.\n}\n"
  },
  {
    "path": "egs/utils/make_lexicon_fst_silprob.pl",
    "content": "#!/usr/bin/env perl\n\n# THIS SCRIPT IS DEPRECATED AND WILL BE REMOVED.  See\n# utils/lang/make_lexicon_fst_silprob.py which is the python-based replacement.\n\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2010-2011  Microsoft Corporation\n#                2013  Johns Hopkins University (author: Daniel Povey)\n#                2015  Hainan Xu\n#                2015  Guoguo Chen\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# makes lexicon FST, in text form, from lexicon which contains (optional)\n# probabilities of pronuniations, and (mandatory) probabilities of silence\n# before and after the pronunciation. This script is almost the same with\n# the make_lexicon_fst.pl script except for the word-dependent silprobs part\n\nif (@ARGV != 4) {\n  print STDERR \"Usage: $0 lexiconp_silprob_disambig.txt \\\\\\n\";\n  print STDERR \"       silprob.txt silphone_string sil_disambig_sym > lexiconfst.txt \\n\";\n  print STDERR \"\\n\";\n  print STDERR \"This script is almost the same as the utils/make_lexicon_fst.pl\\n\";\n  print STDERR \"except here we include word-dependent silence probabilities\\n\";\n  print STDERR \"when making the lexicon FSTs. \";\n  print STDERR \"For details, see paper \\nhttp://danielpovey.com/files/2015_interspeech_silprob.pdf\\n\\n\";\n  print STDERR \"The lexiconp_silprob_disambig.txt file should have each line like \\n\\n\";\n  print STDERR \"word p(pronunciation|word) p(sil-after|word) correction-term-for-sil \";\n  print STDERR \"correction-term-for-no-sil phone-1 phone-2 ... phone-N\\n\\n\";\n  print STDERR \"The pronunciation would have to include disambiguation symbols;\\n\";\n  print STDERR \"the 2 correction terms above are computed to reflect how much a \\n\";\n  print STDERR \"word affects the probability of a [non-]silence before it. \\n\";\n  print STDERR \"Please see the paper (link given above) for detailed descriptions\\n\";\n  print STDERR \"for how the 2 terms are computed.\\n\\n\";\n  print STDERR \"The silprob.txt file contains 4 lines, \\n\\n\";\n  print STDERR \"<s> p(sil-after|<s>)\\n\";\n  print STDERR \"</s>_s correction-term-for-sil-for-</s>\\n\";\n  print STDERR \"</s>_n correction-term-for-no-sil-for-</s>\\n\";\n  print STDERR \"overall p(overall-sil)\\n\\n\";\n  print STDERR \"Other files are the same as utils/make_lexicon_fst.pl\\n\";\n\n  exit(1);\n}\n\n$lexfn = shift @ARGV;\n$silprobfile = shift @ARGV;\n\n($silphone,$sildisambig) = @ARGV;\n\nopen(L, \"<$lexfn\") || die \"Error opening lexicon $lexfn\";\nopen(SP, \"<$silprobfile\") || die \"Error opening word-sil-probs $SP\";\n\n$silbeginprob = -1;\n$silendcorrection = -1;\n$nonsilendcorrection = -1;\n$siloverallprob = -1;\n\nwhile (<SP>) {\n  @A = split(\" \", $_);\n  $w = shift @A;\n  if ($w eq \"<s>\") {\n    $silbeginprob = shift @A;\n  }\n  if ($w eq \"</s>_s\") {\n    $silendcorrection = shift @A;\n  }\n  if ($w eq \"</s>_n\") {\n    $nonsilendcorrection = shift @A;\n  }\n  if ($w eq \"overall\") {\n    $siloverallprob = shift @A;\n  }\n}\n\n$startstate = 0;\n$nonsilstart = 1;\n$silstart = 2;\n$nextstate = 3;\n\n$cost = -log($silbeginprob);\nprint \"$startstate\\t$silstart\\t$silphone\\t<eps>\\t$cost\\n\"; # will change these\n$cost = -log(1 - $silbeginprob);\nprint \"$startstate\\t$nonsilstart\\t$sildisambig\\t<eps>\\t$cost\\n\";\n\nwhile (<L>) {\n  @A = split(\" \", $_);\n  $w = shift @A;\n  $pron_prob = shift @A;\n  if (! defined $pron_prob || !($pron_prob > 0.0 && $pron_prob <= 1.0)) {\n    die \"Bad pronunciation probability in line $_\";\n  }\n\n  $wordsilprob = shift @A;\n  $silwordcorrection = shift @A;\n  $nonsilwordcorrection = shift @A;\n\n  $pron_cost = -log($pron_prob);\n  $wordsilcost = -log($wordsilprob);\n  $wordnonsilcost = -log(1.0 - $wordsilprob);\n  $silwordcost = -log($silwordcorrection);\n  $nonsilwordcost = -log($nonsilwordcorrection);\n\n  $first = 1;  # used as a bool, to handle the first phone (adding sils)\n  while (@A > 0) {\n    $p = shift @A;\n\n    if ($first == 1) {\n      $newstate = $nextstate++;\n\n      # for nonsil before w\n      $cost = $nonsilwordcost + $pron_cost;\n      print \"$nonsilstart\\t$newstate\\t$p\\t$w\\t$cost\\n\";\n\n      # for sil before w\n      $cost = $silwordcost + $pron_cost;\n      print \"$silstart\\t$newstate\\t$p\\t$w\\t$cost\\n\";\n      $first = 0;\n    }\n    else {\n      $oldstate = $nextstate - 1;\n      print \"$oldstate\\t$nextstate\\t$p\\t<eps>\\n\";\n      $nextstate++;\n    }\n    if (@A == 0) {\n      $oldstate = $nextstate - 1;\n      # for no sil after w\n      $cost = $wordnonsilcost;\n      print \"$oldstate\\t$nonsilstart\\t$sildisambig\\t<eps>\\t$cost\\n\";\n\n      # for sil after w\n      $cost = $wordsilcost;\n      print \"$oldstate\\t$silstart\\t$silphone\\t<eps>\\t$cost\\n\";\n    }\n  }\n}\n$cost = -log($silendcorrection);\nprint \"$silstart\\t$cost\\n\";\n$cost = -log($nonsilendcorrection);\nprint \"$nonsilstart\\t$cost\\n\";\n"
  },
  {
    "path": "egs/utils/make_unigram_grammar.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script is used in discriminative training.\n# This script makes a simple unigram-loop version of G.fst\n# using a unigram grammar estimated from some training transcripts.\n# This is for MMI training.\n# We don't have any silences in G.fst; these are supplied by the\n# optional silences in the lexicon.\n\n# Note: the symbols in the transcripts become the input and output\n# symbols of G.txt; these can be numeric or not.\n\nif(@ARGV != 0) {\n    die \"Usage: make_unigram_grammar.pl < text-transcripts > G.txt\"\n}\n\n$totcount = 0;\n$nl = 0;\nwhile (<>) {\n  @A = split(\" \", $_);\n  foreach $a (@A) {\n    $count{$a}++;\n    $totcount++;\n  }\n  $nl++;\n  $totcount++; # Treat end-of-sentence as a symbol for purposes of\n  # $totcount, so the grammar is properly stochastic.  This doesn't\n  # become </s>, it just becomes the final-prob.\n}\n\nforeach $a (keys %count) {\n  $prob = $count{$a} / $totcount;\n  $cost = -log($prob);          # Negated natural-log probs.\n  print \"0\\t0\\t$a\\t$a\\t$cost\\n\";\n}\n# Zero final-cost.\n$final_prob = $nl / $totcount;\n$final_cost = -log($final_prob);\nprint \"0\\t$final_cost\\n\";\n\n"
  },
  {
    "path": "egs/utils/map_arpa_lm.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2014  Guoguo Chen\n#           2014  Johns Hopkins University (author: Daniel Povey)\n# Apache 2.0.\n#\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\nThis script reads the Arpa format language model, and maps the words into\nintegers or vice versa. It ignores the words that are not in the symbol table,\nand updates the head information.\n\nIt will be used joinly with lmbin/arpa-to-const-arpa to build ConstArpaLm format\nlanguage model. We first map the words in an Arpa format language model to\nintegers, and then use lmbin/arpa-to-const-arpa to build a ConstArpaLm format\nlanguage model.\n\nUsage: utils/map_arpa_lm.pl [options] <vocab-file> < input-arpa >output-arpa\n e.g.: utils/map_arpa_lm.pl words.txt <arpa_lm.txt >arpa_lm.int\n\nAllowed options:\n  --sym2int   : If true, maps words to integers, other wise maps integers to\n                words. (boolean, default = true)\n\nEOU\n\nmy $sym2int = \"true\";\nGetOptions('sym2int=s' => \\$sym2int);\n\n($sym2int eq \"true\" || $sym2int eq \"false\") ||\n  die \"$0: Bad value for option --sym2int\\n\";\n\nif (@ARGV != 1) {\n  die $Usage;\n}\n\n# Gets parameters.\nmy $symtab = shift @ARGV;\nmy $arpa_in = shift @ARGV;\nmy $arpa_out = shift @ARGV;\n\n# Opens files.\nopen(M, \"<$symtab\") || die \"$0: Fail to open $symtab\\n\";\n\n# Reads in the mapper.\nmy %mapper;\nwhile (<M>) {\n  chomp;\n  my @col = split(/[\\s]+/, $_);\n  @col == 2 || die \"$0: Bad line in mapper file \\\"$_\\\"\\n\";\n  if ($sym2int eq \"true\") {\n    if (defined($mapper{$col[0]})) {\n      die \"$0: Duplicate entry \\\"$col[0]\\\"\\n\";\n    }\n    $mapper{$col[0]} = $col[1];\n  } else {\n    if (defined($mapper{$col[1]})) {\n      die \"$0: Duplicate entry \\\"$col[1]\\\"\\n\";\n    }\n    $mapper{$col[1]} = $col[0];\n  }\n}\n\nmy $num_oov_lines = 0;\nmy $max_oov_warn = 20;\n\n# Parses Arpa n-gram language model.\nmy $arpa = \"\";\nmy $current_order = -1;\nmy %head_ngram_count;\nmy %actual_ngram_count;\nwhile (<STDIN>) {\n  chomp;\n  my @col = split(\" \", $_);\n\n  if ($current_order == -1 and ! m/^\\\\data\\\\$/) {\n    next;\n  }\n\n  if (m/^\\\\data\\\\$/) {\n    print STDERR \"$0: Processing \\\"\\\\data\\\\\\\"\\n\";\n    print \"$_\\n\";\n    $current_order = 0;\n  } elsif (m/^\\\\[0-9]*-grams:$/) {\n    $current_order = $_;\n    $current_order =~ s/-grams:$//g;\n    $current_order =~ s/^\\\\//g;\n    print \"$_\\n\";\n    print STDERR \"$0: Processing \\\"\\\\$current_order-grams:\\\\\\\"\\n\";\n  } elsif (m/^\\\\end\\\\/) {\n    print \"$_\\n\";\n  } elsif ($_ eq \"\") {\n    if ($current_order >= 1) {\n      print \"\\n\";\n    }\n  } else {\n    if ($current_order == 0) {\n      # echo head section.\n      print \"$_\\n\";\n    } else {\n      # Parses n-gram section.\n      if (@col > 2 + $current_order || @col < 1 + $current_order) {\n        die \"$0: Bad line in arpa lm \\\"$_\\\"\\n\";\n      }\n      my $prob = shift @col;\n      my $is_oov = 0;\n      for (my $i = 0; $i < $current_order; $i++) {\n        my $temp = $mapper{$col[$i]};\n        if (!defined($temp)) {\n          $is_oov = 1;\n          $num_oov_lines++;\n          last;\n        } else {\n          $col[$i] = $temp;\n        }\n      }\n      if (!$is_oov) {\n        my $rest_of_line = join(\" \", @col);\n        print \"$prob\\t$rest_of_line\\n\";\n      } else {\n        if ($num_oov_lines < $max_oov_warn) {\n          print STDERR \"$0: Warning: OOV line $_\\n\";\n        }\n      }\n    }\n  }\n}\n\nif ($num_oov_lines > 0) {\n  print STDERR \"$0: $num_oov_lines lines of the Arpa file contained OOVs and \";\n  print STDERR \"were not printed.\\n\";\n}\n\nclose(M);\n"
  },
  {
    "path": "egs/utils/mkgraph.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2010-2012 Microsoft Corporation\n#           2012-2013 Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# This script creates a fully expanded decoding graph (HCLG) that represents\n# all the language-model, pronunciation dictionary (lexicon), context-dependency,\n# and HMM structure in our model.  The output is a Finite State Transducer\n# that has word-ids on the output, and transition-ids on the input (these are indexes\n# that resolve to pdf-ids).\n# See\n#  http://kaldi-asr.org/doc/graph_recipe_test.html\n# (this is compiled from this repository using Doxygen,\n# the source for this part is in src/doc/graph_recipe_test.dox)\n\nset -o pipefail\n\ntscale=1.0\nloopscale=0.1\n\nremove_oov=false\n\nfor x in `seq 4`; do\n  [ \"$1\" == \"--mono\" -o \"$1\" == \"--left-biphone\" -o \"$1\" == \"--quinphone\" ] && shift && \\\n    echo \"WARNING: the --mono, --left-biphone and --quinphone options are now deprecated and ignored.\"\n  [ \"$1\" == \"--remove-oov\" ] && remove_oov=true && shift;\n  [ \"$1\" == \"--transition-scale\" ] && tscale=$2 && shift 2;\n  [ \"$1\" == \"--self-loop-scale\" ] && loopscale=$2 && shift 2;\ndone\n\nif [ $# != 3 ]; then\n   echo \"Usage: utils/mkgraph.sh [options] <lang-dir> <model-dir> <graphdir>\"\n   echo \"e.g.: utils/mkgraph.sh data/lang_test exp/tri1/ exp/tri1/graph\"\n   echo \" Options:\"\n   echo \" --remove-oov       #  If true, any paths containing the OOV symbol (obtained from oov.int\"\n   echo \"                    #  in the lang directory) are removed from the G.fst during compilation.\"\n   echo \" --transition-scale #  Scaling factor on transition probabilities.\"\n   echo \" --self-loop-scale  #  Please see: http://kaldi-asr.org/doc/hmm.html#hmm_scale.\"\n   echo \"Note: the --mono, --left-biphone and --quinphone options are now deprecated\"\n   echo \"and will be ignored.\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\nlang=$1\ntree=$2/tree\nmodel=$2/final.mdl\ndir=$3\n\nmkdir -p $dir\n\n# If $lang/tmp/LG.fst does not exist or is older than its sources, make it...\n# (note: the [[ ]] brackets make the || type operators work (inside [ ], we\n# would have to use -o instead),  -f means file exists, and -ot means older than).\n\nrequired=\"$lang/L_disambig.fst $lang/G.fst $lang/phones.txt $lang/words.txt $lang/phones/silence.csl $lang/phones/disambig.int $model $tree\"\nfor f in $required; do\n  [ ! -f $f ] && echo \"mkgraph.sh: expected $f to exist\" && exit 1;\ndone\n\nif [ -f $dir/HCLG.fst ]; then\n  # detect when the result already exists, and avoid overwriting it.\n  must_rebuild=false\n  for f in $required; do\n    [ $f -nt $dir/HCLG.fst ] && must_rebuild=true\n  done\n  if ! $must_rebuild; then\n    echo \"$0: $dir/HCLG.fst is up to date.\"\n    exit 0\n  fi\nfi\n\n\nN=$(tree-info $tree | grep \"context-width\" | cut -d' ' -f2) || { echo \"Error when getting context-width\"; exit 1; }\nP=$(tree-info $tree | grep \"central-position\" | cut -d' ' -f2) || { echo \"Error when getting central-position\"; exit 1; }\n\n[[ -f $2/frame_subsampling_factor && \"$loopscale\" == \"0.1\" ]] && \\\n  echo \"$0: WARNING: chain models need '--self-loop-scale 1.0'\";\n\nif [ -f $lang/phones/nonterm_phones_offset.int ]; then\n  if [[ $N != 2  || $P != 1 ]]; then\n    echo \"$0: when doing grammar decoding, you can only build graphs for left-biphone trees.\"\n    exit 1\n  fi\n  nonterm_phones_offset=$(cat $lang/phones/nonterm_phones_offset.int)\n  nonterm_opt=\"--nonterm-phones-offset=$nonterm_phones_offset\"\n  prepare_grammar_command=\"make-grammar-fst --nonterm-phones-offset=$nonterm_phones_offset - -\"\nelse\n  prepare_grammar_command=\"cat\"\n  nonterm_opt=\nfi\n\nmkdir -p $lang/tmp\ntrap \"rm -f $lang/tmp/LG.fst.$$\" EXIT HUP INT PIPE TERM\n# Note: [[ ]] is like [ ] but enables certain extra constructs, e.g. || in\n# place of -o\nif [[ ! -s $lang/tmp/LG.fst || $lang/tmp/LG.fst -ot $lang/G.fst || \\\n      $lang/tmp/LG.fst -ot $lang/L_disambig.fst ]]; then\n  fsttablecompose $lang/L_disambig.fst $lang/G.fst | fstdeterminizestar --use-log=true | \\\n    fstminimizeencoded | fstpushspecial > $lang/tmp/LG.fst.$$ || exit 1;\n  mv $lang/tmp/LG.fst.$$ $lang/tmp/LG.fst\n  fstisstochastic $lang/tmp/LG.fst || echo \"[info]: LG not stochastic.\"\nfi\n\nclg=$lang/tmp/CLG_${N}_${P}.fst\nclg_tmp=$clg.$$\nilabels=$lang/tmp/ilabels_${N}_${P}\nilabels_tmp=$ilabels.$$\ntrap \"rm -f $clg_tmp $ilabels_tmp\" EXIT HUP INT PIPE TERM\nif [[ ! -s $clg || $clg -ot $lang/tmp/LG.fst \\\n    || ! -s $ilabels || $ilabels -ot $lang/tmp/LG.fst ]]; then\n  fstcomposecontext $nonterm_opt --context-size=$N --central-position=$P \\\n   --read-disambig-syms=$lang/phones/disambig.int \\\n   --write-disambig-syms=$lang/tmp/disambig_ilabels_${N}_${P}.int \\\n    $ilabels_tmp $lang/tmp/LG.fst |\\\n    fstarcsort --sort_type=ilabel > $clg_tmp\n  mv $clg_tmp $clg\n  mv $ilabels_tmp $ilabels\n  fstisstochastic $clg || echo \"[info]: CLG not stochastic.\"\nfi\n\ntrap \"rm -f $dir/Ha.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/Ha.fst || $dir/Ha.fst -ot $model  \\\n    || $dir/Ha.fst -ot $lang/tmp/ilabels_${N}_${P} ]]; then\n  make-h-transducer $nonterm_opt --disambig-syms-out=$dir/disambig_tid.int \\\n    --transition-scale=$tscale $lang/tmp/ilabels_${N}_${P} $tree $model \\\n     > $dir/Ha.fst.$$  || exit 1;\n  mv $dir/Ha.fst.$$ $dir/Ha.fst\nfi\n\ntrap \"rm -f $dir/HCLGa.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/HCLGa.fst || $dir/HCLGa.fst -ot $dir/Ha.fst || \\\n      $dir/HCLGa.fst -ot $clg ]]; then\n  if $remove_oov; then\n    [ ! -f $lang/oov.int ] && \\\n      echo \"$0: --remove-oov option: no file $lang/oov.int\" && exit 1;\n    clg=\"fstrmsymbols --remove-arcs=true --apply-to-output=true $lang/oov.int $clg|\"\n  fi\n  fsttablecompose $dir/Ha.fst \"$clg\" | fstdeterminizestar --use-log=true \\\n    | fstrmsymbols $dir/disambig_tid.int | fstrmepslocal | \\\n     fstminimizeencoded > $dir/HCLGa.fst.$$ || exit 1;\n  mv $dir/HCLGa.fst.$$ $dir/HCLGa.fst\n  fstisstochastic $dir/HCLGa.fst || echo \"HCLGa is not stochastic\"\nfi\n\ntrap \"rm -f $dir/HCLG.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/HCLG.fst || $dir/HCLG.fst -ot $dir/HCLGa.fst ]]; then\n  add-self-loops --self-loop-scale=$loopscale --reorder=true $model $dir/HCLGa.fst | \\\n    $prepare_grammar_command | \\\n    fstconvert --fst_type=const > $dir/HCLG.fst.$$ || exit 1;\n  mv $dir/HCLG.fst.$$ $dir/HCLG.fst\n  if [ $tscale == 1.0 -a $loopscale == 1.0 ]; then\n    # No point doing this test if transition-scale not 1, as it is bound to fail.\n    fstisstochastic $dir/HCLG.fst || echo \"[info]: final HCLG is not stochastic.\"\n  fi\nfi\n\n# note: the empty FST has 66 bytes.  this check is for whether the final FST\n# is the empty file or is the empty FST.\nif ! [ $(head -c 67 $dir/HCLG.fst | wc -c) -eq 67 ]; then\n  echo \"$0: it looks like the result in $dir/HCLG.fst is empty\"\n  exit 1\nfi\n\n# save space.\nrm $dir/HCLGa.fst $dir/Ha.fst 2>/dev/null || true\n\n# keep a copy of the lexicon and a list of silence phones with HCLG...\n# this means we can decode without reference to the $lang directory.\n\n\ncp $lang/words.txt $dir/ || exit 1;\nmkdir -p $dir/phones\ncp $lang/phones/word_boundary.* $dir/phones/ 2>/dev/null # might be needed for ctm scoring,\ncp $lang/phones/align_lexicon.* $dir/phones/ 2>/dev/null # might be needed for ctm scoring,\ncp $lang/phones/optional_silence.* $dir/phones/ 2>/dev/null # might be needed for analyzing alignments.\n    # but ignore the error if it's not there.\n\n\ncp $lang/phones/disambig.{txt,int} $dir/phones/ 2> /dev/null\ncp $lang/phones/silence.csl $dir/phones/ || exit 1;\ncp $lang/phones.txt $dir/ 2> /dev/null # ignore the error if it's not there.\n\nam-info --print-args=false $model | grep pdfs | awk '{print $NF}' > $dir/num_pdfs\n"
  },
  {
    "path": "egs/utils/mkgraph_lookahead.sh",
    "content": "#!/bin/bash\n# Copyright 2019 Alpha Cephei Inc.\n# Copyright 2018 Joan Puigcerver\n# Copyright 2010-2012 Microsoft Corporation\n#           2012-2013 Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n# This script creates setup for decoding with lookahead online composition. The \n# graph HCLr.fst represents pronunciation dictionary (lexicon), context-dependency,\n# and HMM structure in our model. The graph Gr.fst represents the language model.\n# If arpa model is provided it compiles ngram model into compact LOUDS-encoded\n# structure with opengrm. Both HCLr.fst and Gr.fst are optionally combined into\n# single graph HCLG for testing with default decoders.\n#\n# See\n#  http://kaldi-asr.org/doc/graph_recipe_test.html\n# (this is compiled from this repository using Doxygen,\n# the source for this part is in src/doc/graph_recipe_test.dox)\n#\n# Note that most of the fsts here are not stochastic, so many kaldi operations like\n# fstpushspecial or fstdeterminizestar in log domain do not really work for them. \n# Instead most operations are in tropical domain.\nset -o pipefail\n\ntscale=1.0\nloopscale=0.1\ncompose_graph=false\nremove_oov=false\n\nfor x in `seq 4`; do\n  [ \"$1\" == \"--remove-oov\" ] && remove_oov=true && shift;\n  [ \"$1\" == \"--compose-graph\" ] && compose_graph=true && shift;\n  [ \"$1\" == \"--transition-scale\" ] && tscale=$2 && shift 2;\n  [ \"$1\" == \"--self-loop-scale\" ] && loopscale=$2 && shift 2;\ndone\n\n# Note: [[ ]] is like [ ] but enables certain extra constructs, e.g. || in\n# place of -o\nif [[ $# != 3 && $# != 4 ]]; then\n   echo \"Usage: $0 [options] <lang-dir> <model-dir> [<arpa_file>] <graphdir>\"\n   echo \"e.g.: $0 data/lang data/local/lm.gz exp/tri1 db/trigram.lm.gz exp/tri1/lgraph\"\n   echo \" Options:\"\n   echo \" --remove-oov       #  If true, any paths containing the OOV symbol (obtained from oov.int\"\n   echo \"                    #  in the lang directory) are removed from the G.fst during compilation.\"\n   echo \" --transition-scale #  Scaling factor on transition probabilities.\"\n   echo \" --self-loop-scale  #  Please see: http://kaldi-asr.org/doc/hmm.html#hmm_scale.\"\n   echo \" --compose-graph    #  Compile composed graph for testing with other decoders (default: false)\"\n   exit 1;\nfi\n\nif [ -f path.sh ]; then . ./path.sh; fi\n\nlang=$1\ntree=$2/tree\nmodel=$2/final.mdl\n\nif [ $# == 3 ]; then\n  echo \"$0 : compiling grammar $1/G.fst\"\n  arpa=\n  dir=$3\nelse\n  echo \"$0 : compiling grammar $3\"\n  arpa=$3\n  dir=$4\n  loc=`which ngramread`\n  if [ -z $loc ]; then\n    echo You appear to not have OpenGRM tools installed.\n    echo cd to $KALDI_ROOT/tools and run extras/install_opengrm.sh.\n    exit 1\n  fi\nfi\n\nmkdir -p $dir\n\nrequired=\"$lang/L_disambig.fst $arpa $lang/phones.txt $lang/words.txt $lang/phones/silence.csl $lang/phones/disambig.int $arpa $model $tree\"\nfor f in $required; do\n  [ ! -f $f ] && echo \"$0 : expected $f to exist\" && exit 1;\ndone\n\nif [ -f $dir/HCLG.fst ]; then\n  # detect when the result already exists, and avoid overwriting it.\n  must_rebuild=false\n  for f in $required; do\n    [ $f -nt $dir/HCLG.fst ] && must_rebuild=true\n  done\n  if ! $must_rebuild; then\n    echo \"$0: $dir/HCLG.fst is up to date.\"\n    exit 0\n  fi\nfi\n\n\nN=$(tree-info $tree | grep \"context-width\" | cut -d' ' -f2) || { echo \"Error when getting context-width\"; exit 1; }\nP=$(tree-info $tree | grep \"central-position\" | cut -d' ' -f2) || { echo \"Error when getting central-position\"; exit 1; }\n\n[[ -f $2/frame_subsampling_factor && \"$loopscale\" == \"0.1\" ]] && \\\n  echo \"$0: WARNING: chain models need '--self-loop-scale 1.0'\";\n\ntrap \"rm -f $dir/L_disambig_det.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/L_disambig_det.fst || $dir/L_disambig_det -ot $lang/L_disambig.fst ]]; then\n  fstdeterminizestar $lang/L_disambig.fst | fstarcsort --sort_type=ilabel > $dir/L_disambig_det.fst.$$ || exit 1;\n  mv $dir/L_disambig_det.fst.$$ $dir/L_disambig_det.fst\nfi\n\ncl=$dir/CL_${N}_${P}.fst\ncl_tmp=$cl.$$\nilabels=$dir/ilabels_${N}_${P}\nilabels_tmp=$ilabels.$$\ntrap \"rm -f $cl_tmp $ilabels_tmp\" EXIT HUP INT PIPE TERM\nif [[ ! -s $cl || $cl -ot $dir/L_disambig_det.fst \\\n    || ! -s $ilabels || $ilabels -ot $dir/L_disambig_det.fst ]]; then\n  fstcomposecontext $nonterm_opt --context-size=$N --central-position=$P \\\n   --read-disambig-syms=$lang/phones/disambig.int \\\n   --write-disambig-syms=$dir/disambig_ilabels_${N}_${P}.int \\\n    $ilabels_tmp $dir/L_disambig_det.fst | \\\n    fstarcsort --sort_type=ilabel > $cl_tmp\n  mv $cl_tmp $cl\n  mv $ilabels_tmp $ilabels\nfi\n\ntrap \"rm -f $dir/Ha.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/Ha.fst || $dir/Ha.fst -ot $model  \\\n    || $dir/Ha.fst -ot $dir/ilabels_${N}_${P} ]]; then\n  make-h-transducer $nonterm_opt --disambig-syms-out=$dir/disambig_tid.int \\\n    --transition-scale=$tscale $dir/ilabels_${N}_${P} $tree $model | \\\n  fstarcsort --sort_type=olabel \\\n     > $dir/Ha.fst.$$  || exit 1;\n  mv $dir/Ha.fst.$$ $dir/Ha.fst\nfi\n\ntrap \"rm -f $dir/HCLr.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ ! -s $dir/HCLr.fst || $dir/HCLr.fst -ot $dir/Ha.fst || \\\n      $dir/HCLr.fst -ot $cl ]]; then\n  fstcompose $dir/Ha.fst \"$cl\" | fstdeterminizestar | \\\n     add-self-loops --disambig-syms=$dir/disambig_tid.int --self-loop-scale=$loopscale --reorder=true $model | \\\n     fstarcsort --sort_type=olabel | \\\n     fstconvert --fst_type=olabel_lookahead --save_relabel_opairs=${dir}/relabel \\\n      > $dir/HCLr.fst.$$ || exit 1;\n  mv $dir/HCLr.fst.$$ $dir/HCLr.fst\nfi\n\ntrap \"rm -f $dir/Gr.fst.$$\" EXIT HUP INT PIPE TERM\nif [[ -z $arpa ]]; then\n  if [[ ! -s $dir/Gr.fst || $dir/Gr.fst -ot $lang/G.fst ]]; then\n    gr=${lang}/G.fst\n    if $remove_oov; then\n      [ ! -f $lang/oov.int ] && \\\n        echo \"$0: --remove-oov option: no file $lang/oov.int\" && exit 1;\n      fstrmsymbols --remove-arcs=true --apply-to-output=true $lang/oov.int $gr | \\\n        fstrelabel --relabel_ipairs=${dir}/relabel | \\\n        fstarcsort --sort_type=ilabel | \\\n        fstconvert --fst_type=const > ${dir}/Gr.fst.$$\n    else\n      fstrelabel --relabel_ipairs=${dir}/relabel \"$gr\" | \\\n        fstarcsort --sort_type=ilabel | \\\n        fstconvert --fst_type=const > ${dir}/Gr.fst.$$\n    fi\n    mv $dir/Gr.fst.$$ $dir/Gr.fst\n    cp $lang/words.txt $dir/ || exit 1;\n  fi\nelse\n  if [[ ! -s $dir/Gr.fst || $dir/Gr.fst -ot $arpa ]]; then\n    # Opengrm builds acceptors, so we need to reorder words in symboltable\n    utils/apply_map.pl --permissive -f 2 ${dir}/relabel < ${lang}/words.txt > ${dir}/words.txt\n    gunzip -c $arpa | ngramread --OOV_symbol=`cat ${lang}/oov.txt` --symbols=${dir}/words.txt --ARPA | \\\n    fstarcsort --sort_type=ilabel | \\\n      fstconvert --fst_type=ngram > ${dir}/Gr.fst.$$\n    mv $dir/Gr.fst.$$ $dir/Gr.fst\n  fi\nfi\n\nif $compose_graph; then\n  trap \"rm -f $dir/HCLG.fst.$$\" EXIT HUP INT PIPE TERM\n  if [[ ! -s $dir/HCLG.fst || $dir/HCLG.fst -ot $dir/HCLr.fst \\\n        || $dir/HCLG.fst -ot $dir/Gr.fst ]]; then\n    fstcompose ${dir}/HCLr.fst ${dir}/Gr.fst | \\\n    fstrmsymbols $dir/disambig_tid.int  | \\\n    fstconvert --fst_type=const > $dir/HCLG.fst.$$ || exit 1;\n    mv $dir/HCLG.fst.$$ $dir/HCLG.fst\n    if [ $tscale == 1.0 -a $loopscale == 1.0 ]; then\n      # No point doing this test if transition-scale not 1, as it is bound to fail.\n      fstisstochastic $dir/HCLG.fst || echo \"[info]: final HCLG is not stochastic.\"\n    fi\n  fi\n\n  # note: the empty FST has 66 bytes.  this check is for whether the final FST\n  # is the empty file or is the empty FST.\n  if ! [ $(head -c 67 $dir/HCLG.fst | wc -c) -eq 67 ]; then\n    echo \"$0: it looks like the result in $dir/HCLG.fst is empty\"\n    exit 1\n  fi\nfi\n\n# keep a copy of the lexicon and a list of silence phones with HCLG...\n# this means we can decode without reference to the $lang directory.\n\nmkdir -p $dir/phones\ncp $lang/phones/word_boundary.* $dir/phones/ 2>/dev/null # might be needed for ctm scoring,\ncp $lang/phones/align_lexicon.* $dir/phones/ 2>/dev/null # might be needed for ctm scoring,\ncp $lang/phones/optional_silence.* $dir/phones/ 2>/dev/null # might be needed for analyzing alignments.\n    # but ignore the error if it's not there.\n\n\ncp $lang/phones/disambig.{txt,int} $dir/phones/ 2> /dev/null\ncp $lang/phones/silence.csl $dir/phones/ || exit 1;\ncp $lang/phones.txt $dir/ 2> /dev/null # ignore the error if it's not there.\n\nam-info --print-args=false $model | grep pdfs | awk '{print $NF}' > $dir/num_pdfs\n"
  },
  {
    "path": "egs/utils/nnet/gen_dct_mat.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# ./gen_dct_mat.py\n# script generates matrix with DCT transform, which is sparse\n# and takes into account that data-layout is along frequency axis,\n# while DCT is done along temporal axis.\n\nfrom __future__ import division\nfrom __future__ import print_function\nfrom math import *\nimport sys\n\n\nfrom optparse import OptionParser\n\ndef print_on_same_line(text):\n    print(text, end=' ')\n\nparser = OptionParser()\nparser.add_option('--fea-dim', dest='dim', help='feature dimension')\nparser.add_option('--splice', dest='splice', help='applied splice value')\nparser.add_option('--dct-basis', dest='dct_basis', help='number of DCT basis')\n(options, args) = parser.parse_args()\n\nif(options.dim == None):\n    parser.print_help()\n    sys.exit(1)\n\ndim=int(options.dim)\nsplice=int(options.splice)\ndct_basis=int(options.dct_basis)\n\ntimeContext=2*splice+1\n\n\n#generate the DCT matrix\nM_PI = 3.1415926535897932384626433832795\nM_SQRT2 = 1.4142135623730950488016887\n\n\n#generate sparse DCT matrix\nprint('[')\nfor k in range(dct_basis):\n    for m in range(dim):\n        for n in range(timeContext):\n          if(n==0):\n              print_on_same_line(m*'0 ')\n          else:\n              print_on_same_line((dim-1)*'0 ')\n          print_on_same_line(str(sqrt(2.0/timeContext)*cos(M_PI/timeContext*k*(n+0.5))))\n          if(n==timeContext-1):\n              print_on_same_line((dim-m-1)*'0 ')\n        print()\n    print()\n\nprint(']')\n\n"
  },
  {
    "path": "egs/utils/nnet/gen_hamm_mat.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# ./gen_hamm_mat.py\n# script generates diagonal matrix with hamming window values\n\nfrom __future__ import division\nfrom __future__ import print_function\nfrom math import *\nimport sys\n\n\nfrom optparse import OptionParser\n\ndef print_on_same_line(text):\n    print(text, end=' ')\n\nparser = OptionParser()\nparser.add_option('--fea-dim', dest='dim', help='feature dimension')\nparser.add_option('--splice', dest='splice', help='applied splice value')\n(options, args) = parser.parse_args()\n\nif(options.dim == None):\n    parser.print_help()\n    sys.exit(1)\n\ndim=int(options.dim)\nsplice=int(options.splice)\n\n\n#generate the diagonal matrix with hammings\nM_2PI = 6.283185307179586476925286766559005\n\ndim_mat=(2*splice+1)*dim\ntimeContext=2*splice+1\nprint('[')\nfor row in range(dim_mat):\n    for col in range(dim_mat):\n        if col!=row:\n            print_on_same_line('0')\n        else:\n            i=int(row/dim)\n            print_on_same_line(str(0.54 - 0.46*cos((M_2PI * i) / (timeContext-1))))\n    print()\n\nprint(']')\n\n\n"
  },
  {
    "path": "egs/utils/nnet/gen_splice.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2012  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# ./gen_splice.py\n# generates <splice> Component\n\nfrom __future__ import print_function\nfrom math import *\nimport sys\n\n\nfrom optparse import OptionParser\n\ndef print_on_same_line(text):\n    print(text, end=' ')\n\nparser = OptionParser()\nparser.add_option('--fea-dim', dest='dim_in', help='feature dimension')\nparser.add_option('--splice', dest='splice', help='number of frames to concatenate with the central frame')\nparser.add_option('--splice-step', dest='splice_step', help='splicing step (frames dont need to be consecutive, --splice 3 --splice-step 2 will select offsets: -6 -4 -2 0 2 4 6)', default='1' )\n(options, args) = parser.parse_args()\n\nif(options.dim_in == None):\n    parser.print_help()\n    sys.exit(1)\n\ndim_in=int(options.dim_in)\nsplice=int(options.splice)\nsplice_step=int(options.splice_step)\n\ndim_out=(2*splice+1)*dim_in\n\nprint('<splice> {0} {1}'.format(dim_out, dim_in))\nprint_on_same_line('[')\n\nsplice_vec = list(range(-splice*splice_step, splice*splice_step+1, splice_step))\nfor idx in range(len(splice_vec)):\n    print_on_same_line(splice_vec[idx])\n\nprint(']')\n\n"
  },
  {
    "path": "egs/utils/nnet/make_blstm_proto.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015-2016  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Generated Nnet prototype, to be initialized by 'nnet-initialize'.\n\nfrom __future__ import print_function\nimport sys\n\n###\n### Parse options\n###\nfrom optparse import OptionParser\nusage=\"%prog [options] <feat-dim> <num-leaves> >nnet-proto-file\"\nparser = OptionParser(usage)\n# Required,\nparser.add_option('--cell-dim', dest='cell_dim', type='int', default=320,\n                   help='Number of cells for one direction in BLSTM [default: %default]');\nparser.add_option('--proj-dim', dest='proj_dim', type='int', default=200,\n                   help='Dim reduction for one direction in BLSTM [default: %default]');\nparser.add_option('--proj-dim-last', dest='proj_dim_last', type='int', default=320,\n                   help='Dim reduction for one direction in BLSTM (last BLSTM component) [default: %default]');\nparser.add_option('--num-layers', dest='num_layers', type='int', default=2,\n                   help='Number of BLSTM layers [default: %default]');\n# Optional (default == 'None'),\nparser.add_option('--lstm-param-range', dest='lstm_param_range', type='float',\n                   help='Range of initial BLSTM parameters [default: %default]');\nparser.add_option('--param-stddev', dest='param_stddev', type='float',\n                   help='Standard deviation for initial weights of Softmax layer [default: %default]');\nparser.add_option('--cell-clip', dest='cell_clip', type='float',\n                   help='Clipping cell values during propagation (per-frame) [default: %default]');\nparser.add_option('--diff-clip', dest='diff_clip', type='float',\n                   help='Clipping partial-derivatives during BPTT (per-frame) [default: %default]');\nparser.add_option('--cell-diff-clip', dest='cell_diff_clip', type='float',\n                   help='Clipping partial-derivatives of \"cells\" during BPTT (per-frame, those accumulated by CEC) [default: %default]');\nparser.add_option('--grad-clip', dest='grad_clip', type='float',\n                   help='Clipping the accumulated gradients (per-updates) [default: %default]');\n#\n\n(o,args) = parser.parse_args()\nif len(args) != 2 :\n  parser.print_help()\n  sys.exit(1)\n\n(feat_dim, num_leaves) = [int(i) for i in args];\n\n# Original prototype from Jiayu,\n#<NnetProto>\n#<Transmit> <InputDim> 40 <OutputDim> 40\n#<LstmProjectedStreams> <InputDim> 40 <OutputDim> 512 <CellDim> 800 <ParamScale> 0.01 <NumStream> 4\n#<AffineTransform> <InputDim> 512 <OutputDim> 8000 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.04\n#<Softmax> <InputDim> 8000 <OutputDim> 8000\n#</NnetProto>\n\nlstm_extra_opts=\"\"\nif None != o.lstm_param_range: lstm_extra_opts += \"<ParamRange> %f \"   % o.lstm_param_range\nif None != o.cell_clip:        lstm_extra_opts += \"<CellClip> %f \"     % o.cell_clip\nif None != o.diff_clip:        lstm_extra_opts += \"<DiffClip> %f \"     % o.diff_clip\nif None != o.cell_diff_clip:   lstm_extra_opts += \"<CellDiffClip> %f \" % o.cell_diff_clip\nif None != o.grad_clip:        lstm_extra_opts += \"<GradClip> %f \"     % o.grad_clip\n\nsoftmax_affine_opts=\"\"\nif None != o.param_stddev:     softmax_affine_opts += \"<ParamStddev> %f \" % o.param_stddev\n\n# The BLSTM layers,\nif o.num_layers == 1:\n  # Single BLSTM,\n  print(\"<BlstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (feat_dim, 2*o.proj_dim_last, o.cell_dim) + lstm_extra_opts)\nelse:\n  # >1 BLSTM,\n  print(\"<BlstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (feat_dim, 2*o.proj_dim, o.cell_dim) + lstm_extra_opts)\n  for l in range(o.num_layers - 2):\n    print(\"<BlstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (2*o.proj_dim, 2*o.proj_dim, o.cell_dim) + lstm_extra_opts)\n  print(\"<BlstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (2*o.proj_dim, 2*o.proj_dim_last, o.cell_dim) + lstm_extra_opts)\n\n# Adding <Tanh> for more stability,\nprint(\"<Tanh> <InputDim> %d <OutputDim> %d\" % (2*o.proj_dim_last, 2*o.proj_dim_last))\n\n# Softmax layer,\nprint(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> 0.0 <BiasRange> 0.0\" % (2*o.proj_dim_last, num_leaves) + softmax_affine_opts)\nprint(\"<Softmax> <InputDim> %d <OutputDim> %d\" % (num_leaves, num_leaves))\n\n"
  },
  {
    "path": "egs/utils/nnet/make_cnn_proto.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2014  Brno University of Technology (author: Katerina Zmolikova, Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Generated Nnet prototype, to be initialized by 'nnet-initialize'.\n\nfrom __future__ import division\nfrom __future__ import print_function\nimport math, random, sys\nfrom optparse import OptionParser\n\n###\n### Parse options\n###\nusage=\"%prog [options] <feat-dim> <num-leaves> <num-hidden-layers> <num-hidden-neurons>  >nnet-proto-file\"\nparser = OptionParser(usage)\n\nparser.add_option('--activation-type', dest='activation_type', \n                   help='Select type of activation function : (<Sigmoid>|<Tanh>) [default: %default]', \n                   default='<Sigmoid>', type='string');\nparser.add_option('--num-filters1', dest='num_filters1',\n\t\t   help='Number of filters in first convolutional layer [default: %default]',\n\t\t   default=128, type='int')\nparser.add_option('--num-filters2', dest='num_filters2',\n\t\t   help='Number of filters in second convolutional layer [default: %default]',\n\t\t   default=256, type='int')\nparser.add_option('--pool-size', dest='pool_size',\n\t  \t   help='Size of pooling [default: %default]',\n\t\t   default=3, type='int')\nparser.add_option('--pool-step', dest='pool_step',\n\t\t  help='Step of pooling [default: %default]',\n\t\t  default=3, type='int')\nparser.add_option('--pool-type', dest='pool_type',\n\t\t  help='Type of pooling (Max || Average) [default: %default]',\n\t\t  default='Max', type='string')\nparser.add_option('--pitch-dim', dest='pitch_dim',\n\t\t  help='Number of features representing pitch [default: %default]',\n\t\t  default=0, type='int')\nparser.add_option('--delta-order', dest='delta_order',\n\t\t  help='Order of delta features [default: %default]',\n\t\t  default=2, type='int')\nparser.add_option('--splice', dest='splice',\n\t\t  help='Length of splice [default: %default]',\n\t\t  default=5,type='int')\nparser.add_option('--patch-step1', dest='patch_step1',\n\t\t  help='Patch step of first convolutional layer [default: %default]',\n\t\t  default=1, type='int')\nparser.add_option('--patch-dim1', dest='patch_dim1',\n\t\t  help='Dim of convolutional kernel in 1st layer (freq. axis) [default: %default]',\n  \t\t  default=8, type='int')\nparser.add_option('--patch-dim2', dest='patch_dim2',\n\t\t  help='Dim of convolutional kernel in 2nd layer (freq. axis) [default: %default]',\n  \t\t  default=4, type='int')\nparser.add_option('--dir', dest='protodir',\n\t\t  help='Directory, where network prototypes will be saved [default: %default]',\n\t\t  default='.', type='string')\nparser.add_option('--num-pitch-neurons', dest='num_pitch_neurons',\n\t\t  help='Number of neurons in layers processing pitch features [default: %default]',\n\t\t  default='200', type='int')\n\n(o,args) = parser.parse_args()\nif len(args) != 1 : \n  parser.print_help()\n  sys.exit(1)\n \nfeat_dim = int(args[0]);\n### End parse options \n\nfeat_raw_dim = feat_dim / (o.delta_order+1) / (o.splice*2+1) - o.pitch_dim # we need number of feats without deltas and splice and pitch\n\n# Check\nassert(feat_dim > 0)\nassert(o.pool_type == 'Max' or o.pool_type == 'Average')\n\n###\n### Print prototype of the network\n###\n\n# Begin the prototype\nprint(\"<NnetProto>\")\n\n# Convolutional part of network\nnum_patch1 = 1 + (feat_raw_dim - o.patch_dim1) / o.patch_step1\nnum_pool = 1 + (num_patch1 - o.pool_size) / o.pool_step\npatch_dim2 = o.patch_dim2\npatch_step2 = o.patch_step1\npatch_stride2 = num_pool # same as layer1 outputs \nnum_patch2 = 1 + (num_pool - patch_dim2) / patch_step2\n\ninputdim_of_cnn = feat_dim\noutputdim_of_cnn = o.num_filters2*num_patch2\n\nconvolution_proto = ''  \n\nconvolution_proto += \"<ConvolutionalComponent> <InputDim> %d <OutputDim> %d <PatchDim> %d <PatchStep> %d <PatchStride> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f\\n\" % \\\n\t\t\t(feat_raw_dim * (o.delta_order+1) * (o.splice*2+1), o.num_filters1 * num_patch1, o.patch_dim1, o.patch_step1, feat_raw_dim, -1.0, 2.0, 0.02, 30) #~8x11x3 = 264 inputs\nconvolution_proto += \"<%sPoolingComponent> <InputDim> %d <OutputDim> %d <PoolSize> %d <PoolStep> %d <PoolStride> %d\\n\" % \\\n\t\t\t(o.pool_type, o.num_filters1*num_patch1, o.num_filters1*num_pool, o.pool_size, o.pool_step, o.num_filters1)\nconvolution_proto += \"<Rescale> <InputDim> %d <OutputDim> %d <InitParam> %f\\n\" % \\\n\t\t\t(o.num_filters1*num_pool, o.num_filters1*num_pool, 1)\nconvolution_proto += \"<AddShift> <InputDim> %d <OutputDim> %d <InitParam> %f\\n\" % \\\n\t\t\t(o.num_filters1*num_pool, o.num_filters1*num_pool, 0)\nconvolution_proto += \"%s <InputDim> %d <OutputDim> %d\\n\" % \\\n\t\t\t(o.activation_type, o.num_filters1*num_pool, o.num_filters1*num_pool)\nconvolution_proto += \"<ConvolutionalComponent> <InputDim> %d <OutputDim> %d <PatchDim> %d <PatchStep> %d <PatchStride> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f\\n\" % \\\n\t\t\t(o.num_filters1*num_pool, outputdim_of_cnn, patch_dim2, patch_step2, patch_stride2, -2.0, 4.0, 0.1, 50) #~4x128 = 512 inputs\nconvolution_proto += \"<Rescale> <InputDim> %d <OutputDim> %d <InitParam> %f\\n\" % \\\n\t\t\t(outputdim_of_cnn, outputdim_of_cnn, 1)\nconvolution_proto += \"<AddShift> <InputDim> %d <OutputDim> %d <InitParam> %f\\n\" % \\\n\t\t\t(outputdim_of_cnn, outputdim_of_cnn, 0)\nconvolution_proto += \"%s <InputDim> %d <OutputDim> %d\\n\" % \\\n\t\t\t(o.activation_type, outputdim_of_cnn, outputdim_of_cnn)\n\nif (o.pitch_dim > 0):\n  # convolutional part\n  f_conv = open('%s/nnet.proto.convolution' % o.protodir, 'w')\n  f_conv.write('<NnetProto>\\n')\n  f_conv.write(convolution_proto)\n  f_conv.write('</NnetProto>\\n')\n  f_conv.close()\n  \n  # pitch part\n  f_pitch = open('%s/nnet.proto.pitch' % o.protodir, 'w')\n  f_pitch.write('<NnetProto>\\n')\n  f_pitch.write('<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f\\n' % \\\n\t\t((o.pitch_dim * (o.delta_order+1) * (o.splice*2+1)), o.num_pitch_neurons, -2, 4, 0.02))\n  f_pitch.write('%s <InputDim> %d <OutputDim> %d\\n' % \\\n\t\t(o.activation_type, o.num_pitch_neurons, o.num_pitch_neurons))\n  f_pitch.write('<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f\\n' % \\\n\t\t(o.num_pitch_neurons, o.num_pitch_neurons, -2, 4, 0.1))\n  f_pitch.write('%s <InputDim> %d <OutputDim> %d\\n' % \\\n\t\t(o.activation_type, o.num_pitch_neurons, o.num_pitch_neurons))\n  f_pitch.write('</NnetProto>\\n')\n  f_pitch.close()\n\n  # paralell part\n  vector = ''\n  for i in range(1, inputdim_of_cnn, feat_raw_dim + o.pitch_dim):\n    vector += '%d:1:%d ' % (i, i + feat_raw_dim - 1)\n  for i in range(feat_raw_dim+1, inputdim_of_cnn + 1, feat_raw_dim + o.pitch_dim):\n    vector += '%d:1:%d ' % (i, i + o.pitch_dim - 1)\n  print('<Copy> <InputDim> %d <OutputDim> %d <BuildVector> %s </BuildVector>' % \\\n\t(inputdim_of_cnn, inputdim_of_cnn, vector))\n  print('<ParallelComponent> <InputDim> %d <OutputDim> %d <NestedNnetProto> %s %s </NestedNnetProto>' % \\\n\t(inputdim_of_cnn, o.num_pitch_neurons + outputdim_of_cnn, '%s/nnet.proto.convolution' % o.protodir, '%s/nnet.proto.pitch' % o.protodir))\n\nelse: # no pitch\n  print(convolution_proto)\n\n# We are done!\nsys.exit(0)\n\n"
  },
  {
    "path": "egs/utils/nnet/make_lstm_proto.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2015-2016  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Generated Nnet prototype, to be initialized by 'nnet-initialize'.\n\nfrom __future__ import print_function\nimport sys\n\n###\n### Parse options\n###\nfrom optparse import OptionParser\nusage=\"%prog [options] <feat-dim> <num-leaves> >nnet-proto-file\"\nparser = OptionParser(usage)\n# Required,\nparser.add_option('--cell-dim', dest='cell_dim', type='int', default=320,\n                   help='Number of cells for one direction in LSTM [default: %default]');\nparser.add_option('--proj-dim', dest='proj_dim', type='int', default=400,\n                   help='Number of LSTM recurrent units [default: %default]');\nparser.add_option('--num-layers', dest='num_layers', type='int', default=2,\n                   help='Number of LSTM layers [default: %default]');\n# Optional (default == 'None'),\nparser.add_option('--lstm-param-range', dest='lstm_param_range', type='float',\n                   help='Range of initial LSTM parameters [default: %default]');\nparser.add_option('--param-stddev', dest='param_stddev', type='float',\n                   help='Standard deviation for initial weights of Softmax layer [default: %default]');\nparser.add_option('--cell-clip', dest='cell_clip', type='float',\n                   help='Clipping cell values during propagation (per-frame) [default: %default]');\nparser.add_option('--diff-clip', dest='diff_clip', type='float',\n                   help='Clipping partial-derivatives during BPTT (per-frame) [default: %default]');\nparser.add_option('--cell-diff-clip', dest='cell_diff_clip', type='float',\n                   help='Clipping partial-derivatives of \"cells\" during BPTT (per-frame, those accumulated by CEC) [default: %default]');\nparser.add_option('--grad-clip', dest='grad_clip', type='float',\n                   help='Clipping the accumulated gradients (per-updates) [default: %default]');\n#\n\n(o,args) = parser.parse_args()\nif len(args) != 2 :\n  parser.print_help()\n  sys.exit(1)\n\n(feat_dim, num_leaves) = [int(i) for i in args];\n\n# Original prototype from Jiayu,\n#<NnetProto>\n#<Transmit> <InputDim> 40 <OutputDim> 40\n#<LstmProjectedStreams> <InputDim> 40 <OutputDim> 512 <CellDim> 800 <ParamScale> 0.01 <NumStream> 4\n#<AffineTransform> <InputDim> 512 <OutputDim> 8000 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.04\n#<Softmax> <InputDim> 8000 <OutputDim> 8000\n#</NnetProto>\n\nlstm_extra_opts=\"\"\nif None != o.lstm_param_range: lstm_extra_opts += \"<ParamRange> %f \"   % o.lstm_param_range\nif None != o.cell_clip:        lstm_extra_opts += \"<CellClip> %f \"     % o.cell_clip\nif None != o.diff_clip:        lstm_extra_opts += \"<DiffClip> %f \"     % o.diff_clip\nif None != o.cell_diff_clip:   lstm_extra_opts += \"<CellDiffClip> %f \" % o.cell_diff_clip\nif None != o.grad_clip:        lstm_extra_opts += \"<GradClip> %f \"     % o.grad_clip\n\nsoftmax_affine_opts=\"\"\nif None != o.param_stddev:     softmax_affine_opts += \"<ParamStddev> %f \" % o.param_stddev\n\n# The LSTM layers,\nprint(\"<LstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (feat_dim, o.proj_dim, o.cell_dim) + lstm_extra_opts)\nfor l in range(o.num_layers - 1):\n  print(\"<LstmProjected> <InputDim> %d <OutputDim> %d <CellDim> %s\" % (o.proj_dim, o.proj_dim, o.cell_dim) + lstm_extra_opts)\n\n# Adding <Tanh> for more stability,\nprint(\"<Tanh> <InputDim> %d <OutputDim> %d\" % (o.proj_dim, o.proj_dim))\n\n# Softmax layer,\nprint(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> 0.0 <BiasRange> 0.0\" % (o.proj_dim, num_leaves) + softmax_affine_opts)\nprint(\"<Softmax> <InputDim> %d <OutputDim> %d\" % (num_leaves, num_leaves))\n\n"
  },
  {
    "path": "egs/utils/nnet/make_nnet_proto.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2014-2016  Brno University of Technology (author: Karel Vesely)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# Generated Nnet prototype, to be initialized by 'nnet-initialize'.\n\nfrom __future__ import division\nfrom __future__ import print_function\nimport math, random, sys, re\n\n###\n### Parse options\n###\nfrom optparse import OptionParser\nusage=\"%prog [options] <feat-dim> <num-leaves> <num-hid-layers> <num-hid-neurons> >nnet-proto-file\"\nparser = OptionParser(usage)\n\n# Softmax related,\nparser.add_option('--no-softmax', dest='with_softmax',\n                   help='Do not put <SoftMax> in the prototype [default: %default]',\n                   default=True, action='store_false');\nparser.add_option('--block-softmax-dims', dest='block_softmax_dims',\n                   help='Generate <BlockSoftmax> with dims D1:D2:D3 [default: %default]',\n                   default=\"\", type='string');\n# Activation related,\nparser.add_option('--activation-type', dest='activation_type',\n                   help='Select type of activation function : (<Sigmoid>|<Tanh>|<ParametricRelu>) [default: %default]',\n                   default='<Sigmoid>', type='string');\nparser.add_option('--activation-opts', dest='activation_opts',\n                   help='Additional options for protoype of activation function [default: %default]',\n                   default='', type='string');\n# Affine-transform related,\nparser.add_option('--hid-bias-mean', dest='hid_bias_mean',\n                   help='Set bias for hidden activations [default: %default]',\n                   default=-2.0, type='float');\nparser.add_option('--hid-bias-range', dest='hid_bias_range',\n                   help='Set bias range for hidden activations (+/- 1/2 range around mean) [default: %default]',\n                   default=4.0, type='float');\nparser.add_option('--param-stddev-factor', dest='param_stddev_factor',\n                   help='Factor to rescale Normal distriburtion for initalizing weight matrices [default: %default]',\n                   default=0.1, type='float');\nparser.add_option('--no-glorot-scaled-stddev', dest='with_glorot',\n                   help='Generate normalized weights according to X.Glorot paper, but mapping U->N with same variance (factor sqrt(x/(dim_in+dim_out)))',\n                   action='store_false', default=True);\nparser.add_option('--no-smaller-input-weights', dest='smaller_input_weights',\n                   help='Disable 1/12 reduction of stddef in input layer [default: %default]',\n                   action='store_false', default=True);\nparser.add_option('--no-bottleneck-trick', dest='bottleneck_trick',\n                   help='Disable smaller initial weights and learning rate around bottleneck',\n                   action='store_false', default=True);\nparser.add_option('--max-norm', dest='max_norm',\n                   help='Max radius of neuron-weights in L2 space (if longer weights get shrinked, not applied to last layer, 0.0 = disable) [default: %default]',\n                   default=0.0, type='float');\nparser.add_option('--affine-opts', dest='affine_opts',\n                   help='Additional options for protoype of affine tranform [default: %default]',\n                   default='', type='string');\n# Topology related,\nparser.add_option('--bottleneck-dim', dest='bottleneck_dim',\n                   help='Make bottleneck network with desired bn-dim (0 = no bottleneck) [default: %default]',\n                   default=0, type='int');\nparser.add_option('--with-dropout', dest='with_dropout',\n                   help='Add <Dropout> after the non-linearity of hidden layer.',\n                   action='store_true', default=False);\nparser.add_option('--dropout-opts', dest='dropout_opts',\n                   help='Extra options for dropout [default: %default]',\n                   default='', type='string');\n\n\n(o,args) = parser.parse_args()\nif len(args) != 4 :\n  parser.print_help()\n  sys.exit(1)\n\n# A HACK TO PASS MULTI-WORD OPTIONS, WORDS ARE CONNECTED BY UNDERSCORES '_',\no.activation_opts = o.activation_opts.replace(\"_\",\" \")\no.affine_opts = o.affine_opts.replace(\"_\",\" \")\no.dropout_opts = o.dropout_opts.replace(\"_\",\" \")\n\n(feat_dim, num_leaves, num_hid_layers, num_hid_neurons) = [int(i) for i in args];\n### End parse options\n\n\n# Check\nassert(feat_dim > 0)\nassert(num_leaves > 0)\nassert(num_hid_layers >= 0)\nassert(num_hid_neurons > 0)\nif o.block_softmax_dims:\n  assert(sum(map(int, re.split(\"[,:]\", o.block_softmax_dims))) == num_leaves) # posible separators : ',' ':'\n\n# Optionaly scale\ndef Glorot(dim1, dim2):\n  if o.with_glorot:\n    # 35.0 = magic number, gives ~1.0 in inner layers for hid-dim 1024dim,\n    return 35.0 * math.sqrt(2.0/(dim1+dim2));\n  else:\n    return 1.0\n\n\n###\n### Print prototype of the network\n###\n\n# NO HIDDEN LAYER, ADDING BOTTLENECK!\n# No hidden layer while adding bottleneck means:\n# - add bottleneck layer + hidden layer + output layer\nif num_hid_layers == 0 and o.bottleneck_dim != 0:\n  assert(o.bottleneck_dim > 0)\n  assert(num_hid_layers == 0)\n  if o.bottleneck_trick:\n    # 25% smaller stddev -> small bottleneck range, 10x smaller learning rate\n    print(\"<LinearTransform> <InputDim> %d <OutputDim> %d <ParamStddev> %f <LearnRateCoef> %f\" % \\\n     (feat_dim, o.bottleneck_dim, \\\n      (o.param_stddev_factor * Glorot(feat_dim, o.bottleneck_dim) * 0.75 ), 0.1))\n    # 25% smaller stddev -> smaller gradient in prev. layer, 10x smaller learning rate for weigts & biases\n    print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <LearnRateCoef> %f <BiasLearnRateCoef> %f <MaxNorm> %f\" % \\\n     (o.bottleneck_dim, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n      (o.param_stddev_factor * Glorot(o.bottleneck_dim, num_hid_neurons) * 0.75 ), 0.1, 0.1, o.max_norm))\n  else:\n    print(\"<LinearTransform> <InputDim> %d <OutputDim> %d <ParamStddev> %f\" % \\\n     (feat_dim, o.bottleneck_dim, \\\n      (o.param_stddev_factor * Glorot(feat_dim, o.bottleneck_dim))))\n    print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f\" % \\\n     (o.bottleneck_dim, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n      (o.param_stddev_factor * Glorot(o.bottleneck_dim, num_hid_neurons)), o.max_norm))\n  print(\"%s <InputDim> %d <OutputDim> %d %s\" % (o.activation_type, num_hid_neurons, num_hid_neurons, o.activation_opts)) # Non-linearity\n  # Last AffineTransform (10x smaller learning rate on bias)\n  print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <LearnRateCoef> %f <BiasLearnRateCoef> %f\" % \\\n   (num_hid_neurons, num_leaves, 0.0, 0.0, \\\n    (o.param_stddev_factor * Glorot(num_hid_neurons, num_leaves)), 1.0, 0.1))\n  # Optionaly append softmax\n  if o.with_softmax:\n    if o.block_softmax_dims == \"\":\n      print(\"<Softmax> <InputDim> %d <OutputDim> %d\" % (num_leaves, num_leaves))\n    else:\n      print(\"<BlockSoftmax> <InputDim> %d <OutputDim> %d <BlockDims> %s\" % (num_leaves, num_leaves, o.block_softmax_dims))\n  print(\"</NnetProto>\")\n  # We are done!\n  sys.exit(0)\n\n# NO HIDDEN LAYERS!\n# Add only last layer (logistic regression)\nif num_hid_layers == 0:\n  print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f\" % \\\n        (feat_dim, num_leaves, 0.0, 0.0, (o.param_stddev_factor * Glorot(feat_dim, num_leaves))))\n  if o.with_softmax:\n    if o.block_softmax_dims == \"\":\n      print(\"<Softmax> <InputDim> %d <OutputDim> %d\" % (num_leaves, num_leaves))\n    else:\n      print(\"<BlockSoftmax> <InputDim> %d <OutputDim> %d <BlockDims> %s\" % (num_leaves, num_leaves, o.block_softmax_dims))\n  print(\"</NnetProto>\")\n  # We are done!\n  sys.exit(0)\n\n\n# THE USUAL DNN PROTOTYPE STARTS HERE!\n# Assuming we have >0 hidden layers,\nassert(num_hid_layers > 0)\n\n# Begin the prototype,\n# First AffineTranform,\nprint(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f %s\" % \\\n      (feat_dim, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n       (o.param_stddev_factor * Glorot(feat_dim, num_hid_neurons) * \\\n        (math.sqrt(1.0/12.0) if o.smaller_input_weights else 1.0)), o.max_norm, o.affine_opts))\n      # Note.: compensating dynamic range mismatch between input features and Sigmoid-hidden layers,\n      # i.e. mapping the std-dev of N(0,1) (input features) to std-dev of U[0,1] (sigmoid-outputs).\n      # This is done by multiplying with stddev(U[0,1]) = sqrt(1/12).\n      # The stddev of weights is consequently reduced with scale 0.29,\nprint(\"%s <InputDim> %d <OutputDim> %d %s\" % (o.activation_type, num_hid_neurons, num_hid_neurons, o.activation_opts))\nif o.with_dropout:\n  print(\"<Dropout> <InputDim> %d <OutputDim> %d %s\" % (num_hid_neurons, num_hid_neurons, o.dropout_opts))\n\n\n# Internal AffineTransforms,\nfor i in range(num_hid_layers-1):\n  print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f %s\" % \\\n        (num_hid_neurons, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n         (o.param_stddev_factor * Glorot(num_hid_neurons, num_hid_neurons)), o.max_norm, o.affine_opts))\n  print(\"%s <InputDim> %d <OutputDim> %d %s\" % (o.activation_type, num_hid_neurons, num_hid_neurons, o.activation_opts))\n  if o.with_dropout:\n    print(\"<Dropout> <InputDim> %d <OutputDim> %d %s\" % (num_hid_neurons, num_hid_neurons, o.dropout_opts))\n\n# Optionaly add bottleneck,\nif o.bottleneck_dim != 0:\n  assert(o.bottleneck_dim > 0)\n  if o.bottleneck_trick:\n    # 25% smaller stddev -> small bottleneck range, 10x smaller learning rate\n    print(\"<LinearTransform> <InputDim> %d <OutputDim> %d <ParamStddev> %f <LearnRateCoef> %f\" % \\\n     (num_hid_neurons, o.bottleneck_dim, \\\n      (o.param_stddev_factor * Glorot(num_hid_neurons, o.bottleneck_dim) * 0.75 ), 0.1))\n    # 25% smaller stddev -> smaller gradient in prev. layer, 10x smaller learning rate for weigts & biases\n    print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <LearnRateCoef> %f <BiasLearnRateCoef> %f <MaxNorm> %f %s\" % \\\n     (o.bottleneck_dim, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n      (o.param_stddev_factor * Glorot(o.bottleneck_dim, num_hid_neurons) * 0.75 ), 0.1, 0.1, o.max_norm, o.affine_opts))\n  else:\n    # Same learninig-rate and stddev-formula everywhere,\n    print(\"<LinearTransform> <InputDim> %d <OutputDim> %d <ParamStddev> %f\" % \\\n     (num_hid_neurons, o.bottleneck_dim, \\\n      (o.param_stddev_factor * Glorot(num_hid_neurons, o.bottleneck_dim))))\n    print(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <MaxNorm> %f %s\" % \\\n     (o.bottleneck_dim, num_hid_neurons, o.hid_bias_mean, o.hid_bias_range, \\\n      (o.param_stddev_factor * Glorot(o.bottleneck_dim, num_hid_neurons)), o.max_norm, o.affine_opts))\n  print(\"%s <InputDim> %d <OutputDim> %d %s\" % (o.activation_type, num_hid_neurons, num_hid_neurons, o.activation_opts))\n  if o.with_dropout:\n    print(\"<Dropout> <InputDim> %d <OutputDim> %d %s\" % (num_hid_neurons, num_hid_neurons, o.dropout_opts))\n\n# Last AffineTransform (10x smaller learning rate on bias)\nprint(\"<AffineTransform> <InputDim> %d <OutputDim> %d <BiasMean> %f <BiasRange> %f <ParamStddev> %f <LearnRateCoef> %f <BiasLearnRateCoef> %f\" % \\\n      (num_hid_neurons, num_leaves, 0.0, 0.0, \\\n       (o.param_stddev_factor * Glorot(num_hid_neurons, num_leaves)), 1.0, 0.1))\n\n# Optionaly append softmax\nif o.with_softmax:\n  if o.block_softmax_dims == \"\":\n    print(\"<Softmax> <InputDim> %d <OutputDim> %d\" % (num_leaves, num_leaves))\n  else:\n    print(\"<BlockSoftmax> <InputDim> %d <OutputDim> %d <BlockDims> %s\" % (num_leaves, num_leaves, o.block_softmax_dims))\n\n# We are done!\nsys.exit(0)\n\n"
  },
  {
    "path": "egs/utils/nnet/subset_data_tr_cv.sh",
    "content": "#!/usr/bin/env bash\n#\n# Copyright 2017  Brno University of Technology (Author: Karel Vesely);\n# Apache 2.0\n\n# This scripts splits 'data' directory into two parts:\n# - training set with 90% of speakers\n# - held-out set with 10% of speakers (cv)\n# (to be used in frame cross-entropy training of 'nnet1' models),\n\n# The script also accepts a list of held-out set speakers by '--cv-spk-list'\n# (with perturbed data, we pass the list of speakers externally).\n# The remaining set of speakers is the the training set.\n\ncv_spk_percent=10\ncv_spk_list= # To be used with perturbed data,\nseed=777\ncv_utt_percent= # ignored (compatibility),\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: $0 [opts] <src-data> <train-data> <cv-data>\"\n  echo \"  --cv-spk-percent N (default 10)\"\n  echo \"  --cv-spk-list <file> (a pre-defined list with cv speakers)\"\n  exit 1;\nfi\n\nset -euo pipefail\n\nsrc_data=$1\ntrn_data=$2\ncv_data=$3\n\n[ ! -r $src_data/spk2utt ] && echo \"Missing '$src_data/spk2utt'. Error!\" && exit 1\n\ntmp=$(mktemp -d /tmp/${USER}_XXXXX)\n\nif [ -z \"$cv_spk_list\" ]; then\n  # Select 'cv_spk_percent' speakers randomly,\n  cat $src_data/spk2utt | awk '{ print $1; }' | utils/shuffle_list.pl --srand $seed >$tmp/speakers\n  n_spk=$(wc -l <$tmp/speakers)\n  n_spk_cv=$(perl -e \"print int($cv_spk_percent * $n_spk / 100); \")\n  #\n  head -n $n_spk_cv $tmp/speakers >$tmp/speakers_cv\n  tail -n+$((n_spk_cv+1)) $tmp/speakers >$tmp/speakers_trn\nelse\n  # Use pre-defined list of speakers,\n  cp $cv_spk_list $tmp/speakers_cv\n  join -v2 <(sort $cv_spk_list) <(awk '{ print $1; }' <$src_data/spk2utt | sort) >$tmp/speakers_trn\nfi\n\n# Sanity checks,\nn_spk=$(wc -l <$src_data/spk2utt)\necho \"Speakers, src=$n_spk, trn=$(wc -l <$tmp/speakers_trn), cv=$(wc -l $tmp/speakers_cv)\"\noverlap=$(join <(sort $tmp/speakers_trn) <(sort $tmp/speakers_cv) | wc -l)\n[ $overlap != 0 ] && \\\n  echo \"WARNING, speaker overlap detected!\" && \\\n  join <(sort $tmp/speakers_trn) <(sort $tmp/speakers_cv) | head && \\\n  echo '...'\n\n# Create new data dirs,\nutils/data/subset_data_dir.sh --spk-list $tmp/speakers_trn $src_data $trn_data\nutils/data/subset_data_dir.sh --spk-list $tmp/speakers_cv $src_data $cv_data\n\n"
  },
  {
    "path": "egs/utils/nnet-cpu/make_nnet_config.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# These options can be useful if we want to splice the input\n# features across time.\n$input_left_context = 0;\n$input_right_context = 0;\n$param_stddev_factor = 1.0;  # can be used to adjust initial variance\n  # of parameters.\n$initial_num_hidden_layers = -1; # if >= 0, the number of hidden layers\n  # the model should start with, which may be less than the final number\n  # (the final number is used to calculate the #neurons).\n$single_layer_config = \"\";\n$bias_stddev = 2.0;\n$learning_rate = 0.001;\n$nobias = \"\";\n\nfor ($x = 1; $x < 10; $x++) {\n  if ($ARGV[0] eq \"--input-left-context\") {\n    $input_left_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--input-right-context\") {\n    $input_right_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--param-stddev-factor\") {\n    $param_stddev_factor = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--bias-stddev\") {\n    $bias_stddev = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--nobias\") {\n    $nobias = \"Nobias\";\n    shift;\n  }\n  if ($ARGV[0] eq \"--learning-rate\") {\n    $learning_rate = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--initial-num-hidden-layers\") {\n    $initial_num_hidden_layers = $ARGV[1];\n    $single_layer_config = $ARGV[2];\n    shift; shift; shift;\n  }\n}\n\n\nif (@ARGV != 4) {\n  print STDERR \"Usage: make_nnet_config.pl  [options] <feat-dim> <num-leaves> <num-hidden-layers> <num-parameters>  >config-file\nOptions:\n   --input-left-context <n>        #  #frames of left context for input features; default 0.\n   --input-right-context <n>       #  #frames of right context for input features; default 0.\n   --param-stdddev-factor <f>      #  Factor which can be used to modify the standard deviation of\n                                   #  randomly initialized features (default, 1.  Gets multiplied by\n                                   #  1/sqrt of number of inputs).\n   --initial-num-hidden-layers <n> <config-file>   #  If >0, number of hidden layers to initialize the network with.\n                                   #  In this case, the positional parameter <num-hidden-layers> is only\n                                   #  used to work out the number of units per hidden layer (based on\n                                   #  parameter count), and we write to <config-file> the config corresponding\n                                   #  to a single hidden layer.\n   --learning-rate <f>             # Initial learning rate, default 0.001\\n\";\n     exit(1);\n}\n\n($feat_dim, $num_leaves, $num_hidden_layers, $num_params) = @ARGV;\n($input_left_context < 0) &&  die \"Invalid input left context $input_left_context\";\n($input_right_context < 0) &&  die \"Invalid input right context $input_right_context\";\n($feat_dim <= 0) &&  die \"Invalid feature dimension $feat_dim\";\n($num_leaves <= 0) && die \"Invalid number of leaves $num_leaves\";\n($num_hidden_layers <= 0) && die \"Invalid number of hidden layers $num_hidden_layers\";\nif ($initial_num_hidden_layers < 0) {\n  $initial_num_hidden_layers = $num_hidden_layers;\n}\nif ($initial_num_hidden_layers > $num_hidden_layers) {\n  print STDERR \"Initial number of hidden layers is more than #hidden layers.\\n\" .\n    \"This does not really make sense but continuing anyway.\";\n}\n\n$context_size = 1 + $input_left_context + $input_right_context;\n($num_params < ($num_leaves + ($feat_dim * $context_size) + $num_hidden_layers + 1))\n  && die \"Invalid number of params $num_params\";\n\n## num_params = hidden_layer_size^2 * (num_hidden_layers-1)\n##            + hidden_layer_size * (num_leaves + feat_dim * context_size)\n## solve for hidden_layer_size = x.\n## a x^2 + b x + c, with\n## a = num_hidden_layers - 1\n## b = num_leaves + feat_dim * context_size\n## c = -num_params\n\n$a = $num_hidden_layers - 1;\n$b = $num_leaves + $feat_dim * $context_size;\n$c = -$num_params;\n\nif ($a > 0) {\n  $hidden_layer_size =  int((-$b + sqrt($b*$b - 4*$a*$c)) / (2*$a));\n} else {\n  $hidden_layer_size = int(-$c/$b);\n}\n\n\n$actual_num_params = $hidden_layer_size * $hidden_layer_size * ($num_hidden_layers - 1)\n                   + $hidden_layer_size * ($num_leaves + $feat_dim * $context_size);\n\nif (abs($actual_num_params - $num_params) > 0.1 * $num_params) {\n  print STDERR \"Warning: make_nnet_config.pl: possible failure $actual_num_params != $num_params\";\n}\n\nif ($input_left_context + $input_right_context != 0) {\n  # First component has to be splicing component...\n  # Note: we might be interested in decorrelating this e.g. with\n  # DCT layer at some point, but for now, splicing isn't seeming to be\n  # that useful.\n  print \"SpliceComponent input-dim=$feat_dim left-context=$input_left_context right-context=$input_right_context\\n\";\n}\n$cur_input_dim = $feat_dim * (1 + $input_left_context + $input_right_context);\n\nfor ($hidden_layer = 0; $hidden_layer < $initial_num_hidden_layers; $hidden_layer++) {\n  $param_stddev = $param_stddev_factor * 1.0 / sqrt($cur_input_dim);\n  print \"AffineComponent$nobias input-dim=$cur_input_dim output-dim=$hidden_layer_size \" .\n    \"learning-rate=$learning_rate param-stddev=$param_stddev bias-stddev=$bias_stddev\\n\";\n  $cur_input_dim = $hidden_layer_size;\n  print \"TanhComponent dim=$cur_input_dim\\n\";\n}\n\nif ($single_layer_config ne \"\") {\n  # Create a config file we'll use to add new hidden layers.\n  open(F, \">$single_layer_config\") || die \"Error opening $single_layer_config for output\";\n  $param_stddev = $param_stddev_factor * 1.0 / sqrt($hidden_layer_size);\n  print F \"AffineComponent$nobias input-dim=$hidden_layer_size output-dim=$hidden_layer_size \" .\n    \"learning-rate=$learning_rate param-stddev=$param_stddev bias-stddev=$bias_stddev\\n\";\n  print F \"TanhComponent dim=$hidden_layer_size\\n\";\n  close (F) || die \"Closing config file\";\n}\n\n## Now the output layer.\nprint \"AffineComponent$nobias input-dim=$cur_input_dim output-dim=$num_leaves \" .\n  \"learning-rate=$learning_rate param-stddev=0 bias-stddev=0\\n\"; # we just set the parameters to zero for this layer.\n## the softmax nonlinearity.\nprint \"SoftmaxComponent dim=$num_leaves\\n\";\n\n##\n"
  },
  {
    "path": "egs/utils/nnet-cpu/make_nnet_config_block.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# These options can be useful if we want to splice the input\n# features across time.\n$input_left_context = 0;\n$input_right_context = 0;\n$param_stddev_factor = 1.0;  # can be used to adjust initial variance\n  # of parameters.\n$initial_num_hidden_layers = -1; # if >= 0, the number of hidden layers\n  # the model should start with, which may be less than the final number\n  # (the final number is used to calculate the #neurons).\n$single_layer_config = \"\";\n\nfor ($x = 1; $x < 10; $x++) {\n  if ($ARGV[0] eq \"--input-left-context\") {\n    $input_left_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--input-right-context\") {\n    $input_right_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--param-stddev-factor\") {\n    $param_stddev_factor = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--initial-num-hidden-layers\") {\n    $initial_num_hidden_layers = $ARGV[1];\n    $single_layer_config = $ARGV[2];\n    shift; shift; shift;\n  }\n}\n\n\nif (@ARGV != 5) {\n  print STDERR \"Usage: make_nnet_config_block.pl  [options] <feat-dim> <num-leaves> <num-hidden-layers> <num-blocks> <num-parameters>  >config-file\nOptions:\n   --input-left-context <n>        #  #frames of left context for input features; default 0.\n   --input-right-context <n>       #  #frames of right context for input features; default 0.\n   --param-stdddev-factor <f>      #  Factor which can be used to modify the standard deviation of\n                                   #  randomly nitialized features (default, 1.  Gets multiplied by\n                                   #  1/sqrt of number of inputs).\n   --initial-num-hidden-layers <n> <config-file>   #  If >0, number of hidden layers to initialize the network with.\n                                   #  In this case, the positional parameter <num-hidden-layers> is only\n                                   #  used to work out the number of units per hidden layer (based on\n                                   #  parameter count), and we write to <config-file> the config corresponding\n                                   #  to a single hidden layer.\\n\";\n     exit(1);\n}\n\n($feat_dim, $num_leaves, $num_hidden_layers, $num_blocks, $num_params) = @ARGV;\n\n($input_left_context < 0) &&  die \"Invalid input left context $input_left_context\";\n($input_right_context < 0) &&  die \"Invalid input right context $input_right_context\";\n($feat_dim <= 0) &&  die \"Invalid feature dimension $feat_dim\";\n($num_leaves <= 0) && die \"Invalid number of leaves $num_leaves\";\n($num_blocks <= 0) && die \"Invalid number of blocks $num_blocks\";\n($num_blocks > 20) && die \"Implausibly high number of blocks $num_blocks\";\n($num_hidden_layers <= 0) && die \"Invalid number of hidden layers $num_hidden_layers\";\nif ($initial_num_hidden_layers < 0) {\n  $initial_num_hidden_layers = $num_hidden_layers;\n}\nif ($initial_num_hidden_layers > $num_hidden_layers) {\n  print STDERR \"Initial number of hidden layers is more than #hidden layers.\\n\" .\n    \"This does not really make sense but continuing anyway.\";\n}\n\n$context_size = 1 + $input_left_context + $input_right_context;\n($num_params < ($num_leaves + ($feat_dim * $context_size) + $num_hidden_layers + 1))\n  && die \"Invalid number of params $num_params\";\n\n## num_params = hidden_layer_size^2/num_blocks * (num_hidden_layers-1)\n##            + hidden_layer_size * (num_leaves + feat_dim * context_size)\n## solve for hidden_layer_size = x.\n## a x^2 + b  + c, with\n## a = (num_hidden_layers - 1) / num_blocks\n## b = num_leaves + feat_dim * context_size\n## c = -num_params\n\n$a = ($num_hidden_layers - 1) / ($num_blocks * 1.0); # * 1.0 to make sure it's float.\n$b = $num_leaves + $feat_dim * $context_size;\n$c = -$num_params;\n\nif ($a > 0) {\n  $hidden_layer_size =  int((-$b + sqrt($b*$b - 4*$a*$c)) / (2*$a));\n} else {\n  $hidden_layer_size = int(-$c/$b);\n}\n##  make sure num_blocks divides hidden_layer_size.\n$hidden_layer_size -= $hidden_layer_size % $num_blocks;\n\n$actual_num_params = ($hidden_layer_size * $hidden_layer_size)/$num_blocks * ($num_hidden_layers - 1)\n                   + $hidden_layer_size * ($num_leaves + $feat_dim * $context_size);\n\nif (abs($actual_num_params - $num_params) > 0.1 * $num_params) {\n  print STDERR \"Warning: $0: possible failure $actual_num_params != $num_params\";\n}\n\nif ($input_left_context + $input_right_context != 0) {\n  # First component has to be splicing component...\n  # Note: we might be interested in decorrelating this e.g. with\n  # DCT layer at some point, but for now, splicing isn't seeming to be\n  # that useful.\n  print \"SpliceComponent input-dim=$feat_dim left-context=$input_left_context right-context=$input_right_context\\n\";\n}\n$cur_input_dim = $feat_dim * (1 + $input_left_context + $input_right_context);\n\nfor ($hidden_layer = 0; $hidden_layer < $initial_num_hidden_layers; $hidden_layer++) {\n  if ($hidden_layer == 0) {\n    $param_stddev = $param_stddev_factor * 1.0 / sqrt($cur_input_dim);\n    print \"AffineComponent input-dim=$cur_input_dim output-dim=$hidden_layer_size \" .\n      \"param-stddev=$param_stddev\\n\";\n    print \"TanhComponent dim=$hidden_layer_size\\n\";\n  } else {\n    $param_stddev = $param_stddev_factor * 1.0 / sqrt($cur_input_dim / $num_blocks);\n    print \"PermuteComponent dim=$cur_input_dim\\n\";\n    print \"BlockAffineComponent num-blocks=$num_blocks input-dim=$cur_input_dim output-dim=$hidden_layer_size \" .\n      \"param-stddev=$param_stddev\\n\";\n    print \"TanhComponent dim=$hidden_layer_size\\n\";\n  }\n  $cur_input_dim = $hidden_layer_size;\n}\n\nif ($single_layer_config ne \"\") {\n  # Create a config file we'll use to add new hidden layers.\n  open(F, \">$single_layer_config\") || die \"Error opening $single_layer_config for output\";\n  $param_stddev = $param_stddev_factor * 1.0 / sqrt($hidden_layer_size);\n  print F \"PermuteComponent dim=$hidden_layer_size\\n\";\n  print F \"BlockAffineComponent num-blocks=$num_blocks input-dim=$hidden_layer_size output-dim=$hidden_layer_size \" .\n    \"param-stddev=$param_stddev\\n\";\n  print F \"TanhComponent dim=$hidden_layer_size\\n\";\n  close (F) || die \"Closing config file\";\n}\n\n## Now the output layer.\nprint \"AffineComponent input-dim=$cur_input_dim output-dim=$num_leaves \" .\n  \"param-stddev=0\\n\"; # we just set the parameters to zero for this layer.\n## the softmax nonlinearity.\nprint \"SoftmaxComponent dim=$num_leaves\\n\";\n\n##\n"
  },
  {
    "path": "egs/utils/nnet-cpu/make_nnet_config_preconditioned.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# These options can be useful if we want to splice the input\n# features across time.\n$input_left_context = 0;\n$input_right_context = 0;\n$param_stddev_factor = 1.0;  # can be used to adjust initial variance\n  # of parameters.\n$initial_num_hidden_layers = -1; # if >= 0, the number of hidden layers\n  # the model should start with, which may be less than the final number\n  # (the final number is used to calculate the #neurons).\n$single_layer_config = \"\"; # a file to which we'll output a config corresponding\n       # to a single layer; we'll later use this to add layers to the neural\n       # network.\n$bias_stddev = 2.0;  # Standard deviation for random initialization of the\n                     # bias terms (mean is zero).\n$splice_max_context = 0; # Relates to SpliceMaxComponent (experimental feature)\n$learning_rate = 0.001;\n$max_change = 0.0;\n$nonlinear_component_type = \"Tanh\";\n\n$alpha = 4.0;\n$l2_penalty_opt = \"\"; # Option for AffineComponentPreconditioned layer.\n$tree_map = \"\"; # If supplied, a text file that maps from l2 to l1 tree nodes (output\n   # by build-tree-two-level).  Used for initializing mixture-prob component.\n\n$splice_context = 0;\n$dropout_scale = -1.0; # if not -1.0, scale for \"lower\" part of \n                       # dropout scale, typically 0 <= dropout_scale < 1.\n$additive_noise_stddev = 0.0; # I didn't find this helpful either.\n$lda_dim = 0;\n$expand_power = 1;\n$expand_scale = 1.0;\n$lda_mat = \"\";\n\nfor ($x = 1; $x < 10; $x++) {\n  if ($ARGV[0] eq \"--input-left-context\") {\n    $input_left_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--l2-penalty\") {\n    my $l2_penalty = $ARGV[1];\n    $l2_penalty_opt = \"l2-penalty=$l2_penalty\";\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--dropout-scale\") {\n    $dropout_scale = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--expand-power\") {\n    $expand_power = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--expand-scale\") {\n    $expand_scale = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--max-change\") {\n    $max_change = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--additive-noise-stddev\") {\n    $additive_noise_stddev = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--nonlinear-component-type\") {\n    $nonlinear_component_type = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--lda-mat\") {\n    $splice_context = $ARGV[1];\n    $lda_dim = $ARGV[2];\n    $lda_mat = $ARGV[3];\n    shift; shift; shift; shift;\n  }\n  if ($ARGV[0] eq \"--input-right-context\") {\n    $input_right_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--param-stddev-factor\") {\n    $param_stddev_factor = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--bias-stddev\") {\n    $bias_stddev = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--alpha\") {\n    $alpha = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--splice-max-context\") {\n    $splice_max_context = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--learning-rate\") {\n    $learning_rate = $ARGV[1];\n    shift; shift;\n  }\n  if ($ARGV[0] eq \"--initial-num-hidden-layers\") {\n    $initial_num_hidden_layers = $ARGV[1];\n    $single_layer_config = $ARGV[2];\n    shift; shift; shift;\n  }\n  if ($ARGV[0] eq \"--tree-map\") { # Note: this was for an idea that\n    # didn't end up working for me; it relates to SCTM-like systems.\n    $tree_map = $ARGV[1];\n    shift; shift;\n  }\n}\n\n\nif (@ARGV != 4) {\n  print STDERR \"Usage: make_nnet_config_preconditioned.pl  [options] <feat-dim> <num-leaves> <num-hidden-layers> <num-parameters>  >config-file\nOptions:\n   --input-left-context <n>        #  #frames of left context for input features; default 0 (this separate from pre-LDA splicing).\n   --input-right-context <n>       #  #frames of right context for input features; default 0  (this separate from pre-LDA splicing).\n   --param-stdddev-factor <f>      #  Factor which can be used to modify the standard deviation of\n                                   #  randomly nitialized features (default, 1.  Gets multiplied by\n                                   #  1/sqrt of number of inputs).\n   --initial-num-hidden-layers <n> <config-file>   #  If >0, number of hidden layers to initialize the network with.\n                                   #  In this case, the positional parameter <num-hidden-layers> is only\n                                   #  used to work out the number of units per hidden layer (based on\n                                   #  parameter count), and we write to <config-file> the config corresponding\n                                   #  to a single hidden layer.\n   --alpha <f>                     #  Factor (default 0.1) which affects the preconditioning.  0 < alpha <= 1;\n                                   #  smaller means more aggressive preconditioning / less smoothing of the Fisher\n                                   #  matrix.\n   --learning-rate <f>             # Initial learning rate, default 0.001\n   --lda-mat <splice-width> <lda-dimension> <lda-matrix-filename>  # Allows the user to specify splice-and-lda\n                                   # with a given transformation, as a fixed component in the network.  E.g.\n                                   # splice-width of 4 represents context of +- 4 frames.  Here, lda-dimension is\n                                   # the output dimension of LDA, which must be the same as in the file.\\n\";\n  exit(1);\n}\n\n($feat_dim, $num_leaves, $num_hidden_layers, $num_params) = @ARGV;\n($input_left_context < 0) &&  die \"Invalid input left context $input_left_context\";\n($input_right_context < 0) &&  die \"Invalid input right context $input_right_context\";\n($feat_dim <= 0) &&  die \"Invalid feature dimension $feat_dim\";\n($num_leaves <= 0) && die \"Invalid number of leaves $num_leaves\";\n($num_hidden_layers <= 0) && die \"Invalid number of hidden layers $num_hidden_layers\";\nif ($initial_num_hidden_layers < 0) {\n  $initial_num_hidden_layers = $num_hidden_layers;\n}\nif ($initial_num_hidden_layers > $num_hidden_layers) {\n  print STDERR \"Initial number of hidden layers is more than #hidden layers.\\n\" .\n    \"This does not really make sense but continuing anyway.\";\n}\n\n$context_size = 1 + $input_left_context + $input_right_context;\n($num_params < ($num_leaves + ($feat_dim * $context_size) + $num_hidden_layers + 1))\n  && die \"Invalid number of params $num_params\";\n\n## num_params = hidden_layer_size^2 * (num_hidden_layers-1)\n##            + hidden_layer_size * (num_leaves + feat_dim * context_size * expand_power)\n## solve for hidden_layer_size = x.\n## a x^2 + b x + c, with\n## a = num_hidden_layers - 1\n## b = num_leaves + feat_dim * context_size\n## c = -num_params\n\n$a = $num_hidden_layers - 1;\n$b = $num_leaves + $feat_dim * $context_size * $expand_power;\n$c = -$num_params;\n\nif ($a > 0) {\n  $hidden_layer_size =  int((-$b + sqrt($b*$b - 4*$a*$c)) / (2*$a));\n} else {\n  $hidden_layer_size = int(-$c/$b);\n}\n\n\n$actual_num_params = $hidden_layer_size * $hidden_layer_size * ($num_hidden_layers - 1)\n                   + $hidden_layer_size * ($num_leaves + $feat_dim * $context_size * $expand_power);\n\nif (abs($actual_num_params - $num_params) > 0.1 * $num_params) {\n  print STDERR \"Warning: make_nnet_config.pl: possible failure $actual_num_params != $num_params\";\n}\n\nif ($splice_context > 0) { # --lda-mat <splice-context> <lda-matrix> was specified...\n  print \"SpliceComponent input-dim=$feat_dim left-context=$splice_context right-context=$splice_context\\n\";\n  print \"FixedLinearComponent matrix=$lda_mat\\n\"; # specify the filename.\n  $feat_dim = $lda_dim; # This is now the input dimension.\n}\n\nif ($splice_max_context > 0) {\n  print \"SpliceMaxComponent dim=$feat_dim left-context=$splice_max_context right-context=$splice_max_context\\n\";\n}\n\n\nif ($input_left_context + $input_right_context != 0) {\n  # First component has to be splicing component...\n  # Note: we might be interested in decorrelating this e.g. with\n  # DCT layer at some point, but for now, splicing isn't seeming to be\n  # that useful.\n  print \"SpliceComponent input-dim=$feat_dim left-context=$input_left_context right-context=$input_right_context\\n\";\n}\n$cur_input_dim = $feat_dim * (1 + $input_left_context + $input_right_context);\n\nif ($expand_power > 1) {\n  print \"PowerExpandComponent input-dim=$cur_input_dim max-power=$expand_power higher-power-scale=$expand_scale\\n\";\n  $cur_input_dim *= $expand_power;\n}\n\nfor ($hidden_layer = 0; $hidden_layer < $initial_num_hidden_layers; $hidden_layer++) {\n  $param_stddev = $param_stddev_factor * 1.0 / sqrt($cur_input_dim);\n  print \"AffineComponentPreconditioned input-dim=$cur_input_dim output-dim=$hidden_layer_size alpha=$alpha max-change=$max_change \" .\n    \"$l2_penalty_opt learning-rate=$learning_rate param-stddev=$param_stddev bias-stddev=$bias_stddev\\n\";\n  $cur_input_dim = $hidden_layer_size;\n  print \"${nonlinear_component_type}Component dim=$cur_input_dim\\n\";\n  if ($dropout_scale != -1.0) {\n    print \"DropoutComponent dim=$cur_input_dim dropout-scale=$dropout_scale\\n\";\n  }\n  if ($additive_noise_stddev != 0.0) {\n    print \"AdditiveNoiseComponent dim=$cur_input_dim stddev=$additive_noise_stddev\\n\";\n  }\n}\n\nif ($single_layer_config ne \"\") {\n  # Create a config file we'll use to add new hidden layers.\n  open(F, \">$single_layer_config\") || die \"Error opening $single_layer_config for output\";\n  $param_stddev = $param_stddev_factor * 1.0 / sqrt($hidden_layer_size);\n  print F \"AffineComponentPreconditioned input-dim=$hidden_layer_size output-dim=$hidden_layer_size alpha=$alpha max-change=$max_change \" .\n    \"$l2_penalty_opt learning-rate=$learning_rate param-stddev=$param_stddev bias-stddev=$bias_stddev\\n\";\n  print F \"${nonlinear_component_type}Component dim=$hidden_layer_size\\n\";\n  if ($dropout_scale != -1.0) {\n    print F \"DropoutComponent dim=$cur_input_dim dropout-scale=$dropout_scale\\n\";\n  }\n  if ($additive_noise_stddev != 0.0) {\n    print F \"AdditiveNoiseComponent dim=$cur_input_dim stddev=$additive_noise_stddev\\n\";\n  }\n  close (F) || die \"Closing config file\";\n}\n\n## Now the output layer.\nprint \"AffineComponentPreconditioned input-dim=$cur_input_dim output-dim=$num_leaves alpha=$alpha max-change=$max_change \" .\n  \"$l2_penalty_opt learning-rate=$learning_rate param-stddev=0 bias-stddev=0\\n\"; # we just set the parameters to zero for this layer.\n## the softmax nonlinearity.\nprint \"SoftmaxComponent dim=$num_leaves\\n\";\n\nif ($tree_map ne \"\") {\n  # Create a MixtureProbComponent at the end, that shares \"Gaussians\"\n  # among leaves that share the same level-1 tree index.\n  open(F, \"<$tree_map\") || die \"opening tree map file $tree_map\";\n  $map = <F>;\n  close(F);\n  $map =~ s/\\s*\\[\\s*// || die \"Unexpected data in tree map file $tree_map\";\n  $map =~ s/\\s*\\]\\s*// || die \"Unexpected data in tree map file $tree_map\";\n  @map = split(\" \", $map);\n  @dims = ();\n  while (@map > 0) {\n    $index = shift @map;\n    $n = 1;\n    while (@map > 0 && $map[0] == $index) { shift @map; $n++; }\n    push @dims, $n;\n  }\n  $dims = join(\":\", @dims);\n  print \"MixtureProbComponent learning-rate=$learning_rate diag-element=0.9 dims=$dims\\n\";\n}\n\n##\n"
  },
  {
    "path": "egs/utils/nnet-cpu/update_learning_rates.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This script takes three command-line arguments.\n# The first is a log-file such as exp/tri4b_nnet/log/combine.10.log,\n# which is the output of nnet-combine.  The second is a file such\n# as exp/tri4b_nnet/11.tmp.mdl, i.e. a model file, for which we will\n# update the learning rates; the third is the output nnet file e.g.\n# exp/tri4b_nnet/11.mdl\n\n# This script assumes that the \"combine\" script is called as:\n# nnet-combine <old-model> <new-model-1> <new-model-2> ... <new-model-n> <validation-examples> <output-model>.\n# It gets from the logging output a line like this:\n# LOG (nnet-combine:CombineNnets():combine-nnet.cc:184) Combining nnets, validation objf per frame changed from -1.43424 to -1.42067, scale factors are  [ 0.727508 0.79889 0.299533 0.137696 -0.0479123 0.210445 0.0195638 0.123843 0.167453 0.0193894 -0.0128672 0.178384 0.0516549 0.0958205 0.125495 ]\n# [in this case the 1st 3 numbers correspond to the <old-model> ] and for each\n# updatable layer, it works out the total weight on the new models.\n# It interprets this as being (for each layer) a step length along\n# the path old-model -> new-model.\n# Basically, we change the learning rate by a factor equal to this step length,\n# subject to limits on the change  [by default limit to halving/doubling].\n# It's fairly obvious why we would want do do this.\n\n# These options can be useful if we want to splice the input\n# features across time.\n$sources_to_exclude = 1; # may make this configurable later.\n$min_learning_rate_factor = 0.5;\n$max_learning_rate_factor = 2.0;\n$min_learning_rate = 0.0001; # Put a floor because if too small,\n  # the changes become zero due to roundoff.\n\nif (@ARGV > 0) {\n  for ($x = 1; $x < 10; $x++) {\n    if ($ARGV[0] eq \"--min-learning-rate-factor\") {\n      $min_learning_rate_factor = $ARGV[1];\n      shift; shift;\n    }\n    if ($ARGV[0] eq \"--max-learning-rate-factor\") {\n      $max_learning_rate_factor = $ARGV[1];\n      shift; shift;\n    }\n    if ($ARGV[0] eq \"--min-learning-rate\") {\n      $min_learning_rate = $ARGV[1];\n      shift; shift;\n    }\n  }\n}\n\n\nif (@ARGV != 3) {\n  print STDERR \"Usage: update_learning_rates.pl [options] <log-file-for-nnet-combine> <nnet-in> <nnet-out>\nOptions:\n   --min-learning-rate-factor       #  minimum factor to change learning rate by (default: 0.5)\n   --max-learning-rate-factor       #  maximum factor to change learning rate by (default: 2.0)\\n\";\n   exit(1);\n}\n\n($combine_log, $nnet_in, $nnet_out) = @ARGV;\n\nopen(L, \"<$combine_log\") || die \"Opening log file \\\"$combine_log\\\"\";\n\n\nwhile(<L>) {\n  if (m/Objective functions for the source neural nets are\\s+\\[(.+)\\]/) {\n    ## a line like:\n    ##  LOG (nnet-combine:GetInitialScaleParams():combine-nnet.cc:66) Objective functions for the source neural nets are  [ -1.37002 -1.52115 -1.52103 -1.50189 -1.51912 ]\n    @A = split(\" \", $1);\n    $num_sources = @A; # number of source neural nets (dimension of @A); 5 in this case.\n  }\n  ## a line like:\n  ## LOG (nnet-combine:CombineNnets():combine-nnet.cc:184) Combining nnets, validation objf per frame changed from -1.37002 to -1.36574, scale factors are  [ 0.819379 0.696122 0.458798 0.040513 -0.0448875 0.171431 0.0274615 0.139143 0.133846 0.0372585 0.114193 0.17944 0.0491838 0.0668778 0.0328936 ]\n  if (m/Combining nnets.+scale factors are\\s+\\[(.+)\\]/) {\n    @scale_factors = split(\" \", $1);\n  }\n}\n\nif (!defined $num_sources) {\n  die \"Log file $combine_log did not have expected format: no line with \\\"Objective functions\\\"\\n\";\n}\nif (!defined @scale_factors) {\n  die \"Log file $combine_log did not have expected format: no line with \\\"Combining nnets\\\"\\n\";\n}\n\n\n$num_scales = @scale_factors; # length of the array.\nif ($num_scales % $num_sources != 0) {\n  die \"Error interpreting log file $combine_log: $num_sources does not divide $num_scales\\n\";\n}\nclose(L);\n\nopen(P, \"nnet-am-info $nnet_in |\") || die \"Opening pipe from nnet-am-info\";\n@learning_rates = ();\nwhile(<P>) {\n  if (m/learning rate = ([^,]+),/) {\n    push @learning_rates, $1;\n  }\n}\nclose(P);\n\n$num_layers = $num_scales / $num_sources;\n\n$num_info_learning_rates = @learning_rates;\nif ($num_layers != $num_info_learning_rates) {\n  die \"From log file we expect there to be $num_layers updatable components, but from the output of nnet-am-info we saw $num_info_learning_rates\";\n}\n\nfor ($layer = 0; $layer < $num_layers; $layer++) {\n  # getting the sum of the weights for this layer from all the non-excluded sources.\n  $sum = 0.0;\n  for ($source = $sources_to_exclude; $source < $num_sources; $source++) {\n    $index = ($source * $num_layers) + $layer;\n    $sum += $scale_factors[$index];\n  }\n  $learning_rate_factor = $sum;\n  if ($learning_rate_factor > $max_learning_rate_factor) { $learning_rate_factor = $max_learning_rate_factor; }\n  if ($learning_rate_factor < $min_learning_rate_factor) { $learning_rate_factor = $min_learning_rate_factor; }\n  $old_learning_rate = $learning_rates[$layer];\n  $new_learning_rate = $old_learning_rate * $learning_rate_factor;\n  if ($new_learning_rate < $min_learning_rate) { $new_learning_rate = $min_learning_rate; }\n  print STDERR \"For layer $layer, sum of weights of non-excluded sources is $sum, learning-rate factor is $learning_rate_factor\\n\";\n  $learning_rates[$layer] = $new_learning_rate;\n}\n\n$lrates_string=join(\":\", @learning_rates);\n\n$ret = system(\"nnet-am-copy --learning-rates=$lrates_string $nnet_in $nnet_out\");\n\nexit($ret != 0);\n"
  },
  {
    "path": "egs/utils/nnet3/convert_config_tdnn_to_affine.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2020  Yiming Wang\n#\n# Apache 2.0.\n\nimport argparse\nimport re\nimport sys\n\n\ndef get_parser():\n    parser = argparse.ArgumentParser(\n        description=\"\"\"\n        Convert a config file with Tdnn components to their equivalent\n        Affine/Linear components. Useful when we are using MACE (a deep learning\n        inference framework using Kaldi's trained models) that doesn't\n        support Tdnn components.\n        Usage:\n            convert_config_tdnn_to_affine.py exp/chain/tdnn_1a/configs/final.config > \\\\\n              exp/chain/tdnn_1a/configs/converted.config\n        \"\"\")\n    # fmt: off\n    parser.add_argument('input', type=str)\n    # fmt: on\n\n    return parser\n\n\ndef main(args):\n    offsets_dict = {}  # mapping from each TdnnComponent's name to its offsets\n    with open(args.input, 'r', encoding='utf-8') as f:\n        for line in f:\n            if (\n                (line.startswith('component ') and not 'type=TdnnComponent' in line)\n                or line.startswith('input-node')\n                or line.startswith('output-node')\n                or line.startswith('#')\n                or (line.strip() == '' and len(line) > 0)\n            ):  # normal component line (all but Tdnn) or input/output node or comments or empty\n                print(line.strip())\n            elif line.startswith('component-node'):\n                new_split_line = []\n                offsets = None\n                component = re.findall(r'component=(\\S+)', line)[-1]\n                if component in offsets_dict:\n                    offsets = offsets_dict[component]\n                for col in line.strip().split():\n                    if col.startswith('input=') and offsets is not None:  # converted from Tdnn with input splices\n                        inp = col.split('=')[1]\n                        offsets_str = [\n                            'Offset({}, {})'.format(inp, o) if o is not '0' else inp for o in offsets\n                        ]\n                        if len(offsets_str) > 1:\n                            new_split_line.append('input=Append({})'.format((', ').join(offsets_str)))\n                        else:\n                            new_split_line.append('input={}'.format(offsets_str[0]))\n                    else:\n                        new_split_line.append(col)\n                print(' '.join(new_split_line))\n            elif line != '':  # Tdnn component line\n                assert 'type=TdnnComponent' in line, line\n                use_bias = True\n                m = re.findall(r'use-bias=(\\w+)', line)\n                if len(m) > 0 and m[-1] == 'false':  # determine converting to Affine or Linear\n                    use_bias = False\n                new_split_line = []\n                offsets = re.findall(r'time-offsets=(\\S+)', line)\n                if len(offsets) > 0:  # extract time-offsets for determining input-dim below\n                    offsets = offsets[-1].split(',')  # -1 in case multiple fields of \"time-offsets\"\n                else:\n                    offsets = None\n                for col in line.strip().split():\n                    if col.startswith('name='):  # keep the name of Component\n                        name = col.split('=')[1]\n                        assert name not in offsets_dict\n                        new_split_line.append(col)\n                    elif col == 'type=TdnnComponent':  # convert Component type\n                        type_str = 'type={}'.format(\n                            'NaturalGradientAffineComponent' if use_bias else\n                            'LinearComponent'\n                        )\n                        new_split_line.append(type_str)\n                    elif col.startswith('input-dim='):  # change input-dim for Affine/Linear Component\n                        input_dim = int(col.split('=')[1])\n                        if offsets is not None:\n                            input_dim *= len(offsets)\n                        new_split_line.append('input-dim={}'.format(input_dim))\n                    elif col.startswith('time-offsets='):  # record time-offsets for component-node\n                        offsets_dict[name] = offsets\n                    elif not col.startswith('use-bias='):  # all the other fields: simply copy over\n                        new_split_line.append(col)\n                print(' '.join(new_split_line))\n\n\nif __name__ == '__main__':\n    parser = get_parser()\n    args = parser.parse_args()\n    main(args)\n"
  },
  {
    "path": "egs/utils/parallel/limit_num_gpus.sh",
    "content": "#!/usr/bin/env bash\n\n# This script functions as a wrapper of a bash command that uses GPUs.\n#\n# It sets the CUDA_VISIBLE_DEVICES variable so that it limits the number of GPUs\n# used for programs. It is neccesary for running a job on the grid if the job\n# would automatically grabs all resources available on the system, e.g. a\n# TensorFlow program.\n\nnum_gpus=1 # this variable indicates how many GPUs we will allow the command\n           # passed to this script will run on. We achieve this by setting the\n           # CUDA_VISIBLE_DEVICES variable\nset -e\n\nif [ \"$1\" == \"--num-gpus\" ]; then\n  num_gpus=$2\n  shift\n  shift\nfi\n\nif ! printf \"%d\" \"$num_gpus\" >/dev/null || [ $num_gpus -le -1 ]; then\n  echo $0: Must pass a positive interger or 0 after --num-gpus\n  echo e.g. $0 --num-gpus 2 local/tfrnnlm/run_lstm.sh\n  exit 1\nfi\n\nif [ $# -eq 0 ]; then\n  echo \"Usage:  $0 [--num-gpus <num-gpus>] <command> [<arg1>...]\"\n  echo \"Runs <command> with args after setting CUDA_VISIBLE_DEVICES to \"\n  echo \"make sure exactly <num-gpus> GPUs are visible (default: 1).\"\n  exit 1\nfi\n\nCUDA_VISIBLE_DEVICES=\nnum_total_gpus=`nvidia-smi -L | wc -l`\nnum_gpus_assigned=0\n\nif [ $num_gpus -eq 0 ] ; then\n    echo \"$0: Running the job on CPU. Disabling submitting to gpu\"\n    export CUDA_VISIBLE_DEVICES=\"\"\nelse\n    for i in `seq 0 $[$num_total_gpus-1]`; do\n    # going over all GPUs and check if it is idle, and add to the list if yes\n      if nvidia-smi -i $i | grep \"No running processes found\" >/dev/null; then\n        CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}$i, && num_gpus_assigned=$[$num_gpus_assigned+1]\n      fi\n    # once we have enough GPUs, break out of the loop\n      [ $num_gpus_assigned -eq $num_gpus ] && break\n    done\n\n    [ $num_gpus_assigned -ne $num_gpus ] && echo Could not find enough idle GPUs && exit 1\n\n    export CUDA_VISIBLE_DEVICES=$(echo $CUDA_VISIBLE_DEVICES | sed \"s=,$==g\")\n\n    echo \"$0: Running the job on GPU(s) $CUDA_VISIBLE_DEVICES\"\nfi\n\n\"$@\"\n"
  },
  {
    "path": "egs/utils/parallel/pbs.pl",
    "content": "#!/usr/bin/env perl\nuse strict;\nuse warnings;\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2014  Johns Hopkins University (Author: Vimal Manohar)\n#           2015  Queensland University of Technology (Author: Ahilan Kanagasundaram <a.kanagasundaram@qut.edu.au>)\n# Apache 2.0.\n\nuse File::Basename;\nuse Cwd;\nuse Getopt::Long;\n\n# This is a version of the queue.pl modified so that it works under PBS\n# The PBS is one of the several \"almost compatible\" queueing systems. The\n# command switches and environment variables are different, so we are adding\n# a this script. An optimal solution might probably be to make the variable\n# names and the commands configurable, as similar problems can be expected\n# with Torque, Univa... and who knows what else\n#\n# pbs.pl has the same functionality as run.pl, except that\n# it runs the job in question on the queue (PBS).\n# This version of pbs.pl uses the task array functionality\n# of PBS.  \n# The script now supports configuring the queue system using a config file\n# (default in conf/pbs.conf; but can be passed specified with --config option)\n# and a set of command line options.\n# The current script handles:\n# 1) Normal configuration arguments\n# For e.g. a command line option of \"--gpu 1\" could be converted into the option\n# \"-q g.q -l gpu=1\" to qsub. How the CLI option is handled is determined by a\n# line in the config file like\n# gpu=* -q g.q -l gpu=$0\n# $0 here in the line is replaced with the argument read from the CLI and the\n# resulting string is passed to qsub.\n# 2) Special arguments to options such as\n# gpu=0\n# If --gpu 0 is given in the command line, then no special \"-q\" is given.\n# 3) Default argument\n# default gpu=0\n# If --gpu option is not passed in the command line, then the script behaves as\n# if --gpu 0 was passed since 0 is specified as the default argument for that\n# option\n# 4) Arbitrary options and arguments.\n# Any command line option starting with '--' and its argument would be handled\n# as long as its defined in the config file.\n# 5) Default behavior\n# If the config file that is passed using is not readable, then the script\n# behaves as if the queue has the following config file:\n# $ cat conf/pbs.conf\n# # Default configuration\n# command qsub -v PATH -S /bin/bash -l arch=*64*\n# option mem=* -l mem_free=$0,ram_free=$0\n# option mem=0          # Do not add anything to qsub_opts\n# option num_threads=* -pe smp $0\n# option num_threads=1  # Do not add anything to qsub_opts\n# option max_jobs_run=* -tc $0\n# default gpu=0\n# option gpu=0 -q all.q\n# option gpu=* -l gpu=$0 -q g.q\n\nmy $qsub_opts = \"\";\nmy $sync = 0;\nmy $num_threads = 1;\nmy $gpu = 0;\n\nmy $config = \"conf/pbs.conf\";\n\nmy %cli_options = ();\n\nmy $jobname;\nmy $jobstart;\nmy $jobend;\n\nmy $array_job = 0;\n\nsub print_usage() {\n  print STDERR\n   \"Usage: pbs.pl [options] [JOB=1:n] log-file command-line arguments...\\n\" .\n   \"e.g.: pbs.pl foo.log echo baz\\n\" .\n   \" (which will echo \\\"baz\\\", with stdout and stderr directed to foo.log)\\n\" .\n   \"or: pbs.pl -q all.q\\@xyz foo.log echo bar \\| sed s/bar/baz/ \\n\" .\n   \" (which is an example of using a pipe; you can provide other escaped bash constructs)\\n\" .\n   \"or: pbs.pl -q all.q\\@qyz JOB=1:10 foo.JOB.log echo JOB \\n\" .\n   \" (which illustrates the mechanism to submit parallel jobs; note, you can use \\n\" .\n   \"  another string other than JOB)\\n\" .\n   \"Note: if you pass the \\\"-sync y\\\" option to qsub, this script will take note\\n\" .\n   \"and change its behavior.  Otherwise it uses qstat to work out when the job finished\\n\" .\n   \"Options:\\n\" .\n   \"  --config <config-file> (default: $config)\\n\" .\n   \"  --mem <mem-requirement> (e.g. --mem 2G, --mem 500M, \\n\" .\n   \"                           also support K and numbers mean bytes)\\n\" .\n   \"  --num-threads <num-threads> (default: $num_threads)\\n\" .\n   \"  --max-jobs-run <num-jobs>\\n\" .\n   \"  --gpu <0|1> (default: $gpu)\\n\";\n  exit 1;\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nfor (my $x = 1; $x <= 2; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    my $switch = shift @ARGV;\n\n    if ($switch eq \"-V\") {\n      $qsub_opts .= \"-V \";\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"pbs.pl: Warning: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $sync = 1;\n        $qsub_opts .= \"$switch $argument \";\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $qsub_opts .= \"$switch $argument $argument2 \";\n        $num_threads = $argument2;\n      } elsif ($switch =~ m/^--/) { # Config options\n        # Convert CLI option to variable name\n        # by removing '--' from the switch and replacing any\n        # '-' with a '_'\n        $switch =~ s/^--//;\n        $switch =~ s/-/_/g;\n        $cli_options{$switch} = $argument;\n      } else {  # Other qsub options - passed as is\n        $qsub_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"pbs.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is a GridEngine limitation).\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"pbs.pl: Warning: suspicious first argument to queue.pl: $ARGV[0]\\n\";\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nif (exists $cli_options{\"config\"}) {\n  $config = $cli_options{\"config\"};\n}\n\nmy $default_config_file = <<'EOF';\n# Default configuration\ncommand qsub -V -v PATH -S /bin/bash -l mem=4G\noption mem=* -l mem=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -l ncpus=$0\noption num_threads=1  # Do not add anything to qsub_opts\ndefault gpu=0\noption gpu=0\noption gpu=* -l ncpus=$0\nEOF\n\n# Here the configuration options specified by the user on the command line\n# (e.g. --mem 2G) are converted to options to the qsub system as defined in\n# the config file. (e.g. if the config file has the line\n# \"option mem=* -l ram_free=$0,mem_free=$0\"\n# and the user has specified '--mem 2G' on the command line, the options\n# passed to queue system would be \"-l ram_free=2G,mem_free=2G\n# A more detailed description of the ways the options would be handled is at\n# the top of this file.\n\nmy $opened_config_file = 1;\n\nopen CONFIG, \"<$config\" or $opened_config_file = 0;\n\nmy %cli_config_options = ();\nmy %cli_default_options = ();\n\nif ($opened_config_file == 0 && exists($cli_options{\"config\"})) {\n  print STDERR \"Could not open config file $config\\n\";\n  exit(1);\n} elsif ($opened_config_file == 0 && !exists($cli_options{\"config\"})) {\n  # Open the default config file instead\n  open (CONFIG, \"echo '$default_config_file' |\") or die \"Unable to open pipe\\n\";\n  $config = \"Default config\";\n}\n\nmy $qsub_cmd = \"\";\nmy $read_command = 0;\n\nwhile(<CONFIG>) {\n  chomp;\n  my $line = $_;\n  $_ =~ s/\\s*#.*//g;\n  if ($_ eq \"\") { next; }\n  if ($_ =~ /^command (.+)/) {\n    $read_command = 1;\n    $qsub_cmd = $1 . \" \";\n  } elsif ($_ =~ m/^option ([^=]+)=\\* (.+)$/) {\n    # Config option that needs replacement with parameter value read from CLI\n    # e.g.: option mem=* -l mem_free=$0,ram_free=$0\n    my $option = $1;     # mem\n    my $arg= $2;         # -l mem_free=$0,ram_free=$0\n    if ($arg !~ m:\\$0:) {\n      die \"Unable to parse line '$line' in config file ($config)\\n\";\n    }\n    if (exists $cli_options{$option}) {\n      # Replace $0 with the argument read from command line.\n      # e.g. \"-l mem_free=$0,ram_free=$0\" -> \"-l mem_free=2G,ram_free=2G\"\n      $arg =~ s/\\$0/$cli_options{$option}/g;\n      $cli_config_options{$option} = $arg;\n    }\n  } elsif ($_ =~ m/^option ([^=]+)=(\\S+)\\s?(.*)$/) {\n    # Config option that does not need replacement\n    # e.g. option gpu=0 -q all.q\n    my $option = $1;      # gpu\n    my $value = $2;       # 0\n    my $arg = $3;         # -q all.q\n    if (exists $cli_options{$option}) {\n      $cli_default_options{($option,$value)} = $arg;\n    }\n  } elsif ($_ =~ m/^default (\\S+)=(\\S+)/) {\n    # Default options. Used for setting default values to options i.e. when\n    # the user does not specify the option on the command line\n    # e.g. default gpu=0\n    my $option = $1;  # gpu\n    my $value = $2;   # 0\n    if (!exists $cli_options{$option}) {\n      # If the user has specified this option on the command line, then we\n      # don't have to do anything\n      $cli_options{$option} = $value;\n    }\n  } else {\n    print STDERR \"pbs.pl: unable to parse line '$line' in config file ($config)\\n\";\n    exit(1);\n  }\n}\n\nclose(CONFIG);\n\nif ($read_command != 1) {\n  print STDERR \"pbs.pl: config file ($config) does not contain the line \\\"command .*\\\"\\n\";\n  exit(1);\n}\n\nfor my $option (keys %cli_options) {\n  if ($option eq \"config\") { next; }\n  if ($option eq \"max_jobs_run\" && $array_job != 1) { next; }\n  my $value = $cli_options{$option};\n\n  if (exists $cli_default_options{($option,$value)}) {\n    $qsub_opts .= \"$cli_default_options{($option,$value)} \";\n  } elsif (exists $cli_config_options{$option}) {\n    $qsub_opts .= \"$cli_config_options{$option} \";\n  } else {\n    if ($opened_config_file == 0) { $config = \"default config file\"; }\n    die \"pbs.pl: Command line option $option not described in $config (or value '$value' not allowed)\\n\";\n  }\n}\n\nmy $cwd = getcwd();\nmy $logfile = shift @ARGV;\n\nif ($array_job == 1 && $logfile !~ m/$jobname/\n    && $jobend > $jobstart) {\n  print STDERR \"pbs.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n#\n# Work out the command; quote escaping is done here.\n# Note: the rules for escaping stuff are worked out pretty\n# arbitrarily, based on what we want it to do.  Some things that\n# we pass as arguments to pbs.pl, such as \"|\", we want to be\n# interpreted by bash, so we don't escape them.  Other things,\n# such as archive specifiers like 'ark:gunzip -c foo.gz|', we want\n# to be passed, in quotes, to the Kaldi program.  Our heuristic\n# is that stuff with spaces in should be quoted.  This doesn't\n# always work.\n#\nmy $cmd = \"\";\n\nforeach my $x (@ARGV) {\n  if ($x =~ m/^\\S+$/) { $cmd .= $x . \" \"; } # If string contains no spaces, take\n                                            # as-is.\n  elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; } # else if no dbl-quotes, use single\n  else { $cmd .= \"\\\"$x\\\" \"; }  # else use double.\n}\n\n#\n# Work out the location of the script file, and open it for writing.\n#\nmy $dir = dirname($logfile);\nmy $base = basename($logfile);\nmy $qdir = \"$dir/q\";\n$qdir =~ s:/(log|LOG)/*q:/q:; # If qdir ends in .../log/q, make it just .../q.\nmy $queue_logfile = \"$qdir/$base\";\n\nif (!-d $dir) { system \"mkdir -p $dir 2>/dev/null\"; } # another job may be doing this...\nif (!-d $dir) { die \"Cannot make the directory $dir\\n\"; }\n# make a directory called \"q\",\n# where we will put the log created by qsub... normally this doesn't contain\n# anything interesting, evertyhing goes to $logfile.\nif (! -d \"$qdir\") {\n  system \"mkdir $qdir 2>/dev/null\";\n  sleep(5); ## This is to fix an issue we encountered in denominator lattice creation,\n  ## where if e.g. the exp/tri2b_denlats/log/15/q directory had just been\n  ## created and the job immediately ran, it would die with an error because nfs\n  ## had not yet synced.  I'm also decreasing the acdirmin and acdirmax in our\n  ## NFS settings to something like 5 seconds.\n}\n\nmy $queue_array_opt = \"\";\nif ($array_job == 1) { # It's an array job.\n  $queue_array_opt = \"-J $jobstart-$jobend\";\n  $logfile =~ s/$jobname/\\$PBS_ARRAY_INDEX/g; # This variable will get\n  # replaced by qsub, in each job, with the job-id.\n  $cmd =~ s/$jobname/\\$\\{PBS_ARRAY_INDEX\\}/g; # same for the command...\n  $queue_logfile =~ s/\\.?$jobname//; # the log file in the q/ subdirectory\n  # is for the queue to put its log, and this doesn't need the task array subscript\n  # so we remove it.\n}\n\n# queue_scriptfile is as $queue_logfile [e.g. dir/q/foo.log] but\n# with the suffix .sh.\nmy $queue_scriptfile = $queue_logfile;\n($queue_scriptfile =~ s/\\.[a-zA-Z]{1,5}$/.sh/) || ($queue_scriptfile .= \".sh\");\nif ($queue_scriptfile !~ m:^/:) {\n  $queue_scriptfile = $cwd . \"/\" . $queue_scriptfile; # just in case.\n}\n\n# We'll write to the standard input of \"qsub\" (the file-handle Q),\n# the job that we want it to execute.\n# Also keep our current PATH around, just in case there was something\n# in it that we need (although we also source ./path.sh)\n\nmy $syncfile = \"$qdir/done.$$\";\n\nsystem(\"rm $queue_logfile $syncfile 2>/dev/null\");\n#\n# Write to the script file, and then close it.\n#\nopen(Q, \">$queue_scriptfile\") || die \"Failed to write to $queue_scriptfile\";\n\nprint Q \"#!/bin/bash\\n\";\nprint Q \"cd $cwd\\n\";\nprint Q \". ./path.sh\\n\";\nprint Q \"( echo '#' Running on \\`hostname\\`\\n\";\nprint Q \"  echo '#' Started at \\`date\\`\\n\";\nprint Q \"  echo -n '# '; cat <<EOF\\n\";\nprint Q \"$cmd\\n\"; # this is a way of echoing the command into a comment in the log file,\nprint Q \"EOF\\n\"; # without having to escape things like \"|\" and quote characters.\nprint Q \") >$logfile\\n\";\nprint Q \"time1=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \" ( $cmd ) 2>>$logfile >>$logfile\\n\";\nprint Q \"ret=\\$?\\n\";\nprint Q \"time2=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \"echo '#' Accounting: time=\\$((\\$time2-\\$time1)) threads=$num_threads >>$logfile\\n\";\nprint Q \"echo '#' Finished at \\`date\\` with status \\$ret >>$logfile\\n\";\nprint Q \"[ \\$ret -eq 137 ] && exit 100;\\n\"; # If process was killed (e.g. oom) it will exit with status 137;\n  # let the script return with status 100 which will put it to E state; more easily rerunnable.\nif ($array_job == 0) { # not an array job\n  print Q \"touch $syncfile\\n\"; # so we know it's done.\n} else {\n  print Q \"touch $syncfile.\\$PBS_ARRAY_INDEX\\n\"; # touch a bunch of sync-files.\n}\nprint Q \"exit \\$[\\$ret ? 1 : 0]\\n\"; # avoid status 100 which grid-engine\nprint Q \"## submitted with:\\n\";       # treats specially.\n$qsub_cmd .= \"-o $queue_logfile $qsub_opts $queue_array_opt $queue_scriptfile >>$queue_logfile 2>&1\";\nprint Q \"# $qsub_cmd\\n\";\nif (!close(Q)) { # close was not successful... || die \"Could not close script file $shfile\";\n  die \"Failed to close the script file (full disk?)\";\n}\n\nmy $ret = system ($qsub_cmd);\nif ($ret != 0) {\n  if ($sync && $ret == 256) { # this is the exit status when a job failed (bad exit status)\n    if (defined $jobname) { $logfile =~ s/\\$PBS_ARRAY_INDEX/*/g; }\n    print STDERR \"pbs.pl: job writing to $logfile failed\\n\";\n  } else {\n    print STDERR \"pbs.pl: error submitting jobs to queue (return status was $ret)\\n\";\n    print STDERR \"queue log file is $queue_logfile, command was $qsub_cmd\\n\";\n    print STDERR `tail $queue_logfile`;\n  }\n  exit(1);\n}\n\nmy $pbs_job_id;\nif (! $sync) { # We're not submitting with -sync y, so we\n  # need to wait for the jobs to finish.  We wait for the\n  # sync-files we \"touched\" in the script to exist.\n  my @syncfiles = ();\n  if (!defined $jobname) { # not an array job.\n    push @syncfiles, $syncfile;\n  } else {\n    for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n      push @syncfiles, \"$syncfile.$jobid\";\n    }\n  }\n  # We will need the pbs_job_id, to check that job still exists\n  { # Get the PBS job-id from the log file in q/\n    open my $L, '<', $queue_logfile || die \"Error opening log file $queue_logfile\";\n    undef $pbs_job_id;\n    while (<$L>) {\n      if (/(\\d+.+\\.pbsserver)/) {\n        if (defined $pbs_job_id) {\n          die \"Error: your job was submitted more than once (see $queue_logfile)\";\n        } else {\n          $pbs_job_id = $1;\n        }\n      }\n    }\n    close $L;\n    if (!defined $pbs_job_id) {\n      die \"Error: log file $queue_logfile does not specify the PBS job-id.\";\n    }\n  }\n  my $check_pbs_job_ctr=1;\n  #\n  my $wait = 0.1;\n  my $counter = 0;\n  foreach my $f (@syncfiles) {\n    # wait for them to finish one by one.\n    while (! -f $f) {\n      sleep($wait);\n      $wait *= 1.2;\n      if ($wait > 3.0) {\n        $wait = 3.0; # never wait more than 3 seconds.\n        # the following (.kick) commands are basically workarounds for NFS bugs.\n        if (rand() < 0.25) { # don't do this every time...\n          if (rand() > 0.5) {\n            system(\"touch $qdir/.kick\");\n          } else {\n            system(\"rm $qdir/.kick 2>/dev/null\");\n          }\n        }\n        if ($counter++ % 10 == 0) {\n          # This seems to kick NFS in the teeth to cause it to refresh the\n          # directory.  I've seen cases where it would indefinitely fail to get\n          # updated, even though the file exists on the server.\n          # Only do this every 10 waits (every 30 seconds) though, or if there\n          # are many jobs waiting they can overwhelm the file server.\n          system(\"ls $qdir >/dev/null\");\n        }\n      }\n\n      # Check that the job exists in PBS. Job can be killed if duration\n      # exceeds some hard limit, or in case of a machine shutdown.\n      if (($check_pbs_job_ctr++ % 10) == 0) { # Don't run qstat too often, avoid stress on PBS.\n        if ( -f $f ) { next; }; #syncfile appeared: OK.\n        $ret = system(\"qstat -t $pbs_job_id >/dev/null 2>/dev/null\");\n        # system(...) : To get the actual exit value, shift $ret right by eight bits.\n        if ($ret>>8 == 1) {     # Job does not seem to exist\n          # Don't consider immediately missing job as error, first wait some\n          # time to make sure it is not just delayed creation of the syncfile.\n\n          sleep(3);\n          # Sometimes NFS gets confused and thinks it's transmitted the directory\n          # but it hasn't, due to timestamp issues.  Changing something in the\n          # directory will usually fix that.\n          system(\"touch $qdir/.kick\");\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) { next; }   #syncfile appeared, ok\n          sleep(7);\n          system(\"touch $qdir/.kick\");\n          sleep(1);\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) {  next; }   #syncfile appeared, ok\n          sleep(60);\n          system(\"touch $qdir/.kick\");\n          sleep(1);\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) { next; }  #syncfile appeared, ok\n          $f =~ m/\\.(\\d+)$/ || die \"Bad sync-file name $f\";\n          my $job_id = $1;\n          if (defined $jobname) {\n            $logfile =~ s/\\$PBS_ARRAY_INDEX/$job_id/g;\n          }\n          my $last_line = `tail -n 1 $logfile`;\n          if ($last_line =~ m/status 0$/ && (-M $logfile) < 0) {\n            # if the last line of $logfile ended with \"status 0\" and\n            # $logfile is newer than this program [(-M $logfile) gives the\n            # time elapsed between file modification and the start of this\n            # program], then we assume the program really finished OK,\n            # and maybe something is up with the file system.\n            print STDERR \"**pbs.pl: syncfile $f was not created but job seems\\n\" .\n              \"**to have finished OK.  Probably your file-system has problems.\\n\" .\n              \"**This is just a warning.\\n\";\n            last;\n          } else {\n            chop $last_line;\n            print STDERR \"pbs.pl: Error, unfinished job no \" .\n              \"longer exists, log is in $logfile, last line is '$last_line', \" .\n              \"syncfile is $f, return status of qstat was $ret\\n\" .\n              \"Possible reasons: a) Exceeded time limit? -> Use more jobs!\" .\n              \" b) Shutdown/Frozen machine? -> Run again!\\n\";\n            exit(1);\n          }\n        } elsif ($ret != 0) {\n          print STDERR \"pbs.pl: Warning: qstat command returned status $ret (qstat -t $pbs_job_id,$!)\\n\";\n        }\n      }\n    }\n  }\n  my $all_syncfiles = join(\" \", @syncfiles);\n  system(\"rm $all_syncfiles 2>/dev/null\");\n}\n\n# OK, at this point we are synced; we know the job is done.\n# But we don't know about its exit status.  We'll look at $logfile for this.\n# First work out an array @logfiles of file-locations we need to\n# read (just one, unless it's an array job).\nmy @logfiles = ();\nif (!defined $jobname) { # not an array job.\n  push @logfiles, $logfile;\n} else {\n  for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n    my $l = $logfile;\n    $l =~ s/\\$PBS_ARRAY_INDEX/$jobid/g;\n    push @logfiles, $l;\n  }\n}\n\nmy $num_failed = 0;\nmy $status = 1;\nforeach my $l (@logfiles) {\n  my @wait_times = (0.1, 0.2, 0.2, 0.3, 0.5, 0.5, 1.0, 2.0, 5.0, 5.0, 5.0, 10.0, 25.0);\n  for (my $iter = 0; $iter <= @wait_times; $iter++) {\n    my $line = `tail -10 $l 2>/dev/null`; # Note: although this line should be the last\n    # line of the file, I've seen cases where it was not quite the last line because\n    # of delayed output by the process that was running, or processes it had called.\n    # so tail -10 gives it a little leeway.\n    if ($line =~ m/with status (\\d+)/) {\n      $status = $1;\n      last;\n    } else {\n      if ($iter < @wait_times) {\n        sleep($wait_times[$iter]);\n      } else {\n        if (! -f $l) {\n          print STDERR \"Log-file $l does not exist.\\n\";\n        } else {\n          print STDERR \"The last line of log-file $l does not seem to indicate the \"\n            . \"return status as expected\\n\";\n        }\n        exit(1);                # Something went wrong with the queue, or the\n        # machine it was running on, probably.\n      }\n    }\n  }\n  # OK, now we have $status, which is the return-status of\n  # the command in the job.\n  if ($status != 0) { $num_failed++; }\n}\nif ($num_failed == 0) { exit(0); }\nelse { # we failed.\n  if (@logfiles == 1) {\n    if (defined $jobname) { $logfile =~ s/\\$PBS_ARRAY_INDEX/$jobstart/g; }\n    print STDERR \"pbs.pl: job failed with status $status, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"pbs.pl: probably you forgot to put JOB=1:\\$nj in your script.\\n\";\n    }\n  } else {\n    if (defined $jobname) { $logfile =~ s/\\$PBS_ARRAY_INDEX/*/g; }\n    my $numjobs = 1 + $jobend - $jobstart;\n    print STDERR \"pbs.pl: $num_failed / $numjobs failed, log is in $logfile\\n\";\n  }\n  exit(1);\n}\n"
  },
  {
    "path": "egs/utils/parallel/queue.pl",
    "content": "#!/usr/bin/env perl\nuse strict;\nuse warnings;\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2014  Vimal Manohar (Johns Hopkins University)\n# Apache 2.0.\n\nuse File::Basename;\nuse Cwd;\nuse Getopt::Long;\n\n# queue.pl has the same functionality as run.pl, except that\n# it runs the job in question on the queue (Sun GridEngine).\n# This version of queue.pl uses the task array functionality\n# of the grid engine.  Note: it's different from the queue.pl\n# in the s4 and earlier scripts.\n\n# The script now supports configuring the queue system using a config file\n# (default in conf/queue.conf; but can be passed specified with --config option)\n# and a set of command line options.\n# The current script handles:\n# 1) Normal configuration arguments\n# For e.g. a command line option of \"--gpu 1\" could be converted into the option\n# \"-q g.q -l gpu=1\" to qsub. How the CLI option is handled is determined by a\n# line in the config file like\n# gpu=* -q g.q -l gpu=$0\n# $0 here in the line is replaced with the argument read from the CLI and the\n# resulting string is passed to qsub.\n# 2) Special arguments to options such as\n# gpu=0\n# If --gpu 0 is given in the command line, then no special \"-q\" is given.\n# 3) Default argument\n# default gpu=0\n# If --gpu option is not passed in the command line, then the script behaves as\n# if --gpu 0 was passed since 0 is specified as the default argument for that\n# option\n# 4) Arbitrary options and arguments.\n# Any command line option starting with '--' and its argument would be handled\n# as long as its defined in the config file.\n# 5) Default behavior\n# If the config file that is passed using is not readable, then the script\n# behaves as if the queue has the following config file:\n# $ cat conf/queue.conf\n# # Default configuration\n# command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\n# option mem=* -l mem_free=$0,ram_free=$0\n# option mem=0          # Do not add anything to qsub_opts\n# option num_threads=* -pe smp $0\n# option num_threads=1  # Do not add anything to qsub_opts\n# option max_jobs_run=* -tc $0\n# default gpu=0\n# option gpu=0 -q all.q\n# option gpu=* -l gpu=$0 -q g.q\n\nmy $qsub_opts = \"\";\nmy $sync = 0;\nmy $num_threads = 1;\nmy $gpu = 0;\n\nmy $config = \"conf/queue.conf\";\n\nmy %cli_options = ();\n\nmy $jobname;\nmy $jobstart;\nmy $jobend;\nmy $array_job = 0;\nmy $sge_job_id;\n\nsub print_usage() {\n  print STDERR\n   \"Usage: queue.pl [options] [JOB=1:n] log-file command-line arguments...\\n\" .\n   \"e.g.: queue.pl foo.log echo baz\\n\" .\n   \" (which will echo \\\"baz\\\", with stdout and stderr directed to foo.log)\\n\" .\n   \"or: queue.pl -q all.q\\@xyz foo.log echo bar \\| sed s/bar/baz/ \\n\" .\n   \" (which is an example of using a pipe; you can provide other escaped bash constructs)\\n\" .\n   \"or: queue.pl -q all.q\\@qyz JOB=1:10 foo.JOB.log echo JOB \\n\" .\n   \" (which illustrates the mechanism to submit parallel jobs; note, you can use \\n\" .\n   \"  another string other than JOB)\\n\" .\n   \"Note: if you pass the \\\"-sync y\\\" option to qsub, this script will take note\\n\" .\n   \"and change its behavior.  Otherwise it uses qstat to work out when the job finished\\n\" .\n   \"Options:\\n\" .\n   \"  --config <config-file> (default: $config)\\n\" .\n   \"  --mem <mem-requirement> (e.g. --mem 2G, --mem 500M, \\n\" .\n   \"                           also support K and numbers mean bytes)\\n\" .\n   \"  --num-threads <num-threads> (default: $num_threads)\\n\" .\n   \"  --max-jobs-run <num-jobs>\\n\" .\n   \"  --gpu <0|1> (default: $gpu)\\n\";\n  exit 1;\n}\n\nsub caught_signal {\n  if ( defined $sge_job_id ) { # Signal trapped after submitting jobs\n    my $signal = $!;\n    system (\"qdel $sge_job_id\");\n    print STDERR \"Caught a signal: $signal , deleting SGE task: $sge_job_id and exiting\\n\";\n    exit(2);\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nfor (my $x = 1; $x <= 2; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    my $switch = shift @ARGV;\n\n    if ($switch eq \"-V\") {\n      $qsub_opts .= \"-V \";\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"WARNING: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $sync = 1;\n        $qsub_opts .= \"$switch $argument \";\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $qsub_opts .= \"$switch $argument $argument2 \";\n        $num_threads = $argument2;\n      } elsif ($switch =~ m/^--/) { # Config options\n        # Convert CLI option to variable name\n        # by removing '--' from the switch and replacing any\n        # '-' with a '_'\n        $switch =~ s/^--//;\n        $switch =~ s/-/_/g;\n        $cli_options{$switch} = $argument;\n      } else {  # Other qsub options - passed as is\n        $qsub_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"queue.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is a GridEngine limitation).\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"queue.pl: Warning: suspicious first argument to queue.pl: $ARGV[0]\\n\";\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nif (exists $cli_options{\"config\"}) {\n  $config = $cli_options{\"config\"};\n}\n\nmy $default_config_file = <<'EOF';\n# Default configuration\ncommand qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*\noption mem=* -l mem_free=$0,ram_free=$0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* -pe smp $0\noption num_threads=1  # Do not add anything to qsub_opts\noption max_jobs_run=* -tc $0\ndefault gpu=0\noption gpu=0\noption gpu=* -l gpu=$0 -q '*.q'\nEOF\n\n# Here the configuration options specified by the user on the command line\n# (e.g. --mem 2G) are converted to options to the qsub system as defined in\n# the config file. (e.g. if the config file has the line\n# \"option mem=* -l ram_free=$0,mem_free=$0\"\n# and the user has specified '--mem 2G' on the command line, the options\n# passed to queue system would be \"-l ram_free=2G,mem_free=2G\n# A more detailed description of the ways the options would be handled is at\n# the top of this file.\n\n$SIG{INT} = \\&caught_signal;\n$SIG{TERM} = \\&caught_signal;\n\nmy $opened_config_file = 1;\n\nopen CONFIG, \"<$config\" or $opened_config_file = 0;\n\nmy %cli_config_options = ();\nmy %cli_default_options = ();\n\nif ($opened_config_file == 0 && exists($cli_options{\"config\"})) {\n  print STDERR \"Could not open config file $config\\n\";\n  exit(1);\n} elsif ($opened_config_file == 0 && !exists($cli_options{\"config\"})) {\n  # Open the default config file instead\n  open (CONFIG, \"echo '$default_config_file' |\") or die \"Unable to open pipe\\n\";\n  $config = \"Default config\";\n}\n\nmy $qsub_cmd = \"\";\nmy $read_command = 0;\n\nwhile(<CONFIG>) {\n  chomp;\n  my $line = $_;\n  $_ =~ s/\\s*#.*//g;\n  if ($_ eq \"\") { next; }\n  if ($_ =~ /^command (.+)/) {\n    $read_command = 1;\n    $qsub_cmd = $1 . \" \";\n  } elsif ($_ =~ m/^option ([^=]+)=\\* (.+)$/) {\n    # Config option that needs replacement with parameter value read from CLI\n    # e.g.: option mem=* -l mem_free=$0,ram_free=$0\n    my $option = $1;     # mem\n    my $arg= $2;         # -l mem_free=$0,ram_free=$0\n    if ($arg !~ m:\\$0:) {\n      die \"Unable to parse line '$line' in config file ($config)\\n\";\n    }\n    if (exists $cli_options{$option}) {\n      # Replace $0 with the argument read from command line.\n      # e.g. \"-l mem_free=$0,ram_free=$0\" -> \"-l mem_free=2G,ram_free=2G\"\n      $arg =~ s/\\$0/$cli_options{$option}/g;\n      $cli_config_options{$option} = $arg;\n    }\n  } elsif ($_ =~ m/^option ([^=]+)=(\\S+)\\s?(.*)$/) {\n    # Config option that does not need replacement\n    # e.g. option gpu=0 -q all.q\n    my $option = $1;      # gpu\n    my $value = $2;       # 0\n    my $arg = $3;         # -q all.q\n    if (exists $cli_options{$option}) {\n      $cli_default_options{($option,$value)} = $arg;\n    }\n  } elsif ($_ =~ m/^default (\\S+)=(\\S+)/) {\n    # Default options. Used for setting default values to options i.e. when\n    # the user does not specify the option on the command line\n    # e.g. default gpu=0\n    my $option = $1;  # gpu\n    my $value = $2;   # 0\n    if (!exists $cli_options{$option}) {\n      # If the user has specified this option on the command line, then we\n      # don't have to do anything\n      $cli_options{$option} = $value;\n    }\n  } else {\n    print STDERR \"queue.pl: unable to parse line '$line' in config file ($config)\\n\";\n    exit(1);\n  }\n}\n\nclose(CONFIG);\n\nif ($read_command != 1) {\n  print STDERR \"queue.pl: config file ($config) does not contain the line \\\"command .*\\\"\\n\";\n  exit(1);\n}\n\nfor my $option (keys %cli_options) {\n  if ($option eq \"config\") { next; }\n  if ($option eq \"max_jobs_run\" && $array_job != 1) { next; }\n  my $value = $cli_options{$option};\n\n  if (exists $cli_default_options{($option,$value)}) {\n    $qsub_opts .= \"$cli_default_options{($option,$value)} \";\n  } elsif (exists $cli_config_options{$option}) {\n    $qsub_opts .= \"$cli_config_options{$option} \";\n  } else {\n    if ($opened_config_file == 0) { $config = \"default config file\"; }\n    die \"queue.pl: Command line option $option not described in $config (or value '$value' not allowed)\\n\";\n  }\n}\n\nmy $cwd = getcwd();\nmy $logfile = shift @ARGV;\n\nif ($array_job == 1 && $logfile !~ m/$jobname/\n    && $jobend > $jobstart) {\n  print STDERR \"queue.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n#\n# Work out the command; quote escaping is done here.\n# Note: the rules for escaping stuff are worked out pretty\n# arbitrarily, based on what we want it to do.  Some things that\n# we pass as arguments to queue.pl, such as \"|\", we want to be\n# interpreted by bash, so we don't escape them.  Other things,\n# such as archive specifiers like 'ark:gunzip -c foo.gz|', we want\n# to be passed, in quotes, to the Kaldi program.  Our heuristic\n# is that stuff with spaces in should be quoted.  This doesn't\n# always work.\n#\nmy $cmd = \"\";\n\nforeach my $x (@ARGV) {\n  if ($x =~ m/^\\S+$/) { $cmd .= $x . \" \"; } # If string contains no spaces, take\n                                            # as-is.\n  elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; } # else if no dbl-quotes, use single\n  else { $cmd .= \"\\\"$x\\\" \"; }  # else use double.\n}\n\n#\n# Work out the location of the script file, and open it for writing.\n#\nmy $dir = dirname($logfile);\nmy $base = basename($logfile);\nmy $qdir = \"$dir/q\";\n$qdir =~ s:/(log|LOG)/*q:/q:; # If qdir ends in .../log/q, make it just .../q.\nmy $queue_logfile = \"$qdir/$base\";\n\nif (!-d $dir) { system \"mkdir -p $dir 2>/dev/null\"; } # another job may be doing this...\nif (!-d $dir) { die \"Cannot make the directory $dir\\n\"; }\n# make a directory called \"q\",\n# where we will put the log created by qsub... normally this doesn't contain\n# anything interesting, evertyhing goes to $logfile.\n# in $qdir/sync we'll put the done.* files... we try to keep this\n# directory small because it's transmitted over NFS many times.\nif (! -d \"$qdir/sync\") {\n  system \"mkdir -p $qdir/sync 2>/dev/null\";\n  sleep(5); ## This is to fix an issue we encountered in denominator lattice creation,\n  ## where if e.g. the exp/tri2b_denlats/log/15/q directory had just been\n  ## created and the job immediately ran, it would die with an error because nfs\n  ## had not yet synced.  I'm also decreasing the acdirmin and acdirmax in our\n  ## NFS settings to something like 5 seconds.\n}\n\nmy $queue_array_opt = \"\";\nif ($array_job == 1) { # It's an array job.\n  $queue_array_opt = \"-t $jobstart:$jobend\";\n  $logfile =~ s/$jobname/\\$SGE_TASK_ID/g; # This variable will get\n  # replaced by qsub, in each job, with the job-id.\n  $cmd =~ s/$jobname/\\$\\{SGE_TASK_ID\\}/g; # same for the command...\n  $queue_logfile =~ s/\\.?$jobname//; # the log file in the q/ subdirectory\n  # is for the queue to put its log, and this doesn't need the task array subscript\n  # so we remove it.\n}\n\n# queue_scriptfile is as $queue_logfile [e.g. dir/q/foo.log] but\n# with the suffix .sh.\nmy $queue_scriptfile = $queue_logfile;\n($queue_scriptfile =~ s/\\.[a-zA-Z]{1,5}$/.sh/) || ($queue_scriptfile .= \".sh\");\nif ($queue_scriptfile !~ m:^/:) {\n  $queue_scriptfile = $cwd . \"/\" . $queue_scriptfile; # just in case.\n}\n\n# We'll write to the standard input of \"qsub\" (the file-handle Q),\n# the job that we want it to execute.\n# Also keep our current PATH around, just in case there was something\n# in it that we need (although we also source ./path.sh)\n\nmy $syncfile = \"$qdir/sync/done.$$\";\n\nunlink($queue_logfile, $syncfile);\n#\n# Write to the script file, and then close it.\n#\nopen(Q, \">$queue_scriptfile\") || die \"Failed to write to $queue_scriptfile\";\n\nprint Q \"#!/bin/bash\\n\";\nprint Q \"cd $cwd\\n\";\nprint Q \". ./path.sh\\n\";\nprint Q \"( echo '#' Running on \\`hostname\\`\\n\";\nprint Q \"  echo '#' Started at \\`date\\`\\n\";\nprint Q \"  echo -n '# '; cat <<EOF\\n\";\nprint Q \"$cmd\\n\"; # this is a way of echoing the command into a comment in the log file,\nprint Q \"EOF\\n\"; # without having to escape things like \"|\" and quote characters.\nprint Q \") >$logfile\\n\";\nprint Q \"time1=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \" ( $cmd ) 2>>$logfile >>$logfile\\n\";\nprint Q \"ret=\\$?\\n\";\nprint Q \"time2=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \"echo '#' Accounting: time=\\$((\\$time2-\\$time1)) threads=$num_threads >>$logfile\\n\";\nprint Q \"echo '#' Finished at \\`date\\` with status \\$ret >>$logfile\\n\";\nprint Q \"[ \\$ret -eq 137 ] && exit 100;\\n\"; # If process was killed (e.g. oom) it will exit with status 137;\n  # let the script return with status 100 which will put it to E state; more easily rerunnable.\nif ($array_job == 0) { # not an array job\n  print Q \"touch $syncfile\\n\"; # so we know it's done.\n} else {\n  print Q \"touch $syncfile.\\$SGE_TASK_ID\\n\"; # touch a bunch of sync-files.\n}\nprint Q \"exit \\$[\\$ret ? 1 : 0]\\n\"; # avoid status 100 which grid-engine\nprint Q \"## submitted with:\\n\";       # treats specially.\n$qsub_cmd .= \"-o $queue_logfile $qsub_opts $queue_array_opt $queue_scriptfile >>$queue_logfile 2>&1\";\nprint Q \"# $qsub_cmd\\n\";\nif (!close(Q)) { # close was not successful... || die \"Could not close script file $shfile\";\n  die \"Failed to close the script file (full disk?)\";\n}\nchmod 0755, $queue_scriptfile;\n\n# This block submits the job to the queue.\nfor (my $try = 1; $try < 5; $try++) {\n  my $ret = system ($qsub_cmd);\n  if ($ret != 0) {\n    if ($sync && $ret == 256) { # this is the exit status when a job failed (bad exit status)\n      if (defined $jobname) {\n        $logfile =~ s/\\$SGE_TASK_ID/*/g;\n      }\n      print STDERR \"queue.pl: job writing to $logfile failed\\n\";\n      exit(1);\n    } else {\n      print STDERR \"queue.pl: Error submitting jobs to queue (return status was $ret)\\n\";\n      print STDERR \"queue log file is $queue_logfile, command was $qsub_cmd\\n\";\n      my $err = `tail $queue_logfile`;\n      print STDERR \"Output of qsub was: $err\\n\";\n      if ($err =~ m/gdi request/ || $err =~ m/qmaster/) {\n        # When we get queue connectivity problems we usually see a message like:\n        # Unable to run job: failed receiving gdi request response for mid=1 (got\n        # syncron message receive timeout error)..\n        my $waitfor = 20;\n        print STDERR \"queue.pl: It looks like the queue master may be inaccessible. \" .\n          \" Trying again after $waitfor seconts\\n\";\n        sleep($waitfor);\n        # ... and continue throught the loop.\n      } else {\n        exit(1);\n      }\n    }\n  } else {\n    last;  # break from the loop.\n  }\n}\n\nif (! $sync) { # We're not submitting with -sync y, so we\n  # need to wait for the jobs to finish.  We wait for the\n  # sync-files we \"touched\" in the script to exist.\n  my @syncfiles = ();\n  if (!defined $jobname) { # not an array job.\n    push @syncfiles, $syncfile;\n  } else {\n    for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n      push @syncfiles, \"$syncfile.$jobid\";\n    }\n  }\n  # We will need the sge_job_id, to check that job still exists\n  { # This block extracts the numeric SGE job-id from the log file in q/.\n    # It may be used later to query 'qstat' about the job.\n    open(L, \"<$queue_logfile\") || die \"Error opening log file $queue_logfile\";\n    undef $sge_job_id;\n    while (<L>) {\n      if (m/Your job\\S* (\\d+)[. ].+ has been submitted/) {\n        if (defined $sge_job_id) {\n          die \"Error: your job was submitted more than once (see $queue_logfile)\";\n        } else {\n          $sge_job_id = $1;\n        }\n      }\n    }\n    close(L);\n    if (!defined $sge_job_id) {\n      die \"Error: log file $queue_logfile does not specify the SGE job-id.\";\n    }\n  }\n  my $check_sge_job_ctr=1;\n\n  my $wait = 0.1;\n  my $counter = 0;\n  foreach my $f (@syncfiles) {\n    # wait for the jobs to finish one by one.\n    while (! -f $f) {\n      sleep($wait);\n      $wait *= 1.2;\n      if ($wait > 3.0) {\n        $wait = 3.0; # never wait more than 3 seconds.\n        # the following (.kick) commands are basically workarounds for NFS bugs.\n        if (rand() < 0.25) { # don't do this every time...\n          if (rand() > 0.5) {\n            system(\"touch $qdir/sync/.kick\");\n          } else {\n            unlink(\"$qdir/sync/.kick\");\n          }\n        }\n        if ($counter++ % 10 == 0) {\n          # This seems to kick NFS in the teeth to cause it to refresh the\n          # directory.  I've seen cases where it would indefinitely fail to get\n          # updated, even though the file exists on the server.\n          # Only do this every 10 waits (every 30 seconds) though, or if there\n          # are many jobs waiting they can overwhelm the file server.\n          system(\"ls $qdir/sync >/dev/null\");\n        }\n      }\n\n      # The purpose of the next block is so that queue.pl can exit if the job\n      # was killed without terminating.  It's a bit complicated because (a) we\n      # don't want to overload the qmaster by querying it too frequently), and\n      # (b) sometimes the qmaster is unreachable or temporarily down, and we\n      # don't want this to necessarily kill the job.\n      if (($check_sge_job_ctr < 100 && ($check_sge_job_ctr++ % 10) == 0) ||\n          ($check_sge_job_ctr >= 100 && ($check_sge_job_ctr++ % 50) == 0)) {\n        # Don't run qstat too often, avoid stress on SGE; the if-condition above\n        # is designed to check every 10 waits at first, and eventually every 50\n        # waits.\n        if ( -f $f ) { next; }  #syncfile appeared: OK.\n        my $output = `qstat -j $sge_job_id 2>&1`;\n        my $ret = $?;\n        if ($ret >> 8 == 1 && $output !~ m/qmaster/ &&\n            $output !~ m/gdi request/) {\n          # Don't consider immediately missing job as error, first wait some\n          # time to make sure it is not just delayed creation of the syncfile.\n\n          sleep(3);\n          # Sometimes NFS gets confused and thinks it's transmitted the directory\n          # but it hasn't, due to timestamp issues.  Changing something in the\n          # directory will usually fix that.\n          system(\"touch $qdir/sync/.kick\");\n          unlink(\"$qdir/sync/.kick\");\n          if ( -f $f ) { next; }   #syncfile appeared, ok\n          sleep(7);\n          system(\"touch $qdir/sync/.kick\");\n          sleep(1);\n          unlink(\"qdir/sync/.kick\");\n          if ( -f $f ) {  next; }   #syncfile appeared, ok\n          sleep(60);\n          system(\"touch $qdir/sync/.kick\");\n          sleep(1);\n          unlink(\"$qdir/sync/.kick\");\n          if ( -f $f ) { next; }  #syncfile appeared, ok\n          $f =~ m/\\.(\\d+)$/ || die \"Bad sync-file name $f\";\n          my $job_id = $1;\n          if (defined $jobname) {\n            $logfile =~ s/\\$SGE_TASK_ID/$job_id/g;\n          }\n          my $last_line = `tail -n 1 $logfile`;\n          if ($last_line =~ m/status 0$/ && (-M $logfile) < 0) {\n            # if the last line of $logfile ended with \"status 0\" and\n            # $logfile is newer than this program [(-M $logfile) gives the\n            # time elapsed between file modification and the start of this\n            # program], then we assume the program really finished OK,\n            # and maybe something is up with the file system.\n            print STDERR \"**queue.pl: syncfile $f was not created but job seems\\n\" .\n              \"**to have finished OK.  Probably your file-system has problems.\\n\" .\n              \"**This is just a warning.\\n\";\n            last;\n          } else {\n            chop $last_line;\n            print STDERR \"queue.pl: Error, unfinished job no \" .\n              \"longer exists, log is in $logfile, last line is '$last_line', \" .\n              \"syncfile is $f, return status of qstat was $ret\\n\" .\n              \"Possible reasons: a) Exceeded time limit? -> Use more jobs!\" .\n              \" b) Shutdown/Frozen machine? -> Run again!  Qmaster output \" .\n              \"was: $output\\n\";\n            exit(1);\n          }\n        } elsif ($ret != 0) {\n          print STDERR \"queue.pl: Warning: qstat command returned status $ret (qstat -j $sge_job_id,$!)\\n\";\n          print STDERR \"queue.pl: output was: $output\";\n        }\n      }\n    }\n  }\n  unlink(@syncfiles);\n}\n\n# OK, at this point we are synced; we know the job is done.\n# But we don't know about its exit status.  We'll look at $logfile for this.\n# First work out an array @logfiles of file-locations we need to\n# read (just one, unless it's an array job).\nmy @logfiles = ();\nif (!defined $jobname) { # not an array job.\n  push @logfiles, $logfile;\n} else {\n  for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n    my $l = $logfile;\n    $l =~ s/\\$SGE_TASK_ID/$jobid/g;\n    push @logfiles, $l;\n  }\n}\n\nmy $num_failed = 0;\nmy $status = 1;\nforeach my $l (@logfiles) {\n  my @wait_times = (0.1, 0.2, 0.2, 0.3, 0.5, 0.5, 1.0, 2.0, 5.0, 5.0, 5.0, 10.0, 25.0);\n  for (my $iter = 0; $iter <= @wait_times; $iter++) {\n    my $line = `tail -10 $l 2>/dev/null`; # Note: although this line should be the last\n    # line of the file, I've seen cases where it was not quite the last line because\n    # of delayed output by the process that was running, or processes it had called.\n    # so tail -10 gives it a little leeway.\n    if ($line =~ m/with status (\\d+)/) {\n      $status = $1;\n      last;\n    } else {\n      if ($iter < @wait_times) {\n        sleep($wait_times[$iter]);\n      } else {\n        if (! -f $l) {\n          print STDERR \"Log-file $l does not exist.\\n\";\n        } else {\n          print STDERR \"The last line of log-file $l does not seem to indicate the \"\n            . \"return status as expected\\n\";\n        }\n        exit(1);                # Something went wrong with the queue, or the\n        # machine it was running on, probably.\n      }\n    }\n  }\n  # OK, now we have $status, which is the return-status of\n  # the command in the job.\n  if ($status != 0) { $num_failed++; }\n}\nif ($num_failed == 0) { exit(0); }\nelse { # we failed.\n  if (@logfiles == 1) {\n    if (defined $jobname) { $logfile =~ s/\\$SGE_TASK_ID/$jobstart/g; }\n    print STDERR \"queue.pl: job failed with status $status, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"queue.pl: probably you forgot to put JOB=1:\\$nj in your script.\\n\";\n    }\n  } else {\n    if (defined $jobname) { $logfile =~ s/\\$SGE_TASK_ID/*/g; }\n    my $numjobs = 1 + $jobend - $jobstart;\n    print STDERR \"queue.pl: $num_failed / $numjobs failed, log is in $logfile\\n\";\n  }\n  exit(1);\n}\n"
  },
  {
    "path": "egs/utils/parallel/retry.pl",
    "content": "#!/usr/bin/env perl\nuse strict;\nuse warnings;\n\n# Copyright 2018  Johns Hopkins University (Author: Daniel Povey).\n# Apache 2.0.\n\nuse File::Basename;\nuse Cwd;\nuse Getopt::Long;\n\n\n# retry.pl is a wrapper for queue.pl.  It can be used to retry jobs that failed,\n# e.g. if your command line was \"queue.pl [args]\", you can replace that\n# with \"retry.pl queue.pl [args]\" and it will retry jobs that failed.\n\n\nmy $num_tries = 2;\n\nsub print_usage() {\n  print STDERR\n    \"Usage: retry.pl  <some-other-wrapper-script> <rest-of-command>\\n\" .\n    \"  e.g.:  retry.pl [options] queue.pl foo.log do_something\\n\" .\n    \"This will retry jobs that failed (only once)\\n\" .\n    \"Options:\\n\" .\n    \"      --num-tries <n>        # default: 2\\n\";\n  exit 1;\n}\n\nif ($ARGV[0] eq \"--num-tries\") {\n  shift;\n  $num_tries =  $ARGV[0] + 0;\n  if ($num_tries < 1) {\n    die \"$0: invalid option --num-tries $ARGV[0]\";\n  }\n  shift;\n}\n\nif (@ARGV < 3) {\n  print_usage();\n}\n\n\nsub get_log_file {\n  my $n;\n  # First just look for the first command-line arg that ends in \".log\".  If that\n  # exists, it's almost certainly the log file.\n  for ($n = 1; $n < @ARGV; $n++) {\n    if ($ARGV[$n] =~ m/\\.log$/) {\n      return $ARGV[$n];\n    }\n  }\n  for ($n = 1; $n < @ARGV; $n++) {\n    # If this arg isn't of the form \"-some-option', and isn't of the form\n    # \"JOB=1:10\", and the previous arg wasn't of the form \"-some-option\", and this\n    # isn't just a number (note: the 'not-a-number' things is mostly to exclude\n    # things like the 5 in \"-pe smp 5\" which is an older but still-supported\n    # option to queue.pl)... then assume it's a log file.\n    if ($ARGV[$n] !~ m/^-=/ &&  $ARGV[$n] !~ m/=/ && $ARGV[$n] !~ m/^\\d+$/ &&\n        $ARGV[$n-1] !~ m/^-/) {\n      return $ARGV[$n];\n    }\n  }\n  print STDERR \"$0: failed to parse log-file name from args:\" . join(\" \", @ARGV);\n  exit(1);\n}\n\n\nmy $log_file = get_log_file();\nmy $return_status;\n\nfor (my $n = 1; $n <= $num_tries; $n++) {\n  system(@ARGV);\n  $return_status = $?;\n  if ($return_status == 0) {\n    exit(0);  # The command succeeded.  We return success.\n  } elsif ($return_status != 256) {\n    # The command did not \"die normally\".  When queue.pl and similar scripts\n    # detect a normal error, they exit(1), which becomes a status of 256\n    # in perl's $? variable.\n    # See http://perldoc.perl.org/perlvar.html#%24CHILD_ERROR for more info.\n    # An example of an abnormal death that would cause us to want to exit\n    # immediately, is when the user does ctrl-c or KILLs the script,\n    # which gets caught by 'caught_signal' in queue.pl and causes that program\n    # to return with exit status 2.\n    exit(1);\n  }\n\n\n  if ($n < $num_tries) {\n    if (! -f $log_file) {\n      # $log_file doesn't exist as a file.  Maybe it was an array job.\n      # This script doesn't yet support array jobs.  We just give up.\n      # Later on we might want to figure out which array jobs failed\n      # and have to be rerun, but for now we just die.\n      print STDERR \"$0: job failed and log file $log_file does not exist (array job?).\\n\";\n    } else {\n      rename($log_file, $log_file . \".bak\");\n      print STDERR \"$0: job failed; renaming log file to ${log_file}.bak and rerunning\\n\";\n    }\n  }\n}\n\nprint STDERR \"$0: job failed $num_tries times; log is in $log_file\\n\";\nexit(1);\n"
  },
  {
    "path": "egs/utils/parallel/run.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# In general, doing\n#  run.pl some.log a b c is like running the command a b c in\n# the bash shell, and putting the standard error and output into some.log.\n# To run parallel jobs (backgrounded on the host machine), you can do (e.g.)\n#  run.pl JOB=1:4 some.JOB.log a b c JOB is like running the command a b c JOB\n# and putting it in some.JOB.log, for each one. [Note: JOB can be any identifier].\n# If any of the jobs fails, this script will fail.\n\n# A typical example is:\n#  run.pl some.log my-prog \"--opt=foo bar\" foo \\|  other-prog baz\n# and run.pl will run something like:\n# ( my-prog '--opt=foo bar' foo |  other-prog baz ) >& some.log\n#\n# Basically it takes the command-line arguments, quotes them\n# as necessary to preserve spaces, and evaluates them with bash.\n# In addition it puts the command line at the top of the log, and\n# the start and end times of the command at the beginning and end.\n# The reason why this is useful is so that we can create a different\n# version of this program that uses a queueing system instead.\n\n#use Data::Dumper;\n\n@ARGV < 2 && die \"usage: run.pl log-file command-line arguments...\";\n\n#print STDERR \"COMMAND-LINE: \" .  Dumper(\\@ARGV) . \"\\n\";\n$job_pick = 'all';\n$max_jobs_run = -1;\n$jobstart = 1;\n$jobend = 1;\n$ignored_opts = \"\"; # These will be ignored.\n\n# First parse an option like JOB=1:4, and any\n# options that would normally be given to\n# queue.pl, which we will just discard.\n\nfor (my $x = 1; $x <= 2; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    # parse any options that would normally go to qsub, but which will be ignored here.\n    my $switch = shift @ARGV;\n    if ($switch eq \"-V\") {\n      $ignored_opts .= \"-V \";\n    } elsif ($switch eq \"--max-jobs-run\" || $switch eq \"-tc\") {\n      # we do support the option --max-jobs-run n, and its GridEngine form -tc n.\n      # if the command appears multiple times uses the smallest option.\n      if ( $max_jobs_run <= 0 ) {\n          $max_jobs_run =  shift @ARGV;\n      } else {\n        my $new_constraint = shift @ARGV;\n        if ( ($new_constraint < $max_jobs_run) ) {\n          $max_jobs_run = $new_constraint;\n        }\n      }\n      \n      if (! ($max_jobs_run > 0)) {\n        die \"run.pl: invalid option --max-jobs-run $max_jobs_run\";\n      }\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"run.pl: WARNING: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $ignored_opts .= \"-sync \"; # Note: in the\n        # corresponding code in queue.pl it says instead, just \"$sync = 1;\".\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $ignored_opts .= \"$switch $argument $argument2 \";\n      } elsif ($switch eq \"--gpu\") {\n        $using_gpu = $argument;\n      } elsif ($switch eq \"--pick\") {\n        if($argument =~ m/^(all|failed|incomplete)$/) {\n          $job_pick = $argument;\n        } else {\n          print STDERR \"run.pl: ERROR: --pick argument must be one of 'all', 'failed' or 'incomplete'\"\n        }\n      } else {\n        # Ignore option.\n        $ignored_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    if ($jobstart > $jobend) {\n      die \"run.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility).\";\n    }\n    shift;\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"run.pl: Warning: suspicious first argument to run.pl: $ARGV[0]\\n\";\n  }\n}\n\n# Users found this message confusing so we are removing it.\n# if ($ignored_opts ne \"\") {\n#   print STDERR \"run.pl: Warning: ignoring options \\\"$ignored_opts\\\"\\n\";\n# }\n\nif ($max_jobs_run == -1) { # If --max-jobs-run option not set,\n                           # then work out the number of processors if possible,\n                           # and set it based on that.\n  $max_jobs_run = 0;\n  if ($using_gpu) {\n    if (open(P, \"nvidia-smi -L |\")) {\n      $max_jobs_run++ while (<P>);\n      close(P);\n    }\n    if ($max_jobs_run == 0) {\n      $max_jobs_run = 1;\n      print STDERR \"run.pl: Warning: failed to detect number of GPUs from nvidia-smi, using ${max_jobs_run}\\n\";\n    }\n  } elsif (open(P, \"</proc/cpuinfo\")) {  # Linux\n    while (<P>) { if (m/^processor/) { $max_jobs_run++; } }\n    if ($max_jobs_run == 0) {\n      print STDERR \"run.pl: Warning: failed to detect any processors from /proc/cpuinfo\\n\";\n      $max_jobs_run = 10;  # reasonable default.\n    }\n    close(P);\n  } elsif (open(P, \"sysctl -a |\")) {  # BSD/Darwin\n    while (<P>) {\n      if (m/hw\\.ncpu\\s*[:=]\\s*(\\d+)/) { # hw.ncpu = 4, or hw.ncpu: 4\n        $max_jobs_run = $1;\n        last;\n      }\n    }\n    close(P);\n    if ($max_jobs_run == 0) {\n      print STDERR \"run.pl: Warning: failed to detect any processors from sysctl -a\\n\";\n      $max_jobs_run = 10;  # reasonable default.\n    }\n  } else {\n    # allow at most 32 jobs at once, on non-UNIX systems; change this code\n    # if you need to change this default.\n    $max_jobs_run = 32;\n  }\n  # The just-computed value of $max_jobs_run is just the number of processors\n  # (or our best guess); and if it happens that the number of jobs we need to\n  # run is just slightly above $max_jobs_run, it will make sense to increase\n  # $max_jobs_run to equal the number of jobs, so we don't have a small number\n  # of leftover jobs.\n  $num_jobs = $jobend - $jobstart + 1;\n  if (!$using_gpu &&\n      $num_jobs > $max_jobs_run && $num_jobs < 1.4 * $max_jobs_run) {\n    $max_jobs_run = $num_jobs;\n  }\n}\n\nsub pick_or_exit {\n  # pick_or_exit ( $logfile ) \n  # Invoked before each job is started helps to run jobs selectively.\n  #\n  # Given the name of the output logfile decides whether the job must be \n  # executed (by returning from the subroutine) or not (by terminating the\n  # process calling exit)\n  # \n  # PRE: $job_pick is a global variable set by command line switch --pick\n  #      and indicates which class of jobs must be executed.\n  #\n  # 1) If a failed job is not executed the process exit code will indicate \n  #    failure, just as if the task was just executed  and failed.\n  #\n  # 2) If a task is incomplete it will be executed. Incomplete may be either\n  #    a job whose log file does not contain the accounting notes in the end,\n  #    or a job whose log file does not exist.\n  #\n  # 3) If the $job_pick is set to 'all' (default behavior) a task will be\n  #    executed regardless of the result of previous attempts.\n  #\n  # This logic could have been implemented in the main execution loop\n  # but a subroutine to preserve the current level of readability of\n  # that part of the code.\n  #\n  # Alexandre Felipe, (o.alexandre.felipe@gmail.com) 14th of August of 2020\n  #\n  if($job_pick eq 'all'){\n    return; # no need to bother with the previous log\n  }\n  open my $fh, \"<\", $_[0] or return; # job not executed yet\n  my $log_line;\n  my $cur_line;\n  while ($cur_line = <$fh>) {\n    if( $cur_line =~ m/# Ended \\(code .*/ ) {\n      $log_line = $cur_line;\n    }\n  }\n  close $fh;\n  if (! defined($log_line)){\n    return; # incomplete\n  }\n  if ( $log_line =~ m/# Ended \\(code 0\\).*/ ) {\n    exit(0); # complete\n  } elsif ( $log_line =~ m/# Ended \\(code \\d+(; signal \\d+)?\\).*/ ){\n    if ($job_pick !~ m/^(failed|all)$/) {\n      exit(1); # failed but not going to run\n    } else {\n      return; # failed\n    }\n  } elsif ( $log_line =~ m/.*\\S.*/ ) {\n    return; # incomplete jobs are always run\n  }\n}\n\n\n$logfile = shift @ARGV;\n\nif (defined $jobname && $logfile !~ m/$jobname/ &&\n    $jobend > $jobstart) {\n  print STDERR \"run.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n$cmd = \"\";\n\nforeach $x (@ARGV) {\n    if ($x =~ m/^\\S+$/) { $cmd .=  $x . \" \"; }\n    elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; }\n    else { $cmd .= \"\\\"$x\\\" \"; }\n}\n\n#$Data::Dumper::Indent=0;\n$ret = 0;\n$numfail = 0;\n%active_pids=();\n\nuse POSIX \":sys_wait_h\";\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  if (scalar(keys %active_pids) >= $max_jobs_run) {\n\n    # Lets wait for a change in any child's status\n    # Then we have to work out which child finished\n    $r = waitpid(-1, 0);\n    $code = $?;\n    if ($r < 0 ) { die \"run.pl: Error waiting for child process\"; } # should never happen.\n    if ( defined $active_pids{$r} ) {\n        $jid=$active_pids{$r};\n        $fail[$jid]=$code;\n        if ($code !=0) { $numfail++;}\n        delete $active_pids{$r};\n        # print STDERR \"Finished: $r/$jid \" .  Dumper(\\%active_pids) . \"\\n\";\n    } else {\n        die \"run.pl: Cannot find the PID of the child process that just finished.\";\n    }\n\n    # In theory we could do a non-blocking waitpid over all jobs running just\n    # to find out if only one or more jobs finished during the previous waitpid()\n    # However, we just omit this and will reap the next one in the next pass\n    # through the for(;;) cycle\n  }\n  $childpid = fork();\n  if (!defined $childpid) { die \"run.pl: Error forking in run.pl (writing to $logfile)\"; }\n  if ($childpid == 0) { # We're in the child... this branch\n    # executes the job and returns (possibly with an error status).\n    if (defined $jobname) {\n      $cmd =~ s/$jobname/$jobid/g;\n      $logfile =~ s/$jobname/$jobid/g;\n    }\n    # exit if the job does not need to be executed\n    pick_or_exit( $logfile );\n\n    system(\"mkdir -p `dirname $logfile` 2>/dev/null\");\n    open(F, \">$logfile\") || die \"run.pl: Error opening log file $logfile\";\n    print F \"# \" . $cmd . \"\\n\";\n    print F \"# Started at \" . `date`;\n    $starttime = `date +'%s'`;\n    print F \"#\\n\";\n    close(F);\n\n    # Pipe into bash.. make sure we're not using any other shell.\n    open(B, \"|bash\") || die \"run.pl: Error opening shell command\";\n    print B \"( \" . $cmd . \") 2>>$logfile >> $logfile\";\n    close(B);                   # If there was an error, exit status is in $?\n    $ret = $?;\n\n    $lowbits = $ret & 127;\n    $highbits = $ret >> 8;\n    if ($lowbits != 0) { $return_str = \"code $highbits; signal $lowbits\" }\n    else { $return_str = \"code $highbits\"; }\n\n    $endtime = `date +'%s'`;\n    open(F, \">>$logfile\") || die \"run.pl: Error opening log file $logfile (again)\";\n    $enddate = `date`;\n    chop $enddate;\n    print F \"# Accounting: time=\" . ($endtime - $starttime) . \" threads=1\\n\";\n    print F \"# Ended ($return_str) at \" . $enddate . \", elapsed time \" . ($endtime-$starttime) . \" seconds\\n\";\n    close(F);\n    exit($ret == 0 ? 0 : 1);\n  } else {\n    $pid[$jobid] = $childpid;\n    $active_pids{$childpid} = $jobid;\n    # print STDERR \"Queued: \" .  Dumper(\\%active_pids) . \"\\n\";\n  }\n}\n\n# Now we have submitted all the jobs, lets wait until all the jobs finish\nforeach $child (keys %active_pids) {\n    $jobid=$active_pids{$child};\n    $r = waitpid($pid[$jobid], 0);\n    $code = $?;\n    if ($r == -1) { die \"run.pl: Error waiting for child process\"; } # should never happen.\n    if ($r != 0) { $fail[$jobid]=$code; $numfail++ if $code!=0; } # Completed successfully\n}\n\n# Some sanity checks:\n# The $fail array should not contain undefined codes\n# The number of non-zeros in that array  should be equal to $numfail\n# We cannot do foreach() here, as the JOB ids do not start at zero\n$failed_jids=0;\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $job_return = $fail[$jobid];\n  if (not defined $job_return ) {\n    # print Dumper(\\@fail);\n\n    die \"run.pl: Sanity check failed: we have indication that some jobs are running \" .\n      \"even after we waited for all jobs to finish\" ;\n  }\n  if ($job_return != 0 ){ $failed_jids++;}\n}\nif ($failed_jids != $numfail) {\n  die \"run.pl: Sanity check failed: cannot find out how many jobs failed ($failed_jids x $numfail).\"\n}\nif ($numfail > 0) { $ret = 1; }\n\nif ($ret != 0) {\n  $njobs = $jobend - $jobstart + 1;\n  if ($njobs == 1) {\n    if (defined $jobname) {\n      $logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with\n                                         # that job.\n    }\n    print STDERR \"run.pl: job failed, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"run.pl: probably you forgot to put JOB=1:\\$nj in your script.\";\n    }\n  }\n  else {\n    $logfile =~ s/$jobname/*/g;\n    print STDERR \"run.pl: $numfail / $njobs failed, log is in $logfile\\n\";\n  }\n}\n\n\nexit ($ret);\n"
  },
  {
    "path": "egs/utils/parallel/slurm.pl",
    "content": "#!/usr/bin/env perl\nuse strict;\nuse warnings;\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey).\n#           2014  Vimal Manohar (Johns Hopkins University)\n#           2015  Johns Hopkins University (Yenda Trmal <jtrmal@gmail.com>>)\n# Apache 2.0.\n\nuse File::Basename;\nuse Cwd;\nuse Getopt::Long;\n\n# slurm.pl was created from the queue.pl\n# queue.pl has the same functionality as run.pl, except that\n# it runs the job in question on the queue (Sun GridEngine).\n# This version of queue.pl uses the task array functionality\n# of the grid engine.  Note: it's different from the queue.pl\n# in the s4 and earlier scripts.\n\n# The script now supports configuring the queue system using a config file\n# (default in conf/queue.conf; but can be passed specified with --config option)\n# and a set of command line options.\n# The current script handles:\n# 1) Normal configuration arguments\n# For e.g. a command line option of \"--gpu 1\" could be converted into the option\n# \"-q g.q -l gpu=1\" to qsub. How the CLI option is handled is determined by a\n# line in the config file like\n# gpu=* -q g.q -l gpu=$0\n# $0 here in the line is replaced with the argument read from the CLI and the\n# resulting string is passed to qsub.\n# 2) Special arguments to options such as\n# gpu=0\n# If --gpu 0 is given in the command line, then no special \"-q\" is given.\n# 3) Default argument\n# default gpu=0\n# If --gpu option is not passed in the command line, then the script behaves as\n# if --gpu 0 was passed since 0 is specified as the default argument for that\n# option\n# 4) Arbitrary options and arguments.\n# Any command line option starting with '--' and its argument would be handled\n# as long as its defined in the config file.\n# 5) Default behavior\n# If the config file that is passed using is not readable, then the script\n# behaves as if the queue has the following config file:\n# $ cat conf/queue.conf\n# # Default configuration\n# command sbatch --export=PATH  -S /bin/bash -j y -l arch=*64*\n# option mem=* --mem-per-cpu $0\n# option mem=0          # Do not add anything to qsub_opts\n# option num_threads=* --cpus-per-task $0\n# option num_threads=1  # Do not add anything to qsub_opts\n# option max_jobs_run=* -tc $0\n# default gpu=0\n# option gpu=0 -p shared\n# option gpu=*  -p gpu  #this has to be figured out\n\n#print STDERR \"$0 \" . join(\" \", @ARGV) . \"\\n\";\n\nmy $qsub_opts = \"\";\nmy $sync = 0;\nmy $num_threads = 1;\nmy $max_jobs_run;\nmy $gpu = 0;\n\nmy $config = \"conf/slurm.conf\";\n\nmy %cli_options = ();\n\nmy $jobname;\nmy $jobstart;\nmy $jobend;\n\nmy $array_job = 0;\n\nsub print_usage() {\n  print STDERR\n   \"Usage: $0 [options] [JOB=1:n] log-file command-line arguments...\\n\" .\n   \"e.g.: $0 foo.log echo baz\\n\" .\n   \" (which will echo \\\"baz\\\", with stdout and stderr directed to foo.log)\\n\" .\n   \"or: $0 -q all.q\\@xyz foo.log echo bar \\| sed s/bar/baz/ \\n\" .\n   \" (which is an example of using a pipe; you can provide other escaped bash constructs)\\n\" .\n   \"or: $0 -q all.q\\@qyz JOB=1:10 foo.JOB.log echo JOB \\n\" .\n   \" (which illustrates the mechanism to submit parallel jobs; note, you can use \\n\" .\n   \"  another string other than JOB)\\n\" .\n   \"Note: if you pass the \\\"-sync y\\\" option to qsub, this script will take note\\n\" .\n   \"and change its behavior.  Otherwise it uses squeue to work out when the job finished\\n\" .\n   \"Options:\\n\" .\n   \"  --config <config-file> (default: $config)\\n\" .\n   \"  --mem <mem-requirement> (e.g. --mem 2G, --mem 500M, \\n\" .\n   \"                           also support K and numbers mean bytes)\\n\" .\n   \"  --num-threads <num-threads> (default: $num_threads)\\n\" .\n   \"  --max-jobs-run <num-jobs>\\n\" .\n   \"  --gpu <0|1> (default: $gpu)\\n\";\n  exit 1;\n}\n\nsub exec_command {\n  # Execute command and return a tuple of stdout and exit code\n  my $command = join ' ', @_;\n  # To get the actual exit value, shift right by eight bits.\n  ($_ = `$command 2>&1`, $? >> 8);\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nfor (my $x = 1; $x <= 3; $x++) { # This for-loop is to\n  # allow the JOB=1:n option to be interleaved with the\n  # options to qsub.\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {\n    my $switch = shift @ARGV;\n\n    if ($switch eq \"-V\") {\n      $qsub_opts .= \"-V \";\n    } else {\n      my $argument = shift @ARGV;\n      if ($argument =~ m/^--/) {\n        print STDERR \"WARNING: suspicious argument '$argument' to $switch; starts with '-'\\n\";\n      }\n      if ($switch eq \"-sync\" && $argument =~ m/^[yY]/) {\n        $sync = 1;\n        $qsub_opts .= \"$switch $argument \";\n      } elsif ($switch eq \"-pe\") { # e.g. -pe smp 5\n        my $argument2 = shift @ARGV;\n        $qsub_opts .= \"$switch $argument $argument2 \";\n        $num_threads = $argument2;\n      } elsif ($switch =~ m/^--/) { # Config options\n        # Convert CLI option to variable name\n        # by removing '--' from the switch and replacing any\n        # '-' with a '_'\n        $switch =~ s/^--//;\n        $switch =~ s/-/_/g;\n        $cli_options{$switch} = $argument;\n      } else {  # Other qsub options - passed as is\n        $qsub_opts .= \"$switch $argument \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:20\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"$0: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"$0: invalid job range $ARGV[0], start must be strictly positive (this is a GridEngine limitation).\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $array_job = 1;\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"Warning: suspicious first argument to $0: $ARGV[0]\\n\";\n  }\n}\n\nif (@ARGV < 2) {\n  print_usage();\n}\n\nif (exists $cli_options{\"config\"}) {\n  $config = $cli_options{\"config\"};\n}\n\nmy $default_config_file = <<'EOF';\n# Default configuration\ncommand sbatch --export=PATH  --ntasks-per-node=1\noption time=* --time $0\noption mem=* --mem-per-cpu $0\noption mem=0          # Do not add anything to qsub_opts\noption num_threads=* --cpus-per-task $0 --ntasks-per-node=1\noption num_threads=1 --cpus-per-task 1  --ntasks-per-node=1 # Do not add anything to qsub_opts\ndefault gpu=0\noption gpu=0 -p shared\noption gpu=* -p gpu --gres=gpu:$0 --time 4:0:0  # this has to be figured out\nEOF\n\n# note: the --max-jobs-run option is supported as a special case\n# by slurm.pl and you don't have to handle it in the config file.\n\n# Here the configuration options specified by the user on the command line\n# (e.g. --mem 2G) are converted to options to the qsub system as defined in\n# the config file. (e.g. if the config file has the line\n# \"option mem=* -l ram_free=$0,mem_free=$0\"\n# and the user has specified '--mem 2G' on the command line, the options\n# passed to queue system would be \"-l ram_free=2G,mem_free=2G\n# A more detailed description of the ways the options would be handled is at\n# the top of this file.\n\nmy $opened_config_file = 1;\n\nopen CONFIG, \"<$config\" or $opened_config_file = 0;\n\nmy %cli_config_options = ();\nmy %cli_default_options = ();\n\nif ($opened_config_file == 0 && exists($cli_options{\"config\"})) {\n  print STDERR \"Could not open config file $config\\n\";\n  exit(1);\n} elsif ($opened_config_file == 0 && !exists($cli_options{\"config\"})) {\n  # Open the default config file instead\n  open (CONFIG, \"echo '$default_config_file' |\") or die \"Unable to open pipe\\n\";\n  $config = \"Default config\";\n}\n\nmy $qsub_cmd = \"\";\nmy $read_command = 0;\n\nwhile(<CONFIG>) {\n  chomp;\n  my $line = $_;\n  $_ =~ s/\\s*#.*//g;\n  if ($_ eq \"\") { next; }\n  if ($_ =~ /^command (.+)/) {\n    $read_command = 1;\n    $qsub_cmd = $1 . \" \";\n  } elsif ($_ =~ m/^option ([^=]+)=\\* (.+)$/) {\n    # Config option that needs replacement with parameter value read from CLI\n    # e.g.: option mem=* -l mem_free=$0,ram_free=$0\n    my $option = $1;     # mem\n    my $arg= $2;         # -l mem_free=$0,ram_free=$0\n    if ($arg !~ m:\\$0:) {\n      print STDERR \"Warning: the line '$line' in config file ($config) does not substitution variable \\$0\\n\";\n    }\n    if (exists $cli_options{$option}) {\n      # Replace $0 with the argument read from command line.\n      # e.g. \"-l mem_free=$0,ram_free=$0\" -> \"-l mem_free=2G,ram_free=2G\"\n      $arg =~ s/\\$0/$cli_options{$option}/g;\n      $cli_config_options{$option} = $arg;\n    }\n  } elsif ($_ =~ m/^option ([^=]+)=(\\S+)\\s?(.*)$/) {\n    # Config option that does not need replacement\n    # e.g. option gpu=0 -q all.q\n    my $option = $1;      # gpu\n    my $value = $2;       # 0\n    my $arg = $3;         # -q all.q\n    if (exists $cli_options{$option}) {\n      $cli_default_options{($option,$value)} = $arg;\n    }\n  } elsif ($_ =~ m/^default (\\S+)=(\\S+)/) {\n    # Default options. Used for setting default values to options i.e. when\n    # the user does not specify the option on the command line\n    # e.g. default gpu=0\n    my $option = $1;  # gpu\n    my $value = $2;   # 0\n    if (!exists $cli_options{$option}) {\n      # If the user has specified this option on the command line, then we\n      # don't have to do anything\n      $cli_options{$option} = $value;\n    }\n  } else {\n    print STDERR \"$0: unable to parse line '$line' in config file ($config)\\n\";\n    exit(1);\n  }\n}\n\nclose(CONFIG);\n\nif ($read_command != 1) {\n  print STDERR \"$0: config file ($config) does not contain the line \\\"command .*\\\"\\n\";\n  exit(1);\n}\n\nfor my $option (keys %cli_options) {\n  if ($option eq \"config\") { next; }\n\n  my $value = $cli_options{$option};\n\n  if ($option eq \"max_jobs_run\") {\n    if ($array_job != 1) {\n      print STDERR \"Ignoring $option since this is not an array task.\";\n    } else {\n      $max_jobs_run = $value;\n    }\n  } elsif (exists $cli_default_options{($option,$value)}) {\n    $qsub_opts .= \"$cli_default_options{($option,$value)} \";\n  } elsif (exists $cli_config_options{$option}) {\n    $qsub_opts .= \"$cli_config_options{$option} \";\n  } elsif (exists $cli_default_options{($option,\"*\")}) {\n    $qsub_opts .= $cli_default_options{($option,\"*\")} . \" \";\n  } else {\n    if ($opened_config_file == 0) {\n      $config = \"default config file\";\n    }\n    die \"$0: Command line option $option not described in $config (or value '$value' not allowed)\\n\";\n  }\n}\n\nmy $cwd = getcwd();\nmy $logfile = shift @ARGV;\n\nif ($array_job == 1 && $logfile !~ m/$jobname/\n    && $jobend > $jobstart) {\n  print STDERR \"$0: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n#\n# Work out the command; quote escaping is done here.\n# Note: the rules for escaping stuff are worked out pretty\n# arbitrarily, based on what we want it to do.  Some things that\n# we pass as arguments to $0, such as \"|\", we want to be\n# interpreted by bash, so we don't escape them.  Other things,\n# such as archive specifiers like 'ark:gunzip -c foo.gz|', we want\n# to be passed, in quotes, to the Kaldi program.  Our heuristic\n# is that stuff with spaces in should be quoted.  This doesn't\n# always work.\n#\nmy $cmd = \"\";\n\nforeach my $x (@ARGV) {\n  if ($x =~ m/^\\S+$/) { $cmd .= $x . \" \"; } # If string contains no spaces, take\n                                            # as-is.\n  elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; } # else if no dbl-quotes, use single\n  else { $cmd .= \"\\\"$x\\\" \"; }  # else use double.\n}\n\n#\n# Work out the location of the script file, and open it for writing.\n#\nmy $dir = dirname($logfile);\nmy $base = basename($logfile);\nmy $qdir = \"$dir/q\";\n$qdir =~ s:/(log|LOG)/*q:/q:; # If qdir ends in .../log/q, make it just .../q.\nmy $queue_logfile = \"$qdir/$base\";\n\nif (!-d $dir) { system \"mkdir -p $dir 2>/dev/null\"; } # another job may be doing this...\nif (!-d $dir) { die \"Cannot make the directory $dir\\n\"; }\n# make a directory called \"q\",\n# where we will put the log created by qsub... normally this doesn't contain\n# anything interesting, evertyhing goes to $logfile.\nif (! -d \"$qdir\") {\n  system \"mkdir $qdir 2>/dev/null\";\n  sleep(5); ## This is to fix an issue we encountered in denominator lattice creation,\n  ## where if e.g. the exp/tri2b_denlats/log/15/q directory had just been\n  ## created and the job immediately ran, it would die with an error because nfs\n  ## had not yet synced.  I'm also decreasing the acdirmin and acdirmax in our\n  ## NFS settings to something like 5 seconds.\n}\n\nmy $queue_array_opt = \"\";\nif ($array_job == 1) { # It's an array job.\n  if ($max_jobs_run) {\n      $queue_array_opt = \"--array ${jobstart}-${jobend}%${max_jobs_run}\";\n  } else {\n      $queue_array_opt = \"--array ${jobstart}-${jobend}\";\n  }\n  $logfile =~ s/$jobname/\\$SLURM_ARRAY_TASK_ID/g; # This variable will get\n  # replaced by qsub, in each job, with the job-id.\n  $cmd =~ s/$jobname/\\$\\{SLURM_ARRAY_TASK_ID\\}/g; # same for the command...\n  $queue_logfile =~ s/\\.?$jobname//; # the log file in the q/ subdirectory\n  # is for the queue to put its log, and this doesn't need the task array subscript\n  # so we remove it.\n}\n\n# queue_scriptfile is as $queue_logfile [e.g. dir/q/foo.log] but\n# with the suffix .sh.\nmy $queue_scriptfile = $queue_logfile;\n($queue_scriptfile =~ s/\\.[a-zA-Z]{1,5}$/.sh/) || ($queue_scriptfile .= \".sh\");\nif ($queue_scriptfile !~ m:^/:) {\n  $queue_scriptfile = $cwd . \"/\" . $queue_scriptfile; # just in case.\n}\n\n# We'll write to the standard input of \"qsub\" (the file-handle Q),\n# the job that we want it to execute.\n# Also keep our current PATH around, just in case there was something\n# in it that we need (although we also source ./path.sh)\n\nmy $syncfile = \"$qdir/done.$$\";\n\nsystem(\"rm $queue_logfile $syncfile 2>/dev/null\");\n#\n# Write to the script file, and then close it.\n#\nopen(Q, \">$queue_scriptfile\") || die \"Failed to write to $queue_scriptfile\";\n\nprint Q \"#!/bin/bash\\n\";\nprint Q \"cd $cwd\\n\";\nprint Q \". ./path.sh\\n\";\nprint Q \"( echo '#' Running on \\`hostname\\`\\n\";\nprint Q \"  echo '#' Started at \\`date\\`\\n\";\nprint Q \"  set | grep SLURM | while read line; do echo \\\"# \\$line\\\"; done\\n\";\nprint Q \"  echo -n '# '; cat <<EOF\\n\";\nprint Q \"$cmd\\n\"; # this is a way of echoing the command into a comment in the log file,\nprint Q \"EOF\\n\"; # without having to escape things like \"|\" and quote characters.\nprint Q \") >$logfile\\n\";\nprint Q \"if [ \\\"\\$CUDA_VISIBLE_DEVICES\\\" == \\\"NoDevFiles\\\" ]; then\\n\";\nprint Q \"  ( echo CUDA_VISIBLE_DEVICES set to NoDevFiles, unsetting it... \\n\";\nprint Q \"  )>>$logfile\\n\";\nprint Q \"  unset CUDA_VISIBLE_DEVICES\\n\";\nprint Q \"fi\\n\";\nprint Q \"time1=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \" ( $cmd ) &>>$logfile\\n\";\nprint Q \"ret=\\$?\\n\";\nprint Q \"sync || true\\n\";\nprint Q \"time2=\\`date +\\\"%s\\\"\\`\\n\";\nprint Q \"echo '#' Accounting: begin_time=\\$time1 >>$logfile\\n\";\nprint Q \"echo '#' Accounting: end_time=\\$time2 >>$logfile\\n\";\nprint Q \"echo '#' Accounting: time=\\$((\\$time2-\\$time1)) threads=$num_threads >>$logfile\\n\";\nprint Q \"echo '#' Finished at \\`date\\` with status \\$ret >>$logfile\\n\";\nprint Q \"[ \\$ret -eq 137 ] && exit 100;\\n\"; # If process was killed (e.g. oom) it will exit with status 137;\n  # let the script return with status 100 which will put it to E state; more easily rerunnable.\nif ($array_job == 0) { # not an array job\n  print Q \"touch $syncfile\\n\"; # so we know it's done.\n} else {\n  print Q \"touch $syncfile.\\$SLURM_ARRAY_TASK_ID\\n\"; # touch a bunch of sync-files.\n}\nprint Q \"exit \\$[\\$ret ? 1 : 0]\\n\"; # avoid status 100 which grid-engine\nprint Q \"## submitted with:\\n\";       # treats specially.\n$qsub_cmd .= \" $qsub_opts --open-mode=append -e ${queue_logfile} -o ${queue_logfile} $queue_array_opt $queue_scriptfile >>$queue_logfile 2>&1\";\nprint Q \"# $qsub_cmd\\n\";\nif (!close(Q)) { # close was not successful... || die \"Could not close script file $shfile\";\n  die \"Failed to close the script file (full disk?)\";\n}\n\nmy $ret = system ($qsub_cmd);\nif ($ret != 0) {\n  if ($sync && $ret == 256) { # this is the exit status when a job failed (bad exit status)\n    if (defined $jobname) { $logfile =~ s/\\$SLURM_ARRAY_TASK_ID/*/g; }\n    print STDERR \"$0: job writing to $logfile failed\\n\";\n  } else {\n    print STDERR \"$0: error submitting jobs to queue (return status was $ret)\\n\";\n    print STDERR \"queue log file is $queue_logfile, command was $qsub_cmd\\n\";\n    print STDERR `tail $queue_logfile`;\n  }\n  exit(1);\n}\n\nmy $sge_job_id;\nif (! $sync) { # We're not submitting with -sync y, so we\n  # need to wait for the jobs to finish.  We wait for the\n  # sync-files we \"touched\" in the script to exist.\n  my @syncfiles = ();\n  if (!defined $jobname) { # not an array job.\n    push @syncfiles, $syncfile;\n  } else {\n    for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n      push @syncfiles, \"$syncfile.$jobid\";\n    }\n  }\n  # We will need the sge_job_id, to check that job still exists\n  { # Get the SLURM job-id from the log file in q/\n    open(L, \"<$queue_logfile\") || die \"Error opening log file $queue_logfile\";\n    undef $sge_job_id;\n    while (<L>) {\n      if (m/Submitted batch job (\\d+)/) {\n        if (defined $sge_job_id) {\n          die \"Error: your job was submitted more than once (see $queue_logfile)\";\n        } else {\n          $sge_job_id = $1;\n        }\n      }\n    }\n    close(L);\n    if (!defined $sge_job_id) {\n      die \"Error: log file $queue_logfile does not specify the SLURM job-id.\";\n    }\n  }\n  my $check_sge_job_ctr=1;\n  #\n  my $wait = 0.1;\n  my $counter = 0;\n  foreach my $f (@syncfiles) {\n    # wait for them to finish one by one.\n    while (! -f $f) {\n      sleep($wait);\n      $wait *= 1.2;\n      if ($wait > 3.0) {\n        $wait = 3.0; # never wait more than 3 seconds.\n        # the following (.kick) commands are basically workarounds for NFS bugs.\n        if (rand() < 0.25) { # don't do this every time...\n          if (rand() > 0.5) {\n            system(\"touch $qdir/.kick 2>/dev/null\");\n          } else {\n            system(\"rm $qdir/.kick 2>/dev/null\");\n          }\n        }\n        if ($counter++ % 10 == 0) {\n          # This seems to kick NFS in the teeth to cause it to refresh the\n          # directory.  I've seen cases where it would indefinitely fail to get\n          # updated, even though the file exists on the server.\n          # Only do this every 10 waits (every 30 seconds) though, or if there\n          # are many jobs waiting they can overwhelm the file server.\n          system(\"ls $qdir >/dev/null\");\n        }\n      }\n\n      # Check that the job exists in SLURM. Job can be killed if duration\n      # exceeds some hard limit, or in case of a machine shutdown.\n      if (($check_sge_job_ctr++ % 10) == 0) { # Don't run qstat too often, avoid stress on SGE.\n        if ( -f $f ) { next; }; #syncfile appeared: OK.\n        # system(...) : To get the actual exit value, shift $ret right by eight bits.\n        my ($squeue_output, $squeue_status) = exec_command(\"squeue -j $sge_job_id\");\n        if ($squeue_status == 1) {\n          # Don't consider immediately missing job as error, first wait some\n          sleep(4);\n          ($squeue_output, $squeue_status) = exec_command(\"squeue -j $sge_job_id\");\n        }\n        if ($squeue_status == 1) {\n          # time to make sure it is not just delayed creation of the syncfile.\n\n          # Don't consider immediately missing job as error, first wait some\n          # time to make sure it is not just delayed creation of the syncfile.\n          sleep(4);\n          # Sometimes NFS gets confused and thinks it's transmitted the directory\n          # but it hasn't, due to timestamp issues.  Changing something in the\n          # directory will usually fix that.\n          system(\"touch $qdir/.kick\");\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) { next; }   #syncfile appeared, ok\n          sleep(7);\n          system(\"touch $qdir/.kick\");\n          sleep(1);\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) {  next; }   #syncfile appeared, ok\n          sleep(60);\n          system(\"touch $qdir/.kick\");\n          sleep(1);\n          system(\"rm $qdir/.kick 2>/dev/null\");\n          if ( -f $f ) { next; }  #syncfile appeared, ok\n          $f =~ m/\\.(\\d+)$/ || die \"Bad sync-file name $f\";\n          my $job_id = $1;\n          if (defined $jobname) {\n            $logfile =~ s/\\$SLURM_ARRAY_TASK_ID/$job_id/g;\n          }\n          my $last_line = `tail -n 1 $logfile`;\n          if ($last_line =~ m/status 0$/ && (-M $logfile) < 0) {\n            # if the last line of $logfile ended with \"status 0\" and\n            # $logfile is newer than this program [(-M $logfile) gives the\n            # time elapsed between file modification and the start of this\n            # program], then we assume the program really finished OK,\n            # and maybe something is up with the file system.\n            print STDERR \"**$0: syncfile $f was not created but job seems\\n\" .\n              \"**to have finished OK.  Probably your file-system has problems.\\n\" .\n              \"**This is just a warning.\\n\";\n            last;\n          } else {\n            chop $last_line;\n            print STDERR \"$0: Error: Job $sge_job_id seems to no longer exists:\\n\" .\n              \"'squeue -j $sge_job_id' returned error code $squeue_status and said:\\n\" .\n              \"  $squeue_output\\n\" .\n              \"Syncfile $f does not exist, meaning that the job did not finish.\\n\" .\n              \"Log is in $logfile. Last line '$last_line' does not end in 'status 0'.\\n\" .\n              \"Possible reasons:\\n\" .\n              \"  a) Exceeded time limit? -> Use more jobs!\\n\" .\n              \"  b) Shutdown/Frozen machine? -> Run again! squeue:\\n\";\n            system(\"squeue -j $sge_job_id\");\n            exit(1);\n          }\n        } elsif ($ret != 0) {\n          print STDERR \"$0: Warning: squeue command returned status $ret (squeue -j $sge_job_id,$!)\\n\";\n        }\n      }\n    }\n  }\n  my $all_syncfiles = join(\" \", @syncfiles);\n  system(\"rm $all_syncfiles 2>/dev/null\");\n}\n\n# OK, at this point we are synced; we know the job is done.\n# But we don't know about its exit status.  We'll look at $logfile for this.\n# First work out an array @logfiles of file-locations we need to\n# read (just one, unless it's an array job).\nmy @logfiles = ();\nif (!defined $jobname) { # not an array job.\n  push @logfiles, $logfile;\n} else {\n  for (my $jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n    my $l = $logfile;\n    $l =~ s/\\$SLURM_ARRAY_TASK_ID/$jobid/g;\n    push @logfiles, $l;\n  }\n}\n\nmy $num_failed = 0;\nmy $status = 1;\nforeach my $l (@logfiles) {\n  my @wait_times = (0.1, 0.2, 0.2, 0.3, 0.5, 0.5, 1.0, 2.0, 5.0, 5.0, 5.0, 10.0, 25.0);\n  for (my $iter = 0; $iter <= @wait_times; $iter++) {\n    my $line = `tail -10 $l 2>/dev/null`; # Note: although this line should be the last\n    # line of the file, I've seen cases where it was not quite the last line because\n    # of delayed output by the process that was running, or processes it had called.\n    # so tail -10 gives it a little leeway.\n    if ($line =~ m/with status (\\d+)/) {\n      $status = $1;\n      last;\n    } else {\n      if ($iter < @wait_times) {\n        sleep($wait_times[$iter]);\n      } else {\n        if (! -f $l) {\n          print STDERR \"Log-file $l does not exist.\\n\";\n        } else {\n          print STDERR \"The last line of log-file $l does not seem to indicate the \"\n            . \"return status as expected\\n\";\n        }\n        exit(1);                # Something went wrong with the queue, or the\n        # machine it was running on, probably.\n      }\n    }\n  }\n  # OK, now we have $status, which is the return-status of\n  # the command in the job.\n  if ($status != 0) { $num_failed++; }\n}\nif ($num_failed == 0) { exit(0); }\nelse { # we failed.\n  if (@logfiles == 1) {\n    if (defined $jobname) { $logfile =~ s/\\$SLURM_TASK_ARRAY_ID/$jobstart/g; }\n    print STDERR \"$0: job failed with status $status, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"$0: probably you forgot to put JOB=1:\\$nj in your script.\\n\";\n    }\n  } else {\n    if (defined $jobname) { $logfile =~ s/\\$SLURM_ARRAY_TASK_ID/*/g; }\n    my $numjobs = 1 + $jobend - $jobstart;\n    print STDERR \"$0: $num_failed / $numjobs failed, log is in $logfile\\n\";\n  }\n  exit(1);\n}\n"
  },
  {
    "path": "egs/utils/parse_options.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);\n#                 Arnab Ghoshal, Karel Vesely\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# Parse command-line options.\n# To be sourced by another script (as in \". parse_options.sh\").\n# Option format is: --option-name arg\n# and shell variable \"option_name\" gets set to value \"arg.\"\n# The exception is --help, which takes no arguments, but prints the\n# $help_message variable (if defined).\n\n\n###\n### The --config file options have lower priority to command line\n### options, so we need to import them first...\n###\n\n# Now import all the configs specified by command-line, in left-to-right order\nfor ((argpos=1; argpos<$#; argpos++)); do\n  if [ \"${!argpos}\" == \"--config\" ]; then\n    argpos_plus1=$((argpos+1))\n    config=${!argpos_plus1}\n    [ ! -r $config ] && echo \"$0: missing config '$config'\" && exit 1\n    . $config  # source the config file.\n  fi\ndone\n\n\n###\n### Now we process the command line options\n###\nwhile true; do\n  [ -z \"${1:-}\" ] && break;  # break if there are no arguments\n  case \"$1\" in\n    # If the enclosing script is called with --help option, print the help\n    # message and exit.  Scripts should put help messages in $help_message\n    --help|-h) if [ -z \"$help_message\" ]; then echo \"No help found.\" 1>&2;\n      else printf \"$help_message\\n\" 1>&2 ; fi;\n      exit 0 ;;\n    --*=*) echo \"$0: options to scripts must be of the form --name value, got '$1'\"\n      exit 1 ;;\n    # If the first command-line argument begins with \"--\" (e.g. --foo-bar),\n    # then work out the variable name as $name, which will equal \"foo_bar\".\n    --*) name=`echo \"$1\" | sed s/^--// | sed s/-/_/g`;\n      # Next we test whether the variable in question is undefned-- if so it's\n      # an invalid option and we die.  Note: $0 evaluates to the name of the\n      # enclosing script.\n      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar\n      # is undefined.  We then have to wrap this test inside \"eval\" because\n      # foo_bar is itself inside a variable ($name).\n      eval '[ -z \"${'$name'+xxx}\" ]' && echo \"$0: invalid option $1\" 1>&2 && exit 1;\n\n      oldval=\"`eval echo \\\\$$name`\";\n      # Work out whether we seem to be expecting a Boolean argument.\n      if [ \"$oldval\" == \"true\" ] || [ \"$oldval\" == \"false\" ]; then\n        was_bool=true;\n      else\n        was_bool=false;\n      fi\n\n      # Set the variable to the right value-- the escaped quotes make it work if\n      # the option had spaces, like --cmd \"queue.pl -sync y\"\n      eval $name=\\\"$2\\\";\n\n      # Check that Boolean-valued arguments are really Boolean.\n      if $was_bool && [[ \"$2\" != \"true\" && \"$2\" != \"false\" ]]; then\n        echo \"$0: expected \\\"true\\\" or \\\"false\\\": $1 $2\" 1>&2\n        exit 1;\n      fi\n      shift 2;\n      ;;\n  *) break;\n  esac\ndone\n\n\n# Check for an empty argument to the --cmd option, which can easily occur as a\n# result of scripting errors.\n[ ! -z \"${cmd+xxx}\" ] && [ -z \"$cmd\" ] && echo \"$0: empty argument to --cmd option\" 1>&2 && exit 1;\n\n\ntrue; # so this script returns exit code 0.\n"
  },
  {
    "path": "egs/utils/perturb_data_dir_speed.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2013  Johns Hopkins University (author: Daniel Povey)\n#           2014  Tom Ko\n#           2018  Emotech LTD (author: Pawel Swietojanski)\n# Apache 2.0\n\n# This script operates on a directory, such as in data/train/,\n# that contains some subset of the following files:\n#  wav.scp\n#  spk2utt\n#  utt2spk\n#  text\n#  utt2dur\n#  reco2dur\n#\n# It generates the files which are used for perturbing the speed of the original data.\n\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n  echo \"Usage: perturb_data_dir_speed.sh <warping-factor> <srcdir> <destdir>\"\n  echo \"e.g.:\"\n  echo \" $0 0.9 data/train_si284 data/train_si284p\"\n  exit 1\nfi\n\nexport LC_ALL=C\n\nfactor=$1\nsrcdir=$2\ndestdir=$3\nlabel=\"sp\"\nspk_prefix=$label$factor\"-\"\nutt_prefix=$label$factor\"-\"\n\n#check is sox on the path\nwhich sox &>/dev/null\n! [ $? -eq 0 ] && echo \"sox: command not found\" && exit 1;\n\nif [ ! -f $srcdir/utt2spk ]; then\n  echo \"$0: no such file $srcdir/utt2spk\"\n  exit 1;\nfi\n\nif [ \"$destdir\" == \"$srcdir\" ]; then\n  echo \"$0: this script requires <srcdir> and <destdir> to be different.\"\n  exit 1\nfi\n\nset -e;\nset -o pipefail\n\nmkdir -p $destdir\n\ncat $srcdir/utt2spk | awk -v p=$utt_prefix '{printf(\"%s %s%s\\n\", $1, p, $1);}' > $destdir/utt_map\ncat $srcdir/spk2utt | awk -v p=$spk_prefix '{printf(\"%s %s%s\\n\", $1, p, $1);}' > $destdir/spk_map\ncat $srcdir/wav.scp | awk -v p=$spk_prefix '{printf(\"%s %s%s\\n\", $1, p, $1);}' > $destdir/reco_map\nif [ ! -f $srcdir/utt2uniq ]; then\n  cat $srcdir/utt2spk | awk -v p=$utt_prefix '{printf(\"%s%s %s\\n\", p, $1, $1);}' > $destdir/utt2uniq\nelse\n  cat $srcdir/utt2uniq | awk -v p=$utt_prefix '{printf(\"%s%s %s\\n\", p, $1, $2);}' > $destdir/utt2uniq\nfi\n\n\ncat $srcdir/utt2spk | utils/apply_map.pl -f 1 $destdir/utt_map  | \\\n  utils/apply_map.pl -f 2 $destdir/spk_map >$destdir/utt2spk\n\nutils/utt2spk_to_spk2utt.pl <$destdir/utt2spk >$destdir/spk2utt\n\nif [ -f $srcdir/segments ]; then\n\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/segments | \\\n    utils/apply_map.pl -f 2 $destdir/reco_map | \\\n      awk -v factor=$factor \\\n        '{s=$3/factor; e=$4/factor; if (e > s + 0.01) { printf(\"%s %s %.2f %.2f\\n\", $1, $2, $3/factor, $4/factor);} }' >$destdir/segments\n\n  utils/apply_map.pl -f 1 $destdir/reco_map <$srcdir/wav.scp | sed 's/| *$/ |/' | \\\n    # Handle three cases of rxfilenames appropriately; \"input piped command\", \"file offset\" and \"filename\" \n    awk -v factor=$factor \\\n        '{wid=$1; $1=\"\"; if ($NF==\"|\") {print wid $_ \" sox -t wav - -t wav - speed \" factor \" |\"}\n          else if (match($0, /:[0-9]+$/)) {print wid \" wav-copy\" $_ \" - | sox -t wav - -t wav - speed \" factor \" |\" } \n          else  {print wid \" sox -t wav\" $_ \" -t wav - speed \" factor \" |\"}}' > $destdir/wav.scp\n  if [ -f $srcdir/reco2file_and_channel ]; then\n    utils/apply_map.pl -f 1 $destdir/reco_map <$srcdir/reco2file_and_channel >$destdir/reco2file_and_channel\n  fi\n\nelse # no segments->wav indexed by utterance.\n  if [ -f $srcdir/wav.scp ]; then\n    utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/wav.scp | sed 's/| *$/ |/' | \\\n     # Handle three cases of rxfilenames appropriately; \"input piped command\", \"file offset\" and \"filename\" \n     awk -v factor=$factor \\\n       '{wid=$1; $1=\"\"; if ($NF==\"|\") {print wid $_ \" sox -t wav - -t wav - speed \" factor \" |\"}\n         else if (match($0, /:[0-9]+$/)) {print wid \" wav-copy\" $_ \" - | sox -t wav - -t wav - speed \" factor \" |\" } \n         else {print wid \" sox -t wav\" $_ \" -t wav - speed \" factor \" |\"}}' > $destdir/wav.scp\n  fi\nfi\n\nif [ -f $srcdir/text ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/text >$destdir/text\nfi\nif [ -f $srcdir/spk2gender ]; then\n  utils/apply_map.pl -f 1 $destdir/spk_map <$srcdir/spk2gender >$destdir/spk2gender\nfi\nif [ -f $srcdir/utt2lang ]; then\n  utils/apply_map.pl -f 1 $destdir/utt_map <$srcdir/utt2lang >$destdir/utt2lang\nfi\n\n#prepare speed-perturbed utt2dur\nif [ ! -f $srcdir/utt2dur ]; then\n  # generate utt2dur if it does not exist in srcdir\n  utils/data/get_utt2dur.sh $srcdir\nfi\ncat $srcdir/utt2dur | utils/apply_map.pl -f 1 $destdir/utt_map  | \\\n  awk -v factor=$factor '{print $1, $2/factor;}' >$destdir/utt2dur\n\n#prepare speed-perturbed reco2dur \nif [ ! -f $srcdir/reco2dur ]; then\n  # generate reco2dur if it does not exist in srcdir\n  utils/data/get_reco2dur.sh $srcdir\nfi\ncat $srcdir/reco2dur | utils/apply_map.pl -f 1 $destdir/reco_map  | \\\n  awk -v factor=$factor '{print $1, $2/factor;}' >$destdir/reco2dur\n\nrm $destdir/spk_map $destdir/utt_map $destdir/reco_map 2>/dev/null\necho \"$0: generated speed-perturbed version of data in $srcdir, in $destdir\"\n\nutils/validate_data_dir.sh --no-feats --no-text $destdir\n"
  },
  {
    "path": "egs/utils/pinyin_map.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n\n$num_args = $#ARGV + 1;\nif ($num_args != 1) {\n  print \"\\nUsage: pinyin2phone.pl pinyin2phone\\n\";\n  exit;\n}\n\nopen(MAPS, $ARGV[0]) or die(\"Could not open pinyin map file.\");\nmy %py2ph; foreach $line (<MAPS>) { @A = split(\" \", $line);\n  $py = shift(@A);\n  $py2ph{$py} = [@A];\n}\n\n#foreach $word ( keys %py2ph ) {\n     #foreach $i ( 0 .. $#{ $py2ph{$word} } ) {\n     #    print \" $word = $py2ph{$word}[$i]\";\n     #}\n     #print \" $#{ $py2ph{$word} }\";\n     #print \"\\n\";\n#}\n\nmy @entry;\n\nwhile (<STDIN>) {\n  @A = split(\" \", $_);\n  @entry = ();\n  $W = shift(@A);\n  push(@entry, $W);\n  for($i = 0; $i < @A; $i++) {\n    $initial= $A[$i]; $final = $A[$i];\n    #print $initial, \" \", $final, \"\\n\";\n    if ($A[$i] =~ /^CH[A-Z0-9]+$/) {$initial =~ s:(CH)[A-Z0-9]+:$1:; $final =~ s:CH([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^SH[A-Z0-9]+$/) {$initial =~ s:(SH)[A-Z0-9]+:$1:; $final =~ s:SH([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^ZH[A-Z0-9]+$/) {$initial =~ s:(ZH)[A-Z0-9]+:$1:; $final =~ s:ZH([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^B[A-Z0-9]+$/) {$initial =~ s:(B)[A-Z0-9]+:$1:; $final =~ s:B([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^C[A-Z0-9]+$/) {$initial =~ s:(C)[A-Z0-9]+:$1:; $final =~ s:C([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^D[A-Z0-9]+$/) {$initial =~ s:(D)[A-Z0-9]+:$1:; $final =~ s:D([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^F[A-Z0-9]+$/) {$initial =~ s:(F)[A-Z0-9]+:$1:; $final =~ s:F([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^G[A-Z0-9]+$/) {$initial =~ s:(G)[A-Z0-9]+:$1:; $final =~ s:G([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^H[A-Z0-9]+$/) {$initial =~ s:(H)[A-Z0-9]+:$1:; $final =~ s:H([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^J[A-Z0-9]+$/) {$initial =~ s:(J)[A-Z0-9]+:$1:; $final =~ s:J([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^K[A-Z0-9]+$/) {$initial =~ s:(K)[A-Z0-9]+:$1:; $final =~ s:K([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^L[A-Z0-9]+$/) {$initial =~ s:(L)[A-Z0-9]+:$1:; $final =~ s:L([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^M[A-Z0-9]+$/) {$initial =~ s:(M)[A-Z0-9]+:$1:; $final =~ s:M([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^N[A-Z0-9]+$/) {$initial =~ s:(N)[A-Z0-9]+:$1:; $final =~ s:N([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^P[A-Z0-9]+$/) {$initial =~ s:(P)[A-Z0-9]+:$1:; $final =~ s:P([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^Q[A-Z0-9]+$/) {$initial =~ s:(Q)[A-Z0-9]+:$1:; $final =~ s:Q([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^R[A-Z0-9]+$/) {$initial =~ s:(R)[A-Z0-9]+:$1:; $final =~ s:R([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^S[A-Z0-9]+$/) {$initial =~ s:(S)[A-Z0-9]+:$1:; $final =~ s:S([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^T[A-Z0-9]+$/) {$initial =~ s:(T)[A-Z0-9]+:$1:; $final =~ s:T([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^W[A-Z0-9]+$/) {$initial =~ s:(W)[A-Z0-9]+:$1:; $final =~ s:W([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^X[A-Z0-9]+$/) {$initial =~ s:(X)[A-Z0-9]+:$1:; $final =~ s:X([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^Y[A-Z0-9]+$/) {$initial =~ s:(Y)[A-Z0-9]+:$1:; $final =~ s:Y([A-Z0-9]+):$1:;}\n    elsif ($A[$i] =~ /^Z[A-Z0-9]+$/) {$initial =~ s:(Z)[A-Z0-9]+:$1:; $final =~ s:Z([A-Z0-9]+):$1:;}\n    if ($initial ne $A[$i]) {\n      $tone = $final;\n      $final =~ s:([A-Z]+)[0-9]:$1:;\n      $tone =~ s:[A-Z]+([0-9]):$1:;\n      if (!(exists $py2ph{$initial}) or !(exists $py2ph{$final})) { die \"$0: no entry find for \", $A[$i], \" \", $initial, \" \", $final;}\n      push(@entry, @{$py2ph{$initial}});\n      @tmp = @{$py2ph{$final}};\n      for($j = 0; $j < @tmp ; $j++) {$tmp[$j] = $tmp[$j].$tone;}\n      push(@entry, @tmp);\n    }\n    else {\n      $tone = $A[$i];\n      $A[$i] =~ s:([A-Z]+)[0-9]:$1:;\n      $tone =~ s:[A-Z]+([0-9]):$1:;\n      if (!(exists $py2ph{$A[$i]})) { die \"$0: no entry find for \", $A[$i];}\n      @tmp = @{$py2ph{$A[$i]}};\n      for($j = 0; $j < @tmp ; $j++) {$tmp[$j] = $tmp[$j].$tone;}\n      push(@entry, @tmp);\n    }\n  }\n  print \"@entry\";\n  print \"\\n\";\n}\n"
  },
  {
    "path": "egs/utils/prepare_extended_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2018  Xiaohui Zhang\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nprep_lang_opts=\nstage=0\nword_list= # if a word list (mapping words from the srcdict to IDs) is provided,\n# we'll make sure the IDs of these words are kept as before.\n# end configuration sections\n\necho \"$0: warning: This sript is is now deprecated. You may want to use utils/lang/extend_lang.sh\"\necho \"$0 $@\"  # Print the command line for logging\n\n. utils/parse_options.sh\n\nif [ $# -ne 7 ]; then\n  echo \"usage: utils/prepare_extended_lang.sh <dict-src-dir> <oov-dict-entry> <extra-lexicon> \"\n  echo \"<phone-symbol-table> <extended-dict-dir> <tmp-dir> <extended-lang-dir>\"\n  echo \"e.g.: utils/prepare_extended_lang.sh data/local/dict '<SPOKEN_NOISE>' lexicon_extra.txt\"\n  echo \"data/lang/phones.txt data/local/dict_ext data/local/lang_ext data/lang_ext\"\n  echo \"The goal is to extend the lexicon from <dict-src-dir> with extra lexical entries from \"\n  echo \"<extra-lexicon>, putting the extended lexicon into <extended-dict-dir>, and then build\"\n  echo \"a valid lang dir <extended-lang-dir>. This is useful when we want to extend the vocab\"\n  echo \"in test time.\"\n  echo \"<dict-src-dir> must be a valid dictionary dir and <oov-dict-entry> is the oov word \"\n  echo \"(see utils/prepare_lang.sh for details). A phone symbol table from a previsouly built \"\n  echo \"lang dir is required, for validating provided lexical entries.\"\n  echo \"options: \"\n  echo \"     --prep-lang-opts STRING              # options to pass to utils/prepare_lang.sh\"\n  echo \"     --word-list <filename>               # default: \\\"\\\"; if not empty, re-order the \"\n  echo \"                                          # words in the generated words.txt so that the\"\n  echo \"                                          # words from the provided list have their ids\"\n  echo \"                                          # kept unchanged.\"\n  exit 1;\nfi\n\nsrcdict=$1\noov_word=$2\nextra_lexicon=$3\nphone_symbol_table=$4\nextdict=$5 # extended dict dir\ntmpdir=$6\nextlang=$7 # extended lang dir\n\nmkdir -p $extlang $tmpdir \n\n[ -f path.sh ] && . ./path.sh\n\n! utils/validate_dict_dir.pl $srcdict && \\\n  echo \"*Error validating directory $srcdict*\" && exit 1;\n\nif [[ ! -f $srcdict/lexicon.txt ]]; then\n  echo \"**Creating $dir/lexicon.txt from $dir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdict/lexiconp.txt \\\n    > $srcdict/lexicon.txt || exit 1;\nfi\n\nif [[ ! -f $srcdict/lexiconp.txt ]]; then\n  echo \"**Creating $srcdict/lexiconp.txt from $srcdict/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdict/lexicon.txt > $srcdict/lexiconp.txt || exit 1;\nfi\n\n# Checks if the phone sets match.\necho \"$(basename $0): Validating the source lexicon\"\ncat $srcdict/lexicon.txt | awk -v f=$phone_symbol_table '\nBEGIN { while ((getline < f) > 0) { sub(/_[BEIS]$/, \"\", $1); phones[$1] = 1; }}\n{ for (x = 2; x <= NF; ++x) { \n  if (!($x in phones)) {\n    print \"The source lexicon contains a phone not in the phones.txt: \"$x;\n    print \"You must provide a phones.txt from the lang built with the source lexicon.\";\n    exit 1; \n  }\n}}' || exit 1;\n\necho \"$(basename $0): Validating the extra lexicon\"\ncat $extra_lexicon | awk -v f=$phone_symbol_table '\nBEGIN { while ((getline < f) > 0) { sub(/_[BEIS]$/, \"\", $1); phones[$1] = 1; }}\n{ for (x = 2; x <= NF; ++x) { if (!($x in phones)) {\n    print \"The extra lexicon contains a phone not in the phone symbol table: \"$x; exit 1; }\n  }\n}' || exit 1;\n\nif [ $stage -le 0 ]; then\n  # Genearte the extended dict dir\n  echo \"$(basename $0): Creating the extended lexicon $extdict/lexicon.txt\"\n  [ -d $extdict ] && rm -r $extdict 2>/dev/null\n  cp -R $srcdict $extdict 2>/dev/null\n  \n  # Reformat the source lexicon\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' <$srcdict/lexiconp.txt | awk '{ gsub(/\\t/, \" \"); print }' \\\n   >$tmpdir/lexicon.txt || exit 1;\n  \n  # Filter lexical entries which are already in the source lexicon\n  awk '{ gsub(/\\t/, \" \"); print }' $extra_lexicon | sort -u | \\\n    awk 'NR==FNR{a[$0]=1;next} {if (!($0 in a)) print $0 }' $tmpdir/lexicon.txt - \\\n    > $extdict/lexicon_extra.txt || exit 1;\n  \n  echo \"$(basename $0): Creating $extdict/lexiconp.txt from $srcdict/lexiconp.txt and $extdict/lexicon_extra.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1 $2/;' < $extdict/lexicon_extra.txt | \\\n    cat $srcdict/lexiconp.txt - | awk '{ gsub(/\\t/, \" \"); print }' | \\\n    sort -u -k1,1 -k2g,2 -k3 > $extdict/lexiconp.txt || exit 1;\n  \n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' <$extdict/lexiconp.txt  >$extdict/lexicon.txt || exit 1;\n  \n  # Create lexicon_silprobs.txt\n  silprob=false\n  [ -f $srcdict/lexiconp_silprob.txt ] && silprob=true\n  if \"$silprob\"; then\n    echo \"$(basename $0): Creating $extdict/lexiconp_silprob.txt from $srcdict/lexiconp_silprob.txt\"\n    # Here we assume no acoustic evidence for the extra word-pron pairs.\n    # So we assign silprob1 = overall_silprob, silprob2 = silprob3 = 1.00\n    overall_silprob=`awk '{if ($1==\"overall\") print $2}' $srcdict/silprob.txt`\n    awk -v overall=$overall_silprob '{\n      printf(\"%s %d %.1f %.2f %.2f\",$1, 1, overall, 1.00, 1.00); \n      for(n=2;n<=NF;n++) printf \" \"$n; printf(\"\\n\");\n      }' $extdict/lexicon_extra.txt | cat $srcdict/lexiconp_silprob.txt - | \\\n      sort -k1,1 -k2g,2 -k6 \\\n      > $extdict/lexiconp_silprob.txt || exit 1;\n  fi\n  \n  if ! utils/validate_dict_dir.pl $extdict >&/dev/null; then\n    utils/validate_dict_dir.pl $extdict  # show the output.\n    echo \"$(basename $0): Validation failed on the extended dict\"\n    exit 1;\n  fi\nfi\n\nif [ $stage -le 1 ]; then\n  echo \"$(basename $0): Preparing the extended lang dir.\"\n  [ -d $extlang ] && rm -r $extlang 2>/dev/null\n  utils/prepare_lang.sh $prep_lang_opts $extdict \\\n    $oov_word $tmpdir $extlang || exit 1;\n  \n  # If a word list is provided, make sure the word-ids of these words are kept unchanged\n  # in the extended word list.\n  if [ -f $word_list ]; then\n    # First, make sure there's no OOV in the provided word-list.\n    if [ `awk -v s=$extlang/words.txt 'BEGIN{ while((getline < s) > 0) { vocab[$1] = 1;}} \\\n        {if (!($1 in vocab)) print $0}' $word_list | wc -l` -gt 0 ]; then\n      echo \"$(basename $0): The provided word list contains words out of the extended vocab.\"\n      exit 1;\n    fi\n    awk -v s=$word_list -v oov=$oov_word -v boost=$oov_unigram_prob -v prob=$oov_prob \\\n      'BEGIN{ while((getline < s) > 0) { vocab[$1] = 1; n+=1; print $0}} \\\n       { if (!($1 in vocab)) {print $1\" \"n; n+=1;}}' $extlang/words.txt > $extlang/words.txt.$$\n    mv $extlang/words.txt.$$ $extlang/words.txt\n  fi\nfi\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/prepare_lang.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey);\n#                      Arnab Ghoshal\n#                2014  Guoguo Chen\n#                2015  Hainan Xu\n#                2016  FAU Erlangen (Author: Axel Horndasch)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script prepares a directory such as data/lang/, in the standard format,\n# given a source directory containing a dictionary lexicon.txt in a form like:\n# word phone1 phone2 ... phoneN\n# per line (alternate prons would be separate lines), or a dictionary with probabilities\n# called lexiconp.txt in a form:\n# word pron-prob phone1 phone2 ... phoneN\n# (with 0.0 < pron-prob <= 1.0); note: if lexiconp.txt exists, we use it even if\n# lexicon.txt exists.\n# and also files silence_phones.txt, nonsilence_phones.txt, optional_silence.txt\n# and extra_questions.txt\n# Here, silence_phones.txt and nonsilence_phones.txt are lists of silence and\n# non-silence phones respectively (where silence includes various kinds of\n# noise, laugh, cough, filled pauses etc., and nonsilence phones includes the\n# \"real\" phones.)\n# In each line of those files is a list of phones, and the phones on each line\n# are assumed to correspond to the same \"base phone\", i.e. they will be\n# different stress or tone variations of the same basic phone.\n# The file \"optional_silence.txt\" contains just a single phone (typically SIL)\n# which is used for optional silence in the lexicon.\n# extra_questions.txt might be empty; typically will consist of lists of phones,\n# all members of each list with the same stress or tone; and also possibly a\n# list for the silence phones.  This will augment the automatically generated\n# questions (note: the automatically generated ones will treat all the\n# stress/tone versions of a phone the same, so will not \"get to ask\" about\n# stress or tone).\n#\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang/.\n\n# Begin configuration section.\nnum_sil_states=5\nnum_nonsil_states=3\nposition_dependent_phones=true\n# position_dependent_phones is false also when position dependent phones and word_boundary.txt\n# have been generated by another source\nshare_silence_phones=false  # if true, then share pdfs of different silence\n                            # phones together.\nsil_prob=0.5\nunk_fst=        # if you want to model the unknown-word (<oov-dict-entry>)\n                # with a phone-level LM as created by make_unk_lm.sh,\n                # provide the text-form FST via this flag, e.g. <work-dir>/unk_fst.txt\n                # where <work-dir> was the 2nd argument of make_unk_lm.sh.\nphone_symbol_table=              # if set, use a specified phones.txt file.\nextra_word_disambig_syms=        # if set, add disambiguation symbols from this file (one per line)\n                                 # to phones/disambig.txt, phones/wdisambig.txt and words.txt\nnum_extra_phone_disambig_syms=1 # Standard one phone disambiguation symbol is used for optional silence.\n                                # Increasing this number does not harm, but is only useful if you later\n                                # want to introduce this labels to L_disambig.fst\n\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\n\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>\"\n  echo \"e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang\"\n  echo \"<dict-src-dir> should contain the following files:\"\n  echo \" extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt\"\n  echo \"See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.\"\n  echo \"options: \"\n  echo \"<dict-src-dir> may also, for the grammar-decoding case (see http://kaldi-asr.org/doc/grammar.html)\"\n  echo \"contain a file nonterminals.txt containing symbols like #nonterm:contact_list, one per line.\"\n  echo \"     --num-sil-states <number of states>             # default: 5, #states in silence models.\"\n  echo \"     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.\"\n  echo \"     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I\"\n  echo \"                                                     # markers on phones to indicate word-internal positions. \"\n  echo \"     --share-silence-phones (true|false)             # default: false; if true, share pdfs of \"\n  echo \"                                                     # all silence phones. \"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  echo \"     --phone-symbol-table <filename>                 # default: \\\"\\\"; if not empty, use the provided \"\n  echo \"                                                     # phones.txt as phone symbol table. This is useful \"\n  echo \"                                                     # if you use a new dictionary for the existing setup.\"\n  echo \"     --unk-fst <text-fst>                            # default: none.  e.g. exp/make_unk_lm/unk_fst.txt.\"\n  echo \"                                                     # This is for if you want to model the unknown word\"\n  echo \"                                                     # via a phone-level LM rather than a special phone\"\n  echo \"                                                     # (this should be more useful for test-time than train-time).\"\n  echo \"     --extra-word-disambig-syms <filename>           # default: \\\"\\\"; if not empty, add disambiguation symbols\"\n  echo \"                                                     # from this file (one per line) to phones/disambig.txt,\"\n  echo \"                                                     # phones/wdisambig.txt and words.txt\"\n  exit 1;\nfi\n\nsrcdir=$1\noov_word=$2\ntmpdir=$3\ndir=$4\n\n\nif [ -d $dir/phones ]; then\n  rm -r $dir/phones\nfi\nmkdir -p $dir $tmpdir $dir/phones\n\nsilprob=false\n[ -f $srcdir/lexiconp_silprob.txt ] && silprob=true\n\n[ -f path.sh ] && . ./path.sh\n\n! utils/validate_dict_dir.pl $srcdir && \\\n  echo \"*Error validating directory $srcdir*\" && exit 1;\n\nif [[ ! -f $srcdir/lexicon.txt ]]; then\n  echo \"**Creating $srcdir/lexicon.txt from $srcdir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdir/lexiconp.txt > $srcdir/lexicon.txt || exit 1;\nfi\nif [[ ! -f $srcdir/lexiconp.txt ]]; then\n  echo \"**Creating $srcdir/lexiconp.txt from $srcdir/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdir/lexicon.txt > $srcdir/lexiconp.txt || exit 1;\nfi\n\nif [ ! -z \"$unk_fst\" ] && [ ! -f \"$unk_fst\" ]; then\n  echo \"$0: expected --unk-fst $unk_fst to exist as a file\"\n  exit 1\nfi\n\nif ! utils/validate_dict_dir.pl $srcdir >&/dev/null; then\n  utils/validate_dict_dir.pl $srcdir  # show the output.\n  echo \"Validation failed (second time)\"\n  exit 1;\nfi\n\n# phones.txt file provided, we will do some sanity check here.\nif [[ ! -z $phone_symbol_table ]]; then\n  # Checks if we have position dependent phones\n  n1=`cat $phone_symbol_table | grep -v -E \"^#[0-9]+$\" | cut -d' ' -f1 | sort -u | wc -l`\n  n2=`cat $phone_symbol_table | grep -v -E \"^#[0-9]+$\" | cut -d' ' -f1 | sed 's/_[BIES]$//g' | sort -u | wc -l`\n  $position_dependent_phones && [ $n1 -eq $n2 ] &&\\\n    echo \"$0: Position dependent phones requested, but not in provided phone symbols\" && exit 1;\n  ! $position_dependent_phones && [ $n1 -ne $n2 ] &&\\\n      echo \"$0: Position dependent phones not requested, but appear in the provided phones.txt\" && exit 1;\n\n  # Checks if the phone sets match.\n  cat $srcdir/{,non}silence_phones.txt | awk -v f=$phone_symbol_table '\n  BEGIN { while ((getline < f) > 0) { sub(/_[BEIS]$/, \"\", $1); phones[$1] = 1; }}\n  { for (x = 1; x <= NF; ++x) { if (!($x in phones)) {\n      print \"Phone appears in the lexicon but not in the provided phones.txt: \"$x; exit 1; }}}' || exit 1;\nfi\n\n# In case there are extra word-level disambiguation symbols we need\n# to make sure that all symbols in the provided file are valid.\nif [ ! -z \"$extra_word_disambig_syms\" ]; then\n  if ! utils/lang/validate_disambig_sym_file.pl --allow-numeric \"false\" $extra_word_disambig_syms; then\n    echo \"$0: Validation of disambiguation file \\\"$extra_word_disambig_syms\\\" failed.\"\n    exit 1;\n  fi\nfi\n\nif $position_dependent_phones; then\n  # Create $tmpdir/lexiconp.txt from $srcdir/lexiconp.txt (or\n  # $tmpdir/lexiconp_silprob.txt from $srcdir/lexiconp_silprob.txt) by\n  # adding the markers _B, _E, _S, _I depending on word position.\n  # In this recipe, these markers apply to silence also.\n  # Do this starting from lexiconp.txt only.\n  if \"$silprob\"; then\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; $silword_p = shift @A;\n              $wordsil_f = shift @A; $wordnonsil_f = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_S\\n\"; }\n         else { print \"$w $p $silword_p $wordsil_f $wordnonsil_f $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n                < $srcdir/lexiconp_silprob.txt > $tmpdir/lexiconp_silprob.txt\n  else\n    perl -ane '@A=split(\" \",$_); $w = shift @A; $p = shift @A; @A>0||die;\n         if(@A==1) { print \"$w $p $A[0]_S\\n\"; } else { print \"$w $p $A[0]_B \";\n         for($n=1;$n<@A-1;$n++) { print \"$A[$n]_I \"; } print \"$A[$n]_E\\n\"; } ' \\\n         < $srcdir/lexiconp.txt > $tmpdir/lexiconp.txt || exit 1;\n  fi\n\n  # create $tmpdir/phone_map.txt\n  # this has the format (on each line)\n  # <original phone> <version 1 of original phone> <version 2> ...\n  # where the versions depend on the position of the phone within a word.\n  # For instance, we'd have:\n  # AA AA_B AA_E AA_I AA_S\n  # for (B)egin, (E)nd, (I)nternal and (S)ingleton\n  # and in the case of silence\n  # SIL SIL SIL_B SIL_E SIL_I SIL_S\n  # [because SIL on its own is one of the variants; this is for when it doesn't\n  #  occur inside a word but as an option in the lexicon.]\n\n  # This phone map expands the phone lists into all the word-position-dependent\n  # versions of the phone lists.\n  cat <(set -f; for x in `cat $srcdir/silence_phones.txt`; do for y in \"\" \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    <(set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do for y in \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    > $tmpdir/phone_map.txt\nelse\n  if \"$silprob\"; then\n    cp $srcdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob.txt\n  else\n    cp $srcdir/lexiconp.txt $tmpdir/lexiconp.txt\n  fi\n\n  cat $srcdir/silence_phones.txt $srcdir/nonsilence_phones.txt | \\\n    awk '{for(n=1;n<=NF;n++) print $n; }' > $tmpdir/phones\n  paste -d' ' $tmpdir/phones $tmpdir/phones > $tmpdir/phone_map.txt\nfi\n\n\n# Sets of phones for use in clustering, and making monophone systems.\n\nif $share_silence_phones; then\n  # build a roots file that will force all the silence phones to share the\n  # same pdf's. [three distinct states, only the transitions will differ.]\n  # 'shared'/'not-shared' means, do we share the 3 states of the HMM\n  # in the same tree-root?\n  # Sharing across models(phones) is achieved by writing several phones\n  # into one line of roots.txt (shared/not-shared doesn't affect this).\n  # 'not-shared not-split' means we have separate tree roots for the 3 states,\n  # but we never split the tree so they remain stumps,\n  # so all phones in the line correspond to the same model.\n\n  cat $srcdir/silence_phones.txt | awk '{printf(\"%s \", $0); } END{printf(\"\\n\");}' | cat - $srcdir/nonsilence_phones.txt | \\\n    utils/apply_map.pl $tmpdir/phone_map.txt > $dir/phones/sets.txt\n  cat $dir/phones/sets.txt | \\\n    awk '{if(NR==1) print \"not-shared\", \"not-split\", $0; else print \"shared\", \"split\", $0;}' > $dir/phones/roots.txt\nelse\n  # different silence phones will have different GMMs.  [note: here, all \"shared split\" means\n  # is that we may have one GMM for all the states, or we can split on states.  because they're\n  # context-independent phones, they don't see the context.]\n  cat $srcdir/{,non}silence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt > $dir/phones/sets.txt\n  cat $dir/phones/sets.txt | awk '{print \"shared\", \"split\", $0;}' > $dir/phones/roots.txt\nfi\n\ncat $srcdir/silence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/silence.txt\ncat $srcdir/nonsilence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/nonsilence.txt\ncp $srcdir/optional_silence.txt $dir/phones/optional_silence.txt\ncp $dir/phones/silence.txt $dir/phones/context_indep.txt\n\n# if extra_questions.txt is empty, it's OK.\ncat $srcdir/extra_questions.txt 2>/dev/null | utils/apply_map.pl $tmpdir/phone_map.txt \\\n  >$dir/phones/extra_questions.txt\n\n# Want extra questions about the word-start/word-end stuff. Make it separate for\n# silence and non-silence. Probably doesn't matter, as silence will rarely\n# be inside a word.\nif $position_dependent_phones; then\n  for suffix in _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\n  for suffix in \"\" _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/silence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\nfi\n\n# add_lex_disambig.pl is responsible for adding disambiguation symbols to\n# the lexicon, for telling us how many disambiguation symbols it used,\n# and also for modifying the unknown-word's pronunciation (if the\n# --unk-fst was provided) to the sequence \"#1 #2 #3\", and reserving those\n# disambig symbols for that purpose.\n# The #2 will later be replaced with the actual unk model.  The reason\n# for the #1 and the #3 is for disambiguation and also to keep the\n# FST compact.  If we didn't have the #1, we might have a different copy of\n# the unk-model FST, or at least some of its arcs, for each start-state from\n# which an <unk> transition comes (instead of per end-state, which is more compact);\n# and adding the #3 prevents us from potentially having 2 copies of the unk-model\n# FST due to the optional-silence [the last phone of any word gets 2 arcs].\nif [ ! -z \"$unk_fst\" ]; then  # if the --unk-fst option was provided...\n  if \"$silprob\"; then\n    utils/lang/internal/modify_unk_pron.py $tmpdir/lexiconp_silprob.txt \"$oov_word\" || exit 1\n  else\n    utils/lang/internal/modify_unk_pron.py $tmpdir/lexiconp.txt \"$oov_word\" || exit 1\n  fi\n  unk_opt=\"--first-allowed-disambig 4\"\nelse\n  unk_opt=\nfi\n\nif \"$silprob\"; then\n  ndisambig=$(utils/add_lex_disambig.pl $unk_opt --pron-probs --sil-probs $tmpdir/lexiconp_silprob.txt $tmpdir/lexiconp_silprob_disambig.txt)\nelse\n  ndisambig=$(utils/add_lex_disambig.pl $unk_opt --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\nndisambig=$[$ndisambig+$num_extra_phone_disambig_syms]; # add (at least) one disambig symbol for silence in lexicon FST.\necho $ndisambig > $tmpdir/lex_ndisambig\n\n# Format of lexiconp_disambig.txt:\n# !SIL\t1.0   SIL_S\n# <SPOKEN_NOISE>\t1.0   SPN_S #1\n# <UNK>\t1.0  SPN_S #2\n# <NOISE>\t1.0  NSN_S\n# !EXCLAMATION-POINT\t1.0  EH2_B K_I S_I K_I L_I AH0_I M_I EY1_I SH_I AH0_I N_I P_I OY2_I N_I T_E\n\n( for n in `seq 0 $ndisambig`; do echo '#'$n; done ) >$dir/phones/disambig.txt\n\n# In case there are extra word-level disambiguation symbols they also\n# need to be added to the list of phone-level disambiguation symbols.\nif [ ! -z \"$extra_word_disambig_syms\" ]; then\n  # We expect a file containing valid word-level disambiguation symbols.\n  cat $extra_word_disambig_syms | awk '{ print $1 }' >> $dir/phones/disambig.txt\nfi\n\n# Create phone symbol table.\nif [[ ! -z $phone_symbol_table ]]; then\n  start_symbol=`grep \\#0 $phone_symbol_table | awk '{print $2}'`\n  echo \"<eps>\" | cat - $dir/phones/{silence,nonsilence}.txt | awk -v f=$phone_symbol_table '\n  BEGIN { while ((getline < f) > 0) { phones[$1] = $2; }} { print $1\" \"phones[$1]; }' | sort -k2 -g |\\\n    cat - <(cat $dir/phones/disambig.txt | awk -v x=$start_symbol '{n=x+NR-1; print $1, n;}') > $dir/phones.txt\nelse\n  echo \"<eps>\" | cat - $dir/phones/{silence,nonsilence,disambig}.txt | \\\n    awk '{n=NR-1; print $1, n;}' > $dir/phones.txt\nfi\n\n# Create a file that describes the word-boundary information for\n# each phone.  5 categories.\nif $position_dependent_phones; then\n  cat $dir/phones/{silence,nonsilence}.txt | \\\n    awk '/_I$/{print $1, \"internal\"; next;} /_B$/{print $1, \"begin\"; next; }\n         /_S$/{print $1, \"singleton\"; next;} /_E$/{print $1, \"end\"; next; }\n         {print $1, \"nonword\";} ' > $dir/phones/word_boundary.txt\nelse\n  # word_boundary.txt might have been generated by another source\n  [ -f $srcdir/word_boundary.txt ] && cp $srcdir/word_boundary.txt $dir/phones/word_boundary.txt\nfi\n\n# Create word symbol table.\n# <s> and </s> are only needed due to the need to rescore lattices with\n# ConstArpaLm format language model. They do not normally appear in G.fst or\n# L.fst.\n\nif \"$silprob\"; then\n  # remove the silprob\n  cat $tmpdir/lexiconp_silprob.txt |\\\n    awk '{\n      for(i=1; i<=NF; i++) {\n        if(i!=3 && i!=4 && i!=5) printf(\"%s\\t\", $i); if(i==NF) print \"\";\n      }\n    }' > $tmpdir/lexiconp.txt\nfi\n\ncat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq  | awk '\n  BEGIN {\n    print \"<eps> 0\";\n  }\n  {\n    if ($1 == \"<s>\") {\n      print \"<s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    if ($1 == \"</s>\") {\n      print \"</s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    printf(\"%s %d\\n\", $1, NR);\n  }\n  END {\n    printf(\"#0 %d\\n\", NR+1);\n    printf(\"<s> %d\\n\", NR+2);\n    printf(\"</s> %d\\n\", NR+3);\n  }' > $dir/words.txt || exit 1;\n\n# In case there are extra word-level disambiguation symbols they also\n# need to be added to words.txt\nif [ ! -z \"$extra_word_disambig_syms\" ]; then\n  # Since words.txt already exists, we need to extract the current word count.\n  word_count=`tail -n 1 $dir/words.txt | awk '{ print $2 }'`\n\n  # We expect a file containing valid word-level disambiguation symbols.\n  # The list of symbols is attached to the current words.txt (including\n  # a numeric identifier for each symbol).\n  cat $extra_word_disambig_syms | \\\n    awk -v WC=$word_count '{ printf(\"%s %d\\n\", $1, ++WC); }' >> $dir/words.txt || exit 1;\nfi\n\n# format of $dir/words.txt:\n#<eps> 0\n#a 1\n#aa 2\n#aarvark 3\n#...\n\nsilphone=`cat $srcdir/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\n# create $dir/phones/align_lexicon.{txt,int}.\n# This is the method we use for lattice word alignment if we are not\n# using word-position-dependent phones.\n\n# First remove pron-probs from the lexicon.\nperl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' <$tmpdir/lexiconp.txt >$tmpdir/align_lexicon.txt\n\n# Note: here, $silphone will have no suffix e.g. _S because it occurs as optional-silence,\n# and is not part of a word.\n[ ! -z \"$silphone\" ] && echo \"<eps> $silphone\" >> $tmpdir/align_lexicon.txt\n\ncat $tmpdir/align_lexicon.txt | \\\n  perl -ane '@A = split; print $A[0], \" \", join(\" \", @A), \"\\n\";' | sort | uniq > $dir/phones/align_lexicon.txt\n\nif [ -f $srcdir/nonterminals.txt ]; then\n  utils/lang/grammar/augment_phones_txt.py $dir/phones.txt $srcdir/nonterminals.txt $dir/phones.txt\n  utils/lang/grammar/augment_words_txt.py $dir/words.txt $srcdir/nonterminals.txt $dir/words.txt\n  cp $srcdir/nonterminals.txt $dir/phones/nonterminals.txt\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/nonterminals.txt >$dir/phones/nonterminals.int\n\n  for w in \"#nonterm_begin\" \"#nonterm_end\" $(cat $srcdir/nonterminals.txt); do\n    echo $w $w  # These are words without pronunciations, so leave those prons\n                # empty.\n  done >> $dir/phones/align_lexicon.txt\n  nonterm_phones_offset=$(grep '#nonterm_bos' <$dir/phones.txt | awk '{print $2}')\n  echo $nonterm_phones_offset > $dir/phones/nonterm_phones_offset.int\n  echo '#nonterm_bos' > $dir/phones/nonterm_phones_offset.txt  # temporary.\n\n  if [ -f $dir/phones/word_boundary.txt ]; then\n    # word-position-dependent system.  Only include the optional-silence phone,\n    # and phones that can end a word, plus the special symbol #nonterm_bos, in the\n    # left-context phones.\n    awk '{if ($2 == \"end\" || $2 == \"singleton\") print $1; }' <$dir/phones/word_boundary.txt | \\\n        cat - $dir/phones/optional_silence.txt $dir/phones/nonterm_phones_offset.txt > $dir/phones/left_context_phones.txt\n  else\n    cat $dir/phones/{silence,nonsilence}.txt $dir/phones/nonterm_phones_offset.txt > $dir/phones/left_context_phones.txt\n  fi\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/left_context_phones.txt >$dir/phones/left_context_phones.int\n\n  # we need to write utils/lang/make_lexicon_fst_silprob.py before this can work.\n  grammar_opts=\"--left-context-phones=$dir/phones/left_context_phones.txt --nonterminals=$srcdir/nonterminals.txt\"\nelse\n  grammar_opts=\nfi\n\n# create phones/align_lexicon.int from phones/align_lexicon.txt\ncat $dir/phones/align_lexicon.txt | utils/sym2int.pl -f 3- $dir/phones.txt | \\\n  utils/sym2int.pl -f 1-2 $dir/words.txt > $dir/phones/align_lexicon.int\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\n\nif $silprob; then\n  # Add silence probabilities (models the prob. of silence before and after each\n  # word).  On some setups this helps a bit.  See utils/dict_dir_add_pronprobs.sh\n  # and where it's called in the example scripts (run.sh).\n  utils/lang/make_lexicon_fst_silprob.py $grammar_opts --sil-phone=$silphone \\\n         $tmpdir/lexiconp_silprob.txt $srcdir/silprob.txt | \\\n     fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n       --keep_isymbols=false --keep_osymbols=false |   \\\n     fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nelse\n  utils/lang/make_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone \\\n            $tmpdir/lexiconp.txt | \\\n    fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n      --keep_isymbols=false --keep_osymbols=false | \\\n    fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n# The file oov.txt contains a word that we will map any OOVs to during\n# training.\necho \"$oov_word\" > $dir/oov.txt || exit 1;\ncat $dir/oov.txt | utils/sym2int.pl $dir/words.txt >$dir/oov.int || exit 1;\n# integer version of oov symbol, used in some scripts.\n\n\n# the file wdisambig.txt contains a (line-by-line) list of the text-form of the\n# disambiguation symbols that are used in the grammar and passed through by the\n# lexicon.  At this stage it's hardcoded as '#0', but we're laying the groundwork\n# for more generality (which probably would be added by another script).\n# wdisambig_words.int contains the corresponding list interpreted by the\n# symbol table words.txt, and wdisambig_phones.int contains the corresponding\n# list interpreted by the symbol table phones.txt.\necho '#0' >$dir/phones/wdisambig.txt\n\n# In case there are extra word-level disambiguation symbols they need\n# to be added to the existing word-level disambiguation symbols file.\nif [ ! -z \"$extra_word_disambig_syms\" ]; then\n  # We expect a file containing valid word-level disambiguation symbols.\n  # The regular expression for awk is just a paranoia filter (e.g. for empty lines).\n  cat $extra_word_disambig_syms | awk '{ print $1 }' >> $dir/phones/wdisambig.txt\nfi\n\nutils/sym2int.pl $dir/phones.txt <$dir/phones/wdisambig.txt >$dir/phones/wdisambig_phones.int\nutils/sym2int.pl $dir/words.txt <$dir/phones/wdisambig.txt >$dir/phones/wdisambig_words.int\n\n# Create these lists of phones in colon-separated integer list form too,\n# for purposes of being given to programs as command-line options.\nfor f in silence nonsilence optional_silence disambig context_indep; do\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt >$dir/phones/$f.int\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt | \\\n   awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/$f.csl || exit 1;\ndone\n\nfor x in sets extra_questions; do\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$x.txt > $dir/phones/$x.int || exit 1;\ndone\n\nutils/sym2int.pl -f 3- $dir/phones.txt <$dir/phones/roots.txt \\\n   > $dir/phones/roots.int || exit 1;\n\nif [ -f $dir/phones/word_boundary.txt ]; then\n  utils/sym2int.pl -f 1 $dir/phones.txt <$dir/phones/word_boundary.txt \\\n    > $dir/phones/word_boundary.int || exit 1;\nfi\n\nsilphonelist=`cat $dir/phones/silence.csl`\nnonsilphonelist=`cat $dir/phones/nonsilence.csl`\n\n# Note: it's OK, after generating the 'lang' directory, to overwrite the topo file\n# with another one of your choice if the 'topo' file you want can't be generated by\n# utils/gen_topo.pl.  We do this in the 'chain' recipes.  Of course, the 'topo' file\n# should cover all the phones.  Try running utils/validate_lang.pl to check that\n# everything is OK after modifying the topo file.\nutils/gen_topo.pl $num_nonsil_states $num_sil_states $nonsilphonelist $silphonelist >$dir/topo\n\n\n# Create the lexicon FST with disambiguation symbols, and put it in lang_test.\n# There is an extra step where we create a loop to \"pass through\" the\n# disambiguation symbols from G.fst.\n\nif $silprob; then\n  utils/lang/make_lexicon_fst_silprob.py $grammar_opts \\\n     --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n     $tmpdir/lexiconp_silprob_disambig.txt $srcdir/silprob.txt | \\\n     fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n       --keep_isymbols=false --keep_osymbols=false |   \\\n     fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | \\\n     fstarcsort --sort_type=olabel > $dir/L_disambig.fst || exit 1;\nelse\n  utils/lang/make_lexicon_fst.py $grammar_opts \\\n       --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig \\\n         $tmpdir/lexiconp_disambig.txt | \\\n     fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n       --keep_isymbols=false --keep_osymbols=false |   \\\n     fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | \\\n     fstarcsort --sort_type=olabel > $dir/L_disambig.fst || exit 1;\nfi\n\n\nif [ ! -z \"$unk_fst\" ]; then\n  utils/lang/internal/apply_unk_lm.sh $unk_fst $dir || exit 1\n\n  if ! $position_dependent_phones; then\n    echo \"$0: warning: you are using the --unk-lm option and setting --position-dependent-phones false.\"\n    echo \" ... this will make it impossible to properly work out the word boundaries after\"\n    echo \" ... decoding; quite a few scripts will not work as a result, and many scoring scripts\"\n    echo \" ... will die.\"\n    sleep 4\n  fi\nfi\n\necho \"$(basename $0): validating output directory\"\n! utils/validate_lang.pl $dir && echo \"$(basename $0): error validating output\" &&  exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/prepare_online_nnet_dist_build.sh",
    "content": "#!/usr/bin/env bash\n\n# Copyright 2015  Johns Hopkins University (Author: Vijayaditya Peddinti)\n#                 Guoguo Chen\n# Apache 2.0\n# Script to prepare the distribution from the online-nnet build\n\nother_files= #other files to be included in the build\nother_dirs=\nconf_files=\"ivector_extractor.conf mfcc.conf online_cmvn.conf online_nnet2_decoding.conf splice.conf\"\nivec_extractor_files=\"final.dubm final.ie final.mat global_cmvn.stats online_cmvn.conf splice_opts\"\n\necho \"$0 $@\"  # Print the command line for logging\n[ -f path.sh ] && . ./path.sh;\n. parse_options.sh || exit 1;\n\nif [ $# -ne 3 ]; then\n   echo \"Usage: $0 <lang-dir> <model-dir> <output-tgz>\"\n   echo \"e.g.: $0 data/lang exp/nnet2_online/nnet_ms_a_online tedlium.tgz\"\n   exit 1;\nfi\n\nlang=$1\nmodeldir=$2\ntgzfile=$3\n\nfor f in $lang/phones.txt $other_files; do\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\ndone\n\nbuild_files=\nfor d in $modeldir/conf $modeldir/ivector_extractor; do\n  [ ! -d $d ] && echo \"$0: no such directory $d\" && exit 1;\ndone\n\nfor f in $ivec_extractor_files; do\n  f=$modeldir/ivector_extractor/$f\n  [ ! -f $f ] && echo \"$0: no such file $f\" && exit 1;\n  build_files=\"$build_files $f\"\ndone\n\n# Makes a copy of the original config files, as we will change the absolute path\n# to relative.\nrm -rf $modeldir/conf_abs_path\nmkdir -p $modeldir/conf_abs_path\ncp -r $modeldir/conf/* $modeldir/conf_abs_path\n\nfor f in $conf_files; do \n  [ ! -f $modeldir/conf/$f ] && \\\n    echo \"$0: no such file $modeldir/conf/$f\" && exit 1;\n  # Changes absolute path to relative path. The path entries in the config file\n  # are generated by scripts and it is safe to assume that they have structure:\n  # variable=path\n  cat $modeldir/conf_abs_path/$f | perl -e '\n    use File::Spec;\n    while(<STDIN>) {\n      chomp;\n      @col = split(\"=\", $_);\n      if (@col == 2 && (-f $col[1])) {\n        $col[1] = File::Spec->abs2rel($col[1]);\n        print \"$col[0]=$col[1]\\n\";\n      } else {\n        print \"$_\\n\";\n      }\n    }\n  ' > $modeldir/conf/$f\n  build_files=\"$build_files $modeldir/conf/$f\"\ndone\n\ntar -hczvf $tgzfile $lang $build_files $other_files $other_dirs \\\n  $modeldir/final.mdl $modeldir/tree >/dev/null\n\n# Changes back to absolute path.\nrm -rf $modeldir/conf\nmv $modeldir/conf_abs_path $modeldir/conf\n"
  },
  {
    "path": "egs/utils/remove_data_links.sh",
    "content": "#!/usr/bin/env bash\n\n# This program searches within a directory for soft links that\n# appear to be created by 'create_data_link.pl' to a 'storage/' subdirectory,\n# and it removes both the soft links and the things they point to.\n# for instance, if you have a soft link \n#   foo/egs/1.1.egs -> storage/2/1.1.egs\n# it will remove both foo/egs/storage/2/1.1.egs, and foo/egs/1.1.egs.\n\nret=0\n\ndry_run=false\n\nif [ \"$1\" == \"--dry-run\" ]; then\n  dry_run=true\n  shift\nfi\n\nif [ $# == 0 ]; then\n  echo \"Usage:  $0 [--dry-run] <list-of-directories>\"\n  echo \"e.g.: $0 exp/nnet4a/egs/\"\n  echo \" Removes from any subdirectories of the command-line arguments, soft links that \"\n  echo \" appear to have been created by utils/create_data_link.pl, as well as the things\"\n  echo \" that those soft links point to.  Will typically be called on a directory prior\"\n  echo \" to 'rm -r' on that directory, to ensure that data that was distributed on other\"\n  echo \" volumes also gets deleted.\"\n  echo \" With --dry-run, just prints what it would do.\"\nfi\n\nfor dir in $*; do\n  if [ ! -d $dir ]; then\n    echo \"$0: not a directory: $dir\"\n    ret=1\n  else\n    for subdir in $(find $dir -type d); do\n      if [ -d $subdir/storage ]; then\n        for x in $(ls $subdir); do\n          f=$subdir/$x\n          if [ -L $f ] && [[ $(readlink $f) == storage/* ]]; then\n            target=$subdir/$(readlink $f)\n            if $dry_run; then\n              echo rm $f $target\n            else\n              rm $f $target\n            fi\n          fi\n        done\n      fi\n    done\n  fi\ndone\n\nexit $ret\n"
  },
  {
    "path": "egs/utils/remove_oovs.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script removes lines that contain these OOVs on either the\n# third or fourth fields  of the line.  It is intended to remove arcs\n# with OOVs on, from FSTs (probably compiled from ARPAs with OOVs in).\n\nif (  @ARGV < 1 && @ARGV > 2) {\n    die \"Usage: remove_oovs.pl unk_list.txt [ printed-fst ]\\n\";\n}\n\n$unklist = shift @ARGV;\nopen(S, \"<$unklist\") || die \"Failed opening unknown-symbol list $unklist\\n\";\nwhile(<S>){ \n    @A = split(\" \", $_);\n    @A == 1 || die \"Bad line in unknown-symbol list: $_\";\n    $unk{$A[0]} = 1;\n}\n\n$num_removed = 0;\nwhile(<>){ \n    @A = split(\" \", $_);\n    if(defined $unk{$A[2]} || defined $unk{$A[3]}) {\n        $num_removed++;\n    } else {\n        print;\n    }\n}\nprint STDERR \"remove_oovs.pl: removed $num_removed lines.\\n\";\n\n"
  },
  {
    "path": "egs/utils/reverse_arpa.py",
    "content": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n# Copyright 2012 Mirko Hannemann BUT, mirko.hannemann@gmail.com\n\nfrom __future__ import print_function\nimport sys\nimport codecs # for UTF-8/unicode\n\nif len(sys.argv) != 2:\n    print('usage: reverse_arpa arpa.in')\n    sys.exit()\narpaname = sys.argv[1]\n\n#\\data\\\n#ngram 1=4\n#ngram 2=2\n#ngram 3=2\n#\n#\\1-grams:\n#-5.234679\ta -3.3\n#-3.456783\tb\n#0.0000000\t<s> -2.5\n#-4.333333\t</s>\n#\n#\\2-grams:\n#-1.45678\ta b -3.23\n#-1.30490\t<s> a -4.2\n#\n#\\3-grams:\n#-0.34958\t<s> a b\n#-0.23940\ta b </s>\n#\\end\\\n\n# read language model in ARPA format\ntry:\n  file = codecs.open(arpaname, \"r\", \"utf-8\")\nexcept IOError:\n  print('file not found: ' + arpaname)\n  sys.exit()\n\ntext=file.readline()\nwhile (text and text[:6] != \"\\\\data\\\\\"): text=file.readline()\nif not text:\n  print(\"invalid ARPA file\")\n  sys.exit()\n#print text,\nwhile (text and text[:5] != \"ngram\"): text=file.readline()\n\n# get ngram counts\ncngrams=[]\nn=0\nwhile (text and text[:5] == \"ngram\"):\n  ind = text.split(\"=\")\n  counts = int(ind[1].strip())\n  r = ind[0].split()\n  read_n = int(r[1].strip())\n  if read_n != n+1:\n    print(\"invalid ARPA file: {}\".format(text))\n    sys.exit()\n  n = read_n\n  cngrams.append(counts)\n  #print text,\n  text=file.readline()\n\n# read all n-grams order by order\nsentprob = 0.0 # sentence begin unigram\nngrams=[]\ninf=float(\"inf\")\nfor n in range(1,len(cngrams)+1): # unigrams, bigrams, trigrams\n  while (text and \"-grams:\" not in text): text=file.readline()\n  if n != int(text[1]):\n    print(\"invalid ARPA file:{}\".format(text))\n    sys.exit()\n  #print text,cngrams[n-1]\n  this_ngrams={} # stores all read ngrams\n  for ng in range(cngrams[n-1]):\n    while (text and len(text.split())<2):\n      text=file.readline()\n      if (not text) or ((len(text.split())==1) and ((\"-grams:\" in text) or (text[:5] == \"\\\\end\\\\\"))): break\n    if (not text) or ((len(text.split())==1) and ((\"-grams:\" in text) or (text[:5] == \"\\\\end\\\\\"))):\n      break # to deal with incorrect ARPA files\n    entry = text.split()\n    prob = float(entry[0])\n    if len(entry)>n+1:\n      back = float(entry[-1])\n      words = entry[1:n+1]\n    else:\n      back = 0.0\n      words = entry[1:]\n    ngram = \" \".join(words)\n    if (n==1) and words[0]==\"<s>\":\n      sentprob = prob\n      prob = 0.0\n    this_ngrams[ngram] = (prob,back)\n    #print prob,ngram.encode(\"utf-8\"),back\n\n    for x in range(n-1,0,-1):\n      # add all missing backoff ngrams for reversed lm\n      l_ngram = \" \".join(words[:x]) # shortened ngram\n      r_ngram = \" \".join(words[1:1+x]) # shortened ngram with offset one\n      if l_ngram not in ngrams[x-1]: # create missing ngram\n        ngrams[x-1][l_ngram] = (0.0,inf)\n        #print ngram, \"create 0.0\", l_ngram, \"inf\"\n      if r_ngram not in ngrams[x-1]: # create missing ngram\n        ngrams[x-1][r_ngram] = (0.0,inf)\n        #print ngram, \"create 0.0\", r_ngram, \"inf\",x,n,h_ngram\n\n      # add all missing backoff ngrams for forward lm\n      h_ngram = \" \".join(words[n-x:]) # shortened history\n      if h_ngram not in ngrams[x-1]: # create missing ngram\n        ngrams[x-1][h_ngram] = (0.0,inf)\n        #print \"create inf\", h_ngram, \"0.0\"\n    text=file.readline()\n    if (not text) or ((len(text.split())==1) and ((\"-grams:\" in text) or (text[:5] == \"\\\\end\\\\\"))): break\n  ngrams.append(this_ngrams)\n\nwhile (text and text[:5] != \"\\\\end\\\\\"): text=file.readline()\nif not text:\n  print(\"invalid ARPA file\")\n  sys.exit()\nfile.close()\n#print text,\n\n#fourgram \"maxent\" model (b(ABCD)=0):\n#p(A)+b(A) A 0\n#p(AB)+b(AB)-b(A)-p(B) AB 0\n#p(ABC)+b(ABC)-b(AB)-p(BC) ABC 0\n#p(ABCD)+b(ABCD)-b(ABC)-p(BCD) ABCD 0\n\n#fourgram reverse ARPA model (b(ABCD)=0):\n#p(A)+b(A) A 0\n#p(AB)+b(AB)-p(B)+p(A) BA 0\n#p(ABC)+b(ABC)-p(BC)+p(AB)-p(B)+p(A) CBA 0\n#p(ABCD)+b(ABCD)-p(BCD)+p(ABC)-p(BC)+p(AB)-p(B)+p(A) DCBA 0\n\n# compute new reversed ARPA model\nprint(\"\\\\data\\\\\")\nfor n in range(1,len(cngrams)+1): # unigrams, bigrams, trigrams\n  print(\"ngram {0} = {1}\".format(n, len(ngrams[n-1].keys())))\noffset = 0.0\nfor n in range(1,len(cngrams)+1): # unigrams, bigrams, trigrams\n  print(\"\\\\{}-grams:\".format(n))\n  keys = sorted(ngrams[n-1].keys())\n  for ngram in keys:\n    prob = ngrams[n-1][ngram]\n    # reverse word order\n    words = ngram.split()\n    rstr = \" \".join(reversed(words))\n    # swap <s> and </s>\n    rev_ngram = rstr.replace(\"<s>\",\"<temp>\").replace(\"</s>\",\"<s>\").replace(\"<temp>\",\"</s>\")\n\n    revprob = prob[0]\n    if (prob[1] != inf): # only backoff weights from not newly created ngrams\n      revprob = revprob + prob[1]\n    #print prob[0],prob[1]\n    # sum all missing terms in decreasing ngram order\n    for x in range(n-1,0,-1): \n      l_ngram = \" \".join(words[:x]) # shortened ngram\n      if l_ngram not in ngrams[x-1]:\n        sys.stderr.write(rev_ngram+\": not found \"+l_ngram+\"\\n\")\n      p_l = ngrams[x-1][l_ngram][0]\n      #print p_l,l_ngram\n      revprob = revprob + p_l\n\n      r_ngram = \" \".join(words[1:1+x]) # shortened ngram with offset one\n      if r_ngram not in ngrams[x-1]:\n        sys.stderr.write(rev_ngram+\": not found \"+r_ngram+\"\\n\")\n      p_r = ngrams[x-1][r_ngram][0]\n      #print -p_r,r_ngram\n      revprob = revprob - p_r\n\n    if n != len(cngrams): #not highest order\n      back = 0.0\n      if rev_ngram[:3] == \"<s>\": # special handling since arpa2fst ignores <s> weight\n        if n == 1:\n          offset = revprob # remember <s> weight\n          revprob = sentprob # apply <s> weight from forward model\n          back = offset\n        elif n == 2:\n          revprob = revprob + offset # add <s> weight to bigrams starting with <s>\n      if (prob[1] != inf): # only backoff weights from not newly created ngrams\n        print(revprob,rev_ngram.encode(\"utf-8\"),back)\n      else:\n        print(revprob,rev_ngram.encode(\"utf-8\"),\"-100000.0\")\n    else: # highest order - no backoff weights\n      if (n==2) and (rev_ngram[:3] == \"<s>\"): revprob = revprob + offset\n      print(revprob,rev_ngram.encode(\"utf-8\"))\nprint(\"\\\\end\\\\\")\n"
  },
  {
    "path": "egs/utils/rnnlm_compute_scores.sh",
    "content": "#!/usr/bin/env bash\n\n# Compute scores from RNNLM.  This script takes a directory\n# $dir (e.g. dir=local/rnnlm/rnnlm.voc30.hl30 ),\n# where it expects the files:\n#  rnnlm  wordlist.rnn  unk.probs,\n# and also an input file location where it can get the sentences to score, and\n# an output file location to put the scores (negated logprobs) for each\n# sentence.  This script uses the Kaldi-style \"archive\" format, so the input and\n# output files will have a first field that corresponds to some kind of\n# utterance-id or, in practice, utterance-id-1, utterance-id-2, etc., for the\n# N-best list.\n#\n# Here, \"wordlist.rnn\" is the set of words, like a vocabulary,\n# that the RNN was trained on (note, it won't include <s> or </s>),\n# plus <RNN_UNK> which is a kind of class where we put low-frequency\n# words; unk.probs gives the probs for words given this class, and it\n# has, on each line, \"word prob\".\n\nrnnlm_ver=rnnlm-0.3e\nensure_normalized_probs=false  # if true then we add the neccesary options to\n                               # normalize the probabilities of RNNLM\n                               # e.g. when using faster-rnnlm in the nce mode\n\n. ./path.sh || exit 1;\n. utils/parse_options.sh\n\nrnnlm=$KALDI_ROOT/tools/$rnnlm_ver/rnnlm\n\n[ ! -f $rnnlm ] && echo No such program $rnnlm && exit 1;\n\nif [ $# != 4 ]; then\n  echo \"Usage: rnnlm_compute_scores.sh <rnn-dir> <temp-dir> <input-text> <output-scores>\"\n  exit 1;\nfi\n\ndir=$1\ntempdir=$2\ntext_in=$3\nscores_out=$4\n\nfor x in rnnlm wordlist.rnn unk.probs; do\n  if [ ! -f $dir/$x ]; then \n    echo \"rnnlm_compute_scores.sh: expected file $dir/$x to exist.\"\n    exit 1;\n  fi\ndone\n\nmkdir -p $tempdir\ncat $text_in | awk '{for (x=2;x<=NF;x++) {printf(\"%s \", $x)} printf(\"\\n\");}' >$tempdir/text\ncat $text_in | awk '{print $1}' > $tempdir/ids # e.g. utterance ids.\ncat $tempdir/text | awk -v voc=$dir/wordlist.rnn -v unk=$dir/unk.probs \\\n  -v logprobs=$tempdir/loglikes.oov \\\n 'BEGIN{ while((getline<voc)>0) { invoc[$1]=1; } while ((getline<unk)>0){ unkprob[$1]=$2;} }\n  { logprob=0;\n    if (NF==0) { printf \"<RNN_UNK>\"; logprob = log(1.0e-07);\n      print \"Warning: empty sequence.\" | \"cat 1>&2\"; }\n    for (x=1;x<=NF;x++) { w=$x;  \n    if (invoc[w]) { printf(\"%s \",w); } else {\n      printf(\"<RNN_UNK> \");\n      if (unkprob[w] != 0) { logprob += log(unkprob[w]); }\n      else { print \"Warning: unknown word \", w | \"cat 1>&2\"; logprob += log(1.0e-07); }}}\n    printf(\"\\n\"); print logprob > logprobs } ' > $tempdir/text.nounk\n\n# OK, now we compute the scores on the text with OOVs replaced\n# with <RNN_UNK>\n\nif [ $rnnlm_ver == \"faster-rnnlm\" ]; then\n  extra_options=\n  if [ \"$ensure_normalized_probs\" = true ]; then\n    extra_options=\"--nce-accurate-test 1\"\n  fi\n  $rnnlm $extra_options -independent -rnnlm $dir/rnnlm -test $tempdir/text.nounk -nbest -debug 0 | \\\n     awk '{print $1*log(10);}' > $tempdir/loglikes.rnn\nelse\n  # add the utterance_id as required by Mikolove's rnnlm\n  paste $tempdir/ids $tempdir/text.nounk > $tempdir/id_text.nounk\n\n  $rnnlm -independent -rnnlm $dir/rnnlm -test $tempdir/id_text.nounk -nbest -debug 0 | \\\n     awk '{print $1*log(10);}' > $tempdir/loglikes.rnn\nfi\n\n[ `cat $tempdir/loglikes.rnn | wc -l` -ne `cat $tempdir/loglikes.oov | wc -l` ] && \\\n  echo \"rnnlm rescoring failed\" && exit 1;\n\npaste $tempdir/loglikes.rnn $tempdir/loglikes.oov | awk '{print -($1+$2);}' >$tempdir/scores\n\n# scores out, with utterance-ids.\npaste $tempdir/ids $tempdir/scores  > $scores_out\n\n"
  },
  {
    "path": "egs/utils/s2eps.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script replaces <s> and </s> with <eps> (on both input and output sides),\n# for the G.fst acceptor.\n\nwhile(<>){\n    @A = split(\" \", $_);\n    if ( @A >= 4 ) {\n        if ($A[2] eq \"<s>\" || $A[2] eq \"</s>\") { $A[2] = \"<eps>\"; }\n        if ($A[3] eq \"<s>\" || $A[3] eq \"</s>\") { $A[3] = \"<eps>\"; }\n    }\n    print join(\"\\t\", @A) . \"\\n\";\n}\n"
  },
  {
    "path": "egs/utils/scoring/wer_ops_details.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2015 Johns Hopkins University (Author: Yenda Trmal <jtrmal@gmail.com>)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# These scripts are (or can be) used by scoring scripts to generate\n# additional information (such as per-spk wer, per-sentence alignments and so on)\n# during the scoring. See the wsj/local/score.sh script for example how\n# the scripts are used\n# For help and instructions about usage, see the bottom of this file,\n# or call it with the parameter --help\n\nuse strict;\nuse warnings;\nuse Getopt::Long;\nuse Pod::Usage;\n\n\nmy $help;\nmy $special_symbol= \"<eps>\";\nmy $separator=\";\";\nmy $extra_size=4;\nmy $max_size=16;\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines \n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to \n# make sure the length of the (decoded) string \n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text); \n      push @unicode_lines, $decoded_text;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    print STDERR \"$0: Note: handling as byte stream\\n\";\n    return (0, @raw_lines);\n  } else {\n    print STDERR \"$0: Note: handling as utf-8 text\\n\";\n    return (1, @unicode_lines);\n  }\n\n  return 0;\n}\nsub print_line {\n  my $op = $_[0];\n  my $rewf = $_[1];\n  my $hypw = $_[2];\n  my $nofop = $_[3];\n\n}\n\nsub max {\n  $_[ 0 ] < $_[ -1 ] ? shift : pop while @_ > 1;\n  return @_;\n}\n\n\nGetOptions(\"special-symbol=s\" => \\$special_symbol,\n           \"separator=s\" => \\$separator,\n           \"help|?\" => \\$help\n           ) or pod2usage(2);\npod2usage(1) if $help;\npod2usage(\"$0: Too many files given.\\n\")  if (@ARGV != 0);\n\nmy %EDIT_OPS;\nmy %UTT;\n(my $is_utf8, my @text) = get_utf8_or_bytestream(\\*STDIN);\nif ($is_utf8) {\n  binmode(STDOUT, \":utf8\");\n}\n\nwhile (@text) {\n  my $line = shift @text;\n  chomp $line;\n  my @entries = split(\" \", $line);\n  next if  @entries < 2;\n  next if  ($entries[1] ne \"hyp\") and ($entries[1] ne \"ref\") ;\n  if (scalar @entries <= 2 ) {\n    print STDERR \"$0: Warning: skipping entry \\\"$_\\\", either an  empty phrase or incompatible format\\n\" ;\n    next;\n  }\n\n  die \"The input stream contains duplicate entry $entries[0] $entries[1]\\n\"\n    if exists $UTT{$entries[0]}->{$entries[1]};\n  push @{$UTT{$entries[0]}->{$entries[1]}}, @entries[2..$#entries];\n  #print join(\" \", @{$UTT{$entries[0]}->{$entries[1]}}) . \"\\n\";\n  #print $_ . \"\\n\";\n}\n\nfor my $utterance( sort (keys %UTT) ) {\n\n  die \"The input stream does not contain entry \\\"hyp\\\" for utterance $utterance\\n\"\n    unless exists $UTT{$utterance}->{\"hyp\"};\n  die \"The input stream does not contain entry \\\"ref\\\" for utterance $utterance\\n\"\n    unless exists $UTT{$utterance}->{\"ref\"};\n\n  my $hyp = $UTT{$utterance}->{\"hyp\"};\n  my $ref = $UTT{$utterance}->{\"ref\"};\n\n  die \"The \\\"ref\\\" an \\\"hyp\\\" entries do not have the same number of fields\"\n    unless (scalar @{$hyp}) == (scalar @{$ref});\n\n  for ( my $i = 0; $i < @{$hyp}; $i += 1) {\n    $EDIT_OPS{$ref->[$i]}->{$hyp->[$i]} += 1;\n  }\n}\n\nmy $word_len = 0;\nmy $ops_len =0;\nforeach my $refw ( sort (keys %EDIT_OPS) ) {\n  foreach my $hypw ( sort (keys %{$EDIT_OPS{$refw}} ) ) {\n    my $q = length($refw) > length($hypw) ? length($refw):  length($hypw) ;\n    if ( $q > $max_size ) {\n      #print STDERR Dumper( [$refw, $hypw, $q, length($refw), length($hypw) ]);\n      ;\n    }\n    $word_len = $q > $word_len ? $q : $word_len ;\n\n    my $d = length(sprintf(\"%d\", $EDIT_OPS{$refw}->{$hypw}));\n    $ops_len =  $d > $ops_len ? $d: $ops_len ;\n  }\n}\n\nif ($word_len > $max_size) {\n  ## We used to warn about this, but it was just confusing-- dan.\n  ## print STDERR \"wer_ops_details.pl [info; affects only whitespace]: we are limiting the width to $max_size, max word len was $word_len\\n\";\n  $word_len = $max_size\n};\n\n\nforeach my $refw ( sort (keys %EDIT_OPS) ) {\n  foreach my $hypw ( sort (keys %{$EDIT_OPS{$refw}} ) ) {\n    if ( $refw eq $hypw ) {\n      printf \"correct       %${word_len}s    %${word_len}s    %${ops_len}d\\n\", ($refw,  $hypw,  $EDIT_OPS{$refw}->{$hypw});\n    } elsif ( $refw eq   $special_symbol ) {\n      printf \"insertion     %${word_len}s    %${word_len}s    %${ops_len}d\\n\", ($refw,  $hypw,  $EDIT_OPS{$refw}->{$hypw});\n    } elsif ( $hypw eq $special_symbol ) {\n      printf \"deletion      %${word_len}s    %${word_len}s    %${ops_len}d\\n\", ($refw,  $hypw,  $EDIT_OPS{$refw}->{$hypw});\n    } else {\n      printf \"substitution  %${word_len}s    %${word_len}s    %${ops_len}d\\n\", ($refw,  $hypw,  $EDIT_OPS{$refw}->{$hypw});\n    }\n  }\n}\nexit 0;\n__END__\n=head1 NAME\n  wer_ops_details.pl -- generate aggregated ops statistics\n\n=head1 SYNOPSIS\n\n  wer_per_spk_details.pl\n\n  Options:\n    --special-symbol        special symbol used in align-text to denote empty word\n                            in case insertion or deletion (\"<eps>\" by default)\n    --help                  Print this help\n\n==head1 DESCRIPTION\n  The program generates global statistic on how many time was each word\n  recognized correctly, confused as another word, incorrectly deleted or inserted.\n  The output will contain similar info as the sclite dtl file, the format is,\n  however, completely different.\n\n\n\n==head1 EXAMPLE INPUT AND OUTPUT\n  Input:\n    UTT-A ref  word-A   <eps>  word-B  word-C  word-D  word-E\n    UTT-A hyp  word-A  word-A  word-B   <eps>  word-D  word-X\n\n  Output:\n    correct       word-A  word-A  1\n    correct       word-B  word-B  1\n    correct       word-D  word-D  1\n    deletion      word-C  <eps>   1\n    insertion     <eps>   word-A  1\n    substitution  word-E  word-X  1\n\n\n  Note:\n    The input can contain other lines as well -- those will be ignored during\n    reading the input. I.E. this is a completely legal input:\n\n      UTT-A ref  word-A   <eps>  word-B  word-C  word-D  word-E\n      UTT-A hyp  word-A  word-A  word-B   <eps>  word-D  word-X\n      UTT-A op      C       I       C       D       C       S\n      UTT-A #csid 3 1 1 1\n=cut\n"
  },
  {
    "path": "egs/utils/scoring/wer_per_spk_details.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2015 Johns Hopkins University (Author: Yenda Trmal <jtrmal@gmail.com>)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# These scripts are (or can be) used by scoring scripts to generate \n# additional information (such as per-spk wer, per-sentence alignments and so on) \n# during the scoring. See the wsj/local/score.sh script for example how \n# the scripts are used\n# For help and instructions about usage, see the bottom of this file, \n# or call it with the parameter --help\n \nuse strict;\nuse warnings;\nuse List::Util qw[max];\nuse Getopt::Long;\nuse Pod::Usage;\n\n\n#use Data::Dumper;\n\nmy $WIDTH=10;\nmy $SPK_WIDTH=15;\nmy $help;\n\nGetOptions(\"spk-field-width\" => \\$SPK_WIDTH,\n           \"field-width\" => \\$WIDTH,\n           \"help|?\" => \\$help\n           ) or pod2usage(2);\npod2usage(1) if $help;\npod2usage(\"$0: Too many files given.\\n\")  if (@ARGV != 1);\n\nmy %UTTMAP;\nmy %PERSPK_STATS;\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines \n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to \n# make sure the length of the (decoded) string \n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text); \n      push @unicode_lines, $decoded_text;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    print STDERR \"$0: Note: handling as byte stream\\n\";\n    return (0, @raw_lines);\n  } else {\n    print STDERR \"$0: Note: handling as utf-8 text\\n\";\n    return (1, @unicode_lines);\n  }\n\n  return 0;\n}\n\nsub print_header {\n  \n  my $f=\"%${WIDTH}s\";\n  my $str = sprintf(\"%-${SPK_WIDTH}s id  $f $f $f $f $f $f $f $f\\n\", \"SPEAKER\", \n                    \"#SENT\", \"#WORD\", \"Corr\", \"Sub\", \"Ins\", \"Del\", \"Err\", \"S.Err\");\n  return $str;\n}\nsub format_raw {\n  my $spk = $_[0];\n  my $sent = $_[1];\n  my $word = $_[2];\n  my $c = $_[3];\n  my $s = $_[4];\n  my $i = $_[5];\n  my $d = $_[6];\n  my $err = $_[7];\n  my $serr = $_[8];\n\n  my $f = \"%${WIDTH}d\"; \n  my $str = sprintf(\"%-${SPK_WIDTH}s raw $f $f $f $f $f $f $f $f\\n\", $spk, \n                    $sent, $word, $c, $s, $i, $d, $err, $serr);\n  return $str;\n}\nsub format_sys {\n  my $spk = $_[0];\n  my $sent = $_[1];\n  my $word = $_[2];\n  my $c = $_[3];\n  my $s = $_[4];\n  my $i = $_[5];\n  my $d = $_[6];\n  my $err = $_[7];\n  my $serr = $_[8];\n\n  my $fd = \"%${WIDTH}d\"; \n  my $ff = \"%${WIDTH}.2f\"; \n  my $str = sprintf(\"%-${SPK_WIDTH}s sys $fd $fd $ff $ff $ff $ff $ff $ff\\n\", $spk, \n                    $sent, $word, $c, $s, $i, $d, $err, $serr);\n  return $str;\n}\n\nopen(UTT2SPK,$ARGV[0]) or die \"Could not open the utt2spk file $ARGV[0]\";\n\n(my $utt_is_utf8, my @utt_lines) = get_utf8_or_bytestream(\\*UTT2SPK);\ndie \"Cannot read file\" unless @utt_lines;\n\nwhile (@utt_lines) {\n  my $line = shift @utt_lines;\n  chomp $line;\n  my @F=split(\" \", $line);\n  die \"Incompatible format of the utt2spk file: $_\" if @F != 2;\n  $UTTMAP{$F[0]} = $F[1];\n  # Set width of speaker column by its longest label,\n  if($SPK_WIDTH < length($F[1])) { $SPK_WIDTH = length($F[1]) }\n}\nclose(UTT2SPK);\n\n(my $is_utf8, my @text) = get_utf8_or_bytestream(\\*STDIN);\nif ($is_utf8) {\n  binmode(STDOUT, \":utf8\");\n}\n\nwhile (@text) {\n  my $line = shift @text;\n  chomp $line;\n  my @entries = split(\" \", $line);\n  next if  @entries < 2;\n  next if  $entries[1] ne \"#csid\" ; \n  die \"Incompatible entry $_ \" if @entries != 6;\n\n  my $c=$entries[2]; \n  my $s=$entries[3]; \n  my $i=$entries[4]; \n  my $d=$entries[5]; \n  \n  my $UTT=$entries[0];\n  my $SPK=$UTTMAP{$UTT};\n  $PERSPK_STATS{$SPK}->{\"C\"} += $c;\n  $PERSPK_STATS{$SPK}->{\"S\"} += $s;\n  $PERSPK_STATS{$SPK}->{\"I\"} += $i;\n  $PERSPK_STATS{$SPK}->{\"D\"} += $d;\n  $PERSPK_STATS{$SPK}->{\"SENT\"} += 1;\n  $PERSPK_STATS{$SPK}->{\"SERR\"} += 1 if ($s + $i + $d != 0);\n}\n\nmy $C = 0;\nmy $S = 0;\nmy $I = 0;\nmy $D = 0;\nmy $SENT = 0;\nmy $WORD = 0;\nmy $ERR = 0;\nmy $SERR = 0;\n\nprint print_header;\n\nfor my $SPK (sort (keys %PERSPK_STATS)) {\n  my $c=$PERSPK_STATS{$SPK}->{\"C\"}; \n  my $s=$PERSPK_STATS{$SPK}->{\"S\"}; \n  my $i=$PERSPK_STATS{$SPK}->{\"I\"}; \n  my $d=$PERSPK_STATS{$SPK}->{\"D\"}; \n  my $sent=$PERSPK_STATS{$SPK}->{\"SENT\"} ;\n  my $word=$c+$s+$d;\n  my $err =$s+$d+$i;\n  my $serr = $PERSPK_STATS{$SPK}->{\"SERR\"} // 0;\n\n  my $spk = \"$SPK\";\n  $C += $c; $S += $s; $I += $i; $D += $d; \n  $SENT += $sent; $SERR += $serr;\n\n  my $w = 1.0 *$word;\n  print format_raw($spk, $sent, $word, $c, $s, $i, $d, $err, $serr);\n  print format_sys($spk, $sent, $word, 100 * $c/$w, 100 * $s/$w, \n                   100 * $i/$w, 100 * $d/$w, 100 * $err/$w, 100.0 * $serr/$sent) unless $w == 0;\n\n}\n$WORD= $C + $S + $D;\n$ERR= $S + $D + $I;\nmy $W = 1.0 * $WORD;\n\nprint format_raw(\"SUM\", $SENT, $WORD, $C, $S, $I, $D, $ERR, $SERR);\nprint format_sys(\"SUM\", $SENT, $WORD, 100* $C/$W, 100*$S/$W, \n                         100*$I/$W,100*$D/$W,100*$ERR/$W, 100.0 * $SERR/$SENT) unless $W==0;\n\n\n __END__\n\n=head1 NAME\n  wer_per_spk_details.pl -- generate aggregated per-speaker details\n\n=head1 SYNOPSIS\n\n  wer_per_spk_details.pl  data/dev/utt2spk\n\n  Options:\n    --spk-field-width         Width of the first field (spk ID field)\n    --field-width             Width of the fields (with exception of the SPK ID \n                              field)\n\n=head1 DESCRIPTION\n  This program aggregates the per-utterance output from utils/wer_per_utt_details.pl\n  It cares only about the \"#csid\" field (counts of Corr, Sub, Ins and Del);\n\n  It expects one parameter -- file in the format of the kaldi utt2spk.\n  In case the SPK ID is longer that 15 characters, the parameter spk-field-width\n  can be used; the same for all other fields and field-width parameter.\n  The field-width parameter should not be necessary under normal circumstances.\n\n==head1 EXAMPLE INPUT AND OUTPUT\n  Input:\n    UTT-A #csid 3 1 1 1\n\n  Output:\n    SPEAKER         id       #SENT      #WORD       Corr        Sub        Ins        Del        Err      S.Err\n    A               raw          1          5          3          1          1          1          3          1\n    A               sys          1          5      60.00      20.00      20.00      20.00      60.00     100.00\n    SUM             raw          1          5          3          1          1          1          3          1\n    SUM             sys          1          5      60.00      20.00      20.00      20.00      60.00     100.00\n    \n    The input can contain other lines as well -- those will be ignored during\n    reading the input. I.E. this is a completely legal input:\n      \n      UTT-A ref  word-A   <eps>  word-B  word-C  word-D  word-E\n      UTT-A hyp  word-A  word-A  word-B   <eps>  word-D  word-X\n      UTT-A op      C       I       C       D       C       S\n      UTT-A #csid 3 1 1 1\n\n=cut\n"
  },
  {
    "path": "egs/utils/scoring/wer_per_utt_details.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2015 Johns Hopkins University (Author: Yenda Trmal <jtrmal@gmail.com>)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n#These scripts are (or can be) used by scoring scripts to generate \n#additional information (such as per-spk wer, per-sentence alignments and so on) \n#during the scoring. See the wsj/local/score.sh script for example how \n#the scripts are used\n#For help and instructions about usage, see the bottom of this file, \n#or call it with the parameter --help\n#\nuse strict;\nuse warnings;\nuse List::Util qw[max];\nuse Getopt::Long;\nuse Pod::Usage;\n\n\n#use Data::Dumper;\n\nmy $special_symbol= \"<eps>\";\nmy $separator=\";\";\nmy $output_hyp = 1;\nmy $output_ref = 1;\nmy $output_ops = 1;\nmy $output_csid = 1;\nmy $help;\n\nGetOptions(\"special-symbol=s\" => \\$special_symbol,\n           \"separator=s\" => \\$separator,\n           \"output-hyp!\" => \\$output_hyp,\n           \"output-ref!\" => \\$output_ref,\n           \"output-ops!\" => \\$output_ops,\n           \"output-csid!\" => \\$output_csid,\n           \"help|?\" => \\$help\n           ) or pod2usage(2);\npod2usage(1) if $help;\npod2usage(\"$0: Too many parameters.\\n\")  if (@ARGV != 0);\n\nsub rjustify {\n  my $maxlen =  $_[1];\n  my $str =  $_[0];\n  return sprintf(\"%-${maxlen}s\", $str);\n}\nsub ljustify {\n  my $maxlen =  $_[1];\n  my $str =  $_[0];\n  return sprintf(\"%${maxlen}s\", $str);\n}\nsub cjustify {\n  my $maxlen =  $_[1];\n  my $str =  $_[0];\n  my $right_spaces = int(($maxlen - length($str)) / 2);\n  my $left_spaces =$maxlen - length($str) - $right_spaces;\n  return sprintf(\"%s%s%s\", \" \" x $left_spaces,  $str, \" \" x $right_spaces);\n}\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines \n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to \n# make sure the length of the (decoded) string \n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text); \n      push @unicode_lines, $decoded_text;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    print STDERR \"$0: Note: handling as byte stream\\n\";\n    return (0, @raw_lines);\n  } else {\n    print STDERR \"$0: Note: handling as utf-8 text\\n\";\n    return (1, @unicode_lines);\n  }\n}\n\n(my $is_utf8, my @text) = get_utf8_or_bytestream(\\*STDIN);\nif ($is_utf8) {\n  binmode(STDOUT, \":utf8\");\n}\n\nwhile (@text) {\n  my $line = shift @text;\n  chomp $line;\n  (my $utt_id, my $alignment) = split (\" \", $line, 2);\n  my @alignment_pairs = split(\" \", $alignment); #splits on spaces, does not create empty fields\n \n  my @HYP;\n  my @REF;\n  my @OP;\n  my %OPCOUNTS= (\n    \"I\" => 0,\n    \"D\" => 0,\n    \"S\" => 0,\n    \"C\" => 0\n  );\n\n  while(@alignment_pairs) {\n    my $ref = shift @alignment_pairs;\n    my $hyp = shift @alignment_pairs;\n    if (@alignment_pairs) {\n      my $sep = shift @alignment_pairs;\n      die \"Detected incorrect separator $sep (expected $separator).\\n\" unless ($sep eq $separator);\n    }\n\n    push @HYP, $hyp;\n    push @REF, $ref;\n\n    if ( $hyp eq $special_symbol ) {\n      push @OP, \"D\";\n      $OPCOUNTS{\"D\"} +=1;\n    } elsif ( $ref eq $special_symbol ) {\n      push @OP, \"I\";\n      $OPCOUNTS{\"I\"} +=1;\n    } elsif ($ref ne $hyp ) {\n      push @OP, \"S\";\n      $OPCOUNTS{\"S\"} +=1;\n    } else {\n      push @OP, \"C\";\n      $OPCOUNTS{\"C\"} +=1;\n    }\n  }\n\n  die \"Number of edit ops is not equal to the length of the text for utterance $utt_id\\n\" if scalar(@OP) != scalar(@HYP);\n   \n  my @hyp_str;\n  my @ref_str;\n  my @op_str;\n  for (my $i=0; $i <= $#OP; $i+=1) {\n    my $maxlen=max(length($REF[$i]), length($HYP[$i]), length($OP[$i]));\n\n    push @ref_str, cjustify($REF[$i], $maxlen);\n    push @hyp_str, cjustify($HYP[$i], $maxlen);\n    push @op_str, cjustify($OP[$i], ${maxlen});\n  }\n  print $utt_id . \" ref  \" . join(\"  \", @ref_str) . \"\\n\" if $output_ref;\n  print $utt_id . \" hyp  \" . join(\"  \", @hyp_str) . \"\\n\" if $output_hyp;\n  print $utt_id . \" op   \" . join(\"  \", @op_str) . \"\\n\" if $output_ops;\n  print $utt_id . \" #csid\" . \" \" .$OPCOUNTS{\"C\"} . \" \" . $OPCOUNTS{\"S\"} . \" \" . $OPCOUNTS{\"I\"} . \" \" . $OPCOUNTS{\"D\"} . \"\\n\" if $output_csid;\n}\n\n\n __END__\n\n=head1 NAME\n  wer_per_utt_details.pl -- generate detailed stats\n\n=head1 SYNOPSIS\n\n  Example:\n    align-text ark:text.filt ark:10.txt ark,t:-  | wer_per_utt_details.pl\n\n  Options:\n    --special-symbol        special symbol used in align-text to denote empty word \n                            in case insertion or deletion (\"<eps>\" by default)\n    --separator             special symbol used to separate individual word-pairs\n                            in the align-text output (\";\" by default)\n\n    --[no]output-hyp        disable/enable printing of the hyp (hypothesis) entry\n    --[no]output-ref        disable/enable printing of the ref (reference) entry\n    --[no]output-ops        disable/enable printing of the ops (edit operations) entry\n    --[no]output-csid       disable/enable printing of the #csid entry (counts\n                            of the individual edit operations)\n\n=head1 DESCRIPTION\n    The program works as a filter -- reads the output from align-text program,\n    parses it and outputs the requested entries on the output. The format of\n    the entries was chosen so that it allows for easy parsing while being human\n    readable.\n\n    By default, all entries (hyp, ref, ops, #csid) are printed. \n\n    The filter can be used (for example) to generate detailed statistics\n    from scoring (similar to the dtl/prf output of the sctk sclite outut)\n\n==head1 EXAMPLE INPUT AND OUTPUT\n  Input:\n    \"UTT-A word-A word-A; <eps> word-A; word-B word-B; word-C <eps>; word-D word-D; word-E word-X;\n\n  Output:\n    UTT-A ref  word-A   <eps>  word-B  word-C  word-D  word-E\n    UTT-A hyp  word-A  word-A  word-B   <eps>  word-D  word-X\n    UTT-A op      C       I       C       D       C       S\n    UTT-A #csid 3 1 1 1\n\n=cut\n\n"
  },
  {
    "path": "egs/utils/scoring/wer_report.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2015 Johns Hopkins University (author: Jan Trmal <jtrmal@gmail.com>)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This script reads per-utt table generated for example during scoring\n# and outpus the WER similar to the format the compute-wer utility \n# or the utils/best_wer.pl produces\n# i.e. from table containing lines in this format\n# SUM raw 23344 243230 176178 46771 9975 20281 77027 16463\n# produces something output like this\n# %WER 31.67 [ 77027 / 243230, 9975 ins, 20281 del, 46771 sub ] \n# NB: if the STDIN stream will contain more of the SUM raw entries,\n#     the best one will be found and printed \n#\n# If the script is called with parameters, it uses them pro provide \n# a description of the output\n# i.e.\n# cat per-spk-report | utils/scoring/wer_report.pl Full set\n# the following output will be produced\n# %WER 31.67 [ 77027 / 243230, 9975 ins, 20281 del, 46771 sub ] Full set\n\n\nwhile (<STDIN>) {\n  if ( m:SUM\\s+raw:) {\n    @F = split;\n    if ((!defined $wer) || ($wer > $F[8])) {\n      $corr=$F[4];\n      $sub=$F[5];\n      $ins=$F[6];\n      $del=$F[7];\n      $wer=$F[8];\n      $words=$F[3];\n    }\n  }\n}\n\nif (defined $wer) {\n  $wer_str = sprintf(\"%.2f\", (100.0 * $wer) / $words);\n  print \"%WER $wer_str [ $wer / $words,  $ins ins, $del del, $sub sub ]\";\n  print \" \" . join(\" \", @ARGV) if @ARGV > 0;\n  print \"\\n\";\n}\n"
  },
  {
    "path": "egs/utils/segmentation.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0.\n\n# This program is for segmentation of data, e.g. long telephone conversations,\n# into short chunks.  The input (stdin) should be a sequence of lines like\n# sw0-20348-A  0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 ...  2 2 0 0 0\n# where there is a number for each frame and the numbers mean 0 for silence, 1\n# for noise, laughter and other nonspeech events, and 2 for speech.  This will\n# typically be derived from some kind of fast recognition (see\n# ../steps/resegment_data.sh), followed by ali-to-phones --per-frame=true and\n# then mapping phones to these classes 0, 1 and 2.\n#\n# The algorithm is as follows:\n#  (1) Find contiguous sequences of classes 1 or 2 (i.e. speech and/or noise), with e.g.\n#      \"1 1 1 2 2\" counted as a single contiguous sequence.  Each such sequence is an\n#      initial segment.\n#  (2) While the proportion of silence in the segments is less than $silence_proportion,\n#      add a single silence frame to the left and right of each segment, as long\n#      as this does not take us past the ends of the file or into another segment.  \n#      At this point, do not merge segments.\n#  (3) Merging segments:\n#      Get a list of all boundaries between segments that ended up touching each other\n#      during phase 2.  Sort them according to the number of silence frames at the boundary,\n#      with those with the least silence to be processed first.  Go through the boundaries\n#      in order, merging each pair of segments, as long as doing so does not create\n#      a segment larger than $max_segment_length.\n#  (4) Splitting excessively long segments:\n#      For all segments that are longer than $hard_max_segment_length, split them equally\n#      into the smallest number of pieces such that the pieces will be no longer than\n#      $hard_max_segment_length.  Print a warning.\n#  (5) Removing any segments that contain no speech.  (remove segments that have only silence\n#      and noise.\n#\n#  By default, the utterance-ids will be of the form <RECORDING-ID>-<START-TIME>-<END-TIME>,\n#  where <START-TIME> and <END-TIME> are measured 0.01 seconds, using fixed-width\n#  integers with enough digits to print out all the segments (the number of digits being\n#  decided per line of the input).  For instance, if the input recording-id was\n#  sw0-20348-A, an example line of the \"segments-file\" output would be:\n#   sw0-20348-A-00124-00298 sw0-20348-A 1.24 2.98\n#  (interpreted as <UTTERANCE-ID> <RECORDING-ID> <START-TIME> <END-TIME>)\n#  and the number of digits has to be that large because the same recording has\n#  a segment something like\n#   sw0-20348-A-13491-13606 sw0-20348-A 134.91 136.06\n#  The \"_\" and \"-\" in the output are separately configurable by means of the\n#  --first-separator and --second-separator options.  However, generally speaking,\n#  it is safer to use \"-\" than, say, \"_\", because \"-\" appears very early in the\n#  ASCII table, and using it as the separator will tend to ensure than when\n#  you sort the utterances and the recording-ids they will sort the same way.\n#  This matters because recording-ids will often equal speaker-ids, and Kaldi scripts\n#  require that the utterance-ids and speaker-ids sort in the \"same order\".\n\n\nuse Getopt::Long;\n\n$silence_proportion = 0.2; # The amount of silence at the sides of segments is\n                           # tuned to give this proportion of silence.\n\n$frame_shift = 0.01; # Affects the interpretation of the options such as max_segment_length,\n                     # and the seconds in the \"segments\" file.\n$max_segment_length = 15.0; # Maximum segment length while we are merging segments...\n                            # it will not allow merging segments to make segments longer than this.\n$hard_max_segment_length = 30.0; # A hard maximum on the segment length; it will\n                                 # break segments to get below this, even if there is\n                                 # no silence, and print a warning.\n$first_separator = \"-\";   # separator between recording-id and start-time, in utterance-id.\n$second_separator = \"-\";  # separator between start-time and end-time, in utterance-id.\n$remove_noise_only_segments = \"true\";  # boolean option; if true,\n                                       # remove segments that have no speech.\n\n\nGetOptions('silence-proportion:f' => \\$silence_proportion,\n           'frame-shift:f' => \\$frame_shift,\n           'max-segment-length:f' => \\$max_segment_length,\n           'hard-max-segment-length:f' => \\$hard_max_segment_length,\n           'first-separator:s' => \\$first_separator,\n           'second-separator:s' => \\$second_separator,\n           'remove-noise-only-segments:s' => \\$remove_noise_only_segments);\n\nif (@ARGV != 0) {\n  print STDERR \"$0:\\n\" .\n               \"Usage: segmentation.pl [options] < per-frame-archive > segments-file\\n\" .\n               \"This program is called from steps/resegment_data.sh.  Please see\\n\" .\n               \"the extensive comment in the source.  Options:\\n\" .\n               \"--silence-proportion <float> (default: $silence_proportion)\\n\" .\n               \"--frame-shift <float> (default: $frame_shift, in seconds)\\n\" .\n               \"--max-segment-length <float> (default: $max_segment_length, in seconds)\\n\" .\n               \"--hard-max-segment-length (default: $hard_max_segment_length, in seconds)\\n\" .\n               \"--first-separator <string> (default: $first_separator), affects utterance-ids\\n\" .\n               \"--second-separator <string> (default: $second_separator), affects utterance-ids\\n\" .\n               \"--remove-noise-only-segments <true|false> (default: true)\\n\";\n  exit 1;\n}\n\n($silence_proportion > 0.01 && $silence_proportion < 0.99) ||\n  die \"Invalid silence-proportion value '$silence_proportion'\";\n($frame_shift > 0.0001 && $frame_shift <= 1.0) ||\n  die \"Very strange frame-shift value '$frame_shift'\";\n($max_segment_length > 1.0 && $max_segment_length < 100.0) ||\n  die \"Very strange max-segment-length value '$max_segment_length'\";\n($hard_max_segment_length > 4.0 && $hard_max_segment_length < 500.0) ||\n  die \"Very strange hard-max-segment-length value '$hard_max_segment_length'\";\n($hard_max_segment_length >= $max_segment_length) ||\n  die \"hard-max-segment-length may not be less than max-segment-length\";\n($remove_noise_only_segments eq 'false' ||\n $remove_noise_only_segments eq 'true') || \n  die \"Option --remove-noise-only-segments takes args true or false\";\n\n\nsub get_initial_segments {\n  # This operates on the global arrays @A, @S and @N.  It sets the elements of\n  # @S to 1 if start of segment, and @E to 1 if end of segment, end of segment\n  # being defined as one past the last frame in the segment.\n\n  for (my $n = 0; $n < $N; $n++) {\n    if ($A[$n] == 0) {\n      if ($n > 0 && $A[$n-1] != 0) {\n        $E[$n] = 1;\n      }\n    } else {\n      if ($n == 0 || $A[$n-1] == 0) {\n        $S[$n] = 1;\n      }\n    }\n  }\n  if ($A[$N-1] != 0) { # Handle the special case\n    $E[$N] = 1;        # where the last frame of the file is silence or noise.\n  }\n}\n\n\nsub set_silence_proportion {\n  $num_nonsil_frames = 0;\n  # Get number of frames that are inside segments.  Initially, this will\n  # all be non-silence.\n  $in_segment = 0;\n\n  my @active_frames = (); # active_frames are segment start/end frames.\n  for (my $n = 0; $n <= $N; $n++) {\n    if ($n < $N && $S[$n] == 1) {\n      $in_segment == 0 || die; \n      $in_segment = 1; \n      push @active_frames, $n;\n    }\n    if ($E[$n] == 1) { \n      $in_segment == 1 || die; \n      $in_segment = 0; \n      push @active_frames, $n;\n    }\n    if ($n < $N) {\n      ($in_segment == ($A[$n] != 0 ? 1 : 0)) || die; # Just a check.\n      if ($in_segment) { $num_nonsil_frames++; }\n    }\n  }\n  $in_segment == 0 || die; # should not be still in a segment after file-end.\n  if ($num_nonsil_frames == 0) {\n    print STDERR \"$0: warning: no segments found for recording $recording_id\\n\";\n    return;\n  }\n  #(target-segment-frames - num-nonsil-frames) / target-segment-frames =  sil-proportion\n  # -> target-segment-frames = (num-nonsil-frames) / (1 - sil-proportion).\n  my $target_segment_frames = int($num_nonsil_frames / (1.0 - $silence_proportion));\n  my $num_segment_frames = $num_nonsil_frames;\n  while ($num_segment_frames < $target_segment_frames) {\n    $changed = 0;\n    for (my $i = 0; $i < @active_frames; $i++) {\n      my $n = $active_frames[$i];\n      if ($E[$n] == 1 && $n < $N && $S[$n] != 1) {\n        # shift the end of this segment one frame to the right.\n        $E[$n] = 0;\n        $E[$n+1] = 1;\n        $active_frames[$i] = $n + 1;\n        $num_segment_frames++;\n        $changed = 1;\n      }\n      if ($n < $N && $S[$n] == 1 && $n > 0 && $E[$n] != 1) {\n        # shift the start of this segment one frame to the left\n        $S[$n] = 0;\n        $S[$n-1] = 1;\n        $active_frames[$i] = $n - 1;\n        $num_segment_frames++;\n        $changed = 1;\n      }\n      if ($num_segment_frames == $target_segment_frames) {\n        last;\n      }\n    }\n    if ($changed == 0) { last; } # avoid an infinite loop.\n  }\n  if ($num_segment_frames < $target_segment_frames) {\n    my $proportion = \n      ($num_segment_frames - $num_nonsil_frames) / $num_segment_frames;\n    print STDERR \"$0: warning: for recording $recording_id, only got a proportion \" .\n      \"$proportion of silence frames, versus target $silence_proportion\\n\";\n  }\n}\n\nsub merge_segments() {\n  my @boundaries = ();\n  my @num_silence_phones = (); # for any index into @S where there\n                               # is a boundary between contiguous segments\n                               # (i.e. an index which is both a segment-start\n                               # and segment-end index), the number of silence\n                               # frames at that boundary (i.e. at the end of the\n                               # previous segment and the beginning of the next\n                               # one.\n  for ($n = 0; $n < $N; $n++) {\n    if ($S[$n] == 1 && $E[$n] == 1) {\n      push @boundaries, $n;\n      my $num_sil = 0;\n      my $p;\n      # note: here we can count the silence phones without regard to the\n      # segment boundaries, since we'll hit nonsilence before we get to\n      # the end/beginning of these segments.\n      for ($p = $n; $p < $N; $p++) {\n        if ($A[$p] == 0) { $num_sil++; }\n        else { last; }\n      }\n      for ($p = $n - 1; $p >= 0; $p--) {\n        if ($A[$p] == 0) { $num_sil++; }\n        else { last; }\n      }\n      \n      $num_silence_phones[$n] = $num_sil; # should be the num of silence\n    }\n  }\n\n  # Sort on increasing number of silence-phones, so we join the segments with\n  # the smallest amount of silence at the boundary first.\n  my @sorted_boundaries = \n    sort { $num_silence_phones[$a] <=> $num_silence_phones[$b] } @boundaries;\n\n  foreach $n (@sorted_boundaries) {\n    # Join the segments only if the length of the resulting segment would\n    # be no more than $max_segment_length.\n    ($S[$n] == 1 && $E[$n] == 1) || die;\n    my $num_frames = 2; # total number of frames in the two segments we'll be merging..\n                        # start the count from 2 because the loops below do not\n                        # count the 1st frame of the segment to the right and\n                        # the last frame of the segment to the left.\n    my $p;\n    for ($p = $n + 1; $p <= @A && $E[$p] == 0; $p++) {\n      $num_frames++;\n    }\n    $E[$p] == 1 || die;\n    for ($p = $n - 1; $p >= 0 && $S[$p] == 0; $p--) {\n      $num_frames++;\n    }\n    $S[$p] == 1 || die;\n    if ($num_frames * $frame_shift <= $max_segment_length) {\n      # Join this pair of segments.\n      $S[$n] = 0;\n      $E[$n] = 0;\n    }\n  }\n}\n\nsub split_long_segments {\n  for (my $n = 0; $n < @A; $n++) {\n    if ($S[$n] == 1) { # segment starts here...\n      my $p;\n      for ($p = $n + 1; $p <= @A; $p++) {\n        if ($E[$p] == 1) { last; }\n      }\n      ($E[$p] == 1) || die;\n      my $segment_length = $p - $n;\n      my $max_frames = int($hard_max_segment_length / $frame_shift);\n      if ($segment_length > $max_frames) {\n        # The segment is too long, we need to split it.  First work out\n        # how many pieces to split it into.\n        # We divide and round up to nearest larger int.\n        my $num_pieces = int(($segment_length / $max_frames) + 0.99999);\n        my $segment_length_in_seconds = $segment_length * $frame_shift;\n        print STDERR \"$0: warning: for recording $recording_id, splitting segment of \" .\n          \"length $segment_length_in_seconds seconds into $num_pieces pieces \" .\n          \"(--hard-max-segment-length $hard_max_segment_length)\\n\";\n        my $frames_per_piece = int($segment_length / $num_pieces);\n        my $i;\n        for ($i = 1; $i < $num_pieces; $i++) {\n          my $q = $n + $i * $frames_per_piece;\n          # Insert a segment boundary at frame $q.\n          $S[$q] = 1;\n          $E[$q] = 1;\n        }\n      }\n      if ($p - 1 > $n) {\n        $n = $p - 1; # avoids some redundant work.\n      }\n    }\n  }\n}\n\nsub remove_noise_only_segments {\n  for (my $n = 0; $n < $N; $n++) {\n    if ($S[$n] == 1) { # segment starts here...\n      my $p;\n      my $saw_speech = 0;\n      for ($p = $n; $p <= $N; $p++) {\n        if ($E[$p] == 1 && $p != $n) { last; }\n        if ($A[$p] == 2) { $saw_speech = 1; }\n      }\n      $E[$p] == 1 || die;\n      if (! $saw_speech) { # There was no speech in this segment, so remove it.\n        $S[$n] = 0;\n        $E[$p] = 0;\n      }\n      if ($p - 1 > $n) {\n        $n = $p - 1; # Avoid some redundant work.\n      }\n    }\n  }\n}\n\nsub print_segments {\n  # We also do some sanity checking here.\n  my @segments = (); # each element will be a string start-time:end-time, in frames.\n\n  $N == @S || die; # check array size.\n  ($N+1) == @E || die; # check array size.\n\n  my $max_end_time = 0;\n\n  for (my $n = 0; $n < $N; $n++) {\n    if ($E[$n] == 1 && $S[$n] != 1) {\n      die \"Ending segment before starting it: n=$n.\\n\";\n    }\n    if ($S[$n]) {\n      my $p;\n      for ($p = $n + 1; $p < $N && $E[$p] != 1; $p++) {\n        $S[$p] && die; # should not start a segment again, before ending it.\n      }\n      $E[$p] == 1 || die;\n      push @segments, \"$n:$p\"; # push the start/end times.\n      $max_end_time = $p;\n      if ($p < $N && $S[$p] == 1) { $n = $p - 1; }\n      else { $n = $p; }\n      # note: we increment $n again before the next loop instance.\n    }\n  }\n\n  if (@segments == 0) {\n    print STDERR \"$0: warning: no segments for recording $recording_id\\n\";\n  }\n\n  # we'll be printing the times out in hundredths of a second (regardless of the\n  # value of $frame_shift), and first need to know how many digits we need (we'll be\n  # printing with \"%05d\" or similar, for zero-padding.\n  $max_end_time_hundredths_second = int(100.0 * $frame_shift * $max_end_time);\n  $num_digits = 1;\n  my $i = 1;\n  while ($i < $max_end_time_hundredths_second) {\n    $i *= 10;\n    $num_digits++;\n  }\n  $format_str = \"%0${num_digits}d\"; # e.g. \"%05d\"\n\n  foreach $s (@segments) {\n    my ($start,$end) = split(\":\", $s);\n    ($end > $start) || die;\n    my $start_seconds = sprintf(\"%.2f\", $frame_shift * $start);\n    my $end_seconds = sprintf(\"%.2f\", $frame_shift * $end);\n    my $start_str = sprintf($format_str, $start_seconds * 100);\n    my $end_str = sprintf($format_str, $end_seconds * 100);\n    my $utterance_id = \"${recording_id}${first_separator}${start_str}${second_separator}${end_str}\";\n    print \"$utterance_id $recording_id $start_seconds $end_seconds\\n\"; # <-- Here is where the output happens.\n  }\n}\n\n\n\nwhile (<STDIN>) {\n  @A = split; # split line on whitespace.\n  if (@A <= 1) {\n    print STDERR \"$0: warning: invalid input line $_\";\n    next;\n  }\n  $recording_id = shift @A;  # e.g. sw0-12430\n  for ($n = 0; $n < @A; $n++) {\n    $a = $A[$n];\n    if ($a != 0 && $a != 1 && $a != 2) {\n      die \"Invalid value $a: expecting 0, 1 or 2.  Line is: $_\";\n    }\n    $A[$n] = 0 + $a; # cast to integer, might be a bit faster.\n  }\n  # The array @S will contain 1 if a segment starts there and 0\n  # otherwise.  The array @E will contain 1 if a segment ends there\n  # and 0 otherwise.\n  $N = @A; # number of elements in @A.  Used globally.\n  @S = (0) x $N;         # 0 repeated $N times.\n  @E = (0) x ($N + 1);   # 0 repeated $N+1 times (one more since if the last frame is\n                         # in a segment, the end-marker goes one past that, at index $N.)\n\n  get_initial_segments();       # stage (1) in the comment above.\n  set_silence_proportion();     # stage (2)\n  merge_segments();             # stage (3)\n  split_long_segments();        # stage (4)\n  if ($remove_noise_only_segments eq 'true') {\n    remove_noise_only_segments(); # stage (5)\n  }\n  print_segments();\n}\n\n"
  },
  {
    "path": "egs/utils/show_lattice.sh",
    "content": "#!/usr/bin/env bash\n\nformat=pdf # pdf svg\nmode=save # display save\nlm_scale=0.0\nacoustic_scale=0.0\noutdir=\n#end of config\n\n. utils/parse_options.sh\n\nif [ $# != 3 ]; then\n   echo \"usage: $0 [--mode display|save] [--format pdf|svg] <utt-id> <lattice-ark> <word-list>\"\n   echo \"e.g.:  $0 utt-0001 \\\"test/lat.*.gz\\\" tri1/graph/words.txt\"\n   exit 1;\nfi\n\n. ./path.sh\n\nuttid=$1\nlat=$2\nwords=$3\n\ntmpdir=$outdir; # trap \"rm -r $tmpdir\" EXIT # cleanup\n\ngunzip -c $lat | lattice-to-fst --lm-scale=$lm_scale --acoustic-scale=$acoustic_scale ark:- \"scp,p:echo $uttid $tmpdir/$uttid.fst|\" || exit 1;\n! [ -s $tmpdir/$uttid.fst ] && \\\n  echo \"Failed to extract lattice for utterance $uttid (not present?)\" && exit 1;\nfstdraw --portrait=true --osymbols=$words $tmpdir/$uttid.fst | dot -T${format} > $tmpdir/$uttid.${format}\n\nif [ \"$(uname)\" == \"Darwin\" ]; then\n    doc_open=open\nelif [ \"$(expr substr $(uname -s) 1 5)\" == \"Linux\" ]; then\n    doc_open=xdg-open\nelif [ $mode == \"display\" ] ; then\n        echo \"Can not automaticaly open file on your operating system\"\n        mode=save\nfi\n\n[ $mode == \"display\" ] && $doc_open $tmpdir/$uttid.${format}\n[[ $mode == \"display\" && $? -ne 0 ]] && echo \"Failed to open ${format} format.\" && mode=save\n# [ $mode == \"save\" ] && echo \"Saving to $uttid.${format}\" && cp $tmpdir/$uttid.${format} $outdir\n\nexit 0\n"
  },
  {
    "path": "egs/utils/shuffle_list.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2013  Johns Hopkins University (author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\nif ($ARGV[0] eq \"--srand\") {\n  $n = $ARGV[1];\n  $n =~ m/\\d+/ || die \"Bad argument to --srand option: \\\"$n\\\"\";\n  srand($ARGV[1]);\n  shift;\n  shift;\n} else {\n  srand(0); # Gives inconsistent behavior if we don't seed.\n}\n\nif (@ARGV > 1 || $ARGV[0] =~ m/^-.+/) { # >1 args, or an option we \n  # don't understand.\n  print \"Usage: shuffle_list.pl [--srand N] [input file]  > output\\n\";\n  print \"randomizes the order of lines of input.\\n\";\n  exit(1);\n}\n\n@lines;\nwhile (<>) {\n  push @lines, [ (rand(), $_)] ;\n}\n\n@lines = sort { $a->[0] cmp $b->[0] } @lines;\nforeach $l (@lines) {\n    print $l->[1];\n}\n"
  },
  {
    "path": "egs/utils/spk2utt_to_utt2spk.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\nwhile(<>){ \n    @A = split(\" \", $_);\n    @A > 1 || die \"Invalid line in spk2utt file: $_\";\n    $s = shift @A;\n    foreach $u ( @A ) {\n        print \"$u $s\\n\";\n    }\n}\n\n\n"
  },
  {
    "path": "egs/utils/split_data.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2010-2013 Microsoft Corporation\n#                     Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\nsplit_per_spk=true\nif [ \"$1\" == \"--per-utt\" ]; then\n  split_per_spk=false\n  shift\nfi\n\nif [ $# != 2 ]; then\n  echo \"Usage: $0 [--per-utt] <data-dir> <num-to-split>\"\n  echo \"E.g.: $0 data/train 50\"\n  echo \"It creates its output in e.g. data/train/split50/{1,2,3,...50}, or if the \"\n  echo \"--per-utt option was given, in e.g. data/train/split50utt/{1,2,3,...50}.\"\n  echo \"\"\n  echo \"This script will not split the data-dir if it detects that the output is newer than the input.\"\n  echo \"By default it splits per speaker (so each speaker is in only one split dir),\"\n  echo \"but with the --per-utt option it will ignore the speaker information while splitting.\"\n  exit 1\nfi\n\ndata=$1\nnumsplit=$2\n\nif ! [ \"$numsplit\" -gt 0 ]; then\n  echo \"Invalid num-split argument $numsplit\";\n  exit 1;\nfi\n\nif $split_per_spk; then\n  warning_opt=\nelse\n  # suppress warnings from filter_scps.pl about 'some input lines were output\n  # to multiple files'.\n  warning_opt=\"--no-warn\"\nfi\n\nn=0;\nfeats=\"\"\nwavs=\"\"\nutt2spks=\"\"\ntexts=\"\"\n\nnu=`cat $data/utt2spk | wc -l`\nnf=`cat $data/feats.scp 2>/dev/null | wc -l`\nnt=`cat $data/text 2>/dev/null | wc -l` # take it as zero if no such file\nif [ -f $data/feats.scp ] && [ $nu -ne $nf ]; then\n  echo \"** split_data.sh: warning, #lines is (utt2spk,feats.scp) is ($nu,$nf); you can \"\n  echo \"**  use utils/fix_data_dir.sh $data to fix this.\"\nfi\nif [ -f $data/text ] && [ $nu -ne $nt ]; then\n  echo \"** split_data.sh: warning, #lines is (utt2spk,text) is ($nu,$nt); you can \"\n  echo \"** use utils/fix_data_dir.sh to fix this.\"\nfi\n\n\nif $split_per_spk; then\n  utt2spk_opt=\"--utt2spk=$data/utt2spk\"\n  utt=\"\"\nelse\n  utt2spk_opt=\n  utt=\"utt\"\nfi\n\ns1=$data/split${numsplit}${utt}/1\nif [ ! -d $s1 ]; then\n  need_to_split=true\nelse\n  need_to_split=false\n  for f in utt2spk spk2utt spk2warp feats.scp text wav.scp cmvn.scp spk2gender \\\n    vad.scp segments reco2file_and_channel utt2lang; do\n    if [[ -f $data/$f && ( ! -f $s1/$f || $s1/$f -ot $data/$f ) ]]; then\n      need_to_split=true\n    fi\n  done\nfi\n\nif ! $need_to_split; then\n  exit 0;\nfi\n\nutt2spks=$(for n in `seq $numsplit`; do echo $data/split${numsplit}${utt}/$n/utt2spk; done)\n\ndirectories=$(for n in `seq $numsplit`; do echo $data/split${numsplit}${utt}/$n; done)\n\n# if this mkdir fails due to argument-list being too long, iterate.\nif ! mkdir -p $directories >&/dev/null; then\n  for n in `seq $numsplit`; do\n    mkdir -p $data/split${numsplit}${utt}/$n\n  done\nfi\n\n# If lockfile is not installed, just don't lock it.  It's not a big deal.\nwhich lockfile >&/dev/null && lockfile -l 60 $data/.split_lock\ntrap 'rm -f $data/.split_lock' EXIT HUP INT PIPE TERM\n\nutils/split_scp.pl $utt2spk_opt $data/utt2spk $utt2spks || exit 1\n\nfor n in `seq $numsplit`; do\n  dsn=$data/split${numsplit}${utt}/$n\n  utils/utt2spk_to_spk2utt.pl $dsn/utt2spk > $dsn/spk2utt || exit 1;\ndone\n\nmaybe_wav_scp=\nif [ ! -f $data/segments ]; then\n  maybe_wav_scp=wav.scp  # If there is no segments file, then wav file is\n                         # indexed per utt.\nfi\n\n# split some things that are indexed by utterance.\nfor f in feats.scp text vad.scp utt2lang $maybe_wav_scp utt2dur utt2num_frames; do\n  if [ -f $data/$f ]; then\n    utils/filter_scps.pl JOB=1:$numsplit \\\n      $data/split${numsplit}${utt}/JOB/utt2spk $data/$f $data/split${numsplit}${utt}/JOB/$f || exit 1;\n  fi\ndone\n\n# split some things that are indexed by speaker\nfor f in spk2gender spk2warp cmvn.scp; do\n  if [ -f $data/$f ]; then\n    utils/filter_scps.pl $warning_opt JOB=1:$numsplit \\\n      $data/split${numsplit}${utt}/JOB/spk2utt $data/$f $data/split${numsplit}${utt}/JOB/$f || exit 1;\n  fi\ndone\n\nif [ -f $data/segments ]; then\n  utils/filter_scps.pl JOB=1:$numsplit \\\n     $data/split${numsplit}${utt}/JOB/utt2spk $data/segments $data/split${numsplit}${utt}/JOB/segments || exit 1\n  for n in `seq $numsplit`; do\n    dsn=$data/split${numsplit}${utt}/$n\n    awk '{print $2;}' $dsn/segments | sort | uniq > $dsn/tmp.reco # recording-ids.\n  done\n  if [ -f $data/reco2file_and_channel ]; then\n    utils/filter_scps.pl $warning_opt JOB=1:$numsplit \\\n      $data/split${numsplit}${utt}/JOB/tmp.reco $data/reco2file_and_channel \\\n      $data/split${numsplit}${utt}/JOB/reco2file_and_channel || exit 1\n  fi\n  if [ -f $data/wav.scp ]; then\n    utils/filter_scps.pl $warning_opt JOB=1:$numsplit \\\n      $data/split${numsplit}${utt}/JOB/tmp.reco $data/wav.scp \\\n      $data/split${numsplit}${utt}/JOB/wav.scp || exit 1\n  fi\n  for f in $data/split${numsplit}${utt}/*/tmp.reco; do rm $f; done\nfi\n\nexit 0\n"
  },
  {
    "path": "egs/utils/split_scp.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2010-2011 Microsoft Corporation\n\n# See ../../COPYING for clarification regarding multiple authors\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n# This program splits up any kind of .scp or archive-type file.\n# If there is no utt2spk option it will work on any text  file and\n# will split it up with an approximately equal number of lines in\n# each but.\n# With the --utt2spk option it will work on anything that has the\n# utterance-id as the first entry on each line; the utt2spk file is\n# of the form \"utterance speaker\" (on each line).\n# It splits it into equal size chunks as far as it can.  If you use the utt2spk\n# option it will make sure these chunks coincide with speaker boundaries.  In\n# this case, if there are more chunks than speakers (and in some other\n# circumstances), some of the resulting chunks will be empty and it will print\n# an error message and exit with nonzero status.\n# You will normally call this like:\n# split_scp.pl scp scp.1 scp.2 scp.3 ...\n# or\n# split_scp.pl --utt2spk=utt2spk scp scp.1 scp.2 scp.3 ...\n# Note that you can use this script to split the utt2spk file itself,\n# e.g. split_scp.pl --utt2spk=utt2spk utt2spk utt2spk.1 utt2spk.2 ...\n\n# You can also call the scripts like:\n# split_scp.pl -j 3 0 scp scp.0\n# [note: with this option, it assumes zero-based indexing of the split parts,\n# i.e. the second number must be 0 <= n < num-jobs.]\n\nuse warnings;\n\n$num_jobs = 0;\n$job_id = 0;\n$utt2spk_file = \"\";\n$one_based = 0;\n\nfor ($x = 1; $x <= 3 && @ARGV > 0; $x++) {\n    if ($ARGV[0] eq \"-j\") {\n        shift @ARGV;\n        $num_jobs = shift @ARGV;\n        $job_id = shift @ARGV;\n    }\n    if ($ARGV[0] =~ /--utt2spk=(.+)/) {\n        $utt2spk_file=$1;\n        shift;\n    }\n    if ($ARGV[0] eq '--one-based') {\n        $one_based = 1;\n        shift @ARGV;\n    }\n}\n\nif ($num_jobs != 0 && ($num_jobs < 0 || $job_id - $one_based < 0 ||\n                       $job_id - $one_based >= $num_jobs)) {\n  die \"$0: Invalid job number/index values for '-j $num_jobs $job_id\" .\n      ($one_based ? \" --one-based\" : \"\") . \"'\\n\"\n}\n\n$one_based\n    and $job_id--;\n\nif(($num_jobs == 0 && @ARGV < 2) || ($num_jobs > 0 && (@ARGV < 1 || @ARGV > 2))) {\n    die\n\"Usage: split_scp.pl [--utt2spk=<utt2spk_file>] in.scp out1.scp out2.scp ...\n   or: split_scp.pl -j num-jobs job-id [--one-based] [--utt2spk=<utt2spk_file>] in.scp [out.scp]\n ... where 0 <= job-id < num-jobs, or 1 <= job-id <- num-jobs if --one-based.\\n\";\n}\n\n$error = 0;\n$inscp = shift @ARGV;\nif ($num_jobs == 0) { # without -j option\n    @OUTPUTS = @ARGV;\n} else {\n    for ($j = 0; $j < $num_jobs; $j++) {\n        if ($j == $job_id) {\n            if (@ARGV > 0) { push @OUTPUTS, $ARGV[0]; }\n            else { push @OUTPUTS, \"-\"; }\n        } else {\n            push @OUTPUTS, \"/dev/null\";\n        }\n    }\n}\n\nif ($utt2spk_file ne \"\") {  # We have the --utt2spk option...\n    open($u_fh, '<', $utt2spk_file) || die \"$0: Error opening utt2spk file $utt2spk_file: $!\\n\";\n    while(<$u_fh>) {\n        @A = split;\n        @A == 2 || die \"$0: Bad line $_ in utt2spk file $utt2spk_file\\n\";\n        ($u,$s) = @A;\n        $utt2spk{$u} = $s;\n    }\n    close $u_fh;\n    open($i_fh, '<', $inscp) || die \"$0: Error opening input scp file $inscp: $!\\n\";\n    @spkrs = ();\n    while(<$i_fh>) {\n        @A = split;\n        if(@A == 0) { die \"$0: Empty or space-only line in scp file $inscp\\n\"; }\n        $u = $A[0];\n        $s = $utt2spk{$u};\n        defined $s || die \"$0: No utterance $u in utt2spk file $utt2spk_file\\n\";\n        if(!defined $spk_count{$s}) {\n            push @spkrs, $s;\n            $spk_count{$s} = 0;\n            $spk_data{$s} = [];  # ref to new empty array.\n        }\n        $spk_count{$s}++;\n        push @{$spk_data{$s}}, $_;\n    }\n    # Now split as equally as possible ..\n    # First allocate spks to files by allocating an approximately\n    # equal number of speakers.\n    $numspks = @spkrs;  # number of speakers.\n    $numscps = @OUTPUTS; # number of output files.\n    if ($numspks < $numscps) {\n      die \"$0: Refusing to split data because number of speakers $numspks \" .\n          \"is less than the number of output .scp files $numscps\\n\";\n    }\n    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {\n        $scparray[$scpidx] = []; # [] is array reference.\n    }\n    for ($spkidx = 0; $spkidx < $numspks; $spkidx++) {\n        $scpidx = int(($spkidx*$numscps) / $numspks);\n        $spk = $spkrs[$spkidx];\n        push @{$scparray[$scpidx]}, $spk;\n        $scpcount[$scpidx] += $spk_count{$spk};\n    }\n\n    # Now will try to reassign beginning + ending speakers\n    # to different scp's and see if it gets more balanced.\n    # Suppose objf we're minimizing is sum_i (num utts in scp[i] - average)^2.\n    # We can show that if considering changing just 2 scp's, we minimize\n    # this by minimizing the squared difference in sizes.  This is\n    # equivalent to minimizing the absolute difference in sizes.  This\n    # shows this method is bound to converge.\n\n    $changed = 1;\n    while($changed) {\n        $changed = 0;\n        for($scpidx = 0; $scpidx < $numscps; $scpidx++) {\n            # First try to reassign ending spk of this scp.\n            if($scpidx < $numscps-1) {\n                $sz = @{$scparray[$scpidx]};\n                if($sz > 0) {\n                    $spk = $scparray[$scpidx]->[$sz-1];\n                    $count = $spk_count{$spk};\n                    $nutt1 = $scpcount[$scpidx];\n                    $nutt2 = $scpcount[$scpidx+1];\n                    if( abs( ($nutt2+$count) - ($nutt1-$count))\n                        < abs($nutt2 - $nutt1))  { # Would decrease\n                        # size-diff by reassigning spk...\n                        $scpcount[$scpidx+1] += $count;\n                        $scpcount[$scpidx] -= $count;\n                        pop @{$scparray[$scpidx]};\n                        unshift @{$scparray[$scpidx+1]}, $spk;\n                        $changed = 1;\n                    }\n                }\n            }\n            if($scpidx > 0 && @{$scparray[$scpidx]} > 0) {\n                $spk = $scparray[$scpidx]->[0];\n                $count = $spk_count{$spk};\n                $nutt1 = $scpcount[$scpidx-1];\n                $nutt2 = $scpcount[$scpidx];\n                if( abs( ($nutt2-$count) - ($nutt1+$count))\n                    < abs($nutt2 - $nutt1))  { # Would decrease\n                    # size-diff by reassigning spk...\n                    $scpcount[$scpidx-1] += $count;\n                    $scpcount[$scpidx] -= $count;\n                    shift @{$scparray[$scpidx]};\n                    push @{$scparray[$scpidx-1]}, $spk;\n                    $changed = 1;\n                }\n            }\n        }\n    }\n    # Now print out the files...\n    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {\n        $scpfile = $OUTPUTS[$scpidx];\n        ($scpfile ne '-' ? open($f_fh, '>', $scpfile)\n                         : open($f_fh, '>&', \\*STDOUT)) ||\n            die \"$0: Could not open scp file $scpfile for writing: $!\\n\";\n        $count = 0;\n        if(@{$scparray[$scpidx]} == 0) {\n            print STDERR \"$0: eError: split_scp.pl producing empty .scp file \" .\n                         \"$scpfile (too many splits and too few speakers?)\\n\";\n            $error = 1;\n        } else {\n            foreach $spk ( @{$scparray[$scpidx]} ) {\n                print $f_fh @{$spk_data{$spk}};\n                $count += $spk_count{$spk};\n            }\n            $count == $scpcount[$scpidx] || die \"Count mismatch [code error]\";\n        }\n        close($f_fh);\n    }\n} else {\n   # This block is the \"normal\" case where there is no --utt2spk\n   # option and we just break into equal size chunks.\n\n    open($i_fh, '<', $inscp) || die \"$0: Error opening input scp file $inscp: $!\\n\";\n\n    $numscps = @OUTPUTS;  # size of array.\n    @F = ();\n    while(<$i_fh>) {\n        push @F, $_;\n    }\n    $numlines = @F;\n    if($numlines == 0) {\n        print STDERR \"$0: error: empty input scp file $inscp\\n\";\n        $error = 1;\n    }\n    $linesperscp = int( $numlines / $numscps); # the \"whole part\"..\n    $linesperscp >= 1 || die \"$0: You are splitting into too many pieces! [reduce \\$nj ($numscps) to be smaller than the number of lines ($numlines) in $inscp]\\n\";\n    $remainder = $numlines - ($linesperscp * $numscps);\n    ($remainder >= 0 && $remainder < $numlines) || die \"bad remainder $remainder\";\n    # [just doing int() rounds down].\n    $n = 0;\n    for($scpidx = 0; $scpidx < @OUTPUTS; $scpidx++) {\n        $scpfile = $OUTPUTS[$scpidx];\n        ($scpfile ne '-' ? open($o_fh, '>', $scpfile)\n                         : open($o_fh, '>&', \\*STDOUT)) ||\n            die \"$0: Could not open scp file $scpfile for writing: $!\\n\";\n        for($k = 0; $k < $linesperscp + ($scpidx < $remainder ? 1 : 0); $k++) {\n            print $o_fh $F[$n++];\n        }\n        close($o_fh) || die \"$0: Eror closing scp file $scpfile: $!\\n\";\n    }\n    $n == $numlines || die \"$n != $numlines [code error]\";\n}\n\nexit ($error);\n"
  },
  {
    "path": "egs/utils/ssh.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n\nuse Cwd;\nuse File::Basename;\n\n# This program is like run.pl except rather than just running on a local\n# machine, it can be configured to run on remote machines via ssh.\n# It requires that you have set up passwordless access to those machines,\n# and that Kaldi is running from a location that is accessible via the\n# same path on those machines (presumably via an NFS mount).\n#\n# It looks for a file .queue/machines that should have, on each line, the name\n# of a machine that you can ssh to (which may include this machine).  It doesn't\n# have to be a fully qualified name.\n#\n# Later we may extend this so that on each line of .queue/machines you\n# can specify various resources that each machine has, such as how\n# many slots and how much memory, and make it wait if machines are \n# busy.  But for now it simply ssh's to a machine from those in the list.\n\n# The command-line interface of this program is the same as run.pl;\n# see run.pl for more information about the usage.\n\n\n@ARGV < 2 && die \"usage: ssh.pl log-file command-line arguments...\";\n\n$jobstart = 1;\n$jobend = 1;\n$qsub_opts=\"\"; # These will be ignored.\n\n# First parse an option like JOB=1:4, and any\n# options that would normally be given to \n# ssh.pl, which we will just discard.\n\nif (@ARGV > 0) {\n  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) { # parse any options\n    # that would normally go to qsub, but which will be ignored here.\n    $switch = shift @ARGV;\n    if ($switch eq \"-V\") {\n      $qsub_opts .= \"-V \";\n    } else {\n      $option = shift @ARGV;\n      if ($switch eq \"-sync\" && $option =~ m/^[yY]/) {\n        $qsub_opts .= \"-sync \"; # Note: in the\n        # corresponding code in queue.pl it says instead, just \"$sync = 1;\".\n      }\n      $qsub_opts .= \"$switch $option \";\n      if ($switch eq \"-pe\") { # e.g. -pe smp 5\n        $option2 = shift @ARGV;\n        $qsub_opts .= \"$option2 \";\n      }\n    }\n  }\n  if ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+):(\\d+)$/) { # e.g. JOB=1:10\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $3;\n    shift;\n    if ($jobstart > $jobend) {\n      die \"run.pl: invalid job range $ARGV[0]\";\n    }\n    if ($jobstart <= 0) {\n      die \"run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility)\";\n    }\n  } elsif ($ARGV[0] =~ m/^([\\w_][\\w\\d_]*)+=(\\d+)$/) { # e.g. JOB=1.\n    $jobname = $1;\n    $jobstart = $2;\n    $jobend = $2;\n    shift;\n  } elsif ($ARGV[0] =~ m/.+\\=.*\\:.*$/) {\n    print STDERR \"Warning: suspicious first argument to run.pl: $ARGV[0]\\n\";\n  }\n}\n\nif ($qsub_opts ne \"\") {\n  print STDERR \"Warning: ssh.pl ignoring options \\\"$qsub_opts\\\"\\n\";\n}\n\n{ # Read .queue/machines\n  if (!open(Q, \"<.queue/machines\")) {\n    print STDERR \"ssh.pl: expected the file .queue/machines to exist.\\n\";\n    exit(1);\n  }\n  @machines = ();\n  while (<Q>) {\n    chop;\n    if ($_ ne \"\") {\n      @A = split;\n      if (@A != 1) {\n        die \"ssh.pl: bad line '$_' in .queue/machines.\";\n      }\n      if ($A[0] !~ m/^[a-z0-9\\.\\-]+/) {\n        die \"ssh.pl: invalid machine name '$A[0]'\";\n      }\n      push @machines, $A[0];\n    }\n  }\n  if (@machines == 0) {   die \"ssh.pl: no machines listed in .queue/machines\";  }\n}\n\n$logfile = shift @ARGV;\n\nif (defined $jobname && $logfile !~ m/$jobname/ &&\n    $jobend > $jobstart) {\n  print STDERR \"ssh.pl: you are trying to run a parallel job but \"\n    . \"you are putting the output into just one log file ($logfile)\\n\";\n  exit(1);\n}\n\n{\n  $offset = 0;  # $offset will be an offset added to any index from the job-id\n                # specified if the user does JOB=1:10.  The main point of this is\n                # that there are instances where a script will manually submit a\n                # number of jobs to the queue, e.g. with log files foo.1.log,\n                # foo.2.log and so on, and we don't want all of these to go\n                # to the first machine.\n  @A = split(\".\", basename($logfile));\n  # if $logfile looks like foo.9.log, add 9 to $offset.\n  foreach $a (@A) {  if ($a =~ m/^\\d+$/) { $offset += $a; } }\n}\n\n$cmd = \"\";\n\nforeach $x (@ARGV) { \n    if ($x =~ m/^\\S+$/) { $cmd .=  $x . \" \"; }\n    elsif ($x =~ m:\\\":) { $cmd .= \"'$x' \"; }\n    else { $cmd .= \"\\\"$x\\\" \"; } \n}\n\n\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $childpid = fork();\n  if (!defined $childpid) { die \"Error forking in ssh.pl (writing to $logfile)\"; }\n  if ($childpid == 0) {\n    # We're in the child... this branch executes the job and returns (possibly\n    # with an error status).\n    if (defined $jobname) {\n      $cmd =~ s/$jobname/$jobid/g;\n      $logfile =~ s/$jobname/$jobid/g;\n    }\n    { # work out the machine to ssh to.\n      $local_offset = $offset + $jobid - 1;  # subtract 1 since jobs never start\n                                             # from 0; we'd like the first job\n                                             # to normally run on the first\n                                             # machine.\n      $num_machines = scalar @machines;\n      # in the next line, the \"+ $num_machines\" is in case $local_offset is\n      # negative, to ensure the modulus is calculated in the mathematical way, not\n      # in the C way where (negative number % positive number) is negative.\n      $machines_index = ($local_offset + $num_machines) % $num_machines;\n      $machine = $machines[$machines_index];\n    }\n    if (!open(S, \"|ssh $machine bash\")) {\n      print STDERR \"ssh.pl failed to ssh to $machine\";\n      exit(1);  # exits from the forked process within ssh.pl.\n    }\n    $cwd = getcwd();\n    $logdir = dirname($logfile);\n    # Below, we're printing into ssh which has opened a bash session; these are\n    # bash commands.\n    print S \"set -e\\n\";  # if any of the later commands fails, we want it to exit.\n    print S \"cd $cwd\\n\";\n    print S \". ./path.sh\\n\";\n    print S \"mkdir -p $logdir\\n\";\n    print S \"time1=\\`date +\\\"%s\\\"\\`\\n\";\n    print S \"( echo '#' Running on \\`hostname\\`\\n\";\n    print S \"  echo '#' Started at \\`date\\`\\n\";\n    print S \"  echo -n '# '; cat <<EOF\\n\";\n    print S \"$cmd\\n\";\n    print S \"EOF\\n\";\n    print S \") >$logfile\\n\";\n    print S \"set +e\\n\";  # we don't want bash to exit if the next line fails.\n    # in the next line, || true means allow this one to fail and not have bash exit immediately.\n    print S \" ( $cmd ) 2>>$logfile >>$logfile\\n\"; \n    print S \"ret=\\$?\\n\";\n    print S \"set -e\\n\"; # back into mode where it will exit on error.\n    print S \"time2=\\`date +\\\"%s\\\"\\`\\n\";\n    print S \"echo '#' Accounting: time=\\$((\\$time2-\\$time1)) threads=1 >>$logfile\\n\";\n    print S \"echo '#' Finished at \\`date\\` with status \\$ret >>$logfile\\n\";\n    print S \"exit \\$ret\";  # return with the status the command exited with.\n    $ret = close(S);\n    $ssh_return_status = $?;\n    # see http://perldoc.perl.org/functions/close.html for explanation of return\n    # status of close() and the variables it sets.\n    if (! $ret && $! != 0) { die \"ssh.pl: unexpected problem ssh'ing to machine $machine\"; }\n    if ($ssh_return_status != 0) { exit(1); } # exit with error status from this forked process.\n    else { exit(0); } # else exit with non-error status.\n  }\n}\n\n$ret = 0;\n$numfail = 0;\nfor ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {\n  $r = wait();\n  if ($r == -1) { die \"Error waiting for child process\"; } # should never happen.\n  if ($? != 0) { $numfail++; $ret = 1; } # The child process failed.\n}\n\nif ($ret != 0) {\n  $njobs = $jobend - $jobstart + 1;\n  if ($njobs == 1) { \n    if (defined $jobname) {\n      $logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with\n                                         # that job.\n    }\n    print STDERR \"ssh.pl: job failed, log is in $logfile\\n\";\n    if ($logfile =~ m/JOB/) {\n      print STDERR \"run.pl: probably you forgot to put JOB=1:\\$nj in your script.\";\n    }\n  }\n  else {\n    $logfile =~ s/$jobname/*/g;\n    print STDERR \"ssh.pl: $numfail / $njobs failed, log is in $logfile\\n\";\n  }\n}\n\n\nexit ($ret);\n"
  },
  {
    "path": "egs/utils/subset_data_dir.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2010-2011  Microsoft Corporation\n#           2012-2013  Johns Hopkins University (Author: Daniel Povey)\n# Apache 2.0\n\n\n# This script operates on a data directory, such as in data/train/.\n# See http://kaldi-asr.org/doc/data_prep.html#data_prep_data\n# for what these directories contain.\n\n# This script creates a subset of that data, consisting of some specified\n# number of utterances.  (The selected utterances are distributed evenly\n# throughout the file, by the program ./subset_scp.pl).\n\n# There are six options, none compatible with any other.\n\n# If you give the --per-spk option, it will attempt to select the supplied\n# number of utterances for each speaker (typically you would supply a much\n# smaller number in this case).\n\n# If you give the --speakers option, it selects a subset of n randomly\n# selected speakers.\n\n# If you give the --shortest option, it will give you the n shortest utterances.\n\n# If you give the --first option, it will just give you the n first utterances.\n\n# If you give the --last option, it will just give you the n last utterances.\n\n# If you give the --spk-list or --utt-list option, it reads the\n# speakers/utterances to keep from <speaker-list-file>/<utt-list-file>\" (note,\n# in this case there is no <num-utt> positional parameter; see usage message.)\n\n\nshortest=false\nperspk=false\nspeakers=false\nfirst_opt=\nspk_list=\nutt_list=\n\nexpect_args=3\ncase $1 in\n  --first|--last) first_opt=$1; shift ;;\n  --per-spk)  perspk=true; shift ;;\n  --shortest) shortest=true; shift ;;\n  --speakers) speakers=true; shift ;;\n  --spk-list) shift; spk_list=$1; shift; expect_args=2 ;;\n  --utt-list) shift; utt_list=$1; shift; expect_args=2 ;;\n  --*) echo \"$0: invalid option '$1'\"; exit 1\nesac\n\nif [ $# != $expect_args ]; then\n  echo \"Usage:\"\n  echo \"  subset_data_dir.sh [--speakers|--shortest|--first|--last|--per-spk] <srcdir> <num-utt> <destdir>\"\n  echo \"  subset_data_dir.sh [--spk-list <speaker-list-file>] <srcdir> <destdir>\"\n  echo \"  subset_data_dir.sh [--utt-list <utt-list-file>] <srcdir> <destdir>\"\n  echo \"By default, randomly selects <num-utt> utterances from the data directory.\"\n  echo \"With --speakers, randomly selects enough speakers that we have <num-utt> utterances\"\n  echo \"With --per-spk, selects <num-utt> utterances per speaker, if available.\"\n  echo \"With --first, selects the first <num-utt> utterances\"\n  echo \"With --last, selects the last <num-utt> utterances\"\n  echo \"With --shortest, selects the shortest <num-utt> utterances.\"\n  echo \"With --spk-list, reads the speakers to keep from <speaker-list-file>\"\n  echo \"With --utt-list, reads the utterances to keep from <utt-list-file>\"\n  exit 1;\nfi\n\nsrcdir=$1\nif [[ $spk_list || $utt_list ]]; then\n  numutt=\n  destdir=$2\nelse\n  numutt=$2\n  destdir=$3\nfi\n\nexport LC_ALL=C\n\nif [ ! -f $srcdir/utt2spk ]; then\n  echo \"$0: no such file $srcdir/utt2spk\"\n  exit 1\nfi\n\nif [[ $numutt && $numutt -gt $(wc -l <$srcdir/utt2spk) ]]; then\n  echo \"$0: cannot subset to more utterances than you originally had.\"\n  exit 1\nfi\n\nif $shortest && [ ! -f $srcdir/feats.scp ]; then\n  echo \"$0: you selected --shortest but no feats.scp exist.\"\n  exit 1\nfi\n\nmkdir -p $destdir || exit 1\n\nif [[ $spk_list ]]; then\n  utils/filter_scp.pl \"$spk_list\" $srcdir/spk2utt > $destdir/spk2utt || exit 1;\n  utils/spk2utt_to_utt2spk.pl < $destdir/spk2utt > $destdir/utt2spk || exit 1;\nelif [[ $utt_list ]]; then\n  utils/filter_scp.pl \"$utt_list\" $srcdir/utt2spk > $destdir/utt2spk || exit 1;\n  utils/utt2spk_to_spk2utt.pl < $destdir/utt2spk > $destdir/spk2utt || exit 1;\nelif $speakers; then\n  utils/shuffle_list.pl < $srcdir/spk2utt |\n    awk -v numutt=$numutt '{ if (tot < numutt){ print; } tot += (NF-1); }' |\n    sort > $destdir/spk2utt\n  utils/spk2utt_to_utt2spk.pl < $destdir/spk2utt > $destdir/utt2spk\nelif $perspk; then\n  awk '{ n='$numutt'; printf(\"%s \",$1);\n         skip=1; while(n*(skip+1) <= NF-1) { skip++; }\n         for(x=2; x<=NF && x <= (n*skip+1); x += skip) { printf(\"%s \", $x); }\n         printf(\"\\n\"); }' <$srcdir/spk2utt >$destdir/spk2utt\n  utils/spk2utt_to_utt2spk.pl < $destdir/spk2utt > $destdir/utt2spk\nelse\n  if $shortest; then\n    # Select $numutt shortest utterances.\n    . ./path.sh\n    feat-to-len scp:$srcdir/feats.scp ark,t:$destdir/tmp.len || exit 1;\n    sort -n -k2 $destdir/tmp.len |\n      awk '{print $1}' |\n      head -$numutt >$destdir/tmp.uttlist\n    utils/filter_scp.pl $destdir/tmp.uttlist $srcdir/utt2spk >$destdir/utt2spk\n    rm $destdir/tmp.uttlist $destdir/tmp.len\n  else\n    # Select $numutt random utterances.\n    utils/subset_scp.pl $first_opt $numutt $srcdir/utt2spk > $destdir/utt2spk || exit 1;\n  fi\n  utils/utt2spk_to_spk2utt.pl < $destdir/utt2spk > $destdir/spk2utt\nfi\n\n# Perform filtering. utt2spk and spk2utt files already exist by this point.\n# Filter by utterance.\n[ -f $srcdir/feats.scp ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/feats.scp >$destdir/feats.scp\n[ -f $srcdir/vad.scp ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/vad.scp >$destdir/vad.scp\n[ -f $srcdir/utt2lang ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/utt2lang >$destdir/utt2lang\n[ -f $srcdir/utt2dur ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/utt2dur >$destdir/utt2dur\n[ -f $srcdir/utt2num_frames ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/utt2num_frames >$destdir/utt2num_frames\n[ -f $srcdir/utt2uniq ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/utt2uniq >$destdir/utt2uniq\n[ -f $srcdir/wav.scp ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/wav.scp >$destdir/wav.scp\n[ -f $srcdir/utt2warp ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/utt2warp >$destdir/utt2warp\n[ -f $srcdir/text ] &&\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/text >$destdir/text\n\n# Filter by speaker.\n[ -f $srcdir/spk2warp ] &&\n  utils/filter_scp.pl $destdir/spk2utt <$srcdir/spk2warp >$destdir/spk2warp\n[ -f $srcdir/spk2gender ] &&\n  utils/filter_scp.pl $destdir/spk2utt <$srcdir/spk2gender >$destdir/spk2gender\n[ -f $srcdir/cmvn.scp ] &&\n  utils/filter_scp.pl $destdir/spk2utt <$srcdir/cmvn.scp >$destdir/cmvn.scp\n\n# Filter by recording-id.\nif [ -f $srcdir/segments ]; then\n  utils/filter_scp.pl $destdir/utt2spk <$srcdir/segments >$destdir/segments\n  # Recording-ids are in segments.\n  awk '{print $2}' $destdir/segments | sort | uniq >$destdir/reco\n  # The next line overrides the command above for wav.scp, which would be incorrect.\n  [ -f $srcdir/wav.scp ] &&\n    utils/filter_scp.pl $destdir/reco <$srcdir/wav.scp >$destdir/wav.scp\nelse\n  # No segments; recording-ids are in wav.scp.\n  awk '{print $1}' $destdir/wav.scp | sort | uniq >$destdir/reco\nfi\n\n[ -f $srcdir/reco2file_and_channel ] &&\n  utils/filter_scp.pl $destdir/reco <$srcdir/reco2file_and_channel >$destdir/reco2file_and_channel\n[ -f $srcdir/reco2dur ] &&\n  utils/filter_scp.pl $destdir/reco <$srcdir/reco2dur >$destdir/reco2dur\n\n# Filter the STM file for proper sclite scoring.\n# Copy over the comments from STM file.\n[ -f $srcdir/stm ] &&\n  (grep \"^;;\" $srcdir/stm\n   utils/filter_scp.pl $destdir/reco $srcdir/stm) >$destdir/stm\n\nrm $destdir/reco\n\n# Copy frame_shift if present.\n[ -f $srcdir/frame_shift ] && cp $srcdir/frame_shift $destdir\n\nsrcutts=$(wc -l <$srcdir/utt2spk)\ndestutts=$(wc -l <$destdir/utt2spk)\necho \"$0: reducing #utt from $srcutts to $destutts\"\nexit 0\n"
  },
  {
    "path": "egs/utils/subset_scp.pl",
    "content": "#!/usr/bin/env perl\nuse warnings; #sed replacement for -w perl parameter\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This program selects a subset of N elements in the scp.\n\n# By default, it selects them evenly from throughout the scp, in order to avoid\n# selecting too many from the same speaker.  It prints them on the standard\n# output.\n# With the option --first, it just selects the N first utterances.\n# With the option --last, it just selects the N last utterances.\n\n# Last modified by JHU & HKUST @2013\n\n\n$quiet = 0;\n$first = 0;\n$last = 0;\n\nif (@ARGV > 0 && $ARGV[0] eq \"--quiet\") {\n  shift;\n  $quiet = 1;\n}\nif (@ARGV > 0 && $ARGV[0] eq \"--first\") {\n  shift;\n  $first = 1;\n}\nif (@ARGV > 0 && $ARGV[0] eq \"--last\") {\n  shift;\n  $last = 1;\n}\n\nif(@ARGV < 2 ) {\n    die \"Usage: subset_scp.pl [--quiet][--first|--last] N in.scp\\n\" .\n        \" --quiet  causes it to not die if N < num lines in scp.\\n\" .\n        \" --first and --last make it equivalent to head or tail.\\n\" .\n        \"See also: filter_scp.pl\\n\";\n}\n\n$N = shift @ARGV;\nif($N == 0) {\n    die \"First command-line parameter to subset_scp.pl must be an integer, got \\\"$N\\\"\";\n}\n$inscp = shift @ARGV;\nopen(I, \"<$inscp\") || die \"Opening input scp file $inscp\";\n\n@F = ();\nwhile(<I>) {\n    push @F, $_;\n}\n$numlines = @F;\nif($N > $numlines) {\n  if ($quiet) {\n    $N = $numlines;\n  } else {\n    die \"You requested from subset_scp.pl more elements than available: $N > $numlines\";\n  }\n}\n\nsub select_n {\n  my ($start,$end,$num_needed) = @_;\n  my $diff = $end - $start;\n  if ($num_needed > $diff) {\n    die \"select_n: code error\";\n  }\n  if ($diff == 1 ) {\n    if ($num_needed  > 0) {\n      print $F[$start];\n    }\n  } else {\n    my $halfdiff = int($diff/2);\n    my $halfneeded = int($num_needed/2);\n    select_n($start, $start+$halfdiff, $halfneeded);\n    select_n($start+$halfdiff, $end, $num_needed - $halfneeded);\n  }\n}\n\nif ( ! $first && ! $last) {\n  if ($N > 0) {\n    select_n(0, $numlines, $N);\n  }\n} else {\n  if ($first) { # --first option: same as head.\n    for ($n = 0; $n < $N; $n++) {\n      print $F[$n];\n    }\n  } else { # --last option: same as tail.\n    for ($n = @F - $N; $n < @F; $n++) {\n      print $F[$n];\n    }\n  }\n}\n"
  },
  {
    "path": "egs/utils/subword/prepare_lang_subword.sh",
    "content": "#!/usr/bin/env bash\n# Copyright 2012-2013  Johns Hopkins University (Author: Daniel Povey);\n#                      Arnab Ghoshal\n#                2014  Guoguo Chen\n#                2015  Hainan Xu\n#                2016  FAU Erlangen (Author: Axel Horndasch)\n#                2019  Dongji Gao\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# This script prepares a directory (for subword) such as data/lang_subword/, in the standard format,\n# given a source directory containing a subword dictionary lexicon.txt in a form like:\n# subword phone1 phone2 ... phoneN\n# per line (alternate prons would be separate lines), or a dictionary with probabilities\n# called lexiconp.txt in a form:\n# subword pron-prob phone1 phone2 ... phoneN\n# (with 0.0 < pron-prob <= 1.0); note: if lexiconp.txt exists, we use it even if\n# lexicon.txt exists.\n# and also files silence_phones.txt, nonsilence_phones.txt, optional_silence.txt\n# and extra_questions.txt\n# Here, silence_phones.txt and nonsilence_phones.txt are lists of silence and\n# non-silence phones respectively (where silence includes various kinds of\n# noise, laugh, cough, filled pauses etc., and nonsilence phones includes the\n# \"real\" phones.)\n# In each line of those files is a list of phones, and the phones on each line\n# are assumed to correspond to the same \"base phone\", i.e. they will be\n# different stress or tone variations of the same basic phone.\n# The file \"optional_silence.txt\" contains just a single phone (typically SIL)\n# which is used for optional silence in the lexicon.\n# extra_questions.txt might be empty; typically will consist of lists of phones,\n# all members of each list with the same stress or tone; and also possibly a\n# list for the silence phones.  This will augment the automatically generated\n# questions (note: the automatically generated ones will treat all the\n# stress/tone versions of a phone the same, so will not \"get to ask\" about\n# stress or tone).\n#\n\n# This script adds word-position-dependent phones and constructs a host of other\n# derived files, that go in data/lang_subword/.\n\n# Currently it only support the most basic functions.\n# Begin configuration section.\nnum_sil_states=5\nnum_nonsil_states=3\nposition_dependent_phones=true\n# position_dependent_phones is false also when position dependent phones and word_boundary.txt\n# have been generated by another source\nshare_silence_phones=false  # if true, then share pdfs of different silence\n                            # phones together.\nsil_prob=0.5\nphone_symbol_table=             # if set, use a specified phones.txt file\nnum_extra_phone_disambig_syms=1 # Standard one phone disambiguation symbol is used for optional silence.\n                                # Increasing this number does not harm, but is only useful if you later\n                                # want to introduce this labels to L_disambig.fst\nseparator=\"@@\"   # Separator is a suffix or prefix of subword indicating the position of this subword in word.\n                 # By default, subword which is not at the end of word would have separator as suffix.\n                 # For example: international -> inter@@ nation@@ al\n\n# end configuration sections\n\necho \"$0 $@\"  # Print the command line for logging\n\n. utils/parse_options.sh\n\nif [ $# -ne 4 ]; then\n  echo \"Usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>\"\n  echo \"e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang\"\n  echo \"<dict-src-dir> should contain the following files:\"\n  echo \" extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt\"\n  echo \"See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.\"\n  echo \"options: \"\n  echo \"<dict-src-dir> may also, for the grammar-decoding case (see http://kaldi-asr.org/doc/grammar.html)\"\n  echo \"contain a file nonterminals.txt containing symbols like #nonterm:contact_list, one per line.\"\n  echo \"     --num-sil-states <number of states>             # default: 5, #states in silence models.\"\n  echo \"     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.\"\n  echo \"     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I\"\n  echo \"                                                     # markers on phones to indicate word-internal positions. \"\n  echo \"     --share-silence-phones (true|false)             # default: false; if true, share pdfs of \"\n  echo \"                                                     # all silence phones. \"\n  echo \"     --sil-prob <probability of silence>             # default: 0.5 [must have 0 <= silprob < 1]\"\n  echo \"     --separator <separator>                         # default: @@\"\n  exit 1;\nfi\n\nsrcdir=$1\noov_word=$2\ntmpdir=$3\ndir=$4\nmkdir -p $dir $tmpdir $dir/phones\n\nsilprob=false\n[ -f $srcdir/lexiconp_silprob.txt ] && echo \"$0: Currently we do not support word-dependent silence probability.\" && exit 1;\n\nif [ -f $srcdir/nonterminals.txt ]; then\n  echo \"$0: Currently we do not support nonterminals\" && exit 1;\nelse\n  grammar_opts=\nfi\n\n[ -f path.sh ] && . ./path.sh\n\n# Validate dict directory\n! utils/validate_dict_dir.pl $srcdir && \\\n  echo \"*Error validating directory $srcdir*\" && exit 1;\n\nif [[ ! -f $srcdir/lexicon.txt ]]; then\n  echo \"**Creating $srcdir/lexicon.txt from $srcdir/lexiconp.txt\"\n  perl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' < $srcdir/lexiconp.txt > $srcdir/lexicon.txt || exit 1;\nfi\nif [[ ! -f $srcdir/lexiconp.txt ]]; then\n  echo \"**Creating $srcdir/lexiconp.txt from $srcdir/lexicon.txt\"\n  perl -ape 's/(\\S+\\s+)(.+)/${1}1.0\\t$2/;' < $srcdir/lexicon.txt > $srcdir/lexiconp.txt || exit 1;\nfi\n\n# Currently The lexicon in dict directory have to be a subword lexicon.\n# If the lexicon is for word and is not phonemic, we can not get a subword lexicon without knowing the alignment.\n! grep -q $separator $srcdir/lexiconp.txt && \\\necho \"$0: Warning, this lexicon contains no separator \\\"$separator\\\" and may not be a subword lexicon.\" && exit 1;\n\n# Write the separator into file for future use.\necho $separator > $dir/subword_separator.txt\n\nif ! utils/validate_dict_dir.pl $srcdir >&/dev/null; then\n  utils/validate_dict_dir.pl $srcdir  # show the output.\n  echo \"Validation failed (second time)\"\n  exit 1;\nfi\n\n# phones.txt file provided, we will do some sanity check here.\nif [ ! -z $phone_symbol_table ]; then\n  # Checks if we have position dependent phones\n  n1=`cat $phone_symbol_table | grep -v -E \"^#[0-9]+$\" | cut -d' ' -f1 | sort -u | wc -l`\n  n2=`cat $phone_symbol_table | grep -v -E \"^#[0-9]+$\" | cut -d' ' -f1 | sed 's/_[BIES]$//g' | sort -u | wc -l`\n  $position_dependent_phones && [ $n1 -eq $n2 ] &&\\\n    echo \"$0: Position dependent phones requested, but not in provided phone symbols\" && exit 1;\n  ! $position_dependent_phones && [ $n1 -ne $n2 ] &&\\\n    echo \"$0: Position dependent phones not requested, but appear in the provided phones.txt\" && exit 1;\n\n  # Checks if the phone sets match.\n  cat $srcdir/{,non}silence_phones.txt | awk -v f=$phone_symbol_table '\n  BEGIN { while ((getline < f) > 0) { sub(/_[BEIS]$/, \"\", $1); phones[$1] = 1; }}\n  { for (x = 1; x <= NF; ++x) { if (!($x in phones)) {\n      print \"Phone appears in the lexicon but not in the provided phones.txt: \"$x; exit 1; }}}' || exit 1;\nfi\n\nif $position_dependent_phones; then\n  # Create $tmpdir/lexiconp.txt from $srcdir/lexiconp.txt (or\n  # $tmpdir/lexiconp_silprob.txt from $srcdir/lexiconp_silprob.txt) by\n  # adding the markers _B, _E, _S, _I depending on word position.\n  # In this recipe, these markers apply to silence also.\n  # Do this starting from lexiconp.txt only.\n  if \"$silprob\"; then\n    echo \"$0: Currently we do not support word-dependent silence probability\" && exit 1;\n  else\n    utils/lang/make_position_dependent_subword_lexicon.py $srcdir/lexiconp.txt > $tmpdir/lexiconp.txt || exit 1;\n  fi\n\n  # create $tmpdir/phone_map.txt\n  # this has the format (on each line)\n  # <original phone> <version 1 of original phone> <version 2> ...\n  # where the versions depend on the position of the phone within a word.\n  # For instance, we'd have:\n  # AA AA_B AA_E AA_I AA_S\n  # for (B)egin, (E)nd, (I)nternal and (S)ingleton\n  # and in the case of silence\n  # SIL SIL SIL_B SIL_E SIL_I SIL_S\n  # [because SIL on its own is one of the variants; this is for when it doesn't\n  #  occur inside a word but as an option in the lexicon.]\n\n  # This phone map expands the phone lists into all the word-position-dependent\n  # versions of the phone lists.\n  cat <(set -f; for x in `cat $srcdir/silence_phones.txt`; do for y in \"\" \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    <(set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do for y in \"\" \"_B\" \"_E\" \"_I\" \"_S\"; do echo -n \"$x$y \"; done; echo; done) \\\n    > $tmpdir/phone_map.txt\nelse\n  if \"$silprob\"; then\n    echo \"$0: Currently we do not support word-dependent silence probability\" && exit 1;\n  else\n    cp $srcdir/lexiconp.txt $tmpdir/lexiconp.txt\n  fi\n\n  cat $srcdir/silence_phones.txt $srcdir/nonsilence_phones.txt | \\\n    awk '{for(n=1;n<=NF;n++) print $n; }' > $tmpdir/phones\n  paste -d' ' $tmpdir/phones $tmpdir/phones > $tmpdir/phone_map.txt\nfi\n\nmkdir -p $dir/phones  # various sets of phones...\n\n# Sets of phones for use in clustering, and making monophone systems.\n\nif $share_silence_phones; then\n  # build a roots file that will force all the silence phones to share the\n  # same pdf's. [three distinct states, only the transitions will differ.]\n  # 'shared'/'not-shared' means, do we share the 3 states of the HMM\n  # in the same tree-root?\n  # Sharing across models(phones) is achieved by writing several phones\n  # into one line of roots.txt (shared/not-shared doesn't affect this).\n  # 'not-shared not-split' means we have separate tree roots for the 3 states,\n  # but we never split the tree so they remain stumps,\n  # so all phones in the line correspond to the same model.\n\n  cat $srcdir/silence_phones.txt | awk '{printf(\"%s \", $0); } END{printf(\"\\n\");}' | cat - $srcdir/nonsilence_phones.txt | \\\n    utils/apply_map.pl $tmpdir/phone_map.txt > $dir/phones/sets.txt\n  cat $dir/phones/sets.txt | \\\n    awk '{if(NR==1) print \"not-shared\", \"not-split\", $0; else print \"shared\", \"split\", $0;}' > $dir/phones/roots.txt\nelse\n  # different silence phones will have different GMMs.  [note: here, all \"shared split\" means\n  # is that we may have one GMM for all the states, or we can split on states.  because they're\n  # context-independent phones, they don't see the context.]\n  cat $srcdir/{,non}silence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt > $dir/phones/sets.txt\n  cat $dir/phones/sets.txt | awk '{print \"shared\", \"split\", $0;}' > $dir/phones/roots.txt\nfi\n\ncat $srcdir/silence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/silence.txt\ncat $srcdir/nonsilence_phones.txt | utils/apply_map.pl $tmpdir/phone_map.txt | \\\n  awk '{for(n=1;n<=NF;n++) print $n;}' > $dir/phones/nonsilence.txt\ncp $srcdir/optional_silence.txt $dir/phones/optional_silence.txt\ncp $dir/phones/silence.txt $dir/phones/context_indep.txt\n\n# if extra_questions.txt is empty, it's OK.\ncat $srcdir/extra_questions.txt 2>/dev/null | utils/apply_map.pl $tmpdir/phone_map.txt \\\n  >$dir/phones/extra_questions.txt\n\n# Want extra questions about the word-start/word-end stuff. Make it separate for\n# silence and non-silence. Probably doesn't matter, as silence will rarely\n# be inside a word.\nif $position_dependent_phones; then\n  for suffix in _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/nonsilence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\n  for suffix in \"\" _B _E _I _S; do\n    (set -f; for x in `cat $srcdir/silence_phones.txt`; do echo -n \"$x$suffix \"; done; echo) >>$dir/phones/extra_questions.txt\n  done\nfi\n\n# add_lex_disambig.pl is responsible for adding disambiguation symbols to\n# the lexicon, for telling us how many disambiguation symbols it used,\n# and and also for modifying the unknown-word's pronunciation (if the\n# --unk-fst was provided) to the sequence \"#1 #2 #3\", and reserving those\n# disambig symbols for that purpose.\n# The #2 will later be replaced with the actual unk model.  The reason\n# for the #1 and the #3 is for disambiguation and also to keep the\n# FST compact.  If we didn't have the #1, we might have a different copy of\n# the unk-model FST, or at least some of its arcs, for each start-state from\n# which an <unk> transition comes (instead of per end-state, which is more compact);\n# and adding the #3 prevents us from potentially having 2 copies of the unk-model\n# FST due to the optional-silence [the last phone of any word gets 2 arcs].\n\nif \"$silprob\"; then\n  echo \"$0: Currently we do not support word-dependent silence probability\" && exit 1;\nelse\n  ndisambig=$(utils/add_lex_disambig.pl $unk_opt --pron-probs $tmpdir/lexiconp.txt $tmpdir/lexiconp_disambig.txt)\nfi\nndisambig=$[$ndisambig+$num_extra_phone_disambig_syms]; # add (at least) one disambig symbol for silence in lexicon FST.\necho $ndisambig > $tmpdir/lex_ndisambig\n\n# Format of lexiconp_disambig.txt:\n# !SIL\t1.0   SIL_S\n# <SPOKEN_NOISE>\t1.0   SPN_S #1\n# <UNK>\t1.0  SPN_S #2\n# <NOISE>\t1.0  NSN_S\n# !EXCLAMATION-POINT\t1.0  EH2_B K_I S_I K_I L_I AH0_I M_I EY1_I SH_I AH0_I N_I P_I OY2_I N_I T_E\n\n( for n in `seq 0 $ndisambig`; do echo '#'$n; done ) >$dir/phones/disambig.txt\n\n# Create phone symbol table.\nif [ ! -z $phone_symbol_table ]; then\n  start_symbol=`grep \\#0 $phone_symbol_table | awk '{print $2}'`\n  echo \"<eps>\" | cat - $dir/phones/{silence,nonsilence}.txt | awk -v f=$phone_symbol_table '\n  BEGIN { while ((getline < f) > 0) { phones[$1] = $2; }} { print $1\" \"phones[$1]; }' | sort -k2 -g |\\\n    cat - <(cat $dir/phones/disambig.txt | awk -v x=$start_symbol '{n=x+NR-1; print $1, n;}') > $dir/phones.txt\nelse\n  echo \"<eps>\" | cat - $dir/phones/{silence,nonsilence,disambig}.txt | \\\n     awk '{n=NR-1; print $1, n;}' > $dir/phones.txt\nfi\n\n# Create a file that describes the word-boundary information for\n# each phone.  5 categories.\nif $position_dependent_phones; then\n  cat $dir/phones/{silence,nonsilence}.txt | \\\n    awk '/_I$/{print $1, \"internal\"; next;} /_B$/{print $1, \"begin\"; next; }\n         /_S$/{print $1, \"singleton\"; next;} /_E$/{print $1, \"end\"; next; }\n         {print $1, \"nonword\";} ' > $dir/phones/word_boundary_moved.txt\nelse\n  # word_boundary.txt might have been generated by another source\n  [ -f $srcdir/word_boundary.txt ] && cp $srcdir/word_boundary.txt $dir/phones/word_boundary_moved.txt\nfi\n\n# Create word symbol table.\n# <s> and </s> are only needed due to the need to rescore lattices with\n# ConstArpaLm format language model. They do not normally appear in G.fst or\n# L.fst.\n\nif \"$silprob\"; then\n  echo \"$0: Currently we do not support word-dependent silence probability\" && exit 1;\nfi\n\ncat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq  | awk '\n  BEGIN {\n    print \"<eps> 0\";\n  }\n  {\n    if ($1 == \"<s>\") {\n      print \"<s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    if ($1 == \"</s>\") {\n      print \"</s> is in the vocabulary!\" | \"cat 1>&2\"\n      exit 1;\n    }\n    printf(\"%s %d\\n\", $1, NR);\n  }\n  END {\n    printf(\"#0 %d\\n\", NR+1);\n    printf(\"<s> %d\\n\", NR+2);\n    printf(\"</s> %d\\n\", NR+3);\n  }' > $dir/words.txt || exit 1;\n\n# In case there are extra word-level disambiguation symbols they also\n# need to be added to words.txt\n\n# format of $dir/words.txt:\n# <eps> 0\n# a 1\n# aa 2\n# aarvark 3\n# ...\n\nsilphone=`cat $srcdir/optional_silence.txt` || exit 1;\n[ -z \"$silphone\" ] && \\\n  ( echo \"You have no optional-silence phone; it is required in the current scripts\"\n    echo \"but you may use the option --sil-prob 0.0 to stop it being used.\" ) && \\\n   exit 1;\n\n# create $dir/phones/align_lexicon.{txt,int}.\n# This is the method we use for lattice word alignment if we are not\n# using word-position-dependent phones.\n\n# First remove pron-probs from the lexicon.\nperl -ape 's/(\\S+\\s+)\\S+\\s+(.+)/$1$2/;' <$tmpdir/lexiconp.txt >$tmpdir/align_lexicon.txt\n\n# Note: here, $silphone will have no suffix e.g. _S because it occurs as optional-silence,\n# and is not part of a word.\n[ ! -z \"$silphone\" ] && echo \"<eps> $silphone\" >> $tmpdir/align_lexicon.txt\n\ncat $tmpdir/align_lexicon.txt | \\\n  perl -ane '@A = split; print $A[0], \" \", join(\" \", @A), \"\\n\";' | sort | uniq > $dir/phones/align_lexicon.txt\n\n# create phones/align_lexicon.int from phones/align_lexicon.txt\ncat $dir/phones/align_lexicon.txt | utils/sym2int.pl -f 3- $dir/phones.txt | \\\n  utils/sym2int.pl -f 1-2 $dir/words.txt > $dir/phones/align_lexicon.int\n\n# Create the basic L.fst without disambiguation symbols, for use\n# in training.\n\nif $silprob; then\n#  # Add silence probabilities (models the prob. of silence before and after each\n#  # word).  On some setups this helps a bit.  See utils/dict_dir_add_pronprobs.sh\n#  # and where it's called in the example scripts (run.sh).\n  echo \"$0: Currently we do not support word-dependnet silence probability\" && exit 1;\nelse\n  utils/lang/make_subword_lexicon_fst.py $grammar_opts --sil-prob=$sil_prob --sil-phone=$silphone --position-dependent\\\n            --separator=$separator $tmpdir/lexiconp.txt | \\\n    fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n      --keep_isymbols=false --keep_osymbols=false | \\\n    fstarcsort --sort_type=olabel > $dir/L.fst || exit 1;\nfi\n\n# The file oov.txt contains a word that we will map any OOVs to during\n# training.\necho \"$oov_word\" > $dir/oov.txt || exit 1;\ncat $dir/oov.txt | utils/sym2int.pl $dir/words.txt >$dir/oov.int || exit 1;\n# integer version of oov symbol, used in some scripts.\n\n# the file wdisambig.txt contains a (line-by-line) list of the text-form of the\n# disambiguation symbols that are used in the grammar and passed through by the\n# lexicon.  At this stage it's hardcoded as '#0', but we're laying the groundwork\n# for more generality (which probably would be added by another script).\n# wdisambig_words.int contains the corresponding list interpreted by the\n# symbol table words.txt, and wdisambig_phones.int contains the corresponding\n# list interpreted by the symbol table phones.txt.\necho '#0' >$dir/phones/wdisambig.txt\n\nutils/sym2int.pl $dir/phones.txt <$dir/phones/wdisambig.txt >$dir/phones/wdisambig_phones.int\nutils/sym2int.pl $dir/words.txt <$dir/phones/wdisambig.txt >$dir/phones/wdisambig_words.int\n\n# Create these lists of phones in colon-separated integer list form too,\n# for purposes of being given to programs as command-line options.\nfor f in silence nonsilence optional_silence disambig context_indep; do\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt >$dir/phones/$f.int\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$f.txt | \\\n   awk '{printf(\":%d\", $1);} END{printf \"\\n\"}' | sed s/:// > $dir/phones/$f.csl || exit 1;\ndone\n\nfor x in sets extra_questions; do\n  utils/sym2int.pl $dir/phones.txt <$dir/phones/$x.txt > $dir/phones/$x.int || exit 1;\ndone\n\nutils/sym2int.pl -f 3- $dir/phones.txt <$dir/phones/roots.txt \\\n   > $dir/phones/roots.int || exit 1;\n\nif [ -f $dir/phones/word_boundary_moved.txt ]; then\n  utils/sym2int.pl -f 1 $dir/phones.txt <$dir/phones/word_boundary_moved.txt \\\n    > $dir/phones/word_boundary_moved.int || exit 1;\nfi\n\nsilphonelist=`cat $dir/phones/silence.csl`\nnonsilphonelist=`cat $dir/phones/nonsilence.csl`\n\n# Note: it's OK, after generating the 'lang' directory, to overwrite the topo file\n# with another one of your choice if the 'topo' file you want can't be generated by\n# utils/gen_topo.pl.  We do this in the 'chain' recipes.  Of course, the 'topo' file\n# should cover all the phones.  Try running utils/validate_lang.pl to check that\n# everything is OK after modifying the topo file.\nutils/gen_topo.pl $num_nonsil_states $num_sil_states $nonsilphonelist $silphonelist >$dir/topo\n\n# Create the lexicon FST with disambiguation symbols, and put it in lang_test.\n# There is an extra step where we create a loop to \"pass through\" the\n# disambiguation symbols from G.fst.\n\nif $silprob; then\n  echo \"$0: Currently we do not support word-dependnet silence probability\" && exit 1;\nelse\n  utils/lang/make_subword_lexicon_fst.py $grammar_opts \\\n       --sil-prob=$sil_prob --sil-phone=$silphone --sil-disambig='#'$ndisambig --position-dependent \\\n       --separator=$separator $tmpdir/lexiconp_disambig.txt | \\\n     fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt \\\n       --keep_isymbols=false --keep_osymbols=false |   \\\n     fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | \\\n     fstarcsort --sort_type=olabel > $dir/L_disambig.fst || exit 1;\nfi\n\necho \"$(basename $0): validating output directory\"\n! utils/validate_lang.pl $dir && echo \"$(basename $0): error validating output\" &&  exit 1;\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/subword/prepare_subword_text.sh",
    "content": "#!/usr/bin/env bash\n\n# 2019 Dongji Gao\n\n# This script generates subword text form word text.\n# For example, <noise> internatioal -> <noise> inter@@ nation@@ al\n# @@ here is the separator indicate the poisition of subword in word.\n# Subword directly followed by separator can only appear at he begining or middle of word.\n# \"<noise>\" here can be reserved if added to the option \"--glossaries\"\n\n# Begin configuration section\nseparator=\"@@\"\nglossaries=\n# End configuration section\n\n. utils/parse_options.sh\n\necho \"$0 $@\"\n\nif [ $# -ne 3 ]; then\n  echo \"Usage: utils/prepare_subword_text.sh <word-text> <pair_code> <subword-text>\"\n  echo \"e.g.: utils/prepare_subword_text.sh data/train/text data/local/pair_code.txt data/train/text_subword\"\n  echo \"    --seperator <separator>         # default: @@\"\n  echo \"    --glossaries <reserved-words>   # glossaries are words reserved\"\n  exit 1;\nfi\n\nword_text=$1\npair_code=$2\nsubword_text=$3\n\n[ ! -f $word_text ] && echo \"Word text $word_text does not exits.\" && exit 1;\n\ngrep -q $separator $word_text && echo \"$0: Error, word text file contains separator $separator. This might be a subword text file or you need to choose a different separator\" && exit 1;\n\nglossaries_opt=\n[ -z $glossaires ] && glossaries_opt=\"--glossaries $glossaries\"\ncut -d ' ' -f2- $word_text | \\\n  utils/lang/bpe/apply_bpe.py -c $pair_code --separator $separator $glossaires_opt > ${word_text}.sub\n  if [ $word_text == $subword_text ]; then\n    mv $word_text ${word_text}.old\n    cut -d ' ' -f1 ${word_text}.old | paste -d ' ' - ${word_text}.sub > $subword_text\n  else\n    cut -d ' ' -f1 $word_text | paste -d ' ' - ${word_text}.sub > $subword_text\n  fi\n\nrm ${word_text}.sub\necho \"Subword text created.\"\n"
  },
  {
    "path": "egs/utils/summarize_logs.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n#scalar(@ARGV) >= 1 && print STDERR \"Usage: summarize_warnings.pl <log-dir>\\n\" && exit 1;\n\nsub split_hundreds { # split list of filenames into groups of 100.\n  my $names = shift @_;\n  my @A = split(\" \", $names);\n  my @ans = ();\n  while (@A > 0) {\n    my $group = \"\";\n    for ($x = 0; $x < 100 && @A>0; $x++) {\n      $fname = pop @A;\n      $group .= \"$fname \";\n    }\n    push @ans, $group;\n  }\n  return @ans;\n}\n\nsub parse_accounting_entry {\n  $entry= shift @_;\n\n  @elems = split \" \", $entry;\n  \n  $time=undef;\n  $threads=undef;\n  foreach $elem (@elems) {\n    if ( $elem=~ m/time=(\\d+)/ ) {\n      $elem =~ s/time=(\\d+)/$1/;\n      $time = $elem;\n    } elsif ( $elem=~ m/threads=(\\d+)/ ) {\n      $elem =~ s/threads=(\\d+)/$1/g;\n      $threads = $elem;\n    } else {\n      die \"Unknown entry \\\"$elem\\\" when parsing \\\"$entry\\\" \\n\";\n    }\n  }\n\n  if (defined($time) and defined($threads) ) {\n    return ($time, $threads);\n  } else {\n    die \"The accounting entry \\\"$entry\\\" did not contain all necessary attributes\";\n  }\n}\n\nforeach $dir (@ARGV) {\n\n  #$dir = $ARGV[0];\n  print $dir\n\n  ! -d $dir && print STDERR \"summarize_warnings.pl: no such directory $dir\\n\" ;\n\n  $dir =~ s:/$::; # Remove trailing slash.\n\n\n  # Group the files into categories where all have the same base-name.\n  foreach $f (glob (\"$dir/*.log\")) {\n    $f_category = $f;\n    # do next expression twice; s///g doesn't work as they overlap.\n    $f_category =~ s:\\.\\d+\\.(?!\\d+):.*.:;\n    #$f_category =~ s:\\.\\d+\\.:.*.:;\n    $fmap{$f_category} .= \" $f\";\n  }\n}\n\nforeach $c (sort (keys %fmap) ) {\n  $n = 0;\n  foreach $fgroup (split_hundreds($fmap{$c})) {\n    $n += `grep -w WARNING $fgroup | wc -l`;\n  }\n  if ($n != 0) {\n    print \"$n warnings in $c\\n\"\n  }\n}\nforeach $c (sort (keys %fmap)) {\n  $n = 0;\n  foreach $fgroup (split_hundreds($fmap{$c})) {\n    $n += `grep -w ERROR $fgroup | wc -l`;\n  }\n  if ($n != 0) {\n    print \"$n errors in $c\\n\"\n  }\n}\n\n$supertotal_cpu_time=0.0;\n$supertotal_clock_time=0.0;\n$supertotal_threads=0.0;\n\nforeach $c (sort (keys %fmap)) {\n  $n = 0;\n\n  $total_cpu_time=0.0;\n  $total_clock_time=0.0;\n  $total_threads=0.0;\n  foreach $fgroup (split_hundreds($fmap{$c})) {\n    $lines=`grep -a \"# Accounting: \" $fgroup |sed 's/.* Accounting: *//g'`;\n    \n    #print $lines .\"\\n\";\n\n    @entries = split \"\\n\", $lines;\n\n    foreach $line (@entries) {\n      $time, $threads = parse_accounting_entry($line);\n\n      $total_cpu_time += $time * $threads;\n      $total_threads += $threads;\n      if ( $time > $total_clock_time ) {\n        $total_clock_time = $time;\n      }\n    }\n  }\n  print \"total_cpu_time=$total_cpu_time clock_time=$total_clock_time total_threads=$total_threads group=$c\\n\";\n\n  $supertotal_cpu_time += $total_cpu_time;\n  $supertotal_clock_time += $total_clock_time;\n  $supertotal_threads += $total_threads;\n}\nprint \"total_cpu_time=$supertotal_cpu_time clock_time=$supertotal_clock_time total_threads=$supertotal_threads group=all\\n\";\n\n"
  },
  {
    "path": "egs/utils/summarize_warnings.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012 Johns Hopkins University (Author: Daniel Povey).  Apache 2.0.\n\n @ARGV != 1 && print STDERR \"Usage: summarize_warnings.pl <log-dir>\\n\" && exit 1;\n\n$dir = $ARGV[0];\n\n! -d $dir && print STDERR \"summarize_warnings.pl: no such directory $dir\\n\" && exit 1;\n\n$dir =~ s:/$::; # Remove trailing slash.\n\n\n# Group the files into categories where all have the same base-name.\nforeach $f (glob (\"$dir/*.log\")) {\n  $f_category = $f;\n  # do next expression twice; s///g doesn't work as they overlap.\n  $f_category =~ s:\\.\\d+\\.:.*.:;\n  $f_category =~ s:\\.\\d+\\.:.*.:;\n  $fmap{$f_category} .= \" $f\";\n}\n\nsub split_hundreds { # split list of filenames into groups of 100.\n  my $names = shift @_;\n  my @A = split(\" \", $names);\n  my @ans = ();\n  while (@A > 0) {\n    my $group = \"\";\n    for ($x = 0; $x < 100 && @A>0; $x++) {\n      $fname = pop @A;\n      $group .= \"$fname \";\n    }\n    push @ans, $group;\n  }\n  return @ans;\n}\n\nforeach $c (keys %fmap) {\n  $n = 0;\n  foreach $fgroup (split_hundreds($fmap{$c})) {\n    $n += `grep -w WARNING $fgroup | wc -l`;\n  }\n  if ($n != 0) {\n    print \"$n warnings in $c\\n\"\n  }\n}\n"
  },
  {
    "path": "egs/utils/sym2int.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2012 Microsoft Corporation  Johns Hopkins University (Author: Daniel Povey)\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n\n$ignore_oov = 0;\n\nfor($x = 0; $x < 2; $x++) {\n  if ($ARGV[0] eq \"--map-oov\") {\n    shift @ARGV;\n    $map_oov = shift @ARGV;\n    if ($map_oov eq \"-f\" || $map_oov =~ m/words\\.txt$/ || $map_oov eq \"\") {\n      # disallow '-f', the empty string and anything ending in words.txt as the\n      # OOV symbol because these are likely command-line errors.\n      die \"the --map-oov option requires an argument\";\n    }\n  }\n  if ($ARGV[0] eq \"-f\") {\n    shift @ARGV;\n    $field_spec = shift @ARGV;\n    if ($field_spec =~ m/^\\d+$/) {\n      $field_begin = $field_spec - 1; $field_end = $field_spec - 1;\n    }\n    if ($field_spec =~ m/^(\\d*)[-:](\\d*)/) { # accept e.g. 1:10 as a courtesy (properly, 1-10)\n      if ($1 ne \"\") {\n        $field_begin = $1 - 1;  # Change to zero-based indexing.\n      }\n      if ($2 ne \"\") {\n        $field_end = $2 - 1;    # Change to zero-based indexing.\n      }\n    }\n    if (!defined $field_begin && !defined $field_end) {\n      die \"Bad argument to -f option: $field_spec\";\n    }\n  }\n}\n\n$symtab = shift @ARGV;\nif (!defined $symtab) {\n  print STDERR \"Usage: sym2int.pl [options] symtab [input transcriptions] > output transcriptions\\n\" .\n    \"options: [--map-oov <oov-symbol> ]  [-f <field-range> ]\\n\" .\n      \"note: <field-range> can look like 4-5, or 4-, or 5-, or 1.\\n\";\n}\nopen(F, \"<$symtab\") || die \"Error opening symbol table file $symtab\";\nwhile(<F>) {\n    @A = split(\" \", $_);\n    @A == 2 || die \"bad line in symbol table file: $_\";\n    $sym2int{$A[0]} = $A[1] + 0;\n}\n\nif (defined $map_oov && $map_oov !~ m/^\\d+$/) { # not numeric-> look it up\n  if (!defined $sym2int{$map_oov}) { die \"OOV symbol $map_oov not defined.\"; }\n  $map_oov = $sym2int{$map_oov};\n}\n\n$num_warning = 0;\n$max_warning = 20;\n\nwhile (<>) {\n  @A = split(\" \", $_);\n  @B = ();\n  for ($n = 0; $n < @A; $n++) {\n    $a = $A[$n];\n    if ( (!defined $field_begin || $n >= $field_begin)\n         && (!defined $field_end || $n <= $field_end)) {\n      $i = $sym2int{$a};\n      if (!defined ($i)) {\n        if (defined $map_oov) {\n          if ($num_warning++ < $max_warning) {\n            print STDERR \"sym2int.pl: replacing $a with $map_oov\\n\";\n            if ($num_warning == $max_warning) {\n              print STDERR \"sym2int.pl: not warning for OOVs any more times\\n\";\n            }\n          }\n          $i = $map_oov;\n        } else {\n          $pos = $n+1;\n          die \"sym2int.pl: undefined symbol $a (in position $pos)\\n\";\n        }\n      }\n      $a = $i;\n    }\n    push @B, $a;\n  }\n  print join(\" \", @B);\n  print \"\\n\";\n}\nif ($num_warning > 0) {\n  print STDERR \"** Replaced $num_warning instances of OOVs with $map_oov\\n\";\n}\n\nexit(0);\n"
  },
  {
    "path": "egs/utils/utt2spk_to_spk2utt.pl",
    "content": "#!/usr/bin/env perl\n# Copyright 2010-2011 Microsoft Corporation\n\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n\n# converts an utt2spk file to a spk2utt file.\n# Takes input from the stdin or from a file argument;\n# output goes to the standard out.\n\nif ( @ARGV > 1 ) {\n    die \"Usage: utt2spk_to_spk2utt.pl [ utt2spk ] > spk2utt\";\n}\n\nwhile(<>){ \n    @A = split(\" \", $_);\n    @A == 2 || die \"Invalid line in utt2spk file: $_\";\n    ($u,$s) = @A;\n    if(!$seen_spk{$s}) {\n        $seen_spk{$s} = 1;\n        push @spklist, $s;\n    }\n    push (@{$spk_hash{$s}}, \"$u\");\n}\nforeach $s (@spklist) {\n    $l = join(' ',@{$spk_hash{$s}});\n    print \"$s $l\\n\";\n}\n"
  },
  {
    "path": "egs/utils/validate_data_dir.sh",
    "content": "#!/usr/bin/env bash\n\ncmd=\"$@\"\n\nno_feats=false\nno_wav=false\nno_text=false\nno_spk_sort=false\nnon_print=false\n\n\nfunction show_help\n{\n      echo \"Usage: $0 [--no-feats] [--no-text] [--non-print] [--no-wav] [--no-spk-sort] <data-dir>\"\n      echo \"The --no-xxx options mean that the script does not require \"\n      echo \"xxx.scp to be present, but it will check it if it is present.\"\n      echo \"--no-spk-sort means that the script does not require the utt2spk to be \"\n      echo \"sorted by the speaker-id in addition to being sorted by utterance-id.\"\n      echo \"--non-print ignore the presence of non-printable characters.\"\n      echo \"By default, utt2spk is expected to be sorted by both, which can be \"\n      echo \"achieved by making the speaker-id prefixes of the utterance-ids\"\n      echo \"e.g.: $0 data/train\"\n}      \n\nwhile [ $# -ne 0 ] ; do\n  case \"$1\" in\n    \"--no-feats\")\n      no_feats=true;\n      ;;\n    \"--no-text\")\n      no_text=true;\n      ;;\n    \"--non-print\")\n      non_print=true;\n      ;;\n    \"--no-wav\")\n      no_wav=true;\n      ;;\n    \"--no-spk-sort\")\n      no_spk_sort=true;\n      ;;\n    *)\n      if ! [ -z \"$data\" ] ; then\n        show_help;\n        exit 1\n      fi\n      data=$1\n      ;;\n  esac\n  shift\ndone\n\n\n\nif [ ! -d $data ]; then\n  echo \"$0: no such directory $data\"\n  exit 1;\nfi\n\nif [ -f $data/images.scp ]; then\n  cmd=${cmd/--no-wav/}  # remove --no-wav if supplied\n  image/validate_data_dir.sh $cmd\n  exit $?\nfi\n\nfor f in spk2utt utt2spk; do\n  if [ ! -f $data/$f ]; then\n    echo \"$0: no such file $f\"\n    exit 1;\n  fi\n  if [ ! -s $data/$f ]; then\n    echo \"$0: empty file $f\"\n    exit 1;\n  fi\ndone\n\n! cat $data/utt2spk | awk '{if (NF != 2) exit(1); }' && \\\n  echo \"$0: $data/utt2spk has wrong format.\" && exit;\n\nns=$(wc -l < $data/spk2utt)\nif [ \"$ns\" == 1 ]; then\n  echo \"$0: WARNING: you have only one speaker.  This probably a bad idea.\"\n  echo \"   Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html\"\n  echo \"   for more information.\"\nfi\n\n\ntmpdir=$(mktemp -d /tmp/kaldi.XXXX);\ntrap 'rm -rf \"$tmpdir\"' EXIT HUP INT PIPE TERM\n\nexport LC_ALL=C\n\nfunction check_sorted_and_uniq {\n  ! perl -ne '((substr $_,-1) eq \"\\n\") or die \"file $ARGV has invalid newline\";' $1 && exit 1;\n  ! awk '{print $1}' < $1 | sort -uC && echo \"$0: file $1 is not sorted or has duplicates\" && exit 1;\n}\n\nfunction partial_diff {\n  diff -U1 $1 $2 | (head -n 6; echo \"...\"; tail -n 6)\n  n1=`cat $1 | wc -l`\n  n2=`cat $2 | wc -l`\n  echo \"[Lengths are $1=$n1 versus $2=$n2]\"\n}\n\ncheck_sorted_and_uniq $data/utt2spk\n\nif ! $no_spk_sort; then\n  ! sort -k2 -C $data/utt2spk && \\\n     echo \"$0: utt2spk is not in sorted order when sorted first on speaker-id \" && \\\n     echo \"(fix this by making speaker-ids prefixes of utt-ids)\" && exit 1;\nfi\n\ncheck_sorted_and_uniq $data/spk2utt\n\n! cmp -s <(cat $data/utt2spk | awk '{print $1, $2;}') \\\n     <(utils/spk2utt_to_utt2spk.pl $data/spk2utt)  && \\\n   echo \"$0: spk2utt and utt2spk do not seem to match\" && exit 1;\n\ncat $data/utt2spk | awk '{print $1;}' > $tmpdir/utts\n\nif [ ! -f $data/text ] && ! $no_text; then\n  echo \"$0: no such file $data/text (if this is by design, specify --no-text)\"\n  exit 1;\nfi\n\nnum_utts=`cat $tmpdir/utts | wc -l`\nif ! $no_text; then\n  if ! $non_print; then\n    n_non_print=$(LC_ALL=\"C.UTF-8\" grep -c '[^[:print:][:space:]]' $data/text) && \\\n    echo \"$0: text contains $n_non_print lines with non-printable characters\" &&\\\n    exit 1;\n  fi\n  utils/validate_text.pl $data/text || exit 1;\n  check_sorted_and_uniq $data/text\n  text_len=`cat $data/text | wc -l`\n  illegal_sym_list=\"<s> </s> #0\"\n  for x in $illegal_sym_list; do\n    if grep -w \"$x\" $data/text > /dev/null; then\n      echo \"$0: Error: in $data, text contains illegal symbol $x\"\n      exit 1;\n    fi\n  done\n  awk '{print $1}' < $data/text > $tmpdir/utts.txt\n  if ! cmp -s $tmpdir/utts{,.txt}; then\n    echo \"$0: Error: in $data, utterance lists extracted from utt2spk and text\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/utts{,.txt}\n    exit 1;\n  fi\nfi\n\nif [ -f $data/segments ] && [ ! -f $data/wav.scp ]; then\n  echo \"$0: in directory $data, segments file exists but no wav.scp\"\n  exit 1;\nfi\n\n\nif [ ! -f $data/wav.scp ] && ! $no_wav; then\n  echo \"$0: no such file $data/wav.scp (if this is by design, specify --no-wav)\"\n  exit 1;\nfi\n\nif [ -f $data/wav.scp ]; then\n  check_sorted_and_uniq $data/wav.scp\n\n  if grep -E -q '^\\S+\\s+~' $data/wav.scp; then\n    # note: it's not a good idea to have any kind of tilde in wav.scp, even if\n    # part of a command, as it would cause compatibility problems if run by\n    # other users, but this used to be not checked for so we let it slide unless\n    # it's something of the form \"foo ~/foo.wav\" (i.e. a plain file name) which\n    # would definitely cause problems as the fopen system call does not do\n    # tilde expansion.\n    echo \"$0: Please do not use tilde (~) in your wav.scp.\"\n    exit 1;\n  fi\n\n  if [ -f $data/segments ]; then\n\n    check_sorted_and_uniq $data/segments\n    # We have a segments file -> interpret wav file as \"recording-ids\" not utterance-ids.\n    ! cat $data/segments | \\\n      awk '{if (NF != 4 || $4 <= $3) { print \"Bad line in segments file\", $0; exit(1); }}' && \\\n      echo \"$0: badly formatted segments file\" && exit 1;\n\n    segments_len=`cat $data/segments | wc -l`\n    if [ -f $data/text ]; then\n      ! cmp -s $tmpdir/utts <(awk '{print $1}' <$data/segments) && \\\n        echo \"$0: Utterance list differs between $data/utt2spk and $data/segments \" && \\\n        echo \"$0: Lengths are $segments_len vs $num_utts\" && \\\n        exit 1\n    fi\n\n    cat $data/segments | awk '{print $2}' | sort | uniq > $tmpdir/recordings\n    awk '{print $1}' $data/wav.scp > $tmpdir/recordings.wav\n    if ! cmp -s $tmpdir/recordings{,.wav}; then\n      echo \"$0: Error: in $data, recording-ids extracted from segments and wav.scp\"\n      echo \"$0: differ, partial diff is:\"\n      partial_diff $tmpdir/recordings{,.wav}\n      exit 1;\n    fi\n    if [ -f $data/reco2file_and_channel ]; then\n      # this file is needed only for ctm scoring; it's indexed by recording-id.\n      check_sorted_and_uniq $data/reco2file_and_channel\n      ! cat $data/reco2file_and_channel | \\\n        awk '{if (NF != 3 || ($3 != \"A\" && $3 != \"B\" )) {\n                if ( NF == 3 && $3 == \"1\" ) {\n                  warning_issued = 1;\n                } else {\n                  print \"Bad line \", $0; exit 1;\n                }\n              }\n            }\n            END {\n              if (warning_issued == 1) {\n                print \"The channel should be marked as A or B, not 1! You should change it ASAP! \"\n              }\n            }' && echo \"$0: badly formatted reco2file_and_channel file\" && exit 1;\n      cat $data/reco2file_and_channel | awk '{print $1}' > $tmpdir/recordings.r2fc\n      if ! cmp -s $tmpdir/recordings{,.r2fc}; then\n        echo \"$0: Error: in $data, recording-ids extracted from segments and reco2file_and_channel\"\n        echo \"$0: differ, partial diff is:\"\n        partial_diff $tmpdir/recordings{,.r2fc}\n        exit 1;\n      fi\n    fi\n  else\n    # No segments file -> assume wav.scp indexed by utterance.\n    cat $data/wav.scp | awk '{print $1}' > $tmpdir/utts.wav\n    if ! cmp -s $tmpdir/utts{,.wav}; then\n      echo \"$0: Error: in $data, utterance lists extracted from utt2spk and wav.scp\"\n      echo \"$0: differ, partial diff is:\"\n      partial_diff $tmpdir/utts{,.wav}\n      exit 1;\n    fi\n\n    if [ -f $data/reco2file_and_channel ]; then\n      # this file is needed only for ctm scoring; it's indexed by recording-id.\n      check_sorted_and_uniq $data/reco2file_and_channel\n      ! cat $data/reco2file_and_channel | \\\n        awk '{if (NF != 3 || ($3 != \"A\" && $3 != \"B\" )) {\n                if ( NF == 3 && $3 == \"1\" ) {\n                  warning_issued = 1;\n                } else {\n                  print \"Bad line \", $0; exit 1;\n                }\n              }\n            }\n            END {\n              if (warning_issued == 1) {\n                print \"The channel should be marked as A or B, not 1! You should change it ASAP! \"\n              }\n            }' && echo \"$0: badly formatted reco2file_and_channel file\" && exit 1;\n      cat $data/reco2file_and_channel | awk '{print $1}' > $tmpdir/utts.r2fc\n      if ! cmp -s $tmpdir/utts{,.r2fc}; then\n        echo \"$0: Error: in $data, utterance-ids extracted from segments and reco2file_and_channel\"\n        echo \"$0: differ, partial diff is:\"\n        partial_diff $tmpdir/utts{,.r2fc}\n        exit 1;\n      fi\n    fi\n  fi\nfi\n\nif [ ! -f $data/feats.scp ] && ! $no_feats; then\n  echo \"$0: no such file $data/feats.scp (if this is by design, specify --no-feats)\"\n  exit 1;\nfi\n\nif [ -f $data/feats.scp ]; then\n  check_sorted_and_uniq $data/feats.scp\n  cat $data/feats.scp | awk '{print $1}' > $tmpdir/utts.feats\n  if ! cmp -s $tmpdir/utts{,.feats}; then\n    echo \"$0: Error: in $data, utterance-ids extracted from utt2spk and features\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/utts{,.feats}\n    exit 1;\n  fi\nfi\n\n\nif [ -f $data/cmvn.scp ]; then\n  check_sorted_and_uniq $data/cmvn.scp\n  cat $data/cmvn.scp | awk '{print $1}' > $tmpdir/speakers.cmvn\n  cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers\n  if ! cmp -s $tmpdir/speakers{,.cmvn}; then\n    echo \"$0: Error: in $data, speaker lists extracted from spk2utt and cmvn\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/speakers{,.cmvn}\n    exit 1;\n  fi\nfi\n\nif [ -f $data/spk2gender ]; then\n  check_sorted_and_uniq $data/spk2gender\n  ! cat $data/spk2gender | awk '{if (!((NF == 2 && ($2 == \"m\" || $2 == \"f\")))) exit 1; }' && \\\n     echo \"$0: Mal-formed spk2gender file\" && exit 1;\n  cat $data/spk2gender | awk '{print $1}' > $tmpdir/speakers.spk2gender\n  cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers\n  if ! cmp -s $tmpdir/speakers{,.spk2gender}; then\n    echo \"$0: Error: in $data, speaker lists extracted from spk2utt and spk2gender\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/speakers{,.spk2gender}\n    exit 1;\n  fi\nfi\n\nif [ -f $data/spk2warp ]; then\n  check_sorted_and_uniq $data/spk2warp\n  ! cat $data/spk2warp | awk '{if (!((NF == 2 && ($2 > 0.5 && $2 < 1.5)))){ print; exit 1; }}' && \\\n     echo \"$0: Mal-formed spk2warp file\" && exit 1;\n  cat $data/spk2warp | awk '{print $1}' > $tmpdir/speakers.spk2warp\n  cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers\n  if ! cmp -s $tmpdir/speakers{,.spk2warp}; then\n    echo \"$0: Error: in $data, speaker lists extracted from spk2utt and spk2warp\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/speakers{,.spk2warp}\n    exit 1;\n  fi\nfi\n\nif [ -f $data/utt2warp ]; then\n  check_sorted_and_uniq $data/utt2warp\n  ! cat $data/utt2warp | awk '{if (!((NF == 2 && ($2 > 0.5 && $2 < 1.5)))){ print; exit 1; }}' && \\\n     echo \"$0: Mal-formed utt2warp file\" && exit 1;\n  cat $data/utt2warp | awk '{print $1}' > $tmpdir/utts.utt2warp\n  cat $data/utt2spk | awk '{print $1}' > $tmpdir/utts\n  if ! cmp -s $tmpdir/utts{,.utt2warp}; then\n    echo \"$0: Error: in $data, utterance lists extracted from utt2spk and utt2warp\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/utts{,.utt2warp}\n    exit 1;\n  fi\nfi\n\n# check some optionally-required things\nfor f in vad.scp utt2lang utt2uniq; do\n  if [ -f $data/$f ]; then\n    check_sorted_and_uniq $data/$f\n    if ! cmp -s <( awk '{print $1}' $data/utt2spk ) \\\n      <( awk '{print $1}' $data/$f ); then\n      echo \"$0: error: in $data, $f and utt2spk do not have identical utterance-id list\"\n      exit 1;\n    fi\n  fi\ndone\n\n\nif [ -f $data/utt2dur ]; then\n  check_sorted_and_uniq $data/utt2dur\n  cat $data/utt2dur | awk '{print $1}' > $tmpdir/utts.utt2dur\n  if ! cmp -s $tmpdir/utts{,.utt2dur}; then\n    echo \"$0: Error: in $data, utterance-ids extracted from utt2spk and utt2dur file\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/utts{,.utt2dur}\n    exit 1;\n  fi\n  cat $data/utt2dur | \\\n    awk '{ if (NF != 2 || !($2 > 0)) { print \"Bad line utt2dur:\" NR \":\" $0; exit(1) }}' || exit 1\nfi\n\nif [ -f $data/utt2num_frames ]; then\n  check_sorted_and_uniq $data/utt2num_frames\n  cat $data/utt2num_frames | awk '{print $1}' > $tmpdir/utts.utt2num_frames\n  if ! cmp -s $tmpdir/utts{,.utt2num_frames}; then\n    echo \"$0: Error: in $data, utterance-ids extracted from utt2spk and utt2num_frames file\"\n    echo \"$0: differ, partial diff is:\"\n    partial_diff $tmpdir/utts{,.utt2num_frames}\n    exit 1\n  fi\n  awk <$data/utt2num_frames '{\n    if (NF != 2 || !($2 > 0) || $2 != int($2)) {\n      print \"Bad line utt2num_frames:\" NR \":\" $0\n      exit 1 } }' || exit 1\nfi\n\nif [ -f $data/reco2dur ]; then\n  check_sorted_and_uniq $data/reco2dur\n  cat $data/reco2dur | awk '{print $1}' > $tmpdir/recordings.reco2dur\n  if [ -f $tmpdir/recordings ]; then\n    if ! cmp -s $tmpdir/recordings{,.reco2dur}; then\n      echo \"$0: Error: in $data, recording-ids extracted from segments and reco2dur file\"\n      echo \"$0: differ, partial diff is:\"\n      partial_diff $tmpdir/recordings{,.reco2dur}\n    exit 1;\n    fi\n  else\n    if ! cmp -s $tmpdir/{utts,recordings.reco2dur}; then\n      echo \"$0: Error: in $data, recording-ids extracted from wav.scp and reco2dur file\"\n      echo \"$0: differ, partial diff is:\"\n      partial_diff $tmpdir/{utts,recordings.reco2dur}\n    exit 1;\n    fi\n  fi\n  cat $data/reco2dur | \\\n    awk '{ if (NF != 2 || !($2 > 0)) { print \"Bad line : \" $0; exit(1) }}' || exit 1\nfi\n\n\necho \"$0: Successfully validated data-directory $data\"\n"
  },
  {
    "path": "egs/utils/validate_dict_dir.pl",
    "content": "#!/usr/bin/env perl\n\n# Apache 2.0.\n# Copyright  2012 Guoguo Chen\n#            2015 Daniel Povey\n#            2017 Johns Hopkins University (Jan \"Yenda\" Trmal <jtrmal@gmail.com>)\n#\n# Validation script for 'dict' directories (e.g. data/local/dict)\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines\n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to\n# make sure the length of the (decoded) string\n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text);\n      push @unicode_lines, $decoded_text;\n    } else {\n      #print STDERR \"WARNING: the line($.) $raw_text cannot be interpreted as UTF-8: $decoded_text\\n\";\n      ;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    return (0, @raw_lines);\n  } else {\n    return (1, @unicode_lines);\n  }\n}\n\n# check if the given unicode string contain unicode whitespaces\n# other than the usual four: TAB, LF, CR and SPACE\nsub validate_utf8_whitespaces {\n  my $unicode_lines = shift;\n  use feature 'unicode_strings';\n  for (my $i = 0; $i < scalar @{$unicode_lines}; $i++) {\n    my $current_line = $unicode_lines->[$i];\n    if ((substr $current_line, -1) ne \"\\n\"){\n      print STDERR \"$0: The current line (nr. $i) has invalid newline\\n\";\n      return 1;\n    }\n    my @A = split(\" \", $current_line);\n    my $utt_id = $A[0];\n    # we replace TAB, LF, CR, and SPACE\n    # this is to simplify the test\n    if ($current_line =~ /\\x{000d}/) {\n      print STDERR \"$0: The line for utterance $utt_id contains CR (0x0D) character\\n\";\n      return 1;\n    }\n    $current_line =~ s/[\\x{0009}\\x{000a}\\x{0020}]/./g;\n    if ($current_line =~/\\s/) {\n      print STDERR \"$0: The line for utterance $utt_id contains disallowed Unicode whitespaces\\n\";\n      return 1;\n    }\n  }\n  return 0;\n}\n\n# checks if the text in the file (supplied as the argument) is utf-8 compatible\n# if yes, checks if it contains only allowed whitespaces. If no, then does not\n# do anything. The function seeks to the original position in the file after\n# reading the text.\nsub check_allowed_whitespace {\n  my $file = shift;\n  my $pos = tell($file);\n  (my $is_utf, my @lines) = get_utf8_or_bytestream($file);\n  seek($file, $pos, SEEK_SET);\n  if ($is_utf) {\n    my $has_invalid_whitespaces = validate_utf8_whitespaces(\\@lines);\n    print \"--> text seems to be UTF-8 or ASCII, checking whitespaces\\n\";\n    if ($has_invalid_whitespaces) {\n      print \"--> ERROR: the text containes disallowed UTF-8 whitespace character(s)\\n\";\n      return 0;\n    } else {\n      print \"--> text contains only allowed whitespaces\\n\";\n    }\n  } else {\n    print \"--> text doesn't seem to be UTF-8 or ASCII, won't check whitespaces\\n\";\n  }\n  return 1;\n}\n\n\nif(@ARGV != 1) {\n  die \"Usage: validate_dict_dir.pl <dict-dir>\\n\" .\n      \"e.g.: validate_dict_dir.pl data/local/dict\\n\";\n}\n\n$dict = shift @ARGV;\n$dict =~ s:/$::;\n\n$exit = 0;\n$success = 1;  # this is re-set each time we read a file.\n\nsub set_to_fail { $exit = 1; $success = 0; }\n\n# Checking silence_phones.txt -------------------------------\nprint \"Checking $dict/silence_phones.txt ...\\n\";\nif(-z \"$dict/silence_phones.txt\") {print \"--> ERROR: $dict/silence_phones.txt is empty or not exists\\n\"; exit 1;}\nif(!open(S, \"<$dict/silence_phones.txt\")) {print \"--> ERROR: fail to open $dict/silence_phones.txt\\n\"; exit 1;}\n$idx = 1;\n%silence = ();\n$crlf = 1;\n\nprint \"--> reading $dict/silence_phones.txt\\n\";\ncheck_allowed_whitespace(\\*S) || set_to_fail();\nwhile(<S>) {\n  if (! s/\\n$//) {\n    print \"--> ERROR: last line '$_' of $dict/silence_phones.txt does not end in newline.\\n\";\n    set_to_fail();\n  }\n  if ($crlf == 1 && m/\\r/) {\n    print \"--> ERROR: $dict/silence_phones.txt contains Carriage Return (^M) characters.\\n\";\n    set_to_fail();\n    $crlf = 0;\n  }\n  my @col = split(\" \", $_);\n  if (@col == 0) {\n    set_to_fail();\n    print \"--> ERROR: empty line in $dict/silence_phones.txt (line $idx)\\n\";\n  }\n  foreach(0 .. @col-1) {\n    my $p = $col[$_];\n    if($silence{$p}) {\n      set_to_fail(); print \"--> ERROR: phone \\\"$p\\\" duplicates in $dict/silence_phones.txt (line $idx)\\n\";\n    } else {\n      $silence{$p} = 1;\n    }\n    # disambiguation symbols; phones ending in _B, _E, _S or _I will cause\n    # problems with word-position-dependent systems, and <eps> is obviously\n    # confusable with epsilon.\n    if ($p =~ m/^#/ || $p =~ m/_[BESI]$/ || $p eq \"<eps>\"){\n      set_to_fail();\n      print \"--> ERROR: phone \\\"$p\\\" has disallowed written form\\n\";\n    }\n  }\n  $idx ++;\n}\nclose(S);\n$success == 0 || print \"--> $dict/silence_phones.txt is OK\\n\";\nprint \"\\n\";\n\n# Checking optional_silence.txt -------------------------------\nprint \"Checking $dict/optional_silence.txt ...\\n\";\nif(-z \"$dict/optional_silence.txt\") {print \"--> ERROR: $dict/optional_silence.txt is empty or not exists\\n\"; exit 1;}\nif(!open(OS, \"<$dict/optional_silence.txt\")) {print \"--> ERROR: fail to open $dict/optional_silence.txt\\n\"; exit 1;}\n$idx = 1;\n$success = 1;\n$crlf = 1;\nprint \"--> reading $dict/optional_silence.txt\\n\";\ncheck_allowed_whitespace(\\*OS) or exit 1;\nwhile(<OS>) {\n  chomp;\n  my @col = split(\" \", $_);\n  if ($idx > 1 or @col > 1) {\n    set_to_fail(); print \"--> ERROR: only 1 phone expected in $dict/optional_silence.txt\\n\";\n  } elsif (!$silence{$col[0]}) {\n    set_to_fail(); print \"--> ERROR: phone $col[0] not found in $dict/silence_phones.txt\\n\";\n  }\n  if ($crlf == 1 && m/\\r/) {\n    print \"--> ERROR: $dict/optional_silence.txt contains Carriage Return (^M) characters.\\n\";\n    set_to_fail();\n    $crlf = 0;\n  }\n  $idx ++;\n}\nclose(OS);\n$success == 0 || print \"--> $dict/optional_silence.txt is OK\\n\";\nprint \"\\n\";\n\n# Checking nonsilence_phones.txt -------------------------------\nprint \"Checking $dict/nonsilence_phones.txt ...\\n\";\nif(-z \"$dict/nonsilence_phones.txt\") {print \"--> ERROR: $dict/nonsilence_phones.txt is empty or not exists\\n\"; exit 1;}\nif(!open(NS, \"<$dict/nonsilence_phones.txt\")) {print \"--> ERROR: fail to open $dict/nonsilence_phones.txt\\n\"; exit 1;}\n$idx = 1;\n%nonsilence = ();\n$success = 1;\n$crlf = 1;\nprint \"--> reading $dict/nonsilence_phones.txt\\n\";\ncheck_allowed_whitespace(\\*NS) or set_to_fail();\nwhile(<NS>) {\n  if ($crlf == 1 && m/\\r/) {\n    print \"--> ERROR: $dict/nonsilence_phones.txt contains Carriage Return (^M) characters.\\n\";\n    set_to_fail();\n    $crlf = 0;\n  }\n  if (! s/\\n$//) {\n    print \"--> ERROR: last line '$_' of $dict/nonsilence_phones.txt does not end in newline.\\n\";\n    set_to_fail();\n  }\n  my @col = split(\" \", $_);\n  if (@col == 0) {\n    set_to_fail();\n    print \"--> ERROR: empty line in $dict/nonsilence_phones.txt (line $idx)\\n\";\n  }\n  foreach(0 .. @col-1) {\n    my $p = $col[$_];\n    if($nonsilence{$p}) {\n      set_to_fail(); print \"--> ERROR: phone \\\"$p\\\" duplicates in $dict/nonsilence_phones.txt (line $idx)\\n\";\n    } else {\n      $nonsilence{$p} = 1;\n    }\n    # phones that start with the pound sign/hash may be mistaken for\n    # disambiguation symbols; phones ending in _B, _E, _S or _I will cause\n    # problems with word-position-dependent systems, and <eps> is obviously\n    # confusable with epsilon.\n    if ($p =~ m/^#/ || $p =~ m/_[BESI]$/ || $p eq \"<eps>\"){\n      set_to_fail();\n      print \"--> ERROR: phone \\\"$p\\\" has disallowed written form\\n\";\n    }\n  }\n  $idx ++;\n}\nclose(NS);\n$success == 0 || print \"--> $dict/nonsilence_phones.txt is OK\\n\";\nprint \"\\n\";\n\n# Checking disjoint -------------------------------\nsub intersect {\n  my ($a, $b) = @_;\n  @itset = ();\n  %itset = ();\n  foreach(keys %$a) {\n    if(exists $b->{$_} and !$itset{$_}) {\n      push(@itset, $_);\n      $itset{$_} = 1;\n    }\n  }\n  return @itset;\n}\n\nprint \"Checking disjoint: silence_phones.txt, nonsilence_phones.txt\\n\";\n@itset = intersect(\\%silence, \\%nonsilence);\nif(@itset == 0) {print \"--> disjoint property is OK.\\n\";}\nelse {set_to_fail(); print \"--> ERROR: silence_phones.txt and nonsilence_phones.txt has overlap: \"; foreach(@itset) {print \"$_ \";} print \"\\n\";}\nprint \"\\n\";\n\n\nsub check_lexicon {\n  my ($lex, $num_prob_cols, $num_skipped_cols) = @_;\n  print \"Checking $lex\\n\";\n  !open(L, \"<$lex\") && print \"--> ERROR: fail to open $lex\\n\" && set_to_fail();\n  my %seen_line = {};\n  $idx = 1; $success = 1; $crlf = 1;\n  print \"--> reading $lex\\n\";\n  check_allowed_whitespace(\\*L) or set_to_fail();\n  while (<L>) {\n    if ($crlf == 1 && m/\\r/) {\n      print \"--> ERROR: $lex contains Carriage Return (^M) characters.\\n\";\n      set_to_fail();\n      $crlf = 0;\n    }\n    if (defined $seen_line{$_}) {\n      print \"--> ERROR: line '$_' of $lex is repeated\\n\";\n      set_to_fail();\n    }\n    $seen_line{$_} = 1;\n    if (! s/\\n$//) {\n      print \"--> ERROR: last line '$_' of $lex does not end in newline.\\n\";\n      set_to_fail();\n    }\n    my @col = split(\" \", $_);\n    $word = shift @col;\n    if (!defined $word) {\n      print \"--> ERROR: empty lexicon line in $lex\\n\"; set_to_fail();\n    }\n    if ($word eq \"<s>\" || $word eq \"</s>\" || $word eq \"<eps>\" || $word eq \"#0\") {\n      print \"--> ERROR: lexicon.txt contains forbidden word $word\\n\";\n      set_to_fail();\n    }\n    for ($n = 0; $n < $num_prob_cols; $n++) {\n      $prob = shift @col;\n      if (!($prob > 0.0 && $prob <= 1.0)) {\n        print \"--> ERROR: bad pron-prob in lexicon-line '$_', in $lex\\n\";\n        set_to_fail();\n      }\n    }\n    for ($n = 0; $n < $num_skipped_cols; $n++) { shift @col; }\n    if (@col == 0) {\n      print \"--> ERROR: lexicon.txt contains word $word with empty \";\n      print \"pronunciation.\\n\";\n      set_to_fail();\n    }\n    foreach (0 .. @col-1) {\n      if (!$silence{@col[$_]} and !$nonsilence{@col[$_]}) {\n        print \"--> ERROR: phone \\\"@col[$_]\\\" is not in {, non}silence.txt \";\n        print \"(line $idx)\\n\";\n        set_to_fail();\n      }\n    }\n    $idx ++;\n  }\n  close(L);\n  $success == 0 || print \"--> $lex is OK\\n\";\n  print \"\\n\";\n}\n\nif (-f \"$dict/lexicon.txt\") { check_lexicon(\"$dict/lexicon.txt\", 0, 0); }\nif (-f \"$dict/lexiconp.txt\") { check_lexicon(\"$dict/lexiconp.txt\", 1, 0); }\nif (-f \"$dict/lexiconp_silprob.txt\") {\n  # If $dict/lexiconp_silprob.txt exists, we expect $dict/silprob.txt to also\n  # exist.\n  check_lexicon(\"$dict/lexiconp_silprob.txt\", 2, 2);\n  if (-f \"$dict/silprob.txt\") {\n    !open(SP, \"<$dict/silprob.txt\") &&\n      print \"--> ERROR: fail to open $dict/silprob.txt\\n\" && set_to_fail();\n      $crlf = 1;\n    while (<SP>) {\n      if ($crlf == 1 && m/\\r/) {\n        print \"--> ERROR: $dict/silprob.txt contains Carriage Return (^M) characters.\\n\";\n        set_to_fail();\n        $crlf = 0;\n      }\n      chomp; my @col = split;\n      @col != 2 && die \"--> ERROR: bad line \\\"$_\\\"\\n\" && set_to_fail();\n      if ($col[0] eq \"<s>\" || $col[0] eq \"overall\") {\n        if (!($col[1] > 0.0 && $col[1] <= 1.0)) {\n          set_to_fail();\n          print \"--> ERROR: bad probability in $dir/silprob.txt \\\"$_\\\"\\n\";\n        }\n      } elsif ($col[0] eq \"</s>_s\" || $col[0] eq \"</s>_n\") {\n        if ($col[1] <= 0.0) {\n          set_to_fail();\n          print \"--> ERROR: bad correction term in $dir/silprob.txt \\\"$_\\\"\\n\";\n        }\n      } else {\n        print \"--> ERROR: unexpected line in $dir/silprob.txt \\\"$_\\\"\\n\";\n        set_to_fail();\n      }\n    }\n    close(SP);\n  } else {\n    set_to_fail();\n    print \"--> ERROR: expecting $dict/silprob.txt to exist\\n\";\n  }\n}\n\nif (!(-f \"$dict/lexicon.txt\" || -f \"$dict/lexiconp.txt\")) {\n  print \"--> ERROR: neither lexicon.txt or lexiconp.txt exist in directory $dir\\n\";\n  set_to_fail();\n}\n\nsub check_lexicon_pair {\n  my ($lex1, $num_prob_cols1, $num_skipped_cols1,\n      $lex2, $num_prob_cols2, $num_skipped_cols2) = @_;\n  # We have checked individual lexicons already.\n  open(L1, \"<$lex1\"); open(L2, \"<$lex2\");\n  print \"Checking lexicon pair $lex1 and $lex2\\n\";\n  my $line_num = 0;\n  while(<L1>) {\n    $line_num++;\n    @A = split;\n    $line_B = <L2>;\n    if (!defined $line_B) {\n      print \"--> ERROR: $lex1 and $lex2 have different number of lines.\\n\";\n      set_to_fail(); last;\n    }\n    @B = split(\" \", $line_B);\n    # Check if the word matches.\n    if ($A[0] ne $B[0]) {\n      print \"--> ERROR: $lex1 and $lex2 mismatch at line $line_num. sorting?\\n\";\n      set_to_fail(); last;\n    }\n    shift @A; shift @B;\n    for ($n = 0; $n < $num_prob_cols1 + $num_skipped_cols1; $n ++) { shift @A; }\n    for ($n = 0; $n < $num_prob_cols2 + $num_skipped_cols2; $n ++) { shift @B; }\n    # Check if the pronunciation matches\n    if (join(\" \", @A) ne join(\" \", @B)) {\n      print \"--> ERROR: $lex1 and $lex2 mismatch at line $line_num. sorting?\\n\";\n      set_to_fail(); last;\n    }\n  }\n  $line_B = <L2>;\n  if (defined $line_B && $exit == 0) {\n    print \"--> ERROR: $lex1 and $lex2 have different number of lines.\\n\";\n    set_to_fail();\n  }\n  $success == 0 || print \"--> lexicon pair $lex1 and $lex2 match\\n\\n\";\n}\n\n# If more than one lexicon exist, we have to check if they correspond to each\n# other. It could be that the user overwrote one and we need to regenerate the\n# other, but we do not know which is which.\nif ( -f \"$dict/lexicon.txt\" && -f \"$dict/lexiconp.txt\") {\n  check_lexicon_pair(\"$dict/lexicon.txt\", 0, 0, \"$dict/lexiconp.txt\", 1, 0);\n}\nif ( -f \"$dict/lexiconp.txt\" && -f \"$dict/lexiconp_silprob.txt\") {\n  check_lexicon_pair(\"$dict/lexiconp.txt\", 1, 0,\n                     \"$dict/lexiconp_silprob.txt\", 2, 2);\n}\n\n# Checking extra_questions.txt -------------------------------\n%distinguished = (); # Keep track of all phone-pairs including nonsilence that\n                     # are distinguished (split apart) by extra_questions.txt,\n                     # as $distinguished{$p1,$p2} = 1.  This will be used to\n                     # make sure that we don't have pairs of phones on the same\n                     # line in nonsilence_phones.txt that can never be\n                     # distinguished from each other by questions.  (If any two\n                     # phones appear on the same line in nonsilence_phones.txt,\n                     # they share a tree root, and since the automatic\n                     # question-building treats all phones that appear on the\n                     # same line of nonsilence_phones.txt as being in the same\n                     # group, we can never distinguish them without resorting to\n                     # questions in extra_questions.txt.\nprint \"Checking $dict/extra_questions.txt ...\\n\";\nif (-s \"$dict/extra_questions.txt\") {\n  if (!open(EX, \"<$dict/extra_questions.txt\")) {\n    set_to_fail(); print \"--> ERROR: fail to open $dict/extra_questions.txt\\n\";\n  }\n  $idx = 1;\n  $success = 1;\n  $crlf = 1;\n  print \"--> reading $dict/extra_questions.txt\\n\";\n  check_allowed_whitespace(\\*EX) or set_to_fail();\n  while(<EX>) {\n    if ($crlf == 1 && m/\\r/) {\n      print \"--> ERROR: $dict/extra_questions.txt contains Carriage Return (^M) characters.\\n\";\n      set_to_fail();\n      $crlf = 0;\n    }\n    if (! s/\\n$//) {\n      print \"--> ERROR: last line '$_' of $dict/extra_questions.txt does not end in newline.\\n\";\n      set_to_fail();\n    }\n    my @col = split(\" \", $_);\n    if (@col == 0) {\n      set_to_fail();  print \"--> ERROR: empty line in $dict/extra_questions.txt\\n\";\n    }\n    foreach (0 .. @col-1) {\n      if(!$silence{@col[$_]} and !$nonsilence{@col[$_]}) {\n        set_to_fail();  print \"--> ERROR: phone \\\"@col[$_]\\\" is not in {, non}silence_phones.txt (line $idx, block \", $_+1, \")\\n\";\n      }\n      $idx ++;\n    }\n    %col_hash = ();\n    foreach $p (@col) { $col_hash{$p} = 1; }\n    foreach $p1 (@col) {\n      # Update %distinguished hash.\n      foreach $p2 (keys %nonsilence) {\n        if (!defined $col_hash{$p2}) { # for each p1 in this question and p2 not\n                                       # in this question (and in nonsilence\n                                       # phones)... mark p1,p2 as being split apart\n          $distinguished{$p1,$p2} = 1;\n          $distinguished{$p2,$p1} = 1;\n        }\n      }\n    }\n  }\n  close(EX);\n  $success == 0 || print \"--> $dict/extra_questions.txt is OK\\n\";\n} else { print \"--> $dict/extra_questions.txt is empty (this is OK)\\n\";}\n\nif (-f \"$dict/nonterminals.txt\") {\n  open(NT, \"<$dict/nonterminals.txt\") || die \"opening $dict/nonterminals.txt\";\n  my %nonterminals = ();\n  my $line_number = 1;\n  while (<NT>) {\n    chop;\n    my @line = split(\" \", $_);\n    if (@line != 1 || ! m/^#nonterm:/ || defined $nonterminals{$line[0]}) {\n      print \"--> ERROR: bad (or duplicate) line $line_number: '$_' in $dict/nonterminals.txt\\n\"; exit 1;\n    }\n    $nonterminals{$line[0]} = 1;\n    $line_number++;\n  }\n  print \"--> $dict/nonterminals.txt is OK\\n\";\n}\n\n\n# check nonsilence_phones.txt again for phone-pairs that are never\n# distnguishable.  (note: this situation is normal and expected for silence\n# phones, so we don't check it.)\nif(!open(NS, \"<$dict/nonsilence_phones.txt\")) {\n  print \"--> ERROR: fail to open $dict/nonsilence_phones.txt the second time\\n\"; exit 1;\n}\n\n$num_warn_nosplit = 0;\n$num_warn_nosplit_limit = 10;\nwhile(<NS>) {\n  my @col = split(\" \", $_);\n  foreach $p1 (@col) {\n    foreach $p2 (@col) {\n      if ($p1 ne $p2 && ! $distinguished{$p1,$p2}) {\n        set_to_fail();\n        if ($num_warn_nosplit <= $num_warn_nosplit_limit) {\n          print \"--> ERROR: phones $p1 and $p2 share a tree root but can never be distinguished by extra_questions.txt.\\n\";\n        }\n        if ($num_warn_nosplit == $num_warn_nosplit_limit) {\n          print \"... Not warning any more times about this issue.\\n\";\n        }\n        if ($num_warn_nosplit == 0) {\n          print \"    (note: we started checking for this only recently.  You can still build a system but\\n\";\n          print \"     phones $p1 and $p2 will be acoustically indistinguishable).\\n\";\n        }\n        $num_warn_nosplit++;\n      }\n    }\n  }\n}\n\n\nif ($exit == 1) {\n  print \"--> ERROR validating dictionary directory $dict (see detailed error \";\n  print \"messages above)\\n\\n\";\n  exit 1;\n} else {\n  print \"--> SUCCESS [validating dictionary directory $dict]\\n\\n\";\n}\n\nexit 0;\n"
  },
  {
    "path": "egs/utils/validate_lang.pl",
    "content": "#!/usr/bin/env perl\n\n# Apache 2.0.\n# Copyright  2012   Guoguo Chen\n#            2014   Neil Nelson\n#            2017   Johns Hopkins University (Jan \"Yenda\" Trmal <jtrmal@gmail.com>)\n#            2019   Dongji Gao\n#\n# Validation script for data/lang\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines\n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to\n# make sure the length of the (decoded) string\n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text);\n      push @unicode_lines, $decoded_text;\n    } else {\n      #print STDERR \"WARNING: the line $raw_text cannot be interpreted as UTF-8: $decoded_text\\n\";\n      ;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    return (0, @raw_lines);\n  } else {\n    return (1, @unicode_lines);\n  }\n}\n\n# check if the given unicode string contain unicode whitespaces\n# other than the usual four: TAB, LF, CR and SPACE\nsub validate_utf8_whitespaces {\n  my $unicode_lines = shift;\n  use feature 'unicode_strings';\n  for (my $i = 0; $i < scalar @{$unicode_lines}; $i++) {\n    my $current_line = $unicode_lines->[$i];\n    if ((substr $current_line, -1) ne \"\\n\"){\n      print STDERR \"$0: The current line (nr. $i) has invalid newline\\n\";\n      return 1;\n    }\n    # we replace TAB, LF, CR, and SPACE\n    # this is to simplify the test\n    if ($current_line =~ /\\x{000d}/) {\n      print STDERR \"$0: The current line (nr. $i) contains CR (0x0D) character\\n\";\n      return 1;\n    }\n    $current_line =~ s/[\\x{0009}\\x{000a}\\x{0020}]/./g;\n    if ($current_line =~/\\s/) {\n      return 1;\n    }\n  }\n  return 0;\n}\n\n# checks if the text in the file (supplied as the argument) is utf-8 compatible\n# if yes, checks if it contains only allowed whitespaces. If no, then does not\n# do anything. The function seeks to the original position in the file after\n# reading the text.\nsub check_allowed_whitespace {\n  my $file = shift;\n  my $pos = tell($file);\n  (my $is_utf, my @lines) = get_utf8_or_bytestream($file);\n  seek($file, $pos, SEEK_SET);\n  if ($is_utf) {\n    my $has_invalid_whitespaces = validate_utf8_whitespaces(\\@lines);\n    print \"--> text seems to be UTF-8 or ASCII, checking whitespaces\\n\";\n    if ($has_invalid_whitespaces) {\n      print \"--> ERROR: the text containes disallowed UTF-8 whitespace character(s)\\n\";\n      return 0;\n    } else {\n      print \"--> text contains only allowed whitespaces\\n\";\n    }\n  } else {\n    print \"--> text doesn't seem to be UTF-8 or ASCII, won't check whitespaces\\n\";\n  }\n  return 1;\n}\n\n$skip_det_check = 0;\n$skip_disambig_check = 0;\n$skip_generate_words_check = 0;\n$subword_check = 0;\n\nfor ($x=0; $x <= 3; $x++) {\n  if (@ARGV > 0 && $ARGV[0] eq \"--skip-determinization-check\") {\n    $skip_det_check = 1;\n    shift @ARGV;\n  }\n  if (@ARGV > 0 && $ARGV[0] eq \"--skip-disambig-check\") {\n    $skip_disambig_check = 1;\n    shift @ARGV;\n  }\n  if (@ARGV > 0 && $ARGV[0] eq \"--skip-generate-words-check\") {\n    $skip_generate_words_check = 1;\n    shift @ARGV;\n  }\n}\n\nif (@ARGV != 1) {\n  print \"Usage: $0 [options] <lang_directory>\\n\";\n  print \"e.g.:  $0 data/lang\\n\";\n  print \"Options:\\n\";\n  print \" --skip-generate-words-check              (this flag causes it to skip a check of generated word sequences).\\n\";\n  print \" --skip-determinization-check             (this flag causes it to skip a time consuming check).\\n\";\n  print \" --skip-disambig-check                    (this flag causes it to skip a disambig check in phone bigram models).\\n\";\n  exit(1);\n}\n\nprint \"$0 \" . join(\" \", @ARGV) . \"\\n\";\n\n$lang = shift @ARGV;\n$exit = 0;\n$warning = 0;\n\n# Checking existence of separator file ------------------\nprint \"Checking existence of separator file\\n\";\nif (!-e \"$lang/subword_separator.txt\") {\n  print \"separator file $lang/subword_separator.txt is empty or does not exist, deal in word case.\\n\";\n} else {\n  if (!open(S, \"<$lang/subword_separator.txt\")) {\n    print \"--> ERROR: fail to open $lang/subword_separator.txt\\n\"; exit 1;\n  } else {\n    $line_num = `wc -l <$lang/subword_separator.txt`;\n    if ($line_num != 1) {\n      print \"--> ERROR, $lang/subword_separator.txt should only contain one line.\\n\"; exit 1;\n    } else {\n      while (<S>) {\n        chomp;\n        my @col = split(\" \", $_);\n        if (@col != 1) {\n          print \"--> ERROR, invalid separator.\\n\"; exit 1;\n        } else {\n         $separator = shift @col;\n         $separator_length = length $separator;\n         $subword_check = 1;\n        }\n      }\n    }\n  }\n}\n\nif (!$subword_check) {\n  $word_boundary = \"word_boundary\";\n} else {\n  $word_boundary = \"word_boundary_moved\";\n}\n\n# Checking phones.txt -------------------------------\nprint \"Checking $lang/phones.txt ...\\n\";\nif (-z \"$lang/phones.txt\") {\n  print \"--> ERROR: $lang/phones.txt is empty or does not exist\\n\"; exit 1;\n}\nif (!open(P, \"<$lang/phones.txt\")) {\n  print \"--> ERROR: fail to open $lang/phones.txt\\n\"; exit 1;\n}\n$idx = 1;\n%psymtab = ();\ncheck_allowed_whitespace(\\*P) or exit 1;\nwhile (<P>) {\n  chomp;\n  my @col = split(\" \", $_);\n  if (@col != 2) {\n    print \"--> ERROR: expect 2 columns in $lang/phones.txt (break at line $idx)\\n\"; exit 1;\n  }\n  my $phone = shift @col;\n  my $id = shift @col;\n  $psymtab{$phone} = $id;\n  $idx ++;\n}\nclose(P);\n%pint2sym = ();\nforeach (keys %psymtab) {\n  if ($pint2sym{$psymtab{$_}}) {\n    print \"--> ERROR: ID \\\"$psymtab{$_}\\\" duplicates\\n\"; exit 1;\n  } else {\n    $pint2sym{$psymtab{$_}} = $_;\n  }\n}\nprint \"--> $lang/phones.txt is OK\\n\";\nprint \"\\n\";\n\n# Check word.txt -------------------------------\nprint \"Checking words.txt: #0 ...\\n\";\nif (-z \"$lang/words.txt\") {\n  print \"--> ERROR: $lang/words.txt is empty or does not exist\\n\"; exit 1;\n}\nif (!open(W, \"<$lang/words.txt\")) {\n  print \"--> ERROR: fail to open $lang/words.txt\\n\"; exit 1;\n}\n$idx = 1;\n%wsymtab = ();\ncheck_allowed_whitespace(\\*W) or exit 1;\nwhile (<W>) {\n  chomp;\n  my @col = split(\" \", $_);\n  if (@col != 2) {\n    print \"--> ERROR: expect 2 columns in $lang/words.txt (line $idx)\\n\"; exit 1;\n  }\n  $word = shift @col;\n  $id = shift @col;\n  $wsymtab{$word} = $id;\n  $idx ++;\n}\nclose(W);\n%wint2sym = ();\nforeach (keys %wsymtab) {\n  if ($wint2sym{$wsymtab{$_}}) {\n    print \"--> ERROR: ID \\\"$wsymtab{$_}\\\" duplicates\\n\"; exit 1;\n  } else {\n    $wint2sym{$wsymtab{$_}} = $_;\n  }\n}\nprint \"--> $lang/words.txt is OK\\n\";\nprint \"\\n\";\n\n# Checking phones/* -------------------------------\nsub check_txt_int_csl {\n  my ($cat, $symtab) = @_;\n  print \"Checking $cat.\\{txt, int, csl\\} ...\\n\";\n  if (!open(TXT, \"<$cat.txt\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $cat.txt\\n\";\n  }\n  if (!open(INT, \"<$cat.int\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $cat.int\\n\";\n  }\n  if (!open(CSL, \"<$cat.csl\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $cat.csl\\n\";\n  }\n  if (-z \"$cat.txt\") {\n    $warning = 1; print \"--> WARNING: $cat.txt is empty\\n\";\n  }\n  if (-z \"$cat.int\") {\n    $warning = 1; print \"--> WARNING: $cat.int is empty\\n\";\n  }\n  if (-z \"$cat.csl\") {\n    $warning = 1; print \"--> WARNING: $cat.csl is empty\\n\";\n  }\n\n  $idx1 = 1;\n  check_allowed_whitespace(\\*TXT) or $exit = 1;\n  while (<TXT>) {\n    chomp;\n    my @col = split(\" \", $_);\n    if (@col != 1) {\n      $exit = 1; return print \"--> ERROR: expect 1 column in $cat.txt (break at line $idx1)\\n\";\n    }\n    $entry[$idx1] = shift @col;\n    $idx1 ++;\n  }\n  close(TXT); $idx1 --;\n  print \"--> $idx1 entry/entries in $cat.txt\\n\";\n\n  $idx2 = 1;\n  while (<INT>) {\n    chomp;\n    my @col = split(\" \", $_);\n    if (@col != 1) {\n      $exit = 1; return print \"--> ERROR: expect 1 column in $cat.int (break at line $idx2)\\n\";\n    }\n    if ($symtab->{$entry[$idx2]} ne shift @col) {\n      $exit = 1; return print \"--> ERROR: $cat.int doesn't correspond to $cat.txt (break at line $idx2)\\n\";\n    }\n    $idx2 ++;\n  }\n  close(INT); $idx2 --;\n  if ($idx1 != $idx2) {\n    $exit = 1; return print \"--> ERROR: $cat.int doesn't correspond to $cat.txt (break at line \", $idx2+1, \")\\n\";\n  }\n  print \"--> $cat.int corresponds to $cat.txt\\n\";\n\n  $num_lines = 0;\n  while (<CSL>) {\n    chomp;\n    my @col = split(\":\", $_);\n    $num_lines++;\n    if (@col != $idx1) {\n      $exit = 1; return print \"--> ERROR: expect $idx1 block/blocks in $cat.csl (break at line $idx3)\\n\";\n    }\n    foreach (1 .. $idx1) {\n      if ($symtab->{$entry[$_]} ne @col[$_-1]) {\n        $exit = 1; return print \"--> ERROR: $cat.csl doesn't correspond to $cat.txt (break at line $idx3, block $_)\\n\";\n      }\n    }\n  }\n  close(CSL);\n  if ($idx1 != 0) {             # nonempty .txt,.int files\n    if ($num_lines != 1) {\n      $exit = 1;\n      return print \"--> ERROR: expect 1 line in $cat.csl\\n\";\n    }\n  } else {\n    if ($num_lines != 1 && $num_lines != 0) {\n      $exit = 1;\n      return print \"--> ERROR: expect 0 or 1 line in $cat.csl, since empty .txt,int\\n\";\n    }\n  }\n  print \"--> $cat.csl corresponds to $cat.txt\\n\";\n\n  return print \"--> $cat.\\{txt, int, csl\\} are OK\\n\";\n}\n\nsub check_txt_int {\n  my ($cat, $symtab, $sym_check) = @_;\n  print \"Checking $cat.\\{txt, int\\} ...\\n\";\n  if (-z \"$cat.txt\") {\n    $exit = 1; return print \"--> ERROR: $cat.txt is empty or does not exist\\n\";\n  }\n  if (-z \"$cat.int\") {\n    $exit = 1; return print \"--> ERROR: $cat.int is empty or does not exist\\n\";\n  }\n  if (!open(TXT, \"<$cat.txt\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $cat.txt\\n\";\n  }\n  if (!open(INT, \"<$cat.int\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $cat.int\\n\";\n  }\n\n  $idx1 = 1;\n  check_allowed_whitespace(\\*TXT) or $exit = 1;\n  while (<TXT>) {\n    chomp;\n    s/^(shared|not-shared) (split|not-split) //g;\n    s/ nonword$//g;\n    s/ begin$//g;\n    s/ end$//g;\n    s/ internal$//g;\n    s/ singleton$//g;\n    $entry[$idx1] = $_;\n    $idx1 ++;\n  }\n  close(TXT); $idx1 --;\n  print \"--> $idx1 entry/entries in $cat.txt\\n\";\n\n  my %used_syms = ();\n  $idx2 = 1;\n  while (<INT>) {\n    chomp;\n    s/^(shared|not-shared) (split|not-split) //g;\n    s/ nonword$//g;\n    s/ begin$//g;\n    s/ end$//g;\n    s/ internal$//g;\n    s/ singleton$//g;\n    my @col = split(\" \", $_);\n    @set = split(\" \", $entry[$idx2]);\n    if (@set != @col) {\n      $exit = 1; return print \"--> ERROR: $cat.int doesn't correspond to $cat.txt (break at line $idx2)\\n\";\n    }\n    foreach (0 .. @set-1) {\n      if ($symtab->{@set[$_]} ne @col[$_]) {\n        $exit = 1; return print \"--> ERROR: $cat.int doesn't correspond to $cat.txt (break at line $idx2, block \" ,$_+1, \")\\n\";\n      }\n      if ($sym_check && defined $used_syms{@set[$_]}) {\n        $exit = 1; return print \"--> ERROR: $cat.txt and $cat.int contain duplicate symbols (break at line $idx2, block \" ,$_+1, \")\\n\";\n      }\n      $used_syms{@set[$_]} = 1;\n    }\n    $idx2 ++;\n  }\n  close(INT); $idx2 --;\n  if ($idx1 != $idx2) {\n    $exit = 1; return print \"--> ERROR: $cat.int doesn't correspond to $cat.txt (break at line \", $idx2+1, \")\\n\";\n  }\n  print \"--> $cat.int corresponds to $cat.txt\\n\";\n\n  if ($sym_check) {\n    while ( my ($key, $value) = each(%silence) ) {\n      if (!defined $used_syms{$key}) {\n        $exit = 1; return print \"--> ERROR: $cat.txt and $cat.int do not contain all silence phones\\n\";\n      }\n    }\n    while ( my ($key, $value) = each(%nonsilence) ) {\n      if (!defined $used_syms{$key}) {\n        $exit = 1; return print \"--> ERROR: $cat.txt and $cat.int do not contain all non-silence phones\\n\";\n      }\n    }\n  }\n\n  return print \"--> $cat.\\{txt, int\\} are OK\\n\";\n}\n\n# Check disjoint and summation -------------------------------\nsub intersect {\n  my ($a, $b) = @_;\n  @itset = ();\n  %itset = ();\n  foreach (keys %$a) {\n    if (exists $b->{$_} and !$itset{$_}) {\n      push(@itset, $_);\n      $itset{$_} = 1;\n    }\n  }\n  return @itset;\n}\n\nsub check_disjoint {\n  print \"Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...\\n\";\n  if (!open(S, \"<$lang/phones/silence.txt\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $lang/phones/silence.txt\\n\";\n  }\n  if (!open(N, \"<$lang/phones/nonsilence.txt\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $lang/phones/nonsilence.txt\\n\";\n  }\n  if (!$skip_disambig_check && !open(D, \"<$lang/phones/disambig.txt\")) {\n    $exit = 1; return print \"--> ERROR: fail to open $lang/phones/disambig.txt\\n\";\n  }\n\n  $idx = 1;\n  while (<S>) {\n    chomp;\n    my @col = split(\" \", $_);\n    $phone = shift @col;\n    if ($silence{$phone}) {\n      $exit = 1; print \"--> ERROR: phone \\\"$phone\\\" duplicates in $lang/phones/silence.txt (line $idx)\\n\";\n    }\n    $silence{$phone} = 1;\n    push(@silence, $phone);\n    $idx ++;\n  }\n  close(S);\n\n  $idx = 1;\n  while (<N>) {\n    chomp;\n    my @col = split(\" \", $_);\n    $phone = shift @col;\n    if ($nonsilence{$phone}) {\n      $exit = 1; print \"--> ERROR: phone \\\"$phone\\\" duplicates in $lang/phones/nonsilence.txt (line $idx)\\n\";\n    }\n    $nonsilence{$phone} = 1;\n    push(@nonsilence, $phone);\n    $idx ++;\n  }\n  close(N);\n\n  $idx = 1;\n  while (<D>) {\n    chomp;\n    my @col = split(\" \", $_);\n    $phone = shift @col;\n    if ($disambig{$phone}) {\n      $exit = 1; print \"--> ERROR: phone \\\"$phone\\\" duplicates in $lang/phones/disambig.txt (line $idx)\\n\";\n    }\n    $disambig{$phone} = 1;\n    $idx ++;\n  }\n  close(D);\n\n  my @itsect1 = intersect(\\%silence, \\%nonsilence);\n  my @itsect2 = intersect(\\%silence, \\%disambig);\n  my @itsect3 = intersect(\\%disambig, \\%nonsilence);\n\n  $success = 1;\n  if (@itsect1 != 0) {\n    $success = 0;\n    $exit = 1; print \"--> ERROR: silence.txt and nonsilence.txt have intersection -- \";\n    foreach (@itsect1) {\n      print $_, \" \";\n    }\n    print \"\\n\";\n  } else {\n    print \"--> silence.txt and nonsilence.txt are disjoint\\n\";\n  }\n\n  if (@itsect2 != 0) {\n    $success = 0;\n    $exit = 1; print \"--> ERROR: silence.txt and disambig.txt have intersection -- \";\n    foreach (@itsect2) {\n      print $_, \" \";\n    }\n    print \"\\n\";\n  } else {\n    print \"--> silence.txt and disambig.txt are disjoint\\n\";\n  }\n\n  if (@itsect3 != 0) {\n    $success = 0;\n    $exit = 1; print \"--> ERROR: disambig.txt and nonsilence.txt have intersection -- \";\n    foreach (@itsect1) {\n      print $_, \" \";\n    }\n    print \"\\n\";\n  } else {\n    print \"--> disambig.txt and nonsilence.txt are disjoint\\n\";\n  }\n\n  $success == 0 || print \"--> disjoint property is OK\\n\";\n  return;\n}\n\nsub check_summation {\n  print \"Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...\\n\";\n  if (scalar(keys %silence) == 0) {\n    $exit = 1; return print \"--> ERROR: $lang/phones/silence.txt is empty or does not exist\\n\";\n  }\n  if (scalar(keys %nonsilence) == 0) {\n    $exit = 1; return print \"--> ERROR: $lang/phones/nonsilence.txt is empty or does not exist\\n\";\n  }\n  if (!$skip_disambig_check && scalar(keys %disambig) == 0) {\n    $warning = 1; print \"--> WARNING: $lang/phones/disambig.txt is empty or does not exist\\n\";\n  }\n\n  %sum = (%silence, %nonsilence, %disambig);\n  $sum{\"<eps>\"} = 1;\n\n  my $ok = 1;\n  foreach $p (keys %psymtab) {\n    if (! defined $sum{$p} && $p !~ m/^#nonterm/) {\n      $exit = 1;  $ok = 0;  print(\"--> ERROR: phone $p is not in silence.txt, nonsilence.txt or disambig.txt...\\n\");\n    }\n  }\n\n  if ($ok) {\n    print \"--> found no unexplainable phones in phones.txt\\n\";\n  }\n  return;\n}\n\n%silence = ();\n@silence = ();\n%nonsilence = ();\n@nonsilence = ();\n%disambig = ();\ncheck_disjoint; print \"\\n\";\ncheck_summation; print \"\\n\";\n\n@list1 = (\"context_indep\", \"nonsilence\", \"silence\", \"optional_silence\");\n@list2 = (\"roots\", \"sets\");\nif (!$skip_disambig_check) {\n    push(@list1, \"disambig\");\n}\nforeach (@list1) {\n  check_txt_int_csl(\"$lang/phones/$_\", \\%psymtab); print \"\\n\";\n}\nforeach (@list2) {\n  check_txt_int(\"$lang/phones/$_\", \\%psymtab, 1); print \"\\n\";\n}\nif ((-s \"$lang/phones/extra_questions.txt\") || (-s \"$lang/phones/extra_questions.int\")) {\n  check_txt_int(\"$lang/phones/extra_questions\", \\%psymtab, 0); print \"\\n\";\n} else {\n  print \"Checking $lang/phones/extra_questions.\\{txt, int\\} ...\\n\";\n  if (!((-f \"$lang/phones/extra_questions.txt\") && (-f \"$lang/phones/extra_questions.int\"))) {\n    print \"--> ERROR: $lang/phones/extra_questions.\\{txt, int\\} do not exist (they may be empty, but should be present)\\n\\n\";\n    $exit = 1;\n  }\n}\nif (-e \"$lang/phones/$word_boundary.txt\") {\n  check_txt_int(\"$lang/phones/$word_boundary\", \\%psymtab, 0); print \"\\n\";\n}\n\n# Checking optional_silence.txt -------------------------------\nprint \"Checking optional_silence.txt ...\\n\";\n$idx = 1;\n$success = 1;\nif (-z \"$lang/phones/optional_silence.txt\") {\n  $exit = 1; $success = 0; print \"--> ERROR: $lang/phones/optional_silence.txt is empty or does not exist\\n\";\n}\nif (!open(OS, \"<$lang/phones/optional_silence.txt\")) {\n  $exit = 1; $success = 0; print \"--> ERROR: fail to open $lang/phones/optional_silence.txt\\n\";\n}\nprint \"--> reading $lang/phones/optional_silence.txt\\n\";\nwhile (<OS>) {\n  chomp;\n  my @col = split(\" \", $_);\n  if ($idx > 1 or @col > 1) {\n    $exit = 1; print \"--> ERROR: only 1 phone expected in $lang/phones/optional_silence.txt\\n\"; $success = 0;\n  } elsif (!$silence{$col[0]}) {\n    $exit = 1; print \"--> ERROR: phone $col[0] not found in $lang/phones/silence_phones.txt\\n\"; $success = 0;\n  }\n  $idx ++;\n}\nclose(OS);\n$success == 0 || print \"--> $lang/phones/optional_silence.txt is OK\\n\";\nprint \"\\n\";\n\nif (!$skip_disambig_check) {\n  # Check disambiguation symbols -------------------------------\n  print \"Checking disambiguation symbols: #0 and #1\\n\";\n  if (scalar(keys %disambig) == 0) {\n    $warning = 1; print \"--> WARNING: $lang/phones/disambig.txt is empty or does not exist\\n\";\n  }\n  if (exists $disambig{\"#0\"} and exists $disambig{\"#1\"}) {\n    print \"--> $lang/phones/disambig.txt has \\\"#0\\\" and \\\"#1\\\"\\n\";\n    print \"--> $lang/phones/disambig.txt is OK\\n\\n\";\n  } else {\n    print \"--> WARNING: $lang/phones/disambig.txt doesn't have \\\"#0\\\" or \\\"#1\\\";\\n\";\n    print \"-->          this would not be OK with a conventional ARPA-type language\\n\";\n    print \"-->          model or a conventional lexicon (L.fst)\\n\";\n    $warning = 1;\n  }\n}\n\n\n# Check topo -------------------------------\nprint \"Checking topo ...\\n\";\nif (-z \"$lang/topo\") {\n  $exit = 1; print \"--> ERROR: $lang/topo is empty or does not exist\\n\";\n}\nif (!open(T, \"<$lang/topo\")) {\n  $exit = 1; print \"--> ERROR: fail to open $lang/topo\\n\";\n} else {\n  $topo_ok = 1;\n  $idx = 1;\n  %phones_in_topo_int_hash = ( );\n  %phones_in_topo_hash = ( );\n  while (<T>) {\n    chomp;\n    next if (m/^<.*>[ ]*$/);\n    foreach $i (split(\" \", $_)) {\n      if (defined $phones_in_topo_int_hash{$i}) {\n        $topo_ok = 0;\n        $exit = 1; print \"--> ERROR: $lang/topo has phone $i twice\\n\";\n      }\n      if (!defined $pint2sym{$i}) {\n        $topo_ok = 0;\n        $exit = 1; print \"--> ERROR: $lang/topo has phone $i which is not in phones.txt\\n\";\n      }\n      $phones_in_topo_int_hash{$i} = 1;\n      $phones_in_topo_hash{$pint2sym{$i}} = 1;\n    }\n  }\n  close(T);\n  $phones_that_should_be_in_topo_hash = {};\n  foreach $p (@silence, @nonsilence) { $phones_that_should_be_in_topo_hash{$p} = 1; }\n  foreach $p (keys %phones_that_should_be_in_topo_hash) {\n    if ( ! defined $phones_in_topo_hash{$p}) {\n      $topo_ok = 0;\n      $i = $pint2sym{$p};\n      $exit = 1; print \"--> ERROR: $lang/topo does not cover phone $p (label = $i)\\n\";\n    }\n  }\n  foreach $i (keys %phones_in_topo_int_hash) {\n    $p = $pint2sym{$i};\n    if ( ! defined $phones_that_should_be_in_topo_hash{$p}) {\n      $topo_ok = 0;\n      $exit = 1; print \"--> ERROR: $lang/topo covers phone $p (label = $i) which is not a real phone\\n\";\n    }\n  }\n  if ($topo_ok) {\n    \"--> $lang/topo is OK\\n\";\n  }\n  print \"\\n\";\n}\n\n# Check word_boundary -------------------------------\n$nonword   = \"\";\n$begin     = \"\";\n$end       = \"\";\n$internal  = \"\";\n$singleton = \"\";\nif (-s \"$lang/phones/$word_boundary.txt\") {\n  print \"Checking $word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ...\\n\";\n  if (!open (W, \"<$lang/phones/$word_boundary.txt\")) {\n    $exit = 1; print \"--> ERROR: fail to open $lang/phones/$word_boundary.txt\\n\";\n  }\n  $idx = 1;\n  %wb = ();\n  while (<W>) {\n    chomp;\n    my @col;\n    if (m/^.*nonword$/  ) {\n      s/ nonword//g;    @col = split(\" \", $_); if (@col == 1) {$nonword   .= \"$col[0] \";}\n    }\n    if (m/^.*begin$/    ) {\n      s/ begin$//g;     @col = split(\" \", $_); if (@col == 1) {$begin     .= \"$col[0] \";}\n    }\n    if (m/^.*end$/      ) {\n      s/ end$//g;       @col = split(\" \", $_); if (@col == 1) {$end       .= \"$col[0] \";}\n    }\n    if (m/^.*internal$/ ) {\n      s/ internal$//g;  @col = split(\" \", $_); if (@col == 1) {$internal  .= \"$col[0] \";}\n    }\n    if (m/^.*singleton$/) {\n      s/ singleton$//g; @col = split(\" \", $_); if (@col == 1) {$singleton .= \"$col[0] \";}\n    }\n    if (@col != 1) {\n      $exit = 1; print \"--> ERROR: expect 1 column in $lang/phones/$word_boundary.txt (line $idx)\\n\";\n    }\n    $wb{shift @col} = 1;\n    $idx ++;\n  }\n  close(W);\n\n  @itset = intersect(\\%disambig, \\%wb);\n  $success1 = 1;\n  if (@itset != 0) {\n    $success1 = 0;\n    $exit = 1; print \"--> ERROR: $lang/phones/$word_boundary.txt has disambiguation symbols -- \";\n    foreach (@itset) {\n      print \"$_ \";\n    }\n    print \"\\n\";\n  }\n  $success1 == 0 || print \"--> $lang/phones/$word_boundary.txt doesn't include disambiguation symbols\\n\";\n\n  %sum = (%silence, %nonsilence);\n  @itset = intersect(\\%sum, \\%wb);\n  %itset = (); foreach(@itset) {$itset{$_} = 1;}\n  $success2 = 1;\n  if (@itset < scalar(keys %sum)) {\n    $success2 = 0;\n    $exit = 1; print \"--> ERROR: phones in nonsilence.txt and silence.txt but not in $word_boundary.txt -- \";\n    foreach (keys %sum) {\n      if (!$itset{$_}) {\n        print \"$_ \";\n      }\n    }\n    print \"\\n\";\n  }\n  if (@itset < scalar(keys %wb)) {\n    $success2 = 0;\n    $exit = 1; print \"--> ERROR: phones in $word_boundary.txt but not in nonsilence.txt or silence.txt -- \";\n    foreach (keys %wb) {\n      if (!$itset{$_}) {\n        print \"$_ \";\n      }\n    }\n    print \"\\n\";\n  }\n  $success2 == 0 || print \"--> $lang/phones/$word_boundary.txt is the union of nonsilence.txt and silence.txt\\n\";\n  $success1 != 1 or $success2 != 1 || print \"--> $lang/phones/$word_boundary.txt is OK\\n\";\n  print \"\\n\";\n}\n\n\n\n{\n  print \"Checking word-level disambiguation symbols...\\n\";\n  # This block checks that one of the two following conditions hold:\n  # (1) for lang diretories prepared by older versions of prepare_lang.sh:\n  #  The symbol  '#0' should appear in words.txt and phones.txt, and should\n  # or (2): the files wdisambig.txt, wdisambig_phones.int and wdisambig_words.int\n  #  exist, and have the expected properties (see below for details).\n\n  # note, %wdisambig_words_hash hashes from the integer word-id of word-level\n  # disambiguation symbols, to 1 if the word is a disambig symbol.\n\n  if (! -e \"$lang/phones/wdisambig.txt\") {\n    print \"--> no $lang/phones/wdisambig.txt (older prepare_lang.sh)\\n\";\n    if (exists $wsymtab{\"#0\"}) {\n      print \"--> $lang/words.txt has \\\"#0\\\"\\n\";\n      $wdisambig_words_hash{$wsymtab{\"#0\"}} = 1;\n    } else {\n      print \"--> WARNING: $lang/words.txt doesn't have \\\"#0\\\"\\n\";\n      print \"-->          (if you are using ARPA-type language models, you will normally\\n\";\n      print \"-->           need the disambiguation symbol \\\"#0\\\" to ensure determinizability)\\n\";\n    }\n  } else {\n    print \"--> $lang/phones/wdisambig.txt exists (newer prepare_lang.sh)\\n\";\n    if (!open(T, \"<$lang/phones/wdisambig.txt\")) {\n      print \"--> ERROR: fail to open $lang/phones/wdisambig.txt\\n\"; $exit = 1; return;\n    }\n    chomp(my @wdisambig = <T>);\n    close(T);\n    if (!open(W, \"<$lang/phones/wdisambig_words.int\")) {\n      print \"--> ERROR: fail to open $lang/phones/wdisambig_words.int\\n\"; $exit = 1; return;\n    }\n    chomp(my @wdisambig_words = <W>);\n    close(W);\n    if (!open(P, \"<$lang/phones/wdisambig_phones.int\")) {\n      print \"--> ERROR: fail to open $lang/phones/wdisambig_phones.int\\n\"; $exit = 1; return;\n    }\n    chomp(my @wdisambig_phones = <P>);\n    close(P);\n    my $len = @wdisambig, $len2;\n    if (($len2 = @wdisambig_words) != $len) {\n      print \"--> ERROR: files $lang/phones/wdisambig.txt and $lang/phones/wdisambig_words.int have different lengths\\n\";\n      $exit = 1; return;\n    }\n    if (($len2 = @wdisambig_phones) != $len) {\n      print \"--> ERROR: files $lang/phones/wdisambig.txt and $lang/phones/wdisambig_phones.int have different lengths\\n\";\n      $exit = 1; return;\n    }\n    for (my $i = 0; $i < $len; $i++) {\n      if ($wsymtab{$wdisambig[$i]} ne $wdisambig_words[$i]) {\n        my $ii = $i + 1;\n        print \"--> ERROR: line $ii of files $lang/phones/wdisambig.txt and $lang/phones/wdisambig_words.int mismatch\\n\";\n        $exit = 1; return;\n      }\n    }\n    for (my $i = 0; $i < $len; $i++) {\n      if ($psymtab{$wdisambig[$i]} ne $wdisambig_phones[$i]) {\n        my $ii = $i + 1;\n        print \"--> ERROR: line $ii of files $lang/phones/wdisambig.txt and $lang/phones/wdisambig_phones.int mismatch\\n\";\n        $exit = 1; return;\n      }\n    }\n    foreach my $i ( @wdisambig_words ) {\n      $wdisambig_words_hash{$i} = 1;\n    }\n  }\n}\n\n# Check validity of L.fst, L_disambig.fst, and word_boundary.int.\n# First we generate a random word/subword sequence. We then compile it into fst and compose it with L.fst/L_disambig.fst.\n# For subword case the last subword of the sequence must be a end-subword \n# (i.e. the subword can only be at the end of word or is a single word itself) \n# to guarantee the composition would not fail.\n# We then get the corresponging phones sequence and apply a transition matrix on it to get the number of valid boundaries.\n# In word case, the number of valid boundaries should be equal to the number of words.\n# In subword case, the number of valid boundaries should be equal to the number of end-subwords.\nif (-s \"$lang/phones/$word_boundary.int\") {\n  print \"Checking $word_boundary.int and disambig.int\\n\";\n  if (!open (W, \"<$lang/phones/$word_boundary.int\")) {\n    $exit = 1; print \"--> ERROR: fail to open $lang/phones/$word_boundary.int\\n\";\n  }\n  while (<W>) {\n    @A = split;\n    if (@A != 2) {\n      $exit = 1; print \"--> ERROR: bad line $_ in $lang/phones/$word_boundary.int\\n\";\n    }\n    $wbtype{$A[0]} = $A[1];\n  }\n  close(W);\n  if (!open (D, \"<$lang/phones/disambig.int\")) {\n    $exit = 1; print \"--> ERROR: fail to open $lang/phones/disambig.int\\n\";\n  }\n  while (<D>) {\n    @A = split;\n    if (@A != 1) {\n      $exit = 1; print \"--> ERROR: bad line $_ in $lang/phones/disambig.int\\n\";\n    }\n    $is_disambig{$A[0]} = 1;\n  }\n\n  $text = `. ./path.sh`;\n  if ($text ne \"\") {\n    print \"*** This script cannot continue because your path.sh or bash profile prints something: $text\" .\n      \"*** Please fix that and try again.\\n\";\n    exit(1);\n  }\n\n  foreach $fst (\"L.fst\", \"L_disambig.fst\") {\n    if ($skip_generate_words_check) {\n      next;\n    }\n    $wlen = int(rand(100)) + 1;\n    $end_subword = 0;\n    print \"--> generating a $wlen word/subword sequence\\n\";\n    $wordseq = \"\";\n    $sid = 0;\n    $wordseq_syms = \"\";\n    # exclude disambiguation symbols, BOS and EOS, epsilon, and\n    # grammar-related symbols from the word sequence.\n    while ($sid < ($wlen - 1)) {\n      $id = int(rand(scalar(keys %wint2sym)));\n      while (defined $wdisambig_words_hash{$id} or\n           $wint2sym{$id} eq \"<s>\" or $wint2sym{$id} eq \"</s>\" or\n           $wint2sym{$id} =~ m/^#nonterm/ or $id == 0) {\n        $id = int(rand(scalar(keys %wint2sym)));\n      }\n      $wordseq_syms = $wordseq_syms . $wint2sym{$id} . \" \";\n      $wordseq = $wordseq . \"$sid \". ($sid + 1) . \" $id $id 0\\n\";\n      $sid ++;\n\n      if ($subword_check) {\n        $subword = $wint2sym{$id};\n        $suffix = substr($subword, -$separator_length, $separator_length);\n        if ($suffix ne $separator) {\n          $end_subword ++;\n        }\n      }\n    } \n\n    # generate the last word (subword)\n    $id = int(rand(scalar(keys %wint2sym)));\n    if ($subword_check) {\n      $subword = $wint2sym{$id};\n      $suffix = substr($subword, -$separator_length, $separator_length);\n      # the last subword can not followed by separator  \n      while (defined $wdisambig_words_hash{$id} or\n           $wint2sym{$id} eq \"<s>\" or $wint2sym{$id} eq \"</s>\" or\n           $wint2sym{$id} =~ m/^#nonterm/ or $id == 0 or $suffix eq $separator) {\n        $id = int(rand(scalar(keys %wint2sym)));\n        $subword = $wint2sym{$id};\n        $suffix = substr($subword, -$separator_length, $separator_length);\n      }\n      $end_subword ++;\n    } else {\n      while (defined $wdisambig_words_hash{$id} or\n           $wint2sym{$id} eq \"<s>\" or $wint2sym{$id} eq \"</s>\" or\n           $wint2sym{$id} =~ m/^#nonterm/ or $id == 0) {\n       $id = int(rand(scalar(keys %wint2sym)));\n      }\n    }\n    $wordseq_syms = $wordseq_syms . $wint2sym{$id} . \" \";\n    $wordseq = $wordseq . \"$sid \". ($sid + 1) . \" $id $id 0\\n\";\n    $sid ++;\n\n    $wordseq = $wordseq . \"$sid 0\";\n    $phoneseq = `. ./path.sh; echo \\\"$wordseq\" | fstcompile | fstcompose $lang/$fst - | fstproject | fstrandgen | fstrmepsilon | fsttopsort | fstprint | awk '{if (NF > 2) {print \\$3}}';`;\n    $transition = { }; # empty assoc. array of allowed transitions between phone types.  1 means we count a word,\n    # 0 means transition is allowed.  bos and eos are added as extra symbols here.\n    foreach $x (\"bos\", \"nonword\", \"end\", \"singleton\") {\n      $transition{$x, \"nonword\"} = 0;\n      $transition{$x, \"begin\"} = 1;\n      $transition{$x, \"singleton\"} = 1;\n      $transition{$x, \"eos\"} = 0;\n    }\n    $transition{\"begin\", \"end\"} = 0;\n    $transition{\"begin\", \"internal\"} = 0;\n    $transition{\"internal\", \"internal\"} = 0;\n    $transition{\"internal\", \"end\"} = 0;\n\n    $cur_state = \"bos\";\n    $num_words = 0;\n    foreach $phone (split (\" \", \"$phoneseq <<eos>>\")) {\n      # Note: now that we support unk-LMs (see the --unk-fst option to\n      # prepare_lang.sh), the regular L.fst may contain some disambiguation\n      # symbols.\n      if (! defined $is_disambig{$phone}) {\n        if ($phone eq \"<<eos>>\") {\n          $state = \"eos\";\n        } elsif ($phone == 0) {\n          $exit = 1; print \"--> ERROR: unexpected phone sequence=$phoneseq, wordseq=$wordseq\\n\"; last;\n        } else {\n          $state = $wbtype{$phone};\n        }\n        if (!defined $state) {\n          $exit = 1; print \"--> ERROR: phone $phone is not specified in $lang/phones/$word_boundary.int\\n\";\n          last;\n        } elsif (!defined $transition{$cur_state, $state}) {\n          $exit = 1; print \"--> ERROR: transition from state $cur_state to $state indicates error in $word_boundary.int or L.fst\\n\";\n          last;\n        } else {\n          $num_words += $transition{$cur_state, $state};\n          $cur_state = $state;\n        }\n      }\n    }\n    if (!$exit) {\n      if ($subword_check) { \n        $wlen = $end_subword;\n      }\n      if ($num_words != $wlen) {\n        $phoneseq_syms = \"\";\n        foreach my $id (split(\" \", $phoneseq)) { $phoneseq_syms = $phoneseq_syms . \" \" . $pint2sym{$id}; }\n        $exit = 1; print \"--> ERROR: number of reconstructed words $num_words does not match real number of words $wlen; indicates problem in $fst or $word_boundary.int.  phoneseq = $phoneseq_syms, wordseq = $wordseq_syms\\n\";\n      } else {\n        print \"--> resulting phone sequence from $fst corresponds to the word sequence\\n\";\n        print \"--> $fst is OK\\n\";\n      }\n    }\n  }\n  print \"\\n\";\n}\n\n# Check oov -------------------------------\ncheck_txt_int(\"$lang/oov\", \\%wsymtab, 0); print \"\\n\";\n\n# Check if L.fst is olabel sorted.\nif (-e \"$lang/L.fst\") {\n  $cmd = \"fstinfo $lang/L.fst | grep -E 'output label sorted.*y' > /dev/null\";\n  $res = system(\". ./path.sh; $cmd\");\n  if ($res == 0) {\n    print \"--> $lang/L.fst is olabel sorted\\n\";\n  } else {\n    print \"--> ERROR: $lang/L.fst is not olabel sorted\\n\";\n    $exit = 1;\n  }\n}\n\n# Check if L_disambig.fst is olabel sorted.\nif (-e \"$lang/L_disambig.fst\") {\n  $cmd = \"fstinfo $lang/L_disambig.fst | grep -E 'output label sorted.*y' > /dev/null\";\n  $res = system(\". ./path.sh; $cmd\");\n  if ($res == 0) {\n    print \"--> $lang/L_disambig.fst is olabel sorted\\n\";\n  } else {\n    print \"--> ERROR: $lang/L_disambig.fst is not olabel sorted\\n\";\n    $exit = 1;\n  }\n}\n\nif (-e \"$lang/G.fst\") {\n  # Check that G.fst is ilabel sorted and nonempty.\n  $text = `. ./path.sh; fstinfo $lang/G.fst`;\n  if ($? != 0) {\n    print \"--> ERROR: fstinfo failed on $lang/G.fst\\n\";\n    $exit = 1;\n  }\n  if ($text =~ m/input label sorted\\s+y/) {\n    print \"--> $lang/G.fst is ilabel sorted\\n\";\n  } else {\n    print \"--> ERROR: $lang/G.fst is not ilabel sorted\\n\";\n    $exit = 1;\n  }\n  if ($text =~ m/# of states\\s+(\\d+)/) {\n    $num_states = $1;\n    if ($num_states == 0) {\n      print \"--> ERROR: $lang/G.fst is empty\\n\";\n      $exit = 1;\n    } else {\n      print \"--> $lang/G.fst has $num_states states\\n\";\n    }\n  }\n\n  # Check that G.fst is determinizable.\n  if (!$skip_det_check) {\n    # Check determinizability of G.fst\n    # fstdeterminizestar is much faster, and a more relevant test as it's what\n    # we do in the actual graph creation recipe.\n    if (-e \"$lang/G.fst\") {\n      $cmd = \"fstdeterminizestar $lang/G.fst /dev/null\";\n      $res = system(\". ./path.sh; $cmd\");\n      if ($res == 0) {\n        print \"--> $lang/G.fst is determinizable\\n\";\n      } else {\n        print \"--> ERROR: fail to determinize $lang/G.fst\\n\";\n        $exit = 1;\n      }\n    }\n  }\n\n  # Check that G.fst does not have cycles with only disambiguation symbols or\n  # epsilons on the input, or the forbidden symbols <s> and </s> (and a few\n  # related checks\n\n  if (-e \"$lang/G.fst\") {\n    system(\"utils/lang/check_g_properties.pl $lang\");\n    if ($? != 0) {\n      print \"--> ERROR: failure running check_g_properties.pl\\n\";\n      $exit = 1;\n    } else {\n      print(\"--> utils/lang/check_g_properties.pl succeeded.\\n\");\n    }\n  }\n}\n\n\nif (!$skip_det_check) {\n  if (-e \"$lang/G.fst\" && -e \"$lang/L_disambig.fst\") {\n    print \"--> Testing determinizability of L_disambig . G\\n\";\n    $output = `. ./path.sh; fsttablecompose $lang/L_disambig.fst $lang/G.fst | fstdeterminizestar | fstinfo 2>&1 `;\n    if ($output =~ m/# of states\\s*[1-9]/) {\n      print \"--> L_disambig . G is determinizable\\n\";\n    } else {\n      print \"--> ERROR: fail to determinize L_disambig . G.  Output is:\\n\";\n      print \"$output\\n\";\n      $exit = 1;\n    }\n  }\n}\n\nif ($exit == 1) {\n  print \"--> ERROR (see error messages above)\\n\"; exit 1;\n} else {\n  if ($warning == 1) {\n    print \"--> WARNING (check output above for warnings)\\n\"; exit 0;\n  } else {\n    print \"--> SUCCESS [validating lang directory $lang]\\n\"; exit 0;\n  }\n}\n"
  },
  {
    "path": "egs/utils/validate_text.pl",
    "content": "#!/usr/bin/env perl\n#\n#===============================================================================\n# Copyright 2017  Johns Hopkins University (author: Yenda Trmal <jtrmal@gmail.com>)\n#                 Johns Hopkins University (author: Daniel Povey)\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#  http://www.apache.org/licenses/LICENSE-2.0\n#\n# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED\n# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,\n# MERCHANTABLITY OR NON-INFRINGEMENT.\n# See the Apache 2 License for the specific language governing permissions and\n# limitations under the License.\n#===============================================================================\n\n# validation script for data/<dataset>/text\n# to be called (preferably) from utils/validate_data_dir.sh\nuse strict;\nuse warnings;\nuse utf8;\nuse Fcntl qw< SEEK_SET >;\n\n# this function reads the opened file (supplied as a first\n# parameter) into an array of lines. For each\n# line, it tests whether it's a valid utf-8 compatible\n# line. If all lines are valid utf-8, it returns the lines\n# decoded as utf-8, otherwise it assumes the file's encoding\n# is one of those 1-byte encodings, such as ISO-8859-x\n# or Windows CP-X.\n# Please recall we do not really care about\n# the actually encoding, we just need to\n# make sure the length of the (decoded) string\n# is correct (to make the output formatting looking right).\nsub get_utf8_or_bytestream {\n  use Encode qw(decode encode);\n  my $is_utf_compatible = 1;\n  my @unicode_lines;\n  my @raw_lines;\n  my $raw_text;\n  my $lineno = 0;\n  my $file = shift;\n\n  while (<$file>) {\n    $raw_text = $_;\n    last unless $raw_text;\n    if ($is_utf_compatible) {\n      my $decoded_text = eval { decode(\"UTF-8\", $raw_text, Encode::FB_CROAK) } ;\n      $is_utf_compatible = $is_utf_compatible && defined($decoded_text);\n      push @unicode_lines, $decoded_text;\n    } else {\n      #print STDERR \"WARNING: the line $raw_text cannot be interpreted as UTF-8: $decoded_text\\n\";\n      ;\n    }\n    push @raw_lines, $raw_text;\n    $lineno += 1;\n  }\n\n  if (!$is_utf_compatible) {\n    return (0, @raw_lines);\n  } else {\n    return (1, @unicode_lines);\n  }\n}\n\n# check if the given unicode string contain unicode whitespaces\n# other than the usual four: TAB, LF, CR and SPACE\nsub validate_utf8_whitespaces {\n  my $unicode_lines = shift;\n  use feature 'unicode_strings';\n  for (my $i = 0; $i < scalar @{$unicode_lines}; $i++) {\n    my $current_line = $unicode_lines->[$i];\n    if ((substr $current_line, -1) ne \"\\n\"){\n      print STDERR \"$0: The current line (nr. $i) has invalid newline\\n\";\n      return 1;\n    }\n    my @A = split(\" \", $current_line);\n    my $utt_id = $A[0];\n    # we replace TAB, LF, CR, and SPACE\n    # this is to simplify the test\n    if ($current_line =~ /\\x{000d}/) {\n      print STDERR \"$0: The line for utterance $utt_id contains CR (0x0D) character\\n\";\n      return 1;\n    }\n    $current_line =~ s/[\\x{0009}\\x{000a}\\x{0020}]/./g;\n    if ($current_line =~/\\s/) {\n      print STDERR \"$0: The line for utterance $utt_id contains disallowed Unicode whitespaces\\n\";\n      return 1;\n    }\n  }\n  return 0;\n}\n\n# checks if the text in the file (supplied as the argument) is utf-8 compatible\n# if yes, checks if it contains only allowed whitespaces. If no, then does not\n# do anything. The function seeks to the original position in the file after\n# reading the text.\nsub check_allowed_whitespace {\n  my $file = shift;\n  my $filename = shift;\n  my $pos = tell($file);\n  (my $is_utf, my @lines) = get_utf8_or_bytestream($file);\n  seek($file, $pos, SEEK_SET);\n  if ($is_utf) {\n    my $has_invalid_whitespaces = validate_utf8_whitespaces(\\@lines);\n    if ($has_invalid_whitespaces) {\n      print STDERR \"$0: ERROR: text file '$filename' contains disallowed UTF-8 whitespace character(s)\\n\";\n      return 0;\n    }\n  }\n  return 1;\n}\n\nif(@ARGV != 1) {\n  die \"Usage: validate_text.pl <text-file>\\n\" .\n      \"e.g.: validate_text.pl data/train/text\\n\";\n}\n\nmy $text = shift @ARGV;\n\nif (-z \"$text\") {\n  print STDERR \"$0: ERROR: file '$text' is empty or does not exist\\n\";\n  exit 1;\n}\n\nif(!open(FILE, \"<$text\")) {\n  print STDERR \"$0: ERROR: failed to open $text\\n\";\n  exit 1;\n}\n\ncheck_allowed_whitespace(\\*FILE, $text) or exit 1;\nclose(FILE);\n"
  },
  {
    "path": "egs/utils/write_kwslist.pl",
    "content": "#!/usr/bin/env perl\n\n# Copyright 2012  Johns Hopkins University (Author: Guoguo Chen)\n# Apache 2.0.\n#\nuse strict;\nuse warnings;\nuse Getopt::Long;\n\nmy $Usage = <<EOU;\nThis script reads the raw keyword search results [result.*] and writes them as the kwslist.xml file.\nIt can also do things like score normalization, decision making, duplicates removal, etc.\n\nUsage: utils/write_kwslist.pl [options] <raw_result_in|-> <kwslist_out|->\n e.g.: utils/write_kwslist.pl --flen=0.01 --duration=1000 --segments=data/eval/segments\n                              --normalize=true --map-utter=data/kws/utter_map raw_results kwslist.xml\n\nAllowed options:\n  --beta                      : Beta value when computing ATWV              (float,   default = 999.9)\n  --digits                    : How many digits should the score use        (int,     default = \"infinite\")\n  --duptime                   : Tolerance for duplicates                    (float,   default = 0.5)\n  --duration                  : Duration of all audio, you must set this    (float,   default = 999.9)\n  --ecf-filename              : ECF file name                               (string,  default = \"\") \n  --flen                      : Frame length                                (float,   default = 0.01)\n  --index-size                : Size of index                               (float,   default = 0)\n  --kwlist-filename           : Kwlist.xml file name                        (string,  default = \"\") \n  --language                  : Language type                               (string,  default = \"cantonese\")\n  --map-utter                 : Map utterance for evaluation                (string,  default = \"\")\n  --normalize                 : Normalize scores or not                     (boolean, default = false)\n  --Ntrue-scale               : Keyword independent scale factor for Ntrue  (float,   default = 1.0)\n  --remove-dup                : Remove duplicates                           (boolean, default = false)\n  --remove-NO                 : Remove the \"NO\" decision instances          (boolean, default = false)\n  --segments                  : Segments file from Kaldi                    (string,  default = \"\")\n  --system-id                 : System ID                                   (string,  default = \"\")\n  --verbose                   : Verbose level (higher --> more kws section) (integer, default = 0)\n  --YES-cutoff                : Only keep \"\\$YES-cutoff\" yeses for each kw   (int,    default = -1)\n  --nbest                     | Output upto nbest hits into the kwlist      (int,     default = -1)\n\nEOU\n\nmy $segment = \"\";\nmy $flen = 0.01;\nmy $beta = 999.9;\nmy $duration = 999.9;\nmy $language = \"cantonese\";\nmy $ecf_filename = \"\";\nmy $index_size = 0;\nmy $system_id = \"\";\nmy $normalize = \"false\";\nmy $map_utter = \"\";\nmy $Ntrue_scale = 1.0;\nmy $digits = 0;\nmy $kwlist_filename = \"\";\nmy $verbose = 0;\nmy $duptime = 0.5;\nmy $remove_dup = \"false\";\nmy $remove_NO = \"false\";\nmy $YES_cutoff = -1;\nmy $nbest_max = -1;\nGetOptions('segments=s'     => \\$segment,\n  'flen=f'         => \\$flen,\n  'beta=f'         => \\$beta,\n  'duration=f'     => \\$duration,\n  'language=s'     => \\$language,\n  'ecf-filename=s' => \\$ecf_filename,\n  'index-size=f'   => \\$index_size,\n  'system-id=s'    => \\$system_id,\n  'normalize=s'    => \\$normalize,\n  'map-utter=s'    => \\$map_utter,\n  'Ntrue-scale=f'  => \\$Ntrue_scale,\n  'digits=i'       => \\$digits,\n  'kwlist-filename=s' => \\$kwlist_filename,\n  'verbose=i'         => \\$verbose,\n  'duptime=f'         => \\$duptime,\n  'remove-dup=s'      => \\$remove_dup,\n  'YES-cutoff=i'      => \\$YES_cutoff,\n  'remove-NO=s'       => \\$remove_NO,\n  'nbest=i'           => \\$nbest_max) or die \"Cannot continue\\n\";\n\n($normalize eq \"true\" || $normalize eq \"false\") || die \"$0: Bad value for option --normalize\\n\";\n($remove_dup eq \"true\" || $remove_dup eq \"false\") || die \"$0: Bad value for option --remove-dup\\n\";\n($remove_NO eq \"true\" || $remove_NO eq \"false\") || die \"$0: Bad value for option --remove-NO\\n\";\n\nif ($segment) {\n  open(SEG, \"<$segment\") || die \"$0: Fail to open segment file $segment\\n\";\n}\n\nif ($map_utter) {\n  open(UTT, \"<$map_utter\") || die \"$0: Fail to open utterance table $map_utter\\n\";\n}\n\nif (@ARGV != 2) {\n  die $Usage;\n}\n\n# Get parameters\nmy $filein = shift @ARGV;\nmy $fileout = shift @ARGV;\n\n# Get input source\nmy $source = \"\";\nif ($filein eq \"-\") {\n  $source = \"STDIN\";\n} else {\n  open(I, \"<$filein\") || die \"$0: Fail to open input file $filein\\n\";\n  $source = \"I\";\n}\n\n# Get symbol table and start time\nmy %tbeg;\nif ($segment) {\n  while (<SEG>) {\n    chomp;\n    my @col = split(\" \", $_);\n    @col == 4 || die \"$0: Bad number of columns in $segment \\\"$_\\\"\\n\";\n    $tbeg{$col[0]} = $col[2];\n  }\n}\n\n# Get utterance mapper\nmy %utter_mapper;\nif ($map_utter) {\n  while (<UTT>) {\n    chomp;\n    my @col = split(\" \", $_);\n    @col == 2 || die \"$0: Bad number of columns in $map_utter \\\"$_\\\"\\n\";\n    $utter_mapper{$col[0]} = $col[1];\n  }\n}\n\n# Function for printing Kwslist.xml\nsub PrintKwslist {\n  my ($info, $KWS) = @_;\n\n  my $kwslist = \"\";\n\n  # Start printing\n  $kwslist .= \"<kwslist kwlist_filename=\\\"$info->[0]\\\" language=\\\"$info->[1]\\\" system_id=\\\"$info->[2]\\\">\\n\";\n  my $prev_kw = \"\";\n  my $nbest = $nbest_max;\n  foreach my $kwentry (@{$KWS}) {\n    if (($prev_kw eq $kwentry->[0])  && ($nbest le 0) && ($nbest_max gt 0)) {\n      next;\n    }\n    if ($prev_kw ne $kwentry->[0]) {\n      if ($prev_kw ne \"\") {$kwslist .= \"  </detected_kwlist>\\n\";}\n      $kwslist .= \"  <detected_kwlist kwid=\\\"$kwentry->[0]\\\" search_time=\\\"1\\\" oov_count=\\\"0\\\">\\n\";\n      $prev_kw = $kwentry->[0];\n      $nbest = $nbest_max;\n    }\n    $nbest -= 1 if $nbest_max gt 0;\n    my $score = sprintf(\"%g\", $kwentry->[5]);\n    $kwslist .= \"    <kw file=\\\"$kwentry->[1]\\\" channel=\\\"$kwentry->[2]\\\" tbeg=\\\"$kwentry->[3]\\\" dur=\\\"$kwentry->[4]\\\" score=\\\"$score\\\" decision=\\\"$kwentry->[6]\\\"\";\n    if (defined($kwentry->[7])) {$kwslist .= \" threshold=\\\"$kwentry->[7]\\\"\";}\n    if (defined($kwentry->[8])) {$kwslist .= \" raw_score=\\\"$kwentry->[8]\\\"\";}\n    $kwslist .= \"/>\\n\";\n  }\n  if ($prev_kw ne \"\") {$kwslist .= \"  </detected_kwlist>\\n\";}\n  $kwslist .= \"</kwslist>\\n\";\n\n  return $kwslist;\n}\n\n# Function for sorting\nsub KwslistOutputSort {\n  if ($a->[0] ne $b->[0]) {\n    if ($a->[0] =~ m/[0-9]+$/ && $b->[0] =~ m/[0-9]+$/) {\n      ($a->[0] =~ /([0-9]*)$/)[0] <=> ($b->[0] =~ /([0-9]*)$/)[0]\n    } else {\n      $a->[0] cmp $b->[0];\n    }\n  } elsif ($a->[5] ne $b->[5]) {\n    $b->[5] <=> $a->[5];\n  } else {\n    $a->[1] cmp $b->[1];\n  }\n}\nsub KwslistDupSort {\n  my ($a, $b, $duptime) = @_;\n  if ($a->[0] ne $b->[0]) {\n    $a->[0] cmp $b->[0];\n  } elsif ($a->[1] ne $b->[1]) {\n    $a->[1] cmp $b->[1];\n  } elsif ($a->[2] ne $b->[2]) {\n    $a->[2] cmp $b->[2];\n  } elsif (abs($a->[3]-$b->[3]) >= $duptime){\n    $a->[3] <=> $b->[3];\n  } elsif ($a->[5] ne $b->[5]) {\n    $b->[5] <=> $a->[5];\n  } else {\n    $b->[4] <=> $a->[4];\n  }\n}\n\n# Processing\nmy @KWS;\nwhile (<$source>) {\n  chomp;\n  my @col = split(\" \", $_);\n  @col == 5 || die \"$0: Bad number of columns in raw results \\\"$_\\\"\\n\";\n  my $kwid = shift @col;\n  my $utter = $col[0];\n  my $start = sprintf(\"%.2f\", $col[1]*$flen);\n  my $dur = sprintf(\"%.2f\", $col[2]*$flen-$start);\n  my $score = exp(-$col[3]);\n\n  if ($segment) {\n    $start = sprintf(\"%.2f\", $start+$tbeg{$utter});\n  }\n  if ($map_utter) {\n    my $utter_x = $utter_mapper{$utter};\n    die \"Unmapped utterance $utter\\n\" unless $utter_x;\n    $utter = $utter_x;\n  }\n\n  push(@KWS, [$kwid, $utter, 1, $start, $dur, $score, \"\"]);\n}\n\nmy %Ntrue = ();\nforeach my $kwentry (@KWS) {\n  if (!defined($Ntrue{$kwentry->[0]})) {\n    $Ntrue{$kwentry->[0]} = 0.0;\n  }\n  $Ntrue{$kwentry->[0]} += $kwentry->[5];\n}\n\n# Scale the Ntrue\nmy %threshold;\nforeach my $key (keys %Ntrue) {\n  $Ntrue{$key} *= $Ntrue_scale;\n  $threshold{$key} = $Ntrue{$key}/($duration/$beta+($beta-1)/$beta*$Ntrue{$key});\n}\n\n# Removing duplicates\nif ($remove_dup eq \"true\") {\n  my @tmp = sort {KwslistDupSort($a, $b, $duptime)} @KWS;\n  @KWS = ();\n  if (@tmp >= 1) {push(@KWS, $tmp[0])};\n  for (my $i = 1; $i < scalar(@tmp); $i ++) {\n    my $prev = $KWS[-1];\n    my $curr = $tmp[$i];\n    if ((abs($prev->[3]-$curr->[3]) < $duptime ) &&\n        ($prev->[2] eq $curr->[2]) &&\n        ($prev->[1] eq $curr->[1]) &&\n        ($prev->[0] eq $curr->[0])) {\n      next;\n    } else {\n      push(@KWS, $curr);\n    }\n  }\n}\n\nmy $format_string = \"%g\";\nif ($digits gt 0 ) {\n  $format_string = \"%.\" . $digits .\"f\";\n}\n\nmy @info = ($kwlist_filename, $language, $system_id);\nmy %YES_count;\nforeach my $kwentry (@KWS) {\n  my $threshold = $threshold{$kwentry->[0]};\n  if ($kwentry->[5] > $threshold) {\n    $kwentry->[6] = \"YES\";\n    if (defined($YES_count{$kwentry->[0]})) {\n      $YES_count{$kwentry->[0]} ++;\n    } else {\n      $YES_count{$kwentry->[0]} = 1;\n    }\n  } else {\n    $kwentry->[6] = \"NO\";\n    if (!defined($YES_count{$kwentry->[0]})) {\n      $YES_count{$kwentry->[0]} = 0;\n    }\n  }\n  if ($verbose > 0) {\n    push(@{$kwentry}, sprintf(\"%g\", $threshold));\n  }\n  if ($normalize eq \"true\") {\n    if ($verbose > 0) {\n      push(@{$kwentry}, $kwentry->[5]);\n    }\n    my $numerator = (1-$threshold)*$kwentry->[5];\n    my $denominator = (1-$threshold)*$kwentry->[5]+(1-$kwentry->[5])*$threshold;\n    if ($denominator != 0) {\n      $kwentry->[5] = sprintf($format_string, $numerator/$denominator);\n    } else {\n      $kwentry->[5] = sprintf($format_string, $kwentry->[5]);\n    }\n  } else {\n    $kwentry->[5] = sprintf($format_string, $kwentry->[5]);\n  }\n}\n\n# Output sorting\nmy @tmp = sort KwslistOutputSort @KWS;\n\n# Process the YES-cutoff. Note that you don't need this for the normal cases where\n# hits and false alarms are balanced\nif ($YES_cutoff != -1) {\n  my $count = 1;\n  for (my $i = 1; $i < scalar(@tmp); $i ++) { \n    if ($tmp[$i]->[0] ne $tmp[$i-1]->[0]) {\n      $count = 1;\n      next;\n    }\n    if ($YES_count{$tmp[$i]->[0]} > $YES_cutoff*2) {\n      $tmp[$i]->[6] = \"NO\";\n      $tmp[$i]->[5] = 0;\n      next;\n    }\n    if (($count == $YES_cutoff) && ($tmp[$i]->[6] eq \"YES\")) {\n      $tmp[$i]->[6] = \"NO\";\n      $tmp[$i]->[5] = 0;\n      next;\n    }\n    if ($tmp[$i]->[6] eq \"YES\") {\n      $count ++;\n    }\n  }\n}\n\n# Process the remove-NO decision\nif ($remove_NO eq \"true\") {\n  my @KWS = @tmp;\n  @tmp = ();\n  for (my $i = 0; $i < scalar(@KWS); $i ++) {\n    if ($KWS[$i]->[6] eq \"YES\") {\n      push(@tmp, $KWS[$i]);\n    }\n  }\n}\n\n# Printing\nmy $kwslist = PrintKwslist(\\@info, \\@tmp);\n\nif ($segment) {close(SEG);}\nif ($map_utter) {close(UTT);}\nif ($filein  ne \"-\") {close(I);}\nif ($fileout eq \"-\") {\n    print $kwslist;\n} else {\n  open(O, \">$fileout\") || die \"$0: Fail to open output file $fileout\\n\";\n  print O $kwslist;\n  close(O);\n}\n"
  },
  {
    "path": "env/build_env.sh",
    "content": "# Author: Jinchuan Tian; tianjinchuan@stu.pku.edu.cn\n# Build the environment for ASR repositry https://github.com/jctian98/e2e_lfmmi\n\n# Here is only an example, you may need to revise this script to suit your machine\n# This script can hardly run automatically. You may need to run it line-by-line\n\n# Our system:\n# Centos 7; GCC 7.3.1\n# Python; Pytorch 1.7.1, \n\nrootdir=/home/tian/tools/opensource\nstage=$1\nnj=48\n\ncd $rootdir\nif [ ${stage} -le 1 ]; then\n  echo \"Install GCC 7.3.1 and system dependency. You need root account for this\"\n  yum install -y cmake sox libsndfile ffmpeg flac\n\n  yum install -y centos-release-scl\n  yum install -y devtoolset-7\n  scl enable devtoolset-7 bash\n  # After this, run 'gcc -v' to ensure the GCC version is correct\nfi\n\nif [ ${stage} -le 2 ]; then\n  echo \"Install Kaldi and its auxiliary tools\"\n  git clone https://github.com/kaldi-asr/kaldi.git\n  cd kaldi/tools/\n  bash extras/check_dependencies.sh # make sure it's ok\n\n  make -j $nj\n  cd ../src/\n  ./configure --shared\n  make depend -j $nj\n  make -j $nj\n\n  # Additionally, you need kaldi_lm to train word-level N-gram LM\n  cd ../tools\n  bash extras/install_kaldi_lm.sh\nfi\n\nif [ ${stage} -le 3 ]; then\n  echo \"Install Espnet environment\"\n  # git clone https://github.com/espnet/espnet\n  cd $rootdir/espnet/tools\n  ln -s $rootdir/kaldi .\n\n  # Build espnet environment. You may choose other versions\n  CONDA_TOOLS_DIR=$(dirname ${CONDA_EXE})/..\n  ./setup_anaconda.sh ${CONDA_TOOLS_DIR} lfmmi 3.8\n  make TH_VERSION=1.7.1 CUDA_VERSION=10.1 \n\n  # NT warpper is required if you will run NT examples\n  # Installing this warpper is difficult. This issue might be\n  # helpful: https://github.com/HawkAaron/warp-transducer/pull/90\n  # Also in our installing process, we find the pytorch test\n  # cannot pass as the gradients mismatch the desired value\n  installers/install_warp-transducer.sh \n\n  # to use our code rather than standard espnet code\n  pip3 uninstall espnet\nfi\n\nif [ ${stage} -le 4 ]; then\n  echo \"Install k2 library\"\n  conda install -c k2-fsa -c pytorch -c conda-forge k2 python=3.8 cudatoolkit=10.1 pytorch=1.7.1\n\nfi\n\nif [ ${stage} -le 5 ]; then\n  echo \"Install other python libraries\"\n  pip3 install kaldilm chainer==6.0.0 kaldialign graphviz lhotse numpy==1.20 \nfi\n\n"
  },
  {
    "path": "kaldi",
    "content": "../kaldi/"
  },
  {
    "path": "lm/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "lm/chainer_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "lm/chainer_backend/extlm.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Mitsubishi Electric Research Laboratories (Takaaki Hori)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport math\n\nimport chainer\nimport chainer.functions as F\nfrom espnet.lm.lm_utils import make_lexical_tree\n\n\n# Definition of a multi-level (subword/word) language model\nclass MultiLevelLM(chainer.Chain):\n    logzero = -10000000000.0\n    zero = 1.0e-10\n\n    def __init__(\n        self,\n        wordlm,\n        subwordlm,\n        word_dict,\n        subword_dict,\n        subwordlm_weight=0.8,\n        oov_penalty=1.0,\n        open_vocab=True,\n    ):\n        super(MultiLevelLM, self).__init__()\n        self.wordlm = wordlm\n        self.subwordlm = subwordlm\n        self.word_eos = word_dict[\"<eos>\"]\n        self.word_unk = word_dict[\"<unk>\"]\n        self.xp_word_eos = self.xp.full(1, self.word_eos, \"i\")\n        self.xp_word_unk = self.xp.full(1, self.word_unk, \"i\")\n        self.space = subword_dict[\"<space>\"]\n        self.eos = subword_dict[\"<eos>\"]\n        self.lexroot = make_lexical_tree(word_dict, subword_dict, self.word_unk)\n        self.log_oov_penalty = math.log(oov_penalty)\n        self.open_vocab = open_vocab\n        self.subword_dict_size = len(subword_dict)\n        self.subwordlm_weight = subwordlm_weight\n        self.normalized = True\n\n    def __call__(self, state, x):\n        # update state with input label x\n        if state is None:  # make initial states and log-prob vectors\n            wlm_state, z_wlm = self.wordlm(None, self.xp_word_eos)\n            wlm_logprobs = F.log_softmax(z_wlm).data\n            clm_state, z_clm = self.subwordlm(None, x)\n            log_y = F.log_softmax(z_clm).data * self.subwordlm_weight\n            new_node = self.lexroot\n            clm_logprob = 0.0\n            xi = self.space\n        else:\n            clm_state, wlm_state, wlm_logprobs, node, log_y, clm_logprob = state\n            xi = int(x)\n            if xi == self.space:  # inter-word transition\n                if node is not None and node[1] >= 0:  # check if the node is word end\n                    w = self.xp.full(1, node[1], \"i\")\n                else:  # this node is not a word end, which means <unk>\n                    w = self.xp_word_unk\n                # update wordlm state and log-prob vector\n                wlm_state, z_wlm = self.wordlm(wlm_state, w)\n                wlm_logprobs = F.log_softmax(z_wlm).data\n                new_node = self.lexroot  # move to the tree root\n                clm_logprob = 0.0\n            elif node is not None and xi in node[0]:  # intra-word transition\n                new_node = node[0][xi]\n                clm_logprob += log_y[0, xi]\n            elif self.open_vocab:  # if no path in the tree, enter open-vocabulary mode\n                new_node = None\n                clm_logprob += log_y[0, xi]\n            else:  # if open_vocab flag is disabled, return 0 probabilities\n                log_y = self.xp.full((1, self.subword_dict_size), self.logzero, \"f\")\n                return (clm_state, wlm_state, None, log_y, 0.0), log_y\n\n            clm_state, z_clm = self.subwordlm(clm_state, x)\n            log_y = F.log_softmax(z_clm).data * self.subwordlm_weight\n\n        # apply word-level probabilies for <space> and <eos> labels\n        if xi != self.space:\n            if new_node is not None and new_node[1] >= 0:  # if new node is word end\n                wlm_logprob = wlm_logprobs[:, new_node[1]] - clm_logprob\n            else:\n                wlm_logprob = wlm_logprobs[:, self.word_unk] + self.log_oov_penalty\n            log_y[:, self.space] = wlm_logprob\n            log_y[:, self.eos] = wlm_logprob\n        else:\n            log_y[:, self.space] = self.logzero\n            log_y[:, self.eos] = self.logzero\n\n        return (clm_state, wlm_state, wlm_logprobs, new_node, log_y, clm_logprob), log_y\n\n    def final(self, state):\n        clm_state, wlm_state, wlm_logprobs, node, log_y, clm_logprob = state\n        if node is not None and node[1] >= 0:  # check if the node is word end\n            w = self.xp.full(1, node[1], \"i\")\n        else:  # this node is not a word end, which means <unk>\n            w = self.xp_word_unk\n        wlm_state, z_wlm = self.wordlm(wlm_state, w)\n        return F.log_softmax(z_wlm).data[:, self.word_eos]\n\n\n# Definition of a look-ahead word language model\nclass LookAheadWordLM(chainer.Chain):\n    logzero = -10000000000.0\n    zero = 1.0e-10\n\n    def __init__(\n        self, wordlm, word_dict, subword_dict, oov_penalty=0.0001, open_vocab=True\n    ):\n        super(LookAheadWordLM, self).__init__()\n        self.wordlm = wordlm\n        self.word_eos = word_dict[\"<eos>\"]\n        self.word_unk = word_dict[\"<unk>\"]\n        self.xp_word_eos = self.xp.full(1, self.word_eos, \"i\")\n        self.xp_word_unk = self.xp.full(1, self.word_unk, \"i\")\n        self.space = subword_dict[\"<space>\"]\n        self.eos = subword_dict[\"<eos>\"]\n        self.lexroot = make_lexical_tree(word_dict, subword_dict, self.word_unk)\n        self.oov_penalty = oov_penalty\n        self.open_vocab = open_vocab\n        self.subword_dict_size = len(subword_dict)\n        self.normalized = True\n\n    def __call__(self, state, x):\n        # update state with input label x\n        if state is None:  # make initial states and cumlative probability vector\n            wlm_state, z_wlm = self.wordlm(None, self.xp_word_eos)\n            cumsum_probs = self.xp.cumsum(F.softmax(z_wlm).data, axis=1)\n            new_node = self.lexroot\n            xi = self.space\n        else:\n            wlm_state, cumsum_probs, node = state\n            xi = int(x)\n            if xi == self.space:  # inter-word transition\n                if node is not None and node[1] >= 0:  # check if the node is word end\n                    w = self.xp.full(1, node[1], \"i\")\n                else:  # this node is not a word end, which means <unk>\n                    w = self.xp_word_unk\n                # update wordlm state and cumlative probability vector\n                wlm_state, z_wlm = self.wordlm(wlm_state, w)\n                cumsum_probs = self.xp.cumsum(F.softmax(z_wlm).data, axis=1)\n                new_node = self.lexroot  # move to the tree root\n            elif node is not None and xi in node[0]:  # intra-word transition\n                new_node = node[0][xi]\n            elif self.open_vocab:  # if no path in the tree, enter open-vocabulary mode\n                new_node = None\n            else:  # if open_vocab flag is disabled, return 0 probabilities\n                log_y = self.xp.full((1, self.subword_dict_size), self.logzero, \"f\")\n                return (wlm_state, None, None), log_y\n\n        if new_node is not None:\n            succ, wid, wids = new_node\n            # compute parent node probability\n            sum_prob = (\n                (cumsum_probs[:, wids[1]] - cumsum_probs[:, wids[0]])\n                if wids is not None\n                else 1.0\n            )\n            if sum_prob < self.zero:\n                log_y = self.xp.full((1, self.subword_dict_size), self.logzero, \"f\")\n                return (wlm_state, cumsum_probs, new_node), log_y\n            # set <unk> probability as a default value\n            unk_prob = (\n                cumsum_probs[:, self.word_unk] - cumsum_probs[:, self.word_unk - 1]\n            )\n            y = self.xp.full(\n                (1, self.subword_dict_size), unk_prob * self.oov_penalty, \"f\"\n            )\n            # compute transition probabilities to child nodes\n            for cid, nd in succ.items():\n                y[:, cid] = (\n                    cumsum_probs[:, nd[2][1]] - cumsum_probs[:, nd[2][0]]\n                ) / sum_prob\n            # apply word-level probabilies for <space> and <eos> labels\n            if wid >= 0:\n                wlm_prob = (cumsum_probs[:, wid] - cumsum_probs[:, wid - 1]) / sum_prob\n                y[:, self.space] = wlm_prob\n                y[:, self.eos] = wlm_prob\n            elif xi == self.space:\n                y[:, self.space] = self.zero\n                y[:, self.eos] = self.zero\n            log_y = self.xp.log(\n                self.xp.clip(y, self.zero, None)\n            )  # clip to avoid log(0)\n        else:  # if no path in the tree, transition probability is one\n            log_y = self.xp.zeros((1, self.subword_dict_size), \"f\")\n        return (wlm_state, cumsum_probs, new_node), log_y\n\n    def final(self, state):\n        wlm_state, cumsum_probs, node = state\n        if node is not None and node[1] >= 0:  # check if the node is word end\n            w = self.xp.full(1, node[1], \"i\")\n        else:  # this node is not a word end, which means <unk>\n            w = self.xp_word_unk\n        wlm_state, z_wlm = self.wordlm(wlm_state, w)\n        return F.log_softmax(z_wlm).data[:, self.word_eos]\n"
  },
  {
    "path": "lm/chainer_backend/lm.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# This code is ported from the following implementation written in Torch.\n# https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py\n\n\nimport copy\nimport json\nimport logging\nimport numpy as np\nimport six\n\nimport chainer\nfrom chainer.dataset import convert\nimport chainer.functions as F\nimport chainer.links as L\n\n# for classifier link\nfrom chainer.functions.loss import softmax_cross_entropy\nfrom chainer import link\nfrom chainer import reporter\nfrom chainer import training\nfrom chainer.training import extensions\n\nfrom espnet.lm.lm_utils import compute_perplexity\nfrom espnet.lm.lm_utils import count_tokens\nfrom espnet.lm.lm_utils import MakeSymlinkToBestModel\nfrom espnet.lm.lm_utils import ParallelSentenceIterator\nfrom espnet.lm.lm_utils import read_tokens\n\nimport espnet.nets.chainer_backend.deterministic_embed_id as DL\nfrom espnet.nets.lm_interface import LMInterface\nfrom espnet.optimizer.factory import dynamic_import_optimizer\nfrom espnet.scheduler.chainer import ChainerScheduler\nfrom espnet.scheduler.scheduler import dynamic_import_scheduler\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom tensorboardX import SummaryWriter\n\nfrom espnet.utils.deterministic_utils import set_deterministic_chainer\nfrom espnet.utils.training.evaluator import BaseEvaluator\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\n\n# TODO(karita): reimplement RNNLM with new interface\nclass DefaultRNNLM(LMInterface, link.Chain):\n    \"\"\"Default RNNLM wrapper to compute reduce framewise loss values.\n\n    Args:\n        n_vocab (int): The size of the vocabulary\n        args (argparse.Namespace): configurations. see `add_arguments`\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        parser.add_argument(\n            \"--type\",\n            type=str,\n            default=\"lstm\",\n            nargs=\"?\",\n            choices=[\"lstm\", \"gru\"],\n            help=\"Which type of RNN to use\",\n        )\n        parser.add_argument(\n            \"--layer\", \"-l\", type=int, default=2, help=\"Number of hidden layers\"\n        )\n        parser.add_argument(\n            \"--unit\", \"-u\", type=int, default=650, help=\"Number of hidden units\"\n        )\n        return parser\n\n\nclass ClassifierWithState(link.Chain):\n    \"\"\"A wrapper for a chainer RNNLM\n\n    :param link.Chain predictor : The RNNLM\n    :param function lossfun: The loss function to use\n    :param int/str label_key:\n    \"\"\"\n\n    def __init__(\n        self,\n        predictor,\n        lossfun=softmax_cross_entropy.softmax_cross_entropy,\n        label_key=-1,\n    ):\n        if not (isinstance(label_key, (int, str))):\n            raise TypeError(\"label_key must be int or str, but is %s\" % type(label_key))\n\n        super(ClassifierWithState, self).__init__()\n        self.lossfun = lossfun\n        self.y = None\n        self.loss = None\n        self.label_key = label_key\n\n        with self.init_scope():\n            self.predictor = predictor\n\n    def __call__(self, state, *args, **kwargs):\n        \"\"\"Computes the loss value for an input and label pair.\n\n            It also computes accuracy and stores it to the attribute.\n            When ``label_key`` is ``int``, the corresponding element in ``args``\n            is treated as ground truth labels. And when it is ``str``, the\n            element in ``kwargs`` is used.\n            The all elements of ``args`` and ``kwargs`` except the groundtruth\n            labels are features.\n            It feeds features to the predictor and compare the result\n            with ground truth labels.\n\n        :param state : The LM state\n        :param list[chainer.Variable] args : Input minibatch\n        :param dict[chainer.Variable] kwargs : Input minibatch\n        :return loss value\n        :rtype chainer.Variable\n        \"\"\"\n\n        if isinstance(self.label_key, int):\n            if not (-len(args) <= self.label_key < len(args)):\n                msg = \"Label key %d is out of bounds\" % self.label_key\n                raise ValueError(msg)\n            t = args[self.label_key]\n            if self.label_key == -1:\n                args = args[:-1]\n            else:\n                args = args[: self.label_key] + args[self.label_key + 1 :]\n        elif isinstance(self.label_key, str):\n            if self.label_key not in kwargs:\n                msg = 'Label key \"%s\" is not found' % self.label_key\n                raise ValueError(msg)\n            t = kwargs[self.label_key]\n            del kwargs[self.label_key]\n\n        self.y = None\n        self.loss = None\n        state, self.y = self.predictor(state, *args, **kwargs)\n        self.loss = self.lossfun(self.y, t)\n        return state, self.loss\n\n    def predict(self, state, x):\n        \"\"\"Predict log probabilities for given state and input x using the predictor\n\n        :param state : the state\n        :param x : the input\n        :return a tuple (state, log prob vector)\n        :rtype cupy/numpy array\n        \"\"\"\n        if hasattr(self.predictor, \"normalized\") and self.predictor.normalized:\n            return self.predictor(state, x)\n        else:\n            state, z = self.predictor(state, x)\n            return state, F.log_softmax(z).data\n\n    def final(self, state):\n        \"\"\"Predict final log probabilities for given state using the predictor\n\n        :param state : the state\n        :return log probability vector\n        :rtype cupy/numpy array\n\n        \"\"\"\n        if hasattr(self.predictor, \"final\"):\n            return self.predictor.final(state)\n        else:\n            return 0.0\n\n\n# Definition of a recurrent net for language modeling\nclass RNNLM(chainer.Chain):\n    \"\"\"A chainer RNNLM\n\n    :param int n_vocab: The size of the vocabulary\n    :param int n_layers: The number of layers to create\n    :param int n_units: The number of units per layer\n    :param str type: The RNN type\n    \"\"\"\n\n    def __init__(self, n_vocab, n_layers, n_units, typ=\"lstm\"):\n        super(RNNLM, self).__init__()\n        with self.init_scope():\n            self.embed = DL.EmbedID(n_vocab, n_units)\n            self.rnn = (\n                chainer.ChainList(\n                    *[L.StatelessLSTM(n_units, n_units) for _ in range(n_layers)]\n                )\n                if typ == \"lstm\"\n                else chainer.ChainList(\n                    *[L.StatelessGRU(n_units, n_units) for _ in range(n_layers)]\n                )\n            )\n            self.lo = L.Linear(n_units, n_vocab)\n\n        for param in self.params():\n            param.data[...] = np.random.uniform(-0.1, 0.1, param.data.shape)\n        self.n_layers = n_layers\n        self.n_units = n_units\n        self.typ = typ\n\n    def __call__(self, state, x):\n        if state is None:\n            if self.typ == \"lstm\":\n                state = {\"c\": [None] * self.n_layers, \"h\": [None] * self.n_layers}\n            else:\n                state = {\"h\": [None] * self.n_layers}\n\n        h = [None] * self.n_layers\n        emb = self.embed(x)\n        if self.typ == \"lstm\":\n            c = [None] * self.n_layers\n            c[0], h[0] = self.rnn[0](state[\"c\"][0], state[\"h\"][0], F.dropout(emb))\n            for n in six.moves.range(1, self.n_layers):\n                c[n], h[n] = self.rnn[n](\n                    state[\"c\"][n], state[\"h\"][n], F.dropout(h[n - 1])\n                )\n            state = {\"c\": c, \"h\": h}\n        else:\n            if state[\"h\"][0] is None:\n                xp = self.xp\n                with chainer.backends.cuda.get_device_from_id(self._device_id):\n                    state[\"h\"][0] = chainer.Variable(\n                        xp.zeros((emb.shape[0], self.n_units), dtype=emb.dtype)\n                    )\n            h[0] = self.rnn[0](state[\"h\"][0], F.dropout(emb))\n            for n in six.moves.range(1, self.n_layers):\n                if state[\"h\"][n] is None:\n                    xp = self.xp\n                    with chainer.backends.cuda.get_device_from_id(self._device_id):\n                        state[\"h\"][n] = chainer.Variable(\n                            xp.zeros(\n                                (h[n - 1].shape[0], self.n_units), dtype=h[n - 1].dtype\n                            )\n                        )\n                h[n] = self.rnn[n](state[\"h\"][n], F.dropout(h[n - 1]))\n            state = {\"h\": h}\n        y = self.lo(F.dropout(h[-1]))\n        return state, y\n\n\nclass BPTTUpdater(training.updaters.StandardUpdater):\n    \"\"\"An updater for a chainer LM\n\n    :param chainer.dataset.Iterator train_iter : The train iterator\n    :param optimizer:\n    :param schedulers:\n    :param int device : The device id\n    :param int accum_grad :\n    \"\"\"\n\n    def __init__(self, train_iter, optimizer, schedulers, device, accum_grad):\n        super(BPTTUpdater, self).__init__(train_iter, optimizer, device=device)\n        self.scheduler = ChainerScheduler(schedulers, optimizer)\n        self.accum_grad = accum_grad\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        # When we pass one iterator and optimizer to StandardUpdater.__init__,\n        # they are automatically named 'main'.\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n\n        count = 0\n        sum_loss = 0\n        optimizer.target.cleargrads()  # Clear the parameter gradients\n        for _ in range(self.accum_grad):\n            # Progress the dataset iterator for sentences at each iteration.\n            batch = train_iter.__next__()\n            x, t = convert.concat_examples(batch, device=self.device, padding=(0, -1))\n            # Concatenate the token IDs to matrices and send them to the device\n            # self.converter does this job\n            # (it is chainer.dataset.concat_examples by default)\n            xp = chainer.backends.cuda.get_array_module(x)\n            loss = 0\n            state = None\n            batch_size, sequence_length = x.shape\n            for i in six.moves.range(sequence_length):\n                # Compute the loss at this time step and accumulate it\n                state, loss_batch = optimizer.target(\n                    state, chainer.Variable(x[:, i]), chainer.Variable(t[:, i])\n                )\n                non_zeros = xp.count_nonzero(x[:, i])\n                loss += loss_batch * non_zeros\n                count += int(non_zeros)\n            # backward\n            loss /= batch_size * self.accum_grad  # normalized by batch size\n            sum_loss += float(loss.data)\n            loss.backward()  # Backprop\n            loss.unchain_backward()  # Truncate the graph\n\n        reporter.report({\"loss\": sum_loss}, optimizer.target)\n        reporter.report({\"count\": count}, optimizer.target)\n        # update\n        optimizer.update()  # Update the parameters\n        self.scheduler.step(self.iteration)\n\n\nclass LMEvaluator(BaseEvaluator):\n    \"\"\"A custom evaluator for a chainer LM\n\n    :param chainer.dataset.Iterator val_iter : The validation iterator\n    :param eval_model : The model to evaluate\n    :param int device : The device id to use\n    \"\"\"\n\n    def __init__(self, val_iter, eval_model, device):\n        super(LMEvaluator, self).__init__(val_iter, eval_model, device=device)\n\n    def evaluate(self):\n        val_iter = self.get_iterator(\"main\")\n        target = self.get_target(\"main\")\n        loss = 0\n        count = 0\n        for batch in copy.copy(val_iter):\n            x, t = convert.concat_examples(batch, device=self.device, padding=(0, -1))\n            xp = chainer.backends.cuda.get_array_module(x)\n            state = None\n            for i in six.moves.range(len(x[0])):\n                state, loss_batch = target(state, x[:, i], t[:, i])\n                non_zeros = xp.count_nonzero(x[:, i])\n                loss += loss_batch.data * non_zeros\n                count += int(non_zeros)\n        # report validation loss\n        observation = {}\n        with reporter.report_scope(observation):\n            reporter.report({\"loss\": float(loss / count)}, target)\n        return observation\n\n\ndef train(args):\n    \"\"\"Train with the given args\n\n    :param Namespace args: The program arguments\n    \"\"\"\n    # TODO(karita): support this\n    if args.model_module != \"default\":\n        raise NotImplementedError(\"chainer backend does not support --model-module\")\n\n    # display chainer version\n    logging.info(\"chainer version = \" + chainer.__version__)\n\n    set_deterministic_chainer(args)\n\n    # check cuda and cudnn availability\n    if not chainer.cuda.available:\n        logging.warning(\"cuda is not available\")\n    if not chainer.cuda.cudnn_enabled:\n        logging.warning(\"cudnn is not available\")\n\n    # get special label ids\n    unk = args.char_list_dict[\"<unk>\"]\n    eos = args.char_list_dict[\"<eos>\"]\n    # read tokens as a sequence of sentences\n    train = read_tokens(args.train_label, args.char_list_dict)\n    val = read_tokens(args.valid_label, args.char_list_dict)\n    # count tokens\n    n_train_tokens, n_train_oovs = count_tokens(train, unk)\n    n_val_tokens, n_val_oovs = count_tokens(val, unk)\n    logging.info(\"#vocab = \" + str(args.n_vocab))\n    logging.info(\"#sentences in the training data = \" + str(len(train)))\n    logging.info(\"#tokens in the training data = \" + str(n_train_tokens))\n    logging.info(\n        \"oov rate in the training data = %.2f %%\"\n        % (n_train_oovs / n_train_tokens * 100)\n    )\n    logging.info(\"#sentences in the validation data = \" + str(len(val)))\n    logging.info(\"#tokens in the validation data = \" + str(n_val_tokens))\n    logging.info(\n        \"oov rate in the validation data = %.2f %%\" % (n_val_oovs / n_val_tokens * 100)\n    )\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n\n    # Create the dataset iterators\n    train_iter = ParallelSentenceIterator(\n        train,\n        args.batchsize,\n        max_length=args.maxlen,\n        sos=eos,\n        eos=eos,\n        shuffle=not use_sortagrad,\n    )\n    val_iter = ParallelSentenceIterator(\n        val, args.batchsize, max_length=args.maxlen, sos=eos, eos=eos, repeat=False\n    )\n    epoch_iters = int(len(train_iter.batch_indices) / args.accum_grad)\n    logging.info(\"#iterations per epoch = %d\" % epoch_iters)\n    logging.info(\"#total iterations = \" + str(args.epoch * epoch_iters))\n    # Prepare an RNNLM model\n    rnn = RNNLM(args.n_vocab, args.layer, args.unit, args.type)\n    model = ClassifierWithState(rnn)\n    if args.ngpu > 1:\n        logging.warning(\"currently, multi-gpu is not supported. use single gpu.\")\n    if args.ngpu > 0:\n        # Make the specified GPU current\n        gpu_id = 0\n        chainer.cuda.get_device_from_id(gpu_id).use()\n        model.to_gpu()\n    else:\n        gpu_id = -1\n\n    # Save model conf to json\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(vars(args), indent=4, ensure_ascii=False, sort_keys=True).encode(\n                \"utf_8\"\n            )\n        )\n\n    # Set up an optimizer\n    opt_class = dynamic_import_optimizer(args.opt, args.backend)\n    optimizer = opt_class.from_args(model, args)\n    if args.schedulers is None:\n        schedulers = []\n    else:\n        schedulers = [dynamic_import_scheduler(v)(k, args) for k, v in args.schedulers]\n\n    optimizer.setup(model)\n    optimizer.add_hook(chainer.optimizer.GradientClipping(args.gradclip))\n\n    updater = BPTTUpdater(train_iter, optimizer, schedulers, gpu_id, args.accum_grad)\n    trainer = training.Trainer(updater, (args.epoch, \"epoch\"), out=args.outdir)\n    trainer.extend(LMEvaluator(val_iter, model, device=gpu_id))\n    trainer.extend(\n        extensions.LogReport(\n            postprocess=compute_perplexity,\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n    )\n    trainer.extend(\n        extensions.PrintReport(\n            [\"epoch\", \"iteration\", \"perplexity\", \"val_perplexity\", \"elapsed_time\"]\n        ),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    trainer.extend(extensions.snapshot(filename=\"snapshot.ep.{.updater.epoch}\"))\n    trainer.extend(extensions.snapshot_object(model, \"rnnlm.model.{.updater.epoch}\"))\n    # MEMO(Hori): wants to use MinValueTrigger, but it seems to fail in resuming\n    trainer.extend(MakeSymlinkToBestModel(\"validation/main/loss\", \"rnnlm.model\"))\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epoch, \"epoch\"),\n        )\n\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        chainer.serializers.load_npz(args.resume, trainer)\n\n    set_early_stop(trainer, args, is_lm=True)\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        writer = SummaryWriter(args.tensorboard_dir)\n        trainer.extend(\n            TensorboardLogger(writer), trigger=(args.report_interval_iters, \"iteration\")\n        )\n\n    trainer.run()\n    check_early_stop(trainer, args.epoch)\n\n    # compute perplexity for test set\n    if args.test_label:\n        logging.info(\"test the best model\")\n        chainer.serializers.load_npz(args.outdir + \"/rnnlm.model.best\", model)\n        test = read_tokens(args.test_label, args.char_list_dict)\n        n_test_tokens, n_test_oovs = count_tokens(test, unk)\n        logging.info(\"#sentences in the test data = \" + str(len(test)))\n        logging.info(\"#tokens in the test data = \" + str(n_test_tokens))\n        logging.info(\n            \"oov rate in the test data = %.2f %%\" % (n_test_oovs / n_test_tokens * 100)\n        )\n        test_iter = ParallelSentenceIterator(\n            test, args.batchsize, max_length=args.maxlen, sos=eos, eos=eos, repeat=False\n        )\n        evaluator = LMEvaluator(test_iter, model, device=gpu_id)\n        with chainer.using_config(\"train\", False):\n            result = evaluator()\n        logging.info(\"test perplexity: \" + str(np.exp(float(result[\"main/loss\"]))))\n"
  },
  {
    "path": "lm/lm_utils.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n# This code is ported from the following implementation written in Torch.\n# https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py\n\nimport chainer\nimport h5py\nimport logging\nimport numpy as np\nimport os\nimport random\nimport six\nfrom tqdm import tqdm\n\nfrom chainer.training import extension\n\n\ndef load_dataset(path, label_dict, outdir=None):\n    \"\"\"Load and save HDF5 that contains a dataset and stats for LM\n\n    Args:\n        path (str): The path of an input text dataset file\n        label_dict (dict[str, int]):\n            dictionary that maps token label string to its ID number\n        outdir (str): The path of an output dir\n\n    Returns:\n        tuple[list[np.ndarray], int, int]: Tuple of\n            token IDs in np.int32 converted by `read_tokens`\n            the number of tokens by `count_tokens`,\n            and the number of OOVs by `count_tokens`\n    \"\"\"\n    if outdir is not None:\n        os.makedirs(outdir, exist_ok=True)\n        filename = outdir + \"/\" + os.path.basename(path) + \".h5\"\n        if os.path.exists(filename):\n            logging.info(f\"loading binary dataset: {filename}\")\n            f = h5py.File(filename, \"r\")\n            return f[\"data\"][:], f[\"n_tokens\"][()], f[\"n_oovs\"][()]\n    else:\n        logging.info(\"skip dump/load HDF5 because the output dir is not specified\")\n    logging.info(f\"reading text dataset: {path}\")\n    ret = read_tokens(path, label_dict)\n    n_tokens, n_oovs = count_tokens(ret, label_dict[\"<unk>\"])\n    if outdir is not None:\n        logging.info(f\"saving binary dataset: {filename}\")\n        with h5py.File(filename, \"w\") as f:\n            # http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data\n            data = f.create_dataset(\n                \"data\", (len(ret),), dtype=h5py.special_dtype(vlen=np.int32)\n            )\n            data[:] = ret\n            f[\"n_tokens\"] = n_tokens\n            f[\"n_oovs\"] = n_oovs\n    return ret, n_tokens, n_oovs\n\n\ndef read_tokens(filename, label_dict):\n    \"\"\"Read tokens as a sequence of sentences\n\n    :param str filename : The name of the input file\n    :param dict label_dict : dictionary that maps token label string to its ID number\n    :return list of ID sequences\n    :rtype list\n    \"\"\"\n\n    data = []\n    unk = label_dict[\"<unk>\"]\n    for ln in tqdm(open(filename, \"r\", encoding=\"utf-8\")):\n        data.append(\n            np.array(\n                [label_dict.get(label, unk) for label in ln.split()], dtype=np.int32\n            )\n        )\n    return data\n\n\ndef count_tokens(data, unk_id=None):\n    \"\"\"Count tokens and oovs in token ID sequences.\n\n    Args:\n        data (list[np.ndarray]): list of token ID sequences\n        unk_id (int): ID of unknown token\n\n    Returns:\n        tuple: tuple of number of token occurrences and number of oov tokens\n\n    \"\"\"\n\n    n_tokens = 0\n    n_oovs = 0\n    for sentence in data:\n        n_tokens += len(sentence)\n        if unk_id is not None:\n            n_oovs += np.count_nonzero(sentence == unk_id)\n    return n_tokens, n_oovs\n\n\ndef compute_perplexity(result):\n    \"\"\"Computes and add the perplexity to the LogReport\n\n    :param dict result: The current observations\n    \"\"\"\n    # Routine to rewrite the result dictionary of LogReport to add perplexity values\n    result[\"perplexity\"] = np.exp(result[\"main/loss\"] / result[\"main/count\"])\n    if \"validation/main/loss\" in result:\n        result[\"val_perplexity\"] = np.exp(result[\"validation/main/loss\"])\n\n\nclass ParallelSentenceIterator(chainer.dataset.Iterator):\n    \"\"\"Dataset iterator to create a batch of sentences.\n\n    This iterator returns a pair of sentences, where one token is shifted\n    between the sentences like '<sos> w1 w2 w3' and 'w1 w2 w3 <eos>'\n    Sentence batches are made in order of longer sentences, and then\n    randomly shuffled.\n    \"\"\"\n\n    def __init__(\n        self, dataset, batch_size, max_length=0, sos=0, eos=0, repeat=True, shuffle=True\n    ):\n        self.dataset = dataset\n        self.batch_size = batch_size  # batch size\n        # Number of completed sweeps over the dataset. In this case, it is\n        # incremented if every word is visited at least once after the last\n        # increment.\n        self.epoch = 0\n        # True if the epoch is incremented at the last iteration.\n        self.is_new_epoch = False\n        self.repeat = repeat\n        length = len(dataset)\n        self.batch_indices = []\n        # make mini-batches\n        if batch_size > 1:\n            indices = sorted(range(len(dataset)), key=lambda i: -len(dataset[i]))\n            bs = 0\n            while bs < length:\n                be = min(bs + batch_size, length)\n                # batch size is automatically reduced if the sentence length\n                # is larger than max_length\n                if max_length > 0:\n                    sent_length = len(dataset[indices[bs]])\n                    be = min(\n                        be, bs + max(batch_size // (sent_length // max_length + 1), 1)\n                    )\n                self.batch_indices.append(np.array(indices[bs:be]))\n                bs = be\n            if shuffle:\n                # shuffle batches\n                random.shuffle(self.batch_indices)\n        else:\n            self.batch_indices = [np.array([i]) for i in six.moves.range(length)]\n\n        # NOTE: this is not a count of parameter updates. It is just a count of\n        # calls of ``__next__``.\n        self.iteration = 0\n        self.sos = sos\n        self.eos = eos\n        # use -1 instead of None internally\n        self._previous_epoch_detail = -1.0\n\n    def __next__(self):\n        # This iterator returns a list representing a mini-batch. Each item\n        # indicates a sentence pair like '<sos> w1 w2 w3' and 'w1 w2 w3 <eos>'\n        # represented by token IDs.\n        n_batches = len(self.batch_indices)\n        if not self.repeat and self.iteration >= n_batches:\n            # If not self.repeat, this iterator stops at the end of the first\n            # epoch (i.e., when all words are visited once).\n            raise StopIteration\n\n        batch = []\n        for idx in self.batch_indices[self.iteration % n_batches]:\n            batch.append(\n                (\n                    np.append([self.sos], self.dataset[idx]),\n                    np.append(self.dataset[idx], [self.eos]),\n                )\n            )\n\n        self._previous_epoch_detail = self.epoch_detail\n        self.iteration += 1\n\n        epoch = self.iteration // n_batches\n        self.is_new_epoch = self.epoch < epoch\n        if self.is_new_epoch:\n            self.epoch = epoch\n\n        return batch\n\n    def start_shuffle(self):\n        random.shuffle(self.batch_indices)\n\n    @property\n    def epoch_detail(self):\n        # Floating point version of epoch.\n        return self.iteration / len(self.batch_indices)\n\n    @property\n    def previous_epoch_detail(self):\n        if self._previous_epoch_detail < 0:\n            return None\n        return self._previous_epoch_detail\n\n    def serialize(self, serializer):\n        # It is important to serialize the state to be recovered on resume.\n        self.iteration = serializer(\"iteration\", self.iteration)\n        self.epoch = serializer(\"epoch\", self.epoch)\n        try:\n            self._previous_epoch_detail = serializer(\n                \"previous_epoch_detail\", self._previous_epoch_detail\n            )\n        except KeyError:\n            # guess previous_epoch_detail for older version\n            self._previous_epoch_detail = self.epoch + (\n                self.current_position - 1\n            ) / len(self.batch_indices)\n            if self.epoch_detail > 0:\n                self._previous_epoch_detail = max(self._previous_epoch_detail, 0.0)\n            else:\n                self._previous_epoch_detail = -1.0\n\n\nclass MakeSymlinkToBestModel(extension.Extension):\n    \"\"\"Extension that makes a symbolic link to the best model\n\n    :param str key: Key of value\n    :param str prefix: Prefix of model files and link target\n    :param str suffix: Suffix of link target\n    \"\"\"\n\n    def __init__(self, key, prefix=\"model\", suffix=\"best\"):\n        super(MakeSymlinkToBestModel, self).__init__()\n        self.best_model = -1\n        self.min_loss = 0.0\n        self.key = key\n        self.prefix = prefix\n        self.suffix = suffix\n\n    def __call__(self, trainer):\n        observation = trainer.observation\n        if self.key in observation:\n            loss = observation[self.key]\n            if self.best_model == -1 or loss < self.min_loss:\n                self.min_loss = loss\n                self.best_model = trainer.updater.epoch\n                src = \"%s.%d\" % (self.prefix, self.best_model)\n                dest = os.path.join(trainer.out, \"%s.%s\" % (self.prefix, self.suffix))\n                if os.path.lexists(dest):\n                    os.remove(dest)\n                os.symlink(src, dest)\n                logging.info(\"best model is \" + src)\n\n    def serialize(self, serializer):\n        if isinstance(serializer, chainer.serializer.Serializer):\n            serializer(\"_best_model\", self.best_model)\n            serializer(\"_min_loss\", self.min_loss)\n            serializer(\"_key\", self.key)\n            serializer(\"_prefix\", self.prefix)\n            serializer(\"_suffix\", self.suffix)\n        else:\n            self.best_model = serializer(\"_best_model\", -1)\n            self.min_loss = serializer(\"_min_loss\", 0.0)\n            self.key = serializer(\"_key\", \"\")\n            self.prefix = serializer(\"_prefix\", \"model\")\n            self.suffix = serializer(\"_suffix\", \"best\")\n\n\n# TODO(Hori): currently it only works with character-word level LM.\n#             need to consider any types of subwords-to-word mapping.\ndef make_lexical_tree(word_dict, subword_dict, word_unk):\n    \"\"\"Make a lexical tree to compute word-level probabilities\"\"\"\n    # node [dict(subword_id -> node), word_id, word_set[start-1, end]]\n    root = [{}, -1, None]\n    for w, wid in word_dict.items():\n        if wid > 0 and wid != word_unk:  # skip <blank> and <unk>\n            if True in [c not in subword_dict for c in w]:  # skip unknown subword\n                print(f\"{w} is skipped due to invalid subword\")\n                continue\n            succ = root[0]  # get successors from root node\n            for i, c in enumerate(w):\n                cid = subword_dict[c]\n                if cid not in succ:  # if next node does not exist, make a new node\n                    succ[cid] = [{}, -1, (wid - 1, wid)]\n                else:\n                    prev = succ[cid][2]\n                    succ[cid][2] = (min(prev[0], wid - 1), max(prev[1], wid))\n                if i == len(w) - 1:  # if word end, set word id\n                    succ[cid][1] = wid\n                succ = succ[cid][0]  # move to the child successors\n        else:\n            print(f\"word {wid} is skipped\")\n    return root\n"
  },
  {
    "path": "lm/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "lm/pytorch_backend/extlm.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Mitsubishi Electric Research Laboratories (Takaaki Hori)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport math\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\n\n\n# Definition of a multi-level (subword/word) language model\nclass MultiLevelLM(nn.Module):\n    logzero = -10000000000.0\n    zero = 1.0e-10\n\n    def __init__(\n        self,\n        wordlm,\n        subwordlm,\n        word_dict,\n        subword_dict,\n        subwordlm_weight=0.8,\n        oov_penalty=1.0,\n        open_vocab=True,\n    ):\n        super(MultiLevelLM, self).__init__()\n        self.wordlm = wordlm\n        self.subwordlm = subwordlm\n        self.word_eos = word_dict[\"<eos>\"]\n        self.word_unk = word_dict[\"<unk>\"]\n        self.var_word_eos = torch.LongTensor([self.word_eos])\n        self.var_word_unk = torch.LongTensor([self.word_unk])\n        self.space = subword_dict[\"<space>\"]\n        self.eos = subword_dict[\"<eos>\"]\n        self.lexroot = make_lexical_tree(word_dict, subword_dict, self.word_unk)\n        self.log_oov_penalty = math.log(oov_penalty)\n        self.open_vocab = open_vocab\n        self.subword_dict_size = len(subword_dict)\n        self.subwordlm_weight = subwordlm_weight\n        self.normalized = True\n\n        # lexroot: [dict(subword->node), word_id, range of word_id with this prefix(start-1, end)]\n\n    def forward(self, state, x):\n        # update state with input label x\n        if state is None:  # make initial states and log-prob vectors\n            self.var_word_eos = to_device(x, self.var_word_eos)\n            self.var_word_unk = to_device(x, self.var_word_eos)\n            wlm_state, z_wlm = self.wordlm(None, self.var_word_eos)\n            wlm_logprobs = F.log_softmax(z_wlm, dim=1)\n            clm_state, z_clm = self.subwordlm(None, x)\n            log_y = F.log_softmax(z_clm, dim=1) * self.subwordlm_weight\n            new_node = self.lexroot\n            clm_logprob = 0.0\n            xi = self.space\n        else:\n            clm_state, wlm_state, wlm_logprobs, node, log_y, clm_logprob = state\n            xi = int(x)\n            if xi == self.space:  # inter-word transition\n                if node is not None and node[1] >= 0:  # check if the node is word end\n                    w = to_device(x, torch.LongTensor([node[1]]))\n                else:  # this node is not a word end, which means <unk>\n                    w = self.var_word_unk\n                # update wordlm state and log-prob vector\n                wlm_state, z_wlm = self.wordlm(wlm_state, w)\n                wlm_logprobs = F.log_softmax(z_wlm, dim=1)\n                new_node = self.lexroot  # move to the tree root\n                clm_logprob = 0.0\n            elif node is not None and xi in node[0]:  # intra-word transition\n                new_node = node[0][xi]\n                clm_logprob += log_y[0, xi]\n            elif self.open_vocab:  # if no path in the tree, enter open-vocabulary mode\n                new_node = None\n                clm_logprob += log_y[0, xi]\n            else:  # if open_vocab flag is disabled, return 0 probabilities\n                log_y = to_device(\n                    x, torch.full((1, self.subword_dict_size), self.logzero)\n                )\n                return (clm_state, wlm_state, wlm_logprobs, None, log_y, 0.0), log_y\n\n            clm_state, z_clm = self.subwordlm(clm_state, x)\n            log_y = F.log_softmax(z_clm, dim=1) * self.subwordlm_weight\n\n        # apply word-level probabilies for <space> and <eos> labels\n        if xi != self.space:\n            if new_node is not None and new_node[1] >= 0:  # if new node is word end\n                wlm_logprob = wlm_logprobs[:, new_node[1]] - clm_logprob\n            else:\n                wlm_logprob = wlm_logprobs[:, self.word_unk] + self.log_oov_penalty\n            log_y[:, self.space] = wlm_logprob\n            log_y[:, self.eos] = wlm_logprob\n        else:\n            log_y[:, self.space] = self.logzero\n            log_y[:, self.eos] = self.logzero\n\n        return (\n            (clm_state, wlm_state, wlm_logprobs, new_node, log_y, float(clm_logprob)),\n            log_y,\n        )\n\n    def final(self, state):\n        clm_state, wlm_state, wlm_logprobs, node, log_y, clm_logprob = state\n        if node is not None and node[1] >= 0:  # check if the node is word end\n            w = to_device(wlm_logprobs, torch.LongTensor([node[1]]))\n        else:  # this node is not a word end, which means <unk>\n            w = self.var_word_unk\n        wlm_state, z_wlm = self.wordlm(wlm_state, w)\n        return float(F.log_softmax(z_wlm, dim=1)[:, self.word_eos])\n\n\n# Definition of a look-ahead word language model\nclass LookAheadWordLM(nn.Module):\n    logzero = -10000000000.0\n    zero = 1.0e-10\n\n    def __init__(\n        self, wordlm, word_dict, subword_dict, oov_penalty=0.0001, open_vocab=True\n    ):\n        super(LookAheadWordLM, self).__init__()\n        self.wordlm = wordlm\n        self.word_eos = word_dict[\"<eos>\"]\n        self.word_unk = word_dict[\"<unk>\"]\n        self.var_word_eos = torch.LongTensor([self.word_eos])\n        self.var_word_unk = torch.LongTensor([self.word_unk])\n        self.space = subword_dict[\"<space>\"]\n        self.eos = subword_dict[\"<eos>\"]\n        self.lexroot = make_lexical_tree(word_dict, subword_dict, self.word_unk)\n        self.oov_penalty = oov_penalty\n        self.open_vocab = open_vocab\n        self.subword_dict_size = len(subword_dict)\n        self.zero_tensor = torch.FloatTensor([self.zero])\n        self.normalized = True\n        \n        # any node including lex_root: [dict(word_id -> node), word_id, range of word prefixed with this]\n\n    def forward(self, state, x):\n        # update state with input label x\n        if state is None:  # make initial states and cumlative probability vector\n            self.var_word_eos = to_device(x, self.var_word_eos)\n            self.var_word_unk = to_device(x, self.var_word_eos)\n            self.zero_tensor = to_device(x, self.zero_tensor)\n            wlm_state, z_wlm = self.wordlm(None, self.var_word_eos)\n            cumsum_probs = torch.cumsum(F.softmax(z_wlm, dim=1), dim=1)\n            new_node = self.lexroot\n            xi = self.space\n        else:\n            wlm_state, cumsum_probs, node = state\n            xi = int(x)\n            if xi == self.space:  # inter-word transition\n                if node is not None and node[1] >= 0:  # check if the node is word end\n                    w = to_device(x, torch.LongTensor([node[1]]))\n                else:  # this node is not a word end, which means <unk>\n                    w = self.var_word_unk\n                # update wordlm state and cumlative probability vector\n                wlm_state, z_wlm = self.wordlm(wlm_state, w)\n                cumsum_probs = torch.cumsum(F.softmax(z_wlm, dim=1), dim=1)\n                new_node = self.lexroot  # move to the tree root\n            elif node is not None and xi in node[0]:  # intra-word transition\n                new_node = node[0][xi]\n            elif self.open_vocab:  # if no path in the tree, enter open-vocabulary mode\n                new_node = None\n            else:  # if open_vocab flag is disabled, return 0 probabilities\n                log_y = to_device(\n                    x, torch.full((1, self.subword_dict_size), self.logzero)\n                )\n                return (wlm_state, None, None), log_y\n\n        if new_node is not None:\n            succ, wid, wids = new_node\n            # compute parent node probability\n            sum_prob = (\n                (cumsum_probs[:, wids[1]] - cumsum_probs[:, wids[0]])\n                if wids is not None\n                else 1.0\n            )\n            if sum_prob < self.zero:\n                log_y = to_device(\n                    x, torch.full((1, self.subword_dict_size), self.logzero)\n                )\n                return (wlm_state, cumsum_probs, new_node), log_y\n            # set <unk> probability as a default value\n            unk_prob = (\n                cumsum_probs[:, self.word_unk] - cumsum_probs[:, self.word_unk - 1]\n            )\n            y = to_device(\n                x,\n                torch.full(\n                    (1, self.subword_dict_size), float(unk_prob) * self.oov_penalty\n                ),\n            )\n            # compute transition probabilities to child nodes\n            for cid, nd in succ.items():\n                y[:, cid] = (\n                    cumsum_probs[:, nd[2][1]] - cumsum_probs[:, nd[2][0]]\n                ) / sum_prob\n            # apply word-level probabilies for <space> and <eos> labels\n            if wid >= 0:\n                wlm_prob = (cumsum_probs[:, wid] - cumsum_probs[:, wid - 1]) / sum_prob\n                y[:, self.space] = wlm_prob\n                y[:, self.eos] = wlm_prob\n            elif xi == self.space:\n                y[:, self.space] = self.zero\n                y[:, self.eos] = self.zero\n            log_y = torch.log(torch.max(y, self.zero_tensor))  # clip to avoid log(0)\n        else:  # if no path in the tree, transition probability is one\n            log_y = to_device(x, torch.zeros(1, self.subword_dict_size))\n        return (wlm_state, cumsum_probs, new_node), log_y\n\n    def final(self, state):\n        wlm_state, cumsum_probs, node = state\n        if node is not None and node[1] >= 0:  # check if the node is word end\n            w = to_device(cumsum_probs, torch.LongTensor([node[1]]))\n        else:  # this node is not a word end, which means <unk>\n            w = self.var_word_unk\n        wlm_state, z_wlm = self.wordlm(wlm_state, w)\n        return float(F.log_softmax(z_wlm, dim=1)[:, self.word_eos])\n"
  },
  {
    "path": "lm/pytorch_backend/lm.py",
    "content": "#!/usr/bin/env python3\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n# This code is ported from the following implementation written in Torch.\n# https://github.com/chainer/chainer/blob/master/examples/ptb/train_ptb_custom_loop.py\n\n\"\"\"LM training in pytorch.\"\"\"\n\nimport copy\nimport json\nimport logging\nimport numpy as np\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn.parallel import data_parallel\n\nfrom chainer import Chain\nfrom chainer.dataset import convert\nfrom chainer import reporter\nfrom chainer import training\nfrom chainer.training import extensions\n\nfrom espnet.lm.lm_utils import count_tokens\nfrom espnet.lm.lm_utils import load_dataset\nfrom espnet.lm.lm_utils import MakeSymlinkToBestModel\nfrom espnet.lm.lm_utils import ParallelSentenceIterator\nfrom espnet.lm.lm_utils import read_tokens\nfrom espnet.nets.lm_interface import dynamic_import_lm\nfrom espnet.nets.lm_interface import LMInterface\nfrom espnet.optimizer.factory import dynamic_import_optimizer\nfrom espnet.scheduler.pytorch import PyTorchScheduler\nfrom espnet.scheduler.scheduler import dynamic_import_scheduler\n\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom tensorboardX import SummaryWriter\n\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.training.evaluator import BaseEvaluator\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\n\ndef compute_perplexity(result):\n    \"\"\"Compute and add the perplexity to the LogReport.\n\n    :param dict result: The current observations\n    \"\"\"\n    # Routine to rewrite the result dictionary of LogReport to add perplexity values\n    result[\"perplexity\"] = np.exp(result[\"main/nll\"] / result[\"main/count\"])\n    if \"validation/main/nll\" in result:\n        result[\"val_perplexity\"] = np.exp(\n            result[\"validation/main/nll\"] / result[\"validation/main/count\"]\n        )\n\n\nclass Reporter(Chain):\n    \"\"\"Dummy module to use chainer's trainer.\"\"\"\n\n    def report(self, loss):\n        \"\"\"Report nothing.\"\"\"\n        pass\n\n\ndef concat_examples(batch, device=None, padding=None):\n    \"\"\"Concat examples in minibatch.\n\n    :param np.ndarray batch: The batch to concatenate\n    :param int device: The device to send to\n    :param Tuple[int,int] padding: The padding to use\n    :return: (inputs, targets)\n    :rtype (torch.Tensor, torch.Tensor)\n    \"\"\"\n    x, t = convert.concat_examples(batch, padding=padding)\n    x = torch.from_numpy(x)\n    t = torch.from_numpy(t)\n    if device is not None and device >= 0:\n        x = x.cuda(device)\n        t = t.cuda(device)\n    return x, t\n\n\nclass BPTTUpdater(training.StandardUpdater):\n    \"\"\"An updater for a pytorch LM.\"\"\"\n\n    def __init__(\n        self,\n        train_iter,\n        model,\n        optimizer,\n        schedulers,\n        device,\n        gradclip=None,\n        use_apex=False,\n        accum_grad=1,\n    ):\n        \"\"\"Initialize class.\n\n        Args:\n            train_iter (chainer.dataset.Iterator): The train iterator\n            model (LMInterface) : The model to update\n            optimizer (torch.optim.Optimizer): The optimizer for training\n            schedulers (espnet.scheduler.scheduler.SchedulerInterface):\n                The schedulers of `optimizer`\n            device (int): The device id\n            gradclip (float): The gradient clipping value to use\n            use_apex (bool): The flag to use Apex in backprop.\n            accum_grad (int): The number of gradient accumulation.\n\n        \"\"\"\n        super(BPTTUpdater, self).__init__(train_iter, optimizer)\n        self.model = model\n        self.device = device\n        self.gradclip = gradclip\n        self.use_apex = use_apex\n        self.scheduler = PyTorchScheduler(schedulers, optimizer)\n        self.accum_grad = accum_grad\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Update the model.\"\"\"\n        # When we pass one iterator and optimizer to StandardUpdater.__init__,\n        # they are automatically named 'main'.\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n        # Progress the dataset iterator for sentences at each iteration.\n        self.model.zero_grad()  # Clear the parameter gradients\n        accum = {\"loss\": 0.0, \"nll\": 0.0, \"count\": 0}\n        for _ in range(self.accum_grad):\n            batch = train_iter.__next__()\n            # Concatenate the token IDs to matrices and send them to the device\n            # self.converter does this job\n            # (it is chainer.dataset.concat_examples by default)\n            x, t = concat_examples(batch, device=self.device[0], padding=(0, -100))\n            if self.device[0] == -1:\n                loss, nll, count = self.model(x, t)\n            else:\n                # apex does not support torch.nn.DataParallel\n                loss, nll, count = data_parallel(self.model, (x, t), self.device)\n\n            # backward\n            loss = loss.mean() / self.accum_grad\n            if self.use_apex:\n                from apex import amp\n\n                with amp.scale_loss(loss, optimizer) as scaled_loss:\n                    scaled_loss.backward()\n            else:\n                loss.backward()  # Backprop\n            # accumulate stats\n            accum[\"loss\"] += float(loss)\n            accum[\"nll\"] += float(nll.sum())\n            accum[\"count\"] += int(count.sum())\n\n        for k, v in accum.items():\n            reporter.report({k: v}, optimizer.target)\n        if self.gradclip is not None:\n            nn.utils.clip_grad_norm_(self.model.parameters(), self.gradclip)\n        optimizer.step()  # Update the parameters\n        self.scheduler.step(n_iter=self.iteration)\n\n\nclass LMEvaluator(BaseEvaluator):\n    \"\"\"A custom evaluator for a pytorch LM.\"\"\"\n\n    def __init__(self, val_iter, eval_model, reporter, device):\n        \"\"\"Initialize class.\n\n        :param chainer.dataset.Iterator val_iter : The validation iterator\n        :param LMInterface eval_model : The model to evaluate\n        :param chainer.Reporter reporter : The observations reporter\n        :param int device : The device id to use\n\n        \"\"\"\n        super(LMEvaluator, self).__init__(val_iter, reporter, device=-1)\n        self.model = eval_model\n        self.device = device\n\n    def evaluate(self):\n        \"\"\"Evaluate the model.\"\"\"\n        val_iter = self.get_iterator(\"main\")\n        loss = 0\n        nll = 0\n        count = 0\n        self.model.eval()\n        with torch.no_grad():\n            for batch in copy.copy(val_iter):\n                x, t = concat_examples(batch, device=self.device[0], padding=(0, -100))\n                if self.device[0] == -1:\n                    l, n, c = self.model(x, t)\n                else:\n                    # apex does not support torch.nn.DataParallel\n                    l, n, c = data_parallel(self.model, (x, t), self.device)\n                loss += float(l.sum())\n                nll += float(n.sum())\n                count += int(c.sum())\n        self.model.train()\n        # report validation loss\n        observation = {}\n        with reporter.report_scope(observation):\n            reporter.report({\"loss\": loss}, self.model.reporter)\n            reporter.report({\"nll\": nll}, self.model.reporter)\n            reporter.report({\"count\": count}, self.model.reporter)\n        return observation\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    :param Namespace args: The program arguments\n    :param type model_class: LMInterface class for training\n    \"\"\"\n    model_class = dynamic_import_lm(args.model_module, args.backend)\n    assert issubclass(model_class, LMInterface), \"model should implement LMInterface\"\n    # display torch version\n    logging.info(\"torch version = \" + torch.__version__)\n\n    set_deterministic_pytorch(args)\n\n    # check cuda and cudnn availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get special label ids\n    unk = args.char_list_dict[\"<unk>\"]\n    eos = args.char_list_dict[\"<eos>\"]\n    # read tokens as a sequence of sentences\n    val, n_val_tokens, n_val_oovs = load_dataset(\n        args.valid_label, args.char_list_dict, args.dump_hdf5_path\n    )\n    train, n_train_tokens, n_train_oovs = load_dataset(\n        args.train_label, args.char_list_dict, args.dump_hdf5_path\n    )\n    logging.info(\"#vocab = \" + str(args.n_vocab))\n    logging.info(\"#sentences in the training data = \" + str(len(train)))\n    logging.info(\"#tokens in the training data = \" + str(n_train_tokens))\n    logging.info(\n        \"oov rate in the training data = %.2f %%\"\n        % (n_train_oovs / n_train_tokens * 100)\n    )\n    logging.info(\"#sentences in the validation data = \" + str(len(val)))\n    logging.info(\"#tokens in the validation data = \" + str(n_val_tokens))\n    logging.info(\n        \"oov rate in the validation data = %.2f %%\" % (n_val_oovs / n_val_tokens * 100)\n    )\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    # Create the dataset iterators\n    batch_size = args.batchsize * max(args.ngpu, 1)\n    if batch_size * args.accum_grad > args.batchsize:\n        logging.info(\n            f\"batch size is automatically increased \"\n            f\"({args.batchsize} -> {batch_size * args.accum_grad})\"\n        )\n    train_iter = ParallelSentenceIterator(\n        train,\n        batch_size,\n        max_length=args.maxlen,\n        sos=eos,\n        eos=eos,\n        shuffle=not use_sortagrad,\n    )\n    val_iter = ParallelSentenceIterator(\n        val, batch_size, max_length=args.maxlen, sos=eos, eos=eos, repeat=False\n    )\n    epoch_iters = int(len(train_iter.batch_indices) / args.accum_grad)\n    logging.info(\"#iterations per epoch = %d\" % epoch_iters)\n    logging.info(\"#total iterations = \" + str(args.epoch * epoch_iters))\n    # Prepare an RNNLM model\n    if args.train_dtype in (\"float16\", \"float32\", \"float64\"):\n        dtype = getattr(torch, args.train_dtype)\n    else:\n        dtype = torch.float32\n    model = model_class(args.n_vocab, args).to(dtype=dtype)\n    if args.ngpu > 0:\n        model.to(\"cuda\")\n        gpu_id = list(range(args.ngpu))\n    else:\n        gpu_id = [-1]\n\n    # Save model conf to json\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(vars(args), indent=4, ensure_ascii=False, sort_keys=True).encode(\n                \"utf_8\"\n            )\n        )\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Set up an optimizer\n    opt_class = dynamic_import_optimizer(args.opt, args.backend)\n    optimizer = opt_class.from_args(model.parameters(), args)\n    if args.schedulers is None:\n        schedulers = []\n    else:\n        schedulers = [dynamic_import_scheduler(v)(k, args) for k, v in args.schedulers]\n\n    # setup apex.amp\n    if args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\"):\n        try:\n            from apex import amp\n        except ImportError as e:\n            logging.error(\n                f\"You need to install apex for --train-dtype {args.train_dtype}. \"\n                \"See https://github.com/NVIDIA/apex#linux\"\n            )\n            raise e\n        model, optimizer = amp.initialize(model, optimizer, opt_level=args.train_dtype)\n        use_apex = True\n    else:\n        use_apex = False\n\n    # FIXME: TOO DIRTY HACK\n    reporter = Reporter()\n    setattr(model, \"reporter\", reporter)\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    updater = BPTTUpdater(\n        train_iter,\n        model,\n        optimizer,\n        schedulers,\n        gpu_id,\n        gradclip=args.gradclip,\n        use_apex=use_apex,\n        accum_grad=args.accum_grad,\n    )\n    trainer = training.Trainer(updater, (args.epoch, \"epoch\"), out=args.outdir)\n    trainer.extend(LMEvaluator(val_iter, model, reporter, device=gpu_id))\n    trainer.extend(\n        extensions.LogReport(\n            postprocess=compute_perplexity,\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n    )\n    trainer.extend(\n        extensions.PrintReport(\n            [\n                \"epoch\",\n                \"iteration\",\n                \"main/loss\",\n                \"perplexity\",\n                \"val_perplexity\",\n                \"elapsed_time\",\n            ]\n        ),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    # Save best models\n    trainer.extend(torch_snapshot(filename=\"snapshot.ep.{.updater.epoch}\"))\n    trainer.extend(snapshot_object(model, \"rnnlm.model.{.updater.epoch}\"))\n    # T.Hori: MinValueTrigger should be used, but it fails when resuming\n    trainer.extend(MakeSymlinkToBestModel(\"validation/main/loss\", \"rnnlm.model\"))\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epoch, \"epoch\"),\n        )\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    set_early_stop(trainer, args, is_lm=True)\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        writer = SummaryWriter(args.tensorboard_dir)\n        trainer.extend(\n            TensorboardLogger(writer), trigger=(args.report_interval_iters, \"iteration\")\n        )\n\n    trainer.run()\n    check_early_stop(trainer, args.epoch)\n\n    # compute perplexity for test set\n    if args.test_label:\n        logging.info(\"test the best model\")\n        torch_load(args.outdir + \"/rnnlm.model.best\", model)\n        test = read_tokens(args.test_label, args.char_list_dict)\n        n_test_tokens, n_test_oovs = count_tokens(test, unk)\n        logging.info(\"#sentences in the test data = \" + str(len(test)))\n        logging.info(\"#tokens in the test data = \" + str(n_test_tokens))\n        logging.info(\n            \"oov rate in the test data = %.2f %%\" % (n_test_oovs / n_test_tokens * 100)\n        )\n        test_iter = ParallelSentenceIterator(\n            test, batch_size, max_length=args.maxlen, sos=eos, eos=eos, repeat=False\n        )\n        evaluator = LMEvaluator(test_iter, model, reporter, device=gpu_id)\n        result = evaluator()\n        compute_perplexity(result)\n        logging.info(f\"test perplexity: {result['perplexity']}\")\n"
  },
  {
    "path": "mt/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "mt/mt_utils.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Utility funcitons for the text translation task.\"\"\"\n\nimport logging\n\n\n# * ------------------ recognition related ------------------ *\ndef parse_hypothesis(hyp, char_list):\n    \"\"\"Parse hypothesis.\n\n    :param list hyp: recognition hypothesis\n    :param list char_list: list of characters\n    :return: recognition text string\n    :return: recognition token string\n    :return: recognition tokenid string\n    \"\"\"\n    # remove sos and get results\n    tokenid_as_list = list(map(int, hyp[\"yseq\"][1:]))\n    token_as_list = [char_list[idx] for idx in tokenid_as_list]\n    score = float(hyp[\"score\"])\n\n    # convert to string\n    tokenid = \" \".join([str(idx) for idx in tokenid_as_list])\n    token = \" \".join(token_as_list)\n    text = \"\".join(token_as_list).replace(\"<space>\", \" \")\n\n    return text, token, tokenid, score\n\n\ndef add_results_to_json(js, nbest_hyps, char_list):\n    \"\"\"Add N-best results to json.\n\n    :param dict js: groundtruth utterance dict\n    :param list nbest_hyps: list of hypothesis\n    :param list char_list: list of characters\n    :return: N-best results added utterance dict\n    \"\"\"\n    # copy old json info\n    new_js = dict()\n    if \"utt2spk\" in js.keys():\n        new_js[\"utt2spk\"] = js[\"utt2spk\"]\n    new_js[\"output\"] = []\n\n    for n, hyp in enumerate(nbest_hyps, 1):\n        # parse hypothesis\n        rec_text, rec_token, rec_tokenid, score = parse_hypothesis(hyp, char_list)\n\n        # copy ground-truth\n        if len(js[\"output\"]) > 0:\n            out_dic = dict(js[\"output\"][0].items())\n        else:\n            out_dic = {\"name\": \"\"}\n\n        # update name\n        out_dic[\"name\"] += \"[%d]\" % n\n\n        # add recognition results\n        out_dic[\"rec_text\"] = rec_text\n        out_dic[\"rec_token\"] = rec_token\n        out_dic[\"rec_tokenid\"] = rec_tokenid\n        out_dic[\"score\"] = score\n\n        # add source reference\n        out_dic[\"text_src\"] = js[\"output\"][1][\"text\"]\n        out_dic[\"token_src\"] = js[\"output\"][1][\"token\"]\n        out_dic[\"tokenid_src\"] = js[\"output\"][1][\"tokenid\"]\n\n        # add to list of N-best result dicts\n        new_js[\"output\"].append(out_dic)\n\n        # show 1-best result\n        if n == 1:\n            if \"text\" in out_dic.keys():\n                logging.info(\"groundtruth: %s\" % out_dic[\"text\"])\n            logging.info(\"prediction : %s\" % out_dic[\"rec_text\"])\n            logging.info(\"source : %s\" % out_dic[\"token_src\"])\n\n    return new_js\n"
  },
  {
    "path": "mt/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "mt/pytorch_backend/mt.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Training/decoding definition for the text translation task.\"\"\"\n\nimport json\nimport logging\nimport os\nimport sys\n\nfrom chainer import training\nfrom chainer.training import extensions\nimport numpy as np\nfrom tensorboardX import SummaryWriter\nimport torch\n\nfrom espnet.asr.asr_utils import adadelta_eps_decay\nfrom espnet.asr.asr_utils import adam_lr_decay\nfrom espnet.asr.asr_utils import add_results_to_json\nfrom espnet.asr.asr_utils import CompareValueTrigger\nfrom espnet.asr.asr_utils import restore_snapshot\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.nets.mt_interface import MTInterface\nfrom espnet.nets.pytorch_backend.e2e_asr import pad_list\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\nfrom espnet.asr.pytorch_backend.asr import CustomEvaluator\nfrom espnet.asr.pytorch_backend.asr import CustomUpdater\nfrom espnet.asr.pytorch_backend.asr import load_trained_model\n\nimport matplotlib\n\nmatplotlib.use(\"Agg\")\n\nif sys.version_info[0] == 2:\n    from itertools import izip_longest as zip_longest\nelse:\n    from itertools import zip_longest as zip_longest\n\n\nclass CustomConverter(object):\n    \"\"\"Custom batch converter for Pytorch.\"\"\"\n\n    def __init__(self):\n        \"\"\"Construct a CustomConverter object.\"\"\"\n        self.ignore_id = -1\n        self.pad = 0\n        # NOTE: we reserve index:0 for <pad> although this is reserved for a blank class\n        # in ASR. However,\n        # blank labels are not used in NMT. To keep the vocabulary size,\n        # we use index:0 for padding instead of adding one more class.\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Transform a batch and send it to a device.\n\n        Args:\n            batch (list): The batch to transform.\n            device (torch.device): The device to send to.\n\n        Returns:\n            tuple(torch.Tensor, torch.Tensor, torch.Tensor)\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys = batch[0]\n\n        # get batch of lengths of input sequences\n        ilens = np.array([x.shape[0] for x in xs])\n\n        # perform padding and convert to tensor\n        xs_pad = pad_list([torch.from_numpy(x).long() for x in xs], self.pad).to(device)\n        ilens = torch.from_numpy(ilens).to(device)\n        ys_pad = pad_list([torch.from_numpy(y).long() for y in ys], self.ignore_id).to(\n            device\n        )\n\n        return xs_pad, ilens, ys_pad\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n    idim = int(valid_json[utts[0]][\"output\"][1][\"shape\"][1])\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # specify model architecture\n    model_class = dynamic_import(args.model_module)\n    model = model_class(idim, odim, args)\n    assert isinstance(model, MTInterface)\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    reporter = model.reporter\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    if args.train_dtype in (\"float16\", \"float32\", \"float64\"):\n        dtype = getattr(torch, args.train_dtype)\n    else:\n        dtype = torch.float32\n    model = model.to(device=device, dtype=dtype)\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Setup an optimizer\n    if args.opt == \"adadelta\":\n        optimizer = torch.optim.Adadelta(\n            model.parameters(), rho=0.95, eps=args.eps, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"adam\":\n        optimizer = torch.optim.Adam(\n            model.parameters(), lr=args.lr, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"noam\":\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n\n        optimizer = get_std_opt(\n            model.parameters(),\n            args.adim,\n            args.transformer_warmup_steps,\n            args.transformer_lr,\n        )\n    else:\n        raise NotImplementedError(\"unknown optimizer: \" + args.opt)\n\n    # setup apex.amp\n    if args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\"):\n        try:\n            from apex import amp\n        except ImportError as e:\n            logging.error(\n                f\"You need to install apex for --train-dtype {args.train_dtype}. \"\n                \"See https://github.com/NVIDIA/apex#linux\"\n            )\n            raise e\n        if args.opt == \"noam\":\n            model, optimizer.optimizer = amp.initialize(\n                model, optimizer.optimizer, opt_level=args.train_dtype\n            )\n        else:\n            model, optimizer = amp.initialize(\n                model, optimizer, opt_level=args.train_dtype\n            )\n        use_apex = True\n    else:\n        use_apex = False\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # Setup a converter\n    converter = CustomConverter()\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    # make minibatch list (variable length)\n    train = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        mt=True,\n        iaxis=1,\n        oaxis=0,\n    )\n    valid = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        mt=True,\n        iaxis=1,\n        oaxis=0,\n    )\n\n    load_tr = LoadInputsAndTargets(mode=\"mt\", load_output=True)\n    load_cv = LoadInputsAndTargets(mode=\"mt\", load_output=True)\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    # default collate function converts numpy array to pytorch tensor\n    # we used an empty collate function instead which returns list\n    train_iter = ChainerDataLoader(\n        dataset=TransformDataset(train, lambda data: converter([load_tr(data)])),\n        batch_size=1,\n        num_workers=args.n_iter_processes,\n        shuffle=not use_sortagrad,\n        collate_fn=lambda x: x[0],\n    )\n    valid_iter = ChainerDataLoader(\n        dataset=TransformDataset(valid, lambda data: converter([load_cv(data)])),\n        batch_size=1,\n        shuffle=False,\n        collate_fn=lambda x: x[0],\n        num_workers=args.n_iter_processes,\n    )\n\n    # Set up a trainer\n    updater = CustomUpdater(\n        model,\n        args.grad_clip,\n        {\"main\": train_iter},\n        optimizer,\n        device,\n        args.ngpu,\n        False,\n        args.accum_grad,\n        use_apex=use_apex,\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    # Evaluate the model with the test dataset for each epoch\n    if args.save_interval_iters > 0:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu),\n            trigger=(args.save_interval_iters, \"iteration\"),\n        )\n    else:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu)\n        )\n\n    # Save attention weight each epoch\n    if args.num_save_attention > 0:\n        # NOTE: sort it by output lengths\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"output\"][0][\"shape\"][0]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            ikey=\"output\",\n            iaxis=1,\n        )\n        trainer.extend(att_reporter, trigger=(1, \"epoch\"))\n    else:\n        att_reporter = None\n\n    # Make a plot for training and validation values\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/loss\", \"validation/main/loss\"], \"epoch\", file_name=\"loss.png\"\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/acc\", \"validation/main/acc\"], \"epoch\", file_name=\"acc.png\"\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/ppl\", \"validation/main/ppl\"], \"epoch\", file_name=\"ppl.png\"\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/bleu\", \"validation/main/bleu\"], \"epoch\", file_name=\"bleu.png\"\n        )\n    )\n\n    # Save best models\n    trainer.extend(\n        snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\"validation/main/loss\"),\n    )\n    trainer.extend(\n        snapshot_object(model, \"model.acc.best\"),\n        trigger=training.triggers.MaxValueTrigger(\"validation/main/acc\"),\n    )\n\n    # save snapshot which contains model and optimizer states\n    if args.save_interval_iters > 0:\n        trainer.extend(\n            torch_snapshot(filename=\"snapshot.iter.{.updater.iteration}\"),\n            trigger=(args.save_interval_iters, \"iteration\"),\n        )\n    else:\n        trainer.extend(torch_snapshot(), trigger=(1, \"epoch\"))\n\n    # epsilon decay in the optimizer\n    if args.opt == \"adadelta\":\n        if args.criterion == \"acc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n    elif args.opt == \"adam\":\n        if args.criterion == \"acc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adam_lr_decay(args.lr_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adam_lr_decay(args.lr_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(\n        extensions.LogReport(trigger=(args.report_interval_iters, \"iteration\"))\n    )\n    report_keys = [\n        \"epoch\",\n        \"iteration\",\n        \"main/loss\",\n        \"validation/main/loss\",\n        \"main/acc\",\n        \"validation/main/acc\",\n        \"main/ppl\",\n        \"validation/main/ppl\",\n        \"elapsed_time\",\n    ]\n    if args.opt == \"adadelta\":\n        trainer.extend(\n            extensions.observe_value(\n                \"eps\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"eps\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"eps\")\n    elif args.opt in [\"adam\", \"noam\"]:\n        trainer.extend(\n            extensions.observe_value(\n                \"lr\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"lr\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"lr\")\n    if args.report_bleu:\n        report_keys.append(\"main/bleu\")\n        report_keys.append(\"validation/main/bleu\")\n    trainer.extend(\n        extensions.PrintReport(report_keys),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    set_early_stop(trainer, args)\n\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        trainer.extend(\n            TensorboardLogger(SummaryWriter(args.tensorboard_dir), att_reporter),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\ndef trans(args):\n    \"\"\"Decode with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, MTInterface)\n    model.trans_args = args\n\n    # gpu\n    if args.ngpu == 1:\n        gpu_id = list(range(args.ngpu))\n        logging.info(\"gpu id: \" + str(gpu_id))\n        model.cuda()\n\n    # read json data\n    with open(args.trans_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n\n    # remove enmpy utterances\n    if train_args.multilingual:\n        js = {\n            k: v\n            for k, v in js.items()\n            if v[\"output\"][0][\"shape\"][0] > 1 and v[\"output\"][1][\"shape\"][0] > 1\n        }\n    else:\n        js = {\n            k: v\n            for k, v in js.items()\n            if v[\"output\"][0][\"shape\"][0] > 0 and v[\"output\"][1][\"shape\"][0] > 0\n        }\n\n    if args.batchsize == 0:\n        with torch.no_grad():\n            for idx, name in enumerate(js.keys(), 1):\n                logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n                feat = [js[name][\"output\"][1][\"tokenid\"].split()]\n                nbest_hyps = model.translate(feat, args, train_args.char_list)\n                new_js[name] = add_results_to_json(\n                    js[name], nbest_hyps, train_args.char_list\n                )\n\n    else:\n\n        def grouper(n, iterable, fillvalue=None):\n            kargs = [iter(iterable)] * n\n            return zip_longest(*kargs, fillvalue=fillvalue)\n\n        # sort data\n        keys = list(js.keys())\n        feat_lens = [js[key][\"output\"][1][\"shape\"][0] for key in keys]\n        sorted_index = sorted(range(len(feat_lens)), key=lambda i: -feat_lens[i])\n        keys = [keys[i] for i in sorted_index]\n\n        with torch.no_grad():\n            for names in grouper(args.batchsize, keys, None):\n                names = [name for name in names if name]\n                feats = [\n                    np.fromiter(\n                        map(int, js[name][\"output\"][1][\"tokenid\"].split()),\n                        dtype=np.int64,\n                    )\n                    for name in names\n                ]\n                nbest_hyps = model.translate_batch(\n                    feats,\n                    args,\n                    train_args.char_list,\n                )\n\n                for i, nbest_hyp in enumerate(nbest_hyps):\n                    name = names[i]\n                    new_js[name] = add_results_to_json(\n                        js[name], nbest_hyp, train_args.char_list\n                    )\n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "nets/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/asr_interface.py",
    "content": "\"\"\"ASR Interface module.\"\"\"\nimport argparse\n\nfrom espnet.bin.asr_train import get_parser\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass ASRInterface:\n    \"\"\"ASR Interface for ESPnet model implementation.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to parser.\"\"\"\n        return parser\n\n    @classmethod\n    def build(cls, idim: int, odim: int, **kwargs):\n        \"\"\"Initialize this class with python-level args.\n\n        Args:\n            idim (int): The number of an input feature dim.\n            odim (int): The number of output vocab.\n\n        Returns:\n            ASRinterface: A new instance of ASRInterface.\n\n        \"\"\"\n\n        def wrap(parser):\n            return get_parser(parser, required=False)\n\n        args = argparse.Namespace(**kwargs)\n        args = fill_missing_args(args, wrap)\n        args = fill_missing_args(args, cls.add_arguments)\n        return cls(idim, odim, args)\n\n    def forward(self, xs, ilens, ys):\n        \"\"\"Compute loss for training.\n\n        :param xs:\n            For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim)\n            For chainer, list of source sequences chainer.Variable\n        :param ilens: batch of lengths of source sequences (B)\n            For pytorch, torch.Tensor\n            For chainer, list of int\n        :param ys:\n            For pytorch, batch of padded source sequences torch.Tensor (B, Lmax)\n            For chainer, list of source sequences chainer.Variable\n        :return: loss value\n        :rtype: torch.Tensor for pytorch, chainer.Variable for chainer\n        \"\"\"\n        raise NotImplementedError(\"forward method is not implemented\")\n\n    def recognize(self, x, recog_args, char_list=None, rnnlm=None):\n        \"\"\"Recognize x for evaluation.\n\n        :param ndarray x: input acouctic feature (B, T, D) or (T, D)\n        :param namespace recog_args: argment namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"recognize method is not implemented\")\n\n    def recognize_batch(self, x, recog_args, char_list=None, rnnlm=None):\n        \"\"\"Beam search implementation for batch.\n\n        :param torch.Tensor x: encoder hidden state sequences (B, Tmax, Henc)\n        :param namespace recog_args: argument namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"Batch decoding is not supported yet.\")\n\n    def calculate_all_attentions(self, xs, ilens, ys):\n        \"\"\"Caluculate attention.\n\n        :param list xs: list of padded input sequences [(T1, idim), (T2, idim), ...]\n        :param ndarray ilens: batch of lengths of input sequences (B)\n        :param list ys: list of character id sequence tensor [(L1), (L2), (L3), ...]\n        :return: attention weights (B, Lmax, Tmax)\n        :rtype: float ndarray\n        \"\"\"\n        raise NotImplementedError(\"calculate_all_attentions method is not implemented\")\n\n    def calculate_all_ctc_probs(self, xs, ilens, ys):\n        \"\"\"Caluculate CTC probability.\n\n        :param list xs_pad: list of padded input sequences [(T1, idim), (T2, idim), ...]\n        :param ndarray ilens: batch of lengths of input sequences (B)\n        :param list ys: list of character id sequence tensor [(L1), (L2), (L3), ...]\n        :return: CTC probabilities (B, Tmax, vocab)\n        :rtype: float ndarray\n        \"\"\"\n        raise NotImplementedError(\"calculate_all_ctc_probs method is not implemented\")\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Get attention plot class.\"\"\"\n        from espnet.asr.asr_utils import PlotAttentionReport\n\n        return PlotAttentionReport\n\n    @property\n    def ctc_plot_class(self):\n        \"\"\"Get CTC plot class.\"\"\"\n        from espnet.asr.asr_utils import PlotCTCReport\n\n        return PlotCTCReport\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        raise NotImplementedError(\n            \"get_total_subsampling_factor method is not implemented\"\n        )\n\n    def encode(self, feat):\n        \"\"\"Encode feature in `beam_search` (optional).\n\n        Args:\n            x (numpy.ndarray): input feature (T, D)\n        Returns:\n            torch.Tensor for pytorch, chainer.Variable for chainer:\n                encoded feature (T, D)\n\n        \"\"\"\n        raise NotImplementedError(\"encode method is not implemented\")\n\n    def scorers(self):\n        \"\"\"Get scorers for `beam_search` (optional).\n\n        Returns:\n            dict[str, ScorerInterface]: dict of `ScorerInterface` objects\n\n        \"\"\"\n        raise NotImplementedError(\"decoders method is not implemented\")\n\n\npredefined_asr = {\n    \"pytorch\": {\n        \"rnn\": \"espnet.nets.pytorch_backend.e2e_asr:E2E\",\n        \"transducer\": \"espnet.nets.pytorch_backend.e2e_asr_transducer:E2E\",\n        \"transformer\": \"espnet.nets.pytorch_backend.e2e_asr_transformer:E2E\",\n        \"conformer\": \"espnet.nets.pytorch_backend.e2e_asr_conformer:E2E\",\n    },\n    \"chainer\": {\n        \"rnn\": \"espnet.nets.chainer_backend.e2e_asr:E2E\",\n        \"transformer\": \"espnet.nets.chainer_backend.e2e_asr_transformer:E2E\",\n    },\n}\n\n\ndef dynamic_import_asr(module, backend):\n    \"\"\"Import ASR models dynamically.\n\n    Args:\n        module (str): module_name:class_name or alias in `predefined_asr`\n        backend (str): NN backend. e.g., pytorch, chainer\n\n    Returns:\n        type: ASR class\n\n    \"\"\"\n    model_class = dynamic_import(module, predefined_asr.get(backend, dict()))\n    assert issubclass(\n        model_class, ASRInterface\n    ), f\"{module} does not implement ASRInterface\"\n    return model_class\n"
  },
  {
    "path": "nets/batch_beam_search.py",
    "content": "\"\"\"Parallel beam search module.\"\"\"\n\nimport logging\nfrom typing import Any\nfrom typing import Dict\nfrom typing import List\nfrom typing import NamedTuple\nfrom typing import Tuple\n\nimport torch\nfrom torch.nn.utils.rnn import pad_sequence\n\nfrom espnet.nets.beam_search import BeamSearch\nfrom espnet.nets.beam_search import Hypothesis\n\n\nclass BatchHypothesis(NamedTuple):\n    \"\"\"Batchfied/Vectorized hypothesis data type.\"\"\"\n\n    yseq: torch.Tensor = torch.tensor([])  # (batch, maxlen)\n    score: torch.Tensor = torch.tensor([])  # (batch,)\n    length: torch.Tensor = torch.tensor([])  # (batch,)\n    scores: Dict[str, torch.Tensor] = dict()  # values: (batch,)\n    states: Dict[str, Dict] = dict()\n\n    def __len__(self) -> int:\n        \"\"\"Return a batch size.\"\"\"\n        return len(self.length)\n\n\nclass BatchBeamSearch(BeamSearch):\n    \"\"\"Batch beam search implementation.\"\"\"\n\n    def batchfy(self, hyps: List[Hypothesis]) -> BatchHypothesis:\n        \"\"\"Convert list to batch.\"\"\"\n        if len(hyps) == 0:\n            return BatchHypothesis()\n        return BatchHypothesis(\n            yseq=pad_sequence(\n                [h.yseq for h in hyps], batch_first=True, padding_value=self.eos\n            ),\n            length=torch.tensor([len(h.yseq) for h in hyps], dtype=torch.int64),\n            score=torch.tensor([h.score for h in hyps]),\n            scores={k: torch.tensor([h.scores[k] for h in hyps]) for k in self.scorers},\n            states={k: [h.states[k] for h in hyps] for k in self.scorers},\n        )\n\n    def _batch_select(self, hyps: BatchHypothesis, ids: List[int]) -> BatchHypothesis:\n        return BatchHypothesis(\n            yseq=hyps.yseq[ids],\n            score=hyps.score[ids],\n            length=hyps.length[ids],\n            scores={k: v[ids] for k, v in hyps.scores.items()},\n            states={\n                k: [self.scorers[k].select_state(v, i) for i in ids]\n                for k, v in hyps.states.items()\n            },\n        )\n\n    def _select(self, hyps: BatchHypothesis, i: int) -> Hypothesis:\n        return Hypothesis(\n            yseq=hyps.yseq[i, : hyps.length[i]],\n            score=hyps.score[i],\n            scores={k: v[i] for k, v in hyps.scores.items()},\n            states={\n                k: self.scorers[k].select_state(v, i) for k, v in hyps.states.items()\n            },\n        )\n\n    def unbatchfy(self, batch_hyps: BatchHypothesis) -> List[Hypothesis]:\n        \"\"\"Revert batch to list.\"\"\"\n        return [\n            Hypothesis(\n                yseq=batch_hyps.yseq[i][: batch_hyps.length[i]],\n                score=batch_hyps.score[i],\n                scores={k: batch_hyps.scores[k][i] for k in self.scorers},\n                states={\n                    k: v.select_state(batch_hyps.states[k], i)\n                    for k, v in self.scorers.items()\n                },\n            )\n            for i in range(len(batch_hyps.length))\n        ]\n\n    def batch_beam(\n        self, weighted_scores: torch.Tensor, ids: torch.Tensor\n    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:\n        \"\"\"Batch-compute topk full token ids and partial token ids.\n\n        Args:\n            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.\n                Its shape is `(n_beam, self.vocab_size)`.\n            ids (torch.Tensor): The partial token ids to compute topk.\n                Its shape is `(n_beam, self.pre_beam_size)`.\n\n        Returns:\n            Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:\n                The topk full (prev_hyp, new_token) ids\n                and partial (prev_hyp, new_token) ids.\n                Their shapes are all `(self.beam_size,)`\n\n        \"\"\"\n        top_ids = weighted_scores.view(-1).topk(self.beam_size)[1]\n        # Because of the flatten above, `top_ids` is organized as:\n        # [hyp1 * V + token1, hyp2 * V + token2, ..., hypK * V + tokenK],\n        # where V is `self.n_vocab` and K is `self.beam_size`\n        prev_hyp_ids = top_ids // self.n_vocab\n        new_token_ids = top_ids % self.n_vocab\n        return prev_hyp_ids, new_token_ids, prev_hyp_ids, new_token_ids\n\n    def init_hyp(self, x: torch.Tensor) -> BatchHypothesis:\n        \"\"\"Get an initial hypothesis data.\n\n        Args:\n            x (torch.Tensor): The encoder output feature\n\n        Returns:\n            Hypothesis: The initial hypothesis.\n\n        \"\"\"\n        init_states = dict()\n        init_scores = dict()\n        for k, d in self.scorers.items():\n            init_states[k] = d.batch_init_state(x)\n            init_scores[k] = 0.0\n        return self.batchfy(\n            [\n                Hypothesis(\n                    score=0.0,\n                    scores=init_scores,\n                    states=init_states,\n                    yseq=torch.tensor([self.sos], device=x.device),\n                )\n            ]\n        )\n\n    def score_full(\n        self, hyp: BatchHypothesis, x: torch.Tensor\n    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:\n        \"\"\"Score new hypothesis by `self.full_scorers`.\n\n        Args:\n            hyp (Hypothesis): Hypothesis with prefix tokens to score\n            x (torch.Tensor): Corresponding input feature\n\n        Returns:\n            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of\n                score dict of `hyp` that has string keys of `self.full_scorers`\n                and tensor score values of shape: `(self.n_vocab,)`,\n                and state dict that has string keys\n                and state values of `self.full_scorers`\n\n        \"\"\"\n        scores = dict()\n        states = dict()\n        for k, d in self.full_scorers.items():\n            scores[k], states[k] = d.batch_score(hyp.yseq, hyp.states[k], x)\n        return scores, states\n\n    def score_partial(\n        self, hyp: BatchHypothesis, ids: torch.Tensor, x: torch.Tensor\n    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:\n        \"\"\"Score new hypothesis by `self.full_scorers`.\n\n        Args:\n            hyp (Hypothesis): Hypothesis with prefix tokens to score\n            ids (torch.Tensor): 2D tensor of new partial tokens to score\n            x (torch.Tensor): Corresponding input feature\n\n        Returns:\n            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of\n                score dict of `hyp` that has string keys of `self.full_scorers`\n                and tensor score values of shape: `(self.n_vocab,)`,\n                and state dict that has string keys\n                and state values of `self.full_scorers`\n\n        \"\"\"\n        scores = dict()\n        states = dict()\n        for k, d in self.part_scorers.items():\n            scores[k], states[k] = d.batch_score_partial(\n                hyp.yseq, ids, hyp.states[k], x\n            )\n        return scores, states\n\n    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:\n        \"\"\"Merge states for new hypothesis.\n\n        Args:\n            states: states of `self.full_scorers`\n            part_states: states of `self.part_scorers`\n            part_idx (int): The new token id for `part_scores`\n\n        Returns:\n            Dict[str, torch.Tensor]: The new score dict.\n                Its keys are names of `self.full_scorers` and `self.part_scorers`.\n                Its values are states of the scorers.\n\n        \"\"\"\n        new_states = dict()\n        for k, v in states.items():\n            new_states[k] = v\n        for k, v in part_states.items():\n            new_states[k] = v\n        return new_states\n\n    def search(self, running_hyps: BatchHypothesis, x: torch.Tensor) -> BatchHypothesis:\n        \"\"\"Search new tokens for running hypotheses and encoded speech x.\n\n        Args:\n            running_hyps (BatchHypothesis): Running hypotheses on beam\n            x (torch.Tensor): Encoded speech feature (T, D)\n\n        Returns:\n            BatchHypothesis: Best sorted hypotheses\n\n        \"\"\"\n        n_batch = len(running_hyps)\n        part_ids = None  # no pre-beam\n        # batch scoring\n        weighted_scores = torch.zeros(\n            n_batch, self.n_vocab, dtype=x.dtype, device=x.device\n        )\n        scores, states = self.score_full(running_hyps, x.expand(n_batch, *x.shape))\n        for k in self.full_scorers:\n            weighted_scores += self.weights[k] * scores[k]\n        # partial scoring\n        if self.do_pre_beam:\n            pre_beam_scores = (\n                weighted_scores\n                if self.pre_beam_score_key == \"full\"\n                else scores[self.pre_beam_score_key]\n            )\n            part_ids = torch.topk(pre_beam_scores, self.pre_beam_size, dim=-1)[1]\n        # NOTE(takaaki-hori): Unlike BeamSearch, we assume that score_partial returns\n        # full-size score matrices, which has non-zero scores for part_ids and zeros\n        # for others.\n        part_scores, part_states = self.score_partial(running_hyps, part_ids, x)\n        for k in self.part_scorers:\n            weighted_scores += self.weights[k] * part_scores[k]\n        # add previous hyp scores\n        weighted_scores += running_hyps.score.to(\n            dtype=x.dtype, device=x.device\n        ).unsqueeze(1)\n\n        # TODO(karita): do not use list. use batch instead\n        # see also https://github.com/espnet/espnet/pull/1402#discussion_r354561029\n        # update hyps\n        best_hyps = []\n        prev_hyps = self.unbatchfy(running_hyps)\n        for (\n            full_prev_hyp_id,\n            full_new_token_id,\n            part_prev_hyp_id,\n            part_new_token_id,\n        ) in zip(*self.batch_beam(weighted_scores, part_ids)):\n            prev_hyp = prev_hyps[full_prev_hyp_id]\n            best_hyps.append(\n                Hypothesis(\n                    score=weighted_scores[full_prev_hyp_id, full_new_token_id],\n                    yseq=self.append_token(prev_hyp.yseq, full_new_token_id),\n                    scores=self.merge_scores(\n                        prev_hyp.scores,\n                        {k: v[full_prev_hyp_id] for k, v in scores.items()},\n                        full_new_token_id,\n                        {k: v[part_prev_hyp_id] for k, v in part_scores.items()},\n                        part_new_token_id,\n                    ),\n                    states=self.merge_states(\n                        {\n                            k: self.full_scorers[k].select_state(v, full_prev_hyp_id)\n                            for k, v in states.items()\n                        },\n                        {\n                            k: self.part_scorers[k].select_state(\n                                v, part_prev_hyp_id, part_new_token_id\n                            )\n                            for k, v in part_states.items()\n                        },\n                        part_new_token_id,\n                    ),\n                )\n            )\n        return self.batchfy(best_hyps)\n\n    def post_process(\n        self,\n        i: int,\n        maxlen: int,\n        maxlenratio: float,\n        running_hyps: BatchHypothesis,\n        ended_hyps: List[Hypothesis],\n    ) -> BatchHypothesis:\n        \"\"\"Perform post-processing of beam search iterations.\n\n        Args:\n            i (int): The length of hypothesis tokens.\n            maxlen (int): The maximum length of tokens in beam search.\n            maxlenratio (int): The maximum length ratio in beam search.\n            running_hyps (BatchHypothesis): The running hypotheses in beam search.\n            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.\n\n        Returns:\n            BatchHypothesis: The new running hypotheses.\n\n        \"\"\"\n        n_batch = running_hyps.yseq.shape[0]\n        logging.debug(f\"the number of running hypothes: {n_batch}\")\n        if self.token_list is not None:\n            logging.debug(\n                \"best hypo: \"\n                + \"\".join(\n                    [\n                        self.token_list[x]\n                        for x in running_hyps.yseq[0, 1 : running_hyps.length[0]]\n                    ]\n                )\n            )\n        # add eos in the final loop to avoid that there are no ended hyps\n        if i == maxlen - 1:\n            logging.info(\"adding <eos> in the last position in the loop\")\n            yseq_eos = torch.cat(\n                (\n                    running_hyps.yseq,\n                    torch.full(\n                        (n_batch, 1),\n                        self.eos,\n                        device=running_hyps.yseq.device,\n                        dtype=torch.int64,\n                    ),\n                ),\n                1,\n            )\n            running_hyps.yseq.resize_as_(yseq_eos)\n            running_hyps.yseq[:] = yseq_eos\n            running_hyps.length[:] = yseq_eos.shape[1]\n\n        # add ended hypotheses to a final list, and removed them from current hypotheses\n        # (this will be a probmlem, number of hyps < beam)\n        is_eos = (\n            running_hyps.yseq[torch.arange(n_batch), running_hyps.length - 1]\n            == self.eos\n        )\n        for b in torch.nonzero(is_eos).view(-1):\n            hyp = self._select(running_hyps, b)\n            ended_hyps.append(hyp)\n        remained_ids = torch.nonzero(is_eos == 0).view(-1)\n        return self._batch_select(running_hyps, remained_ids)\n"
  },
  {
    "path": "nets/batch_beam_search_online_sim.py",
    "content": "\"\"\"Parallel beam search module for online simulation.\"\"\"\n\nimport logging\nfrom pathlib import Path\nfrom typing import List\n\nimport yaml\n\nimport torch\n\nfrom espnet.nets.batch_beam_search import BatchBeamSearch\nfrom espnet.nets.beam_search import Hypothesis\nfrom espnet.nets.e2e_asr_common import end_detect\n\n\nclass BatchBeamSearchOnlineSim(BatchBeamSearch):\n    \"\"\"Online beam search implementation.\n\n    This simulates streaming decoding.\n    It requires encoded features of entire utterance and\n    extracts block by block from it as it shoud be done\n    in streaming processing.\n    This is based on Tsunoo et al, \"STREAMING TRANSFORMER ASR\n    WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH\"\n    (https://arxiv.org/abs/2006.14941).\n    \"\"\"\n\n    def set_streaming_config(self, asr_config: str):\n        \"\"\"Set config file for streaming decoding.\n\n        Args:\n            asr_config (str): The config file for asr training\n\n        \"\"\"\n        train_config_file = Path(asr_config)\n        self.block_size = None\n        self.hop_size = None\n        self.look_ahead = None\n        config = None\n        with train_config_file.open(\"r\", encoding=\"utf-8\") as f:\n            args = yaml.safe_load(f)\n            if \"encoder_conf\" in args.keys():\n                if \"block_size\" in args[\"encoder_conf\"].keys():\n                    self.block_size = args[\"encoder_conf\"][\"block_size\"]\n                if \"hop_size\" in args[\"encoder_conf\"].keys():\n                    self.hop_size = args[\"encoder_conf\"][\"hop_size\"]\n                if \"look_ahead\" in args[\"encoder_conf\"].keys():\n                    self.look_ahead = args[\"encoder_conf\"][\"look_ahead\"]\n            elif \"config\" in args.keys():\n                config = args[\"config\"]\n                if config is None:\n                    logging.info(\n                        \"Cannot find config file for streaming decoding: \"\n                        + \"apply batch beam search instead.\"\n                    )\n                    return\n        if (\n            self.block_size is None or self.hop_size is None or self.look_ahead is None\n        ) and config is not None:\n            config_file = Path(config)\n            with config_file.open(\"r\", encoding=\"utf-8\") as f:\n                args = yaml.safe_load(f)\n            if \"encoder_conf\" in args.keys():\n                enc_args = args[\"encoder_conf\"]\n            if enc_args and \"block_size\" in enc_args:\n                self.block_size = enc_args[\"block_size\"]\n            if enc_args and \"hop_size\" in enc_args:\n                self.hop_size = enc_args[\"hop_size\"]\n            if enc_args and \"look_ahead\" in enc_args:\n                self.look_ahead = enc_args[\"look_ahead\"]\n\n    def set_block_size(self, block_size: int):\n        \"\"\"Set block size for streaming decoding.\n\n        Args:\n            block_size (int): The block size of encoder\n        \"\"\"\n        self.block_size = block_size\n\n    def set_hop_size(self, hop_size: int):\n        \"\"\"Set hop size for streaming decoding.\n\n        Args:\n            hop_size (int): The hop size of encoder\n        \"\"\"\n        self.hop_size = hop_size\n\n    def set_look_ahead(self, look_ahead: int):\n        \"\"\"Set look ahead size for streaming decoding.\n\n        Args:\n            look_ahead (int): The look ahead size of encoder\n        \"\"\"\n        self.look_ahead = look_ahead\n\n    def forward(\n        self, x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0\n    ) -> List[Hypothesis]:\n        \"\"\"Perform beam search.\n\n        Args:\n            x (torch.Tensor): Encoded speech feature (T, D)\n            maxlenratio (float): Input length ratio to obtain max output length.\n                If maxlenratio=0.0 (default), it uses a end-detect function\n                to automatically find maximum hypothesis lengths\n            minlenratio (float): Input length ratio to obtain min output length.\n\n        Returns:\n            list[Hypothesis]: N-best decoding results\n\n        \"\"\"\n        self.conservative = True  # always true\n\n        if self.block_size and self.hop_size and self.look_ahead:\n            cur_end_frame = int(self.block_size - self.look_ahead)\n        else:\n            cur_end_frame = x.shape[0]\n        process_idx = 0\n        if cur_end_frame < x.shape[0]:\n            h = x.narrow(0, 0, cur_end_frame)\n        else:\n            h = x\n\n        # set length bounds\n        if maxlenratio == 0:\n            maxlen = x.shape[0]\n        else:\n            maxlen = max(1, int(maxlenratio * x.size(0)))\n        minlen = int(minlenratio * x.size(0))\n        logging.info(\"decoder input length: \" + str(x.shape[0]))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # main loop of prefix search\n        running_hyps = self.init_hyp(h)\n        prev_hyps = []\n        ended_hyps = []\n        prev_repeat = False\n\n        continue_decode = True\n\n        while continue_decode:\n            move_to_next_block = False\n            if cur_end_frame < x.shape[0]:\n                h = x.narrow(0, 0, cur_end_frame)\n            else:\n                h = x\n\n            # extend states for ctc\n            self.extend(h, running_hyps)\n\n            while process_idx < maxlen:\n                logging.debug(\"position \" + str(process_idx))\n                best = self.search(running_hyps, h)\n\n                if process_idx == maxlen - 1:\n                    # end decoding\n                    running_hyps = self.post_process(\n                        process_idx, maxlen, maxlenratio, best, ended_hyps\n                    )\n                n_batch = best.yseq.shape[0]\n                local_ended_hyps = []\n                is_local_eos = (\n                    best.yseq[torch.arange(n_batch), best.length - 1] == self.eos\n                )\n                for i in range(is_local_eos.shape[0]):\n                    if is_local_eos[i]:\n                        hyp = self._select(best, i)\n                        local_ended_hyps.append(hyp)\n                    # NOTE(tsunoo): check repetitions here\n                    # This is a implicit implementation of\n                    # Eq (11) in https://arxiv.org/abs/2006.14941\n                    # A flag prev_repeat is used instead of using set\n                    elif (\n                        not prev_repeat\n                        and best.yseq[i, -1] in best.yseq[i, :-1]\n                        and cur_end_frame < x.shape[0]\n                    ):\n                        move_to_next_block = True\n                        prev_repeat = True\n                if maxlenratio == 0.0 and end_detect(\n                    [lh.asdict() for lh in local_ended_hyps], process_idx\n                ):\n                    logging.info(f\"end detected at {process_idx}\")\n                    continue_decode = False\n                    break\n                if len(local_ended_hyps) > 0 and cur_end_frame < x.shape[0]:\n                    move_to_next_block = True\n\n                if move_to_next_block:\n                    if (\n                        self.hop_size\n                        and cur_end_frame + int(self.hop_size) + int(self.look_ahead)\n                        < x.shape[0]\n                    ):\n                        cur_end_frame += int(self.hop_size)\n                    else:\n                        cur_end_frame = x.shape[0]\n                    logging.debug(\"Going to next block: %d\", cur_end_frame)\n                    if process_idx > 1 and len(prev_hyps) > 0 and self.conservative:\n                        running_hyps = prev_hyps\n                        process_idx -= 1\n                        prev_hyps = []\n                    break\n\n                prev_repeat = False\n                prev_hyps = running_hyps\n                running_hyps = self.post_process(\n                    process_idx, maxlen, maxlenratio, best, ended_hyps\n                )\n\n                if cur_end_frame >= x.shape[0]:\n                    for hyp in local_ended_hyps:\n                        ended_hyps.append(hyp)\n\n                if len(running_hyps) == 0:\n                    logging.info(\"no hypothesis. Finish decoding.\")\n                    continue_decode = False\n                    break\n                else:\n                    logging.debug(f\"remained hypotheses: {len(running_hyps)}\")\n                # increment number\n                process_idx += 1\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)\n        # check the number of hypotheses reaching to eos\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform recognition \"\n                \"again with smaller minlenratio.\"\n            )\n            return (\n                []\n                if minlenratio < 0.1\n                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))\n            )\n\n        # report the best result\n        best = nbest_hyps[0]\n        for k, v in best.scores.items():\n            logging.info(\n                f\"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}\"\n            )\n        logging.info(f\"total log probability: {best.score:.2f}\")\n        logging.info(f\"normalized log probability: {best.score / len(best.yseq):.2f}\")\n        logging.info(f\"total number of ended hypotheses: {len(nbest_hyps)}\")\n        if self.token_list is not None:\n            logging.info(\n                \"best hypo: \"\n                + \"\".join([self.token_list[x] for x in best.yseq[1:-1]])\n                + \"\\n\"\n            )\n        return nbest_hyps\n\n    def extend(self, x: torch.Tensor, hyps: Hypothesis) -> List[Hypothesis]:\n        \"\"\"Extend probabilities and states with more encoded chunks.\n\n        Args:\n            x (torch.Tensor): The extended encoder output feature\n            hyps (Hypothesis): Current list of hypothesis\n\n        Returns:\n            Hypothesis: The exxtended hypothesis\n\n        \"\"\"\n        for k, d in self.scorers.items():\n            if hasattr(d, \"extend_prob\"):\n                d.extend_prob(x)\n            if hasattr(d, \"extend_state\"):\n                hyps.states[k] = d.extend_state(hyps.states[k])\n"
  },
  {
    "path": "nets/beam_search.py",
    "content": "\"\"\"Beam search module.\"\"\"\n\nfrom itertools import chain\nimport logging\nfrom typing import Any\nfrom typing import Dict\nfrom typing import List\nfrom typing import NamedTuple\nfrom typing import Tuple\nfrom typing import Union\n\nimport torch\n\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom espnet.nets.scorer_interface import ScorerInterface\nfrom snowfall.warpper.mmi_utils import parse_step\n\nclass Hypothesis(NamedTuple):\n    \"\"\"Hypothesis data type.\"\"\"\n\n    yseq: torch.Tensor\n    score: Union[float, torch.Tensor] = 0\n    scores: Dict[str, Union[float, torch.Tensor]] = dict()\n    states: Dict[str, Any] = dict()\n\n    def asdict(self) -> dict:\n        \"\"\"Convert data to JSON-friendly dict.\"\"\"\n        return self._replace(\n            yseq=self.yseq.tolist(),\n            score=float(self.score),\n            scores={k: float(v) for k, v in self.scores.items()},\n        )._asdict()\n\n    def __str__(self):\n        ans = \"\"\n\n        ans += f\"Total Scores: {self.score}\\n\"\n        info = \"Scores -> \"\n        for k, v in self.scores.items():\n            info += \"| {}: {:<7.2f} |\".format(k, v)\n        ans += info\n        return ans\n\n\nclass BeamSearch(object):\n    \"\"\"Beam search implementation.\"\"\"\n\n    def __init__(\n        self,\n        scorers: Dict[str, ScorerInterface],\n        weights: Dict[str, float],\n        beam_size: int,\n        vocab_size: int,\n        sos: int,\n        eos: int,\n        token_list: List[str] = None,\n        pre_beam_ratio: float = 1.5,\n        pre_beam_score_key: str = None,\n        mmi_rescorer = None,\n    ):\n        \"\"\"Initialize beam search.\n\n        Args:\n            scorers (dict[str, ScorerInterface]): Dict of decoder modules\n                e.g., Decoder, CTCPrefixScorer, LM\n                The scorer will be ignored if it is `None`\n            weights (dict[str, float]): Dict of weights for each scorers\n                The scorer will be ignored if its weight is 0\n            beam_size (int): The number of hypotheses kept during search\n            vocab_size (int): The number of vocabulary\n            sos (int): Start of sequence id\n            eos (int): End of sequence id\n            token_list (list[str]): List of tokens for debug log\n            pre_beam_score_key (str): key of scores to perform pre-beam search\n            pre_beam_ratio (float): beam size in the pre-beam search\n                will be `int(pre_beam_ratio * beam_size)`\n\n        \"\"\"\n        super().__init__()\n        # set scorers\n        self.weights = weights\n        self.scorers = dict()\n        self.full_scorers = dict()\n        self.part_scorers = dict()\n        # this module dict is required for recursive cast\n        # `self.to(device, dtype)` in `recog.py`\n        self.nn_dict = torch.nn.ModuleDict()\n        for k, v in scorers.items():\n            w = weights.get(k, 0)\n            if w == 0 or v is None:\n                continue\n            assert isinstance(\n                v, ScorerInterface\n            ), f\"{k} ({type(v)}) does not implement ScorerInterface\"\n            self.scorers[k] = v\n            if isinstance(v, PartialScorerInterface):\n                self.part_scorers[k] = v\n                print(f\"Using part scorer: {k} with weight: {w}\", flush=True)\n            else:\n                self.full_scorers[k] = v\n                print(f\"Using full scorer: {k} with weight: {w}\", flush=True)\n            if isinstance(v, torch.nn.Module):\n                self.nn_dict[k] = v\n\n        # set configurations\n        self.sos = sos\n        self.eos = eos\n        self.token_list = token_list\n        self.pre_beam_size = int(pre_beam_ratio * beam_size)\n        self.beam_size = beam_size\n        self.n_vocab = vocab_size\n        if (\n            pre_beam_score_key is not None\n            and pre_beam_score_key != \"full\"\n            and pre_beam_score_key not in self.full_scorers\n        ):\n            raise KeyError(f\"{pre_beam_score_key} is not found in {self.full_scorers}\")\n        self.pre_beam_score_key = pre_beam_score_key\n        self.do_pre_beam = (\n            self.pre_beam_score_key is not None\n            and self.pre_beam_size < self.n_vocab\n            and len(self.part_scorers) > 0\n        )\n        print(f\"Do pre-beam: {self.do_pre_beam}\")\n\n        self.mmi_rescorer = mmi_rescorer\n        # score below this would be deleted even it is in beam\n        self.min_score = -1000 \n\n    def init_hyp(self, x: torch.Tensor) -> List[Hypothesis]:\n        \"\"\"Get an initial hypothesis data.\n\n        Args:\n            x (torch.Tensor): The encoder output feature\n\n        Returns:\n            Hypothesis: The initial hypothesis.\n\n        \"\"\"\n        init_states = dict()\n        init_scores = dict()\n        for k, d in self.scorers.items():\n            init_states[k] = d.init_state(x)\n            init_scores[k] = 0.0\n        return [\n            Hypothesis(\n                score=0.0,\n                scores=init_scores,\n                states=init_states,\n                yseq=torch.tensor([self.sos], device=x.device),\n            )\n        ]\n\n    @staticmethod\n    def append_token(xs: torch.Tensor, x: int) -> torch.Tensor:\n        \"\"\"Append new token to prefix tokens.\n\n        Args:\n            xs (torch.Tensor): The prefix token\n            x (int): The new token to append\n\n        Returns:\n            torch.Tensor: New tensor contains: xs + [x] with xs.dtype and xs.device\n\n        \"\"\"\n        x = torch.tensor([x], dtype=xs.dtype, device=xs.device)\n        return torch.cat((xs, x))\n\n    def score_full(\n        self, hyp: Hypothesis, x: torch.Tensor\n    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:\n        \"\"\"Score new hypothesis by `self.full_scorers`.\n\n        Args:\n            hyp (Hypothesis): Hypothesis with prefix tokens to score\n            x (torch.Tensor): Corresponding input feature\n\n        Returns:\n            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of\n                score dict of `hyp` that has string keys of `self.full_scorers`\n                and tensor score values of shape: `(self.n_vocab,)`,\n                and state dict that has string keys\n                and state values of `self.full_scorers`\n\n        \"\"\"\n        scores = dict()\n        states = dict()\n        for k, d in self.full_scorers.items():\n            scores[k], states[k] = d.score(hyp.yseq, hyp.states[k], x)\n        return scores, states\n\n    def score_partial(\n        self, hyp: Hypothesis, ids: torch.Tensor, x: torch.Tensor\n    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:\n        \"\"\"Score new hypothesis by `self.part_scorers`.\n\n        Args:\n            hyp (Hypothesis): Hypothesis with prefix tokens to score\n            ids (torch.Tensor): 1D tensor of new partial tokens to score\n            x (torch.Tensor): Corresponding input feature\n\n        Returns:\n            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of\n                score dict of `hyp` that has string keys of `self.part_scorers`\n                and tensor score values of shape: `(len(ids),)`,\n                and state dict that has string keys\n                and state values of `self.part_scorers`\n\n        \"\"\"\n        scores = dict()\n        states = dict()\n        for k, d in self.part_scorers.items():\n            scores[k], states[k] = d.score_partial(hyp.yseq, ids, hyp.states[k], x)\n        return scores, states\n\n    def beam(\n        self, weighted_scores: torch.Tensor, ids: torch.Tensor\n    ) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"Compute topk full token ids and partial token ids.\n\n        Args:\n            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.\n            Its shape is `(self.n_vocab,)`.\n            ids (torch.Tensor): The partial token ids to compute topk\n\n        Returns:\n            Tuple[torch.Tensor, torch.Tensor]:\n                The topk full token ids and partial token ids.\n                Their shapes are `(self.beam_size,)`\n\n        \"\"\"\n        # no pre beam performed\n        if weighted_scores.size(0) == ids.size(0):\n            top_ids = weighted_scores.topk(self.beam_size)[1]\n            return top_ids, top_ids\n\n        # mask pruned in pre-beam not to select in topk\n        tmp = weighted_scores[ids]\n        weighted_scores[:] = -float(\"inf\")\n        weighted_scores[ids] = tmp\n        top_ids = weighted_scores.topk(self.beam_size)[1]\n        local_ids = weighted_scores[ids].topk(self.beam_size)[1]\n        return top_ids, local_ids\n\n    @staticmethod\n    def merge_scores(\n        prev_scores: Dict[str, float],\n        next_full_scores: Dict[str, torch.Tensor],\n        full_idx: int,\n        next_part_scores: Dict[str, torch.Tensor],\n        part_idx: int,\n    ) -> Dict[str, torch.Tensor]:\n        \"\"\"Merge scores for new hypothesis.\n\n        Args:\n            prev_scores (Dict[str, float]):\n                The previous hypothesis scores by `self.scorers`\n            next_full_scores (Dict[str, torch.Tensor]): scores by `self.full_scorers`\n            full_idx (int): The next token id for `next_full_scores`\n            next_part_scores (Dict[str, torch.Tensor]):\n                scores of partial tokens by `self.part_scorers`\n            part_idx (int): The new token id for `next_part_scores`\n\n        Returns:\n            Dict[str, torch.Tensor]: The new score dict.\n                Its keys are names of `self.full_scorers` and `self.part_scorers`.\n                Its values are scalar tensors by the scorers.\n\n        \"\"\"\n        new_scores = dict()\n        for k, v in next_full_scores.items():\n            new_scores[k] = prev_scores[k] + v[full_idx]\n        for k, v in next_part_scores.items():\n            new_scores[k] = prev_scores[k] + v[part_idx]\n        return new_scores\n\n    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:\n        \"\"\"Merge states for new hypothesis.\n\n        Args:\n            states: states of `self.full_scorers`\n            part_states: states of `self.part_scorers`\n            part_idx (int): The new token id for `part_scores`\n\n        Returns:\n            Dict[str, torch.Tensor]: The new score dict.\n                Its keys are names of `self.full_scorers` and `self.part_scorers`.\n                Its values are states of the scorers.\n\n        \"\"\"\n        new_states = dict()\n        for k, v in states.items():\n            new_states[k] = v\n        for k, d in self.part_scorers.items():\n            new_states[k] = d.select_state(part_states[k], part_idx)\n        return new_states\n\n    def search(\n        self, running_hyps: List[Hypothesis], x: torch.Tensor\n    ) -> List[Hypothesis]:\n        \"\"\"Search new tokens for running hypotheses and encoded speech x.\n\n        Args:\n            running_hyps (List[Hypothesis]): Running hypotheses on beam\n            x (torch.Tensor): Encoded speech feature (T, D)\n\n        Returns:\n            List[Hypotheses]: Best sorted hypotheses\n\n        \"\"\"\n        best_hyps = []\n        part_ids = torch.arange(self.n_vocab, device=x.device)  # no pre-beam\n        for hyp in running_hyps:\n\n            # scoring\n            weighted_scores = torch.zeros(self.n_vocab, dtype=x.dtype, device=x.device)\n            scores, states = self.score_full(hyp, x)\n            for k in self.full_scorers:\n                weighted_scores += self.weights[k] * scores[k]\n            # partial scoring\n            if self.do_pre_beam:\n                pre_beam_scores = (\n                    weighted_scores\n                    if self.pre_beam_score_key == \"full\"\n                    else scores[self.pre_beam_score_key]\n                )\n                part_ids = torch.topk(pre_beam_scores, self.pre_beam_size)[1]\n            part_scores, part_states = self.score_partial(hyp, part_ids, x)\n            for k in self.part_scorers:\n                weighted_scores[part_ids] += self.weights[k] * part_scores[k]\n            # Show the scores step by step\n            # parse_step(hyp, self.token_list, part_ids,\n            #            self.weights, scores,\n            #            part_scores, weighted_scores)\n            weighted_scores += hyp.score\n\n            # update hyps\n            for j, part_j in zip(*self.beam(weighted_scores, part_ids)):\n                # will be (2 x beam at most)\n                this_hyp = Hypothesis(\n                        score=weighted_scores[j],\n                        yseq=self.append_token(hyp.yseq, j),\n                        scores=self.merge_scores(\n                            hyp.scores, scores, j, part_scores, part_j\n                        ),\n                        states=self.merge_states(states, part_states, part_j),\n                    )\n                best_hyps.append(this_hyp)\n\n            # sort and prune 2 x beam -> beam\n            best_hyps = sorted(best_hyps, key=lambda x: x.score, reverse=True)[\n                : min(len(best_hyps), self.beam_size)\n            ]\n        return best_hyps\n\n    def __call__(\n        self, x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0\n    ) -> List[Hypothesis]:\n        \"\"\"Perform beam search.\n\n        Args:\n            x (torch.Tensor): Encoded speech feature (T, D)\n            maxlenratio (float): Input length ratio to obtain max output length.\n                If maxlenratio=0.0 (default), it uses a end-detect function\n                to automatically find maximum hypothesis lengths\n            minlenratio (float): Input length ratio to obtain min output length.\n\n        Returns:\n            list[Hypothesis]: N-best decoding results\n\n        \"\"\"\n        # set length bounds\n        if maxlenratio == 0:\n            maxlen = x.shape[0]\n        else:\n            maxlen = max(1, int(maxlenratio * x.size(0)))\n        minlen = int(minlenratio * x.size(0))\n        logging.info(\"decoder input length: \" + str(x.shape[0]))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # main loop of prefix search\n        running_hyps = self.init_hyp(x)\n        ended_hyps = []\n        for i in range(maxlen):\n            # print(f\"######### Iteration {i} #########\")\n            logging.debug(\"position \" + str(i))\n            best = self.search(running_hyps, x)\n\n            # post process of one iteration\n            running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)\n \n            # delete hypothesis that below min_score. this means need to be killed by mmi\n            running_hyps = [h for h in running_hyps if h.score > self.min_score]\n\n            # end detection\n            if maxlenratio == 0.0 and end_detect([h.asdict() for h in ended_hyps], i):\n                logging.info(f\"end detected at {i}\")\n                break\n            if len(running_hyps) == 0:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n            else:\n                logging.debug(f\"remained hypotheses: {len(running_hyps)}\")\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)\n        # print(\"#\" * 20, \"Details of Final Best Hypothesis\", \"#\" * 20)\n        # for h in nbest_hyps:\n        #     print(\"Hypothesis: \" + \"\".join([self.token_list[x] for x in h.yseq[1:-1]]))\n        #     print(h, flush=True)\n        # check the number of hypotheses reaching to eos\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform recognition \"\n                \"again with smaller minlenratio.\"\n            )\n            return (\n                []\n                if minlenratio < 0.1\n                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))\n            )\n\n        # report the best result\n        best = nbest_hyps[0]\n        for k, v in best.scores.items():\n            logging.info(\n                f\"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}\"\n            )\n        logging.info(f\"total log probability: {best.score:.2f}\")\n        logging.info(f\"normalized log probability: {best.score / len(best.yseq):.2f}\")\n        logging.info(f\"total number of ended hypotheses: {len(nbest_hyps)}\")\n        if self.token_list is not None:\n            logging.info(\n                \"best hypo: \"\n                + \"\".join([self.token_list[x] for x in best.yseq[1:-1]])\n                + \"\\n\"\n            )\n        # print(\"Start MMI rescoring\", flush=True)\n        if self.mmi_rescorer:\n            nbest_hyps = self.mmi_rescorer.score(x, nbest_hyps, v2=True)\n\n        return nbest_hyps\n\n    def post_process(\n        self,\n        i: int,\n        maxlen: int,\n        maxlenratio: float,\n        running_hyps: List[Hypothesis],\n        ended_hyps: List[Hypothesis],\n    ) -> List[Hypothesis]:\n        \"\"\"Perform post-processing of beam search iterations.\n\n        Args:\n            i (int): The length of hypothesis tokens.\n            maxlen (int): The maximum length of tokens in beam search.\n            maxlenratio (int): The maximum length ratio in beam search.\n            running_hyps (List[Hypothesis]): The running hypotheses in beam search.\n            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.\n\n        Returns:\n            List[Hypothesis]: The new running hypotheses.\n\n        \"\"\"\n        logging.debug(f\"the number of running hypotheses: {len(running_hyps)}\")\n        if self.token_list is not None:\n            logging.debug(\n                \"best hypo: \"\n                + \"\".join([self.token_list[x] for x in running_hyps[0].yseq[1:]])\n            )\n        # add eos in the final loop to avoid that there are no ended hyps\n        if i == maxlen - 1:\n            logging.info(\"adding <eos> in the last position in the loop\")\n            running_hyps = [\n                h._replace(yseq=self.append_token(h.yseq, self.eos))\n                for h in running_hyps\n            ]\n\n        # add ended hypotheses to a final list, and removed them from current hypotheses\n        # (this will be a problem, number of hyps < beam)\n        remained_hyps = []\n        for hyp in running_hyps:\n            if hyp.yseq[-1] == self.eos:\n                # e.g., Word LM needs to add final <eos> score\n                for k, d in chain(self.full_scorers.items(), self.part_scorers.items()):\n                    s = d.final_score(hyp.states[k])\n                    hyp.scores[k] += s\n                    hyp = hyp._replace(score=hyp.score + self.weights[k] * s)\n                ended_hyps.append(hyp)\n            else:\n                remained_hyps.append(hyp)\n        return remained_hyps\n\n\ndef beam_search(\n    x: torch.Tensor,\n    sos: int,\n    eos: int,\n    beam_size: int,\n    vocab_size: int,\n    scorers: Dict[str, ScorerInterface],\n    weights: Dict[str, float],\n    token_list: List[str] = None,\n    maxlenratio: float = 0.0,\n    minlenratio: float = 0.0,\n    pre_beam_ratio: float = 1.5,\n    pre_beam_score_key: str = \"full\",\n) -> list:\n    \"\"\"Perform beam search with scorers.\n\n    Args:\n        x (torch.Tensor): Encoded speech feature (T, D)\n        sos (int): Start of sequence id\n        eos (int): End of sequence id\n        beam_size (int): The number of hypotheses kept during search\n        vocab_size (int): The number of vocabulary\n        scorers (dict[str, ScorerInterface]): Dict of decoder modules\n            e.g., Decoder, CTCPrefixScorer, LM\n            The scorer will be ignored if it is `None`\n        weights (dict[str, float]): Dict of weights for each scorers\n            The scorer will be ignored if its weight is 0\n        token_list (list[str]): List of tokens for debug log\n        maxlenratio (float): Input length ratio to obtain max output length.\n            If maxlenratio=0.0 (default), it uses a end-detect function\n            to automatically find maximum hypothesis lengths\n        minlenratio (float): Input length ratio to obtain min output length.\n        pre_beam_score_key (str): key of scores to perform pre-beam search\n        pre_beam_ratio (float): beam size in the pre-beam search\n            will be `int(pre_beam_ratio * beam_size)`\n\n    Returns:\n        list: N-best decoding results\n\n    \"\"\"\n    ret = BeamSearch(\n        scorers,\n        weights,\n        beam_size=beam_size,\n        vocab_size=vocab_size,\n        pre_beam_ratio=pre_beam_ratio,\n        pre_beam_score_key=pre_beam_score_key,\n        sos=sos,\n        eos=eos,\n        token_list=token_list,\n    ).forward(x=x, maxlenratio=maxlenratio, minlenratio=minlenratio)\n    return [h.asdict() for h in ret]\n"
  },
  {
    "path": "nets/beam_search_transducer.py",
    "content": "\"\"\"Search algorithms for transducer models.\"\"\"\nfrom typing import List\nfrom typing import Union\nfrom collections import Counter, defaultdict\nimport numpy as np\nimport torch\nimport time\nimport math\nfrom itertools import groupby\nfrom espnet.nets.pytorch_backend.transducer.utils import create_lm_batch_state\nfrom espnet.nets.pytorch_backend.transducer.utils import init_lm_state\nfrom espnet.nets.pytorch_backend.transducer.utils import is_prefix\nfrom espnet.nets.pytorch_backend.transducer.utils import recombine_hyps\nfrom espnet.nets.pytorch_backend.transducer.utils import select_lm_state\nfrom espnet.nets.pytorch_backend.transducer.utils import substract\nfrom espnet.nets.transducer_decoder_interface import Hypothesis\nfrom espnet.nets.transducer_decoder_interface import NSCHypothesis\nfrom espnet.nets.transducer_decoder_interface import TransducerDecoderInterface\nfrom espnet.nets.scorers.mmi_rescorer import MMIRescorer\n# from espnet.nets.scorers.mmi_rnnt_scorer import MMIRNNTScorer\nfrom espnet.nets.scorers.mmi_alignment_score import MMIRNNTScorer\nfrom espnet.nets.scorers.ctc_rnnt_scorer import CTCRNNTScorer\nfrom espnet.nets.scorers.mmi_rnnt_lookahead_scorer import MMIRNNTLookaheadScorer\n\nclass BeamSearchTransducer:\n    \"\"\"Beam search implementation for transducer.\"\"\"\n\n    def __init__(\n        self,\n        decoder: Union[TransducerDecoderInterface, torch.nn.Module],\n        joint_network: torch.nn.Module,\n        beam_size: int,\n        lm: torch.nn.Module = None,\n        lm_weight: float = 0.1,\n        search_type: str = \"default\",\n        char_list = None,\n        max_sym_exp: int = 2,\n        u_max: int = 50,\n        nstep: int = 1,\n        prefix_alpha: int = 1,\n        score_norm: bool = True,\n        nbest: int = 1,\n        mmi_scorer=None,\n        mmi_weight=0.0,\n        ctc_module=None,\n        ctc_weight=0.0,\n        ngram_scorer=None,\n        ngram_weight=0.0,\n        word_ngram_scorer=None,\n        word_ngram_weight=0.0,\n        tlg_scorer=None,\n        tlg_weight=0.0,\n        forbid_eng=False,\n        eng_vocab=None,\n    ):\n        \"\"\"Initialize transducer beam search.\n\n        Args:\n            decoder: Decoder class to use\n            joint_network: Joint Network class\n            beam_size: Number of hypotheses kept during search\n            lm: LM class to use\n            lm_weight: lm weight for soft fusion\n            search_type: type of algorithm to use for search\n            max_sym_exp: number of maximum symbol expansions at each time step (\"tsd\")\n            u_max: maximum output sequence length (\"alsd\")\n            nstep: number of maximum expansion steps at each time step (\"nsc\")\n            prefix_alpha: maximum prefix length in prefix search (\"nsc\")\n            score_norm: normalize final scores by length (\"default\")\n            nbest: number of returned final hypothesis\n        \"\"\"\n        self.decoder = decoder\n        self.joint_network = joint_network\n\n        self.beam_size = beam_size\n        self.hidden_size = decoder.dunits\n        self.vocab_size = decoder.odim\n        self.blank = decoder.blank\n\n        # MMI alignment scorer\n        self.mmi_scorer = mmi_scorer\n        self.mmi_weight = mmi_weight\n        print(f\"MMI scorer: {mmi_scorer} | MMI weight: {mmi_weight}\")\n\n        # deprecated. CTC scorer\n        self.ctc_module = ctc_module\n        self.ctc_weight = ctc_weight\n        print(f\"CTC scorer: {ctc_module} | CTC weight: {ctc_weight}\")\n\n        # character-level Ngram scorer implemented by kenlm\n        self.ngram_scorer = ngram_scorer\n        self.ngram_weight = ngram_weight\n        print(f\"ngram scorer: {ngram_scorer} | ngram weight: {ngram_weight}\")\n\n        # word-level Ngram scorer implemented by k2\n        self.word_ngram_scorer = word_ngram_scorer\n        self.word_ngram_weight = word_ngram_weight\n        print(f\"word ngram scorer: {word_ngram_scorer} | word ngram weight: {word_ngram_weight}\")\n\n        # word-level Ngram scorer implemented by pykaldi (cweng)\n        self.tlg_scorer = tlg_scorer\n        self.tlg_weight = tlg_weight\n        print(f\"tlg scorer: {tlg_scorer} | tlg weight: {tlg_weight}\")\n\n        if search_type == \"ctc_greedy\":\n            self.search_algorithm = self.ctc_greedy_search\n            assert self.ctc_module is not None\n        elif self.beam_size <= 1:\n            self.search_algorithm = self.greedy_search\n        elif search_type == \"ctc_beam\":\n            self.search_algorithm = self.ctc_beam_search\n            assert self.ctc_module is not None\n        elif search_type == \"default\":\n            self.search_algorithm = self.default_beam_search\n        elif search_type == \"tsd\":\n            self.search_algorithm = self.time_sync_decoding\n        elif search_type == \"alsd\":\n            self.search_algorithm = self.align_length_sync_decoding\n        elif search_type == \"nsc\":\n            self.search_algorithm = self.nsc_beam_search\n        else:\n            raise NotImplementedError\n\n        self.lm = lm\n        self.lm_weight = lm_weight\n        print(f\"Using LM {lm} with weight {lm_weight}\")\n\n        if lm is not None and lm_weight > 0.0:\n            self.use_lm = True\n            self.is_wordlm = True if hasattr(lm, \"predictor\") and \\\n                             hasattr(lm.predictor, \"wordlm\") else False\n            if hasattr(lm, \"predictor\"):\n                self.lm_predictor = lm.predictor.wordlm if self.is_wordlm else lm.predictor\n                self.lm_layers = len(self.lm_predictor.rnn)\n            else:\n                self.is_transformer_lm = True\n        else:\n            self.use_lm = False\n\n        self.max_sym_exp = max_sym_exp\n        self.u_max = u_max\n        self.nstep = nstep\n        self.prefix_alpha = prefix_alpha\n        self.score_norm = score_norm\n\n        self.nbest = nbest\n        self.char_list = char_list\n\n        self.forbid_lst = []\n        if forbid_eng:\n            self.forbid_lst = [self.char_list.index(x) \\\n                               for x in self.char_list \\\n                               if (x >= '\\u0041' and x <= '\\u005a') \\\n                               or (x >= '\\u0061' and x <= '\\u007a')]\n        print(\"Forbid chars: \", self.forbid_lst, flush=True)\n\n        self.eng_vocab = eng_vocab\n        if self.eng_vocab is not None:\n            self.eng_token_list = [x if not is_all_chinese(x) else \"\" \\\n                                   for x in self.char_list]\n\n    def __call__(self, h: torch.Tensor) -> Union[List[Hypothesis], List[NSCHypothesis]]:\n        \"\"\"Perform beam search.\n\n        Args:\n            h: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            nbest_hyps: N-best decoding results\n\n        \"\"\"\n        self.decoder.set_device(h.device)\n\n        if len(h.size()) == 3:\n            h = h.squeeze(0)\n\n        if not hasattr(self.decoder, \"decoders\"):\n            self.decoder.set_data_type(h.dtype)\n\n        nbest_hyps = self.search_algorithm(h)\n        \n        if isinstance(self.mmi_scorer, MMIRescorer):\n            nbest_hyps = self.mmi_scorer.score(h, nbest_hyps, v2=True)\n        return nbest_hyps\n\n    def sort_nbest(\n        self, hyps: Union[List[Hypothesis], List[NSCHypothesis]]\n    ) -> Union[List[Hypothesis], List[NSCHypothesis]]:\n        \"\"\"Sort hypotheses by score or score given sequence length.\n\n        Args:\n            hyps: list of hypotheses\n\n        Return:\n            hyps: sorted list of hypotheses\n\n        \"\"\"\n        if self.score_norm:\n            hyps.sort(key=lambda x: x.score / len(x.yseq), reverse=True)\n        else:\n            hyps.sort(key=lambda x: x.score, reverse=True)\n\n        return hyps[: self.nbest]\n\n    def vocab_regularization(self, hyps):\n        bpe_seperator = u'\\u2581'\n \n        ans = []\n        for h in hyps:\n            yseq = h.yseq if isinstance(h, Hypothesis) else h[0] # rnnt or ctc hypothesis\n            text = \"\".join([self.eng_token_list[x] for x in yseq[1:]])\n            eng_words = [x for x in text.split(bpe_seperator)[:-1] if x != \"\"] # the last may not finish\n            if all([x in self.eng_vocab for x in eng_words]):\n                ans.append(h)\n \n        return ans\n\n    def greedy_search(self, h: torch.Tensor) -> List[Hypothesis]:\n        \"\"\"Greedy search implementation for transformer-transducer.\n\n        Args:\n            h: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            hyp: 1-best decoding results\n\n        \"\"\"\n        dec_state = self.decoder.init_state(1)\n\n        hyp = Hypothesis(score=0.0, yseq=[self.blank], dec_state=dec_state)\n        cache = {}\n\n        y, state, _ = self.decoder.score(hyp, cache)\n\n        for i, hi in enumerate(h):\n            ytu = torch.log_softmax(self.joint_network(hi, y), dim=-1)\n            logp, pred = torch.max(ytu, dim=-1)\n            if pred != self.blank:\n                hyp.yseq.append(int(pred))\n                hyp.score += float(logp)\n\n                hyp.dec_state = state\n\n                y, state, _ = self.decoder.score(hyp, cache)\n        return [hyp]\n\n    def ctc_greedy_search(self, h: torch.Tensor) -> List[Hypothesis]:\n        if len(h.size()) == 2:\n            h = h.unsqueeze(0)       \n \n        lpz = self.ctc_module.argmax(h)\n        collapsed_indices = [x[0] for x in groupby(lpz[0])]\n        hyp = [x for x in filter(lambda x: x != self.blank, collapsed_indices)]\n        nbest_hyps = [Hypothesis(score=0.0, yseq=[self.blank] + hyp, dec_state=None)]\n        return nbest_hyps\n\n    # mainly derived from wenet\n    def ctc_beam_search(self, h: torch.Tensor) -> List[Hypothesis]:\n        if len(h.size()) == 2:\n            h = h.unsqueeze(0)\n\n        ctc_prob = self.ctc_module.log_softmax(h)[0]\n        maxlen = ctc_prob.size(0)\n\n        use_full_score = False\n        if self.word_ngram_weight > 0.0:\n            lm, lm_weight = self.word_ngram_scorer, self.word_ngram_weight\n        elif self.ngram_weight > 0.0:\n            lm, lm_weight = self.ngram_scorer, self.ngram_weight\n        elif self.lm_weight > 0.0:\n            lm, lm_weight = self.lm, self.lm_weight\n            use_full_score = True\n        else:\n            lm, lm_weight = None, 0.0\n        \n        if lm is not None:\n            # yseq: (lm_score, lm_state)\n            lm_cache = {(self.blank,): (0.0, lm.init_state(None))}\n            sort_fn = lambda x: log_add(list(x[1])) + lm_cache[x[0]][0]\n        else:\n            lm_cache = None\n            sort_fn = lambda x: log_add(list(x[1])) \n\n        # non-blank sequence; (blank_ending_score, non_blank_ending_score)\n        cur_hyps = [((self.blank,), (0.0, -float('inf')))]\n        for t in range(0, maxlen):\n            logp = ctc_prob[t]\n            next_hyps = defaultdict(lambda: (-float('inf'), -float('inf')))\n            top_k_logp, top_k_index = logp.topk(self.beam_size)\n     \n            for s in top_k_index:\n                s = s.item()\n                ps = logp[s].item()\n\n                for prefix, (pb, pnb) in cur_hyps:\n                    last = prefix[-1] if len(prefix) > 0 else None\n                    if s == self.blank: # blank\n                        n_pb, n_pnb = next_hyps[prefix]\n                        n_pb = log_add([n_pb, pb + ps, pnb + ps])\n                        next_hyps[prefix] = (n_pb, n_pnb)\n                    elif s == last:\n                        #  Update *ss -> *s;\n                        n_pb, n_pnb = next_hyps[prefix]\n                        n_pnb = log_add([n_pnb, pnb + ps])\n                        next_hyps[prefix] = (n_pb, n_pnb)\n                        #  Update *s-s -> *ss, - is for blank\n                        n_prefix = prefix + (s, )\n                        n_pb, n_pnb = next_hyps[n_prefix]\n                        n_pnb = log_add([n_pnb, pb + ps])\n                        next_hyps[n_prefix] = (n_pb, n_pnb)\n                    else:\n                        n_prefix = prefix + (s, )\n                        n_pb, n_pnb = next_hyps[n_prefix]\n                        n_pnb = log_add([n_pnb, pb + ps, pnb + ps])\n                        next_hyps[n_prefix] = (n_pb, n_pnb)\n\n            # LM on-the-fly rescore for unseen prefix\n            if lm is not None:\n                for prefix, (_, _) in next_hyps.items():\n                    if not prefix in lm_cache.keys():\n                        y = prefix[:-1]\n                        # update all children hypotheses: NNLM \n                        if use_full_score:\n                            scores, state = lm.score(torch.Tensor(y).long(), \n                                                     lm_cache[y][1], h)\n                            for k in range(len(scores)):\n                                lm_cache[y + (k,)] = (lm_cache[y][0] \\\n                                  + scores[k].item() * lm_weight, \n                                  lm.select_state(state, k)\n                                )\n                        # update only this hypothesis: N-gram LM                                \n                        else:\n                            next_token = prefix[-1:]\n                            score, state = lm.score_partial(\n                                             torch.Tensor(y).long(), \n                                             torch.Tensor(next_token).long(),\n                                             lm_cache[y][1], h\n                                           )\n                            lm_cache[prefix] = (lm_cache[y][0] + score[0].item() * lm_weight, \n                                                lm.select_state(state, 0)\n                                               )\n             \n            next_hyps = sorted(next_hyps.items(), key=sort_fn, reverse=True)\n            if self.eng_vocab:\n                next_hyps = self.vocab_regularization(next_hyps)\n            cur_hyps = next_hyps[:self.beam_size]\n\n        hyps = [Hypothesis(score=log_add([hyp[1][0], hyp[1][1]]),\n                           yseq=list(hyp[0]),\n                           dec_state=None,\n                           mmi_tot_score=lm_cache[hyp[0]][0] \\\n                             if lm_cache is not None else 0.0\n                           )\n                           for hyp in cur_hyps\n               ]\n\n        return hyps \n\n    def default_beam_search(self, h: torch.Tensor) -> List[Hypothesis]:\n        \"\"\"Beam search implementation.\n\n        Args:\n            x: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            nbest_hyps: N-best decoding results\n\n        \"\"\"\n        beam = min(self.beam_size, self.vocab_size)\n        beam_k = min(beam, (self.vocab_size - 1))\n\n        dec_state = self.decoder.init_state(1)\n\n        kept_hyps = [Hypothesis(score=0.0, yseq=[self.blank], dec_state=dec_state)]\n        cache = {}\n\n        for hi in h:\n            hyps = kept_hyps\n            kept_hyps = []\n\n            while True:\n                max_hyp = max(hyps, key=lambda x: x.score)\n                hyps.remove(max_hyp)\n\n                y, state, lm_tokens = self.decoder.score(max_hyp, cache)\n\n                ytu = torch.log_softmax(self.joint_network(hi, y), dim=-1)\n                top_k = ytu[1:].topk(beam_k, dim=-1)\n                \n                # add a blank only\n                kept_hyps.append(\n                    Hypothesis(\n                        score=(max_hyp.score + float(ytu[0:1])),\n                        yseq=max_hyp.yseq[:],\n                        dec_state=max_hyp.dec_state,\n                        lm_state=max_hyp.lm_state,\n                    )\n                )\n\n                if self.use_lm:\n                    lm_state, lm_scores = self.lm.predict(max_hyp.lm_state, lm_tokens)\n                else:\n                    lm_state = max_hyp.lm_state\n\n                for logp, k in zip(*top_k):\n                    score = max_hyp.score + float(logp)\n\n                    if self.use_lm:\n                        score += self.lm_weight * lm_scores[0][k + 1]\n\n                    hyps.append(\n                        Hypothesis(\n                            score=score,\n                            yseq=max_hyp.yseq[:] + [int(k + 1)],\n                            dec_state=state,\n                            lm_state=lm_state,\n                        )\n                    )\n\n                hyps_max = float(max(hyps, key=lambda x: x.score).score)\n                kept_most_prob = sorted(\n                    [hyp for hyp in kept_hyps if hyp.score > hyps_max],\n                    key=lambda x: x.score,\n                )\n                if len(kept_most_prob) >= beam:\n                    kept_hyps = kept_most_prob\n                    break\n\n        return self.sort_nbest(kept_hyps)\n\n    def time_sync_decoding(self, h: torch.Tensor) -> List[Hypothesis]:\n        \"\"\"Time synchronous beam search implementation.\n\n        Based on https://ieeexplore.ieee.org/document/9053040\n\n        Args:\n            h: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            nbest_hyps: N-best decoding results\n\n        \"\"\"\n        beam = min(self.beam_size, self.vocab_size)\n\n        beam_state = self.decoder.init_state(beam)\n\n        B = [\n            Hypothesis(\n                yseq=[self.blank],\n                score=0.0,\n                dec_state=self.decoder.select_state(beam_state, 0),\n            )\n        ]\n        cache = {}\n\n        if self.use_lm and not self.is_wordlm:\n            B[0].lm_state = init_lm_state(self.lm_predictor)\n\n        for hi in h:\n            A = []\n            C = B\n\n            h_enc = hi.unsqueeze(0)\n\n            for v in range(self.max_sym_exp):\n                D = []\n\n                beam_y, beam_state, beam_lm_tokens = self.decoder.batch_score(\n                    C,\n                    beam_state,\n                    cache,\n                    self.use_lm,\n                )\n\n                beam_logp = torch.log_softmax(self.joint_network(h_enc, beam_y), dim=-1)\n                beam_topk = beam_logp[:, 1:].topk(beam, dim=-1)\n\n                seq_A = [h.yseq for h in A]\n\n                for i, hyp in enumerate(C):\n                    if hyp.yseq not in seq_A:\n                        A.append(\n                            Hypothesis(\n                                score=(hyp.score + float(beam_logp[i, 0])),\n                                yseq=hyp.yseq[:],\n                                dec_state=hyp.dec_state,\n                                lm_state=hyp.lm_state,\n                            )\n                        )\n                    else:\n                        dict_pos = seq_A.index(hyp.yseq)\n\n                        A[dict_pos].score = np.logaddexp(\n                            A[dict_pos].score, (hyp.score + float(beam_logp[i, 0]))\n                        )\n\n                if v < (self.max_sym_exp - 1):\n                    if self.use_lm:\n                        beam_lm_states = create_lm_batch_state(\n                            [c.lm_state for c in C], self.lm_layers, self.is_wordlm\n                        )\n\n                        beam_lm_states, beam_lm_scores = self.lm.buff_predict(\n                            beam_lm_states, beam_lm_tokens, len(C)\n                        )\n\n                    for i, hyp in enumerate(C):\n                        for logp, k in zip(beam_topk[0][i], beam_topk[1][i] + 1):\n                            new_hyp = Hypothesis(\n                                score=(hyp.score + float(logp)),\n                                yseq=(hyp.yseq + [int(k)]),\n                                dec_state=self.decoder.select_state(beam_state, i),\n                                lm_state=hyp.lm_state,\n                            )\n\n                            if self.use_lm:\n                                new_hyp.score += self.lm_weight * beam_lm_scores[i, k]\n\n                                new_hyp.lm_state = select_lm_state(\n                                    beam_lm_states, i, self.lm_layers, self.is_wordlm\n                                )\n\n                            D.append(new_hyp)\n\n                C = sorted(D, key=lambda x: x.score, reverse=True)[:beam]\n\n            B = sorted(A, key=lambda x: x.score, reverse=True)[:beam]\n\n        return self.sort_nbest(B)\n\n    def align_length_sync_decoding(self, h: torch.Tensor) -> List[Hypothesis]:\n        \"\"\"Alignment-length synchronous beam search implementation.\n\n        Based on https://ieeexplore.ieee.org/document/9053040\n\n        Args:\n            h: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            nbest_hyps: N-best decoding results\n\n        \"\"\"\n\n        hidden = h\n        beam = min(self.beam_size, self.vocab_size)\n\n        h_length = int(h.size(0))\n        u_max = min(self.u_max, (h_length - 1))\n\n        beam_state = self.decoder.init_state(beam)        \n \n        B = [\n            Hypothesis(\n                yseq=[self.blank],\n                score=0.0,\n                dec_state=self.decoder.select_state(beam_state, 0),\n                mmi_tot_score=0.0,\n                word_ngram_score=0.0,\n                tlg_state=self.tlg_scorer.init_state() if self.tlg_scorer else None\n            )\n        ]\n        # final hypothesis set to return\n        final = []\n        # For hypothesis with same yseq, its decoder output could be cached\n        # yseq -> decoder_out, decoder_state\n        cache = {} \n\n        # lm initialization\n        if self.use_lm and not self.is_wordlm:\n            if hasattr(self, \"lm_predictor\"):\n                B[0].lm_state = init_lm_state(self.lm_predictor)\n            else:\n                B[0].lm_state = self.lm.init_state(h)\n\n        if self.mmi_scorer is not None and self.mmi_weight > 0.0:\n            mmi_nnet_output, mmi_den_scores = self.mmi_scorer.den_scores(h)\n\n        if self.ctc_module:\n            ctc_pred = self.ctc_module.log_softmax(h.unsqueeze(0))[0]\n\n        for tu_sum in range(h_length + u_max):\n            A = [] # collection for next step\n            B_ = [] # collection for search in this step. state of pred. net is kept\n            h_states = [] # collection of encoder_out frame for each hypothesis\n            p_ctc = [] # collection of ctc distribution\n            for j, hyp in enumerate(B): # skip all hypothesis that head the last frame.\n                u = len(hyp.yseq) - 1\n                t = tu_sum - u + 1\n\n                if t > (h_length - 1):\n                    continue\n\n                B_.append(hyp)\n                h_states.append((t, h[t]))\n\n                if self.ctc_module:\n                    p_ctc.append(ctc_pred[t])\n\n            if B_:\n                beam_y, beam_state, beam_lm_tokens = self.decoder.batch_score(\n                    B_,\n                    beam_state,\n                    cache,\n                    self.use_lm,\n                )\n                \n                h_enc = torch.stack([h[1] for h in h_states])\n\n                # [beam, h_dim], [beam, h_dim]\n                beam_logp = torch.log_softmax(self.joint_network(h_enc, beam_y), dim=-1) # [beam, vocab]\n                if self.forbid_lst:\n                    beam_logp[:, self.forbid_lst] = -1e20\n                    beam_logp = torch.log_softmax(beam_logp, dim=-1)              \n \n                if self.ctc_module:\n                    p_ctc = torch.stack([p for p in p_ctc])\n                    beam_logp += self.ctc_weight * p_ctc\n\n                # warning: like in LASCTC, the LM score would not be considered in top-k process\n                beam_topk = beam_logp[:, 1:].topk(beam, dim=-1) # values and indices: [beam, beam]. blank excluded\n\n                if self.use_lm and not self.is_transformer_lm:\n                    beam_lm_states = create_lm_batch_state(\n                        [b.lm_state for b in B_], self.lm_layers, self.is_wordlm\n                    )\n\n                    beam_lm_states, beam_lm_scores = self.lm.buff_predict(\n                        beam_lm_states, beam_lm_tokens, len(B_)\n                    )\n\n                for i, hyp in enumerate(B_):\n                    new_hyp = Hypothesis(\n                        score=(hyp.score + float(beam_logp[i, 0])),\n                        yseq=hyp.yseq[:],\n                        dec_state=hyp.dec_state,\n                        lm_state=hyp.lm_state,\n                        mmi_tot_score=hyp.mmi_tot_score,\n                        word_ngram_score=hyp.word_ngram_score,\n                        tlg_state=hyp.tlg_state,\n                    )\n\n                    if h_states[i][0] == (h_length - 1):\n                        final.append(new_hyp)\n                    \n                    A.append(new_hyp)\n\n                    # Only search a part of candidate tokens\n                    if self.word_ngram_scorer and self.word_ngram_weight > 0.0:\n                        next_tokens = beam_topk[1][i] + 1\n                        word_ngram_scores, word_ngram_states = self.word_ngram_scorer.score_partial(\n                                                                 hyp.yseq[1:], next_tokens, \n                                                                 hyp.word_ngram_score, None)\n                    else:\n                        word_ngram_scores = [0.0] * len(beam_topk[1][i])\n                        word_ngram_states = [0.0] * len(beam_topk[1][i])\n\n                    if self.tlg_scorer and self.tlg_weight > 0.0:\n                        next_tokens = beam_topk[1][i] + 1\n                        tlg_scores, tlg_states = self.tlg_scorer.score_partial(\n                                                     None, next_tokens,\n                                                     hyp.tlg_state, None)\n                    else:\n                        tlg_scores = [0.0] * len(beam_topk[1][i])\n                        tlg_states = [None] * len(beam_topk[1][i])\n\n                    if self.use_lm and self.is_transformer_lm:\n                        lm_score, lm_state = self.lm.score(torch.Tensor(hyp.yseq).long(),\n                                                           hyp.lm_state,\n                                                           None)\n                     \n                    for j, (logp, k) in enumerate(zip(beam_topk[0][i], beam_topk[1][i] + 1)):\n \n                        new_hyp = Hypothesis(\n                            score=(hyp.score + float(logp)),\n                            yseq=(hyp.yseq[:] + [int(k)]),\n                            dec_state=self.decoder.select_state(beam_state, i),\n                            lm_state=hyp.lm_state,\n                            mmi_tot_score=hyp.mmi_tot_score,\n                            word_ngram_score=word_ngram_states[j],\n                            tlg_state=tlg_states[j] \n                        )\n\n                        # add LM scores. possibly 5 styles\n                        if self.use_lm and not self.is_transformer_lm:\n                            new_hyp.score += self.lm_weight * beam_lm_scores[i, k]\n\n                            new_hyp.lm_state = select_lm_state(\n                                beam_lm_states, i, self.lm_layers, self.is_wordlm\n                            )\n\n                        if self.use_lm and self.is_transformer_lm:\n                            new_hyp.score += self.lm_weight * lm_score[k]\n\n                            new_hyp.lm_state = lm_state   \n \n                        # Word-level N-gram LM\n                        if self.word_ngram_scorer and self.word_ngram_weight > 0.0:\n                            new_hyp.score += self.word_ngram_weight * word_ngram_scores[j]\n\n                        # TLG.fst\n                        if self.tlg_scorer and self.tlg_weight > 0.0:\n                            new_hyp.score += self.tlg_weight * tlg_scores[j]\n\n                        # N-gram LM\n                        if self.ngram_scorer and self.ngram_weight > 0.0:\n                            ngram_score, _ = self.ngram_scorer.score_partial(\n                                             torch.Tensor(hyp.yseq[:]).int(),\n                                             torch.Tensor([int(k)]).int(), \n                                             None, h)\n                            new_hyp.score += self.ngram_weight * ngram_score.item()\n                            \n                        A.append(new_hyp)\n           \n            if self.eng_vocab is not None:\n                A = self.vocab_regularization(A)\n \n            if self.mmi_scorer is not None and self.mmi_weight > 0.0:\n                A = self.mmi_scorer.batch_score(A, mmi_nnet_output, mmi_den_scores, tu_sum+1, self.mmi_weight)\n\n            # unlike the original implementation, we combine the hypotheses before pruning\n            # this allow the hypothesis different and possibly make the rescore more effective\n            B = recombine_hyps(A, self.mmi_weight)\n            B = sorted(B, key=lambda x: x.score, reverse=True)[:beam]\n \n        if self.tlg_scorer and self.tlg_weight > 0.0 and final:\n            tlg_final_states = [h.tlg_state for h in final]\n            tlg_final_scores = self.tlg_scorer.final_score(tlg_final_states)\n            for i, h in enumerate(final):\n                h.score += self.tlg_weight * tlg_final_scores[i]\n\n        if self.mmi_scorer is not None and self.mmi_weight == 0.0:\n            final = self.mmi_scorer.batch_rescore(final, hidden)\n\n        if final:\n            return self.sort_nbest(final)\n        else:\n            print(\"Warning: No finished hypothesis found. return the partial hypothesis\", flush=True)\n            return B\n\n    def nsc_beam_search(self, h: torch.Tensor) -> List[NSCHypothesis]:\n        \"\"\"N-step constrained beam search implementation.\n\n        Based and modified from https://arxiv.org/pdf/2002.03577.pdf.\n        Please reference ESPnet (b-flo, PR #2444) for any usage outside ESPnet\n        until further modifications.\n\n        Note: the algorithm is not in his \"complete\" form but works almost as\n        intended.\n\n        Args:\n            h: Encoded speech features (T_max, D_enc)\n\n        Returns:\n            nbest_hyps: N-best decoding results\n\n        \"\"\"\n        beam = min(self.beam_size, self.vocab_size)\n        beam_k = min(beam, (self.vocab_size - 1))\n\n        beam_state = self.decoder.init_state(beam)\n\n        init_tokens = [\n            NSCHypothesis(\n                yseq=[self.blank],\n                score=0.0,\n                dec_state=self.decoder.select_state(beam_state, 0),\n            )\n        ]\n\n        cache = {}\n\n        beam_y, beam_state, beam_lm_tokens = self.decoder.batch_score(\n            init_tokens,\n            beam_state,\n            cache,\n            self.use_lm,\n        )\n\n        state = self.decoder.select_state(beam_state, 0)\n\n        if self.use_lm:\n            beam_lm_states, beam_lm_scores = self.lm.buff_predict(\n                None, beam_lm_tokens, 1\n            )\n            lm_state = select_lm_state(\n                beam_lm_states, 0, self.lm_layers, self.is_wordlm\n            )\n            lm_scores = beam_lm_scores[0]\n        else:\n            lm_state = None\n            lm_scores = None\n\n        kept_hyps = [\n            NSCHypothesis(\n                yseq=[self.blank],\n                score=0.0,\n                dec_state=state,\n                y=[beam_y[0]],\n                lm_state=lm_state,\n                lm_scores=lm_scores,\n            )\n        ]\n\n        for hi in h:\n            hyps = sorted(kept_hyps, key=lambda x: len(x.yseq), reverse=True)\n            kept_hyps = []\n\n            h_enc = hi.unsqueeze(0)\n\n            for j, hyp_j in enumerate(hyps[:-1]):\n                for hyp_i in hyps[(j + 1) :]:\n                    curr_id = len(hyp_j.yseq)\n                    next_id = len(hyp_i.yseq)\n\n                    if (\n                        is_prefix(hyp_j.yseq, hyp_i.yseq)\n                        and (curr_id - next_id) <= self.prefix_alpha\n                    ):\n                        ytu = torch.log_softmax(\n                            self.joint_network(hi, hyp_i.y[-1]), dim=-1\n                        )\n\n                        curr_score = hyp_i.score + float(ytu[hyp_j.yseq[next_id]])\n\n                        for k in range(next_id, (curr_id - 1)):\n                            ytu = torch.log_softmax(\n                                self.joint_network(hi, hyp_j.y[k]), dim=-1\n                            )\n\n                            curr_score += float(ytu[hyp_j.yseq[k + 1]])\n\n                        hyp_j.score = np.logaddexp(hyp_j.score, curr_score)\n\n            S = []\n            V = []\n            for n in range(self.nstep):\n                beam_y = torch.stack([hyp.y[-1] for hyp in hyps])\n\n                beam_logp = torch.log_softmax(self.joint_network(h_enc, beam_y), dim=-1)\n                beam_topk = beam_logp[:, 1:].topk(beam_k, dim=-1)\n\n                for i, hyp in enumerate(hyps):\n                    S.append(\n                        NSCHypothesis(\n                            yseq=hyp.yseq[:],\n                            score=hyp.score + float(beam_logp[i, 0:1]),\n                            y=hyp.y[:],\n                            dec_state=hyp.dec_state,\n                            lm_state=hyp.lm_state,\n                            lm_scores=hyp.lm_scores,\n                        )\n                    )\n\n                    for logp, k in zip(beam_topk[0][i], beam_topk[1][i] + 1):\n                        score = hyp.score + float(logp)\n\n                        if self.use_lm:\n                            score += self.lm_weight * float(hyp.lm_scores[k])\n\n                        V.append(\n                            NSCHypothesis(\n                                yseq=hyp.yseq[:] + [int(k)],\n                                score=score,\n                                y=hyp.y[:],\n                                dec_state=hyp.dec_state,\n                                lm_state=hyp.lm_state,\n                                lm_scores=hyp.lm_scores,\n                            )\n                        )\n\n                V.sort(key=lambda x: x.score, reverse=True)\n                V = substract(V, hyps)[:beam]\n\n                beam_state = self.decoder.create_batch_states(\n                    beam_state,\n                    [v.dec_state for v in V],\n                    [v.yseq for v in V],\n                )\n                beam_y, beam_state, beam_lm_tokens = self.decoder.batch_score(\n                    V,\n                    beam_state,\n                    cache,\n                    self.use_lm,\n                )\n\n                if self.use_lm:\n                    beam_lm_states = create_lm_batch_state(\n                        [v.lm_state for v in V], self.lm_layers, self.is_wordlm\n                    )\n                    beam_lm_states, beam_lm_scores = self.lm.buff_predict(\n                        beam_lm_states, beam_lm_tokens, len(V)\n                    )\n\n                if n < (self.nstep - 1):\n                    for i, v in enumerate(V):\n                        v.y.append(beam_y[i])\n\n                        v.dec_state = self.decoder.select_state(beam_state, i)\n\n                        if self.use_lm:\n                            v.lm_state = select_lm_state(\n                                beam_lm_states, i, self.lm_layers, self.is_wordlm\n                            )\n                            v.lm_scores = beam_lm_scores[i]\n\n                    hyps = V[:]\n                else:\n                    beam_logp = torch.log_softmax(\n                        self.joint_network(h_enc, beam_y), dim=-1\n                    )\n\n                    for i, v in enumerate(V):\n                        if self.nstep != 1:\n                            v.score += float(beam_logp[i, 0])\n\n                        v.y.append(beam_y[i])\n\n                        v.dec_state = self.decoder.select_state(beam_state, i)\n\n                        if self.use_lm:\n                            v.lm_state = select_lm_state(\n                                beam_lm_states, i, self.lm_layers, self.is_wordlm\n                            )\n                            v.lm_scores = beam_lm_scores[i]\n\n            kept_hyps = sorted((S + V), key=lambda x: x.score, reverse=True)[:beam]\n\n        return self.sort_nbest(kept_hyps)\n\n# wenet log_add implementation used in beam search\ndef log_add(args: List[int]) -> float:\n    \"\"\"\n    Stable log add\n    \"\"\"\n    if all(a == -float('inf') for a in args):\n        return -float('inf')\n    a_max = max(args)\n    lsp = math.log(sum(math.exp(a - a_max) for a in args))\n    return a_max + lsp\n\ndef is_all_chinese(strs):\n    for _char in strs:\n        if not '\\u4e00' <= _char <= '\\u9fa5':\n            return False\n    return True\n"
  },
  {
    "path": "nets/chainer_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/chainer_backend/asr_interface.py",
    "content": "\"\"\"ASR Interface module.\"\"\"\nimport chainer\n\nfrom espnet.nets.asr_interface import ASRInterface\n\n\nclass ChainerASRInterface(ASRInterface, chainer.Chain):\n    \"\"\"ASR Interface for ESPnet model implementation.\"\"\"\n\n    @staticmethod\n    def custom_converter(*args, **kw):\n        \"\"\"Get customconverter of the model (Chainer only).\"\"\"\n        raise NotImplementedError(\"custom converter method is not implemented\")\n\n    @staticmethod\n    def custom_updater(*args, **kw):\n        \"\"\"Get custom_updater of the model (Chainer only).\"\"\"\n        raise NotImplementedError(\"custom updater method is not implemented\")\n\n    @staticmethod\n    def custom_parallel_updater(*args, **kw):\n        \"\"\"Get custom_parallel_updater of the model (Chainer only).\"\"\"\n        raise NotImplementedError(\"custom parallel updater method is not implemented\")\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        raise NotImplementedError(\n            \"get_total_subsampling_factor method is not implemented\"\n        )\n"
  },
  {
    "path": "nets/chainer_backend/ctc.py",
    "content": "import logging\n\nimport chainer\nfrom chainer import cuda\nimport chainer.functions as F\nimport chainer.links as L\nimport numpy as np\n\n\nclass CTC(chainer.Chain):\n    \"\"\"Chainer implementation of ctc layer.\n\n    Args:\n        odim (int): The output dimension.\n        eprojs (int | None): Dimension of input vectors from encoder.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, odim, eprojs, dropout_rate):\n        super(CTC, self).__init__()\n        self.dropout_rate = dropout_rate\n        self.loss = None\n\n        with self.init_scope():\n            self.ctc_lo = L.Linear(eprojs, odim)\n\n    def __call__(self, hs, ys):\n        \"\"\"CTC forward.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n            ys (list of chainer.Variable | N-dimension array):\n                Input variable of decoder.\n\n        Returns:\n            chainer.Variable: A variable holding a scalar value of the CTC loss.\n\n        \"\"\"\n        self.loss = None\n        ilens = [x.shape[0] for x in hs]\n        olens = [x.shape[0] for x in ys]\n\n        # zero padding for hs\n        y_hat = self.ctc_lo(\n            F.dropout(F.pad_sequence(hs), ratio=self.dropout_rate), n_batch_axes=2\n        )\n        y_hat = F.separate(y_hat, axis=1)  # ilen list of batch x hdim\n\n        # zero padding for ys\n        y_true = F.pad_sequence(ys, padding=-1)  # batch x olen\n\n        # get length info\n        input_length = chainer.Variable(self.xp.array(ilens, dtype=np.int32))\n        label_length = chainer.Variable(self.xp.array(olens, dtype=np.int32))\n        logging.info(\n            self.__class__.__name__ + \" input lengths:  \" + str(input_length.data)\n        )\n        logging.info(\n            self.__class__.__name__ + \" output lengths: \" + str(label_length.data)\n        )\n\n        # get ctc loss\n        self.loss = F.connectionist_temporal_classification(\n            y_hat, y_true, 0, input_length, label_length\n        )\n        logging.info(\"ctc loss:\" + str(self.loss.data))\n\n        return self.loss\n\n    def log_softmax(self, hs):\n        \"\"\"Log_softmax of frame activations.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n\n        Returns:\n            chainer.Variable: A n-dimension float array.\n\n        \"\"\"\n        y_hat = self.ctc_lo(F.pad_sequence(hs), n_batch_axes=2)\n        return F.log_softmax(y_hat.reshape(-1, y_hat.shape[-1])).reshape(y_hat.shape)\n\n\nclass WarpCTC(chainer.Chain):\n    \"\"\"Chainer implementation of warp-ctc layer.\n\n    Args:\n        odim (int): The output dimension.\n        eproj (int | None): Dimension of input vector from encoder.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, odim, eprojs, dropout_rate):\n        super(WarpCTC, self).__init__()\n        self.dropout_rate = dropout_rate\n        self.loss = None\n\n        with self.init_scope():\n            self.ctc_lo = L.Linear(eprojs, odim)\n\n    def __call__(self, hs, ys):\n        \"\"\"Core function of the Warp-CTC layer.\n\n        Args:\n            hs (iterable of chainer.Variable | N-dimention array):\n                Input variable from encoder.\n            ys (iterable of chainer.Variable | N-dimension array):\n                Input variable of decoder.\n\n        Returns:\n           chainer.Variable: A variable holding a scalar value of the CTC loss.\n\n        \"\"\"\n        self.loss = None\n        ilens = [x.shape[0] for x in hs]\n        olens = [x.shape[0] for x in ys]\n\n        # zero padding for hs\n        y_hat = self.ctc_lo(\n            F.dropout(F.pad_sequence(hs), ratio=self.dropout_rate), n_batch_axes=2\n        )\n        y_hat = y_hat.transpose(1, 0, 2)  # batch x frames x hdim\n\n        # get length info\n        logging.info(self.__class__.__name__ + \" input lengths:  \" + str(ilens))\n        logging.info(self.__class__.__name__ + \" output lengths: \" + str(olens))\n\n        # get ctc loss\n        from chainer_ctc.warpctc import ctc as warp_ctc\n\n        self.loss = warp_ctc(y_hat, ilens, [cuda.to_cpu(y.data) for y in ys])[0]\n        logging.info(\"ctc loss:\" + str(self.loss.data))\n\n        return self.loss\n\n    def log_softmax(self, hs):\n        \"\"\"Log_softmax of frame activations.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n\n        Returns:\n            chainer.Variable: A n-dimension float array.\n\n        \"\"\"\n        y_hat = self.ctc_lo(F.pad_sequence(hs), n_batch_axes=2)\n        return F.log_softmax(y_hat.reshape(-1, y_hat.shape[-1])).reshape(y_hat.shape)\n\n    def argmax(self, hs_pad):\n        \"\"\"argmax of frame activations\n\n        :param chainer variable hs_pad: 3d tensor (B, Tmax, eprojs)\n        :return: argmax applied 2d tensor (B, Tmax)\n        :rtype: chainer.Variable\n        \"\"\"\n        return F.argmax(self.ctc_lo(F.pad_sequence(hs_pad), n_batch_axes=2), axis=-1)\n\n\ndef ctc_for(args, odim):\n    \"\"\"Return the CTC layer corresponding to the args.\n\n    Args:\n        args (Namespace): The program arguments.\n        odim (int): The output dimension.\n\n    Returns:\n        The CTC module.\n\n    \"\"\"\n    ctc_type = args.ctc_type\n    if ctc_type == \"builtin\":\n        logging.info(\"Using chainer CTC implementation\")\n        ctc = CTC(odim, args.eprojs, args.dropout_rate)\n    elif ctc_type == \"warpctc\":\n        logging.info(\"Using warpctc CTC implementation\")\n        ctc = WarpCTC(odim, args.eprojs, args.dropout_rate)\n    else:\n        raise ValueError('ctc_type must be \"builtin\" or \"warpctc\": {}'.format(ctc_type))\n    return ctc\n"
  },
  {
    "path": "nets/chainer_backend/deterministic_embed_id.py",
    "content": "import numpy\nimport six\n\nimport chainer\nfrom chainer import cuda\nfrom chainer import function_node\nfrom chainer.initializers import normal\n\n# from chainer.functions.connection import embed_id\nfrom chainer import link\nfrom chainer.utils import type_check\nfrom chainer import variable\n\n\"\"\"Deterministic EmbedID link and function\n\n   copied from chainer/links/connection/embed_id.py\n   and chainer/functions/connection/embed_id.py,\n   and modified not to use atomicAdd operation\n\"\"\"\n\n\nclass EmbedIDFunction(function_node.FunctionNode):\n    def __init__(self, ignore_label=None):\n        self.ignore_label = ignore_label\n        self._w_shape = None\n\n    def check_type_forward(self, in_types):\n        type_check.expect(in_types.size() == 2)\n        x_type, w_type = in_types\n        type_check.expect(\n            x_type.dtype.kind == \"i\",\n            x_type.ndim >= 1,\n        )\n        type_check.expect(w_type.dtype == numpy.float32, w_type.ndim == 2)\n\n    def forward(self, inputs):\n        self.retain_inputs((0,))\n        x, W = inputs\n        self._w_shape = W.shape\n\n        if not type_check.same_types(*inputs):\n            raise ValueError(\n                \"numpy and cupy must not be used together\\n\"\n                \"type(W): {0}, type(x): {1}\".format(type(W), type(x))\n            )\n\n        xp = cuda.get_array_module(*inputs)\n        if chainer.is_debug():\n            valid_x = xp.logical_and(0 <= x, x < len(W))\n            if self.ignore_label is not None:\n                valid_x = xp.logical_or(valid_x, x == self.ignore_label)\n            if not valid_x.all():\n                raise ValueError(\n                    \"Each not ignored `x` value need to satisfy\" \"`0 <= x < len(W)`\"\n                )\n\n        if self.ignore_label is not None:\n            mask = x == self.ignore_label\n            return (xp.where(mask[..., None], 0, W[xp.where(mask, 0, x)]),)\n\n        return (W[x],)\n\n    def backward(self, indexes, grad_outputs):\n        inputs = self.get_retained_inputs()\n        gW = EmbedIDGrad(self._w_shape, self.ignore_label).apply(inputs + grad_outputs)[\n            0\n        ]\n        return None, gW\n\n\nclass EmbedIDGrad(function_node.FunctionNode):\n    def __init__(self, w_shape, ignore_label=None):\n        self.w_shape = w_shape\n        self.ignore_label = ignore_label\n        self._gy_shape = None\n\n    def forward(self, inputs):\n        self.retain_inputs((0,))\n        xp = cuda.get_array_module(*inputs)\n        x, gy = inputs\n        self._gy_shape = gy.shape\n        gW = xp.zeros(self.w_shape, dtype=gy.dtype)\n\n        if xp is numpy:\n            # It is equivalent to `numpy.add.at(gW, x, gy)` but ufunc.at is\n            # too slow.\n            for ix, igy in six.moves.zip(x.ravel(), gy.reshape(x.size, -1)):\n                if ix == self.ignore_label:\n                    continue\n                gW[ix] += igy\n        else:\n            \"\"\"\n            # original code based on cuda elementwise method\n            if self.ignore_label is None:\n                cuda.elementwise(\n                    'T gy, S x, S n_out', 'raw T gW',\n                    'ptrdiff_t w_ind[] = {x, i % n_out};'\n                    'atomicAdd(&gW[w_ind], gy)',\n                    'embed_id_bwd')(\n                        gy, xp.expand_dims(x, -1), gW.shape[1], gW)\n            else:\n                cuda.elementwise(\n                    'T gy, S x, S n_out, S ignore', 'raw T gW',\n                    '''\n                    if (x != ignore) {\n                      ptrdiff_t w_ind[] = {x, i % n_out};\n                      atomicAdd(&gW[w_ind], gy);\n                    }\n                    ''',\n                    'embed_id_bwd_ignore_label')(\n                        gy, xp.expand_dims(x, -1), gW.shape[1],\n                        self.ignore_label, gW)\n            \"\"\"\n            # EmbedID gradient alternative without atomicAdd, which simply\n            # creates a one-hot vector and applies dot product\n            xi = xp.zeros((x.size, len(gW)), dtype=numpy.float32)\n            idx = xp.arange(x.size, dtype=numpy.int32) * len(gW) + x.ravel()\n            xi.ravel()[idx] = 1.0\n            if self.ignore_label is not None:\n                xi[:, self.ignore_label] = 0.0\n            gW = xi.T.dot(gy.reshape(x.size, -1)).astype(gW.dtype, copy=False)\n\n        return (gW,)\n\n    def backward(self, indexes, grads):\n        xp = cuda.get_array_module(*grads)\n        x = self.get_retained_inputs()[0].data\n        ggW = grads[0]\n\n        if self.ignore_label is not None:\n            mask = x == self.ignore_label\n            # To prevent index out of bounds, we need to check if ignore_label\n            # is inside of W.\n            if not (0 <= self.ignore_label < self.w_shape[1]):\n                x = xp.where(mask, 0, x)\n\n        ggy = ggW[x]\n\n        if self.ignore_label is not None:\n            mask, zero, _ = xp.broadcast_arrays(\n                mask[..., None], xp.zeros((), \"f\"), ggy.data\n            )\n            ggy = chainer.functions.where(mask, zero, ggy)\n        return None, ggy\n\n\ndef embed_id(x, W, ignore_label=None):\n    r\"\"\"Efficient linear function for one-hot input.\n\n    This function implements so called *word embeddings*. It takes two\n    arguments: a set of IDs (words) ``x`` in :math:`B` dimensional integer\n    vector, and a set of all ID (word) embeddings ``W`` in :math:`V \\\\times d`\n    float32 matrix. It outputs :math:`B \\\\times d` matrix whose ``i``-th\n    column is the ``x[i]``-th column of ``W``.\n    This function is only differentiable on the input ``W``.\n\n    Args:\n        x (chainer.Variable | np.ndarray): Batch vectors of IDs. Each\n            element must be signed integer.\n        W (chainer.Variable | np.ndarray): Distributed representation\n            of each ID (a.k.a. word embeddings).\n        ignore_label (int): If ignore_label is an int value, i-th column\n            of return value is filled with 0.\n\n    Returns:\n        chainer.Variable: Embedded variable.\n\n\n    .. rubric:: :class:`~chainer.links.EmbedID`\n\n    Examples:\n\n        >>> x = np.array([2, 1]).astype('i')\n        >>> x\n        array([2, 1], dtype=int32)\n        >>> W = np.array([[0, 0, 0],\n        ...               [1, 1, 1],\n        ...               [2, 2, 2]]).astype('f')\n        >>> W\n        array([[ 0.,  0.,  0.],\n               [ 1.,  1.,  1.],\n               [ 2.,  2.,  2.]], dtype=float32)\n        >>> F.embed_id(x, W).data\n        array([[ 2.,  2.,  2.],\n               [ 1.,  1.,  1.]], dtype=float32)\n        >>> F.embed_id(x, W, ignore_label=1).data\n        array([[ 2.,  2.,  2.],\n               [ 0.,  0.,  0.]], dtype=float32)\n\n    \"\"\"\n    return EmbedIDFunction(ignore_label=ignore_label).apply((x, W))[0]\n\n\nclass EmbedID(link.Link):\n    \"\"\"Efficient linear layer for one-hot input.\n\n    This is a link that wraps the :func:`~chainer.functions.embed_id` function.\n    This link holds the ID (word) embedding matrix ``W`` as a parameter.\n\n    Args:\n        in_size (int): Number of different identifiers (a.k.a. vocabulary size).\n        out_size (int): Output dimension.\n        initialW (Initializer): Initializer to initialize the weight.\n        ignore_label (int): If `ignore_label` is an int value, i-th column of\n            return value is filled with 0.\n\n    .. rubric:: :func:`~chainer.functions.embed_id`\n\n    Attributes:\n        W (~chainer.Variable): Embedding parameter matrix.\n\n    Examples:\n\n        >>> W = np.array([[0, 0, 0],\n        ...               [1, 1, 1],\n        ...               [2, 2, 2]]).astype('f')\n        >>> W\n        array([[ 0.,  0.,  0.],\n               [ 1.,  1.,  1.],\n               [ 2.,  2.,  2.]], dtype=float32)\n        >>> l = L.EmbedID(W.shape[0], W.shape[1], initialW=W)\n        >>> x = np.array([2, 1]).astype('i')\n        >>> x\n        array([2, 1], dtype=int32)\n        >>> y = l(x)\n        >>> y.data\n        array([[ 2.,  2.,  2.],\n               [ 1.,  1.,  1.]], dtype=float32)\n\n    \"\"\"\n\n    ignore_label = None\n\n    def __init__(self, in_size, out_size, initialW=None, ignore_label=None):\n        super(EmbedID, self).__init__()\n        self.ignore_label = ignore_label\n\n        with self.init_scope():\n            if initialW is None:\n                initialW = normal.Normal(1.0)\n            self.W = variable.Parameter(initialW, (in_size, out_size))\n\n    def __call__(self, x):\n        \"\"\"Extracts the word embedding of given IDs.\n\n        Args:\n            x (chainer.Variable): Batch vectors of IDs.\n\n        Returns:\n            chainer.Variable: Batch of corresponding embeddings.\n\n        \"\"\"\n        return embed_id(x, self.W, ignore_label=self.ignore_label)\n"
  },
  {
    "path": "nets/chainer_backend/e2e_asr.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"RNN sequence-to-sequence speech recognition model (chainer).\"\"\"\n\nimport logging\nimport math\n\nimport chainer\nfrom chainer import reporter\nimport numpy as np\n\nfrom espnet.nets.chainer_backend.asr_interface import ChainerASRInterface\nfrom espnet.nets.chainer_backend.ctc import ctc_for\nfrom espnet.nets.chainer_backend.rnn.attentions import att_for\nfrom espnet.nets.chainer_backend.rnn.decoders import decoder_for\nfrom espnet.nets.chainer_backend.rnn.encoders import encoder_for\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.pytorch_backend.e2e_asr import E2E as E2E_pytorch\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\n\nCTC_LOSS_THRESHOLD = 10000\n\n\nclass E2E(ChainerASRInterface):\n    \"\"\"E2E module for chainer backend.\n\n    Args:\n        idim (int): Dimension of the inputs.\n        odim (int): Dimension of the outputs.\n        args (parser.args): Training config.\n        flag_return (bool): If True, train() would return\n            additional metrics in addition to the training\n            loss.\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        return E2E_pytorch.add_arguments(parser)\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.enc.conv_subsampling_factor * int(np.prod(self.subsample))\n\n    def __init__(self, idim, odim, args, flag_return=True):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        chainer.Chain.__init__(self)\n        self.mtlalpha = args.mtlalpha\n        assert 0 <= self.mtlalpha <= 1, \"mtlalpha must be [0,1]\"\n        self.etype = args.etype\n        self.verbose = args.verbose\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n\n        # subsample info\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"rnn\")\n\n        # label smoothing info\n        if args.lsm_type:\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        with self.init_scope():\n            # encoder\n            self.enc = encoder_for(args, idim, self.subsample)\n            # ctc\n            self.ctc = ctc_for(args, odim)\n            # attention\n            self.att = att_for(args)\n            # decoder\n            self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        self.acc = None\n        self.loss = None\n        self.flag_return = flag_return\n\n    def forward(self, xs, ilens, ys):\n        \"\"\"E2E forward propagation.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each input batch. (B,)\n            ys (chainer.Variable): Batch of padded target features. (B, Lmax, odim)\n\n        Returns:\n            float: Loss that calculated by attention and ctc loss.\n            float (optional): Ctc loss.\n            float (optional): Attention loss.\n            float (optional): Accuracy.\n\n        \"\"\"\n        # 1. encoder\n        hs, ilens = self.enc(xs, ilens)\n\n        # 3. CTC loss\n        if self.mtlalpha == 0:\n            loss_ctc = None\n        else:\n            loss_ctc = self.ctc(hs, ys)\n\n        # 4. attention loss\n        if self.mtlalpha == 1:\n            loss_att = None\n            acc = None\n        else:\n            loss_att, acc = self.dec(hs, ys)\n\n        self.acc = acc\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = loss_att\n        elif alpha == 1:\n            self.loss = loss_ctc\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n\n        if self.loss.data < CTC_LOSS_THRESHOLD and not math.isnan(self.loss.data):\n            reporter.report({\"loss_ctc\": loss_ctc}, self)\n            reporter.report({\"loss_att\": loss_att}, self)\n            reporter.report({\"acc\": acc}, self)\n\n            logging.info(\"mtl loss:\" + str(self.loss.data))\n            reporter.report({\"loss\": self.loss}, self)\n        else:\n            logging.warning(\"loss (=%f) is not correct\", self.loss.data)\n        if self.flag_return:\n            return self.loss, loss_ctc, loss_att, acc\n        else:\n            return self.loss\n\n    def recognize(self, x, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E greedy/beam search.\n\n        Args:\n            x (chainer.Variable): Input tensor for recognition.\n            recog_args (parser.args): Arguments of config file.\n            char_list (List[str]): List of Charactors.\n            rnnlm (Module): RNNLM module defined at `espnet.lm.chainer_backend.lm`.\n\n        Returns:\n            List[Dict[str, Any]]: Result of recognition.\n\n        \"\"\"\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        ilen = self.xp.array(x.shape[0], dtype=np.int32)\n        h = chainer.Variable(self.xp.array(x, dtype=np.float32))\n\n        with chainer.no_backprop_mode(), chainer.using_config(\"train\", False):\n            # 1. encoder\n            # make a utt list (1) to use the same interface for encoder\n            h, _ = self.enc([h], [ilen])\n\n            # calculate log P(z_t|X) for CTC scores\n            if recog_args.ctc_weight > 0.0:\n                lpz = self.ctc.log_softmax(h).data[0]\n            else:\n                lpz = None\n\n            # 2. decoder\n            # decode the first utterance\n            y = self.dec.recognize_beam(h[0], lpz, recog_args, char_list, rnnlm)\n\n            return y\n\n    def calculate_all_attentions(self, xs, ilens, ys):\n        \"\"\"E2E attention calculation.\n\n        Args:\n            xs (List): List of padded input sequences. [(T1, idim), (T2, idim), ...]\n            ilens (np.ndarray): Batch of lengths of input sequences. (B)\n            ys (List): List of character id sequence tensor. [(L1), (L2), (L3), ...]\n\n        Returns:\n            float np.ndarray: Attention weights. (B, Lmax, Tmax)\n\n        \"\"\"\n        hs, ilens = self.enc(xs, ilens)\n        att_ws = self.dec.calculate_all_attentions(hs, ys)\n\n        return att_ws\n\n    @staticmethod\n    def custom_converter(subsampling_factor=0):\n        \"\"\"Get customconverter of the model.\"\"\"\n        from espnet.nets.chainer_backend.rnn.training import CustomConverter\n\n        return CustomConverter(subsampling_factor=subsampling_factor)\n\n    @staticmethod\n    def custom_updater(iters, optimizer, converter, device=-1, accum_grad=1):\n        \"\"\"Get custom_updater of the model.\"\"\"\n        from espnet.nets.chainer_backend.rnn.training import CustomUpdater\n\n        return CustomUpdater(\n            iters, optimizer, converter=converter, device=device, accum_grad=accum_grad\n        )\n\n    @staticmethod\n    def custom_parallel_updater(iters, optimizer, converter, devices, accum_grad=1):\n        \"\"\"Get custom_parallel_updater of the model.\"\"\"\n        from espnet.nets.chainer_backend.rnn.training import CustomParallelUpdater\n\n        return CustomParallelUpdater(\n            iters,\n            optimizer,\n            converter=converter,\n            devices=devices,\n            accum_grad=accum_grad,\n        )\n"
  },
  {
    "path": "nets/chainer_backend/e2e_asr_transformer.py",
    "content": "# encoding: utf-8\n\"\"\"Transformer-based model for End-to-end ASR.\"\"\"\n\nfrom argparse import Namespace\nfrom distutils.util import strtobool\nimport logging\nimport math\n\nimport chainer\nimport chainer.functions as F\nfrom chainer import reporter\nimport numpy as np\nimport six\n\nfrom espnet.nets.chainer_backend.asr_interface import ChainerASRInterface\nfrom espnet.nets.chainer_backend.transformer.attention import MultiHeadAttention\nfrom espnet.nets.chainer_backend.transformer import ctc\nfrom espnet.nets.chainer_backend.transformer.decoder import Decoder\nfrom espnet.nets.chainer_backend.transformer.encoder import Encoder\nfrom espnet.nets.chainer_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\nfrom espnet.nets.chainer_backend.transformer.training import CustomConverter\nfrom espnet.nets.chainer_backend.transformer.training import CustomUpdater\nfrom espnet.nets.chainer_backend.transformer.training import (\n    CustomParallelUpdater,  # noqa: H301\n)\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.e2e_asr_common import ErrorCalculator\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\n\n\nCTC_SCORING_RATIO = 1.5\nMAX_DECODER_OUTPUT = 5\n\n\nclass E2E(ChainerASRInterface):\n    \"\"\"E2E module.\n\n    Args:\n        idim (int): Input dimmensions.\n        odim (int): Output dimmensions.\n        args (Namespace): Training config.\n        ignore_id (int, optional): Id for ignoring a character.\n        flag_return (bool, optional): If true, return a list with (loss,\n        loss_ctc, loss_att, acc) in forward. Otherwise, return loss.\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Customize flags for transformer setup.\n\n        Args:\n            parser (Namespace): Training config.\n\n        \"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n        group.add_argument(\n            \"--transformer-init\",\n            type=str,\n            default=\"pytorch\",\n            help=\"how to initialize transformer parameters\",\n        )\n        group.add_argument(\n            \"--transformer-input-layer\",\n            type=str,\n            default=\"conv2d\",\n            choices=[\"conv2d\", \"linear\", \"embed\"],\n            help=\"transformer input layer type\",\n        )\n        group.add_argument(\n            \"--transformer-attn-dropout-rate\",\n            default=None,\n            type=float,\n            help=\"dropout in transformer attention. use --dropout-rate if None is set\",\n        )\n        group.add_argument(\n            \"--transformer-lr\",\n            default=10.0,\n            type=float,\n            help=\"Initial value of learning rate\",\n        )\n        group.add_argument(\n            \"--transformer-warmup-steps\",\n            default=25000,\n            type=int,\n            help=\"optimizer warmup steps\",\n        )\n        group.add_argument(\n            \"--transformer-length-normalized-loss\",\n            default=True,\n            type=strtobool,\n            help=\"normalize loss by length\",\n        )\n\n        group.add_argument(\n            \"--dropout-rate\",\n            default=0.0,\n            type=float,\n            help=\"Dropout rate for the encoder\",\n        )\n        # Encoder\n        group.add_argument(\n            \"--elayers\",\n            default=4,\n            type=int,\n            help=\"Number of encoder layers (for shared recognition part \"\n            \"in multi-speaker asr mode)\",\n        )\n        group.add_argument(\n            \"--eunits\",\n            \"-u\",\n            default=300,\n            type=int,\n            help=\"Number of encoder hidden units\",\n        )\n        # Attention\n        group.add_argument(\n            \"--adim\",\n            default=320,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aheads\",\n            default=4,\n            type=int,\n            help=\"Number of heads for multi head attention\",\n        )\n        # Decoder\n        group.add_argument(\n            \"--dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n        )\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.encoder.conv_subsampling_factor * int(np.prod(self.subsample))\n\n    def __init__(self, idim, odim, args, ignore_id=-1, flag_return=True):\n        \"\"\"Initialize the transformer.\"\"\"\n        chainer.Chain.__init__(self)\n        self.mtlalpha = args.mtlalpha\n        assert 0 <= self.mtlalpha <= 1, \"mtlalpha must be [0,1]\"\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n        self.use_label_smoothing = False\n        self.char_list = args.char_list\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.scale_emb = args.adim ** 0.5\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"transformer\")\n        self.ignore_id = ignore_id\n        self.reset_parameters(args)\n        with self.init_scope():\n            self.encoder = Encoder(\n                idim=idim,\n                attention_dim=args.adim,\n                attention_heads=args.aheads,\n                linear_units=args.eunits,\n                input_layer=args.transformer_input_layer,\n                dropout_rate=args.dropout_rate,\n                positional_dropout_rate=args.dropout_rate,\n                attention_dropout_rate=args.transformer_attn_dropout_rate,\n                initialW=self.initialW,\n                initial_bias=self.initialB,\n            )\n            self.decoder = Decoder(\n                odim, args, initialW=self.initialW, initial_bias=self.initialB\n            )\n            self.criterion = LabelSmoothingLoss(\n                args.lsm_weight,\n                len(args.char_list),\n                args.transformer_length_normalized_loss,\n            )\n            if args.mtlalpha > 0.0:\n                if args.ctc_type == \"builtin\":\n                    logging.info(\"Using chainer CTC implementation\")\n                    self.ctc = ctc.CTC(odim, args.adim, args.dropout_rate)\n                elif args.ctc_type == \"warpctc\":\n                    logging.info(\"Using warpctc CTC implementation\")\n                    self.ctc = ctc.WarpCTC(odim, args.adim, args.dropout_rate)\n                else:\n                    raise ValueError(\n                        'ctc_type must be \"builtin\" or \"warpctc\": {}'.format(\n                            args.ctc_type\n                        )\n                    )\n            else:\n                self.ctc = None\n        self.dims = args.adim\n        self.odim = odim\n        self.flag_return = flag_return\n        if args.report_cer or args.report_wer:\n            self.error_calculator = ErrorCalculator(\n                args.char_list,\n                args.sym_space,\n                args.sym_blank,\n                args.report_cer,\n                args.report_wer,\n            )\n        else:\n            self.error_calculator = None\n        if \"Namespace\" in str(type(args)):\n            self.verbose = 0 if \"verbose\" not in args else args.verbose\n        else:\n            self.verbose = 0 if args.verbose is None else args.verbose\n\n    def reset_parameters(self, args):\n        \"\"\"Initialize the Weight according to the give initialize-type.\n\n        Args:\n            args (Namespace): Transformer config.\n\n        \"\"\"\n        type_init = args.transformer_init\n        if type_init == \"lecun_uniform\":\n            logging.info(\"Using LeCunUniform as Parameter initializer\")\n            self.initialW = chainer.initializers.LeCunUniform\n        elif type_init == \"lecun_normal\":\n            logging.info(\"Using LeCunNormal as Parameter initializer\")\n            self.initialW = chainer.initializers.LeCunNormal\n        elif type_init == \"gorot_uniform\":\n            logging.info(\"Using GlorotUniform as Parameter initializer\")\n            self.initialW = chainer.initializers.GlorotUniform\n        elif type_init == \"gorot_normal\":\n            logging.info(\"Using GlorotNormal as Parameter initializer\")\n            self.initialW = chainer.initializers.GlorotNormal\n        elif type_init == \"he_uniform\":\n            logging.info(\"Using HeUniform as Parameter initializer\")\n            self.initialW = chainer.initializers.HeUniform\n        elif type_init == \"he_normal\":\n            logging.info(\"Using HeNormal as Parameter initializer\")\n            self.initialW = chainer.initializers.HeNormal\n        elif type_init == \"pytorch\":\n            logging.info(\"Using Pytorch initializer\")\n            self.initialW = chainer.initializers.Uniform\n        else:\n            logging.info(\"Using Chainer default as Parameter initializer\")\n            self.initialW = chainer.initializers.Uniform\n        self.initialB = chainer.initializers.Uniform\n\n    def forward(self, xs, ilens, ys_pad, calculate_attentions=False):\n        \"\"\"E2E forward propagation.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each input batch. (B,)\n            ys (chainer.Variable): Batch of padded target features. (B, Lmax, odim)\n            calculate_attentions (bool): If true, return value is the output of encoder.\n\n        Returns:\n            float: Training loss.\n            float (optional): Training loss for ctc.\n            float (optional): Training loss for attention.\n            float (optional): Accuracy.\n            chainer.Variable (Optional): Output of the encoder.\n\n        \"\"\"\n        alpha = self.mtlalpha\n\n        # 1. Encoder\n        xs, x_mask, ilens = self.encoder(xs, ilens)\n\n        # 2. CTC loss\n        cer_ctc = None\n        if alpha == 0.0:\n            loss_ctc = None\n        else:\n            _ys = [y.astype(np.int32) for y in ys_pad]\n            loss_ctc = self.ctc(xs, _ys)\n            if self.error_calculator is not None:\n                with chainer.no_backprop_mode():\n                    ys_hat = chainer.backends.cuda.to_cpu(self.ctc.argmax(xs).data)\n                cer_ctc = self.error_calculator(ys_hat, ys_pad, is_ctc=True)\n\n        # 3. Decoder\n        if calculate_attentions:\n            self.calculate_attentions(xs, x_mask, ys_pad)\n        ys = self.decoder(ys_pad, xs, x_mask)\n\n        # 4. Attention Loss\n        cer, wer = None, None\n        if alpha == 1:\n            loss_att = None\n            acc = None\n        else:\n            # Make target\n            eos = np.array([self.eos], \"i\")\n            with chainer.no_backprop_mode():\n                ys_pad_out = [np.concatenate([y, eos], axis=0) for y in ys_pad]\n                ys_pad_out = F.pad_sequence(ys_pad_out, padding=-1).data\n                ys_pad_out = self.xp.array(ys_pad_out)\n\n            loss_att = self.criterion(ys, ys_pad_out)\n            acc = F.accuracy(\n                ys.reshape(-1, self.odim), ys_pad_out.reshape(-1), ignore_label=-1\n            )\n            if (not chainer.config.train) and (self.error_calculator is not None):\n                cer, wer = self.error_calculator(ys, ys_pad)\n\n        if alpha == 0.0:\n            self.loss = loss_att\n            loss_att_data = loss_att.data\n            loss_ctc_data = None\n        elif alpha == 1.0:\n            self.loss = loss_ctc\n            loss_att_data = None\n            loss_ctc_data = loss_ctc.data\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n            loss_att_data = loss_att.data\n            loss_ctc_data = loss_ctc.data\n        loss_data = self.loss.data\n\n        if not math.isnan(loss_data):\n            reporter.report({\"loss_ctc\": loss_ctc_data}, self)\n            reporter.report({\"loss_att\": loss_att_data}, self)\n            reporter.report({\"acc\": acc}, self)\n\n            reporter.report({\"cer_ctc\": cer_ctc}, self)\n            reporter.report({\"cer\": cer}, self)\n            reporter.report({\"wer\": wer}, self)\n\n            logging.info(\"mtl loss:\" + str(loss_data))\n            reporter.report({\"loss\": loss_data}, self)\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n\n        if self.flag_return:\n            loss_ctc = None\n            return self.loss, loss_ctc, loss_att, acc\n        else:\n            return self.loss\n\n    def calculate_attentions(self, xs, x_mask, ys_pad):\n        \"\"\"Calculate Attentions.\"\"\"\n        self.decoder(ys_pad, xs, x_mask)\n\n    def recognize(self, x_block, recog_args, char_list=None, rnnlm=None):\n        \"\"\"E2E recognition function.\n\n        Args:\n            x (ndarray): Input acouctic feature (B, T, D) or (T, D).\n            recog_args (Namespace): Argment namespace contraining options.\n            char_list (List[str]): List of characters.\n            rnnlm (chainer.Chain): Language model module defined at\n            `espnet.lm.chainer_backend.lm`.\n\n        Returns:\n            List: N-best decoding results.\n\n        \"\"\"\n        with chainer.no_backprop_mode(), chainer.using_config(\"train\", False):\n            # 1. encoder\n            ilens = [x_block.shape[0]]\n            batch = len(ilens)\n            xs, _, _ = self.encoder(x_block[None, :, :], ilens)\n\n            # calculate log P(z_t|X) for CTC scores\n            if recog_args.ctc_weight > 0.0:\n                lpz = self.ctc.log_softmax(xs.reshape(batch, -1, self.dims)).data[0]\n            else:\n                lpz = None\n            # 2. decoder\n            if recog_args.lm_weight == 0.0:\n                rnnlm = None\n            y = self.recognize_beam(xs, lpz, recog_args, char_list, rnnlm)\n\n        return y\n\n    def recognize_beam(self, h, lpz, recog_args, char_list=None, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        Args:\n            h (ndarray): Encoder ouput features (B, T, D) or (T, D).\n            lpz (ndarray): Log probabilities from CTC.\n            recog_args (Namespace): Argment namespace contraining options.\n            char_list (List[str]): List of characters.\n            rnnlm (chainer.Chain): Language model module defined at\n            `espnet.lm.chainer_backend.lm`.\n\n        Returns:\n            List: N-best decoding results.\n\n        \"\"\"\n        logging.info(\"input lengths: \" + str(h.shape[1]))\n\n        # initialization\n        n_len = h.shape[1]\n        xp = self.xp\n        h_mask = xp.ones((1, n_len))\n\n        # search parms\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = recog_args.ctc_weight\n\n        # prepare sos\n        y = self.sos\n        if recog_args.maxlenratio == 0:\n            maxlen = n_len\n        else:\n            maxlen = max(1, int(recog_args.maxlenratio * n_len))\n        minlen = int(recog_args.minlenratio * n_len)\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        if rnnlm:\n            hyp = {\"score\": 0.0, \"yseq\": [y], \"rnnlm_prev\": None}\n        else:\n            hyp = {\"score\": 0.0, \"yseq\": [y]}\n\n        if lpz is not None:\n            ctc_prefix_score = CTCPrefixScore(lpz, 0, self.eos, self.xp)\n            hyp[\"ctc_state_prev\"] = ctc_prefix_score.initial_state()\n            hyp[\"ctc_score_prev\"] = 0.0\n            if ctc_weight != 1.0:\n                # pre-pruning based on attention scores\n                ctc_beam = min(lpz.shape[-1], int(beam * CTC_SCORING_RATIO))\n            else:\n                ctc_beam = lpz.shape[-1]\n\n        hyps = [hyp]\n        ended_hyps = []\n\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            hyps_best_kept = []\n            for hyp in hyps:\n                ys = F.expand_dims(xp.array(hyp[\"yseq\"]), axis=0).data\n                out = self.decoder(ys, h, h_mask)\n\n                # get nbest local scores and their ids\n                local_att_scores = F.log_softmax(out[:, -1], axis=-1).data\n                if rnnlm:\n                    rnnlm_state, local_lm_scores = rnnlm.predict(\n                        hyp[\"rnnlm_prev\"], hyp[\"yseq\"][i]\n                    )\n                    local_scores = (\n                        local_att_scores + recog_args.lm_weight * local_lm_scores\n                    )\n                else:\n                    local_scores = local_att_scores\n\n                if lpz is not None:\n                    local_best_ids = xp.argsort(local_scores, axis=1)[0, ::-1][\n                        :ctc_beam\n                    ]\n                    ctc_scores, ctc_states = ctc_prefix_score(\n                        hyp[\"yseq\"], local_best_ids, hyp[\"ctc_state_prev\"]\n                    )\n                    local_scores = (1.0 - ctc_weight) * local_att_scores[\n                        :, local_best_ids\n                    ] + ctc_weight * (ctc_scores - hyp[\"ctc_score_prev\"])\n                    if rnnlm:\n                        local_scores += (\n                            recog_args.lm_weight * local_lm_scores[:, local_best_ids]\n                        )\n                    joint_best_ids = xp.argsort(local_scores, axis=1)[0, ::-1][:beam]\n                    local_best_scores = local_scores[:, joint_best_ids]\n                    local_best_ids = local_best_ids[joint_best_ids]\n                else:\n                    local_best_ids = self.xp.argsort(local_scores, axis=1)[0, ::-1][\n                        :beam\n                    ]\n                    local_best_scores = local_scores[:, local_best_ids]\n\n                for j in six.moves.range(beam):\n                    new_hyp = {}\n                    new_hyp[\"score\"] = hyp[\"score\"] + float(local_best_scores[0, j])\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[j])\n                    if rnnlm:\n                        new_hyp[\"rnnlm_prev\"] = rnnlm_state\n                    if lpz is not None:\n                        new_hyp[\"ctc_state_prev\"] = ctc_states[joint_best_ids[j]]\n                        new_hyp[\"ctc_score_prev\"] = ctc_scores[joint_best_ids[j]]\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypothesis: \" + str(len(hyps)))\n            if char_list is not None:\n                logging.debug(\n                    \"best hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n                    + \" score: \"\n                    + str(hyps[0][\"score\"])\n                )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last postion in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypothes to a final list, and removed them from current hypothes\n            # (this will be a probmlem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        if rnnlm:  # Word LM needs to add final <eos> score\n                            hyp[\"score\"] += recog_args.lm_weight * rnnlm.final(\n                                hyp[\"rnnlm_prev\"]\n                            )\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remained hypothes: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n            if char_list is not None:\n                for hyp in hyps:\n                    logging.debug(\n                        \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                    )\n\n            logging.debug(\"number of ended hypothes: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(\n            ended_hyps, key=lambda x: x[\"score\"], reverse=True\n        )  # [:min(len(ended_hyps), recog_args.nbest)]\n\n        logging.debug(nbest_hyps)\n        # check number of hypotheis\n        if len(nbest_hyps) == 0:\n            logging.warn(\n                \"there is no N-best results, perform recognition \"\n                \"again with smaller minlenratio.\"\n            )\n            # should copy becasuse Namespace will be overwritten globally\n            recog_args = Namespace(**vars(recog_args))\n            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)\n            return self.recognize_beam(h, lpz, recog_args, char_list, rnnlm)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n        # remove sos\n        return nbest_hyps\n\n    def calculate_all_attentions(self, xs, ilens, ys):\n        \"\"\"E2E attention calculation.\n\n        Args:\n            xs (List[tuple()]): List of padded input sequences.\n                [(T1, idim), (T2, idim), ...]\n            ilens (ndarray): Batch of lengths of input sequences. (B)\n            ys (List): List of character id sequence tensor. [(L1), (L2), (L3), ...]\n\n        Returns:\n            float ndarray: Attention weights. (B, Lmax, Tmax)\n\n        \"\"\"\n        with chainer.no_backprop_mode():\n            self(xs, ilens, ys, calculate_attentions=True)\n        ret = dict()\n        for name, m in self.namedlinks():\n            if isinstance(m, MultiHeadAttention):\n                var = m.attn\n                var.to_cpu()\n                _name = name[1:].replace(\"/\", \"_\")\n                ret[_name] = var.data\n        return ret\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Attention plot function.\n\n        Redirects to PlotAttentionReport\n\n        Returns:\n            PlotAttentionReport\n\n        \"\"\"\n        return PlotAttentionReport\n\n    @staticmethod\n    def custom_converter(subsampling_factor=0):\n        \"\"\"Get customconverter of the model.\"\"\"\n        return CustomConverter()\n\n    @staticmethod\n    def custom_updater(iters, optimizer, converter, device=-1, accum_grad=1):\n        \"\"\"Get custom_updater of the model.\"\"\"\n        return CustomUpdater(\n            iters, optimizer, converter=converter, device=device, accum_grad=accum_grad\n        )\n\n    @staticmethod\n    def custom_parallel_updater(iters, optimizer, converter, devices, accum_grad=1):\n        \"\"\"Get custom_parallel_updater of the model.\"\"\"\n        return CustomParallelUpdater(\n            iters,\n            optimizer,\n            converter=converter,\n            devices=devices,\n            accum_grad=accum_grad,\n        )\n"
  },
  {
    "path": "nets/chainer_backend/nets_utils.py",
    "content": "import chainer.functions as F\n\n\ndef _subsamplex(x, n):\n    x = [F.get_item(xx, (slice(None, None, n), slice(None))) for xx in x]\n    ilens = [xx.shape[0] for xx in x]\n    return x, ilens\n"
  },
  {
    "path": "nets/chainer_backend/rnn/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/chainer_backend/rnn/attentions.py",
    "content": "import chainer\nimport chainer.functions as F\nimport chainer.links as L\n\nimport numpy as np\n\n\n# dot product based attention\nclass AttDot(chainer.Chain):\n    \"\"\"Compute attention based on dot product.\n\n    Args:\n        eprojs (int | None): Dimension of input vectors from encoder.\n        dunits (int | None): Dimension of input vectors for decoder.\n        att_dim (int): Dimension of input vectors for attention.\n\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim):\n        super(AttDot, self).__init__()\n        with self.init_scope():\n            self.mlp_enc = L.Linear(eprojs, att_dim)\n            self.mlp_dec = L.Linear(dunits, att_dim)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n\n    def reset(self):\n        \"\"\"Reset states.\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n\n    def __call__(self, enc_hs, dec_z, att_prev, scaling=2.0):\n        \"\"\"Compute AttDot forward layer.\n\n        Args:\n            enc_hs (chainer.Variable | N-dimensional array):\n                Input variable from encoder.\n            dec_z (chainer.Variable | N-dimensional array): Input variable of decoder.\n            scaling (float): Scaling weight to make attention sharp.\n\n        Returns:\n            chainer.Variable: Weighted sum over flames.\n            chainer.Variable: Attention weight.\n\n        \"\"\"\n        batch = len(enc_hs)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = F.pad_sequence(enc_hs)  # utt x frame x hdim\n            self.h_length = self.enc_h.shape[1]\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = F.tanh(self.mlp_enc(self.enc_h, n_batch_axes=2))\n\n        if dec_z is None:\n            dec_z = chainer.Variable(\n                self.xp.zeros((batch, self.dunits), dtype=np.float32)\n            )\n        else:\n            dec_z = dec_z.reshape(batch, self.dunits)\n\n        # <phi (h_t), psi (s)> for all t\n        u = F.broadcast_to(\n            F.expand_dims(F.tanh(self.mlp_dec(dec_z)), 1), self.pre_compute_enc_h.shape\n        )\n        e = F.sum(self.pre_compute_enc_h * u, axis=2)  # utt x frame\n        # Applying a minus-large-number filter\n        # to make a probability value zero for a padded area\n        # simply degrades the performance, and I gave up this implementation\n        # Apply a scaling to make an attention sharp\n        w = F.softmax(scaling * e)\n        # weighted sum over flames\n        # utt x hdim\n        c = F.sum(\n            self.enc_h * F.broadcast_to(F.expand_dims(w, 2), self.enc_h.shape), axis=1\n        )\n\n        return c, w\n\n\n# location based attention\nclass AttLoc(chainer.Chain):\n    \"\"\"Compute location-based attention.\n\n    Args:\n        eprojs (int | None): Dimension of input vectors from encoder.\n        dunits (int | None): Dimension of input vectors for decoder.\n        att_dim (int): Dimension of input vectors for attention.\n        aconv_chans (int): Number of channels of output arrays from convolutional layer.\n        aconv_filts (int): Size of filters of convolutional layer.\n\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim, aconv_chans, aconv_filts):\n        super(AttLoc, self).__init__()\n        with self.init_scope():\n            self.mlp_enc = L.Linear(eprojs, att_dim)\n            self.mlp_dec = L.Linear(dunits, att_dim, nobias=True)\n            self.mlp_att = L.Linear(aconv_chans, att_dim, nobias=True)\n            self.loc_conv = L.Convolution2D(\n                1, aconv_chans, ksize=(1, 2 * aconv_filts + 1), pad=(0, aconv_filts)\n            )\n            self.gvec = L.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.aconv_chans = aconv_chans\n\n    def reset(self):\n        \"\"\"Reset states.\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n\n    def __call__(self, enc_hs, dec_z, att_prev, scaling=2.0):\n        \"\"\"Compute AttLoc forward layer.\n\n        Args:\n            enc_hs (chainer.Variable | N-dimensional array):\n                Input variable from encoders.\n            dec_z (chainer.Variable | N-dimensional array): Input variable of decoder.\n            att_prev (chainer.Variable | None): Attention weight.\n            scaling (float): Scaling weight to make attention sharp.\n\n        Returns:\n            chainer.Variable: Weighted sum over flames.\n            chainer.Variable: Attention weight.\n\n        \"\"\"\n        batch = len(enc_hs)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = F.pad_sequence(enc_hs)  # utt x frame x hdim\n            self.h_length = self.enc_h.shape[1]\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h, n_batch_axes=2)\n\n        if dec_z is None:\n            dec_z = chainer.Variable(\n                self.xp.zeros((batch, self.dunits), dtype=np.float32)\n            )\n        else:\n            dec_z = dec_z.reshape(batch, self.dunits)\n\n        # initialize attention weight with uniform dist.\n        if att_prev is None:\n            att_prev = [\n                self.xp.full(hh.shape[0], 1.0 / hh.shape[0], dtype=np.float32)\n                for hh in enc_hs\n            ]\n            att_prev = [chainer.Variable(att) for att in att_prev]\n            att_prev = F.pad_sequence(att_prev)\n\n        # att_prev: utt x frame -> utt x 1 x 1 x frame\n        # -> utt x att_conv_chans x 1 x frame\n        att_conv = self.loc_conv(att_prev.reshape(batch, 1, 1, self.h_length))\n        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans\n        att_conv = F.swapaxes(F.squeeze(att_conv, axis=2), 1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv, n_batch_axes=2)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = F.broadcast_to(\n            F.expand_dims(self.mlp_dec(dec_z), 1), self.pre_compute_enc_h.shape\n        )\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        # TODO(watanabe) use batch_matmul\n        e = F.squeeze(\n            self.gvec(\n                F.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled), n_batch_axes=2\n            ),\n            axis=2,\n        )\n        # Applying a minus-large-number filter\n        # to make a probability value zero for a padded area\n        # simply degrades the performance, and I gave up this implementation\n        # Apply a scaling to make an attention sharp\n        w = F.softmax(scaling * e)\n\n        # weighted sum over flames\n        # utt x hdim\n        c = F.sum(\n            self.enc_h * F.broadcast_to(F.expand_dims(w, 2), self.enc_h.shape), axis=1\n        )\n\n        return c, w\n\n\nclass NoAtt(chainer.Chain):\n    \"\"\"Compute non-attention layer.\n\n    This layer is a dummy attention layer to be compatible with other\n    attention-based models.\n\n    \"\"\"\n\n    def __init__(self):\n        super(NoAtt, self).__init__()\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.c = None\n\n    def reset(self):\n        \"\"\"Reset states.\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.c = None\n\n    def __call__(self, enc_hs, dec_z, att_prev):\n        \"\"\"Compute NoAtt forward layer.\n\n        Args:\n            enc_hs (chainer.Variable | N-dimensional array):\n                Input variable from encoders.\n            dec_z: Dummy.\n            att_prev (chainer.Variable | None): Attention weight.\n\n        Returns:\n            chainer.Variable: Sum over flames.\n            chainer.Variable: Attention weight.\n\n        \"\"\"\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = F.pad_sequence(enc_hs)  # utt x frame x hdim\n            self.h_length = self.enc_h.shape[1]\n\n        # initialize attention weight with uniform dist.\n        if att_prev is None:\n            att_prev = [\n                self.xp.full(hh.shape[0], 1.0 / hh.shape[0], dtype=np.float32)\n                for hh in enc_hs\n            ]\n            att_prev = [chainer.Variable(att) for att in att_prev]\n            att_prev = F.pad_sequence(att_prev)\n            self.c = F.sum(\n                self.enc_h\n                * F.broadcast_to(F.expand_dims(att_prev, 2), self.enc_h.shape),\n                axis=1,\n            )\n\n        return self.c, att_prev\n\n\ndef att_for(args):\n    \"\"\"Returns an attention layer given the program arguments.\n\n    Args:\n        args (Namespace): The arguments.\n\n    Returns:\n        chainer.Chain: The corresponding attention module.\n\n    \"\"\"\n    if args.atype == \"dot\":\n        att = AttDot(args.eprojs, args.dunits, args.adim)\n    elif args.atype == \"location\":\n        att = AttLoc(\n            args.eprojs, args.dunits, args.adim, args.aconv_chans, args.aconv_filts\n        )\n    elif args.atype == \"noatt\":\n        att = NoAtt()\n    else:\n        raise NotImplementedError(\n            \"chainer supports only noatt, dot, and location attention.\"\n        )\n    return att\n"
  },
  {
    "path": "nets/chainer_backend/rnn/decoders.py",
    "content": "import logging\nimport random\nimport six\n\nimport chainer\nimport chainer.functions as F\nimport chainer.links as L\nimport numpy as np\n\nimport espnet.nets.chainer_backend.deterministic_embed_id as DL\n\nfrom argparse import Namespace\n\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.e2e_asr_common import end_detect\n\nCTC_SCORING_RATIO = 1.5\nMAX_DECODER_OUTPUT = 5\n\n\nclass Decoder(chainer.Chain):\n    \"\"\"Decoder layer.\n\n    Args:\n        eprojs (int): Dimension of input variables from encoder.\n        odim (int): The output dimension.\n        dtype (str): Decoder type.\n        dlayers (int): Number of layers for decoder.\n        dunits (int): Dimension of input vector of decoder.\n        sos (int): Number to indicate the start of sequences.\n        eos (int): Number to indicate the end of sequences.\n        att (Module): Attention module defined at\n            `espnet.espnet.nets.chainer_backend.attentions`.\n        verbose (int): Verbosity level.\n        char_list (List[str]): List of all charactors.\n        labeldist (numpy.array): Distributed array of counted transcript length.\n        lsm_weight (float): Weight to use when calculating the training loss.\n        sampling_probability (float): Threshold for scheduled sampling.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        eprojs,\n        odim,\n        dtype,\n        dlayers,\n        dunits,\n        sos,\n        eos,\n        att,\n        verbose=0,\n        char_list=None,\n        labeldist=None,\n        lsm_weight=0.0,\n        sampling_probability=0.0,\n    ):\n        super(Decoder, self).__init__()\n        with self.init_scope():\n            self.embed = DL.EmbedID(odim, dunits)\n            self.rnn0 = (\n                L.StatelessLSTM(dunits + eprojs, dunits)\n                if dtype == \"lstm\"\n                else L.StatelessGRU(dunits + eprojs, dunits)\n            )\n            for i in six.moves.range(1, dlayers):\n                setattr(\n                    self,\n                    \"rnn%d\" % i,\n                    L.StatelessLSTM(dunits, dunits)\n                    if dtype == \"lstm\"\n                    else L.StatelessGRU(dunits, dunits),\n                )\n            self.output = L.Linear(dunits, odim)\n        self.dtype = dtype\n        self.loss = None\n        self.att = att\n        self.dlayers = dlayers\n        self.dunits = dunits\n        self.sos = sos\n        self.eos = eos\n        self.verbose = verbose\n        self.char_list = char_list\n        # for label smoothing\n        self.labeldist = labeldist\n        self.vlabeldist = None\n        self.lsm_weight = lsm_weight\n        self.sampling_probability = sampling_probability\n\n    def rnn_forward(self, ey, z_list, c_list, z_prev, c_prev):\n        if self.dtype == \"lstm\":\n            c_list[0], z_list[0] = self.rnn0(c_prev[0], z_prev[0], ey)\n            for i in six.moves.range(1, self.dlayers):\n                c_list[i], z_list[i] = self[\"rnn%d\" % i](\n                    c_prev[i], z_prev[i], z_list[i - 1]\n                )\n        else:\n            if z_prev[0] is None:\n                xp = self.xp\n                with chainer.backends.cuda.get_device_from_id(self._device_id):\n                    z_prev[0] = chainer.Variable(\n                        xp.zeros((ey.shape[0], self.dunits), dtype=ey.dtype)\n                    )\n            z_list[0] = self.rnn0(z_prev[0], ey)\n            for i in six.moves.range(1, self.dlayers):\n                if z_prev[i] is None:\n                    xp = self.xp\n                    with chainer.backends.cuda.get_device_from_id(self._device_id):\n                        z_prev[i] = chainer.Variable(\n                            xp.zeros(\n                                (z_list[i - 1].shape[0], self.dunits),\n                                dtype=z_list[i - 1].dtype,\n                            )\n                        )\n                z_list[i] = self[\"rnn%d\" % i](z_prev[i], z_list[i - 1])\n        return z_list, c_list\n\n    def __call__(self, hs, ys):\n        \"\"\"Core function of Decoder layer.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n            ys (list of chainer.Variable | N-dimension array):\n                Input variable of decoder.\n\n        Returns:\n            chainer.Variable: A variable holding a scalar array of the training loss.\n            chainer.Variable: A variable holding a scalar array of the accuracy.\n\n        \"\"\"\n        self.loss = None\n        # prepare input and output word sequences with sos/eos IDs\n        eos = self.xp.array([self.eos], \"i\")\n        sos = self.xp.array([self.sos], \"i\")\n        ys_in = [F.concat([sos, y], axis=0) for y in ys]\n        ys_out = [F.concat([y, eos], axis=0) for y in ys]\n\n        # padding for ys with -1\n        # pys: utt x olen\n        pad_ys_in = F.pad_sequence(ys_in, padding=self.eos)\n        pad_ys_out = F.pad_sequence(ys_out, padding=-1)\n\n        # get dim, length info\n        batch = pad_ys_out.shape[0]\n        olength = pad_ys_out.shape[1]\n        logging.info(\n            self.__class__.__name__\n            + \" input lengths:  \"\n            + str(self.xp.array([h.shape[0] for h in hs]))\n        )\n        logging.info(\n            self.__class__.__name__\n            + \" output lengths: \"\n            + str(self.xp.array([y.shape[0] for y in ys_out]))\n        )\n\n        # initialization\n        c_list = [None]  # list of cell state of each layer\n        z_list = [None]  # list of hidden state of each layer\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(None)\n            z_list.append(None)\n        att_w = None\n        z_all = []\n        self.att.reset()  # reset pre-computation of h\n\n        # pre-computation of embedding\n        eys = self.embed(pad_ys_in)  # utt x olen x zdim\n        eys = F.separate(eys, axis=1)\n\n        # loop for an output sequence\n        for i in six.moves.range(olength):\n            att_c, att_w = self.att(hs, z_list[0], att_w)\n            if i > 0 and random.random() < self.sampling_probability:\n                logging.info(\" scheduled sampling \")\n                z_out = self.output(z_all[-1])\n                z_out = F.argmax(F.log_softmax(z_out), axis=1)\n                z_out = self.embed(z_out)\n                ey = F.hstack((z_out, att_c))  # utt x (zdim + hdim)\n            else:\n                ey = F.hstack((eys[i], att_c))  # utt x (zdim + hdim)\n            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)\n            z_all.append(z_list[-1])\n\n        z_all = F.stack(z_all, axis=1).reshape(batch * olength, self.dunits)\n        # compute loss\n        y_all = self.output(z_all)\n        self.loss = F.softmax_cross_entropy(y_all, F.flatten(pad_ys_out))\n        # -1: eos, which is removed in the loss computation\n        self.loss *= np.mean([len(x) for x in ys_in]) - 1\n        acc = F.accuracy(y_all, F.flatten(pad_ys_out), ignore_label=-1)\n        logging.info(\"att loss:\" + str(self.loss.data))\n\n        # show predicted character sequence for debug\n        if self.verbose > 0 and self.char_list is not None:\n            y_hat = y_all.reshape(batch, olength, -1)\n            y_true = pad_ys_out\n            for (i, y_hat_), y_true_ in zip(enumerate(y_hat.data), y_true.data):\n                if i == MAX_DECODER_OUTPUT:\n                    break\n                idx_hat = self.xp.argmax(y_hat_[y_true_ != -1], axis=1)\n                idx_true = y_true_[y_true_ != -1]\n                seq_hat = [self.char_list[int(idx)] for idx in idx_hat]\n                seq_true = [self.char_list[int(idx)] for idx in idx_true]\n                seq_hat = \"\".join(seq_hat).replace(\"<space>\", \" \")\n                seq_true = \"\".join(seq_true).replace(\"<space>\", \" \")\n                logging.info(\"groundtruth[%d]: \" % i + seq_true)\n                logging.info(\"prediction [%d]: \" % i + seq_hat)\n\n        if self.labeldist is not None:\n            if self.vlabeldist is None:\n                self.vlabeldist = chainer.Variable(self.xp.asarray(self.labeldist))\n            loss_reg = -F.sum(\n                F.scale(F.log_softmax(y_all), self.vlabeldist, axis=1)\n            ) / len(ys_in)\n            self.loss = (1.0 - self.lsm_weight) * self.loss + self.lsm_weight * loss_reg\n\n        return self.loss, acc\n\n    def recognize_beam(self, h, lpz, recog_args, char_list, rnnlm=None):\n        \"\"\"Beam search implementation.\n\n        Args:\n            h (chainer.Variable): One of the output from the encoder.\n            lpz (chainer.Variable | None): Result of net propagation.\n            recog_args (Namespace): The argument.\n            char_list (List[str]): List of all charactors.\n            rnnlm (Module): RNNLM module. Defined at `espnet.lm.chainer_backend.lm`\n\n        Returns:\n            List[Dict[str,Any]]: Result of recognition.\n\n        \"\"\"\n        logging.info(\"input lengths: \" + str(h.shape[0]))\n        # initialization\n        c_list = [None]  # list of cell state of each layer\n        z_list = [None]  # list of hidden state of each layer\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(None)\n            z_list.append(None)\n        a = None\n        self.att.reset()  # reset pre-computation of h\n\n        # search parms\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = recog_args.ctc_weight\n\n        # preprate sos\n        y = self.xp.full(1, self.sos, \"i\")\n        if recog_args.maxlenratio == 0:\n            maxlen = h.shape[0]\n        else:\n            # maxlen >= 1\n            maxlen = max(1, int(recog_args.maxlenratio * h.shape[0]))\n        minlen = int(recog_args.minlenratio * h.shape[0])\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        if rnnlm:\n            hyp = {\n                \"score\": 0.0,\n                \"yseq\": [y],\n                \"c_prev\": c_list,\n                \"z_prev\": z_list,\n                \"a_prev\": a,\n                \"rnnlm_prev\": None,\n            }\n        else:\n            hyp = {\n                \"score\": 0.0,\n                \"yseq\": [y],\n                \"c_prev\": c_list,\n                \"z_prev\": z_list,\n                \"a_prev\": a,\n            }\n        if lpz is not None:\n            ctc_prefix_score = CTCPrefixScore(lpz, 0, self.eos, self.xp)\n            hyp[\"ctc_state_prev\"] = ctc_prefix_score.initial_state()\n            hyp[\"ctc_score_prev\"] = 0.0\n            if ctc_weight != 1.0:\n                # pre-pruning based on attention scores\n                ctc_beam = min(lpz.shape[-1], int(beam * CTC_SCORING_RATIO))\n            else:\n                ctc_beam = lpz.shape[-1]\n        hyps = [hyp]\n        ended_hyps = []\n\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            hyps_best_kept = []\n            for hyp in hyps:\n                ey = self.embed(hyp[\"yseq\"][i])  # utt list (1) x zdim\n                att_c, att_w = self.att([h], hyp[\"z_prev\"][0], hyp[\"a_prev\"])\n                ey = F.hstack((ey, att_c))  # utt(1) x (zdim + hdim)\n\n                z_list, c_list = self.rnn_forward(\n                    ey, z_list, c_list, hyp[\"z_prev\"], hyp[\"c_prev\"]\n                )\n\n                # get nbest local scores and their ids\n                local_att_scores = F.log_softmax(self.output(z_list[-1])).data\n                if rnnlm:\n                    rnnlm_state, local_lm_scores = rnnlm.predict(\n                        hyp[\"rnnlm_prev\"], hyp[\"yseq\"][i]\n                    )\n                    local_scores = (\n                        local_att_scores + recog_args.lm_weight * local_lm_scores\n                    )\n                else:\n                    local_scores = local_att_scores\n\n                if lpz is not None:\n                    local_best_ids = self.xp.argsort(local_scores, axis=1)[0, ::-1][\n                        :ctc_beam\n                    ]\n                    ctc_scores, ctc_states = ctc_prefix_score(\n                        hyp[\"yseq\"], local_best_ids, hyp[\"ctc_state_prev\"]\n                    )\n                    local_scores = (1.0 - ctc_weight) * local_att_scores[\n                        :, local_best_ids\n                    ] + ctc_weight * (ctc_scores - hyp[\"ctc_score_prev\"])\n                    if rnnlm:\n                        local_scores += (\n                            recog_args.lm_weight * local_lm_scores[:, local_best_ids]\n                        )\n                    joint_best_ids = self.xp.argsort(local_scores, axis=1)[0, ::-1][\n                        :beam\n                    ]\n                    local_best_scores = local_scores[:, joint_best_ids]\n                    local_best_ids = local_best_ids[joint_best_ids]\n                else:\n                    local_best_ids = self.xp.argsort(local_scores, axis=1)[0, ::-1][\n                        :beam\n                    ]\n                    local_best_scores = local_scores[:, local_best_ids]\n\n                for j in six.moves.range(beam):\n                    new_hyp = {}\n                    # do not copy {z,c}_list directly\n                    new_hyp[\"z_prev\"] = z_list[:]\n                    new_hyp[\"c_prev\"] = c_list[:]\n                    new_hyp[\"a_prev\"] = att_w\n                    new_hyp[\"score\"] = hyp[\"score\"] + local_best_scores[0, j]\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = self.xp.full(\n                        1, local_best_ids[j], \"i\"\n                    )\n                    if rnnlm:\n                        new_hyp[\"rnnlm_prev\"] = rnnlm_state\n                    if lpz is not None:\n                        new_hyp[\"ctc_state_prev\"] = ctc_states[joint_best_ids[j]]\n                        new_hyp[\"ctc_score_prev\"] = ctc_scores[joint_best_ids[j]]\n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypotheses: \" + str(len(hyps)))\n            logging.debug(\n                \"best hypo: \"\n                + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]]).replace(\n                    \"<space>\", \" \"\n                )\n            )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last position in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.xp.full(1, self.eos, \"i\"))\n\n            # add ended hypotheses to a final list,\n            # and removed them from current hypotheses\n            # (this will be a problem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        if rnnlm:  # Word LM needs to add final <eos> score\n                            hyp[\"score\"] += recog_args.lm_weight * rnnlm.final(\n                                hyp[\"rnnlm_prev\"]\n                            )\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remaining hypotheses: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            for hyp in hyps:\n                logging.debug(\n                    \"hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]]).replace(\n                        \"<space>\", \" \"\n                    )\n                )\n\n            logging.debug(\"number of ended hypotheses: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), recog_args.nbest)\n        ]\n\n        # check number of hypotheses\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, \"\n                \"perform recognition again with smaller minlenratio.\"\n            )\n            # should copy because Namespace will be overwritten globally\n            recog_args = Namespace(**vars(recog_args))\n            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)\n            return self.recognize_beam(h, lpz, recog_args, char_list, rnnlm)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n\n        return nbest_hyps\n\n    def calculate_all_attentions(self, hs, ys):\n        \"\"\"Calculate all of attentions.\n\n        Args:\n            hs (list of chainer.Variable | N-dimensional array):\n                Input variable from encoder.\n            ys (list of chainer.Variable | N-dimensional array):\n                Input variable of decoder.\n\n        Returns:\n            chainer.Variable: List of attention weights.\n\n        \"\"\"\n        # prepare input and output word sequences with sos/eos IDs\n        eos = self.xp.array([self.eos], \"i\")\n        sos = self.xp.array([self.sos], \"i\")\n        ys_in = [F.concat([sos, y], axis=0) for y in ys]\n        ys_out = [F.concat([y, eos], axis=0) for y in ys]\n\n        # padding for ys with -1\n        # pys: utt x olen\n        pad_ys_in = F.pad_sequence(ys_in, padding=self.eos)\n        pad_ys_out = F.pad_sequence(ys_out, padding=-1)\n\n        # get length info\n        olength = pad_ys_out.shape[1]\n\n        # initialization\n        c_list = [None]  # list of cell state of each layer\n        z_list = [None]  # list of hidden state of each layer\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(None)\n            z_list.append(None)\n        att_w = None\n        att_ws = []\n        self.att.reset()  # reset pre-computation of h\n\n        # pre-computation of embedding\n        eys = self.embed(pad_ys_in)  # utt x olen x zdim\n        eys = F.separate(eys, axis=1)\n\n        # loop for an output sequence\n        for i in six.moves.range(olength):\n            att_c, att_w = self.att(hs, z_list[0], att_w)\n            ey = F.hstack((eys[i], att_c))  # utt x (zdim + hdim)\n            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)\n            att_ws.append(att_w)  # for debugging\n\n        att_ws = F.stack(att_ws, axis=1)\n        att_ws.to_cpu()\n\n        return att_ws.data\n\n\ndef decoder_for(args, odim, sos, eos, att, labeldist):\n    \"\"\"Return the decoding layer corresponding to the args.\n\n    Args:\n        args (Namespace): The program arguments.\n        odim (int): The output dimension.\n        sos (int): Number to indicate the start of sequences.\n        eos (int) Number to indicate the end of sequences.\n        att (Module):\n            Attention module defined at `espnet.nets.chainer_backend.attentions`.\n        labeldist (numpy.array): Distributed array of length od transcript.\n\n    Returns:\n        chainer.Chain: The decoder module.\n\n    \"\"\"\n    return Decoder(\n        args.eprojs,\n        odim,\n        args.dtype,\n        args.dlayers,\n        args.dunits,\n        sos,\n        eos,\n        att,\n        args.verbose,\n        args.char_list,\n        labeldist,\n        args.lsm_weight,\n        args.sampling_probability,\n    )\n"
  },
  {
    "path": "nets/chainer_backend/rnn/encoders.py",
    "content": "import logging\nimport six\n\nimport chainer\nimport chainer.functions as F\nimport chainer.links as L\nimport numpy as np\n\nfrom chainer import cuda\n\nfrom espnet.nets.chainer_backend.nets_utils import _subsamplex\nfrom espnet.nets.e2e_asr_common import get_vgg2l_odim\n\n\n# TODO(watanabe) explanation of BLSTMP\nclass RNNP(chainer.Chain):\n    \"\"\"RNN with projection layer module.\n\n    Args:\n        idim (int): Dimension of inputs.\n        elayers (int): Number of encoder layers.\n        cdim (int): Number of rnn units. (resulted in cdim * 2 if bidirectional)\n        hdim (int): Number of projection units.\n        subsample (np.ndarray): List to use sabsample the input array.\n        dropout (float): Dropout rate.\n        typ (str): The RNN type.\n\n    \"\"\"\n\n    def __init__(self, idim, elayers, cdim, hdim, subsample, dropout, typ=\"blstm\"):\n        super(RNNP, self).__init__()\n        bidir = typ[0] == \"b\"\n        if bidir:\n            rnn = L.NStepBiLSTM if \"lstm\" in typ else L.NStepBiGRU\n        else:\n            rnn = L.NStepLSTM if \"lstm\" in typ else L.NStepGRU\n        rnn_label = \"birnn\" if bidir else \"rnn\"\n        with self.init_scope():\n            for i in six.moves.range(elayers):\n                if i == 0:\n                    inputdim = idim\n                else:\n                    inputdim = hdim\n                _cdim = 2 * cdim if bidir else cdim\n                # bottleneck layer to merge\n                setattr(\n                    self, \"{}{:d}\".format(rnn_label, i), rnn(1, inputdim, cdim, dropout)\n                )\n                setattr(self, \"bt%d\" % i, L.Linear(_cdim, hdim))\n\n        self.elayers = elayers\n        self.rnn_label = rnn_label\n        self.cdim = cdim\n        self.subsample = subsample\n        self.typ = typ\n        self.bidir = bidir\n\n    def __call__(self, xs, ilens):\n        \"\"\"RNNP forward.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each input batch. (B,)\n\n        Returns:\n            xs (chainer.Variable):subsampled vector of xs.\n            chainer.Variable: Subsampled vector of ilens.\n\n        \"\"\"\n        logging.info(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        for layer in six.moves.range(self.elayers):\n            if \"lstm\" in self.typ:\n                _, _, ys = self[self.rnn_label + str(layer)](None, None, xs)\n            else:\n                _, ys = self[self.rnn_label + str(layer)](None, xs)\n            # ys: utt list of frame x cdim x 2 (2: means bidirectional)\n            # TODO(watanabe) replace subsample and FC layer with CNN\n            ys, ilens = _subsamplex(ys, self.subsample[layer + 1])\n            # (sum _utt frame_utt) x dim\n            ys = self[\"bt\" + str(layer)](F.vstack(ys))\n            xs = F.split_axis(ys, np.cumsum(ilens[:-1]), axis=0)\n\n        # final tanh operation\n        xs = F.split_axis(F.tanh(F.vstack(xs)), np.cumsum(ilens[:-1]), axis=0)\n\n        # 1 utterance case, it becomes an array, so need to make a utt tuple\n        if not isinstance(xs, tuple):\n            xs = [xs]\n\n        return xs, ilens  # x: utt list of frame x dim\n\n\nclass RNN(chainer.Chain):\n    \"\"\"RNN Module.\n\n    Args:\n        idim (int): Dimension of the imput.\n        elayers (int): Number of encoder layers.\n        cdim (int): Number of rnn units.\n        hdim (int): Number of projection units.\n        dropout (float): Dropout rate.\n        typ (str): Rnn type.\n\n    \"\"\"\n\n    def __init__(self, idim, elayers, cdim, hdim, dropout, typ=\"lstm\"):\n        super(RNN, self).__init__()\n        bidir = typ[0] == \"b\"\n        if bidir:\n            rnn = L.NStepBiLSTM if \"lstm\" in typ else L.NStepBiGRU\n        else:\n            rnn = L.NStepLSTM if \"lstm\" in typ else L.NStepGRU\n        _cdim = 2 * cdim if bidir else cdim\n        with self.init_scope():\n            self.nbrnn = rnn(elayers, idim, cdim, dropout)\n            self.l_last = L.Linear(_cdim, hdim)\n        self.typ = typ\n        self.bidir = bidir\n\n    def __call__(self, xs, ilens):\n        \"\"\"BRNN forward propagation.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each input batch. (B,)\n\n        Returns:\n            tuple(chainer.Variable): Tuple of `chainer.Variable` objects.\n            chainer.Variable: `ilens` .\n\n        \"\"\"\n        logging.info(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n        # need to move ilens to cpu\n        ilens = cuda.to_cpu(ilens)\n\n        if \"lstm\" in self.typ:\n            _, _, ys = self.nbrnn(None, None, xs)\n        else:\n            _, ys = self.nbrnn(None, xs)\n        ys = self.l_last(F.vstack(ys))  # (sum _utt frame_utt) x dim\n        xs = F.split_axis(ys, np.cumsum(ilens[:-1]), axis=0)\n\n        # final tanh operation\n        xs = F.split_axis(F.tanh(F.vstack(xs)), np.cumsum(ilens[:-1]), axis=0)\n\n        # 1 utterance case, it becomes an array, so need to make a utt tuple\n        if not isinstance(xs, tuple):\n            xs = [xs]\n\n        return xs, ilens  # x: utt list of frame x dim\n\n\n# TODO(watanabe) explanation of VGG2L, VGG2B (Block) might be better\nclass VGG2L(chainer.Chain):\n    \"\"\"VGG motibated cnn layers.\n\n    Args:\n        in_channel (int): Number of channels.\n\n    \"\"\"\n\n    def __init__(self, in_channel=1):\n        super(VGG2L, self).__init__()\n        with self.init_scope():\n            # CNN layer (VGG motivated)\n            self.conv1_1 = L.Convolution2D(in_channel, 64, 3, stride=1, pad=1)\n            self.conv1_2 = L.Convolution2D(64, 64, 3, stride=1, pad=1)\n            self.conv2_1 = L.Convolution2D(64, 128, 3, stride=1, pad=1)\n            self.conv2_2 = L.Convolution2D(128, 128, 3, stride=1, pad=1)\n\n        self.in_channel = in_channel\n\n    def __call__(self, xs, ilens):\n        \"\"\"VGG2L forward propagation.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each features. (B,)\n\n        Returns:\n            chainer.Variable: Subsampled vector of xs.\n            chainer.Variable: Subsampled vector of ilens.\n\n        \"\"\"\n        logging.info(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        # x: utt x frame x dim\n        xs = F.pad_sequence(xs)\n\n        # x: utt x 1 (input channel num) x frame x dim\n        xs = F.swapaxes(\n            xs.reshape(\n                xs.shape[0],\n                xs.shape[1],\n                self.in_channel,\n                xs.shape[2] // self.in_channel,\n            ),\n            1,\n            2,\n        )\n\n        xs = F.relu(self.conv1_1(xs))\n        xs = F.relu(self.conv1_2(xs))\n        xs = F.max_pooling_2d(xs, 2, stride=2)\n\n        xs = F.relu(self.conv2_1(xs))\n        xs = F.relu(self.conv2_2(xs))\n        xs = F.max_pooling_2d(xs, 2, stride=2)\n\n        # change ilens accordingly\n        ilens = self.xp.array(\n            self.xp.ceil(self.xp.array(ilens, dtype=np.float32) / 2), dtype=np.int32\n        )\n        ilens = self.xp.array(\n            self.xp.ceil(self.xp.array(ilens, dtype=np.float32) / 2), dtype=np.int32\n        )\n\n        # x: utt_list of frame (remove zeropaded frames) x (input channel num x dim)\n        xs = F.swapaxes(xs, 1, 2)\n        xs = xs.reshape(xs.shape[0], xs.shape[1], xs.shape[2] * xs.shape[3])\n        xs = [xs[i, : ilens[i], :] for i in range(len(ilens))]\n\n        return xs, ilens\n\n\nclass Encoder(chainer.Chain):\n    \"\"\"Encoder network class.\n\n    Args:\n        etype (str): Type of encoder network.\n        idim (int): Number of dimensions of encoder network.\n        elayers (int): Number of layers of encoder network.\n        eunits (int): Number of lstm units of encoder network.\n        eprojs (int): Number of projection units of encoder network.\n        subsample (np.array): Subsampling number. e.g. 1_2_2_2_1\n        dropout (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(\n        self, etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1\n    ):\n        super(Encoder, self).__init__()\n        typ = etype.lstrip(\"vgg\").rstrip(\"p\")\n        if typ not in [\"lstm\", \"gru\", \"blstm\", \"bgru\"]:\n            logging.error(\"Error: need to specify an appropriate encoder architecture\")\n        with self.init_scope():\n            if etype.startswith(\"vgg\"):\n                if etype[-1] == \"p\":\n                    self.enc = chainer.Sequential(\n                        VGG2L(in_channel),\n                        RNNP(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            subsample,\n                            dropout,\n                            typ=typ,\n                        ),\n                    )\n                    logging.info(\"Use CNN-VGG + \" + typ.upper() + \"P for encoder\")\n                else:\n                    self.enc = chainer.Sequential(\n                        VGG2L(in_channel),\n                        RNN(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            dropout,\n                            typ=typ,\n                        ),\n                    )\n                    logging.info(\"Use CNN-VGG + \" + typ.upper() + \" for encoder\")\n                self.conv_subsampling_factor = 4\n            else:\n                if etype[-1] == \"p\":\n                    self.enc = chainer.Sequential(\n                        RNNP(idim, elayers, eunits, eprojs, subsample, dropout, typ=typ)\n                    )\n                    logging.info(\n                        typ.upper() + \" with every-layer projection for encoder\"\n                    )\n                else:\n                    self.enc = chainer.Sequential(\n                        RNN(idim, elayers, eunits, eprojs, dropout, typ=typ)\n                    )\n                    logging.info(typ.upper() + \" without projection for encoder\")\n                self.conv_subsampling_factor = 1\n\n    def __call__(self, xs, ilens):\n        \"\"\"Encoder forward.\n\n        Args:\n            xs (chainer.Variable): Batch of padded charactor ids. (B, Tmax)\n            ilens (chainer.variable): Batch of length of each features. (B,)\n\n        Returns:\n            chainer.Variable: Output of the encoder.\n            chainer.Variable: (Subsampled) vector of ilens.\n\n        \"\"\"\n        xs, ilens = self.enc(xs, ilens)\n\n        return xs, ilens\n\n\ndef encoder_for(args, idim, subsample):\n    \"\"\"Return the Encoder module.\n\n    Args:\n        idim (int): Dimension of input array.\n        subsample (numpy.array): Subsample number. egs).1_2_2_2_1\n\n    Return\n        chainer.nn.Module: Encoder module.\n\n    \"\"\"\n    return Encoder(\n        args.etype,\n        idim,\n        args.elayers,\n        args.eunits,\n        args.eprojs,\n        subsample,\n        args.dropout_rate,\n    )\n"
  },
  {
    "path": "nets/chainer_backend/rnn/training.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\nimport collections\nimport logging\nimport math\nimport six\n\n# chainer related\nfrom chainer import cuda\nfrom chainer import training\nfrom chainer import Variable\n\nfrom chainer.training.updaters.multiprocess_parallel_updater import gather_grads\nfrom chainer.training.updaters.multiprocess_parallel_updater import gather_params\nfrom chainer.training.updaters.multiprocess_parallel_updater import scatter_grads\n\nimport numpy as np\n\n\n# copied from https://github.com/chainer/chainer/blob/master/chainer/optimizer.py\ndef sum_sqnorm(arr):\n    \"\"\"Calculate the norm of the array.\n\n    Args:\n        arr (numpy.ndarray)\n\n    Returns:\n        Float: Sum of the norm calculated from the given array.\n\n    \"\"\"\n    sq_sum = collections.defaultdict(float)\n    for x in arr:\n        with cuda.get_device_from_array(x) as dev:\n            if x is not None:\n                x = x.ravel()\n                s = x.dot(x)\n                sq_sum[int(dev)] += s\n    return sum([float(i) for i in six.itervalues(sq_sum)])\n\n\nclass CustomUpdater(training.StandardUpdater):\n    \"\"\"Custom updater for chainer.\n\n    Args:\n        train_iter (iterator | dict[str, iterator]): Dataset iterator for the\n            training dataset. It can also be a dictionary that maps strings to\n            iterators. If this is just an iterator, then the iterator is\n            registered by the name ``'main'``.\n        optimizer (optimizer | dict[str, optimizer]): Optimizer to update\n            parameters. It can also be a dictionary that maps strings to\n            optimizers. If this is just an optimizer, then the optimizer is\n            registered by the name ``'main'``.\n        converter (espnet.asr.chainer_backend.asr.CustomConverter): Converter\n            function to build input arrays. Each batch extracted by the main\n            iterator and the ``device`` option are passed to this function.\n            :func:`chainer.dataset.concat_examples` is used by default.\n        device (int or dict): The destination device info to send variables. In the\n            case of cpu or single gpu, `device=-1 or 0`, respectively.\n            In the case of multi-gpu, `device={\"main\":0, \"sub_1\": 1, ...}`.\n        accum_grad (int):The number of gradient accumulation. if set to 2, the network\n            parameters will be updated once in twice,\n            i.e. actual batchsize will be doubled.\n\n    \"\"\"\n\n    def __init__(self, train_iter, optimizer, converter, device, accum_grad=1):\n        super(CustomUpdater, self).__init__(\n            train_iter, optimizer, converter=converter, device=device\n        )\n        self.forward_count = 0\n        self.accum_grad = accum_grad\n        self.start = True\n        # To solve #1091, it is required to set the variable inside this class.\n        self.device = device\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Main update routine for Custom Updater.\"\"\"\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n\n        # Get batch and convert into variables\n        batch = train_iter.next()\n        x = self.converter(batch, self.device)\n        if self.start:\n            optimizer.target.cleargrads()\n            self.start = False\n\n        # Compute the loss at this time step and accumulate it\n        loss = optimizer.target(*x) / self.accum_grad\n        loss.backward()  # Backprop\n        loss.unchain_backward()  # Truncate the graph\n\n        # update parameters\n        self.forward_count += 1\n        if self.forward_count != self.accum_grad:\n            return\n        self.forward_count = 0\n        # compute the gradient norm to check if it is normal or not\n        grad_norm = np.sqrt(\n            sum_sqnorm([p.grad for p in optimizer.target.params(False)])\n        )\n        logging.info(\"grad norm={}\".format(grad_norm))\n        if math.isnan(grad_norm):\n            logging.warning(\"grad norm is nan. Do not update model.\")\n        else:\n            optimizer.update()\n        optimizer.target.cleargrads()  # Clear the parameter gradients\n\n    def update(self):\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomParallelUpdater(training.updaters.MultiprocessParallelUpdater):\n    \"\"\"Custom Parallel Updater for chainer.\n\n    Defines the main update routine.\n\n    Args:\n        train_iter (iterator | dict[str, iterator]): Dataset iterator for the\n            training dataset. It can also be a dictionary that maps strings to\n            iterators. If this is just an iterator, then the iterator is\n            registered by the name ``'main'``.\n        optimizer (optimizer | dict[str, optimizer]): Optimizer to update\n            parameters. It can also be a dictionary that maps strings to\n            optimizers. If this is just an optimizer, then the optimizer is\n            registered by the name ``'main'``.\n        converter (espnet.asr.chainer_backend.asr.CustomConverter): Converter\n            function to build input arrays. Each batch extracted by the main\n            iterator and the ``device`` option are passed to this function.\n            :func:`chainer.dataset.concat_examples` is used by default.\n        device (torch.device): Device to which the training data is sent.\n            Negative value\n            indicates the host memory (CPU).\n        accum_grad (int):The number of gradient accumulation. if set to 2,\n            the network parameters will be updated once in twice,\n            i.e. actual batchsize will be doubled.\n\n    \"\"\"\n\n    def __init__(self, train_iters, optimizer, converter, devices, accum_grad=1):\n        super(CustomParallelUpdater, self).__init__(\n            train_iters, optimizer, converter=converter, devices=devices\n        )\n        from cupy.cuda import nccl\n\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n        self.nccl = nccl\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Main Update routine of the custom parallel updater.\"\"\"\n        self.setup_workers()\n\n        self._send_message((\"update\", None))\n        with cuda.Device(self._devices[0]):\n            # For reducing memory\n\n            optimizer = self.get_optimizer(\"main\")\n            batch = self.get_iterator(\"main\").next()\n            x = self.converter(batch, self._devices[0])\n\n            loss = self._master(*x) / self.accum_grad\n            loss.backward()\n            loss.unchain_backward()\n\n            # NCCL: reduce grads\n            null_stream = cuda.Stream.null\n            if self.comm is not None:\n                gg = gather_grads(self._master)\n                self.comm.reduce(\n                    gg.data.ptr,\n                    gg.data.ptr,\n                    gg.size,\n                    self.nccl.NCCL_FLOAT,\n                    self.nccl.NCCL_SUM,\n                    0,\n                    null_stream.ptr,\n                )\n                scatter_grads(self._master, gg)\n                del gg\n\n            # update parameters\n            self.forward_count += 1\n            if self.forward_count != self.accum_grad:\n                return\n            self.forward_count = 0\n            # check gradient value\n            grad_norm = np.sqrt(\n                sum_sqnorm([p.grad for p in optimizer.target.params(False)])\n            )\n            logging.info(\"grad norm={}\".format(grad_norm))\n\n            # update\n            if math.isnan(grad_norm):\n                logging.warning(\"grad norm is nan. Do not update model.\")\n            else:\n                optimizer.update()\n            self._master.cleargrads()\n\n            if self.comm is not None:\n                gp = gather_params(self._master)\n                self.comm.bcast(\n                    gp.data.ptr, gp.size, self.nccl.NCCL_FLOAT, 0, null_stream.ptr\n                )\n\n    def update(self):\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomConverter(object):\n    \"\"\"Custom Converter.\n\n    Args:\n        subsampling_factor (int): The subsampling factor.\n\n    \"\"\"\n\n    def __init__(self, subsampling_factor=1):\n        self.subsampling_factor = subsampling_factor\n\n    def __call__(self, batch, device):\n        \"\"\"Perform sabsampling.\n\n        Args:\n            batch (list): Batch that will be sabsampled.\n            device (device): GPU device.\n\n        Returns:\n            chainer.Variable: xp.array that sabsampled from batch.\n            xp.array: xp.array of the length of the mini-batches.\n            chainer.Variable: xp.array that sabsampled from batch.\n\n        \"\"\"\n        # set device\n        xp = cuda.cupy if device != -1 else np\n\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys = batch[0]\n\n        # perform subsampling\n        if self.subsampling_factor > 1:\n            xs = [x[:: self.subsampling_factor, :] for x in xs]\n\n        # get batch made of lengths of input sequences\n        ilens = [x.shape[0] for x in xs]\n\n        # convert to Variable\n        xs = [Variable(xp.array(x, dtype=xp.float32)) for x in xs]\n        ilens = xp.array(ilens, dtype=xp.int32)\n        ys = [Variable(xp.array(y, dtype=xp.int32)) for y in ys]\n\n        return xs, ilens, ys\n"
  },
  {
    "path": "nets/chainer_backend/transformer/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/chainer_backend/transformer/attention.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Attention.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\nimport chainer.links as L\n\nimport numpy as np\n\nMIN_VALUE = float(np.finfo(np.float32).min)\n\n\nclass MultiHeadAttention(chainer.Chain):\n    \"\"\"Multi Head Attention Layer.\n\n    Args:\n        n_units (int): Number of input units.\n        h (int): Number of attention heads.\n        dropout (float): Dropout rate.\n        initialW: Initializer to initialize the weight.\n        initial_bias: Initializer to initialize the bias.\n\n    :param int h: the number of heads\n    :param int n_units: the number of features\n    :param float dropout_rate: dropout rate\n\n    \"\"\"\n\n    def __init__(self, n_units, h=8, dropout=0.1, initialW=None, initial_bias=None):\n        \"\"\"Initialize MultiHeadAttention.\"\"\"\n        super(MultiHeadAttention, self).__init__()\n        assert n_units % h == 0\n        stvd = 1.0 / np.sqrt(n_units)\n        with self.init_scope():\n            self.linear_q = L.Linear(\n                n_units,\n                n_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.linear_k = L.Linear(\n                n_units,\n                n_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.linear_v = L.Linear(\n                n_units,\n                n_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.linear_out = L.Linear(\n                n_units,\n                n_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n        self.d_k = n_units // h\n        self.h = h\n        self.dropout = dropout\n        self.attn = None\n\n    def forward(self, e_var, s_var=None, mask=None, batch=1):\n        \"\"\"Core function of the Multi-head attention layer.\n\n        Args:\n            e_var (chainer.Variable): Variable of input array.\n            s_var (chainer.Variable): Variable of source array from encoder.\n            mask (chainer.Variable): Attention mask.\n            batch (int): Batch size.\n\n        Returns:\n            chainer.Variable: Outout of multi-head attention layer.\n\n        \"\"\"\n        xp = self.xp\n        if s_var is None:\n            # batch, head, time1/2, d_k)\n            Q = self.linear_q(e_var).reshape(batch, -1, self.h, self.d_k)\n            K = self.linear_k(e_var).reshape(batch, -1, self.h, self.d_k)\n            V = self.linear_v(e_var).reshape(batch, -1, self.h, self.d_k)\n        else:\n            Q = self.linear_q(e_var).reshape(batch, -1, self.h, self.d_k)\n            K = self.linear_k(s_var).reshape(batch, -1, self.h, self.d_k)\n            V = self.linear_v(s_var).reshape(batch, -1, self.h, self.d_k)\n        scores = F.matmul(F.swapaxes(Q, 1, 2), K.transpose(0, 2, 3, 1)) / np.sqrt(\n            self.d_k\n        )\n        if mask is not None:\n            mask = xp.stack([mask] * self.h, axis=1)\n            scores = F.where(mask, scores, xp.full(scores.shape, MIN_VALUE, \"f\"))\n        self.attn = F.softmax(scores, axis=-1)\n        p_attn = F.dropout(self.attn, self.dropout)\n        x = F.matmul(p_attn, F.swapaxes(V, 1, 2))\n        x = F.swapaxes(x, 1, 2).reshape(-1, self.h * self.d_k)\n        return self.linear_out(x)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/ctc.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's CTC.\"\"\"\nimport logging\n\nimport chainer\nimport chainer.functions as F\nimport chainer.links as L\nimport numpy as np\n\n\n# TODO(nelson): Merge chainer_backend/transformer/ctc.py in chainer_backend/ctc.py\nclass CTC(chainer.Chain):\n    \"\"\"Chainer implementation of ctc layer.\n\n    Args:\n        odim (int): The output dimension.\n        eprojs (int | None): Dimension of input vectors from encoder.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, odim, eprojs, dropout_rate):\n        \"\"\"Initialize CTC.\"\"\"\n        super(CTC, self).__init__()\n        self.dropout_rate = dropout_rate\n        self.loss = None\n\n        with self.init_scope():\n            self.ctc_lo = L.Linear(eprojs, odim)\n\n    def __call__(self, hs, ys):\n        \"\"\"CTC forward.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n            ys (list of chainer.Variable | N-dimension array):\n                Input variable of decoder.\n\n        Returns:\n            chainer.Variable: A variable holding a scalar value of the CTC loss.\n\n        \"\"\"\n        self.loss = None\n        ilens = [x.shape[0] for x in hs]\n        olens = [x.shape[0] for x in ys]\n\n        # zero padding for hs\n        y_hat = self.ctc_lo(\n            F.dropout(F.pad_sequence(hs), ratio=self.dropout_rate), n_batch_axes=2\n        )\n        y_hat = F.separate(y_hat, axis=1)  # ilen list of batch x hdim\n\n        # zero padding for ys\n        y_true = F.pad_sequence(ys, padding=-1)  # batch x olen\n\n        # get length info\n        input_length = chainer.Variable(self.xp.array(ilens, dtype=np.int32))\n        label_length = chainer.Variable(self.xp.array(olens, dtype=np.int32))\n        logging.info(\n            self.__class__.__name__ + \" input lengths:  \" + str(input_length.data)\n        )\n        logging.info(\n            self.__class__.__name__ + \" output lengths: \" + str(label_length.data)\n        )\n\n        # get ctc loss\n        self.loss = F.connectionist_temporal_classification(\n            y_hat, y_true, 0, input_length, label_length\n        )\n        logging.info(\"ctc loss:\" + str(self.loss.data))\n\n        return self.loss\n\n    def log_softmax(self, hs):\n        \"\"\"Log_softmax of frame activations.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n\n        Returns:\n            chainer.Variable: A n-dimension float array.\n\n        \"\"\"\n        y_hat = self.ctc_lo(F.pad_sequence(hs), n_batch_axes=2)\n        return F.log_softmax(y_hat.reshape(-1, y_hat.shape[-1])).reshape(y_hat.shape)\n\n\nclass WarpCTC(chainer.Chain):\n    \"\"\"Chainer implementation of warp-ctc layer.\n\n    Args:\n        odim (int): The output dimension.\n        eproj (int | None): Dimension of input vector from encoder.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, odim, eprojs, dropout_rate):\n        \"\"\"Initialize WarpCTC.\"\"\"\n        super(WarpCTC, self).__init__()\n        # The main difference between the ctc for transformer and\n        # the rnn is because the target (ys) is already a list of\n        # arrays located in the cpu, while in rnn routine the target is\n        # a list of variables located in cpu/gpu. If the target of rnn becomes\n        # a list of cpu arrays then this file would be no longer required.\n        from chainer_ctc.warpctc import ctc as warp_ctc\n\n        self.ctc = warp_ctc\n        self.dropout_rate = dropout_rate\n        self.loss = None\n\n        with self.init_scope():\n            self.ctc_lo = L.Linear(eprojs, odim)\n\n    def forward(self, hs, ys):\n        \"\"\"Core function of the Warp-CTC layer.\n\n        Args:\n            hs (iterable of chainer.Variable | N-dimention array):\n                Input variable from encoder.\n            ys (iterable of N-dimension array): Input variable of decoder.\n\n        Returns:\n           chainer.Variable: A variable holding a scalar value of the CTC loss.\n\n        \"\"\"\n        self.loss = None\n        ilens = [hs.shape[1]] * hs.shape[0]\n        olens = [x.shape[0] for x in ys]\n\n        # zero padding for hs\n        # output batch x frames x hdim > frames x batch x hdim\n        y_hat = self.ctc_lo(\n            F.dropout(hs, ratio=self.dropout_rate), n_batch_axes=2\n        ).transpose(1, 0, 2)\n\n        # get length info\n        logging.info(self.__class__.__name__ + \" input lengths:  \" + str(ilens))\n        logging.info(self.__class__.__name__ + \" output lengths: \" + str(olens))\n\n        # get ctc loss\n        self.loss = self.ctc(y_hat, ilens, ys)[0]\n        logging.info(\"ctc loss:\" + str(self.loss.data))\n        return self.loss\n\n    def log_softmax(self, hs):\n        \"\"\"Log_softmax of frame activations.\n\n        Args:\n            hs (list of chainer.Variable | N-dimension array):\n                Input variable from encoder.\n\n        Returns:\n            chainer.Variable: A n-dimension float array.\n\n        \"\"\"\n        y_hat = self.ctc_lo(F.pad_sequence(hs), n_batch_axes=2)\n        return F.log_softmax(y_hat.reshape(-1, y_hat.shape[-1])).reshape(y_hat.shape)\n\n    def argmax(self, hs_pad):\n        \"\"\"Argmax of frame activations.\n\n        :param chainer variable hs_pad: 3d tensor (B, Tmax, eprojs)\n        :return: argmax applied 2d tensor (B, Tmax)\n        :rtype: chainer.Variable.\n        \"\"\"\n        return F.argmax(self.ctc_lo(F.pad_sequence(hs_pad), n_batch_axes=2), axis=-1)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/decoder.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Decoder.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\nimport chainer.links as L\n\nfrom espnet.nets.chainer_backend.transformer.decoder_layer import DecoderLayer\nfrom espnet.nets.chainer_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.chainer_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.chainer_backend.transformer.mask import make_history_mask\n\nimport numpy as np\n\n\nclass Decoder(chainer.Chain):\n    \"\"\"Decoder layer.\n\n    Args:\n        odim (int): The output dimension.\n        n_layers (int): Number of ecoder layers.\n        n_units (int): Number of attention units.\n        d_units (int): Dimension of input vector of decoder.\n        h (int): Number of attention heads.\n        dropout (float): Dropout rate.\n        initialW (Initializer): Initializer to initialize the weight.\n        initial_bias (Initializer): Initializer to initialize teh bias.\n\n    \"\"\"\n\n    def __init__(self, odim, args, initialW=None, initial_bias=None):\n        \"\"\"Initialize Decoder.\"\"\"\n        super(Decoder, self).__init__()\n        self.sos = odim - 1\n        self.eos = odim - 1\n        initialW = chainer.initializers.Uniform if initialW is None else initialW\n        initial_bias = (\n            chainer.initializers.Uniform if initial_bias is None else initial_bias\n        )\n        with self.init_scope():\n            self.output_norm = LayerNorm(args.adim)\n            self.pe = PositionalEncoding(args.adim, args.dropout_rate)\n            stvd = 1.0 / np.sqrt(args.adim)\n            self.output_layer = L.Linear(\n                args.adim,\n                odim,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.embed = L.EmbedID(\n                odim,\n                args.adim,\n                ignore_label=-1,\n                initialW=chainer.initializers.Normal(scale=1.0),\n            )\n        for i in range(args.dlayers):\n            name = \"decoders.\" + str(i)\n            layer = DecoderLayer(\n                args.adim,\n                d_units=args.dunits,\n                h=args.aheads,\n                dropout=args.dropout_rate,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.add_link(name, layer)\n        self.n_layers = args.dlayers\n\n    def make_attention_mask(self, source_block, target_block):\n        \"\"\"Prepare the attention mask.\n\n        Args:\n            source_block (ndarray): Source block with dimensions: (B x S).\n            target_block (ndarray): Target block with dimensions: (B x T).\n        Returns:\n            ndarray: Mask with dimensions (B, S, T).\n\n        \"\"\"\n        mask = (target_block[:, None, :] >= 0) * (source_block[:, :, None] >= 0)\n        # (batch, source_length, target_length)\n        return mask\n\n    def forward(self, ys_pad, source, x_mask):\n        \"\"\"Forward decoder.\n\n        :param xp.array e: input token ids, int64 (batch, maxlen_out)\n        :param xp.array yy_mask: input token mask, uint8  (batch, maxlen_out)\n        :param xp.array source: encoded memory, float32  (batch, maxlen_in, feat)\n        :param xp.array xy_mask: encoded memory mask, uint8  (batch, maxlen_in)\n        :return e: decoded token score before softmax (batch, maxlen_out, token)\n        :rtype: chainer.Variable\n        \"\"\"\n        xp = self.xp\n        sos = np.array([self.sos], np.int32)\n        ys = [np.concatenate([sos, y], axis=0) for y in ys_pad]\n        e = F.pad_sequence(ys, padding=self.eos).data\n        e = xp.array(e)\n        # mask preparation\n        xy_mask = self.make_attention_mask(e, xp.array(x_mask))\n        yy_mask = self.make_attention_mask(e, e)\n        yy_mask *= make_history_mask(xp, e)\n\n        e = self.pe(self.embed(e))\n        batch, length, dims = e.shape\n        e = e.reshape(-1, dims)\n        source = source.reshape(-1, dims)\n        for i in range(self.n_layers):\n            e = self[\"decoders.\" + str(i)](e, source, xy_mask, yy_mask, batch)\n        return self.output_layer(self.output_norm(e)).reshape(batch, length, -1)\n\n    def recognize(self, e, yy_mask, source):\n        \"\"\"Process recognition function.\"\"\"\n        e = self.forward(e, source, yy_mask)\n        return F.log_softmax(e, axis=-1)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/decoder_layer.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Decoder Block.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\n\nfrom espnet.nets.chainer_backend.transformer.attention import MultiHeadAttention\nfrom espnet.nets.chainer_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.chainer_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\n\n\nclass DecoderLayer(chainer.Chain):\n    \"\"\"Single decoder layer module.\n\n    Args:\n        n_units (int): Number of input/output dimension of a FeedForward layer.\n        d_units (int): Number of units of hidden layer in a FeedForward layer.\n        h (int): Number of attention heads.\n        dropout (float): Dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self, n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None\n    ):\n        \"\"\"Initialize DecoderLayer.\"\"\"\n        super(DecoderLayer, self).__init__()\n        with self.init_scope():\n            self.self_attn = MultiHeadAttention(\n                n_units,\n                h,\n                dropout=dropout,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.src_attn = MultiHeadAttention(\n                n_units,\n                h,\n                dropout=dropout,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.feed_forward = PositionwiseFeedForward(\n                n_units,\n                d_units=d_units,\n                dropout=dropout,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.norm1 = LayerNorm(n_units)\n            self.norm2 = LayerNorm(n_units)\n            self.norm3 = LayerNorm(n_units)\n        self.dropout = dropout\n\n    def forward(self, e, s, xy_mask, yy_mask, batch):\n        \"\"\"Compute Encoder layer.\n\n        Args:\n            e (chainer.Variable): Batch of padded features. (B, Lmax)\n            s (chainer.Variable): Batch of padded character. (B, Tmax)\n\n        Returns:\n            chainer.Variable: Computed variable of decoder.\n\n        \"\"\"\n        n_e = self.norm1(e)\n        n_e = self.self_attn(n_e, mask=yy_mask, batch=batch)\n        e = e + F.dropout(n_e, self.dropout)\n\n        n_e = self.norm2(e)\n        n_e = self.src_attn(n_e, s_var=s, mask=xy_mask, batch=batch)\n        e = e + F.dropout(n_e, self.dropout)\n\n        n_e = self.norm3(e)\n        n_e = self.feed_forward(n_e)\n        e = e + F.dropout(n_e, self.dropout)\n        return e\n"
  },
  {
    "path": "nets/chainer_backend/transformer/embedding.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Positional Encoding.\"\"\"\n\nimport chainer\nimport chainer.functions as F\n\nimport numpy as np\n\n\nclass PositionalEncoding(chainer.Chain):\n    \"\"\"Positional encoding module.\n\n    :param int n_units: embedding dim\n    :param float dropout: dropout rate\n    :param int length: maximum input length\n\n    \"\"\"\n\n    def __init__(self, n_units, dropout=0.1, length=5000):\n        \"\"\"Initialize Positional Encoding.\"\"\"\n        # Implementation described in the paper\n        super(PositionalEncoding, self).__init__()\n        self.dropout = dropout\n        posi_block = np.arange(0, length, dtype=np.float32)[:, None]\n        unit_block = np.exp(\n            np.arange(0, n_units, 2, dtype=np.float32) * -(np.log(10000.0) / n_units)\n        )\n        self.pe = np.zeros((length, n_units), dtype=np.float32)\n        self.pe[:, ::2] = np.sin(posi_block * unit_block)\n        self.pe[:, 1::2] = np.cos(posi_block * unit_block)\n        self.scale = np.sqrt(n_units)\n\n    def forward(self, e):\n        \"\"\"Forward Positional Encoding.\"\"\"\n        length = e.shape[1]\n        e = e * self.scale + self.xp.array(self.pe[:length])\n        return F.dropout(e, self.dropout)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/encoder.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Encoder.\"\"\"\n\nimport chainer\n\nfrom chainer import links as L\n\nfrom espnet.nets.chainer_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.chainer_backend.transformer.encoder_layer import EncoderLayer\nfrom espnet.nets.chainer_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.chainer_backend.transformer.mask import make_history_mask\nfrom espnet.nets.chainer_backend.transformer.subsampling import Conv2dSubsampling\nfrom espnet.nets.chainer_backend.transformer.subsampling import LinearSampling\n\nimport logging\nimport numpy as np\n\n\nclass Encoder(chainer.Chain):\n    \"\"\"Encoder.\n\n    Args:\n        input_type(str):\n            Sampling type. `input_type` must be `conv2d` or 'linear' currently.\n        idim (int): Dimension of inputs.\n        n_layers (int): Number of encoder layers.\n        n_units (int): Number of input/output dimension of a FeedForward layer.\n        d_units (int): Number of units of hidden layer in a FeedForward layer.\n        h (int): Number of attention heads.\n        dropout (float): Dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        attention_dim=256,\n        attention_heads=4,\n        linear_units=2048,\n        num_blocks=6,\n        dropout_rate=0.1,\n        positional_dropout_rate=0.1,\n        attention_dropout_rate=0.0,\n        input_layer=\"conv2d\",\n        pos_enc_class=PositionalEncoding,\n        initialW=None,\n        initial_bias=None,\n    ):\n        \"\"\"Initialize Encoder.\n\n        Args:\n            idim (int): Input dimension.\n            args (Namespace): Training config.\n            initialW (int, optional):  Initializer to initialize the weight.\n            initial_bias (bool, optional): Initializer to initialize the bias.\n\n        \"\"\"\n        super(Encoder, self).__init__()\n        initialW = chainer.initializers.Uniform if initialW is None else initialW\n        initial_bias = (\n            chainer.initializers.Uniform if initial_bias is None else initial_bias\n        )\n        self.do_history_mask = False\n        with self.init_scope():\n            self.conv_subsampling_factor = 1\n            channels = 64  # Based in paper\n            if input_layer == \"conv2d\":\n                idim = int(np.ceil(np.ceil(idim / 2) / 2)) * channels\n                self.input_layer = Conv2dSubsampling(\n                    channels,\n                    idim,\n                    attention_dim,\n                    dropout=dropout_rate,\n                    initialW=initialW,\n                    initial_bias=initial_bias,\n                )\n                self.conv_subsampling_factor = 4\n            elif input_layer == \"linear\":\n                self.input_layer = LinearSampling(\n                    idim, attention_dim, initialW=initialW, initial_bias=initial_bias\n                )\n            elif input_layer == \"embed\":\n                self.input_layer = chainer.Sequential(\n                    L.EmbedID(idim, attention_dim, ignore_label=-1),\n                    pos_enc_class(attention_dim, positional_dropout_rate),\n                )\n                self.do_history_mask = True\n            else:\n                raise ValueError(\"unknown input_layer: \" + input_layer)\n            self.norm = LayerNorm(attention_dim)\n        for i in range(num_blocks):\n            name = \"encoders.\" + str(i)\n            layer = EncoderLayer(\n                attention_dim,\n                d_units=linear_units,\n                h=attention_heads,\n                dropout=attention_dropout_rate,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.add_link(name, layer)\n        self.n_layers = num_blocks\n\n    def forward(self, e, ilens):\n        \"\"\"Compute Encoder layer.\n\n        Args:\n            e (chainer.Variable): Batch of padded charactor. (B, Tmax)\n            ilens (chainer.Variable): Batch of length of each input batch. (B,)\n\n        Returns:\n            chainer.Variable: Computed variable of encoder.\n            numpy.array: Mask.\n            chainer.Variable: Batch of lengths of each encoder outputs.\n\n        \"\"\"\n        if isinstance(self.input_layer, Conv2dSubsampling):\n            e, ilens = self.input_layer(e, ilens)\n        else:\n            e = self.input_layer(e)\n        batch, length, dims = e.shape\n        x_mask = np.ones([batch, length])\n        for j in range(batch):\n            x_mask[j, ilens[j] :] = -1\n        xx_mask = (x_mask[:, None, :] >= 0) * (x_mask[:, :, None] >= 0)\n        xx_mask = self.xp.array(xx_mask)\n        if self.do_history_mask:\n            history_mask = make_history_mask(self.xp, x_mask)\n            xx_mask *= history_mask\n        logging.debug(\"encoders size: \" + str(e.shape))\n        e = e.reshape(-1, dims)\n        for i in range(self.n_layers):\n            e = self[\"encoders.\" + str(i)](e, xx_mask, batch)\n        return self.norm(e).reshape(batch, length, -1), x_mask, ilens\n"
  },
  {
    "path": "nets/chainer_backend/transformer/encoder_layer.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Encoder Block.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\n\nfrom espnet.nets.chainer_backend.transformer.attention import MultiHeadAttention\nfrom espnet.nets.chainer_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.chainer_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\n\n\nclass EncoderLayer(chainer.Chain):\n    \"\"\"Single encoder layer module.\n\n    Args:\n        n_units (int): Number of input/output dimension of a FeedForward layer.\n        d_units (int): Number of units of hidden layer in a FeedForward layer.\n        h (int): Number of attention heads.\n        dropout (float): Dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self, n_units, d_units=0, h=8, dropout=0.1, initialW=None, initial_bias=None\n    ):\n        \"\"\"Initialize EncoderLayer.\"\"\"\n        super(EncoderLayer, self).__init__()\n        with self.init_scope():\n            self.self_attn = MultiHeadAttention(\n                n_units,\n                h,\n                dropout=dropout,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.feed_forward = PositionwiseFeedForward(\n                n_units,\n                d_units=d_units,\n                dropout=dropout,\n                initialW=initialW,\n                initial_bias=initial_bias,\n            )\n            self.norm1 = LayerNorm(n_units)\n            self.norm2 = LayerNorm(n_units)\n        self.dropout = dropout\n        self.n_units = n_units\n\n    def forward(self, e, xx_mask, batch):\n        \"\"\"Forward Positional Encoding.\"\"\"\n        n_e = self.norm1(e)\n        n_e = self.self_attn(n_e, mask=xx_mask, batch=batch)\n        e = e + F.dropout(n_e, self.dropout)\n\n        n_e = self.norm2(e)\n        n_e = self.feed_forward(n_e)\n        e = e + F.dropout(n_e, self.dropout)\n        return e\n"
  },
  {
    "path": "nets/chainer_backend/transformer/label_smoothing_loss.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Label Smootion loss.\"\"\"\n\nimport logging\n\nimport chainer\n\nimport chainer.functions as F\n\n\nclass LabelSmoothingLoss(chainer.Chain):\n    \"\"\"Label Smoothing Loss.\n\n    Args:\n        smoothing (float): smoothing rate (0.0 means the conventional CE).\n        n_target_vocab (int): number of classes.\n        normalize_length (bool): normalize loss by sequence length if True.\n\n    \"\"\"\n\n    def __init__(self, smoothing, n_target_vocab, normalize_length=False, ignore_id=-1):\n        \"\"\"Initialize Loss.\"\"\"\n        super(LabelSmoothingLoss, self).__init__()\n        self.use_label_smoothing = False\n        if smoothing > 0.0:\n            logging.info(\"Use label smoothing\")\n            self.smoothing = smoothing\n            self.confidence = 1.0 - smoothing\n            self.use_label_smoothing = True\n            self.n_target_vocab = n_target_vocab\n        self.normalize_length = normalize_length\n        self.ignore_id = ignore_id\n        self.acc = None\n\n    def forward(self, ys_block, ys_pad):\n        \"\"\"Forward Loss.\n\n        Args:\n            ys_block (chainer.Variable): Predicted labels.\n            ys_pad (chainer.Variable): Target (true) labels.\n\n        Returns:\n            float: Training loss.\n\n        \"\"\"\n        # Output (all together at once for efficiency)\n        batch, length, dims = ys_block.shape\n        concat_logit_block = ys_block.reshape(-1, dims)\n\n        # Target reshape\n        concat_t_block = ys_pad.reshape((batch * length))\n        ignore_mask = concat_t_block >= 0\n        n_token = ignore_mask.sum()\n        normalizer = n_token if self.normalize_length else batch\n\n        if not self.use_label_smoothing:\n            loss = F.softmax_cross_entropy(concat_logit_block, concat_t_block)\n            loss = loss * n_token / normalizer\n        else:\n            log_prob = F.log_softmax(concat_logit_block)\n            broad_ignore_mask = self.xp.broadcast_to(\n                ignore_mask[:, None], concat_logit_block.shape\n            )\n            pre_loss = (\n                ignore_mask * log_prob[self.xp.arange(batch * length), concat_t_block]\n            )\n            loss = -F.sum(pre_loss) / normalizer\n            label_smoothing = broad_ignore_mask * -1.0 / self.n_target_vocab * log_prob\n            label_smoothing = F.sum(label_smoothing) / normalizer\n            loss = self.confidence * loss + self.smoothing * label_smoothing\n        return loss\n"
  },
  {
    "path": "nets/chainer_backend/transformer/layer_norm.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Label Smootion loss.\"\"\"\n\nimport chainer.links as L\n\n\nclass LayerNorm(L.LayerNormalization):\n    \"\"\"Redirect to L.LayerNormalization.\"\"\"\n\n    def __init__(self, dims, eps=1e-12):\n        \"\"\"Initialize LayerNorm.\"\"\"\n        super(LayerNorm, self).__init__(size=dims, eps=eps)\n\n    def __call__(self, e):\n        \"\"\"Forward LayerNorm.\"\"\"\n        return super(LayerNorm, self).__call__(e)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/mask.py",
    "content": "\"\"\"Create mask for subsequent steps.\"\"\"\n\n\ndef make_history_mask(xp, block):\n    \"\"\"Prepare the history mask.\n\n    Args:\n        block (ndarray): Block with dimensions: (B x S).\n    Returns:\n        ndarray, np.ndarray: History mask with dimensions (B, S, S).\n\n    \"\"\"\n    batch, length = block.shape\n    arange = xp.arange(length)\n    history_mask = (arange[None] <= arange[:, None])[\n        None,\n    ]\n    history_mask = xp.broadcast_to(history_mask, (batch, length, length))\n    return history_mask\n"
  },
  {
    "path": "nets/chainer_backend/transformer/positionwise_feed_forward.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Positionwise Feedforward.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\nimport chainer.links as L\n\nimport numpy as np\n\n\nclass PositionwiseFeedForward(chainer.Chain):\n    \"\"\"Positionwise feed forward.\n\n    Args:\n        :param int idim: input dimenstion\n        :param int hidden_units: number of hidden units\n        :param float dropout_rate: dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self, n_units, d_units=0, dropout=0.1, initialW=None, initial_bias=None\n    ):\n        \"\"\"Initialize PositionwiseFeedForward.\n\n        Args:\n            n_units (int): Input dimension.\n            d_units (int, optional): Output dimension of hidden layer.\n            dropout (float, optional): Dropout ratio.\n            initialW (int, optional):  Initializer to initialize the weight.\n            initial_bias (bool, optional): Initializer to initialize the bias.\n\n        \"\"\"\n        super(PositionwiseFeedForward, self).__init__()\n        n_inner_units = d_units if d_units > 0 else n_units * 4\n        with self.init_scope():\n            stvd = 1.0 / np.sqrt(n_units)\n            self.w_1 = L.Linear(\n                n_units,\n                n_inner_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            stvd = 1.0 / np.sqrt(n_inner_units)\n            self.w_2 = L.Linear(\n                n_inner_units,\n                n_units,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.act = F.relu\n        self.dropout = dropout\n\n    def __call__(self, e):\n        \"\"\"Initialize PositionwiseFeedForward.\n\n        Args:\n            e (chainer.Variable): Input variable.\n\n        Return:\n            chainer.Variable: Output variable.\n\n        \"\"\"\n        e = F.dropout(self.act(self.w_1(e)), self.dropout)\n        return self.w_2(e)\n"
  },
  {
    "path": "nets/chainer_backend/transformer/subsampling.py",
    "content": "# encoding: utf-8\n\"\"\"Class Declaration of Transformer's Input layers.\"\"\"\n\nimport chainer\n\nimport chainer.functions as F\nimport chainer.links as L\n\nfrom espnet.nets.chainer_backend.transformer.embedding import PositionalEncoding\n\nimport logging\nimport numpy as np\n\n\nclass Conv2dSubsampling(chainer.Chain):\n    \"\"\"Convolutional 2D subsampling (to 1/4 length).\n\n    :param int idim: input dim\n    :param int odim: output dim\n    :param flaot dropout_rate: dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self, channels, idim, dims, dropout=0.1, initialW=None, initial_bias=None\n    ):\n        \"\"\"Initialize Conv2dSubsampling.\"\"\"\n        super(Conv2dSubsampling, self).__init__()\n        self.dropout = dropout\n        with self.init_scope():\n            # Standard deviation for Conv2D with 1 channel and kernel 3 x 3.\n            n = 1 * 3 * 3\n            stvd = 1.0 / np.sqrt(n)\n            self.conv1 = L.Convolution2D(\n                1,\n                channels,\n                3,\n                stride=2,\n                pad=1,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            n = channels * 3 * 3\n            stvd = 1.0 / np.sqrt(n)\n            self.conv2 = L.Convolution2D(\n                channels,\n                channels,\n                3,\n                stride=2,\n                pad=1,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            stvd = 1.0 / np.sqrt(dims)\n            self.out = L.Linear(\n                idim,\n                dims,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.pe = PositionalEncoding(dims, dropout)\n\n    def forward(self, xs, ilens):\n        \"\"\"Subsample x.\n\n        :param chainer.Variable x: input tensor\n        :return: subsampled x and mask\n\n        \"\"\"\n        xs = self.xp.array(xs[:, None])\n        xs = F.relu(self.conv1(xs))\n        xs = F.relu(self.conv2(xs))\n        batch, _, length, _ = xs.shape\n        xs = self.out(F.swapaxes(xs, 1, 2).reshape(batch * length, -1))\n        xs = self.pe(xs.reshape(batch, length, -1))\n        # change ilens accordingly\n        ilens = np.ceil(np.array(ilens, dtype=np.float32) / 2).astype(np.int)\n        ilens = np.ceil(np.array(ilens, dtype=np.float32) / 2).astype(np.int)\n        return xs, ilens\n\n\nclass LinearSampling(chainer.Chain):\n    \"\"\"Linear 1D subsampling.\n\n    :param int idim: input dim\n    :param int odim: output dim\n    :param flaot dropout_rate: dropout rate\n\n    \"\"\"\n\n    def __init__(self, idim, dims, dropout=0.1, initialW=None, initial_bias=None):\n        \"\"\"Initialize LinearSampling.\"\"\"\n        super(LinearSampling, self).__init__()\n        stvd = 1.0 / np.sqrt(dims)\n        self.dropout = dropout\n        with self.init_scope():\n            self.linear = L.Linear(\n                idim,\n                dims,\n                initialW=initialW(scale=stvd),\n                initial_bias=initial_bias(scale=stvd),\n            )\n            self.pe = PositionalEncoding(dims, dropout)\n\n    def forward(self, xs, ilens):\n        \"\"\"Subsample x.\n\n        :param chainer.Variable x: input tensor\n        :return: subsampled x and mask\n\n        \"\"\"\n        logging.info(xs.shape)\n        xs = self.linear(xs, n_batch_axes=2)\n        logging.info(xs.shape)\n        xs = self.pe(xs)\n        return xs, ilens\n"
  },
  {
    "path": "nets/chainer_backend/transformer/training.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\"\"\"Class Declaration of Transformer's Training Subprocess.\"\"\"\nimport collections\nimport logging\nimport math\nimport six\n\nfrom chainer import cuda\nfrom chainer import functions as F\nfrom chainer import training\nfrom chainer.training import extension\nfrom chainer.training.updaters.multiprocess_parallel_updater import gather_grads\nfrom chainer.training.updaters.multiprocess_parallel_updater import gather_params\nfrom chainer.training.updaters.multiprocess_parallel_updater import scatter_grads\nimport numpy as np\n\n\n# copied from https://github.com/chainer/chainer/blob/master/chainer/optimizer.py\ndef sum_sqnorm(arr):\n    \"\"\"Calculate the norm of the array.\n\n    Args:\n        arr (numpy.ndarray)\n\n    Returns:\n        Float: Sum of the norm calculated from the given array.\n\n    \"\"\"\n    sq_sum = collections.defaultdict(float)\n    for x in arr:\n        with cuda.get_device_from_array(x) as dev:\n            if x is not None:\n                x = x.ravel()\n                s = x.dot(x)\n                sq_sum[int(dev)] += s\n    return sum([float(i) for i in six.itervalues(sq_sum)])\n\n\nclass CustomUpdater(training.StandardUpdater):\n    \"\"\"Custom updater for chainer.\n\n    Args:\n        train_iter (iterator | dict[str, iterator]): Dataset iterator for the\n            training dataset. It can also be a dictionary that maps strings to\n            iterators. If this is just an iterator, then the iterator is\n            registered by the name ``'main'``.\n        optimizer (optimizer | dict[str, optimizer]): Optimizer to update\n            parameters. It can also be a dictionary that maps strings to\n            optimizers. If this is just an optimizer, then the optimizer is\n            registered by the name ``'main'``.\n        converter (espnet.asr.chainer_backend.asr.CustomConverter): Converter\n            function to build input arrays. Each batch extracted by the main\n            iterator and the ``device`` option are passed to this function.\n            :func:`chainer.dataset.concat_examples` is used by default.\n        device (int or dict): The destination device info to send variables. In the\n            case of cpu or single gpu, `device=-1 or 0`, respectively.\n            In the case of multi-gpu, `device={\"main\":0, \"sub_1\": 1, ...}`.\n        accum_grad (int):The number of gradient accumulation. if set to 2, the network\n            parameters will be updated once in twice,\n            i.e. actual batchsize will be doubled.\n\n    \"\"\"\n\n    def __init__(self, train_iter, optimizer, converter, device, accum_grad=1):\n        \"\"\"Initialize Custom Updater.\"\"\"\n        super(CustomUpdater, self).__init__(\n            train_iter, optimizer, converter=converter, device=device\n        )\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n        self.start = True\n        self.device = device\n        logging.debug(\"using custom converter for transformer\")\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Process main update routine for Custom Updater.\"\"\"\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n\n        # Get batch and convert into variables\n        batch = train_iter.next()\n        x = self.converter(batch, self.device)\n        if self.start:\n            optimizer.target.cleargrads()\n            self.start = False\n\n        # Compute the loss at this time step and accumulate it\n        loss = optimizer.target(*x) / self.accum_grad\n        loss.backward()  # Backprop\n\n        self.forward_count += 1\n        if self.forward_count != self.accum_grad:\n            return\n        self.forward_count = 0\n        # compute the gradient norm to check if it is normal or not\n        grad_norm = np.sqrt(\n            sum_sqnorm([p.grad for p in optimizer.target.params(False)])\n        )\n        logging.info(\"grad norm={}\".format(grad_norm))\n        if math.isnan(grad_norm):\n            logging.warning(\"grad norm is nan. Do not update model.\")\n        else:\n            optimizer.update()\n        optimizer.target.cleargrads()  # Clear the parameter gradients\n\n    def update(self):\n        \"\"\"Update step for Custom Updater.\"\"\"\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomParallelUpdater(training.updaters.MultiprocessParallelUpdater):\n    \"\"\"Custom Parallel Updater for chainer.\n\n    Defines the main update routine.\n\n    Args:\n        train_iter (iterator | dict[str, iterator]): Dataset iterator for the\n            training dataset. It can also be a dictionary that maps strings to\n            iterators. If this is just an iterator, then the iterator is\n            registered by the name ``'main'``.\n        optimizer (optimizer | dict[str, optimizer]): Optimizer to update\n            parameters. It can also be a dictionary that maps strings to\n            optimizers. If this is just an optimizer, then the optimizer is\n            registered by the name ``'main'``.\n        converter (espnet.asr.chainer_backend.asr.CustomConverter): Converter\n            function to build input arrays. Each batch extracted by the main\n            iterator and the ``device`` option are passed to this function.\n            :func:`chainer.dataset.concat_examples` is used by default.\n        device (torch.device): Device to which the training data is sent. Negative value\n            indicates the host memory (CPU).\n        accum_grad (int):The number of gradient accumulation. if set to 2, the network\n            parameters will be updated once in twice,\n            i.e. actual batchsize will be doubled.\n\n    \"\"\"\n\n    def __init__(self, train_iters, optimizer, converter, devices, accum_grad=1):\n        \"\"\"Initialize custom parallel updater.\"\"\"\n        from cupy.cuda import nccl\n\n        super(CustomParallelUpdater, self).__init__(\n            train_iters, optimizer, converter=converter, devices=devices\n        )\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n        self.nccl = nccl\n        logging.debug(\"using custom parallel updater for transformer\")\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Process main update routine for Custom Parallel Updater.\"\"\"\n        self.setup_workers()\n\n        self._send_message((\"update\", None))\n        with cuda.Device(self._devices[0]):\n            # For reducing memory\n            optimizer = self.get_optimizer(\"main\")\n            batch = self.get_iterator(\"main\").next()\n            x = self.converter(batch, self._devices[0])\n\n            loss = self._master(*x) / self.accum_grad\n            loss.backward()\n\n            # NCCL: reduce grads\n            null_stream = cuda.Stream.null\n            if self.comm is not None:\n                gg = gather_grads(self._master)\n                self.comm.reduce(\n                    gg.data.ptr,\n                    gg.data.ptr,\n                    gg.size,\n                    self.nccl.NCCL_FLOAT,\n                    self.nccl.NCCL_SUM,\n                    0,\n                    null_stream.ptr,\n                )\n                scatter_grads(self._master, gg)\n                del gg\n\n            # update parameters\n            self.forward_count += 1\n            if self.forward_count != self.accum_grad:\n                return\n            self.forward_count = 0\n            # check gradient value\n            grad_norm = np.sqrt(\n                sum_sqnorm([p.grad for p in optimizer.target.params(False)])\n            )\n            logging.info(\"grad norm={}\".format(grad_norm))\n\n            # update\n            if math.isnan(grad_norm):\n                logging.warning(\"grad norm is nan. Do not update model.\")\n            else:\n                optimizer.update()\n            self._master.cleargrads()\n\n            if self.comm is not None:\n                gp = gather_params(self._master)\n                self.comm.bcast(\n                    gp.data.ptr, gp.size, self.nccl.NCCL_FLOAT, 0, null_stream.ptr\n                )\n\n    def update(self):\n        \"\"\"Update step for Custom Parallel Updater.\"\"\"\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass VaswaniRule(extension.Extension):\n    \"\"\"Trainer extension to shift an optimizer attribute magically by Vaswani.\n\n    Args:\n        attr (str): Name of the attribute to shift.\n        rate (float): Rate of the exponential shift. This value is multiplied\n            to the attribute at each call.\n        init (float): Initial value of the attribute. If it is ``None``, the\n            extension extracts the attribute at the first call and uses it as\n            the initial value.\n        target (float): Target value of the attribute. If the attribute reaches\n            this value, the shift stops.\n        optimizer (~chainer.Optimizer): Target optimizer to adjust the\n            attribute. If it is ``None``, the main optimizer of the updater is\n            used.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        attr,\n        d,\n        warmup_steps=4000,\n        init=None,\n        target=None,\n        optimizer=None,\n        scale=1.0,\n    ):\n        \"\"\"Initialize Vaswani rule extension.\"\"\"\n        self._attr = attr\n        self._d_inv05 = d ** (-0.5) * scale\n        self._warmup_steps_inv15 = warmup_steps ** (-1.5)\n        self._init = init\n        self._target = target\n        self._optimizer = optimizer\n        self._t = 0\n        self._last_value = None\n\n    def initialize(self, trainer):\n        \"\"\"Initialize Optimizer values.\"\"\"\n        optimizer = self._get_optimizer(trainer)\n        # ensure that _init is set\n        if self._init is None:\n            self._init = self._d_inv05 * (1.0 * self._warmup_steps_inv15)\n        if self._last_value is not None:  # resuming from a snapshot\n            self._update_value(optimizer, self._last_value)\n        else:\n            self._update_value(optimizer, self._init)\n\n    def __call__(self, trainer):\n        \"\"\"Forward extension.\"\"\"\n        self._t += 1\n        optimizer = self._get_optimizer(trainer)\n        value = self._d_inv05 * min(\n            self._t ** (-0.5), self._t * self._warmup_steps_inv15\n        )\n        self._update_value(optimizer, value)\n\n    def serialize(self, serializer):\n        \"\"\"Serialize extension.\"\"\"\n        self._t = serializer(\"_t\", self._t)\n        self._last_value = serializer(\"_last_value\", self._last_value)\n\n    def _get_optimizer(self, trainer):\n        \"\"\"Obtain optimizer from trainer.\"\"\"\n        return self._optimizer or trainer.updater.get_optimizer(\"main\")\n\n    def _update_value(self, optimizer, value):\n        \"\"\"Update requested variable values.\"\"\"\n        setattr(optimizer, self._attr, value)\n        self._last_value = value\n\n\nclass CustomConverter(object):\n    \"\"\"Custom Converter.\n\n    Args:\n        subsampling_factor (int): The subsampling factor.\n\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"Initialize subsampling.\"\"\"\n        pass\n\n    def __call__(self, batch, device):\n        \"\"\"Perform subsampling.\n\n        Args:\n            batch (list): Batch that will be sabsampled.\n            device (chainer.backend.Device): CPU or GPU device.\n\n        Returns:\n            chainer.Variable: xp.array that are padded and subsampled from batch.\n            xp.array: xp.array of the length of the mini-batches.\n            chainer.Variable: xp.array that are padded and subsampled from batch.\n\n        \"\"\"\n        # For transformer, data is processed in CPU.\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys = batch[0]\n        xs = F.pad_sequence(xs, padding=-1).data\n        # get batch of lengths of input sequences\n        ilens = np.array([x.shape[0] for x in xs], dtype=np.int32)\n        return xs, ilens, ys\n"
  },
  {
    "path": "nets/ctc_prefix_score.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2018 Mitsubishi Electric Research Labs (Takaaki Hori)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport torch\n\nimport numpy as np\nimport six\n\n\nclass CTCPrefixScoreTH(object):\n    \"\"\"Batch processing of CTCPrefixScore\n\n    which is based on Algorithm 2 in WATANABE et al.\n    \"HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,\"\n    but extended to efficiently compute the label probablities for multiple\n    hypotheses simultaneously\n    See also Seki et al. \"Vectorized Beam Search for CTC-Attention-Based\n    Speech Recognition,\" In INTERSPEECH (pp. 3825-3829), 2019.\n    \"\"\"\n\n    def __init__(self, x, xlens, blank, eos, margin=0):\n        \"\"\"Construct CTC prefix scorer\n\n        :param torch.Tensor x: input label posterior sequences (B, T, O)\n        :param torch.Tensor xlens: input lengths (B,)\n        :param int blank: blank label id\n        :param int eos: end-of-sequence id\n        :param int margin: margin parameter for windowing (0 means no windowing)\n        \"\"\"\n        # In the comment lines,\n        # we assume T: input_length, B: batch size, W: beam width, O: output dim.\n        self.logzero = -10000000000.0\n        self.blank = blank\n        self.eos = eos\n        self.batch = x.size(0)\n        self.input_length = x.size(1)\n        self.odim = x.size(2)\n        self.dtype = x.dtype\n        self.device = (\n            torch.device(\"cuda:%d\" % x.get_device())\n            if x.is_cuda\n            else torch.device(\"cpu\")\n        )\n        # Pad the rest of posteriors in the batch\n        # TODO(takaaki-hori): need a better way without for-loops\n        for i, l in enumerate(xlens):\n            if l < self.input_length:\n                x[i, l:, :] = self.logzero\n                x[i, l:, blank] = 0\n        # Reshape input x\n        xn = x.transpose(0, 1)  # (B, T, O) -> (T, B, O)\n        xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim)\n        self.x = torch.stack([xn, xb])  # (2, T, B, O)\n        self.end_frames = torch.as_tensor(xlens) - 1\n\n        # Setup CTC windowing\n        self.margin = margin\n        if margin > 0:\n            self.frame_ids = torch.arange(\n                self.input_length, dtype=self.dtype, device=self.device\n            )\n        # Base indices for index conversion\n        self.idx_bh = None\n        self.idx_b = torch.arange(self.batch, device=self.device)\n        self.idx_bo = (self.idx_b * self.odim).unsqueeze(1)\n\n    def __call__(self, y, state, scoring_ids=None, att_w=None):\n        \"\"\"Compute CTC prefix scores for next labels\n\n        :param list y: prefix label sequences\n        :param tuple state: previous CTC state\n        :param torch.Tensor pre_scores: scores for pre-selection of hypotheses (BW, O)\n        :param torch.Tensor att_w: attention weights to decide CTC window\n        :return new_state, ctc_local_scores (BW, O)\n        \"\"\"\n        output_length = len(y[0]) - 1  # ignore sos\n        last_ids = [yi[-1] for yi in y]  # last output label ids\n        n_bh = len(last_ids)  # batch * hyps\n        n_hyps = n_bh // self.batch  # assuming each utterance has the same # of hyps\n        self.scoring_num = scoring_ids.size(-1) if scoring_ids is not None else 0\n        # prepare state info\n        if state is None:\n            r_prev = torch.full(\n                (self.input_length, 2, self.batch, n_hyps),\n                self.logzero,\n                dtype=self.dtype,\n                device=self.device,\n            )\n            r_prev[:, 1] = torch.cumsum(self.x[0, :, :, self.blank], 0).unsqueeze(2)\n            r_prev = r_prev.view(-1, 2, n_bh)\n            s_prev = 0.0\n            f_min_prev = 0\n            f_max_prev = 1\n        else:\n            r_prev, s_prev, f_min_prev, f_max_prev = state\n\n        # select input dimensions for scoring\n        if self.scoring_num > 0:\n            scoring_idmap = torch.full(\n                (n_bh, self.odim), -1, dtype=torch.long, device=self.device\n            )\n            snum = self.scoring_num\n            if self.idx_bh is None or n_bh > len(self.idx_bh):\n                self.idx_bh = torch.arange(n_bh, device=self.device).view(-1, 1)\n            scoring_idmap[self.idx_bh[:n_bh], scoring_ids] = torch.arange(\n                snum, device=self.device\n            )\n            scoring_idx = (\n                scoring_ids + self.idx_bo.repeat(1, n_hyps).view(-1, 1)\n            ).view(-1)\n            x_ = torch.index_select(\n                self.x.view(2, -1, self.batch * self.odim), 2, scoring_idx\n            ).view(2, -1, n_bh, snum)\n        else:\n            scoring_ids = None\n            scoring_idmap = None\n            snum = self.odim\n            x_ = self.x.unsqueeze(3).repeat(1, 1, 1, n_hyps, 1).view(2, -1, n_bh, snum)\n\n        # new CTC forward probs are prepared as a (T x 2 x BW x S) tensor\n        # that corresponds to r_t^n(h) and r_t^b(h) in a batch.\n        r = torch.full(\n            (self.input_length, 2, n_bh, snum),\n            self.logzero,\n            dtype=self.dtype,\n            device=self.device,\n        )\n        if output_length == 0:\n            r[0, 0] = x_[0, 0]\n\n        r_sum = torch.logsumexp(r_prev, 1)\n        log_phi = r_sum.unsqueeze(2).repeat(1, 1, snum)\n        if scoring_ids is not None:\n            for idx in range(n_bh):\n                pos = scoring_idmap[idx, last_ids[idx]]\n                if pos >= 0:\n                    log_phi[:, idx, pos] = r_prev[:, 1, idx]\n        else:\n            for idx in range(n_bh):\n                log_phi[:, idx, last_ids[idx]] = r_prev[:, 1, idx]\n\n        # decide start and end frames based on attention weights\n        if att_w is not None and self.margin > 0:\n            f_arg = torch.matmul(att_w, self.frame_ids)\n            f_min = max(int(f_arg.min().cpu()), f_min_prev)\n            f_max = max(int(f_arg.max().cpu()), f_max_prev)\n            start = min(f_max_prev, max(f_min - self.margin, output_length, 1))\n            end = min(f_max + self.margin, self.input_length)\n        else:\n            f_min = f_max = 0\n            start = max(output_length, 1)\n            end = self.input_length\n\n        # compute forward probabilities log(r_t^n(h)) and log(r_t^b(h))\n        for t in range(start, end):\n            rp = r[t - 1]\n            rr = torch.stack([rp[0], log_phi[t - 1], rp[0], rp[1]]).view(\n                2, 2, n_bh, snum\n            )\n            r[t] = torch.logsumexp(rr, 1) + x_[:, t]\n\n        # compute log prefix probabilites log(psi)\n        log_phi_x = torch.cat((log_phi[0].unsqueeze(0), log_phi[:-1]), dim=0) + x_[0]\n        if scoring_ids is not None:\n            log_psi = torch.full(\n                (n_bh, self.odim), self.logzero, dtype=self.dtype, device=self.device\n            )\n            log_psi_ = torch.logsumexp(\n                torch.cat((log_phi_x[start:end], r[start - 1, 0].unsqueeze(0)), dim=0),\n                dim=0,\n            )\n            for si in range(n_bh):\n                log_psi[si, scoring_ids[si]] = log_psi_[si]\n        else:\n            log_psi = torch.logsumexp(\n                torch.cat((log_phi_x[start:end], r[start - 1, 0].unsqueeze(0)), dim=0),\n                dim=0,\n            )\n\n        for si in range(n_bh):\n            log_psi[si, self.eos] = r_sum[self.end_frames[si // n_hyps], si]\n\n        # exclude blank probs\n        log_psi[:, self.blank] = self.logzero\n\n        return (log_psi - s_prev), (r, log_psi, f_min, f_max, scoring_idmap)\n\n    def index_select_state(self, state, best_ids):\n        \"\"\"Select CTC states according to best ids\n\n        :param state    : CTC state\n        :param best_ids : index numbers selected by beam pruning (B, W)\n        :return selected_state\n        \"\"\"\n        r, s, f_min, f_max, scoring_idmap = state\n        # convert ids to BHO space\n        n_bh = len(s)\n        n_hyps = n_bh // self.batch\n        vidx = (best_ids + (self.idx_b * (n_hyps * self.odim)).view(-1, 1)).view(-1)\n        # select hypothesis scores\n        s_new = torch.index_select(s.view(-1), 0, vidx)\n        s_new = s_new.view(-1, 1).repeat(1, self.odim).view(n_bh, self.odim)\n        # convert ids to BHS space (S: scoring_num)\n        if scoring_idmap is not None:\n            snum = self.scoring_num\n            hyp_idx = (best_ids // self.odim + (self.idx_b * n_hyps).view(-1, 1)).view(\n                -1\n            )\n            label_ids = torch.fmod(best_ids, self.odim).view(-1)\n            score_idx = scoring_idmap[hyp_idx, label_ids]\n            score_idx[score_idx == -1] = 0\n            vidx = score_idx + hyp_idx * snum\n        else:\n            snum = self.odim\n        # select forward probabilities\n        r_new = torch.index_select(r.view(-1, 2, n_bh * snum), 2, vidx).view(\n            -1, 2, n_bh\n        )\n        return r_new, s_new, f_min, f_max\n\n    def extend_prob(self, x):\n        \"\"\"Extend CTC prob.\n\n        :param torch.Tensor x: input label posterior sequences (B, T, O)\n        \"\"\"\n\n        if self.x.shape[1] < x.shape[1]:  # self.x (2,T,B,O); x (B,T,O)\n            # Pad the rest of posteriors in the batch\n            # TODO(takaaki-hori): need a better way without for-loops\n            xlens = [x.size(1)]\n            for i, l in enumerate(xlens):\n                if l < self.input_length:\n                    x[i, l:, :] = self.logzero\n                    x[i, l:, self.blank] = 0\n            tmp_x = self.x\n            xn = x.transpose(0, 1)  # (B, T, O) -> (T, B, O)\n            xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim)\n            self.x = torch.stack([xn, xb])  # (2, T, B, O)\n            self.x[:, : tmp_x.shape[1], :, :] = tmp_x\n            self.input_length = x.size(1)\n            self.end_frames = torch.as_tensor(xlens) - 1\n\n    def extend_state(self, state):\n        \"\"\"Compute CTC prefix state.\n\n\n        :param state    : CTC state\n        :return ctc_state\n        \"\"\"\n\n        if state is None:\n            # nothing to do\n            return state\n        else:\n            r_prev, s_prev, f_min_prev, f_max_prev = state\n\n            r_prev_new = torch.full(\n                (self.input_length, 2),\n                self.logzero,\n                dtype=self.dtype,\n                device=self.device,\n            )\n            start = max(r_prev.shape[0], 1)\n            r_prev_new[0:start] = r_prev\n            for t in six.moves.range(start, self.input_length):\n                r_prev_new[t, 1] = r_prev_new[t - 1, 1] + self.x[0, t, :, self.blank]\n\n            return (r_prev_new, s_prev, f_min_prev, f_max_prev)\n\n\nclass CTCPrefixScore(object):\n    # by tyrion: CTC prefix score is the probability of all hypothesis start with \n    # that prefix: it is the accumulated probability of given prefix U at any time t.\n    \"\"\"Compute CTC label sequence scores\n\n    which is based on Algorithm 2 in WATANABE et al.\n    \"HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,\"\n    but extended to efficiently compute the probablities of multiple labels\n    simultaneously\n    \"\"\"\n\n    def __init__(self, x, blank, eos, xp):\n        self.xp = xp\n        self.logzero = -10000000000.0\n        self.blank = blank\n        self.eos = eos\n        self.input_length = len(x)\n        self.x = x\n\n    def initial_state(self):\n        \"\"\"Obtain an initial CTC state\n\n        :return: CTC state\n        \"\"\"\n        # initial CTC state is made of a frame x 2 tensor that corresponds to\n        # r_t^n(<sos>) and r_t^b(<sos>), where 0 and 1 of axis=1 represent\n        # superscripts n and b (non-blank and blank), respectively.\n        r = self.xp.full((self.input_length, 2), self.logzero, dtype=np.float32)\n        r[0, 1] = self.x[0, self.blank]\n        for i in six.moves.range(1, self.input_length):\n            r[i, 1] = r[i - 1, 1] + self.x[i, self.blank]\n        return r\n\n    def __call__(self, y, cs, r_prev):\n        \"\"\"Compute CTC prefix scores for next labels\n\n        :param y     : prefix label sequence\n        :param cs    : array of next labels\n        :param r_prev: previous CTC state\n        :return ctc_scores, ctc_states\n        \"\"\"\n        # initialize CTC states\n        output_length = len(y) - 1  # ignore sos\n        # new CTC states are prepared as a frame x (n or b) x n_labels tensor\n        # that corresponds to r_t^n(h) and r_t^b(h).\n        r = self.xp.ndarray((self.input_length, 2, len(cs)), dtype=np.float32)\n        xs = self.x[:, cs]\n        if output_length == 0:\n            r[0, 0] = xs[0]\n            r[0, 1] = self.logzero\n        else:\n            r[output_length - 1] = self.logzero\n\n        # prepare forward probabilities for the last label\n        r_sum = self.xp.logaddexp(\n            r_prev[:, 0], r_prev[:, 1]\n        )  # log(r_t^n(g) + r_t^b(g))\n        last = y[-1]\n        if output_length > 0 and last in cs:\n            log_phi = self.xp.ndarray((self.input_length, len(cs)), dtype=np.float32)\n            for i in six.moves.range(len(cs)):\n                log_phi[:, i] = r_sum if cs[i] != last else r_prev[:, 1]\n        else:\n            log_phi = r_sum\n\n        # compute forward probabilities log(r_t^n(h)), log(r_t^b(h)),\n        # and log prefix probabilites log(psi)\n        start = max(output_length, 1)\n        log_psi = r[start - 1, 0]\n        for t in six.moves.range(start, self.input_length):\n            r[t, 0] = self.xp.logaddexp(r[t - 1, 0], log_phi[t - 1]) + xs[t]\n            r[t, 1] = (\n                self.xp.logaddexp(r[t - 1, 0], r[t - 1, 1]) + self.x[t, self.blank]\n            )\n            log_psi = self.xp.logaddexp(log_psi, log_phi[t - 1] + xs[t])\n\n        # get P(...eos|X) that ends with the prefix itself\n        eos_pos = self.xp.where(cs == self.eos)[0]\n        if len(eos_pos) > 0:\n            log_psi[eos_pos] = r_sum[-1]  # log(r_T^n(g) + r_T^b(g))\n\n        # exclude blank probs\n        blank_pos = self.xp.where(cs == self.blank)[0]\n        if len(blank_pos) > 0:\n            log_psi[blank_pos] = self.logzero\n\n        # return the log prefix probability and CTC states, where the label axis\n        # of the CTC states is moved to the first axis to slice it easily\n        return log_psi, self.xp.rollaxis(r, 2)\n"
  },
  {
    "path": "nets/e2e_asr_common.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Common functions for ASR.\"\"\"\n\nimport json\nimport logging\nimport sys\n\nimport editdistance\nfrom itertools import groupby\nimport numpy as np\nimport six\n\n\ndef end_detect(ended_hyps, i, M=3, D_end=np.log(1 * np.exp(-10))):\n    \"\"\"End detection.\n\n    described in Eq. (50) of S. Watanabe et al\n    \"Hybrid CTC/Attention Architecture for End-to-End Speech Recognition\"\n\n    :param ended_hyps:\n    :param i:\n    :param M:\n    :param D_end:\n    :return:\n    \"\"\"\n    if len(ended_hyps) == 0:\n        return False\n    count = 0\n    best_hyp = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[0]\n    for m in six.moves.range(M):\n        # get ended_hyps with their length is i - m\n        hyp_length = i - m\n        hyps_same_length = [x for x in ended_hyps if len(x[\"yseq\"]) == hyp_length]\n        if len(hyps_same_length) > 0:\n            best_hyp_same_length = sorted(\n                hyps_same_length, key=lambda x: x[\"score\"], reverse=True\n            )[0]\n            if best_hyp_same_length[\"score\"] - best_hyp[\"score\"] < D_end:\n                count += 1\n\n    if count == M:\n        return True\n    else:\n        return False\n\n\n# TODO(takaaki-hori): add different smoothing methods\ndef label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):\n    \"\"\"Obtain label distribution for loss smoothing.\n\n    :param odim:\n    :param lsm_type:\n    :param blank:\n    :param transcript:\n    :return:\n    \"\"\"\n    if transcript is not None:\n        with open(transcript, \"rb\") as f:\n            trans_json = json.load(f)[\"utts\"]\n\n    if lsm_type == \"unigram\":\n        assert transcript is not None, (\n            \"transcript is required for %s label smoothing\" % lsm_type\n        )\n        labelcount = np.zeros(odim)\n        for k, v in trans_json.items():\n            ids = np.array([int(n) for n in v[\"output\"][0][\"tokenid\"].split()])\n            # to avoid an error when there is no text in an uttrance\n            if len(ids) > 0:\n                labelcount[ids] += 1\n        labelcount[odim - 1] = len(transcript)  # count <eos>\n        labelcount[labelcount == 0] = 1  # flooring\n        labelcount[blank] = 0  # remove counts for blank\n        labeldist = labelcount.astype(np.float32) / np.sum(labelcount)\n    else:\n        logging.error(\"Error: unexpected label smoothing type: %s\" % lsm_type)\n        sys.exit()\n\n    return labeldist\n\n\ndef get_vgg2l_odim(idim, in_channel=3, out_channel=128):\n    \"\"\"Return the output size of the VGG frontend.\n\n    :param in_channel: input channel size\n    :param out_channel: output channel size\n    :return: output size\n    :rtype int\n    \"\"\"\n    idim = idim / in_channel\n    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 1st max pooling\n    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 2nd max pooling\n    return int(idim) * out_channel  # numer of channels\n\n\nclass ErrorCalculator(object):\n    \"\"\"Calculate CER and WER for E2E_ASR and CTC models during training.\n\n    :param y_hats: numpy array with predicted text\n    :param y_pads: numpy array with true (target) text\n    :param char_list:\n    :param sym_space:\n    :param sym_blank:\n    :return:\n    \"\"\"\n\n    def __init__(\n        self, char_list, sym_space, sym_blank, report_cer=False, report_wer=False\n    ):\n        \"\"\"Construct an ErrorCalculator object.\"\"\"\n        super(ErrorCalculator, self).__init__()\n\n        self.report_cer = report_cer\n        self.report_wer = report_wer\n\n        self.char_list = char_list\n        self.space = sym_space\n        self.blank = sym_blank\n        self.idx_blank = self.char_list.index(self.blank)\n        if self.space in self.char_list:\n            self.idx_space = self.char_list.index(self.space)\n        else:\n            self.idx_space = None\n\n    def __call__(self, ys_hat, ys_pad, is_ctc=False):\n        \"\"\"Calculate sentence-level WER/CER score.\n\n        :param torch.Tensor ys_hat: prediction (batch, seqlen)\n        :param torch.Tensor ys_pad: reference (batch, seqlen)\n        :param bool is_ctc: calculate CER score for CTC\n        :return: sentence-level WER score\n        :rtype float\n        :return: sentence-level CER score\n        :rtype float\n        \"\"\"\n        cer, wer = None, None\n        if is_ctc:\n            return self.calculate_cer_ctc(ys_hat, ys_pad)\n        elif not self.report_cer and not self.report_wer:\n            return cer, wer\n\n        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)\n        if self.report_cer:\n            cer = self.calculate_cer(seqs_hat, seqs_true)\n\n        if self.report_wer:\n            wer = self.calculate_wer(seqs_hat, seqs_true)\n        return cer, wer\n\n    def calculate_cer_ctc(self, ys_hat, ys_pad):\n        \"\"\"Calculate sentence-level CER score for CTC.\n\n        :param torch.Tensor ys_hat: prediction (batch, seqlen)\n        :param torch.Tensor ys_pad: reference (batch, seqlen)\n        :return: average sentence-level CER score\n        :rtype float\n        \"\"\"\n        cers, char_ref_lens = [], []\n        for i, y in enumerate(ys_hat):\n            y_hat = [x[0] for x in groupby(y)]\n            y_true = ys_pad[i]\n            seq_hat, seq_true = [], []\n            for idx in y_hat:\n                idx = int(idx)\n                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:\n                    seq_hat.append(self.char_list[int(idx)])\n\n            for idx in y_true:\n                idx = int(idx)\n                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:\n                    seq_true.append(self.char_list[int(idx)])\n\n            hyp_chars = \"\".join(seq_hat)\n            ref_chars = \"\".join(seq_true)\n            if len(ref_chars) > 0:\n                cers.append(editdistance.eval(hyp_chars, ref_chars))\n                char_ref_lens.append(len(ref_chars))\n\n        cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None\n        return cer_ctc\n\n    def convert_to_char(self, ys_hat, ys_pad):\n        \"\"\"Convert index to character.\n\n        :param torch.Tensor seqs_hat: prediction (batch, seqlen)\n        :param torch.Tensor seqs_true: reference (batch, seqlen)\n        :return: token list of prediction\n        :rtype list\n        :return: token list of reference\n        :rtype list\n        \"\"\"\n        seqs_hat, seqs_true = [], []\n        for i, y_hat in enumerate(ys_hat):\n            y_true = ys_pad[i]\n            eos_true = np.where(y_true == -1)[0]\n            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)\n            # NOTE: padding index (-1) in y_true is used to pad y_hat\n            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]\n            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]\n            seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n            seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n            seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n            seqs_hat.append(seq_hat_text)\n            seqs_true.append(seq_true_text)\n        return seqs_hat, seqs_true\n\n    def calculate_cer(self, seqs_hat, seqs_true):\n        \"\"\"Calculate sentence-level CER score.\n\n        :param list seqs_hat: prediction\n        :param list seqs_true: reference\n        :return: average sentence-level CER score\n        :rtype float\n        \"\"\"\n        char_eds, char_ref_lens = [], []\n        for i, seq_hat_text in enumerate(seqs_hat):\n            seq_true_text = seqs_true[i]\n            hyp_chars = seq_hat_text.replace(\" \", \"\")\n            ref_chars = seq_true_text.replace(\" \", \"\")\n            char_eds.append(editdistance.eval(hyp_chars, ref_chars))\n            char_ref_lens.append(len(ref_chars))\n        return float(sum(char_eds)) / sum(char_ref_lens)\n\n    def calculate_wer(self, seqs_hat, seqs_true):\n        \"\"\"Calculate sentence-level WER score.\n\n        :param list seqs_hat: prediction\n        :param list seqs_true: reference\n        :return: average sentence-level WER score\n        :rtype float\n        \"\"\"\n        word_eds, word_ref_lens = [], []\n        for i, seq_hat_text in enumerate(seqs_hat):\n            seq_true_text = seqs_true[i]\n            hyp_words = seq_hat_text.split()\n            ref_words = seq_true_text.split()\n            word_eds.append(editdistance.eval(hyp_words, ref_words))\n            word_ref_lens.append(len(ref_words))\n        return float(sum(word_eds)) / sum(word_ref_lens)\n"
  },
  {
    "path": "nets/e2e_mt_common.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Common functions for ST and MT.\"\"\"\n\nimport nltk\nimport numpy as np\n\n\nclass ErrorCalculator(object):\n    \"\"\"Calculate BLEU for ST and MT models during training.\n\n    :param y_hats: numpy array with predicted text\n    :param y_pads: numpy array with true (target) text\n    :param char_list: vocabulary list\n    :param sym_space: space symbol\n    :param sym_pad: pad symbol\n    :param report_bleu: report BLUE score if True\n    \"\"\"\n\n    def __init__(self, char_list, sym_space, sym_pad, report_bleu=False):\n        \"\"\"Construct an ErrorCalculator object.\"\"\"\n        super(ErrorCalculator, self).__init__()\n        self.char_list = char_list\n        self.space = sym_space\n        self.pad = sym_pad\n        self.report_bleu = report_bleu\n        if self.space in self.char_list:\n            self.idx_space = self.char_list.index(self.space)\n        else:\n            self.idx_space = None\n\n    def __call__(self, ys_hat, ys_pad):\n        \"\"\"Calculate corpus-level BLEU score.\n\n        :param torch.Tensor ys_hat: prediction (batch, seqlen)\n        :param torch.Tensor ys_pad: reference (batch, seqlen)\n        :return: corpus-level BLEU score in a mini-batch\n        :rtype float\n        \"\"\"\n        bleu = None\n        if not self.report_bleu:\n            return bleu\n\n        bleu = self.calculate_corpus_bleu(ys_hat, ys_pad)\n        return bleu\n\n    def calculate_corpus_bleu(self, ys_hat, ys_pad):\n        \"\"\"Calculate corpus-level BLEU score in a mini-batch.\n\n        :param torch.Tensor seqs_hat: prediction (batch, seqlen)\n        :param torch.Tensor seqs_true: reference (batch, seqlen)\n        :return: corpus-level BLEU score\n        :rtype float\n        \"\"\"\n        seqs_hat, seqs_true = [], []\n        for i, y_hat in enumerate(ys_hat):\n            y_true = ys_pad[i]\n            eos_true = np.where(y_true == -1)[0]\n            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)\n            # NOTE: padding index (-1) in y_true is used to pad y_hat\n            # because y_hats is not padded with -1\n            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]\n            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]\n            seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n            seq_hat_text = seq_hat_text.replace(self.pad, \"\")\n            seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n            seqs_hat.append(seq_hat_text)\n            seqs_true.append(seq_true_text)\n        bleu = nltk.bleu_score.corpus_bleu([[ref] for ref in seqs_true], seqs_hat)\n        return bleu * 100\n"
  },
  {
    "path": "nets/lm_interface.py",
    "content": "\"\"\"Language model interface.\"\"\"\n\nimport argparse\n\nfrom espnet.nets.scorer_interface import ScorerInterface\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass LMInterface(ScorerInterface):\n    \"\"\"LM Interface for ESPnet model implementation.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to command line argument parser.\"\"\"\n        return parser\n\n    @classmethod\n    def build(cls, n_vocab: int, **kwargs):\n        \"\"\"Initialize this class with python-level args.\n\n        Args:\n            idim (int): The number of vocabulary.\n\n        Returns:\n            LMinterface: A new instance of LMInterface.\n\n        \"\"\"\n        # local import to avoid cyclic import in lm_train\n        from espnet.bin.lm_train import get_parser\n\n        def wrap(parser):\n            return get_parser(parser, required=False)\n\n        args = argparse.Namespace(**kwargs)\n        args = fill_missing_args(args, wrap)\n        args = fill_missing_args(args, cls.add_arguments)\n        return cls(n_vocab, args)\n\n    def forward(self, x, t):\n        \"\"\"Compute LM loss value from buffer sequences.\n\n        Args:\n            x (torch.Tensor): Input ids. (batch, len)\n            t (torch.Tensor): Target ids. (batch, len)\n\n        Returns:\n            tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Tuple of\n                loss to backward (scalar),\n                negative log-likelihood of t: -log p(t) (scalar) and\n                the number of elements in x (scalar)\n\n        Notes:\n            The last two return values are used\n            in perplexity: p(t)^{-n} = exp(-log p(t) / n)\n\n        \"\"\"\n        raise NotImplementedError(\"forward method is not implemented\")\n\n\npredefined_lms = {\n    \"pytorch\": {\n        \"default\": \"espnet.nets.pytorch_backend.lm.default:DefaultRNNLM\",\n        \"seq_rnn\": \"espnet.nets.pytorch_backend.lm.seq_rnn:SequentialRNNLM\",\n        \"transformer\": \"espnet.nets.pytorch_backend.lm.transformer:TransformerLM\",\n    },\n    \"chainer\": {\"default\": \"espnet.lm.chainer_backend.lm:DefaultRNNLM\"},\n}\n\n\ndef dynamic_import_lm(module, backend):\n    \"\"\"Import LM class dynamically.\n\n    Args:\n        module (str): module_name:class_name or alias in `predefined_lms`\n        backend (str): NN backend. e.g., pytorch, chainer\n\n    Returns:\n        type: LM class\n\n    \"\"\"\n    model_class = dynamic_import(module, predefined_lms.get(backend, dict()))\n    assert issubclass(\n        model_class, LMInterface\n    ), f\"{module} does not implement LMInterface\"\n    return model_class\n"
  },
  {
    "path": "nets/mt_interface.py",
    "content": "\"\"\"MT Interface module.\"\"\"\nimport argparse\n\nfrom espnet.bin.asr_train import get_parser\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass MTInterface:\n    \"\"\"MT Interface for ESPnet model implementation.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to parser.\"\"\"\n        return parser\n\n    @classmethod\n    def build(cls, idim: int, odim: int, **kwargs):\n        \"\"\"Initialize this class with python-level args.\n\n        Args:\n            idim (int): The number of an input feature dim.\n            odim (int): The number of output vocab.\n\n        Returns:\n            ASRinterface: A new instance of ASRInterface.\n\n        \"\"\"\n\n        def wrap(parser):\n            return get_parser(parser, required=False)\n\n        args = argparse.Namespace(**kwargs)\n        args = fill_missing_args(args, wrap)\n        args = fill_missing_args(args, cls.add_arguments)\n        return cls(idim, odim, args)\n\n    def forward(self, xs, ilens, ys):\n        \"\"\"Compute loss for training.\n\n        :param xs:\n            For pytorch, batch of padded source sequences torch.Tensor (B, Tmax, idim)\n            For chainer, list of source sequences chainer.Variable\n        :param ilens: batch of lengths of source sequences (B)\n            For pytorch, torch.Tensor\n            For chainer, list of int\n        :param ys:\n            For pytorch, batch of padded source sequences torch.Tensor (B, Lmax)\n            For chainer, list of source sequences chainer.Variable\n        :return: loss value\n        :rtype: torch.Tensor for pytorch, chainer.Variable for chainer\n        \"\"\"\n        raise NotImplementedError(\"forward method is not implemented\")\n\n    def translate(self, x, trans_args, char_list=None, rnnlm=None):\n        \"\"\"Translate x for evaluation.\n\n        :param ndarray x: input acouctic feature (B, T, D) or (T, D)\n        :param namespace trans_args: argment namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"translate method is not implemented\")\n\n    def translate_batch(self, x, trans_args, char_list=None, rnnlm=None):\n        \"\"\"Beam search implementation for batch.\n\n        :param torch.Tensor x: encoder hidden state sequences (B, Tmax, Henc)\n        :param namespace trans_args: argument namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"Batch decoding is not supported yet.\")\n\n    def calculate_all_attentions(self, xs, ilens, ys):\n        \"\"\"Caluculate attention.\n\n        :param list xs: list of padded input sequences [(T1, idim), (T2, idim), ...]\n        :param ndarray ilens: batch of lengths of input sequences (B)\n        :param list ys: list of character id sequence tensor [(L1), (L2), (L3), ...]\n        :return: attention weights (B, Lmax, Tmax)\n        :rtype: float ndarray\n        \"\"\"\n        raise NotImplementedError(\"calculate_all_attentions method is not implemented\")\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Get attention plot class.\"\"\"\n        from espnet.asr.asr_utils import PlotAttentionReport\n\n        return PlotAttentionReport\n"
  },
  {
    "path": "nets/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/argument.py",
    "content": "# Copyright 2020 Hirofumi Inaguma\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Conformer common arguments.\"\"\"\n\n\nfrom distutils.util import strtobool\nimport logging\n\n\ndef add_arguments_conformer_common(group):\n    \"\"\"Add Transformer common arguments.\"\"\"\n    group.add_argument(\n        \"--transformer-encoder-pos-enc-layer-type\",\n        type=str,\n        default=\"abs_pos\",\n        choices=[\"abs_pos\", \"scaled_abs_pos\", \"rel_pos\"],\n        help=\"Transformer encoder positional encoding layer type\",\n    )\n    group.add_argument(\n        \"--transformer-encoder-activation-type\",\n        type=str,\n        default=\"swish\",\n        choices=[\"relu\", \"hardtanh\", \"selu\", \"swish\"],\n        help=\"Transformer encoder activation function type\",\n    )\n    group.add_argument(\n        \"--macaron-style\",\n        default=False,\n        type=strtobool,\n        help=\"Whether to use macaron style for positionwise layer\",\n    )\n    # Attention\n    group.add_argument(\n        \"--zero-triu\",\n        default=False,\n        type=strtobool,\n        help=\"If true, zero the uppper triangular part of attention matrix.\",\n    )\n    # Relative positional encoding\n    group.add_argument(\n        \"--rel-pos-type\",\n        type=str,\n        default=\"legacy\",\n        choices=[\"legacy\", \"latest\"],\n        help=\"Whether to use the latest relative positional encoding or the legacy one.\"\n        \"The legacy relative positional encoding will be deprecated in the future.\"\n        \"More Details can be found in https://github.com/espnet/espnet/pull/2816.\",\n    )\n    # CNN module\n    group.add_argument(\n        \"--use-cnn-module\",\n        default=False,\n        type=strtobool,\n        help=\"Use convolution module or not\",\n    )\n    group.add_argument(\n        \"--cnn-module-kernel\",\n        default=31,\n        type=int,\n        help=\"Kernel size of convolution module.\",\n    )\n    return group\n\n\ndef verify_rel_pos_type(args):\n    \"\"\"Verify the relative positional encoding type for compatibility.\n\n    Args:\n        args (Namespace): original arguments\n    Returns:\n        args (Namespace): modified arguments\n    \"\"\"\n    rel_pos_type = getattr(args, \"rel_pos_type\", None)\n    if rel_pos_type is None or rel_pos_type == \"legacy\":\n        if args.transformer_encoder_pos_enc_layer_type == \"rel_pos\":\n            args.transformer_encoder_pos_enc_layer_type = \"legacy_rel_pos\"\n            logging.warning(\n                \"Using legacy_rel_pos and it will be deprecated in the future.\"\n            )\n        if args.transformer_encoder_selfattn_layer_type == \"rel_selfattn\":\n            args.transformer_encoder_selfattn_layer_type = \"legacy_rel_selfattn\"\n            logging.warning(\n                \"Using legacy_rel_selfattn and it will be deprecated in the future.\"\n            )\n\n    return args\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/convolution.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Northwestern Polytechnical University (Pengcheng Guo)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"ConvolutionModule definition.\"\"\"\n\nfrom torch import nn\n\n\nclass ConvolutionModule(nn.Module):\n    \"\"\"ConvolutionModule in Conformer model.\n\n    Args:\n        channels (int): The number of channels of conv layers.\n        kernel_size (int): Kernerl size of conv layers.\n\n    \"\"\"\n\n    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):\n        \"\"\"Construct an ConvolutionModule object.\"\"\"\n        super(ConvolutionModule, self).__init__()\n        # kernerl_size should be a odd number for 'SAME' padding\n        assert (kernel_size - 1) % 2 == 0\n\n        self.pointwise_conv1 = nn.Conv1d(\n            channels,\n            2 * channels,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n            bias=bias,\n        )\n        self.depthwise_conv = nn.Conv1d(\n            channels,\n            channels,\n            kernel_size,\n            stride=1,\n            padding=(kernel_size - 1) // 2,\n            groups=channels,\n            bias=bias,\n        )\n        # self.norm = nn.BatchNorm1d(channels)\n        # It would be harmful to use batch norm in DDP \n        # As it cannot be update globally\n        self.norm = nn.GroupNorm(2, channels)\n        self.pointwise_conv2 = nn.Conv1d(\n            channels,\n            channels,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n            bias=bias,\n        )\n        self.activation = activation\n\n    def forward(self, x):\n        \"\"\"Compute convolution module.\n\n        Args:\n            x (torch.Tensor): Input tensor (#batch, time, channels).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, channels).\n\n        \"\"\"\n        # exchange the temporal dimension and the feature dimension\n        x = x.transpose(1, 2)\n\n        # GLU mechanism\n        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)\n        x = nn.functional.glu(x, dim=1)  # (batch, channel, dim)\n\n        # 1D Depthwise Conv\n        x = self.depthwise_conv(x)\n        x = self.activation(self.norm(x))\n\n        x = self.pointwise_conv2(x)\n\n        return x.transpose(1, 2)\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/encoder.py",
    "content": "# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Northwestern Polytechnical University (Pengcheng Guo)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder definition.\"\"\"\n\nimport logging\nimport torch\n\nfrom espnet.nets.pytorch_backend.conformer.convolution import ConvolutionModule\nfrom espnet.nets.pytorch_backend.conformer.encoder_layer import EncoderLayer\nfrom espnet.nets.pytorch_backend.nets_utils import get_activation\nfrom espnet.nets.pytorch_backend.transducer.vgg2l import VGG2L\nfrom espnet.nets.pytorch_backend.transformer.attention import (\n    MultiHeadedAttention,  # noqa: H301\n    RelPositionMultiHeadedAttention,  # noqa: H301\n    LegacyRelPositionMultiHeadedAttention,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.embedding import (\n    PositionalEncoding,  # noqa: H301\n    ScaledPositionalEncoding,  # noqa: H301\n    RelPositionalEncoding,  # noqa: H301\n    LegacyRelPositionalEncoding,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.pytorch_backend.transformer.multi_layer_conv import Conv1dLinear\nfrom espnet.nets.pytorch_backend.transformer.multi_layer_conv import MultiLayeredConv1d\nfrom espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.repeat import repeat\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling\n\n\nclass Encoder(torch.nn.Module):\n    \"\"\"Conformer encoder module.\n\n    Args:\n        idim (int): Input dimension.\n        attention_dim (int): Dimention of attention.\n        attention_heads (int): The number of heads of multi head attention.\n        linear_units (int): The number of units of position-wise feed forward.\n        num_blocks (int): The number of decoder blocks.\n        dropout_rate (float): Dropout rate.\n        positional_dropout_rate (float): Dropout rate after adding positional encoding.\n        attention_dropout_rate (float): Dropout rate in attention.\n        input_layer (Union[str, torch.nn.Module]): Input layer type.\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n        positionwise_layer_type (str): \"linear\", \"conv1d\", or \"conv1d-linear\".\n        positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer.\n        macaron_style (bool): Whether to use macaron style for positionwise layer.\n        pos_enc_layer_type (str): Encoder positional encoding layer type.\n        selfattention_layer_type (str): Encoder attention layer type.\n        activation_type (str): Encoder activation function type.\n        use_cnn_module (bool): Whether to use convolution module.\n        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.\n        cnn_module_kernel (int): Kernerl size of convolution module.\n        padding_idx (int): Padding idx for input_layer=embed.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        attention_dim=256,\n        attention_heads=4,\n        linear_units=2048,\n        num_blocks=6,\n        dropout_rate=0.1,\n        positional_dropout_rate=0.1,\n        attention_dropout_rate=0.0,\n        input_layer=\"conv2d\",\n        normalize_before=True,\n        concat_after=False,\n        positionwise_layer_type=\"linear\",\n        positionwise_conv_kernel_size=1,\n        macaron_style=False,\n        pos_enc_layer_type=\"abs_pos\",\n        selfattention_layer_type=\"selfattn\",\n        activation_type=\"swish\",\n        use_cnn_module=False,\n        zero_triu=False,\n        cnn_module_kernel=31,\n        padding_idx=-1,\n    ):\n        \"\"\"Construct an Encoder object.\"\"\"\n        super(Encoder, self).__init__()\n\n        activation = get_activation(activation_type)\n        if pos_enc_layer_type == \"abs_pos\":\n            pos_enc_class = PositionalEncoding\n        elif pos_enc_layer_type == \"scaled_abs_pos\":\n            pos_enc_class = ScaledPositionalEncoding\n        elif pos_enc_layer_type == \"rel_pos\":\n            assert selfattention_layer_type == \"rel_selfattn\"\n            pos_enc_class = RelPositionalEncoding\n        elif pos_enc_layer_type == \"legacy_rel_pos\":\n            pos_enc_class = LegacyRelPositionalEncoding\n            assert selfattention_layer_type == \"legacy_rel_selfattn\"\n        else:\n            raise ValueError(\"unknown pos_enc_layer: \" + pos_enc_layer_type)\n\n        self.conv_subsampling_factor = 1\n        if input_layer == \"linear\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Linear(idim, attention_dim),\n                torch.nn.LayerNorm(attention_dim),\n                torch.nn.Dropout(dropout_rate),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif input_layer == \"conv2d\":\n            self.embed = Conv2dSubsampling(\n                idim,\n                attention_dim,\n                dropout_rate,\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n            self.conv_subsampling_factor = 4\n        elif input_layer == \"vgg2l\":\n            self.embed = VGG2L(idim, attention_dim)\n            self.conv_subsampling_factor = 4\n        elif input_layer == \"embed\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif isinstance(input_layer, torch.nn.Module):\n            self.embed = torch.nn.Sequential(\n                input_layer,\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif input_layer is None:\n            self.embed = torch.nn.Sequential(\n                pos_enc_class(attention_dim, positional_dropout_rate)\n            )\n        else:\n            raise ValueError(\"unknown input_layer: \" + input_layer)\n        self.normalize_before = normalize_before\n\n        # self-attention module definition\n        if selfattention_layer_type == \"selfattn\":\n            logging.info(\"encoder self-attention layer type = self-attention\")\n            encoder_selfattn_layer = MultiHeadedAttention\n            encoder_selfattn_layer_args = (\n                attention_heads,\n                attention_dim,\n                attention_dropout_rate,\n            )\n        elif selfattention_layer_type == \"legacy_rel_selfattn\":\n            assert pos_enc_layer_type == \"legacy_rel_pos\"\n            encoder_selfattn_layer = LegacyRelPositionMultiHeadedAttention\n            encoder_selfattn_layer_args = (\n                attention_heads,\n                attention_dim,\n                attention_dropout_rate,\n            )\n        elif selfattention_layer_type == \"rel_selfattn\":\n            logging.info(\"encoder self-attention layer type = relative self-attention\")\n            assert pos_enc_layer_type == \"rel_pos\"\n            encoder_selfattn_layer = RelPositionMultiHeadedAttention\n            encoder_selfattn_layer_args = (\n                attention_heads,\n                attention_dim,\n                attention_dropout_rate,\n                zero_triu,\n            )\n        else:\n            raise ValueError(\"unknown encoder_attn_layer: \" + selfattention_layer_type)\n\n        # feed-forward module definition\n        if positionwise_layer_type == \"linear\":\n            positionwise_layer = PositionwiseFeedForward\n            positionwise_layer_args = (\n                attention_dim,\n                linear_units,\n                dropout_rate,\n                activation,\n            )\n        elif positionwise_layer_type == \"conv1d\":\n            positionwise_layer = MultiLayeredConv1d\n            positionwise_layer_args = (\n                attention_dim,\n                linear_units,\n                positionwise_conv_kernel_size,\n                dropout_rate,\n            )\n        elif positionwise_layer_type == \"conv1d-linear\":\n            positionwise_layer = Conv1dLinear\n            positionwise_layer_args = (\n                attention_dim,\n                linear_units,\n                positionwise_conv_kernel_size,\n                dropout_rate,\n            )\n        else:\n            raise NotImplementedError(\"Support only linear or conv1d.\")\n\n        # convolution module definition\n        convolution_layer = ConvolutionModule\n        convolution_layer_args = (attention_dim, cnn_module_kernel, activation)\n\n        self.encoders = repeat(\n            num_blocks,\n            lambda lnum: EncoderLayer(\n                attention_dim,\n                encoder_selfattn_layer(*encoder_selfattn_layer_args),\n                positionwise_layer(*positionwise_layer_args),\n                positionwise_layer(*positionwise_layer_args) if macaron_style else None,\n                convolution_layer(*convolution_layer_args) if use_cnn_module else None,\n                dropout_rate,\n                normalize_before,\n                concat_after,\n            ),\n        )\n        if self.normalize_before:\n            self.after_norm = LayerNorm(attention_dim)\n\n    def forward(self, xs, masks):\n        \"\"\"Encode input sequence.\n\n        Args:\n            xs (torch.Tensor): Input tensor (#batch, time, idim).\n            masks (torch.Tensor): Mask tensor (#batch, time).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, attention_dim).\n            torch.Tensor: Mask tensor (#batch, time).\n\n        \"\"\"\n        if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n\n        xs, masks = self.encoders(xs, masks)\n        if isinstance(xs, tuple):\n            xs = xs[0]\n\n        if self.normalize_before:\n            xs = self.after_norm(xs)\n        return xs, masks\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/encoder_layer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Northwestern Polytechnical University (Pengcheng Guo)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder self-attention layer definition.\"\"\"\n\nimport torch\n\nfrom torch import nn\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass EncoderLayer(nn.Module):\n    \"\"\"Encoder layer module.\n\n    Args:\n        size (int): Input dimension.\n        self_attn (torch.nn.Module): Self-attention module instance.\n            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance\n            can be used as the argument.\n        feed_forward (torch.nn.Module): Feed-forward module instance.\n            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance\n            can be used as the argument.\n        feed_forward_macaron (torch.nn.Module): Additional feed-forward module instance.\n            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance\n            can be used as the argument.\n        conv_module (torch.nn.Module): Convolution module instance.\n            `ConvlutionModule` instance can be used as the argument.\n        dropout_rate (float): Dropout rate.\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        self_attn,\n        feed_forward,\n        feed_forward_macaron,\n        conv_module,\n        dropout_rate,\n        normalize_before=True,\n        concat_after=False,\n    ):\n        \"\"\"Construct an EncoderLayer object.\"\"\"\n        super(EncoderLayer, self).__init__()\n        self.self_attn = self_attn\n        self.feed_forward = feed_forward\n        self.feed_forward_macaron = feed_forward_macaron\n        self.conv_module = conv_module\n        self.norm_ff = LayerNorm(size)  # for the FNN module\n        self.norm_mha = LayerNorm(size)  # for the MHA module\n        if feed_forward_macaron is not None:\n            self.norm_ff_macaron = LayerNorm(size)\n            self.ff_scale = 0.5\n        else:\n            self.ff_scale = 1.0\n        if self.conv_module is not None:\n            self.norm_conv = LayerNorm(size)  # for the CNN module\n            self.norm_final = LayerNorm(size)  # for the final output of the block\n        self.dropout = nn.Dropout(dropout_rate)\n        self.size = size\n        self.normalize_before = normalize_before\n        self.concat_after = concat_after\n        if self.concat_after:\n            self.concat_linear = nn.Linear(size + size, size)\n\n    def forward(self, x_input, mask, cache=None):\n        \"\"\"Compute encoded features.\n\n        Args:\n            x_input (Union[Tuple, torch.Tensor]): Input tensor w/ or w/o pos emb.\n                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].\n                - w/o pos emb: Tensor (#batch, time, size).\n            mask (torch.Tensor): Mask tensor for the input (#batch, time).\n            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, size).\n            torch.Tensor: Mask tensor (#batch, time).\n\n        \"\"\"\n        if isinstance(x_input, tuple):\n            x, pos_emb = x_input[0], x_input[1]\n        else:\n            x, pos_emb = x_input, None\n\n        # whether to use macaron style\n        if self.feed_forward_macaron is not None:\n            residual = x\n            if self.normalize_before:\n                x = self.norm_ff_macaron(x)\n            x = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(x))\n            if not self.normalize_before:\n                x = self.norm_ff_macaron(x)\n\n        # multi-headed self-attention module\n        residual = x\n        if self.normalize_before:\n            x = self.norm_mha(x)\n\n        if cache is None:\n            x_q = x\n        else:\n            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)\n            x_q = x[:, -1:, :]\n            residual = residual[:, -1:, :]\n            mask = None if mask is None else mask[:, -1:, :]\n\n        if pos_emb is not None:\n            x_att = self.self_attn(x_q, x, x, pos_emb, mask)\n        else:\n            x_att = self.self_attn(x_q, x, x, mask)\n\n        if self.concat_after:\n            x_concat = torch.cat((x, x_att), dim=-1)\n            x = residual + self.concat_linear(x_concat)\n        else:\n            x = residual + self.dropout(x_att)\n        if not self.normalize_before:\n            x = self.norm_mha(x)\n\n        # convolution module\n        if self.conv_module is not None:\n            residual = x\n            if self.normalize_before:\n                x = self.norm_conv(x)\n            x = residual + self.dropout(self.conv_module(x))\n            if not self.normalize_before:\n                x = self.norm_conv(x)\n\n        # feed forward module\n        residual = x\n        if self.normalize_before:\n            x = self.norm_ff(x)\n        x = residual + self.ff_scale * self.dropout(self.feed_forward(x))\n        if not self.normalize_before:\n            x = self.norm_ff(x)\n\n        if self.conv_module is not None:\n            x = self.norm_final(x)\n\n        if cache is not None:\n            x = torch.cat([cache, x], dim=1)\n\n        if pos_emb is not None:\n            return (x, pos_emb), mask\n\n        return x, mask\n"
  },
  {
    "path": "nets/pytorch_backend/conformer/swish.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Northwestern Polytechnical University (Pengcheng Guo)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Swish() activation function for Conformer.\"\"\"\n\nimport torch\n\n\nclass Swish(torch.nn.Module):\n    \"\"\"Construct an Swish object.\"\"\"\n\n    def forward(self, x):\n        \"\"\"Return Swich activation function.\"\"\"\n        return x * torch.sigmoid(x)\n"
  },
  {
    "path": "nets/pytorch_backend/ctc.py",
    "content": "from distutils.version import LooseVersion\nimport logging\n\nimport numpy as np\nimport six\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\n\n\nclass CTC(torch.nn.Module):\n    \"\"\"CTC module\n\n    :param int odim: dimension of outputs\n    :param int eprojs: number of encoder projection units\n    :param float dropout_rate: dropout rate (0.0 ~ 1.0)\n    :param str ctc_type: builtin or warpctc\n    :param bool reduce: reduce the CTC loss into a scalar\n    \"\"\"\n\n    def __init__(self, odim, eprojs, dropout_rate, ctc_type=\"warpctc\", reduce=True):\n        super().__init__()\n        self.dropout_rate = dropout_rate\n        self.loss = None\n        self.ctc_lo = torch.nn.Linear(eprojs, odim)\n        self.probs = None  # for visualization\n\n        # In case of Pytorch >= 1.7.0, CTC will be always builtin\n        self.ctc_type = (\n            ctc_type\n            if LooseVersion(torch.__version__) < LooseVersion(\"1.7.0\")\n            else \"builtin\"\n        )\n\n        # ctc_type = buitin not support Pytorch=1.0.1\n        if self.ctc_type == \"builtin\" and (\n            LooseVersion(torch.__version__) < LooseVersion(\"1.1.0\")\n        ):\n            self.ctc_type = \"cudnnctc\"\n\n        if ctc_type != self.ctc_type:\n            logging.warning(f\"CTC was set to {self.ctc_type} due to PyTorch version.\")\n\n        if self.ctc_type == \"builtin\":\n            reduction_type = \"sum\" if reduce else \"none\"\n            self.ctc_loss = torch.nn.CTCLoss(\n                reduction=reduction_type, zero_infinity=True\n            )\n        elif self.ctc_type == \"cudnnctc\":\n            reduction_type = \"sum\" if reduce else \"none\"\n            self.ctc_loss = torch.nn.CTCLoss(reduction=reduction_type)\n        elif self.ctc_type == \"warpctc\":\n            import warpctc_pytorch as warp_ctc\n\n            self.ctc_loss = warp_ctc.CTCLoss(size_average=True, reduce=reduce)\n        elif self.ctc_type == \"gtnctc\":\n            from espnet.nets.pytorch_backend.gtn_ctc import GTNCTCLossFunction\n\n            self.ctc_loss = GTNCTCLossFunction.apply\n        else:\n            raise ValueError(\n                'ctc_type must be \"builtin\" or \"warpctc\": {}'.format(self.ctc_type)\n            )\n\n        self.ignore_id = -1\n        self.reduce = reduce\n\n    def loss_fn(self, th_pred, th_target, th_ilen, th_olen):\n        if self.ctc_type in [\"builtin\", \"cudnnctc\"]:\n            th_pred = th_pred.log_softmax(2)\n            # Use the deterministic CuDNN implementation of CTC loss to avoid\n            #  [issue#17798](https://github.com/pytorch/pytorch/issues/17798)\n            with torch.backends.cudnn.flags(deterministic=True):\n                loss = self.ctc_loss(th_pred, th_target, th_ilen, th_olen)\n            # Batch-size average\n            loss = loss / th_pred.size(1)\n            return loss\n        elif self.ctc_type == \"warpctc\":\n            return self.ctc_loss(th_pred, th_target, th_ilen, th_olen)\n        elif self.ctc_type == \"gtnctc\":\n            targets = [t.tolist() for t in th_target]\n            log_probs = torch.nn.functional.log_softmax(th_pred, dim=2)\n            return self.ctc_loss(log_probs, targets, 0, \"none\")\n        else:\n            raise NotImplementedError\n\n    # Add the texts to be compatible with MMI loss\n    def forward(self, hs_pad, hlens, ys_pad, texts):\n        \"\"\"CTC forward\n\n        :param torch.Tensor hs_pad: batch of padded hidden state sequences (B, Tmax, D)\n        :param torch.Tensor hlens: batch of lengths of hidden state sequences (B)\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, Lmax)\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        # TODO(kan-bayashi): need to make more smart way\n        ys = [y[y != self.ignore_id] for y in ys_pad]  # parse padded ys\n\n        # zero padding for hs\n        ys_hat = self.ctc_lo(F.dropout(hs_pad, p=self.dropout_rate))\n        if self.ctc_type != \"gtnctc\":\n            ys_hat = ys_hat.transpose(0, 1)\n\n        if self.ctc_type == \"builtin\":\n            olens = to_device(ys_hat, torch.LongTensor([len(s) for s in ys]))\n            hlens = hlens.long()\n            ys_pad = torch.cat(ys)  # without this the code breaks for asr_mix\n            self.loss = self.loss_fn(ys_hat, ys_pad, hlens, olens)\n        else:\n            self.loss = None\n            hlens = torch.from_numpy(np.fromiter(hlens, dtype=np.int32))\n            olens = torch.from_numpy(\n                np.fromiter((x.size(0) for x in ys), dtype=np.int32)\n            )\n            # zero padding for ys\n            ys_true = torch.cat(ys).cpu().int()  # batch x olen\n            # get ctc loss\n            # expected shape of seqLength x batchSize x alphabet_size\n            dtype = ys_hat.dtype\n            if self.ctc_type == \"warpctc\" or dtype == torch.float16:\n                # warpctc only supports float32\n                # torch.ctc does not support float16 (#1751)\n                ys_hat = ys_hat.to(dtype=torch.float32)\n            if self.ctc_type == \"cudnnctc\":\n                # use GPU when using the cuDNN implementation\n                ys_true = to_device(hs_pad, ys_true)\n            if self.ctc_type == \"gtnctc\":\n                # keep as list for gtn\n                ys_true = ys\n            self.loss = to_device(\n                hs_pad, self.loss_fn(ys_hat, ys_true, hlens, olens)\n            ).to(dtype=dtype)\n\n        # get length info\n        logging.info(\n            self.__class__.__name__\n            + \" input lengths:  \"\n            + \"\".join(str(hlens).split(\"\\n\"))\n        )\n        logging.info(\n            self.__class__.__name__\n            + \" output lengths: \"\n            + \"\".join(str(olens).split(\"\\n\"))\n        )\n\n        if self.reduce:\n            # NOTE: sum() is needed to keep consistency\n            # since warpctc return as tensor w/ shape (1,)\n            # but builtin return as tensor w/o shape (scalar).\n            self.loss = self.loss.sum()\n            logging.info(\"ctc loss:\" + str(float(self.loss)))\n\n        return self.loss\n\n    def softmax(self, hs_pad):\n        \"\"\"softmax of frame activations\n\n        :param torch.Tensor hs_pad: 3d tensor (B, Tmax, eprojs)\n        :return: log softmax applied 3d tensor (B, Tmax, odim)\n        :rtype: torch.Tensor\n        \"\"\"\n        self.probs = F.softmax(self.ctc_lo(hs_pad), dim=2)\n        return self.probs\n\n    def log_softmax(self, hs_pad):\n        \"\"\"log_softmax of frame activations\n\n        :param torch.Tensor hs_pad: 3d tensor (B, Tmax, eprojs)\n        :return: log softmax applied 3d tensor (B, Tmax, odim)\n        :rtype: torch.Tensor\n        \"\"\"\n        return F.log_softmax(self.ctc_lo(hs_pad), dim=2)\n\n    def argmax(self, hs_pad):\n        \"\"\"argmax of frame activations\n\n        :param torch.Tensor hs_pad: 3d tensor (B, Tmax, eprojs)\n        :return: argmax applied 2d tensor (B, Tmax)\n        :rtype: torch.Tensor\n        \"\"\"\n        return torch.argmax(self.ctc_lo(hs_pad), dim=2)\n\n    def forced_align(self, h, y, blank_id=0):\n        \"\"\"forced alignment.\n\n        :param torch.Tensor h: hidden state sequence, 2d tensor (T, D)\n        :param torch.Tensor y: id sequence tensor 1d tensor (L)\n        :param int y: blank symbol index\n        :return: best alignment results\n        :rtype: list\n        \"\"\"\n\n        def interpolate_blank(label, blank_id=0):\n            \"\"\"Insert blank token between every two label token.\"\"\"\n            label = np.expand_dims(label, 1)\n            blanks = np.zeros((label.shape[0], 1), dtype=np.int64) + blank_id\n            label = np.concatenate([blanks, label], axis=1)\n            label = label.reshape(-1)\n            label = np.append(label, label[0])\n            return label\n\n        lpz = self.log_softmax(h)\n        lpz = lpz.squeeze(0)\n\n        y_int = interpolate_blank(y, blank_id)\n\n        logdelta = np.zeros((lpz.size(0), len(y_int))) - 100000000000.0  # log of zero\n        state_path = (\n            np.zeros((lpz.size(0), len(y_int)), dtype=np.int16) - 1\n        )  # state path\n\n        logdelta[0, 0] = lpz[0][y_int[0]]\n        logdelta[0, 1] = lpz[0][y_int[1]]\n\n        for t in six.moves.range(1, lpz.size(0)):\n            for s in six.moves.range(len(y_int)):\n                if y_int[s] == blank_id or s < 2 or y_int[s] == y_int[s - 2]:\n                    candidates = np.array([logdelta[t - 1, s], logdelta[t - 1, s - 1]])\n                    prev_state = [s, s - 1]\n                else:\n                    candidates = np.array(\n                        [\n                            logdelta[t - 1, s],\n                            logdelta[t - 1, s - 1],\n                            logdelta[t - 1, s - 2],\n                        ]\n                    )\n                    prev_state = [s, s - 1, s - 2]\n                logdelta[t, s] = np.max(candidates) + lpz[t][y_int[s]]\n                state_path[t, s] = prev_state[np.argmax(candidates)]\n\n        state_seq = -1 * np.ones((lpz.size(0), 1), dtype=np.int16)\n\n        candidates = np.array(\n            [logdelta[-1, len(y_int) - 1], logdelta[-1, len(y_int) - 2]]\n        )\n        prev_state = [len(y_int) - 1, len(y_int) - 2]\n        state_seq[-1] = prev_state[np.argmax(candidates)]\n        for t in six.moves.range(lpz.size(0) - 2, -1, -1):\n            state_seq[t] = state_path[t + 1, state_seq[t + 1, 0]]\n\n        output_state_seq = []\n        for t in six.moves.range(0, lpz.size(0)):\n            output_state_seq.append(y_int[state_seq[t, 0]])\n\n        return output_state_seq\n\n\ndef ctc_for(args, odim, reduce=True):\n    \"\"\"Returns the CTC module for the given args and output dimension\n\n    :param Namespace args: the program args\n    :param int odim : The output dimension\n    :param bool reduce : return the CTC loss in a scalar\n    :return: the corresponding CTC module\n    \"\"\"\n    num_encs = getattr(args, \"num_encs\", 1)  # use getattr to keep compatibility\n    if num_encs == 1:\n        # compatible with single encoder asr mode\n        return CTC(\n            odim, args.eprojs, args.dropout_rate, ctc_type=args.ctc_type, reduce=reduce\n        )\n    elif num_encs >= 1:\n        ctcs_list = torch.nn.ModuleList()\n        if args.share_ctc:\n            # use dropout_rate of the first encoder\n            ctc = CTC(\n                odim,\n                args.eprojs,\n                args.dropout_rate[0],\n                ctc_type=args.ctc_type,\n                reduce=reduce,\n            )\n            ctcs_list.append(ctc)\n        else:\n            for idx in range(num_encs):\n                ctc = CTC(\n                    odim,\n                    args.eprojs,\n                    args.dropout_rate[idx],\n                    ctc_type=args.ctc_type,\n                    reduce=reduce,\n                )\n                ctcs_list.append(ctc)\n        return ctcs_list\n    else:\n        raise ValueError(\n            \"Number of encoders needs to be more than one. {}\".format(num_encs)\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"RNN sequence-to-sequence speech recognition model (pytorch).\"\"\"\n\nimport argparse\nfrom itertools import groupby\nimport logging\nimport math\nimport os\n\nimport chainer\nfrom chainer import reporter\nimport editdistance\nimport numpy as np\nimport six\nimport torch\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.pytorch_backend.ctc import ctc_for\nfrom espnet.nets.pytorch_backend.frontends.frontend import frontend_for\nfrom espnet.nets.pytorch_backend.initialization import lecun_normal_init_parameters\nfrom espnet.nets.pytorch_backend.initialization import set_forget_bias_to_one\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.nets_utils import to_torch_tensor\nfrom espnet.nets.pytorch_backend.rnn.argument import (\n    add_arguments_rnn_encoder_common,  # noqa: H301\n    add_arguments_rnn_decoder_common,  # noqa: H301\n    add_arguments_rnn_attention_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_for\nfrom espnet.nets.pytorch_backend.rnn.decoders import decoder_for\nfrom espnet.nets.pytorch_backend.rnn.encoders import encoder_for\nfrom espnet.nets.scorers.ctc import CTCPrefixScorer\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\nCTC_LOSS_THRESHOLD = 10000\n\n\nclass Reporter(chainer.Chain):\n    \"\"\"A chainer reporter wrapper.\"\"\"\n\n    def report(self, loss_ctc, loss_att, loss_third, loss_mbr, acc, cer_ctc, cer, wer, mtl_loss):\n        \"\"\"Report at every step.\"\"\"\n        reporter.report({\"loss_ctc\": loss_ctc}, self)\n        reporter.report({\"loss_att\": loss_att}, self)\n        reporter.report({\"loss_third\": loss_third}, self)\n        reporter.report({\"loss_mbr\": loss_mbr}, self)\n        reporter.report({\"acc\": acc}, self)\n        reporter.report({\"cer_ctc\": cer_ctc}, self)\n        reporter.report({\"cer\": cer}, self)\n        reporter.report({\"wer\": wer}, self)\n        logging.info(\"mtl loss:\" + str(mtl_loss))\n        reporter.report({\"loss\": mtl_loss}, self)\n\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2E.encoder_add_arguments(parser)\n        E2E.attention_add_arguments(parser)\n        E2E.decoder_add_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_add_arguments(parser):\n        \"\"\"Add arguments for the encoder.\"\"\"\n        group = parser.add_argument_group(\"E2E encoder setting\")\n        group = add_arguments_rnn_encoder_common(group)\n        return parser\n\n    @staticmethod\n    def attention_add_arguments(parser):\n        \"\"\"Add arguments for the attention.\"\"\"\n        group = parser.add_argument_group(\"E2E attention setting\")\n        group = add_arguments_rnn_attention_common(group)\n        return parser\n\n    @staticmethod\n    def decoder_add_arguments(parser):\n        \"\"\"Add arguments for the decoder.\"\"\"\n        group = parser.add_argument_group(\"E2E decoder setting\")\n        group = add_arguments_rnn_decoder_common(group)\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        if isinstance(self.enc, torch.nn.ModuleList):\n            return self.enc[0].conv_subsampling_factor * int(np.prod(self.subsample))\n        else:\n            return self.enc.conv_subsampling_factor * int(np.prod(self.subsample))\n\n    def __init__(self, idim, odim, args):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super(E2E, self).__init__()\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        self.mtlalpha = args.mtlalpha\n        assert 0.0 <= self.mtlalpha <= 1.0, \"mtlalpha should be [0.0, 1.0]\"\n        self.etype = args.etype\n        self.verbose = args.verbose\n        # NOTE: for self.build method\n        args.char_list = getattr(args, \"char_list\", None)\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.reporter = Reporter()\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n\n        # subsample info\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"rnn\")\n\n        # label smoothing info\n        if args.lsm_type and os.path.isfile(args.train_json):\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        # encoder\n        self.enc = encoder_for(args, idim, self.subsample)\n        # ctc\n        self.ctc = ctc_for(args, odim)\n        # attention\n        self.att = att_for(args)\n        # decoder\n        self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        # weight initialization\n        self.init_like_chainer()\n\n        # options for beam search\n        if args.report_cer or args.report_wer:\n            recog_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": args.ctc_weight,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n            }\n\n            self.recog_args = argparse.Namespace(**recog_args)\n            self.report_cer = args.report_cer\n            self.report_wer = args.report_wer\n        else:\n            self.report_cer = False\n            self.report_wer = False\n        self.rnnlm = None\n\n        self.logzero = -10000000000.0\n        self.loss = None\n        self.acc = None\n\n    def init_like_chainer(self):\n        \"\"\"Initialize weight like chainer.\n\n        chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0\n        pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)\n        however, there are two exceptions as far as I know.\n        - EmbedID.W ~ Normal(0, 1)\n        - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)\n        \"\"\"\n        lecun_normal_init_parameters(self)\n        # exceptions\n        # embed weight ~ Normal(0, 1)\n        self.dec.embed.weight.data.normal_(0, 1)\n        # forget-bias = 1.0\n        # https://discuss.pytorch.org/t/set-forget-gate-bias-of-lstm/1745\n        for i in six.moves.range(len(self.dec.decoder)):\n            set_forget_bias_to_one(self.dec.decoder[i].bias_ih)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        # 0. Frontend\n        if self.frontend is not None:\n            hs_pad, hlens, mask = self.frontend(to_torch_tensor(xs_pad), ilens)\n            hs_pad, hlens = self.feature_transform(hs_pad, hlens)\n        else:\n            hs_pad, hlens = xs_pad, ilens\n\n        # 1. Encoder\n        hs_pad, hlens, _ = self.enc(hs_pad, hlens)\n\n        # 2. CTC loss\n        if self.mtlalpha == 0:\n            self.loss_ctc = None\n        else:\n            self.loss_ctc = self.ctc(hs_pad, hlens, ys_pad)\n\n        # 3. attention loss\n        if self.mtlalpha == 1:\n            self.loss_att, acc = None, None\n        else:\n            self.loss_att, acc, _ = self.dec(hs_pad, hlens, ys_pad)\n        self.acc = acc\n\n        # 4. compute cer without beam search\n        if self.mtlalpha == 0 or self.char_list is None:\n            cer_ctc = None\n        else:\n            cers = []\n\n            y_hats = self.ctc.argmax(hs_pad).data\n            for i, y in enumerate(y_hats):\n                y_hat = [x[0] for x in groupby(y)]\n                y_true = ys_pad[i]\n\n                seq_hat = [self.char_list[int(idx)] for idx in y_hat if int(idx) != -1]\n                seq_true = [\n                    self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                ]\n                seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n                seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n                seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n\n                hyp_chars = seq_hat_text.replace(\" \", \"\")\n                ref_chars = seq_true_text.replace(\" \", \"\")\n                if len(ref_chars) > 0:\n                    cers.append(\n                        editdistance.eval(hyp_chars, ref_chars) / len(ref_chars)\n                    )\n\n            cer_ctc = sum(cers) / len(cers) if cers else None\n\n        # 5. compute cer/wer\n        if self.training or not (self.report_cer or self.report_wer):\n            cer, wer = 0.0, 0.0\n            # oracle_cer, oracle_wer = 0.0, 0.0\n        else:\n            if self.recog_args.ctc_weight > 0.0:\n                lpz = self.ctc.log_softmax(hs_pad).data\n            else:\n                lpz = None\n\n            word_eds, word_ref_lens, char_eds, char_ref_lens = [], [], [], []\n            nbest_hyps = self.dec.recognize_beam_batch(\n                hs_pad,\n                torch.tensor(hlens),\n                lpz,\n                self.recog_args,\n                self.char_list,\n                self.rnnlm,\n            )\n            # remove <sos> and <eos>\n            y_hats = [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps]\n            for i, y_hat in enumerate(y_hats):\n                y_true = ys_pad[i]\n\n                seq_hat = [self.char_list[int(idx)] for idx in y_hat if int(idx) != -1]\n                seq_true = [\n                    self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                ]\n                seq_hat_text = \"\".join(seq_hat).replace(self.recog_args.space, \" \")\n                seq_hat_text = seq_hat_text.replace(self.recog_args.blank, \"\")\n                seq_true_text = \"\".join(seq_true).replace(self.recog_args.space, \" \")\n\n                hyp_words = seq_hat_text.split()\n                ref_words = seq_true_text.split()\n                word_eds.append(editdistance.eval(hyp_words, ref_words))\n                word_ref_lens.append(len(ref_words))\n                hyp_chars = seq_hat_text.replace(\" \", \"\")\n                ref_chars = seq_true_text.replace(\" \", \"\")\n                char_eds.append(editdistance.eval(hyp_chars, ref_chars))\n                char_ref_lens.append(len(ref_chars))\n\n            wer = (\n                0.0\n                if not self.report_wer\n                else float(sum(word_eds)) / sum(word_ref_lens)\n            )\n            cer = (\n                0.0\n                if not self.report_cer\n                else float(sum(char_eds)) / sum(char_ref_lens)\n            )\n\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = self.loss_att\n            loss_att_data = float(self.loss_att)\n            loss_ctc_data = None\n        elif alpha == 1:\n            self.loss = self.loss_ctc\n            loss_att_data = None\n            loss_ctc_data = float(self.loss_ctc)\n        else:\n            self.loss = alpha * self.loss_ctc + (1 - alpha) * self.loss_att\n            loss_att_data = float(self.loss_att)\n            loss_ctc_data = float(self.loss_ctc)\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data, loss_att_data, acc, cer_ctc, cer, wer, loss_data\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def scorers(self):\n        \"\"\"Scorers.\"\"\"\n        return dict(decoder=self.dec, ctc=CTCPrefixScorer(self.ctc, self.eos))\n\n    def encode(self, x):\n        \"\"\"Encode acoustic features.\n\n        :param ndarray x: input acoustic feature (T, D)\n        :return: encoder outputs\n        :rtype: torch.Tensor\n        \"\"\"\n        self.eval()\n        ilens = [x.shape[0]]\n\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        p = next(self.parameters())\n        h = torch.as_tensor(x, device=p.device, dtype=p.dtype)\n        # make a utt list (1) to use the same interface for encoder\n        hs = h.contiguous().unsqueeze(0)\n\n        # 0. Frontend\n        if self.frontend is not None:\n            enhanced, hlens, mask = self.frontend(hs, ilens)\n            hs, hlens = self.feature_transform(enhanced, hlens)\n        else:\n            hs, hlens = hs, ilens\n\n        # 1. encoder\n        hs, _, _ = self.enc(hs, hlens)\n        return hs.squeeze(0)\n\n    def recognize(self, x, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param ndarray x: input acoustic feature (T, D)\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        hs = self.encode(x).unsqueeze(0)\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            lpz = self.ctc.log_softmax(hs)[0]\n        else:\n            lpz = None\n\n        # 2. Decoder\n        # decode the first utterance\n        y = self.dec.recognize_beam(hs[0], lpz, recog_args, char_list, rnnlm)\n        return y\n\n    def recognize_batch(self, xs, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E batch beam search.\n\n        :param list xs: list of input acoustic feature arrays [(T_1, D), (T_2, D), ...]\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n        ilens = np.fromiter((xx.shape[0] for xx in xs), dtype=np.int64)\n\n        # subsample frame\n        xs = [xx[:: self.subsample[0], :] for xx in xs]\n        xs = [to_device(self, to_torch_tensor(xx).float()) for xx in xs]\n        xs_pad = pad_list(xs, 0.0)\n\n        # 0. Frontend\n        if self.frontend is not None:\n            enhanced, hlens, mask = self.frontend(xs_pad, ilens)\n            hs_pad, hlens = self.feature_transform(enhanced, hlens)\n        else:\n            hs_pad, hlens = xs_pad, ilens\n\n        # 1. Encoder\n        hs_pad, hlens, _ = self.enc(hs_pad, hlens)\n\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            lpz = self.ctc.log_softmax(hs_pad)\n            normalize_score = False\n        else:\n            lpz = None\n            normalize_score = True\n\n        # 2. Decoder\n        hlens = torch.tensor(list(map(int, hlens)))  # make sure hlens is tensor\n        y = self.dec.recognize_beam_batch(\n            hs_pad,\n            hlens,\n            lpz,\n            recog_args,\n            char_list,\n            rnnlm,\n            normalize_score=normalize_score,\n        )\n\n        if prev:\n            self.train()\n        return y\n\n    def enhance(self, xs):\n        \"\"\"Forward only in the frontend stage.\n\n        :param ndarray xs: input acoustic feature (T, C, F)\n        :return: enhaned feature\n        :rtype: torch.Tensor\n        \"\"\"\n        if self.frontend is None:\n            raise RuntimeError(\"Frontend does't exist\")\n        prev = self.training\n        self.eval()\n        ilens = np.fromiter((xx.shape[0] for xx in xs), dtype=np.int64)\n\n        # subsample frame\n        xs = [xx[:: self.subsample[0], :] for xx in xs]\n        xs = [to_device(self, to_torch_tensor(xx).float()) for xx in xs]\n        xs_pad = pad_list(xs, 0.0)\n        enhanced, hlensm, mask = self.frontend(xs_pad, ilens)\n        if prev:\n            self.train()\n        return enhanced.cpu().numpy(), mask.cpu().numpy(), ilens\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            # 0. Frontend\n            if self.frontend is not None:\n                hs_pad, hlens, mask = self.frontend(to_torch_tensor(xs_pad), ilens)\n                hs_pad, hlens = self.feature_transform(hs_pad, hlens)\n            else:\n                hs_pad, hlens = xs_pad, ilens\n\n            # 1. Encoder\n            hpad, hlens, _ = self.enc(hs_pad, hlens)\n\n            # 2. Decoder\n            att_ws = self.dec.calculate_all_attentions(hpad, hlens, ys_pad)\n        self.train()\n        return att_ws\n\n    def calculate_all_ctc_probs(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E CTC probability calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: CTC probability (B, Tmax, vocab)\n        :rtype: float ndarray\n        \"\"\"\n        probs = None\n        if self.mtlalpha == 0:\n            return probs\n\n        self.eval()\n        with torch.no_grad():\n            # 0. Frontend\n            if self.frontend is not None:\n                hs_pad, hlens, mask = self.frontend(to_torch_tensor(xs_pad), ilens)\n                hs_pad, hlens = self.feature_transform(hs_pad, hlens)\n            else:\n                hs_pad, hlens = xs_pad, ilens\n\n            # 1. Encoder\n            hpad, hlens, _ = self.enc(hs_pad, hlens)\n\n            # 2. CTC probs\n            probs = self.ctc.softmax(hpad).cpu().numpy()\n        self.train()\n        return probs\n\n    def subsample_frames(self, x):\n        \"\"\"Subsample speeh frames in the encoder.\"\"\"\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        ilen = [x.shape[0]]\n        h = to_device(self, torch.from_numpy(np.array(x, dtype=np.float32)))\n        h.contiguous()\n        return h, ilen\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_conformer.py",
    "content": "# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Northwestern Polytechnical University (Pengcheng Guo)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"\nConformer speech recognition model (pytorch).\n\nIt is a fusion of `e2e_asr_transformer.py`\nRefer to: https://arxiv.org/abs/2005.08100\n\n\"\"\"\n\nfrom espnet.nets.pytorch_backend.conformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.e2e_asr_transformer import E2E as E2ETransformer\nfrom espnet.nets.pytorch_backend.conformer.argument import (\n    add_arguments_conformer_common,  # noqa: H301\n    verify_rel_pos_type,  # noqa: H301\n)\n\n\nclass E2E(E2ETransformer):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2ETransformer.add_arguments(parser)\n        E2E.add_conformer_arguments(parser)\n        return parser\n\n    @staticmethod\n    def add_conformer_arguments(parser):\n        \"\"\"Add arguments for conformer model.\"\"\"\n        group = parser.add_argument_group(\"conformer model specific setting\")\n        group = add_arguments_conformer_common(group)\n        return parser\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super().__init__(idim, odim, args, ignore_id)\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n\n        # Check the relative positional encoding type\n        args = verify_rel_pos_type(args)\n\n        self.encoder = Encoder(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=args.transformer_input_layer,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n            pos_enc_layer_type=args.transformer_encoder_pos_enc_layer_type,\n            selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n            activation_type=args.transformer_encoder_activation_type,\n            macaron_style=args.macaron_style,\n            use_cnn_module=args.use_cnn_module,\n            zero_triu=args.zero_triu,\n            cnn_module_kernel=args.cnn_module_kernel,\n        )\n        self.reset_parameters(args)\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_maskctc.py",
    "content": "# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Waseda University (Yosuke Higuchi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"\nMask CTC based non-autoregressive speech recognition model (pytorch).\n\nSee https://arxiv.org/abs/2005.08700 for the detail.\n\n\"\"\"\n\nfrom itertools import groupby\nimport logging\nimport math\n\nfrom distutils.util import strtobool\nimport numpy\nimport torch\n\nfrom espnet.nets.pytorch_backend.conformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.conformer.argument import (\n    add_arguments_conformer_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.e2e_asr import CTC_LOSS_THRESHOLD\nfrom espnet.nets.pytorch_backend.e2e_asr_transformer import E2E as E2ETransformer\nfrom espnet.nets.pytorch_backend.maskctc.add_mask_token import mask_uniform\nfrom espnet.nets.pytorch_backend.maskctc.mask import square_mask\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\n\n\nclass E2E(E2ETransformer):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2ETransformer.add_arguments(parser)\n        E2E.add_maskctc_arguments(parser)\n\n        return parser\n\n    @staticmethod\n    def add_maskctc_arguments(parser):\n        \"\"\"Add arguments for maskctc model.\"\"\"\n        group = parser.add_argument_group(\"maskctc specific setting\")\n\n        group.add_argument(\n            \"--maskctc-use-conformer-encoder\",\n            default=False,\n            type=strtobool,\n        )\n        group = add_arguments_conformer_common(group)\n\n        return parser\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        odim += 1  # for the mask token\n\n        super().__init__(idim, odim, args, ignore_id)\n        assert 0.0 <= self.mtlalpha < 1.0, \"mtlalpha should be [0.0, 1.0)\"\n\n        self.mask_token = odim - 1\n        self.sos = odim - 2\n        self.eos = odim - 2\n        self.odim = odim\n\n        if args.maskctc_use_conformer_encoder:\n            if args.transformer_attn_dropout_rate is None:\n                args.transformer_attn_dropout_rate = args.conformer_dropout_rate\n            self.encoder = Encoder(\n                idim=idim,\n                attention_dim=args.adim,\n                attention_heads=args.aheads,\n                linear_units=args.eunits,\n                num_blocks=args.elayers,\n                input_layer=args.transformer_input_layer,\n                dropout_rate=args.dropout_rate,\n                positional_dropout_rate=args.dropout_rate,\n                attention_dropout_rate=args.transformer_attn_dropout_rate,\n                pos_enc_layer_type=args.transformer_encoder_pos_enc_layer_type,\n                selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n                activation_type=args.transformer_encoder_activation_type,\n                macaron_style=args.macaron_style,\n                use_cnn_module=args.use_cnn_module,\n                cnn_module_kernel=args.cnn_module_kernel,\n            )\n        self.reset_parameters(args)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of source sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]  # for data parallel\n        src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n        hs_pad, hs_mask = self.encoder(xs_pad, src_mask)\n        self.hs_pad = hs_pad\n\n        # 2. forward decoder\n        ys_in_pad, ys_out_pad = mask_uniform(\n            ys_pad, self.mask_token, self.eos, self.ignore_id\n        )\n        ys_mask = square_mask(ys_in_pad, self.eos)\n        pred_pad, pred_mask = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n        self.pred_pad = pred_pad\n\n        # 3. compute attention loss\n        loss_att = self.criterion(pred_pad, ys_out_pad)\n        self.acc = th_accuracy(\n            pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n        )\n\n        # 4. compute ctc loss\n        loss_ctc, cer_ctc = None, None\n        if self.mtlalpha > 0:\n            batch_size = xs_pad.size(0)\n            hs_len = hs_mask.view(batch_size, -1).sum(1)\n            loss_ctc = self.ctc(hs_pad.view(batch_size, -1, self.adim), hs_len, ys_pad)\n            if self.error_calculator is not None:\n                ys_hat = self.ctc.argmax(hs_pad.view(batch_size, -1, self.adim)).data\n                cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)\n            # for visualization\n            if not self.training:\n                self.ctc.softmax(hs_pad)\n\n        # 5. compute cer/wer\n        if self.training or self.error_calculator is None or self.decoder is None:\n            cer, wer = None, None\n        else:\n            ys_hat = pred_pad.argmax(dim=-1)\n            cer, wer = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())\n\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = None\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = float(loss_ctc)\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data, loss_att_data, self.acc, cer_ctc, cer, wer, loss_data\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def recognize(self, x, recog_args, char_list=None, rnnlm=None):\n        \"\"\"Recognize input speech.\n\n        :param ndnarray x: input acoustic feature (B, T, D) or (T, D)\n        :param Namespace recog_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: decoding result\n        :rtype: list\n        \"\"\"\n\n        def num2str(char_list, mask_token, mask_char=\"_\"):\n            def f(yl):\n                cl = [char_list[y] if y != mask_token else mask_char for y in yl]\n                return \"\".join(cl).replace(\"<space>\", \" \")\n\n            return f\n\n        n2s = num2str(char_list, self.mask_token)\n\n        self.eval()\n        h = self.encode(x).unsqueeze(0)\n\n        # greedy ctc outputs\n        ctc_probs, ctc_ids = torch.exp(self.ctc.log_softmax(h)).max(dim=-1)\n        y_hat = torch.stack([x[0] for x in groupby(ctc_ids[0])])\n        y_idx = torch.nonzero(y_hat != 0).squeeze(-1)\n\n        # calculate token-level ctc probabilities by taking\n        # the maximum probability of consecutive frames with\n        # the same ctc symbols\n        probs_hat = []\n        cnt = 0\n        for i, y in enumerate(y_hat.tolist()):\n            probs_hat.append(-1)\n            while cnt < ctc_ids.shape[1] and y == ctc_ids[0][cnt]:\n                if probs_hat[i] < ctc_probs[0][cnt]:\n                    probs_hat[i] = ctc_probs[0][cnt].item()\n                cnt += 1\n        probs_hat = torch.from_numpy(numpy.array(probs_hat))\n\n        # mask ctc outputs based on ctc probabilities\n        p_thres = recog_args.maskctc_probability_threshold\n        mask_idx = torch.nonzero(probs_hat[y_idx] < p_thres).squeeze(-1)\n        confident_idx = torch.nonzero(probs_hat[y_idx] >= p_thres).squeeze(-1)\n        mask_num = len(mask_idx)\n\n        y_in = torch.zeros(1, len(y_idx), dtype=torch.long) + self.mask_token\n        y_in[0][confident_idx] = y_hat[y_idx][confident_idx]\n\n        logging.info(\"ctc:{}\".format(n2s(y_in[0].tolist())))\n\n        # iterative decoding\n        if not mask_num == 0:\n            K = recog_args.maskctc_n_iterations\n            num_iter = K if mask_num >= K and K > 0 else mask_num\n\n            for t in range(num_iter - 1):\n                pred, _ = self.decoder(y_in, None, h, None)\n                pred_score, pred_id = pred[0][mask_idx].max(dim=-1)\n                cand = torch.topk(pred_score, mask_num // num_iter, -1)[1]\n                y_in[0][mask_idx[cand]] = pred_id[cand]\n                mask_idx = torch.nonzero(y_in[0] == self.mask_token).squeeze(-1)\n\n                logging.info(\"msk:{}\".format(n2s(y_in[0].tolist())))\n\n            # predict leftover masks (|masks| < mask_num // num_iter)\n            pred, pred_mask = self.decoder(y_in, None, h, None)\n            y_in[0][mask_idx] = pred[0][mask_idx].argmax(dim=-1)\n\n            logging.info(\"msk:{}\".format(n2s(y_in[0].tolist())))\n\n        ret = y_in.tolist()[0]\n        hyp = {\"score\": 0.0, \"yseq\": [self.sos] + ret + [self.eos]}\n\n        return [hyp]\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_mix.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"\nThis script is used to construct End-to-End models of multi-speaker ASR.\n\nCopyright 2017 Johns Hopkins University (Shinji Watanabe)\n Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\"\"\"\n\nimport argparse\nfrom itertools import groupby\nimport logging\nimport math\nimport os\nimport sys\n\nimport editdistance\nimport numpy as np\nimport six\nimport torch\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.e2e_asr_common import get_vgg2l_odim\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.pytorch_backend.ctc import ctc_for\nfrom espnet.nets.pytorch_backend.e2e_asr import E2E as E2EASR\nfrom espnet.nets.pytorch_backend.e2e_asr import Reporter\nfrom espnet.nets.pytorch_backend.frontends.feature_transform import (\n    feature_transform_for,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.frontends.frontend import frontend_for\nfrom espnet.nets.pytorch_backend.initialization import lecun_normal_init_parameters\nfrom espnet.nets.pytorch_backend.initialization import set_forget_bias_to_one\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.nets_utils import to_torch_tensor\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_for\nfrom espnet.nets.pytorch_backend.rnn.decoders import decoder_for\nfrom espnet.nets.pytorch_backend.rnn.encoders import encoder_for as encoder_for_single\nfrom espnet.nets.pytorch_backend.rnn.encoders import RNNP\nfrom espnet.nets.pytorch_backend.rnn.encoders import VGG2L\n\nCTC_LOSS_THRESHOLD = 10000\n\n\nclass PIT(object):\n    \"\"\"Permutation Invariant Training (PIT) module.\n\n    :parameter int num_spkrs: number of speakers for PIT process (2 or 3)\n    \"\"\"\n\n    def __init__(self, num_spkrs):\n        \"\"\"Initialize PIT module.\"\"\"\n        self.num_spkrs = num_spkrs\n\n        # [[0, 1], [1, 0]] or\n        # [[0, 1, 2], [0, 2, 1], [1, 0, 2], [1, 2, 0], [2, 1, 0], [2, 0, 1]]\n        self.perm_choices = []\n        initial_seq = np.linspace(0, num_spkrs - 1, num_spkrs, dtype=np.int64)\n        self.permutationDFS(initial_seq, 0)\n\n        # [[0, 3], [1, 2]] or\n        # [[0, 4, 8], [0, 5, 7], [1, 3, 8], [1, 5, 6], [2, 4, 6], [2, 3, 7]]\n        self.loss_perm_idx = np.linspace(\n            0, num_spkrs * (num_spkrs - 1), num_spkrs, dtype=np.int64\n        ).reshape(1, num_spkrs)\n        self.loss_perm_idx = (self.loss_perm_idx + np.array(self.perm_choices)).tolist()\n\n    def min_pit_sample(self, loss):\n        \"\"\"Compute the PIT loss for each sample.\n\n        :param 1-D torch.Tensor loss: list of losses for one sample,\n            including [h1r1, h1r2, h2r1, h2r2] or\n            [h1r1, h1r2, h1r3, h2r1, h2r2, h2r3, h3r1, h3r2, h3r3]\n        :return minimum loss of best permutation\n        :rtype torch.Tensor (1)\n        :return the best permutation\n        :rtype List: len=2\n\n        \"\"\"\n        score_perms = (\n            torch.stack(\n                [torch.sum(loss[loss_perm_idx]) for loss_perm_idx in self.loss_perm_idx]\n            )\n            / self.num_spkrs\n        )\n        perm_loss, min_idx = torch.min(score_perms, 0)\n        permutation = self.perm_choices[min_idx]\n        return perm_loss, permutation\n\n    def pit_process(self, losses):\n        \"\"\"Compute the PIT loss for a batch.\n\n        :param torch.Tensor losses: losses (B, 1|4|9)\n        :return minimum losses of a batch with best permutation\n        :rtype torch.Tensor (B)\n        :return the best permutation\n        :rtype torch.LongTensor (B, 1|2|3)\n\n        \"\"\"\n        bs = losses.size(0)\n        ret = [self.min_pit_sample(losses[i]) for i in range(bs)]\n\n        loss_perm = torch.stack([r[0] for r in ret], dim=0).to(losses.device)  # (B)\n        permutation = torch.tensor([r[1] for r in ret]).long().to(losses.device)\n        return torch.mean(loss_perm), permutation\n\n    def permutationDFS(self, source, start):\n        \"\"\"Get permutations with DFS.\n\n           The final result is all permutations of the 'source' sequence.\n           e.g. [[1, 2], [2, 1]] or\n                [[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 2, 1], [3, 1, 2]]\n\n        :param np.ndarray source: (num_spkrs, 1), e.g. [1, 2, ..., N]\n        :param int start: the start point to permute\n\n        \"\"\"\n        if start == len(source) - 1:  # reach final state\n            self.perm_choices.append(source.tolist())\n        for i in range(start, len(source)):\n            # swap values at position start and i\n            source[start], source[i] = source[i], source[start]\n            self.permutationDFS(source, start + 1)\n            # reverse the swap\n            source[start], source[i] = source[i], source[start]\n\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2EASR.encoder_add_arguments(parser)\n        E2E.encoder_mix_add_arguments(parser)\n        E2EASR.attention_add_arguments(parser)\n        E2EASR.decoder_add_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_mix_add_arguments(parser):\n        \"\"\"Add arguments for multi-speaker encoder.\"\"\"\n        group = parser.add_argument_group(\"E2E encoder setting for multi-speaker\")\n        # asr-mix encoder\n        group.add_argument(\n            \"--spa\",\n            action=\"store_true\",\n            help=\"Enable speaker parallel attention \"\n            \"for multi-speaker speech recognition task.\",\n        )\n        group.add_argument(\n            \"--elayers-sd\",\n            default=4,\n            type=int,\n            help=\"Number of speaker differentiate encoder layers\"\n            \"for multi-speaker speech recognition task.\",\n        )\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.enc.conv_subsampling_factor * int(np.prod(self.subsample))\n\n    def __init__(self, idim, odim, args):\n        \"\"\"Initialize multi-speaker E2E module.\"\"\"\n        super(E2E, self).__init__()\n        torch.nn.Module.__init__(self)\n        self.mtlalpha = args.mtlalpha\n        assert 0.0 <= self.mtlalpha <= 1.0, \"mtlalpha should be [0.0, 1.0]\"\n        self.etype = args.etype\n        self.verbose = args.verbose\n        # NOTE: for self.build method\n        args.char_list = getattr(args, \"char_list\", None)\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.reporter = Reporter()\n        self.num_spkrs = args.num_spkrs\n        self.spa = args.spa\n        self.pit = PIT(self.num_spkrs)\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n\n        # subsample info\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"rnn_mix\")\n\n        # label smoothing info\n        if args.lsm_type and os.path.isfile(args.train_json):\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        if getattr(args, \"use_frontend\", False):  # use getattr to keep compatibility\n            self.frontend = frontend_for(args, idim)\n            self.feature_transform = feature_transform_for(args, (idim - 1) * 2)\n            idim = args.n_mels\n        else:\n            self.frontend = None\n\n        # encoder\n        self.enc = encoder_for(args, idim, self.subsample)\n        # ctc\n        self.ctc = ctc_for(args, odim, reduce=False)\n        # attention\n        num_att = self.num_spkrs if args.spa else 1\n        self.att = att_for(args, num_att)\n        # decoder\n        self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        # weight initialization\n        self.init_like_chainer()\n\n        # options for beam search\n        if \"report_cer\" in vars(args) and (args.report_cer or args.report_wer):\n            recog_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": args.ctc_weight,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n            }\n\n            self.recog_args = argparse.Namespace(**recog_args)\n            self.report_cer = args.report_cer\n            self.report_wer = args.report_wer\n        else:\n            self.report_cer = False\n            self.report_wer = False\n        self.rnnlm = None\n\n        self.logzero = -10000000000.0\n        self.loss = None\n        self.acc = None\n\n    def init_like_chainer(self):\n        \"\"\"Initialize weight like chainer.\n\n        chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0\n        pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)\n\n        however, there are two exceptions as far as I know.\n        - EmbedID.W ~ Normal(0, 1)\n        - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)\n        \"\"\"\n        lecun_normal_init_parameters(self)\n        # exceptions\n        # embed weight ~ Normal(0, 1)\n        self.dec.embed.weight.data.normal_(0, 1)\n        # forget-bias = 1.0\n        # https://discuss.pytorch.org/t/set-forget-gate-bias-of-lstm/1745\n        for i in six.moves.range(len(self.dec.decoder)):\n            set_forget_bias_to_one(self.dec.decoder[i].bias_ih)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, num_spkrs, Lmax)\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 0. Frontend\n        if self.frontend is not None:\n            hs_pad, hlens, mask = self.frontend(to_torch_tensor(xs_pad), ilens)\n            if isinstance(hs_pad, list):\n                hlens_n = [None] * self.num_spkrs\n                for i in range(self.num_spkrs):\n                    hs_pad[i], hlens_n[i] = self.feature_transform(hs_pad[i], hlens)\n                hlens = hlens_n\n            else:\n                hs_pad, hlens = self.feature_transform(hs_pad, hlens)\n        else:\n            hs_pad, hlens = xs_pad, ilens\n\n        # 1. Encoder\n        if not isinstance(\n            hs_pad, list\n        ):  # single-channel input xs_pad (single- or multi-speaker)\n            hs_pad, hlens, _ = self.enc(hs_pad, hlens)\n        else:  # multi-channel multi-speaker input xs_pad\n            for i in range(self.num_spkrs):\n                hs_pad[i], hlens[i], _ = self.enc(hs_pad[i], hlens[i])\n\n        # 2. CTC loss\n        if self.mtlalpha == 0:\n            loss_ctc, min_perm = None, None\n        else:\n            if not isinstance(hs_pad, list):  # single-speaker input xs_pad\n                loss_ctc = torch.mean(self.ctc(hs_pad, hlens, ys_pad))\n            else:  # multi-speaker input xs_pad\n                ys_pad = ys_pad.transpose(0, 1)  # (num_spkrs, B, Lmax)\n                loss_ctc_perm = torch.stack(\n                    [\n                        self.ctc(\n                            hs_pad[i // self.num_spkrs],\n                            hlens[i // self.num_spkrs],\n                            ys_pad[i % self.num_spkrs],\n                        )\n                        for i in range(self.num_spkrs ** 2)\n                    ],\n                    dim=1,\n                )  # (B, num_spkrs^2)\n                loss_ctc, min_perm = self.pit.pit_process(loss_ctc_perm)\n                logging.info(\"ctc loss:\" + str(float(loss_ctc)))\n\n        # 3. attention loss\n        if self.mtlalpha == 1:\n            loss_att = None\n            acc = None\n        else:\n            if not isinstance(hs_pad, list):  # single-speaker input xs_pad\n                loss_att, acc, _ = self.dec(hs_pad, hlens, ys_pad)\n            else:\n                for i in range(ys_pad.size(1)):  # B\n                    ys_pad[:, i] = ys_pad[min_perm[i], i]\n                rslt = [\n                    self.dec(hs_pad[i], hlens[i], ys_pad[i], strm_idx=i)\n                    for i in range(self.num_spkrs)\n                ]\n                loss_att = sum([r[0] for r in rslt]) / float(len(rslt))\n                acc = sum([r[1] for r in rslt]) / float(len(rslt))\n        self.acc = acc\n\n        # 4. compute cer without beam search\n        if self.mtlalpha == 0 or self.char_list is None:\n            cer_ctc = None\n        else:\n            cers = []\n            for ns in range(self.num_spkrs):\n                y_hats = self.ctc.argmax(hs_pad[ns]).data\n                for i, y in enumerate(y_hats):\n                    y_hat = [x[0] for x in groupby(y)]\n                    y_true = ys_pad[ns][i]\n\n                    seq_hat = [\n                        self.char_list[int(idx)] for idx in y_hat if int(idx) != -1\n                    ]\n                    seq_true = [\n                        self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                    ]\n                    seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n                    seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n                    seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n\n                    hyp_chars = seq_hat_text.replace(\" \", \"\")\n                    ref_chars = seq_true_text.replace(\" \", \"\")\n                    if len(ref_chars) > 0:\n                        cers.append(\n                            editdistance.eval(hyp_chars, ref_chars) / len(ref_chars)\n                        )\n\n            cer_ctc = sum(cers) / len(cers) if cers else None\n\n        # 5. compute cer/wer\n        if (\n            self.training\n            or not (self.report_cer or self.report_wer)\n            or not isinstance(hs_pad, list)\n        ):\n            cer, wer = 0.0, 0.0\n        else:\n            if self.recog_args.ctc_weight > 0.0:\n                lpz = [\n                    self.ctc.log_softmax(hs_pad[i]).data for i in range(self.num_spkrs)\n                ]\n            else:\n                lpz = None\n\n            word_eds, char_eds, word_ref_lens, char_ref_lens = [], [], [], []\n            nbest_hyps = [\n                self.dec.recognize_beam_batch(\n                    hs_pad[i],\n                    torch.tensor(hlens[i]),\n                    lpz[i],\n                    self.recog_args,\n                    self.char_list,\n                    self.rnnlm,\n                    strm_idx=i,\n                )\n                for i in range(self.num_spkrs)\n            ]\n            # remove <sos> and <eos>\n            y_hats = [\n                [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps[i]]\n                for i in range(self.num_spkrs)\n            ]\n            for i in range(len(y_hats[0])):\n                hyp_words = []\n                hyp_chars = []\n                ref_words = []\n                ref_chars = []\n                for ns in range(self.num_spkrs):\n                    y_hat = y_hats[ns][i]\n                    y_true = ys_pad[ns][i]\n\n                    seq_hat = [\n                        self.char_list[int(idx)] for idx in y_hat if int(idx) != -1\n                    ]\n                    seq_true = [\n                        self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                    ]\n                    seq_hat_text = \"\".join(seq_hat).replace(self.recog_args.space, \" \")\n                    seq_hat_text = seq_hat_text.replace(self.recog_args.blank, \"\")\n                    seq_true_text = \"\".join(seq_true).replace(\n                        self.recog_args.space, \" \"\n                    )\n\n                    hyp_words.append(seq_hat_text.split())\n                    ref_words.append(seq_true_text.split())\n                    hyp_chars.append(seq_hat_text.replace(\" \", \"\"))\n                    ref_chars.append(seq_true_text.replace(\" \", \"\"))\n\n                tmp_word_ed = [\n                    editdistance.eval(\n                        hyp_words[ns // self.num_spkrs], ref_words[ns % self.num_spkrs]\n                    )\n                    for ns in range(self.num_spkrs ** 2)\n                ]  # h1r1,h1r2,h2r1,h2r2\n                tmp_char_ed = [\n                    editdistance.eval(\n                        hyp_chars[ns // self.num_spkrs], ref_chars[ns % self.num_spkrs]\n                    )\n                    for ns in range(self.num_spkrs ** 2)\n                ]  # h1r1,h1r2,h2r1,h2r2\n\n                word_eds.append(self.pit.min_pit_sample(torch.tensor(tmp_word_ed))[0])\n                word_ref_lens.append(len(sum(ref_words, [])))\n                char_eds.append(self.pit.min_pit_sample(torch.tensor(tmp_char_ed))[0])\n                char_ref_lens.append(len(\"\".join(ref_chars)))\n\n            wer = (\n                0.0\n                if not self.report_wer\n                else float(sum(word_eds)) / sum(word_ref_lens)\n            )\n            cer = (\n                0.0\n                if not self.report_cer\n                else float(sum(char_eds)) / sum(char_ref_lens)\n            )\n\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = None\n        elif alpha == 1:\n            self.loss = loss_ctc\n            loss_att_data = None\n            loss_ctc_data = float(loss_ctc)\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = float(loss_ctc)\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data, loss_att_data, self.acc, cer_ctc, cer, wer, loss_data\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def recognize(self, x, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param ndarray x: input acoustic feature (T, D)\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n        ilens = [x.shape[0]]\n\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        h = to_device(self, to_torch_tensor(x).float())\n        # make a utt list (1) to use the same interface for encoder\n        hs = h.contiguous().unsqueeze(0)\n\n        # 0. Frontend\n        if self.frontend is not None:\n            hs, hlens, mask = self.frontend(hs, ilens)\n            hlens_n = [None] * self.num_spkrs\n            for i in range(self.num_spkrs):\n                hs[i], hlens_n[i] = self.feature_transform(hs[i], hlens)\n            hlens = hlens_n\n        else:\n            hs, hlens = hs, ilens\n\n        # 1. Encoder\n        if not isinstance(hs, list):  # single-channel multi-speaker input x\n            hs, hlens, _ = self.enc(hs, hlens)\n        else:  # multi-channel multi-speaker input x\n            for i in range(self.num_spkrs):\n                hs[i], hlens[i], _ = self.enc(hs[i], hlens[i])\n\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            lpz = [self.ctc.log_softmax(i)[0] for i in hs]\n        else:\n            lpz = None\n\n        # 2. decoder\n        # decode the first utterance\n        y = [\n            self.dec.recognize_beam(\n                hs[i][0], lpz[i], recog_args, char_list, rnnlm, strm_idx=i\n            )\n            for i in range(self.num_spkrs)\n        ]\n\n        if prev:\n            self.train()\n        return y\n\n    def recognize_batch(self, xs, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param ndarray xs: input acoustic feature (T, D)\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n        ilens = np.fromiter((xx.shape[0] for xx in xs), dtype=np.int64)\n\n        # subsample frame\n        xs = [xx[:: self.subsample[0], :] for xx in xs]\n        xs = [to_device(self, to_torch_tensor(xx).float()) for xx in xs]\n        xs_pad = pad_list(xs, 0.0)\n\n        # 0. Frontend\n        if self.frontend is not None:\n            hs_pad, hlens, mask = self.frontend(xs_pad, ilens)\n            hlens_n = [None] * self.num_spkrs\n            for i in range(self.num_spkrs):\n                hs_pad[i], hlens_n[i] = self.feature_transform(hs_pad[i], hlens)\n            hlens = hlens_n\n        else:\n            hs_pad, hlens = xs_pad, ilens\n\n        # 1. Encoder\n        if not isinstance(hs_pad, list):  # single-channel multi-speaker input x\n            hs_pad, hlens, _ = self.enc(hs_pad, hlens)\n        else:  # multi-channel multi-speaker input x\n            for i in range(self.num_spkrs):\n                hs_pad[i], hlens[i], _ = self.enc(hs_pad[i], hlens[i])\n\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            lpz = [self.ctc.log_softmax(hs_pad[i]) for i in range(self.num_spkrs)]\n            normalize_score = False\n        else:\n            lpz = None\n            normalize_score = True\n\n        # 2. decoder\n        y = [\n            self.dec.recognize_beam_batch(\n                hs_pad[i],\n                hlens[i],\n                lpz[i],\n                recog_args,\n                char_list,\n                rnnlm,\n                normalize_score=normalize_score,\n                strm_idx=i,\n            )\n            for i in range(self.num_spkrs)\n        ]\n\n        if prev:\n            self.train()\n        return y\n\n    def enhance(self, xs):\n        \"\"\"Forward only the frontend stage.\n\n        :param ndarray xs: input acoustic feature (T, C, F)\n        \"\"\"\n        if self.frontend is None:\n            raise RuntimeError(\"Frontend doesn't exist\")\n        prev = self.training\n        self.eval()\n        ilens = np.fromiter((xx.shape[0] for xx in xs), dtype=np.int64)\n\n        # subsample frame\n        xs = [xx[:: self.subsample[0], :] for xx in xs]\n        xs = [to_device(self, to_torch_tensor(xx).float()) for xx in xs]\n        xs_pad = pad_list(xs, 0.0)\n        enhanced, hlensm, mask = self.frontend(xs_pad, ilens)\n        if prev:\n            self.train()\n\n        if isinstance(enhanced, (tuple, list)):\n            enhanced = list(enhanced)\n            mask = list(mask)\n            for idx in range(len(enhanced)):  # number of speakers\n                enhanced[idx] = enhanced[idx].cpu().numpy()\n                mask[idx] = mask[idx].cpu().numpy()\n            return enhanced, mask, ilens\n        return enhanced.cpu().numpy(), mask.cpu().numpy(), ilens\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, num_spkrs, Lmax)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray\n        \"\"\"\n        with torch.no_grad():\n            # 0. Frontend\n            if self.frontend is not None:\n                hs_pad, hlens, mask = self.frontend(to_torch_tensor(xs_pad), ilens)\n                hlens_n = [None] * self.num_spkrs\n                for i in range(self.num_spkrs):\n                    hs_pad[i], hlens_n[i] = self.feature_transform(hs_pad[i], hlens)\n                hlens = hlens_n\n            else:\n                hs_pad, hlens = xs_pad, ilens\n\n            # 1. Encoder\n            if not isinstance(hs_pad, list):  # single-channel multi-speaker input x\n                hs_pad, hlens, _ = self.enc(hs_pad, hlens)\n            else:  # multi-channel multi-speaker input x\n                for i in range(self.num_spkrs):\n                    hs_pad[i], hlens[i], _ = self.enc(hs_pad[i], hlens[i])\n\n            # Permutation\n            ys_pad = ys_pad.transpose(0, 1)  # (num_spkrs, B, Lmax)\n            if self.num_spkrs <= 3:\n                loss_ctc = torch.stack(\n                    [\n                        self.ctc(\n                            hs_pad[i // self.num_spkrs],\n                            hlens[i // self.num_spkrs],\n                            ys_pad[i % self.num_spkrs],\n                        )\n                        for i in range(self.num_spkrs ** 2)\n                    ],\n                    1,\n                )  # (B, num_spkrs^2)\n                loss_ctc, min_perm = self.pit.pit_process(loss_ctc)\n            for i in range(ys_pad.size(1)):  # B\n                ys_pad[:, i] = ys_pad[min_perm[i], i]\n\n            # 2. Decoder\n            att_ws = [\n                self.dec.calculate_all_attentions(\n                    hs_pad[i], hlens[i], ys_pad[i], strm_idx=i\n                )\n                for i in range(self.num_spkrs)\n            ]\n\n        return att_ws\n\n\nclass EncoderMix(torch.nn.Module):\n    \"\"\"Encoder module for the case of multi-speaker mixture speech.\n\n    :param str etype: type of encoder network\n    :param int idim: number of dimensions of encoder network\n    :param int elayers_sd:\n        number of layers of speaker differentiate part in encoder network\n    :param int elayers_rec:\n        number of layers of shared recognition part in encoder network\n    :param int eunits: number of lstm units of encoder network\n    :param int eprojs: number of projection units of encoder network\n    :param np.ndarray subsample: list of subsampling numbers\n    :param float dropout: dropout rate\n    :param int in_channel: number of input channels\n    :param int num_spkrs: number of number of speakers\n    \"\"\"\n\n    def __init__(\n        self,\n        etype,\n        idim,\n        elayers_sd,\n        elayers_rec,\n        eunits,\n        eprojs,\n        subsample,\n        dropout,\n        num_spkrs=2,\n        in_channel=1,\n    ):\n        \"\"\"Initialize the encoder of single-channel multi-speaker ASR.\"\"\"\n        super(EncoderMix, self).__init__()\n        typ = etype.lstrip(\"vgg\").rstrip(\"p\")\n        if typ not in [\"lstm\", \"gru\", \"blstm\", \"bgru\"]:\n            logging.error(\"Error: need to specify an appropriate encoder architecture\")\n        if etype.startswith(\"vgg\"):\n            if etype[-1] == \"p\":\n                self.enc_mix = torch.nn.ModuleList([VGG2L(in_channel)])\n                self.enc_sd = torch.nn.ModuleList(\n                    [\n                        torch.nn.ModuleList(\n                            [\n                                RNNP(\n                                    get_vgg2l_odim(idim, in_channel=in_channel),\n                                    elayers_sd,\n                                    eunits,\n                                    eprojs,\n                                    subsample[: elayers_sd + 1],\n                                    dropout,\n                                    typ=typ,\n                                )\n                            ]\n                        )\n                        for i in range(num_spkrs)\n                    ]\n                )\n                self.enc_rec = torch.nn.ModuleList(\n                    [\n                        RNNP(\n                            eprojs,\n                            elayers_rec,\n                            eunits,\n                            eprojs,\n                            subsample[elayers_sd:],\n                            dropout,\n                            typ=typ,\n                        )\n                    ]\n                )\n                logging.info(\"Use CNN-VGG + B\" + typ.upper() + \"P for encoder\")\n            else:\n                logging.error(\n                    f\"Error: need to specify an appropriate encoder architecture. \"\n                    f\"Illegal name {etype}\"\n                )\n                sys.exit()\n        else:\n            logging.error(\n                f\"Error: need to specify an appropriate encoder architecture. \"\n                f\"Illegal name {etype}\"\n            )\n            sys.exit()\n\n        self.num_spkrs = num_spkrs\n\n    def forward(self, xs_pad, ilens):\n        \"\"\"Encodermix forward.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :return: list: batch of hidden state sequences [num_spkrs x (B, Tmax, eprojs)]\n        :rtype: torch.Tensor\n        \"\"\"\n        # mixture encoder\n        for module in self.enc_mix:\n            xs_pad, ilens, _ = module(xs_pad, ilens)\n\n        # SD and Rec encoder\n        xs_pad_sd = [xs_pad for i in range(self.num_spkrs)]\n        ilens_sd = [ilens for i in range(self.num_spkrs)]\n        for ns in range(self.num_spkrs):\n            # Encoder_SD: speaker differentiate encoder\n            for module in self.enc_sd[ns]:\n                xs_pad_sd[ns], ilens_sd[ns], _ = module(xs_pad_sd[ns], ilens_sd[ns])\n            # Encoder_Rec: recognition encoder\n            for module in self.enc_rec:\n                xs_pad_sd[ns], ilens_sd[ns], _ = module(xs_pad_sd[ns], ilens_sd[ns])\n\n        # make mask to remove bias value in padded part\n        mask = to_device(xs_pad, make_pad_mask(ilens_sd[0]).unsqueeze(-1))\n\n        return [x.masked_fill(mask, 0.0) for x in xs_pad_sd], ilens_sd, None\n\n\ndef encoder_for(args, idim, subsample):\n    \"\"\"Construct the encoder.\"\"\"\n    if getattr(args, \"use_frontend\", False):  # use getattr to keep compatibility\n        # with frontend, the mixed speech are separated as streams for each speaker\n        return encoder_for_single(args, idim, subsample)\n    else:\n        return EncoderMix(\n            args.etype,\n            idim,\n            args.elayers_sd,\n            args.elayers,\n            args.eunits,\n            args.eprojs,\n            subsample,\n            args.dropout_rate,\n            args.num_spkrs,\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_mix_transformer.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n# Copyright 2020 Johns Hopkins University (Xuankai Chang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"\nTransformer speech recognition model for single-channel multi-speaker mixture speech.\n\nIt is a fusion of `e2e_asr_mix.py` and `e2e_asr_transformer.py`. Refer to:\n    https://arxiv.org/pdf/2002.03921.pdf\n1. The Transformer-based Encoder now consists of three stages:\n     (a): Enc_mix: encoding input mixture speech;\n     (b): Enc_SD: separating mixed speech representations;\n     (c): Enc_rec: transforming each separated speech representation.\n2. PIT is used in CTC to determine the permutation with minimum loss.\n\"\"\"\n\nfrom argparse import Namespace\nimport logging\nimport math\n\nimport numpy\nimport torch\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.pytorch_backend.ctc import CTC\nfrom espnet.nets.pytorch_backend.e2e_asr import CTC_LOSS_THRESHOLD\nfrom espnet.nets.pytorch_backend.e2e_asr_mix import E2E as E2EASRMIX\nfrom espnet.nets.pytorch_backend.e2e_asr_mix import PIT\nfrom espnet.nets.pytorch_backend.e2e_asr_transformer import E2E as E2EASR\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\nfrom espnet.nets.pytorch_backend.rnn.decoders import CTC_SCORING_RATIO\nfrom espnet.nets.pytorch_backend.transformer.add_sos_eos import add_sos_eos\nfrom espnet.nets.pytorch_backend.transformer.encoder_mix import EncoderMix\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.pytorch_backend.transformer.mask import target_mask\n\n\nclass E2E(E2EASR, ASRInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2EASR.add_arguments(parser)\n        E2EASRMIX.encoder_mix_add_arguments(parser)\n        return parser\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super(E2E, self).__init__(idim, odim, args, ignore_id=-1)\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n        self.encoder = EncoderMix(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks_sd=args.elayers_sd,\n            num_blocks_rec=args.elayers,\n            input_layer=args.transformer_input_layer,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n            num_spkrs=args.num_spkrs,\n        )\n\n        if args.mtlalpha > 0.0:\n            self.ctc = CTC(\n                odim, args.adim, args.dropout_rate, ctc_type=args.ctc_type, reduce=False\n            )\n        else:\n            self.ctc = None\n\n        self.num_spkrs = args.num_spkrs\n        self.pit = PIT(self.num_spkrs)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of source sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences\n                                    (B, num_spkrs, Lmax)\n        :return: ctc loass value\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]  # for data parallel\n        src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n        hs_pad, hs_mask = self.encoder(xs_pad, src_mask)  # list: speaker differentiate\n        self.hs_pad = hs_pad\n\n        # 2. ctc\n        # TODO(karita) show predicted text\n        # TODO(karita) calculate these stats\n        cer_ctc = None\n        assert self.mtlalpha > 0.0\n        batch_size = xs_pad.size(0)\n        ys_pad = ys_pad.transpose(0, 1)  # (num_spkrs, B, Lmax)\n        hs_len = [hs_mask[i].view(batch_size, -1).sum(1) for i in range(self.num_spkrs)]\n        loss_ctc_perm = torch.stack(\n            [\n                self.ctc(\n                    hs_pad[i // self.num_spkrs].view(batch_size, -1, self.adim),\n                    hs_len[i // self.num_spkrs],\n                    ys_pad[i % self.num_spkrs],\n                )\n                for i in range(self.num_spkrs ** 2)\n            ],\n            dim=1,\n        )  # (B, num_spkrs^2)\n        loss_ctc, min_perm = self.pit.pit_process(loss_ctc_perm)\n        logging.info(\"ctc loss:\" + str(float(loss_ctc)))\n\n        # Permute the labels according to loss\n        for b in range(batch_size):  # B\n            ys_pad[:, b] = ys_pad[min_perm[b], b]  # (num_spkrs, B, Lmax)\n        ys_out_len = [\n            float(torch.sum(ys_pad[i] != self.ignore_id)) for i in range(self.num_spkrs)\n        ]\n\n        # TODO(karita) show predicted text\n        # TODO(karita) calculate these stats\n        if self.error_calculator is not None:\n            cer_ctc = []\n            for i in range(self.num_spkrs):\n                ys_hat = self.ctc.argmax(hs_pad[i].view(batch_size, -1, self.adim)).data\n                cer_ctc.append(\n                    self.error_calculator(ys_hat.cpu(), ys_pad[i].cpu(), is_ctc=True)\n                )\n            cer_ctc = sum(map(lambda x: x[0] * x[1], zip(cer_ctc, ys_out_len))) / sum(\n                ys_out_len\n            )\n        else:\n            cer_ctc = None\n\n        # 3. forward decoder\n        if self.mtlalpha == 1.0:\n            loss_att, self.acc, cer, wer = None, None, None, None\n        else:\n            pred_pad, pred_mask = [None] * self.num_spkrs, [None] * self.num_spkrs\n            loss_att, acc = [None] * self.num_spkrs, [None] * self.num_spkrs\n            for i in range(self.num_spkrs):\n                (\n                    pred_pad[i],\n                    pred_mask[i],\n                    loss_att[i],\n                    acc[i],\n                ) = self.decoder_and_attention(\n                    hs_pad[i], hs_mask[i], ys_pad[i], batch_size\n                )\n\n            # 4. compute attention loss\n            # The following is just an approximation\n            loss_att = sum(map(lambda x: x[0] * x[1], zip(loss_att, ys_out_len))) / sum(\n                ys_out_len\n            )\n            self.acc = sum(map(lambda x: x[0] * x[1], zip(acc, ys_out_len))) / sum(\n                ys_out_len\n            )\n\n            # 5. compute cer/wer\n            if self.training or self.error_calculator is None:\n                cer, wer = None, None\n            else:\n                ys_hat = pred_pad.argmax(dim=-1)\n                cer, wer = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())\n\n        # copyied from e2e_asr\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = None\n        elif alpha == 1:\n            self.loss = loss_ctc\n            loss_att_data = None\n            loss_ctc_data = float(loss_ctc)\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = float(loss_ctc)\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data, loss_att_data, self.acc, cer_ctc, cer, wer, loss_data\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def decoder_and_attention(self, hs_pad, hs_mask, ys_pad, batch_size):\n        \"\"\"Forward decoder and attention loss.\"\"\"\n        # forward decoder\n        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)\n        ys_mask = target_mask(ys_in_pad, self.ignore_id)\n        pred_pad, pred_mask = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n\n        # compute attention loss\n        loss_att = self.criterion(pred_pad, ys_out_pad)\n        acc = th_accuracy(\n            pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n        )\n        return pred_pad, pred_mask, loss_att, acc\n\n    def encode(self, x):\n        \"\"\"Encode acoustic features.\n\n        :param ndarray x: source acoustic feature (T, D)\n        :return: encoder outputs\n        :rtype: torch.Tensor\n        \"\"\"\n        self.eval()\n        x = torch.as_tensor(x).unsqueeze(0)\n        enc_output, _ = self.encoder(x, None)\n        return enc_output\n\n    def recog(self, enc_output, recog_args, char_list=None, rnnlm=None, use_jit=False):\n        \"\"\"Recognize input speech of each speaker.\n\n        :param ndnarray enc_output: encoder outputs (B, T, D) or (T, D)\n        :param Namespace recog_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        if recog_args.ctc_weight > 0.0:\n            lpz = self.ctc.log_softmax(enc_output)\n            lpz = lpz.squeeze(0)\n        else:\n            lpz = None\n\n        h = enc_output.squeeze(0)\n\n        logging.info(\"input lengths: \" + str(h.size(0)))\n        # search parms\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = recog_args.ctc_weight\n\n        # preprare sos\n        y = self.sos\n        vy = h.new_zeros(1).long()\n\n        if recog_args.maxlenratio == 0:\n            maxlen = h.shape[0]\n        else:\n            # maxlen >= 1\n            maxlen = max(1, int(recog_args.maxlenratio * h.size(0)))\n        minlen = int(recog_args.minlenratio * h.size(0))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        if rnnlm:\n            hyp = {\"score\": 0.0, \"yseq\": [y], \"rnnlm_prev\": None}\n        else:\n            hyp = {\"score\": 0.0, \"yseq\": [y]}\n        if lpz is not None:\n            ctc_prefix_score = CTCPrefixScore(lpz.detach().numpy(), 0, self.eos, numpy)\n            hyp[\"ctc_state_prev\"] = ctc_prefix_score.initial_state()\n            hyp[\"ctc_score_prev\"] = 0.0\n            if ctc_weight != 1.0:\n                # pre-pruning based on attention scores\n                ctc_beam = min(lpz.shape[-1], int(beam * CTC_SCORING_RATIO))\n            else:\n                ctc_beam = lpz.shape[-1]\n        hyps = [hyp]\n        ended_hyps = []\n\n        import six\n\n        traced_decoder = None\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            hyps_best_kept = []\n            for hyp in hyps:\n                vy[0] = hyp[\"yseq\"][i]\n\n                # get nbest local scores and their ids\n                ys_mask = subsequent_mask(i + 1).unsqueeze(0)\n                ys = torch.tensor(hyp[\"yseq\"]).unsqueeze(0)\n                # FIXME: jit does not match non-jit result\n                if use_jit:\n                    if traced_decoder is None:\n                        traced_decoder = torch.jit.trace(\n                            self.decoder.forward_one_step, (ys, ys_mask, enc_output)\n                        )\n                    local_att_scores = traced_decoder(ys, ys_mask, enc_output)[0]\n                else:\n                    local_att_scores = self.decoder.forward_one_step(\n                        ys, ys_mask, enc_output\n                    )[0]\n\n                if rnnlm:\n                    rnnlm_state, local_lm_scores = rnnlm.predict(hyp[\"rnnlm_prev\"], vy)\n                    local_scores = (\n                        local_att_scores + recog_args.lm_weight * local_lm_scores\n                    )\n                else:\n                    local_scores = local_att_scores\n\n                if lpz is not None:\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_att_scores, ctc_beam, dim=1\n                    )\n                    ctc_scores, ctc_states = ctc_prefix_score(\n                        hyp[\"yseq\"], local_best_ids[0], hyp[\"ctc_state_prev\"]\n                    )\n                    local_scores = (1.0 - ctc_weight) * local_att_scores[\n                        :, local_best_ids[0]\n                    ] + ctc_weight * torch.from_numpy(\n                        ctc_scores - hyp[\"ctc_score_prev\"]\n                    )\n                    if rnnlm:\n                        local_scores += (\n                            recog_args.lm_weight * local_lm_scores[:, local_best_ids[0]]\n                        )\n                    local_best_scores, joint_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n                    local_best_ids = local_best_ids[:, joint_best_ids[0]]\n                else:\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n\n                for j in six.moves.range(beam):\n                    new_hyp = {}\n                    new_hyp[\"score\"] = hyp[\"score\"] + float(local_best_scores[0, j])\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[0, j])\n                    if rnnlm:\n                        new_hyp[\"rnnlm_prev\"] = rnnlm_state\n                    if lpz is not None:\n                        new_hyp[\"ctc_state_prev\"] = ctc_states[joint_best_ids[0, j]]\n                        new_hyp[\"ctc_score_prev\"] = ctc_scores[joint_best_ids[0, j]]\n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypothes: \" + str(len(hyps)))\n            if char_list is not None:\n                logging.debug(\n                    \"best hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n                )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last postion in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypothes to a final list, and removed them from current hypothes\n            # (this will be a probmlem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        if rnnlm:  # Word LM needs to add final <eos> score\n                            hyp[\"score\"] += recog_args.lm_weight * rnnlm.final(\n                                hyp[\"rnnlm_prev\"]\n                            )\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n\n            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remeined hypothes: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            if char_list is not None:\n                for hyp in hyps:\n                    logging.debug(\n                        \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                    )\n\n            logging.debug(\"number of ended hypothes: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), recog_args.nbest)\n        ]\n\n        # check number of hypotheis\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform recognition \"\n                \"again with smaller minlenratio.\"\n            )\n            # should copy becasuse Namespace will be overwritten globally\n            recog_args = Namespace(**vars(recog_args))\n            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)\n            return self.recog(enc_output, recog_args, char_list, rnnlm)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n        return nbest_hyps\n\n    def recognize(self, x, recog_args, char_list=None, rnnlm=None, use_jit=False):\n        \"\"\"Recognize input speech of each speaker.\n\n        :param ndnarray x: input acoustic feature (B, T, D) or (T, D)\n        :param Namespace recog_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        # Encoder\n        enc_output = self.encode(x)\n\n        # Decoder\n        nbest_hyps = []\n        for enc_out in enc_output:\n            nbest_hyps.append(\n                self.recog(enc_out, recog_args, char_list, rnnlm, use_jit)\n            )\n        return nbest_hyps\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_mulenc.py",
    "content": "# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n# Copyright 2017 Johns Hopkins University (Ruizhi Li)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Define e2e module for multi-encoder network. https://arxiv.org/pdf/1811.04903.pdf.\"\"\"\n\nimport argparse\nfrom itertools import groupby\nimport logging\nimport math\nimport os\n\nimport chainer\nfrom chainer import reporter\nimport editdistance\nimport numpy as np\nimport torch\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.pytorch_backend.ctc import ctc_for\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.nets_utils import to_torch_tensor\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_for\nfrom espnet.nets.pytorch_backend.rnn.decoders import decoder_for\nfrom espnet.nets.pytorch_backend.rnn.encoders import Encoder\nfrom espnet.nets.pytorch_backend.rnn.encoders import encoder_for\nfrom espnet.nets.scorers.ctc import CTCPrefixScorer\nfrom espnet.utils.cli_utils import strtobool\n\nCTC_LOSS_THRESHOLD = 10000\n\n\nclass Reporter(chainer.Chain):\n    \"\"\"Define a chainer reporter wrapper.\"\"\"\n\n    def report(self, loss_ctc_list, loss_att, acc, cer_ctc_list, cer, wer, mtl_loss):\n        \"\"\"Define a chainer reporter function.\"\"\"\n        # loss_ctc_list = [weighted CTC, CTC1, CTC2, ... CTCN]\n        # cer_ctc_list = [weighted cer_ctc, cer_ctc_1, cer_ctc_2, ... cer_ctc_N]\n        num_encs = len(loss_ctc_list) - 1\n        reporter.report({\"loss_ctc\": loss_ctc_list[0]}, self)\n        for i in range(num_encs):\n            reporter.report({\"loss_ctc{}\".format(i + 1): loss_ctc_list[i + 1]}, self)\n        reporter.report({\"loss_att\": loss_att}, self)\n        reporter.report({\"acc\": acc}, self)\n        reporter.report({\"cer_ctc\": cer_ctc_list[0]}, self)\n        for i in range(num_encs):\n            reporter.report({\"cer_ctc{}\".format(i + 1): cer_ctc_list[i + 1]}, self)\n        reporter.report({\"cer\": cer}, self)\n        reporter.report({\"wer\": wer}, self)\n        logging.info(\"mtl loss:\" + str(mtl_loss))\n        reporter.report({\"loss\": mtl_loss}, self)\n\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param List idims: List of dimensions of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments for multi-encoder setting.\"\"\"\n        E2E.encoder_add_arguments(parser)\n        E2E.attention_add_arguments(parser)\n        E2E.decoder_add_arguments(parser)\n        E2E.ctc_add_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_add_arguments(parser):\n        \"\"\"Add arguments for encoders in multi-encoder setting.\"\"\"\n        group = parser.add_argument_group(\"E2E encoder setting\")\n        group.add_argument(\n            \"--etype\",\n            action=\"append\",\n            type=str,\n            choices=[\n                \"lstm\",\n                \"blstm\",\n                \"lstmp\",\n                \"blstmp\",\n                \"vgglstmp\",\n                \"vggblstmp\",\n                \"vgglstm\",\n                \"vggblstm\",\n                \"gru\",\n                \"bgru\",\n                \"grup\",\n                \"bgrup\",\n                \"vgggrup\",\n                \"vggbgrup\",\n                \"vgggru\",\n                \"vggbgru\",\n            ],\n            help=\"Type of encoder network architecture\",\n        )\n        group.add_argument(\n            \"--elayers\",\n            type=int,\n            action=\"append\",\n            help=\"Number of encoder layers \"\n            \"(for shared recognition part in multi-speaker asr mode)\",\n        )\n        group.add_argument(\n            \"--eunits\",\n            \"-u\",\n            type=int,\n            action=\"append\",\n            help=\"Number of encoder hidden units\",\n        )\n        group.add_argument(\n            \"--eprojs\", default=320, type=int, help=\"Number of encoder projection units\"\n        )\n        group.add_argument(\n            \"--subsample\",\n            type=str,\n            action=\"append\",\n            help=\"Subsample input frames x_y_z means \"\n            \"subsample every x frame at 1st layer, \"\n            \"every y frame at 2nd layer etc.\",\n        )\n        return parser\n\n    @staticmethod\n    def attention_add_arguments(parser):\n        \"\"\"Add arguments for attentions in multi-encoder setting.\"\"\"\n        group = parser.add_argument_group(\"E2E attention setting\")\n        # attention\n        group.add_argument(\n            \"--atype\",\n            type=str,\n            action=\"append\",\n            choices=[\n                \"noatt\",\n                \"dot\",\n                \"add\",\n                \"location\",\n                \"coverage\",\n                \"coverage_location\",\n                \"location2d\",\n                \"location_recurrent\",\n                \"multi_head_dot\",\n                \"multi_head_add\",\n                \"multi_head_loc\",\n                \"multi_head_multi_res_loc\",\n            ],\n            help=\"Type of attention architecture\",\n        )\n        group.add_argument(\n            \"--adim\",\n            type=int,\n            action=\"append\",\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--awin\",\n            type=int,\n            action=\"append\",\n            help=\"Window size for location2d attention\",\n        )\n        group.add_argument(\n            \"--aheads\",\n            type=int,\n            action=\"append\",\n            help=\"Number of heads for multi head attention\",\n        )\n        group.add_argument(\n            \"--aconv-chans\",\n            type=int,\n            action=\"append\",\n            help=\"Number of attention convolution channels \\\n                           (negative value indicates no location-aware attention)\",\n        )\n        group.add_argument(\n            \"--aconv-filts\",\n            type=int,\n            action=\"append\",\n            help=\"Number of attention convolution filters \\\n                           (negative value indicates no location-aware attention)\",\n        )\n        group.add_argument(\n            \"--dropout-rate\",\n            type=float,\n            action=\"append\",\n            help=\"Dropout rate for the encoder\",\n        )\n        # hierarchical attention network (HAN)\n        group.add_argument(\n            \"--han-type\",\n            default=\"dot\",\n            type=str,\n            choices=[\n                \"noatt\",\n                \"dot\",\n                \"add\",\n                \"location\",\n                \"coverage\",\n                \"coverage_location\",\n                \"location2d\",\n                \"location_recurrent\",\n                \"multi_head_dot\",\n                \"multi_head_add\",\n                \"multi_head_loc\",\n                \"multi_head_multi_res_loc\",\n            ],\n            help=\"Type of attention architecture (multi-encoder asr mode only)\",\n        )\n        group.add_argument(\n            \"--han-dim\",\n            default=320,\n            type=int,\n            help=\"Number of attention transformation dimensions in HAN\",\n        )\n        group.add_argument(\n            \"--han-win\",\n            default=5,\n            type=int,\n            help=\"Window size for location2d attention in HAN\",\n        )\n        group.add_argument(\n            \"--han-heads\",\n            default=4,\n            type=int,\n            help=\"Number of heads for multi head attention in HAN\",\n        )\n        group.add_argument(\n            \"--han-conv-chans\",\n            default=-1,\n            type=int,\n            help=\"Number of attention convolution channels  in HAN \\\n                           (negative value indicates no location-aware attention)\",\n        )\n        group.add_argument(\n            \"--han-conv-filts\",\n            default=100,\n            type=int,\n            help=\"Number of attention convolution filters in HAN \\\n                           (negative value indicates no location-aware attention)\",\n        )\n        return parser\n\n    @staticmethod\n    def decoder_add_arguments(parser):\n        \"\"\"Add arguments for decoder in multi-encoder setting.\"\"\"\n        group = parser.add_argument_group(\"E2E decoder setting\")\n        group.add_argument(\n            \"--dtype\",\n            default=\"lstm\",\n            type=str,\n            choices=[\"lstm\", \"gru\"],\n            help=\"Type of decoder network architecture\",\n        )\n        group.add_argument(\n            \"--dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--dropout-rate-decoder\",\n            default=0.0,\n            type=float,\n            help=\"Dropout rate for the decoder\",\n        )\n        group.add_argument(\n            \"--sampling-probability\",\n            default=0.0,\n            type=float,\n            help=\"Ratio of predicted labels fed back to decoder\",\n        )\n        group.add_argument(\n            \"--lsm-type\",\n            const=\"\",\n            default=\"\",\n            type=str,\n            nargs=\"?\",\n            choices=[\"\", \"unigram\"],\n            help=\"Apply label smoothing with a specified distribution type\",\n        )\n        return parser\n\n    @staticmethod\n    def ctc_add_arguments(parser):\n        \"\"\"Add arguments for ctc in multi-encoder setting.\"\"\"\n        group = parser.add_argument_group(\"E2E multi-ctc setting\")\n        group.add_argument(\n            \"--share-ctc\",\n            type=strtobool,\n            default=False,\n            help=\"The flag to switch to share ctc across multiple encoders \"\n            \"(multi-encoder asr mode only).\",\n        )\n        group.add_argument(\n            \"--weights-ctc-train\",\n            type=float,\n            action=\"append\",\n            help=\"ctc weight assigned to each encoder during training.\",\n        )\n        group.add_argument(\n            \"--weights-ctc-dec\",\n            type=float,\n            action=\"append\",\n            help=\"ctc weight assigned to each encoder during decoding.\",\n        )\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        if isinstance(self.enc, Encoder):\n            return self.enc.conv_subsampling_factor * int(\n                np.prod(self.subsample_list[0])\n            )\n        else:\n            return self.enc[0].conv_subsampling_factor * int(\n                np.prod(self.subsample_list[0])\n            )\n\n    def __init__(self, idims, odim, args):\n        \"\"\"Initialize this class with python-level args.\n\n        Args:\n            idims (list): list of the number of an input feature dim.\n            odim (int): The number of output vocab.\n            args (Namespace): arguments\n\n        \"\"\"\n        super(E2E, self).__init__()\n        torch.nn.Module.__init__(self)\n        self.mtlalpha = args.mtlalpha\n        assert 0.0 <= self.mtlalpha <= 1.0, \"mtlalpha should be [0.0, 1.0]\"\n        self.verbose = args.verbose\n        # NOTE: for self.build method\n        args.char_list = getattr(args, \"char_list\", None)\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.reporter = Reporter()\n        self.num_encs = args.num_encs\n        self.share_ctc = args.share_ctc\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n\n        # subsample info\n        self.subsample_list = get_subsample(args, mode=\"asr\", arch=\"rnn_mulenc\")\n\n        # label smoothing info\n        if args.lsm_type and os.path.isfile(args.train_json):\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        # speech translation related\n        self.replace_sos = getattr(\n            args, \"replace_sos\", False\n        )  # use getattr to keep compatibility\n\n        self.frontend = None\n\n        # encoder\n        self.enc = encoder_for(args, idims, self.subsample_list)\n        # ctc\n        self.ctc = ctc_for(args, odim)\n        # attention\n        self.att = att_for(args)\n        # hierarchical attention network\n        han = att_for(args, han_mode=True)\n        self.att.append(han)\n        # decoder\n        self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        if args.mtlalpha > 0 and self.num_encs > 1:\n            # weights-ctc,\n            # e.g. ctc_loss = w_1*ctc_1_loss + w_2 * ctc_2_loss + w_N * ctc_N_loss\n            self.weights_ctc_train = args.weights_ctc_train / np.sum(\n                args.weights_ctc_train\n            )  # normalize\n            self.weights_ctc_dec = args.weights_ctc_dec / np.sum(\n                args.weights_ctc_dec\n            )  # normalize\n            logging.info(\n                \"ctc weights (training during training): \"\n                + \" \".join([str(x) for x in self.weights_ctc_train])\n            )\n            logging.info(\n                \"ctc weights (decoding during training): \"\n                + \" \".join([str(x) for x in self.weights_ctc_dec])\n            )\n        else:\n            self.weights_ctc_dec = [1.0]\n            self.weights_ctc_train = [1.0]\n\n        # weight initialization\n        self.init_like_chainer()\n\n        # options for beam search\n        if args.report_cer or args.report_wer:\n            recog_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": args.ctc_weight,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n                \"tgt_lang\": False,\n                \"ctc_weights_dec\": self.weights_ctc_dec,\n            }\n\n            self.recog_args = argparse.Namespace(**recog_args)\n            self.report_cer = args.report_cer\n            self.report_wer = args.report_wer\n        else:\n            self.report_cer = False\n            self.report_wer = False\n        self.rnnlm = None\n\n        self.logzero = -10000000000.0\n        self.loss = None\n        self.acc = None\n\n    def init_like_chainer(self):\n        \"\"\"Initialize weight like chainer.\n\n        chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0\n        pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)\n\n        however, there are two exceptions as far as I know.\n        - EmbedID.W ~ Normal(0, 1)\n        - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)\n        \"\"\"\n\n        def lecun_normal_init_parameters(module):\n            for p in module.parameters():\n                data = p.data\n                if data.dim() == 1:\n                    # bias\n                    data.zero_()\n                elif data.dim() == 2:\n                    # linear weight\n                    n = data.size(1)\n                    stdv = 1.0 / math.sqrt(n)\n                    data.normal_(0, stdv)\n                elif data.dim() in (3, 4):\n                    # conv weight\n                    n = data.size(1)\n                    for k in data.size()[2:]:\n                        n *= k\n                    stdv = 1.0 / math.sqrt(n)\n                    data.normal_(0, stdv)\n                else:\n                    raise NotImplementedError\n\n        def set_forget_bias_to_one(bias):\n            n = bias.size(0)\n            start, end = n // 4, n // 2\n            bias.data[start:end].fill_(1.0)\n\n        lecun_normal_init_parameters(self)\n        # exceptions\n        # embed weight ~ Normal(0, 1)\n        self.dec.embed.weight.data.normal_(0, 1)\n        # forget-bias = 1.0\n        # https://discuss.pytorch.org/t/set-forget-gate-bias-of-lstm/1745\n        for i in range(len(self.dec.decoder)):\n            set_forget_bias_to_one(self.dec.decoder[i].bias_ih)\n\n    def forward(self, xs_pad_list, ilens_list, ys_pad):\n        \"\"\"E2E forward.\n\n        :param List xs_pad_list: list of batch (torch.Tensor) of padded input sequences\n                                [(B, Tmax_1, idim), (B, Tmax_2, idim),..]\n        :param List ilens_list:\n            list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, Lmax)\n        :return: loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        if self.replace_sos:\n            tgt_lang_ids = ys_pad[:, 0:1]\n            ys_pad = ys_pad[:, 1:]  # remove target language ID in the beginning\n        else:\n            tgt_lang_ids = None\n\n        hs_pad_list, hlens_list, self.loss_ctc_list = [], [], []\n        for idx in range(self.num_encs):\n            # 1. Encoder\n            hs_pad, hlens, _ = self.enc[idx](xs_pad_list[idx], ilens_list[idx])\n\n            # 2. CTC loss\n            if self.mtlalpha == 0:\n                self.loss_ctc_list.append(None)\n            else:\n                ctc_idx = 0 if self.share_ctc else idx\n                loss_ctc = self.ctc[ctc_idx](hs_pad, hlens, ys_pad)\n                self.loss_ctc_list.append(loss_ctc)\n            hs_pad_list.append(hs_pad)\n            hlens_list.append(hlens)\n\n        # 3. attention loss\n        if self.mtlalpha == 1:\n            self.loss_att, acc = None, None\n        else:\n            self.loss_att, acc, _ = self.dec(\n                hs_pad_list, hlens_list, ys_pad, lang_ids=tgt_lang_ids\n            )\n        self.acc = acc\n\n        # 4. compute cer without beam search\n        if self.mtlalpha == 0 or self.char_list is None:\n            cer_ctc_list = [None] * (self.num_encs + 1)\n        else:\n            cer_ctc_list = []\n            for ind in range(self.num_encs):\n                cers = []\n                ctc_idx = 0 if self.share_ctc else ind\n                y_hats = self.ctc[ctc_idx].argmax(hs_pad_list[ind]).data\n                for i, y in enumerate(y_hats):\n                    y_hat = [x[0] for x in groupby(y)]\n                    y_true = ys_pad[i]\n\n                    seq_hat = [\n                        self.char_list[int(idx)] for idx in y_hat if int(idx) != -1\n                    ]\n                    seq_true = [\n                        self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                    ]\n                    seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n                    seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n                    seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n\n                    hyp_chars = seq_hat_text.replace(\" \", \"\")\n                    ref_chars = seq_true_text.replace(\" \", \"\")\n                    if len(ref_chars) > 0:\n                        cers.append(\n                            editdistance.eval(hyp_chars, ref_chars) / len(ref_chars)\n                        )\n\n                cer_ctc = sum(cers) / len(cers) if cers else None\n                cer_ctc_list.append(cer_ctc)\n            cer_ctc_weighted = np.sum(\n                [\n                    item * self.weights_ctc_train[i]\n                    for i, item in enumerate(cer_ctc_list)\n                ]\n            )\n            cer_ctc_list = [float(cer_ctc_weighted)] + [\n                float(item) for item in cer_ctc_list\n            ]\n\n        # 5. compute cer/wer\n        if self.training or not (self.report_cer or self.report_wer):\n            cer, wer = 0.0, 0.0\n            # oracle_cer, oracle_wer = 0.0, 0.0\n        else:\n            if self.recog_args.ctc_weight > 0.0:\n                lpz_list = []\n                for idx in range(self.num_encs):\n                    ctc_idx = 0 if self.share_ctc else idx\n                    lpz = self.ctc[ctc_idx].log_softmax(hs_pad_list[idx]).data\n                    lpz_list.append(lpz)\n            else:\n                lpz_list = None\n\n            word_eds, word_ref_lens, char_eds, char_ref_lens = [], [], [], []\n            nbest_hyps = self.dec.recognize_beam_batch(\n                hs_pad_list,\n                hlens_list,\n                lpz_list,\n                self.recog_args,\n                self.char_list,\n                self.rnnlm,\n                lang_ids=tgt_lang_ids.squeeze(1).tolist() if self.replace_sos else None,\n            )\n            # remove <sos> and <eos>\n            y_hats = [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps]\n            for i, y_hat in enumerate(y_hats):\n                y_true = ys_pad[i]\n\n                seq_hat = [self.char_list[int(idx)] for idx in y_hat if int(idx) != -1]\n                seq_true = [\n                    self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                ]\n                seq_hat_text = \"\".join(seq_hat).replace(self.recog_args.space, \" \")\n                seq_hat_text = seq_hat_text.replace(self.recog_args.blank, \"\")\n                seq_true_text = \"\".join(seq_true).replace(self.recog_args.space, \" \")\n\n                hyp_words = seq_hat_text.split()\n                ref_words = seq_true_text.split()\n                word_eds.append(editdistance.eval(hyp_words, ref_words))\n                word_ref_lens.append(len(ref_words))\n                hyp_chars = seq_hat_text.replace(\" \", \"\")\n                ref_chars = seq_true_text.replace(\" \", \"\")\n                char_eds.append(editdistance.eval(hyp_chars, ref_chars))\n                char_ref_lens.append(len(ref_chars))\n\n            wer = (\n                0.0\n                if not self.report_wer\n                else float(sum(word_eds)) / sum(word_ref_lens)\n            )\n            cer = (\n                0.0\n                if not self.report_cer\n                else float(sum(char_eds)) / sum(char_ref_lens)\n            )\n\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = self.loss_att\n            loss_att_data = float(self.loss_att)\n            loss_ctc_data_list = [None] * (self.num_encs + 1)\n        elif alpha == 1:\n            self.loss = torch.sum(\n                torch.cat(\n                    [\n                        (item * self.weights_ctc_train[i]).unsqueeze(0)\n                        for i, item in enumerate(self.loss_ctc_list)\n                    ]\n                )\n            )\n            loss_att_data = None\n            loss_ctc_data_list = [float(self.loss)] + [\n                float(item) for item in self.loss_ctc_list\n            ]\n        else:\n            self.loss_ctc = torch.sum(\n                torch.cat(\n                    [\n                        (item * self.weights_ctc_train[i]).unsqueeze(0)\n                        for i, item in enumerate(self.loss_ctc_list)\n                    ]\n                )\n            )\n            self.loss = alpha * self.loss_ctc + (1 - alpha) * self.loss_att\n            loss_att_data = float(self.loss_att)\n            loss_ctc_data_list = [float(self.loss_ctc)] + [\n                float(item) for item in self.loss_ctc_list\n            ]\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data_list,\n                loss_att_data,\n                acc,\n                cer_ctc_list,\n                cer,\n                wer,\n                loss_data,\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def scorers(self):\n        \"\"\"Get scorers for `beam_search` (optional).\n\n        Returns:\n            dict[str, ScorerInterface]: dict of `ScorerInterface` objects\n\n        \"\"\"\n        return dict(decoder=self.dec, ctc=CTCPrefixScorer(self.ctc, self.eos))\n\n    def encode(self, x_list):\n        \"\"\"Encode feature.\n\n        Args:\n            x_list (list): input feature [(T1, D), (T2, D), ... ]\n        Returns:\n            list\n                encoded feature [(T1, D), (T2, D), ... ]\n\n        \"\"\"\n        self.eval()\n        ilens_list = [[x_list[idx].shape[0]] for idx in range(self.num_encs)]\n\n        # subsample frame\n        x_list = [\n            x_list[idx][:: self.subsample_list[idx][0], :]\n            for idx in range(self.num_encs)\n        ]\n        p = next(self.parameters())\n        x_list = [\n            torch.as_tensor(x_list[idx], device=p.device, dtype=p.dtype)\n            for idx in range(self.num_encs)\n        ]\n        # make a utt list (1) to use the same interface for encoder\n        xs_list = [\n            x_list[idx].contiguous().unsqueeze(0) for idx in range(self.num_encs)\n        ]\n\n        # 1. encoder\n        hs_list = []\n        for idx in range(self.num_encs):\n            hs, _, _ = self.enc[idx](xs_list[idx], ilens_list[idx])\n            hs_list.append(hs[0])\n        return hs_list\n\n    def recognize(self, x_list, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param list of ndarray x: list of input acoustic feature [(T1, D), (T2,D),...]\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        hs_list = self.encode(x_list)\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            if self.share_ctc:\n                lpz_list = [\n                    self.ctc[0].log_softmax(hs_list[idx].unsqueeze(0))[0]\n                    for idx in range(self.num_encs)\n                ]\n            else:\n                lpz_list = [\n                    self.ctc[idx].log_softmax(hs_list[idx].unsqueeze(0))[0]\n                    for idx in range(self.num_encs)\n                ]\n        else:\n            lpz_list = None\n\n        # 2. Decoder\n        # decode the first utterance\n        y = self.dec.recognize_beam(hs_list, lpz_list, recog_args, char_list, rnnlm)\n        return y\n\n    def recognize_batch(self, xs_list, recog_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param list xs_list: list of list of input acoustic feature arrays\n                [[(T1_1, D), (T1_2, D), ...],[(T2_1, D), (T2_2, D), ...], ...]\n        :param Namespace recog_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n        ilens_list = [\n            np.fromiter((xx.shape[0] for xx in xs_list[idx]), dtype=np.int64)\n            for idx in range(self.num_encs)\n        ]\n\n        # subsample frame\n        xs_list = [\n            [xx[:: self.subsample_list[idx][0], :] for xx in xs_list[idx]]\n            for idx in range(self.num_encs)\n        ]\n\n        xs_list = [\n            [to_device(self, to_torch_tensor(xx).float()) for xx in xs_list[idx]]\n            for idx in range(self.num_encs)\n        ]\n        xs_pad_list = [pad_list(xs_list[idx], 0.0) for idx in range(self.num_encs)]\n\n        # 1. Encoder\n        hs_pad_list, hlens_list = [], []\n        for idx in range(self.num_encs):\n            hs_pad, hlens, _ = self.enc[idx](xs_pad_list[idx], ilens_list[idx])\n            hs_pad_list.append(hs_pad)\n            hlens_list.append(hlens)\n\n        # calculate log P(z_t|X) for CTC scores\n        if recog_args.ctc_weight > 0.0:\n            if self.share_ctc:\n                lpz_list = [\n                    self.ctc[0].log_softmax(hs_pad_list[idx])\n                    for idx in range(self.num_encs)\n                ]\n            else:\n                lpz_list = [\n                    self.ctc[idx].log_softmax(hs_pad_list[idx])\n                    for idx in range(self.num_encs)\n                ]\n            normalize_score = False\n        else:\n            lpz_list = None\n            normalize_score = True\n\n        # 2. Decoder\n        hlens_list = [\n            torch.tensor(list(map(int, hlens_list[idx])))\n            for idx in range(self.num_encs)\n        ]  # make sure hlens is tensor\n        y = self.dec.recognize_beam_batch(\n            hs_pad_list,\n            hlens_list,\n            lpz_list,\n            recog_args,\n            char_list,\n            rnnlm,\n            normalize_score=normalize_score,\n        )\n\n        if prev:\n            self.train()\n        return y\n\n    def calculate_all_attentions(self, xs_pad_list, ilens_list, ys_pad):\n        \"\"\"E2E attention calculation.\n\n        :param List xs_pad_list: list of batch (torch.Tensor) of padded input sequences\n                                [(B, Tmax_1, idim), (B, Tmax_2, idim),..]\n        :param List ilens_list:\n            list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, Lmax)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) multi-encoder case\n                => [(B, Lmax, Tmax1), (B, Lmax, Tmax2), ..., (B, Lmax, NumEncs)]\n            3) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray or list\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            # 1. Encoder\n            if self.replace_sos:\n                tgt_lang_ids = ys_pad[:, 0:1]\n                ys_pad = ys_pad[:, 1:]  # remove target language ID in the beggining\n            else:\n                tgt_lang_ids = None\n\n            hs_pad_list, hlens_list = [], []\n            for idx in range(self.num_encs):\n                hs_pad, hlens, _ = self.enc[idx](xs_pad_list[idx], ilens_list[idx])\n                hs_pad_list.append(hs_pad)\n                hlens_list.append(hlens)\n\n            # 2. Decoder\n            att_ws = self.dec.calculate_all_attentions(\n                hs_pad_list, hlens_list, ys_pad, lang_ids=tgt_lang_ids\n            )\n        self.train()\n        return att_ws\n\n    def calculate_all_ctc_probs(self, xs_pad_list, ilens_list, ys_pad):\n        \"\"\"E2E CTC probability calculation.\n\n        :param List xs_pad_list: list of batch (torch.Tensor) of padded input sequences\n                                [(B, Tmax_1, idim), (B, Tmax_2, idim),..]\n        :param List ilens_list:\n            list of batch (torch.Tensor) of lengths of input sequences [(B), (B), ..]\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, Lmax)\n        :return: CTC probability (B, Tmax, vocab)\n        :rtype: float ndarray or list\n        \"\"\"\n        probs_list = [None]\n        if self.mtlalpha == 0:\n            return probs_list\n\n        self.eval()\n        probs_list = []\n        with torch.no_grad():\n            # 1. Encoder\n            for idx in range(self.num_encs):\n                hs_pad, hlens, _ = self.enc[idx](xs_pad_list[idx], ilens_list[idx])\n\n                # 2. CTC loss\n                ctc_idx = 0 if self.share_ctc else idx\n                probs = self.ctc[ctc_idx].softmax(hs_pad).cpu().numpy()\n                probs_list.append(probs)\n        self.train()\n        return probs_list\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_transducer.py",
    "content": "\"\"\"Transducer speech recognition model (pytorch).\"\"\"\n\nfrom argparse import Namespace\nfrom collections import Counter\nfrom dataclasses import asdict\nfrom functools import partial\nfrom concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED\n\nimport logging\nimport math\nimport numpy\nimport functools\nimport chainer\nimport torch\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.pytorch_backend.ctc import ctc_for\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.transducer.arguments import (\n    add_encoder_general_arguments,  # noqa: H301\n    add_rnn_encoder_arguments,  # noqa: H301\n    add_custom_encoder_arguments,  # noqa: H301\n    add_decoder_general_arguments,  # noqa: H301\n    add_rnn_decoder_arguments,  # noqa: H301\n    add_custom_decoder_arguments,  # noqa: H301\n    add_custom_training_arguments,  # noqa: H301\n    add_transducer_arguments,  # noqa: H301\n    add_auxiliary_task_arguments,  # noqa: H301\n    add_att_scorer_arguments,\n)\nfrom espnet.nets.pytorch_backend.transducer.auxiliary_task import AuxiliaryTask\nfrom espnet.nets.pytorch_backend.transducer.custom_decoder import CustomDecoder\nfrom espnet.nets.pytorch_backend.transducer.custom_encoder import CustomEncoder\nfrom espnet.nets.pytorch_backend.transducer.error_calculator import ErrorCalculator\nfrom espnet.nets.pytorch_backend.transducer.initializer import initializer\nfrom espnet.nets.pytorch_backend.transducer.joint_network import JointNetwork\nfrom espnet.nets.pytorch_backend.transducer.loss import TransLoss\nfrom espnet.nets.pytorch_backend.transducer.rnn_decoder import DecoderRNNT\nfrom espnet.nets.pytorch_backend.transducer.rnn_encoder import encoder_for\nfrom espnet.nets.pytorch_backend.transducer.utils import prepare_loss_inputs\nfrom espnet.nets.pytorch_backend.transducer.utils import valid_aux_task_layer_list\nfrom espnet.nets.pytorch_backend.transformer.attention import (\n    MultiHeadedAttention,  # noqa: H301\n    RelPositionMultiHeadedAttention,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.mask import target_mask\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nfrom espnet.utils.fill_missing_args import fill_missing_args\nfrom espnet.snowfall.warpper.warpper_mmi import K2MMI\nfrom espnet.snowfall.warpper.warpper_ctc import K2CTC\nfrom espnet.nets.beam_search_transducer import BeamSearchTransducer\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\n\nimport editdistance\n\nclass Reporter(chainer.Chain):\n    \"\"\"A chainer reporter wrapper for transducer models.\"\"\"\n\n    def report(\n        self,\n        loss,\n        loss_trans,\n        loss_ctc,\n        loss_lm,\n        loss_aux_trans,\n        loss_aux_symm_kl,\n        loss_mbr,\n        loss_mmi,\n        loss_att,\n        cer,\n        wer,\n    ):\n        \"\"\"Instantiate reporter attributes.\"\"\"\n        chainer.reporter.report({\"loss\": loss}, self)\n        chainer.reporter.report({\"loss_trans\": loss_trans}, self)\n        chainer.reporter.report({\"loss_ctc\": loss_ctc}, self)\n        chainer.reporter.report({\"loss_lm\": loss_lm}, self)\n        chainer.reporter.report({\"loss_aux_trans\": loss_aux_trans}, self)\n        chainer.reporter.report({\"loss_aux_symm_kl\": loss_aux_symm_kl}, self)\n        chainer.reporter.report({\"loss_mbr\": loss_mbr}, self)\n        chainer.reporter.report({\"loss_mmi\": loss_mmi}, self)\n        chainer.reporter.report({\"loss_att\": loss_att}, self)\n        chainer.reporter.report({\"cer\": cer}, self)\n        chainer.reporter.report({\"wer\": wer}, self)\n\n        logging.info(\"loss:\" + str(loss))\n\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module for transducer models.\n\n    Args:\n        idim (int): dimension of inputs\n        odim (int): dimension of outputs\n        args (Namespace): argument Namespace containing options\n        ignore_id (int): padding symbol id\n        blank_id (int): blank symbol id\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments for transducer model.\"\"\"\n        E2E.encoder_add_general_arguments(parser)\n        E2E.encoder_add_rnn_arguments(parser)\n        E2E.encoder_add_custom_arguments(parser)\n\n        E2E.decoder_add_general_arguments(parser)\n        E2E.decoder_add_rnn_arguments(parser)\n        E2E.decoder_add_custom_arguments(parser)\n\n        E2E.training_add_custom_arguments(parser)\n        E2E.transducer_add_arguments(parser)\n        E2E.auxiliary_task_add_arguments(parser)\n\n        E2E.att_scorer_arguments(parser)\n        return parser\n\n    @staticmethod\n    def att_scorer_arguments(parser):\n        \"\"\"Add attention scorer argument.\"\"\"\n        group = parser.add_argument_group(\"Attention scorer arguments\")\n        group = add_att_scorer_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def encoder_add_general_arguments(parser):\n        \"\"\"Add general arguments for encoder.\"\"\"\n        group = parser.add_argument_group(\"Encoder general arguments\")\n        group = add_encoder_general_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def encoder_add_rnn_arguments(parser):\n        \"\"\"Add arguments for RNN encoder.\"\"\"\n        group = parser.add_argument_group(\"RNN encoder arguments\")\n        group = add_rnn_encoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def encoder_add_custom_arguments(parser):\n        \"\"\"Add arguments for Custom encoder.\"\"\"\n        group = parser.add_argument_group(\"Custom encoder arguments\")\n        group = add_custom_encoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def decoder_add_general_arguments(parser):\n        \"\"\"Add general arguments for decoder.\"\"\"\n        group = parser.add_argument_group(\"Decoder general arguments\")\n        group = add_decoder_general_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def decoder_add_rnn_arguments(parser):\n        \"\"\"Add arguments for RNN decoder.\"\"\"\n        group = parser.add_argument_group(\"RNN decoder arguments\")\n        group = add_rnn_decoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def decoder_add_custom_arguments(parser):\n        \"\"\"Add arguments for Custom decoder.\"\"\"\n        group = parser.add_argument_group(\"Custom decoder arguments\")\n        group = add_custom_decoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def training_add_custom_arguments(parser):\n        \"\"\"Add arguments for Custom architecture training.\"\"\"\n        group = parser.add_argument_group(\"Training arguments for custom archictecture\")\n        group = add_custom_training_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def transducer_add_arguments(parser):\n        \"\"\"Add arguments for transducer model.\"\"\"\n        group = parser.add_argument_group(\"Transducer model arguments\")\n        group = add_transducer_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def auxiliary_task_add_arguments(parser):\n        \"\"\"Add arguments for auxiliary task.\"\"\"\n        group = parser.add_argument_group(\"Auxiliary task arguments\")\n        group = add_auxiliary_task_arguments(group)\n\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Get attention plot class.\"\"\"\n        return PlotAttentionReport\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        if self.etype == \"custom\":\n            return self.encoder.conv_subsampling_factor * int(\n                numpy.prod(self.subsample)\n            )\n        else:\n            return self.enc.conv_subsampling_factor * int(numpy.prod(self.subsample))\n\n    def __init__(self, idim, odim, args, ignore_id=-1, blank_id=0, training=True):\n        \"\"\"Construct an E2E object for transducer model.\"\"\"\n        torch.nn.Module.__init__(self)\n        \n        args = fill_missing_args(args, self.add_arguments)\n\n        self.is_rnnt = True\n        self.transducer_weight = args.transducer_weight\n\n        self.use_aux_task = (\n            True if (args.aux_task_type is not None and training) else False\n        )\n\n        self.use_aux_ctc = args.aux_ctc #and training\n        self.aux_ctc_weight = args.aux_ctc_weight\n\n        self.use_aux_mmi = args.aux_mmi #and training\n        self.aux_mmi_weight = args.aux_mmi_weight\n\n        self.use_aux_cross_entropy = args.aux_cross_entropy #and training\n        self.aux_cross_entropy_weight = args.aux_cross_entropy_weight\n\n        self.use_aux_mbr = args.aux_mbr\n        self.aux_mbr_weight = args.aux_mbr_weight\n        self.aux_mbr_beam = args.aux_mbr_beam\n\n        self.use_att_scorer = args.att_scorer_weight > 0.0\n        self.att_scorer_weight = args.att_scorer_weight\n\n        if self.use_aux_task:\n            n_layers = (\n                (len(args.enc_block_arch) * args.enc_block_repeat - 1)\n                if args.enc_block_arch is not None\n                else (args.elayers - 1)\n            )\n\n            aux_task_layer_list = valid_aux_task_layer_list(\n                args.aux_task_layer_list,\n                n_layers,\n            )\n        else:\n            aux_task_layer_list = []\n\n        if \"custom\" in args.etype:\n            if args.enc_block_arch is None:\n                raise ValueError(\n                    \"When specifying custom encoder type, --enc-block-arch\"\n                    \"should also be specified in training config. See\"\n                    \"egs/vivos/asr1/conf/transducer/train_*.yaml for more info.\"\n                )\n\n            self.subsample = get_subsample(args, mode=\"asr\", arch=\"transformer\")\n\n            self.encoder = CustomEncoder(\n                idim,\n                args.enc_block_arch,\n                input_layer=args.custom_enc_input_layer,\n                repeat_block=args.enc_block_repeat,\n                self_attn_type=args.custom_enc_self_attn_type,\n                positional_encoding_type=args.custom_enc_positional_encoding_type,\n                positionwise_activation_type=args.custom_enc_pw_activation_type,\n                conv_mod_activation_type=args.custom_enc_conv_mod_activation_type,\n                aux_task_layer_list=aux_task_layer_list,\n            )\n            encoder_out = self.encoder.enc_out\n\n            self.most_dom_list = args.enc_block_arch[:]\n        else:\n            self.subsample = get_subsample(args, mode=\"asr\", arch=\"rnn-t\")\n\n            self.enc = encoder_for(\n                args,\n                idim,\n                self.subsample,\n                aux_task_layer_list=aux_task_layer_list,\n            )\n            encoder_out = args.eprojs\n\n        if \"custom\" in args.dtype:\n            if args.dec_block_arch is None:\n                raise ValueError(\n                    \"When specifying custom decoder type, --dec-block-arch\"\n                    \"should also be specified in training config. See\"\n                    \"egs/vivos/asr1/conf/transducer/train_*.yaml for more info.\"\n                )\n\n            self.decoder = CustomDecoder(\n                odim,\n                args.dec_block_arch,\n                input_layer=args.custom_dec_input_layer,\n                repeat_block=args.dec_block_repeat,\n                positionwise_activation_type=args.custom_dec_pw_activation_type,\n                dropout_rate_embed=args.dropout_rate_embed_decoder,\n            )\n            decoder_out = self.decoder.dunits\n\n            if \"custom\" in args.etype:\n                self.most_dom_list += args.dec_block_arch[:]\n            else:\n                self.most_dom_list = args.dec_block_arch[:]\n        else:\n            self.dec = DecoderRNNT(\n                odim,\n                args.dtype,\n                args.dlayers,\n                args.dunits,\n                blank_id,\n                args.dec_embed_dim,\n                args.dropout_rate_decoder,\n                args.dropout_rate_embed_decoder,\n            )\n            decoder_out = args.dunits\n\n        self.joint_network = JointNetwork(\n            odim, encoder_out, decoder_out, args.joint_dim, args.joint_activation_type\n        )\n\n        # Attention Rescore\n        if self.use_att_scorer > 0.0:\n            self.att_scorer = Decoder(\n                odim=odim,\n                selfattention_layer_type=args.att_decoder_selfattn_layer_type,\n                attention_dim=args.att_adim,\n                attention_heads=args.att_aheads,\n                conv_wshare=args.att_wshare,\n                conv_kernel_length=args.att_ldconv_decoder_kernel_length,\n                conv_usebias=args.att_ldconv_usebias,\n                linear_units=args.att_dunits,\n                num_blocks=args.att_dlayers,\n                dropout_rate=args.att_dropout_rate,\n                positional_dropout_rate=args.att_dropout_rate,\n                self_attention_dropout_rate=args.att_attn_dropout_rate,\n                src_attention_dropout_rate=args.att_attn_dropout_rate,\n            )\n            self.att_scorer_criterion = LabelSmoothingLoss(\n                odim,\n                ignore_id,\n                args.lsm_weight,\n                args.att_length_normalized_loss,\n            )\n        else:\n            self.attention_scorer = None\n            self.att_scorer_criterion = None\n\n        if hasattr(self, \"most_dom_list\"):\n            self.most_dom_dim = sorted(\n                Counter(\n                    d[\"d_hidden\"] for d in self.most_dom_list if \"d_hidden\" in d\n                ).most_common(),\n                key=lambda x: x[0],\n                reverse=True,\n            )[0][0]\n\n        self.etype = args.etype\n        self.dtype = args.dtype\n\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.blank_id = blank_id\n        self.ignore_id = ignore_id\n\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n\n        self.odim = odim\n\n        self.reporter = Reporter()\n\n        self.error_calculator = None\n\n        self.default_parameters(args)\n\n        self.criterion = TransLoss(args.trans_type, self.blank_id)\n        if training:\n\n            decoder = self.decoder if self.dtype == \"custom\" else self.dec\n\n            if args.report_cer or args.report_wer:\n                self.error_calculator = ErrorCalculator(\n                    decoder,\n                    self.joint_network,\n                    args.char_list,\n                    args.sym_space,\n                    args.sym_blank,\n                    args.report_cer,\n                    args.report_wer,\n                )\n\n            if self.use_aux_task:\n                self.auxiliary_task = AuxiliaryTask(\n                    decoder,\n                    self.joint_network,\n                    self.criterion,\n                    args.aux_task_type,\n                    args.aux_task_weight,\n                    encoder_out,\n                    args.joint_dim,\n                )\n\n        if self.use_aux_ctc:\n            self.aux_ctc = ctc_for(\n                Namespace(\n                    num_encs=1,\n                    eprojs=encoder_out,\n                    dropout_rate=args.aux_ctc_dropout_rate,\n                    ctc_type=\"warpctc\",\n                ),\n                odim,\n            )\n\n        if self.use_aux_mmi:\n            # assert self.use_aux_ctc # ctc is needed for aishell-1 but not for librispeech\n            device = torch.device(f\"cuda:{args.local_rank}\") if torch.cuda.is_available() else torch.device(\"cpu\")\n            aux_mmi_module = K2MMI if args.aux_mmi_type == \"mmi\" else K2CTC\n            self.aux_mmi=aux_mmi_module(idim=encoder_out,\n                         lang=args.lang,\n                         char_list=args.char_list,\n                         device=device,\n                         dropout=args.aux_mmi_dropout_rate,\n                         den_scale=args.den_scale,\n                         eos_id=self.eos,\n                         use_segment=args.use_segment)\n\n        if self.use_aux_cross_entropy:\n            self.aux_decoder_output = torch.nn.Linear(decoder_out, odim)\n\n            self.aux_cross_entropy = LabelSmoothingLoss(\n                odim, ignore_id, args.aux_cross_entropy_smoothing\n            )\n\n        if self.use_aux_mbr:\n            assert args.resume is not None # need a seed model\n            self.beam_search = BeamSearchTransducer(\n                decoder=self.decoder if \"custom\" in self.dtype else self.dec,\n                joint_network=self.joint_network,\n                beam_size=self.aux_mbr_beam,\n                nbest=self.aux_mbr_beam,\n                search_type='alsd',\n            ) \n            self.char_list = args.char_list\n\n            self.mbr_trans_type = args.trans_type\n            if args.trans_type == \"warp-transducer\":\n                from warprnnt_pytorch import RNNTLoss\n                self.mbr_trans_loss = RNNTLoss(blank=self.blank_id, reduction=\"none\")\n            elif args.trans_type == \"warp-rnnt\":\n                from warp_rnnt import rnnt_loss\n                self.mbr_trans_loss = rnnt_loss\n            print(\"built beam search decoder for MBR\") \n\n        self.loss = None\n        self.rnnlm = None\n\n    def default_parameters(self, args):\n        \"\"\"Initialize/reset parameters for transducer.\n\n        Args:\n            args (Namespace): argument Namespace containing options\n\n        \"\"\"\n        initializer(self, args)\n\n    def forward(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E forward.\n\n        Args:\n            xs_pad (torch.Tensor): batch of padded source sequences (B, Tmax, idim)\n            ilens (torch.Tensor): batch of lengths of input sequences (B)\n            ys_pad (torch.Tensor): batch of padded target sequences (B, Lmax)\n\n        Returns:\n            loss (torch.Tensor): transducer loss value\n\n        \"\"\"\n        # 1. encoder\n        xs_pad = xs_pad[:, : max(ilens)]\n\n        if \"custom\" in self.etype:\n            src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n\n            _hs_pad, hs_mask = self.encoder(xs_pad, src_mask)\n        else:\n            _hs_pad, hs_mask, _ = self.enc(xs_pad, ilens)\n\n        if self.use_aux_task:\n            hs_pad, aux_hs_pad = _hs_pad[0], _hs_pad[1]\n        else:\n            hs_pad, aux_hs_pad = _hs_pad, None\n\n        # 1.5. transducer preparation related\n        ys_in_pad, ys_out_pad, target, pred_len, target_len = prepare_loss_inputs(\n            ys_pad, hs_mask\n        )\n        \"\"\"\n        ys_in_pad : ys with blank_id in head. For decoder forward\n        ys_out_pad : ys with ignore_id in tail. For aux task \n        target: ys with padding only, for RNNT loss computation\n        pred_len: real length of hs_mask\n        target_len: real length of target \n        \"\"\"\n\n        if self.use_aux_mbr:\n            loss_mbr = self.mbr_forward(xs_pad_orig, ilens, ys_pad, hs_pad)\n            loss_mbr *= self.aux_mbr_weight\n        else:\n            loss_mbr = 0.0\n\n        # 2. decoder\n        if \"custom\" in self.dtype:\n            ys_mask = target_mask(ys_in_pad, self.blank_id)\n            pred_pad, _ = self.decoder(ys_in_pad, ys_mask)\n        else:\n            pred_pad = self.dec(hs_pad, ys_in_pad)\n\n        z = self.joint_network(hs_pad.unsqueeze(2), pred_pad.unsqueeze(1))\n\n        # 3. loss computation\n        loss_trans = self.criterion(z, target, pred_len, target_len)\n\n        if self.use_aux_task and aux_hs_pad is not None:\n            loss_aux_trans, loss_aux_symm_kl = self.auxiliary_task(\n                aux_hs_pad, pred_pad, z, target, pred_len, target_len\n            )\n        else:\n            loss_aux_trans, loss_aux_symm_kl = 0.0, 0.0\n\n        if self.use_aux_ctc or self.use_aux_mmi:\n            if \"custom\" in self.etype:\n                hlen = torch.IntTensor(\n                    [h.size(1) for h in hs_mask],\n                ).to(hs_mask.device)\n\n        if self.use_aux_ctc:\n            loss_ctc = self.aux_ctc_weight * self.aux_ctc(hs_pad, hlen, ys_pad, texts)\n        else:\n            loss_ctc = 0.0\n\n        if self.use_aux_mmi:\n            loss_mmi = self.aux_mmi_weight * self.aux_mmi(hs_pad, hlen, ys_pad, texts)\n        else:\n            loss_mmi = 0.0\n\n        if self.use_aux_cross_entropy:\n            loss_lm = self.aux_cross_entropy_weight * self.aux_cross_entropy(\n                self.aux_decoder_output(pred_pad), ys_out_pad\n            )\n        else:\n            loss_lm = 0.0\n\n        if self.use_att_scorer:\n            ys_mask = target_mask(ys_in_pad, self.ignore_id)\n            pred_pad, _ = self.att_scorer(ys_in_pad, ys_mask, hs_pad, hs_mask)\n            loss_att = self.att_scorer_criterion(pred_pad, ys_out_pad)\n            loss_att *= self.att_scorer_weight\n        else:\n            loss_att = 0.0\n\n        loss = (\n            loss_trans\n            + self.transducer_weight * (loss_aux_trans + loss_aux_symm_kl)\n            + loss_ctc\n            + loss_mmi\n            + loss_lm\n            + loss_mbr\n            + loss_att\n        )\n\n        self.loss = loss\n        loss_data = float(loss)\n\n        # 4. compute cer/wer\n        if self.training or self.error_calculator is None:\n            cer, wer = None, None\n        else:\n            cer, wer = self.error_calculator(hs_pad, ys_pad)\n\n        if not math.isnan(loss_data):\n            self.reporter.report(\n                loss_data,\n                float(loss_trans),\n                float(loss_ctc),\n                float(loss_lm),\n                float(loss_aux_trans),\n                float(loss_aux_symm_kl),\n                float(loss_mbr),\n                float(loss_mmi),\n                float(loss_att),\n                cer,\n                wer,\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n\n        \"\"\" \n        # draw figures        \n        T = hs_pad.size(1)\n        mmi_array, trans_array = [], []\n        for t in range(1, T+1):\n            hlen = torch.Tensor([t]).int().cuda()\n            mmi_loss_item = - self.aux_mmi(hs_pad[:, :t], hlen, ys_pad[:, :-2], texts).item()\n            loss_trans_item = - self.criterion(z[:, :t, :-2].contiguous(), target[:, :-2], hlen, target_len-2).item()\n            mmi_array.append(mmi_loss_item)\n            trans_array.append(loss_trans_item)\n        print(mmi_array, trans_array)\n\n        print(texts[0])\n        import uuid\n        this_uuid = uuid.uuid4()\n        filename = f\"figures/{this_uuid}.png\"\n        print(f\"plot save in {filename}\")\n\n        import matplotlib\n\n        matplotlib.use(\"Agg\")\n        import matplotlib.pyplot as plt\n        # plt.style.use('seaborn-whitegrid')\n        palette = plt.get_cmap('Set1')\n        font1 = {'family' : 'Times New Roman',\n        'weight' : 'normal',\n        'size'   : 18,\n        }\n\n        plt.clf()\n        axis = range(1, len(mmi_array) + 1)\n        plt.plot(mmi_array, label=\"LF-MMI\", color=\"red\", marker='*')\n        plt.plot(trans_array, label=\"NT\", color=\"blue\", marker='v')\n        plt.xlabel(\"Frame Index t\", fontsize=14)\n        plt.ylabel(\"Log-Posterior\", fontsize=14)\n        plt.xticks([162, 163], fontsize=10)\n        plt.yticks([-80, -50, -20], fontsize=10)\n        plt.vlines(162, -100, 0, color=\"black\", linestyles = \"dashed\")\n        plt.vlines(163, -100, 0, color=\"black\", linestyles = \"dashed\") \n        plt.xlim((154, 175))\n        plt.ylim((-80, 0))\n        plt.legend(loc='upper left', fontsize=14)       \n\n        # plt.grid() \n        plt.tight_layout()\n        plt.savefig(filename)\n        \"\"\"\n\n        return self.loss\n\n        \n\n    def mbr_forward(self, xs_pad_orig, ilens, ys_pad, hs_pad):\n        batch_size = len(ilens)\n        \n        # (1) on-the-fly decoding\n        self.eval()\n        with torch.no_grad():\n            # decode without data augmentation (a.k.a., xs_pad_orig)\n            if \"custom\" in self.etype:\n                src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad_orig.device).unsqueeze(-2)\n                specs, hs_mask = self.encoder(xs_pad_orig, src_mask)\n            else:\n                specs, hs_mask, _ = self.enc(xs_pad_orig, ilens)           \n\n\n            hs = [h[h != 0] for h in hs_mask]\n            hlens = list(map(int, [h.size(0) for h in hs]))\n            specs = [h[:l] for h, l in zip(specs, hlens)]\n\n            # multi-thread on-the-fly decoding on GPU\n            \"\"\"\n            It is very inefficient to do on-the-fly decoding.\n            We've tried multi-process but failed since the dataloader cannot work\n              in forked process\n            Multi-thread is used. Remember to use 'export OMP_NUM_THREADS=<ncpu>'\n              to achieve faster decoding speed\n            \"\"\"\n            with ThreadPoolExecutor(max_workers=self.aux_mbr_beam) as executor:\n                futures = [executor.submit(self.beam_search, h) for h in specs]\n                wait(futures, return_when=ALL_COMPLETED)\n            \n                hyps = []\n                for future in futures:\n                    hyps.extend(future.result()) \n                hyps = [h.yseq[1:] for h in hyps] # exclude <sos>\n    \n                # for debug\n                # for i, y in enumerate(ys_pad):\n                #     ref_text = \"\".join([self.char_list[x] for x in y if x != self.ignore_id])\n                #     print(f\"ref_text: {ref_text}\")\n                #     for y in hyps[i * self.aux_mbr_beam: (i+1) * self.aux_mbr_beam]:\n                #         hyp_text = \"\".join([self.char_list[x] for x in y if x != self.blank_id])\n                #         print(f\"hyp_text: {hyp_text}\")\n\n        self.train()\n\n        if not len(hyps) == self.aux_mbr_beam * batch_size:\n            print(\"WARNNING: on-the-decoding fail in this iteration.\")\n            return 0.0\n\n        # (2) compute edit distance\n        dist = self.compute_edit_distance(hyps, ys_pad)\n \n        if dist is None:\n            print(\"Warning: An error encountered when editing distance\", flush=True)\n            return 0.0 # fail in editdistance.  \n\n        # (3) RNN-T loss computation\n        # prepare many inputs\n        hyp_maxlen = max([len(hyp) for hyp in hyps])\n        hyps_pad = [hyp + [self.ignore_id] * (hyp_maxlen - len(hyp)) for hyp in hyps]\n        hyps_pad = torch.Tensor(hyps_pad).to(ys_pad.device).to(ys_pad.dtype)\n\n        hyps_in_pad, hyps_out_pad, target, pred_len, target_len = prepare_loss_inputs(\n                                                              hyps_pad, hs_mask) \n        \n        idx = torch.arange(self.aux_mbr_beam * batch_size) // self.aux_mbr_beam\n        pred_len = pred_len[idx]\n        hs_pad = hs_pad[idx]\n\n        # decoder and joint-net forward\n        \"\"\" We are not sure which hs_pad should be used in decoder forward \n            Currently we are using the hs_pad from xs_pad, since we consider\n            the encoder should also receive the gradient from denominator\n        \"\"\"\n        if \"custom\" in self.dtype:\n            hyps_mask = target_mask(hyps_in_pad, self.blank_id)\n            pred_pad, _ = self.decoder(hyps_in_pad, hyps_mask, hs_pad)\n        else:\n            pred_pad = self.dec(hs_pad, hyps_in_pad)\n\n        z = self.joint_network(hs_pad.unsqueeze(2), pred_pad.unsqueeze(1))\n\n        # loss computation\n        # we need reduction = 'none' for utt-level probability\n        # code for warp-rnnt is not tested\n        if self.mbr_trans_type == \"warp-rnnt\":\n            log_prob = torch.log_softmax(z, dim=-1)\n            loss_trans = self.mbr_trans_loss(\n                log_probs,\n                target,\n                pred_len,\n                target_len,\n                reduction=None,\n                blank=self.blank_id,\n                gather=True,\n            )\n        elif self.mbr_trans_type == \"warp-transducer\":\n            loss_trans = self.mbr_trans_loss(z, target, pred_len, target_len)\n\n        # This is exactly posterior P(W|O) \n        loss_trans = (-loss_trans).exp()\n        # print(\"probability: \", loss_trans)\n        # print(\"edit distance: \", dist)\n \n        # (4) MBR loss. \n        num = (loss_trans * dist).view(batch_size, self.aux_mbr_beam)\n        den = loss_trans.view(batch_size, self.aux_mbr_beam)\n        loss_mbr = num.sum(dim=-1) / den.sum(dim=-1)\n        loss_mbr = loss_mbr.mean() # RNN-T Loss also works in reduction=mean\n        \n        return loss_mbr \n \n    def compute_edit_distance(self, hyps, refs):\n        # hyps: list of list with number batch * beam\n        # refs: 2-D tensor of labels. -1 means padding\n  \n        # convert refs into list and remove padding \n        refs_device = refs.device\n        refs = refs.cpu().tolist()\n        refs = [[x for x in t if x != self.ignore_id] for t in refs]\n         \n        if not len(hyps) % len(refs) == 0:\n            raise ValueError(\"The number of hypotheses is not correct\")\n\n        beam = int(len(hyps) / len(refs))\n\n        dist = [editdistance.eval(hyp, refs[i//beam]) \n                for i, hyp in enumerate(hyps)\n               ] \n        dist = torch.IntTensor(dist).to(refs_device)\n        return dist\n\n    def encode_custom(self, x):\n        \"\"\"Encode acoustic features.\n\n        Args:\n            x (ndarray): input acoustic feature (T, D)\n\n        Returns:\n            x (torch.Tensor): encoded features (T, D_enc)\n\n        \"\"\"\n        x = torch.as_tensor(x).unsqueeze(0)\n        enc_output, _ = self.encoder(x, None)\n\n        return enc_output.squeeze(0)\n\n    def encode_rnn(self, x):\n        \"\"\"Encode acoustic features.\n\n        Args:\n            x (ndarray): input acoustic feature (T, D)\n\n        Returns:\n            x (torch.Tensor): encoded features (T, D_enc)\n\n        \"\"\"\n        p = next(self.parameters())\n\n        ilens = [x.shape[0]]\n        x = x[:: self.subsample[0], :]\n\n        h = torch.as_tensor(x, device=p.device, dtype=p.dtype)\n        hs = h.contiguous().unsqueeze(0)\n\n        hs, _, _ = self.enc(hs, ilens)\n\n        return hs.squeeze(0)\n\n    def recognize(self, x, beam_search, decode_feature=\"combine\"):\n        \"\"\"Recognize input features.\n\n        Args:\n            x (ndarray): input acoustic feature (T, D)\n            beam_search (class): beam search class\n\n        Returns:\n            nbest_hyps (list): n-best decoding results\n\n        \"\"\"\n        assert decode_feature == \"combine\" # other method only for code-switch\n\n        self.eval()\n\n        if \"custom\" in self.etype:\n            h = self.encode_custom(x)\n        else:\n            h = self.encode_rnn(x)\n\n        nbest_hyps = beam_search(h)\n        return [asdict(n) for n in nbest_hyps]\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E attention calculation.\n\n        Args:\n            xs_pad (torch.Tensor): batch of padded input sequences (B, Tmax, idim)\n            ilens (torch.Tensor): batch of lengths of input sequences (B)\n            ys_pad (torch.Tensor):\n                batch of padded character id sequence tensor (B, Lmax)\n\n        Returns:\n            ret (ndarray): attention weights with the following shape,\n                1) multi-head case => attention weights (B, H, Lmax, Tmax),\n                2) other case => attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        self.eval()\n\n        if \"custom\" not in self.etype and \"custom\" not in self.dtype:\n            return []\n        else:\n            with torch.no_grad():\n                self.forward(xs_pad, ilens, ys_pad, texts, xs_pad_orig)\n\n            ret = dict()\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention) or isinstance(\n                    m, RelPositionMultiHeadedAttention\n                ):\n                    ret[name] = m.attn.cpu().numpy()\n\n        self.train()\n\n        return ret\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_transducer_cs.py",
    "content": "# Author: Jinchuan Tian; tianjinchuan@stu.pku.edu.cn ; tyriontian@tencent.com\n# Neural Transducer model for code-switch (bilingual problem)\n\nfrom argparse import Namespace\nfrom collections import Counter, defaultdict\nfrom dataclasses import asdict\n\nimport torch\nimport chainer\nimport numpy\nimport math\nimport logging\nfrom itertools import groupby\nfrom typing import Tuple, List\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.pytorch_backend.ctc import ctc_for\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.transducer.arguments import (\n    add_encoder_general_arguments,  # noqa: H301\n    add_custom_encoder_arguments,  # noqa: H301\n    add_decoder_general_arguments,  # noqa: H301\n    add_rnn_decoder_arguments,  # noqa: H301\n    add_custom_training_arguments,  # noqa: H301\n    add_transducer_arguments,  # noqa: H301\n    add_auxiliary_task_arguments,  # noqa: H301\n    add_att_scorer_arguments,\n    add_transducer_code_switch_arguments,\n)\nfrom espnet.nets.pytorch_backend.transducer.custom_encoder import CustomEncoder\nfrom espnet.nets.pytorch_backend.transducer.error_calculator import ErrorCalculator\nfrom espnet.nets.pytorch_backend.transducer.initializer import initializer\nfrom espnet.nets.pytorch_backend.transducer.joint_network import JointNetwork\nfrom espnet.nets.pytorch_backend.transducer.loss import TransLoss\nfrom espnet.nets.pytorch_backend.transducer.rnn_decoder import DecoderRNNT\nfrom espnet.nets.pytorch_backend.transformer.attention import (\n    MultiHeadedAttention,  # noqa: H301\n    RelPositionMultiHeadedAttention,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nfrom espnet.utils.fill_missing_args import fill_missing_args\nfrom espnet.nets.pytorch_backend.transducer.utils import prepare_loss_inputs\nfrom espnet.nets.pytorch_backend.e2e_asr import pad_list\nfrom espnet.nets.transducer_decoder_interface import Hypothesis\n\n\nclass Reporter(chainer.Chain):\n    \"\"\"A chainer reporter wrapper for transducer models.\"\"\"\n\n    def report(\n        self,\n        loss,\n        loss_trans,\n        loss_ctc,\n        loss_lm,\n        loss_aux_trans,\n        loss_aux_symm_kl,\n        loss_mbr,\n        loss_mmi,\n        loss_att,\n        loss_lang,\n        cer,\n        wer,\n    ):\n        \"\"\"Instantiate reporter attributes.\"\"\"\n        chainer.reporter.report({\"loss\": loss}, self)\n        chainer.reporter.report({\"loss_trans\": loss_trans}, self)\n        chainer.reporter.report({\"loss_ctc\": loss_ctc}, self)\n        chainer.reporter.report({\"loss_lm\": loss_lm}, self)\n        chainer.reporter.report({\"loss_aux_trans\": loss_aux_trans}, self)\n        chainer.reporter.report({\"loss_aux_symm_kl\": loss_aux_symm_kl}, self)\n        chainer.reporter.report({\"loss_mbr\": loss_mbr}, self)\n        chainer.reporter.report({\"loss_mmi\": loss_mmi}, self)\n        chainer.reporter.report({\"loss_att\": loss_att}, self)\n        chainer.reporter.report({\"loss_lang\": loss_lang}, self)\n        chainer.reporter.report({\"cer\": cer}, self)\n        chainer.reporter.report({\"wer\": wer}, self)\n\n        logging.info(\"loss:\" + str(loss))\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module for transducer models.\n\n    Args:\n        idim (int): dimension of inputs\n        odim (int): dimension of outputs\n        args (Namespace): argument Namespace containing options\n        ignore_id (int): padding symbol id\n        blank_id (int): blank symbol id\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments for transducer model.\"\"\"\n        E2E.encoder_add_general_arguments(parser)\n        E2E.encoder_add_custom_arguments(parser)\n\n        E2E.decoder_add_general_arguments(parser)\n        E2E.decoder_add_rnn_arguments(parser)\n\n        E2E.training_add_custom_arguments(parser)\n        E2E.transducer_add_arguments(parser)\n        E2E.auxiliary_task_add_arguments(parser)\n\n        E2E.transducer_add_code_switch_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_add_general_arguments(parser):\n        \"\"\"Add general arguments for encoder.\"\"\"\n        group = parser.add_argument_group(\"Encoder general arguments\")\n        group = add_encoder_general_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def encoder_add_custom_arguments(parser):\n        \"\"\"Add arguments for Custom encoder.\"\"\"\n        group = parser.add_argument_group(\"Custom encoder arguments\")\n        group = add_custom_encoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def decoder_add_general_arguments(parser):\n        \"\"\"Add general arguments for decoder.\"\"\"\n        group = parser.add_argument_group(\"Decoder general arguments\")\n        group = add_decoder_general_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def decoder_add_rnn_arguments(parser):\n        \"\"\"Add arguments for RNN decoder.\"\"\"\n        group = parser.add_argument_group(\"RNN decoder arguments\")\n        group = add_rnn_decoder_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def training_add_custom_arguments(parser):\n        \"\"\"Add arguments for Custom architecture training.\"\"\"\n        group = parser.add_argument_group(\"Training arguments for custom archictecture\")\n        group = add_custom_training_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def transducer_add_arguments(parser):\n        \"\"\"Add arguments for transducer model.\"\"\"\n        group = parser.add_argument_group(\"Transducer model arguments\")\n        group = add_transducer_arguments(group)\n\n        return parser\n\n    @staticmethod\n    def transducer_add_code_switch_arguments(parser):\n        \"\"\"Add arguments for transducer model.\"\"\"\n        group = parser.add_argument_group(\"Transducer code switch arguments\")\n        group = add_transducer_code_switch_arguments(group)\n        \n        return parser\n\n    @staticmethod\n    def auxiliary_task_add_arguments(parser):\n        \"\"\"Add arguments for auxiliary task.\"\"\"\n        group = parser.add_argument_group(\"Auxiliary task arguments\")\n        group = add_auxiliary_task_arguments(group)\n\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        if self.shared_encoder:\n            return self.shared_encoder.conv_subsampling_factor * int(\n                numpy.prod(self.subsample)\n        )\n        else:\n            return self.chn_encoder.conv_subsampling_factor * int(\n                numpy.prod(self.subsample)\n        )\n\n    def __init__(self, idim, odim, args, ignore_id=-1, blank_id=0, training=True):\n        \"\"\"Construct an E2E object for transducer model.\"\"\"\n        \"\"\" By default we only adopt Custom Encoder and RNN Decoder \"\"\"\n        torch.nn.Module.__init__(self)\n\n        args = fill_missing_args(args, self.add_arguments)\n\n        ### Commom transducer configs ###\n        self.is_rnnt = True # legacy\n        self.transducer_weight = args.transducer_weight\n        self.etype = \"custom\" # legacy\n        self.dtype = \"rnn\"\n\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.blank_id = blank_id\n        self.ignore_id = ignore_id\n\n        ### code-switch parameters ###\n        self.chn_id = odim\n        self.eng_id = odim + 1\n        self.cs_id = odim + 2\n        self.chn_start = args.cs_chn_start\n        self.eng_start = args.cs_eng_start\n        \n        self.use_adversial_examples = args.cs_use_adversial_examples\n        self.is_ctc_decoder = args.cs_is_ctc_decoder\n        self.is_pretrain = args.cs_is_pretrain\n        self.use_decoder_expert = args.cs_decoder_expert       \n \n        self.aux_ctc_weight = args.aux_ctc_weight\n        self.lang_weight = args.cs_lang_weight\n\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.odim = odim\n        self.reporter = Reporter()\n        self.error_calculator = None\n\n        ### Modules ###\n        if args.cs_share_encoder:\n            self.shared_encoder = CustomEncoder(\n                idim=idim,\n                enc_arch=args.enc_block_arch,\n                input_layer=args.custom_enc_input_layer,\n                repeat_block=args.cs_share_encoder_layers,\n                self_attn_type=args.custom_enc_self_attn_type,\n                positional_encoding_type=args.custom_enc_positional_encoding_type,\n                positionwise_activation_type=args.custom_enc_pw_activation_type,\n                conv_mod_activation_type=args.custom_enc_conv_mod_activation_type,\n            )\n        else:\n            self.shared_encoder = None\n       \n        # When use shared_encoder, there is no cnn layers in chn/eng encoder \n        enc_params = dict(\n            idim=idim if not args.cs_share_encoder else self.shared_encoder.enc_out,\n            enc_arch=args.enc_block_arch,\n            input_layer=args.custom_enc_input_layer if not args.cs_share_encoder else \"null\",\n            repeat_block=args.enc_block_repeat,\n            self_attn_type=args.custom_enc_self_attn_type,\n            positional_encoding_type=args.custom_enc_positional_encoding_type,\n            positionwise_activation_type=args.custom_enc_pw_activation_type,\n            conv_mod_activation_type=args.custom_enc_conv_mod_activation_type,\n        )\n        # Make sure identical settings\n        self.chn_encoder = CustomEncoder(**enc_params)\n        self.eng_encoder = CustomEncoder(**enc_params)\n\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"transformer\")\n        encoder_out = self.chn_encoder.enc_out\n\n        self.most_dom_list = args.enc_block_arch[:]\n        self.most_dom_dim = sorted(\n            Counter(\n                d[\"d_hidden\"] for d in self.most_dom_list if \"d_hidden\" in d\n            ).most_common(),\n            key=lambda x: x[0],\n            reverse=True,\n        )[0][0]\n           \n\n        dec_param = (\n                odim,\n                args.dtype,\n                args.dlayers,\n                args.dunits,\n                blank_id,\n                args.dec_embed_dim,\n                args.dropout_rate_decoder,\n                args.dropout_rate_embed_decoder,\n        )\n        if self.use_decoder_expert:\n            raise NotImplementedError\n        else:\n            self.dec = DecoderRNNT(*dec_param) \n \n        decoder_out = args.dunits\n\n        self.joint_network = JointNetwork(\n            odim, encoder_out, decoder_out, args.joint_dim, args.joint_activation_type\n        )\n\n        if self.lang_weight > 0.0:\n            self.lang_classifer = torch.nn.Sequential(\n                                        torch.nn.Linear(encoder_out, 2 * encoder_out),\n                                        torch.nn.ReLU(),\n                                        torch.nn.Linear(2 * encoder_out, 3),\n                                      )\n \n        self.default_parameters(args)\n\n        ### Criterion ###\n        self.criterion = TransLoss(args.trans_type, self.blank_id)\n        self.aux_ctc = ctc_for(\n                Namespace(\n                    num_encs=1,\n                    eprojs=encoder_out,\n                    dropout_rate=args.aux_ctc_dropout_rate,\n                    ctc_type=\"warpctc\",\n                ),\n                odim,\n                reduce=False,\n        )\n        self.decoder_ctc = ctc_for(\n                Namespace(\n                    num_encs=1,\n                    eprojs=encoder_out,\n                    dropout_rate=args.aux_ctc_dropout_rate,\n                     ctc_type=\"warpctc\",\n                ),\n                odim,\n                reduce=False,\n        )\n        self.lang_cls_criterion = torch.nn.CrossEntropyLoss()\n\n        self.loss = None\n        self.rnnlm = None\n        self.lms = {} # ngram LMs. Set externally during decoding\n\n    def default_parameters(self, args):\n        \"\"\"Initialize/reset parameters for transducer.\n\n        Args:\n            args (Namespace): argument Namespace containing options\n\n        \"\"\"\n        initializer(self, args)\n\n    ### Training Implementation  ###\n    def forward(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E forward.\n\n        Args:\n            xs_pad (torch.Tensor): batch of padded source sequences (B, Tmax, idim)\n            ilens (torch.Tensor): batch of lengths of input sequences (B)\n            ys_pad (torch.Tensor): batch of padded target sequences (B, Lmax)\n\n        Returns:\n            loss (torch.Tensor): transducer loss value\n        \"\"\"\n        # 0. process labels\n        ys_pad, cls_ids = ys_pad[:, 1:], ys_pad[:, 0].squeeze(0)\n\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]\n        src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n\n        if self.shared_encoder:\n            hs_pad, hs_mask = self.shared_encoder(xs_pad, src_mask,\n                                                  return_as_intermidiate=True)\n        else:\n            hs_pad, hs_mask = xs_pad, src_mask\n\n        chn_hs_pad, chn_hs_mask = self.chn_encoder(hs_pad, hs_mask)\n        eng_hs_pad, eng_hs_mask = self.eng_encoder(hs_pad, hs_mask)\n\n        hs_pad, hs_mask = self.combine_fn(chn_hs_pad, eng_hs_pad,\n                                          chn_hs_mask, eng_hs_mask)\n\n        # 2. Decoder loss: either RNNT or CTC\n        if not self.is_pretrain:\n            if not self.is_ctc_decoder:    \n                ys_in_pad, ys_out_pad, target, pred_len, target_len = \\\n                    prepare_loss_inputs(ys_pad, hs_mask\n                )\n        \n                pred_pad = self.dec(hs_pad, ys_in_pad)\n    \n                z = self.joint_network(hs_pad.unsqueeze(2), pred_pad.unsqueeze(1))\n                loss_dec = self.criterion(z, target, pred_len, target_len)\n            else:\n                hlen = torch.IntTensor([h.size(1) for h in hs_mask]).to(hs_mask.device)\n                loss_dec = self.decoder_ctc(hs_pad, hlen, ys_pad, texts).sum()\n        else:\n            loss_dec = 0.0\n\n        # 3. auxiliary CTC\n        if self.aux_ctc_weight > 0.0:\n            chn_ys_pad, eng_ys_pad = self.monolingual_mask(ys_pad)\n            # print(chn_ys_pad, eng_ys_pad)\n            hlen = torch.IntTensor([h.size(1) for h in chn_hs_mask]).to(chn_hs_mask.device)\n            loss_ctc_chn = self.aux_ctc(chn_hs_pad, hlen, chn_ys_pad, texts)\n            loss_ctc_eng = self.aux_ctc(eng_hs_pad, hlen, eng_ys_pad, texts)\n        \n            # In fine-tuning we must compute two ctc loss for each utt\n            if self.use_adversial_examples:\n                loss_ctc = (loss_ctc_chn + loss_ctc_eng).sum() / 2\n            else:\n                chn_indices = torch.nonzero(cls_ids != self.eng_id).squeeze(1)\n                eng_indices = torch.nonzero(cls_ids != self.chn_id).squeeze(1)\n                loss_ctc = loss_ctc_chn[chn_indices].sum() + \\\n                           loss_ctc_eng[eng_indices].sum()\n        else:\n            loss_ctc = 0.0\n\n        # 4. language prediction loss\n        if self.lang_weight > 0.0:\n            loss_lang = self.lang_cls_criterion(\n                        self.lang_classifer(hs_pad.mean(dim=1)),\n                        cls_ids - self.chn_id\n                        )\n        else:\n            loss_lang = 0.0\n\n        # 5. aggregate loss and report\n        loss = loss_dec + \\\n               loss_ctc  * self.aux_ctc_weight + \\\n               loss_lang * self.lang_weight\n \n        self.loss = loss\n        loss_data = float(loss)\n\n        # Some reprot keys are not used here\n        if not math.isnan(loss_data):\n            self.reporter.report(\n                loss_data,\n                float(loss_dec),\n                float(loss_ctc),\n                0.0,\n                0.0,\n                0.0,\n                0.0,\n                0.0,\n                0.0,\n                float(loss_lang),\n                0.0,\n                0.0,\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n\n        return self.loss\n\n    # You may want to revise this function to combine encoder_output differently\n    def combine_fn(self, chn_hs_pad, eng_hs_pad, chn_hs_mask, eng_hs_mask):\n        return chn_hs_pad + eng_hs_pad, chn_hs_mask\n\n    def monolingual_mask(self, ys_pad):\n        # <chn> 2 ; <eng> 3\n        ys_pad_chn = torch.where(torch.logical_and(\n            ys_pad >= self.eng_start, ys_pad < self.odim),\n            3, ys_pad)\n\n        ys_pad_eng = torch.where(torch.logical_and(\n            ys_pad >= self.chn_start, ys_pad < self.eng_start),\n            2, ys_pad)\n\n        return ys_pad_chn, ys_pad_eng\n\n    ### Decoding Implementation ###\n    def encoder_forward(self, x):\n         # Inference all\n        self.eval()\n        device = next(self.parameters()).device\n        x = torch.Tensor(x).to(device).unsqueeze(0)\n\n        if self.shared_encoder:\n            hs, _ = self.shared_encoder(x, None, return_as_intermidiate=True)\n        else:\n            hs = x\n\n        chn_hs, _ = self.chn_encoder(hs, None)\n        eng_hs, _ = self.eng_encoder(hs, None)\n\n        hs, _ = self.combine_fn(chn_hs, eng_hs, None, None)\n\n        # temporary code:\n        if hasattr(self, \"lang_classifer\"):\n            pred = torch.argmax(self.lang_classifer(hs.mean(dim=1))).item()\n            print(\"language classification results: \", pred, flush=True)\n\n        return hs, chn_hs, eng_hs\n\n    def recognize(self, x, beam_search=None, decode_feature=\"combine\"):\n        hs, chn_hs, eng_hs = self.encoder_forward(x)\n        if decode_feature == \"combine\":\n            feature = hs\n        elif decode_feature == \"chn\":\n            feature = chn_hs\n        elif decode_feature == \"eng\":\n            feature = eng_hs\n        else:\n            raise NotImplementedError\n\n        nbest_hyps = beam_search(feature)\n        return [asdict(n) for n in nbest_hyps] \n\n    # legacy, not used  \n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E attention calculation.\n\n        Args:\n            xs_pad (torch.Tensor): batch of padded input sequences (B, Tmax, idim)\n            ilens (torch.Tensor): batch of lengths of input sequences (B)\n            ys_pad (torch.Tensor):\n                batch of padded character id sequence tensor (B, Lmax)\n\n        Returns:\n            ret (ndarray): attention weights with the following shape,\n                1) multi-head case => attention weights (B, H, Lmax, Tmax),\n                2) other case => attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        self.eval()\n\n        if \"custom\" not in self.etype and \"custom\" not in self.dtype:\n            return []\n        else:\n            with torch.no_grad():\n                self.forward(xs_pad, ilens, ys_pad, texts, xs_pad_orig)\n\n            ret = dict()\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention) or isinstance(\n                    m, RelPositionMultiHeadedAttention\n                ):\n                    ret[name] = m.attn.cpu().numpy()\n\n        self.train()\n\n        return ret\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_asr_transformer.py",
    "content": "# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Transformer speech recognition model (pytorch).\"\"\"\n\nfrom argparse import Namespace\nimport logging\nimport math\nimport copy\nimport numpy\nimport torch\nfrom concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED\nimport editdistance\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.e2e_asr_common import ErrorCalculator\nfrom espnet.nets.pytorch_backend.ctc import CTC\nfrom espnet.nets.pytorch_backend.e2e_asr import CTC_LOSS_THRESHOLD\nfrom espnet.nets.pytorch_backend.e2e_asr import Reporter\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\nfrom espnet.nets.pytorch_backend.rnn.decoders import CTC_SCORING_RATIO\nfrom espnet.nets.pytorch_backend.transformer.add_sos_eos import add_sos_eos\nfrom espnet.nets.pytorch_backend.transformer.argument import (\n    add_arguments_transformer_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.attention import (\n    MultiHeadedAttention,  # noqa: H301\n    RelPositionMultiHeadedAttention,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv import DynamicConvolution\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv2d import DynamicConvolution2D\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.pytorch_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.pytorch_backend.transformer.mask import target_mask\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nfrom espnet.nets.scorers.ctc import CTCPrefixScorer\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\nfrom espnet.snowfall.warpper.warpper_mmi import K2MMI\nfrom espnet.snowfall.warpper.warpper_ctc import K2CTC\nfrom espnet.nets.scorers.mmi import MMIPrefixScores\nfrom espnet.nets.scorers.mmi_lookahead import MMILookaheadScorer\nfrom espnet.nets.scorers.mmi_rescorer import MMIRescorer\nfrom espnet.nets.scorers.mmi_frame_scorer import MMIFrameScorer\nfrom espnet.nets.scorers.mmi_frame_prefix_scorer import MMIFramePrefixScorer\nfrom espnet.nets.scorers.ctc import CTCPrefixScorer\nfrom espnet.nets.beam_search import BeamSearch\n\nfrom espnet.utils.print import step_print\n\nclass E2E(ASRInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n\n        group = add_arguments_transformer_common(group)\n\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return PlotAttentionReport.\"\"\"\n        return PlotAttentionReport\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.encoder.conv_subsampling_factor * int(numpy.prod(self.subsample))\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n        self.encoder = Encoder(\n            idim=idim,\n            selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            conv_wshare=args.wshare,\n            conv_kernel_length=args.ldconv_encoder_kernel_length,\n            conv_usebias=args.ldconv_usebias,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=args.transformer_input_layer,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n        )\n        if args.mtlalpha < 1:\n            self.decoder = Decoder(\n                odim=odim,\n                selfattention_layer_type=args.transformer_decoder_selfattn_layer_type,\n                attention_dim=args.adim,\n                attention_heads=args.aheads,\n                conv_wshare=args.wshare,\n                conv_kernel_length=args.ldconv_decoder_kernel_length,\n                conv_usebias=args.ldconv_usebias,\n                linear_units=args.dunits,\n                num_blocks=args.dlayers,\n                dropout_rate=args.dropout_rate,\n                positional_dropout_rate=args.dropout_rate,\n                self_attention_dropout_rate=args.transformer_attn_dropout_rate,\n                src_attention_dropout_rate=args.transformer_attn_dropout_rate,\n            )\n            self.criterion = LabelSmoothingLoss(\n                odim,\n                ignore_id,\n                args.lsm_weight,\n                args.transformer_length_normalized_loss,\n            )\n        else:\n            self.decoder = None\n            self.criterion = None\n        self.blank = 0\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.odim = odim\n        self.ignore_id = ignore_id\n        self.subsample = get_subsample(args, mode=\"asr\", arch=\"transformer\")\n        self.reporter = Reporter()\n\n        self.reset_parameters(args)\n        self.adim = args.adim  # used for CTC (equal to d_model)\n        self.mtlalpha = args.mtlalpha\n        if args.mtlalpha > 0.0:\n            if args.ctc_type == \"k2mmi\" or args.ctc_type == \"k2ctc\":\n                device = torch.device(f\"cuda:{args.local_rank}\") if torch.cuda.is_available() else torch.device(\"cpu\")\n                enc_supervise_module = K2MMI if args.ctc_type == 'k2mmi' else K2CTC\n                self.ctc=enc_supervise_module(idim=args.adim,\n                         lang=args.lang,\n                         char_list=args.char_list,\n                         device=device,\n                         dropout=args.dropout_rate, \n                         den_scale=args.den_scale,\n                         eos_id=self.eos,\n                         use_segment=args.use_segment)\n                if args.third_weight:\n                    print(f\"You have used MMI to supervise encoder Training. However, \\\n                          you still add CTC on encoder with weight {args.third_weight}\")\n                    self.third_weight = args.third_weight\n                    self.third_loss = CTC(\n                      odim, args.adim, args.dropout_rate, ctc_type=args.ctc_type, reduce=True\n                      ) \n            else:\n                self.ctc = CTC(\n                    odim, args.adim, args.dropout_rate, ctc_type=args.ctc_type, reduce=True\n                )\n        else:\n            self.ctc = None\n\n        # Decoder for on-the-fly decoding. Used in MBR training\n        if args.aux_mbr:\n            scorers = {\"decoder\": self.decoder,\n                       \"ctc\": CTCPrefixScorer(self.ctc, self.eos),\n                      }\n            weights = {\"decoder\": 1 - args.mtlalpha, \n                       \"ctc\": args.mtlalpha,\n                      } \n            self.beam_search = BeamSearch(\n                beam_size=args.aux_mbr_beam,\n                vocab_size=len(args.char_list),\n                weights=weights,\n                scorers=scorers,\n                sos=self.sos,\n                eos=self.eos,\n                token_list=args.char_list,\n                pre_beam_score_key=None if args.mtlalpha == 1.0 else \"full\",\n            )\n            self.aux_mbr_beam = args.aux_mbr_beam\n            self.aux_mbr_weight = args.aux_mbr_weight\n            self.mbr_criterion = torch.nn.CrossEntropyLoss(\n                ignore_index=self.ignore_id,\n                reduction=\"none\",\n            ) \n        else:\n            self.beam_search = None \n\n        if args.report_cer or args.report_wer:\n            self.error_calculator = ErrorCalculator(\n                args.char_list,\n                args.sym_space,\n                args.sym_blank,\n                args.report_cer,\n                args.report_wer,\n            )\n        else:\n            self.error_calculator = None\n        self.rnnlm = None\n        self.char_list = args.char_list\n\n    def reset_parameters(self, args):\n        \"\"\"Initialize parameters.\"\"\"\n        # initialize parameters\n        initialize(self, args.transformer_init)\n\n    def forward(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of source sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]  # for data parallel\n        src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n        hs_pad, hs_mask = self.encoder(xs_pad, src_mask)\n        self.hs_pad = hs_pad\n\n        # 2. forward decoder\n        if self.decoder is not None:\n            ys_in_pad, ys_out_pad = add_sos_eos(\n                ys_pad, self.sos, self.eos, self.ignore_id\n            )\n            ys_mask = target_mask(ys_in_pad, self.ignore_id)\n            pred_pad, pred_mask = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n            self.pred_pad = pred_pad\n\n            # 3. compute attention loss\n            loss_att = self.criterion(pred_pad, ys_out_pad)\n            self.acc = th_accuracy(\n                pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n            )\n        else:\n            loss_att = None\n            self.acc = None\n\n        # TODO(karita) show predicted text\n        # TODO(karita) calculate these stats\n        cer_ctc = None\n\n        if self.mtlalpha == 0.0:\n            loss_ctc = None\n        else:\n            batch_size = xs_pad.size(0)\n            hs_len = hs_mask.view(batch_size, -1).sum(1)\n            loss_ctc = self.ctc(hs_pad.view(batch_size, -1, self.adim), hs_len, ys_pad, texts)\n            if hasattr(self, \"third_weight\"):\n                third_loss = self.third_loss(hs_pad.view(batch_size, -1, self.adim), hs_len, ys_pad, texts)\n            if not self.training and self.error_calculator is not None:\n                ys_hat = self.ctc.argmax(hs_pad.view(batch_size, -1, self.adim)).data\n                cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)\n            # for visualization\n            if not self.training:\n                self.ctc.softmax(hs_pad)\n\n        if self.beam_search:\n            loss_mbr = self.mbr_forward(xs_pad_orig, ilens, ys_pad, hs_pad, hs_mask)\n        \n        # 5. compute cer/wer\n        if self.training or self.error_calculator is None or self.decoder is None:\n            cer, wer = None, None\n        else:\n            ys_hat = pred_pad.argmax(dim=-1)\n            cer, wer = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())\n\n        # copied from e2e_asr\n        alpha = self.mtlalpha\n        if alpha == 0:\n            self.loss = loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = None\n        elif alpha == 1:\n            self.loss = loss_ctc\n            loss_att_data = None\n            loss_ctc_data = float(loss_ctc)\n        else:\n            self.loss = alpha * loss_ctc + (1 - alpha) * loss_att\n            loss_att_data = float(loss_att)\n            loss_ctc_data = float(loss_ctc)\n\n        # Add the third loss if it is adopted\n        if hasattr(self, \"third_weight\"):\n            self.loss += self.third_weight * third_loss\n            third_loss_data = float(third_loss) \n        else:\n            third_loss_data = None\n\n        if self.beam_search:\n            self.loss += self.aux_mbr_weight * loss_mbr\n            loss_mbr_data = float(loss_mbr)\n        else:\n            loss_mbr_data = None\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_ctc_data, loss_att_data, third_loss_data, loss_mbr_data, \n                self.acc, cer_ctc, cer, wer, loss_data\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n     \n    def mbr_forward(self, xs_pad_orig, ilens, ys_pad, hs_pad, hs_mask):\n        batch_size = len(ilens)\n\n        # (1) on-the-fly decoding\n        self.eval()\n        with torch.no_grad():\n            src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad_orig.device).unsqueeze(-2)\n            specs, hs_mask = self.encoder(xs_pad_orig, src_mask)\n\n            hs = [h[h != 0] for h in hs_mask]\n            hlens = list(map(int, [h.size(0) for h in hs]))\n            specs = [h[:l] for h, l in zip(specs, hlens)]\n\n            with ThreadPoolExecutor(max_workers=self.aux_mbr_beam) as executor:\n                futures = [executor.submit(self.beam_search, h) for h in specs]\n                wait(futures, return_when=ALL_COMPLETED)\n\n            hyps = []\n            for future in futures:\n                hyps.extend(future.result()[:self.aux_mbr_beam])\n            hyps = [h.yseq[1:-1].tolist() for h in hyps] # exclude <sos>, <eos>\n\n            # for debug\n            for i, y in enumerate(ys_pad):\n                ref_text = \"\".join([self.char_list[x] for x in y if x != self.ignore_id])\n                # print(f\"ref_text: {ref_text}\")\n                for y in hyps[i * self.aux_mbr_beam: (i+1) * self.aux_mbr_beam]:\n                    hyp_text = \"\".join([self.char_list[x] for x in y])\n                    # print(f\"hyp_text: {hyp_text}\")\n        self.train()\n\n        # problem in decoding\n        if not len(hyps) == self.aux_mbr_beam * batch_size:\n            return 0.0\n\n        # (2) edit-distance\n        dist = self.compute_edit_distance(hyps, ys_pad)\n        if dist is None:\n            return 0.0 # fail in editdistance.\n\n        # (3) decoder forward: prob of each hyp\n        hyp_maxlen = max([len(hyp) for hyp in hyps])\n        hyps_pad = [hyp + [self.ignore_id] * (hyp_maxlen - len(hyp)) for hyp in hyps]\n        hyps_pad = torch.Tensor(hyps_pad).to(ys_pad.device).to(ys_pad.dtype)\n        hyps_pad_in, hyps_pad_out = add_sos_eos(hyps_pad, self.sos, self.eos, self.ignore_id)\n        hyps_mask = target_mask(hyps_pad_in, self.ignore_id)\n\n        idx = torch.arange(self.aux_mbr_beam * batch_size) // self.aux_mbr_beam\n        hs_pad = hs_pad[idx]\n        hs_mask = hs_mask[idx]\n\n        pred_pad, _ = self.decoder(hyps_pad_in, hyps_mask, hs_pad, hs_mask)\n        loss_att = self.mbr_criterion(pred_pad.permute(0, 2, 1), hyps_pad_out)\n        mask = torch.eq(hyps_pad_out.int(), self.ignore_id)\n        loss_att.masked_fill(torch.eq(hyps_pad_out, self.ignore_id), 0.0)\n        loss_att = (-loss_att.sum(dim=-1)).exp()\n\n        # (4) MBR loss. \n        num = (loss_att * dist).view(batch_size, self.aux_mbr_beam)\n        den = loss_att.view(batch_size, self.aux_mbr_beam)\n        loss_mbr = num.sum(dim=-1) / (den.sum(dim=-1) + 1e-10) # smooth\n        loss_mbr = loss_mbr.mean() # other Loss also works in reduction=mean\n        return loss_mbr\n       \n    def compute_edit_distance(self, hyps, refs):\n        # hyps: list of list with number batch * beam\n        # refs: 2-D tensor of labels. -1 means padding\n\n        # convert refs into list and remove padding \n        refs_device = refs.device\n        refs = refs.cpu().tolist()\n        refs = [[x for x in t if x != self.ignore_id] for t in refs]\n\n        if not len(hyps) % len(refs) == 0:\n            raise ValueError(\"The number of hypotheses is not correct\")\n\n        beam = int(len(hyps) / len(refs))\n\n        dist = [editdistance.eval(hyp, refs[i//beam])\n                for i, hyp in enumerate(hyps)\n               ]\n        dist = torch.IntTensor(dist).to(refs_device)\n        return dist \n\n    def scorers(self):\n        \"\"\"Scorers.\"\"\"\n        return dict(decoder=self.decoder)\n\n    def encode(self, x):\n        \"\"\"Encode acoustic features.\n\n        :param ndarray x: source acoustic feature (T, D)\n        :return: encoder outputs\n        :rtype: torch.Tensor\n        \"\"\"\n        self.eval()\n        x = torch.as_tensor(x).unsqueeze(0)\n        enc_output, _ = self.encoder(x, None)\n        return enc_output.squeeze(0)\n\n    def recognize(self, x, recog_args, char_list=None, rnnlm=None, use_jit=False):\n        \"\"\"Recognize input speech.\n\n        :param ndnarray x: input acoustic feature (B, T, D) or (T, D)\n        :param Namespace recog_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        enc_output = self.encode(x).unsqueeze(0)\n        if self.mtlalpha == 1.0:\n            recog_args.ctc_weight = 1.0\n            logging.info(\"Set to pure CTC decoding mode.\")\n\n        if self.mtlalpha > 0 and recog_args.ctc_weight == 1.0:\n            from itertools import groupby\n\n            lpz = self.ctc.argmax(enc_output)\n            collapsed_indices = [x[0] for x in groupby(lpz[0])]\n            hyp = [x for x in filter(lambda x: x != self.blank, collapsed_indices)]\n            nbest_hyps = [{\"score\": 0.0, \"yseq\": [self.sos] + hyp}]\n            if recog_args.beam_size > 1:\n                raise NotImplementedError(\"Pure CTC beam search is not implemented.\")\n            # TODO(hirofumi0810): Implement beam search\n            return nbest_hyps\n        elif self.mtlalpha > 0 and recog_args.ctc_weight > 0.0:\n            # Being compatible with LAS+MMI+CTC\n            ctc_module = self.ctc if isinstance(self.ctc, CTC) else self.third_loss\n            lpz = ctc_module.log_softmax(enc_output)\n            lpz = lpz.squeeze(0)\n        else:\n            lpz = None\n\n        if recog_args.mmi_weight > 0.0:\n            assert isinstance(self.ctc, K2MMI)\n            self.ctc.dump_weight(recog_args.local_rank)\n            if recog_args.mmi_type == \"lookahead\":\n                mmi_scorer = MMIFramePrefixScorer(lang=self.ctc.lang,\n                                                  device=\"cuda\",\n                                                  idim=self.adim,\n                                                  sos_id=self.sos,\n                                                  rank=recog_args.local_rank,\n                                                  use_segment=recog_args.use_segment,\n                                                  char_list=char_list\n                                                  )\n            elif recog_args.mmi_type == \"frame\":\n                mmi_scorer = MMIFrameScorer(lang=self.ctc.lang,\n                                            device=self.ctc.device,\n                                            idim=self.adim,\n                                            sos_id=self.sos,\n                                            rank=recog_args.local_rank,\n                                            use_segment=recog_args.use_segment,\n                                            char_list=char_list\n                                            )\n            else:    \n                raise NotImplementedError\n        else:\n            mmi_scorer = None\n\n        if recog_args.mmi_rescore == True:\n            self.ctc.dump_weight(recog_args.local_rank)\n            if recog_args.mmi_weight > 0.0:\n                raise ValueError(\"Cannot do rescoring if mmi_weight > 0.0\")\n            mmi_rescorer = MMIRescorer(lang=self.ctc.lang,\n                                                device=self.ctc.device,\n                                                idim=self.adim,\n                                                sos_id=self.sos,\n                                                rank=recog_args.local_rank,\n                                                use_segment=recog_args.use_segment,\n                                                char_list=char_list)\n            print(\"Will do rescore after decoding\")\n        else:\n            mmi_rescorer = None \n\n        h = enc_output.squeeze(0)\n\n        logging.info(\"input lengths: \" + str(h.size(0)))\n        # search parms\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = recog_args.ctc_weight\n\n        # preprare sos\n        y = self.sos\n        vy = h.new_zeros(1).long()\n\n        if recog_args.maxlenratio == 0:\n            maxlen = h.shape[0]\n        else:\n            # maxlen >= 1\n            maxlen = max(1, int(recog_args.maxlenratio * h.size(0)))\n        minlen = int(recog_args.minlenratio * h.size(0))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        if rnnlm:\n            hyp = {\"score\": 0.0, \"yseq\": [y], \"rnnlm_prev\": None}\n        else:\n            hyp = {\"score\": 0.0, \"yseq\": [y]}\n        if lpz is not None:\n            ctc_prefix_score = CTCPrefixScore(lpz.detach().numpy(), 0, self.eos, numpy)\n            hyp[\"ctc_state_prev\"] = ctc_prefix_score.initial_state()\n            hyp[\"ctc_score_prev\"] = 0.0\n        \n        # CTC beam is independent to lpz.\n        ctc_beam = int(beam * CTC_SCORING_RATIO)\n\n        if mmi_scorer:\n            hyp[\"mmi_state\"] = mmi_scorer.init_state(enc_output.squeeze(0))\n        \n        # Trace each score in each step\n        logs = {\"att\": []}\n        if ctc_weight > 0.0:\n            logs[\"ctc\"] = []\n        if recog_args.mmi_weight > 0.0:\n            logs[\"mmi\"] = []\n        if recog_args.lm_weight > 0.0:\n            logs[\"lm\"] = []\n        hyp[\"logs\"] = logs\n        \n        hyps = [hyp]\n        ended_hyps = []\n\n        import six\n\n        traced_decoder = None\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            hyps_best_kept = []\n            print(\"#\" * 20, f\"Iteration {i}\", \"#\" * 20, flush=True)\n            for hyp in hyps:\n                vy[0] = hyp[\"yseq\"][i]\n\n                # get nbest local scores and their ids\n                ys_mask = subsequent_mask(i + 1).unsqueeze(0)\n                ys = torch.tensor(hyp[\"yseq\"]).unsqueeze(0)\n                # FIXME: jit does not match non-jit result\n                if use_jit:\n                    if traced_decoder is None:\n                        traced_decoder = torch.jit.trace(\n                            self.decoder.forward_one_step, (ys, ys_mask, enc_output)\n                        )\n                    local_att_scores = traced_decoder(ys, ys_mask, enc_output)[0]\n                else:\n                    local_att_scores = self.decoder.forward_one_step(\n                        ys, ys_mask, enc_output\n                    )[0]\n\n                if rnnlm:\n                    rnnlm_state, local_lm_scores = rnnlm.predict(hyp[\"rnnlm_prev\"], vy)\n                    local_scores = (\n                        local_att_scores + recog_args.lm_weight * local_lm_scores\n                    )\n                else:\n                    local_scores = local_att_scores\n                \n                if lpz is not None or mmi_scorer: # allow use either CTC or MMI\n                    # Accumulate Attention scores\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_att_scores, ctc_beam, dim=1\n                    )\n                    local_scores = (1.0 - ctc_weight) * local_best_scores \n                    att_scores = local_best_scores \n                    # Previous Hypothesis\n                    prev_hyp = \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]]).replace(\"<space>\", \" \")\n                    print(f\"Privious Hypothesis: {prev_hyp} | Prev_Score: {hyp['score']}\")\n                    \n                    # Candidates\n                    candidates = [char_list[int(x)] for x in local_best_ids[0]]\n                    print(\"Candidates:     \" + \"\".join([\"{:<9}\".format(x) for x in candidates]))\n\n                    # Attention Scores\n                    print(f\"Attention({1-recog_args.ctc_weight}): \" + \"\".join([\"{:<9.2f}\".format(x) for x in local_best_scores[0]])) \n\n                    # Accumulate CTC scores if provided \n                    if lpz is not None:\n                        ctc_scores, ctc_states = ctc_prefix_score(\n                            hyp[\"yseq\"], local_best_ids[0], hyp[\"ctc_state_prev\"]\n                        )\n                        local_scores += ctc_weight * torch.from_numpy(\n                                        ctc_scores - hyp[\"ctc_score_prev\"])\n                        print(f\"CTC({recog_args.ctc_weight}):       \" + \"\".join([\"{:<9.2f}\".format(x) for x in ctc_scores]))\n\n                    # Accumulate MMI scores if provided\n                    if mmi_scorer:\n                        prefix = torch.Tensor(hyp[\"yseq\"]).to(torch.int32).to(mmi_scorer.device)\n                        mmi_scores, mmi_states = mmi_scorer.score_partial(\n                          prefix, local_best_ids[0], hyp[\"mmi_state\"], None\n                        )\n                        local_scores += (recog_args.mmi_weight * mmi_scores)\n                        print(f\"MMI({recog_args.mmi_weight}):       \" + \"\".join([\"{:<9.2f}\".format(x) for x in mmi_scores]))\n\n                    # Accumulate LM scores if provided\n                    if recog_args.lm_weight > 0.0:\n                        local_scores += (\n                            recog_args.lm_weight * local_lm_scores[:, local_best_ids[0]]\n                        )\n                        lm_scores = local_lm_scores[:, local_best_ids[0]]\n                        print(f\"LM ({recog_args.lm_weight}):       \" + \"\".join([\"{:<9.2f}\".format(x) for x \\\n                          in local_lm_scores[:, local_best_ids[0]][0]]))\n\n                    local_best_scores, joint_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n                    local_best_ids = local_best_ids[:, joint_best_ids[0]]\n                else:\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n\n                for j in six.moves.range(beam):\n                    new_hyp = {}\n                    new_hyp[\"score\"] = hyp[\"score\"] + float(local_best_scores[0, j])\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[0, j])\n                    if rnnlm:\n                        new_hyp[\"rnnlm_prev\"] = rnnlm_state\n                    if lpz is not None:\n                        new_hyp[\"ctc_state_prev\"] = ctc_states[joint_best_ids[0, j]]\n                        new_hyp[\"ctc_score_prev\"] = ctc_scores[joint_best_ids[0, j]]\n                    if mmi_scorer:\n                        new_hyp[\"mmi_state\"] = mmi_scorer.select_state(mmi_states, joint_best_ids[0, j])\n                    \n                    # Update log\n                    old_logs = copy.deepcopy(hyp[\"logs\"])\n                    if att_scores is not None:\n                        old_logs[\"att\"].append(att_scores[0, joint_best_ids[0, j]].item()) \n                    if ctc_weight > 0.0:\n                        old_logs[\"ctc\"].append(ctc_scores[joint_best_ids[0, j]].item())\n                    if recog_args.mmi_weight > 0.0:\n                        old_logs[\"mmi\"].append(mmi_scores[joint_best_ids[0, j]].item())\n                    if recog_args.lm_weight > 0.0:\n                        old_logs[\"lm\"].append(lm_scores[0, joint_best_ids[0, j]].item())\n                    new_hyp[\"logs\"] = old_logs\n                    \n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypothes: \" + str(len(hyps)))\n            if char_list is not None:\n                logging.debug(\n                    \"best hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n                )\n            \n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last postion in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypothes to a final list, and removed them from current hypothes\n            # (this will be a probmlem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        if rnnlm:  # Word LM needs to add final <eos> score\n                            hyp[\"score\"] += recog_args.lm_weight * rnnlm.final(\n                                hyp[\"rnnlm_prev\"]\n                            )\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # show all hypothesis\n            for hyp in ended_hyps:\n                hyp_str = \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]]).replace(\"<space>\", \" \")\n                print(f\"{hyp_str} | {hyp['score']}\")\n\n            # end detection\n            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remeined hypothes: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            if char_list is not None:\n                for hyp in hyps:\n                    logging.debug(\n                        \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                    )\n\n            logging.debug(\"number of ended hypothes: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), recog_args.nbest)\n        ]\n\n        # check number of hypotheis\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform recognition \"\n                \"again with smaller minlenratio.\"\n            )\n            # should copy becasuse Namespace will be overwritten globally\n            recog_args = Namespace(**vars(recog_args))\n            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)\n            return self.recognize(x, recog_args, char_list, rnnlm)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n        if mmi_rescorer:\n            nbest_hyps = mmi_rescorer.score(enc_output.squeeze(0), nbest_hyps, char_list)\n        return nbest_hyps\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights (B, H, Lmax, Tmax)\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            self.forward(xs_pad, ilens, ys_pad, texts, xs_pad_orig)\n        ret = dict()\n        for name, m in self.named_modules():\n            if (\n                isinstance(m, MultiHeadedAttention)\n                or isinstance(m, DynamicConvolution)\n                or isinstance(m, RelPositionMultiHeadedAttention)\n            ):\n                ret[name] = m.attn.cpu().numpy()\n            if isinstance(m, DynamicConvolution2D):\n                ret[name + \"_time\"] = m.attn_t.cpu().numpy()\n                ret[name + \"_freq\"] = m.attn_f.cpu().numpy()\n        self.train()\n        return ret\n\n    def calculate_all_ctc_probs(self, xs_pad, ilens, ys_pad, texts, xs_pad_orig):\n        \"\"\"E2E CTC probability calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: CTC probability (B, Tmax, vocab)\n        :rtype: float ndarray\n        \"\"\"\n        ret = None\n        if self.mtlalpha == 0:\n            return ret\n\n        self.eval()\n        with torch.no_grad():\n            self.forward(xs_pad, ilens, ys_pad, texts, xs_pad_orig)\n        for name, m in self.named_modules():\n            if isinstance(m, (CTC, K2MMI, K2CTC)) and m.probs is not None:\n                ret = m.probs.cpu().numpy()\n        self.train()\n        return ret\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_mt.py",
    "content": "# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"RNN sequence-to-sequence text translation model (pytorch).\"\"\"\n\nimport argparse\nimport logging\nimport math\nimport os\n\nimport chainer\nfrom chainer import reporter\nimport nltk\nimport numpy as np\nimport torch\n\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.mt_interface import MTInterface\nfrom espnet.nets.pytorch_backend.initialization import uniform_init_parameters\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.rnn.argument import (\n    add_arguments_rnn_encoder_common,  # noqa: H301\n    add_arguments_rnn_decoder_common,  # noqa: H301\n    add_arguments_rnn_attention_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_for\nfrom espnet.nets.pytorch_backend.rnn.decoders import decoder_for\nfrom espnet.nets.pytorch_backend.rnn.encoders import encoder_for\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass Reporter(chainer.Chain):\n    \"\"\"A chainer reporter wrapper.\"\"\"\n\n    def report(self, loss, acc, ppl, bleu):\n        \"\"\"Report at every step.\"\"\"\n        reporter.report({\"loss\": loss}, self)\n        reporter.report({\"acc\": acc}, self)\n        reporter.report({\"ppl\": ppl}, self)\n        reporter.report({\"bleu\": bleu}, self)\n\n\nclass E2E(MTInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2E.encoder_add_arguments(parser)\n        E2E.attention_add_arguments(parser)\n        E2E.decoder_add_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_add_arguments(parser):\n        \"\"\"Add arguments for the encoder.\"\"\"\n        group = parser.add_argument_group(\"E2E encoder setting\")\n        group = add_arguments_rnn_encoder_common(group)\n        return parser\n\n    @staticmethod\n    def attention_add_arguments(parser):\n        \"\"\"Add arguments for the attention.\"\"\"\n        group = parser.add_argument_group(\"E2E attention setting\")\n        group = add_arguments_rnn_attention_common(group)\n        return parser\n\n    @staticmethod\n    def decoder_add_arguments(parser):\n        \"\"\"Add arguments for the decoder.\"\"\"\n        group = parser.add_argument_group(\"E2E decoder setting\")\n        group = add_arguments_rnn_decoder_common(group)\n        return parser\n\n    def __init__(self, idim, odim, args):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super(E2E, self).__init__()\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        self.etype = args.etype\n        self.verbose = args.verbose\n        # NOTE: for self.build method\n        args.char_list = getattr(args, \"char_list\", None)\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.reporter = Reporter()\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.pad = 0\n        # NOTE: we reserve index:0 for <pad> although this is reserved for a blank class\n        # in ASR. However, blank labels are not used in MT.\n        # To keep the vocabulary size,\n        # we use index:0 for padding instead of adding one more class.\n\n        # subsample info\n        self.subsample = get_subsample(args, mode=\"mt\", arch=\"rnn\")\n\n        # label smoothing info\n        if args.lsm_type and os.path.isfile(args.train_json):\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        # multilingual related\n        self.multilingual = getattr(args, \"multilingual\", False)\n        self.replace_sos = getattr(args, \"replace_sos\", False)\n\n        # encoder\n        self.embed = torch.nn.Embedding(idim, args.eunits, padding_idx=self.pad)\n        self.dropout = torch.nn.Dropout(p=args.dropout_rate)\n        self.enc = encoder_for(args, args.eunits, self.subsample)\n        # attention\n        self.att = att_for(args)\n        # decoder\n        self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        # tie source and target emeddings\n        if args.tie_src_tgt_embedding:\n            if idim != odim:\n                raise ValueError(\n                    \"When using tie_src_tgt_embedding, idim and odim must be equal.\"\n                )\n            if args.eunits != args.dunits:\n                raise ValueError(\n                    \"When using tie_src_tgt_embedding, eunits and dunits must be equal.\"\n                )\n            self.embed.weight = self.dec.embed.weight\n\n        # tie emeddings and the classfier\n        if args.tie_classifier:\n            if args.context_residual:\n                raise ValueError(\n                    \"When using tie_classifier, context_residual must be turned off.\"\n                )\n            self.dec.output.weight = self.dec.embed.weight\n\n        # weight initialization\n        self.init_like_fairseq()\n\n        # options for beam search\n        if args.report_bleu:\n            trans_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": 0,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n                \"tgt_lang\": False,\n            }\n\n            self.trans_args = argparse.Namespace(**trans_args)\n            self.report_bleu = args.report_bleu\n        else:\n            self.report_bleu = False\n        self.rnnlm = None\n\n        self.logzero = -10000000000.0\n        self.loss = None\n        self.acc = None\n\n    def init_like_fairseq(self):\n        \"\"\"Initialize weight like Fairseq.\n\n        Fairseq basically uses W, b, EmbedID.W ~ Uniform(-0.1, 0.1),\n        \"\"\"\n        uniform_init_parameters(self)\n        # exceptions\n        # embed weight ~ Normal(-0.1, 0.1)\n        torch.nn.init.uniform_(self.embed.weight, -0.1, 0.1)\n        torch.nn.init.constant_(self.embed.weight[self.pad], 0)\n        torch.nn.init.uniform_(self.dec.embed.weight, -0.1, 0.1)\n        torch.nn.init.constant_(self.dec.embed.weight[self.pad], 0)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        # 1. Encoder\n        xs_pad, ys_pad = self.target_language_biasing(xs_pad, ilens, ys_pad)\n        hs_pad, hlens, _ = self.enc(self.dropout(self.embed(xs_pad)), ilens)\n\n        # 3. attention loss\n        self.loss, self.acc, self.ppl = self.dec(hs_pad, hlens, ys_pad)\n\n        # 4. compute bleu\n        if self.training or not self.report_bleu:\n            self.bleu = 0.0\n        else:\n            lpz = None\n\n            nbest_hyps = self.dec.recognize_beam_batch(\n                hs_pad,\n                torch.tensor(hlens),\n                lpz,\n                self.trans_args,\n                self.char_list,\n                self.rnnlm,\n            )\n            # remove <sos> and <eos>\n            list_of_refs = []\n            hyps = []\n            y_hats = [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps]\n            for i, y_hat in enumerate(y_hats):\n                y_true = ys_pad[i]\n\n                seq_hat = [self.char_list[int(idx)] for idx in y_hat if int(idx) != -1]\n                seq_true = [\n                    self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                ]\n                seq_hat_text = \"\".join(seq_hat).replace(self.trans_args.space, \" \")\n                seq_hat_text = seq_hat_text.replace(self.trans_args.blank, \"\")\n                seq_true_text = \"\".join(seq_true).replace(self.trans_args.space, \" \")\n\n                hyps += [seq_hat_text.split(\" \")]\n                list_of_refs += [[seq_true_text.split(\" \")]]\n\n            self.bleu = nltk.bleu_score.corpus_bleu(list_of_refs, hyps) * 100\n\n        loss_data = float(self.loss)\n        if not math.isnan(loss_data):\n            self.reporter.report(loss_data, self.acc, self.ppl, self.bleu)\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def target_language_biasing(self, xs_pad, ilens, ys_pad):\n        \"\"\"Prepend target language IDs to source sentences for multilingual MT.\n\n        These tags are prepended in source/target sentences as pre-processing.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :return: source text without language IDs\n        :rtype: torch.Tensor\n        :return: target text without language IDs\n        :rtype: torch.Tensor\n        :return: target language IDs\n        :rtype: torch.Tensor (B, 1)\n        \"\"\"\n        if self.multilingual:\n            # remove language ID in the beggining\n            tgt_lang_ids = ys_pad[:, 0].unsqueeze(1)\n            xs_pad = xs_pad[:, 1:]  # remove source language IDs here\n            ys_pad = ys_pad[:, 1:]\n\n            # prepend target language ID to source sentences\n            xs_pad = torch.cat([tgt_lang_ids, xs_pad], dim=1)\n        return xs_pad, ys_pad\n\n    def translate(self, x, trans_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param ndarray x: input source text feature (B, T, D)\n        :param Namespace trans_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n\n        # 1. encoder\n        # make a utt list (1) to use the same interface for encoder\n        if self.multilingual:\n            ilen = [len(x[0][1:])]\n            h = to_device(\n                self, torch.from_numpy(np.fromiter(map(int, x[0][1:]), dtype=np.int64))\n            )\n        else:\n            ilen = [len(x[0])]\n            h = to_device(\n                self, torch.from_numpy(np.fromiter(map(int, x[0]), dtype=np.int64))\n            )\n        hs, _, _ = self.enc(self.dropout(self.embed(h.unsqueeze(0))), ilen)\n\n        # 2. decoder\n        # decode the first utterance\n        y = self.dec.recognize_beam(hs[0], None, trans_args, char_list, rnnlm)\n\n        if prev:\n            self.train()\n        return y\n\n    def translate_batch(self, xs, trans_args, char_list, rnnlm=None):\n        \"\"\"E2E batch beam search.\n\n        :param list xs:\n            list of input source text feature arrays [(T_1, D), (T_2, D), ...]\n        :param Namespace trans_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n\n        # 1. Encoder\n        if self.multilingual:\n            ilens = np.fromiter((len(xx[1:]) for xx in xs), dtype=np.int64)\n            hs = [to_device(self, torch.from_numpy(xx[1:])) for xx in xs]\n        else:\n            ilens = np.fromiter((len(xx) for xx in xs), dtype=np.int64)\n            hs = [to_device(self, torch.from_numpy(xx)) for xx in xs]\n        xpad = pad_list(hs, self.pad)\n        hs_pad, hlens, _ = self.enc(self.dropout(self.embed(xpad)), ilens)\n\n        # 2. Decoder\n        hlens = torch.tensor(list(map(int, hlens)))  # make sure hlens is tensor\n        y = self.dec.recognize_beam_batch(\n            hs_pad, hlens, None, trans_args, char_list, rnnlm\n        )\n\n        if prev:\n            self.train()\n        return y\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            # 1. Encoder\n            xs_pad, ys_pad = self.target_language_biasing(xs_pad, ilens, ys_pad)\n            hpad, hlens, _ = self.enc(self.dropout(self.embed(xs_pad)), ilens)\n\n            # 2. Decoder\n            att_ws = self.dec.calculate_all_attentions(hpad, hlens, ys_pad)\n        self.train()\n        return att_ws\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_mt_transformer.py",
    "content": "# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Transformer text translation model (pytorch).\"\"\"\n\nfrom argparse import Namespace\nimport logging\nimport math\n\nimport numpy as np\nimport torch\n\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.e2e_mt_common import ErrorCalculator\nfrom espnet.nets.mt_interface import MTInterface\nfrom espnet.nets.pytorch_backend.e2e_mt import Reporter\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.transformer.add_sos_eos import add_sos_eos\nfrom espnet.nets.pytorch_backend.transformer.argument import (\n    add_arguments_transformer_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.pytorch_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.pytorch_backend.transformer.mask import target_mask\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass E2E(MTInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n        group = add_arguments_transformer_common(group)\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return PlotAttentionReport.\"\"\"\n        return PlotAttentionReport\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n        self.encoder = Encoder(\n            idim=idim,\n            selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            conv_wshare=args.wshare,\n            conv_kernel_length=args.ldconv_encoder_kernel_length,\n            conv_usebias=args.ldconv_usebias,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=\"embed\",\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n        )\n        self.decoder = Decoder(\n            odim=odim,\n            selfattention_layer_type=args.transformer_decoder_selfattn_layer_type,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            conv_wshare=args.wshare,\n            conv_kernel_length=args.ldconv_decoder_kernel_length,\n            conv_usebias=args.ldconv_usebias,\n            linear_units=args.dunits,\n            num_blocks=args.dlayers,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            self_attention_dropout_rate=args.transformer_attn_dropout_rate,\n            src_attention_dropout_rate=args.transformer_attn_dropout_rate,\n        )\n        self.pad = 0  # use <blank> for padding\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.odim = odim\n        self.ignore_id = ignore_id\n        self.subsample = get_subsample(args, mode=\"mt\", arch=\"transformer\")\n        self.reporter = Reporter()\n\n        # tie source and target emeddings\n        if args.tie_src_tgt_embedding:\n            if idim != odim:\n                raise ValueError(\n                    \"When using tie_src_tgt_embedding, idim and odim must be equal.\"\n                )\n            self.encoder.embed[0].weight = self.decoder.embed[0].weight\n\n        # tie emeddings and the classfier\n        if args.tie_classifier:\n            self.decoder.output_layer.weight = self.decoder.embed[0].weight\n\n        self.criterion = LabelSmoothingLoss(\n            self.odim,\n            self.ignore_id,\n            args.lsm_weight,\n            args.transformer_length_normalized_loss,\n        )\n        self.normalize_length = args.transformer_length_normalized_loss  # for PPL\n        self.reset_parameters(args)\n        self.adim = args.adim\n        self.error_calculator = ErrorCalculator(\n            args.char_list, args.sym_space, args.sym_blank, args.report_bleu\n        )\n        self.rnnlm = None\n\n        # multilingual MT related\n        self.multilingual = args.multilingual\n\n    def reset_parameters(self, args):\n        \"\"\"Initialize parameters.\"\"\"\n        initialize(self, args.transformer_init)\n        torch.nn.init.normal_(\n            self.encoder.embed[0].weight, mean=0, std=args.adim ** -0.5\n        )\n        torch.nn.init.constant_(self.encoder.embed[0].weight[self.pad], 0)\n        torch.nn.init.normal_(\n            self.decoder.embed[0].weight, mean=0, std=args.adim ** -0.5\n        )\n        torch.nn.init.constant_(self.decoder.embed[0].weight[self.pad], 0)\n\n    def forward(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of source sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]  # for data parallel\n        src_mask = (~make_pad_mask(ilens.tolist())).to(xs_pad.device).unsqueeze(-2)\n        xs_pad, ys_pad = self.target_forcing(xs_pad, ys_pad)\n        hs_pad, hs_mask = self.encoder(xs_pad, src_mask)\n\n        # 2. forward decoder\n        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)\n        ys_mask = target_mask(ys_in_pad, self.ignore_id)\n        pred_pad, pred_mask = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n\n        # 3. compute attention loss\n        self.loss = self.criterion(pred_pad, ys_out_pad)\n        self.acc = th_accuracy(\n            pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n        )\n\n        # 4. compute corpus-level bleu in a mini-batch\n        if self.training:\n            self.bleu = None\n        else:\n            ys_hat = pred_pad.argmax(dim=-1)\n            self.bleu = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())\n\n        loss_data = float(self.loss)\n        if self.normalize_length:\n            self.ppl = np.exp(loss_data)\n        else:\n            batch_size = ys_out_pad.size(0)\n            ys_out_pad = ys_out_pad.view(-1)\n            ignore = ys_out_pad == self.ignore_id  # (B*T,)\n            total_n_tokens = len(ys_out_pad) - ignore.sum().item()\n            self.ppl = np.exp(loss_data * batch_size / total_n_tokens)\n        if not math.isnan(loss_data):\n            self.reporter.report(loss_data, self.acc, self.ppl, self.bleu)\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def scorers(self):\n        \"\"\"Scorers.\"\"\"\n        return dict(decoder=self.decoder)\n\n    def encode(self, xs):\n        \"\"\"Encode source sentences.\"\"\"\n        self.eval()\n        xs = torch.as_tensor(xs).unsqueeze(0)\n        enc_output, _ = self.encoder(xs, None)\n        return enc_output.squeeze(0)\n\n    def target_forcing(self, xs_pad, ys_pad=None, tgt_lang=None):\n        \"\"\"Prepend target language IDs to source sentences for multilingual MT.\n\n        These tags are prepended in source/target sentences as pre-processing.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :return: source text without language IDs\n        :rtype: torch.Tensor\n        :return: target text without language IDs\n        :rtype: torch.Tensor\n        :return: target language IDs\n        :rtype: torch.Tensor (B, 1)\n        \"\"\"\n        if self.multilingual:\n            xs_pad = xs_pad[:, 1:]  # remove source language IDs here\n            if ys_pad is not None:\n                # remove language ID in the beginning\n                lang_ids = ys_pad[:, 0].unsqueeze(1)\n                ys_pad = ys_pad[:, 1:]\n            elif tgt_lang is not None:\n                lang_ids = xs_pad.new_zeros(xs_pad.size(0), 1).fill_(tgt_lang)\n            else:\n                raise ValueError(\"Set ys_pad or tgt_lang.\")\n\n            # prepend target language ID to source sentences\n            xs_pad = torch.cat([lang_ids, xs_pad], dim=1)\n        return xs_pad, ys_pad\n\n    def translate(self, x, trans_args, char_list=None):\n        \"\"\"Translate source text.\n\n        :param list x: input source text feature (T,)\n        :param Namespace trans_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        self.eval()  # NOTE: this is important because self.encode() is not used\n        assert isinstance(x, list)\n\n        # make a utt list (1) to use the same interface for encoder\n        if self.multilingual:\n            x = to_device(\n                self, torch.from_numpy(np.fromiter(map(int, x[0][1:]), dtype=np.int64))\n            )\n        else:\n            x = to_device(\n                self, torch.from_numpy(np.fromiter(map(int, x[0]), dtype=np.int64))\n            )\n\n        logging.info(\"input lengths: \" + str(x.size(0)))\n        xs_pad = x.unsqueeze(0)\n        tgt_lang = None\n        if trans_args.tgt_lang:\n            tgt_lang = char_list.index(trans_args.tgt_lang)\n        xs_pad, _ = self.target_forcing(xs_pad, tgt_lang=tgt_lang)\n        h, _ = self.encoder(xs_pad, None)\n        logging.info(\"encoder output lengths: \" + str(h.size(1)))\n\n        # search parms\n        beam = trans_args.beam_size\n        penalty = trans_args.penalty\n\n        if trans_args.maxlenratio == 0:\n            maxlen = h.size(1)\n        else:\n            # maxlen >= 1\n            maxlen = max(1, int(trans_args.maxlenratio * h.size(1)))\n        minlen = int(trans_args.minlenratio * h.size(1))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        hyp = {\"score\": 0.0, \"yseq\": [self.sos]}\n        hyps = [hyp]\n        ended_hyps = []\n\n        for i in range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            # batchfy\n            ys = h.new_zeros((len(hyps), i + 1), dtype=torch.int64)\n            for j, hyp in enumerate(hyps):\n                ys[j, :] = torch.tensor(hyp[\"yseq\"])\n            ys_mask = subsequent_mask(i + 1).unsqueeze(0).to(h.device)\n\n            local_scores = self.decoder.forward_one_step(\n                ys, ys_mask, h.repeat([len(hyps), 1, 1])\n            )[0]\n\n            hyps_best_kept = []\n            for j, hyp in enumerate(hyps):\n                local_best_scores, local_best_ids = torch.topk(\n                    local_scores[j : j + 1], beam, dim=1\n                )\n\n                for j in range(beam):\n                    new_hyp = {}\n                    new_hyp[\"score\"] = hyp[\"score\"] + float(local_best_scores[0, j])\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[0, j])\n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypothes: \" + str(len(hyps)))\n            if char_list is not None:\n                logging.debug(\n                    \"best hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n                )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last postion in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypothes to a final list, and removed them from current hypothes\n            # (this will be a probmlem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n            if end_detect(ended_hyps, i) and trans_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remeined hypothes: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            if char_list is not None:\n                for hyp in hyps:\n                    logging.debug(\n                        \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                    )\n\n            logging.debug(\"number of ended hypothes: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), trans_args.nbest)\n        ]\n\n        # check number of hypotheis\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform translation \"\n                \"again with smaller minlenratio.\"\n            )\n            # should copy becasuse Namespace will be overwritten globally\n            trans_args = Namespace(**vars(trans_args))\n            trans_args.minlenratio = max(0.0, trans_args.minlenratio - 0.1)\n            return self.translate(x, trans_args, char_list)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n        return nbest_hyps\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights (B, H, Lmax, Tmax)\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            self.forward(xs_pad, ilens, ys_pad)\n        ret = dict()\n        for name, m in self.named_modules():\n            if isinstance(m, MultiHeadedAttention) and m.attn is not None:\n                ret[name] = m.attn.cpu().numpy()\n        self.train()\n        return ret\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_st.py",
    "content": "# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"RNN sequence-to-sequence speech translation model (pytorch).\"\"\"\n\nimport argparse\nimport copy\nimport logging\nimport math\nimport os\n\nimport editdistance\nimport nltk\n\nimport chainer\nimport numpy as np\nimport six\nimport torch\n\nfrom itertools import groupby\n\nfrom chainer import reporter\n\nfrom espnet.nets.e2e_asr_common import label_smoothing_dist\nfrom espnet.nets.pytorch_backend.ctc import CTC\nfrom espnet.nets.pytorch_backend.initialization import lecun_normal_init_parameters\nfrom espnet.nets.pytorch_backend.initialization import set_forget_bias_to_one\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.pytorch_backend.nets_utils import to_torch_tensor\nfrom espnet.nets.pytorch_backend.rnn.argument import (\n    add_arguments_rnn_encoder_common,  # noqa: H301\n    add_arguments_rnn_decoder_common,  # noqa: H301\n    add_arguments_rnn_attention_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_for\nfrom espnet.nets.pytorch_backend.rnn.decoders import decoder_for\nfrom espnet.nets.pytorch_backend.rnn.encoders import encoder_for\nfrom espnet.nets.st_interface import STInterface\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\nCTC_LOSS_THRESHOLD = 10000\n\n\nclass Reporter(chainer.Chain):\n    \"\"\"A chainer reporter wrapper.\"\"\"\n\n    def report(\n        self,\n        loss_asr,\n        loss_mt,\n        loss_st,\n        acc_asr,\n        acc_mt,\n        acc,\n        cer_ctc,\n        cer,\n        wer,\n        bleu,\n        mtl_loss,\n    ):\n        \"\"\"Report at every step.\"\"\"\n        reporter.report({\"loss_asr\": loss_asr}, self)\n        reporter.report({\"loss_mt\": loss_mt}, self)\n        reporter.report({\"loss_st\": loss_st}, self)\n        reporter.report({\"acc_asr\": acc_asr}, self)\n        reporter.report({\"acc_mt\": acc_mt}, self)\n        reporter.report({\"acc\": acc}, self)\n        reporter.report({\"cer_ctc\": cer_ctc}, self)\n        reporter.report({\"cer\": cer}, self)\n        reporter.report({\"wer\": wer}, self)\n        reporter.report({\"bleu\": bleu}, self)\n        logging.info(\"mtl loss:\" + str(mtl_loss))\n        reporter.report({\"loss\": mtl_loss}, self)\n\n\nclass E2E(STInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2E.encoder_add_arguments(parser)\n        E2E.attention_add_arguments(parser)\n        E2E.decoder_add_arguments(parser)\n        return parser\n\n    @staticmethod\n    def encoder_add_arguments(parser):\n        \"\"\"Add arguments for the encoder.\"\"\"\n        group = parser.add_argument_group(\"E2E encoder setting\")\n        group = add_arguments_rnn_encoder_common(group)\n        return parser\n\n    @staticmethod\n    def attention_add_arguments(parser):\n        \"\"\"Add arguments for the attention.\"\"\"\n        group = parser.add_argument_group(\"E2E attention setting\")\n        group = add_arguments_rnn_attention_common(group)\n        return parser\n\n    @staticmethod\n    def decoder_add_arguments(parser):\n        \"\"\"Add arguments for the decoder.\"\"\"\n        group = parser.add_argument_group(\"E2E decoder setting\")\n        group = add_arguments_rnn_decoder_common(group)\n        return parser\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.enc.conv_subsampling_factor * int(np.prod(self.subsample))\n\n    def __init__(self, idim, odim, args):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super(E2E, self).__init__()\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        self.asr_weight = args.asr_weight\n        self.mt_weight = args.mt_weight\n        self.mtlalpha = args.mtlalpha\n        assert 0.0 <= self.asr_weight < 1.0, \"asr_weight should be [0.0, 1.0)\"\n        assert 0.0 <= self.mt_weight < 1.0, \"mt_weight should be [0.0, 1.0)\"\n        assert 0.0 <= self.mtlalpha <= 1.0, \"mtlalpha should be [0.0, 1.0]\"\n        self.etype = args.etype\n        self.verbose = args.verbose\n        # NOTE: for self.build method\n        args.char_list = getattr(args, \"char_list\", None)\n        self.char_list = args.char_list\n        self.outdir = args.outdir\n        self.space = args.sym_space\n        self.blank = args.sym_blank\n        self.reporter = Reporter()\n\n        # below means the last number becomes eos/sos ID\n        # note that sos/eos IDs are identical\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.pad = 0\n        # NOTE: we reserve index:0 for <pad> although this is reserved for a blank class\n        # in ASR. However, blank labels are not used in MT.\n        # To keep the vocabulary size,\n        # we use index:0 for padding instead of adding one more class.\n\n        # subsample info\n        self.subsample = get_subsample(args, mode=\"st\", arch=\"rnn\")\n\n        # label smoothing info\n        if args.lsm_type and os.path.isfile(args.train_json):\n            logging.info(\"Use label smoothing with \" + args.lsm_type)\n            labeldist = label_smoothing_dist(\n                odim, args.lsm_type, transcript=args.train_json\n            )\n        else:\n            labeldist = None\n\n        # multilingual related\n        self.multilingual = getattr(args, \"multilingual\", False)\n        self.replace_sos = getattr(args, \"replace_sos\", False)\n\n        # encoder\n        self.enc = encoder_for(args, idim, self.subsample)\n        # attention (ST)\n        self.att = att_for(args)\n        # decoder (ST)\n        self.dec = decoder_for(args, odim, self.sos, self.eos, self.att, labeldist)\n\n        # submodule for ASR task\n        self.ctc = None\n        self.att_asr = None\n        self.dec_asr = None\n        if self.asr_weight > 0:\n            if self.mtlalpha > 0.0:\n                self.ctc = CTC(\n                    odim,\n                    args.eprojs,\n                    args.dropout_rate,\n                    ctc_type=args.ctc_type,\n                    reduce=True,\n                )\n            if self.mtlalpha < 1.0:\n                # attention (asr)\n                self.att_asr = att_for(args)\n                # decoder (asr)\n                args_asr = copy.deepcopy(args)\n                args_asr.atype = \"location\"  # TODO(hirofumi0810): make this option\n                self.dec_asr = decoder_for(\n                    args_asr, odim, self.sos, self.eos, self.att_asr, labeldist\n                )\n\n        # submodule for MT task\n        if self.mt_weight > 0:\n            self.embed_mt = torch.nn.Embedding(odim, args.eunits, padding_idx=self.pad)\n            self.dropout_mt = torch.nn.Dropout(p=args.dropout_rate)\n            self.enc_mt = encoder_for(\n                args, args.eunits, subsample=np.ones(args.elayers + 1, dtype=np.int)\n            )\n\n        # weight initialization\n        self.init_like_chainer()\n\n        # options for beam search\n        if self.asr_weight > 0 and args.report_cer or args.report_wer:\n            recog_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": args.ctc_weight,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n                \"tgt_lang\": False,\n            }\n\n            self.recog_args = argparse.Namespace(**recog_args)\n            self.report_cer = args.report_cer\n            self.report_wer = args.report_wer\n        else:\n            self.report_cer = False\n            self.report_wer = False\n        if args.report_bleu:\n            trans_args = {\n                \"beam_size\": args.beam_size,\n                \"penalty\": args.penalty,\n                \"ctc_weight\": 0,\n                \"maxlenratio\": args.maxlenratio,\n                \"minlenratio\": args.minlenratio,\n                \"lm_weight\": args.lm_weight,\n                \"rnnlm\": args.rnnlm,\n                \"nbest\": args.nbest,\n                \"space\": args.sym_space,\n                \"blank\": args.sym_blank,\n                \"tgt_lang\": False,\n            }\n\n            self.trans_args = argparse.Namespace(**trans_args)\n            self.report_bleu = args.report_bleu\n        else:\n            self.report_bleu = False\n        self.rnnlm = None\n\n        self.logzero = -10000000000.0\n        self.loss = None\n        self.acc = None\n\n    def init_like_chainer(self):\n        \"\"\"Initialize weight like chainer.\n\n        chainer basically uses LeCun way: W ~ Normal(0, fan_in ** -0.5), b = 0\n        pytorch basically uses W, b ~ Uniform(-fan_in**-0.5, fan_in**-0.5)\n        however, there are two exceptions as far as I know.\n        - EmbedID.W ~ Normal(0, 1)\n        - LSTM.upward.b[forget_gate_range] = 1 (but not used in NStepLSTM)\n        \"\"\"\n        lecun_normal_init_parameters(self)\n        # exceptions\n        # embed weight ~ Normal(0, 1)\n        self.dec.embed.weight.data.normal_(0, 1)\n        # forget-bias = 1.0\n        # https://discuss.pytorch.org/t/set-forget-gate-bias-of-lstm/1745\n        for i in six.moves.range(len(self.dec.decoder)):\n            set_forget_bias_to_one(self.dec.decoder[i].bias_ih)\n\n    def forward(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :return: loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        # 0. Extract target language ID\n        if self.multilingual:\n            tgt_lang_ids = ys_pad[:, 0:1]\n            ys_pad = ys_pad[:, 1:]  # remove target language ID in the beggining\n        else:\n            tgt_lang_ids = None\n\n        # 1. Encoder\n        hs_pad, hlens, _ = self.enc(xs_pad, ilens)\n\n        # 2. ST attention loss\n        self.loss_st, self.acc, _ = self.dec(\n            hs_pad, hlens, ys_pad, lang_ids=tgt_lang_ids\n        )\n\n        # 3. ASR loss\n        (\n            self.loss_asr_att,\n            acc_asr,\n            self.loss_asr_ctc,\n            cer_ctc,\n            cer,\n            wer,\n        ) = self.forward_asr(hs_pad, hlens, ys_pad_src)\n\n        # 4. MT attention loss\n        self.loss_mt, acc_mt = self.forward_mt(ys_pad, ys_pad_src)\n\n        # 5. Compute BLEU\n        if self.training or not self.report_bleu:\n            self.bleu = 0.0\n        else:\n            lpz = None\n\n            nbest_hyps = self.dec.recognize_beam_batch(\n                hs_pad,\n                torch.tensor(hlens),\n                lpz,\n                self.trans_args,\n                self.char_list,\n                self.rnnlm,\n                lang_ids=tgt_lang_ids.squeeze(1).tolist()\n                if self.multilingual\n                else None,\n            )\n            # remove <sos> and <eos>\n            list_of_refs = []\n            hyps = []\n            y_hats = [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps]\n            for i, y_hat in enumerate(y_hats):\n                y_true = ys_pad[i]\n\n                seq_hat = [self.char_list[int(idx)] for idx in y_hat if int(idx) != -1]\n                seq_true = [\n                    self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                ]\n                seq_hat_text = \"\".join(seq_hat).replace(self.trans_args.space, \" \")\n                seq_hat_text = seq_hat_text.replace(self.trans_args.blank, \"\")\n                seq_true_text = \"\".join(seq_true).replace(self.trans_args.space, \" \")\n\n                hyps += [seq_hat_text.split(\" \")]\n                list_of_refs += [[seq_true_text.split(\" \")]]\n\n            self.bleu = nltk.bleu_score.corpus_bleu(list_of_refs, hyps) * 100\n\n        asr_ctc_weight = self.mtlalpha\n        self.loss_asr = (\n            asr_ctc_weight * self.loss_asr_ctc\n            + (1 - asr_ctc_weight) * self.loss_asr_att\n        )\n        self.loss = (\n            (1 - self.asr_weight - self.mt_weight) * self.loss_st\n            + self.asr_weight * self.loss_asr\n            + self.mt_weight * self.loss_mt\n        )\n        loss_st_data = float(self.loss_st)\n        loss_asr_data = float(self.loss_asr)\n        loss_mt_data = float(self.loss_mt)\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_asr_data,\n                loss_mt_data,\n                loss_st_data,\n                acc_asr,\n                acc_mt,\n                self.acc,\n                cer_ctc,\n                cer,\n                wer,\n                self.bleu,\n                loss_data,\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def forward_asr(self, hs_pad, hlens, ys_pad):\n        \"\"\"Forward pass in the auxiliary ASR task.\n\n        :param torch.Tensor hs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor hlens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :return: ASR attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in ASR attention decoder\n        :rtype: float\n        :return: ASR CTC loss value\n        :rtype: torch.Tensor\n        :return: character error rate from CTC prediction\n        :rtype: float\n        :return: character error rate from attetion decoder prediction\n        :rtype: float\n        :return: word error rate from attetion decoder prediction\n        :rtype: float\n        \"\"\"\n        loss_att, loss_ctc = 0.0, 0.0\n        acc = None\n        cer, wer = None, None\n        cer_ctc = None\n        if self.asr_weight == 0:\n            return loss_att, acc, loss_ctc, cer_ctc, cer, wer\n\n        # attention\n        if self.mtlalpha < 1:\n            loss_asr, acc_asr, _ = self.dec_asr(hs_pad, hlens, ys_pad)\n\n            # Compute wer and cer\n            if not self.training and (self.report_cer or self.report_wer):\n                if self.mtlalpha > 0 and self.recog_args.ctc_weight > 0.0:\n                    lpz = self.ctc.log_softmax(hs_pad).data\n                else:\n                    lpz = None\n\n                word_eds, word_ref_lens, char_eds, char_ref_lens = [], [], [], []\n                nbest_hyps_asr = self.dec_asr.recognize_beam_batch(\n                    hs_pad,\n                    torch.tensor(hlens),\n                    lpz,\n                    self.recog_args,\n                    self.char_list,\n                    self.rnnlm,\n                )\n                # remove <sos> and <eos>\n                y_hats = [nbest_hyp[0][\"yseq\"][1:-1] for nbest_hyp in nbest_hyps_asr]\n                for i, y_hat in enumerate(y_hats):\n                    y_true = ys_pad[i]\n\n                    seq_hat = [\n                        self.char_list[int(idx)] for idx in y_hat if int(idx) != -1\n                    ]\n                    seq_true = [\n                        self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                    ]\n                    seq_hat_text = \"\".join(seq_hat).replace(self.recog_args.space, \" \")\n                    seq_hat_text = seq_hat_text.replace(self.recog_args.blank, \"\")\n                    seq_true_text = \"\".join(seq_true).replace(\n                        self.recog_args.space, \" \"\n                    )\n\n                    hyp_words = seq_hat_text.split()\n                    ref_words = seq_true_text.split()\n                    word_eds.append(editdistance.eval(hyp_words, ref_words))\n                    word_ref_lens.append(len(ref_words))\n                    hyp_chars = seq_hat_text.replace(\" \", \"\")\n                    ref_chars = seq_true_text.replace(\" \", \"\")\n                    char_eds.append(editdistance.eval(hyp_chars, ref_chars))\n                    char_ref_lens.append(len(ref_chars))\n\n                wer = (\n                    0.0\n                    if not self.report_wer\n                    else float(sum(word_eds)) / sum(word_ref_lens)\n                )\n                cer = (\n                    0.0\n                    if not self.report_cer\n                    else float(sum(char_eds)) / sum(char_ref_lens)\n                )\n\n        # CTC\n        if self.mtlalpha > 0:\n            loss_ctc = self.ctc(hs_pad, hlens, ys_pad)\n\n            # Compute cer with CTC prediction\n            if self.char_list is not None:\n                cers = []\n                y_hats = self.ctc.argmax(hs_pad).data\n                for i, y in enumerate(y_hats):\n                    y_hat = [x[0] for x in groupby(y)]\n                    y_true = ys_pad[i]\n\n                    seq_hat = [\n                        self.char_list[int(idx)] for idx in y_hat if int(idx) != -1\n                    ]\n                    seq_true = [\n                        self.char_list[int(idx)] for idx in y_true if int(idx) != -1\n                    ]\n                    seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n                    seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n                    seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n\n                    hyp_chars = seq_hat_text.replace(\" \", \"\")\n                    ref_chars = seq_true_text.replace(\" \", \"\")\n                    if len(ref_chars) > 0:\n                        cers.append(\n                            editdistance.eval(hyp_chars, ref_chars) / len(ref_chars)\n                        )\n                cer_ctc = sum(cers) / len(cers) if cers else None\n\n        return loss_att, acc, loss_ctc, cer_ctc, cer, wer\n\n    def forward_mt(self, xs_pad, ys_pad):\n        \"\"\"Forward pass in the auxiliary MT task.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :return: MT loss value\n        :rtype: torch.Tensor\n        :return: accuracy in MT decoder\n        :rtype: float\n        \"\"\"\n        loss = 0.0\n        acc = 0.0\n        if self.mt_weight == 0:\n            return loss, acc\n\n        ilens = torch.sum(xs_pad != -1, dim=1).cpu().numpy()\n        # NOTE: xs_pad is padded with -1\n        ys_src = [y[y != -1] for y in xs_pad]  # parse padded ys_src\n        xs_zero_pad = pad_list(ys_src, self.pad)  # re-pad with zero\n        hs_pad, hlens, _ = self.enc_mt(\n            self.dropout_mt(self.embed_mt(xs_zero_pad)), ilens\n        )\n        loss, acc, _ = self.dec(hs_pad, hlens, ys_pad)\n        return loss, acc\n\n    def scorers(self):\n        \"\"\"Scorers.\"\"\"\n        return dict(decoder=self.dec)\n\n    def encode(self, x):\n        \"\"\"Encode acoustic features.\n\n        :param ndarray x: input acoustic feature (T, D)\n        :return: encoder outputs\n        :rtype: torch.Tensor\n        \"\"\"\n        self.eval()\n        ilens = [x.shape[0]]\n\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        p = next(self.parameters())\n        h = torch.as_tensor(x, device=p.device, dtype=p.dtype)\n        # make a utt list (1) to use the same interface for encoder\n        hs = h.contiguous().unsqueeze(0)\n\n        # 1. encoder\n        hs, _, _ = self.enc(hs, ilens)\n        return hs.squeeze(0)\n\n    def translate(self, x, trans_args, char_list, rnnlm=None):\n        \"\"\"E2E beam search.\n\n        :param ndarray x: input acoustic feature (T, D)\n        :param Namespace trans_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        logging.info(\"input lengths: \" + str(x.shape[0]))\n        hs = self.encode(x).unsqueeze(0)\n        logging.info(\"encoder output lengths: \" + str(hs.size(1)))\n\n        # 2. Decoder\n        # decode the first utterance\n        y = self.dec.recognize_beam(hs[0], None, trans_args, char_list, rnnlm)\n        return y\n\n    def translate_batch(self, xs, trans_args, char_list, rnnlm=None):\n        \"\"\"E2E batch beam search.\n\n        :param list xs: list of input acoustic feature arrays [(T_1, D), (T_2, D), ...]\n        :param Namespace trans_args: argument Namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        prev = self.training\n        self.eval()\n        ilens = np.fromiter((xx.shape[0] for xx in xs), dtype=np.int64)\n\n        # subsample frame\n        xs = [xx[:: self.subsample[0], :] for xx in xs]\n        xs = [to_device(self, to_torch_tensor(xx).float()) for xx in xs]\n        xs_pad = pad_list(xs, 0.0)\n\n        # 1. Encoder\n        hs_pad, hlens, _ = self.enc(xs_pad, ilens)\n\n        # 2. Decoder\n        hlens = torch.tensor(list(map(int, hlens)))  # make sure hlens is tensor\n        y = self.dec.recognize_beam_batch(\n            hs_pad, hlens, None, trans_args, char_list, rnnlm\n        )\n\n        if prev:\n            self.train()\n        return y\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :param torch.Tensor ys_pad_src:\n            batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            # 1. Encoder\n            if self.multilingual:\n                tgt_lang_ids = ys_pad[:, 0:1]\n                ys_pad = ys_pad[:, 1:]  # remove target language ID in the beggining\n            else:\n                tgt_lang_ids = None\n            hpad, hlens, _ = self.enc(xs_pad, ilens)\n\n            # 2. Decoder\n            att_ws = self.dec.calculate_all_attentions(\n                hpad, hlens, ys_pad, lang_ids=tgt_lang_ids\n            )\n        self.train()\n        return att_ws\n\n    def calculate_all_ctc_probs(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E CTC probability calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :param torch.Tensor\n            ys_pad_src: batch of padded token id sequence tensor (B, Lmax)\n        :return: CTC probability (B, Tmax, vocab)\n        :rtype: float ndarray\n        \"\"\"\n        probs = None\n        if self.asr_weight == 0 or self.mtlalpha == 0:\n            return probs\n\n        self.eval()\n        with torch.no_grad():\n            # 1. Encoder\n            hpad, hlens, _ = self.enc(xs_pad, ilens)\n\n            # 2. CTC probs\n            probs = self.ctc.softmax(hpad).cpu().numpy()\n        self.train()\n        return probs\n\n    def subsample_frames(self, x):\n        \"\"\"Subsample speeh frames in the encoder.\"\"\"\n        # subsample frame\n        x = x[:: self.subsample[0], :]\n        ilen = [x.shape[0]]\n        h = to_device(self, torch.from_numpy(np.array(x, dtype=np.float32)))\n        h.contiguous()\n        return h, ilen\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_st_conformer.py",
    "content": "# Copyright 2020 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"\nConformer speech translation model (pytorch).\n\nIt is a fusion of `e2e_st_transformer.py`\nRefer to: https://arxiv.org/abs/2005.08100\n\n\"\"\"\n\nfrom espnet.nets.pytorch_backend.conformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.e2e_st_transformer import E2E as E2ETransformer\nfrom espnet.nets.pytorch_backend.conformer.argument import (\n    add_arguments_conformer_common,  # noqa: H301\n    verify_rel_pos_type,  # noqa: H301\n)\n\n\nclass E2E(E2ETransformer):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        E2ETransformer.add_arguments(parser)\n        E2E.add_conformer_arguments(parser)\n        return parser\n\n    @staticmethod\n    def add_conformer_arguments(parser):\n        \"\"\"Add arguments for conformer model.\"\"\"\n        group = parser.add_argument_group(\"conformer model specific setting\")\n        group = add_arguments_conformer_common(group)\n        return parser\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        super().__init__(idim, odim, args, ignore_id)\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n\n        # Check the relative positional encoding type\n        args = verify_rel_pos_type(args)\n\n        self.encoder = Encoder(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=args.transformer_input_layer,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n            pos_enc_layer_type=args.transformer_encoder_pos_enc_layer_type,\n            selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n            activation_type=args.transformer_encoder_activation_type,\n            macaron_style=args.macaron_style,\n            use_cnn_module=args.use_cnn_module,\n            cnn_module_kernel=args.cnn_module_kernel,\n        )\n        self.reset_parameters(args)\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_st_transformer.py",
    "content": "# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Transformer speech recognition model (pytorch).\"\"\"\n\nfrom argparse import Namespace\nimport logging\nimport math\nimport numpy\n\nimport torch\n\nfrom espnet.nets.e2e_asr_common import end_detect\nfrom espnet.nets.e2e_asr_common import ErrorCalculator as ASRErrorCalculator\nfrom espnet.nets.e2e_mt_common import ErrorCalculator as MTErrorCalculator\nfrom espnet.nets.pytorch_backend.ctc import CTC\nfrom espnet.nets.pytorch_backend.e2e_asr import CTC_LOSS_THRESHOLD\nfrom espnet.nets.pytorch_backend.e2e_st import Reporter\nfrom espnet.nets.pytorch_backend.nets_utils import get_subsample\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\nfrom espnet.nets.pytorch_backend.transformer.add_sos_eos import add_sos_eos\nfrom espnet.nets.pytorch_backend.transformer.argument import (\n    add_arguments_transformer_common,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.pytorch_backend.transformer.label_smoothing_loss import (\n    LabelSmoothingLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.pytorch_backend.transformer.mask import target_mask\nfrom espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nfrom espnet.nets.st_interface import STInterface\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass E2E(STInterface, torch.nn.Module):\n    \"\"\"E2E module.\n\n    :param int idim: dimension of inputs\n    :param int odim: dimension of outputs\n    :param Namespace args: argument Namespace containing options\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments.\"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n        group = add_arguments_transformer_common(group)\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return PlotAttentionReport.\"\"\"\n        return PlotAttentionReport\n\n    def get_total_subsampling_factor(self):\n        \"\"\"Get total subsampling factor.\"\"\"\n        return self.encoder.conv_subsampling_factor * int(numpy.prod(self.subsample))\n\n    def __init__(self, idim, odim, args, ignore_id=-1):\n        \"\"\"Construct an E2E object.\n\n        :param int idim: dimension of inputs\n        :param int odim: dimension of outputs\n        :param Namespace args: argument Namespace containing options\n        \"\"\"\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments for compatibility\n        args = fill_missing_args(args, self.add_arguments)\n\n        if args.transformer_attn_dropout_rate is None:\n            args.transformer_attn_dropout_rate = args.dropout_rate\n        self.encoder = Encoder(\n            idim=idim,\n            selfattention_layer_type=args.transformer_encoder_selfattn_layer_type,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            conv_wshare=args.wshare,\n            conv_kernel_length=args.ldconv_encoder_kernel_length,\n            conv_usebias=args.ldconv_usebias,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=args.transformer_input_layer,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            attention_dropout_rate=args.transformer_attn_dropout_rate,\n        )\n        self.decoder = Decoder(\n            odim=odim,\n            selfattention_layer_type=args.transformer_decoder_selfattn_layer_type,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            conv_wshare=args.wshare,\n            conv_kernel_length=args.ldconv_decoder_kernel_length,\n            conv_usebias=args.ldconv_usebias,\n            linear_units=args.dunits,\n            num_blocks=args.dlayers,\n            dropout_rate=args.dropout_rate,\n            positional_dropout_rate=args.dropout_rate,\n            self_attention_dropout_rate=args.transformer_attn_dropout_rate,\n            src_attention_dropout_rate=args.transformer_attn_dropout_rate,\n        )\n        self.pad = 0  # use <blank> for padding\n        self.sos = odim - 1\n        self.eos = odim - 1\n        self.odim = odim\n        self.ignore_id = ignore_id\n        self.subsample = get_subsample(args, mode=\"st\", arch=\"transformer\")\n        self.reporter = Reporter()\n\n        self.criterion = LabelSmoothingLoss(\n            self.odim,\n            self.ignore_id,\n            args.lsm_weight,\n            args.transformer_length_normalized_loss,\n        )\n        # submodule for ASR task\n        self.mtlalpha = args.mtlalpha\n        self.asr_weight = args.asr_weight\n        if self.asr_weight > 0 and args.mtlalpha < 1:\n            self.decoder_asr = Decoder(\n                odim=odim,\n                attention_dim=args.adim,\n                attention_heads=args.aheads,\n                linear_units=args.dunits,\n                num_blocks=args.dlayers,\n                dropout_rate=args.dropout_rate,\n                positional_dropout_rate=args.dropout_rate,\n                self_attention_dropout_rate=args.transformer_attn_dropout_rate,\n                src_attention_dropout_rate=args.transformer_attn_dropout_rate,\n            )\n\n        # submodule for MT task\n        self.mt_weight = args.mt_weight\n        if self.mt_weight > 0:\n            self.encoder_mt = Encoder(\n                idim=odim,\n                attention_dim=args.adim,\n                attention_heads=args.aheads,\n                linear_units=args.dunits,\n                num_blocks=args.dlayers,\n                input_layer=\"embed\",\n                dropout_rate=args.dropout_rate,\n                positional_dropout_rate=args.dropout_rate,\n                attention_dropout_rate=args.transformer_attn_dropout_rate,\n                padding_idx=0,\n            )\n        self.reset_parameters(args)  # NOTE: place after the submodule initialization\n        self.adim = args.adim  # used for CTC (equal to d_model)\n        if self.asr_weight > 0 and args.mtlalpha > 0.0:\n            self.ctc = CTC(\n                odim, args.adim, args.dropout_rate, ctc_type=args.ctc_type, reduce=True\n            )\n        else:\n            self.ctc = None\n\n        # translation error calculator\n        self.error_calculator = MTErrorCalculator(\n            args.char_list, args.sym_space, args.sym_blank, args.report_bleu\n        )\n\n        # recognition error calculator\n        self.error_calculator_asr = ASRErrorCalculator(\n            args.char_list,\n            args.sym_space,\n            args.sym_blank,\n            args.report_cer,\n            args.report_wer,\n        )\n        self.rnnlm = None\n\n        # multilingual E2E-ST related\n        self.multilingual = getattr(args, \"multilingual\", False)\n        self.replace_sos = getattr(args, \"replace_sos\", False)\n\n    def reset_parameters(self, args):\n        \"\"\"Initialize parameters.\"\"\"\n        initialize(self, args.transformer_init)\n        if self.mt_weight > 0:\n            torch.nn.init.normal_(\n                self.encoder_mt.embed[0].weight, mean=0, std=args.adim ** -0.5\n            )\n            torch.nn.init.constant_(self.encoder_mt.embed[0].weight[self.pad], 0)\n\n    def forward(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E forward.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of source sequences (B)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :param torch.Tensor ys_pad_src: batch of padded target sequences (B, Lmax)\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in attention decoder\n        :rtype: float\n        \"\"\"\n        # 0. Extract target language ID\n        tgt_lang_ids = None\n        if self.multilingual:\n            tgt_lang_ids = ys_pad[:, 0:1]\n            ys_pad = ys_pad[:, 1:]  # remove target language ID in the beggining\n\n        # 1. forward encoder\n        xs_pad = xs_pad[:, : max(ilens)]  # for data parallel\n        src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2)\n        hs_pad, hs_mask = self.encoder(xs_pad, src_mask)\n\n        # 2. forward decoder\n        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)\n        # replace <sos> with target language ID\n        if self.replace_sos:\n            ys_in_pad = torch.cat([tgt_lang_ids, ys_in_pad[:, 1:]], dim=1)\n        ys_mask = target_mask(ys_in_pad, self.ignore_id)\n        pred_pad, pred_mask = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n\n        # 3. compute ST loss\n        loss_att = self.criterion(pred_pad, ys_out_pad)\n\n        self.acc = th_accuracy(\n            pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n        )\n\n        # 4. compute corpus-level bleu in a mini-batch\n        if self.training:\n            self.bleu = None\n        else:\n            ys_hat = pred_pad.argmax(dim=-1)\n            self.bleu = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())\n\n        # 5. compute auxiliary ASR loss\n        loss_asr_att, acc_asr, loss_asr_ctc, cer_ctc, cer, wer = self.forward_asr(\n            hs_pad, hs_mask, ys_pad_src\n        )\n\n        # 6. compute auxiliary MT loss\n        loss_mt, acc_mt = 0.0, None\n        if self.mt_weight > 0:\n            loss_mt, acc_mt = self.forward_mt(\n                ys_pad_src, ys_in_pad, ys_out_pad, ys_mask\n            )\n\n        asr_ctc_weight = self.mtlalpha\n        self.loss = (\n            (1 - self.asr_weight - self.mt_weight) * loss_att\n            + self.asr_weight\n            * (asr_ctc_weight * loss_asr_ctc + (1 - asr_ctc_weight) * loss_asr_att)\n            + self.mt_weight * loss_mt\n        )\n        loss_asr_data = float(\n            asr_ctc_weight * loss_asr_ctc + (1 - asr_ctc_weight) * loss_asr_att\n        )\n        loss_mt_data = None if self.mt_weight == 0 else float(loss_mt)\n        loss_st_data = float(loss_att)\n\n        loss_data = float(self.loss)\n        if loss_data < CTC_LOSS_THRESHOLD and not math.isnan(loss_data):\n            self.reporter.report(\n                loss_asr_data,\n                loss_mt_data,\n                loss_st_data,\n                acc_asr,\n                acc_mt,\n                self.acc,\n                cer_ctc,\n                cer,\n                wer,\n                self.bleu,\n                loss_data,\n            )\n        else:\n            logging.warning(\"loss (=%f) is not correct\", loss_data)\n        return self.loss\n\n    def forward_asr(self, hs_pad, hs_mask, ys_pad):\n        \"\"\"Forward pass in the auxiliary ASR task.\n\n        :param torch.Tensor hs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor hs_mask: batch of input token mask (B, Lmax)\n        :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n        :return: ASR attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy in ASR attention decoder\n        :rtype: float\n        :return: ASR CTC loss value\n        :rtype: torch.Tensor\n        :return: character error rate from CTC prediction\n        :rtype: float\n        :return: character error rate from attetion decoder prediction\n        :rtype: float\n        :return: word error rate from attetion decoder prediction\n        :rtype: float\n        \"\"\"\n        loss_att, loss_ctc = 0.0, 0.0\n        acc = None\n        cer, wer = None, None\n        cer_ctc = None\n        if self.asr_weight == 0:\n            return loss_att, acc, loss_ctc, cer_ctc, cer, wer\n\n        # attention\n        if self.mtlalpha < 1:\n            ys_in_pad_asr, ys_out_pad_asr = add_sos_eos(\n                ys_pad, self.sos, self.eos, self.ignore_id\n            )\n            ys_mask_asr = target_mask(ys_in_pad_asr, self.ignore_id)\n            pred_pad, _ = self.decoder_asr(ys_in_pad_asr, ys_mask_asr, hs_pad, hs_mask)\n            loss_att = self.criterion(pred_pad, ys_out_pad_asr)\n\n            acc = th_accuracy(\n                pred_pad.view(-1, self.odim),\n                ys_out_pad_asr,\n                ignore_label=self.ignore_id,\n            )\n            if not self.training:\n                ys_hat_asr = pred_pad.argmax(dim=-1)\n                cer, wer = self.error_calculator_asr(ys_hat_asr.cpu(), ys_pad.cpu())\n\n        # CTC\n        if self.mtlalpha > 0:\n            batch_size = hs_pad.size(0)\n            hs_len = hs_mask.view(batch_size, -1).sum(1)\n            loss_ctc = self.ctc(hs_pad.view(batch_size, -1, self.adim), hs_len, ys_pad)\n            if not self.training:\n                ys_hat_ctc = self.ctc.argmax(\n                    hs_pad.view(batch_size, -1, self.adim)\n                ).data\n                cer_ctc = self.error_calculator_asr(\n                    ys_hat_ctc.cpu(), ys_pad.cpu(), is_ctc=True\n                )\n                # for visualization\n                self.ctc.softmax(hs_pad)\n        return loss_att, acc, loss_ctc, cer_ctc, cer, wer\n\n    def forward_mt(self, xs_pad, ys_in_pad, ys_out_pad, ys_mask):\n        \"\"\"Forward pass in the auxiliary MT task.\n\n        :param torch.Tensor xs_pad: batch of padded source sequences (B, Tmax, idim)\n        :param torch.Tensor ys_in_pad: batch of padded target sequences (B, Lmax)\n        :param torch.Tensor ys_out_pad: batch of padded target sequences (B, Lmax)\n        :param torch.Tensor ys_mask: batch of input token mask (B, Lmax)\n        :return: MT loss value\n        :rtype: torch.Tensor\n        :return: accuracy in MT decoder\n        :rtype: float\n        \"\"\"\n        loss, acc = 0.0, None\n        if self.mt_weight == 0:\n            return loss, acc\n\n        ilens = torch.sum(xs_pad != self.ignore_id, dim=1).cpu().numpy()\n        # NOTE: xs_pad is padded with -1\n        xs = [x[x != self.ignore_id] for x in xs_pad]  # parse padded xs\n        xs_zero_pad = pad_list(xs, self.pad)  # re-pad with zero\n        xs_zero_pad = xs_zero_pad[:, : max(ilens)]  # for data parallel\n        src_mask = (\n            make_non_pad_mask(ilens.tolist()).to(xs_zero_pad.device).unsqueeze(-2)\n        )\n        hs_pad, hs_mask = self.encoder_mt(xs_zero_pad, src_mask)\n        pred_pad, _ = self.decoder(ys_in_pad, ys_mask, hs_pad, hs_mask)\n        loss = self.criterion(pred_pad, ys_out_pad)\n        acc = th_accuracy(\n            pred_pad.view(-1, self.odim), ys_out_pad, ignore_label=self.ignore_id\n        )\n        return loss, acc\n\n    def scorers(self):\n        \"\"\"Scorers.\"\"\"\n        return dict(decoder=self.decoder)\n\n    def encode(self, x):\n        \"\"\"Encode source acoustic features.\n\n        :param ndarray x: source acoustic feature (T, D)\n        :return: encoder outputs\n        :rtype: torch.Tensor\n        \"\"\"\n        self.eval()\n        x = torch.as_tensor(x).unsqueeze(0)\n        enc_output, _ = self.encoder(x, None)\n        return enc_output.squeeze(0)\n\n    def translate(\n        self,\n        x,\n        trans_args,\n        char_list=None,\n    ):\n        \"\"\"Translate input speech.\n\n        :param ndnarray x: input acoustic feature (B, T, D) or (T, D)\n        :param Namespace trans_args: argment Namespace contraining options\n        :param list char_list: list of characters\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        # preprate sos\n        if getattr(trans_args, \"tgt_lang\", False):\n            if self.replace_sos:\n                y = char_list.index(trans_args.tgt_lang)\n        else:\n            y = self.sos\n        logging.info(\"<sos> index: \" + str(y))\n        logging.info(\"<sos> mark: \" + char_list[y])\n        logging.info(\"input lengths: \" + str(x.shape[0]))\n\n        enc_output = self.encode(x).unsqueeze(0)\n\n        h = enc_output\n\n        logging.info(\"encoder output lengths: \" + str(h.size(1)))\n        # search parms\n        beam = trans_args.beam_size\n        penalty = trans_args.penalty\n\n        if trans_args.maxlenratio == 0:\n            maxlen = h.size(1)\n        else:\n            # maxlen >= 1\n            maxlen = max(1, int(trans_args.maxlenratio * h.size(1)))\n        minlen = int(trans_args.minlenratio * h.size(1))\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        hyp = {\"score\": 0.0, \"yseq\": [y]}\n        hyps = [hyp]\n        ended_hyps = []\n\n        for i in range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            # batchfy\n            ys = h.new_zeros((len(hyps), i + 1), dtype=torch.int64)\n            for j, hyp in enumerate(hyps):\n                ys[j, :] = torch.tensor(hyp[\"yseq\"])\n            ys_mask = subsequent_mask(i + 1).unsqueeze(0).to(h.device)\n\n            local_scores = self.decoder.forward_one_step(\n                ys, ys_mask, h.repeat([len(hyps), 1, 1])\n            )[0]\n\n            hyps_best_kept = []\n            for j, hyp in enumerate(hyps):\n                local_best_scores, local_best_ids = torch.topk(\n                    local_scores[j : j + 1], beam, dim=1\n                )\n\n                for j in range(beam):\n                    new_hyp = {}\n                    new_hyp[\"score\"] = hyp[\"score\"] + float(local_best_scores[0, j])\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[0, j])\n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypothes: \" + str(len(hyps)))\n            if char_list is not None:\n                logging.debug(\n                    \"best hypo: \"\n                    + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n                )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last postion in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypothes to a final list, and removed them from current hypothes\n            # (this will be a probmlem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n            if end_detect(ended_hyps, i) and trans_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remeined hypothes: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            if char_list is not None:\n                for hyp in hyps:\n                    logging.debug(\n                        \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                    )\n\n            logging.debug(\"number of ended hypothes: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), trans_args.nbest)\n        ]\n\n        # check number of hypotheis\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, perform translation \"\n                \"again with smaller minlenratio.\"\n            )\n            # should copy becasuse Namespace will be overwritten globally\n            trans_args = Namespace(**vars(trans_args))\n            trans_args.minlenratio = max(0.0, trans_args.minlenratio - 0.1)\n            return self.translate(x, trans_args, char_list)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n        return nbest_hyps\n\n    def calculate_all_attentions(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E attention calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :param torch.Tensor ys_pad_src:\n            batch of padded token id sequence tensor (B, Lmax)\n        :return: attention weights (B, H, Lmax, Tmax)\n        :rtype: float ndarray\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            self.forward(xs_pad, ilens, ys_pad, ys_pad_src)\n        ret = dict()\n        for name, m in self.named_modules():\n            if (\n                isinstance(m, MultiHeadedAttention) and m.attn is not None\n            ):  # skip MHA for submodules\n                ret[name] = m.attn.cpu().numpy()\n        self.train()\n        return ret\n\n    def calculate_all_ctc_probs(self, xs_pad, ilens, ys_pad, ys_pad_src):\n        \"\"\"E2E CTC probability calculation.\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor ys_pad: batch of padded token id sequence tensor (B, Lmax)\n        :param torch.Tensor ys_pad_src:\n            batch of padded token id sequence tensor (B, Lmax)\n        :return: CTC probability (B, Tmax, vocab)\n        :rtype: float ndarray\n        \"\"\"\n        ret = None\n        if self.asr_weight == 0 or self.mtlalpha == 0:\n            return ret\n\n        self.eval()\n        with torch.no_grad():\n            self.forward(xs_pad, ilens, ys_pad, ys_pad_src)\n        ret = None\n        for name, m in self.named_modules():\n            if isinstance(m, CTC) and m.probs is not None:\n                ret = m.probs.cpu().numpy()\n        self.train()\n        return ret\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_tts_fastspeech.py",
    "content": "# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"FastSpeech related modules.\"\"\"\n\nimport logging\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.nets.pytorch_backend.fastspeech.duration_calculator import (\n    DurationCalculator,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.fastspeech.duration_predictor import DurationPredictor\nfrom espnet.nets.pytorch_backend.fastspeech.duration_predictor import (\n    DurationPredictorLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.fastspeech.length_regulator import LengthRegulator\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Postnet\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.embedding import ScaledPositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass FeedForwardTransformerLoss(torch.nn.Module):\n    \"\"\"Loss function module for feed-forward Transformer.\"\"\"\n\n    def __init__(self, use_masking=True, use_weighted_masking=False):\n        \"\"\"Initialize feed-forward Transformer loss module.\n\n        Args:\n            use_masking (bool):\n                Whether to apply masking for padded part in loss calculation.\n            use_weighted_masking (bool):\n                Whether to weighted masking in loss calculation.\n\n        \"\"\"\n        super(FeedForwardTransformerLoss, self).__init__()\n        assert (use_masking != use_weighted_masking) or not use_masking\n        self.use_masking = use_masking\n        self.use_weighted_masking = use_weighted_masking\n\n        # define criterions\n        reduction = \"none\" if self.use_weighted_masking else \"mean\"\n        self.l1_criterion = torch.nn.L1Loss(reduction=reduction)\n        self.duration_criterion = DurationPredictorLoss(reduction=reduction)\n\n    def forward(self, after_outs, before_outs, d_outs, ys, ds, ilens, olens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            after_outs (Tensor): Batch of outputs after postnets (B, Lmax, odim).\n            before_outs (Tensor): Batch of outputs before postnets (B, Lmax, odim).\n            d_outs (Tensor): Batch of outputs of duration predictor (B, Tmax).\n            ys (Tensor): Batch of target features (B, Lmax, odim).\n            ds (Tensor): Batch of durations (B, Tmax).\n            ilens (LongTensor): Batch of the lengths of each input (B,).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n\n        Returns:\n            Tensor: L1 loss value.\n            Tensor: Duration predictor loss value.\n\n        \"\"\"\n        # apply mask to remove padded part\n        if self.use_masking:\n            duration_masks = make_non_pad_mask(ilens).to(ys.device)\n            d_outs = d_outs.masked_select(duration_masks)\n            ds = ds.masked_select(duration_masks)\n            out_masks = make_non_pad_mask(olens).unsqueeze(-1).to(ys.device)\n            before_outs = before_outs.masked_select(out_masks)\n            after_outs = (\n                after_outs.masked_select(out_masks) if after_outs is not None else None\n            )\n            ys = ys.masked_select(out_masks)\n\n        # calculate loss\n        l1_loss = self.l1_criterion(before_outs, ys)\n        if after_outs is not None:\n            l1_loss += self.l1_criterion(after_outs, ys)\n        duration_loss = self.duration_criterion(d_outs, ds)\n\n        # make weighted mask and apply it\n        if self.use_weighted_masking:\n            out_masks = make_non_pad_mask(olens).unsqueeze(-1).to(ys.device)\n            out_weights = out_masks.float() / out_masks.sum(dim=1, keepdim=True).float()\n            out_weights /= ys.size(0) * ys.size(2)\n            duration_masks = make_non_pad_mask(ilens).to(ys.device)\n            duration_weights = (\n                duration_masks.float() / duration_masks.sum(dim=1, keepdim=True).float()\n            )\n            duration_weights /= ds.size(0)\n\n            # apply weight\n            l1_loss = l1_loss.mul(out_weights).masked_select(out_masks).sum()\n            duration_loss = (\n                duration_loss.mul(duration_weights).masked_select(duration_masks).sum()\n            )\n\n        return l1_loss, duration_loss\n\n\nclass FeedForwardTransformer(TTSInterface, torch.nn.Module):\n    \"\"\"Feed Forward Transformer for TTS a.k.a. FastSpeech.\n\n    This is a module of FastSpeech,\n    feed-forward Transformer with duration predictor described in\n    `FastSpeech: Fast, Robust and Controllable Text to Speech`_,\n    which does not require any auto-regressive\n    processing during inference,\n    resulting in fast decoding compared with auto-regressive Transformer.\n\n    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:\n        https://arxiv.org/pdf/1905.09263.pdf\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model-specific arguments to the parser.\"\"\"\n        group = parser.add_argument_group(\"feed-forward transformer model setting\")\n        # network structure related\n        group.add_argument(\n            \"--adim\",\n            default=384,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aheads\",\n            default=4,\n            type=int,\n            help=\"Number of heads for multi head attention\",\n        )\n        group.add_argument(\n            \"--elayers\", default=6, type=int, help=\"Number of encoder layers\"\n        )\n        group.add_argument(\n            \"--eunits\", default=1536, type=int, help=\"Number of encoder hidden units\"\n        )\n        group.add_argument(\n            \"--dlayers\", default=6, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=1536, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--positionwise-layer-type\",\n            default=\"linear\",\n            type=str,\n            choices=[\"linear\", \"conv1d\", \"conv1d-linear\"],\n            help=\"Positionwise layer type.\",\n        )\n        group.add_argument(\n            \"--positionwise-conv-kernel-size\",\n            default=3,\n            type=int,\n            help=\"Kernel size of positionwise conv1d layer\",\n        )\n        group.add_argument(\n            \"--postnet-layers\", default=0, type=int, help=\"Number of postnet layers\"\n        )\n        group.add_argument(\n            \"--postnet-chans\", default=256, type=int, help=\"Number of postnet channels\"\n        )\n        group.add_argument(\n            \"--postnet-filts\", default=5, type=int, help=\"Filter size of postnet\"\n        )\n        group.add_argument(\n            \"--use-batch-norm\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use batch normalization\",\n        )\n        group.add_argument(\n            \"--use-scaled-pos-enc\",\n            default=True,\n            type=strtobool,\n            help=\"Use trainable scaled positional encoding \"\n            \"instead of the fixed scale one\",\n        )\n        group.add_argument(\n            \"--encoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before encoder block\",\n        )\n        group.add_argument(\n            \"--decoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before decoder block\",\n        )\n        group.add_argument(\n            \"--encoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in encoder\",\n        )\n        group.add_argument(\n            \"--decoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in decoder\",\n        )\n        group.add_argument(\n            \"--duration-predictor-layers\",\n            default=2,\n            type=int,\n            help=\"Number of layers in duration predictor\",\n        )\n        group.add_argument(\n            \"--duration-predictor-chans\",\n            default=384,\n            type=int,\n            help=\"Number of channels in duration predictor\",\n        )\n        group.add_argument(\n            \"--duration-predictor-kernel-size\",\n            default=3,\n            type=int,\n            help=\"Kernel size in duration predictor\",\n        )\n        group.add_argument(\n            \"--teacher-model\",\n            default=None,\n            type=str,\n            nargs=\"?\",\n            help=\"Teacher model file path\",\n        )\n        group.add_argument(\n            \"--reduction-factor\", default=1, type=int, help=\"Reduction factor\"\n        )\n        group.add_argument(\n            \"--spk-embed-dim\",\n            default=None,\n            type=int,\n            help=\"Number of speaker embedding dimensions\",\n        )\n        group.add_argument(\n            \"--spk-embed-integration-type\",\n            type=str,\n            default=\"add\",\n            choices=[\"add\", \"concat\"],\n            help=\"How to integrate speaker embedding\",\n        )\n        # training related\n        group.add_argument(\n            \"--transformer-init\",\n            type=str,\n            default=\"pytorch\",\n            choices=[\n                \"pytorch\",\n                \"xavier_uniform\",\n                \"xavier_normal\",\n                \"kaiming_uniform\",\n                \"kaiming_normal\",\n            ],\n            help=\"How to initialize transformer parameters\",\n        )\n        group.add_argument(\n            \"--initial-encoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in encoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--initial-decoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in decoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--transformer-lr\",\n            default=1.0,\n            type=float,\n            help=\"Initial value of learning rate\",\n        )\n        group.add_argument(\n            \"--transformer-warmup-steps\",\n            default=4000,\n            type=int,\n            help=\"Optimizer warmup steps\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder except for attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-enc-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-dec-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder except \"\n            \"for attention and pos encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder-decoder attention\",\n        )\n        group.add_argument(\n            \"--duration-predictor-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for duration predictor\",\n        )\n        group.add_argument(\n            \"--postnet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in postnet\",\n        )\n        group.add_argument(\n            \"--transfer-encoder-from-teacher\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to transfer teacher's parameters\",\n        )\n        group.add_argument(\n            \"--transferred-encoder-module\",\n            default=\"all\",\n            type=str,\n            choices=[\"all\", \"embed\"],\n            help=\"Encoder modeules to be trasferred from teacher\",\n        )\n        # loss related\n        group.add_argument(\n            \"--use-masking\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--use-weighted-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use weighted masking in calculation of loss\",\n        )\n        return parser\n\n    def __init__(self, idim, odim, args=None):\n        \"\"\"Initialize feed-forward Transformer module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            args (Namespace, optional):\n                - elayers (int): Number of encoder layers.\n                - eunits (int): Number of encoder hidden units.\n                - adim (int): Number of attention transformation dimensions.\n                - aheads (int): Number of heads for multi head attention.\n                - dlayers (int): Number of decoder layers.\n                - dunits (int): Number of decoder hidden units.\n                - use_scaled_pos_enc (bool):\n                    Whether to use trainable scaled positional encoding.\n                - encoder_normalize_before (bool):\n                    Whether to perform layer normalization before encoder block.\n                - decoder_normalize_before (bool):\n                    Whether to perform layer normalization before decoder block.\n                - encoder_concat_after (bool): Whether to concatenate attention\n                    layer's input and output in encoder.\n                - decoder_concat_after (bool): Whether to concatenate attention\n                    layer's input and output in decoder.\n                - duration_predictor_layers (int): Number of duration predictor layers.\n                - duration_predictor_chans (int): Number of duration predictor channels.\n                - duration_predictor_kernel_size (int):\n                    Kernel size of duration predictor.\n                - spk_embed_dim (int): Number of speaker embedding dimensions.\n                - spk_embed_integration_type: How to integrate speaker embedding.\n                - teacher_model (str): Teacher auto-regressive transformer model path.\n                - reduction_factor (int): Reduction factor.\n                - transformer_init (float): How to initialize transformer parameters.\n                - transformer_lr (float): Initial value of learning rate.\n                - transformer_warmup_steps (int): Optimizer warmup steps.\n                - transformer_enc_dropout_rate (float):\n                    Dropout rate in encoder except attention & positional encoding.\n                - transformer_enc_positional_dropout_rate (float):\n                    Dropout rate after encoder positional encoding.\n                - transformer_enc_attn_dropout_rate (float):\n                    Dropout rate in encoder self-attention module.\n                - transformer_dec_dropout_rate (float):\n                    Dropout rate in decoder except attention & positional encoding.\n                - transformer_dec_positional_dropout_rate (float):\n                    Dropout rate after decoder positional encoding.\n                - transformer_dec_attn_dropout_rate (float):\n                    Dropout rate in deocoder self-attention module.\n                - transformer_enc_dec_attn_dropout_rate (float):\n                    Dropout rate in encoder-deocoder attention module.\n                - use_masking (bool):\n                    Whether to apply masking for padded part in loss calculation.\n                - use_weighted_masking (bool):\n                    Whether to apply weighted masking in loss calculation.\n                - transfer_encoder_from_teacher:\n                    Whether to transfer encoder using teacher encoder parameters.\n                - transferred_encoder_module:\n                    Encoder module to be initialized using teacher parameters.\n\n        \"\"\"\n        # initialize base classes\n        TTSInterface.__init__(self)\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments\n        args = fill_missing_args(args, self.add_arguments)\n\n        # store hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.reduction_factor = args.reduction_factor\n        self.use_scaled_pos_enc = args.use_scaled_pos_enc\n        self.spk_embed_dim = args.spk_embed_dim\n        if self.spk_embed_dim is not None:\n            self.spk_embed_integration_type = args.spk_embed_integration_type\n\n        # use idx 0 as padding idx\n        padding_idx = 0\n\n        # get positional encoding class\n        pos_enc_class = (\n            ScaledPositionalEncoding if self.use_scaled_pos_enc else PositionalEncoding\n        )\n\n        # define encoder\n        encoder_input_layer = torch.nn.Embedding(\n            num_embeddings=idim, embedding_dim=args.adim, padding_idx=padding_idx\n        )\n        self.encoder = Encoder(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=encoder_input_layer,\n            dropout_rate=args.transformer_enc_dropout_rate,\n            positional_dropout_rate=args.transformer_enc_positional_dropout_rate,\n            attention_dropout_rate=args.transformer_enc_attn_dropout_rate,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.encoder_normalize_before,\n            concat_after=args.encoder_concat_after,\n            positionwise_layer_type=args.positionwise_layer_type,\n            positionwise_conv_kernel_size=args.positionwise_conv_kernel_size,\n        )\n\n        # define additional projection for speaker embedding\n        if self.spk_embed_dim is not None:\n            if self.spk_embed_integration_type == \"add\":\n                self.projection = torch.nn.Linear(self.spk_embed_dim, args.adim)\n            else:\n                self.projection = torch.nn.Linear(\n                    args.adim + self.spk_embed_dim, args.adim\n                )\n\n        # define duration predictor\n        self.duration_predictor = DurationPredictor(\n            idim=args.adim,\n            n_layers=args.duration_predictor_layers,\n            n_chans=args.duration_predictor_chans,\n            kernel_size=args.duration_predictor_kernel_size,\n            dropout_rate=args.duration_predictor_dropout_rate,\n        )\n\n        # define length regulator\n        self.length_regulator = LengthRegulator()\n\n        # define decoder\n        # NOTE: we use encoder as decoder\n        # because fastspeech's decoder is the same as encoder\n        self.decoder = Encoder(\n            idim=0,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.dunits,\n            num_blocks=args.dlayers,\n            input_layer=None,\n            dropout_rate=args.transformer_dec_dropout_rate,\n            positional_dropout_rate=args.transformer_dec_positional_dropout_rate,\n            attention_dropout_rate=args.transformer_dec_attn_dropout_rate,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.decoder_normalize_before,\n            concat_after=args.decoder_concat_after,\n            positionwise_layer_type=args.positionwise_layer_type,\n            positionwise_conv_kernel_size=args.positionwise_conv_kernel_size,\n        )\n\n        # define final projection\n        self.feat_out = torch.nn.Linear(args.adim, odim * args.reduction_factor)\n\n        # define postnet\n        self.postnet = (\n            None\n            if args.postnet_layers == 0\n            else Postnet(\n                idim=idim,\n                odim=odim,\n                n_layers=args.postnet_layers,\n                n_chans=args.postnet_chans,\n                n_filts=args.postnet_filts,\n                use_batch_norm=args.use_batch_norm,\n                dropout_rate=args.postnet_dropout_rate,\n            )\n        )\n\n        # initialize parameters\n        self._reset_parameters(\n            init_type=args.transformer_init,\n            init_enc_alpha=args.initial_encoder_alpha,\n            init_dec_alpha=args.initial_decoder_alpha,\n        )\n\n        # define teacher model\n        if args.teacher_model is not None:\n            self.teacher = self._load_teacher_model(args.teacher_model)\n        else:\n            self.teacher = None\n\n        # define duration calculator\n        if self.teacher is not None:\n            self.duration_calculator = DurationCalculator(self.teacher)\n        else:\n            self.duration_calculator = None\n\n        # transfer teacher parameters\n        if self.teacher is not None and args.transfer_encoder_from_teacher:\n            self._transfer_from_teacher(args.transferred_encoder_module)\n\n        # define criterions\n        self.criterion = FeedForwardTransformerLoss(\n            use_masking=args.use_masking, use_weighted_masking=args.use_weighted_masking\n        )\n\n    def _forward(\n        self,\n        xs,\n        ilens,\n        ys=None,\n        olens=None,\n        spembs=None,\n        ds=None,\n        is_inference=False,\n        alpha=1.0,\n    ):\n        # forward encoder\n        x_masks = self._source_mask(ilens)\n        hs, _ = self.encoder(xs, x_masks)  # (B, Tmax, adim)\n\n        # integrate speaker embedding\n        if self.spk_embed_dim is not None:\n            hs = self._integrate_with_spk_embed(hs, spembs)\n\n        # forward duration predictor and length regulator\n        d_masks = make_pad_mask(ilens).to(xs.device)\n        if is_inference:\n            d_outs = self.duration_predictor.inference(hs, d_masks)  # (B, Tmax)\n            hs = self.length_regulator(hs, d_outs, alpha)  # (B, Lmax, adim)\n        else:\n            if ds is None:\n                with torch.no_grad():\n                    ds = self.duration_calculator(\n                        xs, ilens, ys, olens, spembs\n                    )  # (B, Tmax)\n            d_outs = self.duration_predictor(hs, d_masks)  # (B, Tmax)\n            hs = self.length_regulator(hs, ds)  # (B, Lmax, adim)\n\n        # forward decoder\n        if olens is not None:\n            if self.reduction_factor > 1:\n                olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n            else:\n                olens_in = olens\n            h_masks = self._source_mask(olens_in)\n        else:\n            h_masks = None\n        zs, _ = self.decoder(hs, h_masks)  # (B, Lmax, adim)\n        before_outs = self.feat_out(zs).view(\n            zs.size(0), -1, self.odim\n        )  # (B, Lmax, odim)\n\n        # postnet -> (B, Lmax//r * r, odim)\n        if self.postnet is None:\n            after_outs = before_outs\n        else:\n            after_outs = before_outs + self.postnet(\n                before_outs.transpose(1, 2)\n            ).transpose(1, 2)\n\n        if is_inference:\n            return before_outs, after_outs, d_outs\n        else:\n            return before_outs, after_outs, ds, d_outs\n\n    def forward(self, xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            extras (Tensor, optional): Batch of precalculated durations (B, Tmax, 1).\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        # remove unnecessary padded part (for multi-gpus)\n        xs = xs[:, : max(ilens)]\n        ys = ys[:, : max(olens)]\n        if extras is not None:\n            extras = extras[:, : max(ilens)].squeeze(-1)\n\n        # forward propagation\n        before_outs, after_outs, ds, d_outs = self._forward(\n            xs, ilens, ys, olens, spembs=spembs, ds=extras, is_inference=False\n        )\n\n        # modifiy mod part of groundtruth\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n            max_olen = max(olens)\n            ys = ys[:, :max_olen]\n\n        # calculate loss\n        if self.postnet is None:\n            l1_loss, duration_loss = self.criterion(\n                None, before_outs, d_outs, ys, ds, ilens, olens\n            )\n        else:\n            l1_loss, duration_loss = self.criterion(\n                after_outs, before_outs, d_outs, ys, ds, ilens, olens\n            )\n        loss = l1_loss + duration_loss\n        report_keys = [\n            {\"l1_loss\": l1_loss.item()},\n            {\"duration_loss\": duration_loss.item()},\n            {\"loss\": loss.item()},\n        ]\n\n        # report extra information\n        if self.use_scaled_pos_enc:\n            report_keys += [\n                {\"encoder_alpha\": self.encoder.embed[-1].alpha.data.item()},\n                {\"decoder_alpha\": self.decoder.embed[-1].alpha.data.item()},\n            ]\n        self.reporter.report(report_keys)\n\n        return loss\n\n    def calculate_all_attentions(\n        self, xs, ilens, ys, olens, spembs=None, extras=None, *args, **kwargs\n    ):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            extras (Tensor, optional): Batch of precalculated durations (B, Tmax, 1).\n\n        Returns:\n            dict: Dict of attention weights and outputs.\n\n        \"\"\"\n        with torch.no_grad():\n            # remove unnecessary padded part (for multi-gpus)\n            xs = xs[:, : max(ilens)]\n            ys = ys[:, : max(olens)]\n            if extras is not None:\n                extras = extras[:, : max(ilens)].squeeze(-1)\n\n            # forward propagation\n            outs = self._forward(\n                xs, ilens, ys, olens, spembs=spembs, ds=extras, is_inference=False\n            )[1]\n\n        att_ws_dict = dict()\n        for name, m in self.named_modules():\n            if isinstance(m, MultiHeadedAttention):\n                attn = m.attn.cpu().numpy()\n                if \"encoder\" in name:\n                    attn = [a[:, :l, :l] for a, l in zip(attn, ilens.tolist())]\n                elif \"decoder\" in name:\n                    if \"src\" in name:\n                        attn = [\n                            a[:, :ol, :il]\n                            for a, il, ol in zip(attn, ilens.tolist(), olens.tolist())\n                        ]\n                    elif \"self\" in name:\n                        attn = [a[:, :l, :l] for a, l in zip(attn, olens.tolist())]\n                    else:\n                        logging.warning(\"unknown attention module: \" + name)\n                else:\n                    logging.warning(\"unknown attention module: \" + name)\n                att_ws_dict[name] = attn\n        att_ws_dict[\"predicted_fbank\"] = [\n            m[:l].T for m, l in zip(outs.cpu().numpy(), olens.tolist())\n        ]\n\n        return att_ws_dict\n\n    def inference(self, x, inference_args, spemb=None, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Args:\n            x (Tensor): Input sequence of characters (T,).\n            inference_args (Namespace): Dummy for compatibility.\n            spemb (Tensor, optional): Speaker embedding vector (spk_embed_dim).\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            None: Dummy for compatibility.\n            None: Dummy for compatibility.\n\n        \"\"\"\n        # setup batch axis\n        ilens = torch.tensor([x.shape[0]], dtype=torch.long, device=x.device)\n        xs = x.unsqueeze(0)\n        if spemb is not None:\n            spembs = spemb.unsqueeze(0)\n        else:\n            spembs = None\n\n        # get option\n        alpha = getattr(inference_args, \"fastspeech_alpha\", 1.0)\n\n        # inference\n        _, outs, _ = self._forward(\n            xs,\n            ilens,\n            spembs=spembs,\n            is_inference=True,\n            alpha=alpha,\n        )  # (1, L, odim)\n\n        return outs[0], None, None\n\n    def _integrate_with_spk_embed(self, hs, spembs):\n        \"\"\"Integrate speaker embedding with hidden states.\n\n        Args:\n            hs (Tensor): Batch of hidden state sequences (B, Tmax, adim).\n            spembs (Tensor): Batch of speaker embeddings (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Batch of integrated hidden state sequences (B, Tmax, adim)\n\n        \"\"\"\n        if self.spk_embed_integration_type == \"add\":\n            # apply projection and then add to hidden states\n            spembs = self.projection(F.normalize(spembs))\n            hs = hs + spembs.unsqueeze(1)\n        elif self.spk_embed_integration_type == \"concat\":\n            # concat hidden states with spk embeds and then apply projection\n            spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n            hs = self.projection(torch.cat([hs, spembs], dim=-1))\n        else:\n            raise NotImplementedError(\"support only add or concat.\")\n\n        return hs\n\n    def _source_mask(self, ilens):\n        \"\"\"Make masks for self-attention.\n\n        Args:\n            ilens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor for self-attention.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> ilens = [5, 3]\n            >>> self._source_mask(ilens)\n            tensor([[[1, 1, 1, 1, 1],\n                     [1, 1, 1, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        x_masks = make_non_pad_mask(ilens).to(next(self.parameters()).device)\n        return x_masks.unsqueeze(-2)\n\n    def _load_teacher_model(self, model_path):\n        # get teacher model config\n        idim, odim, args = get_model_conf(model_path)\n\n        # assert dimension is the same between teacher and studnet\n        assert idim == self.idim\n        assert odim == self.odim\n        assert args.reduction_factor == self.reduction_factor\n\n        # load teacher model\n        from espnet.utils.dynamic_import import dynamic_import\n\n        model_class = dynamic_import(args.model_module)\n        model = model_class(idim, odim, args)\n        torch_load(model_path, model)\n\n        # freeze teacher model parameters\n        for p in model.parameters():\n            p.requires_grad = False\n\n        return model\n\n    def _reset_parameters(self, init_type, init_enc_alpha=1.0, init_dec_alpha=1.0):\n        # initialize parameters\n        initialize(self, init_type)\n\n        # initialize alpha in scaled positional encoding\n        if self.use_scaled_pos_enc:\n            self.encoder.embed[-1].alpha.data = torch.tensor(init_enc_alpha)\n            self.decoder.embed[-1].alpha.data = torch.tensor(init_dec_alpha)\n\n    def _transfer_from_teacher(self, transferred_encoder_module):\n        if transferred_encoder_module == \"all\":\n            for (n1, p1), (n2, p2) in zip(\n                self.encoder.named_parameters(), self.teacher.encoder.named_parameters()\n            ):\n                assert n1 == n2, \"It seems that encoder structure is different.\"\n                assert p1.shape == p2.shape, \"It seems that encoder size is different.\"\n                p1.data.copy_(p2.data)\n        elif transferred_encoder_module == \"embed\":\n            student_shape = self.encoder.embed[0].weight.data.shape\n            teacher_shape = self.teacher.encoder.embed[0].weight.data.shape\n            assert (\n                student_shape == teacher_shape\n            ), \"It seems that embed dimension is different.\"\n            self.encoder.embed[0].weight.data.copy_(\n                self.teacher.encoder.embed[0].weight.data\n            )\n        else:\n            raise NotImplementedError(\"Support only all or embed.\")\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return plot class for attention weight plot.\"\"\"\n        # Lazy import to avoid chainer dependency\n        from espnet.nets.pytorch_backend.e2e_tts_transformer import TTSPlot\n\n        return TTSPlot\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        keys should match what `chainer.reporter` reports.\n        If you add the key `loss`,\n        the reporter will report `main/loss` and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n        and `validation/main/loss` values.\n\n        Returns:\n            list: List of strings which are base keys to plot during training.\n\n        \"\"\"\n        plot_keys = [\"loss\", \"l1_loss\", \"duration_loss\"]\n        if self.use_scaled_pos_enc:\n            plot_keys += [\"encoder_alpha\", \"decoder_alpha\"]\n\n        return plot_keys\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_tts_tacotron2.py",
    "content": "# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Tacotron 2 related modules.\"\"\"\n\nimport logging\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttForward\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttForwardTA\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttLoc\nfrom espnet.nets.pytorch_backend.tacotron2.cbhg import CBHG\nfrom espnet.nets.pytorch_backend.tacotron2.cbhg import CBHGLoss\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Decoder\nfrom espnet.nets.pytorch_backend.tacotron2.encoder import Encoder\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass GuidedAttentionLoss(torch.nn.Module):\n    \"\"\"Guided attention loss function module.\n\n    This module calculates the guided attention loss described\n    in `Efficiently Trainable Text-to-Speech System Based\n    on Deep Convolutional Networks with Guided Attention`_,\n    which forces the attention to be diagonal.\n\n    .. _`Efficiently Trainable Text-to-Speech System\n        Based on Deep Convolutional Networks with Guided Attention`:\n        https://arxiv.org/abs/1710.08969\n\n    \"\"\"\n\n    def __init__(self, sigma=0.4, alpha=1.0, reset_always=True):\n        \"\"\"Initialize guided attention loss module.\n\n        Args:\n            sigma (float, optional): Standard deviation to control\n                how close attention to a diagonal.\n            alpha (float, optional): Scaling coefficient (lambda).\n            reset_always (bool, optional): Whether to always reset masks.\n\n        \"\"\"\n        super(GuidedAttentionLoss, self).__init__()\n        self.sigma = sigma\n        self.alpha = alpha\n        self.reset_always = reset_always\n        self.guided_attn_masks = None\n        self.masks = None\n\n    def _reset_masks(self):\n        self.guided_attn_masks = None\n        self.masks = None\n\n    def forward(self, att_ws, ilens, olens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            att_ws (Tensor): Batch of attention weights (B, T_max_out, T_max_in).\n            ilens (LongTensor): Batch of input lenghts (B,).\n            olens (LongTensor): Batch of output lenghts (B,).\n\n        Returns:\n            Tensor: Guided attention loss value.\n\n        \"\"\"\n        if self.guided_attn_masks is None:\n            self.guided_attn_masks = self._make_guided_attention_masks(ilens, olens).to(\n                att_ws.device\n            )\n        if self.masks is None:\n            self.masks = self._make_masks(ilens, olens).to(att_ws.device)\n        losses = self.guided_attn_masks * att_ws\n        loss = torch.mean(losses.masked_select(self.masks))\n        if self.reset_always:\n            self._reset_masks()\n        return self.alpha * loss\n\n    def _make_guided_attention_masks(self, ilens, olens):\n        n_batches = len(ilens)\n        max_ilen = max(ilens)\n        max_olen = max(olens)\n        guided_attn_masks = torch.zeros((n_batches, max_olen, max_ilen))\n        for idx, (ilen, olen) in enumerate(zip(ilens, olens)):\n            guided_attn_masks[idx, :olen, :ilen] = self._make_guided_attention_mask(\n                ilen, olen, self.sigma\n            )\n        return guided_attn_masks\n\n    @staticmethod\n    def _make_guided_attention_mask(ilen, olen, sigma):\n        \"\"\"Make guided attention mask.\n\n        Examples:\n            >>> guided_attn_mask =_make_guided_attention(5, 5, 0.4)\n            >>> guided_attn_mask.shape\n            torch.Size([5, 5])\n            >>> guided_attn_mask\n            tensor([[0.0000, 0.1175, 0.3935, 0.6753, 0.8647],\n                    [0.1175, 0.0000, 0.1175, 0.3935, 0.6753],\n                    [0.3935, 0.1175, 0.0000, 0.1175, 0.3935],\n                    [0.6753, 0.3935, 0.1175, 0.0000, 0.1175],\n                    [0.8647, 0.6753, 0.3935, 0.1175, 0.0000]])\n            >>> guided_attn_mask =_make_guided_attention(3, 6, 0.4)\n            >>> guided_attn_mask.shape\n            torch.Size([6, 3])\n            >>> guided_attn_mask\n            tensor([[0.0000, 0.2934, 0.7506],\n                    [0.0831, 0.0831, 0.5422],\n                    [0.2934, 0.0000, 0.2934],\n                    [0.5422, 0.0831, 0.0831],\n                    [0.7506, 0.2934, 0.0000],\n                    [0.8858, 0.5422, 0.0831]])\n\n        \"\"\"\n        grid_x, grid_y = torch.meshgrid(torch.arange(olen), torch.arange(ilen))\n        grid_x, grid_y = grid_x.float().to(olen.device), grid_y.float().to(ilen.device)\n        return 1.0 - torch.exp(\n            -((grid_y / ilen - grid_x / olen) ** 2) / (2 * (sigma ** 2))\n        )\n\n    @staticmethod\n    def _make_masks(ilens, olens):\n        \"\"\"Make masks indicating non-padded part.\n\n        Args:\n            ilens (LongTensor or List): Batch of lengths (B,).\n            olens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor indicating non-padded part.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> ilens, olens = [5, 2], [8, 5]\n            >>> _make_mask(ilens, olens)\n            tensor([[[1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1],\n                     [1, 1, 1, 1, 1]],\n                    [[1, 1, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [0, 0, 0, 0, 0],\n                     [0, 0, 0, 0, 0],\n                     [0, 0, 0, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        in_masks = make_non_pad_mask(ilens)  # (B, T_in)\n        out_masks = make_non_pad_mask(olens)  # (B, T_out)\n        return out_masks.unsqueeze(-1) & in_masks.unsqueeze(-2)  # (B, T_out, T_in)\n\n\nclass Tacotron2Loss(torch.nn.Module):\n    \"\"\"Loss function module for Tacotron2.\"\"\"\n\n    def __init__(\n        self, use_masking=True, use_weighted_masking=False, bce_pos_weight=20.0\n    ):\n        \"\"\"Initialize Tactoron2 loss module.\n\n        Args:\n            use_masking (bool): Whether to apply masking\n                for padded part in loss calculation.\n            use_weighted_masking (bool):\n                Whether to apply weighted masking in loss calculation.\n            bce_pos_weight (float): Weight of positive sample of stop token.\n\n        \"\"\"\n        super(Tacotron2Loss, self).__init__()\n        assert (use_masking != use_weighted_masking) or not use_masking\n        self.use_masking = use_masking\n        self.use_weighted_masking = use_weighted_masking\n\n        # define criterions\n        reduction = \"none\" if self.use_weighted_masking else \"mean\"\n        self.l1_criterion = torch.nn.L1Loss(reduction=reduction)\n        self.mse_criterion = torch.nn.MSELoss(reduction=reduction)\n        self.bce_criterion = torch.nn.BCEWithLogitsLoss(\n            reduction=reduction, pos_weight=torch.tensor(bce_pos_weight)\n        )\n\n        # NOTE(kan-bayashi): register pre hook function for the compatibility\n        self._register_load_state_dict_pre_hook(self._load_state_dict_pre_hook)\n\n    def forward(self, after_outs, before_outs, logits, ys, labels, olens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            after_outs (Tensor): Batch of outputs after postnets (B, Lmax, odim).\n            before_outs (Tensor): Batch of outputs before postnets (B, Lmax, odim).\n            logits (Tensor): Batch of stop logits (B, Lmax).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            labels (LongTensor): Batch of the sequences of stop token labels (B, Lmax).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n\n        Returns:\n            Tensor: L1 loss value.\n            Tensor: Mean square error loss value.\n            Tensor: Binary cross entropy loss value.\n\n        \"\"\"\n        # make mask and apply it\n        if self.use_masking:\n            masks = make_non_pad_mask(olens).unsqueeze(-1).to(ys.device)\n            ys = ys.masked_select(masks)\n            after_outs = after_outs.masked_select(masks)\n            before_outs = before_outs.masked_select(masks)\n            labels = labels.masked_select(masks[:, :, 0])\n            logits = logits.masked_select(masks[:, :, 0])\n\n        # calculate loss\n        l1_loss = self.l1_criterion(after_outs, ys) + self.l1_criterion(before_outs, ys)\n        mse_loss = self.mse_criterion(after_outs, ys) + self.mse_criterion(\n            before_outs, ys\n        )\n        bce_loss = self.bce_criterion(logits, labels)\n\n        # make weighted mask and apply it\n        if self.use_weighted_masking:\n            masks = make_non_pad_mask(olens).unsqueeze(-1).to(ys.device)\n            weights = masks.float() / masks.sum(dim=1, keepdim=True).float()\n            out_weights = weights.div(ys.size(0) * ys.size(2))\n            logit_weights = weights.div(ys.size(0))\n\n            # apply weight\n            l1_loss = l1_loss.mul(out_weights).masked_select(masks).sum()\n            mse_loss = mse_loss.mul(out_weights).masked_select(masks).sum()\n            bce_loss = (\n                bce_loss.mul(logit_weights.squeeze(-1))\n                .masked_select(masks.squeeze(-1))\n                .sum()\n            )\n\n        return l1_loss, mse_loss, bce_loss\n\n    def _load_state_dict_pre_hook(\n        self,\n        state_dict,\n        prefix,\n        local_metadata,\n        strict,\n        missing_keys,\n        unexpected_keys,\n        error_msgs,\n    ):\n        \"\"\"Apply pre hook fucntion before loading state dict.\n\n        From v.0.6.1 `bce_criterion.pos_weight` param is registered as a parameter but\n        old models do not include it and as a result, it causes missing key error when\n        loading old model parameter. This function solve the issue by adding param in\n        state dict before loading as a pre hook function\n        of the `load_state_dict` method.\n\n        \"\"\"\n        key = prefix + \"bce_criterion.pos_weight\"\n        if key not in state_dict:\n            state_dict[key] = self.bce_criterion.pos_weight\n\n\nclass Tacotron2(TTSInterface, torch.nn.Module):\n    \"\"\"Tacotron2 module for end-to-end text-to-speech (E2E-TTS).\n\n    This is a module of Spectrogram prediction network in Tacotron2 described\n    in `Natural TTS Synthesis\n    by Conditioning WaveNet on Mel Spectrogram Predictions`_,\n    which converts the sequence of characters\n    into the sequence of Mel-filterbanks.\n\n    .. _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`:\n       https://arxiv.org/abs/1712.05884\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model-specific arguments to the parser.\"\"\"\n        group = parser.add_argument_group(\"tacotron 2 model setting\")\n        # encoder\n        group.add_argument(\n            \"--embed-dim\",\n            default=512,\n            type=int,\n            help=\"Number of dimension of embedding\",\n        )\n        group.add_argument(\n            \"--elayers\", default=1, type=int, help=\"Number of encoder layers\"\n        )\n        group.add_argument(\n            \"--eunits\",\n            \"-u\",\n            default=512,\n            type=int,\n            help=\"Number of encoder hidden units\",\n        )\n        group.add_argument(\n            \"--econv-layers\",\n            default=3,\n            type=int,\n            help=\"Number of encoder convolution layers\",\n        )\n        group.add_argument(\n            \"--econv-chans\",\n            default=512,\n            type=int,\n            help=\"Number of encoder convolution channels\",\n        )\n        group.add_argument(\n            \"--econv-filts\",\n            default=5,\n            type=int,\n            help=\"Filter size of encoder convolution\",\n        )\n        # attention\n        group.add_argument(\n            \"--atype\",\n            default=\"location\",\n            type=str,\n            choices=[\"forward_ta\", \"forward\", \"location\"],\n            help=\"Type of attention mechanism\",\n        )\n        group.add_argument(\n            \"--adim\",\n            default=512,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aconv-chans\",\n            default=32,\n            type=int,\n            help=\"Number of attention convolution channels\",\n        )\n        group.add_argument(\n            \"--aconv-filts\",\n            default=15,\n            type=int,\n            help=\"Filter size of attention convolution\",\n        )\n        group.add_argument(\n            \"--cumulate-att-w\",\n            default=True,\n            type=strtobool,\n            help=\"Whether or not to cumulate attention weights\",\n        )\n        # decoder\n        group.add_argument(\n            \"--dlayers\", default=2, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=1024, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--prenet-layers\", default=2, type=int, help=\"Number of prenet layers\"\n        )\n        group.add_argument(\n            \"--prenet-units\",\n            default=256,\n            type=int,\n            help=\"Number of prenet hidden units\",\n        )\n        group.add_argument(\n            \"--postnet-layers\", default=5, type=int, help=\"Number of postnet layers\"\n        )\n        group.add_argument(\n            \"--postnet-chans\", default=512, type=int, help=\"Number of postnet channels\"\n        )\n        group.add_argument(\n            \"--postnet-filts\", default=5, type=int, help=\"Filter size of postnet\"\n        )\n        group.add_argument(\n            \"--output-activation\",\n            default=None,\n            type=str,\n            nargs=\"?\",\n            help=\"Output activation function\",\n        )\n        # cbhg\n        group.add_argument(\n            \"--use-cbhg\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use CBHG module\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-bank-layers\",\n            default=8,\n            type=int,\n            help=\"Number of convoluional bank layers in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-bank-chans\",\n            default=128,\n            type=int,\n            help=\"Number of convoluional bank channles in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-proj-filts\",\n            default=3,\n            type=int,\n            help=\"Filter size of convoluional projection layer in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-proj-chans\",\n            default=256,\n            type=int,\n            help=\"Number of convoluional projection channels in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-highway-layers\",\n            default=4,\n            type=int,\n            help=\"Number of highway layers in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-highway-units\",\n            default=128,\n            type=int,\n            help=\"Number of highway units in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-gru-units\",\n            default=256,\n            type=int,\n            help=\"Number of GRU units in CBHG\",\n        )\n        # model (parameter) related\n        group.add_argument(\n            \"--use-batch-norm\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use batch normalization\",\n        )\n        group.add_argument(\n            \"--use-concate\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to concatenate encoder embedding with decoder outputs\",\n        )\n        group.add_argument(\n            \"--use-residual\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use residual connection in conv layer\",\n        )\n        group.add_argument(\n            \"--dropout-rate\", default=0.5, type=float, help=\"Dropout rate\"\n        )\n        group.add_argument(\n            \"--zoneout-rate\", default=0.1, type=float, help=\"Zoneout rate\"\n        )\n        group.add_argument(\n            \"--reduction-factor\", default=1, type=int, help=\"Reduction factor\"\n        )\n        group.add_argument(\n            \"--spk-embed-dim\",\n            default=None,\n            type=int,\n            help=\"Number of speaker embedding dimensions\",\n        )\n        group.add_argument(\n            \"--spc-dim\", default=None, type=int, help=\"Number of spectrogram dimensions\"\n        )\n        group.add_argument(\n            \"--pretrained-model\", default=None, type=str, help=\"Pretrained model path\"\n        )\n        # loss related\n        group.add_argument(\n            \"--use-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--use-weighted-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use weighted masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--bce-pos-weight\",\n            default=20.0,\n            type=float,\n            help=\"Positive sample weight in BCE calculation \"\n            \"(only for use-masking=True)\",\n        )\n        group.add_argument(\n            \"--use-guided-attn-loss\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-sigma\",\n            default=0.4,\n            type=float,\n            help=\"Sigma in guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in guided attention loss\",\n        )\n        return parser\n\n    def __init__(self, idim, odim, args=None):\n        \"\"\"Initialize Tacotron2 module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            args (Namespace, optional):\n                - spk_embed_dim (int): Dimension of the speaker embedding.\n                - embed_dim (int): Dimension of character embedding.\n                - elayers (int): The number of encoder blstm layers.\n                - eunits (int): The number of encoder blstm units.\n                - econv_layers (int): The number of encoder conv layers.\n                - econv_filts (int): The number of encoder conv filter size.\n                - econv_chans (int): The number of encoder conv filter channels.\n                - dlayers (int): The number of decoder lstm layers.\n                - dunits (int): The number of decoder lstm units.\n                - prenet_layers (int): The number of prenet layers.\n                - prenet_units (int): The number of prenet units.\n                - postnet_layers (int): The number of postnet layers.\n                - postnet_filts (int): The number of postnet filter size.\n                - postnet_chans (int): The number of postnet filter channels.\n                - output_activation (int): The name of activation function for outputs.\n                - adim (int): The number of dimension of mlp in attention.\n                - aconv_chans (int): The number of attention conv filter channels.\n                - aconv_filts (int): The number of attention conv filter size.\n                - cumulate_att_w (bool): Whether to cumulate previous attention weight.\n                - use_batch_norm (bool): Whether to use batch normalization.\n                - use_concate (int): Whether to concatenate encoder embedding\n                    with decoder lstm outputs.\n                - dropout_rate (float): Dropout rate.\n                - zoneout_rate (float): Zoneout rate.\n                - reduction_factor (int): Reduction factor.\n                - spk_embed_dim (int): Number of speaker embedding dimenstions.\n                - spc_dim (int): Number of spectrogram embedding dimenstions\n                    (only for use_cbhg=True).\n                - use_cbhg (bool): Whether to use CBHG module.\n                - cbhg_conv_bank_layers (int): The number of convoluional banks in CBHG.\n                - cbhg_conv_bank_chans (int): The number of channels of\n                    convolutional bank in CBHG.\n                - cbhg_proj_filts (int):\n                    The number of filter size of projection layeri in CBHG.\n                - cbhg_proj_chans (int):\n                    The number of channels of projection layer in CBHG.\n                - cbhg_highway_layers (int):\n                    The number of layers of highway network in CBHG.\n                - cbhg_highway_units (int):\n                    The number of units of highway network in CBHG.\n                - cbhg_gru_units (int): The number of units of GRU in CBHG.\n                - use_masking (bool):\n                    Whether to apply masking for padded part in loss calculation.\n                - use_weighted_masking (bool):\n                    Whether to apply weighted masking in loss calculation.\n                - bce_pos_weight (float):\n                    Weight of positive sample of stop token (only for use_masking=True).\n                - use-guided-attn-loss (bool): Whether to use guided attention loss.\n                - guided-attn-loss-sigma (float) Sigma in guided attention loss.\n                - guided-attn-loss-lamdba (float): Lambda in guided attention loss.\n\n        \"\"\"\n        # initialize base classes\n        TTSInterface.__init__(self)\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments\n        args = fill_missing_args(args, self.add_arguments)\n\n        # store hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.spk_embed_dim = args.spk_embed_dim\n        self.cumulate_att_w = args.cumulate_att_w\n        self.reduction_factor = args.reduction_factor\n        self.use_cbhg = args.use_cbhg\n        self.use_guided_attn_loss = args.use_guided_attn_loss\n\n        # define activation function for the final output\n        if args.output_activation is None:\n            self.output_activation_fn = None\n        elif hasattr(F, args.output_activation):\n            self.output_activation_fn = getattr(F, args.output_activation)\n        else:\n            raise ValueError(\n                \"there is no such an activation function. (%s)\" % args.output_activation\n            )\n\n        # set padding idx\n        padding_idx = 0\n\n        # define network modules\n        self.enc = Encoder(\n            idim=idim,\n            embed_dim=args.embed_dim,\n            elayers=args.elayers,\n            eunits=args.eunits,\n            econv_layers=args.econv_layers,\n            econv_chans=args.econv_chans,\n            econv_filts=args.econv_filts,\n            use_batch_norm=args.use_batch_norm,\n            use_residual=args.use_residual,\n            dropout_rate=args.dropout_rate,\n            padding_idx=padding_idx,\n        )\n        dec_idim = (\n            args.eunits\n            if args.spk_embed_dim is None\n            else args.eunits + args.spk_embed_dim\n        )\n        if args.atype == \"location\":\n            att = AttLoc(\n                dec_idim, args.dunits, args.adim, args.aconv_chans, args.aconv_filts\n            )\n        elif args.atype == \"forward\":\n            att = AttForward(\n                dec_idim, args.dunits, args.adim, args.aconv_chans, args.aconv_filts\n            )\n            if self.cumulate_att_w:\n                logging.warning(\n                    \"cumulation of attention weights is disabled in forward attention.\"\n                )\n                self.cumulate_att_w = False\n        elif args.atype == \"forward_ta\":\n            att = AttForwardTA(\n                dec_idim,\n                args.dunits,\n                args.adim,\n                args.aconv_chans,\n                args.aconv_filts,\n                odim,\n            )\n            if self.cumulate_att_w:\n                logging.warning(\n                    \"cumulation of attention weights is disabled in forward attention.\"\n                )\n                self.cumulate_att_w = False\n        else:\n            raise NotImplementedError(\"Support only location or forward\")\n        self.dec = Decoder(\n            idim=dec_idim,\n            odim=odim,\n            att=att,\n            dlayers=args.dlayers,\n            dunits=args.dunits,\n            prenet_layers=args.prenet_layers,\n            prenet_units=args.prenet_units,\n            postnet_layers=args.postnet_layers,\n            postnet_chans=args.postnet_chans,\n            postnet_filts=args.postnet_filts,\n            output_activation_fn=self.output_activation_fn,\n            cumulate_att_w=self.cumulate_att_w,\n            use_batch_norm=args.use_batch_norm,\n            use_concate=args.use_concate,\n            dropout_rate=args.dropout_rate,\n            zoneout_rate=args.zoneout_rate,\n            reduction_factor=args.reduction_factor,\n        )\n        self.taco2_loss = Tacotron2Loss(\n            use_masking=args.use_masking,\n            use_weighted_masking=args.use_weighted_masking,\n            bce_pos_weight=args.bce_pos_weight,\n        )\n        if self.use_guided_attn_loss:\n            self.attn_loss = GuidedAttentionLoss(\n                sigma=args.guided_attn_loss_sigma,\n                alpha=args.guided_attn_loss_lambda,\n            )\n        if self.use_cbhg:\n            self.cbhg = CBHG(\n                idim=odim,\n                odim=args.spc_dim,\n                conv_bank_layers=args.cbhg_conv_bank_layers,\n                conv_bank_chans=args.cbhg_conv_bank_chans,\n                conv_proj_filts=args.cbhg_conv_proj_filts,\n                conv_proj_chans=args.cbhg_conv_proj_chans,\n                highway_layers=args.cbhg_highway_layers,\n                highway_units=args.cbhg_highway_units,\n                gru_units=args.cbhg_gru_units,\n            )\n            self.cbhg_loss = CBHGLoss(use_masking=args.use_masking)\n\n        # load pretrained model\n        if args.pretrained_model is not None:\n            self.load_pretrained_model(args.pretrained_model)\n\n    def forward(\n        self, xs, ilens, ys, labels, olens, spembs=None, extras=None, *args, **kwargs\n    ):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            extras (Tensor, optional):\n                Batch of groundtruth spectrograms (B, Lmax, spc_dim).\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        # remove unnecessary padded part (for multi-gpus)\n        max_in = max(ilens)\n        max_out = max(olens)\n        if max_in != xs.shape[1]:\n            xs = xs[:, :max_in]\n        if max_out != ys.shape[1]:\n            ys = ys[:, :max_out]\n            labels = labels[:, :max_out]\n\n        # calculate tacotron2 outputs\n        hs, hlens = self.enc(xs, ilens)\n        if self.spk_embed_dim is not None:\n            spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n            hs = torch.cat([hs, spembs], dim=-1)\n        after_outs, before_outs, logits, att_ws = self.dec(hs, hlens, ys)\n\n        # modifiy mod part of groundtruth\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n            max_out = max(olens)\n            ys = ys[:, :max_out]\n            labels = labels[:, :max_out]\n            labels[:, -1] = 1.0  # make sure at least one frame has 1\n\n        # caluculate taco2 loss\n        l1_loss, mse_loss, bce_loss = self.taco2_loss(\n            after_outs, before_outs, logits, ys, labels, olens\n        )\n        loss = l1_loss + mse_loss + bce_loss\n        report_keys = [\n            {\"l1_loss\": l1_loss.item()},\n            {\"mse_loss\": mse_loss.item()},\n            {\"bce_loss\": bce_loss.item()},\n        ]\n\n        # caluculate attention loss\n        if self.use_guided_attn_loss:\n            # NOTE(kan-bayashi):\n            # length of output for auto-regressive input will be changed when r > 1\n            if self.reduction_factor > 1:\n                olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n            else:\n                olens_in = olens\n            attn_loss = self.attn_loss(att_ws, ilens, olens_in)\n            loss = loss + attn_loss\n            report_keys += [\n                {\"attn_loss\": attn_loss.item()},\n            ]\n\n        # caluculate cbhg loss\n        if self.use_cbhg:\n            # remove unnecessary padded part (for multi-gpus)\n            if max_out != extras.shape[1]:\n                extras = extras[:, :max_out]\n\n            # caluculate cbhg outputs & loss and report them\n            cbhg_outs, _ = self.cbhg(after_outs, olens)\n            cbhg_l1_loss, cbhg_mse_loss = self.cbhg_loss(cbhg_outs, extras, olens)\n            loss = loss + cbhg_l1_loss + cbhg_mse_loss\n            report_keys += [\n                {\"cbhg_l1_loss\": cbhg_l1_loss.item()},\n                {\"cbhg_mse_loss\": cbhg_mse_loss.item()},\n            ]\n\n        report_keys += [{\"loss\": loss.item()}]\n        self.reporter.report(report_keys)\n\n        return loss\n\n    def inference(self, x, inference_args, spemb=None, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Args:\n            x (Tensor): Input sequence of characters (T,).\n            inference_args (Namespace):\n                - threshold (float): Threshold in inference.\n                - minlenratio (float): Minimum length ratio in inference.\n                - maxlenratio (float): Maximum length ratio in inference.\n            spemb (Tensor, optional): Speaker embedding vector (spk_embed_dim).\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            Tensor: Output sequence of stop probabilities (L,).\n            Tensor: Attention weights (L, T).\n\n        \"\"\"\n        # get options\n        threshold = inference_args.threshold\n        minlenratio = inference_args.minlenratio\n        maxlenratio = inference_args.maxlenratio\n        use_att_constraint = getattr(\n            inference_args, \"use_att_constraint\", False\n        )  # keep compatibility\n        backward_window = inference_args.backward_window if use_att_constraint else 0\n        forward_window = inference_args.forward_window if use_att_constraint else 0\n\n        # inference\n        h = self.enc.inference(x)\n        if self.spk_embed_dim is not None:\n            spemb = F.normalize(spemb, dim=0).unsqueeze(0).expand(h.size(0), -1)\n            h = torch.cat([h, spemb], dim=-1)\n        outs, probs, att_ws = self.dec.inference(\n            h,\n            threshold,\n            minlenratio,\n            maxlenratio,\n            use_att_constraint=use_att_constraint,\n            backward_window=backward_window,\n            forward_window=forward_window,\n        )\n\n        if self.use_cbhg:\n            cbhg_outs = self.cbhg.inference(outs)\n            return cbhg_outs, probs, att_ws\n        else:\n            return outs, probs, att_ws\n\n    def calculate_all_attentions(\n        self, xs, ilens, ys, spembs=None, keep_tensor=False, *args, **kwargs\n    ):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            keep_tensor (bool, optional): Whether to keep original tensor.\n\n        Returns:\n            Union[ndarray, Tensor]: Batch of attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        # check ilens type (should be list of int)\n        if isinstance(ilens, torch.Tensor) or isinstance(ilens, np.ndarray):\n            ilens = list(map(int, ilens))\n\n        self.eval()\n        with torch.no_grad():\n            hs, hlens = self.enc(xs, ilens)\n            if self.spk_embed_dim is not None:\n                spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n                hs = torch.cat([hs, spembs], dim=-1)\n            att_ws = self.dec.calculate_all_attentions(hs, hlens, ys)\n        self.train()\n\n        if keep_tensor:\n            return att_ws\n        else:\n            return att_ws.cpu().numpy()\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        keys should match what `chainer.reporter` reports.\n        If you add the key `loss`, the reporter will report `main/loss`\n        and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n        and `validation/main/loss` values.\n\n        Returns:\n            list: List of strings which are base keys to plot during training.\n\n        \"\"\"\n        plot_keys = [\"loss\", \"l1_loss\", \"mse_loss\", \"bce_loss\"]\n        if self.use_guided_attn_loss:\n            plot_keys += [\"attn_loss\"]\n        if self.use_cbhg:\n            plot_keys += [\"cbhg_l1_loss\", \"cbhg_mse_loss\"]\n        return plot_keys\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_tts_transformer.py",
    "content": "# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"TTS-Transformer related modules.\"\"\"\n\nimport logging\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.e2e_tts_tacotron2 import GuidedAttentionLoss\nfrom espnet.nets.pytorch_backend.e2e_tts_tacotron2 import (\n    Tacotron2Loss as TransformerLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Postnet\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Prenet as DecoderPrenet\nfrom espnet.nets.pytorch_backend.tacotron2.encoder import Encoder as EncoderPrenet\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.embedding import ScaledPositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass GuidedMultiHeadAttentionLoss(GuidedAttentionLoss):\n    \"\"\"Guided attention loss function module for multi head attention.\n\n    Args:\n        sigma (float, optional): Standard deviation to control\n        how close attention to a diagonal.\n        alpha (float, optional): Scaling coefficient (lambda).\n        reset_always (bool, optional): Whether to always reset masks.\n\n    \"\"\"\n\n    def forward(self, att_ws, ilens, olens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            att_ws (Tensor):\n                Batch of multi head attention weights (B, H, T_max_out, T_max_in).\n            ilens (LongTensor): Batch of input lenghts (B,).\n            olens (LongTensor): Batch of output lenghts (B,).\n\n        Returns:\n            Tensor: Guided attention loss value.\n\n        \"\"\"\n        if self.guided_attn_masks is None:\n            self.guided_attn_masks = (\n                self._make_guided_attention_masks(ilens, olens)\n                .to(att_ws.device)\n                .unsqueeze(1)\n            )\n        if self.masks is None:\n            self.masks = self._make_masks(ilens, olens).to(att_ws.device).unsqueeze(1)\n        losses = self.guided_attn_masks * att_ws\n        loss = torch.mean(losses.masked_select(self.masks))\n        if self.reset_always:\n            self._reset_masks()\n\n        return self.alpha * loss\n\n\ntry:\n    from espnet.nets.pytorch_backend.transformer.plot import PlotAttentionReport\nexcept (ImportError, TypeError):\n    TTSPlot = None\nelse:\n\n    class TTSPlot(PlotAttentionReport):\n        \"\"\"Attention plot module for TTS-Transformer.\"\"\"\n\n        def plotfn(\n            self, data_dict, uttid_list, attn_dict, outdir, suffix=\"png\", savefn=None\n        ):\n            \"\"\"Plot multi head attentions.\n\n            Args:\n                data_dict (dict): Utts info from json file.\n                uttid_list (list): List of utt_id.\n                attn_dict (dict): Multi head attention dict.\n                    Values should be numpy.ndarray (H, L, T)\n                outdir (str): Directory name to save figures.\n                suffix (str): Filename suffix including image type (e.g., png).\n                savefn (function): Function to save figures.\n\n            \"\"\"\n            import matplotlib.pyplot as plt\n            from espnet.nets.pytorch_backend.transformer.plot import (\n                _plot_and_save_attention,  # noqa: H301\n            )\n\n            for name, att_ws in attn_dict.items():\n                for utt_id, att_w in zip(uttid_list, att_ws):\n                    filename = \"%s/%s.%s.%s\" % (outdir, utt_id, name, suffix)\n                    if \"fbank\" in name:\n                        fig = plt.Figure()\n                        ax = fig.subplots(1, 1)\n                        ax.imshow(att_w, aspect=\"auto\")\n                        ax.set_xlabel(\"frames\")\n                        ax.set_ylabel(\"fbank coeff\")\n                        fig.tight_layout()\n                    else:\n                        fig = _plot_and_save_attention(att_w, filename)\n                    savefn(fig, filename)\n\n\nclass Transformer(TTSInterface, torch.nn.Module):\n    \"\"\"Text-to-Speech Transformer module.\n\n    This is a module of text-to-speech Transformer described\n    in `Neural Speech Synthesis with Transformer Network`_,\n    which convert the sequence of characters\n    or phonemes into the sequence of Mel-filterbanks.\n\n    .. _`Neural Speech Synthesis with Transformer Network`:\n        https://arxiv.org/pdf/1809.08895.pdf\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model-specific arguments to the parser.\"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n        # network structure related\n        group.add_argument(\n            \"--embed-dim\",\n            default=512,\n            type=int,\n            help=\"Dimension of character embedding in encoder prenet\",\n        )\n        group.add_argument(\n            \"--eprenet-conv-layers\",\n            default=3,\n            type=int,\n            help=\"Number of encoder prenet convolution layers\",\n        )\n        group.add_argument(\n            \"--eprenet-conv-chans\",\n            default=256,\n            type=int,\n            help=\"Number of encoder prenet convolution channels\",\n        )\n        group.add_argument(\n            \"--eprenet-conv-filts\",\n            default=5,\n            type=int,\n            help=\"Filter size of encoder prenet convolution\",\n        )\n        group.add_argument(\n            \"--dprenet-layers\",\n            default=2,\n            type=int,\n            help=\"Number of decoder prenet layers\",\n        )\n        group.add_argument(\n            \"--dprenet-units\",\n            default=256,\n            type=int,\n            help=\"Number of decoder prenet hidden units\",\n        )\n        group.add_argument(\n            \"--elayers\", default=3, type=int, help=\"Number of encoder layers\"\n        )\n        group.add_argument(\n            \"--eunits\", default=1536, type=int, help=\"Number of encoder hidden units\"\n        )\n        group.add_argument(\n            \"--adim\",\n            default=384,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aheads\",\n            default=4,\n            type=int,\n            help=\"Number of heads for multi head attention\",\n        )\n        group.add_argument(\n            \"--dlayers\", default=3, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=1536, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--positionwise-layer-type\",\n            default=\"linear\",\n            type=str,\n            choices=[\"linear\", \"conv1d\", \"conv1d-linear\"],\n            help=\"Positionwise layer type.\",\n        )\n        group.add_argument(\n            \"--positionwise-conv-kernel-size\",\n            default=1,\n            type=int,\n            help=\"Kernel size of positionwise conv1d layer\",\n        )\n        group.add_argument(\n            \"--postnet-layers\", default=5, type=int, help=\"Number of postnet layers\"\n        )\n        group.add_argument(\n            \"--postnet-chans\", default=256, type=int, help=\"Number of postnet channels\"\n        )\n        group.add_argument(\n            \"--postnet-filts\", default=5, type=int, help=\"Filter size of postnet\"\n        )\n        group.add_argument(\n            \"--use-scaled-pos-enc\",\n            default=True,\n            type=strtobool,\n            help=\"Use trainable scaled positional encoding \"\n            \"instead of the fixed scale one.\",\n        )\n        group.add_argument(\n            \"--use-batch-norm\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use batch normalization\",\n        )\n        group.add_argument(\n            \"--encoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before encoder block\",\n        )\n        group.add_argument(\n            \"--decoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before decoder block\",\n        )\n        group.add_argument(\n            \"--encoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in encoder\",\n        )\n        group.add_argument(\n            \"--decoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in decoder\",\n        )\n        group.add_argument(\n            \"--reduction-factor\", default=1, type=int, help=\"Reduction factor\"\n        )\n        group.add_argument(\n            \"--spk-embed-dim\",\n            default=None,\n            type=int,\n            help=\"Number of speaker embedding dimensions\",\n        )\n        group.add_argument(\n            \"--spk-embed-integration-type\",\n            type=str,\n            default=\"add\",\n            choices=[\"add\", \"concat\"],\n            help=\"How to integrate speaker embedding\",\n        )\n        # training related\n        group.add_argument(\n            \"--transformer-init\",\n            type=str,\n            default=\"pytorch\",\n            choices=[\n                \"pytorch\",\n                \"xavier_uniform\",\n                \"xavier_normal\",\n                \"kaiming_uniform\",\n                \"kaiming_normal\",\n            ],\n            help=\"How to initialize transformer parameters\",\n        )\n        group.add_argument(\n            \"--initial-encoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in encoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--initial-decoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in decoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--transformer-lr\",\n            default=1.0,\n            type=float,\n            help=\"Initial value of learning rate\",\n        )\n        group.add_argument(\n            \"--transformer-warmup-steps\",\n            default=4000,\n            type=int,\n            help=\"Optimizer warmup steps\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder except for attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-enc-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-dec-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder \"\n            \"except for attention and pos encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder-decoder attention\",\n        )\n        group.add_argument(\n            \"--eprenet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in encoder prenet\",\n        )\n        group.add_argument(\n            \"--dprenet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in decoder prenet\",\n        )\n        group.add_argument(\n            \"--postnet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in postnet\",\n        )\n        group.add_argument(\n            \"--pretrained-model\", default=None, type=str, help=\"Pretrained model path\"\n        )\n        # loss related\n        group.add_argument(\n            \"--use-masking\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--use-weighted-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use weighted masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--loss-type\",\n            default=\"L1\",\n            choices=[\"L1\", \"L2\", \"L1+L2\"],\n            help=\"How to calc loss\",\n        )\n        group.add_argument(\n            \"--bce-pos-weight\",\n            default=5.0,\n            type=float,\n            help=\"Positive sample weight in BCE calculation \"\n            \"(only for use-masking=True)\",\n        )\n        group.add_argument(\n            \"--use-guided-attn-loss\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-sigma\",\n            default=0.4,\n            type=float,\n            help=\"Sigma in guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in guided attention loss\",\n        )\n        group.add_argument(\n            \"--num-heads-applied-guided-attn\",\n            default=2,\n            type=int,\n            help=\"Number of heads in each layer to be applied guided attention loss\"\n            \"if set -1, all of the heads will be applied.\",\n        )\n        group.add_argument(\n            \"--num-layers-applied-guided-attn\",\n            default=2,\n            type=int,\n            help=\"Number of layers to be applied guided attention loss\"\n            \"if set -1, all of the layers will be applied.\",\n        )\n        group.add_argument(\n            \"--modules-applied-guided-attn\",\n            type=str,\n            nargs=\"+\",\n            default=[\"encoder-decoder\"],\n            help=\"Module name list to be applied guided attention loss\",\n        )\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return plot class for attention weight plot.\"\"\"\n        return TTSPlot\n\n    def __init__(self, idim, odim, args=None):\n        \"\"\"Initialize TTS-Transformer module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            args (Namespace, optional):\n                - embed_dim (int): Dimension of character embedding.\n                - eprenet_conv_layers (int):\n                    Number of encoder prenet convolution layers.\n                - eprenet_conv_chans (int):\n                    Number of encoder prenet convolution channels.\n                - eprenet_conv_filts (int): Filter size of encoder prenet convolution.\n                - dprenet_layers (int): Number of decoder prenet layers.\n                - dprenet_units (int): Number of decoder prenet hidden units.\n                - elayers (int): Number of encoder layers.\n                - eunits (int): Number of encoder hidden units.\n                - adim (int): Number of attention transformation dimensions.\n                - aheads (int): Number of heads for multi head attention.\n                - dlayers (int): Number of decoder layers.\n                - dunits (int): Number of decoder hidden units.\n                - postnet_layers (int): Number of postnet layers.\n                - postnet_chans (int): Number of postnet channels.\n                - postnet_filts (int): Filter size of postnet.\n                - use_scaled_pos_enc (bool):\n                    Whether to use trainable scaled positional encoding.\n                - use_batch_norm (bool):\n                    Whether to use batch normalization in encoder prenet.\n                - encoder_normalize_before (bool):\n                    Whether to perform layer normalization before encoder block.\n                - decoder_normalize_before (bool):\n                    Whether to perform layer normalization before decoder block.\n                - encoder_concat_after (bool): Whether to concatenate attention\n                    layer's input and output in encoder.\n                - decoder_concat_after (bool): Whether to concatenate attention\n                    layer's input and output in decoder.\n                - reduction_factor (int): Reduction factor.\n                - spk_embed_dim (int): Number of speaker embedding dimenstions.\n                - spk_embed_integration_type: How to integrate speaker embedding.\n                - transformer_init (float): How to initialize transformer parameters.\n                - transformer_lr (float): Initial value of learning rate.\n                - transformer_warmup_steps (int): Optimizer warmup steps.\n                - transformer_enc_dropout_rate (float):\n                    Dropout rate in encoder except attention & positional encoding.\n                - transformer_enc_positional_dropout_rate (float):\n                    Dropout rate after encoder positional encoding.\n                - transformer_enc_attn_dropout_rate (float):\n                    Dropout rate in encoder self-attention module.\n                - transformer_dec_dropout_rate (float):\n                    Dropout rate in decoder except attention & positional encoding.\n                - transformer_dec_positional_dropout_rate (float):\n                    Dropout rate after decoder positional encoding.\n                - transformer_dec_attn_dropout_rate (float):\n                    Dropout rate in deocoder self-attention module.\n                - transformer_enc_dec_attn_dropout_rate (float):\n                    Dropout rate in encoder-deocoder attention module.\n                - eprenet_dropout_rate (float): Dropout rate in encoder prenet.\n                - dprenet_dropout_rate (float): Dropout rate in decoder prenet.\n                - postnet_dropout_rate (float): Dropout rate in postnet.\n                - use_masking (bool):\n                    Whether to apply masking for padded part in loss calculation.\n                - use_weighted_masking (bool):\n                    Whether to apply weighted masking in loss calculation.\n                - bce_pos_weight (float): Positive sample weight in bce calculation\n                    (only for use_masking=true).\n                - loss_type (str): How to calculate loss.\n                - use_guided_attn_loss (bool): Whether to use guided attention loss.\n                - num_heads_applied_guided_attn (int):\n                    Number of heads in each layer to apply guided attention loss.\n                - num_layers_applied_guided_attn (int):\n                    Number of layers to apply guided attention loss.\n                - modules_applied_guided_attn (list):\n                    List of module names to apply guided attention loss.\n                - guided-attn-loss-sigma (float) Sigma in guided attention loss.\n                - guided-attn-loss-lambda (float): Lambda in guided attention loss.\n\n        \"\"\"\n        # initialize base classes\n        TTSInterface.__init__(self)\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments\n        args = fill_missing_args(args, self.add_arguments)\n\n        # store hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.spk_embed_dim = args.spk_embed_dim\n        if self.spk_embed_dim is not None:\n            self.spk_embed_integration_type = args.spk_embed_integration_type\n        self.use_scaled_pos_enc = args.use_scaled_pos_enc\n        self.reduction_factor = args.reduction_factor\n        self.loss_type = args.loss_type\n        self.use_guided_attn_loss = args.use_guided_attn_loss\n        if self.use_guided_attn_loss:\n            if args.num_layers_applied_guided_attn == -1:\n                self.num_layers_applied_guided_attn = args.elayers\n            else:\n                self.num_layers_applied_guided_attn = (\n                    args.num_layers_applied_guided_attn\n                )\n            if args.num_heads_applied_guided_attn == -1:\n                self.num_heads_applied_guided_attn = args.aheads\n            else:\n                self.num_heads_applied_guided_attn = args.num_heads_applied_guided_attn\n            self.modules_applied_guided_attn = args.modules_applied_guided_attn\n\n        # use idx 0 as padding idx\n        padding_idx = 0\n\n        # get positional encoding class\n        pos_enc_class = (\n            ScaledPositionalEncoding if self.use_scaled_pos_enc else PositionalEncoding\n        )\n\n        # define transformer encoder\n        if args.eprenet_conv_layers != 0:\n            # encoder prenet\n            encoder_input_layer = torch.nn.Sequential(\n                EncoderPrenet(\n                    idim=idim,\n                    embed_dim=args.embed_dim,\n                    elayers=0,\n                    econv_layers=args.eprenet_conv_layers,\n                    econv_chans=args.eprenet_conv_chans,\n                    econv_filts=args.eprenet_conv_filts,\n                    use_batch_norm=args.use_batch_norm,\n                    dropout_rate=args.eprenet_dropout_rate,\n                    padding_idx=padding_idx,\n                ),\n                torch.nn.Linear(args.eprenet_conv_chans, args.adim),\n            )\n        else:\n            encoder_input_layer = torch.nn.Embedding(\n                num_embeddings=idim, embedding_dim=args.adim, padding_idx=padding_idx\n            )\n        self.encoder = Encoder(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=encoder_input_layer,\n            dropout_rate=args.transformer_enc_dropout_rate,\n            positional_dropout_rate=args.transformer_enc_positional_dropout_rate,\n            attention_dropout_rate=args.transformer_enc_attn_dropout_rate,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.encoder_normalize_before,\n            concat_after=args.encoder_concat_after,\n            positionwise_layer_type=args.positionwise_layer_type,\n            positionwise_conv_kernel_size=args.positionwise_conv_kernel_size,\n        )\n\n        # define projection layer\n        if self.spk_embed_dim is not None:\n            if self.spk_embed_integration_type == \"add\":\n                self.projection = torch.nn.Linear(self.spk_embed_dim, args.adim)\n            else:\n                self.projection = torch.nn.Linear(\n                    args.adim + self.spk_embed_dim, args.adim\n                )\n\n        # define transformer decoder\n        if args.dprenet_layers != 0:\n            # decoder prenet\n            decoder_input_layer = torch.nn.Sequential(\n                DecoderPrenet(\n                    idim=odim,\n                    n_layers=args.dprenet_layers,\n                    n_units=args.dprenet_units,\n                    dropout_rate=args.dprenet_dropout_rate,\n                ),\n                torch.nn.Linear(args.dprenet_units, args.adim),\n            )\n        else:\n            decoder_input_layer = \"linear\"\n        self.decoder = Decoder(\n            odim=-1,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.dunits,\n            num_blocks=args.dlayers,\n            dropout_rate=args.transformer_dec_dropout_rate,\n            positional_dropout_rate=args.transformer_dec_positional_dropout_rate,\n            self_attention_dropout_rate=args.transformer_dec_attn_dropout_rate,\n            src_attention_dropout_rate=args.transformer_enc_dec_attn_dropout_rate,\n            input_layer=decoder_input_layer,\n            use_output_layer=False,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.decoder_normalize_before,\n            concat_after=args.decoder_concat_after,\n        )\n\n        # define final projection\n        self.feat_out = torch.nn.Linear(args.adim, odim * args.reduction_factor)\n        self.prob_out = torch.nn.Linear(args.adim, args.reduction_factor)\n\n        # define postnet\n        self.postnet = (\n            None\n            if args.postnet_layers == 0\n            else Postnet(\n                idim=idim,\n                odim=odim,\n                n_layers=args.postnet_layers,\n                n_chans=args.postnet_chans,\n                n_filts=args.postnet_filts,\n                use_batch_norm=args.use_batch_norm,\n                dropout_rate=args.postnet_dropout_rate,\n            )\n        )\n\n        # define loss function\n        self.criterion = TransformerLoss(\n            use_masking=args.use_masking,\n            use_weighted_masking=args.use_weighted_masking,\n            bce_pos_weight=args.bce_pos_weight,\n        )\n        if self.use_guided_attn_loss:\n            self.attn_criterion = GuidedMultiHeadAttentionLoss(\n                sigma=args.guided_attn_loss_sigma,\n                alpha=args.guided_attn_loss_lambda,\n            )\n\n        # initialize parameters\n        self._reset_parameters(\n            init_type=args.transformer_init,\n            init_enc_alpha=args.initial_encoder_alpha,\n            init_dec_alpha=args.initial_decoder_alpha,\n        )\n\n        # load pretrained model\n        if args.pretrained_model is not None:\n            self.load_pretrained_model(args.pretrained_model)\n\n    def _reset_parameters(self, init_type, init_enc_alpha=1.0, init_dec_alpha=1.0):\n        # initialize parameters\n        initialize(self, init_type)\n\n        # initialize alpha in scaled positional encoding\n        if self.use_scaled_pos_enc:\n            self.encoder.embed[-1].alpha.data = torch.tensor(init_enc_alpha)\n            self.decoder.embed[-1].alpha.data = torch.tensor(init_dec_alpha)\n\n    def _add_first_frame_and_remove_last_frame(self, ys):\n        ys_in = torch.cat(\n            [ys.new_zeros((ys.shape[0], 1, ys.shape[2])), ys[:, :-1]], dim=1\n        )\n        return ys_in\n\n    def forward(self, xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        # remove unnecessary padded part (for multi-gpus)\n        max_ilen = max(ilens)\n        max_olen = max(olens)\n        if max_ilen != xs.shape[1]:\n            xs = xs[:, :max_ilen]\n        if max_olen != ys.shape[1]:\n            ys = ys[:, :max_olen]\n            labels = labels[:, :max_olen]\n\n        # forward encoder\n        x_masks = self._source_mask(ilens)\n        hs, h_masks = self.encoder(xs, x_masks)\n\n        # integrate speaker embedding\n        if self.spk_embed_dim is not None:\n            hs = self._integrate_with_spk_embed(hs, spembs)\n\n        # thin out frames for reduction factor (B, Lmax, odim) ->  (B, Lmax//r, odim)\n        if self.reduction_factor > 1:\n            ys_in = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n            olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n        else:\n            ys_in, olens_in = ys, olens\n\n        # add first zero frame and remove last frame for auto-regressive\n        ys_in = self._add_first_frame_and_remove_last_frame(ys_in)\n\n        # forward decoder\n        y_masks = self._target_mask(olens_in)\n        zs, _ = self.decoder(ys_in, y_masks, hs, h_masks)\n        # (B, Lmax//r, odim * r) -> (B, Lmax//r * r, odim)\n        before_outs = self.feat_out(zs).view(zs.size(0), -1, self.odim)\n        # (B, Lmax//r, r) -> (B, Lmax//r * r)\n        logits = self.prob_out(zs).view(zs.size(0), -1)\n\n        # postnet -> (B, Lmax//r * r, odim)\n        if self.postnet is None:\n            after_outs = before_outs\n        else:\n            after_outs = before_outs + self.postnet(\n                before_outs.transpose(1, 2)\n            ).transpose(1, 2)\n\n        # modifiy mod part of groundtruth\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n            max_olen = max(olens)\n            ys = ys[:, :max_olen]\n            labels = labels[:, :max_olen]\n            labels[:, -1] = 1.0  # make sure at least one frame has 1\n\n        # caluculate loss values\n        l1_loss, l2_loss, bce_loss = self.criterion(\n            after_outs, before_outs, logits, ys, labels, olens\n        )\n        if self.loss_type == \"L1\":\n            loss = l1_loss + bce_loss\n        elif self.loss_type == \"L2\":\n            loss = l2_loss + bce_loss\n        elif self.loss_type == \"L1+L2\":\n            loss = l1_loss + l2_loss + bce_loss\n        else:\n            raise ValueError(\"unknown --loss-type \" + self.loss_type)\n        report_keys = [\n            {\"l1_loss\": l1_loss.item()},\n            {\"l2_loss\": l2_loss.item()},\n            {\"bce_loss\": bce_loss.item()},\n            {\"loss\": loss.item()},\n        ]\n\n        # calculate guided attention loss\n        if self.use_guided_attn_loss:\n            # calculate for encoder\n            if \"encoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.encoder.encoders)))\n                ):\n                    att_ws += [\n                        self.encoder.encoders[layer_idx].self_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_in, T_in)\n                enc_attn_loss = self.attn_criterion(att_ws, ilens, ilens)\n                loss = loss + enc_attn_loss\n                report_keys += [{\"enc_attn_loss\": enc_attn_loss.item()}]\n            # calculate for decoder\n            if \"decoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.decoder.decoders)))\n                ):\n                    att_ws += [\n                        self.decoder.decoders[layer_idx].self_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_out, T_out)\n                dec_attn_loss = self.attn_criterion(att_ws, olens_in, olens_in)\n                loss = loss + dec_attn_loss\n                report_keys += [{\"dec_attn_loss\": dec_attn_loss.item()}]\n            # calculate for encoder-decoder\n            if \"encoder-decoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.decoder.decoders)))\n                ):\n                    att_ws += [\n                        self.decoder.decoders[layer_idx].src_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_out, T_in)\n                enc_dec_attn_loss = self.attn_criterion(att_ws, ilens, olens_in)\n                loss = loss + enc_dec_attn_loss\n                report_keys += [{\"enc_dec_attn_loss\": enc_dec_attn_loss.item()}]\n\n        # report extra information\n        if self.use_scaled_pos_enc:\n            report_keys += [\n                {\"encoder_alpha\": self.encoder.embed[-1].alpha.data.item()},\n                {\"decoder_alpha\": self.decoder.embed[-1].alpha.data.item()},\n            ]\n        self.reporter.report(report_keys)\n\n        return loss\n\n    def inference(self, x, inference_args, spemb=None, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Args:\n            x (Tensor): Input sequence of characters (T,).\n            inference_args (Namespace):\n                - threshold (float): Threshold in inference.\n                - minlenratio (float): Minimum length ratio in inference.\n                - maxlenratio (float): Maximum length ratio in inference.\n            spemb (Tensor, optional): Speaker embedding vector (spk_embed_dim).\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            Tensor: Output sequence of stop probabilities (L,).\n            Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).\n\n        \"\"\"\n        # get options\n        threshold = inference_args.threshold\n        minlenratio = inference_args.minlenratio\n        maxlenratio = inference_args.maxlenratio\n        use_att_constraint = getattr(\n            inference_args, \"use_att_constraint\", False\n        )  # keep compatibility\n        if use_att_constraint:\n            logging.warning(\n                \"Attention constraint is not yet supported in Transformer. Not enabled.\"\n            )\n\n        # forward encoder\n        xs = x.unsqueeze(0)\n        hs, _ = self.encoder(xs, None)\n\n        # integrate speaker embedding\n        if self.spk_embed_dim is not None:\n            spembs = spemb.unsqueeze(0)\n            hs = self._integrate_with_spk_embed(hs, spembs)\n\n        # set limits of length\n        maxlen = int(hs.size(1) * maxlenratio / self.reduction_factor)\n        minlen = int(hs.size(1) * minlenratio / self.reduction_factor)\n\n        # initialize\n        idx = 0\n        ys = hs.new_zeros(1, 1, self.odim)\n        outs, probs = [], []\n\n        # forward decoder step-by-step\n        z_cache = self.decoder.init_state(x)\n        while True:\n            # update index\n            idx += 1\n\n            # calculate output and stop prob at idx-th step\n            y_masks = subsequent_mask(idx).unsqueeze(0).to(x.device)\n            z, z_cache = self.decoder.forward_one_step(\n                ys, y_masks, hs, cache=z_cache\n            )  # (B, adim)\n            outs += [\n                self.feat_out(z).view(self.reduction_factor, self.odim)\n            ]  # [(r, odim), ...]\n            probs += [torch.sigmoid(self.prob_out(z))[0]]  # [(r), ...]\n\n            # update next inputs\n            ys = torch.cat(\n                (ys, outs[-1][-1].view(1, 1, self.odim)), dim=1\n            )  # (1, idx + 1, odim)\n\n            # get attention weights\n            att_ws_ = []\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention) and \"src\" in name:\n                    att_ws_ += [m.attn[0, :, -1].unsqueeze(1)]  # [(#heads, 1, T),...]\n            if idx == 1:\n                att_ws = att_ws_\n            else:\n                # [(#heads, l, T), ...]\n                att_ws = [\n                    torch.cat([att_w, att_w_], dim=1)\n                    for att_w, att_w_ in zip(att_ws, att_ws_)\n                ]\n\n            # check whether to finish generation\n            if int(sum(probs[-1] >= threshold)) > 0 or idx >= maxlen:\n                # check mininum length\n                if idx < minlen:\n                    continue\n                outs = (\n                    torch.cat(outs, dim=0).unsqueeze(0).transpose(1, 2)\n                )  # (L, odim) -> (1, L, odim) -> (1, odim, L)\n                if self.postnet is not None:\n                    outs = outs + self.postnet(outs)  # (1, odim, L)\n                outs = outs.transpose(2, 1).squeeze(0)  # (L, odim)\n                probs = torch.cat(probs, dim=0)\n                break\n\n        # concatenate attention weights -> (#layers, #heads, L, T)\n        att_ws = torch.stack(att_ws, dim=0)\n\n        return outs, probs, att_ws\n\n    def calculate_all_attentions(\n        self,\n        xs,\n        ilens,\n        ys,\n        olens,\n        spembs=None,\n        skip_output=False,\n        keep_tensor=False,\n        *args,\n        **kwargs\n    ):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            xs (Tensor): Batch of padded character ids (B, Tmax).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            skip_output (bool, optional): Whether to skip calculate the final output.\n            keep_tensor (bool, optional): Whether to keep original tensor.\n\n        Returns:\n            dict: Dict of attention weights and outputs.\n\n        \"\"\"\n        self.eval()\n        with torch.no_grad():\n            # forward encoder\n            x_masks = self._source_mask(ilens)\n            hs, h_masks = self.encoder(xs, x_masks)\n\n            # integrate speaker embedding\n            if self.spk_embed_dim is not None:\n                hs = self._integrate_with_spk_embed(hs, spembs)\n\n            # thin out frames for reduction factor\n            # (B, Lmax, odim) ->  (B, Lmax//r, odim)\n            if self.reduction_factor > 1:\n                ys_in = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n                olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n            else:\n                ys_in, olens_in = ys, olens\n\n            # add first zero frame and remove last frame for auto-regressive\n            ys_in = self._add_first_frame_and_remove_last_frame(ys_in)\n\n            # forward decoder\n            y_masks = self._target_mask(olens_in)\n            zs, _ = self.decoder(ys_in, y_masks, hs, h_masks)\n\n            # calculate final outputs\n            if not skip_output:\n                before_outs = self.feat_out(zs).view(zs.size(0), -1, self.odim)\n                if self.postnet is None:\n                    after_outs = before_outs\n                else:\n                    after_outs = before_outs + self.postnet(\n                        before_outs.transpose(1, 2)\n                    ).transpose(1, 2)\n\n        # modifiy mod part of output lengths due to reduction factor > 1\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n\n        # store into dict\n        att_ws_dict = dict()\n        if keep_tensor:\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention):\n                    att_ws_dict[name] = m.attn\n            if not skip_output:\n                att_ws_dict[\"before_postnet_fbank\"] = before_outs\n                att_ws_dict[\"after_postnet_fbank\"] = after_outs\n        else:\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention):\n                    attn = m.attn.cpu().numpy()\n                    if \"encoder\" in name:\n                        attn = [a[:, :l, :l] for a, l in zip(attn, ilens.tolist())]\n                    elif \"decoder\" in name:\n                        if \"src\" in name:\n                            attn = [\n                                a[:, :ol, :il]\n                                for a, il, ol in zip(\n                                    attn, ilens.tolist(), olens_in.tolist()\n                                )\n                            ]\n                        elif \"self\" in name:\n                            attn = [\n                                a[:, :l, :l] for a, l in zip(attn, olens_in.tolist())\n                            ]\n                        else:\n                            logging.warning(\"unknown attention module: \" + name)\n                    else:\n                        logging.warning(\"unknown attention module: \" + name)\n                    att_ws_dict[name] = attn\n            if not skip_output:\n                before_outs = before_outs.cpu().numpy()\n                after_outs = after_outs.cpu().numpy()\n                att_ws_dict[\"before_postnet_fbank\"] = [\n                    m[:l].T for m, l in zip(before_outs, olens.tolist())\n                ]\n                att_ws_dict[\"after_postnet_fbank\"] = [\n                    m[:l].T for m, l in zip(after_outs, olens.tolist())\n                ]\n        self.train()\n        return att_ws_dict\n\n    def _integrate_with_spk_embed(self, hs, spembs):\n        \"\"\"Integrate speaker embedding with hidden states.\n\n        Args:\n            hs (Tensor): Batch of hidden state sequences (B, Tmax, adim).\n            spembs (Tensor): Batch of speaker embeddings (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Batch of integrated hidden state sequences (B, Tmax, adim)\n\n        \"\"\"\n        if self.spk_embed_integration_type == \"add\":\n            # apply projection and then add to hidden states\n            spembs = self.projection(F.normalize(spembs))\n            hs = hs + spembs.unsqueeze(1)\n        elif self.spk_embed_integration_type == \"concat\":\n            # concat hidden states with spk embeds and then apply projection\n            spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n            hs = self.projection(torch.cat([hs, spembs], dim=-1))\n        else:\n            raise NotImplementedError(\"support only add or concat.\")\n\n        return hs\n\n    def _source_mask(self, ilens):\n        \"\"\"Make masks for self-attention.\n\n        Args:\n            ilens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor for self-attention.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> ilens = [5, 3]\n            >>> self._source_mask(ilens)\n            tensor([[[1, 1, 1, 1, 1],\n                    [[1, 1, 1, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        x_masks = make_non_pad_mask(ilens).to(next(self.parameters()).device)\n        return x_masks.unsqueeze(-2)\n\n    def _target_mask(self, olens):\n        \"\"\"Make masks for masked self-attention.\n\n        Args:\n            olens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor for masked self-attention.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> olens = [5, 3]\n            >>> self._target_mask(olens)\n            tensor([[[1, 0, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 1, 0],\n                     [1, 1, 1, 1, 1]],\n                    [[1, 0, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        y_masks = make_non_pad_mask(olens).to(next(self.parameters()).device)\n        s_masks = subsequent_mask(y_masks.size(-1), device=y_masks.device).unsqueeze(0)\n        return y_masks.unsqueeze(-2) & s_masks\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        keys should match what `chainer.reporter` reports.\n        If you add the key `loss`, the reporter will report `main/loss`\n        and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n        and `validation/main/loss` values.\n\n        Returns:\n            list: List of strings which are base keys to plot during training.\n\n        \"\"\"\n        plot_keys = [\"loss\", \"l1_loss\", \"l2_loss\", \"bce_loss\"]\n        if self.use_scaled_pos_enc:\n            plot_keys += [\"encoder_alpha\", \"decoder_alpha\"]\n        if self.use_guided_attn_loss:\n            if \"encoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"enc_attn_loss\"]\n            if \"decoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"dec_attn_loss\"]\n            if \"encoder-decoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"enc_dec_attn_loss\"]\n\n        return plot_keys\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_vc_tacotron2.py",
    "content": "# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Tacotron2-VC related modules.\"\"\"\n\nimport logging\n\nfrom distutils.util import strtobool\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttForward\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttForwardTA\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttLoc\nfrom espnet.nets.pytorch_backend.tacotron2.cbhg import CBHG\nfrom espnet.nets.pytorch_backend.tacotron2.cbhg import CBHGLoss\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Decoder\nfrom espnet.nets.pytorch_backend.tacotron2.encoder import Encoder\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.fill_missing_args import fill_missing_args\nfrom espnet.nets.pytorch_backend.e2e_tts_tacotron2 import (\n    GuidedAttentionLoss,  # noqa: H301\n    Tacotron2Loss,  # noqa: H301\n)\n\n\nclass Tacotron2(TTSInterface, torch.nn.Module):\n    \"\"\"VC Tacotron2 module for VC.\n\n    This is a module of Tacotron2-based VC model,\n    which convert the sequence of acoustic features\n    into the sequence of acoustic features.\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model-specific arguments to the parser.\"\"\"\n        group = parser.add_argument_group(\"tacotron 2 model setting\")\n        # encoder\n        group.add_argument(\n            \"--elayers\", default=1, type=int, help=\"Number of encoder layers\"\n        )\n        group.add_argument(\n            \"--eunits\",\n            \"-u\",\n            default=512,\n            type=int,\n            help=\"Number of encoder hidden units\",\n        )\n        group.add_argument(\n            \"--econv-layers\",\n            default=3,\n            type=int,\n            help=\"Number of encoder convolution layers\",\n        )\n        group.add_argument(\n            \"--econv-chans\",\n            default=512,\n            type=int,\n            help=\"Number of encoder convolution channels\",\n        )\n        group.add_argument(\n            \"--econv-filts\",\n            default=5,\n            type=int,\n            help=\"Filter size of encoder convolution\",\n        )\n        # attention\n        group.add_argument(\n            \"--atype\",\n            default=\"location\",\n            type=str,\n            choices=[\"forward_ta\", \"forward\", \"location\"],\n            help=\"Type of attention mechanism\",\n        )\n        group.add_argument(\n            \"--adim\",\n            default=512,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aconv-chans\",\n            default=32,\n            type=int,\n            help=\"Number of attention convolution channels\",\n        )\n        group.add_argument(\n            \"--aconv-filts\",\n            default=15,\n            type=int,\n            help=\"Filter size of attention convolution\",\n        )\n        group.add_argument(\n            \"--cumulate-att-w\",\n            default=True,\n            type=strtobool,\n            help=\"Whether or not to cumulate attention weights\",\n        )\n        # decoder\n        group.add_argument(\n            \"--dlayers\", default=2, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=1024, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--prenet-layers\", default=2, type=int, help=\"Number of prenet layers\"\n        )\n        group.add_argument(\n            \"--prenet-units\",\n            default=256,\n            type=int,\n            help=\"Number of prenet hidden units\",\n        )\n        group.add_argument(\n            \"--postnet-layers\", default=5, type=int, help=\"Number of postnet layers\"\n        )\n        group.add_argument(\n            \"--postnet-chans\", default=512, type=int, help=\"Number of postnet channels\"\n        )\n        group.add_argument(\n            \"--postnet-filts\", default=5, type=int, help=\"Filter size of postnet\"\n        )\n        group.add_argument(\n            \"--output-activation\",\n            default=None,\n            type=str,\n            nargs=\"?\",\n            help=\"Output activation function\",\n        )\n        # cbhg\n        group.add_argument(\n            \"--use-cbhg\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use CBHG module\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-bank-layers\",\n            default=8,\n            type=int,\n            help=\"Number of convoluional bank layers in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-bank-chans\",\n            default=128,\n            type=int,\n            help=\"Number of convoluional bank channles in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-proj-filts\",\n            default=3,\n            type=int,\n            help=\"Filter size of convoluional projection layer in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-conv-proj-chans\",\n            default=256,\n            type=int,\n            help=\"Number of convoluional projection channels in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-highway-layers\",\n            default=4,\n            type=int,\n            help=\"Number of highway layers in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-highway-units\",\n            default=128,\n            type=int,\n            help=\"Number of highway units in CBHG\",\n        )\n        group.add_argument(\n            \"--cbhg-gru-units\",\n            default=256,\n            type=int,\n            help=\"Number of GRU units in CBHG\",\n        )\n        # model (parameter) related\n        group.add_argument(\n            \"--use-batch-norm\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use batch normalization\",\n        )\n        group.add_argument(\n            \"--use-concate\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to concatenate encoder embedding with decoder outputs\",\n        )\n        group.add_argument(\n            \"--use-residual\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use residual connection in conv layer\",\n        )\n        group.add_argument(\n            \"--dropout-rate\", default=0.5, type=float, help=\"Dropout rate\"\n        )\n        group.add_argument(\n            \"--zoneout-rate\", default=0.1, type=float, help=\"Zoneout rate\"\n        )\n        group.add_argument(\n            \"--reduction-factor\",\n            default=1,\n            type=int,\n            help=\"Reduction factor (for decoder)\",\n        )\n        group.add_argument(\n            \"--encoder-reduction-factor\",\n            default=1,\n            type=int,\n            help=\"Reduction factor (for encoder)\",\n        )\n        group.add_argument(\n            \"--spk-embed-dim\",\n            default=None,\n            type=int,\n            help=\"Number of speaker embedding dimensions\",\n        )\n        group.add_argument(\n            \"--spc-dim\", default=None, type=int, help=\"Number of spectrogram dimensions\"\n        )\n        group.add_argument(\n            \"--pretrained-model\", default=None, type=str, help=\"Pretrained model path\"\n        )\n        # loss related\n        group.add_argument(\n            \"--use-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--bce-pos-weight\",\n            default=20.0,\n            type=float,\n            help=\"Positive sample weight in BCE calculation \"\n            \"(only for use-masking=True)\",\n        )\n        group.add_argument(\n            \"--use-guided-attn-loss\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-sigma\",\n            default=0.4,\n            type=float,\n            help=\"Sigma in guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in guided attention loss\",\n        )\n        group.add_argument(\n            \"--src-reconstruction-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in source reconstruction loss\",\n        )\n        group.add_argument(\n            \"--trg-reconstruction-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in target reconstruction loss\",\n        )\n        return parser\n\n    def __init__(self, idim, odim, args=None):\n        \"\"\"Initialize Tacotron2 module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            args (Namespace, optional):\n                - spk_embed_dim (int): Dimension of the speaker embedding.\n                - elayers (int): The number of encoder blstm layers.\n                - eunits (int): The number of encoder blstm units.\n                - econv_layers (int): The number of encoder conv layers.\n                - econv_filts (int): The number of encoder conv filter size.\n                - econv_chans (int): The number of encoder conv filter channels.\n                - dlayers (int): The number of decoder lstm layers.\n                - dunits (int): The number of decoder lstm units.\n                - prenet_layers (int): The number of prenet layers.\n                - prenet_units (int): The number of prenet units.\n                - postnet_layers (int): The number of postnet layers.\n                - postnet_filts (int): The number of postnet filter size.\n                - postnet_chans (int): The number of postnet filter channels.\n                - output_activation (int): The name of activation function for outputs.\n                - adim (int): The number of dimension of mlp in attention.\n                - aconv_chans (int): The number of attention conv filter channels.\n                - aconv_filts (int): The number of attention conv filter size.\n                - cumulate_att_w (bool): Whether to cumulate previous attention weight.\n                - use_batch_norm (bool): Whether to use batch normalization.\n                - use_concate (int):\n                    Whether to concatenate encoder embedding with decoder lstm outputs.\n                - dropout_rate (float): Dropout rate.\n                - zoneout_rate (float): Zoneout rate.\n                - reduction_factor (int): Reduction factor.\n                - spk_embed_dim (int): Number of speaker embedding dimenstions.\n                - spc_dim (int): Number of spectrogram embedding dimenstions\n                    (only for use_cbhg=True).\n                - use_cbhg (bool): Whether to use CBHG module.\n                - cbhg_conv_bank_layers (int):\n                    The number of convoluional banks in CBHG.\n                - cbhg_conv_bank_chans (int):\n                    The number of channels of convolutional bank in CBHG.\n                - cbhg_proj_filts (int):\n                    The number of filter size of projection layeri in CBHG.\n                - cbhg_proj_chans (int):\n                    The number of channels of projection layer in CBHG.\n                - cbhg_highway_layers (int):\n                    The number of layers of highway network in CBHG.\n                - cbhg_highway_units (int):\n                    The number of units of highway network in CBHG.\n                - cbhg_gru_units (int): The number of units of GRU in CBHG.\n                - use_masking (bool): Whether to mask padded part in loss calculation.\n                - bce_pos_weight (float): Weight of positive sample of stop token\n                    (only for use_masking=True).\n                - use-guided-attn-loss (bool): Whether to use guided attention loss.\n                - guided-attn-loss-sigma (float) Sigma in guided attention loss.\n                - guided-attn-loss-lamdba (float): Lambda in guided attention loss.\n\n        \"\"\"\n        # initialize base classes\n        TTSInterface.__init__(self)\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments\n        args = fill_missing_args(args, self.add_arguments)\n\n        # store hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.adim = args.adim\n        self.spk_embed_dim = args.spk_embed_dim\n        self.cumulate_att_w = args.cumulate_att_w\n        self.reduction_factor = args.reduction_factor\n        self.encoder_reduction_factor = args.encoder_reduction_factor\n        self.use_cbhg = args.use_cbhg\n        self.use_guided_attn_loss = args.use_guided_attn_loss\n        self.src_reconstruction_loss_lambda = args.src_reconstruction_loss_lambda\n        self.trg_reconstruction_loss_lambda = args.trg_reconstruction_loss_lambda\n\n        # define activation function for the final output\n        if args.output_activation is None:\n            self.output_activation_fn = None\n        elif hasattr(F, args.output_activation):\n            self.output_activation_fn = getattr(F, args.output_activation)\n        else:\n            raise ValueError(\n                \"there is no such an activation function. (%s)\" % args.output_activation\n            )\n\n        # define network modules\n        self.enc = Encoder(\n            idim=idim * args.encoder_reduction_factor,\n            input_layer=\"linear\",\n            elayers=args.elayers,\n            eunits=args.eunits,\n            econv_layers=args.econv_layers,\n            econv_chans=args.econv_chans,\n            econv_filts=args.econv_filts,\n            use_batch_norm=args.use_batch_norm,\n            use_residual=args.use_residual,\n            dropout_rate=args.dropout_rate,\n        )\n        dec_idim = (\n            args.eunits\n            if args.spk_embed_dim is None\n            else args.eunits + args.spk_embed_dim\n        )\n        if args.atype == \"location\":\n            att = AttLoc(\n                dec_idim, args.dunits, args.adim, args.aconv_chans, args.aconv_filts\n            )\n        elif args.atype == \"forward\":\n            att = AttForward(\n                dec_idim, args.dunits, args.adim, args.aconv_chans, args.aconv_filts\n            )\n            if self.cumulate_att_w:\n                logging.warning(\n                    \"cumulation of attention weights is disabled in forward attention.\"\n                )\n                self.cumulate_att_w = False\n        elif args.atype == \"forward_ta\":\n            att = AttForwardTA(\n                dec_idim,\n                args.dunits,\n                args.adim,\n                args.aconv_chans,\n                args.aconv_filts,\n                odim,\n            )\n            if self.cumulate_att_w:\n                logging.warning(\n                    \"cumulation of attention weights is disabled in forward attention.\"\n                )\n                self.cumulate_att_w = False\n        else:\n            raise NotImplementedError(\"Support only location or forward\")\n        self.dec = Decoder(\n            idim=dec_idim,\n            odim=odim,\n            att=att,\n            dlayers=args.dlayers,\n            dunits=args.dunits,\n            prenet_layers=args.prenet_layers,\n            prenet_units=args.prenet_units,\n            postnet_layers=args.postnet_layers,\n            postnet_chans=args.postnet_chans,\n            postnet_filts=args.postnet_filts,\n            output_activation_fn=self.output_activation_fn,\n            cumulate_att_w=self.cumulate_att_w,\n            use_batch_norm=args.use_batch_norm,\n            use_concate=args.use_concate,\n            dropout_rate=args.dropout_rate,\n            zoneout_rate=args.zoneout_rate,\n            reduction_factor=args.reduction_factor,\n        )\n        self.taco2_loss = Tacotron2Loss(\n            use_masking=args.use_masking, bce_pos_weight=args.bce_pos_weight\n        )\n        if self.use_guided_attn_loss:\n            self.attn_loss = GuidedAttentionLoss(\n                sigma=args.guided_attn_loss_sigma,\n                alpha=args.guided_attn_loss_lambda,\n            )\n        if self.use_cbhg:\n            self.cbhg = CBHG(\n                idim=odim,\n                odim=args.spc_dim,\n                conv_bank_layers=args.cbhg_conv_bank_layers,\n                conv_bank_chans=args.cbhg_conv_bank_chans,\n                conv_proj_filts=args.cbhg_conv_proj_filts,\n                conv_proj_chans=args.cbhg_conv_proj_chans,\n                highway_layers=args.cbhg_highway_layers,\n                highway_units=args.cbhg_highway_units,\n                gru_units=args.cbhg_gru_units,\n            )\n            self.cbhg_loss = CBHGLoss(use_masking=args.use_masking)\n        if self.src_reconstruction_loss_lambda > 0:\n            self.src_reconstructor = Encoder(\n                idim=dec_idim,\n                input_layer=\"linear\",\n                elayers=args.elayers,\n                eunits=args.eunits,\n                econv_layers=args.econv_layers,\n                econv_chans=args.econv_chans,\n                econv_filts=args.econv_filts,\n                use_batch_norm=args.use_batch_norm,\n                use_residual=args.use_residual,\n                dropout_rate=args.dropout_rate,\n            )\n            self.src_reconstructor_linear = torch.nn.Linear(\n                args.econv_chans, idim * args.encoder_reduction_factor\n            )\n\n            self.src_reconstruction_loss = CBHGLoss(use_masking=args.use_masking)\n        if self.trg_reconstruction_loss_lambda > 0:\n            self.trg_reconstructor = Encoder(\n                idim=dec_idim,\n                input_layer=\"linear\",\n                elayers=args.elayers,\n                eunits=args.eunits,\n                econv_layers=args.econv_layers,\n                econv_chans=args.econv_chans,\n                econv_filts=args.econv_filts,\n                use_batch_norm=args.use_batch_norm,\n                use_residual=args.use_residual,\n                dropout_rate=args.dropout_rate,\n            )\n            self.trg_reconstructor_linear = torch.nn.Linear(\n                args.econv_chans, odim * args.reduction_factor\n            )\n            self.trg_reconstruction_loss = CBHGLoss(use_masking=args.use_masking)\n\n        # load pretrained model\n        if args.pretrained_model is not None:\n            self.load_pretrained_model(args.pretrained_model)\n\n    def forward(\n        self, xs, ilens, ys, labels, olens, spembs=None, spcs=None, *args, **kwargs\n    ):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of padded acoustic features (B, Tmax, idim).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n            spcs (Tensor, optional):\n                Batch of groundtruth spectrograms (B, Lmax, spc_dim).\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        # remove unnecessary padded part (for multi-gpus)\n        max_in = max(ilens)\n        max_out = max(olens)\n        if max_in != xs.shape[1]:\n            xs = xs[:, :max_in]\n        if max_out != ys.shape[1]:\n            ys = ys[:, :max_out]\n            labels = labels[:, :max_out]\n\n        # thin out input frames for reduction factor\n        # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n        if self.encoder_reduction_factor > 1:\n            B, Lmax, idim = xs.shape\n            if Lmax % self.encoder_reduction_factor != 0:\n                xs = xs[:, : -(Lmax % self.encoder_reduction_factor), :]\n            xs_ds = xs.contiguous().view(\n                B,\n                int(Lmax / self.encoder_reduction_factor),\n                idim * self.encoder_reduction_factor,\n            )\n            ilens_ds = ilens.new(\n                [ilen // self.encoder_reduction_factor for ilen in ilens]\n            )\n        else:\n            xs_ds, ilens_ds = xs, ilens\n\n        # calculate tacotron2 outputs\n        hs, hlens = self.enc(xs_ds, ilens_ds)\n        if self.spk_embed_dim is not None:\n            spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n            hs = torch.cat([hs, spembs], dim=-1)\n        after_outs, before_outs, logits, att_ws = self.dec(hs, hlens, ys)\n\n        # caluculate src reconstruction\n        if self.src_reconstruction_loss_lambda > 0:\n            B, _in_length, _adim = hs.shape\n            xt, xtlens = self.src_reconstructor(hs, hlens)\n            xt = self.src_reconstructor_linear(xt)\n            if self.encoder_reduction_factor > 1:\n                xt = xt.view(B, -1, self.idim)\n\n        # caluculate trg reconstruction\n        if self.trg_reconstruction_loss_lambda > 0:\n            olens_trg_cp = olens.new(\n                sorted([olen // self.reduction_factor for olen in olens], reverse=True)\n            )\n            B, _in_length, _adim = hs.shape\n            _, _out_length, _ = att_ws.shape\n            # att_R should be [B, out_length / r_d, adim]\n            att_R = torch.sum(\n                hs.view(B, 1, _in_length, _adim)\n                * att_ws.view(B, _out_length, _in_length, 1),\n                dim=2,\n            )\n            yt, ytlens = self.trg_reconstructor(\n                att_R, olens_trg_cp\n            )  # is using olens correct?\n            yt = self.trg_reconstructor_linear(yt)\n            if self.reduction_factor > 1:\n                yt = yt.view(\n                    B, -1, self.odim\n                )  # now att_R should be [B, out_length, adim]\n\n        # modifiy mod part of groundtruth\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n            max_out = max(olens)\n            ys = ys[:, :max_out]\n            labels = labels[:, :max_out]\n            labels[:, -1] = 1.0  # make sure at least one frame has 1\n        if self.encoder_reduction_factor > 1:\n            ilens = ilens.new(\n                [ilen - ilen % self.encoder_reduction_factor for ilen in ilens]\n            )\n            max_in = max(ilens)\n            xs = xs[:, :max_in]\n\n        # caluculate taco2 loss\n        l1_loss, mse_loss, bce_loss = self.taco2_loss(\n            after_outs, before_outs, logits, ys, labels, olens\n        )\n        loss = l1_loss + mse_loss + bce_loss\n        report_keys = [\n            {\"l1_loss\": l1_loss.item()},\n            {\"mse_loss\": mse_loss.item()},\n            {\"bce_loss\": bce_loss.item()},\n        ]\n\n        # caluculate context_perservation loss\n        if self.src_reconstruction_loss_lambda > 0:\n            src_recon_l1_loss, src_recon_mse_loss = self.src_reconstruction_loss(\n                xt, xs, ilens\n            )\n            loss = loss + src_recon_l1_loss\n            report_keys += [\n                {\"src_recon_l1_loss\": src_recon_l1_loss.item()},\n                {\"src_recon_mse_loss\": src_recon_mse_loss.item()},\n            ]\n        if self.trg_reconstruction_loss_lambda > 0:\n            trg_recon_l1_loss, trg_recon_mse_loss = self.trg_reconstruction_loss(\n                yt, ys, olens\n            )\n            loss = loss + trg_recon_l1_loss\n            report_keys += [\n                {\"trg_recon_l1_loss\": trg_recon_l1_loss.item()},\n                {\"trg_recon_mse_loss\": trg_recon_mse_loss.item()},\n            ]\n\n        # caluculate attention loss\n        if self.use_guided_attn_loss:\n            # NOTE(kan-bayashi): length of output for auto-regressive input\n            #   will be changed when r > 1\n            if self.encoder_reduction_factor > 1:\n                ilens_in = ilens.new(\n                    [ilen // self.encoder_reduction_factor for ilen in ilens]\n                )\n            else:\n                ilens_in = ilens\n            if self.reduction_factor > 1:\n                olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n            else:\n                olens_in = olens\n            attn_loss = self.attn_loss(att_ws, ilens_in, olens_in)\n            loss = loss + attn_loss\n            report_keys += [\n                {\"attn_loss\": attn_loss.item()},\n            ]\n\n        # caluculate cbhg loss\n        if self.use_cbhg:\n            # remove unnecessary padded part (for multi-gpus)\n            if max_out != spcs.shape[1]:\n                spcs = spcs[:, :max_out]\n\n            # caluculate cbhg outputs & loss and report them\n            cbhg_outs, _ = self.cbhg(after_outs, olens)\n            cbhg_l1_loss, cbhg_mse_loss = self.cbhg_loss(cbhg_outs, spcs, olens)\n            loss = loss + cbhg_l1_loss + cbhg_mse_loss\n            report_keys += [\n                {\"cbhg_l1_loss\": cbhg_l1_loss.item()},\n                {\"cbhg_mse_loss\": cbhg_mse_loss.item()},\n            ]\n\n        report_keys += [{\"loss\": loss.item()}]\n        self.reporter.report(report_keys)\n\n        return loss\n\n    def inference(self, x, inference_args, spemb=None, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Args:\n            x (Tensor): Input sequence of acoustic features (T, idim).\n            inference_args (Namespace):\n                - threshold (float): Threshold in inference.\n                - minlenratio (float): Minimum length ratio in inference.\n                - maxlenratio (float): Maximum length ratio in inference.\n            spemb (Tensor, optional): Speaker embedding vector (spk_embed_dim).\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            Tensor: Output sequence of stop probabilities (L,).\n            Tensor: Attention weights (L, T).\n\n        \"\"\"\n        # get options\n        threshold = inference_args.threshold\n        minlenratio = inference_args.minlenratio\n        maxlenratio = inference_args.maxlenratio\n\n        # thin out input frames for reduction factor\n        # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n        if self.encoder_reduction_factor > 1:\n            Lmax, idim = x.shape\n            if Lmax % self.encoder_reduction_factor != 0:\n                x = x[: -(Lmax % self.encoder_reduction_factor), :]\n            x_ds = x.contiguous().view(\n                int(Lmax / self.encoder_reduction_factor),\n                idim * self.encoder_reduction_factor,\n            )\n        else:\n            x_ds = x\n\n        # inference\n        h = self.enc.inference(x_ds)\n        if self.spk_embed_dim is not None:\n            spemb = F.normalize(spemb, dim=0).unsqueeze(0).expand(h.size(0), -1)\n            h = torch.cat([h, spemb], dim=-1)\n        outs, probs, att_ws = self.dec.inference(h, threshold, minlenratio, maxlenratio)\n\n        if self.use_cbhg:\n            cbhg_outs = self.cbhg.inference(outs)\n            return cbhg_outs, probs, att_ws\n        else:\n            return outs, probs, att_ws\n\n    def calculate_all_attentions(self, xs, ilens, ys, spembs=None, *args, **kwargs):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            xs (Tensor): Batch of padded acoustic features (B, Tmax, idim).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n\n        Returns:\n            numpy.ndarray: Batch of attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        # check ilens type (should be list of int)\n        if isinstance(ilens, torch.Tensor) or isinstance(ilens, np.ndarray):\n            ilens = list(map(int, ilens))\n\n        self.eval()\n        with torch.no_grad():\n            # thin out input frames for reduction factor\n            # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n            if self.encoder_reduction_factor > 1:\n                B, Lmax, idim = xs.shape\n                if Lmax % self.encoder_reduction_factor != 0:\n                    xs = xs[:, : -(Lmax % self.encoder_reduction_factor), :]\n                xs_ds = xs.contiguous().view(\n                    B,\n                    int(Lmax / self.encoder_reduction_factor),\n                    idim * self.encoder_reduction_factor,\n                )\n                ilens_ds = [ilen // self.encoder_reduction_factor for ilen in ilens]\n            else:\n                xs_ds, ilens_ds = xs, ilens\n\n            hs, hlens = self.enc(xs_ds, ilens_ds)\n            if self.spk_embed_dim is not None:\n                spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n                hs = torch.cat([hs, spembs], dim=-1)\n            att_ws = self.dec.calculate_all_attentions(hs, hlens, ys)\n        self.train()\n\n        return att_ws.cpu().numpy()\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        keys should match what `chainer.reporter` reports.\n        If you add the key `loss`, the reporter will report `main/loss`\n            and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n            and `validation/main/loss` values.\n\n        Returns:\n            list: List of strings which are base keys to plot during training.\n\n        \"\"\"\n        plot_keys = [\"loss\", \"l1_loss\", \"mse_loss\", \"bce_loss\"]\n        if self.use_guided_attn_loss:\n            plot_keys += [\"attn_loss\"]\n        if self.use_cbhg:\n            plot_keys += [\"cbhg_l1_loss\", \"cbhg_mse_loss\"]\n        if self.src_reconstruction_loss_lambda > 0:\n            plot_keys += [\"src_recon_l1_loss\", \"src_recon_mse_loss\"]\n        if self.trg_reconstruction_loss_lambda > 0:\n            plot_keys += [\"trg_recon_l1_loss\", \"trg_recon_mse_loss\"]\n        return plot_keys\n\n    def _sort_by_length(self, xs, ilens):\n        sort_ilens, sort_idx = ilens.sort(0, descending=True)\n        return xs[sort_idx], ilens[sort_idx], sort_idx\n\n    def _revert_sort_by_length(self, xs, ilens, sort_idx):\n        _, revert_idx = sort_idx.sort(0)\n        return xs[revert_idx], ilens[revert_idx]\n"
  },
  {
    "path": "nets/pytorch_backend/e2e_vc_transformer.py",
    "content": "# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Voice Transformer Network (Transformer-VC) related modules.\"\"\"\n\nimport logging\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.e2e_asr_transformer import subsequent_mask\nfrom espnet.nets.pytorch_backend.e2e_tts_tacotron2 import (\n    Tacotron2Loss as TransformerLoss,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Postnet\nfrom espnet.nets.pytorch_backend.tacotron2.decoder import Prenet as DecoderPrenet\nfrom espnet.nets.pytorch_backend.tacotron2.encoder import Encoder as EncoderPrenet\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.decoder import Decoder\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.embedding import ScaledPositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.initializer import initialize\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.cli_utils import strtobool\nfrom espnet.utils.fill_missing_args import fill_missing_args\nfrom espnet.nets.pytorch_backend.e2e_tts_transformer import (\n    GuidedMultiHeadAttentionLoss,  # noqa: H301\n    TTSPlot,  # noqa: H301\n)\n\n\nclass Transformer(TTSInterface, torch.nn.Module):\n    \"\"\"VC Transformer module.\n\n    This is a module of the Voice Transformer Network\n    (a.k.a. VTN or Transformer-VC) described in\n    `Voice Transformer Network: Sequence-to-Sequence\n    Voice Conversion Using Transformer with\n    Text-to-Speech Pretraining`_,\n    which convert the sequence of acoustic features\n    into the sequence of acoustic features.\n\n    .. _`Voice Transformer Network: Sequence-to-Sequence\n        Voice Conversion Using Transformer with\n        Text-to-Speech Pretraining`:\n        https://arxiv.org/pdf/1912.06813.pdf\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model-specific arguments to the parser.\"\"\"\n        group = parser.add_argument_group(\"transformer model setting\")\n        # network structure related\n        group.add_argument(\n            \"--eprenet-conv-layers\",\n            default=0,\n            type=int,\n            help=\"Number of encoder prenet convolution layers\",\n        )\n        group.add_argument(\n            \"--eprenet-conv-chans\",\n            default=0,\n            type=int,\n            help=\"Number of encoder prenet convolution channels\",\n        )\n        group.add_argument(\n            \"--eprenet-conv-filts\",\n            default=0,\n            type=int,\n            help=\"Filter size of encoder prenet convolution\",\n        )\n        group.add_argument(\n            \"--transformer-input-layer\",\n            default=\"linear\",\n            type=str,\n            help=\"Type of input layer (linear or conv2d)\",\n        )\n        group.add_argument(\n            \"--dprenet-layers\",\n            default=2,\n            type=int,\n            help=\"Number of decoder prenet layers\",\n        )\n        group.add_argument(\n            \"--dprenet-units\",\n            default=256,\n            type=int,\n            help=\"Number of decoder prenet hidden units\",\n        )\n        group.add_argument(\n            \"--elayers\", default=3, type=int, help=\"Number of encoder layers\"\n        )\n        group.add_argument(\n            \"--eunits\", default=1536, type=int, help=\"Number of encoder hidden units\"\n        )\n        group.add_argument(\n            \"--adim\",\n            default=384,\n            type=int,\n            help=\"Number of attention transformation dimensions\",\n        )\n        group.add_argument(\n            \"--aheads\",\n            default=4,\n            type=int,\n            help=\"Number of heads for multi head attention\",\n        )\n        group.add_argument(\n            \"--dlayers\", default=3, type=int, help=\"Number of decoder layers\"\n        )\n        group.add_argument(\n            \"--dunits\", default=1536, type=int, help=\"Number of decoder hidden units\"\n        )\n        group.add_argument(\n            \"--positionwise-layer-type\",\n            default=\"linear\",\n            type=str,\n            choices=[\"linear\", \"conv1d\", \"conv1d-linear\"],\n            help=\"Positionwise layer type.\",\n        )\n        group.add_argument(\n            \"--positionwise-conv-kernel-size\",\n            default=1,\n            type=int,\n            help=\"Kernel size of positionwise conv1d layer\",\n        )\n        group.add_argument(\n            \"--postnet-layers\", default=5, type=int, help=\"Number of postnet layers\"\n        )\n        group.add_argument(\n            \"--postnet-chans\", default=256, type=int, help=\"Number of postnet channels\"\n        )\n        group.add_argument(\n            \"--postnet-filts\", default=5, type=int, help=\"Filter size of postnet\"\n        )\n        group.add_argument(\n            \"--use-scaled-pos-enc\",\n            default=True,\n            type=strtobool,\n            help=\"Use trainable scaled positional encoding\"\n            \"instead of the fixed scale one.\",\n        )\n        group.add_argument(\n            \"--use-batch-norm\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use batch normalization\",\n        )\n        group.add_argument(\n            \"--encoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before encoder block\",\n        )\n        group.add_argument(\n            \"--decoder-normalize-before\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to apply layer norm before decoder block\",\n        )\n        group.add_argument(\n            \"--encoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in encoder\",\n        )\n        group.add_argument(\n            \"--decoder-concat-after\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to concatenate attention layer's input and output in decoder\",\n        )\n        group.add_argument(\n            \"--reduction-factor\",\n            default=1,\n            type=int,\n            help=\"Reduction factor (for decoder)\",\n        )\n        group.add_argument(\n            \"--encoder-reduction-factor\",\n            default=1,\n            type=int,\n            help=\"Reduction factor (for encoder)\",\n        )\n        group.add_argument(\n            \"--spk-embed-dim\",\n            default=None,\n            type=int,\n            help=\"Number of speaker embedding dimensions\",\n        )\n        group.add_argument(\n            \"--spk-embed-integration-type\",\n            type=str,\n            default=\"add\",\n            choices=[\"add\", \"concat\"],\n            help=\"How to integrate speaker embedding\",\n        )\n        # training related\n        group.add_argument(\n            \"--transformer-init\",\n            type=str,\n            default=\"pytorch\",\n            choices=[\n                \"pytorch\",\n                \"xavier_uniform\",\n                \"xavier_normal\",\n                \"kaiming_uniform\",\n                \"kaiming_normal\",\n            ],\n            help=\"How to initialize transformer parameters\",\n        )\n        group.add_argument(\n            \"--initial-encoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in encoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--initial-decoder-alpha\",\n            type=float,\n            default=1.0,\n            help=\"Initial alpha value in decoder's ScaledPositionalEncoding\",\n        )\n        group.add_argument(\n            \"--transformer-lr\",\n            default=1.0,\n            type=float,\n            help=\"Initial value of learning rate\",\n        )\n        group.add_argument(\n            \"--transformer-warmup-steps\",\n            default=4000,\n            type=int,\n            help=\"Optimizer warmup steps\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder except for attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-enc-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-dec-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder \"\n            \"except for attention and pos encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-positional-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder positional encoding\",\n        )\n        group.add_argument(\n            \"--transformer-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer decoder self-attention\",\n        )\n        group.add_argument(\n            \"--transformer-enc-dec-attn-dropout-rate\",\n            default=0.1,\n            type=float,\n            help=\"Dropout rate for transformer encoder-decoder attention\",\n        )\n        group.add_argument(\n            \"--eprenet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in encoder prenet\",\n        )\n        group.add_argument(\n            \"--dprenet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in decoder prenet\",\n        )\n        group.add_argument(\n            \"--postnet-dropout-rate\",\n            default=0.5,\n            type=float,\n            help=\"Dropout rate in postnet\",\n        )\n        group.add_argument(\n            \"--pretrained-model\", default=None, type=str, help=\"Pretrained model path\"\n        )\n\n        # loss related\n        group.add_argument(\n            \"--use-masking\",\n            default=True,\n            type=strtobool,\n            help=\"Whether to use masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--use-weighted-masking\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use weighted masking in calculation of loss\",\n        )\n        group.add_argument(\n            \"--loss-type\",\n            default=\"L1\",\n            choices=[\"L1\", \"L2\", \"L1+L2\"],\n            help=\"How to calc loss\",\n        )\n        group.add_argument(\n            \"--bce-pos-weight\",\n            default=5.0,\n            type=float,\n            help=\"Positive sample weight in BCE calculation \"\n            \"(only for use-masking=True)\",\n        )\n        group.add_argument(\n            \"--use-guided-attn-loss\",\n            default=False,\n            type=strtobool,\n            help=\"Whether to use guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-sigma\",\n            default=0.4,\n            type=float,\n            help=\"Sigma in guided attention loss\",\n        )\n        group.add_argument(\n            \"--guided-attn-loss-lambda\",\n            default=1.0,\n            type=float,\n            help=\"Lambda in guided attention loss\",\n        )\n        group.add_argument(\n            \"--num-heads-applied-guided-attn\",\n            default=2,\n            type=int,\n            help=\"Number of heads in each layer to be applied guided attention loss\"\n            \"if set -1, all of the heads will be applied.\",\n        )\n        group.add_argument(\n            \"--num-layers-applied-guided-attn\",\n            default=2,\n            type=int,\n            help=\"Number of layers to be applied guided attention loss\"\n            \"if set -1, all of the layers will be applied.\",\n        )\n        group.add_argument(\n            \"--modules-applied-guided-attn\",\n            type=str,\n            nargs=\"+\",\n            default=[\"encoder-decoder\"],\n            help=\"Module name list to be applied guided attention loss\",\n        )\n        return parser\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Return plot class for attention weight plot.\"\"\"\n        return TTSPlot\n\n    def __init__(self, idim, odim, args=None):\n        \"\"\"Initialize Transformer-VC module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            args (Namespace, optional):\n                - eprenet_conv_layers (int):\n                    Number of encoder prenet convolution layers.\n                - eprenet_conv_chans (int):\n                    Number of encoder prenet convolution channels.\n                - eprenet_conv_filts (int):\n                    Filter size of encoder prenet convolution.\n                - transformer_input_layer (str): Input layer before the encoder.\n                - dprenet_layers (int): Number of decoder prenet layers.\n                - dprenet_units (int): Number of decoder prenet hidden units.\n                - elayers (int): Number of encoder layers.\n                - eunits (int): Number of encoder hidden units.\n                - adim (int): Number of attention transformation dimensions.\n                - aheads (int): Number of heads for multi head attention.\n                - dlayers (int): Number of decoder layers.\n                - dunits (int): Number of decoder hidden units.\n                - postnet_layers (int): Number of postnet layers.\n                - postnet_chans (int): Number of postnet channels.\n                - postnet_filts (int): Filter size of postnet.\n                - use_scaled_pos_enc (bool):\n                    Whether to use trainable scaled positional encoding.\n                - use_batch_norm (bool):\n                    Whether to use batch normalization in encoder prenet.\n                - encoder_normalize_before (bool):\n                    Whether to perform layer normalization before encoder block.\n                - decoder_normalize_before (bool):\n                    Whether to perform layer normalization before decoder block.\n                - encoder_concat_after (bool): Whether to concatenate\n                    attention layer's input and output in encoder.\n                - decoder_concat_after (bool): Whether to concatenate\n                    attention layer's input and output in decoder.\n                - reduction_factor (int): Reduction factor (for decoder).\n                - encoder_reduction_factor (int): Reduction factor (for encoder).\n                - spk_embed_dim (int): Number of speaker embedding dimenstions.\n                - spk_embed_integration_type: How to integrate speaker embedding.\n                - transformer_init (float): How to initialize transformer parameters.\n                - transformer_lr (float): Initial value of learning rate.\n                - transformer_warmup_steps (int): Optimizer warmup steps.\n                - transformer_enc_dropout_rate (float):\n                    Dropout rate in encoder except attention & positional encoding.\n                - transformer_enc_positional_dropout_rate (float):\n                    Dropout rate after encoder positional encoding.\n                - transformer_enc_attn_dropout_rate (float):\n                    Dropout rate in encoder self-attention module.\n                - transformer_dec_dropout_rate (float):\n                    Dropout rate in decoder except attention & positional encoding.\n                - transformer_dec_positional_dropout_rate (float):\n                    Dropout rate after decoder positional encoding.\n                - transformer_dec_attn_dropout_rate (float):\n                    Dropout rate in deocoder self-attention module.\n                - transformer_enc_dec_attn_dropout_rate (float):\n                    Dropout rate in encoder-deocoder attention module.\n                - eprenet_dropout_rate (float): Dropout rate in encoder prenet.\n                - dprenet_dropout_rate (float): Dropout rate in decoder prenet.\n                - postnet_dropout_rate (float): Dropout rate in postnet.\n                - use_masking (bool):\n                    Whether to apply masking for padded part in loss calculation.\n                - use_weighted_masking (bool):\n                    Whether to apply weighted masking in loss calculation.\n                - bce_pos_weight (float): Positive sample weight in bce calculation\n                    (only for use_masking=true).\n                - loss_type (str): How to calculate loss.\n                - use_guided_attn_loss (bool): Whether to use guided attention loss.\n                - num_heads_applied_guided_attn (int):\n                    Number of heads in each layer to apply guided attention loss.\n                - num_layers_applied_guided_attn (int):\n                    Number of layers to apply guided attention loss.\n                - modules_applied_guided_attn (list):\n                    List of module names to apply guided attention loss.\n                - guided-attn-loss-sigma (float) Sigma in guided attention loss.\n                - guided-attn-loss-lambda (float): Lambda in guided attention loss.\n\n        \"\"\"\n        # initialize base classes\n        TTSInterface.__init__(self)\n        torch.nn.Module.__init__(self)\n\n        # fill missing arguments\n        args = fill_missing_args(args, self.add_arguments)\n\n        # store hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.spk_embed_dim = args.spk_embed_dim\n        if self.spk_embed_dim is not None:\n            self.spk_embed_integration_type = args.spk_embed_integration_type\n        self.use_scaled_pos_enc = args.use_scaled_pos_enc\n        self.reduction_factor = args.reduction_factor\n        self.encoder_reduction_factor = args.encoder_reduction_factor\n        self.transformer_input_layer = args.transformer_input_layer\n        self.loss_type = args.loss_type\n        self.use_guided_attn_loss = args.use_guided_attn_loss\n        if self.use_guided_attn_loss:\n            if args.num_layers_applied_guided_attn == -1:\n                self.num_layers_applied_guided_attn = args.elayers\n            else:\n                self.num_layers_applied_guided_attn = (\n                    args.num_layers_applied_guided_attn\n                )\n            if args.num_heads_applied_guided_attn == -1:\n                self.num_heads_applied_guided_attn = args.aheads\n            else:\n                self.num_heads_applied_guided_attn = args.num_heads_applied_guided_attn\n            self.modules_applied_guided_attn = args.modules_applied_guided_attn\n\n        # use idx 0 as padding idx\n        padding_idx = 0\n\n        # get positional encoding class\n        pos_enc_class = (\n            ScaledPositionalEncoding if self.use_scaled_pos_enc else PositionalEncoding\n        )\n\n        # define transformer encoder\n        if args.eprenet_conv_layers != 0:\n            # encoder prenet\n            encoder_input_layer = torch.nn.Sequential(\n                EncoderPrenet(\n                    idim=idim,\n                    elayers=0,\n                    econv_layers=args.eprenet_conv_layers,\n                    econv_chans=args.eprenet_conv_chans,\n                    econv_filts=args.eprenet_conv_filts,\n                    use_batch_norm=args.use_batch_norm,\n                    dropout_rate=args.eprenet_dropout_rate,\n                    padding_idx=padding_idx,\n                    input_layer=torch.nn.Linear(\n                        idim * args.encoder_reduction_factor, idim\n                    ),\n                ),\n                torch.nn.Linear(args.eprenet_conv_chans, args.adim),\n            )\n        elif args.transformer_input_layer == \"linear\":\n            encoder_input_layer = torch.nn.Linear(\n                idim * args.encoder_reduction_factor, args.adim\n            )\n        else:\n            encoder_input_layer = args.transformer_input_layer\n        self.encoder = Encoder(\n            idim=idim,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.eunits,\n            num_blocks=args.elayers,\n            input_layer=encoder_input_layer,\n            dropout_rate=args.transformer_enc_dropout_rate,\n            positional_dropout_rate=args.transformer_enc_positional_dropout_rate,\n            attention_dropout_rate=args.transformer_enc_attn_dropout_rate,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.encoder_normalize_before,\n            concat_after=args.encoder_concat_after,\n            positionwise_layer_type=args.positionwise_layer_type,\n            positionwise_conv_kernel_size=args.positionwise_conv_kernel_size,\n        )\n\n        # define projection layer\n        if self.spk_embed_dim is not None:\n            if self.spk_embed_integration_type == \"add\":\n                self.projection = torch.nn.Linear(self.spk_embed_dim, args.adim)\n            else:\n                self.projection = torch.nn.Linear(\n                    args.adim + self.spk_embed_dim, args.adim\n                )\n\n        # define transformer decoder\n        if args.dprenet_layers != 0:\n            # decoder prenet\n            decoder_input_layer = torch.nn.Sequential(\n                DecoderPrenet(\n                    idim=odim,\n                    n_layers=args.dprenet_layers,\n                    n_units=args.dprenet_units,\n                    dropout_rate=args.dprenet_dropout_rate,\n                ),\n                torch.nn.Linear(args.dprenet_units, args.adim),\n            )\n        else:\n            decoder_input_layer = \"linear\"\n        self.decoder = Decoder(\n            odim=-1,\n            attention_dim=args.adim,\n            attention_heads=args.aheads,\n            linear_units=args.dunits,\n            num_blocks=args.dlayers,\n            dropout_rate=args.transformer_dec_dropout_rate,\n            positional_dropout_rate=args.transformer_dec_positional_dropout_rate,\n            self_attention_dropout_rate=args.transformer_dec_attn_dropout_rate,\n            src_attention_dropout_rate=args.transformer_enc_dec_attn_dropout_rate,\n            input_layer=decoder_input_layer,\n            use_output_layer=False,\n            pos_enc_class=pos_enc_class,\n            normalize_before=args.decoder_normalize_before,\n            concat_after=args.decoder_concat_after,\n        )\n\n        # define final projection\n        self.feat_out = torch.nn.Linear(args.adim, odim * args.reduction_factor)\n        self.prob_out = torch.nn.Linear(args.adim, args.reduction_factor)\n\n        # define postnet\n        self.postnet = (\n            None\n            if args.postnet_layers == 0\n            else Postnet(\n                idim=idim,\n                odim=odim,\n                n_layers=args.postnet_layers,\n                n_chans=args.postnet_chans,\n                n_filts=args.postnet_filts,\n                use_batch_norm=args.use_batch_norm,\n                dropout_rate=args.postnet_dropout_rate,\n            )\n        )\n\n        # define loss function\n        self.criterion = TransformerLoss(\n            use_masking=args.use_masking,\n            use_weighted_masking=args.use_weighted_masking,\n            bce_pos_weight=args.bce_pos_weight,\n        )\n        if self.use_guided_attn_loss:\n            self.attn_criterion = GuidedMultiHeadAttentionLoss(\n                sigma=args.guided_attn_loss_sigma,\n                alpha=args.guided_attn_loss_lambda,\n            )\n\n        # initialize parameters\n        self._reset_parameters(\n            init_type=args.transformer_init,\n            init_enc_alpha=args.initial_encoder_alpha,\n            init_dec_alpha=args.initial_decoder_alpha,\n        )\n\n        # load pretrained model\n        if args.pretrained_model is not None:\n            self.load_pretrained_model(args.pretrained_model)\n\n    def _reset_parameters(self, init_type, init_enc_alpha=1.0, init_dec_alpha=1.0):\n        # initialize parameters\n        initialize(self, init_type)\n\n        # initialize alpha in scaled positional encoding\n        if self.use_scaled_pos_enc:\n            self.encoder.embed[-1].alpha.data = torch.tensor(init_enc_alpha)\n            self.decoder.embed[-1].alpha.data = torch.tensor(init_dec_alpha)\n\n    def _add_first_frame_and_remove_last_frame(self, ys):\n        ys_in = torch.cat(\n            [ys.new_zeros((ys.shape[0], 1, ys.shape[2])), ys[:, :-1]], dim=1\n        )\n        return ys_in\n\n    def forward(self, xs, ilens, ys, labels, olens, spembs=None, *args, **kwargs):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of padded acoustic features (B, Tmax, idim).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional): Batch of speaker embedding vectors\n                (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        # remove unnecessary padded part (for multi-gpus)\n        max_ilen = max(ilens)\n        max_olen = max(olens)\n        if max_ilen != xs.shape[1]:\n            xs = xs[:, :max_ilen]\n        if max_olen != ys.shape[1]:\n            ys = ys[:, :max_olen]\n            labels = labels[:, :max_olen]\n\n        # thin out input frames for reduction factor\n        # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n        if self.encoder_reduction_factor > 1:\n            B, Lmax, idim = xs.shape\n            if Lmax % self.encoder_reduction_factor != 0:\n                xs = xs[:, : -(Lmax % self.encoder_reduction_factor), :]\n            xs_ds = xs.contiguous().view(\n                B,\n                int(Lmax / self.encoder_reduction_factor),\n                idim * self.encoder_reduction_factor,\n            )\n            ilens_ds = ilens.new(\n                [ilen // self.encoder_reduction_factor for ilen in ilens]\n            )\n        else:\n            xs_ds, ilens_ds = xs, ilens\n\n        # forward encoder\n        x_masks = self._source_mask(ilens_ds)\n        hs, hs_masks = self.encoder(xs_ds, x_masks)\n\n        # integrate speaker embedding\n        if self.spk_embed_dim is not None:\n            hs_int = self._integrate_with_spk_embed(hs, spembs)\n        else:\n            hs_int = hs\n\n        # thin out frames for reduction factor (B, Lmax, odim) ->  (B, Lmax//r, odim)\n        if self.reduction_factor > 1:\n            ys_in = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n            olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n        else:\n            ys_in, olens_in = ys, olens\n\n        # add first zero frame and remove last frame for auto-regressive\n        ys_in = self._add_first_frame_and_remove_last_frame(ys_in)\n\n        # if conv2d, modify mask. Use ceiling division here\n        if \"conv2d\" in self.transformer_input_layer:\n            ilens_ds_st = ilens_ds.new(\n                [((ilen - 2 + 1) // 2 - 2 + 1) // 2 for ilen in ilens_ds]\n            )\n        else:\n            ilens_ds_st = ilens_ds\n\n        # forward decoder\n        y_masks = self._target_mask(olens_in)\n        zs, _ = self.decoder(ys_in, y_masks, hs_int, hs_masks)\n        # (B, Lmax//r, odim * r) -> (B, Lmax//r * r, odim)\n        before_outs = self.feat_out(zs).view(zs.size(0), -1, self.odim)\n        # (B, Lmax//r, r) -> (B, Lmax//r * r)\n        logits = self.prob_out(zs).view(zs.size(0), -1)\n\n        # postnet -> (B, Lmax//r * r, odim)\n        if self.postnet is None:\n            after_outs = before_outs\n        else:\n            after_outs = before_outs + self.postnet(\n                before_outs.transpose(1, 2)\n            ).transpose(1, 2)\n\n        # modifiy mod part of groundtruth\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n            max_olen = max(olens)\n            ys = ys[:, :max_olen]\n            labels = labels[:, :max_olen]\n            labels[:, -1] = 1.0  # make sure at least one frame has 1\n\n        # caluculate loss values\n        l1_loss, l2_loss, bce_loss = self.criterion(\n            after_outs, before_outs, logits, ys, labels, olens\n        )\n        if self.loss_type == \"L1\":\n            loss = l1_loss + bce_loss\n        elif self.loss_type == \"L2\":\n            loss = l2_loss + bce_loss\n        elif self.loss_type == \"L1+L2\":\n            loss = l1_loss + l2_loss + bce_loss\n        else:\n            raise ValueError(\"unknown --loss-type \" + self.loss_type)\n        report_keys = [\n            {\"l1_loss\": l1_loss.item()},\n            {\"l2_loss\": l2_loss.item()},\n            {\"bce_loss\": bce_loss.item()},\n            {\"loss\": loss.item()},\n        ]\n\n        # calculate guided attention loss\n        if self.use_guided_attn_loss:\n            # calculate for encoder\n            if \"encoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.encoder.encoders)))\n                ):\n                    att_ws += [\n                        self.encoder.encoders[layer_idx].self_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_in, T_in)\n                enc_attn_loss = self.attn_criterion(\n                    att_ws, ilens_ds_st, ilens_ds_st\n                )  # TODO(unilight): is changing to ilens_ds_st right?\n                loss = loss + enc_attn_loss\n                report_keys += [{\"enc_attn_loss\": enc_attn_loss.item()}]\n            # calculate for decoder\n            if \"decoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.decoder.decoders)))\n                ):\n                    att_ws += [\n                        self.decoder.decoders[layer_idx].self_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_out, T_out)\n                dec_attn_loss = self.attn_criterion(att_ws, olens_in, olens_in)\n                loss = loss + dec_attn_loss\n                report_keys += [{\"dec_attn_loss\": dec_attn_loss.item()}]\n            # calculate for encoder-decoder\n            if \"encoder-decoder\" in self.modules_applied_guided_attn:\n                att_ws = []\n                for idx, layer_idx in enumerate(\n                    reversed(range(len(self.decoder.decoders)))\n                ):\n                    att_ws += [\n                        self.decoder.decoders[layer_idx].src_attn.attn[\n                            :, : self.num_heads_applied_guided_attn\n                        ]\n                    ]\n                    if idx + 1 == self.num_layers_applied_guided_attn:\n                        break\n                att_ws = torch.cat(att_ws, dim=1)  # (B, H*L, T_out, T_in)\n                enc_dec_attn_loss = self.attn_criterion(\n                    att_ws, ilens_ds_st, olens_in\n                )  # TODO(unilight): is changing to ilens_ds_st right?\n                loss = loss + enc_dec_attn_loss\n                report_keys += [{\"enc_dec_attn_loss\": enc_dec_attn_loss.item()}]\n\n        # report extra information\n        if self.use_scaled_pos_enc:\n            report_keys += [\n                {\"encoder_alpha\": self.encoder.embed[-1].alpha.data.item()},\n                {\"decoder_alpha\": self.decoder.embed[-1].alpha.data.item()},\n            ]\n        self.reporter.report(report_keys)\n\n        return loss\n\n    def inference(self, x, inference_args, spemb=None, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of acoustic features.\n\n        Args:\n            x (Tensor): Input sequence of acoustic features (T, idim).\n            inference_args (Namespace):\n                - threshold (float): Threshold in inference.\n                - minlenratio (float): Minimum length ratio in inference.\n                - maxlenratio (float): Maximum length ratio in inference.\n            spemb (Tensor, optional): Speaker embedding vector (spk_embed_dim).\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            Tensor: Output sequence of stop probabilities (L,).\n            Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).\n\n        \"\"\"\n        # get options\n        threshold = inference_args.threshold\n        minlenratio = inference_args.minlenratio\n        maxlenratio = inference_args.maxlenratio\n        use_att_constraint = getattr(\n            inference_args, \"use_att_constraint\", False\n        )  # keep compatibility\n        if use_att_constraint:\n            logging.warning(\n                \"Attention constraint is not yet supported in Transformer. Not enabled.\"\n            )\n\n        # thin out input frames for reduction factor\n        # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n        if self.encoder_reduction_factor > 1:\n            Lmax, idim = x.shape\n            if Lmax % self.encoder_reduction_factor != 0:\n                x = x[: -(Lmax % self.encoder_reduction_factor), :]\n            x_ds = x.contiguous().view(\n                int(Lmax / self.encoder_reduction_factor),\n                idim * self.encoder_reduction_factor,\n            )\n        else:\n            x_ds = x\n\n        # forward encoder\n        x_ds = x_ds.unsqueeze(0)\n        hs, _ = self.encoder(x_ds, None)\n\n        # integrate speaker embedding\n        if self.spk_embed_dim is not None:\n            spembs = spemb.unsqueeze(0)\n            hs = self._integrate_with_spk_embed(hs, spembs)\n\n        # set limits of length\n        maxlen = int(hs.size(1) * maxlenratio / self.reduction_factor)\n        minlen = int(hs.size(1) * minlenratio / self.reduction_factor)\n\n        # initialize\n        idx = 0\n        ys = hs.new_zeros(1, 1, self.odim)\n        outs, probs = [], []\n\n        # forward decoder step-by-step\n        z_cache = self.decoder.init_state(x)\n        while True:\n            # update index\n            idx += 1\n\n            # calculate output and stop prob at idx-th step\n            y_masks = subsequent_mask(idx).unsqueeze(0).to(x.device)\n            z, z_cache = self.decoder.forward_one_step(\n                ys, y_masks, hs, cache=z_cache\n            )  # (B, adim)\n            outs += [\n                self.feat_out(z).view(self.reduction_factor, self.odim)\n            ]  # [(r, odim), ...]\n            probs += [torch.sigmoid(self.prob_out(z))[0]]  # [(r), ...]\n\n            # update next inputs\n            ys = torch.cat(\n                (ys, outs[-1][-1].view(1, 1, self.odim)), dim=1\n            )  # (1, idx + 1, odim)\n\n            # get attention weights\n            att_ws_ = []\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention) and \"src\" in name:\n                    att_ws_ += [m.attn[0, :, -1].unsqueeze(1)]  # [(#heads, 1, T),...]\n            if idx == 1:\n                att_ws = att_ws_\n            else:\n                # [(#heads, l, T), ...]\n                att_ws = [\n                    torch.cat([att_w, att_w_], dim=1)\n                    for att_w, att_w_ in zip(att_ws, att_ws_)\n                ]\n\n            # check whether to finish generation\n            if int(sum(probs[-1] >= threshold)) > 0 or idx >= maxlen:\n                # check mininum length\n                if idx < minlen:\n                    continue\n                outs = (\n                    torch.cat(outs, dim=0).unsqueeze(0).transpose(1, 2)\n                )  # (L, odim) -> (1, L, odim) -> (1, odim, L)\n                if self.postnet is not None:\n                    outs = outs + self.postnet(outs)  # (1, odim, L)\n                outs = outs.transpose(2, 1).squeeze(0)  # (L, odim)\n                probs = torch.cat(probs, dim=0)\n                break\n\n        # concatenate attention weights -> (#layers, #heads, L, T)\n        att_ws = torch.stack(att_ws, dim=0)\n\n        return outs, probs, att_ws\n\n    def calculate_all_attentions(\n        self,\n        xs,\n        ilens,\n        ys,\n        olens,\n        spembs=None,\n        skip_output=False,\n        keep_tensor=False,\n        *args,\n        **kwargs\n    ):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            xs (Tensor): Batch of padded acoustic features (B, Tmax, idim).\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor): Batch of padded target features (B, Lmax, odim).\n            olens (LongTensor): Batch of the lengths of each target (B,).\n            spembs (Tensor, optional): Batch of speaker embedding vectors\n                (B, spk_embed_dim).\n            skip_output (bool, optional): Whether to skip calculate the final output.\n            keep_tensor (bool, optional): Whether to keep original tensor.\n\n        Returns:\n            dict: Dict of attention weights and outputs.\n\n        \"\"\"\n        with torch.no_grad():\n            # thin out input frames for reduction factor\n            # (B, Lmax, idim) ->  (B, Lmax // r, idim * r)\n            if self.encoder_reduction_factor > 1:\n                B, Lmax, idim = xs.shape\n                if Lmax % self.encoder_reduction_factor != 0:\n                    xs = xs[:, : -(Lmax % self.encoder_reduction_factor), :]\n                xs_ds = xs.contiguous().view(\n                    B,\n                    int(Lmax / self.encoder_reduction_factor),\n                    idim * self.encoder_reduction_factor,\n                )\n                ilens_ds = ilens.new(\n                    [ilen // self.encoder_reduction_factor for ilen in ilens]\n                )\n            else:\n                xs_ds, ilens_ds = xs, ilens\n\n            # forward encoder\n            x_masks = self._source_mask(ilens_ds)\n            hs, hs_masks = self.encoder(xs_ds, x_masks)\n\n            # integrate speaker embedding\n            if self.spk_embed_dim is not None:\n                hs = self._integrate_with_spk_embed(hs, spembs)\n\n            # thin out frames for reduction factor\n            # (B, Lmax, odim) ->  (B, Lmax//r, odim)\n            if self.reduction_factor > 1:\n                ys_in = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n                olens_in = olens.new([olen // self.reduction_factor for olen in olens])\n            else:\n                ys_in, olens_in = ys, olens\n\n            # add first zero frame and remove last frame for auto-regressive\n            ys_in = self._add_first_frame_and_remove_last_frame(ys_in)\n\n            # forward decoder\n            y_masks = self._target_mask(olens_in)\n            zs, _ = self.decoder(ys_in, y_masks, hs, hs_masks)\n\n            # calculate final outputs\n            if not skip_output:\n                before_outs = self.feat_out(zs).view(zs.size(0), -1, self.odim)\n                if self.postnet is None:\n                    after_outs = before_outs\n                else:\n                    after_outs = before_outs + self.postnet(\n                        before_outs.transpose(1, 2)\n                    ).transpose(1, 2)\n\n        # modifiy mod part of output lengths due to reduction factor > 1\n        if self.reduction_factor > 1:\n            olens = olens.new([olen - olen % self.reduction_factor for olen in olens])\n\n        # store into dict\n        att_ws_dict = dict()\n        if keep_tensor:\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention):\n                    att_ws_dict[name] = m.attn\n            if not skip_output:\n                att_ws_dict[\"before_postnet_fbank\"] = before_outs\n                att_ws_dict[\"after_postnet_fbank\"] = after_outs\n        else:\n            for name, m in self.named_modules():\n                if isinstance(m, MultiHeadedAttention):\n                    attn = m.attn.cpu().numpy()\n                    if \"encoder\" in name:\n                        attn = [a[:, :l, :l] for a, l in zip(attn, ilens.tolist())]\n                    elif \"decoder\" in name:\n                        if \"src\" in name:\n                            attn = [\n                                a[:, :ol, :il]\n                                for a, il, ol in zip(\n                                    attn, ilens.tolist(), olens_in.tolist()\n                                )\n                            ]\n                        elif \"self\" in name:\n                            attn = [\n                                a[:, :l, :l] for a, l in zip(attn, olens_in.tolist())\n                            ]\n                        else:\n                            logging.warning(\"unknown attention module: \" + name)\n                    else:\n                        logging.warning(\"unknown attention module: \" + name)\n                    att_ws_dict[name] = attn\n            if not skip_output:\n                before_outs = before_outs.cpu().numpy()\n                after_outs = after_outs.cpu().numpy()\n                att_ws_dict[\"before_postnet_fbank\"] = [\n                    m[:l].T for m, l in zip(before_outs, olens.tolist())\n                ]\n                att_ws_dict[\"after_postnet_fbank\"] = [\n                    m[:l].T for m, l in zip(after_outs, olens.tolist())\n                ]\n\n        return att_ws_dict\n\n    def _integrate_with_spk_embed(self, hs, spembs):\n        \"\"\"Integrate speaker embedding with hidden states.\n\n        Args:\n            hs (Tensor): Batch of hidden state sequences (B, Tmax, adim).\n            spembs (Tensor): Batch of speaker embeddings (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Batch of integrated hidden state sequences (B, Tmax, adim)\n\n        \"\"\"\n        if self.spk_embed_integration_type == \"add\":\n            # apply projection and then add to hidden states\n            spembs = self.projection(F.normalize(spembs))\n            hs = hs + spembs.unsqueeze(1)\n        elif self.spk_embed_integration_type == \"concat\":\n            # concat hidden states with spk embeds and then apply projection\n            spembs = F.normalize(spembs).unsqueeze(1).expand(-1, hs.size(1), -1)\n            hs = self.projection(torch.cat([hs, spembs], dim=-1))\n        else:\n            raise NotImplementedError(\"support only add or concat.\")\n\n        return hs\n\n    def _source_mask(self, ilens):\n        \"\"\"Make masks for self-attention.\n\n        Args:\n            ilens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor for self-attention.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> ilens = [5, 3]\n            >>> self._source_mask(ilens)\n            tensor([[[1, 1, 1, 1, 1],\n                    [[1, 1, 1, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        x_masks = make_non_pad_mask(ilens).to(next(self.parameters()).device)\n        return x_masks.unsqueeze(-2)\n\n    def _target_mask(self, olens):\n        \"\"\"Make masks for masked self-attention.\n\n        Args:\n            olens (LongTensor or List): Batch of lengths (B,).\n\n        Returns:\n            Tensor: Mask tensor for masked self-attention.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n        Examples:\n            >>> olens = [5, 3]\n            >>> self._target_mask(olens)\n            tensor([[[1, 0, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 1, 0],\n                     [1, 1, 1, 1, 1]],\n                    [[1, 0, 0, 0, 0],\n                     [1, 1, 0, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 0, 0],\n                     [1, 1, 1, 0, 0]]], dtype=torch.uint8)\n\n        \"\"\"\n        y_masks = make_non_pad_mask(olens).to(next(self.parameters()).device)\n        s_masks = subsequent_mask(y_masks.size(-1), device=y_masks.device).unsqueeze(0)\n        return y_masks.unsqueeze(-2) & s_masks\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        keys should match what `chainer.reporter` reports.\n        If you add the key `loss`, the reporter will report `main/loss`\n            and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n            and `validation/main/loss` values.\n\n        Returns:\n            list: List of strings which are base keys to plot during training.\n\n        \"\"\"\n        plot_keys = [\"loss\", \"l1_loss\", \"l2_loss\", \"bce_loss\"]\n        if self.use_scaled_pos_enc:\n            plot_keys += [\"encoder_alpha\", \"decoder_alpha\"]\n        if self.use_guided_attn_loss:\n            if \"encoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"enc_attn_loss\"]\n            if \"decoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"dec_attn_loss\"]\n            if \"encoder-decoder\" in self.modules_applied_guided_attn:\n                plot_keys += [\"enc_dec_attn_loss\"]\n\n        return plot_keys\n"
  },
  {
    "path": "nets/pytorch_backend/fastspeech/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/fastspeech/duration_calculator.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Duration calculator related modules.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.e2e_tts_tacotron2 import Tacotron2\nfrom espnet.nets.pytorch_backend.e2e_tts_transformer import Transformer\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\n\n\nclass DurationCalculator(torch.nn.Module):\n    \"\"\"Duration calculator module for FastSpeech.\n\n    Todo:\n        * Fix the duplicated calculation of diagonal head decision\n\n    \"\"\"\n\n    def __init__(self, teacher_model):\n        \"\"\"Initialize duration calculator module.\n\n        Args:\n            teacher_model (e2e_tts_transformer.Transformer):\n                Pretrained auto-regressive Transformer.\n\n        \"\"\"\n        super(DurationCalculator, self).__init__()\n        if isinstance(teacher_model, Transformer):\n            self.register_buffer(\"diag_head_idx\", torch.tensor(-1))\n        elif isinstance(teacher_model, Tacotron2):\n            pass\n        else:\n            raise ValueError(\n                \"teacher model should be the instance of \"\n                \"e2e_tts_transformer.Transformer or e2e_tts_tacotron2.Tacotron2.\"\n            )\n        self.teacher_model = teacher_model\n\n    def forward(self, xs, ilens, ys, olens, spembs=None):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of the padded sequences of character ids (B, Tmax).\n            ilens (Tensor): Batch of lengths of each input sequence (B,).\n            ys (Tensor):\n                Batch of the padded sequence of target features (B, Lmax, odim).\n            olens (Tensor): Batch of lengths of each output sequence (B,).\n            spembs (Tensor, optional):\n                Batch of speaker embedding vectors (B, spk_embed_dim).\n\n        Returns:\n            Tensor: Batch of durations (B, Tmax).\n\n        \"\"\"\n        if isinstance(self.teacher_model, Transformer):\n            att_ws = self._calculate_encoder_decoder_attentions(\n                xs, ilens, ys, olens, spembs=spembs\n            )\n            # TODO(kan-bayashi): fix this issue\n            # this does not work in multi-gpu case. registered buffer is not saved.\n            if int(self.diag_head_idx) == -1:\n                self._init_diagonal_head(att_ws)\n            att_ws = att_ws[:, self.diag_head_idx]\n        else:\n            # NOTE(kan-bayashi): Here we assume that the teacher is tacotron 2\n            att_ws = self.teacher_model.calculate_all_attentions(\n                xs, ilens, ys, spembs=spembs, keep_tensor=True\n            )\n        durations = [\n            self._calculate_duration(att_w, ilen, olen)\n            for att_w, ilen, olen in zip(att_ws, ilens, olens)\n        ]\n\n        return pad_list(durations, 0)\n\n    @staticmethod\n    def _calculate_duration(att_w, ilen, olen):\n        return torch.stack(\n            [att_w[:olen, :ilen].argmax(-1).eq(i).sum() for i in range(ilen)]\n        )\n\n    def _init_diagonal_head(self, att_ws):\n        diagonal_scores = att_ws.max(dim=-1)[0].mean(dim=-1).mean(dim=0)  # (H * L,)\n        self.register_buffer(\"diag_head_idx\", diagonal_scores.argmax())\n\n    def _calculate_encoder_decoder_attentions(self, xs, ilens, ys, olens, spembs=None):\n        att_dict = self.teacher_model.calculate_all_attentions(\n            xs, ilens, ys, olens, spembs=spembs, skip_output=True, keep_tensor=True\n        )\n        return torch.cat(\n            [att_dict[k] for k in att_dict.keys() if \"src_attn\" in k], dim=1\n        )  # (B, H*L, Lmax, Tmax)\n"
  },
  {
    "path": "nets/pytorch_backend/fastspeech/duration_predictor.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Duration predictor related modules.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass DurationPredictor(torch.nn.Module):\n    \"\"\"Duration predictor module.\n\n    This is a module of duration predictor described\n    in `FastSpeech: Fast, Robust and Controllable Text to Speech`_.\n    The duration predictor predicts a duration of each frame in log domain\n    from the hidden embeddings of encoder.\n\n    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:\n        https://arxiv.org/pdf/1905.09263.pdf\n\n    Note:\n        The calculation domain of outputs is different\n        between in `forward` and in `inference`. In `forward`,\n        the outputs are calculated in log domain but in `inference`,\n        those are calculated in linear domain.\n\n    \"\"\"\n\n    def __init__(\n        self, idim, n_layers=2, n_chans=384, kernel_size=3, dropout_rate=0.1, offset=1.0\n    ):\n        \"\"\"Initilize duration predictor module.\n\n        Args:\n            idim (int): Input dimension.\n            n_layers (int, optional): Number of convolutional layers.\n            n_chans (int, optional): Number of channels of convolutional layers.\n            kernel_size (int, optional): Kernel size of convolutional layers.\n            dropout_rate (float, optional): Dropout rate.\n            offset (float, optional): Offset value to avoid nan in log domain.\n\n        \"\"\"\n        super(DurationPredictor, self).__init__()\n        self.offset = offset\n        self.conv = torch.nn.ModuleList()\n        for idx in range(n_layers):\n            in_chans = idim if idx == 0 else n_chans\n            self.conv += [\n                torch.nn.Sequential(\n                    torch.nn.Conv1d(\n                        in_chans,\n                        n_chans,\n                        kernel_size,\n                        stride=1,\n                        padding=(kernel_size - 1) // 2,\n                    ),\n                    torch.nn.ReLU(),\n                    LayerNorm(n_chans, dim=1),\n                    torch.nn.Dropout(dropout_rate),\n                )\n            ]\n        self.linear = torch.nn.Linear(n_chans, 1)\n\n    def _forward(self, xs, x_masks=None, is_inference=False):\n        xs = xs.transpose(1, -1)  # (B, idim, Tmax)\n        for f in self.conv:\n            xs = f(xs)  # (B, C, Tmax)\n\n        # NOTE: calculate in log domain\n        xs = self.linear(xs.transpose(1, -1)).squeeze(-1)  # (B, Tmax)\n\n        if is_inference:\n            # NOTE: calculate in linear domain\n            xs = torch.clamp(\n                torch.round(xs.exp() - self.offset), min=0\n            ).long()  # avoid negative value\n\n        if x_masks is not None:\n            xs = xs.masked_fill(x_masks, 0.0)\n\n        return xs\n\n    def forward(self, xs, x_masks=None):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of input sequences (B, Tmax, idim).\n            x_masks (ByteTensor, optional):\n                Batch of masks indicating padded part (B, Tmax).\n\n        Returns:\n            Tensor: Batch of predicted durations in log domain (B, Tmax).\n\n        \"\"\"\n        return self._forward(xs, x_masks, False)\n\n    def inference(self, xs, x_masks=None):\n        \"\"\"Inference duration.\n\n        Args:\n            xs (Tensor): Batch of input sequences (B, Tmax, idim).\n            x_masks (ByteTensor, optional):\n                Batch of masks indicating padded part (B, Tmax).\n\n        Returns:\n            LongTensor: Batch of predicted durations in linear domain (B, Tmax).\n\n        \"\"\"\n        return self._forward(xs, x_masks, True)\n\n\nclass DurationPredictorLoss(torch.nn.Module):\n    \"\"\"Loss function module for duration predictor.\n\n    The loss value is Calculated in log domain to make it Gaussian.\n\n    \"\"\"\n\n    def __init__(self, offset=1.0, reduction=\"mean\"):\n        \"\"\"Initilize duration predictor loss module.\n\n        Args:\n            offset (float, optional): Offset value to avoid nan in log domain.\n            reduction (str): Reduction type in loss calculation.\n\n        \"\"\"\n        super(DurationPredictorLoss, self).__init__()\n        self.criterion = torch.nn.MSELoss(reduction=reduction)\n        self.offset = offset\n\n    def forward(self, outputs, targets):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            outputs (Tensor): Batch of prediction durations in log domain (B, T)\n            targets (LongTensor): Batch of groundtruth durations in linear domain (B, T)\n\n        Returns:\n            Tensor: Mean squared error loss value.\n\n        Note:\n            `outputs` is in log domain but `targets` is in linear domain.\n\n        \"\"\"\n        # NOTE: outputs is in log domain while targets in linear\n        targets = torch.log(targets.float() + self.offset)\n        loss = self.criterion(outputs, targets)\n\n        return loss\n"
  },
  {
    "path": "nets/pytorch_backend/fastspeech/length_regulator.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Length regulator related modules.\"\"\"\n\nimport logging\n\nfrom distutils.version import LooseVersion\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\n\nis_torch_1_1_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.1\")\n\n\nclass LengthRegulator(torch.nn.Module):\n    \"\"\"Length regulator module for feed-forward Transformer.\n\n    This is a module of length regulator described in\n    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.\n    The length regulator expands char or\n    phoneme-level embedding features to frame-level by repeating each\n    feature based on the corresponding predicted durations.\n\n    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:\n        https://arxiv.org/pdf/1905.09263.pdf\n\n    \"\"\"\n\n    def __init__(self, pad_value=0.0):\n        \"\"\"Initilize length regulator module.\n\n        Args:\n            pad_value (float, optional): Value used for padding.\n\n        \"\"\"\n        super(LengthRegulator, self).__init__()\n        self.pad_value = pad_value\n        if is_torch_1_1_plus:\n            self.repeat_fn = self._repeat_one_sequence\n        else:\n            self.repeat_fn = self._legacy_repeat_one_sequence\n\n    def forward(self, xs, ds, alpha=1.0):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of sequences of char or phoneme embeddings (B, Tmax, D).\n            ds (LongTensor): Batch of durations of each frame (B, T).\n            alpha (float, optional): Alpha value to control speed of speech.\n\n        Returns:\n            Tensor: replicated input tensor based on durations (B, T*, D).\n\n        \"\"\"\n        if alpha != 1.0:\n            assert alpha > 0\n            ds = torch.round(ds.float() * alpha).long()\n\n        if ds.sum() == 0:\n            logging.warning(\n                \"predicted durations includes all 0 sequences. \"\n                \"fill the first element with 1.\"\n            )\n            # NOTE(kan-bayashi): This case must not be happend in teacher forcing.\n            #   It will be happened in inference with a bad duration predictor.\n            #   So we do not need to care the padded sequence case here.\n            ds[ds.sum(dim=1).eq(0)] = 1\n\n        return pad_list([self.repeat_fn(x, d) for x, d in zip(xs, ds)], self.pad_value)\n\n    def _repeat_one_sequence(self, x, d):\n        \"\"\"Repeat each frame according to duration for torch 1.1+.\"\"\"\n        return torch.repeat_interleave(x, d, dim=0)\n\n    def _legacy_repeat_one_sequence(self, x, d):\n        \"\"\"Repeat each frame according to duration for torch 1.0.\n\n        Examples:\n            >>> x = torch.tensor([[1], [2], [3]])\n            tensor([[1],\n                    [2],\n                    [3]])\n            >>> d = torch.tensor([1, 2, 3])\n            tensor([1, 2, 3])\n            >>> self._repeat_one_sequence(x, d)\n            tensor([[1],\n                    [2],\n                    [2],\n                    [3],\n                    [3],\n                    [3]])\n\n        \"\"\"\n        return torch.cat(\n            [x_.repeat(int(d_), 1) for x_, d_ in zip(x, d) if d_ != 0], dim=0\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/beamformer.py",
    "content": "import torch\nfrom torch_complex import functional as FC\nfrom torch_complex.tensor import ComplexTensor\n\n\ndef get_power_spectral_density_matrix(\n    xs: ComplexTensor, mask: torch.Tensor, normalization=True, eps: float = 1e-15\n) -> ComplexTensor:\n    \"\"\"Return cross-channel power spectral density (PSD) matrix\n\n    Args:\n        xs (ComplexTensor): (..., F, C, T)\n        mask (torch.Tensor): (..., F, C, T)\n        normalization (bool):\n        eps (float):\n    Returns\n        psd (ComplexTensor): (..., F, C, C)\n\n    \"\"\"\n    # outer product: (..., C_1, T) x (..., C_2, T) -> (..., T, C, C_2)\n    psd_Y = FC.einsum(\"...ct,...et->...tce\", [xs, xs.conj()])\n\n    # Averaging mask along C: (..., C, T) -> (..., T)\n    mask = mask.mean(dim=-2)\n\n    # Normalized mask along T: (..., T)\n    if normalization:\n        # If assuming the tensor is padded with zero, the summation along\n        # the time axis is same regardless of the padding length.\n        mask = mask / (mask.sum(dim=-1, keepdim=True) + eps)\n\n    # psd: (..., T, C, C)\n    psd = psd_Y * mask[..., None, None]\n    # (..., T, C, C) -> (..., C, C)\n    psd = psd.sum(dim=-3)\n\n    return psd\n\n\ndef get_mvdr_vector(\n    psd_s: ComplexTensor,\n    psd_n: ComplexTensor,\n    reference_vector: torch.Tensor,\n    eps: float = 1e-15,\n) -> ComplexTensor:\n    \"\"\"Return the MVDR(Minimum Variance Distortionless Response) vector:\n\n        h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u\n\n    Reference:\n        On optimal frequency-domain multichannel linear filtering\n        for noise reduction; M. Souden et al., 2010;\n        https://ieeexplore.ieee.org/document/5089420\n\n    Args:\n        psd_s (ComplexTensor): (..., F, C, C)\n        psd_n (ComplexTensor): (..., F, C, C)\n        reference_vector (torch.Tensor): (..., C)\n        eps (float):\n    Returns:\n        beamform_vector (ComplexTensor)r: (..., F, C)\n    \"\"\"\n    # Add eps\n    C = psd_n.size(-1)\n    eye = torch.eye(C, dtype=psd_n.dtype, device=psd_n.device)\n    shape = [1 for _ in range(psd_n.dim() - 2)] + [C, C]\n    eye = eye.view(*shape)\n    psd_n += eps * eye\n\n    # numerator: (..., C_1, C_2) x (..., C_2, C_3) -> (..., C_1, C_3)\n    numerator = FC.einsum(\"...ec,...cd->...ed\", [psd_n.inverse(), psd_s])\n    # ws: (..., C, C) / (...,) -> (..., C, C)\n    ws = numerator / (FC.trace(numerator)[..., None, None] + eps)\n    # h: (..., F, C_1, C_2) x (..., C_2) -> (..., F, C_1)\n    beamform_vector = FC.einsum(\"...fec,...c->...fe\", [ws, reference_vector])\n    return beamform_vector\n\n\ndef apply_beamforming_vector(\n    beamform_vector: ComplexTensor, mix: ComplexTensor\n) -> ComplexTensor:\n    # (..., C) x (..., C, T) -> (..., T)\n    es = FC.einsum(\"...c,...ct->...t\", [beamform_vector.conj(), mix])\n    return es\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/dnn_beamformer.py",
    "content": "from distutils.version import LooseVersion\nfrom typing import Tuple\n\nimport torch\nfrom torch.nn import functional as F\n\nfrom espnet.nets.pytorch_backend.frontends.beamformer import apply_beamforming_vector\nfrom espnet.nets.pytorch_backend.frontends.beamformer import get_mvdr_vector\nfrom espnet.nets.pytorch_backend.frontends.beamformer import (\n    get_power_spectral_density_matrix,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.frontends.mask_estimator import MaskEstimator\nfrom torch_complex.tensor import ComplexTensor\n\nis_torch_1_2_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.2.0\")\nis_torch_1_3_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.3.0\")\n\n\nclass DNN_Beamformer(torch.nn.Module):\n    \"\"\"DNN mask based Beamformer\n\n    Citation:\n        Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017;\n        https://arxiv.org/abs/1703.04783\n\n    \"\"\"\n\n    def __init__(\n        self,\n        bidim,\n        btype=\"blstmp\",\n        blayers=3,\n        bunits=300,\n        bprojs=320,\n        bnmask=2,\n        dropout_rate=0.0,\n        badim=320,\n        ref_channel: int = -1,\n        beamformer_type=\"mvdr\",\n    ):\n        super().__init__()\n        self.mask = MaskEstimator(\n            btype, bidim, blayers, bunits, bprojs, dropout_rate, nmask=bnmask\n        )\n        self.ref = AttentionReference(bidim, badim)\n        self.ref_channel = ref_channel\n\n        self.nmask = bnmask\n\n        if beamformer_type != \"mvdr\":\n            raise ValueError(\n                \"Not supporting beamformer_type={}\".format(beamformer_type)\n            )\n        self.beamformer_type = beamformer_type\n\n    def forward(\n        self, data: ComplexTensor, ilens: torch.LongTensor\n    ) -> Tuple[ComplexTensor, torch.LongTensor, ComplexTensor]:\n        \"\"\"The forward function\n\n        Notation:\n            B: Batch\n            C: Channel\n            T: Time or Sequence length\n            F: Freq\n\n        Args:\n            data (ComplexTensor): (B, T, C, F)\n            ilens (torch.Tensor): (B,)\n        Returns:\n            enhanced (ComplexTensor): (B, T, F)\n            ilens (torch.Tensor): (B,)\n\n        \"\"\"\n\n        def apply_beamforming(data, ilens, psd_speech, psd_noise):\n            # u: (B, C)\n            if self.ref_channel < 0:\n                u, _ = self.ref(psd_speech, ilens)\n            else:\n                # (optional) Create onehot vector for fixed reference microphone\n                u = torch.zeros(\n                    *(data.size()[:-3] + (data.size(-2),)), device=data.device\n                )\n                u[..., self.ref_channel].fill_(1)\n\n            ws = get_mvdr_vector(psd_speech, psd_noise, u)\n            enhanced = apply_beamforming_vector(ws, data)\n\n            return enhanced, ws\n\n        # data (B, T, C, F) -> (B, F, C, T)\n        data = data.permute(0, 3, 2, 1)\n\n        # mask: (B, F, C, T)\n        masks, _ = self.mask(data, ilens)\n        assert self.nmask == len(masks)\n\n        if self.nmask == 2:  # (mask_speech, mask_noise)\n            mask_speech, mask_noise = masks\n\n            psd_speech = get_power_spectral_density_matrix(data, mask_speech)\n            psd_noise = get_power_spectral_density_matrix(data, mask_noise)\n\n            enhanced, ws = apply_beamforming(data, ilens, psd_speech, psd_noise)\n\n            # (..., F, T) -> (..., T, F)\n            enhanced = enhanced.transpose(-1, -2)\n            mask_speech = mask_speech.transpose(-1, -3)\n        else:  # multi-speaker case: (mask_speech1, ..., mask_noise)\n            mask_speech = list(masks[:-1])\n            mask_noise = masks[-1]\n\n            psd_speeches = [\n                get_power_spectral_density_matrix(data, mask) for mask in mask_speech\n            ]\n            psd_noise = get_power_spectral_density_matrix(data, mask_noise)\n\n            enhanced = []\n            ws = []\n            for i in range(self.nmask - 1):\n                psd_speech = psd_speeches.pop(i)\n                # treat all other speakers' psd_speech as noises\n                enh, w = apply_beamforming(\n                    data, ilens, psd_speech, sum(psd_speeches) + psd_noise\n                )\n                psd_speeches.insert(i, psd_speech)\n\n                # (..., F, T) -> (..., T, F)\n                enh = enh.transpose(-1, -2)\n                mask_speech[i] = mask_speech[i].transpose(-1, -3)\n\n                enhanced.append(enh)\n                ws.append(w)\n\n        return enhanced, ilens, mask_speech\n\n\nclass AttentionReference(torch.nn.Module):\n    def __init__(self, bidim, att_dim):\n        super().__init__()\n        self.mlp_psd = torch.nn.Linear(bidim, att_dim)\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n    def forward(\n        self, psd_in: ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0\n    ) -> Tuple[torch.Tensor, torch.LongTensor]:\n        \"\"\"The forward function\n\n        Args:\n            psd_in (ComplexTensor): (B, F, C, C)\n            ilens (torch.Tensor): (B,)\n            scaling (float):\n        Returns:\n            u (torch.Tensor): (B, C)\n            ilens (torch.Tensor): (B,)\n        \"\"\"\n        B, _, C = psd_in.size()[:3]\n        assert psd_in.size(2) == psd_in.size(3), psd_in.size()\n        # psd_in: (B, F, C, C)\n        datatype = torch.bool if is_torch_1_3_plus else torch.uint8\n        datatype2 = torch.bool if is_torch_1_2_plus else torch.uint8\n        psd = psd_in.masked_fill(\n            torch.eye(C, dtype=datatype, device=psd_in.device).type(datatype2), 0\n        )\n        # psd: (B, F, C, C) -> (B, C, F)\n        psd = (psd.sum(dim=-1) / (C - 1)).transpose(-1, -2)\n\n        # Calculate amplitude\n        psd_feat = (psd.real ** 2 + psd.imag ** 2) ** 0.5\n\n        # (B, C, F) -> (B, C, F2)\n        mlp_psd = self.mlp_psd(psd_feat)\n        # (B, C, F2) -> (B, C, 1) -> (B, C)\n        e = self.gvec(torch.tanh(mlp_psd)).squeeze(-1)\n        u = F.softmax(scaling * e, dim=-1)\n        return u, ilens\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/dnn_wpe.py",
    "content": "from typing import Tuple\n\nfrom pytorch_wpe import wpe_one_iteration\nimport torch\nfrom torch_complex.tensor import ComplexTensor\n\nfrom espnet.nets.pytorch_backend.frontends.mask_estimator import MaskEstimator\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\n\n\nclass DNN_WPE(torch.nn.Module):\n    def __init__(\n        self,\n        wtype: str = \"blstmp\",\n        widim: int = 257,\n        wlayers: int = 3,\n        wunits: int = 300,\n        wprojs: int = 320,\n        dropout_rate: float = 0.0,\n        taps: int = 5,\n        delay: int = 3,\n        use_dnn_mask: bool = True,\n        iterations: int = 1,\n        normalization: bool = False,\n    ):\n        super().__init__()\n        self.iterations = iterations\n        self.taps = taps\n        self.delay = delay\n\n        self.normalization = normalization\n        self.use_dnn_mask = use_dnn_mask\n\n        self.inverse_power = True\n\n        if self.use_dnn_mask:\n            self.mask_est = MaskEstimator(\n                wtype, widim, wlayers, wunits, wprojs, dropout_rate, nmask=1\n            )\n\n    def forward(\n        self, data: ComplexTensor, ilens: torch.LongTensor\n    ) -> Tuple[ComplexTensor, torch.LongTensor, ComplexTensor]:\n        \"\"\"The forward function\n\n        Notation:\n            B: Batch\n            C: Channel\n            T: Time or Sequence length\n            F: Freq or Some dimension of the feature vector\n\n        Args:\n            data: (B, C, T, F)\n            ilens: (B,)\n        Returns:\n            data: (B, C, T, F)\n            ilens: (B,)\n        \"\"\"\n        # (B, T, C, F) -> (B, F, C, T)\n        enhanced = data = data.permute(0, 3, 2, 1)\n        mask = None\n\n        for i in range(self.iterations):\n            # Calculate power: (..., C, T)\n            power = enhanced.real ** 2 + enhanced.imag ** 2\n            if i == 0 and self.use_dnn_mask:\n                # mask: (B, F, C, T)\n                (mask,), _ = self.mask_est(enhanced, ilens)\n                if self.normalization:\n                    # Normalize along T\n                    mask = mask / mask.sum(dim=-1)[..., None]\n                # (..., C, T) * (..., C, T) -> (..., C, T)\n                power = power * mask\n\n            # Averaging along the channel axis: (..., C, T) -> (..., T)\n            power = power.mean(dim=-2)\n\n            # enhanced: (..., C, T) -> (..., C, T)\n            enhanced = wpe_one_iteration(\n                data.contiguous(),\n                power,\n                taps=self.taps,\n                delay=self.delay,\n                inverse_power=self.inverse_power,\n            )\n\n            enhanced.masked_fill_(make_pad_mask(ilens, enhanced.real), 0)\n\n        # (B, F, C, T) -> (B, T, C, F)\n        enhanced = enhanced.permute(0, 3, 2, 1)\n        if mask is not None:\n            mask = mask.transpose(-1, -3)\n        return enhanced, ilens, mask\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/feature_transform.py",
    "content": "from typing import List\nfrom typing import Tuple\nfrom typing import Union\n\nimport librosa\nimport numpy as np\nimport torch\nfrom torch_complex.tensor import ComplexTensor\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\n\n\nclass FeatureTransform(torch.nn.Module):\n    def __init__(\n        self,\n        # Mel options,\n        fs: int = 16000,\n        n_fft: int = 512,\n        n_mels: int = 80,\n        fmin: float = 0.0,\n        fmax: float = None,\n        # Normalization\n        stats_file: str = None,\n        apply_uttmvn: bool = True,\n        uttmvn_norm_means: bool = True,\n        uttmvn_norm_vars: bool = False,\n    ):\n        super().__init__()\n        self.apply_uttmvn = apply_uttmvn\n\n        self.logmel = LogMel(fs=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)\n        self.stats_file = stats_file\n        if stats_file is not None:\n            self.global_mvn = GlobalMVN(stats_file)\n        else:\n            self.global_mvn = None\n\n        if self.apply_uttmvn is not None:\n            self.uttmvn = UtteranceMVN(\n                norm_means=uttmvn_norm_means, norm_vars=uttmvn_norm_vars\n            )\n        else:\n            self.uttmvn = None\n\n    def forward(\n        self, x: ComplexTensor, ilens: Union[torch.LongTensor, np.ndarray, List[int]]\n    ) -> Tuple[torch.Tensor, torch.LongTensor]:\n        # (B, T, F) or (B, T, C, F)\n        if x.dim() not in (3, 4):\n            raise ValueError(f\"Input dim must be 3 or 4: {x.dim()}\")\n        if not torch.is_tensor(ilens):\n            ilens = torch.from_numpy(np.asarray(ilens)).to(x.device)\n\n        if x.dim() == 4:\n            # h: (B, T, C, F) -> h: (B, T, F)\n            if self.training:\n                # Select 1ch randomly\n                ch = np.random.randint(x.size(2))\n                h = x[:, :, ch, :]\n            else:\n                # Use the first channel\n                h = x[:, :, 0, :]\n        else:\n            h = x\n\n        # h: ComplexTensor(B, T, F) -> torch.Tensor(B, T, F)\n        h = h.real ** 2 + h.imag ** 2\n\n        h, _ = self.logmel(h, ilens)\n        if self.stats_file is not None:\n            h, _ = self.global_mvn(h, ilens)\n        if self.apply_uttmvn:\n            h, _ = self.uttmvn(h, ilens)\n\n        return h, ilens\n\n\nclass LogMel(torch.nn.Module):\n    \"\"\"Convert STFT to fbank feats\n\n    The arguments is same as librosa.filters.mel\n\n    Args:\n        fs: number > 0 [scalar] sampling rate of the incoming signal\n        n_fft: int > 0 [scalar] number of FFT components\n        n_mels: int > 0 [scalar] number of Mel bands to generate\n        fmin: float >= 0 [scalar] lowest frequency (in Hz)\n        fmax: float >= 0 [scalar] highest frequency (in Hz).\n            If `None`, use `fmax = fs / 2.0`\n        htk: use HTK formula instead of Slaney\n        norm: {None, 1, np.inf} [scalar]\n            if 1, divide the triangular mel weights by the width of the mel band\n            (area normalization).  Otherwise, leave all the triangles aiming for\n            a peak value of 1.0\n\n    \"\"\"\n\n    def __init__(\n        self,\n        fs: int = 16000,\n        n_fft: int = 512,\n        n_mels: int = 80,\n        fmin: float = 0.0,\n        fmax: float = None,\n        htk: bool = False,\n        norm=1,\n    ):\n        super().__init__()\n\n        _mel_options = dict(\n            sr=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax, htk=htk, norm=norm\n        )\n        self.mel_options = _mel_options\n\n        # Note(kamo): The mel matrix of librosa is different from kaldi.\n        melmat = librosa.filters.mel(**_mel_options)\n        # melmat: (D2, D1) -> (D1, D2)\n        self.register_buffer(\"melmat\", torch.from_numpy(melmat.T).float())\n\n    def extra_repr(self):\n        return \", \".join(f\"{k}={v}\" for k, v in self.mel_options.items())\n\n    def forward(\n        self, feat: torch.Tensor, ilens: torch.LongTensor\n    ) -> Tuple[torch.Tensor, torch.LongTensor]:\n        # feat: (B, T, D1) x melmat: (D1, D2) -> mel_feat: (B, T, D2)\n        mel_feat = torch.matmul(feat, self.melmat)\n\n        logmel_feat = (mel_feat + 1e-20).log()\n        # Zero padding\n        logmel_feat = logmel_feat.masked_fill(make_pad_mask(ilens, logmel_feat, 1), 0.0)\n        return logmel_feat, ilens\n\n\nclass GlobalMVN(torch.nn.Module):\n    \"\"\"Apply global mean and variance normalization\n\n    Args:\n        stats_file(str): npy file of 1-dim array or text file.\n            From the _first element to\n            the {(len(array) - 1) / 2}th element are treated as\n            the sum of features,\n            and the rest excluding the last elements are\n            treated as the sum of the square value of features,\n            and the last elements eqauls to the number of samples.\n        std_floor(float):\n    \"\"\"\n\n    def __init__(\n        self,\n        stats_file: str,\n        norm_means: bool = True,\n        norm_vars: bool = True,\n        eps: float = 1.0e-20,\n    ):\n        super().__init__()\n        self.norm_means = norm_means\n        self.norm_vars = norm_vars\n\n        self.stats_file = stats_file\n        stats = np.load(stats_file)\n\n        stats = stats.astype(float)\n        assert (len(stats) - 1) % 2 == 0, stats.shape\n\n        count = stats.flatten()[-1]\n        mean = stats[: (len(stats) - 1) // 2] / count\n        var = stats[(len(stats) - 1) // 2 : -1] / count - mean * mean\n        std = np.maximum(np.sqrt(var), eps)\n\n        self.register_buffer(\"bias\", torch.from_numpy(-mean.astype(np.float32)))\n        self.register_buffer(\"scale\", torch.from_numpy(1 / std.astype(np.float32)))\n\n    def extra_repr(self):\n        return (\n            f\"stats_file={self.stats_file}, \"\n            f\"norm_means={self.norm_means}, norm_vars={self.norm_vars}\"\n        )\n\n    def forward(\n        self, x: torch.Tensor, ilens: torch.LongTensor\n    ) -> Tuple[torch.Tensor, torch.LongTensor]:\n        # feat: (B, T, D)\n        if self.norm_means:\n            x += self.bias.type_as(x)\n            x.masked_fill(make_pad_mask(ilens, x, 1), 0.0)\n\n        if self.norm_vars:\n            x *= self.scale.type_as(x)\n        return x, ilens\n\n\nclass UtteranceMVN(torch.nn.Module):\n    def __init__(\n        self, norm_means: bool = True, norm_vars: bool = False, eps: float = 1.0e-20\n    ):\n        super().__init__()\n        self.norm_means = norm_means\n        self.norm_vars = norm_vars\n        self.eps = eps\n\n    def extra_repr(self):\n        return f\"norm_means={self.norm_means}, norm_vars={self.norm_vars}\"\n\n    def forward(\n        self, x: torch.Tensor, ilens: torch.LongTensor\n    ) -> Tuple[torch.Tensor, torch.LongTensor]:\n        return utterance_mvn(\n            x, ilens, norm_means=self.norm_means, norm_vars=self.norm_vars, eps=self.eps\n        )\n\n\ndef utterance_mvn(\n    x: torch.Tensor,\n    ilens: torch.LongTensor,\n    norm_means: bool = True,\n    norm_vars: bool = False,\n    eps: float = 1.0e-20,\n) -> Tuple[torch.Tensor, torch.LongTensor]:\n    \"\"\"Apply utterance mean and variance normalization\n\n    Args:\n        x: (B, T, D), assumed zero padded\n        ilens: (B, T, D)\n        norm_means:\n        norm_vars:\n        eps:\n\n    \"\"\"\n    ilens_ = ilens.type_as(x)\n    # mean: (B, D)\n    mean = x.sum(dim=1) / ilens_[:, None]\n\n    if norm_means:\n        x -= mean[:, None, :]\n        x_ = x\n    else:\n        x_ = x - mean[:, None, :]\n\n    # Zero padding\n    x_.masked_fill(make_pad_mask(ilens, x_, 1), 0.0)\n    if norm_vars:\n        var = x_.pow(2).sum(dim=1) / ilens_[:, None]\n        var = torch.clamp(var, min=eps)\n        x /= var.sqrt()[:, None, :]\n        x_ = x\n    return x_, ilens\n\n\ndef feature_transform_for(args, n_fft):\n    return FeatureTransform(\n        # Mel options,\n        fs=args.fbank_fs,\n        n_fft=n_fft,\n        n_mels=args.n_mels,\n        fmin=args.fbank_fmin,\n        fmax=args.fbank_fmax,\n        # Normalization\n        stats_file=args.stats_file,\n        apply_uttmvn=args.apply_uttmvn,\n        uttmvn_norm_means=args.uttmvn_norm_means,\n        uttmvn_norm_vars=args.uttmvn_norm_vars,\n    )\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/frontend.py",
    "content": "from typing import List\nfrom typing import Optional\nfrom typing import Tuple\nfrom typing import Union\n\nimport numpy\nimport torch\nimport torch.nn as nn\nfrom torch_complex.tensor import ComplexTensor\n\nfrom espnet.nets.pytorch_backend.frontends.dnn_beamformer import DNN_Beamformer\nfrom espnet.nets.pytorch_backend.frontends.dnn_wpe import DNN_WPE\n\n\nclass Frontend(nn.Module):\n    def __init__(\n        self,\n        idim: int,\n        # WPE options\n        use_wpe: bool = False,\n        wtype: str = \"blstmp\",\n        wlayers: int = 3,\n        wunits: int = 300,\n        wprojs: int = 320,\n        wdropout_rate: float = 0.0,\n        taps: int = 5,\n        delay: int = 3,\n        use_dnn_mask_for_wpe: bool = True,\n        # Beamformer options\n        use_beamformer: bool = False,\n        btype: str = \"blstmp\",\n        blayers: int = 3,\n        bunits: int = 300,\n        bprojs: int = 320,\n        bnmask: int = 2,\n        badim: int = 320,\n        ref_channel: int = -1,\n        bdropout_rate=0.0,\n    ):\n        super().__init__()\n\n        self.use_beamformer = use_beamformer\n        self.use_wpe = use_wpe\n        self.use_dnn_mask_for_wpe = use_dnn_mask_for_wpe\n        # use frontend for all the data,\n        # e.g. in the case of multi-speaker speech separation\n        self.use_frontend_for_all = bnmask > 2\n\n        if self.use_wpe:\n            if self.use_dnn_mask_for_wpe:\n                # Use DNN for power estimation\n                # (Not observed significant gains)\n                iterations = 1\n            else:\n                # Performing as conventional WPE, without DNN Estimator\n                iterations = 2\n\n            self.wpe = DNN_WPE(\n                wtype=wtype,\n                widim=idim,\n                wunits=wunits,\n                wprojs=wprojs,\n                wlayers=wlayers,\n                taps=taps,\n                delay=delay,\n                dropout_rate=wdropout_rate,\n                iterations=iterations,\n                use_dnn_mask=use_dnn_mask_for_wpe,\n            )\n        else:\n            self.wpe = None\n\n        if self.use_beamformer:\n            self.beamformer = DNN_Beamformer(\n                btype=btype,\n                bidim=idim,\n                bunits=bunits,\n                bprojs=bprojs,\n                blayers=blayers,\n                bnmask=bnmask,\n                dropout_rate=bdropout_rate,\n                badim=badim,\n                ref_channel=ref_channel,\n            )\n        else:\n            self.beamformer = None\n\n    def forward(\n        self, x: ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]\n    ) -> Tuple[ComplexTensor, torch.LongTensor, Optional[ComplexTensor]]:\n        assert len(x) == len(ilens), (len(x), len(ilens))\n        # (B, T, F) or (B, T, C, F)\n        if x.dim() not in (3, 4):\n            raise ValueError(f\"Input dim must be 3 or 4: {x.dim()}\")\n        if not torch.is_tensor(ilens):\n            ilens = torch.from_numpy(numpy.asarray(ilens)).to(x.device)\n\n        mask = None\n        h = x\n        if h.dim() == 4:\n            if self.training:\n                choices = [(False, False)] if not self.use_frontend_for_all else []\n                if self.use_wpe:\n                    choices.append((True, False))\n\n                if self.use_beamformer:\n                    choices.append((False, True))\n\n                use_wpe, use_beamformer = choices[numpy.random.randint(len(choices))]\n\n            else:\n                use_wpe = self.use_wpe\n                use_beamformer = self.use_beamformer\n\n            # 1. WPE\n            if use_wpe:\n                # h: (B, T, C, F) -> h: (B, T, C, F)\n                h, ilens, mask = self.wpe(h, ilens)\n\n            # 2. Beamformer\n            if use_beamformer:\n                # h: (B, T, C, F) -> h: (B, T, F)\n                h, ilens, mask = self.beamformer(h, ilens)\n\n        return h, ilens, mask\n\n\ndef frontend_for(args, idim):\n    return Frontend(\n        idim=idim,\n        # WPE options\n        use_wpe=args.use_wpe,\n        wtype=args.wtype,\n        wlayers=args.wlayers,\n        wunits=args.wunits,\n        wprojs=args.wprojs,\n        wdropout_rate=args.wdropout_rate,\n        taps=args.wpe_taps,\n        delay=args.wpe_delay,\n        use_dnn_mask_for_wpe=args.use_dnn_mask_for_wpe,\n        # Beamformer options\n        use_beamformer=args.use_beamformer,\n        btype=args.btype,\n        blayers=args.blayers,\n        bunits=args.bunits,\n        bprojs=args.bprojs,\n        bnmask=args.bnmask,\n        badim=args.badim,\n        ref_channel=args.ref_channel,\n        bdropout_rate=args.bdropout_rate,\n    )\n"
  },
  {
    "path": "nets/pytorch_backend/frontends/mask_estimator.py",
    "content": "from typing import Tuple\n\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\nfrom torch_complex.tensor import ComplexTensor\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.rnn.encoders import RNN\nfrom espnet.nets.pytorch_backend.rnn.encoders import RNNP\n\n\nclass MaskEstimator(torch.nn.Module):\n    def __init__(self, type, idim, layers, units, projs, dropout, nmask=1):\n        super().__init__()\n        subsample = np.ones(layers + 1, dtype=np.int)\n\n        typ = type.lstrip(\"vgg\").rstrip(\"p\")\n        if type[-1] == \"p\":\n            self.brnn = RNNP(idim, layers, units, projs, subsample, dropout, typ=typ)\n        else:\n            self.brnn = RNN(idim, layers, units, projs, dropout, typ=typ)\n\n        self.type = type\n        self.nmask = nmask\n        self.linears = torch.nn.ModuleList(\n            [torch.nn.Linear(projs, idim) for _ in range(nmask)]\n        )\n\n    def forward(\n        self, xs: ComplexTensor, ilens: torch.LongTensor\n    ) -> Tuple[Tuple[torch.Tensor, ...], torch.LongTensor]:\n        \"\"\"The forward function\n\n        Args:\n            xs: (B, F, C, T)\n            ilens: (B,)\n        Returns:\n            hs (torch.Tensor): The hidden vector (B, F, C, T)\n            masks: A tuple of the masks. (B, F, C, T)\n            ilens: (B,)\n        \"\"\"\n        assert xs.size(0) == ilens.size(0), (xs.size(0), ilens.size(0))\n        _, _, C, input_length = xs.size()\n        # (B, F, C, T) -> (B, C, T, F)\n        xs = xs.permute(0, 2, 3, 1)\n\n        # Calculate amplitude: (B, C, T, F) -> (B, C, T, F)\n        xs = (xs.real ** 2 + xs.imag ** 2) ** 0.5\n        # xs: (B, C, T, F) -> xs: (B * C, T, F)\n        xs = xs.contiguous().view(-1, xs.size(-2), xs.size(-1))\n        # ilens: (B,) -> ilens_: (B * C)\n        ilens_ = ilens[:, None].expand(-1, C).contiguous().view(-1)\n\n        # xs: (B * C, T, F) -> xs: (B * C, T, D)\n        xs, _, _ = self.brnn(xs, ilens_)\n        # xs: (B * C, T, D) -> xs: (B, C, T, D)\n        xs = xs.view(-1, C, xs.size(-2), xs.size(-1))\n\n        masks = []\n        for linear in self.linears:\n            # xs: (B, C, T, D) -> mask:(B, C, T, F)\n            mask = linear(xs)\n\n            mask = torch.sigmoid(mask)\n            # Zero padding\n            mask.masked_fill(make_pad_mask(ilens, mask, length_dim=2), 0)\n\n            # (B, C, T, F) -> (B, F, C, T)\n            mask = mask.permute(0, 3, 1, 2)\n\n            # Take cares of multi gpu cases: If input_length > max(ilens)\n            if mask.size(-1) < input_length:\n                mask = F.pad(mask, [0, input_length - mask.size(-1)], value=0)\n            masks.append(mask)\n\n        return tuple(masks), ilens\n"
  },
  {
    "path": "nets/pytorch_backend/gtn_ctc.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"GTN CTC implementation.\"\"\"\n\nimport gtn\nimport torch\n\n\nclass GTNCTCLossFunction(torch.autograd.Function):\n    \"\"\"GTN CTC module.\"\"\"\n\n    # Copied from FB's GTN example implementation:\n    # https://github.com/facebookresearch/gtn_applications/blob/master/utils.py#L251\n\n    @staticmethod\n    def create_ctc_graph(target, blank_idx):\n        \"\"\"Build gtn graph.\n\n        :param list target: single target sequence\n        :param int blank_idx: index of blank token\n        :return: gtn graph of target sequence\n        :rtype: gtn.Graph\n        \"\"\"\n        g_criterion = gtn.Graph(False)\n        L = len(target)\n        S = 2 * L + 1\n        for s in range(S):\n            idx = (s - 1) // 2\n            g_criterion.add_node(s == 0, s == S - 1 or s == S - 2)\n            label = target[idx] if s % 2 else blank_idx\n            g_criterion.add_arc(s, s, label)\n            if s > 0:\n                g_criterion.add_arc(s - 1, s, label)\n            if s % 2 and s > 1 and label != target[idx - 1]:\n                g_criterion.add_arc(s - 2, s, label)\n        g_criterion.arc_sort(False)\n        return g_criterion\n\n    @staticmethod\n    def forward(ctx, log_probs, targets, blank_idx=0, reduction=\"none\"):\n        \"\"\"Forward computation.\n\n        :param torch.tensor log_probs: batched log softmax probabilities (B, Tmax, oDim)\n        :param list targets: batched target sequences, list of lists\n        :param int blank_idx: index of blank token\n        :return: ctc loss value\n        :rtype: torch.Tensor\n        \"\"\"\n        B, T, C = log_probs.shape\n        losses = [None] * B\n        scales = [None] * B\n        emissions_graphs = [None] * B\n\n        def process(b):\n            # create emission graph\n            g_emissions = gtn.linear_graph(T, C, log_probs.requires_grad)\n            cpu_data = log_probs[b].cpu().contiguous()\n            g_emissions.set_weights(cpu_data.data_ptr())\n\n            # create criterion graph\n            g_criterion = GTNCTCLossFunction.create_ctc_graph(targets[b], blank_idx)\n            # compose the graphs\n            g_loss = gtn.negate(\n                gtn.forward_score(gtn.intersect(g_emissions, g_criterion))\n            )\n\n            scale = 1.0\n            if reduction == \"mean\":\n                L = len(targets[b])\n                scale = 1.0 / L if L > 0 else scale\n            elif reduction != \"none\":\n                raise ValueError(\"invalid value for reduction '\" + str(reduction) + \"'\")\n\n            # Save for backward:\n            losses[b] = g_loss\n            scales[b] = scale\n            emissions_graphs[b] = g_emissions\n\n        gtn.parallel_for(process, range(B))\n\n        ctx.auxiliary_data = (losses, scales, emissions_graphs, log_probs.shape)\n        loss = torch.tensor([losses[b].item() * scales[b] for b in range(B)])\n        return torch.mean(loss.cuda() if log_probs.is_cuda else loss)\n\n    @staticmethod\n    def backward(ctx, grad_output):\n        \"\"\"Backward computation.\n\n        :param torch.tensor grad_output: backward passed gradient value\n        :return: cumulative gradient output\n        :rtype: (torch.Tensor, None, None, None)\n        \"\"\"\n        losses, scales, emissions_graphs, in_shape = ctx.auxiliary_data\n        B, T, C = in_shape\n        input_grad = torch.empty((B, T, C))\n\n        def process(b):\n            gtn.backward(losses[b], False)\n            emissions = emissions_graphs[b]\n            grad = emissions.grad().weights_to_numpy()\n            input_grad[b] = torch.from_numpy(grad).view(1, T, C) * scales[b]\n\n        gtn.parallel_for(process, range(B))\n\n        if grad_output.is_cuda:\n            input_grad = input_grad.cuda()\n        input_grad *= grad_output / B\n\n        return (\n            input_grad,\n            None,  # targets\n            None,  # blank_idx\n            None,  # reduction\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/initialization.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Initialization functions for RNN sequence-to-sequence models.\"\"\"\n\nimport math\n\n\ndef lecun_normal_init_parameters(module):\n    \"\"\"Initialize parameters in the LeCun's manner.\"\"\"\n    for p in module.parameters():\n        data = p.data\n        if data.dim() == 1:\n            # bias\n            data.zero_()\n        elif data.dim() == 2:\n            # linear weight\n            n = data.size(1)\n            stdv = 1.0 / math.sqrt(n)\n            data.normal_(0, stdv)\n        elif data.dim() in (3, 4):\n            # conv weight\n            n = data.size(1)\n            for k in data.size()[2:]:\n                n *= k\n            stdv = 1.0 / math.sqrt(n)\n            data.normal_(0, stdv)\n        else:\n            raise NotImplementedError\n\n\ndef uniform_init_parameters(module):\n    \"\"\"Initialize parameters with an uniform distribution.\"\"\"\n    for p in module.parameters():\n        data = p.data\n        if data.dim() == 1:\n            # bias\n            data.uniform_(-0.1, 0.1)\n        elif data.dim() == 2:\n            # linear weight\n            data.uniform_(-0.1, 0.1)\n        elif data.dim() in (3, 4):\n            # conv weight\n            pass  # use the pytorch default\n        else:\n            raise NotImplementedError\n\n\ndef set_forget_bias_to_one(bias):\n    \"\"\"Initialize a bias vector in the forget gate with one.\"\"\"\n    n = bias.size(0)\n    start, end = n // 4, n // 2\n    bias.data[start:end].fill_(1.0)\n"
  },
  {
    "path": "nets/pytorch_backend/lm/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/lm/default.py",
    "content": "\"\"\"Default Recurrent Neural Network Languge Model in `lm_train.py`.\"\"\"\n\nfrom typing import Any\nfrom typing import List\nfrom typing import Tuple\n\nimport logging\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom espnet.nets.lm_interface import LMInterface\nfrom espnet.nets.pytorch_backend.e2e_asr import to_device\nfrom espnet.nets.scorer_interface import BatchScorerInterface\nfrom espnet.utils.cli_utils import strtobool\n\n\nclass DefaultRNNLM(BatchScorerInterface, LMInterface, nn.Module):\n    \"\"\"Default RNNLM for `LMInterface` Implementation.\n\n    Note:\n        PyTorch seems to have memory leak when one GPU compute this after data parallel.\n        If parallel GPUs compute this, it seems to be fine.\n        See also https://github.com/espnet/espnet/issues/1075\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to command line argument parser.\"\"\"\n        parser.add_argument(\n            \"--type\",\n            type=str,\n            default=\"lstm\",\n            nargs=\"?\",\n            choices=[\"lstm\", \"gru\"],\n            help=\"Which type of RNN to use\",\n        )\n        parser.add_argument(\n            \"--layer\", \"-l\", type=int, default=2, help=\"Number of hidden layers\"\n        )\n        parser.add_argument(\n            \"--unit\", \"-u\", type=int, default=650, help=\"Number of hidden units\"\n        )\n        parser.add_argument(\n            \"--embed-unit\",\n            default=None,\n            type=int,\n            help=\"Number of hidden units in embedding layer, \"\n            \"if it is not specified, it keeps the same number with hidden units.\",\n        )\n        parser.add_argument(\n            \"--dropout-rate\", type=float, default=0.5, help=\"dropout probability\"\n        )\n        parser.add_argument(\n            \"--emb-dropout-rate\",\n            type=float,\n            default=0.0,\n            help=\"emb dropout probability\",\n        )\n        parser.add_argument(\n            \"--tie-weights\",\n            type=strtobool,\n            default=False,\n            help=\"Tie input and output embeddings\",\n        )\n        return parser\n\n    def __init__(self, n_vocab, args):\n        \"\"\"Initialize class.\n\n        Args:\n            n_vocab (int): The size of the vocabulary\n            args (argparse.Namespace): configurations. see py:method:`add_arguments`\n\n        \"\"\"\n        nn.Module.__init__(self)\n        # NOTE: for a compatibility with less than 0.5.0 version models\n        dropout_rate = getattr(args, \"dropout_rate\", 0.0)\n        # NOTE: for a compatibility with less than 0.6.1 version models\n        embed_unit = getattr(args, \"embed_unit\", None)\n        # NOTE: for a compatibility with less than 0.9.7 version models\n        emb_dropout_rate = getattr(args, \"emb_dropout_rate\", 0.0)\n        # NOTE: for a compatibility with less than 0.9.7 version models\n        tie_weights = getattr(args, \"tie_weights\", False)\n\n        self.model = ClassifierWithState(\n            RNNLM(\n                n_vocab,\n                args.layer,\n                args.unit,\n                embed_unit,\n                args.type,\n                dropout_rate,\n                emb_dropout_rate,\n                tie_weights,\n            )\n        )\n\n    def state_dict(self):\n        \"\"\"Dump state dict.\"\"\"\n        return self.model.state_dict()\n\n    def load_state_dict(self, d):\n        \"\"\"Load state dict.\"\"\"\n        self.model.load_state_dict(d)\n\n    def forward(self, x, t):\n        \"\"\"Compute LM loss value from buffer sequences.\n\n        Args:\n            x (torch.Tensor): Input ids. (batch, len)\n            t (torch.Tensor): Target ids. (batch, len)\n\n        Returns:\n            tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Tuple of\n                loss to backward (scalar),\n                negative log-likelihood of t: -log p(t) (scalar) and\n                the number of elements in x (scalar)\n\n        Notes:\n            The last two return values are used\n            in perplexity: p(t)^{-n} = exp(-log p(t) / n)\n\n        \"\"\"\n        loss = 0\n        logp = 0\n        count = torch.tensor(0).long()\n        state = None\n        batch_size, sequence_length = x.shape\n        for i in range(sequence_length):\n            # Compute the loss at this time step and accumulate it\n            state, loss_batch = self.model(state, x[:, i], t[:, i])\n            non_zeros = torch.sum(x[:, i] != 0, dtype=loss_batch.dtype)\n            loss += loss_batch.mean() * non_zeros\n            logp += torch.sum(loss_batch * non_zeros)\n            count += int(non_zeros)\n        return loss / batch_size, loss, count.to(loss.device)\n\n    def score(self, y, state, x):\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D torch.int64 prefix tokens.\n            state: Scorer state for prefix tokens\n            x (torch.Tensor): 2D encoder feature that generates ys.\n\n        Returns:\n            tuple[torch.Tensor, Any]: Tuple of\n                torch.float32 scores for next token (n_vocab)\n                and next state for ys\n\n        \"\"\"\n        new_state, scores = self.model.predict(state, y[-1].unsqueeze(0))\n        return scores.squeeze(0), new_state\n\n    def final_score(self, state):\n        \"\"\"Score eos.\n\n        Args:\n            state: Scorer state for prefix tokens\n\n        Returns:\n            float: final score\n\n        \"\"\"\n        return self.model.final(state)\n\n    # batch beam search API (see BatchScorerInterface)\n    def batch_score(\n        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor\n    ) -> Tuple[torch.Tensor, List[Any]]:\n        \"\"\"Score new token batch.\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        # merge states\n        n_batch = len(ys)\n        n_layers = self.model.predictor.n_layers\n        if self.model.predictor.typ == \"lstm\":\n            keys = (\"c\", \"h\")\n        else:\n            keys = (\"h\",)\n\n        if states[0] is None:\n            states = None\n        else:\n            # transpose state of [batch, key, layer] into [key, layer, batch]\n            states = {\n                k: [\n                    torch.stack([states[b][k][i] for b in range(n_batch)])\n                    for i in range(n_layers)\n                ]\n                for k in keys\n            }\n        states, logp = self.model.predict(states, ys[:, -1])\n\n        # transpose state of [key, layer, batch] into [batch, key, layer]\n        return (\n            logp,\n            [\n                {k: [states[k][i][b] for i in range(n_layers)] for k in keys}\n                for b in range(n_batch)\n            ],\n        )\n\n\nclass ClassifierWithState(nn.Module):\n    \"\"\"A wrapper for pytorch RNNLM.\"\"\"\n\n    def __init__(\n        self, predictor, lossfun=nn.CrossEntropyLoss(reduction=\"none\"), label_key=-1\n    ):\n        \"\"\"Initialize class.\n\n        :param torch.nn.Module predictor : The RNNLM\n        :param function lossfun : The loss function to use\n        :param int/str label_key :\n\n        \"\"\"\n        if not (isinstance(label_key, (int, str))):\n            raise TypeError(\"label_key must be int or str, but is %s\" % type(label_key))\n        super(ClassifierWithState, self).__init__()\n        self.lossfun = lossfun\n        self.y = None\n        self.loss = None\n        self.label_key = label_key\n        self.predictor = predictor\n\n    def forward(self, state, *args, **kwargs):\n        \"\"\"Compute the loss value for an input and label pair.\n\n        Notes:\n            It also computes accuracy and stores it to the attribute.\n            When ``label_key`` is ``int``, the corresponding element in ``args``\n            is treated as ground truth labels. And when it is ``str``, the\n            element in ``kwargs`` is used.\n            The all elements of ``args`` and ``kwargs`` except the groundtruth\n            labels are features.\n            It feeds features to the predictor and compare the result\n            with ground truth labels.\n\n        :param torch.Tensor state : the LM state\n        :param list[torch.Tensor] args : Input minibatch\n        :param dict[torch.Tensor] kwargs : Input minibatch\n        :return loss value\n        :rtype torch.Tensor\n\n        \"\"\"\n        if isinstance(self.label_key, int):\n            if not (-len(args) <= self.label_key < len(args)):\n                msg = \"Label key %d is out of bounds\" % self.label_key\n                raise ValueError(msg)\n            t = args[self.label_key]\n            if self.label_key == -1:\n                args = args[:-1]\n            else:\n                args = args[: self.label_key] + args[self.label_key + 1 :]\n        elif isinstance(self.label_key, str):\n            if self.label_key not in kwargs:\n                msg = 'Label key \"%s\" is not found' % self.label_key\n                raise ValueError(msg)\n            t = kwargs[self.label_key]\n            del kwargs[self.label_key]\n\n        self.y = None\n        self.loss = None\n        state, self.y = self.predictor(state, *args, **kwargs)\n        self.loss = self.lossfun(self.y, t)\n        return state, self.loss\n\n    def predict(self, state, x):\n        \"\"\"Predict log probabilities for given state and input x using the predictor.\n\n        :param torch.Tensor state : The current state\n        :param torch.Tensor x : The input\n        :return a tuple (new state, log prob vector)\n        :rtype (torch.Tensor, torch.Tensor)\n        \"\"\"\n        if hasattr(self.predictor, \"normalized\") and self.predictor.normalized:\n            return self.predictor(state, x)\n        else:\n            state, z = self.predictor(state, x)\n            return state, F.log_softmax(z, dim=1)\n\n    def buff_predict(self, state, x, n):\n        \"\"\"Predict new tokens from buffered inputs.\"\"\"\n        if self.predictor.__class__.__name__ == \"RNNLM\":\n            return self.predict(state, x)\n\n        new_state = []\n        new_log_y = []\n        for i in range(n):\n            state_i = None if state is None else state[i]\n            state_i, log_y = self.predict(state_i, x[i].unsqueeze(0))\n            new_state.append(state_i)\n            new_log_y.append(log_y)\n\n        return new_state, torch.cat(new_log_y)\n\n    def final(self, state, index=None):\n        \"\"\"Predict final log probabilities for given state using the predictor.\n\n        :param state: The state\n        :return The final log probabilities\n        :rtype torch.Tensor\n        \"\"\"\n        if hasattr(self.predictor, \"final\"):\n            if index is not None:\n                return self.predictor.final(state[index])\n            else:\n                return self.predictor.final(state)\n        else:\n            return 0.0\n\n\n# Definition of a recurrent net for language modeling\nclass RNNLM(nn.Module):\n    \"\"\"A pytorch RNNLM.\"\"\"\n\n    def __init__(\n        self,\n        n_vocab,\n        n_layers,\n        n_units,\n        n_embed=None,\n        typ=\"lstm\",\n        dropout_rate=0.5,\n        emb_dropout_rate=0.0,\n        tie_weights=False,\n    ):\n        \"\"\"Initialize class.\n\n        :param int n_vocab: The size of the vocabulary\n        :param int n_layers: The number of layers to create\n        :param int n_units: The number of units per layer\n        :param str typ: The RNN type\n        \"\"\"\n        super(RNNLM, self).__init__()\n        if n_embed is None:\n            n_embed = n_units\n\n        self.embed = nn.Embedding(n_vocab, n_embed)\n\n        if emb_dropout_rate == 0.0:\n            self.embed_drop = None\n        else:\n            self.embed_drop = nn.Dropout(emb_dropout_rate)\n\n        if typ == \"lstm\":\n            self.rnn = nn.ModuleList(\n                [nn.LSTMCell(n_embed, n_units)]\n                + [nn.LSTMCell(n_units, n_units) for _ in range(n_layers - 1)]\n            )\n        else:\n            self.rnn = nn.ModuleList(\n                [nn.GRUCell(n_embed, n_units)]\n                + [nn.GRUCell(n_units, n_units) for _ in range(n_layers - 1)]\n            )\n\n        self.dropout = nn.ModuleList(\n            [nn.Dropout(dropout_rate) for _ in range(n_layers + 1)]\n        )\n        self.lo = nn.Linear(n_units, n_vocab)\n        self.n_layers = n_layers\n        self.n_units = n_units\n        self.typ = typ\n\n        logging.info(\"Tie weights set to {}\".format(tie_weights))\n        logging.info(\"Dropout set to {}\".format(dropout_rate))\n        logging.info(\"Emb Dropout set to {}\".format(emb_dropout_rate))\n\n        if tie_weights:\n            assert (\n                n_embed == n_units\n            ), \"Tie Weights: True need embedding and final dimensions to match\"\n            self.lo.weight = self.embed.weight\n\n        # initialize parameters from uniform distribution\n        for param in self.parameters():\n            param.data.uniform_(-0.1, 0.1)\n\n    def zero_state(self, batchsize):\n        \"\"\"Initialize state.\"\"\"\n        p = next(self.parameters())\n        return torch.zeros(batchsize, self.n_units).to(device=p.device, dtype=p.dtype)\n\n    def forward(self, state, x):\n        \"\"\"Forward neural networks.\"\"\"\n        if state is None:\n            h = [to_device(x, self.zero_state(x.size(0))) for n in range(self.n_layers)]\n            state = {\"h\": h}\n            if self.typ == \"lstm\":\n                c = [\n                    to_device(x, self.zero_state(x.size(0)))\n                    for n in range(self.n_layers)\n                ]\n                state = {\"c\": c, \"h\": h}\n\n        h = [None] * self.n_layers\n        if self.embed_drop is not None:\n            emb = self.embed_drop(self.embed(x))\n        else:\n            emb = self.embed(x)\n        if self.typ == \"lstm\":\n            c = [None] * self.n_layers\n            h[0], c[0] = self.rnn[0](\n                self.dropout[0](emb), (state[\"h\"][0], state[\"c\"][0])\n            )\n            for n in range(1, self.n_layers):\n                h[n], c[n] = self.rnn[n](\n                    self.dropout[n](h[n - 1]), (state[\"h\"][n], state[\"c\"][n])\n                )\n            state = {\"c\": c, \"h\": h}\n        else:\n            h[0] = self.rnn[0](self.dropout[0](emb), state[\"h\"][0])\n            for n in range(1, self.n_layers):\n                h[n] = self.rnn[n](self.dropout[n](h[n - 1]), state[\"h\"][n])\n            state = {\"h\": h}\n        y = self.lo(self.dropout[-1](h[-1]))\n        return state, y\n"
  },
  {
    "path": "nets/pytorch_backend/lm/seq_rnn.py",
    "content": "\"\"\"Sequential implementation of Recurrent Neural Network Language Model.\"\"\"\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom espnet.nets.lm_interface import LMInterface\n\n\nclass SequentialRNNLM(LMInterface, torch.nn.Module):\n    \"\"\"Sequential RNNLM.\n\n    See also:\n        https://github.com/pytorch/examples/blob/4581968193699de14b56527296262dd76ab43557/word_language_model/model.py\n\n    \"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to command line argument parser.\"\"\"\n        parser.add_argument(\n            \"--type\",\n            type=str,\n            default=\"lstm\",\n            nargs=\"?\",\n            choices=[\"lstm\", \"gru\"],\n            help=\"Which type of RNN to use\",\n        )\n        parser.add_argument(\n            \"--layer\", \"-l\", type=int, default=2, help=\"Number of hidden layers\"\n        )\n        parser.add_argument(\n            \"--unit\", \"-u\", type=int, default=650, help=\"Number of hidden units\"\n        )\n        parser.add_argument(\n            \"--dropout-rate\", type=float, default=0.5, help=\"dropout probability\"\n        )\n        return parser\n\n    def __init__(self, n_vocab, args):\n        \"\"\"Initialize class.\n\n        Args:\n            n_vocab (int): The size of the vocabulary\n            args (argparse.Namespace): configurations. see py:method:`add_arguments`\n\n        \"\"\"\n        torch.nn.Module.__init__(self)\n        self._setup(\n            rnn_type=args.type.upper(),\n            ntoken=n_vocab,\n            ninp=args.unit,\n            nhid=args.unit,\n            nlayers=args.layer,\n            dropout=args.dropout_rate,\n        )\n\n    def _setup(\n        self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False\n    ):\n        self.drop = nn.Dropout(dropout)\n        self.encoder = nn.Embedding(ntoken, ninp)\n        if rnn_type in [\"LSTM\", \"GRU\"]:\n            self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)\n        else:\n            try:\n                nonlinearity = {\"RNN_TANH\": \"tanh\", \"RNN_RELU\": \"relu\"}[rnn_type]\n            except KeyError:\n                raise ValueError(\n                    \"An invalid option for `--model` was supplied, \"\n                    \"options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']\"\n                )\n            self.rnn = nn.RNN(\n                ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout\n            )\n        self.decoder = nn.Linear(nhid, ntoken)\n\n        # Optionally tie weights as in:\n        # \"Using the Output Embedding to Improve Language Models\" (Press & Wolf 2016)\n        # https://arxiv.org/abs/1608.05859\n        # and\n        # \"Tying Word Vectors and Word Classifiers:\n        #  A Loss Framework for Language Modeling\" (Inan et al. 2016)\n        # https://arxiv.org/abs/1611.01462\n        if tie_weights:\n            if nhid != ninp:\n                raise ValueError(\n                    \"When using the tied flag, nhid must be equal to emsize\"\n                )\n            self.decoder.weight = self.encoder.weight\n\n        self._init_weights()\n\n        self.rnn_type = rnn_type\n        self.nhid = nhid\n        self.nlayers = nlayers\n\n    def _init_weights(self):\n        # NOTE: original init in pytorch/examples\n        # initrange = 0.1\n        # self.encoder.weight.data.uniform_(-initrange, initrange)\n        # self.decoder.bias.data.zero_()\n        # self.decoder.weight.data.uniform_(-initrange, initrange)\n        # NOTE: our default.py:RNNLM init\n        for param in self.parameters():\n            param.data.uniform_(-0.1, 0.1)\n\n    def forward(self, x, t):\n        \"\"\"Compute LM loss value from buffer sequences.\n\n        Args:\n            x (torch.Tensor): Input ids. (batch, len)\n            t (torch.Tensor): Target ids. (batch, len)\n\n        Returns:\n            tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Tuple of\n                loss to backward (scalar),\n                negative log-likelihood of t: -log p(t) (scalar) and\n                the number of elements in x (scalar)\n\n        Notes:\n            The last two return values are used\n            in perplexity: p(t)^{-n} = exp(-log p(t) / n)\n\n        \"\"\"\n        y = self._before_loss(x, None)[0]\n        mask = (x != 0).to(y.dtype)\n        loss = F.cross_entropy(y.view(-1, y.shape[-1]), t.view(-1), reduction=\"none\")\n        logp = loss * mask.view(-1)\n        logp = logp.sum()\n        count = mask.sum()\n        return logp / count, logp, count\n\n    def _before_loss(self, input, hidden):\n        emb = self.drop(self.encoder(input))\n        output, hidden = self.rnn(emb, hidden)\n        output = self.drop(output)\n        decoded = self.decoder(\n            output.view(output.size(0) * output.size(1), output.size(2))\n        )\n        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden\n\n    def init_state(self, x):\n        \"\"\"Get an initial state for decoding.\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        Returns: initial state\n\n        \"\"\"\n        bsz = 1\n        weight = next(self.parameters())\n        if self.rnn_type == \"LSTM\":\n            return (\n                weight.new_zeros(self.nlayers, bsz, self.nhid),\n                weight.new_zeros(self.nlayers, bsz, self.nhid),\n            )\n        else:\n            return weight.new_zeros(self.nlayers, bsz, self.nhid)\n\n    def score(self, y, state, x):\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D torch.int64 prefix tokens.\n            state: Scorer state for prefix tokens\n            x (torch.Tensor): 2D encoder feature that generates ys.\n\n        Returns:\n            tuple[torch.Tensor, Any]: Tuple of\n                torch.float32 scores for next token (n_vocab)\n                and next state for ys\n\n        \"\"\"\n        y, new_state = self._before_loss(y[-1].view(1, 1), state)\n        logp = y.log_softmax(dim=-1).view(-1)\n        return logp, new_state\n"
  },
  {
    "path": "nets/pytorch_backend/lm/transformer.py",
    "content": "\"\"\"Transformer language model.\"\"\"\n\nfrom typing import Any\nfrom typing import List\nfrom typing import Tuple\n\nimport logging\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom espnet.nets.lm_interface import LMInterface\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.scorer_interface import BatchScorerInterface\nfrom espnet.utils.cli_utils import strtobool\n\n\nclass TransformerLM(nn.Module, LMInterface, BatchScorerInterface):\n    \"\"\"Transformer language model.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add arguments to command line argument parser.\"\"\"\n        parser.add_argument(\n            \"--layer\", type=int, default=4, help=\"Number of hidden layers\"\n        )\n        parser.add_argument(\n            \"--unit\",\n            type=int,\n            default=1024,\n            help=\"Number of hidden units in feedforward layer\",\n        )\n        parser.add_argument(\n            \"--att-unit\",\n            type=int,\n            default=256,\n            help=\"Number of hidden units in attention layer\",\n        )\n        parser.add_argument(\n            \"--embed-unit\",\n            type=int,\n            default=128,\n            help=\"Number of hidden units in embedding layer\",\n        )\n        parser.add_argument(\n            \"--head\", type=int, default=2, help=\"Number of multi head attention\"\n        )\n        parser.add_argument(\n            \"--dropout-rate\", type=float, default=0.5, help=\"dropout probability\"\n        )\n        parser.add_argument(\n            \"--att-dropout-rate\",\n            type=float,\n            default=0.0,\n            help=\"att dropout probability\",\n        )\n        parser.add_argument(\n            \"--emb-dropout-rate\",\n            type=float,\n            default=0.0,\n            help=\"emb dropout probability\",\n        )\n        parser.add_argument(\n            \"--tie-weights\",\n            type=strtobool,\n            default=False,\n            help=\"Tie input and output embeddings\",\n        )\n        parser.add_argument(\n            \"--pos-enc\",\n            default=\"sinusoidal\",\n            choices=[\"sinusoidal\", \"none\"],\n            help=\"positional encoding\",\n        )\n        return parser\n\n    def __init__(self, n_vocab, args):\n        \"\"\"Initialize class.\n\n        Args:\n            n_vocab (int): The size of the vocabulary\n            args (argparse.Namespace): configurations. see py:method:`add_arguments`\n\n        \"\"\"\n        nn.Module.__init__(self)\n\n        # NOTE: for a compatibility with less than 0.9.7 version models\n        emb_dropout_rate = getattr(args, \"emb_dropout_rate\", 0.0)\n        # NOTE: for a compatibility with less than 0.9.7 version models\n        tie_weights = getattr(args, \"tie_weights\", False)\n        # NOTE: for a compatibility with less than 0.9.7 version models\n        att_dropout_rate = getattr(args, \"att_dropout_rate\", 0.0)\n\n        if args.pos_enc == \"sinusoidal\":\n            pos_enc_class = PositionalEncoding\n        elif args.pos_enc == \"none\":\n\n            def pos_enc_class(*args, **kwargs):\n                return nn.Sequential()  # indentity\n\n        else:\n            raise ValueError(f\"unknown pos-enc option: {args.pos_enc}\")\n\n        self.embed = nn.Embedding(n_vocab, args.embed_unit)\n\n        if emb_dropout_rate == 0.0:\n            self.embed_drop = None\n        else:\n            self.embed_drop = nn.Dropout(emb_dropout_rate)\n\n        self.encoder = Encoder(\n            idim=args.embed_unit,\n            attention_dim=args.att_unit,\n            attention_heads=args.head,\n            linear_units=args.unit,\n            num_blocks=args.layer,\n            dropout_rate=args.dropout_rate,\n            attention_dropout_rate=att_dropout_rate,\n            input_layer=\"linear\",\n            pos_enc_class=pos_enc_class,\n        )\n        self.decoder = nn.Linear(args.att_unit, n_vocab)\n\n        logging.info(\"Tie weights set to {}\".format(tie_weights))\n        logging.info(\"Dropout set to {}\".format(args.dropout_rate))\n        logging.info(\"Emb Dropout set to {}\".format(emb_dropout_rate))\n        logging.info(\"Att Dropout set to {}\".format(att_dropout_rate))\n\n        if tie_weights:\n            assert (\n                args.att_unit == args.embed_unit\n            ), \"Tie Weights: True need embedding and final dimensions to match\"\n            self.decoder.weight = self.embed.weight\n\n    def _target_mask(self, ys_in_pad):\n        ys_mask = ys_in_pad != 0\n        m = subsequent_mask(ys_mask.size(-1), device=ys_mask.device).unsqueeze(0)\n        return ys_mask.unsqueeze(-2) & m\n\n    def forward(\n        self, x: torch.Tensor, t: torch.Tensor\n    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n        \"\"\"Compute LM loss value from buffer sequences.\n\n        Args:\n            x (torch.Tensor): Input ids. (batch, len)\n            t (torch.Tensor): Target ids. (batch, len)\n\n        Returns:\n            tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Tuple of\n                loss to backward (scalar),\n                negative log-likelihood of t: -log p(t) (scalar) and\n                the number of elements in x (scalar)\n\n        Notes:\n            The last two return values are used\n            in perplexity: p(t)^{-n} = exp(-log p(t) / n)\n\n        \"\"\"\n        xm = x != 0\n\n        if self.embed_drop is not None:\n            emb = self.embed_drop(self.embed(x))\n        else:\n            emb = self.embed(x)\n\n        h, _ = self.encoder(emb, self._target_mask(x))\n        y = self.decoder(h)\n        loss = F.cross_entropy(y.view(-1, y.shape[-1]), t.view(-1), reduction=\"none\")\n        mask = xm.to(dtype=loss.dtype)\n        logp = loss * mask.view(-1)\n        logp = logp.sum()\n        count = mask.sum()\n        return logp / count, logp, count\n\n    def score(\n        self, y: torch.Tensor, state: Any, x: torch.Tensor\n    ) -> Tuple[torch.Tensor, Any]:\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D torch.int64 prefix tokens.\n            state: Scorer state for prefix tokens\n            x (torch.Tensor): encoder feature that generates ys.\n\n        Returns:\n            tuple[torch.Tensor, Any]: Tuple of\n                torch.float32 scores for next token (n_vocab)\n                and next state for ys\n\n        \"\"\"\n        y = y.unsqueeze(0)\n        if self.embed_drop is not None:\n            emb = self.embed_drop(self.embed(y))\n        else:\n            emb = self.embed(y)\n\n        h, _, cache = self.encoder.forward_one_step(\n            emb, self._target_mask(y), cache=state\n        )\n        h = self.decoder(h[:, -1])\n        logp = h.log_softmax(dim=-1).squeeze(0)\n        return logp, cache\n\n    def score_partial(\n        self, y: torch.Tensor, next_tokens: Any, state: Any, x: torch.Tensor\n    ) -> Tuple[torch.Tensor, Any]:\n        scores, state = self.score(y, state, x)\n        scores = scores[next_tokens]\n        return scores, state\n\n    def select_state(self, states, i):\n        return states\n\n    # batch beam search API (see BatchScorerInterface)\n    def batch_score(\n        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor\n    ) -> Tuple[torch.Tensor, List[Any]]:\n        \"\"\"Score new token batch (required).\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        # merge states\n        n_batch = len(ys)\n        n_layers = len(self.encoder.encoders)\n        if states[0] is None:\n            batch_state = None\n        else:\n            # transpose state of [batch, layer] into [layer, batch]\n            batch_state = [\n                torch.stack([states[b][i] for b in range(n_batch)])\n                for i in range(n_layers)\n            ]\n\n        if self.embed_drop is not None:\n            emb = self.embed_drop(self.embed(ys))\n        else:\n            emb = self.embed(ys)\n\n        # batch decoding\n        h, _, states = self.encoder.forward_one_step(\n            emb, self._target_mask(ys), cache=batch_state\n        )\n        h = self.decoder(h[:, -1])\n        logp = h.log_softmax(dim=-1)\n\n        # transpose state of [layer, batch] into [batch, layer]\n        state_list = [[states[i][b] for i in range(n_layers)] for b in range(n_batch)]\n        return logp, state_list\n"
  },
  {
    "path": "nets/pytorch_backend/maskctc/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/maskctc/add_mask_token.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Waseda University (Yosuke Higuchi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Token masking module for Masked LM.\"\"\"\n\nimport numpy\n\n\ndef mask_uniform(ys_pad, mask_token, eos, ignore_id):\n    \"\"\"Replace random tokens with <mask> label and add <eos> label.\n\n    The number of <mask> is chosen from a uniform distribution\n    between one and the target sequence's length.\n    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n    :param int mask_token: index of <mask>\n    :param int eos: index of <eos>\n    :param int ignore_id: index of padding\n    :return: padded tensor (B, Lmax)\n    :rtype: torch.Tensor\n    :return: padded tensor (B, Lmax)\n    :rtype: torch.Tensor\n    \"\"\"\n    from espnet.nets.pytorch_backend.nets_utils import pad_list\n\n    ys = [y[y != ignore_id] for y in ys_pad]  # parse padded ys\n    ys_out = [y.new(y.size()).fill_(ignore_id) for y in ys]\n    ys_in = [y.clone() for y in ys]\n    for i in range(len(ys)):\n        num_samples = numpy.random.randint(1, len(ys[i]) + 1)\n        idx = numpy.random.choice(len(ys[i]), num_samples)\n\n        ys_in[i][idx] = mask_token\n        ys_out[i][idx] = ys[i][idx]\n\n    return pad_list(ys_in, eos), pad_list(ys_out, ignore_id)\n"
  },
  {
    "path": "nets/pytorch_backend/maskctc/mask.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Johns Hopkins University (Shinji Watanabe)\n#                Waseda University (Yosuke Higuchi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Attention masking module for Masked LM.\"\"\"\n\n\ndef square_mask(ys_in_pad, ignore_id):\n    \"\"\"Create attention mask to avoid attending on padding tokens.\n\n    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n    :param int ignore_id: index of padding\n    :param torch.dtype dtype: result dtype\n    :rtype: torch.Tensor (B, Lmax, Lmax)\n    \"\"\"\n    ys_mask = (ys_in_pad != ignore_id).unsqueeze(-2)\n    ymax = ys_mask.size(-1)\n    ys_mask_tmp = ys_mask.transpose(1, 2).repeat(1, 1, ymax)\n    ys_mask = ys_mask.repeat(1, ymax, 1) & ys_mask_tmp\n\n    return ys_mask\n"
  },
  {
    "path": "nets/pytorch_backend/nets_utils.py",
    "content": "# -*- coding: utf-8 -*-\n\n\"\"\"Network related utility tools.\"\"\"\n\nimport logging\nfrom typing import Dict\n\nimport numpy as np\nimport torch\n\n\ndef to_device(m, x):\n    \"\"\"Send tensor into the device of the module.\n\n    Args:\n        m (torch.nn.Module): Torch module.\n        x (Tensor): Torch tensor.\n\n    Returns:\n        Tensor: Torch tensor located in the same place as torch module.\n\n    \"\"\"\n    if isinstance(m, torch.nn.Module):\n        device = next(m.parameters()).device\n    elif isinstance(m, torch.Tensor):\n        device = m.device\n    else:\n        raise TypeError(\n            \"Expected torch.nn.Module or torch.tensor, \" f\"bot got: {type(m)}\"\n        )\n    return x.to(device)\n\n\ndef pad_list(xs, pad_value):\n    \"\"\"Perform padding for the list of tensors.\n\n    Args:\n        xs (List): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)].\n        pad_value (float): Value for padding.\n\n    Returns:\n        Tensor: Padded tensor (B, Tmax, `*`).\n\n    Examples:\n        >>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]\n        >>> x\n        [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]\n        >>> pad_list(x, 0)\n        tensor([[1., 1., 1., 1.],\n                [1., 1., 0., 0.],\n                [1., 0., 0., 0.]])\n\n    \"\"\"\n    n_batch = len(xs)\n    max_len = max(x.size(0) for x in xs)\n    pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)\n\n    for i in range(n_batch):\n        pad[i, : xs[i].size(0)] = xs[i]\n\n    return pad\n\n\ndef make_pad_mask(lengths, xs=None, length_dim=-1):\n    \"\"\"Make mask tensor containing indices of padded part.\n\n    Args:\n        lengths (LongTensor or List): Batch of lengths (B,).\n        xs (Tensor, optional): The reference tensor.\n            If set, masks will be the same shape as this tensor.\n        length_dim (int, optional): Dimension indicator of the above tensor.\n            See the example.\n\n    Returns:\n        Tensor: Mask tensor containing indices of padded part.\n                dtype=torch.uint8 in PyTorch 1.2-\n                dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n    Examples:\n        With only lengths.\n\n        >>> lengths = [5, 3, 2]\n        >>> make_non_pad_mask(lengths)\n        masks = [[0, 0, 0, 0 ,0],\n                 [0, 0, 0, 1, 1],\n                 [0, 0, 1, 1, 1]]\n\n        With the reference tensor.\n\n        >>> xs = torch.zeros((3, 2, 4))\n        >>> make_pad_mask(lengths, xs)\n        tensor([[[0, 0, 0, 0],\n                 [0, 0, 0, 0]],\n                [[0, 0, 0, 1],\n                 [0, 0, 0, 1]],\n                [[0, 0, 1, 1],\n                 [0, 0, 1, 1]]], dtype=torch.uint8)\n        >>> xs = torch.zeros((3, 2, 6))\n        >>> make_pad_mask(lengths, xs)\n        tensor([[[0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1]],\n                [[0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1]],\n                [[0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)\n\n        With the reference tensor and dimension indicator.\n\n        >>> xs = torch.zeros((3, 6, 6))\n        >>> make_pad_mask(lengths, xs, 1)\n        tensor([[[0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [1, 1, 1, 1, 1, 1]],\n                [[0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1]],\n                [[0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)\n        >>> make_pad_mask(lengths, xs, 2)\n        tensor([[[0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1],\n                 [0, 0, 0, 0, 0, 1]],\n                [[0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1],\n                 [0, 0, 0, 1, 1, 1]],\n                [[0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1],\n                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)\n\n    \"\"\"\n    if length_dim == 0:\n        raise ValueError(\"length_dim cannot be 0: {}\".format(length_dim))\n\n    if not isinstance(lengths, list):\n        lengths = lengths.tolist()\n    bs = int(len(lengths))\n    if xs is None:\n        maxlen = int(max(lengths))\n    else:\n        maxlen = xs.size(length_dim)\n\n    seq_range = torch.arange(0, maxlen, dtype=torch.int64)\n    seq_range_expand = seq_range.unsqueeze(0).expand(bs, maxlen)\n    seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)\n    mask = seq_range_expand >= seq_length_expand\n\n    if xs is not None:\n        assert xs.size(0) == bs, (xs.size(0), bs)\n\n        if length_dim < 0:\n            length_dim = xs.dim() + length_dim\n        # ind = (:, None, ..., None, :, , None, ..., None)\n        ind = tuple(\n            slice(None) if i in (0, length_dim) else None for i in range(xs.dim())\n        )\n        mask = mask[ind].expand_as(xs).to(xs.device)\n    return mask\n\n\ndef make_non_pad_mask(lengths, xs=None, length_dim=-1):\n    \"\"\"Make mask tensor containing indices of non-padded part.\n\n    Args:\n        lengths (LongTensor or List): Batch of lengths (B,).\n        xs (Tensor, optional): The reference tensor.\n            If set, masks will be the same shape as this tensor.\n        length_dim (int, optional): Dimension indicator of the above tensor.\n            See the example.\n\n    Returns:\n        ByteTensor: mask tensor containing indices of padded part.\n                    dtype=torch.uint8 in PyTorch 1.2-\n                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)\n\n    Examples:\n        With only lengths.\n\n        >>> lengths = [5, 3, 2]\n        >>> make_non_pad_mask(lengths)\n        masks = [[1, 1, 1, 1 ,1],\n                 [1, 1, 1, 0, 0],\n                 [1, 1, 0, 0, 0]]\n\n        With the reference tensor.\n\n        >>> xs = torch.zeros((3, 2, 4))\n        >>> make_non_pad_mask(lengths, xs)\n        tensor([[[1, 1, 1, 1],\n                 [1, 1, 1, 1]],\n                [[1, 1, 1, 0],\n                 [1, 1, 1, 0]],\n                [[1, 1, 0, 0],\n                 [1, 1, 0, 0]]], dtype=torch.uint8)\n        >>> xs = torch.zeros((3, 2, 6))\n        >>> make_non_pad_mask(lengths, xs)\n        tensor([[[1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0]],\n                [[1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0]],\n                [[1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)\n\n        With the reference tensor and dimension indicator.\n\n        >>> xs = torch.zeros((3, 6, 6))\n        >>> make_non_pad_mask(lengths, xs, 1)\n        tensor([[[1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [0, 0, 0, 0, 0, 0]],\n                [[1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0]],\n                [[1, 1, 1, 1, 1, 1],\n                 [1, 1, 1, 1, 1, 1],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0],\n                 [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)\n        >>> make_non_pad_mask(lengths, xs, 2)\n        tensor([[[1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0],\n                 [1, 1, 1, 1, 1, 0]],\n                [[1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0],\n                 [1, 1, 1, 0, 0, 0]],\n                [[1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0],\n                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)\n\n    \"\"\"\n    return ~make_pad_mask(lengths, xs, length_dim)\n\n\ndef mask_by_length(xs, lengths, fill=0):\n    \"\"\"Mask tensor according to length.\n\n    Args:\n        xs (Tensor): Batch of input tensor (B, `*`).\n        lengths (LongTensor or List): Batch of lengths (B,).\n        fill (int or float): Value to fill masked part.\n\n    Returns:\n        Tensor: Batch of masked input tensor (B, `*`).\n\n    Examples:\n        >>> x = torch.arange(5).repeat(3, 1) + 1\n        >>> x\n        tensor([[1, 2, 3, 4, 5],\n                [1, 2, 3, 4, 5],\n                [1, 2, 3, 4, 5]])\n        >>> lengths = [5, 3, 2]\n        >>> mask_by_length(x, lengths)\n        tensor([[1, 2, 3, 4, 5],\n                [1, 2, 3, 0, 0],\n                [1, 2, 0, 0, 0]])\n\n    \"\"\"\n    assert xs.size(0) == len(lengths)\n    ret = xs.data.new(*xs.size()).fill_(fill)\n    for i, l in enumerate(lengths):\n        ret[i, :l] = xs[i, :l]\n    return ret\n\n\ndef th_accuracy(pad_outputs, pad_targets, ignore_label):\n    \"\"\"Calculate accuracy.\n\n    Args:\n        pad_outputs (Tensor): Prediction tensors (B * Lmax, D).\n        pad_targets (LongTensor): Target label tensors (B, Lmax, D).\n        ignore_label (int): Ignore label id.\n\n    Returns:\n        float: Accuracy value (0.0 - 1.0).\n\n    \"\"\"\n    pad_pred = pad_outputs.view(\n        pad_targets.size(0), pad_targets.size(1), pad_outputs.size(1)\n    ).argmax(2)\n    mask = pad_targets != ignore_label\n    numerator = torch.sum(\n        pad_pred.masked_select(mask) == pad_targets.masked_select(mask)\n    )\n    denominator = torch.sum(mask)\n    return float(numerator) / float(denominator)\n\n\ndef to_torch_tensor(x):\n    \"\"\"Change to torch.Tensor or ComplexTensor from numpy.ndarray.\n\n    Args:\n        x: Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.\n\n    Returns:\n        Tensor or ComplexTensor: Type converted inputs.\n\n    Examples:\n        >>> xs = np.ones(3, dtype=np.float32)\n        >>> xs = to_torch_tensor(xs)\n        tensor([1., 1., 1.])\n        >>> xs = torch.ones(3, 4, 5)\n        >>> assert to_torch_tensor(xs) is xs\n        >>> xs = {'real': xs, 'imag': xs}\n        >>> to_torch_tensor(xs)\n        ComplexTensor(\n        Real:\n        tensor([1., 1., 1.])\n        Imag;\n        tensor([1., 1., 1.])\n        )\n\n    \"\"\"\n    # If numpy, change to torch tensor\n    if isinstance(x, np.ndarray):\n        if x.dtype.kind == \"c\":\n            # Dynamically importing because torch_complex requires python3\n            from torch_complex.tensor import ComplexTensor\n\n            return ComplexTensor(x)\n        else:\n            return torch.from_numpy(x)\n\n    # If {'real': ..., 'imag': ...}, convert to ComplexTensor\n    elif isinstance(x, dict):\n        # Dynamically importing because torch_complex requires python3\n        from torch_complex.tensor import ComplexTensor\n\n        if \"real\" not in x or \"imag\" not in x:\n            raise ValueError(\"has 'real' and 'imag' keys: {}\".format(list(x)))\n        # Relative importing because of using python3 syntax\n        return ComplexTensor(x[\"real\"], x[\"imag\"])\n\n    # If torch.Tensor, as it is\n    elif isinstance(x, torch.Tensor):\n        return x\n\n    else:\n        error = (\n            \"x must be numpy.ndarray, torch.Tensor or a dict like \"\n            \"{{'real': torch.Tensor, 'imag': torch.Tensor}}, \"\n            \"but got {}\".format(type(x))\n        )\n        try:\n            from torch_complex.tensor import ComplexTensor\n        except Exception:\n            # If PY2\n            raise ValueError(error)\n        else:\n            # If PY3\n            if isinstance(x, ComplexTensor):\n                return x\n            else:\n                raise ValueError(error)\n\n\ndef get_subsample(train_args, mode, arch):\n    \"\"\"Parse the subsampling factors from the args for the specified `mode` and `arch`.\n\n    Args:\n        train_args: argument Namespace containing options.\n        mode: one of ('asr', 'mt', 'st')\n        arch: one of ('rnn', 'rnn-t', 'rnn_mix', 'rnn_mulenc', 'transformer')\n\n    Returns:\n        np.ndarray / List[np.ndarray]: subsampling factors.\n    \"\"\"\n    if arch == \"transformer\":\n        return np.array([1])\n\n    elif mode == \"mt\" and arch == \"rnn\":\n        # +1 means input (+1) and layers outputs (train_args.elayer)\n        subsample = np.ones(train_args.elayers + 1, dtype=np.int)\n        logging.warning(\"Subsampling is not performed for machine translation.\")\n        logging.info(\"subsample: \" + \" \".join([str(x) for x in subsample]))\n        return subsample\n\n    elif (\n        (mode == \"asr\" and arch in (\"rnn\", \"rnn-t\"))\n        or (mode == \"mt\" and arch == \"rnn\")\n        or (mode == \"st\" and arch == \"rnn\")\n    ):\n        subsample = np.ones(train_args.elayers + 1, dtype=np.int)\n        if train_args.etype.endswith(\"p\") and not train_args.etype.startswith(\"vgg\"):\n            ss = train_args.subsample.split(\"_\")\n            for j in range(min(train_args.elayers + 1, len(ss))):\n                subsample[j] = int(ss[j])\n        else:\n            logging.warning(\n                \"Subsampling is not performed for vgg*. \"\n                \"It is performed in max pooling layers at CNN.\"\n            )\n        logging.info(\"subsample: \" + \" \".join([str(x) for x in subsample]))\n        return subsample\n\n    elif mode == \"asr\" and arch == \"rnn_mix\":\n        subsample = np.ones(\n            train_args.elayers_sd + train_args.elayers + 1, dtype=np.int\n        )\n        if train_args.etype.endswith(\"p\") and not train_args.etype.startswith(\"vgg\"):\n            ss = train_args.subsample.split(\"_\")\n            for j in range(\n                min(train_args.elayers_sd + train_args.elayers + 1, len(ss))\n            ):\n                subsample[j] = int(ss[j])\n        else:\n            logging.warning(\n                \"Subsampling is not performed for vgg*. \"\n                \"It is performed in max pooling layers at CNN.\"\n            )\n        logging.info(\"subsample: \" + \" \".join([str(x) for x in subsample]))\n        return subsample\n\n    elif mode == \"asr\" and arch == \"rnn_mulenc\":\n        subsample_list = []\n        for idx in range(train_args.num_encs):\n            subsample = np.ones(train_args.elayers[idx] + 1, dtype=np.int)\n            if train_args.etype[idx].endswith(\"p\") and not train_args.etype[\n                idx\n            ].startswith(\"vgg\"):\n                ss = train_args.subsample[idx].split(\"_\")\n                for j in range(min(train_args.elayers[idx] + 1, len(ss))):\n                    subsample[j] = int(ss[j])\n            else:\n                logging.warning(\n                    \"Encoder %d: Subsampling is not performed for vgg*. \"\n                    \"It is performed in max pooling layers at CNN.\",\n                    idx + 1,\n                )\n            logging.info(\"subsample: \" + \" \".join([str(x) for x in subsample]))\n            subsample_list.append(subsample)\n        return subsample_list\n\n    else:\n        raise ValueError(\"Invalid options: mode={}, arch={}\".format(mode, arch))\n\n\ndef rename_state_dict(\n    old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor]\n):\n    \"\"\"Replace keys of old prefix with new prefix in state dict.\"\"\"\n    # need this list not to break the dict iterator\n    old_keys = [k for k in state_dict if k.startswith(old_prefix)]\n    if len(old_keys) > 0:\n        logging.warning(f\"Rename: {old_prefix} -> {new_prefix}\")\n    for k in old_keys:\n        v = state_dict.pop(k)\n        new_k = k.replace(old_prefix, new_prefix)\n        state_dict[new_k] = v\n\n\ndef get_activation(act):\n    \"\"\"Return activation function.\"\"\"\n    # Lazy load to avoid unused import\n    from espnet.nets.pytorch_backend.conformer.swish import Swish\n\n    activation_funcs = {\n        \"hardtanh\": torch.nn.Hardtanh,\n        \"tanh\": torch.nn.Tanh,\n        \"relu\": torch.nn.ReLU,\n        \"selu\": torch.nn.SELU,\n        \"swish\": Swish,\n    }\n\n    return activation_funcs[act]()\n"
  },
  {
    "path": "nets/pytorch_backend/rnn/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/rnn/argument.py",
    "content": "# Copyright 2020 Hirofumi Inaguma\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Conformer common arguments.\"\"\"\n\n\ndef add_arguments_rnn_encoder_common(group):\n    \"\"\"Define common arguments for RNN encoder.\"\"\"\n    group.add_argument(\n        \"--etype\",\n        default=\"blstmp\",\n        type=str,\n        choices=[\n            \"lstm\",\n            \"blstm\",\n            \"lstmp\",\n            \"blstmp\",\n            \"vgglstmp\",\n            \"vggblstmp\",\n            \"vgglstm\",\n            \"vggblstm\",\n            \"gru\",\n            \"bgru\",\n            \"grup\",\n            \"bgrup\",\n            \"vgggrup\",\n            \"vggbgrup\",\n            \"vgggru\",\n            \"vggbgru\",\n        ],\n        help=\"Type of encoder network architecture\",\n    )\n    group.add_argument(\n        \"--elayers\",\n        default=4,\n        type=int,\n        help=\"Number of encoder layers\",\n    )\n    group.add_argument(\n        \"--eunits\",\n        \"-u\",\n        default=300,\n        type=int,\n        help=\"Number of encoder hidden units\",\n    )\n    group.add_argument(\n        \"--eprojs\", default=320, type=int, help=\"Number of encoder projection units\"\n    )\n    group.add_argument(\n        \"--subsample\",\n        default=\"1\",\n        type=str,\n        help=\"Subsample input frames x_y_z means \"\n        \"subsample every x frame at 1st layer, \"\n        \"every y frame at 2nd layer etc.\",\n    )\n    return group\n\n\ndef add_arguments_rnn_decoder_common(group):\n    \"\"\"Define common arguments for RNN decoder.\"\"\"\n    group.add_argument(\n        \"--dtype\",\n        default=\"lstm\",\n        type=str,\n        choices=[\"lstm\", \"gru\"],\n        help=\"Type of decoder network architecture\",\n    )\n    group.add_argument(\n        \"--dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n    )\n    group.add_argument(\n        \"--dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n    )\n    group.add_argument(\n        \"--dropout-rate-decoder\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the decoder\",\n    )\n    group.add_argument(\n        \"--sampling-probability\",\n        default=0.0,\n        type=float,\n        help=\"Ratio of predicted labels fed back to decoder\",\n    )\n    group.add_argument(\n        \"--lsm-type\",\n        const=\"\",\n        default=\"\",\n        type=str,\n        nargs=\"?\",\n        choices=[\"\", \"unigram\"],\n        help=\"Apply label smoothing with a specified distribution type\",\n    )\n    return group\n\n\ndef add_arguments_rnn_attention_common(group):\n    \"\"\"Define common arguments for RNN attention.\"\"\"\n    group.add_argument(\n        \"--atype\",\n        default=\"dot\",\n        type=str,\n        choices=[\n            \"noatt\",\n            \"dot\",\n            \"add\",\n            \"location\",\n            \"coverage\",\n            \"coverage_location\",\n            \"location2d\",\n            \"location_recurrent\",\n            \"multi_head_dot\",\n            \"multi_head_add\",\n            \"multi_head_loc\",\n            \"multi_head_multi_res_loc\",\n        ],\n        help=\"Type of attention architecture\",\n    )\n    group.add_argument(\n        \"--adim\",\n        default=320,\n        type=int,\n        help=\"Number of attention transformation dimensions\",\n    )\n    group.add_argument(\n        \"--awin\", default=5, type=int, help=\"Window size for location2d attention\"\n    )\n    group.add_argument(\n        \"--aheads\",\n        default=4,\n        type=int,\n        help=\"Number of heads for multi head attention\",\n    )\n    group.add_argument(\n        \"--aconv-chans\",\n        default=-1,\n        type=int,\n        help=\"Number of attention convolution channels \\\n                       (negative value indicates no location-aware attention)\",\n    )\n    group.add_argument(\n        \"--aconv-filts\",\n        default=100,\n        type=int,\n        help=\"Number of attention convolution filters \\\n                       (negative value indicates no location-aware attention)\",\n    )\n    group.add_argument(\n        \"--dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the encoder\",\n    )\n    return group\n"
  },
  {
    "path": "nets/pytorch_backend/rnn/attentions.py",
    "content": "\"\"\"Attention modules for RNN.\"\"\"\n\nimport math\nimport six\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\n\n\ndef _apply_attention_constraint(\n    e, last_attended_idx, backward_window=1, forward_window=3\n):\n    \"\"\"Apply monotonic attention constraint.\n\n    This function apply the monotonic attention constraint\n    introduced in `Deep Voice 3: Scaling\n    Text-to-Speech with Convolutional Sequence Learning`_.\n\n    Args:\n        e (Tensor): Attention energy before applying softmax (1, T).\n        last_attended_idx (int): The index of the inputs of the last attended [0, T].\n        backward_window (int, optional): Backward window size in attention constraint.\n        forward_window (int, optional): Forward window size in attetion constraint.\n\n    Returns:\n        Tensor: Monotonic constrained attention energy (1, T).\n\n    .. _`Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning`:\n        https://arxiv.org/abs/1710.07654\n\n    \"\"\"\n    if e.size(0) != 1:\n        raise NotImplementedError(\"Batch attention constraining is not yet supported.\")\n    backward_idx = last_attended_idx - backward_window\n    forward_idx = last_attended_idx + forward_window\n    if backward_idx > 0:\n        e[:, :backward_idx] = -float(\"inf\")\n    if forward_idx < e.size(1):\n        e[:, forward_idx:] = -float(\"inf\")\n    return e\n\n\nclass NoAtt(torch.nn.Module):\n    \"\"\"No attention\"\"\"\n\n    def __init__(self):\n        super(NoAtt, self).__init__()\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.c = None\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.c = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):\n        \"\"\"NoAtt forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B, T_max, D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: dummy (does not use)\n        :param torch.Tensor att_prev: dummy (does not use)\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights\n        :rtype: torch.Tensor\n        \"\"\"\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n\n        # initialize attention weight with uniform dist.\n        if att_prev is None:\n            # if no bias, 0 0-pad goes 0\n            mask = 1.0 - make_pad_mask(enc_hs_len).float()\n            att_prev = mask / mask.new(enc_hs_len).unsqueeze(-1)\n            att_prev = att_prev.to(self.enc_h)\n            self.c = torch.sum(\n                self.enc_h * att_prev.view(batch, self.h_length, 1), dim=1\n            )\n\n        return self.c, att_prev\n\n\nclass AttDot(torch.nn.Module):\n    \"\"\"Dot product attention\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim, han_mode=False):\n        super(AttDot, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):\n        \"\"\"AttDot forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: dummy (does not use)\n        :param torch.Tensor att_prev: dummy (does not use)\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weight (B x T_max)\n        :rtype: torch.Tensor\n        \"\"\"\n\n        batch = enc_hs_pad.size(0)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = torch.tanh(self.mlp_enc(self.enc_h))\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        e = torch.sum(\n            self.pre_compute_enc_h\n            * torch.tanh(self.mlp_dec(dec_z)).view(batch, 1, self.att_dim),\n            dim=2,\n        )  # utt x frame\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n        return c, w\n\n\nclass AttAdd(torch.nn.Module):\n    \"\"\"Additive attention\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim, han_mode=False):\n        super(AttAdd, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.gvec = torch.nn.Linear(att_dim, 1)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):\n        \"\"\"AttAdd forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: dummy (does not use)\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights (B x T_max)\n        :rtype: torch.Tensor\n        \"\"\"\n\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(torch.tanh(self.pre_compute_enc_h + dec_z_tiled)).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        return c, w\n\n\nclass AttLoc(torch.nn.Module):\n    \"\"\"location-aware attention module.\n\n    Reference: Attention-Based Models for Speech Recognition\n        (https://arxiv.org/pdf/1506.07503.pdf)\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(\n        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False\n    ):\n        super(AttLoc, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (1, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(\n        self,\n        enc_hs_pad,\n        enc_hs_len,\n        dec_z,\n        att_prev,\n        scaling=2.0,\n        last_attended_idx=None,\n        backward_window=1,\n        forward_window=3,\n    ):\n        \"\"\"Calcualte AttLoc forward propagation.\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: previous attention weight (B x T_max)\n        :param float scaling: scaling parameter before applying softmax\n        :param torch.Tensor forward_window:\n            forward window size when constraining attention\n        :param int last_attended_idx: index of the inputs of the last attended\n        :param int backward_window: backward window size in attention constraint\n        :param int forward_window: forward window size in attetion constraint\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights (B x T_max)\n        :rtype: torch.Tensor\n        \"\"\"\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        # initialize attention weight with uniform dist.\n        if att_prev is None:\n            # if no bias, 0 0-pad goes 0\n            att_prev = 1.0 - make_pad_mask(enc_hs_len).to(\n                device=dec_z.device, dtype=dec_z.dtype\n            )\n            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)\n\n        # att_prev: utt x frame -> utt x 1 x 1 x frame\n        # -> utt x att_conv_chans x 1 x frame\n        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))\n        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans\n        att_conv = att_conv.squeeze(2).transpose(1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE: consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n\n        # apply monotonic attention constraint (mainly for TTS)\n        if last_attended_idx is not None:\n            e = _apply_attention_constraint(\n                e, last_attended_idx, backward_window, forward_window\n            )\n\n        w = F.softmax(scaling * e, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        return c, w\n\n\nclass AttCov(torch.nn.Module):\n    \"\"\"Coverage mechanism attention\n\n    Reference: Get To The Point: Summarization with Pointer-Generator Network\n       (https://arxiv.org/abs/1704.04368)\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim, han_mode=False):\n        super(AttCov, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.wvec = torch.nn.Linear(1, att_dim)\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0):\n        \"\"\"AttCov forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param list att_prev_list: list of previous attention weight\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weights\n        :rtype: list\n        \"\"\"\n\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        # initialize attention weight with uniform dist.\n        if att_prev_list is None:\n            # if no bias, 0 0-pad goes 0\n            att_prev_list = to_device(\n                enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float())\n            )\n            att_prev_list = [\n                att_prev_list / att_prev_list.new(enc_hs_len).unsqueeze(-1)\n            ]\n\n        # att_prev_list: L' * [B x T] => cov_vec B x T\n        cov_vec = sum(att_prev_list)\n        # cov_vec: B x T => B x T x 1 => B x T x att_dim\n        cov_vec = self.wvec(cov_vec.unsqueeze(-1))\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(cov_vec + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n        att_prev_list += [w]\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        return c, att_prev_list\n\n\nclass AttLoc2D(torch.nn.Module):\n    \"\"\"2D location-aware attention\n\n    This attention is an extended version of location aware attention.\n    It take not only one frame before attention weights,\n    but also earlier frames into account.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param int att_win: attention window size (default=5)\n    :param bool han_mode:\n        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(\n        self, eprojs, dunits, att_dim, att_win, aconv_chans, aconv_filts, han_mode=False\n    ):\n        super(AttLoc2D, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (att_win, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.aconv_chans = aconv_chans\n        self.att_win = att_win\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):\n        \"\"\"AttLoc2D forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: previous attention weight (B x att_win x T_max)\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights (B x att_win x T_max)\n        :rtype: torch.Tensor\n        \"\"\"\n\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        # initialize attention weight with uniform dist.\n        if att_prev is None:\n            # B * [Li x att_win]\n            # if no bias, 0 0-pad goes 0\n            att_prev = to_device(enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float()))\n            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)\n            att_prev = att_prev.unsqueeze(1).expand(-1, self.att_win, -1)\n\n        # att_prev: B x att_win x Tmax -> B x 1 x att_win x Tmax -> B x C x 1 x Tmax\n        att_conv = self.loc_conv(att_prev.unsqueeze(1))\n        # att_conv: B x C x 1 x Tmax -> B x Tmax x C\n        att_conv = att_conv.squeeze(2).transpose(1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        # update att_prev: B x att_win x Tmax -> B x att_win+1 x Tmax\n        # -> B x att_win x Tmax\n        att_prev = torch.cat([att_prev, w.unsqueeze(1)], dim=1)\n        att_prev = att_prev[:, 1:]\n\n        return c, att_prev\n\n\nclass AttLocRec(torch.nn.Module):\n    \"\"\"location-aware recurrent attention\n\n    This attention is an extended version of location aware attention.\n    With the use of RNN,\n    it take the effect of the history of attention weights into account.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode:\n        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(\n        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False\n    ):\n        super(AttLocRec, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (1, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.att_lstm = torch.nn.LSTMCell(aconv_chans, att_dim, bias=False)\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_states, scaling=2.0):\n        \"\"\"AttLocRec forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param tuple att_prev_states: previous attention weight and lstm states\n                                      ((B, T_max), ((B, att_dim), (B, att_dim)))\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights and lstm states (w, (hx, cx))\n                 ((B, T_max), ((B, att_dim), (B, att_dim)))\n        :rtype: tuple\n        \"\"\"\n\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        if att_prev_states is None:\n            # initialize attention weight with uniform dist.\n            # if no bias, 0 0-pad goes 0\n            att_prev = to_device(enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float()))\n            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)\n\n            # initialize lstm states\n            att_h = enc_hs_pad.new_zeros(batch, self.att_dim)\n            att_c = enc_hs_pad.new_zeros(batch, self.att_dim)\n            att_states = (att_h, att_c)\n        else:\n            att_prev = att_prev_states[0]\n            att_states = att_prev_states[1]\n\n        # B x 1 x 1 x T -> B x C x 1 x T\n        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))\n        # apply non-linear\n        att_conv = F.relu(att_conv)\n        # B x C x 1 x T -> B x C x 1 x 1 -> B x C\n        att_conv = F.max_pool2d(att_conv, (1, att_conv.size(3))).view(batch, -1)\n\n        att_h, att_c = self.att_lstm(att_conv, att_states)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(att_h.unsqueeze(1) + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        return c, (w, (att_h, att_c))\n\n\nclass AttCovLoc(torch.nn.Module):\n    \"\"\"Coverage mechanism location aware attention\n\n    This attention is a combination of coverage and location-aware attentions.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode:\n        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h\n    \"\"\"\n\n    def __init__(\n        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False\n    ):\n        super(AttCovLoc, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (1, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.gvec = torch.nn.Linear(att_dim, 1)\n\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.aconv_chans = aconv_chans\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0):\n        \"\"\"AttCovLoc forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param list att_prev_list: list of previous attention weight\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weights\n        :rtype: list\n        \"\"\"\n\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        # initialize attention weight with uniform dist.\n        if att_prev_list is None:\n            # if no bias, 0 0-pad goes 0\n            mask = 1.0 - make_pad_mask(enc_hs_len).float()\n            att_prev_list = [\n                to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))\n            ]\n\n        # att_prev_list: L' * [B x T] => cov_vec B x T\n        cov_vec = sum(att_prev_list)\n\n        # cov_vec: B x T -> B x 1 x 1 x T -> B x C x 1 x T\n        att_conv = self.loc_conv(cov_vec.view(batch, 1, 1, self.h_length))\n        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans\n        att_conv = att_conv.squeeze(2).transpose(1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n        w = F.softmax(scaling * e, dim=1)\n        att_prev_list += [w]\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        return c, att_prev_list\n\n\nclass AttMultiHeadDot(torch.nn.Module):\n    \"\"\"Multi head dot product attention\n\n    Reference: Attention is all you need\n        (https://arxiv.org/abs/1706.03762)\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int aheads: # heads of multi head attention\n    :param int att_dim_k: dimension k in multi head attention\n    :param int att_dim_v: dimension v in multi head attention\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_k and pre_compute_v\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False):\n        super(AttMultiHeadDot, self).__init__()\n        self.mlp_q = torch.nn.ModuleList()\n        self.mlp_k = torch.nn.ModuleList()\n        self.mlp_v = torch.nn.ModuleList()\n        for _ in six.moves.range(aheads):\n            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]\n            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]\n            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]\n        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.aheads = aheads\n        self.att_dim_k = att_dim_k\n        self.att_dim_v = att_dim_v\n        self.scaling = 1.0 / math.sqrt(att_dim_k)\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):\n        \"\"\"AttMultiHeadDot forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: dummy (does not use)\n        :return: attention weighted encoder state (B x D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weight (B x T_max) * aheads\n        :rtype: list\n        \"\"\"\n\n        batch = enc_hs_pad.size(0)\n        # pre-compute all k and v outside the decoder loop\n        if self.pre_compute_k is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_k = [\n                torch.tanh(self.mlp_k[h](self.enc_h))\n                for h in six.moves.range(self.aheads)\n            ]\n\n        if self.pre_compute_v is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_v = [\n                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        c = []\n        w = []\n        for h in six.moves.range(self.aheads):\n            e = torch.sum(\n                self.pre_compute_k[h]\n                * torch.tanh(self.mlp_q[h](dec_z)).view(batch, 1, self.att_dim_k),\n                dim=2,\n            )  # utt x frame\n\n            # NOTE consider zero padding when compute w.\n            if self.mask is None:\n                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n            e.masked_fill_(self.mask, -float(\"inf\"))\n            w += [F.softmax(self.scaling * e, dim=1)]\n\n            # weighted sum over flames\n            # utt x hdim\n            # NOTE use bmm instead of sum(*)\n            c += [\n                torch.sum(\n                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1\n                )\n            ]\n\n        # concat all of c\n        c = self.mlp_o(torch.cat(c, dim=1))\n\n        return c, w\n\n\nclass AttMultiHeadAdd(torch.nn.Module):\n    \"\"\"Multi head additive attention\n\n    Reference: Attention is all you need\n        (https://arxiv.org/abs/1706.03762)\n\n    This attention is multi head attention using additive attention for each head.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int aheads: # heads of multi head attention\n    :param int att_dim_k: dimension k in multi head attention\n    :param int att_dim_v: dimension v in multi head attention\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_k and pre_compute_v\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False):\n        super(AttMultiHeadAdd, self).__init__()\n        self.mlp_q = torch.nn.ModuleList()\n        self.mlp_k = torch.nn.ModuleList()\n        self.mlp_v = torch.nn.ModuleList()\n        self.gvec = torch.nn.ModuleList()\n        for _ in six.moves.range(aheads):\n            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]\n            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]\n            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]\n            self.gvec += [torch.nn.Linear(att_dim_k, 1)]\n        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.aheads = aheads\n        self.att_dim_k = att_dim_k\n        self.att_dim_v = att_dim_v\n        self.scaling = 1.0 / math.sqrt(att_dim_k)\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):\n        \"\"\"AttMultiHeadAdd forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: dummy (does not use)\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weight (B x T_max) * aheads\n        :rtype: list\n        \"\"\"\n\n        batch = enc_hs_pad.size(0)\n        # pre-compute all k and v outside the decoder loop\n        if self.pre_compute_k is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_k = [\n                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if self.pre_compute_v is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_v = [\n                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        c = []\n        w = []\n        for h in six.moves.range(self.aheads):\n            e = self.gvec[h](\n                torch.tanh(\n                    self.pre_compute_k[h]\n                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)\n                )\n            ).squeeze(2)\n\n            # NOTE consider zero padding when compute w.\n            if self.mask is None:\n                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n            e.masked_fill_(self.mask, -float(\"inf\"))\n            w += [F.softmax(self.scaling * e, dim=1)]\n\n            # weighted sum over flames\n            # utt x hdim\n            # NOTE use bmm instead of sum(*)\n            c += [\n                torch.sum(\n                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1\n                )\n            ]\n\n        # concat all of c\n        c = self.mlp_o(torch.cat(c, dim=1))\n\n        return c, w\n\n\nclass AttMultiHeadLoc(torch.nn.Module):\n    \"\"\"Multi head location based attention\n\n    Reference: Attention is all you need\n        (https://arxiv.org/abs/1706.03762)\n\n    This attention is multi head attention using location-aware attention for each head.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int aheads: # heads of multi head attention\n    :param int att_dim_k: dimension k in multi head attention\n    :param int att_dim_v: dimension v in multi head attention\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_k and pre_compute_v\n    \"\"\"\n\n    def __init__(\n        self,\n        eprojs,\n        dunits,\n        aheads,\n        att_dim_k,\n        att_dim_v,\n        aconv_chans,\n        aconv_filts,\n        han_mode=False,\n    ):\n        super(AttMultiHeadLoc, self).__init__()\n        self.mlp_q = torch.nn.ModuleList()\n        self.mlp_k = torch.nn.ModuleList()\n        self.mlp_v = torch.nn.ModuleList()\n        self.gvec = torch.nn.ModuleList()\n        self.loc_conv = torch.nn.ModuleList()\n        self.mlp_att = torch.nn.ModuleList()\n        for _ in six.moves.range(aheads):\n            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]\n            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]\n            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]\n            self.gvec += [torch.nn.Linear(att_dim_k, 1)]\n            self.loc_conv += [\n                torch.nn.Conv2d(\n                    1,\n                    aconv_chans,\n                    (1, 2 * aconv_filts + 1),\n                    padding=(0, aconv_filts),\n                    bias=False,\n                )\n            ]\n            self.mlp_att += [torch.nn.Linear(aconv_chans, att_dim_k, bias=False)]\n        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.aheads = aheads\n        self.att_dim_k = att_dim_k\n        self.att_dim_v = att_dim_v\n        self.scaling = 1.0 / math.sqrt(att_dim_k)\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):\n        \"\"\"AttMultiHeadLoc forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev:\n            list of previous attention weight (B x T_max) * aheads\n        :param float scaling: scaling parameter before applying softmax\n        :return: attention weighted encoder state (B x D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weight (B x T_max) * aheads\n        :rtype: list\n        \"\"\"\n\n        batch = enc_hs_pad.size(0)\n        # pre-compute all k and v outside the decoder loop\n        if self.pre_compute_k is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_k = [\n                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if self.pre_compute_v is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_v = [\n                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        if att_prev is None:\n            att_prev = []\n            for _ in six.moves.range(self.aheads):\n                # if no bias, 0 0-pad goes 0\n                mask = 1.0 - make_pad_mask(enc_hs_len).float()\n                att_prev += [\n                    to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))\n                ]\n\n        c = []\n        w = []\n        for h in six.moves.range(self.aheads):\n            att_conv = self.loc_conv[h](att_prev[h].view(batch, 1, 1, self.h_length))\n            att_conv = att_conv.squeeze(2).transpose(1, 2)\n            att_conv = self.mlp_att[h](att_conv)\n\n            e = self.gvec[h](\n                torch.tanh(\n                    self.pre_compute_k[h]\n                    + att_conv\n                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)\n                )\n            ).squeeze(2)\n\n            # NOTE consider zero padding when compute w.\n            if self.mask is None:\n                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n            e.masked_fill_(self.mask, -float(\"inf\"))\n            w += [F.softmax(scaling * e, dim=1)]\n\n            # weighted sum over flames\n            # utt x hdim\n            # NOTE use bmm instead of sum(*)\n            c += [\n                torch.sum(\n                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1\n                )\n            ]\n\n        # concat all of c\n        c = self.mlp_o(torch.cat(c, dim=1))\n\n        return c, w\n\n\nclass AttMultiHeadMultiResLoc(torch.nn.Module):\n    \"\"\"Multi head multi resolution location based attention\n\n    Reference: Attention is all you need\n        (https://arxiv.org/abs/1706.03762)\n\n    This attention is multi head attention using location-aware attention for each head.\n    Furthermore, it uses different filter size for each head.\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int aheads: # heads of multi head attention\n    :param int att_dim_k: dimension k in multi head attention\n    :param int att_dim_v: dimension v in multi head attention\n    :param int aconv_chans: maximum # channels of attention convolution\n        each head use #ch = aconv_chans * (head + 1) / aheads\n        e.g. aheads=4, aconv_chans=100 => filter size = 25, 50, 75, 100\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n        and not store pre_compute_k and pre_compute_v\n    \"\"\"\n\n    def __init__(\n        self,\n        eprojs,\n        dunits,\n        aheads,\n        att_dim_k,\n        att_dim_v,\n        aconv_chans,\n        aconv_filts,\n        han_mode=False,\n    ):\n        super(AttMultiHeadMultiResLoc, self).__init__()\n        self.mlp_q = torch.nn.ModuleList()\n        self.mlp_k = torch.nn.ModuleList()\n        self.mlp_v = torch.nn.ModuleList()\n        self.gvec = torch.nn.ModuleList()\n        self.loc_conv = torch.nn.ModuleList()\n        self.mlp_att = torch.nn.ModuleList()\n        for h in six.moves.range(aheads):\n            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]\n            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]\n            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]\n            self.gvec += [torch.nn.Linear(att_dim_k, 1)]\n            afilts = aconv_filts * (h + 1) // aheads\n            self.loc_conv += [\n                torch.nn.Conv2d(\n                    1, aconv_chans, (1, 2 * afilts + 1), padding=(0, afilts), bias=False\n                )\n            ]\n            self.mlp_att += [torch.nn.Linear(aconv_chans, att_dim_k, bias=False)]\n        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.aheads = aheads\n        self.att_dim_k = att_dim_k\n        self.att_dim_v = att_dim_v\n        self.scaling = 1.0 / math.sqrt(att_dim_k)\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n        self.han_mode = han_mode\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_k = None\n        self.pre_compute_v = None\n        self.mask = None\n\n    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):\n        \"\"\"AttMultiHeadMultiResLoc forward\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: list of previous attention weight\n            (B x T_max) * aheads\n        :return: attention weighted encoder state (B x D_enc)\n        :rtype: torch.Tensor\n        :return: list of previous attention weight (B x T_max) * aheads\n        :rtype: list\n        \"\"\"\n\n        batch = enc_hs_pad.size(0)\n        # pre-compute all k and v outside the decoder loop\n        if self.pre_compute_k is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_k = [\n                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if self.pre_compute_v is None or self.han_mode:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_v = [\n                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)\n            ]\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        if att_prev is None:\n            att_prev = []\n            for _ in six.moves.range(self.aheads):\n                # if no bias, 0 0-pad goes 0\n                mask = 1.0 - make_pad_mask(enc_hs_len).float()\n                att_prev += [\n                    to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))\n                ]\n\n        c = []\n        w = []\n        for h in six.moves.range(self.aheads):\n            att_conv = self.loc_conv[h](att_prev[h].view(batch, 1, 1, self.h_length))\n            att_conv = att_conv.squeeze(2).transpose(1, 2)\n            att_conv = self.mlp_att[h](att_conv)\n\n            e = self.gvec[h](\n                torch.tanh(\n                    self.pre_compute_k[h]\n                    + att_conv\n                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)\n                )\n            ).squeeze(2)\n\n            # NOTE consider zero padding when compute w.\n            if self.mask is None:\n                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n            e.masked_fill_(self.mask, -float(\"inf\"))\n            w += [F.softmax(self.scaling * e, dim=1)]\n\n            # weighted sum over flames\n            # utt x hdim\n            # NOTE use bmm instead of sum(*)\n            c += [\n                torch.sum(\n                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1\n                )\n            ]\n\n        # concat all of c\n        c = self.mlp_o(torch.cat(c, dim=1))\n\n        return c, w\n\n\nclass AttForward(torch.nn.Module):\n    \"\"\"Forward attention module.\n\n    Reference:\n    Forward attention in sequence-to-sequence acoustic modeling for speech synthesis\n        (https://arxiv.org/pdf/1807.06736.pdf)\n\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    \"\"\"\n\n    def __init__(self, eprojs, dunits, att_dim, aconv_chans, aconv_filts):\n        super(AttForward, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (1, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.gvec = torch.nn.Linear(att_dim, 1)\n        self.dunits = dunits\n        self.eprojs = eprojs\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def reset(self):\n        \"\"\"reset states\"\"\"\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n\n    def forward(\n        self,\n        enc_hs_pad,\n        enc_hs_len,\n        dec_z,\n        att_prev,\n        scaling=1.0,\n        last_attended_idx=None,\n        backward_window=1,\n        forward_window=3,\n    ):\n        \"\"\"Calculate AttForward forward propagation.\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)\n        :param torch.Tensor att_prev: attention weights of previous step\n        :param float scaling: scaling parameter before applying softmax\n        :param int last_attended_idx: index of the inputs of the last attended\n        :param int backward_window: backward window size in attention constraint\n        :param int forward_window: forward window size in attetion constraint\n        :return: attention weighted encoder state (B, D_enc)\n        :rtype: torch.Tensor\n        :return: previous attention weights (B x T_max)\n        :rtype: torch.Tensor\n        \"\"\"\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        if att_prev is None:\n            # initial attention will be [1, 0, 0, ...]\n            att_prev = enc_hs_pad.new_zeros(*enc_hs_pad.size()[:2])\n            att_prev[:, 0] = 1.0\n\n        # att_prev: utt x frame -> utt x 1 x 1 x frame\n        # -> utt x att_conv_chans x 1 x frame\n        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))\n        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans\n        att_conv = att_conv.squeeze(2).transpose(1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).unsqueeze(1)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(self.pre_compute_enc_h + dec_z_tiled + att_conv)\n        ).squeeze(2)\n\n        # NOTE: consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n\n        # apply monotonic attention constraint (mainly for TTS)\n        if last_attended_idx is not None:\n            e = _apply_attention_constraint(\n                e, last_attended_idx, backward_window, forward_window\n            )\n\n        w = F.softmax(scaling * e, dim=1)\n\n        # forward attention\n        att_prev_shift = F.pad(att_prev, (1, 0))[:, :-1]\n        w = (att_prev + att_prev_shift) * w\n        # NOTE: clamp is needed to avoid nan gradient\n        w = F.normalize(torch.clamp(w, 1e-6), p=1, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.unsqueeze(-1), dim=1)\n\n        return c, w\n\n\nclass AttForwardTA(torch.nn.Module):\n    \"\"\"Forward attention with transition agent module.\n\n    Reference:\n    Forward attention in sequence-to-sequence acoustic modeling for speech synthesis\n        (https://arxiv.org/pdf/1807.06736.pdf)\n\n    :param int eunits: # units of encoder\n    :param int dunits: # units of decoder\n    :param int att_dim: attention dimension\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param int odim: output dimension\n    \"\"\"\n\n    def __init__(self, eunits, dunits, att_dim, aconv_chans, aconv_filts, odim):\n        super(AttForwardTA, self).__init__()\n        self.mlp_enc = torch.nn.Linear(eunits, att_dim)\n        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)\n        self.mlp_ta = torch.nn.Linear(eunits + dunits + odim, 1)\n        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)\n        self.loc_conv = torch.nn.Conv2d(\n            1,\n            aconv_chans,\n            (1, 2 * aconv_filts + 1),\n            padding=(0, aconv_filts),\n            bias=False,\n        )\n        self.gvec = torch.nn.Linear(att_dim, 1)\n        self.dunits = dunits\n        self.eunits = eunits\n        self.att_dim = att_dim\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.trans_agent_prob = 0.5\n\n    def reset(self):\n        self.h_length = None\n        self.enc_h = None\n        self.pre_compute_enc_h = None\n        self.mask = None\n        self.trans_agent_prob = 0.5\n\n    def forward(\n        self,\n        enc_hs_pad,\n        enc_hs_len,\n        dec_z,\n        att_prev,\n        out_prev,\n        scaling=1.0,\n        last_attended_idx=None,\n        backward_window=1,\n        forward_window=3,\n    ):\n        \"\"\"Calculate AttForwardTA forward propagation.\n\n        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B, Tmax, eunits)\n        :param list enc_hs_len: padded encoder hidden state length (B)\n        :param torch.Tensor dec_z: decoder hidden state (B, dunits)\n        :param torch.Tensor att_prev: attention weights of previous step\n        :param torch.Tensor out_prev: decoder outputs of previous step (B, odim)\n        :param float scaling: scaling parameter before applying softmax\n        :param int last_attended_idx: index of the inputs of the last attended\n        :param int backward_window: backward window size in attention constraint\n        :param int forward_window: forward window size in attetion constraint\n        :return: attention weighted encoder state (B, dunits)\n        :rtype: torch.Tensor\n        :return: previous attention weights (B, Tmax)\n        :rtype: torch.Tensor\n        \"\"\"\n        batch = len(enc_hs_pad)\n        # pre-compute all h outside the decoder loop\n        if self.pre_compute_enc_h is None:\n            self.enc_h = enc_hs_pad  # utt x frame x hdim\n            self.h_length = self.enc_h.size(1)\n            # utt x frame x att_dim\n            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)\n\n        if dec_z is None:\n            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)\n        else:\n            dec_z = dec_z.view(batch, self.dunits)\n\n        if att_prev is None:\n            # initial attention will be [1, 0, 0, ...]\n            att_prev = enc_hs_pad.new_zeros(*enc_hs_pad.size()[:2])\n            att_prev[:, 0] = 1.0\n\n        # att_prev: utt x frame -> utt x 1 x 1 x frame\n        # -> utt x att_conv_chans x 1 x frame\n        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))\n        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans\n        att_conv = att_conv.squeeze(2).transpose(1, 2)\n        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim\n        att_conv = self.mlp_att(att_conv)\n\n        # dec_z_tiled: utt x frame x att_dim\n        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)\n\n        # dot with gvec\n        # utt x frame x att_dim -> utt x frame\n        e = self.gvec(\n            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)\n        ).squeeze(2)\n\n        # NOTE consider zero padding when compute w.\n        if self.mask is None:\n            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))\n        e.masked_fill_(self.mask, -float(\"inf\"))\n\n        # apply monotonic attention constraint (mainly for TTS)\n        if last_attended_idx is not None:\n            e = _apply_attention_constraint(\n                e, last_attended_idx, backward_window, forward_window\n            )\n\n        w = F.softmax(scaling * e, dim=1)\n\n        # forward attention\n        att_prev_shift = F.pad(att_prev, (1, 0))[:, :-1]\n        w = (\n            self.trans_agent_prob * att_prev\n            + (1 - self.trans_agent_prob) * att_prev_shift\n        ) * w\n        # NOTE: clamp is needed to avoid nan gradient\n        w = F.normalize(torch.clamp(w, 1e-6), p=1, dim=1)\n\n        # weighted sum over flames\n        # utt x hdim\n        # NOTE use bmm instead of sum(*)\n        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)\n\n        # update transition agent prob\n        self.trans_agent_prob = torch.sigmoid(\n            self.mlp_ta(torch.cat([c, out_prev, dec_z], dim=1))\n        )\n\n        return c, w\n\n\ndef att_for(args, num_att=1, han_mode=False):\n    \"\"\"Instantiates an attention module given the program arguments\n\n    :param Namespace args: The arguments\n    :param int num_att: number of attention modules\n        (in multi-speaker case, it can be 2 or more)\n    :param bool han_mode: switch on/off mode of hierarchical attention network (HAN)\n    :rtype torch.nn.Module\n    :return: The attention module\n    \"\"\"\n    att_list = torch.nn.ModuleList()\n    num_encs = getattr(args, \"num_encs\", 1)  # use getattr to keep compatibility\n    aheads = getattr(args, \"aheads\", None)\n    awin = getattr(args, \"awin\", None)\n    aconv_chans = getattr(args, \"aconv_chans\", None)\n    aconv_filts = getattr(args, \"aconv_filts\", None)\n\n    if num_encs == 1:\n        for i in range(num_att):\n            att = initial_att(\n                args.atype,\n                args.eprojs,\n                args.dunits,\n                aheads,\n                args.adim,\n                awin,\n                aconv_chans,\n                aconv_filts,\n            )\n            att_list.append(att)\n    elif num_encs > 1:  # no multi-speaker mode\n        if han_mode:\n            att = initial_att(\n                args.han_type,\n                args.eprojs,\n                args.dunits,\n                args.han_heads,\n                args.han_dim,\n                args.han_win,\n                args.han_conv_chans,\n                args.han_conv_filts,\n                han_mode=True,\n            )\n            return att\n        else:\n            att_list = torch.nn.ModuleList()\n            for idx in range(num_encs):\n                att = initial_att(\n                    args.atype[idx],\n                    args.eprojs,\n                    args.dunits,\n                    aheads[idx],\n                    args.adim[idx],\n                    awin[idx],\n                    aconv_chans[idx],\n                    aconv_filts[idx],\n                )\n                att_list.append(att)\n    else:\n        raise ValueError(\n            \"Number of encoders needs to be more than one. {}\".format(num_encs)\n        )\n    return att_list\n\n\ndef initial_att(\n    atype, eprojs, dunits, aheads, adim, awin, aconv_chans, aconv_filts, han_mode=False\n):\n    \"\"\"Instantiates a single attention module\n\n    :param str atype: attention type\n    :param int eprojs: # projection-units of encoder\n    :param int dunits: # units of decoder\n    :param int aheads: # heads of multi head attention\n    :param int adim: attention dimension\n    :param int awin: attention window size\n    :param int aconv_chans: # channels of attention convolution\n    :param int aconv_filts: filter size of attention convolution\n    :param bool han_mode: flag to swith on mode of hierarchical attention\n    :return: The attention module\n    \"\"\"\n\n    if atype == \"noatt\":\n        att = NoAtt()\n    elif atype == \"dot\":\n        att = AttDot(eprojs, dunits, adim, han_mode)\n    elif atype == \"add\":\n        att = AttAdd(eprojs, dunits, adim, han_mode)\n    elif atype == \"location\":\n        att = AttLoc(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)\n    elif atype == \"location2d\":\n        att = AttLoc2D(eprojs, dunits, adim, awin, aconv_chans, aconv_filts, han_mode)\n    elif atype == \"location_recurrent\":\n        att = AttLocRec(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)\n    elif atype == \"coverage\":\n        att = AttCov(eprojs, dunits, adim, han_mode)\n    elif atype == \"coverage_location\":\n        att = AttCovLoc(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)\n    elif atype == \"multi_head_dot\":\n        att = AttMultiHeadDot(eprojs, dunits, aheads, adim, adim, han_mode)\n    elif atype == \"multi_head_add\":\n        att = AttMultiHeadAdd(eprojs, dunits, aheads, adim, adim, han_mode)\n    elif atype == \"multi_head_loc\":\n        att = AttMultiHeadLoc(\n            eprojs, dunits, aheads, adim, adim, aconv_chans, aconv_filts, han_mode\n        )\n    elif atype == \"multi_head_multi_res_loc\":\n        att = AttMultiHeadMultiResLoc(\n            eprojs, dunits, aheads, adim, adim, aconv_chans, aconv_filts, han_mode\n        )\n    return att\n\n\ndef att_to_numpy(att_ws, att):\n    \"\"\"Converts attention weights to a numpy array given the attention\n\n    :param list att_ws: The attention weights\n    :param torch.nn.Module att: The attention\n    :rtype: np.ndarray\n    :return: The numpy array of the attention weights\n    \"\"\"\n    # convert to numpy array with the shape (B, Lmax, Tmax)\n    if isinstance(att, AttLoc2D):\n        # att_ws => list of previous concate attentions\n        att_ws = torch.stack([aw[:, -1] for aw in att_ws], dim=1).cpu().numpy()\n    elif isinstance(att, (AttCov, AttCovLoc)):\n        # att_ws => list of list of previous attentions\n        att_ws = (\n            torch.stack([aw[idx] for idx, aw in enumerate(att_ws)], dim=1).cpu().numpy()\n        )\n    elif isinstance(att, AttLocRec):\n        # att_ws => list of tuple of attention and hidden states\n        att_ws = torch.stack([aw[0] for aw in att_ws], dim=1).cpu().numpy()\n    elif isinstance(\n        att,\n        (AttMultiHeadDot, AttMultiHeadAdd, AttMultiHeadLoc, AttMultiHeadMultiResLoc),\n    ):\n        # att_ws => list of list of each head attention\n        n_heads = len(att_ws[0])\n        att_ws_sorted_by_head = []\n        for h in six.moves.range(n_heads):\n            att_ws_head = torch.stack([aw[h] for aw in att_ws], dim=1)\n            att_ws_sorted_by_head += [att_ws_head]\n        att_ws = torch.stack(att_ws_sorted_by_head, dim=1).cpu().numpy()\n    else:\n        # att_ws => list of attentions\n        att_ws = torch.stack(att_ws, dim=1).cpu().numpy()\n    return att_ws\n"
  },
  {
    "path": "nets/pytorch_backend/rnn/decoders.py",
    "content": "from distutils.version import LooseVersion\nimport logging\nimport math\nimport random\nimport six\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom argparse import Namespace\n\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScoreTH\nfrom espnet.nets.e2e_asr_common import end_detect\n\nfrom espnet.nets.pytorch_backend.rnn.attentions import att_to_numpy\n\nfrom espnet.nets.pytorch_backend.nets_utils import mask_by_length\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.pytorch_backend.nets_utils import th_accuracy\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\nfrom espnet.nets.scorer_interface import ScorerInterface\n\nMAX_DECODER_OUTPUT = 5\nCTC_SCORING_RATIO = 1.5\n\n\nclass Decoder(torch.nn.Module, ScorerInterface):\n    \"\"\"Decoder module\n\n    :param int eprojs: encoder projection units\n    :param int odim: dimension of outputs\n    :param str dtype: gru or lstm\n    :param int dlayers: decoder layers\n    :param int dunits: decoder units\n    :param int sos: start of sequence symbol id\n    :param int eos: end of sequence symbol id\n    :param torch.nn.Module att: attention module\n    :param int verbose: verbose level\n    :param list char_list: list of character strings\n    :param ndarray labeldist: distribution of label smoothing\n    :param float lsm_weight: label smoothing weight\n    :param float sampling_probability: scheduled sampling probability\n    :param float dropout: dropout rate\n    :param float context_residual: if True, use context vector for token generation\n    :param float replace_sos: use for multilingual (speech/text) translation\n    \"\"\"\n\n    def __init__(\n        self,\n        eprojs,\n        odim,\n        dtype,\n        dlayers,\n        dunits,\n        sos,\n        eos,\n        att,\n        verbose=0,\n        char_list=None,\n        labeldist=None,\n        lsm_weight=0.0,\n        sampling_probability=0.0,\n        dropout=0.0,\n        context_residual=False,\n        replace_sos=False,\n        num_encs=1,\n    ):\n\n        torch.nn.Module.__init__(self)\n        self.dtype = dtype\n        self.dunits = dunits\n        self.dlayers = dlayers\n        self.context_residual = context_residual\n        self.embed = torch.nn.Embedding(odim, dunits)\n        self.dropout_emb = torch.nn.Dropout(p=dropout)\n\n        self.decoder = torch.nn.ModuleList()\n        self.dropout_dec = torch.nn.ModuleList()\n        self.decoder += [\n            torch.nn.LSTMCell(dunits + eprojs, dunits)\n            if self.dtype == \"lstm\"\n            else torch.nn.GRUCell(dunits + eprojs, dunits)\n        ]\n        self.dropout_dec += [torch.nn.Dropout(p=dropout)]\n        for _ in six.moves.range(1, self.dlayers):\n            self.decoder += [\n                torch.nn.LSTMCell(dunits, dunits)\n                if self.dtype == \"lstm\"\n                else torch.nn.GRUCell(dunits, dunits)\n            ]\n            self.dropout_dec += [torch.nn.Dropout(p=dropout)]\n            # NOTE: dropout is applied only for the vertical connections\n            # see https://arxiv.org/pdf/1409.2329.pdf\n        self.ignore_id = -1\n\n        if context_residual:\n            self.output = torch.nn.Linear(dunits + eprojs, odim)\n        else:\n            self.output = torch.nn.Linear(dunits, odim)\n\n        self.loss = None\n        self.att = att\n        self.dunits = dunits\n        self.sos = sos\n        self.eos = eos\n        self.odim = odim\n        self.verbose = verbose\n        self.char_list = char_list\n        # for label smoothing\n        self.labeldist = labeldist\n        self.vlabeldist = None\n        self.lsm_weight = lsm_weight\n        self.sampling_probability = sampling_probability\n        self.dropout = dropout\n        self.num_encs = num_encs\n\n        # for multilingual E2E-ST\n        self.replace_sos = replace_sos\n\n        self.logzero = -10000000000.0\n\n    def zero_state(self, hs_pad):\n        return hs_pad.new_zeros(hs_pad.size(0), self.dunits)\n\n    def rnn_forward(self, ey, z_list, c_list, z_prev, c_prev):\n        if self.dtype == \"lstm\":\n            z_list[0], c_list[0] = self.decoder[0](ey, (z_prev[0], c_prev[0]))\n            for i in six.moves.range(1, self.dlayers):\n                z_list[i], c_list[i] = self.decoder[i](\n                    self.dropout_dec[i - 1](z_list[i - 1]), (z_prev[i], c_prev[i])\n                )\n        else:\n            z_list[0] = self.decoder[0](ey, z_prev[0])\n            for i in six.moves.range(1, self.dlayers):\n                z_list[i] = self.decoder[i](\n                    self.dropout_dec[i - 1](z_list[i - 1]), z_prev[i]\n                )\n        return z_list, c_list\n\n    def forward(self, hs_pad, hlens, ys_pad, strm_idx=0, lang_ids=None):\n        \"\"\"Decoder forward\n\n        :param torch.Tensor hs_pad: batch of padded hidden state sequences (B, Tmax, D)\n                                    [in multi-encoder case,\n                                    list of torch.Tensor,\n                                    [(B, Tmax_1, D), (B, Tmax_2, D), ..., ] ]\n        :param torch.Tensor hlens: batch of lengths of hidden state sequences (B)\n                                   [in multi-encoder case, list of torch.Tensor,\n                                   [(B), (B), ..., ]\n        :param torch.Tensor ys_pad: batch of padded character id sequence tensor\n                                    (B, Lmax)\n        :param int strm_idx: stream index indicates the index of decoding stream.\n        :param torch.Tensor lang_ids: batch of target language id tensor (B, 1)\n        :return: attention loss value\n        :rtype: torch.Tensor\n        :return: accuracy\n        :rtype: float\n        \"\"\"\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            hs_pad = [hs_pad]\n            hlens = [hlens]\n\n        # TODO(kan-bayashi): need to make more smart way\n        ys = [y[y != self.ignore_id] for y in ys_pad]  # parse padded ys\n        # attention index for the attention module\n        # in SPA (speaker parallel attention),\n        # att_idx is used to select attention module. In other cases, it is 0.\n        att_idx = min(strm_idx, len(self.att) - 1)\n\n        # hlens should be list of list of integer\n        hlens = [list(map(int, hlens[idx])) for idx in range(self.num_encs)]\n\n        self.loss = None\n        # prepare input and output word sequences with sos/eos IDs\n        eos = ys[0].new([self.eos])\n        sos = ys[0].new([self.sos])\n        if self.replace_sos:\n            ys_in = [torch.cat([idx, y], dim=0) for idx, y in zip(lang_ids, ys)]\n        else:\n            ys_in = [torch.cat([sos, y], dim=0) for y in ys]\n        ys_out = [torch.cat([y, eos], dim=0) for y in ys]\n\n        # padding for ys with -1\n        # pys: utt x olen\n        ys_in_pad = pad_list(ys_in, self.eos)\n        ys_out_pad = pad_list(ys_out, self.ignore_id)\n\n        # get dim, length info\n        batch = ys_out_pad.size(0)\n        olength = ys_out_pad.size(1)\n        for idx in range(self.num_encs):\n            logging.info(\n                self.__class__.__name__\n                + \"Number of Encoder:{}; enc{}: input lengths: {}.\".format(\n                    self.num_encs, idx + 1, hlens[idx]\n                )\n            )\n        logging.info(\n            self.__class__.__name__\n            + \" output lengths: \"\n            + str([y.size(0) for y in ys_out])\n        )\n\n        # initialization\n        c_list = [self.zero_state(hs_pad[0])]\n        z_list = [self.zero_state(hs_pad[0])]\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(self.zero_state(hs_pad[0]))\n            z_list.append(self.zero_state(hs_pad[0]))\n        z_all = []\n        if self.num_encs == 1:\n            att_w = None\n            self.att[att_idx].reset()  # reset pre-computation of h\n        else:\n            att_w_list = [None] * (self.num_encs + 1)  # atts + han\n            att_c_list = [None] * (self.num_encs)  # atts\n            for idx in range(self.num_encs + 1):\n                self.att[idx].reset()  # reset pre-computation of h in atts and han\n\n        # pre-computation of embedding\n        eys = self.dropout_emb(self.embed(ys_in_pad))  # utt x olen x zdim\n\n        # loop for an output sequence\n        for i in six.moves.range(olength):\n            if self.num_encs == 1:\n                att_c, att_w = self.att[att_idx](\n                    hs_pad[0], hlens[0], self.dropout_dec[0](z_list[0]), att_w\n                )\n            else:\n                for idx in range(self.num_encs):\n                    att_c_list[idx], att_w_list[idx] = self.att[idx](\n                        hs_pad[idx],\n                        hlens[idx],\n                        self.dropout_dec[0](z_list[0]),\n                        att_w_list[idx],\n                    )\n                hs_pad_han = torch.stack(att_c_list, dim=1)\n                hlens_han = [self.num_encs] * len(ys_in)\n                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](\n                    hs_pad_han,\n                    hlens_han,\n                    self.dropout_dec[0](z_list[0]),\n                    att_w_list[self.num_encs],\n                )\n            if i > 0 and random.random() < self.sampling_probability:\n                logging.info(\" scheduled sampling \")\n                z_out = self.output(z_all[-1])\n                z_out = np.argmax(z_out.detach().cpu(), axis=1)\n                z_out = self.dropout_emb(self.embed(to_device(hs_pad[0], z_out)))\n                ey = torch.cat((z_out, att_c), dim=1)  # utt x (zdim + hdim)\n            else:\n                ey = torch.cat((eys[:, i, :], att_c), dim=1)  # utt x (zdim + hdim)\n            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)\n            if self.context_residual:\n                z_all.append(\n                    torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)\n                )  # utt x (zdim + hdim)\n            else:\n                z_all.append(self.dropout_dec[-1](z_list[-1]))  # utt x (zdim)\n\n        z_all = torch.stack(z_all, dim=1).view(batch * olength, -1)\n        # compute loss\n        y_all = self.output(z_all)\n        if LooseVersion(torch.__version__) < LooseVersion(\"1.0\"):\n            reduction_str = \"elementwise_mean\"\n        else:\n            reduction_str = \"mean\"\n        self.loss = F.cross_entropy(\n            y_all,\n            ys_out_pad.view(-1),\n            ignore_index=self.ignore_id,\n            reduction=reduction_str,\n        )\n        # compute perplexity\n        ppl = math.exp(self.loss.item())\n        # -1: eos, which is removed in the loss computation\n        self.loss *= np.mean([len(x) for x in ys_in]) - 1\n        acc = th_accuracy(y_all, ys_out_pad, ignore_label=self.ignore_id)\n        logging.info(\"att loss:\" + \"\".join(str(self.loss.item()).split(\"\\n\")))\n\n        # show predicted character sequence for debug\n        if self.verbose > 0 and self.char_list is not None:\n            ys_hat = y_all.view(batch, olength, -1)\n            ys_true = ys_out_pad\n            for (i, y_hat), y_true in zip(\n                enumerate(ys_hat.detach().cpu().numpy()), ys_true.detach().cpu().numpy()\n            ):\n                if i == MAX_DECODER_OUTPUT:\n                    break\n                idx_hat = np.argmax(y_hat[y_true != self.ignore_id], axis=1)\n                idx_true = y_true[y_true != self.ignore_id]\n                seq_hat = [self.char_list[int(idx)] for idx in idx_hat]\n                seq_true = [self.char_list[int(idx)] for idx in idx_true]\n                seq_hat = \"\".join(seq_hat)\n                seq_true = \"\".join(seq_true)\n                logging.info(\"groundtruth[%d]: \" % i + seq_true)\n                logging.info(\"prediction [%d]: \" % i + seq_hat)\n\n        if self.labeldist is not None:\n            if self.vlabeldist is None:\n                self.vlabeldist = to_device(hs_pad[0], torch.from_numpy(self.labeldist))\n            loss_reg = -torch.sum(\n                (F.log_softmax(y_all, dim=1) * self.vlabeldist).view(-1), dim=0\n            ) / len(ys_in)\n            self.loss = (1.0 - self.lsm_weight) * self.loss + self.lsm_weight * loss_reg\n\n        return self.loss, acc, ppl\n\n    def recognize_beam(self, h, lpz, recog_args, char_list, rnnlm=None, strm_idx=0):\n        \"\"\"beam search implementation\n\n        :param torch.Tensor h: encoder hidden state (T, eprojs)\n                                [in multi-encoder case, list of torch.Tensor,\n                                [(T1, eprojs), (T2, eprojs), ...] ]\n        :param torch.Tensor lpz: ctc log softmax output (T, odim)\n                                [in multi-encoder case, list of torch.Tensor,\n                                [(T1, odim), (T2, odim), ...] ]\n        :param Namespace recog_args: argument Namespace containing options\n        :param char_list: list of character strings\n        :param torch.nn.Module rnnlm: language module\n        :param int strm_idx:\n            stream index for speaker parallel attention in multi-speaker case\n        :return: N-best decoding results\n        :rtype: list of dicts\n        \"\"\"\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            h = [h]\n            lpz = [lpz]\n        if self.num_encs > 1 and lpz is None:\n            lpz = [lpz] * self.num_encs\n\n        for idx in range(self.num_encs):\n            logging.info(\n                \"Number of Encoder:{}; enc{}: input lengths: {}.\".format(\n                    self.num_encs, idx + 1, h[0].size(0)\n                )\n            )\n        att_idx = min(strm_idx, len(self.att) - 1)\n        # initialization\n        c_list = [self.zero_state(h[0].unsqueeze(0))]\n        z_list = [self.zero_state(h[0].unsqueeze(0))]\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(self.zero_state(h[0].unsqueeze(0)))\n            z_list.append(self.zero_state(h[0].unsqueeze(0)))\n        if self.num_encs == 1:\n            a = None\n            self.att[att_idx].reset()  # reset pre-computation of h\n        else:\n            a = [None] * (self.num_encs + 1)  # atts + han\n            att_w_list = [None] * (self.num_encs + 1)  # atts + han\n            att_c_list = [None] * (self.num_encs)  # atts\n            for idx in range(self.num_encs + 1):\n                self.att[idx].reset()  # reset pre-computation of h in atts and han\n\n        # search parms\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = getattr(recog_args, \"ctc_weight\", False)  # for NMT\n\n        if lpz[0] is not None and self.num_encs > 1:\n            # weights-ctc,\n            # e.g. ctc_loss = w_1*ctc_1_loss + w_2 * ctc_2_loss + w_N * ctc_N_loss\n            weights_ctc_dec = recog_args.weights_ctc_dec / np.sum(\n                recog_args.weights_ctc_dec\n            )  # normalize\n            logging.info(\n                \"ctc weights (decoding): \" + \" \".join([str(x) for x in weights_ctc_dec])\n            )\n        else:\n            weights_ctc_dec = [1.0]\n\n        # preprate sos\n        if self.replace_sos and recog_args.tgt_lang:\n            y = char_list.index(recog_args.tgt_lang)\n        else:\n            y = self.sos\n        logging.info(\"<sos> index: \" + str(y))\n        logging.info(\"<sos> mark: \" + char_list[y])\n        vy = h[0].new_zeros(1).long()\n\n        maxlen = np.amin([h[idx].size(0) for idx in range(self.num_encs)])\n        if recog_args.maxlenratio != 0:\n            # maxlen >= 1\n            maxlen = max(1, int(recog_args.maxlenratio * maxlen))\n        minlen = int(recog_args.minlenratio * maxlen)\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialize hypothesis\n        if rnnlm:\n            hyp = {\n                \"score\": 0.0,\n                \"yseq\": [y],\n                \"c_prev\": c_list,\n                \"z_prev\": z_list,\n                \"a_prev\": a,\n                \"rnnlm_prev\": None,\n            }\n        else:\n            hyp = {\n                \"score\": 0.0,\n                \"yseq\": [y],\n                \"c_prev\": c_list,\n                \"z_prev\": z_list,\n                \"a_prev\": a,\n            }\n        if lpz[0] is not None:\n            ctc_prefix_score = [\n                CTCPrefixScore(lpz[idx].detach().numpy(), 0, self.eos, np)\n                for idx in range(self.num_encs)\n            ]\n            hyp[\"ctc_state_prev\"] = [\n                ctc_prefix_score[idx].initial_state() for idx in range(self.num_encs)\n            ]\n            hyp[\"ctc_score_prev\"] = [0.0] * self.num_encs\n            if ctc_weight != 1.0:\n                # pre-pruning based on attention scores\n                ctc_beam = min(lpz[0].shape[-1], int(beam * CTC_SCORING_RATIO))\n            else:\n                ctc_beam = lpz[0].shape[-1]\n        hyps = [hyp]\n        ended_hyps = []\n\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            hyps_best_kept = []\n            for hyp in hyps:\n                vy[0] = hyp[\"yseq\"][i]\n                ey = self.dropout_emb(self.embed(vy))  # utt list (1) x zdim\n                if self.num_encs == 1:\n                    att_c, att_w = self.att[att_idx](\n                        h[0].unsqueeze(0),\n                        [h[0].size(0)],\n                        self.dropout_dec[0](hyp[\"z_prev\"][0]),\n                        hyp[\"a_prev\"],\n                    )\n                else:\n                    for idx in range(self.num_encs):\n                        att_c_list[idx], att_w_list[idx] = self.att[idx](\n                            h[idx].unsqueeze(0),\n                            [h[idx].size(0)],\n                            self.dropout_dec[0](hyp[\"z_prev\"][0]),\n                            hyp[\"a_prev\"][idx],\n                        )\n                    h_han = torch.stack(att_c_list, dim=1)\n                    att_c, att_w_list[self.num_encs] = self.att[self.num_encs](\n                        h_han,\n                        [self.num_encs],\n                        self.dropout_dec[0](hyp[\"z_prev\"][0]),\n                        hyp[\"a_prev\"][self.num_encs],\n                    )\n                ey = torch.cat((ey, att_c), dim=1)  # utt(1) x (zdim + hdim)\n                z_list, c_list = self.rnn_forward(\n                    ey, z_list, c_list, hyp[\"z_prev\"], hyp[\"c_prev\"]\n                )\n\n                # get nbest local scores and their ids\n                if self.context_residual:\n                    logits = self.output(\n                        torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)\n                    )\n                else:\n                    logits = self.output(self.dropout_dec[-1](z_list[-1]))\n                local_att_scores = F.log_softmax(logits, dim=1)\n                if rnnlm:\n                    rnnlm_state, local_lm_scores = rnnlm.predict(hyp[\"rnnlm_prev\"], vy)\n                    local_scores = (\n                        local_att_scores + recog_args.lm_weight * local_lm_scores\n                    )\n                else:\n                    local_scores = local_att_scores\n\n                if lpz[0] is not None:\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_att_scores, ctc_beam, dim=1\n                    )\n                    ctc_scores, ctc_states = (\n                        [None] * self.num_encs,\n                        [None] * self.num_encs,\n                    )\n                    for idx in range(self.num_encs):\n                        ctc_scores[idx], ctc_states[idx] = ctc_prefix_score[idx](\n                            hyp[\"yseq\"], local_best_ids[0], hyp[\"ctc_state_prev\"][idx]\n                        )\n                    local_scores = (1.0 - ctc_weight) * local_att_scores[\n                        :, local_best_ids[0]\n                    ]\n                    if self.num_encs == 1:\n                        local_scores += ctc_weight * torch.from_numpy(\n                            ctc_scores[0] - hyp[\"ctc_score_prev\"][0]\n                        )\n                    else:\n                        for idx in range(self.num_encs):\n                            local_scores += (\n                                ctc_weight\n                                * weights_ctc_dec[idx]\n                                * torch.from_numpy(\n                                    ctc_scores[idx] - hyp[\"ctc_score_prev\"][idx]\n                                )\n                            )\n                    if rnnlm:\n                        local_scores += (\n                            recog_args.lm_weight * local_lm_scores[:, local_best_ids[0]]\n                        )\n                    local_best_scores, joint_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n                    local_best_ids = local_best_ids[:, joint_best_ids[0]]\n                else:\n                    local_best_scores, local_best_ids = torch.topk(\n                        local_scores, beam, dim=1\n                    )\n\n                for j in six.moves.range(beam):\n                    new_hyp = {}\n                    # [:] is needed!\n                    new_hyp[\"z_prev\"] = z_list[:]\n                    new_hyp[\"c_prev\"] = c_list[:]\n                    if self.num_encs == 1:\n                        new_hyp[\"a_prev\"] = att_w[:]\n                    else:\n                        new_hyp[\"a_prev\"] = [\n                            att_w_list[idx][:] for idx in range(self.num_encs + 1)\n                        ]\n                    new_hyp[\"score\"] = hyp[\"score\"] + local_best_scores[0, j]\n                    new_hyp[\"yseq\"] = [0] * (1 + len(hyp[\"yseq\"]))\n                    new_hyp[\"yseq\"][: len(hyp[\"yseq\"])] = hyp[\"yseq\"]\n                    new_hyp[\"yseq\"][len(hyp[\"yseq\"])] = int(local_best_ids[0, j])\n                    if rnnlm:\n                        new_hyp[\"rnnlm_prev\"] = rnnlm_state\n                    if lpz[0] is not None:\n                        new_hyp[\"ctc_state_prev\"] = [\n                            ctc_states[idx][joint_best_ids[0, j]]\n                            for idx in range(self.num_encs)\n                        ]\n                        new_hyp[\"ctc_score_prev\"] = [\n                            ctc_scores[idx][joint_best_ids[0, j]]\n                            for idx in range(self.num_encs)\n                        ]\n                    # will be (2 x beam) hyps at most\n                    hyps_best_kept.append(new_hyp)\n\n                hyps_best_kept = sorted(\n                    hyps_best_kept, key=lambda x: x[\"score\"], reverse=True\n                )[:beam]\n\n            # sort and get nbest\n            hyps = hyps_best_kept\n            logging.debug(\"number of pruned hypotheses: \" + str(len(hyps)))\n            logging.debug(\n                \"best hypo: \"\n                + \"\".join([char_list[int(x)] for x in hyps[0][\"yseq\"][1:]])\n            )\n\n            # add eos in the final loop to avoid that there are no ended hyps\n            if i == maxlen - 1:\n                logging.info(\"adding <eos> in the last position in the loop\")\n                for hyp in hyps:\n                    hyp[\"yseq\"].append(self.eos)\n\n            # add ended hypotheses to a final list,\n            # and removed them from current hypotheses\n            # (this will be a problem, number of hyps < beam)\n            remained_hyps = []\n            for hyp in hyps:\n                if hyp[\"yseq\"][-1] == self.eos:\n                    # only store the sequence that has more than minlen outputs\n                    # also add penalty\n                    if len(hyp[\"yseq\"]) > minlen:\n                        hyp[\"score\"] += (i + 1) * penalty\n                        if rnnlm:  # Word LM needs to add final <eos> score\n                            hyp[\"score\"] += recog_args.lm_weight * rnnlm.final(\n                                hyp[\"rnnlm_prev\"]\n                            )\n                        ended_hyps.append(hyp)\n                else:\n                    remained_hyps.append(hyp)\n\n            # end detection\n            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:\n                logging.info(\"end detected at %d\", i)\n                break\n\n            hyps = remained_hyps\n            if len(hyps) > 0:\n                logging.debug(\"remaining hypotheses: \" + str(len(hyps)))\n            else:\n                logging.info(\"no hypothesis. Finish decoding.\")\n                break\n\n            for hyp in hyps:\n                logging.debug(\n                    \"hypo: \" + \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]])\n                )\n\n            logging.debug(\"number of ended hypotheses: \" + str(len(ended_hyps)))\n\n        nbest_hyps = sorted(ended_hyps, key=lambda x: x[\"score\"], reverse=True)[\n            : min(len(ended_hyps), recog_args.nbest)\n        ]\n\n        # check number of hypotheses\n        if len(nbest_hyps) == 0:\n            logging.warning(\n                \"there is no N-best results, \"\n                \"perform recognition again with smaller minlenratio.\"\n            )\n            # should copy because Namespace will be overwritten globally\n            recog_args = Namespace(**vars(recog_args))\n            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)\n            if self.num_encs == 1:\n                return self.recognize_beam(h[0], lpz[0], recog_args, char_list, rnnlm)\n            else:\n                return self.recognize_beam(h, lpz, recog_args, char_list, rnnlm)\n\n        logging.info(\"total log probability: \" + str(nbest_hyps[0][\"score\"]))\n        logging.info(\n            \"normalized log probability: \"\n            + str(nbest_hyps[0][\"score\"] / len(nbest_hyps[0][\"yseq\"]))\n        )\n\n        # remove sos\n        return nbest_hyps\n\n    def recognize_beam_batch(\n        self,\n        h,\n        hlens,\n        lpz,\n        recog_args,\n        char_list,\n        rnnlm=None,\n        normalize_score=True,\n        strm_idx=0,\n        lang_ids=None,\n    ):\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            h = [h]\n            hlens = [hlens]\n            lpz = [lpz]\n        if self.num_encs > 1 and lpz is None:\n            lpz = [lpz] * self.num_encs\n\n        att_idx = min(strm_idx, len(self.att) - 1)\n        for idx in range(self.num_encs):\n            logging.info(\n                \"Number of Encoder:{}; enc{}: input lengths: {}.\".format(\n                    self.num_encs, idx + 1, h[idx].size(1)\n                )\n            )\n            h[idx] = mask_by_length(h[idx], hlens[idx], 0.0)\n\n        # search params\n        batch = len(hlens[0])\n        beam = recog_args.beam_size\n        penalty = recog_args.penalty\n        ctc_weight = getattr(recog_args, \"ctc_weight\", 0)  # for NMT\n        att_weight = 1.0 - ctc_weight\n        ctc_margin = getattr(\n            recog_args, \"ctc_window_margin\", 0\n        )  # use getattr to keep compatibility\n        # weights-ctc,\n        # e.g. ctc_loss = w_1*ctc_1_loss + w_2 * ctc_2_loss + w_N * ctc_N_loss\n        if lpz[0] is not None and self.num_encs > 1:\n            weights_ctc_dec = recog_args.weights_ctc_dec / np.sum(\n                recog_args.weights_ctc_dec\n            )  # normalize\n            logging.info(\n                \"ctc weights (decoding): \" + \" \".join([str(x) for x in weights_ctc_dec])\n            )\n        else:\n            weights_ctc_dec = [1.0]\n\n        n_bb = batch * beam\n        pad_b = to_device(h[0], torch.arange(batch) * beam).view(-1, 1)\n\n        max_hlen = np.amin([max(hlens[idx]) for idx in range(self.num_encs)])\n        if recog_args.maxlenratio == 0:\n            maxlen = max_hlen\n        else:\n            maxlen = max(1, int(recog_args.maxlenratio * max_hlen))\n        minlen = int(recog_args.minlenratio * max_hlen)\n        logging.info(\"max output length: \" + str(maxlen))\n        logging.info(\"min output length: \" + str(minlen))\n\n        # initialization\n        c_prev = [\n            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)\n        ]\n        z_prev = [\n            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)\n        ]\n        c_list = [\n            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)\n        ]\n        z_list = [\n            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)\n        ]\n        vscores = to_device(h[0], torch.zeros(batch, beam))\n\n        rnnlm_state = None\n        if self.num_encs == 1:\n            a_prev = [None]\n            att_w_list, ctc_scorer, ctc_state = [None], [None], [None]\n            self.att[att_idx].reset()  # reset pre-computation of h\n        else:\n            a_prev = [None] * (self.num_encs + 1)  # atts + han\n            att_w_list = [None] * (self.num_encs + 1)  # atts + han\n            att_c_list = [None] * (self.num_encs)  # atts\n            ctc_scorer, ctc_state = [None] * (self.num_encs), [None] * (self.num_encs)\n            for idx in range(self.num_encs + 1):\n                self.att[idx].reset()  # reset pre-computation of h in atts and han\n\n        if self.replace_sos and recog_args.tgt_lang:\n            logging.info(\"<sos> index: \" + str(char_list.index(recog_args.tgt_lang)))\n            logging.info(\"<sos> mark: \" + recog_args.tgt_lang)\n            yseq = [\n                [char_list.index(recog_args.tgt_lang)] for _ in six.moves.range(n_bb)\n            ]\n        elif lang_ids is not None:\n            # NOTE: used for evaluation during training\n            yseq = [\n                [lang_ids[b // recog_args.beam_size]] for b in six.moves.range(n_bb)\n            ]\n        else:\n            logging.info(\"<sos> index: \" + str(self.sos))\n            logging.info(\"<sos> mark: \" + char_list[self.sos])\n            yseq = [[self.sos] for _ in six.moves.range(n_bb)]\n\n        accum_odim_ids = [self.sos for _ in six.moves.range(n_bb)]\n        stop_search = [False for _ in six.moves.range(batch)]\n        nbest_hyps = [[] for _ in six.moves.range(batch)]\n        ended_hyps = [[] for _ in range(batch)]\n\n        exp_hlens = [\n            hlens[idx].repeat(beam).view(beam, batch).transpose(0, 1).contiguous()\n            for idx in range(self.num_encs)\n        ]\n        exp_hlens = [exp_hlens[idx].view(-1).tolist() for idx in range(self.num_encs)]\n        exp_h = [\n            h[idx].unsqueeze(1).repeat(1, beam, 1, 1).contiguous()\n            for idx in range(self.num_encs)\n        ]\n        exp_h = [\n            exp_h[idx].view(n_bb, h[idx].size()[1], h[idx].size()[2])\n            for idx in range(self.num_encs)\n        ]\n\n        if lpz[0] is not None:\n            scoring_num = min(\n                int(beam * CTC_SCORING_RATIO)\n                if att_weight > 0.0 and not lpz[0].is_cuda\n                else 0,\n                lpz[0].size(-1),\n            )\n            ctc_scorer = [\n                CTCPrefixScoreTH(\n                    lpz[idx],\n                    hlens[idx],\n                    0,\n                    self.eos,\n                    margin=ctc_margin,\n                )\n                for idx in range(self.num_encs)\n            ]\n\n        for i in six.moves.range(maxlen):\n            logging.debug(\"position \" + str(i))\n\n            vy = to_device(h[0], torch.LongTensor(self._get_last_yseq(yseq)))\n            ey = self.dropout_emb(self.embed(vy))\n            if self.num_encs == 1:\n                att_c, att_w = self.att[att_idx](\n                    exp_h[0], exp_hlens[0], self.dropout_dec[0](z_prev[0]), a_prev[0]\n                )\n                att_w_list = [att_w]\n            else:\n                for idx in range(self.num_encs):\n                    att_c_list[idx], att_w_list[idx] = self.att[idx](\n                        exp_h[idx],\n                        exp_hlens[idx],\n                        self.dropout_dec[0](z_prev[0]),\n                        a_prev[idx],\n                    )\n                exp_h_han = torch.stack(att_c_list, dim=1)\n                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](\n                    exp_h_han,\n                    [self.num_encs] * n_bb,\n                    self.dropout_dec[0](z_prev[0]),\n                    a_prev[self.num_encs],\n                )\n            ey = torch.cat((ey, att_c), dim=1)\n\n            # attention decoder\n            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_prev, c_prev)\n            if self.context_residual:\n                logits = self.output(\n                    torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)\n                )\n            else:\n                logits = self.output(self.dropout_dec[-1](z_list[-1]))\n            local_scores = att_weight * F.log_softmax(logits, dim=1)\n\n            # rnnlm\n            if rnnlm:\n                rnnlm_state, local_lm_scores = rnnlm.buff_predict(rnnlm_state, vy, n_bb)\n                local_scores = local_scores + recog_args.lm_weight * local_lm_scores\n\n            # ctc\n            if ctc_scorer[0]:\n                local_scores[:, 0] = self.logzero  # avoid choosing blank\n                part_ids = (\n                    torch.topk(local_scores, scoring_num, dim=-1)[1]\n                    if scoring_num > 0\n                    else None\n                )\n                for idx in range(self.num_encs):\n                    att_w = att_w_list[idx]\n                    att_w_ = att_w if isinstance(att_w, torch.Tensor) else att_w[0]\n                    local_ctc_scores, ctc_state[idx] = ctc_scorer[idx](\n                        yseq, ctc_state[idx], part_ids, att_w_\n                    )\n                    local_scores = (\n                        local_scores\n                        + ctc_weight * weights_ctc_dec[idx] * local_ctc_scores\n                    )\n\n            local_scores = local_scores.view(batch, beam, self.odim)\n            if i == 0:\n                local_scores[:, 1:, :] = self.logzero\n\n            # accumulate scores\n            eos_vscores = local_scores[:, :, self.eos] + vscores\n            vscores = vscores.view(batch, beam, 1).repeat(1, 1, self.odim)\n            vscores[:, :, self.eos] = self.logzero\n            vscores = (vscores + local_scores).view(batch, -1)\n\n            # global pruning\n            accum_best_scores, accum_best_ids = torch.topk(vscores, beam, 1)\n            accum_odim_ids = (\n                torch.fmod(accum_best_ids, self.odim).view(-1).data.cpu().tolist()\n            )\n            accum_padded_beam_ids = (\n                (accum_best_ids // self.odim + pad_b).view(-1).data.cpu().tolist()\n            )\n\n            y_prev = yseq[:][:]\n            yseq = self._index_select_list(yseq, accum_padded_beam_ids)\n            yseq = self._append_ids(yseq, accum_odim_ids)\n            vscores = accum_best_scores\n            vidx = to_device(h[0], torch.LongTensor(accum_padded_beam_ids))\n\n            a_prev = []\n            num_atts = self.num_encs if self.num_encs == 1 else self.num_encs + 1\n            for idx in range(num_atts):\n                if isinstance(att_w_list[idx], torch.Tensor):\n                    _a_prev = torch.index_select(\n                        att_w_list[idx].view(n_bb, *att_w_list[idx].shape[1:]), 0, vidx\n                    )\n                elif isinstance(att_w_list[idx], list):\n                    # handle the case of multi-head attention\n                    _a_prev = [\n                        torch.index_select(att_w_one.view(n_bb, -1), 0, vidx)\n                        for att_w_one in att_w_list[idx]\n                    ]\n                else:\n                    # handle the case of location_recurrent when return is a tuple\n                    _a_prev_ = torch.index_select(\n                        att_w_list[idx][0].view(n_bb, -1), 0, vidx\n                    )\n                    _h_prev_ = torch.index_select(\n                        att_w_list[idx][1][0].view(n_bb, -1), 0, vidx\n                    )\n                    _c_prev_ = torch.index_select(\n                        att_w_list[idx][1][1].view(n_bb, -1), 0, vidx\n                    )\n                    _a_prev = (_a_prev_, (_h_prev_, _c_prev_))\n                a_prev.append(_a_prev)\n            z_prev = [\n                torch.index_select(z_list[li].view(n_bb, -1), 0, vidx)\n                for li in range(self.dlayers)\n            ]\n            c_prev = [\n                torch.index_select(c_list[li].view(n_bb, -1), 0, vidx)\n                for li in range(self.dlayers)\n            ]\n\n            # pick ended hyps\n            if i >= minlen:\n                k = 0\n                penalty_i = (i + 1) * penalty\n                thr = accum_best_scores[:, -1]\n                for samp_i in six.moves.range(batch):\n                    if stop_search[samp_i]:\n                        k = k + beam\n                        continue\n                    for beam_j in six.moves.range(beam):\n                        _vscore = None\n                        if eos_vscores[samp_i, beam_j] > thr[samp_i]:\n                            yk = y_prev[k][:]\n                            if len(yk) <= min(\n                                hlens[idx][samp_i] for idx in range(self.num_encs)\n                            ):\n                                _vscore = eos_vscores[samp_i][beam_j] + penalty_i\n                        elif i == maxlen - 1:\n                            yk = yseq[k][:]\n                            _vscore = vscores[samp_i][beam_j] + penalty_i\n                        if _vscore:\n                            yk.append(self.eos)\n                            if rnnlm:\n                                _vscore += recog_args.lm_weight * rnnlm.final(\n                                    rnnlm_state, index=k\n                                )\n                            _score = _vscore.data.cpu().numpy()\n                            ended_hyps[samp_i].append(\n                                {\"yseq\": yk, \"vscore\": _vscore, \"score\": _score}\n                            )\n                        k = k + 1\n\n            # end detection\n            stop_search = [\n                stop_search[samp_i] or end_detect(ended_hyps[samp_i], i)\n                for samp_i in six.moves.range(batch)\n            ]\n            stop_search_summary = list(set(stop_search))\n            if len(stop_search_summary) == 1 and stop_search_summary[0]:\n                break\n\n            if rnnlm:\n                rnnlm_state = self._index_select_lm_state(rnnlm_state, 0, vidx)\n            if ctc_scorer[0]:\n                for idx in range(self.num_encs):\n                    ctc_state[idx] = ctc_scorer[idx].index_select_state(\n                        ctc_state[idx], accum_best_ids\n                    )\n\n        torch.cuda.empty_cache()\n\n        dummy_hyps = [\n            {\"yseq\": [self.sos, self.eos], \"score\": np.array([-float(\"inf\")])}\n        ]\n        ended_hyps = [\n            ended_hyps[samp_i] if len(ended_hyps[samp_i]) != 0 else dummy_hyps\n            for samp_i in six.moves.range(batch)\n        ]\n        if normalize_score:\n            for samp_i in six.moves.range(batch):\n                for x in ended_hyps[samp_i]:\n                    x[\"score\"] /= len(x[\"yseq\"])\n\n        nbest_hyps = [\n            sorted(ended_hyps[samp_i], key=lambda x: x[\"score\"], reverse=True)[\n                : min(len(ended_hyps[samp_i]), recog_args.nbest)\n            ]\n            for samp_i in six.moves.range(batch)\n        ]\n\n        return nbest_hyps\n\n    def calculate_all_attentions(self, hs_pad, hlen, ys_pad, strm_idx=0, lang_ids=None):\n        \"\"\"Calculate all of attentions\n\n        :param torch.Tensor hs_pad: batch of padded hidden state sequences\n                                    (B, Tmax, D)\n                                    in multi-encoder case, list of torch.Tensor,\n                                    [(B, Tmax_1, D), (B, Tmax_2, D), ..., ] ]\n        :param torch.Tensor hlen: batch of lengths of hidden state sequences (B)\n                                    [in multi-encoder case, list of torch.Tensor,\n                                    [(B), (B), ..., ]\n        :param torch.Tensor ys_pad:\n            batch of padded character id sequence tensor (B, Lmax)\n        :param int strm_idx:\n            stream index for parallel speaker attention in multi-speaker case\n        :param torch.Tensor lang_ids: batch of target language id tensor (B, 1)\n        :return: attention weights with the following shape,\n            1) multi-head case => attention weights (B, H, Lmax, Tmax),\n            2) multi-encoder case =>\n                [(B, Lmax, Tmax1), (B, Lmax, Tmax2), ..., (B, Lmax, NumEncs)]\n            3) other case => attention weights (B, Lmax, Tmax).\n        :rtype: float ndarray\n        \"\"\"\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            hs_pad = [hs_pad]\n            hlen = [hlen]\n\n        # TODO(kan-bayashi): need to make more smart way\n        ys = [y[y != self.ignore_id] for y in ys_pad]  # parse padded ys\n        att_idx = min(strm_idx, len(self.att) - 1)\n\n        # hlen should be list of list of integer\n        hlen = [list(map(int, hlen[idx])) for idx in range(self.num_encs)]\n\n        self.loss = None\n        # prepare input and output word sequences with sos/eos IDs\n        eos = ys[0].new([self.eos])\n        sos = ys[0].new([self.sos])\n        if self.replace_sos:\n            ys_in = [torch.cat([idx, y], dim=0) for idx, y in zip(lang_ids, ys)]\n        else:\n            ys_in = [torch.cat([sos, y], dim=0) for y in ys]\n        ys_out = [torch.cat([y, eos], dim=0) for y in ys]\n\n        # padding for ys with -1\n        # pys: utt x olen\n        ys_in_pad = pad_list(ys_in, self.eos)\n        ys_out_pad = pad_list(ys_out, self.ignore_id)\n\n        # get length info\n        olength = ys_out_pad.size(1)\n\n        # initialization\n        c_list = [self.zero_state(hs_pad[0])]\n        z_list = [self.zero_state(hs_pad[0])]\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(self.zero_state(hs_pad[0]))\n            z_list.append(self.zero_state(hs_pad[0]))\n        att_ws = []\n        if self.num_encs == 1:\n            att_w = None\n            self.att[att_idx].reset()  # reset pre-computation of h\n        else:\n            att_w_list = [None] * (self.num_encs + 1)  # atts + han\n            att_c_list = [None] * (self.num_encs)  # atts\n            for idx in range(self.num_encs + 1):\n                self.att[idx].reset()  # reset pre-computation of h in atts and han\n\n        # pre-computation of embedding\n        eys = self.dropout_emb(self.embed(ys_in_pad))  # utt x olen x zdim\n\n        # loop for an output sequence\n        for i in six.moves.range(olength):\n            if self.num_encs == 1:\n                att_c, att_w = self.att[att_idx](\n                    hs_pad[0], hlen[0], self.dropout_dec[0](z_list[0]), att_w\n                )\n                att_ws.append(att_w)\n            else:\n                for idx in range(self.num_encs):\n                    att_c_list[idx], att_w_list[idx] = self.att[idx](\n                        hs_pad[idx],\n                        hlen[idx],\n                        self.dropout_dec[0](z_list[0]),\n                        att_w_list[idx],\n                    )\n                hs_pad_han = torch.stack(att_c_list, dim=1)\n                hlen_han = [self.num_encs] * len(ys_in)\n                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](\n                    hs_pad_han,\n                    hlen_han,\n                    self.dropout_dec[0](z_list[0]),\n                    att_w_list[self.num_encs],\n                )\n                att_ws.append(att_w_list.copy())\n            ey = torch.cat((eys[:, i, :], att_c), dim=1)  # utt x (zdim + hdim)\n            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)\n\n        if self.num_encs == 1:\n            # convert to numpy array with the shape (B, Lmax, Tmax)\n            att_ws = att_to_numpy(att_ws, self.att[att_idx])\n        else:\n            _att_ws = []\n            for idx, ws in enumerate(zip(*att_ws)):\n                ws = att_to_numpy(ws, self.att[idx])\n                _att_ws.append(ws)\n            att_ws = _att_ws\n        return att_ws\n\n    @staticmethod\n    def _get_last_yseq(exp_yseq):\n        last = []\n        for y_seq in exp_yseq:\n            last.append(y_seq[-1])\n        return last\n\n    @staticmethod\n    def _append_ids(yseq, ids):\n        if isinstance(ids, list):\n            for i, j in enumerate(ids):\n                yseq[i].append(j)\n        else:\n            for i in range(len(yseq)):\n                yseq[i].append(ids)\n        return yseq\n\n    @staticmethod\n    def _index_select_list(yseq, lst):\n        new_yseq = []\n        for i in lst:\n            new_yseq.append(yseq[i][:])\n        return new_yseq\n\n    @staticmethod\n    def _index_select_lm_state(rnnlm_state, dim, vidx):\n        if isinstance(rnnlm_state, dict):\n            new_state = {}\n            for k, v in rnnlm_state.items():\n                new_state[k] = [torch.index_select(vi, dim, vidx) for vi in v]\n        elif isinstance(rnnlm_state, list):\n            new_state = []\n            for i in vidx:\n                new_state.append(rnnlm_state[int(i)][:])\n        return new_state\n\n    # scorer interface methods\n    def init_state(self, x):\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            x = [x]\n\n        c_list = [self.zero_state(x[0].unsqueeze(0))]\n        z_list = [self.zero_state(x[0].unsqueeze(0))]\n        for _ in six.moves.range(1, self.dlayers):\n            c_list.append(self.zero_state(x[0].unsqueeze(0)))\n            z_list.append(self.zero_state(x[0].unsqueeze(0)))\n        # TODO(karita): support strm_index for `asr_mix`\n        strm_index = 0\n        att_idx = min(strm_index, len(self.att) - 1)\n        if self.num_encs == 1:\n            a = None\n            self.att[att_idx].reset()  # reset pre-computation of h\n        else:\n            a = [None] * (self.num_encs + 1)  # atts + han\n            for idx in range(self.num_encs + 1):\n                self.att[idx].reset()  # reset pre-computation of h in atts and han\n        return dict(\n            c_prev=c_list[:],\n            z_prev=z_list[:],\n            a_prev=a,\n            workspace=(att_idx, z_list, c_list),\n        )\n\n    def score(self, yseq, state, x):\n        # to support mutiple encoder asr mode, in single encoder mode,\n        # convert torch.Tensor to List of torch.Tensor\n        if self.num_encs == 1:\n            x = [x]\n\n        att_idx, z_list, c_list = state[\"workspace\"]\n        vy = yseq[-1].unsqueeze(0)\n        ey = self.dropout_emb(self.embed(vy))  # utt list (1) x zdim\n        if self.num_encs == 1:\n            att_c, att_w = self.att[att_idx](\n                x[0].unsqueeze(0),\n                [x[0].size(0)],\n                self.dropout_dec[0](state[\"z_prev\"][0]),\n                state[\"a_prev\"],\n            )\n        else:\n            att_w = [None] * (self.num_encs + 1)  # atts + han\n            att_c_list = [None] * (self.num_encs)  # atts\n            for idx in range(self.num_encs):\n                att_c_list[idx], att_w[idx] = self.att[idx](\n                    x[idx].unsqueeze(0),\n                    [x[idx].size(0)],\n                    self.dropout_dec[0](state[\"z_prev\"][0]),\n                    state[\"a_prev\"][idx],\n                )\n            h_han = torch.stack(att_c_list, dim=1)\n            att_c, att_w[self.num_encs] = self.att[self.num_encs](\n                h_han,\n                [self.num_encs],\n                self.dropout_dec[0](state[\"z_prev\"][0]),\n                state[\"a_prev\"][self.num_encs],\n            )\n        ey = torch.cat((ey, att_c), dim=1)  # utt(1) x (zdim + hdim)\n        z_list, c_list = self.rnn_forward(\n            ey, z_list, c_list, state[\"z_prev\"], state[\"c_prev\"]\n        )\n        if self.context_residual:\n            logits = self.output(\n                torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)\n            )\n        else:\n            logits = self.output(self.dropout_dec[-1](z_list[-1]))\n        logp = F.log_softmax(logits, dim=1).squeeze(0)\n        return (\n            logp,\n            dict(\n                c_prev=c_list[:],\n                z_prev=z_list[:],\n                a_prev=att_w,\n                workspace=(att_idx, z_list, c_list),\n            ),\n        )\n\n\ndef decoder_for(args, odim, sos, eos, att, labeldist):\n    return Decoder(\n        args.eprojs,\n        odim,\n        args.dtype,\n        args.dlayers,\n        args.dunits,\n        sos,\n        eos,\n        att,\n        args.verbose,\n        args.char_list,\n        labeldist,\n        args.lsm_weight,\n        args.sampling_probability,\n        args.dropout_rate_decoder,\n        getattr(args, \"context_residual\", False),  # use getattr to keep compatibility\n        getattr(args, \"replace_sos\", False),  # use getattr to keep compatibility\n        getattr(args, \"num_encs\", 1),\n    )  # use getattr to keep compatibility\n"
  },
  {
    "path": "nets/pytorch_backend/rnn/encoders.py",
    "content": "import logging\nimport six\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch.nn.utils.rnn import pack_padded_sequence\nfrom torch.nn.utils.rnn import pad_packed_sequence\n\nfrom espnet.nets.e2e_asr_common import get_vgg2l_odim\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\n\n\nclass RNNP(torch.nn.Module):\n    \"\"\"RNN with projection layer module\n\n    :param int idim: dimension of inputs\n    :param int elayers: number of encoder layers\n    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)\n    :param int hdim: number of projection units\n    :param np.ndarray subsample: list of subsampling numbers\n    :param float dropout: dropout rate\n    :param str typ: The RNN type\n    \"\"\"\n\n    def __init__(self, idim, elayers, cdim, hdim, subsample, dropout, typ=\"blstm\"):\n        super(RNNP, self).__init__()\n        bidir = typ[0] == \"b\"\n        for i in six.moves.range(elayers):\n            if i == 0:\n                inputdim = idim\n            else:\n                inputdim = hdim\n\n            RNN = torch.nn.LSTM if \"lstm\" in typ else torch.nn.GRU\n            rnn = RNN(\n                inputdim, cdim, num_layers=1, bidirectional=bidir, batch_first=True\n            )\n\n            setattr(self, \"%s%d\" % (\"birnn\" if bidir else \"rnn\", i), rnn)\n\n            # bottleneck layer to merge\n            if bidir:\n                setattr(self, \"bt%d\" % i, torch.nn.Linear(2 * cdim, hdim))\n            else:\n                setattr(self, \"bt%d\" % i, torch.nn.Linear(cdim, hdim))\n\n        self.elayers = elayers\n        self.cdim = cdim\n        self.subsample = subsample\n        self.typ = typ\n        self.bidir = bidir\n        self.dropout = dropout\n\n    def forward(self, xs_pad, ilens, prev_state=None):\n        \"\"\"RNNP forward\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor prev_state: batch of previous RNN states\n        :return: batch of hidden state sequences (B, Tmax, hdim)\n        :rtype: torch.Tensor\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n        elayer_states = []\n        for layer in six.moves.range(self.elayers):\n            if not isinstance(ilens, torch.Tensor):\n                ilens = torch.tensor(ilens)\n            xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)\n            rnn = getattr(self, (\"birnn\" if self.bidir else \"rnn\") + str(layer))\n            rnn.flatten_parameters()\n            if prev_state is not None and rnn.bidirectional:\n                prev_state = reset_backward_rnn_state(prev_state)\n            ys, states = rnn(\n                xs_pack, hx=None if prev_state is None else prev_state[layer]\n            )\n            elayer_states.append(states)\n            # ys: utt list of frame x cdim x 2 (2: means bidirectional)\n            ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)\n            sub = self.subsample[layer + 1]\n            if sub > 1:\n                ys_pad = ys_pad[:, ::sub]\n                ilens = torch.tensor([int(i + 1) // sub for i in ilens])\n            # (sum _utt frame_utt) x dim\n            projection_layer = getattr(self, \"bt%d\" % layer)\n            projected = projection_layer(ys_pad.contiguous().view(-1, ys_pad.size(2)))\n            xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)\n            if layer < self.elayers - 1:\n                xs_pad = torch.tanh(F.dropout(xs_pad, p=self.dropout))\n\n        return xs_pad, ilens, elayer_states  # x: utt list of frame x dim\n\n\nclass RNN(torch.nn.Module):\n    \"\"\"RNN module\n\n    :param int idim: dimension of inputs\n    :param int elayers: number of encoder layers\n    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)\n    :param int hdim: number of final projection units\n    :param float dropout: dropout rate\n    :param str typ: The RNN type\n    \"\"\"\n\n    def __init__(self, idim, elayers, cdim, hdim, dropout, typ=\"blstm\"):\n        super(RNN, self).__init__()\n        bidir = typ[0] == \"b\"\n        self.nbrnn = (\n            torch.nn.LSTM(\n                idim,\n                cdim,\n                elayers,\n                batch_first=True,\n                dropout=dropout,\n                bidirectional=bidir,\n            )\n            if \"lstm\" in typ\n            else torch.nn.GRU(\n                idim,\n                cdim,\n                elayers,\n                batch_first=True,\n                dropout=dropout,\n                bidirectional=bidir,\n            )\n        )\n        if bidir:\n            self.l_last = torch.nn.Linear(cdim * 2, hdim)\n        else:\n            self.l_last = torch.nn.Linear(cdim, hdim)\n        self.typ = typ\n\n    def forward(self, xs_pad, ilens, prev_state=None):\n        \"\"\"RNN forward\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor prev_state: batch of previous RNN states\n        :return: batch of hidden state sequences (B, Tmax, eprojs)\n        :rtype: torch.Tensor\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n        if not isinstance(ilens, torch.Tensor):\n            ilens = torch.tensor(ilens)\n        xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)\n        self.nbrnn.flatten_parameters()\n        if prev_state is not None and self.nbrnn.bidirectional:\n            # We assume that when previous state is passed,\n            # it means that we're streaming the input\n            # and therefore cannot propagate backward BRNN state\n            # (otherwise it goes in the wrong direction)\n            prev_state = reset_backward_rnn_state(prev_state)\n        ys, states = self.nbrnn(xs_pack, hx=prev_state)\n        # ys: utt list of frame x cdim x 2 (2: means bidirectional)\n        ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)\n        # (sum _utt frame_utt) x dim\n        projected = torch.tanh(\n            self.l_last(ys_pad.contiguous().view(-1, ys_pad.size(2)))\n        )\n        xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)\n        return xs_pad, ilens, states  # x: utt list of frame x dim\n\n\ndef reset_backward_rnn_state(states):\n    \"\"\"Sets backward BRNN states to zeroes\n\n    Useful in processing of sliding windows over the inputs\n    \"\"\"\n    if isinstance(states, (list, tuple)):\n        for state in states:\n            state[1::2] = 0.0\n    else:\n        states[1::2] = 0.0\n    return states\n\n\nclass VGG2L(torch.nn.Module):\n    \"\"\"VGG-like module\n\n    :param int in_channel: number of input channels\n    \"\"\"\n\n    def __init__(self, in_channel=1):\n        super(VGG2L, self).__init__()\n        # CNN layer (VGG motivated)\n        self.conv1_1 = torch.nn.Conv2d(in_channel, 64, 3, stride=1, padding=1)\n        self.conv1_2 = torch.nn.Conv2d(64, 64, 3, stride=1, padding=1)\n        self.conv2_1 = torch.nn.Conv2d(64, 128, 3, stride=1, padding=1)\n        self.conv2_2 = torch.nn.Conv2d(128, 128, 3, stride=1, padding=1)\n\n        self.in_channel = in_channel\n\n    def forward(self, xs_pad, ilens, **kwargs):\n        \"\"\"VGG2L forward\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :return: batch of padded hidden state sequences (B, Tmax // 4, 128 * D // 4)\n        :rtype: torch.Tensor\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        # x: utt x frame x dim\n        # xs_pad = F.pad_sequence(xs_pad)\n\n        # x: utt x 1 (input channel num) x frame x dim\n        xs_pad = xs_pad.view(\n            xs_pad.size(0),\n            xs_pad.size(1),\n            self.in_channel,\n            xs_pad.size(2) // self.in_channel,\n        ).transpose(1, 2)\n\n        # NOTE: max_pool1d ?\n        xs_pad = F.relu(self.conv1_1(xs_pad))\n        xs_pad = F.relu(self.conv1_2(xs_pad))\n        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)\n\n        xs_pad = F.relu(self.conv2_1(xs_pad))\n        xs_pad = F.relu(self.conv2_2(xs_pad))\n        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)\n        if torch.is_tensor(ilens):\n            ilens = ilens.cpu().numpy()\n        else:\n            ilens = np.array(ilens, dtype=np.float32)\n        ilens = np.array(np.ceil(ilens / 2), dtype=np.int64)\n        ilens = np.array(\n            np.ceil(np.array(ilens, dtype=np.float32) / 2), dtype=np.int64\n        ).tolist()\n\n        # x: utt_list of frame (remove zeropaded frames) x (input channel num x dim)\n        xs_pad = xs_pad.transpose(1, 2)\n        xs_pad = xs_pad.contiguous().view(\n            xs_pad.size(0), xs_pad.size(1), xs_pad.size(2) * xs_pad.size(3)\n        )\n        return xs_pad, ilens, None  # no state in this layer\n\n\nclass Encoder(torch.nn.Module):\n    \"\"\"Encoder module\n\n    :param str etype: type of encoder network\n    :param int idim: number of dimensions of encoder network\n    :param int elayers: number of layers of encoder network\n    :param int eunits: number of lstm units of encoder network\n    :param int eprojs: number of projection units of encoder network\n    :param np.ndarray subsample: list of subsampling numbers\n    :param float dropout: dropout rate\n    :param int in_channel: number of input channels\n    \"\"\"\n\n    def __init__(\n        self, etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1\n    ):\n        super(Encoder, self).__init__()\n        typ = etype.lstrip(\"vgg\").rstrip(\"p\")\n        if typ not in [\"lstm\", \"gru\", \"blstm\", \"bgru\"]:\n            logging.error(\"Error: need to specify an appropriate encoder architecture\")\n\n        if etype.startswith(\"vgg\"):\n            if etype[-1] == \"p\":\n                self.enc = torch.nn.ModuleList(\n                    [\n                        VGG2L(in_channel),\n                        RNNP(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            subsample,\n                            dropout,\n                            typ=typ,\n                        ),\n                    ]\n                )\n                logging.info(\"Use CNN-VGG + \" + typ.upper() + \"P for encoder\")\n            else:\n                self.enc = torch.nn.ModuleList(\n                    [\n                        VGG2L(in_channel),\n                        RNN(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            dropout,\n                            typ=typ,\n                        ),\n                    ]\n                )\n                logging.info(\"Use CNN-VGG + \" + typ.upper() + \" for encoder\")\n            self.conv_subsampling_factor = 4\n        else:\n            if etype[-1] == \"p\":\n                self.enc = torch.nn.ModuleList(\n                    [RNNP(idim, elayers, eunits, eprojs, subsample, dropout, typ=typ)]\n                )\n                logging.info(typ.upper() + \" with every-layer projection for encoder\")\n            else:\n                self.enc = torch.nn.ModuleList(\n                    [RNN(idim, elayers, eunits, eprojs, dropout, typ=typ)]\n                )\n                logging.info(typ.upper() + \" without projection for encoder\")\n            self.conv_subsampling_factor = 1\n\n    def forward(self, xs_pad, ilens, prev_states=None):\n        \"\"\"Encoder forward\n\n        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)\n        :param torch.Tensor ilens: batch of lengths of input sequences (B)\n        :param torch.Tensor prev_state: batch of previous encoder hidden states (?, ...)\n        :return: batch of hidden state sequences (B, Tmax, eprojs)\n        :rtype: torch.Tensor\n        \"\"\"\n        if prev_states is None:\n            prev_states = [None] * len(self.enc)\n        assert len(prev_states) == len(self.enc)\n\n        current_states = []\n        for module, prev_state in zip(self.enc, prev_states):\n            xs_pad, ilens, states = module(xs_pad, ilens, prev_state=prev_state)\n            current_states.append(states)\n\n        # make mask to remove bias value in padded part\n        mask = to_device(xs_pad, make_pad_mask(ilens).unsqueeze(-1))\n\n        return xs_pad.masked_fill(mask, 0.0), ilens, current_states\n\n\ndef encoder_for(args, idim, subsample):\n    \"\"\"Instantiates an encoder module given the program arguments\n\n    :param Namespace args: The arguments\n    :param int or List of integer idim: dimension of input, e.g. 83, or\n                                        List of dimensions of inputs, e.g. [83,83]\n    :param List or List of List subsample: subsample factors, e.g. [1,2,2,1,1], or\n                                        List of subsample factors of each encoder.\n                                         e.g. [[1,2,2,1,1], [1,2,2,1,1]]\n    :rtype torch.nn.Module\n    :return: The encoder module\n    \"\"\"\n    num_encs = getattr(args, \"num_encs\", 1)  # use getattr to keep compatibility\n    if num_encs == 1:\n        # compatible with single encoder asr mode\n        return Encoder(\n            args.etype,\n            idim,\n            args.elayers,\n            args.eunits,\n            args.eprojs,\n            subsample,\n            args.dropout_rate,\n        )\n    elif num_encs >= 1:\n        enc_list = torch.nn.ModuleList()\n        for idx in range(num_encs):\n            enc = Encoder(\n                args.etype[idx],\n                idim[idx],\n                args.elayers[idx],\n                args.eunits[idx],\n                args.eprojs,\n                subsample[idx],\n                args.dropout_rate[idx],\n            )\n            enc_list.append(enc)\n        return enc_list\n    else:\n        raise ValueError(\n            \"Number of encoders needs to be more than one. {}\".format(num_encs)\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/streaming/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/streaming/segment.py",
    "content": "import numpy as np\nimport torch\n\n\nclass SegmentStreamingE2E(object):\n    \"\"\"SegmentStreamingE2E constructor.\n\n    :param E2E e2e: E2E ASR object\n    :param recog_args: arguments for \"recognize\" method of E2E\n    \"\"\"\n\n    def __init__(self, e2e, recog_args, rnnlm=None):\n        self._e2e = e2e\n        self._recog_args = recog_args\n        self._char_list = e2e.char_list\n        self._rnnlm = rnnlm\n\n        self._e2e.eval()\n\n        self._blank_idx_in_char_list = -1\n        for idx in range(len(self._char_list)):\n            if self._char_list[idx] == self._e2e.blank:\n                self._blank_idx_in_char_list = idx\n                break\n\n        self._subsampling_factor = np.prod(e2e.subsample)\n        self._activates = 0\n        self._blank_dur = 0\n\n        self._previous_input = []\n        self._previous_encoder_recurrent_state = None\n        self._encoder_states = []\n        self._ctc_posteriors = []\n\n        assert (\n            self._recog_args.batchsize <= 1\n        ), \"SegmentStreamingE2E works only with batch size <= 1\"\n        assert (\n            \"b\" not in self._e2e.etype\n        ), \"SegmentStreamingE2E works only with uni-directional encoders\"\n\n    def accept_input(self, x):\n        \"\"\"Call this method each time a new batch of input is available.\"\"\"\n\n        self._previous_input.extend(x)\n        h, ilen = self._e2e.subsample_frames(x)\n\n        # Run encoder and apply greedy search on CTC softmax output\n        h, _, self._previous_encoder_recurrent_state = self._e2e.enc(\n            h.unsqueeze(0), ilen, self._previous_encoder_recurrent_state\n        )\n        z = self._e2e.ctc.argmax(h).squeeze(0)\n\n        if self._activates == 0 and z[0] != self._blank_idx_in_char_list:\n            self._activates = 1\n\n            # Rerun encoder with zero state at onset of detection\n            tail_len = self._subsampling_factor * (\n                self._recog_args.streaming_onset_margin + 1\n            )\n            h, ilen = self._e2e.subsample_frames(\n                np.reshape(\n                    self._previous_input[-tail_len:], [-1, len(self._previous_input[0])]\n                )\n            )\n            h, _, self._previous_encoder_recurrent_state = self._e2e.enc(\n                h.unsqueeze(0), ilen, None\n            )\n\n        hyp = None\n        if self._activates == 1:\n            self._encoder_states.extend(h.squeeze(0))\n            self._ctc_posteriors.extend(self._e2e.ctc.log_softmax(h).squeeze(0))\n\n            if z[0] == self._blank_idx_in_char_list:\n                self._blank_dur += 1\n            else:\n                self._blank_dur = 0\n\n            if self._blank_dur >= self._recog_args.streaming_min_blank_dur:\n                seg_len = (\n                    len(self._encoder_states)\n                    - self._blank_dur\n                    + self._recog_args.streaming_offset_margin\n                )\n                if seg_len > 0:\n                    # Run decoder with a detected segment\n                    h = torch.cat(self._encoder_states[:seg_len], dim=0).view(\n                        -1, self._encoder_states[0].size(0)\n                    )\n                    if self._recog_args.ctc_weight > 0.0:\n                        lpz = torch.cat(self._ctc_posteriors[:seg_len], dim=0).view(\n                            -1, self._ctc_posteriors[0].size(0)\n                        )\n                        if self._recog_args.batchsize > 0:\n                            lpz = lpz.unsqueeze(0)\n                        normalize_score = False\n                    else:\n                        lpz = None\n                        normalize_score = True\n\n                    if self._recog_args.batchsize == 0:\n                        hyp = self._e2e.dec.recognize_beam(\n                            h, lpz, self._recog_args, self._char_list, self._rnnlm\n                        )\n                    else:\n                        hlens = torch.tensor([h.shape[0]])\n                        hyp = self._e2e.dec.recognize_beam_batch(\n                            h.unsqueeze(0),\n                            hlens,\n                            lpz,\n                            self._recog_args,\n                            self._char_list,\n                            self._rnnlm,\n                            normalize_score=normalize_score,\n                        )[0]\n\n                    self._activates = 0\n                    self._blank_dur = 0\n\n                    tail_len = (\n                        self._subsampling_factor\n                        * self._recog_args.streaming_onset_margin\n                    )\n                    self._previous_input = self._previous_input[-tail_len:]\n                    self._encoder_states = []\n                    self._ctc_posteriors = []\n\n        return hyp\n"
  },
  {
    "path": "nets/pytorch_backend/streaming/window.py",
    "content": "import torch\n\n\n# TODO(pzelasko): Currently allows half-streaming only;\n#  needs streaming attention decoder implementation\nclass WindowStreamingE2E(object):\n    \"\"\"WindowStreamingE2E constructor.\n\n    :param E2E e2e: E2E ASR object\n    :param recog_args: arguments for \"recognize\" method of E2E\n    \"\"\"\n\n    def __init__(self, e2e, recog_args, rnnlm=None):\n        self._e2e = e2e\n        self._recog_args = recog_args\n        self._char_list = e2e.char_list\n        self._rnnlm = rnnlm\n\n        self._e2e.eval()\n\n        self._offset = 0\n        self._previous_encoder_recurrent_state = None\n        self._encoder_states = []\n        self._ctc_posteriors = []\n        self._last_recognition = None\n\n        assert (\n            self._recog_args.ctc_weight > 0.0\n        ), \"WindowStreamingE2E works only with combined CTC and attention decoders.\"\n\n    def accept_input(self, x):\n        \"\"\"Call this method each time a new batch of input is available.\"\"\"\n\n        h, ilen = self._e2e.subsample_frames(x)\n\n        # Streaming encoder\n        h, _, self._previous_encoder_recurrent_state = self._e2e.enc(\n            h.unsqueeze(0), ilen, self._previous_encoder_recurrent_state\n        )\n        self._encoder_states.append(h.squeeze(0))\n\n        # CTC posteriors for the incoming audio\n        self._ctc_posteriors.append(self._e2e.ctc.log_softmax(h).squeeze(0))\n\n    def _input_window_for_decoder(self, use_all=False):\n        if use_all:\n            return (\n                torch.cat(self._encoder_states, dim=0),\n                torch.cat(self._ctc_posteriors, dim=0),\n            )\n\n        def select_unprocessed_windows(window_tensors):\n            last_offset = self._offset\n            offset_traversed = 0\n            selected_windows = []\n            for es in window_tensors:\n                if offset_traversed > last_offset:\n                    selected_windows.append(es)\n                    continue\n                offset_traversed += es.size(1)\n            return torch.cat(selected_windows, dim=0)\n\n        return (\n            select_unprocessed_windows(self._encoder_states),\n            select_unprocessed_windows(self._ctc_posteriors),\n        )\n\n    def decode_with_attention_offline(self):\n        \"\"\"Run the attention decoder offline.\n\n        Works even if the previous layers (encoder and CTC decoder) were\n        being run in the online mode.\n        This method should be run after all the audio has been consumed.\n        This is used mostly to compare the results between offline\n        and online implementation of the previous layers.\n        \"\"\"\n        h, lpz = self._input_window_for_decoder(use_all=True)\n\n        return self._e2e.dec.recognize_beam(\n            h, lpz, self._recog_args, self._char_list, self._rnnlm\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/tacotron2/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/tacotron2/cbhg.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"CBHG related modules.\"\"\"\n\nimport torch\nimport torch.nn.functional as F\n\nfrom torch.nn.utils.rnn import pack_padded_sequence\nfrom torch.nn.utils.rnn import pad_packed_sequence\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\n\n\nclass CBHGLoss(torch.nn.Module):\n    \"\"\"Loss function module for CBHG.\"\"\"\n\n    def __init__(self, use_masking=True):\n        \"\"\"Initialize CBHG loss module.\n\n        Args:\n            use_masking (bool): Whether to mask padded part in loss calculation.\n\n        \"\"\"\n        super(CBHGLoss, self).__init__()\n        self.use_masking = use_masking\n\n    def forward(self, cbhg_outs, spcs, olens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            cbhg_outs (Tensor): Batch of CBHG outputs (B, Lmax, spc_dim).\n            spcs (Tensor): Batch of groundtruth of spectrogram (B, Lmax, spc_dim).\n            olens (LongTensor): Batch of the lengths of each sequence (B,).\n\n        Returns:\n            Tensor: L1 loss value\n            Tensor: Mean square error loss value.\n\n        \"\"\"\n        # perform masking for padded values\n        if self.use_masking:\n            mask = make_non_pad_mask(olens).unsqueeze(-1).to(spcs.device)\n            spcs = spcs.masked_select(mask)\n            cbhg_outs = cbhg_outs.masked_select(mask)\n\n        # calculate loss\n        cbhg_l1_loss = F.l1_loss(cbhg_outs, spcs)\n        cbhg_mse_loss = F.mse_loss(cbhg_outs, spcs)\n\n        return cbhg_l1_loss, cbhg_mse_loss\n\n\nclass CBHG(torch.nn.Module):\n    \"\"\"CBHG module to convert log Mel-filterbanks to linear spectrogram.\n\n    This is a module of CBHG introduced\n    in `Tacotron: Towards End-to-End Speech Synthesis`_.\n    The CBHG converts the sequence of log Mel-filterbanks into linear spectrogram.\n\n    .. _`Tacotron: Towards End-to-End Speech Synthesis`:\n         https://arxiv.org/abs/1703.10135\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        odim,\n        conv_bank_layers=8,\n        conv_bank_chans=128,\n        conv_proj_filts=3,\n        conv_proj_chans=256,\n        highway_layers=4,\n        highway_units=128,\n        gru_units=256,\n    ):\n        \"\"\"Initialize CBHG module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            conv_bank_layers (int, optional): The number of convolution bank layers.\n            conv_bank_chans (int, optional): The number of channels in convolution bank.\n            conv_proj_filts (int, optional):\n                Kernel size of convolutional projection layer.\n            conv_proj_chans (int, optional):\n                The number of channels in convolutional projection layer.\n            highway_layers (int, optional): The number of highway network layers.\n            highway_units (int, optional): The number of highway network units.\n            gru_units (int, optional): The number of GRU units (for both directions).\n\n        \"\"\"\n        super(CBHG, self).__init__()\n        self.idim = idim\n        self.odim = odim\n        self.conv_bank_layers = conv_bank_layers\n        self.conv_bank_chans = conv_bank_chans\n        self.conv_proj_filts = conv_proj_filts\n        self.conv_proj_chans = conv_proj_chans\n        self.highway_layers = highway_layers\n        self.highway_units = highway_units\n        self.gru_units = gru_units\n\n        # define 1d convolution bank\n        self.conv_bank = torch.nn.ModuleList()\n        for k in range(1, self.conv_bank_layers + 1):\n            if k % 2 != 0:\n                padding = (k - 1) // 2\n            else:\n                padding = ((k - 1) // 2, (k - 1) // 2 + 1)\n            self.conv_bank += [\n                torch.nn.Sequential(\n                    torch.nn.ConstantPad1d(padding, 0.0),\n                    torch.nn.Conv1d(\n                        idim, self.conv_bank_chans, k, stride=1, padding=0, bias=True\n                    ),\n                    torch.nn.BatchNorm1d(self.conv_bank_chans),\n                    torch.nn.ReLU(),\n                )\n            ]\n\n        # define max pooling (need padding for one-side to keep same length)\n        self.max_pool = torch.nn.Sequential(\n            torch.nn.ConstantPad1d((0, 1), 0.0), torch.nn.MaxPool1d(2, stride=1)\n        )\n\n        # define 1d convolution projection\n        self.projections = torch.nn.Sequential(\n            torch.nn.Conv1d(\n                self.conv_bank_chans * self.conv_bank_layers,\n                self.conv_proj_chans,\n                self.conv_proj_filts,\n                stride=1,\n                padding=(self.conv_proj_filts - 1) // 2,\n                bias=True,\n            ),\n            torch.nn.BatchNorm1d(self.conv_proj_chans),\n            torch.nn.ReLU(),\n            torch.nn.Conv1d(\n                self.conv_proj_chans,\n                self.idim,\n                self.conv_proj_filts,\n                stride=1,\n                padding=(self.conv_proj_filts - 1) // 2,\n                bias=True,\n            ),\n            torch.nn.BatchNorm1d(self.idim),\n        )\n\n        # define highway network\n        self.highways = torch.nn.ModuleList()\n        self.highways += [torch.nn.Linear(idim, self.highway_units)]\n        for _ in range(self.highway_layers):\n            self.highways += [HighwayNet(self.highway_units)]\n\n        # define bidirectional GRU\n        self.gru = torch.nn.GRU(\n            self.highway_units,\n            gru_units // 2,\n            num_layers=1,\n            batch_first=True,\n            bidirectional=True,\n        )\n\n        # define final projection\n        self.output = torch.nn.Linear(gru_units, odim, bias=True)\n\n    def forward(self, xs, ilens):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of the padded sequences of inputs (B, Tmax, idim).\n            ilens (LongTensor): Batch of lengths of each input sequence (B,).\n\n        Return:\n            Tensor: Batch of the padded sequence of outputs (B, Tmax, odim).\n            LongTensor: Batch of lengths of each output sequence (B,).\n\n        \"\"\"\n        xs = xs.transpose(1, 2)  # (B, idim, Tmax)\n        convs = []\n        for k in range(self.conv_bank_layers):\n            convs += [self.conv_bank[k](xs)]\n        convs = torch.cat(convs, dim=1)  # (B, #CH * #BANK, Tmax)\n        convs = self.max_pool(convs)\n        convs = self.projections(convs).transpose(1, 2)  # (B, Tmax, idim)\n        xs = xs.transpose(1, 2) + convs\n        # + 1 for dimension adjustment layer\n        for i in range(self.highway_layers + 1):\n            xs = self.highways[i](xs)\n\n        # sort by length\n        xs, ilens, sort_idx = self._sort_by_length(xs, ilens)\n\n        # total_length needs for DataParallel\n        # (see https://github.com/pytorch/pytorch/pull/6327)\n        total_length = xs.size(1)\n        if not isinstance(ilens, torch.Tensor):\n            ilens = torch.tensor(ilens)\n        xs = pack_padded_sequence(xs, ilens.cpu(), batch_first=True)\n        self.gru.flatten_parameters()\n        xs, _ = self.gru(xs)\n        xs, ilens = pad_packed_sequence(xs, batch_first=True, total_length=total_length)\n\n        # revert sorting by length\n        xs, ilens = self._revert_sort_by_length(xs, ilens, sort_idx)\n\n        xs = self.output(xs)  # (B, Tmax, odim)\n\n        return xs, ilens\n\n    def inference(self, x):\n        \"\"\"Inference.\n\n        Args:\n            x (Tensor): The sequences of inputs (T, idim).\n\n        Return:\n            Tensor: The sequence of outputs (T, odim).\n\n        \"\"\"\n        assert len(x.size()) == 2\n        xs = x.unsqueeze(0)\n        ilens = x.new([x.size(0)]).long()\n\n        return self.forward(xs, ilens)[0][0]\n\n    def _sort_by_length(self, xs, ilens):\n        sort_ilens, sort_idx = ilens.sort(0, descending=True)\n        return xs[sort_idx], ilens[sort_idx], sort_idx\n\n    def _revert_sort_by_length(self, xs, ilens, sort_idx):\n        _, revert_idx = sort_idx.sort(0)\n        return xs[revert_idx], ilens[revert_idx]\n\n\nclass HighwayNet(torch.nn.Module):\n    \"\"\"Highway Network module.\n\n    This is a module of Highway Network introduced in `Highway Networks`_.\n\n    .. _`Highway Networks`: https://arxiv.org/abs/1505.00387\n\n    \"\"\"\n\n    def __init__(self, idim):\n        \"\"\"Initialize Highway Network module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n\n        \"\"\"\n        super(HighwayNet, self).__init__()\n        self.idim = idim\n        self.projection = torch.nn.Sequential(\n            torch.nn.Linear(idim, idim), torch.nn.ReLU()\n        )\n        self.gate = torch.nn.Sequential(torch.nn.Linear(idim, idim), torch.nn.Sigmoid())\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (Tensor): Batch of inputs (B, ..., idim).\n\n        Returns:\n            Tensor: Batch of outputs, which are the same shape as inputs (B, ..., idim).\n\n        \"\"\"\n        proj = self.projection(x)\n        gate = self.gate(x)\n        return proj * gate + x * (1.0 - gate)\n"
  },
  {
    "path": "nets/pytorch_backend/tacotron2/decoder.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Tacotron2 decoder related modules.\"\"\"\n\nimport six\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.pytorch_backend.rnn.attentions import AttForwardTA\n\n\ndef decoder_init(m):\n    \"\"\"Initialize decoder parameters.\"\"\"\n    if isinstance(m, torch.nn.Conv1d):\n        torch.nn.init.xavier_uniform_(m.weight, torch.nn.init.calculate_gain(\"tanh\"))\n\n\nclass ZoneOutCell(torch.nn.Module):\n    \"\"\"ZoneOut Cell module.\n\n    This is a module of zoneout described in\n    `Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations`_.\n    This code is modified from `eladhoffer/seq2seq.pytorch`_.\n\n    Examples:\n        >>> lstm = torch.nn.LSTMCell(16, 32)\n        >>> lstm = ZoneOutCell(lstm, 0.5)\n\n    .. _`Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations`:\n        https://arxiv.org/abs/1606.01305\n\n    .. _`eladhoffer/seq2seq.pytorch`:\n        https://github.com/eladhoffer/seq2seq.pytorch\n\n    \"\"\"\n\n    def __init__(self, cell, zoneout_rate=0.1):\n        \"\"\"Initialize zone out cell module.\n\n        Args:\n            cell (torch.nn.Module): Pytorch recurrent cell module\n                e.g. `torch.nn.Module.LSTMCell`.\n            zoneout_rate (float, optional): Probability of zoneout from 0.0 to 1.0.\n\n        \"\"\"\n        super(ZoneOutCell, self).__init__()\n        self.cell = cell\n        self.hidden_size = cell.hidden_size\n        self.zoneout_rate = zoneout_rate\n        if zoneout_rate > 1.0 or zoneout_rate < 0.0:\n            raise ValueError(\n                \"zoneout probability must be in the range from 0.0 to 1.0.\"\n            )\n\n    def forward(self, inputs, hidden):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            inputs (Tensor): Batch of input tensor (B, input_size).\n            hidden (tuple):\n                - Tensor: Batch of initial hidden states (B, hidden_size).\n                - Tensor: Batch of initial cell states (B, hidden_size).\n\n        Returns:\n            tuple:\n                - Tensor: Batch of next hidden states (B, hidden_size).\n                - Tensor: Batch of next cell states (B, hidden_size).\n\n        \"\"\"\n        next_hidden = self.cell(inputs, hidden)\n        next_hidden = self._zoneout(hidden, next_hidden, self.zoneout_rate)\n        return next_hidden\n\n    def _zoneout(self, h, next_h, prob):\n        # apply recursively\n        if isinstance(h, tuple):\n            num_h = len(h)\n            if not isinstance(prob, tuple):\n                prob = tuple([prob] * num_h)\n            return tuple(\n                [self._zoneout(h[i], next_h[i], prob[i]) for i in range(num_h)]\n            )\n\n        if self.training:\n            mask = h.new(*h.size()).bernoulli_(prob)\n            return mask * h + (1 - mask) * next_h\n        else:\n            return prob * h + (1 - prob) * next_h\n\n\nclass Prenet(torch.nn.Module):\n    \"\"\"Prenet module for decoder of Spectrogram prediction network.\n\n    This is a module of Prenet in the decoder of Spectrogram prediction network,\n    which described in `Natural TTS\n    Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`_.\n    The Prenet preforms nonlinear conversion\n    of inputs before input to auto-regressive lstm,\n    which helps to learn diagonal attentions.\n\n    Note:\n        This module alway applies dropout even in evaluation.\n        See the detail in `Natural TTS Synthesis by\n        Conditioning WaveNet on Mel Spectrogram Predictions`_.\n\n    .. _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`:\n       https://arxiv.org/abs/1712.05884\n\n    \"\"\"\n\n    def __init__(self, idim, n_layers=2, n_units=256, dropout_rate=0.5):\n        \"\"\"Initialize prenet module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            n_layers (int, optional): The number of prenet layers.\n            n_units (int, optional): The number of prenet units.\n\n        \"\"\"\n        super(Prenet, self).__init__()\n        self.dropout_rate = dropout_rate\n        self.prenet = torch.nn.ModuleList()\n        for layer in six.moves.range(n_layers):\n            n_inputs = idim if layer == 0 else n_units\n            self.prenet += [\n                torch.nn.Sequential(torch.nn.Linear(n_inputs, n_units), torch.nn.ReLU())\n            ]\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (Tensor): Batch of input tensors (B, ..., idim).\n\n        Returns:\n            Tensor: Batch of output tensors (B, ..., odim).\n\n        \"\"\"\n        for i in six.moves.range(len(self.prenet)):\n            x = F.dropout(self.prenet[i](x), self.dropout_rate)\n        return x\n\n\nclass Postnet(torch.nn.Module):\n    \"\"\"Postnet module for Spectrogram prediction network.\n\n    This is a module of Postnet in Spectrogram prediction network,\n    which described in `Natural TTS Synthesis by\n    Conditioning WaveNet on Mel Spectrogram Predictions`_.\n    The Postnet predicts refines the predicted\n    Mel-filterbank of the decoder,\n    which helps to compensate the detail sturcture of spectrogram.\n\n    .. _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`:\n       https://arxiv.org/abs/1712.05884\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        odim,\n        n_layers=5,\n        n_chans=512,\n        n_filts=5,\n        dropout_rate=0.5,\n        use_batch_norm=True,\n    ):\n        \"\"\"Initialize postnet module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            n_layers (int, optional): The number of layers.\n            n_filts (int, optional): The number of filter size.\n            n_units (int, optional): The number of filter channels.\n            use_batch_norm (bool, optional): Whether to use batch normalization..\n            dropout_rate (float, optional): Dropout rate..\n\n        \"\"\"\n        super(Postnet, self).__init__()\n        self.postnet = torch.nn.ModuleList()\n        for layer in six.moves.range(n_layers - 1):\n            ichans = odim if layer == 0 else n_chans\n            ochans = odim if layer == n_layers - 1 else n_chans\n            if use_batch_norm:\n                self.postnet += [\n                    torch.nn.Sequential(\n                        torch.nn.Conv1d(\n                            ichans,\n                            ochans,\n                            n_filts,\n                            stride=1,\n                            padding=(n_filts - 1) // 2,\n                            bias=False,\n                        ),\n                        torch.nn.BatchNorm1d(ochans),\n                        torch.nn.Tanh(),\n                        torch.nn.Dropout(dropout_rate),\n                    )\n                ]\n            else:\n                self.postnet += [\n                    torch.nn.Sequential(\n                        torch.nn.Conv1d(\n                            ichans,\n                            ochans,\n                            n_filts,\n                            stride=1,\n                            padding=(n_filts - 1) // 2,\n                            bias=False,\n                        ),\n                        torch.nn.Tanh(),\n                        torch.nn.Dropout(dropout_rate),\n                    )\n                ]\n        ichans = n_chans if n_layers != 1 else odim\n        if use_batch_norm:\n            self.postnet += [\n                torch.nn.Sequential(\n                    torch.nn.Conv1d(\n                        ichans,\n                        odim,\n                        n_filts,\n                        stride=1,\n                        padding=(n_filts - 1) // 2,\n                        bias=False,\n                    ),\n                    torch.nn.BatchNorm1d(odim),\n                    torch.nn.Dropout(dropout_rate),\n                )\n            ]\n        else:\n            self.postnet += [\n                torch.nn.Sequential(\n                    torch.nn.Conv1d(\n                        ichans,\n                        odim,\n                        n_filts,\n                        stride=1,\n                        padding=(n_filts - 1) // 2,\n                        bias=False,\n                    ),\n                    torch.nn.Dropout(dropout_rate),\n                )\n            ]\n\n    def forward(self, xs):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of the sequences of padded input tensors (B, idim, Tmax).\n\n        Returns:\n            Tensor: Batch of padded output tensor. (B, odim, Tmax).\n\n        \"\"\"\n        for i in six.moves.range(len(self.postnet)):\n            xs = self.postnet[i](xs)\n        return xs\n\n\nclass Decoder(torch.nn.Module):\n    \"\"\"Decoder module of Spectrogram prediction network.\n\n    This is a module of decoder of Spectrogram prediction network in Tacotron2,\n    which described in `Natural TTS\n    Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`_.\n    The decoder generates the sequence of\n    features from the sequence of the hidden states.\n\n    .. _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`:\n       https://arxiv.org/abs/1712.05884\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        odim,\n        att,\n        dlayers=2,\n        dunits=1024,\n        prenet_layers=2,\n        prenet_units=256,\n        postnet_layers=5,\n        postnet_chans=512,\n        postnet_filts=5,\n        output_activation_fn=None,\n        cumulate_att_w=True,\n        use_batch_norm=True,\n        use_concate=True,\n        dropout_rate=0.5,\n        zoneout_rate=0.1,\n        reduction_factor=1,\n    ):\n        \"\"\"Initialize Tacotron2 decoder module.\n\n        Args:\n            idim (int): Dimension of the inputs.\n            odim (int): Dimension of the outputs.\n            att (torch.nn.Module): Instance of attention class.\n            dlayers (int, optional): The number of decoder lstm layers.\n            dunits (int, optional): The number of decoder lstm units.\n            prenet_layers (int, optional): The number of prenet layers.\n            prenet_units (int, optional): The number of prenet units.\n            postnet_layers (int, optional): The number of postnet layers.\n            postnet_filts (int, optional): The number of postnet filter size.\n            postnet_chans (int, optional): The number of postnet filter channels.\n            output_activation_fn (torch.nn.Module, optional):\n                Activation function for outputs.\n            cumulate_att_w (bool, optional):\n                Whether to cumulate previous attention weight.\n            use_batch_norm (bool, optional): Whether to use batch normalization.\n            use_concate (bool, optional): Whether to concatenate encoder embedding\n                with decoder lstm outputs.\n            dropout_rate (float, optional): Dropout rate.\n            zoneout_rate (float, optional): Zoneout rate.\n            reduction_factor (int, optional): Reduction factor.\n\n        \"\"\"\n        super(Decoder, self).__init__()\n\n        # store the hyperparameters\n        self.idim = idim\n        self.odim = odim\n        self.att = att\n        self.output_activation_fn = output_activation_fn\n        self.cumulate_att_w = cumulate_att_w\n        self.use_concate = use_concate\n        self.reduction_factor = reduction_factor\n\n        # check attention type\n        if isinstance(self.att, AttForwardTA):\n            self.use_att_extra_inputs = True\n        else:\n            self.use_att_extra_inputs = False\n\n        # define lstm network\n        prenet_units = prenet_units if prenet_layers != 0 else odim\n        self.lstm = torch.nn.ModuleList()\n        for layer in six.moves.range(dlayers):\n            iunits = idim + prenet_units if layer == 0 else dunits\n            lstm = torch.nn.LSTMCell(iunits, dunits)\n            if zoneout_rate > 0.0:\n                lstm = ZoneOutCell(lstm, zoneout_rate)\n            self.lstm += [lstm]\n\n        # define prenet\n        if prenet_layers > 0:\n            self.prenet = Prenet(\n                idim=odim,\n                n_layers=prenet_layers,\n                n_units=prenet_units,\n                dropout_rate=dropout_rate,\n            )\n        else:\n            self.prenet = None\n\n        # define postnet\n        if postnet_layers > 0:\n            self.postnet = Postnet(\n                idim=idim,\n                odim=odim,\n                n_layers=postnet_layers,\n                n_chans=postnet_chans,\n                n_filts=postnet_filts,\n                use_batch_norm=use_batch_norm,\n                dropout_rate=dropout_rate,\n            )\n        else:\n            self.postnet = None\n\n        # define projection layers\n        iunits = idim + dunits if use_concate else dunits\n        self.feat_out = torch.nn.Linear(iunits, odim * reduction_factor, bias=False)\n        self.prob_out = torch.nn.Linear(iunits, reduction_factor)\n\n        # initialize\n        self.apply(decoder_init)\n\n    def _zero_state(self, hs):\n        init_hs = hs.new_zeros(hs.size(0), self.lstm[0].hidden_size)\n        return init_hs\n\n    def forward(self, hs, hlens, ys):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            hs (Tensor): Batch of the sequences of padded hidden states (B, Tmax, idim).\n            hlens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor):\n                Batch of the sequences of padded target features (B, Lmax, odim).\n\n        Returns:\n            Tensor: Batch of output tensors after postnet (B, Lmax, odim).\n            Tensor: Batch of output tensors before postnet (B, Lmax, odim).\n            Tensor: Batch of logits of stop prediction (B, Lmax).\n            Tensor: Batch of attention weights (B, Lmax, Tmax).\n\n        Note:\n            This computation is performed in teacher-forcing manner.\n\n        \"\"\"\n        # thin out frames (B, Lmax, odim) ->  (B, Lmax/r, odim)\n        if self.reduction_factor > 1:\n            ys = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n\n        # length list should be list of int\n        hlens = list(map(int, hlens))\n\n        # initialize hidden states of decoder\n        c_list = [self._zero_state(hs)]\n        z_list = [self._zero_state(hs)]\n        for _ in six.moves.range(1, len(self.lstm)):\n            c_list += [self._zero_state(hs)]\n            z_list += [self._zero_state(hs)]\n        prev_out = hs.new_zeros(hs.size(0), self.odim)\n\n        # initialize attention\n        prev_att_w = None\n        self.att.reset()\n\n        # loop for an output sequence\n        outs, logits, att_ws = [], [], []\n        for y in ys.transpose(0, 1):\n            if self.use_att_extra_inputs:\n                att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w, prev_out)\n            else:\n                att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w)\n            prenet_out = self.prenet(prev_out) if self.prenet is not None else prev_out\n            xs = torch.cat([att_c, prenet_out], dim=1)\n            z_list[0], c_list[0] = self.lstm[0](xs, (z_list[0], c_list[0]))\n            for i in six.moves.range(1, len(self.lstm)):\n                z_list[i], c_list[i] = self.lstm[i](\n                    z_list[i - 1], (z_list[i], c_list[i])\n                )\n            zcs = (\n                torch.cat([z_list[-1], att_c], dim=1)\n                if self.use_concate\n                else z_list[-1]\n            )\n            outs += [self.feat_out(zcs).view(hs.size(0), self.odim, -1)]\n            logits += [self.prob_out(zcs)]\n            att_ws += [att_w]\n            prev_out = y  # teacher forcing\n            if self.cumulate_att_w and prev_att_w is not None:\n                prev_att_w = prev_att_w + att_w  # Note: error when use +=\n            else:\n                prev_att_w = att_w\n\n        logits = torch.cat(logits, dim=1)  # (B, Lmax)\n        before_outs = torch.cat(outs, dim=2)  # (B, odim, Lmax)\n        att_ws = torch.stack(att_ws, dim=1)  # (B, Lmax, Tmax)\n\n        if self.reduction_factor > 1:\n            before_outs = before_outs.view(\n                before_outs.size(0), self.odim, -1\n            )  # (B, odim, Lmax)\n\n        if self.postnet is not None:\n            after_outs = before_outs + self.postnet(before_outs)  # (B, odim, Lmax)\n        else:\n            after_outs = before_outs\n        before_outs = before_outs.transpose(2, 1)  # (B, Lmax, odim)\n        after_outs = after_outs.transpose(2, 1)  # (B, Lmax, odim)\n        logits = logits\n\n        # apply activation function for scaling\n        if self.output_activation_fn is not None:\n            before_outs = self.output_activation_fn(before_outs)\n            after_outs = self.output_activation_fn(after_outs)\n\n        return after_outs, before_outs, logits, att_ws\n\n    def inference(\n        self,\n        h,\n        threshold=0.5,\n        minlenratio=0.0,\n        maxlenratio=10.0,\n        use_att_constraint=False,\n        backward_window=None,\n        forward_window=None,\n    ):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Args:\n            h (Tensor): Input sequence of encoder hidden states (T, C).\n            threshold (float, optional): Threshold to stop generation.\n            minlenratio (float, optional): Minimum length ratio.\n                If set to 1.0 and the length of input is 10,\n                the minimum length of outputs will be 10 * 1 = 10.\n            minlenratio (float, optional): Minimum length ratio.\n                If set to 10 and the length of input is 10,\n                the maximum length of outputs will be 10 * 10 = 100.\n            use_att_constraint (bool):\n                Whether to apply attention constraint introduced in `Deep Voice 3`_.\n            backward_window (int): Backward window size in attention constraint.\n            forward_window (int): Forward window size in attention constraint.\n\n        Returns:\n            Tensor: Output sequence of features (L, odim).\n            Tensor: Output sequence of stop probabilities (L,).\n            Tensor: Attention weights (L, T).\n\n        Note:\n            This computation is performed in auto-regressive manner.\n\n        .. _`Deep Voice 3`: https://arxiv.org/abs/1710.07654\n\n        \"\"\"\n        # setup\n        assert len(h.size()) == 2\n        hs = h.unsqueeze(0)\n        ilens = [h.size(0)]\n        maxlen = int(h.size(0) * maxlenratio)\n        minlen = int(h.size(0) * minlenratio)\n\n        # initialize hidden states of decoder\n        c_list = [self._zero_state(hs)]\n        z_list = [self._zero_state(hs)]\n        for _ in six.moves.range(1, len(self.lstm)):\n            c_list += [self._zero_state(hs)]\n            z_list += [self._zero_state(hs)]\n        prev_out = hs.new_zeros(1, self.odim)\n\n        # initialize attention\n        prev_att_w = None\n        self.att.reset()\n\n        # setup for attention constraint\n        if use_att_constraint:\n            last_attended_idx = 0\n        else:\n            last_attended_idx = None\n\n        # loop for an output sequence\n        idx = 0\n        outs, att_ws, probs = [], [], []\n        while True:\n            # updated index\n            idx += self.reduction_factor\n\n            # decoder calculation\n            if self.use_att_extra_inputs:\n                att_c, att_w = self.att(\n                    hs,\n                    ilens,\n                    z_list[0],\n                    prev_att_w,\n                    prev_out,\n                    last_attended_idx=last_attended_idx,\n                    backward_window=backward_window,\n                    forward_window=forward_window,\n                )\n            else:\n                att_c, att_w = self.att(\n                    hs,\n                    ilens,\n                    z_list[0],\n                    prev_att_w,\n                    last_attended_idx=last_attended_idx,\n                    backward_window=backward_window,\n                    forward_window=forward_window,\n                )\n\n            att_ws += [att_w]\n            prenet_out = self.prenet(prev_out) if self.prenet is not None else prev_out\n            xs = torch.cat([att_c, prenet_out], dim=1)\n            z_list[0], c_list[0] = self.lstm[0](xs, (z_list[0], c_list[0]))\n            for i in six.moves.range(1, len(self.lstm)):\n                z_list[i], c_list[i] = self.lstm[i](\n                    z_list[i - 1], (z_list[i], c_list[i])\n                )\n            zcs = (\n                torch.cat([z_list[-1], att_c], dim=1)\n                if self.use_concate\n                else z_list[-1]\n            )\n            outs += [self.feat_out(zcs).view(1, self.odim, -1)]  # [(1, odim, r), ...]\n            probs += [torch.sigmoid(self.prob_out(zcs))[0]]  # [(r), ...]\n            if self.output_activation_fn is not None:\n                prev_out = self.output_activation_fn(outs[-1][:, :, -1])  # (1, odim)\n            else:\n                prev_out = outs[-1][:, :, -1]  # (1, odim)\n            if self.cumulate_att_w and prev_att_w is not None:\n                prev_att_w = prev_att_w + att_w  # Note: error when use +=\n            else:\n                prev_att_w = att_w\n            if use_att_constraint:\n                last_attended_idx = int(att_w.argmax())\n\n            # check whether to finish generation\n            if int(sum(probs[-1] >= threshold)) > 0 or idx >= maxlen:\n                # check mininum length\n                if idx < minlen:\n                    continue\n                outs = torch.cat(outs, dim=2)  # (1, odim, L)\n                if self.postnet is not None:\n                    outs = outs + self.postnet(outs)  # (1, odim, L)\n                outs = outs.transpose(2, 1).squeeze(0)  # (L, odim)\n                probs = torch.cat(probs, dim=0)\n                att_ws = torch.cat(att_ws, dim=0)\n                break\n\n        if self.output_activation_fn is not None:\n            outs = self.output_activation_fn(outs)\n\n        return outs, probs, att_ws\n\n    def calculate_all_attentions(self, hs, hlens, ys):\n        \"\"\"Calculate all of the attention weights.\n\n        Args:\n            hs (Tensor): Batch of the sequences of padded hidden states (B, Tmax, idim).\n            hlens (LongTensor): Batch of lengths of each input batch (B,).\n            ys (Tensor):\n                Batch of the sequences of padded target features (B, Lmax, odim).\n\n        Returns:\n            numpy.ndarray: Batch of attention weights (B, Lmax, Tmax).\n\n        Note:\n            This computation is performed in teacher-forcing manner.\n\n        \"\"\"\n        # thin out frames (B, Lmax, odim) ->  (B, Lmax/r, odim)\n        if self.reduction_factor > 1:\n            ys = ys[:, self.reduction_factor - 1 :: self.reduction_factor]\n\n        # length list should be list of int\n        hlens = list(map(int, hlens))\n\n        # initialize hidden states of decoder\n        c_list = [self._zero_state(hs)]\n        z_list = [self._zero_state(hs)]\n        for _ in six.moves.range(1, len(self.lstm)):\n            c_list += [self._zero_state(hs)]\n            z_list += [self._zero_state(hs)]\n        prev_out = hs.new_zeros(hs.size(0), self.odim)\n\n        # initialize attention\n        prev_att_w = None\n        self.att.reset()\n\n        # loop for an output sequence\n        att_ws = []\n        for y in ys.transpose(0, 1):\n            if self.use_att_extra_inputs:\n                att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w, prev_out)\n            else:\n                att_c, att_w = self.att(hs, hlens, z_list[0], prev_att_w)\n            att_ws += [att_w]\n            prenet_out = self.prenet(prev_out) if self.prenet is not None else prev_out\n            xs = torch.cat([att_c, prenet_out], dim=1)\n            z_list[0], c_list[0] = self.lstm[0](xs, (z_list[0], c_list[0]))\n            for i in six.moves.range(1, len(self.lstm)):\n                z_list[i], c_list[i] = self.lstm[i](\n                    z_list[i - 1], (z_list[i], c_list[i])\n                )\n            prev_out = y  # teacher forcing\n            if self.cumulate_att_w and prev_att_w is not None:\n                prev_att_w = prev_att_w + att_w  # Note: error when use +=\n            else:\n                prev_att_w = att_w\n\n        att_ws = torch.stack(att_ws, dim=1)  # (B, Lmax, Tmax)\n\n        return att_ws\n"
  },
  {
    "path": "nets/pytorch_backend/tacotron2/encoder.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Tacotron2 encoder related modules.\"\"\"\n\nimport six\n\nimport torch\n\nfrom torch.nn.utils.rnn import pack_padded_sequence\nfrom torch.nn.utils.rnn import pad_packed_sequence\n\n\ndef encoder_init(m):\n    \"\"\"Initialize encoder parameters.\"\"\"\n    if isinstance(m, torch.nn.Conv1d):\n        torch.nn.init.xavier_uniform_(m.weight, torch.nn.init.calculate_gain(\"relu\"))\n\n\nclass Encoder(torch.nn.Module):\n    \"\"\"Encoder module of Spectrogram prediction network.\n\n    This is a module of encoder of Spectrogram prediction network in Tacotron2,\n    which described in `Natural TTS Synthesis by Conditioning WaveNet on Mel\n    Spectrogram Predictions`_. This is the encoder which converts either a sequence\n    of characters or acoustic features into the sequence of hidden states.\n\n    .. _`Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions`:\n       https://arxiv.org/abs/1712.05884\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        input_layer=\"embed\",\n        embed_dim=512,\n        elayers=1,\n        eunits=512,\n        econv_layers=3,\n        econv_chans=512,\n        econv_filts=5,\n        use_batch_norm=True,\n        use_residual=False,\n        dropout_rate=0.5,\n        padding_idx=0,\n    ):\n        \"\"\"Initialize Tacotron2 encoder module.\n\n        Args:\n            idim (int) Dimension of the inputs.\n            input_layer (str): Input layer type.\n            embed_dim (int, optional) Dimension of character embedding.\n            elayers (int, optional) The number of encoder blstm layers.\n            eunits (int, optional) The number of encoder blstm units.\n            econv_layers (int, optional) The number of encoder conv layers.\n            econv_filts (int, optional) The number of encoder conv filter size.\n            econv_chans (int, optional) The number of encoder conv filter channels.\n            use_batch_norm (bool, optional) Whether to use batch normalization.\n            use_residual (bool, optional) Whether to use residual connection.\n            dropout_rate (float, optional) Dropout rate.\n\n        \"\"\"\n        super(Encoder, self).__init__()\n        # store the hyperparameters\n        self.idim = idim\n        self.use_residual = use_residual\n\n        # define network layer modules\n        if input_layer == \"linear\":\n            self.embed = torch.nn.Linear(idim, econv_chans)\n        elif input_layer == \"embed\":\n            self.embed = torch.nn.Embedding(idim, embed_dim, padding_idx=padding_idx)\n        else:\n            raise ValueError(\"unknown input_layer: \" + input_layer)\n\n        if econv_layers > 0:\n            self.convs = torch.nn.ModuleList()\n            for layer in six.moves.range(econv_layers):\n                ichans = (\n                    embed_dim if layer == 0 and input_layer == \"embed\" else econv_chans\n                )\n                if use_batch_norm:\n                    self.convs += [\n                        torch.nn.Sequential(\n                            torch.nn.Conv1d(\n                                ichans,\n                                econv_chans,\n                                econv_filts,\n                                stride=1,\n                                padding=(econv_filts - 1) // 2,\n                                bias=False,\n                            ),\n                            torch.nn.BatchNorm1d(econv_chans),\n                            torch.nn.ReLU(),\n                            torch.nn.Dropout(dropout_rate),\n                        )\n                    ]\n                else:\n                    self.convs += [\n                        torch.nn.Sequential(\n                            torch.nn.Conv1d(\n                                ichans,\n                                econv_chans,\n                                econv_filts,\n                                stride=1,\n                                padding=(econv_filts - 1) // 2,\n                                bias=False,\n                            ),\n                            torch.nn.ReLU(),\n                            torch.nn.Dropout(dropout_rate),\n                        )\n                    ]\n        else:\n            self.convs = None\n        if elayers > 0:\n            iunits = econv_chans if econv_layers != 0 else embed_dim\n            self.blstm = torch.nn.LSTM(\n                iunits, eunits // 2, elayers, batch_first=True, bidirectional=True\n            )\n        else:\n            self.blstm = None\n\n        # initialize\n        self.apply(encoder_init)\n\n    def forward(self, xs, ilens=None):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            xs (Tensor): Batch of the padded sequence. Either character ids (B, Tmax)\n                or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded\n                value should be 0.\n            ilens (LongTensor): Batch of lengths of each input batch (B,).\n\n        Returns:\n            Tensor: Batch of the sequences of encoder states(B, Tmax, eunits).\n            LongTensor: Batch of lengths of each sequence (B,)\n\n        \"\"\"\n        xs = self.embed(xs).transpose(1, 2)\n        if self.convs is not None:\n            for i in six.moves.range(len(self.convs)):\n                if self.use_residual:\n                    xs += self.convs[i](xs)\n                else:\n                    xs = self.convs[i](xs)\n        if self.blstm is None:\n            return xs.transpose(1, 2)\n        if not isinstance(ilens, torch.Tensor):\n            ilens = torch.tensor(ilens)\n        xs = pack_padded_sequence(xs.transpose(1, 2), ilens.cpu(), batch_first=True)\n        self.blstm.flatten_parameters()\n        xs, _ = self.blstm(xs)  # (B, Tmax, C)\n        xs, hlens = pad_packed_sequence(xs, batch_first=True)\n\n        return xs, hlens\n\n    def inference(self, x):\n        \"\"\"Inference.\n\n        Args:\n            x (Tensor): The sequeunce of character ids (T,)\n                    or acoustic feature (T, idim * encoder_reduction_factor).\n\n        Returns:\n            Tensor: The sequences of encoder states(T, eunits).\n\n        \"\"\"\n        xs = x.unsqueeze(0)\n        ilens = torch.tensor([x.size(0)])\n\n        return self.forward(xs, ilens)[0][0]\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/arguments.py",
    "content": "\"\"\"Transducer model arguments.\"\"\"\n\nimport ast\nfrom distutils.util import strtobool\n\n\ndef add_encoder_general_arguments(group):\n    \"\"\"Define general arguments for encoder.\"\"\"\n    group.add_argument(\n        \"--etype\",\n        default=\"blstmp\",\n        type=str,\n        choices=[\n            \"custom\",\n            \"lstm\",\n            \"blstm\",\n            \"lstmp\",\n            \"blstmp\",\n            \"vgglstmp\",\n            \"vggblstmp\",\n            \"vgglstm\",\n            \"vggblstm\",\n            \"gru\",\n            \"bgru\",\n            \"grup\",\n            \"bgrup\",\n            \"vgggrup\",\n            \"vggbgrup\",\n            \"vgggru\",\n            \"vggbgru\",\n        ],\n        help=\"Type of encoder network architecture\",\n    )\n    group.add_argument(\n        \"--dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the encoder\",\n    )\n\n    return group\n\n\ndef add_rnn_encoder_arguments(group):\n    \"\"\"Define arguments for RNN encoder.\"\"\"\n    group.add_argument(\n        \"--elayers\",\n        default=4,\n        type=int,\n        help=\"Number of encoder layers (for shared recognition part \"\n        \"in multi-speaker asr mode)\",\n    )\n    group.add_argument(\n        \"--eunits\",\n        \"-u\",\n        default=300,\n        type=int,\n        help=\"Number of encoder hidden units\",\n    )\n    group.add_argument(\n        \"--eprojs\", default=320, type=int, help=\"Number of encoder projection units\"\n    )\n    group.add_argument(\n        \"--subsample\",\n        default=\"1\",\n        type=str,\n        help=\"Subsample input frames x_y_z means subsample every x frame \"\n        \"at 1st layer, every y frame at 2nd layer etc.\",\n    )\n\n    return group\n\n\ndef add_custom_encoder_arguments(group):\n    \"\"\"Define arguments for Custom encoder.\"\"\"\n    group.add_argument(\n        \"--enc-block-arch\",\n        type=eval,\n        action=\"append\",\n        default=None,\n        help=\"Encoder architecture definition by blocks\",\n    )\n    group.add_argument(\n        \"--enc-block-repeat\",\n        default=0,\n        type=int,\n        help=\"Repeat N times the provided encoder blocks if N > 1\",\n    )\n    group.add_argument(\n        \"--custom-enc-input-layer\",\n        type=str,\n        default=\"conv2d\",\n        choices=[\"conv2d\", \"vgg2l\", \"linear\", \"embed\", \"null\"],\n        help=\"Custom encoder input layer type\",\n    )\n    group.add_argument(\n        \"--custom-enc-positional-encoding-type\",\n        type=str,\n        default=\"abs_pos\",\n        choices=[\"abs_pos\", \"scaled_abs_pos\", \"rel_pos\"],\n        help=\"Custom encoder positional encoding layer type\",\n    )\n    group.add_argument(\n        \"--custom-enc-self-attn-type\",\n        type=str,\n        default=\"self_attn\",\n        choices=[\"self_attn\", \"rel_self_attn\"],\n        help=\"Custom encoder self-attention type\",\n    )\n    group.add_argument(\n        \"--custom-enc-pw-activation-type\",\n        type=str,\n        default=\"relu\",\n        choices=[\"relu\", \"hardtanh\", \"selu\", \"swish\"],\n        help=\"Custom encoder pointwise activation type\",\n    )\n    group.add_argument(\n        \"--custom-enc-conv-mod-activation-type\",\n        type=str,\n        default=\"swish\",\n        choices=[\"relu\", \"hardtanh\", \"selu\", \"swish\"],\n        help=\"Custom encoder convolutional module activation type\",\n    )\n\n    return group\n\n\ndef add_decoder_general_arguments(group):\n    \"\"\"Define general arguments for encoder.\"\"\"\n    group.add_argument(\n        \"--dtype\",\n        default=\"lstm\",\n        type=str,\n        choices=[\"lstm\", \"gru\", \"custom\"],\n        help=\"Type of decoder to use\",\n    )\n    group.add_argument(\n        \"--dropout-rate-decoder\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the decoder\",\n    )\n    group.add_argument(\n        \"--dropout-rate-embed-decoder\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the decoder embedding layer\",\n    )\n\n    return group\n\n\ndef add_rnn_decoder_arguments(group):\n    \"\"\"Define arguments for RNN decoder.\"\"\"\n    group.add_argument(\n        \"--dec-embed-dim\",\n        default=320,\n        type=int,\n        help=\"Number of decoder embeddings dimensions\",\n    )\n    group.add_argument(\n        \"--dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n    )\n    group.add_argument(\n        \"--dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n    )\n\n    return group\n\n\ndef add_custom_decoder_arguments(group):\n    \"\"\"Define arguments for Custom decoder.\"\"\"\n    group.add_argument(\n        \"--dec-block-arch\",\n        type=eval,\n        action=\"append\",\n        default=None,\n        help=\"Custom decoder blocks definition\",\n    )\n    group.add_argument(\n        \"--dec-block-repeat\",\n        default=1,\n        type=int,\n        help=\"Repeat N times the provided decoder blocks if N > 1\",\n    )\n    group.add_argument(\n        \"--custom-dec-input-layer\",\n        type=str,\n        default=\"embed\",\n        choices=[\"linear\", \"embed\"],\n        help=\"Custom decoder input layer type\",\n    )\n    group.add_argument(\n        \"--custom-dec-pw-activation-type\",\n        type=str,\n        default=\"relu\",\n        choices=[\"relu\", \"hardtanh\", \"selu\", \"swish\"],\n        help=\"Custom decoder pointwise activation type\",\n    )\n\n    return group\n\n\ndef add_custom_training_arguments(group):\n    \"\"\"Define arguments for training with Custom architecture.\"\"\"\n    group.add_argument(\n        \"--transformer-warmup-steps\",\n        default=25000,\n        type=int,\n        help=\"Optimizer warmup steps\",\n    )\n    group.add_argument(\n        \"--transformer-lr\",\n        default=10.0,\n        type=float,\n        help=\"Initial value of learning rate\",\n    )\n\n    return group\n\n\ndef add_transducer_arguments(group):\n    \"\"\"Define general arguments for transducer model.\"\"\"\n    group.add_argument(\n        \"--trans-type\",\n        default=\"warp-transducer\",\n        type=str,\n        choices=[\"warp-transducer\", \"warp-rnnt\"],\n        help=\"Type of transducer implementation to calculate loss.\",\n    )\n    group.add_argument(\n        \"--transducer-weight\",\n        default=1.0,\n        type=float,\n        help=\"Weight of transducer loss when auxiliary task is used.\",\n    )\n    group.add_argument(\n        \"--joint-dim\",\n        default=320,\n        type=int,\n        help=\"Number of dimensions in joint space\",\n    )\n    group.add_argument(\n        \"--joint-activation-type\",\n        type=str,\n        default=\"tanh\",\n        choices=[\"relu\", \"tanh\", \"swish\"],\n        help=\"Joint network activation type\",\n    )\n    group.add_argument(\n        \"--score-norm\",\n        type=strtobool,\n        nargs=\"?\",\n        default=True,\n        help=\"Normalize transducer scores by length\",\n    )\n\n    return group\n\n\ndef add_auxiliary_task_arguments(group):\n    \"\"\"Add arguments for auxiliary task.\"\"\"\n    group.add_argument(\n        \"--aux-task-type\",\n        nargs=\"?\",\n        default=None,\n        choices=[\"default\", \"symm_kl_div\", \"both\"],\n        help=\"Type of auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-task-layer-list\",\n        default=None,\n        type=ast.literal_eval,\n        help=\"List of layers to use for auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-task-weight\",\n        default=0.3,\n        type=float,\n        help=\"Weight of auxiliary task loss.\",\n    )\n    group.add_argument(\n        \"--aux-ctc\",\n        type=strtobool,\n        nargs=\"?\",\n        default=False,\n        help=\"Whether to use CTC as auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-ctc-weight\",\n        default=1.0,\n        type=float,\n        help=\"Weight of auxiliary task loss\",\n    )\n    group.add_argument(\n        \"--aux-ctc-dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for auxiliary CTC\",\n    )\n    group.add_argument(\n        \"--aux-cross-entropy\",\n        type=strtobool,\n        nargs=\"?\",\n        default=False,\n        help=\"Whether to use CE as auxiliary task for the prediction network.\",\n    )\n    group.add_argument(\n        \"--aux-cross-entropy-smoothing\",\n        default=0.0,\n        type=float,\n        help=\"Smoothing rate for cross-entropy. If > 0, enables label smoothing loss.\",\n    )\n    group.add_argument(\n        \"--aux-cross-entropy-weight\",\n        default=0.5,\n        type=float,\n        help=\"Weight of auxiliary task loss\",\n    )\n    group.add_argument(\n        \"--aux-mmi\",\n        type=strtobool,\n        nargs=\"?\",\n        default=False,\n        help=\"Whether to use mmi as auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-mmi-weight\",\n        default=0.5,\n        type=float,\n        help=\"Weight of auxiliary mmi loss\",\n    )\n    group.add_argument(\n        \"--aux-mmi-dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for auxiliary mmi\",\n    )\n    group.add_argument(\n        \"--aux-mmi-type\",\n        type=str,\n        choices=['mmi', 'phonectc'],\n        default='mmi',\n        help=\"LF-MMI or CTC\",\n    )\n    group.add_argument(\n        \"--aux-mbr\",\n        type=strtobool,\n        nargs=\"?\",\n        default=False,\n        help=\"Whether to use mbr as auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-mbr-weight\",\n        default=1.0,\n        type=float,\n        help=\"Weight of auxiliary mbr loss\",\n    )\n    group.add_argument(\n        \"--aux-mbr-beam\",\n        default=2,\n        type=int,\n        help=\"Number of hypothesis for MBR loss computation\",\n    )\n\n    return group\n\ndef add_att_scorer_arguments(group):\n    \"\"\"\n    Argument mainly copied from: espnet.nets.pytorch_backend.transformer.argument\n    We only copy the argument for attention decoder / rescorer\n    All arguments are added with prefix 'att', which means RNN-T attention scorer only\n    \"\"\"\n    group.add_argument(\n        \"--att-scorer-weight\",\n        default=0.0,\n        type=float,\n        help=\"weight of attention scorer loss\",\n    )\n    group.add_argument(\n        \"--att-decoder-selfattn-layer-type\",\n        type=str,\n        default=\"selfattn\",\n        choices=[\n            \"selfattn\",\n            \"lightconv\",\n            \"lightconv2d\",\n            \"dynamicconv\",\n            \"dynamicconv2d\",\n            \"light-dynamicconv2d\",\n        ],\n        help=\"transformer decoder self-attention layer type\",\n    )\n    group.add_argument(\n        \"--att-adim\",\n        default=320,\n        type=int,\n        help=\"Number of attention transformation dimensions\",\n    )\n    group.add_argument(\n        \"--att-aheads\",\n        default=4,\n        type=int,\n        help=\"Number of heads for multi head attention\",\n    )\n    group.add_argument(\n        \"--att-wshare\",\n        default=4,\n        type=int,\n        help=\"Number of parameter shargin for lightweight convolution\",\n    )\n    group.add_argument(\n        \"--att-ldconv-decoder-kernel-length\",\n        default=\"11_13_15_17_19_21\",\n        type=str,\n        help=\"kernel size for lightweight/dynamic convolution: \"\n        'Decoder side. For example, \"21_23_25\" means kernel length 21 for '\n        \"First layer, 23 for Second layer and so on.\",\n    )\n    group.add_argument(\n        \"--att-ldconv-usebias\",\n        type=strtobool,\n        default=False,\n        help=\"use bias term in lightweight/dynamic convolution\",\n    )\n    group.add_argument(\n        \"--att-dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n    )\n    group.add_argument(\n        \"--att-dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n    )\n    group.add_argument(\n        \"--att-attn-dropout-rate\",\n        default=None,\n        type=float,\n        help=\"dropout in transformer attention. use --dropout-rate if None is set\",\n    )\n    group.add_argument(\n        \"--att-dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the encoder\",\n    )\n    group.add_argument(\n        \"--att-length-normalized-loss\",\n        default=True,\n        type=strtobool,\n        help=\"normalize loss by length\",\n    )\n    return group\n\n\ndef add_transducer_code_switch_arguments(group):\n    \"\"\"Define general arguments for transducer model.\"\"\"\n    group.add_argument(\n        \"--cs-is-pretrain\",\n        default=False,\n        type=strtobool,\n        help=\"If true, ignore decoder loss\",\n    )\n    group.add_argument(\n        \"--cs-share-encoder\",\n        default=False,\n        type=strtobool,\n        help=\"If true, use a shared encoder before the language-specific encoder\",\n    )\n    group.add_argument(\n        \"--cs-share-encoder-layers\",\n        default=9,\n        type=int,\n        help=\"If true, number of layers in shared encoder\",\n    )\n    group.add_argument(\n        \"--cs-chn-start\",\n        default=5,\n        type=int,\n        help=\"start index of chn symbols in dict\",\n    )\n    group.add_argument(\n        \"--cs-eng-start\",\n        default=4302,\n        type=int,\n        help=\"start index of eng symbols in dict\",\n    )\n    group.add_argument(\n        \"--cs-use-adversial-examples\",\n        default=False,\n        type=strtobool,\n        help=\"If true, mask symbols not from this language\",\n    )\n    group.add_argument(\n        \"--cs-is-ctc-decoder\",\n        default=False,\n        type=strtobool,\n        help=\"If true, the fine tuning system is on CTC rather than RNNT\",\n    )\n    group.add_argument(\n        \"--cs-use-mask-predictor\",\n        default=False,\n        type=strtobool,\n        help=\"If true, use a mask-filter process in combine function\",\n    )\n    group.add_argument(\n        \"--cs-lang-weight\",\n        default=0.0,\n        type=float,\n        help=\"weight of language classificiation\",\n    )\n    group.add_argument(\n        \"--cs-decoder-expert\",\n        default=False,\n        type=strtobool,\n        help=\"If true, use decoder expert\",\n    )\n    return group\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/auxiliary_task.py",
    "content": "\"\"\"Auxiliary task implementation for transducer models.\"\"\"\n\nfrom itertools import chain\nfrom typing import List\nfrom typing import Tuple\nfrom typing import Union\n\nimport torch\nimport torch.nn.functional as F\n\nfrom espnet.nets.transducer_decoder_interface import TransducerDecoderInterface\n\n\nclass AuxiliaryTask(torch.nn.Module):\n    \"\"\"Auxiliary task module.\"\"\"\n\n    def __init__(\n        self,\n        decoder: Union[torch.nn.Module, TransducerDecoderInterface],\n        joint_network: torch.nn.Module,\n        rnnt_criterion: torch.nn.Module,\n        aux_task_type: str,\n        aux_task_weight: int,\n        encoder_out_dim: int,\n        joint_dim: int,\n    ):\n        \"\"\"Auxiliary task initialization.\n\n        Args:\n            decoder: Decoder module\n            joint_network: Joint network module\n            aux_task_type: Auxiliary task type\n            aux_task_weight: Auxiliary task weight\n            encoder_out: Encoder output dimension\n            joint_dim: Joint space dimension\n\n        \"\"\"\n        super().__init__()\n\n        self.rnnt_criterion = rnnt_criterion\n\n        self.mlp_net = torch.nn.Sequential(\n            torch.nn.Linear(encoder_out_dim, joint_dim),\n            torch.nn.ReLU(),\n            torch.nn.Linear(joint_dim, joint_dim),\n        )\n\n        self.decoder = decoder\n        self.joint_network = joint_network\n\n        self.aux_task_type = aux_task_type\n        self.aux_task_weight = aux_task_weight\n\n    def forward(\n        self,\n        enc_out_aux: List,\n        dec_out: torch.Tensor,\n        main_joint: torch.Tensor,\n        target: torch.Tensor,\n        pred_len: torch.Tensor,\n        target_len: torch.Tensor,\n    ) -> Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"Forward auxiliary task.\n\n        Args:\n            enc_out_aux: List of encoder intermediate outputs\n            dec_out: Decoder outputs\n            main_joint: Joint output for main task\n            target: Target labels\n            pred_len: Prediction lengths\n            target_len: Target lengths\n\n        Returns:\n            : (Weighted auxiliary transducer loss, Weighted auxiliary symmetric KL loss)\n\n        \"\"\"\n        aux_trans = 0\n        aux_symm_kl = 0\n\n        for p in chain(self.decoder.parameters(), self.joint_network.parameters()):\n            p.requires_grad = False\n\n        for i, enc_aux in enumerate(enc_out_aux):\n            aux_mlp = self.mlp_net(enc_aux)\n\n            aux_joint = self.joint_network(\n                aux_mlp.unsqueeze(2),\n                dec_out.unsqueeze(1),\n                is_aux=True,\n            )\n\n            if self.aux_task_type != \"symm_kl_div\":\n                aux_trans += self.rnnt_criterion(\n                    aux_joint,\n                    target,\n                    pred_len,\n                    target_len,\n                )\n\n            if self.aux_task_type != \"default\":\n                aux_symm_kl += F.kl_div(\n                    F.log_softmax(main_joint, dim=-1),\n                    F.softmax(aux_joint, dim=-1),\n                    reduction=\"mean\",\n                ) + F.kl_div(\n                    F.log_softmax(aux_joint, dim=-1),\n                    F.softmax(main_joint, dim=-1),\n                    reduction=\"mean\",\n                )\n\n        for p in chain(self.decoder.parameters(), self.joint_network.parameters()):\n            p.requires_grad = True\n\n        return self.aux_task_weight * aux_trans, self.aux_task_weight * aux_symm_kl\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/blocks.py",
    "content": "\"\"\"Set of methods to create custom architecture.\"\"\"\n\nfrom collections import Counter\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.conformer.convolution import ConvolutionModule\nfrom espnet.nets.pytorch_backend.conformer.encoder_layer import (\n    EncoderLayer as ConformerEncoderLayer,  # noqa: H301\n)\n\nfrom espnet.nets.pytorch_backend.nets_utils import get_activation\n\nfrom espnet.nets.pytorch_backend.transducer.causal_conv1d import CausalConv1d\nfrom espnet.nets.pytorch_backend.transducer.transformer_decoder_layer import (\n    DecoderLayer,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transducer.tdnn import TDNN\nfrom espnet.nets.pytorch_backend.transducer.vgg2l import VGG2L\n\nfrom espnet.nets.pytorch_backend.transformer.attention import (\n    MultiHeadedAttention,  # noqa: H301\n    RelPositionMultiHeadedAttention,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.encoder_layer import EncoderLayer\nfrom espnet.nets.pytorch_backend.transformer.embedding import (\n    PositionalEncoding,  # noqa: H301\n    ScaledPositionalEncoding,  # noqa: H301\n    RelPositionalEncoding,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.repeat import MultiSequential\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling\n\n\ndef check_and_prepare(net_part, blocks_arch, input_layer):\n    \"\"\"Check consecutive block shapes match and prepare input parameters.\n\n    Args:\n        net_part (str): either 'encoder' or 'decoder'\n        blocks_arch (list): list of blocks for network part (type and parameters)\n        input_layer (str): input layer type\n\n    Return:\n        input_layer (str): input layer type\n        input_layer_odim (int): output dim of input layer\n        input_dropout_rate (float): dropout rate of input layer\n        input_pos_dropout_rate (float): dropout rate of input layer positional enc.\n        out_dim (int): output dim of last block\n\n    \"\"\"\n    input_dropout_rate = sorted(\n        Counter(\n            b[\"dropout-rate\"] for b in blocks_arch if \"dropout-rate\" in b\n        ).most_common(),\n        key=lambda x: x[0],\n        reverse=True,\n    )\n\n    input_pos_dropout_rate = sorted(\n        Counter(\n            b[\"pos-dropout-rate\"] for b in blocks_arch if \"pos-dropout-rate\" in b\n        ).most_common(),\n        key=lambda x: x[0],\n        reverse=True,\n    )\n\n    input_dropout_rate = input_dropout_rate[0][0] if input_dropout_rate else 0.0\n    input_pos_dropout_rate = (\n        input_pos_dropout_rate[0][0] if input_pos_dropout_rate else 0.0\n    )\n\n    cmp_io = []\n    has_transformer = False\n    has_conformer = False\n    for i in range(len(blocks_arch)):\n        if \"type\" in blocks_arch[i]:\n            block_type = blocks_arch[i][\"type\"]\n        else:\n            raise ValueError(\"type is not defined in the \" + str(i + 1) + \"th block.\")\n\n        if block_type == \"transformer\":\n            if not {\"d_hidden\", \"d_ff\", \"heads\"}.issubset(blocks_arch[i]):\n                raise ValueError(\n                    \"Block \"\n                    + str(i + 1)\n                    + \"in \"\n                    + net_part\n                    + \": Transformer block format is: {'type: transformer', \"\n                    \"'d_hidden': int, 'd_ff': int, 'heads': int, [...]}\"\n                )\n\n            has_transformer = True\n            cmp_io.append((blocks_arch[i][\"d_hidden\"], blocks_arch[i][\"d_hidden\"]))\n        elif block_type == \"conformer\":\n            if net_part != \"encoder\":\n                raise ValueError(\n                    \"Block \" + str(i + 1) + \": conformer type is only for encoder part.\"\n                )\n\n            if not {\n                \"d_hidden\",\n                \"d_ff\",\n                \"heads\",\n                \"macaron_style\",\n                \"use_conv_mod\",\n            }.issubset(blocks_arch[i]):\n                raise ValueError(\n                    \"Block \"\n                    + str(i + 1)\n                    + \" in \"\n                    + net_part\n                    + \": Conformer block format is {'type: conformer', \"\n                    \"'d_hidden': int, 'd_ff': int, 'heads': int, \"\n                    \"'macaron_style': bool, 'use_conv_mod': bool, [...]}\"\n                )\n\n            if (\n                blocks_arch[i][\"use_conv_mod\"] is True\n                and \"conv_mod_kernel\" not in blocks_arch[i]\n            ):\n                raise ValueError(\n                    \"Block \"\n                    + str(i + 1)\n                    + \": 'use_conv_mod' is True but 'use_conv_kernel' is not specified\"\n                )\n\n            has_conformer = True\n            cmp_io.append((blocks_arch[i][\"d_hidden\"], blocks_arch[i][\"d_hidden\"]))\n        elif block_type == \"causal-conv1d\":\n            if not {\"idim\", \"odim\", \"kernel_size\"}.issubset(blocks_arch[i]):\n                raise ValueError(\n                    \"Block \"\n                    + str(i + 1)\n                    + \" in \"\n                    + net_part\n                    + \": causal conv1d block format is: {'type: causal-conv1d', \"\n                    \"'idim': int, 'odim': int, 'kernel_size': int}\"\n                )\n\n            if i == 0:\n                input_layer = \"c-embed\"\n\n            cmp_io.append((blocks_arch[i][\"idim\"], blocks_arch[i][\"odim\"]))\n        elif block_type == \"tdnn\":\n            if not {\"idim\", \"odim\", \"ctx_size\", \"dilation\", \"stride\"}.issubset(\n                blocks_arch[i]\n            ):\n                raise ValueError(\n                    \"Block \"\n                    + str(i + 1)\n                    + \" in \"\n                    + net_part\n                    + \": TDNN block format is: {'type: tdnn', \"\n                    \"'idim': int, 'odim': int, 'ctx_size': int, \"\n                    \"'dilation': int, 'stride': int, [...]}\"\n                )\n\n            cmp_io.append((blocks_arch[i][\"idim\"], blocks_arch[i][\"odim\"]))\n        else:\n            raise NotImplementedError(\n                \"Wrong type for block \"\n                + str(i + 1)\n                + \" in \"\n                + net_part\n                + \". Currently supported: \"\n                \"tdnn, causal-conv1d or transformer\"\n            )\n\n    if has_transformer and has_conformer:\n        raise NotImplementedError(\n            net_part + \": transformer and conformer blocks \"\n            \"can't be defined in the same net part.\"\n        )\n\n    for i in range(1, len(cmp_io)):\n        if cmp_io[(i - 1)][1] != cmp_io[i][0]:\n            raise ValueError(\n                \"Output/Input mismatch between blocks \"\n                + str(i)\n                + \" and \"\n                + str(i + 1)\n                + \" in \"\n                + net_part\n            )\n\n    if blocks_arch[0][\"type\"] in (\"tdnn\", \"causal-conv1d\"):\n        input_layer_odim = blocks_arch[0][\"idim\"]\n    else:\n        input_layer_odim = blocks_arch[0][\"d_hidden\"]\n\n    if blocks_arch[-1][\"type\"] in (\"tdnn\", \"causal-conv1d\"):\n        out_dim = blocks_arch[-1][\"odim\"]\n    else:\n        out_dim = blocks_arch[-1][\"d_hidden\"]\n\n    return (\n        input_layer,\n        input_layer_odim,\n        input_dropout_rate,\n        input_pos_dropout_rate,\n        out_dim,\n    )\n\n\ndef get_pos_enc_and_att_class(net_part, pos_enc_type, self_attn_type):\n    \"\"\"Get positional encoding and self attention module class.\n\n    Args:\n        net_part (str): either 'encoder' or 'decoder'\n        pos_enc_type (str): positional encoding type\n        self_attn_type (str): self-attention type\n\n    Return:\n        pos_enc_class (torch.nn.Module): positional encoding class\n        self_attn_class (torch.nn.Module): self-attention class\n\n    \"\"\"\n    if pos_enc_type == \"abs_pos\":\n        pos_enc_class = PositionalEncoding\n    elif pos_enc_type == \"scaled_abs_pos\":\n        pos_enc_class = ScaledPositionalEncoding\n    elif pos_enc_type == \"rel_pos\":\n        if net_part == \"encoder\" and self_attn_type != \"rel_self_attn\":\n            raise ValueError(\"'rel_pos' is only compatible with 'rel_self_attn'\")\n        pos_enc_class = RelPositionalEncoding\n    else:\n        raise NotImplementedError(\n            \"pos_enc_type should be either 'abs_pos', 'scaled_abs_pos' or 'rel_pos'\"\n        )\n\n    if self_attn_type == \"rel_self_attn\":\n        self_attn_class = RelPositionMultiHeadedAttention\n    else:\n        self_attn_class = MultiHeadedAttention\n\n    return pos_enc_class, self_attn_class\n\n\ndef build_input_layer(\n    input_layer,\n    idim,\n    odim,\n    pos_enc_class,\n    dropout_rate_embed,\n    dropout_rate,\n    pos_dropout_rate,\n    padding_idx,\n):\n    \"\"\"Build input layer.\n\n    Args:\n        input_layer (str): input layer type\n        idim (int): input dimension\n        odim (int): output dimension\n        pos_enc_class (class): positional encoding class\n        dropout_rate_embed (float): dropout rate for embedding layer\n        dropout_rate (float): dropout rate for input layer\n        pos_dropout_rate (float): dropout rate for positional encoding\n        padding_idx (int): padding index for embedding input layer (if specified)\n\n    Returns:\n        (torch.nn.*): input layer module\n        subsampling_factor (int): subsampling factor\n\n    \"\"\"\n    if input_layer == \"null\":\n        return None, 1\n    elif pos_enc_class.__name__ == \"RelPositionalEncoding\":\n        pos_enc_class_subsampling = pos_enc_class(odim, pos_dropout_rate)\n    else:\n        pos_enc_class_subsampling = None\n\n    if input_layer == \"linear\":\n        return (\n            torch.nn.Sequential(\n                torch.nn.Linear(idim, odim),\n                torch.nn.LayerNorm(odim),\n                torch.nn.Dropout(dropout_rate),\n                torch.nn.ReLU(),\n                pos_enc_class(odim, pos_dropout_rate),\n            ),\n            1,\n        )\n    elif input_layer == \"conv2d\":\n        return Conv2dSubsampling(idim, odim, dropout_rate, pos_enc_class_subsampling), 4\n    elif input_layer == \"vgg2l\":\n        return VGG2L(idim, odim, pos_enc_class_subsampling), 4\n    elif input_layer == \"embed\":\n        return (\n            torch.nn.Sequential(\n                torch.nn.Embedding(idim, odim, padding_idx=padding_idx),\n                pos_enc_class(odim, pos_dropout_rate),\n            ),\n            1,\n        )\n    elif input_layer == \"c-embed\":\n        return (\n            torch.nn.Sequential(\n                torch.nn.Embedding(idim, odim, padding_idx=padding_idx),\n                torch.nn.Dropout(dropout_rate_embed),\n            ),\n            1,\n        )\n    else:\n        raise NotImplementedError(\"Support: linear, conv2d, vgg2l and embed\")\n\n\ndef build_transformer_block(net_part, block_arch, pw_layer_type, pw_activation_type):\n    \"\"\"Build function for transformer block.\n\n    Args:\n        net_part (str): either 'encoder' or 'decoder'\n        block_arch (dict): transformer block parameters\n        pw_layer_type (str): positionwise layer type\n        pw_activation_type (str): positionwise activation type\n\n    Returns:\n        (function): function to create transformer block\n\n    \"\"\"\n    d_hidden = block_arch[\"d_hidden\"]\n    d_ff = block_arch[\"d_ff\"]\n    heads = block_arch[\"heads\"]\n\n    dropout_rate = block_arch[\"dropout-rate\"] if \"dropout-rate\" in block_arch else 0.0\n    pos_dropout_rate = (\n        block_arch[\"pos-dropout-rate\"] if \"pos-dropout-rate\" in block_arch else 0.0\n    )\n    att_dropout_rate = (\n        block_arch[\"att-dropout-rate\"] if \"att-dropout-rate\" in block_arch else 0.0\n    )\n\n    if pw_layer_type == \"linear\":\n        pw_layer = PositionwiseFeedForward\n        pw_activation = get_activation(pw_activation_type)\n        pw_layer_args = (d_hidden, d_ff, pos_dropout_rate, pw_activation)\n    else:\n        raise NotImplementedError(\"Transformer block only supports linear yet.\")\n\n    if net_part == \"encoder\":\n        transformer_layer_class = EncoderLayer\n    elif net_part == \"decoder\":\n        transformer_layer_class = DecoderLayer\n\n    return lambda: transformer_layer_class(\n        d_hidden,\n        MultiHeadedAttention(heads, d_hidden, att_dropout_rate),\n        pw_layer(*pw_layer_args),\n        dropout_rate,\n    )\n\n\ndef build_conformer_block(\n    block_arch,\n    self_attn_class,\n    pw_layer_type,\n    pw_activation_type,\n    conv_mod_activation_type,\n):\n    \"\"\"Build function for conformer block.\n\n    Args:\n        block_arch (dict): conformer block parameters\n        self_attn_type (str): self-attention module type\n        pw_layer_type (str): positionwise layer type\n        pw_activation_type (str): positionwise activation type\n        conv_mod_activation_type (str): convolutional module activation type\n\n    Returns:\n        (function): function to create conformer block\n\n    \"\"\"\n    d_hidden = block_arch[\"d_hidden\"]\n    d_ff = block_arch[\"d_ff\"]\n    heads = block_arch[\"heads\"]\n    macaron_style = block_arch[\"macaron_style\"]\n    use_conv_mod = block_arch[\"use_conv_mod\"]\n\n    dropout_rate = block_arch[\"dropout-rate\"] if \"dropout-rate\" in block_arch else 0.0\n    pos_dropout_rate = (\n        block_arch[\"pos-dropout-rate\"] if \"pos-dropout-rate\" in block_arch else 0.0\n    )\n    att_dropout_rate = (\n        block_arch[\"att-dropout-rate\"] if \"att-dropout-rate\" in block_arch else 0.0\n    )\n\n    if pw_layer_type == \"linear\":\n        pw_layer = PositionwiseFeedForward\n        pw_activation = get_activation(pw_activation_type)\n        pw_layer_args = (d_hidden, d_ff, pos_dropout_rate, pw_activation)\n    else:\n        raise NotImplementedError(\"Conformer block only supports linear yet.\")\n\n    if use_conv_mod:\n        conv_layer = ConvolutionModule\n        conv_activation = get_activation(conv_mod_activation_type)\n        conv_layers_args = (d_hidden, block_arch[\"conv_mod_kernel\"], conv_activation)\n\n    return lambda: ConformerEncoderLayer(\n        d_hidden,\n        self_attn_class(heads, d_hidden, att_dropout_rate),\n        pw_layer(*pw_layer_args),\n        pw_layer(*pw_layer_args) if macaron_style else None,\n        conv_layer(*conv_layers_args) if use_conv_mod else None,\n        dropout_rate,\n    )\n\n\ndef build_causal_conv1d_block(block_arch):\n    \"\"\"Build function for causal conv1d block.\n\n    Args:\n        block_arch (dict): causal conv1d block parameters\n\n    Returns:\n        (function): function to create causal conv1d block\n\n    \"\"\"\n    idim = block_arch[\"idim\"]\n    odim = block_arch[\"odim\"]\n    kernel_size = block_arch[\"kernel_size\"]\n\n    return lambda: CausalConv1d(idim, odim, kernel_size)\n\n\ndef build_tdnn_block(block_arch):\n    \"\"\"Build function for tdnn block.\n\n    Args:\n        block_arch (dict): tdnn block parameters\n\n    Returns:\n        (function): function to create tdnn block\n\n    \"\"\"\n    idim = block_arch[\"idim\"]\n    odim = block_arch[\"odim\"]\n    ctx_size = block_arch[\"ctx_size\"]\n    dilation = block_arch[\"dilation\"]\n    stride = block_arch[\"stride\"]\n\n    use_batch_norm = (\n        block_arch[\"use-batch-norm\"] if \"use-batch-norm\" in block_arch else False\n    )\n    use_relu = block_arch[\"use-relu\"] if \"use-relu\" in block_arch else False\n\n    dropout_rate = block_arch[\"dropout-rate\"] if \"dropout-rate\" in block_arch else 0.0\n\n    return lambda: TDNN(\n        idim,\n        odim,\n        ctx_size=ctx_size,\n        dilation=dilation,\n        stride=stride,\n        dropout_rate=dropout_rate,\n        batch_norm=use_batch_norm,\n        relu=use_relu,\n    )\n\n\ndef build_blocks(\n    net_part,\n    idim,\n    input_layer,\n    blocks_arch,\n    repeat_block=0,\n    self_attn_type=\"self_attn\",\n    positional_encoding_type=\"abs_pos\",\n    positionwise_layer_type=\"linear\",\n    positionwise_activation_type=\"relu\",\n    conv_mod_activation_type=\"relu\",\n    dropout_rate_embed=0.0,\n    padding_idx=-1,\n):\n    \"\"\"Build block for customizable architecture.\n\n    Args:\n        net_part (str): either 'encoder' or 'decoder'\n        idim (int): dimension of inputs\n        input_layer (str): input layer type\n        blocks_arch (list): list of blocks for network part (type and parameters)\n        repeat_block (int): repeat provided blocks N times if N > 1\n        positional_encoding_type (str): positional encoding layer type\n        positionwise_layer_type (str): linear\n        positionwise_activation_type (str): positionwise activation type\n        conv_mod_activation_type (str): convolutional module activation type\n        dropout_rate_embed (float): dropout rate for embedding\n        padding_idx (int): padding index for embedding input layer (if specified)\n\n    Returns:\n        in_layer (torch.nn.*): input layer\n        all_blocks (MultiSequential): all blocks for network part\n        out_dim (int): dimension of last block output\n        conv_subsampling_factor (int): subsampling factor in frontend CNN\n\n    \"\"\"\n    fn_modules = []\n\n    (\n        input_layer,\n        input_layer_odim,\n        input_dropout_rate,\n        input_pos_dropout_rate,\n        out_dim,\n    ) = check_and_prepare(net_part, blocks_arch, input_layer)\n\n    pos_enc_class, self_attn_class = get_pos_enc_and_att_class(\n        net_part, positional_encoding_type, self_attn_type\n    )\n\n    in_layer, conv_subsampling_factor = build_input_layer(\n        input_layer,\n        idim,\n        input_layer_odim,\n        pos_enc_class,\n        dropout_rate_embed,\n        input_dropout_rate,\n        input_pos_dropout_rate,\n        padding_idx,\n    )\n\n    for i in range(len(blocks_arch)):\n        block_type = blocks_arch[i][\"type\"]\n\n        if block_type == \"tdnn\":\n            module = build_tdnn_block(blocks_arch[i])\n        elif block_type == \"transformer\":\n            module = build_transformer_block(\n                net_part,\n                blocks_arch[i],\n                positionwise_layer_type,\n                positionwise_activation_type,\n            )\n        elif block_type == \"conformer\":\n            module = build_conformer_block(\n                blocks_arch[i],\n                self_attn_class,\n                positionwise_layer_type,\n                positionwise_activation_type,\n                conv_mod_activation_type,\n            )\n        elif block_type == \"causal-conv1d\":\n            module = build_causal_conv1d_block(blocks_arch[i])\n\n        fn_modules.append(module)\n\n    if repeat_block > 1:\n        fn_modules = fn_modules * repeat_block\n\n    return (\n        in_layer,\n        MultiSequential(*[fn() for fn in fn_modules]),\n        out_dim,\n        conv_subsampling_factor,\n    )\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/causal_conv1d.py",
    "content": "\"\"\"CausalConv1d module definition for custom decoder.\"\"\"\n\nimport torch\n\n\nclass CausalConv1d(torch.nn.Module):\n    \"\"\"CausalConv1d module for custom decoder.\n\n    Args:\n        idim (int): dimension of inputs\n        odim (int): dimension of outputs\n        kernel_size (int): size of convolving kernel\n        stride (int): stride of the convolution\n        dilation (int): spacing between the kernel points\n        groups (int): number of blocked connections from ichannels to ochannels\n        bias (bool): whether to add a learnable bias to the output\n\n    \"\"\"\n\n    def __init__(\n        self, idim, odim, kernel_size, stride=1, dilation=1, groups=1, bias=True\n    ):\n        \"\"\"Construct a CausalConv1d object.\"\"\"\n        super().__init__()\n\n        self._pad = (kernel_size - 1) * dilation\n\n        self.causal_conv1d = torch.nn.Conv1d(\n            idim,\n            odim,\n            kernel_size=kernel_size,\n            stride=stride,\n            padding=self._pad,\n            dilation=dilation,\n            groups=groups,\n            bias=bias,\n        )\n\n    def forward(self, x, x_mask, cache=None):\n        \"\"\"CausalConv1d forward for x.\n\n        Args:\n            x (torch.Tensor): input torch (B, U, idim)\n            x_mask (torch.Tensor): (B, 1, U)\n\n        Returns:\n            x (torch.Tensor): input torch (B, sub(U), attention_dim)\n            x_mask (torch.Tensor): (B, 1, sub(U))\n\n        \"\"\"\n        x = x.permute(0, 2, 1)\n        x = self.causal_conv1d(x)\n\n        if self._pad != 0:\n            x = x[:, :, : -self._pad]\n\n        x = x.permute(0, 2, 1)\n\n        return x, x_mask\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/custom_decoder.py",
    "content": "\"\"\"Custom decoder definition for transducer models.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transducer.blocks import build_blocks\nfrom espnet.nets.pytorch_backend.transducer.utils import check_batch_state\nfrom espnet.nets.pytorch_backend.transducer.utils import check_state\nfrom espnet.nets.pytorch_backend.transducer.utils import pad_sequence\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.transducer_decoder_interface import TransducerDecoderInterface\n\n\nclass CustomDecoder(TransducerDecoderInterface, torch.nn.Module):\n    \"\"\"Custom decoder module for transducer models.\n\n    Args:\n        odim (int): dimension of outputs\n        dec_arch (list): list of layer definitions\n        input_layer (str): input layer type\n        repeat_block (int): repeat provided blocks N times if N > 1\n        positional_encoding_type (str): positional encoding type\n        positionwise_layer_type (str): linear\n        positionwise_activation_type (str): positionwise activation type\n        dropout_rate_embed (float): dropout rate for embedding layer (if specified)\n        blank (int): blank symbol ID\n\n    \"\"\"\n\n    def __init__(\n        self,\n        odim,\n        dec_arch,\n        input_layer=\"embed\",\n        repeat_block=0,\n        joint_activation_type=\"tanh\",\n        positional_encoding_type=\"abs_pos\",\n        positionwise_layer_type=\"linear\",\n        positionwise_activation_type=\"relu\",\n        dropout_rate_embed=0.0,\n        blank=0,\n    ):\n        \"\"\"Construct a CustomDecoder object.\"\"\"\n        torch.nn.Module.__init__(self)\n\n        self.embed, self.decoders, ddim, _ = build_blocks(\n            \"decoder\",\n            odim,\n            input_layer,\n            dec_arch,\n            repeat_block=repeat_block,\n            positional_encoding_type=positional_encoding_type,\n            positionwise_layer_type=positionwise_layer_type,\n            positionwise_activation_type=positionwise_activation_type,\n            dropout_rate_embed=dropout_rate_embed,\n            padding_idx=blank,\n        )\n\n        self.after_norm = LayerNorm(ddim)\n\n        self.dlayers = len(self.decoders)\n        self.dunits = ddim\n        self.odim = odim\n\n        self.blank = blank\n\n    def set_device(self, device):\n        \"\"\"Set GPU device to use.\n\n        Args:\n            device (torch.device): device id\n\n        \"\"\"\n        self.device = device\n\n    def init_state(self, batch_size=None, device=None, dtype=None):\n        \"\"\"Initialize decoder states.\n\n        Args:\n            None\n\n        Returns:\n            state (list): batch of decoder decoder states [L x None]\n\n        \"\"\"\n        state = [None] * self.dlayers\n\n        return state\n\n    def forward(self, tgt, tgt_mask, memory):\n        \"\"\"Forward custom decoder.\n\n        Args:\n            tgt (torch.Tensor): input token ids, int64 (batch, maxlen_out)\n                                if input_layer == \"embed\"\n                                input tensor\n                                (batch, maxlen_out, #mels) in the other cases\n            tgt_mask (torch.Tensor): input token mask,  (batch, maxlen_out)\n                                     dtype=torch.uint8 in PyTorch 1.2-\n                                     dtype=torch.bool in PyTorch 1.2+ (include 1.2)\n            memory (torch.Tensor): encoded memory, float32  (batch, maxlen_in, feat)\n\n        Return:\n            tgt (torch.Tensor): decoder output (batch, maxlen_out, dim_dec)\n            tgt_mask (torch.Tensor): score mask before softmax (batch, maxlen_out)\n\n        \"\"\"\n        tgt = self.embed(tgt)\n\n        tgt, tgt_mask = self.decoders(tgt, tgt_mask)\n        tgt = self.after_norm(tgt)\n\n        return tgt, tgt_mask\n\n    def score(self, hyp, cache):\n        \"\"\"Forward one step.\n\n        Args:\n            hyp (dataclass): hypothesis\n            cache (dict): states cache\n\n        Returns:\n            y (torch.Tensor): decoder outputs (1, dec_dim)\n            (list): decoder states\n                [L x (1, max_len, dec_dim)]\n            lm_tokens (torch.Tensor): token id for LM (1)\n\n        \"\"\"\n        tgt = torch.tensor([hyp.yseq], device=self.device)\n        lm_tokens = tgt[:, -1]\n\n        str_yseq = \"\".join(list(map(str, hyp.yseq)))\n\n        if str_yseq in cache:\n            y, new_state = cache[str_yseq]\n        else:\n            tgt_mask = subsequent_mask(len(hyp.yseq)).unsqueeze_(0)\n\n            state = check_state(hyp.dec_state, (tgt.size(1) - 1), self.blank)\n\n            tgt = self.embed(tgt)\n\n            new_state = []\n            for s, decoder in zip(state, self.decoders):\n                tgt, tgt_mask = decoder(tgt, tgt_mask, cache=s)\n                new_state.append(tgt)\n\n            y = self.after_norm(tgt[:, -1])\n\n            cache[str_yseq] = (y, new_state)\n\n        return y[0], new_state, lm_tokens\n\n    def batch_score(self, hyps, batch_states, cache, use_lm):\n        \"\"\"Forward batch one step.\n\n        Args:\n            hyps (list): batch of hypotheses\n            batch_states (list): decoder states\n                [L x (B, max_len, dec_dim)]\n            cache (dict): states cache\n\n        Returns:\n            batch_y (torch.Tensor): decoder output (B, dec_dim)\n            batch_states (list): decoder states\n                [L x (B, max_len, dec_dim)]\n            lm_tokens (torch.Tensor): batch of token ids for LM (B)\n\n        \"\"\"\n        final_batch = len(hyps)\n\n        process = []\n        done = [None for _ in range(final_batch)]\n\n        for i, hyp in enumerate(hyps):\n            str_yseq = \"\".join(list(map(str, hyp.yseq)))\n\n            if str_yseq in cache:\n                done[i] = cache[str_yseq]\n            else:\n                process.append((str_yseq, hyp.yseq, hyp.dec_state))\n\n        if process:\n            _tokens = pad_sequence([p[1] for p in process], self.blank)\n            batch_tokens = torch.LongTensor(_tokens, device=self.device)\n\n            tgt_mask = (\n                subsequent_mask(batch_tokens.size(-1))\n                .unsqueeze_(0)\n                .expand(len(process), -1, -1)\n            )\n\n            dec_state = self.create_batch_states(\n                self.init_state(),\n                [p[2] for p in process],\n                _tokens,\n            )\n\n            tgt = self.embed(batch_tokens)\n\n            next_state = []\n            for s, decoder in zip(dec_state, self.decoders):\n                tgt, tgt_mask = decoder(tgt, tgt_mask, cache=s)\n                next_state.append(tgt)\n\n            tgt = self.after_norm(tgt[:, -1])\n\n        j = 0\n        for i in range(final_batch):\n            if done[i] is None:\n                new_state = self.select_state(next_state, j)\n\n                done[i] = (tgt[j], new_state)\n                cache[process[j][0]] = (tgt[j], new_state)\n\n                j += 1\n\n        self.create_batch_states(\n            batch_states, [d[1] for d in done], [[0] + h.yseq for h in hyps]\n        )\n        batch_y = torch.stack([d[0] for d in done])\n\n        if use_lm:\n            lm_tokens = torch.LongTensor(\n                [hyp.yseq[-1] for hyp in hyps], device=self.device\n            )\n\n            return batch_y, batch_states, lm_tokens\n\n        return batch_y, batch_states, None\n\n    def select_state(self, batch_states, idx):\n        \"\"\"Get decoder state from batch of states, for given id.\n\n        Args:\n            batch_states (list): batch of decoder states\n                [L x (B, max_len, dec_dim)]\n            idx (int): index to extract state from batch of states\n\n        Returns:\n            state_idx (list): decoder states for given id\n                [L x (1, max_len, dec_dim)]\n\n        \"\"\"\n        if batch_states[0] is None:\n            return batch_states\n\n        state_idx = [batch_states[layer][idx] for layer in range(self.dlayers)]\n\n        return state_idx\n\n    def create_batch_states(self, batch_states, l_states, check_list):\n        \"\"\"Create batch of decoder states.\n\n        Args:\n            batch_states (list): batch of decoder states\n                [L x (B, max_len, dec_dim)]\n            l_states (list): list of decoder states\n                [B x [L x (1, max_len, dec_dim)]]\n            check_list (list): list of sequences for max_len\n\n        Returns:\n            batch_states (list): batch of decoder states\n                [L x (B, max_len, dec_dim)]\n\n        \"\"\"\n        if l_states[0][0] is None:\n            return batch_states\n\n        max_len = max(len(elem) for elem in check_list) - 1\n\n        for layer in range(self.dlayers):\n            batch_states[layer] = check_batch_state(\n                [s[layer] for s in l_states], max_len, self.blank\n            )\n\n        return batch_states\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/custom_encoder.py",
    "content": "\"\"\"Cutom encoder definition for transducer models.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transducer.blocks import build_blocks\nfrom espnet.nets.pytorch_backend.transducer.vgg2l import VGG2L\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling\n\n\nclass CustomEncoder(torch.nn.Module):\n    \"\"\"Custom encoder module for transducer models.\n\n    Args:\n        idim (int): input dim\n        enc_arch (list): list of encoder blocks (type and parameters)\n        input_layer (str): input layer type\n        repeat_block (int): repeat provided block N times if N > 1\n        self_attn_type (str): type of self-attention\n        positional_encoding_type (str): positional encoding type\n        positionwise_layer_type (str): linear\n        positionwise_activation_type (str): positionwise activation type\n        conv_mod_activation_type (str): convolutional module activation type\n        normalize_before (bool): whether to use layer_norm before the first block\n        aux_task_layer_list (list): list of layer ids for intermediate output\n        padding_idx (int): padding_idx for embedding input layer (if specified)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        enc_arch,\n        input_layer=\"linear\",\n        repeat_block=0,\n        self_attn_type=\"selfattn\",\n        positional_encoding_type=\"abs_pos\",\n        positionwise_layer_type=\"linear\",\n        positionwise_activation_type=\"relu\",\n        conv_mod_activation_type=\"relu\",\n        normalize_before=True,\n        aux_task_layer_list=[],\n        padding_idx=-1,\n    ):\n        \"\"\"Construct an CustomEncoder object.\"\"\"\n        super().__init__()\n        (\n            self.embed,\n            self.encoders,\n            self.enc_out,\n            self.conv_subsampling_factor,\n        ) = build_blocks(\n            \"encoder\",\n            idim,\n            input_layer,\n            enc_arch,\n            repeat_block=repeat_block,\n            self_attn_type=self_attn_type,\n            positional_encoding_type=positional_encoding_type,\n            positionwise_layer_type=positionwise_layer_type,\n            positionwise_activation_type=positionwise_activation_type,\n            conv_mod_activation_type=conv_mod_activation_type,\n            padding_idx=padding_idx,\n        )\n\n        self.normalize_before = normalize_before\n\n        if self.normalize_before:\n            self.after_norm = LayerNorm(self.enc_out)\n\n        self.n_blocks = len(enc_arch) * repeat_block\n\n        self.aux_task_layer_list = aux_task_layer_list\n\n    def forward(self, xs, masks, return_as_intermidiate=False):\n        \"\"\"Encode input sequence.\n\n        Args:\n            xs (torch.Tensor): input tensor\n            masks (torch.Tensor): input mask\n\n        Returns:\n            xs (torch.Tensor or tuple):\n                position embedded output or\n                (position embedded output, auxiliary outputs)\n            mask (torch.Tensor): position embedded mask\n\n        \"\"\"\n        if self.embed is None:\n            xs, masks = xs, masks\n        elif isinstance(self.embed, (Conv2dSubsampling, VGG2L)):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n\n        if self.aux_task_layer_list:\n            aux_xs_list = []\n\n            for b in range(self.n_blocks):\n                xs, masks = self.encoders[b](xs, masks)\n\n                if b in self.aux_task_layer_list:\n                    if isinstance(xs, tuple):\n                        aux_xs = xs[0]\n                    else:\n                        aux_xs = xs\n\n                    if self.normalize_before:\n                        aux_xs_list.append(self.after_norm(aux_xs))\n                    else:\n                        aux_xs_list.append(aux_xs)\n        else:\n            xs, masks = self.encoders(xs, masks)\n\n        # we keep the pos_emb for layer conformer layers\n        if return_as_intermidiate:\n            return xs, masks\n\n        if isinstance(xs, tuple):\n            xs = xs[0]\n        \n        if self.normalize_before:\n            xs = self.after_norm(xs)\n\n        if self.aux_task_layer_list:\n            return (xs, aux_xs_list), masks\n\n        return xs, masks\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/error_calculator.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\n\n\"\"\"CER/WER monitoring for transducer models.\"\"\"\n\nimport editdistance\n\nfrom espnet.nets.beam_search_transducer import BeamSearchTransducer\n\n\nclass ErrorCalculator(object):\n    \"\"\"Calculate CER and WER for transducer models.\n\n    Args:\n        decoder (torch.nn.Module|TransducerDecoderInterface): decoder module\n        joint_network (torch.nn.Module): joint network module\n        token_list (list): list of tokens\n        sym_space (str): space symbol\n        sym_blank (str): blank symbol\n        report_cer (boolean): compute CER option\n        report_wer (boolean): compute WER option\n\n    \"\"\"\n\n    def __init__(\n        self,\n        decoder,\n        joint_network,\n        token_list,\n        sym_space,\n        sym_blank,\n        report_cer=False,\n        report_wer=False,\n    ):\n        \"\"\"Construct an ErrorCalculator object for transducer model.\"\"\"\n        super().__init__()\n\n        self.beam_search = BeamSearchTransducer(\n            decoder=decoder,\n            joint_network=joint_network,\n            beam_size=1,\n        )\n\n        self.decoder = decoder\n\n        self.token_list = token_list\n        self.space = sym_space\n        self.blank = sym_blank\n\n        self.report_cer = report_cer\n        self.report_wer = report_wer\n\n    def __call__(self, hs_pad, ys_pad):\n        \"\"\"Calculate sentence-level WER/CER score for transducer models.\n\n        Args:\n            hs_pad (torch.Tensor): batch of padded input sequence (batch, T, D)\n            ys_pad (torch.Tensor): reference (batch, seqlen)\n\n        Returns:\n            (float): sentence-level CER score\n            (float): sentence-level WER score\n\n        \"\"\"\n        cer, wer = None, None\n\n        batchsize = int(hs_pad.size(0))\n        batch_nbest = []\n\n        hs_pad = hs_pad.to(next(self.decoder.parameters()).device)\n\n        for b in range(batchsize):\n            nbest_hyps = self.beam_search(hs_pad[b])\n            batch_nbest.append(nbest_hyps[-1])\n\n        ys_hat = [nbest_hyp.yseq[1:] for nbest_hyp in batch_nbest]\n\n        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad.cpu())\n\n        if self.report_cer:\n            cer = self.calculate_cer(seqs_hat, seqs_true)\n\n        if self.report_wer:\n            wer = self.calculate_wer(seqs_hat, seqs_true)\n\n        return cer, wer\n\n    def convert_to_char(self, ys_hat, ys_pad):\n        \"\"\"Convert index to character.\n\n        Args:\n            ys_hat (torch.Tensor): prediction (batch, seqlen)\n            ys_pad (torch.Tensor): reference (batch, seqlen)\n\n        Returns:\n            (list): token list of prediction\n            (list): token list of reference\n\n        \"\"\"\n        seqs_hat, seqs_true = [], []\n\n        for i, y_hat in enumerate(ys_hat):\n            y_true = ys_pad[i]\n\n            seq_hat = [self.token_list[int(idx)] for idx in y_hat]\n            seq_true = [self.token_list[int(idx)] for idx in y_true if int(idx) != -1]\n\n            seq_hat_text = \"\".join(seq_hat).replace(self.space, \" \")\n            seq_hat_text = seq_hat_text.replace(self.blank, \"\")\n            seq_true_text = \"\".join(seq_true).replace(self.space, \" \")\n\n            seqs_hat.append(seq_hat_text)\n            seqs_true.append(seq_true_text)\n\n        return seqs_hat, seqs_true\n\n    def calculate_cer(self, seqs_hat, seqs_true):\n        \"\"\"Calculate sentence-level CER score for transducer model.\n\n        Args:\n            seqs_hat (torch.Tensor): prediction (batch, seqlen)\n            seqs_true (torch.Tensor): reference (batch, seqlen)\n\n        Returns:\n            (float): average sentence-level CER score\n\n        \"\"\"\n        char_eds, char_ref_lens = [], []\n\n        for i, seq_hat_text in enumerate(seqs_hat):\n            seq_true_text = seqs_true[i]\n\n            hyp_chars = seq_hat_text.replace(\" \", \"\")\n            ref_chars = seq_true_text.replace(\" \", \"\")\n\n            char_eds.append(editdistance.eval(hyp_chars, ref_chars))\n            char_ref_lens.append(len(ref_chars))\n\n        return float(sum(char_eds)) / sum(char_ref_lens)\n\n    def calculate_wer(self, seqs_hat, seqs_true):\n        \"\"\"Calculate sentence-level WER score for transducer model.\n\n        Args:\n            seqs_hat (torch.Tensor): prediction (batch, seqlen)\n            seqs_true (torch.Tensor): reference (batch, seqlen)\n\n        Returns:\n            (float): average sentence-level WER score\n\n        \"\"\"\n        word_eds, word_ref_lens = [], []\n\n        for i, seq_hat_text in enumerate(seqs_hat):\n            seq_true_text = seqs_true[i]\n\n            hyp_words = seq_hat_text.split()\n            ref_words = seq_true_text.split()\n\n            word_eds.append(editdistance.eval(hyp_words, ref_words))\n            word_ref_lens.append(len(ref_words))\n\n        return float(sum(word_eds)) / sum(word_ref_lens)\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/initializer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n\"\"\"Parameter initialization for transducer model.\"\"\"\n\nimport math\n\nfrom espnet.nets.pytorch_backend.initialization import set_forget_bias_to_one\n\n\ndef initializer(model, args):\n    \"\"\"Initialize transducer model.\n\n    Args:\n        model (torch.nn.Module): transducer instance\n        args (Namespace): argument Namespace containing options\n\n    \"\"\"\n\n    # RNN only\n    for name, p in model.named_parameters():\n        if any(x in name for x in [\"enc.\", \"dec.\", \"joint_network\"]):\n            # rnn based parts + joint network\n            if p.dim() == 1:\n                # bias\n                p.data.zero_()\n            elif p.dim() == 2:\n                # linear weight\n                n = p.size(1)\n                stdv = 1.0 / math.sqrt(n)\n                p.data.normal_(0, stdv)\n            elif p.dim() in (3, 4):\n                # conv weight\n                n = p.size(1)\n                for k in p.size()[2:]:\n                    n *= k\n                    stdv = 1.0 / math.sqrt(n)\n                    p.data.normal_(0, stdv)\n\n    if args.dtype != \"custom\":\n        model.dec.embed.weight.data.normal_(0, 1)\n\n        for i in range(model.dec.dlayers):\n            set_forget_bias_to_one(getattr(model.dec.decoder[i], \"bias_ih_l0\"))\n            set_forget_bias_to_one(getattr(model.dec.decoder[i], \"bias_hh_l0\"))\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/joint_network.py",
    "content": "\"\"\"Transducer joint network implementation.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.nets_utils import get_activation\n\n\nclass JointNetwork(torch.nn.Module):\n    \"\"\"Transducer joint network module.\n\n    Args:\n        joint_space_size: Dimension of joint space\n        joint_activation_type: Activation type for joint network\n\n    \"\"\"\n\n    def __init__(\n        self,\n        vocab_size: int,\n        encoder_output_size: int,\n        decoder_output_size: int,\n        joint_space_size: int,\n        joint_activation_type: int,\n    ):\n        \"\"\"Joint network initializer.\"\"\"\n        super().__init__()\n\n        self.lin_enc = torch.nn.Linear(encoder_output_size, joint_space_size)\n        self.lin_dec = torch.nn.Linear(\n            decoder_output_size, joint_space_size, bias=False\n        )\n\n        self.lin_out = torch.nn.Linear(joint_space_size, vocab_size)\n\n        self.joint_activation = get_activation(joint_activation_type)\n\n    def forward(\n        self, h_enc: torch.Tensor, h_dec: torch.Tensor, is_aux: bool = False\n    ) -> torch.Tensor:\n        \"\"\"Joint computation of z.\n\n        Args:\n            h_enc: Batch of expanded hidden state (B, T, 1, D_enc)\n            h_dec: Batch of expanded hidden state (B, 1, U, D_dec)\n\n        Returns:\n            z: Output (B, T, U, vocab_size)\n\n        \"\"\"\n        if is_aux:\n            z = self.joint_activation(h_enc + self.lin_dec(h_dec))\n        else:\n            z = self.joint_activation(self.lin_enc(h_enc) + self.lin_dec(h_dec))\n        z = self.lin_out(z)\n\n        return z\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/loss.py",
    "content": "#!/usr/bin/env python3\n\n\"\"\"Transducer loss module.\"\"\"\n\nimport torch\n\n\nclass TransLoss(torch.nn.Module):\n    \"\"\"Transducer loss module.\n\n    Args:\n        trans_type (str): type of transducer implementation to calculate loss.\n        blank_id (int): blank symbol id\n    \"\"\"\n\n    def __init__(self, trans_type, blank_id):\n        \"\"\"Construct an TransLoss object.\"\"\"\n        super().__init__()\n\n        device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n        if trans_type == \"warp-transducer\":\n            from warprnnt_pytorch import RNNTLoss\n\n            self.trans_loss = RNNTLoss(blank=blank_id)\n        elif trans_type == \"warp-rnnt\":\n            if device.type == \"cuda\":\n                try:\n                    from warp_rnnt import rnnt_loss\n\n                    self.trans_loss = rnnt_loss\n                except ImportError:\n                    raise ImportError(\n                        \"warp-rnnt is not installed. Please re-setup\"\n                        \" espnet or use 'warp-transducer'\"\n                    )\n            else:\n                raise ValueError(\"warp-rnnt is not supported in CPU mode\")\n\n        self.trans_type = trans_type\n        self.blank_id = blank_id\n\n    def forward(self, pred_pad, target, pred_len, target_len):\n        \"\"\"Compute path-aware regularization transducer loss.\n\n        Args:\n            pred_pad (torch.Tensor): Batch of predicted sequences\n                (batch, maxlen_in, maxlen_out+1, odim)\n            target (torch.Tensor): Batch of target sequences (batch, maxlen_out)\n            pred_len (torch.Tensor): batch of lengths of predicted sequences (batch)\n            target_len (torch.tensor): batch of lengths of target sequences (batch)\n\n        Returns:\n            loss (torch.Tensor): transducer loss\n\n        \"\"\"\n        dtype = pred_pad.dtype\n        if dtype != torch.float32:\n            # warp-transducer and warp-rnnt only support float32\n            pred_pad = pred_pad.to(dtype=torch.float32)\n\n        if self.trans_type == \"warp-rnnt\":\n            log_probs = torch.log_softmax(pred_pad, dim=-1)\n\n            loss = self.trans_loss(\n                log_probs,\n                target,\n                pred_len,\n                target_len,\n                reduction=\"mean\",\n                blank=self.blank_id,\n                gather=True,\n            )\n        else:\n            loss = self.trans_loss(pred_pad, target, pred_len, target_len)\n        loss = loss.to(dtype=dtype)\n\n        return loss\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/rnn_decoder.py",
    "content": "\"\"\"RNN decoder for transducer-based models.\"\"\"\n\nimport torch\n\nfrom espnet.nets.transducer_decoder_interface import TransducerDecoderInterface\n\n\nclass DecoderRNNT(TransducerDecoderInterface, torch.nn.Module):\n    \"\"\"RNN-T Decoder module.\n\n    Args:\n        odim (int): dimension of outputs\n        dtype (str): gru or lstm\n        dlayers (int): # prediction layers\n        dunits (int): # prediction units\n        blank (int): blank symbol id\n        embed_dim (int): dimension of embeddings\n        dropout (float): dropout rate\n        dropout_embed (float): embedding dropout rate\n\n    \"\"\"\n\n    def __init__(\n        self,\n        odim,\n        dtype,\n        dlayers,\n        dunits,\n        blank,\n        embed_dim,\n        dropout=0.0,\n        dropout_embed=0.0,\n    ):\n        \"\"\"Transducer initializer.\"\"\"\n        super().__init__()\n\n        self.embed = torch.nn.Embedding(odim, embed_dim, padding_idx=blank)\n        self.dropout_embed = torch.nn.Dropout(p=dropout_embed)\n\n        dec_net = torch.nn.LSTM if dtype == \"lstm\" else torch.nn.GRU\n\n        self.decoder = torch.nn.ModuleList(\n            [dec_net(embed_dim, dunits, 1, batch_first=True)]\n        )\n        self.dropout_dec = torch.nn.Dropout(p=dropout)\n\n        for _ in range(1, dlayers):\n            self.decoder += [dec_net(dunits, dunits, 1, batch_first=True)]\n\n        self.dlayers = dlayers\n        self.dunits = dunits\n        self.dtype = dtype\n\n        self.odim = odim\n\n        self.ignore_id = -1\n        self.blank = blank\n\n        self.multi_gpus = torch.cuda.device_count() > 1\n\n    def set_device(self, device):\n        \"\"\"Set GPU device to use.\n\n        Args:\n            device (torch.device): device id\n\n        \"\"\"\n        self.device = device\n\n    def set_data_type(self, data_type):\n        \"\"\"Set GPU device to use.\n\n        Args:\n            data_type (torch.dtype): Tensor data type\n\n        \"\"\"\n        self.data_type = data_type\n\n    def init_state(self, batch_size):\n        \"\"\"Initialize decoder states.\n\n        Args:\n            batch_size (int): Batch size\n\n        Returns:\n            (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n\n        \"\"\"\n        h_n = torch.zeros(\n            self.dlayers,\n            batch_size,\n            self.dunits,\n            device=self.device,\n            dtype=self.data_type,\n        )\n\n        if self.dtype == \"lstm\":\n            c_n = torch.zeros(\n                self.dlayers,\n                batch_size,\n                self.dunits,\n                device=self.device,\n                dtype=self.data_type,\n            )\n\n            return (h_n, c_n)\n\n        return (h_n, None)\n\n    def rnn_forward(self, y, state):\n        \"\"\"RNN forward.\n\n        Args:\n            y (torch.Tensor): batch of input features (B, emb_dim)\n            state (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n\n        Returns:\n            y (torch.Tensor): batch of output features (B, dec_dim)\n            (tuple): batch of decoder states\n                (L, B, dec_dim), (L, B, dec_dim))\n\n        \"\"\"\n        h_prev, c_prev = state\n        h_next, c_next = self.init_state(y.size(0))\n\n        for layer in range(self.dlayers):\n            if self.dtype == \"lstm\":\n                y, (\n                    h_next[layer : layer + 1],\n                    c_next[layer : layer + 1],\n                ) = self.decoder[layer](\n                    y, hx=(h_prev[layer : layer + 1], c_prev[layer : layer + 1])\n                )\n            else:\n                y, h_next[layer : layer + 1] = self.decoder[layer](\n                    y, hx=h_prev[layer : layer + 1]\n                )\n\n            y = self.dropout_dec(y)\n\n        return y, (h_next, c_next)\n\n    def forward(self, hs_pad, ys_in_pad):\n        \"\"\"Forward function for transducer.\n\n        Args:\n            hs_pad (torch.Tensor):\n                batch of padded hidden state sequences (B, Tmax, D)\n            ys_in_pad (torch.Tensor):\n                batch of padded character id sequence tensor (B, Lmax+1)\n\n        Returns:\n            z (torch.Tensor): output (B, T, U, odim)\n\n        \"\"\"\n        self.set_device(hs_pad.device)\n        self.set_data_type(hs_pad.dtype)\n\n        state = self.init_state(hs_pad.size(0))\n        eys = self.dropout_embed(self.embed(ys_in_pad))\n\n        h_dec, _ = self.rnn_forward(eys, state)\n\n        return h_dec\n\n    def score(self, hyp, cache):\n        \"\"\"Forward one step.\n\n        Args:\n            hyp (dataclass): hypothesis\n            cache (dict): states cache\n\n        Returns:\n            y (torch.Tensor): decoder outputs (1, dec_dim)\n            state (tuple): decoder states\n                ((L, 1, dec_dim), (L, 1, dec_dim)),\n            (torch.Tensor): token id for LM (1,)\n\n        \"\"\"\n        vy = torch.full((1, 1), hyp.yseq[-1], dtype=torch.long, device=self.device)\n\n        str_yseq = \"\".join(list(map(str, hyp.yseq)))\n\n        if str_yseq in cache:\n            y, state = cache[str_yseq]\n        else:\n            ey = self.embed(vy)\n\n            y, state = self.rnn_forward(ey, hyp.dec_state)\n            cache[str_yseq] = (y, state)\n\n        return y[0][0], state, vy[0]\n\n    def batch_score(self, hyps, batch_states, cache, use_lm):\n        \"\"\"Forward batch one step.\n\n        Args:\n            hyps (list): batch of hypotheses\n            batch_states (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n            cache (dict): states cache\n            use_lm (bool): whether a LM is used for decoding\n\n        Returns:\n            batch_y (torch.Tensor): decoder output (B, dec_dim)\n            batch_states (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n            lm_tokens (torch.Tensor): batch of token ids for LM (B)\n\n        \"\"\"\n        final_batch = len(hyps)\n\n        process = []\n        done = [None] * final_batch\n\n        for i, hyp in enumerate(hyps):\n            str_yseq = \"\".join(list(map(str, hyp.yseq)))\n\n            if str_yseq in cache:\n                done[i] = cache[str_yseq]\n            else:\n                process.append((str_yseq, hyp.yseq[-1], hyp.dec_state))\n\n        if process:\n            tokens = torch.LongTensor([[p[1]] for p in process]).to(self.device)\n            dec_state = self.create_batch_states(\n                self.init_state(tokens.size(0)), [p[2] for p in process]\n            )\n\n            ey = self.embed(tokens)\n            y, dec_state = self.rnn_forward(ey, dec_state)\n\n        j = 0\n        for i in range(final_batch):\n            if done[i] is None:\n                new_state = self.select_state(dec_state, j)\n\n                done[i] = (y[j], new_state)\n                cache[process[j][0]] = (y[j], new_state)\n\n                j += 1\n\n        batch_y = torch.cat([d[0] for d in done], dim=0)\n        batch_states = self.create_batch_states(batch_states, [d[1] for d in done])\n\n        if use_lm:\n            lm_tokens = torch.LongTensor([h.yseq[-1] for h in hyps], device=self.device)\n\n            return batch_y, batch_states, lm_tokens\n\n        return batch_y, batch_states, None\n\n    def select_state(self, batch_states, idx):\n        \"\"\"Get decoder state from batch of states, for given id.\n\n        Args:\n            batch_states (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n            idx (int): index to extract state from batch of states\n\n        Returns:\n            (tuple): decoder states for given id\n                ((L, 1, dec_dim), (L, 1, dec_dim))\n\n        \"\"\"\n        return (\n            batch_states[0][:, idx : idx + 1, :],\n            batch_states[1][:, idx : idx + 1, :] if self.dtype == \"lstm\" else None,\n        )\n\n    def create_batch_states(self, batch_states, l_states, l_tokens=None):\n        \"\"\"Create batch of decoder states.\n\n        Args:\n            batch_states (tuple): batch of decoder states\n               ((L, B, dec_dim), (L, B, dec_dim))\n            l_states (list): list of decoder states\n               [L x ((1, dec_dim), (1, dec_dim))]\n\n        Returns:\n            batch_states (tuple): batch of decoder states\n                ((L, B, dec_dim), (L, B, dec_dim))\n\n        \"\"\"\n        return (\n            torch.cat([s[0] for s in l_states], dim=1),\n            torch.cat([s[1] for s in l_states], dim=1)\n            if self.dtype == \"lstm\"\n            else None,\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/rnn_encoder.py",
    "content": "\"\"\"RNN encoder implementation for transducer-based models.\n\nThese classes are based on the ones in espnet.nets.pytorch_backend.rnn.encoders,\nand modified to output intermediate layers representation based on a list of\nlayers given as input. These additional outputs are intended to be used with\nauxiliary tasks.\nIt should be noted that, here, RNN class rely on a stack of 1-layer LSTM instead\nof a multi-layer LSTM for that purpose.\n\n\"\"\"\n\nimport argparse\nimport logging\nfrom typing import List\nfrom typing import Optional\nfrom typing import Tuple\nfrom typing import Union\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch.nn.utils.rnn import pack_padded_sequence\nfrom torch.nn.utils.rnn import pad_packed_sequence\n\nfrom espnet.nets.e2e_asr_common import get_vgg2l_odim\nfrom espnet.nets.pytorch_backend.nets_utils import make_pad_mask\nfrom espnet.nets.pytorch_backend.nets_utils import to_device\n\n\nclass RNNP(torch.nn.Module):\n    \"\"\"RNN with projection layer module.\n\n    Args:\n        idim: Dimension of inputs\n        elayers: Dimension of encoder layers\n        cdim: Number of units (results in cdim * 2 if bidirectional)\n        hdim: Number of projection units\n        subsample: List of subsampling number\n        dropout: Dropout rate\n        typ: RNN type\n        aux_task_layer_list: List of layer ids for intermediate output\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim: int,\n        elayers: int,\n        cdim: int,\n        hdim: int,\n        subsample: np.ndarray,\n        dropout: float,\n        typ: str = \"blstm\",\n        aux_task_layer_list: List = [],\n    ):\n        \"\"\"Initialize RNNP module.\"\"\"\n        super(RNNP, self).__init__()\n\n        bidir = typ[0] == \"b\"\n        for i in range(elayers):\n            if i == 0:\n                inputdim = idim\n            else:\n                inputdim = hdim\n\n            RNN = torch.nn.LSTM if \"lstm\" in typ else torch.nn.GRU\n            rnn = RNN(\n                inputdim, cdim, num_layers=1, bidirectional=bidir, batch_first=True\n            )\n\n            setattr(self, \"%s%d\" % (\"birnn\" if bidir else \"rnn\", i), rnn)\n\n            if bidir:\n                setattr(self, \"bt%d\" % i, torch.nn.Linear(2 * cdim, hdim))\n            else:\n                setattr(self, \"bt%d\" % i, torch.nn.Linear(cdim, hdim))\n\n        self.elayers = elayers\n        self.cdim = cdim\n        self.subsample = subsample\n        self.typ = typ\n        self.bidir = bidir\n        self.dropout = dropout\n\n        self.aux_task_layer_list = aux_task_layer_list\n\n    def forward(\n        self,\n        xs_pad: torch.Tensor,\n        ilens: torch.Tensor,\n        prev_state: Optional[torch.Tensor] = None,\n    ) -> Union[Tuple[torch.Tensor, List], torch.Tensor]:\n        \"\"\"RNNP forward.\n\n        Args:\n            xs_pad: Batch of padded input sequences (B, Tmax, idim)\n            ilens: Batch of lengths of input sequences (B)\n            prev_state: Batch of previous RNN states\n\n        Returns:\n            : Batch of padded output sequences (B, Tmax, hdim)\n                    or tuple w/ aux outputs ((B, Tmax, hdim), [L x (B, Tmax, hdim)])\n            : Batch of lengths of output sequences (B)\n            : Batch of hidden state sequences (B, Tmax, hdim)\n\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        aux_xs_list = []\n        elayer_states = []\n        for layer in range(self.elayers):\n            if not isinstance(ilens, torch.Tensor):\n                ilens = torch.tensor(ilens)\n\n            xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)\n            rnn = getattr(self, (\"birnn\" if self.bidir else \"rnn\") + str(layer))\n            rnn.flatten_parameters()\n\n            if prev_state is not None and rnn.bidirectional:\n                prev_state = reset_backward_rnn_state(prev_state)\n\n            ys, states = rnn(\n                xs_pack, hx=None if prev_state is None else prev_state[layer]\n            )\n            elayer_states.append(states)\n\n            ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)\n\n            sub = self.subsample[layer + 1]\n            if sub > 1:\n                ys_pad = ys_pad[:, ::sub]\n                ilens = torch.tensor([int(i + 1) // sub for i in ilens])\n\n            projection_layer = getattr(self, \"bt%d\" % layer)\n            projected = projection_layer(ys_pad.contiguous().view(-1, ys_pad.size(2)))\n            xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)\n\n            if layer in self.aux_task_layer_list:\n                aux_xs_list.append(xs_pad)\n\n            if layer < self.elayers - 1:\n                xs_pad = torch.tanh(F.dropout(xs_pad, p=self.dropout))\n\n        if aux_xs_list:\n            return (xs_pad, aux_xs_list), ilens, elayer_states\n        else:\n            return xs_pad, ilens, elayer_states\n\n\nclass RNN(torch.nn.Module):\n    \"\"\"RNN module.\n\n    Args:\n        idim: Dimension of inputs\n        elayers: Number of encoder layers\n        cdim: Number of rnn units (resulted in cdim * 2 if bidirectional)\n        hdim: Number of final projection units\n        dropout: Dropout rate\n        typ: The RNN type\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim: int,\n        elayers: int,\n        cdim: int,\n        hdim: int,\n        dropout: float,\n        typ: str = \"blstm\",\n        aux_task_layer_list: List = [],\n    ):\n        \"\"\"Initialize RNN module.\"\"\"\n        super(RNN, self).__init__()\n\n        bidir = typ[0] == \"b\"\n\n        for i in range(elayers):\n            if i == 0:\n                inputdim = idim\n            else:\n                inputdim = cdim\n\n            layer_type = torch.nn.LSTM if \"lstm\" in typ else torch.nn.GRU\n            rnn = layer_type(\n                inputdim, cdim, num_layers=1, bidirectional=bidir, batch_first=True\n            )\n\n            setattr(self, \"%s%d\" % (\"birnn\" if bidir else \"rnn\", i), rnn)\n\n        self.dropout = torch.nn.Dropout(p=dropout)\n\n        self.elayers = elayers\n        self.cdim = cdim\n        self.hdim = hdim\n        self.typ = typ\n        self.bidir = bidir\n\n        self.l_last = torch.nn.Linear(cdim, hdim)\n\n        self.aux_task_layer_list = aux_task_layer_list\n\n    def forward(\n        self,\n        xs_pad: torch.Tensor,\n        ilens: torch.Tensor,\n        prev_state: Optional[torch.Tensor] = None,\n    ) -> Union[Tuple[torch.Tensor, List], torch.Tensor]:\n        \"\"\"RNN forward.\n\n        Args:\n            xs_pad: Batch of padded input sequences (B, Tmax, idim)\n            ilens: Batch of lengths of input sequences (B)\n            prev_state: Batch of previous RNN states\n\n        Returns:\n            : Batch of padded output sequences (B, Tmax, hdim)\n                    or tuple w/ aux outputs ((B, Tmax, hdim), [L x (B, Tmax, hdim)])\n            : Batch of lengths of output sequences (B)\n            : Batch of hidden state sequences (B, Tmax, hdim)\n\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        aux_xs_list = []\n        elayer_states = []\n        for layer in range(self.elayers):\n            if not isinstance(ilens, torch.Tensor):\n                ilens = torch.tensor(ilens)\n\n            xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)\n\n            rnn = getattr(self, (\"birnn\" if self.bidir else \"rnn\") + str(layer))\n            rnn.flatten_parameters()\n\n            if prev_state is not None and rnn.bidirectional:\n                prev_state = reset_backward_rnn_state(prev_state)\n\n            xs, states = rnn(\n                xs_pack, hx=None if prev_state is None else prev_state[layer]\n            )\n            elayer_states.append(states)\n\n            xs_pad, ilens = pad_packed_sequence(xs, batch_first=True)\n\n            if self.bidir:\n                xs_pad = xs_pad[:, :, : self.cdim] + xs_pad[:, :, self.cdim :]\n\n            if layer in self.aux_task_layer_list:\n                aux_projected = torch.tanh(\n                    self.l_last(xs_pad.contiguous().view(-1, xs_pad.size(2)))\n                )\n                aux_xs_pad = aux_projected.view(xs_pad.size(0), xs_pad.size(1), -1)\n\n                aux_xs_list.append(aux_xs_pad)\n\n            if layer < self.elayers - 1:\n                xs_pad = self.dropout(xs_pad)\n\n        projected = torch.tanh(\n            self.l_last(xs_pad.contiguous().view(-1, xs_pad.size(2)))\n        )\n        xs_pad = projected.view(xs_pad.size(0), xs_pad.size(1), -1)\n\n        if aux_xs_list:\n            return (xs_pad, aux_xs_list), ilens, elayer_states\n        else:\n            return xs_pad, ilens, elayer_states\n\n\ndef reset_backward_rnn_state(\n    states: Union[torch.Tensor, Tuple, List]\n) -> Union[torch.Tensor, Tuple, List]:\n    \"\"\"Set backward BRNN states to zeroes.\n\n    Args:\n        states: RNN states\n\n    Returns:\n        states: RNN states with backward set to zeroes\n\n    \"\"\"\n    if isinstance(states, (list, tuple)):\n        for state in states:\n            state[1::2] = 0.0\n    else:\n        states[1::2] = 0.0\n    return states\n\n\nclass VGG2L(torch.nn.Module):\n    \"\"\"VGG-like module.\n\n    Args:\n        in_channel: number of input channels\n\n    \"\"\"\n\n    def __init__(self, in_channel: int = 1):\n        \"\"\"Initialize VGG-like module.\"\"\"\n        super(VGG2L, self).__init__()\n\n        # CNN layer (VGG motivated)\n        self.conv1_1 = torch.nn.Conv2d(in_channel, 64, 3, stride=1, padding=1)\n        self.conv1_2 = torch.nn.Conv2d(64, 64, 3, stride=1, padding=1)\n        self.conv2_1 = torch.nn.Conv2d(64, 128, 3, stride=1, padding=1)\n        self.conv2_2 = torch.nn.Conv2d(128, 128, 3, stride=1, padding=1)\n\n        self.in_channel = in_channel\n\n    def forward(self, xs_pad: torch.Tensor, ilens: torch.Tensor, **kwargs):\n        \"\"\"VGG2L forward.\n\n        Args:\n            xs_pad: Batch of padded input sequences (B, Tmax, D)\n            ilens: Batch of lengths of input sequences (B)\n\n        Returns:\n            : Batch of padded output sequences (B, Tmax // 4, 128 * D // 4)\n            : Batch of lengths of output sequences (B)\n\n        \"\"\"\n        logging.debug(self.__class__.__name__ + \" input lengths: \" + str(ilens))\n\n        xs_pad = xs_pad.view(\n            xs_pad.size(0),\n            xs_pad.size(1),\n            self.in_channel,\n            xs_pad.size(2) // self.in_channel,\n        ).transpose(1, 2)\n\n        xs_pad = F.relu(self.conv1_1(xs_pad))\n        xs_pad = F.relu(self.conv1_2(xs_pad))\n        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)\n\n        xs_pad = F.relu(self.conv2_1(xs_pad))\n        xs_pad = F.relu(self.conv2_2(xs_pad))\n        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)\n\n        if torch.is_tensor(ilens):\n            ilens = ilens.cpu().numpy()\n        else:\n            ilens = np.array(ilens, dtype=np.float32)\n        ilens = np.array(np.ceil(ilens / 2), dtype=np.int64)\n        ilens = np.array(\n            np.ceil(np.array(ilens, dtype=np.float32) / 2), dtype=np.int64\n        ).tolist()\n\n        xs_pad = xs_pad.transpose(1, 2)\n        xs_pad = xs_pad.contiguous().view(\n            xs_pad.size(0), xs_pad.size(1), xs_pad.size(2) * xs_pad.size(3)\n        )\n\n        return xs_pad, ilens, None\n\n\nclass Encoder(torch.nn.Module):\n    \"\"\"Encoder module.\n\n    Args:\n        etype: Type of encoder network\n        idim: Number of dimensions of encoder network\n        elayers: Number of layers of encoder network\n        eunits: Number of RNN units of encoder network\n        eprojs: Number of projection units of encoder network\n        subsample: List of subsampling numbers\n        dropout: Dropout rate\n        in_channel: Number of input channels\n\n    \"\"\"\n\n    def __init__(\n        self,\n        etype: str,\n        idim: int,\n        elayers: int,\n        eunits: int,\n        eprojs: int,\n        subsample: np.ndarray,\n        dropout: float,\n        in_channel: int = 1,\n        aux_task_layer_list: List = [],\n    ):\n        \"\"\"Initialize Encoder module.\"\"\"\n        super(Encoder, self).__init__()\n\n        typ = etype.lstrip(\"vgg\").rstrip(\"p\")\n        if typ not in [\"lstm\", \"gru\", \"blstm\", \"bgru\"]:\n            logging.error(\"Error: need to specify an appropriate encoder architecture\")\n\n        if etype.startswith(\"vgg\"):\n            if etype[-1] == \"p\":\n                self.enc = torch.nn.ModuleList(\n                    [\n                        VGG2L(in_channel),\n                        RNNP(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            subsample,\n                            dropout,\n                            typ=typ,\n                            aux_task_layer_list=aux_task_layer_list,\n                        ),\n                    ]\n                )\n                logging.info(\"Use CNN-VGG + \" + typ.upper() + \"P for encoder\")\n            else:\n                self.enc = torch.nn.ModuleList(\n                    [\n                        VGG2L(in_channel),\n                        RNN(\n                            get_vgg2l_odim(idim, in_channel=in_channel),\n                            elayers,\n                            eunits,\n                            eprojs,\n                            dropout,\n                            typ=typ,\n                            aux_task_layer_list=aux_task_layer_list,\n                        ),\n                    ]\n                )\n                logging.info(\"Use CNN-VGG + \" + typ.upper() + \" for encoder\")\n            self.conv_subsampling_factor = 4\n        else:\n            if etype[-1] == \"p\":\n                self.enc = torch.nn.ModuleList(\n                    [\n                        RNNP(\n                            idim,\n                            elayers,\n                            eunits,\n                            eprojs,\n                            subsample,\n                            dropout,\n                            typ=typ,\n                            aux_task_layer_list=aux_task_layer_list,\n                        )\n                    ]\n                )\n                logging.info(typ.upper() + \" with every-layer projection for encoder\")\n            else:\n                self.enc = torch.nn.ModuleList(\n                    [\n                        RNN(\n                            idim,\n                            elayers,\n                            eunits,\n                            eprojs,\n                            dropout,\n                            typ=typ,\n                            aux_task_layer_list=aux_task_layer_list,\n                        )\n                    ]\n                )\n                logging.info(typ.upper() + \" without projection for encoder\")\n            self.conv_subsampling_factor = 1\n\n    def forward(self, xs_pad, ilens, prev_states=None):\n        \"\"\"Forward encoder.\n\n        Args:\n            xs_pad: Batch of padded input sequences (B, Tmax, idim)\n            ilens: Batch of lengths of input sequences (B)\n            prev_state: Batch of previous encoder hidden states (B, ??)\n\n        Returns:\n            : Batch of padded output sequences (B, Tmax, hdim)\n                    or tuple w/ aux outputs ((B, Tmax, hdim), [L x (B, Tmax, hdim)])\n            : Batch of lengths of output sequences (B)\n            : Batch of hidden state sequences (B, Tmax, hdim)\n\n        \"\"\"\n        if prev_states is None:\n            prev_states = [None] * len(self.enc)\n        assert len(prev_states) == len(self.enc)\n\n        current_states = []\n        for module, prev_state in zip(self.enc, prev_states):\n            xs_pad, ilens, states = module(\n                xs_pad,\n                ilens,\n                prev_state=prev_state,\n            )\n            current_states.append(states)\n\n        if isinstance(xs_pad, tuple):\n            final_xs_pad, aux_xs_list = xs_pad[0], xs_pad[1]\n\n            mask = to_device(final_xs_pad, make_pad_mask(ilens).unsqueeze(-1))\n\n            aux_xs_list = [layer.masked_fill(mask, 0.0) for layer in aux_xs_list]\n\n            return (\n                (\n                    final_xs_pad.masked_fill(mask, 0.0),\n                    aux_xs_list,\n                ),\n                ilens,\n                current_states,\n            )\n        else:\n            mask = to_device(xs_pad, make_pad_mask(ilens).unsqueeze(-1))\n\n            return xs_pad.masked_fill(mask, 0.0), ilens, current_states\n\n\ndef encoder_for(\n    args: argparse.Namespace,\n    idim: Union[int, List],\n    subsample: np.ndarray,\n    aux_task_layer_list: List = [],\n) -> Union[torch.nn.Module, List[torch.nn.Module]]:\n    \"\"\"Instantiate an encoder module given the program arguments.\n\n    Args:\n        args: The model arguments\n        idim: Dimension of inputs or list of dimensions of inputs for each encoder\n        subsample: subsample factors or list of subsample factors for each encoder\n\n    Returns:\n        : The encoder module or list of encoder modules\n\n    \"\"\"\n    return Encoder(\n        args.etype,\n        idim,\n        args.elayers,\n        args.eunits,\n        args.eprojs,\n        subsample,\n        args.dropout_rate,\n        aux_task_layer_list=aux_task_layer_list,\n    )\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/tdnn.py",
    "content": "\"\"\"TDNN modules definition for transformer encoder.\"\"\"\n\nimport logging\nfrom typing import Tuple\nfrom typing import Union\n\nimport torch\n\n\nclass TDNN(torch.nn.Module):\n    \"\"\"TDNN implementation with symmetric context.\n\n    Args:\n        idim: Dimension of inputs\n        odim: Dimension of outputs\n        ctx_size: Size of context window\n        stride: Stride of the sliding blocks\n        dilation: Parameter to control the stride of\n                  elements within the neighborhood\n        batch_norm: Whether to use batch normalization\n        relu: Whether to use non-linearity layer (ReLU)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim: int,\n        odim: int,\n        ctx_size: int = 5,\n        dilation: int = 1,\n        stride: int = 1,\n        batch_norm: bool = False,\n        relu: bool = True,\n        dropout_rate: float = 0.0,\n    ):\n        \"\"\"Construct a TDNN object.\"\"\"\n        super().__init__()\n\n        self.idim = idim\n        self.odim = odim\n\n        self.ctx_size = ctx_size\n        self.stride = stride\n        self.dilation = dilation\n\n        self.batch_norm = batch_norm\n        self.relu = relu\n\n        self.tdnn = torch.nn.Conv1d(\n            idim, odim, ctx_size, stride=stride, dilation=dilation\n        )\n\n        if self.relu:\n            self.relu_func = torch.nn.ReLU()\n\n        if self.batch_norm:\n            self.bn = torch.nn.BatchNorm1d(odim)\n\n        self.dropout = torch.nn.Dropout(p=dropout_rate)\n\n    def forward(\n        self,\n        x_input: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n        masks: torch.Tensor,\n    ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:\n        \"\"\"Forward TDNN.\n\n        Args:\n            x_input: Input tensor (B, T, idim) or ((B, T, idim), (B, T, att_dim))\n            or ((B, T, idim), (B, 2*T-1, att_dim))\n            masks: Input mask (B, 1, T)\n\n        Returns:\n            x_output: Output tensor (B, sub(T), odim)\n                          or ((B, sub(T), odim), (B, sub(T), att_dim))\n            mask: Output mask (B, 1, sub(T))\n\n        \"\"\"\n        if isinstance(x_input, tuple):\n            xs, pos_emb = x_input[0], x_input[1]\n        else:\n            xs, pos_emb = x_input, None\n\n        # The bidirect_pos is used to distinguish legacy_rel_pos and rel_pos in\n        # Conformer model. Note the `legacy_rel_pos` will be deprecated in the future.\n        # Details can be found in https://github.com/espnet/espnet/pull/2816.\n        if pos_emb is not None and pos_emb.size(1) == 2 * xs.size(1) - 1:\n            logging.warning(\"Using bidirectional relative postitional encoding.\")\n            bidirect_pos = True\n        else:\n            bidirect_pos = False\n\n        xs = xs.transpose(1, 2)\n        xs = self.tdnn(xs)\n\n        if self.relu:\n            xs = self.relu_func(xs)\n\n        xs = self.dropout(xs)\n\n        if self.batch_norm:\n            xs = self.bn(xs)\n\n        xs = xs.transpose(1, 2)\n\n        return self.create_outputs(xs, pos_emb, masks, bidirect_pos=bidirect_pos)\n\n    def create_outputs(\n        self,\n        xs: torch.Tensor,\n        pos_emb: torch.Tensor,\n        masks: torch.Tensor,\n        bidirect_pos: bool = False,\n    ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], torch.Tensor]:\n        \"\"\"Create outputs with subsampled version of pos_emb and masks.\n\n        Args:\n            xs: Output tensor (B, sub(T), odim)\n            pos_emb: Input positional embedding tensor (B, T, att_dim)\n            or (B, 2*T-1, att_dim)\n            masks: Input mask (B, 1, T)\n            bidirect_pos: whether to use bidirectional positional embedding\n\n        Returns:\n            xs: Output tensor (B, sub(T), odim)\n            pos_emb: Output positional embedding tensor (B, sub(T), att_dim)\n            or (B, 2*sub(T)-1, att_dim)\n            masks: Output mask (B, 1, sub(T))\n\n        \"\"\"\n        sub = (self.ctx_size - 1) * self.dilation\n\n        if masks is not None:\n            if sub != 0:\n                masks = masks[:, :, :-sub]\n\n            masks = masks[:, :, :: self.stride]\n\n        if pos_emb is not None:\n            # If the bidirect_pos is true, the pos_emb will include both positive and\n            # negative embeddings. Refer to https://github.com/espnet/espnet/pull/2816.\n            if bidirect_pos:\n                pos_emb_positive = pos_emb[:, : pos_emb.size(1) // 2 + 1, :]\n                pos_emb_negative = pos_emb[:, pos_emb.size(1) // 2 :, :]\n\n                if sub != 0:\n                    pos_emb_positive = pos_emb_positive[:, :-sub, :]\n                    pos_emb_negative = pos_emb_negative[:, :-sub, :]\n\n                pos_emb_positive = pos_emb_positive[:, :: self.stride, :]\n                pos_emb_negative = pos_emb_negative[:, :: self.stride, :]\n                pos_emb = torch.cat(\n                    [pos_emb_positive, pos_emb_negative[:, 1:, :]], dim=1\n                )\n            else:\n                if sub != 0:\n                    pos_emb = pos_emb[:, :-sub, :]\n\n                pos_emb = pos_emb[:, :: self.stride, :]\n\n            return (xs, pos_emb), masks\n\n        return xs, masks\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/transformer_decoder_layer.py",
    "content": "\"\"\"Decoder layer definition for transformer-transducer models.\"\"\"\n\nimport torch\nfrom torch import nn\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass DecoderLayer(nn.Module):\n    \"\"\"Single decoder layer module for transformer-transducer models.\n\n    Args:\n        size (int): input dim\n        self_attn (MultiHeadedAttention): self attention module\n        feed_forward (PositionwiseFeedForward): feed forward layer module\n        dropout_rate (float): dropout rate\n        normalize_before (bool): whether to use layer_norm before the first block\n\n    \"\"\"\n\n    def __init__(self, size, self_attn, feed_forward, dropout_rate):\n        \"\"\"Construct an DecoderLayer object.\"\"\"\n        super().__init__()\n\n        self.self_attn = self_attn\n        self.feed_forward = feed_forward\n\n        self.norm1 = LayerNorm(size)\n        self.norm2 = LayerNorm(size)\n\n        self.dropout = nn.Dropout(dropout_rate)\n\n        self.size = size\n\n    def forward(self, tgt, tgt_mask, cache=None):\n        \"\"\"Compute decoded features.\n\n        Args:\n            tgt (torch.Tensor): decoded previous target features (B, Lmax, idim)\n            tgt_mask (torch.Tensor): mask for tgt (B, Lmax)\n            cache (torch.Tensor): cached output (B, Lmax-1, idim)\n\n        Returns:\n            tgt (torch.Tensor): decoder target features (B, Lmax, odim)\n            tgt_mask (torch.Tensor): mask for tgt (B, Lmax)\n        \"\"\"\n        residual = tgt\n        tgt = self.norm1(tgt)\n\n        if cache is None:\n            tgt_q = tgt\n        else:\n            assert cache.shape == (\n                tgt.shape[0],\n                tgt.shape[1] - 1,\n                self.size,\n            ), f\"{cache.shape} == {(tgt.shape[0], tgt.shape[1] - 1, self.size)}\"\n\n            tgt_q = tgt[:, -1:, :]\n            residual = residual[:, -1:, :]\n\n            if tgt_mask is not None:\n                tgt_mask = tgt_mask[:, -1:, :]\n\n        tgt = residual + self.dropout(self.self_attn(tgt_q, tgt, tgt, tgt_mask))\n\n        residual = tgt\n        tgt = self.norm2(tgt)\n\n        tgt = residual + self.dropout(self.feed_forward(tgt))\n\n        if cache is not None:\n            tgt = torch.cat([cache, tgt], dim=1)\n\n        return tgt, tgt_mask\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/utils.py",
    "content": "\"\"\"Utility functions for transducer models.\"\"\"\n\nimport os\n\nimport numpy as np\nimport torch\n\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\n\n\ndef prepare_loss_inputs(ys_pad, hlens, blank_id=0, ignore_id=-1):\n    \"\"\"Prepare tensors for transducer loss computation.\n\n    Args:\n        ys_pad (torch.Tensor): batch of padded target sequences (B, Lmax)\n        hlens (torch.Tensor): batch of hidden sequence lengthts (B)\n                              or batch of masks (B, 1, Tmax)\n        blank_id (int): index of blank label\n        ignore_id (int): index of initial padding\n\n    Returns:\n        ys_in_pad (torch.Tensor): batch of padded target sequences + blank (B, Lmax + 1)\n        target (torch.Tensor): batch of padded target sequences (B, Lmax)\n        pred_len (torch.Tensor): batch of hidden sequence lengths (B)\n        target_len (torch.Tensor): batch of output sequence lengths (B)\n\n    \"\"\"\n    device = ys_pad.device\n\n    ys = [y[y != ignore_id] for y in ys_pad]\n    blank = ys[0].new([blank_id])\n\n    ys_in_pad = pad_list([torch.cat([blank, y], dim=0) for y in ys], blank_id)\n    ys_out_pad = pad_list([torch.cat([y, blank], dim=0) for y in ys], ignore_id)\n\n    target = pad_list(ys, blank_id).type(torch.int32).to(device)\n    target_len = torch.IntTensor([y.size(0) for y in ys]).to(device)\n\n    if torch.is_tensor(hlens):\n        if hlens.dim() > 1:\n            hs = [h[h != 0] for h in hlens]\n            hlens = list(map(int, [h.size(0) for h in hs]))\n        else:\n            hlens = list(map(int, hlens))\n\n    pred_len = torch.IntTensor(hlens).to(device)\n\n    return ys_in_pad, ys_out_pad, target, pred_len, target_len\n\n\ndef valid_aux_task_layer_list(aux_layer_ids, enc_num_layers):\n    \"\"\"Check whether input list of auxiliary layer ids is valid.\n\n       Return the valid list sorted with duplicated removed.\n\n    Args:\n        aux_layer_ids (list): Auxiliary layers ids\n        enc_num_layers (int): Number of encoder layers\n\n    Returns:\n        valid (list): Validated list of layers for auxiliary task\n\n    \"\"\"\n    if (\n        not isinstance(aux_layer_ids, list)\n        or not aux_layer_ids\n        or not all(isinstance(layer, int) for layer in aux_layer_ids)\n    ):\n        raise ValueError(\"--aux-task-layer-list argument takes a list of layer ids.\")\n\n    sorted_list = sorted(aux_layer_ids, key=int, reverse=False)\n    valid = list(filter(lambda x: 0 <= x < enc_num_layers, sorted_list))\n\n    if sorted_list != valid:\n        raise ValueError(\n            \"Provided list of layer ids for auxiliary task is incorrect. \"\n            \"IDs should be between [0, %d]\" % (enc_num_layers - 1)\n        )\n\n    return valid\n\n\ndef is_prefix(x, pref):\n    \"\"\"Check prefix.\n\n    Args:\n        x (list): token id sequence\n        pref (list): token id sequence\n\n    Returns:\n       (boolean): whether pref is a prefix of x.\n\n    \"\"\"\n    if len(pref) >= len(x):\n        return False\n\n    for i in range(len(pref)):\n        if pref[i] != x[i]:\n            return False\n\n    return True\n\n\ndef substract(x, subset):\n    \"\"\"Remove elements of subset if corresponding token id sequence exist in x.\n\n    Args:\n        x (list): set of hypotheses\n        subset (list): subset of hypotheses\n\n    Returns:\n       final (list): new set\n\n    \"\"\"\n    final = []\n\n    for x_ in x:\n        if any(x_.yseq == sub.yseq for sub in subset):\n            continue\n        final.append(x_)\n\n    return final\n\n\ndef select_lm_state(lm_states, idx, lm_layers, is_wordlm):\n    \"\"\"Get LM state from batch for given id.\n\n    Args:\n        lm_states (list or dict): batch of LM states\n        idx (int): index to extract state from batch state\n        lm_layers (int): number of LM layers\n        is_wordlm (bool): whether provided LM is a word-LM\n\n    Returns:\n       idx_state (dict): LM state for given id\n\n    \"\"\"\n    if is_wordlm:\n        idx_state = lm_states[idx]\n    else:\n        idx_state = {}\n\n        idx_state[\"c\"] = [lm_states[\"c\"][layer][idx] for layer in range(lm_layers)]\n        idx_state[\"h\"] = [lm_states[\"h\"][layer][idx] for layer in range(lm_layers)]\n\n    return idx_state\n\n\ndef create_lm_batch_state(lm_states_list, lm_layers, is_wordlm):\n    \"\"\"Create batch of LM states.\n\n    Args:\n        lm_states (list or dict): list of individual LM states\n        lm_layers (int): number of LM layers\n        is_wordlm (bool): whether provided LM is a word-LM\n\n    Returns:\n       batch_states (list): batch of LM states\n\n    \"\"\"\n    if is_wordlm:\n        batch_states = lm_states_list\n    else:\n        batch_states = {}\n\n        batch_states[\"c\"] = [\n            torch.stack([state[\"c\"][layer] for state in lm_states_list])\n            for layer in range(lm_layers)\n        ]\n        batch_states[\"h\"] = [\n            torch.stack([state[\"h\"][layer] for state in lm_states_list])\n            for layer in range(lm_layers)\n        ]\n\n    return batch_states\n\n\ndef init_lm_state(lm_model):\n    \"\"\"Initialize LM state.\n\n    Args:\n        lm_model (torch.nn.Module): LM module\n\n    Returns:\n        lm_state (dict): initial LM state\n\n    \"\"\"\n    lm_layers = len(lm_model.rnn)\n    lm_units_typ = lm_model.typ\n    lm_units = lm_model.n_units\n\n    p = next(lm_model.parameters())\n\n    h = [\n        torch.zeros(lm_units).to(device=p.device, dtype=p.dtype)\n        for _ in range(lm_layers)\n    ]\n\n    lm_state = {\"h\": h}\n\n    if lm_units_typ == \"lstm\":\n        lm_state[\"c\"] = [\n            torch.zeros(lm_units).to(device=p.device, dtype=p.dtype)\n            for _ in range(lm_layers)\n        ]\n\n    return lm_state\n\n\ndef recombine_hyps(hyps, mmi_weight):\n    \"\"\"Recombine hypotheses with equivalent output sequence.\n\n    Args:\n        hyps (list): list of hypotheses\n\n    Returns:\n       final (list): list of recombined hypotheses\n\n    \"\"\"\n    final = []\n\n    for hyp in hyps:\n        seq_final = [f.yseq for f in final if f.yseq]\n\n        if hyp.yseq in seq_final:\n            seq_pos = seq_final.index(hyp.yseq)\n\n            # for the same u, t, MMI score should be the same.\n            assert (final[seq_pos].mmi_tot_score - hyp.mmi_tot_score) < 1e-5\n            mmi_score = hyp.mmi_tot_score\n\n            # the MMI score should not be combined: it is independent to paths\n            final[seq_pos].score = np.logaddexp(final[seq_pos].score - mmi_score * mmi_weight, hyp.score - mmi_score * mmi_weight)\n            final[seq_pos].score += mmi_weight * mmi_score\n        else:\n            final.append(hyp)\n\n    return final # prev: hyps\n\n\ndef pad_sequence(seqlist, pad_token):\n    \"\"\"Left pad list of token id sequences.\n\n    Args:\n        seqlist (list): list of token id sequences\n        pad_token (int): padding token id\n\n    Returns:\n        final (list): list of padded token id sequences\n\n    \"\"\"\n    maxlen = max(len(x) for x in seqlist)\n\n    final = [([pad_token] * (maxlen - len(x))) + x for x in seqlist]\n\n    return final\n\n\ndef check_state(state, max_len, pad_token):\n    \"\"\"Check state and left pad or trim if necessary.\n\n    Args:\n        state (list): list of of L decoder states (in_len, dec_dim)\n        max_len (int): maximum length authorized\n        pad_token (int): padding token id\n\n    Returns:\n        final (list): list of L padded decoder states (1, max_len, dec_dim)\n\n    \"\"\"\n    if state is None or max_len < 1 or state[0].size(1) == max_len:\n        return state\n\n    curr_len = state[0].size(1)\n\n    if curr_len > max_len:\n        trim_val = int(state[0].size(1) - max_len)\n\n        for i, s in enumerate(state):\n            state[i] = s[:, trim_val:, :]\n    else:\n        layers = len(state)\n        ddim = state[0].size(2)\n\n        final_dims = (1, max_len, ddim)\n        final = [state[0].data.new(*final_dims).fill_(pad_token) for _ in range(layers)]\n\n        for i, s in enumerate(state):\n            final[i][:, (max_len - s.size(1)) : max_len, :] = s\n\n        return final\n\n    return state\n\n\ndef check_batch_state(state, max_len, pad_token):\n    \"\"\"Check batch of states and left pad or trim if necessary.\n\n    Args:\n        state (list): list of of L decoder states (B, ?, dec_dim)\n        max_len (int): maximum length authorized\n        pad_token (int): padding token id\n\n    Returns:\n        final (list): list of L decoder states (B, pred_len, dec_dim)\n\n    \"\"\"\n    final_dims = (len(state), max_len, state[0].size(1))\n    final = state[0].data.new(*final_dims).fill_(pad_token)\n\n    for i, s in enumerate(state):\n        curr_len = s.size(0)\n\n        if curr_len < max_len:\n            final[i, (max_len - curr_len) : max_len, :] = s\n        else:\n            final[i, :, :] = s[(curr_len - max_len) :, :]\n\n    return final\n\n\ndef custom_torch_load(model_path, model, training=True):\n    \"\"\"Load transducer model modules and parameters with training-only ones removed.\n\n    Args:\n        model_path (str): Model path\n        model (torch.nn.Module): The model with pretrained modules\n\n    \"\"\"\n    if \"snapshot\" in os.path.basename(model_path):\n        model_state_dict = torch.load(\n            model_path, map_location=lambda storage, loc: storage\n        )[\"model\"]\n    else:\n        model_state_dict = torch.load(\n            model_path, map_location=lambda storage, loc: storage\n        )\n\n    if not training:\n        model_state_dict = {\n            k: v for k, v in model_state_dict.items() # if not k.startswith(\"aux\")\n        }\n\n    model.load_state_dict(model_state_dict)\n\n    del model_state_dict\n"
  },
  {
    "path": "nets/pytorch_backend/transducer/vgg2l.py",
    "content": "\"\"\"VGG2L module definition for transformer encoder.\"\"\"\n\nfrom typing import Tuple\nfrom typing import Union\n\nimport torch\n\n\nclass VGG2L(torch.nn.Module):\n    \"\"\"VGG2L module for custom encoder.\n\n    Args:\n        idim: Dimension of inputs\n        odim: Dimension of outputs\n        pos_enc: Positional encoding class\n\n    \"\"\"\n\n    def __init__(self, idim: int, odim: int, pos_enc: torch.nn.Module = None):\n        \"\"\"Construct a VGG2L object.\"\"\"\n        super().__init__()\n\n        self.vgg2l = torch.nn.Sequential(\n            torch.nn.Conv2d(1, 64, 3, stride=1, padding=1),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(64, 64, 3, stride=1, padding=1),\n            torch.nn.ReLU(),\n            torch.nn.MaxPool2d((3, 2)),\n            torch.nn.Conv2d(64, 128, 3, stride=1, padding=1),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(128, 128, 3, stride=1, padding=1),\n            torch.nn.ReLU(),\n            torch.nn.MaxPool2d((2, 2)),\n        )\n\n        if pos_enc is not None:\n            self.output = torch.nn.Sequential(\n                torch.nn.Linear(128 * ((idim // 2) // 2), odim), pos_enc\n            )\n        else:\n            self.output = torch.nn.Linear(128 * ((idim // 2) // 2), odim)\n\n    def forward(\n        self, x: torch.Tensor, x_mask: torch.Tensor\n    ) -> Union[\n        Tuple[torch.Tensor, torch.Tensor],\n        Tuple[Tuple[torch.Tensor, torch.Tensor], torch.Tensor],\n    ]:\n        \"\"\"VGG2L forward for x.\n\n        Args:\n            x: Input tensor (B, T, idim)\n            x_mask: Input mask (B, 1, T)\n\n        Returns:\n            x: Output tensor (B, sub(T), odim)\n                   or ((B, sub(T), odim), (B, sub(T), att_dim))\n            x_mask: Output mask (B, 1, sub(T))\n\n        \"\"\"\n        x = x.unsqueeze(1)\n        x = self.vgg2l(x)\n        b, c, t, f = x.size()\n\n        x = self.output(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        if x_mask is not None:\n            x_mask = self.create_new_mask(x_mask)\n\n        return x, x_mask\n\n    def create_new_mask(self, x_mask: torch.Tensor) -> torch.Tensor:\n        \"\"\"Create a subsampled version of x_mask.\n\n        Args:\n            x_mask: Input mask (B, 1, T)\n\n        Returns:\n            x_mask: Output mask (B, 1, sub(T))\n\n        \"\"\"\n        x_t1 = x_mask.size(2) - (x_mask.size(2) % 3)\n        x_mask = x_mask[:, :, :x_t1][:, :, ::3]\n\n        x_t2 = x_mask.size(2) - (x_mask.size(2) % 2)\n        x_mask = x_mask[:, :, :x_t2][:, :, ::2]\n\n        return x_mask\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/add_sos_eos.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Unility funcitons for Transformer.\"\"\"\n\nimport torch\n\n\ndef add_sos_eos(ys_pad, sos, eos, ignore_id):\n    \"\"\"Add <sos> and <eos> labels.\n\n    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n    :param int sos: index of <sos>\n    :param int eos: index of <eos>\n    :param int ignore_id: index of padding\n    :return: padded tensor (B, Lmax)\n    :rtype: torch.Tensor\n    :return: padded tensor (B, Lmax)\n    :rtype: torch.Tensor\n    \"\"\"\n    from espnet.nets.pytorch_backend.nets_utils import pad_list\n\n    _sos = ys_pad.new([sos])\n    _eos = ys_pad.new([eos])\n    ys = [y[y != ignore_id] for y in ys_pad]  # parse padded ys\n    ys_in = [torch.cat([_sos, y], dim=0) for y in ys]\n    ys_out = [torch.cat([y, _eos], dim=0) for y in ys]\n    return pad_list(ys_in, eos), pad_list(ys_out, ignore_id)\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/argument.py",
    "content": "# Copyright 2020 Hirofumi Inaguma\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Transformer common arguments.\"\"\"\n\n\nfrom distutils.util import strtobool\n\n\ndef add_arguments_transformer_common(group):\n    \"\"\"Add Transformer common arguments.\"\"\"\n    group.add_argument(\n        \"--transformer-init\",\n        type=str,\n        default=\"pytorch\",\n        choices=[\n            \"pytorch\",\n            \"xavier_uniform\",\n            \"xavier_normal\",\n            \"kaiming_uniform\",\n            \"kaiming_normal\",\n        ],\n        help=\"how to initialize transformer parameters\",\n    )\n    group.add_argument(\n        \"--transformer-input-layer\",\n        type=str,\n        default=\"conv2d\",\n        choices=[\"conv2d\", \"linear\", \"embed\"],\n        help=\"transformer input layer type\",\n    )\n    group.add_argument(\n        \"--transformer-attn-dropout-rate\",\n        default=None,\n        type=float,\n        help=\"dropout in transformer attention. use --dropout-rate if None is set\",\n    )\n    group.add_argument(\n        \"--transformer-lr\",\n        default=10.0,\n        type=float,\n        help=\"Initial value of learning rate\",\n    )\n    group.add_argument(\n        \"--transformer-warmup-steps\",\n        default=25000,\n        type=int,\n        help=\"optimizer warmup steps\",\n    )\n    group.add_argument(\n        \"--transformer-length-normalized-loss\",\n        default=True,\n        type=strtobool,\n        help=\"normalize loss by length\",\n    )\n    group.add_argument(\n        \"--transformer-encoder-selfattn-layer-type\",\n        type=str,\n        default=\"selfattn\",\n        choices=[\n            \"selfattn\",\n            \"rel_selfattn\",\n            \"lightconv\",\n            \"lightconv2d\",\n            \"dynamicconv\",\n            \"dynamicconv2d\",\n            \"light-dynamicconv2d\",\n        ],\n        help=\"transformer encoder self-attention layer type\",\n    )\n    group.add_argument(\n        \"--transformer-decoder-selfattn-layer-type\",\n        type=str,\n        default=\"selfattn\",\n        choices=[\n            \"selfattn\",\n            \"lightconv\",\n            \"lightconv2d\",\n            \"dynamicconv\",\n            \"dynamicconv2d\",\n            \"light-dynamicconv2d\",\n        ],\n        help=\"transformer decoder self-attention layer type\",\n    )\n    # Lightweight/Dynamic convolution related parameters.\n    # See https://arxiv.org/abs/1912.11793v2\n    # and https://arxiv.org/abs/1901.10430 for detail of the method.\n    # Configurations used in the first paper are in\n    # egs/{csj, librispeech}/asr1/conf/tuning/ld_conv/\n    group.add_argument(\n        \"--wshare\",\n        default=4,\n        type=int,\n        help=\"Number of parameter shargin for lightweight convolution\",\n    )\n    group.add_argument(\n        \"--ldconv-encoder-kernel-length\",\n        default=\"21_23_25_27_29_31_33_35_37_39_41_43\",\n        type=str,\n        help=\"kernel size for lightweight/dynamic convolution: \"\n        'Encoder side. For example, \"21_23_25\" means kernel length 21 for '\n        \"First layer, 23 for Second layer and so on.\",\n    )\n    group.add_argument(\n        \"--ldconv-decoder-kernel-length\",\n        default=\"11_13_15_17_19_21\",\n        type=str,\n        help=\"kernel size for lightweight/dynamic convolution: \"\n        'Decoder side. For example, \"21_23_25\" means kernel length 21 for '\n        \"First layer, 23 for Second layer and so on.\",\n    )\n    group.add_argument(\n        \"--ldconv-usebias\",\n        type=strtobool,\n        default=False,\n        help=\"use bias term in lightweight/dynamic convolution\",\n    )\n    group.add_argument(\n        \"--dropout-rate\",\n        default=0.0,\n        type=float,\n        help=\"Dropout rate for the encoder\",\n    )\n    # Encoder\n    group.add_argument(\n        \"--elayers\",\n        default=4,\n        type=int,\n        help=\"Number of encoder layers (for shared recognition part \"\n        \"in multi-speaker asr mode)\",\n    )\n    group.add_argument(\n        \"--eunits\",\n        \"-u\",\n        default=300,\n        type=int,\n        help=\"Number of encoder hidden units\",\n    )\n    # Attention\n    group.add_argument(\n        \"--adim\",\n        default=320,\n        type=int,\n        help=\"Number of attention transformation dimensions\",\n    )\n    group.add_argument(\n        \"--aheads\",\n        default=4,\n        type=int,\n        help=\"Number of heads for multi head attention\",\n    )\n    # Decoder\n    group.add_argument(\n        \"--dlayers\", default=1, type=int, help=\"Number of decoder layers\"\n    )\n    group.add_argument(\n        \"--dunits\", default=320, type=int, help=\"Number of decoder hidden units\"\n    )\n\n    # MBR \n    group.add_argument(\n        \"--aux-mbr\",\n        type=strtobool,\n        nargs=\"?\",\n        default=False,\n        help=\"Whether to use mbr as auxiliary task.\",\n    )\n    group.add_argument(\n        \"--aux-mbr-weight\",\n        default=1.0,\n        type=float,\n        help=\"Weight of auxiliary mbr loss\",\n    )\n    group.add_argument(\n        \"--aux-mbr-beam\",\n        default=2,\n        type=int,\n        help=\"Number of hypothesis for MBR loss computation\",\n    )\n    return group\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/attention.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Multi-Head Attention layer definition.\"\"\"\n\nimport math\n\nimport numpy\nimport torch\nfrom torch import nn\n\n\nclass MultiHeadedAttention(nn.Module):\n    \"\"\"Multi-Head Attention layer.\n\n    Args:\n        n_head (int): The number of heads.\n        n_feat (int): The number of features.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, n_head, n_feat, dropout_rate):\n        \"\"\"Construct an MultiHeadedAttention object.\"\"\"\n        super(MultiHeadedAttention, self).__init__()\n        assert n_feat % n_head == 0\n        # We assume d_v always equals d_k\n        self.d_k = n_feat // n_head\n        self.h = n_head\n        self.linear_q = nn.Linear(n_feat, n_feat)\n        self.linear_k = nn.Linear(n_feat, n_feat)\n        self.linear_v = nn.Linear(n_feat, n_feat)\n        self.linear_out = nn.Linear(n_feat, n_feat)\n        self.attn = None\n        self.dropout = nn.Dropout(p=dropout_rate)\n\n    def forward_qkv(self, query, key, value):\n        \"\"\"Transform query, key and value.\n\n        Args:\n            query (torch.Tensor): Query tensor (#batch, time1, size).\n            key (torch.Tensor): Key tensor (#batch, time2, size).\n            value (torch.Tensor): Value tensor (#batch, time2, size).\n\n        Returns:\n            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).\n            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).\n            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).\n\n        \"\"\"\n        n_batch = query.size(0)\n        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)\n        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)\n        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)\n        q = q.transpose(1, 2)  # (batch, head, time1, d_k)\n        k = k.transpose(1, 2)  # (batch, head, time2, d_k)\n        v = v.transpose(1, 2)  # (batch, head, time2, d_k)\n\n        return q, k, v\n\n    def forward_attention(self, value, scores, mask):\n        \"\"\"Compute attention context vector.\n\n        Args:\n            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).\n            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).\n            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).\n\n        Returns:\n            torch.Tensor: Transformed value (#batch, time1, d_model)\n                weighted by the attention score (#batch, time1, time2).\n\n        \"\"\"\n        n_batch = value.size(0)\n        if mask is not None:\n            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)\n            min_value = float(\n                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min\n            )\n            scores = scores.masked_fill(mask, min_value)\n            self.attn = torch.softmax(scores, dim=-1).masked_fill(\n                mask, 0.0\n            )  # (batch, head, time1, time2)\n        else:\n            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)\n\n        p_attn = self.dropout(self.attn)\n        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)\n        x = (\n            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)\n        )  # (batch, time1, d_model)\n\n        return self.linear_out(x)  # (batch, time1, d_model)\n\n    def forward(self, query, key, value, mask):\n        \"\"\"Compute scaled dot product attention.\n\n        Args:\n            query (torch.Tensor): Query tensor (#batch, time1, size).\n            key (torch.Tensor): Key tensor (#batch, time2, size).\n            value (torch.Tensor): Value tensor (#batch, time2, size).\n            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or\n                (#batch, time1, time2).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time1, d_model).\n\n        \"\"\"\n        q, k, v = self.forward_qkv(query, key, value)\n        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)\n        return self.forward_attention(v, scores, mask)\n\n\nclass LegacyRelPositionMultiHeadedAttention(MultiHeadedAttention):\n    \"\"\"Multi-Head Attention layer with relative position encoding (old version).\n\n    Details can be found in https://github.com/espnet/espnet/pull/2816.\n\n    Paper: https://arxiv.org/abs/1901.02860\n\n    Args:\n        n_head (int): The number of heads.\n        n_feat (int): The number of features.\n        dropout_rate (float): Dropout rate.\n        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.\n\n    \"\"\"\n\n    def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False):\n        \"\"\"Construct an RelPositionMultiHeadedAttention object.\"\"\"\n        super().__init__(n_head, n_feat, dropout_rate)\n        self.zero_triu = zero_triu\n        # linear transformation for positional encoding\n        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)\n        # these two learnable bias are used in matrix c and matrix d\n        # as described in https://arxiv.org/abs/1901.02860 Section 3.3\n        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))\n        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))\n        torch.nn.init.xavier_uniform_(self.pos_bias_u)\n        torch.nn.init.xavier_uniform_(self.pos_bias_v)\n\n    def rel_shift(self, x):\n        \"\"\"Compute relative positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, head, time1, time2).\n\n        Returns:\n            torch.Tensor: Output tensor.\n\n        \"\"\"\n        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=-1)\n\n        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))\n        x = x_padded[:, :, 1:].view_as(x)\n\n        if self.zero_triu:\n            ones = torch.ones((x.size(2), x.size(3)))\n            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]\n\n        return x\n\n    def forward(self, query, key, value, pos_emb, mask):\n        \"\"\"Compute 'Scaled Dot Product Attention' with rel. positional encoding.\n\n        Args:\n            query (torch.Tensor): Query tensor (#batch, time1, size).\n            key (torch.Tensor): Key tensor (#batch, time2, size).\n            value (torch.Tensor): Value tensor (#batch, time2, size).\n            pos_emb (torch.Tensor): Positional embedding tensor (#batch, time1, size).\n            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or\n                (#batch, time1, time2).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time1, d_model).\n\n        \"\"\"\n        q, k, v = self.forward_qkv(query, key, value)\n        q = q.transpose(1, 2)  # (batch, time1, head, d_k)\n\n        n_batch_pos = pos_emb.size(0)\n        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)\n        p = p.transpose(1, 2)  # (batch, head, time1, d_k)\n\n        # (batch, head, time1, d_k)\n        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)\n        # (batch, head, time1, d_k)\n        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)\n\n        # compute attention score\n        # first compute matrix a and matrix c\n        # as described in https://arxiv.org/abs/1901.02860 Section 3.3\n        # (batch, head, time1, time2)\n        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))\n\n        # compute matrix b and matrix d\n        # (batch, head, time1, time1)\n        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))\n        matrix_bd = self.rel_shift(matrix_bd)\n\n        scores = (matrix_ac + matrix_bd) / math.sqrt(\n            self.d_k\n        )  # (batch, head, time1, time2)\n\n        return self.forward_attention(v, scores, mask)\n\n\nclass RelPositionMultiHeadedAttention(MultiHeadedAttention):\n    \"\"\"Multi-Head Attention layer with relative position encoding (new implementation).\n\n    Details can be found in https://github.com/espnet/espnet/pull/2816.\n\n    Paper: https://arxiv.org/abs/1901.02860\n\n    Args:\n        n_head (int): The number of heads.\n        n_feat (int): The number of features.\n        dropout_rate (float): Dropout rate.\n        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.\n\n    \"\"\"\n\n    def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False):\n        \"\"\"Construct an RelPositionMultiHeadedAttention object.\"\"\"\n        super().__init__(n_head, n_feat, dropout_rate)\n        self.zero_triu = zero_triu\n        # linear transformation for positional encoding\n        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)\n        # these two learnable bias are used in matrix c and matrix d\n        # as described in https://arxiv.org/abs/1901.02860 Section 3.3\n        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))\n        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))\n        torch.nn.init.xavier_uniform_(self.pos_bias_u)\n        torch.nn.init.xavier_uniform_(self.pos_bias_v)\n\n    def rel_shift(self, x):\n        \"\"\"Compute relative positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, head, time1, 2*time1-1).\n            time1 means the length of query vector.\n\n        Returns:\n            torch.Tensor: Output tensor.\n\n        \"\"\"\n        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)\n        x_padded = torch.cat([zero_pad, x], dim=-1)\n\n        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))\n        x = x_padded[:, :, 1:].view_as(x)[\n            :, :, :, : x.size(-1) // 2 + 1\n        ]  # only keep the positions from 0 to time2\n\n        if self.zero_triu:\n            ones = torch.ones((x.size(2), x.size(3)), device=x.device)\n            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]\n\n        return x\n\n    def forward(self, query, key, value, pos_emb, mask):\n        \"\"\"Compute 'Scaled Dot Product Attention' with rel. positional encoding.\n\n        Args:\n            query (torch.Tensor): Query tensor (#batch, time1, size).\n            key (torch.Tensor): Key tensor (#batch, time2, size).\n            value (torch.Tensor): Value tensor (#batch, time2, size).\n            pos_emb (torch.Tensor): Positional embedding tensor\n                (#batch, 2*time1-1, size).\n            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or\n                (#batch, time1, time2).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time1, d_model).\n\n        \"\"\"\n        q, k, v = self.forward_qkv(query, key, value)\n        q = q.transpose(1, 2)  # (batch, time1, head, d_k)\n\n        n_batch_pos = pos_emb.size(0)\n        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)\n        p = p.transpose(1, 2)  # (batch, head, 2*time1-1, d_k)\n\n        # (batch, head, time1, d_k)\n        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)\n        # (batch, head, time1, d_k)\n        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)\n\n        # compute attention score\n        # first compute matrix a and matrix c\n        # as described in https://arxiv.org/abs/1901.02860 Section 3.3\n        # (batch, head, time1, time2)\n        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))\n\n        # compute matrix b and matrix d\n        # (batch, head, time1, 2*time1-1)\n        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))\n        matrix_bd = self.rel_shift(matrix_bd)\n\n        scores = (matrix_ac + matrix_bd) / math.sqrt(\n            self.d_k\n        )  # (batch, head, time1, time2)\n\n        return self.forward_attention(v, scores, mask)\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/contextual_block_encoder_layer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Emiru Tsunoo\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder self-attention layer definition.\"\"\"\n\nimport torch\n\nfrom torch import nn\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass ContextualBlockEncoderLayer(nn.Module):\n    \"\"\"Contexutal Block Encoder layer module.\n\n    Args:\n        size (int): Input dimension.\n        self_attn (torch.nn.Module): Self-attention module instance.\n            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance\n            can be used as the argument.\n        feed_forward (torch.nn.Module): Feed-forward module instance.\n            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance\n            can be used as the argument.\n        dropout_rate (float): Dropout rate.\n        total_layer_num (int): Total number of layers\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        self_attn,\n        feed_forward,\n        dropout_rate,\n        total_layer_num,\n        normalize_before=True,\n        concat_after=False,\n    ):\n        \"\"\"Construct an EncoderLayer object.\"\"\"\n        super(ContextualBlockEncoderLayer, self).__init__()\n        self.self_attn = self_attn\n        self.feed_forward = feed_forward\n        self.norm1 = LayerNorm(size)\n        self.norm2 = LayerNorm(size)\n        self.dropout = nn.Dropout(dropout_rate)\n        self.size = size\n        self.normalize_before = normalize_before\n        self.concat_after = concat_after\n        self.total_layer_num = total_layer_num\n        if self.concat_after:\n            self.concat_linear = nn.Linear(size + size, size)\n\n    def forward(self, x, mask, past_ctx=None, next_ctx=None, layer_idx=0, cache=None):\n        \"\"\"Compute encoded features.\n\n        Args:\n            x_input (torch.Tensor): Input tensor (#batch, time, size).\n            mask (torch.Tensor): Mask tensor for the input (#batch, time).\n            past_ctx (torch.Tensor): Previous contexutal vector\n            next_ctx (torch.Tensor): Next contexutal vector\n            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, size).\n            torch.Tensor: Mask tensor (#batch, time).\n            cur_ctx (torch.Tensor): Current contexutal vector\n            next_ctx (torch.Tensor): Next contexutal vector\n            layer_idx (int): layer index number\n\n        \"\"\"\n        nbatch = x.size(0)\n        nblock = x.size(1)\n\n        if past_ctx is not None:\n            if next_ctx is None:\n                # store all context vectors in one tensor\n                next_ctx = past_ctx.new_zeros(\n                    nbatch, nblock, self.total_layer_num, x.size(-1)\n                )\n            else:\n                x[:, :, 0] = past_ctx[:, :, layer_idx]\n\n        # reshape ( nbatch, nblock, block_size + 2, dim )\n        #     -> ( nbatch * nblock, block_size + 2, dim )\n        x = x.view(-1, x.size(-2), x.size(-1))\n        if mask is not None:\n            mask = mask.view(-1, mask.size(-2), mask.size(-1))\n\n        residual = x\n        if self.normalize_before:\n            x = self.norm1(x)\n\n        if cache is None:\n            x_q = x\n        else:\n            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)\n            x_q = x[:, -1:, :]\n            residual = residual[:, -1:, :]\n            mask = None if mask is None else mask[:, -1:, :]\n\n        if self.concat_after:\n            x_concat = torch.cat((x, self.self_attn(x_q, x, x, mask)), dim=-1)\n            x = residual + self.concat_linear(x_concat)\n        else:\n            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))\n        if not self.normalize_before:\n            x = self.norm1(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.norm2(x)\n        x = residual + self.dropout(self.feed_forward(x))\n        if not self.normalize_before:\n            x = self.norm2(x)\n\n        if cache is not None:\n            x = torch.cat([cache, x], dim=1)\n\n        layer_idx += 1\n        # reshape ( nbatch * nblock, block_size + 2, dim )\n        #       -> ( nbatch, nblock, block_size + 2, dim )\n        x = x.view(nbatch, -1, x.size(-2), x.size(-1)).squeeze(1)\n        if mask is not None:\n            mask = mask.view(nbatch, -1, mask.size(-2), mask.size(-1)).squeeze(1)\n\n        if next_ctx is not None and layer_idx < self.total_layer_num:\n            next_ctx[:, 0, layer_idx, :] = x[:, 0, -1, :]\n            next_ctx[:, 1:, layer_idx, :] = x[:, 0:-1, -1, :]\n\n        return x, mask, next_ctx, next_ctx, layer_idx\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/decoder.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Decoder definition.\"\"\"\n\nimport logging\n\nfrom typing import Any\nfrom typing import List\nfrom typing import Tuple\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.nets_utils import rename_state_dict\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.decoder_layer import DecoderLayer\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv import DynamicConvolution\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv2d import DynamicConvolution2D\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.pytorch_backend.transformer.lightconv import LightweightConvolution\nfrom espnet.nets.pytorch_backend.transformer.lightconv2d import LightweightConvolution2D\nfrom espnet.nets.pytorch_backend.transformer.mask import subsequent_mask\nfrom espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.repeat import repeat\nfrom espnet.nets.scorer_interface import BatchScorerInterface\n\n\ndef _pre_hook(\n    state_dict,\n    prefix,\n    local_metadata,\n    strict,\n    missing_keys,\n    unexpected_keys,\n    error_msgs,\n):\n    # https://github.com/espnet/espnet/commit/3d422f6de8d4f03673b89e1caef698745ec749ea#diff-bffb1396f038b317b2b64dd96e6d3563\n    rename_state_dict(prefix + \"output_norm.\", prefix + \"after_norm.\", state_dict)\n\n\nclass Decoder(BatchScorerInterface, torch.nn.Module):\n    \"\"\"Transfomer decoder module.\n\n    Args:\n        odim (int): Output diminsion.\n        self_attention_layer_type (str): Self-attention layer type.\n        attention_dim (int): Dimention of attention.\n        attention_heads (int): The number of heads of multi head attention.\n        conv_wshare (int): The number of kernel of convolution. Only used in\n            self_attention_layer_type == \"lightconv*\" or \"dynamiconv*\".\n        conv_kernel_length (Union[int, str]): Kernel size str of convolution\n            (e.g. 71_71_71_71_71_71). Only used in self_attention_layer_type\n            == \"lightconv*\" or \"dynamiconv*\".\n        conv_usebias (bool): Whether to use bias in convolution. Only used in\n            self_attention_layer_type == \"lightconv*\" or \"dynamiconv*\".\n        linear_units (int): The number of units of position-wise feed forward.\n        num_blocks (int): The number of decoder blocks.\n        dropout_rate (float): Dropout rate.\n        positional_dropout_rate (float): Dropout rate after adding positional encoding.\n        self_attention_dropout_rate (float): Dropout rate in self-attention.\n        src_attention_dropout_rate (float): Dropout rate in source-attention.\n        input_layer (Union[str, torch.nn.Module]): Input layer type.\n        use_output_layer (bool): Whether to use output layer.\n        pos_enc_class (torch.nn.Module): Positional encoding module class.\n            `PositionalEncoding `or `ScaledPositionalEncoding`\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        odim,\n        selfattention_layer_type=\"selfattn\",\n        attention_dim=256,\n        attention_heads=4,\n        conv_wshare=4,\n        conv_kernel_length=11,\n        conv_usebias=False,\n        linear_units=2048,\n        num_blocks=6,\n        dropout_rate=0.1,\n        positional_dropout_rate=0.1,\n        self_attention_dropout_rate=0.0,\n        src_attention_dropout_rate=0.0,\n        input_layer=\"embed\",\n        use_output_layer=True,\n        pos_enc_class=PositionalEncoding,\n        normalize_before=True,\n        concat_after=False,\n    ):\n        \"\"\"Construct an Decoder object.\"\"\"\n        torch.nn.Module.__init__(self)\n        self._register_load_state_dict_pre_hook(_pre_hook)\n        if input_layer == \"embed\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Embedding(odim, attention_dim),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif input_layer == \"linear\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Linear(odim, attention_dim),\n                torch.nn.LayerNorm(attention_dim),\n                torch.nn.Dropout(dropout_rate),\n                torch.nn.ReLU(),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif isinstance(input_layer, torch.nn.Module):\n            self.embed = torch.nn.Sequential(\n                input_layer, pos_enc_class(attention_dim, positional_dropout_rate)\n            )\n        else:\n            raise NotImplementedError(\"only `embed` or torch.nn.Module is supported.\")\n        self.normalize_before = normalize_before\n\n        # self-attention module definition\n        if selfattention_layer_type == \"selfattn\":\n            logging.info(\"decoder self-attention layer type = self-attention\")\n            decoder_selfattn_layer = MultiHeadedAttention\n            decoder_selfattn_layer_args = [\n                (\n                    attention_heads,\n                    attention_dim,\n                    self_attention_dropout_rate,\n                )\n            ] * num_blocks\n        elif selfattention_layer_type == \"lightconv\":\n            logging.info(\"decoder self-attention layer type = lightweight convolution\")\n            decoder_selfattn_layer = LightweightConvolution\n            decoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    self_attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    True,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"lightconv2d\":\n            logging.info(\n                \"decoder self-attention layer \"\n                \"type = lightweight convolution 2-dimentional\"\n            )\n            decoder_selfattn_layer = LightweightConvolution2D\n            decoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    self_attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    True,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"dynamicconv\":\n            logging.info(\"decoder self-attention layer type = dynamic convolution\")\n            decoder_selfattn_layer = DynamicConvolution\n            decoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    self_attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    True,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"dynamicconv2d\":\n            logging.info(\n                \"decoder self-attention layer type = dynamic convolution 2-dimentional\"\n            )\n            decoder_selfattn_layer = DynamicConvolution2D\n            decoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    self_attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    True,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n\n        self.decoders = repeat(\n            num_blocks,\n            lambda lnum: DecoderLayer(\n                attention_dim,\n                decoder_selfattn_layer(*decoder_selfattn_layer_args[lnum]),\n                MultiHeadedAttention(\n                    attention_heads, attention_dim, src_attention_dropout_rate\n                ),\n                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),\n                dropout_rate,\n                normalize_before,\n                concat_after,\n            ),\n        )\n        self.selfattention_layer_type = selfattention_layer_type\n        if self.normalize_before:\n            self.after_norm = LayerNorm(attention_dim)\n        if use_output_layer:\n            self.output_layer = torch.nn.Linear(attention_dim, odim)\n        else:\n            self.output_layer = None\n\n    def forward(self, tgt, tgt_mask, memory, memory_mask):\n        \"\"\"Forward decoder.\n\n        Args:\n            tgt (torch.Tensor): Input token ids, int64 (#batch, maxlen_out) if\n                input_layer == \"embed\". In the other case, input tensor\n                (#batch, maxlen_out, odim).\n            tgt_mask (torch.Tensor): Input token mask (#batch, maxlen_out).\n                dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+\n                (include 1.2).\n            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, feat).\n            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).\n                dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+\n                (include 1.2).\n\n        Returns:\n            torch.Tensor: Decoded token score before softmax (#batch, maxlen_out, odim)\n                   if use_output_layer is True. In the other case,final block outputs\n                   (#batch, maxlen_out, attention_dim).\n            torch.Tensor: Score mask before softmax (#batch, maxlen_out).\n\n        \"\"\"\n        x = self.embed(tgt)\n        x, tgt_mask, memory, memory_mask = self.decoders(\n            x, tgt_mask, memory, memory_mask\n        )\n        if self.normalize_before:\n            x = self.after_norm(x)\n        if self.output_layer is not None:\n            x = self.output_layer(x)\n        return x, tgt_mask\n\n    def forward_one_step(self, tgt, tgt_mask, memory, cache=None):\n        \"\"\"Forward one step.\n\n        Args:\n            tgt (torch.Tensor): Input token ids, int64 (#batch, maxlen_out).\n            tgt_mask (torch.Tensor): Input token mask (#batch, maxlen_out).\n                dtype=torch.uint8 in PyTorch 1.2- and dtype=torch.bool in PyTorch 1.2+\n                (include 1.2).\n            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, feat).\n            cache (List[torch.Tensor]): List of cached tensors.\n                Each tensor shape should be (#batch, maxlen_out - 1, size).\n\n        Returns:\n            torch.Tensor: Output tensor (batch, maxlen_out, odim).\n            List[torch.Tensor]: List of cache tensors of each decoder layer.\n\n        \"\"\"\n        x = self.embed(tgt)\n        if cache is None:\n            cache = [None] * len(self.decoders)\n        new_cache = []\n        for c, decoder in zip(cache, self.decoders):\n            x, tgt_mask, memory, memory_mask = decoder(\n                x, tgt_mask, memory, None, cache=c\n            )\n            new_cache.append(x)\n\n        if self.normalize_before:\n            y = self.after_norm(x[:, -1])\n        else:\n            y = x[:, -1]\n        if self.output_layer is not None:\n            y = torch.log_softmax(self.output_layer(y), dim=-1)\n\n        return y, new_cache\n\n    # beam search API (see ScorerInterface)\n    def score(self, ys, state, x):\n        \"\"\"Score.\"\"\"\n        ys_mask = subsequent_mask(len(ys), device=x.device).unsqueeze(0)\n        if self.selfattention_layer_type != \"selfattn\":\n            # TODO(karita): implement cache\n            logging.warning(\n                f\"{self.selfattention_layer_type} does not support cached decoding.\"\n            )\n            state = None\n        logp, state = self.forward_one_step(\n            ys.unsqueeze(0), ys_mask, x.unsqueeze(0), cache=state\n        )\n        return logp.squeeze(0), state\n\n    # batch beam search API (see BatchScorerInterface)\n    def batch_score(\n        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor\n    ) -> Tuple[torch.Tensor, List[Any]]:\n        \"\"\"Score new token batch (required).\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        # merge states\n        n_batch = len(ys)\n        n_layers = len(self.decoders)\n        if states[0] is None:\n            batch_state = None\n        else:\n            # transpose state of [batch, layer] into [layer, batch]\n            batch_state = [\n                torch.stack([states[b][i] for b in range(n_batch)])\n                for i in range(n_layers)\n            ]\n\n        # batch decoding\n        ys_mask = subsequent_mask(ys.size(-1), device=xs.device).unsqueeze(0)\n        logp, states = self.forward_one_step(ys, ys_mask, xs, cache=batch_state)\n\n        # transpose state of [layer, batch] into [batch, layer]\n        state_list = [[states[i][b] for i in range(n_layers)] for b in range(n_batch)]\n        return logp, state_list\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/decoder_layer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Decoder self-attention layer definition.\"\"\"\n\nimport torch\nfrom torch import nn\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass DecoderLayer(nn.Module):\n    \"\"\"Single decoder layer module.\n\n    Args:\n        size (int): Input dimension.\n        self_attn (torch.nn.Module): Self-attention module instance.\n            `MultiHeadedAttention` instance can be used as the argument.\n        src_attn (torch.nn.Module): Self-attention module instance.\n            `MultiHeadedAttention` instance can be used as the argument.\n        feed_forward (torch.nn.Module): Feed-forward module instance.\n            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance\n            can be used as the argument.\n        dropout_rate (float): Dropout rate.\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n\n\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        self_attn,\n        src_attn,\n        feed_forward,\n        dropout_rate,\n        normalize_before=True,\n        concat_after=False,\n    ):\n        \"\"\"Construct an DecoderLayer object.\"\"\"\n        super(DecoderLayer, self).__init__()\n        self.size = size\n        self.self_attn = self_attn\n        self.src_attn = src_attn\n        self.feed_forward = feed_forward\n        self.norm1 = LayerNorm(size)\n        self.norm2 = LayerNorm(size)\n        self.norm3 = LayerNorm(size)\n        self.dropout = nn.Dropout(dropout_rate)\n        self.normalize_before = normalize_before\n        self.concat_after = concat_after\n        if self.concat_after:\n            self.concat_linear1 = nn.Linear(size + size, size)\n            self.concat_linear2 = nn.Linear(size + size, size)\n\n    def forward(self, tgt, tgt_mask, memory, memory_mask, cache=None):\n        \"\"\"Compute decoded features.\n\n        Args:\n            tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).\n            tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).\n            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).\n            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).\n            cache (List[torch.Tensor]): List of cached tensors.\n                Each tensor shape should be (#batch, maxlen_out - 1, size).\n\n        Returns:\n            torch.Tensor: Output tensor(#batch, maxlen_out, size).\n            torch.Tensor: Mask for output tensor (#batch, maxlen_out).\n            torch.Tensor: Encoded memory (#batch, maxlen_in, size).\n            torch.Tensor: Encoded memory mask (#batch, maxlen_in).\n\n        \"\"\"\n        residual = tgt\n        if self.normalize_before:\n            tgt = self.norm1(tgt)\n\n        if cache is None:\n            tgt_q = tgt\n            tgt_q_mask = tgt_mask\n        else:\n            # compute only the last frame query keeping dim: max_time_out -> 1\n            assert cache.shape == (\n                tgt.shape[0],\n                tgt.shape[1] - 1,\n                self.size,\n            ), f\"{cache.shape} == {(tgt.shape[0], tgt.shape[1] - 1, self.size)}\"\n            tgt_q = tgt[:, -1:, :]\n            residual = residual[:, -1:, :]\n            tgt_q_mask = None\n            if tgt_mask is not None:\n                tgt_q_mask = tgt_mask[:, -1:, :]\n\n        if self.concat_after:\n            tgt_concat = torch.cat(\n                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1\n            )\n            x = residual + self.concat_linear1(tgt_concat)\n        else:\n            x = residual + self.dropout(self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))\n        if not self.normalize_before:\n            x = self.norm1(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.norm2(x)\n        if self.concat_after:\n            x_concat = torch.cat(\n                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1\n            )\n            x = residual + self.concat_linear2(x_concat)\n        else:\n            x = residual + self.dropout(self.src_attn(x, memory, memory, memory_mask))\n        if not self.normalize_before:\n            x = self.norm2(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.norm3(x)\n        x = residual + self.dropout(self.feed_forward(x))\n        if not self.normalize_before:\n            x = self.norm3(x)\n\n        if cache is not None:\n            x = torch.cat([cache, x], dim=1)\n\n        return x, tgt_mask, memory, memory_mask\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/dynamic_conv.py",
    "content": "\"\"\"Dynamic Convolution module.\"\"\"\n\nimport numpy\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\n\n\nMIN_VALUE = float(numpy.finfo(numpy.float32).min)\n\n\nclass DynamicConvolution(nn.Module):\n    \"\"\"Dynamic Convolution layer.\n\n    This implementation is based on\n    https://github.com/pytorch/fairseq/tree/master/fairseq\n\n    Args:\n        wshare (int): the number of kernel of convolution\n        n_feat (int): the number of features\n        dropout_rate (float): dropout_rate\n        kernel_size (int): kernel size (length)\n        use_kernel_mask (bool): Use causal mask or not for convolution kernel\n        use_bias (bool): Use bias term or not.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        wshare,\n        n_feat,\n        dropout_rate,\n        kernel_size,\n        use_kernel_mask=False,\n        use_bias=False,\n    ):\n        \"\"\"Construct Dynamic Convolution layer.\"\"\"\n        super(DynamicConvolution, self).__init__()\n\n        assert n_feat % wshare == 0\n        self.wshare = wshare\n        self.use_kernel_mask = use_kernel_mask\n        self.dropout_rate = dropout_rate\n        self.kernel_size = kernel_size\n        self.attn = None\n\n        # linear -> GLU -- -> lightconv -> linear\n        #               \\        /\n        #                 Linear\n        self.linear1 = nn.Linear(n_feat, n_feat * 2)\n        self.linear2 = nn.Linear(n_feat, n_feat)\n        self.linear_weight = nn.Linear(n_feat, self.wshare * 1 * kernel_size)\n        nn.init.xavier_uniform(self.linear_weight.weight)\n        self.act = nn.GLU()\n\n        # dynamic conv related\n        self.use_bias = use_bias\n        if self.use_bias:\n            self.bias = nn.Parameter(torch.Tensor(n_feat))\n\n    def forward(self, query, key, value, mask):\n        \"\"\"Forward of 'Dynamic Convolution'.\n\n        This function takes query, key and value but uses only quert.\n        This is just for compatibility with self-attention layer (attention.py)\n\n        Args:\n            query (torch.Tensor): (batch, time1, d_model) input tensor\n            key (torch.Tensor): (batch, time2, d_model) NOT USED\n            value (torch.Tensor): (batch, time2, d_model) NOT USED\n            mask (torch.Tensor): (batch, time1, time2) mask\n\n        Return:\n            x (torch.Tensor): (batch, time1, d_model) ouput\n\n        \"\"\"\n        # linear -> GLU -- -> lightconv -> linear\n        #               \\        /\n        #                 Linear\n        x = query\n        B, T, C = x.size()\n        H = self.wshare\n        k = self.kernel_size\n\n        # first liner layer\n        x = self.linear1(x)\n\n        # GLU activation\n        x = self.act(x)\n\n        # get kernel of convolution\n        weight = self.linear_weight(x)  # B x T x kH\n        weight = F.dropout(weight, self.dropout_rate, training=self.training)\n        weight = weight.view(B, T, H, k).transpose(1, 2).contiguous()  # B x H x T x k\n        weight_new = torch.zeros(B * H * T * (T + k - 1), dtype=weight.dtype)\n        weight_new = weight_new.view(B, H, T, T + k - 1).fill_(float(\"-inf\"))\n        weight_new = weight_new.to(x.device)  # B x H x T x T+k-1\n        weight_new.as_strided(\n            (B, H, T, k), ((T + k - 1) * T * H, (T + k - 1) * T, T + k, 1)\n        ).copy_(weight)\n        weight_new = weight_new.narrow(-1, int((k - 1) / 2), T)  # B x H x T x T(k)\n        if self.use_kernel_mask:\n            kernel_mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0)\n            weight_new = weight_new.masked_fill(kernel_mask == 0.0, float(\"-inf\"))\n        weight_new = F.softmax(weight_new, dim=-1)\n        self.attn = weight_new\n        weight_new = weight_new.view(B * H, T, T)\n\n        # convolution\n        x = x.transpose(1, 2).contiguous()  # B x C x T\n        x = x.view(B * H, int(C / H), T).transpose(1, 2)\n        x = torch.bmm(weight_new, x)  # BH x T x C/H\n        x = x.transpose(1, 2).contiguous().view(B, C, T)\n\n        if self.use_bias:\n            x = x + self.bias.view(1, -1, 1)\n        x = x.transpose(1, 2)  # B x T x C\n\n        if mask is not None and not self.use_kernel_mask:\n            mask = mask.transpose(-1, -2)\n            x = x.masked_fill(mask == 0, 0.0)\n\n        # second linear layer\n        x = self.linear2(x)\n        return x\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/dynamic_conv2d.py",
    "content": "\"\"\"Dynamic 2-Dimentional Convolution module.\"\"\"\n\nimport numpy\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\n\n\nMIN_VALUE = float(numpy.finfo(numpy.float32).min)\n\n\nclass DynamicConvolution2D(nn.Module):\n    \"\"\"Dynamic 2-Dimentional Convolution layer.\n\n    This implementation is based on\n    https://github.com/pytorch/fairseq/tree/master/fairseq\n\n    Args:\n        wshare (int): the number of kernel of convolution\n        n_feat (int): the number of features\n        dropout_rate (float): dropout_rate\n        kernel_size (int): kernel size (length)\n        use_kernel_mask (bool): Use causal mask or not for convolution kernel\n        use_bias (bool): Use bias term or not.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        wshare,\n        n_feat,\n        dropout_rate,\n        kernel_size,\n        use_kernel_mask=False,\n        use_bias=False,\n    ):\n        \"\"\"Construct Dynamic 2-Dimentional Convolution layer.\"\"\"\n        super(DynamicConvolution2D, self).__init__()\n\n        assert n_feat % wshare == 0\n        self.wshare = wshare\n        self.use_kernel_mask = use_kernel_mask\n        self.dropout_rate = dropout_rate\n        self.kernel_size = kernel_size\n        self.padding_size = int(kernel_size / 2)\n        self.attn_t = None\n        self.attn_f = None\n\n        # linear -> GLU -- -> lightconv -> linear\n        #               \\        /\n        #                 Linear\n        self.linear1 = nn.Linear(n_feat, n_feat * 2)\n        self.linear2 = nn.Linear(n_feat * 2, n_feat)\n        self.linear_weight = nn.Linear(n_feat, self.wshare * 1 * kernel_size)\n        nn.init.xavier_uniform(self.linear_weight.weight)\n        self.linear_weight_f = nn.Linear(n_feat, kernel_size)\n        nn.init.xavier_uniform(self.linear_weight_f.weight)\n        self.act = nn.GLU()\n\n        # dynamic conv related\n        self.use_bias = use_bias\n        if self.use_bias:\n            self.bias = nn.Parameter(torch.Tensor(n_feat))\n\n    def forward(self, query, key, value, mask):\n        \"\"\"Forward of 'Dynamic 2-Dimentional Convolution'.\n\n        This function takes query, key and value but uses only query.\n        This is just for compatibility with self-attention layer (attention.py)\n\n        Args:\n            query (torch.Tensor): (batch, time1, d_model) input tensor\n            key (torch.Tensor): (batch, time2, d_model) NOT USED\n            value (torch.Tensor): (batch, time2, d_model) NOT USED\n            mask (torch.Tensor): (batch, time1, time2) mask\n\n        Return:\n            x (torch.Tensor): (batch, time1, d_model) ouput\n\n        \"\"\"\n        # linear -> GLU -- -> lightconv -> linear\n        #               \\        /\n        #                 Linear\n        x = query\n        B, T, C = x.size()\n        H = self.wshare\n        k = self.kernel_size\n\n        # first liner layer\n        x = self.linear1(x)\n\n        # GLU activation\n        x = self.act(x)\n\n        # convolution of frequency axis\n        weight_f = self.linear_weight_f(x).view(B * T, 1, k)  # B x T x k\n        self.attn_f = weight_f.view(B, T, k).unsqueeze(1)\n        xf = F.conv1d(\n            x.view(1, B * T, C), weight_f, padding=self.padding_size, groups=B * T\n        )\n        xf = xf.view(B, T, C)\n\n        # get kernel of convolution\n        weight = self.linear_weight(x)  # B x T x kH\n        weight = F.dropout(weight, self.dropout_rate, training=self.training)\n        weight = weight.view(B, T, H, k).transpose(1, 2).contiguous()  # B x H x T x k\n        weight_new = torch.zeros(B * H * T * (T + k - 1), dtype=weight.dtype)\n        weight_new = weight_new.view(B, H, T, T + k - 1).fill_(float(\"-inf\"))\n        weight_new = weight_new.to(x.device)  # B x H x T x T+k-1\n        weight_new.as_strided(\n            (B, H, T, k), ((T + k - 1) * T * H, (T + k - 1) * T, T + k, 1)\n        ).copy_(weight)\n        weight_new = weight_new.narrow(-1, int((k - 1) / 2), T)  # B x H x T x T(k)\n        if self.use_kernel_mask:\n            kernel_mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0)\n            weight_new = weight_new.masked_fill(kernel_mask == 0.0, float(\"-inf\"))\n        weight_new = F.softmax(weight_new, dim=-1)\n        self.attn_t = weight_new\n        weight_new = weight_new.view(B * H, T, T)\n\n        # convolution\n        x = x.transpose(1, 2).contiguous()  # B x C x T\n        x = x.view(B * H, int(C / H), T).transpose(1, 2)\n        x = torch.bmm(weight_new, x)\n        x = x.transpose(1, 2).contiguous().view(B, C, T)\n\n        if self.use_bias:\n            x = x + self.bias.view(1, -1, 1)\n        x = x.transpose(1, 2)  # B x T x C\n        x = torch.cat((x, xf), -1)  # B x T x Cx2\n\n        if mask is not None and not self.use_kernel_mask:\n            mask = mask.transpose(-1, -2)\n            x = x.masked_fill(mask == 0, 0.0)\n\n        # second linear layer\n        x = self.linear2(x)\n        return x\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/embedding.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Positional Encoding Module.\"\"\"\n\nimport math\n\nimport torch\n\n\ndef _pre_hook(\n    state_dict,\n    prefix,\n    local_metadata,\n    strict,\n    missing_keys,\n    unexpected_keys,\n    error_msgs,\n):\n    \"\"\"Perform pre-hook in load_state_dict for backward compatibility.\n\n    Note:\n        We saved self.pe until v.0.5.2 but we have omitted it later.\n        Therefore, we remove the item \"pe\" from `state_dict` for backward compatibility.\n\n    \"\"\"\n    k = prefix + \"pe\"\n    if k in state_dict:\n        state_dict.pop(k)\n\n\nclass PositionalEncoding(torch.nn.Module):\n    \"\"\"Positional encoding.\n\n    Args:\n        d_model (int): Embedding dimension.\n        dropout_rate (float): Dropout rate.\n        max_len (int): Maximum input length.\n        reverse (bool): Whether to reverse the input position. Only for\n        the class LegacyRelPositionalEncoding. We remove it in the current\n        class RelPositionalEncoding.\n\n    \"\"\"\n\n    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):\n        \"\"\"Construct an PositionalEncoding object.\"\"\"\n        super(PositionalEncoding, self).__init__()\n        self.d_model = d_model\n        self.reverse = reverse\n        self.xscale = math.sqrt(self.d_model)\n        self.dropout = torch.nn.Dropout(p=dropout_rate)\n        self.pe = None\n        self.extend_pe(torch.tensor(0.0).expand(1, max_len))\n        self._register_load_state_dict_pre_hook(_pre_hook)\n\n    def extend_pe(self, x):\n        \"\"\"Reset the positional encodings.\"\"\"\n        if self.pe is not None:\n            if self.pe.size(1) >= x.size(1):\n                if self.pe.dtype != x.dtype or self.pe.device != x.device:\n                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)\n                return\n        pe = torch.zeros(x.size(1), self.d_model)\n        if self.reverse:\n            position = torch.arange(\n                x.size(1) - 1, -1, -1.0, dtype=torch.float32\n            ).unsqueeze(1)\n        else:\n            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)\n        div_term = torch.exp(\n            torch.arange(0, self.d_model, 2, dtype=torch.float32)\n            * -(math.log(10000.0) / self.d_model)\n        )\n        pe[:, 0::2] = torch.sin(position * div_term)\n        pe[:, 1::2] = torch.cos(position * div_term)\n        pe = pe.unsqueeze(0)\n        self.pe = pe.to(device=x.device, dtype=x.dtype)\n\n    def forward(self, x: torch.Tensor):\n        \"\"\"Add positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, `*`).\n\n        Returns:\n            torch.Tensor: Encoded tensor (batch, time, `*`).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x * self.xscale + self.pe[:, : x.size(1)]\n        return self.dropout(x)\n\n\nclass ScaledPositionalEncoding(PositionalEncoding):\n    \"\"\"Scaled positional encoding module.\n\n    See Sec. 3.2  https://arxiv.org/abs/1809.08895\n\n    Args:\n        d_model (int): Embedding dimension.\n        dropout_rate (float): Dropout rate.\n        max_len (int): Maximum input length.\n\n    \"\"\"\n\n    def __init__(self, d_model, dropout_rate, max_len=5000):\n        \"\"\"Initialize class.\"\"\"\n        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)\n        self.alpha = torch.nn.Parameter(torch.tensor(1.0))\n\n    def reset_parameters(self):\n        \"\"\"Reset parameters.\"\"\"\n        self.alpha.data = torch.tensor(1.0)\n\n    def forward(self, x):\n        \"\"\"Add positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, `*`).\n\n        Returns:\n            torch.Tensor: Encoded tensor (batch, time, `*`).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x + self.alpha * self.pe[:, : x.size(1)]\n        return self.dropout(x)\n\n\nclass LegacyRelPositionalEncoding(PositionalEncoding):\n    \"\"\"Relative positional encoding module (old version).\n\n    Details can be found in https://github.com/espnet/espnet/pull/2816.\n\n    See : Appendix B in https://arxiv.org/abs/1901.02860\n\n    Args:\n        d_model (int): Embedding dimension.\n        dropout_rate (float): Dropout rate.\n        max_len (int): Maximum input length.\n\n    \"\"\"\n\n    def __init__(self, d_model, dropout_rate, max_len=5000):\n        \"\"\"Initialize class.\"\"\"\n        super().__init__(\n            d_model=d_model,\n            dropout_rate=dropout_rate,\n            max_len=max_len,\n            reverse=True,\n        )\n\n    def forward(self, x):\n        \"\"\"Compute positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, `*`).\n\n        Returns:\n            torch.Tensor: Encoded tensor (batch, time, `*`).\n            torch.Tensor: Positional embedding tensor (1, time, `*`).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x * self.xscale\n        pos_emb = self.pe[:, : x.size(1)]\n        return self.dropout(x), self.dropout(pos_emb)\n\n\nclass RelPositionalEncoding(torch.nn.Module):\n    \"\"\"Relative positional encoding module (new implementation).\n\n    Details can be found in https://github.com/espnet/espnet/pull/2816.\n\n    See : Appendix B in https://arxiv.org/abs/1901.02860\n\n    Args:\n        d_model (int): Embedding dimension.\n        dropout_rate (float): Dropout rate.\n        max_len (int): Maximum input length.\n\n    \"\"\"\n\n    def __init__(self, d_model, dropout_rate, max_len=5000):\n        \"\"\"Construct an PositionalEncoding object.\"\"\"\n        super(RelPositionalEncoding, self).__init__()\n        self.d_model = d_model\n        self.xscale = math.sqrt(self.d_model)\n        self.dropout = torch.nn.Dropout(p=dropout_rate)\n        self.pe = None\n        self.extend_pe(torch.tensor(0.0).expand(1, max_len))\n\n    def extend_pe(self, x):\n        \"\"\"Reset the positional encodings.\"\"\"\n        if self.pe is not None:\n            # self.pe contains both positive and negative parts\n            # the length of self.pe is 2 * input_len - 1\n            if self.pe.size(1) >= x.size(1) * 2 - 1:\n                if self.pe.dtype != x.dtype or self.pe.device != x.device:\n                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)\n                return\n        # Suppose `i` means to the position of query vecotr and `j` means the\n        # position of key vector. We use position relative positions when keys\n        # are to the left (i>j) and negative relative positions otherwise (i<j).\n        pe_positive = torch.zeros(x.size(1), self.d_model)\n        pe_negative = torch.zeros(x.size(1), self.d_model)\n        position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)\n        div_term = torch.exp(\n            torch.arange(0, self.d_model, 2, dtype=torch.float32)\n            * -(math.log(10000.0) / self.d_model)\n        )\n        pe_positive[:, 0::2] = torch.sin(position * div_term)\n        pe_positive[:, 1::2] = torch.cos(position * div_term)\n        pe_negative[:, 0::2] = torch.sin(-1 * position * div_term)\n        pe_negative[:, 1::2] = torch.cos(-1 * position * div_term)\n\n        # Reserve the order of positive indices and concat both positive and\n        # negative indices. This is used to support the shifting trick\n        # as in https://arxiv.org/abs/1901.02860\n        pe_positive = torch.flip(pe_positive, [0]).unsqueeze(0)\n        pe_negative = pe_negative[1:].unsqueeze(0)\n        pe = torch.cat([pe_positive, pe_negative], dim=1)\n        self.pe = pe.to(device=x.device, dtype=x.dtype)\n\n    def forward(self, x: torch.Tensor):\n        \"\"\"Add positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, `*`).\n\n        Returns:\n            torch.Tensor: Encoded tensor (batch, time, `*`).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x * self.xscale\n        pos_emb = self.pe[\n            :,\n            self.pe.size(1) // 2 - x.size(1) + 1 : self.pe.size(1) // 2 + x.size(1),\n        ]\n        return self.dropout(x), self.dropout(pos_emb)\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/encoder.py",
    "content": "# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder definition.\"\"\"\n\nimport logging\nimport torch\n\nfrom espnet.nets.pytorch_backend.nets_utils import rename_state_dict\nfrom espnet.nets.pytorch_backend.transducer.vgg2l import VGG2L\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv import DynamicConvolution\nfrom espnet.nets.pytorch_backend.transformer.dynamic_conv2d import DynamicConvolution2D\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder_layer import EncoderLayer\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\nfrom espnet.nets.pytorch_backend.transformer.lightconv import LightweightConvolution\nfrom espnet.nets.pytorch_backend.transformer.lightconv2d import LightweightConvolution2D\nfrom espnet.nets.pytorch_backend.transformer.multi_layer_conv import Conv1dLinear\nfrom espnet.nets.pytorch_backend.transformer.multi_layer_conv import MultiLayeredConv1d\nfrom espnet.nets.pytorch_backend.transformer.positionwise_feed_forward import (\n    PositionwiseFeedForward,  # noqa: H301\n)\nfrom espnet.nets.pytorch_backend.transformer.repeat import repeat\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling6\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling8\n\n\ndef _pre_hook(\n    state_dict,\n    prefix,\n    local_metadata,\n    strict,\n    missing_keys,\n    unexpected_keys,\n    error_msgs,\n):\n    # https://github.com/espnet/espnet/commit/21d70286c354c66c0350e65dc098d2ee236faccc#diff-bffb1396f038b317b2b64dd96e6d3563\n    rename_state_dict(prefix + \"input_layer.\", prefix + \"embed.\", state_dict)\n    # https://github.com/espnet/espnet/commit/3d422f6de8d4f03673b89e1caef698745ec749ea#diff-bffb1396f038b317b2b64dd96e6d3563\n    rename_state_dict(prefix + \"norm.\", prefix + \"after_norm.\", state_dict)\n\n\nclass Encoder(torch.nn.Module):\n    \"\"\"Transformer encoder module.\n\n    Args:\n        idim (int): Input dimension.\n        attention_dim (int): Dimention of attention.\n        attention_heads (int): The number of heads of multi head attention.\n        conv_wshare (int): The number of kernel of convolution. Only used in\n            self_attention_layer_type == \"lightconv*\" or \"dynamiconv*\".\n        conv_kernel_length (Union[int, str]): Kernel size str of convolution\n            (e.g. 71_71_71_71_71_71). Only used in self_attention_layer_type\n            == \"lightconv*\" or \"dynamiconv*\".\n        conv_usebias (bool): Whether to use bias in convolution. Only used in\n            self_attention_layer_type == \"lightconv*\" or \"dynamiconv*\".\n        linear_units (int): The number of units of position-wise feed forward.\n        num_blocks (int): The number of decoder blocks.\n        dropout_rate (float): Dropout rate.\n        positional_dropout_rate (float): Dropout rate after adding positional encoding.\n        attention_dropout_rate (float): Dropout rate in attention.\n        input_layer (Union[str, torch.nn.Module]): Input layer type.\n        pos_enc_class (torch.nn.Module): Positional encoding module class.\n            `PositionalEncoding `or `ScaledPositionalEncoding`\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n        positionwise_layer_type (str): \"linear\", \"conv1d\", or \"conv1d-linear\".\n        positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer.\n        selfattention_layer_type (str): Encoder attention layer type.\n        padding_idx (int): Padding idx for input_layer=embed.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        attention_dim=256,\n        attention_heads=4,\n        conv_wshare=4,\n        conv_kernel_length=\"11\",\n        conv_usebias=False,\n        linear_units=2048,\n        num_blocks=6,\n        dropout_rate=0.1,\n        positional_dropout_rate=0.1,\n        attention_dropout_rate=0.0,\n        input_layer=\"conv2d\",\n        pos_enc_class=PositionalEncoding,\n        normalize_before=True,\n        concat_after=False,\n        positionwise_layer_type=\"linear\",\n        positionwise_conv_kernel_size=1,\n        selfattention_layer_type=\"selfattn\",\n        padding_idx=-1,\n    ):\n        \"\"\"Construct an Encoder object.\"\"\"\n        super(Encoder, self).__init__()\n        self._register_load_state_dict_pre_hook(_pre_hook)\n\n        self.conv_subsampling_factor = 1\n        if input_layer == \"linear\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Linear(idim, attention_dim),\n                torch.nn.LayerNorm(attention_dim),\n                torch.nn.Dropout(dropout_rate),\n                torch.nn.ReLU(),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif input_layer == \"conv2d\":\n            self.embed = Conv2dSubsampling(idim, attention_dim, dropout_rate)\n            self.conv_subsampling_factor = 4\n        elif input_layer == \"conv2d-scaled-pos-enc\":\n            self.embed = Conv2dSubsampling(\n                idim,\n                attention_dim,\n                dropout_rate,\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n            self.conv_subsampling_factor = 4\n        elif input_layer == \"conv2d6\":\n            self.embed = Conv2dSubsampling6(idim, attention_dim, dropout_rate)\n            self.conv_subsampling_factor = 6\n        elif input_layer == \"conv2d8\":\n            self.embed = Conv2dSubsampling8(idim, attention_dim, dropout_rate)\n            self.conv_subsampling_factor = 8\n        elif input_layer == \"vgg2l\":\n            self.embed = VGG2L(idim, attention_dim)\n            self.conv_subsampling_factor = 4\n        elif input_layer == \"embed\":\n            self.embed = torch.nn.Sequential(\n                torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif isinstance(input_layer, torch.nn.Module):\n            self.embed = torch.nn.Sequential(\n                input_layer,\n                pos_enc_class(attention_dim, positional_dropout_rate),\n            )\n        elif input_layer is None:\n            self.embed = torch.nn.Sequential(\n                pos_enc_class(attention_dim, positional_dropout_rate)\n            )\n        else:\n            raise ValueError(\"unknown input_layer: \" + input_layer)\n        self.normalize_before = normalize_before\n        positionwise_layer, positionwise_layer_args = self.get_positionwise_layer(\n            positionwise_layer_type,\n            attention_dim,\n            linear_units,\n            dropout_rate,\n            positionwise_conv_kernel_size,\n        )\n        if selfattention_layer_type in [\n            \"selfattn\",\n            \"rel_selfattn\",\n            \"legacy_rel_selfattn\",\n        ]:\n            logging.info(\"encoder self-attention layer type = self-attention\")\n            encoder_selfattn_layer = MultiHeadedAttention\n            encoder_selfattn_layer_args = [\n                (\n                    attention_heads,\n                    attention_dim,\n                    attention_dropout_rate,\n                )\n            ] * num_blocks\n        elif selfattention_layer_type == \"lightconv\":\n            logging.info(\"encoder self-attention layer type = lightweight convolution\")\n            encoder_selfattn_layer = LightweightConvolution\n            encoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    False,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"lightconv2d\":\n            logging.info(\n                \"encoder self-attention layer \"\n                \"type = lightweight convolution 2-dimentional\"\n            )\n            encoder_selfattn_layer = LightweightConvolution2D\n            encoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    False,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"dynamicconv\":\n            logging.info(\"encoder self-attention layer type = dynamic convolution\")\n            encoder_selfattn_layer = DynamicConvolution\n            encoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    False,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        elif selfattention_layer_type == \"dynamicconv2d\":\n            logging.info(\n                \"encoder self-attention layer type = dynamic convolution 2-dimentional\"\n            )\n            encoder_selfattn_layer = DynamicConvolution2D\n            encoder_selfattn_layer_args = [\n                (\n                    conv_wshare,\n                    attention_dim,\n                    attention_dropout_rate,\n                    int(conv_kernel_length.split(\"_\")[lnum]),\n                    False,\n                    conv_usebias,\n                )\n                for lnum in range(num_blocks)\n            ]\n        else:\n            raise NotImplementedError(selfattention_layer_type)\n\n        self.encoders = repeat(\n            num_blocks,\n            lambda lnum: EncoderLayer(\n                attention_dim,\n                encoder_selfattn_layer(*encoder_selfattn_layer_args[lnum]),\n                positionwise_layer(*positionwise_layer_args),\n                dropout_rate,\n                normalize_before,\n                concat_after,\n            ),\n        )\n        if self.normalize_before:\n            self.after_norm = LayerNorm(attention_dim)\n\n    def get_positionwise_layer(\n        self,\n        positionwise_layer_type=\"linear\",\n        attention_dim=256,\n        linear_units=2048,\n        dropout_rate=0.1,\n        positionwise_conv_kernel_size=1,\n    ):\n        \"\"\"Define positionwise layer.\"\"\"\n        if positionwise_layer_type == \"linear\":\n            positionwise_layer = PositionwiseFeedForward\n            positionwise_layer_args = (attention_dim, linear_units, dropout_rate)\n        elif positionwise_layer_type == \"conv1d\":\n            positionwise_layer = MultiLayeredConv1d\n            positionwise_layer_args = (\n                attention_dim,\n                linear_units,\n                positionwise_conv_kernel_size,\n                dropout_rate,\n            )\n        elif positionwise_layer_type == \"conv1d-linear\":\n            positionwise_layer = Conv1dLinear\n            positionwise_layer_args = (\n                attention_dim,\n                linear_units,\n                positionwise_conv_kernel_size,\n                dropout_rate,\n            )\n        else:\n            raise NotImplementedError(\"Support only linear or conv1d.\")\n        return positionwise_layer, positionwise_layer_args\n\n    def forward(self, xs, masks):\n        \"\"\"Encode input sequence.\n\n        Args:\n            xs (torch.Tensor): Input tensor (#batch, time, idim).\n            masks (torch.Tensor): Mask tensor (#batch, time).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, attention_dim).\n            torch.Tensor: Mask tensor (#batch, time).\n\n        \"\"\"\n        if isinstance(\n            self.embed,\n            (Conv2dSubsampling, Conv2dSubsampling6, Conv2dSubsampling8, VGG2L),\n        ):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n        xs, masks = self.encoders(xs, masks)\n        if self.normalize_before:\n            xs = self.after_norm(xs)\n        return xs, masks\n\n    def forward_one_step(self, xs, masks, cache=None):\n        \"\"\"Encode input frame.\n\n        Args:\n            xs (torch.Tensor): Input tensor.\n            masks (torch.Tensor): Mask tensor.\n            cache (List[torch.Tensor]): List of cache tensors.\n\n        Returns:\n            torch.Tensor: Output tensor.\n            torch.Tensor: Mask tensor.\n            List[torch.Tensor]: List of new cache tensors.\n\n        \"\"\"\n        if isinstance(self.embed, Conv2dSubsampling):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n        if cache is None:\n            cache = [None for _ in range(len(self.encoders))]\n        new_cache = []\n        for c, e in zip(cache, self.encoders):\n            xs, masks = e(xs, masks, cache=c)\n            new_cache.append(xs)\n        if self.normalize_before:\n            xs = self.after_norm(xs)\n        return xs, masks, new_cache\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/encoder_layer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder self-attention layer definition.\"\"\"\n\nimport torch\n\nfrom torch import nn\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\nclass EncoderLayer(nn.Module):\n    \"\"\"Encoder layer module.\n\n    Args:\n        size (int): Input dimension.\n        self_attn (torch.nn.Module): Self-attention module instance.\n            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance\n            can be used as the argument.\n        feed_forward (torch.nn.Module): Feed-forward module instance.\n            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance\n            can be used as the argument.\n        dropout_rate (float): Dropout rate.\n        normalize_before (bool): Whether to use layer_norm before the first block.\n        concat_after (bool): Whether to concat attention layer's input and output.\n            if True, additional linear will be applied.\n            i.e. x -> x + linear(concat(x, att(x)))\n            if False, no additional linear will be applied. i.e. x -> x + att(x)\n\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        self_attn,\n        feed_forward,\n        dropout_rate,\n        normalize_before=True,\n        concat_after=False,\n    ):\n        \"\"\"Construct an EncoderLayer object.\"\"\"\n        super(EncoderLayer, self).__init__()\n        self.self_attn = self_attn\n        self.feed_forward = feed_forward\n        self.norm1 = LayerNorm(size)\n        self.norm2 = LayerNorm(size)\n        self.dropout = nn.Dropout(dropout_rate)\n        self.size = size\n        self.normalize_before = normalize_before\n        self.concat_after = concat_after\n        if self.concat_after:\n            self.concat_linear = nn.Linear(size + size, size)\n\n    def forward(self, x, mask, cache=None):\n        \"\"\"Compute encoded features.\n\n        Args:\n            x_input (torch.Tensor): Input tensor (#batch, time, size).\n            mask (torch.Tensor): Mask tensor for the input (#batch, time).\n            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).\n\n        Returns:\n            torch.Tensor: Output tensor (#batch, time, size).\n            torch.Tensor: Mask tensor (#batch, time).\n\n        \"\"\"\n        residual = x\n        if self.normalize_before:\n            x = self.norm1(x)\n\n        if cache is None:\n            x_q = x\n        else:\n            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)\n            x_q = x[:, -1:, :]\n            residual = residual[:, -1:, :]\n            mask = None if mask is None else mask[:, -1:, :]\n\n        if self.concat_after:\n            x_concat = torch.cat((x, self.self_attn(x_q, x, x, mask)), dim=-1)\n            x = residual + self.concat_linear(x_concat)\n        else:\n            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))\n        if not self.normalize_before:\n            x = self.norm1(x)\n\n        residual = x\n        if self.normalize_before:\n            x = self.norm2(x)\n        x = residual + self.dropout(self.feed_forward(x))\n        if not self.normalize_before:\n            x = self.norm2(x)\n\n        if cache is not None:\n            x = torch.cat([cache, x], dim=1)\n\n        return x, mask\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/encoder_mix.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Encoder Mix definition.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transducer.vgg2l import VGG2L\nfrom espnet.nets.pytorch_backend.transformer.attention import MultiHeadedAttention\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\nfrom espnet.nets.pytorch_backend.transformer.encoder import Encoder\nfrom espnet.nets.pytorch_backend.transformer.encoder_layer import EncoderLayer\nfrom espnet.nets.pytorch_backend.transformer.repeat import repeat\nfrom espnet.nets.pytorch_backend.transformer.subsampling import Conv2dSubsampling\n\n\nclass EncoderMix(Encoder, torch.nn.Module):\n    \"\"\"Transformer encoder module.\n\n    :param int idim: input dim\n    :param int attention_dim: dimention of attention\n    :param int attention_heads: the number of heads of multi head attention\n    :param int linear_units: the number of units of position-wise feed forward\n    :param int num_blocks: the number of decoder blocks\n    :param float dropout_rate: dropout rate\n    :param float attention_dropout_rate: dropout rate in attention\n    :param float positional_dropout_rate: dropout rate after adding positional encoding\n    :param str or torch.nn.Module input_layer: input layer type\n    :param class pos_enc_class: PositionalEncoding or ScaledPositionalEncoding\n    :param bool normalize_before: whether to use layer_norm before the first block\n    :param bool concat_after: whether to concat attention layer's input and output\n        if True, additional linear will be applied.\n        i.e. x -> x + linear(concat(x, att(x)))\n        if False, no additional linear will be applied. i.e. x -> x + att(x)\n    :param str positionwise_layer_type: linear of conv1d\n    :param int positionwise_conv_kernel_size: kernel size of positionwise conv1d layer\n    :param int padding_idx: padding_idx for input_layer=embed\n    \"\"\"\n\n    def __init__(\n        self,\n        idim,\n        attention_dim=256,\n        attention_heads=4,\n        linear_units=2048,\n        num_blocks_sd=4,\n        num_blocks_rec=8,\n        dropout_rate=0.1,\n        positional_dropout_rate=0.1,\n        attention_dropout_rate=0.0,\n        input_layer=\"conv2d\",\n        pos_enc_class=PositionalEncoding,\n        normalize_before=True,\n        concat_after=False,\n        positionwise_layer_type=\"linear\",\n        positionwise_conv_kernel_size=1,\n        padding_idx=-1,\n        num_spkrs=2,\n    ):\n        \"\"\"Construct an Encoder object.\"\"\"\n        super(EncoderMix, self).__init__(\n            idim=idim,\n            selfattention_layer_type=\"selfattn\",\n            attention_dim=attention_dim,\n            attention_heads=attention_heads,\n            linear_units=linear_units,\n            num_blocks=num_blocks_rec,\n            dropout_rate=dropout_rate,\n            positional_dropout_rate=positional_dropout_rate,\n            attention_dropout_rate=attention_dropout_rate,\n            input_layer=input_layer,\n            pos_enc_class=pos_enc_class,\n            normalize_before=normalize_before,\n            concat_after=concat_after,\n            positionwise_layer_type=positionwise_layer_type,\n            positionwise_conv_kernel_size=positionwise_conv_kernel_size,\n            padding_idx=padding_idx,\n        )\n        positionwise_layer, positionwise_layer_args = self.get_positionwise_layer(\n            positionwise_layer_type,\n            attention_dim,\n            linear_units,\n            dropout_rate,\n            positionwise_conv_kernel_size,\n        )\n        self.num_spkrs = num_spkrs\n        self.encoders_sd = torch.nn.ModuleList(\n            [\n                repeat(\n                    num_blocks_sd,\n                    lambda lnum: EncoderLayer(\n                        attention_dim,\n                        MultiHeadedAttention(\n                            attention_heads, attention_dim, attention_dropout_rate\n                        ),\n                        positionwise_layer(*positionwise_layer_args),\n                        dropout_rate,\n                        normalize_before,\n                        concat_after,\n                    ),\n                )\n                for i in range(num_spkrs)\n            ]\n        )\n\n    def forward(self, xs, masks):\n        \"\"\"Encode input sequence.\n\n        :param torch.Tensor xs: input tensor\n        :param torch.Tensor masks: input mask\n        :return: position embedded tensor and mask\n        :rtype Tuple[torch.Tensor, torch.Tensor]:\n        \"\"\"\n        if isinstance(self.embed, (Conv2dSubsampling, VGG2L)):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n        xs_sd, masks_sd = [None] * self.num_spkrs, [None] * self.num_spkrs\n\n        for ns in range(self.num_spkrs):\n            xs_sd[ns], masks_sd[ns] = self.encoders_sd[ns](xs, masks)\n            xs_sd[ns], masks_sd[ns] = self.encoders(xs_sd[ns], masks_sd[ns])  # Enc_rec\n            if self.normalize_before:\n                xs_sd[ns] = self.after_norm(xs_sd[ns])\n        return xs_sd, masks_sd\n\n    def forward_one_step(self, xs, masks, cache=None):\n        \"\"\"Encode input frame.\n\n        :param torch.Tensor xs: input tensor\n        :param torch.Tensor masks: input mask\n        :param List[torch.Tensor] cache: cache tensors\n        :return: position embedded tensor, mask and new cache\n        :rtype Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]:\n        \"\"\"\n        if isinstance(self.embed, Conv2dSubsampling):\n            xs, masks = self.embed(xs, masks)\n        else:\n            xs = self.embed(xs)\n\n        new_cache_sd = []\n        for ns in range(self.num_spkrs):\n            if cache is None:\n                cache = [\n                    None for _ in range(len(self.encoders_sd) + len(self.encoders_rec))\n                ]\n            new_cache = []\n            for c, e in zip(cache[: len(self.encoders_sd)], self.encoders_sd[ns]):\n                xs, masks = e(xs, masks, cache=c)\n                new_cache.append(xs)\n            for c, e in zip(cache[: len(self.encoders_sd) :], self.encoders_rec):\n                xs, masks = e(xs, masks, cache=c)\n                new_cache.append(xs)\n            new_cache_sd.append(new_cache)\n            if self.normalize_before:\n                xs = self.after_norm(xs)\n        return xs, masks, new_cache_sd\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/initializer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Parameter initialization.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transformer.layer_norm import LayerNorm\n\n\ndef initialize(model, init_type=\"pytorch\"):\n    \"\"\"Initialize Transformer module.\n\n    :param torch.nn.Module model: transformer instance\n    :param str init_type: initialization type\n    \"\"\"\n    if init_type == \"pytorch\":\n        return\n\n    # weight init\n    for p in model.parameters():\n        if p.dim() > 1:\n            if init_type == \"xavier_uniform\":\n                torch.nn.init.xavier_uniform_(p.data)\n            elif init_type == \"xavier_normal\":\n                torch.nn.init.xavier_normal_(p.data)\n            elif init_type == \"kaiming_uniform\":\n                torch.nn.init.kaiming_uniform_(p.data, nonlinearity=\"relu\")\n            elif init_type == \"kaiming_normal\":\n                torch.nn.init.kaiming_normal_(p.data, nonlinearity=\"relu\")\n            else:\n                raise ValueError(\"Unknown initialization: \" + init_type)\n    # bias init\n    for p in model.parameters():\n        if p.dim() == 1:\n            p.data.zero_()\n\n    # reset some modules with default init\n    for m in model.modules():\n        if isinstance(m, (torch.nn.Embedding, LayerNorm)):\n            m.reset_parameters()\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/label_smoothing_loss.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Label smoothing module.\"\"\"\n\nimport torch\nfrom torch import nn\n\n\nclass LabelSmoothingLoss(nn.Module):\n    \"\"\"Label-smoothing loss.\n\n    :param int size: the number of class\n    :param int padding_idx: ignored class id\n    :param float smoothing: smoothing rate (0.0 means the conventional CE)\n    :param bool normalize_length: normalize loss by sequence length if True\n    :param torch.nn.Module criterion: loss function to be smoothed\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        padding_idx,\n        smoothing,\n        normalize_length=False,\n        criterion=nn.KLDivLoss(reduction=\"none\"),\n    ):\n        \"\"\"Construct an LabelSmoothingLoss object.\"\"\"\n        super(LabelSmoothingLoss, self).__init__()\n        self.criterion = criterion\n        self.padding_idx = padding_idx\n        self.confidence = 1.0 - smoothing\n        self.smoothing = smoothing\n        self.size = size\n        self.true_dist = None\n        self.normalize_length = normalize_length\n\n    def forward(self, x, target):\n        \"\"\"Compute loss between x and target.\n\n        :param torch.Tensor x: prediction (batch, seqlen, class)\n        :param torch.Tensor target:\n            target signal masked with self.padding_id (batch, seqlen)\n        :return: scalar float value\n        :rtype torch.Tensor\n        \"\"\"\n        assert x.size(2) == self.size\n        batch_size = x.size(0)\n        x = x.view(-1, self.size)\n        target = target.view(-1)\n        with torch.no_grad():\n            true_dist = x.clone()\n            true_dist.fill_(self.smoothing / (self.size - 1))\n            ignore = target == self.padding_idx  # (B,)\n            total = len(target) - ignore.sum().item()\n            target = target.masked_fill(ignore, 0)  # avoid -1 index\n            true_dist.scatter_(1, target.unsqueeze(1), self.confidence)\n        kl = self.criterion(torch.log_softmax(x, dim=1), true_dist)\n        denom = total if self.normalize_length else batch_size\n        return kl.masked_fill(ignore.unsqueeze(1), 0).sum() / denom\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/layer_norm.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Layer normalization module.\"\"\"\n\nimport torch\n\n\nclass LayerNorm(torch.nn.LayerNorm):\n    \"\"\"Layer normalization module.\n\n    Args:\n        nout (int): Output dim size.\n        dim (int): Dimension to be normalized.\n\n    \"\"\"\n\n    def __init__(self, nout, dim=-1):\n        \"\"\"Construct an LayerNorm object.\"\"\"\n        super(LayerNorm, self).__init__(nout, eps=1e-12)\n        self.dim = dim\n\n    def forward(self, x):\n        \"\"\"Apply layer normalization.\n\n        Args:\n            x (torch.Tensor): Input tensor.\n\n        Returns:\n            torch.Tensor: Normalized tensor.\n\n        \"\"\"\n        if self.dim == -1:\n            return super(LayerNorm, self).forward(x)\n        return (\n            super(LayerNorm, self)\n            .forward(x.transpose(self.dim, -1))\n            .transpose(self.dim, -1)\n        )\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/lightconv.py",
    "content": "\"\"\"Lightweight Convolution Module.\"\"\"\n\nimport numpy\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\n\n\nMIN_VALUE = float(numpy.finfo(numpy.float32).min)\n\n\nclass LightweightConvolution(nn.Module):\n    \"\"\"Lightweight Convolution layer.\n\n    This implementation is based on\n    https://github.com/pytorch/fairseq/tree/master/fairseq\n\n    Args:\n        wshare (int): the number of kernel of convolution\n        n_feat (int): the number of features\n        dropout_rate (float): dropout_rate\n        kernel_size (int): kernel size (length)\n        use_kernel_mask (bool): Use causal mask or not for convolution kernel\n        use_bias (bool): Use bias term or not.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        wshare,\n        n_feat,\n        dropout_rate,\n        kernel_size,\n        use_kernel_mask=False,\n        use_bias=False,\n    ):\n        \"\"\"Construct Lightweight Convolution layer.\"\"\"\n        super(LightweightConvolution, self).__init__()\n\n        assert n_feat % wshare == 0\n        self.wshare = wshare\n        self.use_kernel_mask = use_kernel_mask\n        self.dropout_rate = dropout_rate\n        self.kernel_size = kernel_size\n        self.padding_size = int(kernel_size / 2)\n\n        # linear -> GLU -> lightconv -> linear\n        self.linear1 = nn.Linear(n_feat, n_feat * 2)\n        self.linear2 = nn.Linear(n_feat, n_feat)\n        self.act = nn.GLU()\n\n        # lightconv related\n        self.weight = nn.Parameter(\n            torch.Tensor(self.wshare, 1, kernel_size).uniform_(0, 1)\n        )\n        self.use_bias = use_bias\n        if self.use_bias:\n            self.bias = nn.Parameter(torch.Tensor(n_feat))\n\n        # mask of kernel\n        kernel_mask0 = torch.zeros(self.wshare, int(kernel_size / 2))\n        kernel_mask1 = torch.ones(self.wshare, int(kernel_size / 2 + 1))\n        self.kernel_mask = torch.cat((kernel_mask1, kernel_mask0), dim=-1).unsqueeze(1)\n\n    def forward(self, query, key, value, mask):\n        \"\"\"Forward of 'Lightweight Convolution'.\n\n        This function takes query, key and value but uses only query.\n        This is just for compatibility with self-attention layer (attention.py)\n\n        Args:\n            query (torch.Tensor): (batch, time1, d_model) input tensor\n            key (torch.Tensor): (batch, time2, d_model) NOT USED\n            value (torch.Tensor): (batch, time2, d_model) NOT USED\n            mask (torch.Tensor): (batch, time1, time2) mask\n\n        Return:\n            x (torch.Tensor): (batch, time1, d_model) ouput\n\n        \"\"\"\n        # linear -> GLU -> lightconv -> linear\n        x = query\n        B, T, C = x.size()\n        H = self.wshare\n\n        # first liner layer\n        x = self.linear1(x)\n\n        # GLU activation\n        x = self.act(x)\n\n        # lightconv\n        x = x.transpose(1, 2).contiguous().view(-1, H, T)  # B x C x T\n        weight = F.dropout(self.weight, self.dropout_rate, training=self.training)\n        if self.use_kernel_mask:\n            self.kernel_mask = self.kernel_mask.to(x.device)\n            weight = weight.masked_fill(self.kernel_mask == 0.0, float(\"-inf\"))\n        weight = F.softmax(weight, dim=-1)\n        x = F.conv1d(x, weight, padding=self.padding_size, groups=self.wshare).view(\n            B, C, T\n        )\n        if self.use_bias:\n            x = x + self.bias.view(1, -1, 1)\n        x = x.transpose(1, 2)  # B x T x C\n\n        if mask is not None and not self.use_kernel_mask:\n            mask = mask.transpose(-1, -2)\n            x = x.masked_fill(mask == 0, 0.0)\n\n        # second linear layer\n        x = self.linear2(x)\n        return x\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/lightconv2d.py",
    "content": "\"\"\"Lightweight 2-Dimentional Convolution module.\"\"\"\n\nimport numpy\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\n\n\nMIN_VALUE = float(numpy.finfo(numpy.float32).min)\n\n\nclass LightweightConvolution2D(nn.Module):\n    \"\"\"Lightweight 2-Dimentional Convolution layer.\n\n    This implementation is based on\n    https://github.com/pytorch/fairseq/tree/master/fairseq\n\n    Args:\n        wshare (int): the number of kernel of convolution\n        n_feat (int): the number of features\n        dropout_rate (float): dropout_rate\n        kernel_size (int): kernel size (length)\n        use_kernel_mask (bool): Use causal mask or not for convolution kernel\n        use_bias (bool): Use bias term or not.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        wshare,\n        n_feat,\n        dropout_rate,\n        kernel_size,\n        use_kernel_mask=False,\n        use_bias=False,\n    ):\n        \"\"\"Construct Lightweight 2-Dimentional Convolution layer.\"\"\"\n        super(LightweightConvolution2D, self).__init__()\n\n        assert n_feat % wshare == 0\n        self.wshare = wshare\n        self.use_kernel_mask = use_kernel_mask\n        self.dropout_rate = dropout_rate\n        self.kernel_size = kernel_size\n        self.padding_size = int(kernel_size / 2)\n\n        # linear -> GLU -> lightconv -> linear\n        self.linear1 = nn.Linear(n_feat, n_feat * 2)\n        self.linear2 = nn.Linear(n_feat * 2, n_feat)\n        self.act = nn.GLU()\n\n        # lightconv related\n        self.weight = nn.Parameter(\n            torch.Tensor(self.wshare, 1, kernel_size).uniform_(0, 1)\n        )\n        self.weight_f = nn.Parameter(torch.Tensor(1, 1, kernel_size).uniform_(0, 1))\n        self.use_bias = use_bias\n        if self.use_bias:\n            self.bias = nn.Parameter(torch.Tensor(n_feat))\n\n        # mask of kernel\n        kernel_mask0 = torch.zeros(self.wshare, int(kernel_size / 2))\n        kernel_mask1 = torch.ones(self.wshare, int(kernel_size / 2 + 1))\n        self.kernel_mask = torch.cat((kernel_mask1, kernel_mask0), dim=-1).unsqueeze(1)\n\n    def forward(self, query, key, value, mask):\n        \"\"\"Forward of 'Lightweight 2-Dimentional Convolution'.\n\n        This function takes query, key and value but uses only query.\n        This is just for compatibility with self-attention layer (attention.py)\n\n        Args:\n            query (torch.Tensor): (batch, time1, d_model) input tensor\n            key (torch.Tensor): (batch, time2, d_model) NOT USED\n            value (torch.Tensor): (batch, time2, d_model) NOT USED\n            mask (torch.Tensor): (batch, time1, time2) mask\n\n        Return:\n            x (torch.Tensor): (batch, time1, d_model) ouput\n\n        \"\"\"\n        # linear -> GLU -> lightconv -> linear\n        x = query\n        B, T, C = x.size()\n        H = self.wshare\n\n        # first liner layer\n        x = self.linear1(x)\n\n        # GLU activation\n        x = self.act(x)\n\n        # convolution along frequency axis\n        weight_f = F.softmax(self.weight_f, dim=-1)\n        weight_f = F.dropout(weight_f, self.dropout_rate, training=self.training)\n        weight_new = torch.zeros(\n            B * T, 1, self.kernel_size, device=x.device, dtype=x.dtype\n        ).copy_(weight_f)\n        xf = F.conv1d(\n            x.view(1, B * T, C), weight_new, padding=self.padding_size, groups=B * T\n        ).view(B, T, C)\n\n        # lightconv\n        x = x.transpose(1, 2).contiguous().view(-1, H, T)  # B x C x T\n        weight = F.dropout(self.weight, self.dropout_rate, training=self.training)\n        if self.use_kernel_mask:\n            self.kernel_mask = self.kernel_mask.to(x.device)\n            weight = weight.masked_fill(self.kernel_mask == 0.0, float(\"-inf\"))\n        weight = F.softmax(weight, dim=-1)\n        x = F.conv1d(x, weight, padding=self.padding_size, groups=self.wshare).view(\n            B, C, T\n        )\n        if self.use_bias:\n            x = x + self.bias.view(1, -1, 1)\n        x = x.transpose(1, 2)  # B x T x C\n        x = torch.cat((x, xf), -1)  # B x T x Cx2\n\n        if mask is not None and not self.use_kernel_mask:\n            mask = mask.transpose(-1, -2)\n            x = x.masked_fill(mask == 0, 0.0)\n\n        # second linear layer\n        x = self.linear2(x)\n        return x\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/mask.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Mask module.\"\"\"\n\nfrom distutils.version import LooseVersion\n\nimport torch\n\nis_torch_1_2_plus = LooseVersion(torch.__version__) >= LooseVersion(\"1.2.0\")\n# LooseVersion('1.2.0') == LooseVersion(torch.__version__) can't include e.g. 1.2.0+aaa\nis_torch_1_2 = (\n    LooseVersion(\"1.3\") > LooseVersion(torch.__version__) >= LooseVersion(\"1.2\")\n)\ndatatype = torch.bool if is_torch_1_2_plus else torch.uint8\n\n\ndef subsequent_mask(size, device=\"cpu\", dtype=datatype):\n    \"\"\"Create mask for subsequent steps (size, size).\n\n    :param int size: size of mask\n    :param str device: \"cpu\" or \"cuda\" or torch.Tensor.device\n    :param torch.dtype dtype: result dtype\n    :rtype: torch.Tensor\n    >>> subsequent_mask(3)\n    [[1, 0, 0],\n     [1, 1, 0],\n     [1, 1, 1]]\n    \"\"\"\n    if is_torch_1_2 and dtype == torch.bool:\n        # torch=1.2 doesn't support tril for bool tensor\n        ret = torch.ones(size, size, device=device, dtype=torch.uint8)\n        return torch.tril(ret, out=ret).type(dtype)\n    else:\n        ret = torch.ones(size, size, device=device, dtype=dtype)\n        return torch.tril(ret, out=ret)\n\n\ndef target_mask(ys_in_pad, ignore_id):\n    \"\"\"Create mask for decoder self-attention.\n\n    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)\n    :param int ignore_id: index of padding\n    :param torch.dtype dtype: result dtype\n    :rtype: torch.Tensor (B, Lmax, Lmax)\n    \"\"\"\n    ys_mask = ys_in_pad != ignore_id\n    m = subsequent_mask(ys_mask.size(-1), device=ys_mask.device).unsqueeze(0)\n    return ys_mask.unsqueeze(-2) & m\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/multi_layer_conv.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Layer modules for FFT block in FastSpeech (Feed-forward Transformer).\"\"\"\n\nimport torch\n\n\nclass MultiLayeredConv1d(torch.nn.Module):\n    \"\"\"Multi-layered conv1d for Transformer block.\n\n    This is a module of multi-leyered conv1d designed\n    to replace positionwise feed-forward network\n    in Transforner block, which is introduced in\n    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.\n\n    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:\n        https://arxiv.org/pdf/1905.09263.pdf\n\n    \"\"\"\n\n    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):\n        \"\"\"Initialize MultiLayeredConv1d module.\n\n        Args:\n            in_chans (int): Number of input channels.\n            hidden_chans (int): Number of hidden channels.\n            kernel_size (int): Kernel size of conv1d.\n            dropout_rate (float): Dropout rate.\n\n        \"\"\"\n        super(MultiLayeredConv1d, self).__init__()\n        self.w_1 = torch.nn.Conv1d(\n            in_chans,\n            hidden_chans,\n            kernel_size,\n            stride=1,\n            padding=(kernel_size - 1) // 2,\n        )\n        self.w_2 = torch.nn.Conv1d(\n            hidden_chans,\n            in_chans,\n            kernel_size,\n            stride=1,\n            padding=(kernel_size - 1) // 2,\n        )\n        self.dropout = torch.nn.Dropout(dropout_rate)\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (torch.Tensor): Batch of input tensors (B, T, in_chans).\n\n        Returns:\n            torch.Tensor: Batch of output tensors (B, T, hidden_chans).\n\n        \"\"\"\n        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)\n        return self.w_2(self.dropout(x).transpose(-1, 1)).transpose(-1, 1)\n\n\nclass Conv1dLinear(torch.nn.Module):\n    \"\"\"Conv1D + Linear for Transformer block.\n\n    A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.\n\n    \"\"\"\n\n    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):\n        \"\"\"Initialize Conv1dLinear module.\n\n        Args:\n            in_chans (int): Number of input channels.\n            hidden_chans (int): Number of hidden channels.\n            kernel_size (int): Kernel size of conv1d.\n            dropout_rate (float): Dropout rate.\n\n        \"\"\"\n        super(Conv1dLinear, self).__init__()\n        self.w_1 = torch.nn.Conv1d(\n            in_chans,\n            hidden_chans,\n            kernel_size,\n            stride=1,\n            padding=(kernel_size - 1) // 2,\n        )\n        self.w_2 = torch.nn.Linear(hidden_chans, in_chans)\n        self.dropout = torch.nn.Dropout(dropout_rate)\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (torch.Tensor): Batch of input tensors (B, T, in_chans).\n\n        Returns:\n            torch.Tensor: Batch of output tensors (B, T, hidden_chans).\n\n        \"\"\"\n        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)\n        return self.w_2(self.dropout(x))\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/optimizer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Optimizer module.\"\"\"\n\nimport torch\n\n\nclass NoamOpt(object):\n    \"\"\"Optim wrapper that implements rate.\"\"\"\n\n    def __init__(self, model_size, factor, warmup, optimizer):\n        \"\"\"Construct an NoamOpt object.\"\"\"\n        self.optimizer = optimizer\n        self._step = 0\n        self.warmup = warmup\n        self.factor = factor\n        self.model_size = model_size\n        self._rate = 0\n\n    @property\n    def param_groups(self):\n        \"\"\"Return param_groups.\"\"\"\n        return self.optimizer.param_groups\n\n    def step(self):\n        \"\"\"Update parameters and rate.\"\"\"\n        self._step += 1\n        rate = self.rate()\n        for p in self.optimizer.param_groups:\n            p[\"lr\"] = rate\n        self._rate = rate\n        self.optimizer.step()\n\n    def rate(self, step=None):\n        \"\"\"Implement `lrate` above.\"\"\"\n        if step is None:\n            step = self._step\n        return (\n            self.factor\n            * self.model_size ** (-0.5)\n            * min(step ** (-0.5), step * self.warmup ** (-1.5))\n        )\n\n    def zero_grad(self):\n        \"\"\"Reset gradient.\"\"\"\n        self.optimizer.zero_grad()\n\n    def state_dict(self):\n        \"\"\"Return state_dict.\"\"\"\n        return {\n            \"_step\": self._step,\n            \"warmup\": self.warmup,\n            \"factor\": self.factor,\n            \"model_size\": self.model_size,\n            \"_rate\": self._rate,\n            \"optimizer\": self.optimizer.state_dict(),\n        }\n\n    def load_state_dict(self, state_dict):\n        \"\"\"Load state_dict.\"\"\"\n        for key, value in state_dict.items():\n            if key == \"optimizer\":\n                self.optimizer.load_state_dict(state_dict[\"optimizer\"])\n            else:\n                setattr(self, key, value)\n\n\ndef get_std_opt(model_params, d_model, warmup, factor):\n    \"\"\"Get standard NoamOpt.\"\"\"\n    base = torch.optim.Adam(model_params, lr=0, betas=(0.9, 0.98), eps=1e-9)\n    return NoamOpt(d_model, factor, warmup, base)\n\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/plot.py",
    "content": "# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport logging\n\nimport matplotlib.pyplot as plt\nimport numpy\nimport os\n\nfrom espnet.asr import asr_utils\n\n\ndef _plot_and_save_attention(att_w, filename, xtokens=None, ytokens=None):\n    # dynamically import matplotlib due to not found error\n    from matplotlib.ticker import MaxNLocator\n\n    d = os.path.dirname(filename)\n    if not os.path.exists(d):\n        os.makedirs(d)\n    w, h = plt.figaspect(1.0 / len(att_w))\n    fig = plt.Figure(figsize=(w * 2, h * 2))\n    axes = fig.subplots(1, len(att_w))\n    if len(att_w) == 1:\n        axes = [axes]\n    for ax, aw in zip(axes, att_w):\n        # plt.subplot(1, len(att_w), h)\n        ax.imshow(aw.astype(numpy.float32), aspect=\"auto\")\n        ax.set_xlabel(\"Input\")\n        ax.set_ylabel(\"Output\")\n        ax.xaxis.set_major_locator(MaxNLocator(integer=True))\n        ax.yaxis.set_major_locator(MaxNLocator(integer=True))\n        # Labels for major ticks\n        if xtokens is not None:\n            ax.set_xticks(numpy.linspace(0, len(xtokens) - 1, len(xtokens)))\n            ax.set_xticks(numpy.linspace(0, len(xtokens) - 1, 1), minor=True)\n            ax.set_xticklabels(xtokens + [\"\"], rotation=40)\n        if ytokens is not None:\n            ax.set_yticks(numpy.linspace(0, len(ytokens) - 1, len(ytokens)))\n            ax.set_yticks(numpy.linspace(0, len(ytokens) - 1, 1), minor=True)\n            ax.set_yticklabels(ytokens + [\"\"])\n    fig.tight_layout()\n    return fig\n\n\ndef savefig(plot, filename):\n    plot.savefig(filename)\n    plt.clf()\n\n\ndef plot_multi_head_attention(\n    data,\n    uttid_list,\n    attn_dict,\n    outdir,\n    suffix=\"png\",\n    savefn=savefig,\n    ikey=\"input\",\n    iaxis=0,\n    okey=\"output\",\n    oaxis=0,\n    subsampling_factor=4,\n):\n    \"\"\"Plot multi head attentions.\n\n    :param dict data: utts info from json file\n    :param List uttid_list: utterance IDs\n    :param dict[str, torch.Tensor] attn_dict: multi head attention dict.\n        values should be torch.Tensor (head, input_length, output_length)\n    :param str outdir: dir to save fig\n    :param str suffix: filename suffix including image type (e.g., png)\n    :param savefn: function to save\n    :param str ikey: key to access input\n    :param int iaxis: dimension to access input\n    :param str okey: key to access output\n    :param int oaxis: dimension to access output\n    :param subsampling_factor: subsampling factor in encoder\n\n    \"\"\"\n    for name, att_ws in attn_dict.items():\n        for idx, att_w in enumerate(att_ws):\n            data_i = data[uttid_list[idx]]\n            filename = \"%s/%s.%s.%s\" % (outdir, uttid_list[idx], name, suffix)\n            dec_len = int(data_i[okey][oaxis][\"shape\"][0]) + 1  # +1 for <eos>\n            enc_len = int(data_i[ikey][iaxis][\"shape\"][0])\n            is_mt = \"token\" in data_i[ikey][iaxis].keys()\n            # for ASR/ST\n            if not is_mt:\n                enc_len //= subsampling_factor\n            xtokens, ytokens = None, None\n            if \"encoder\" in name:\n                att_w = att_w[:, :enc_len, :enc_len]\n                # for MT\n                if is_mt:\n                    xtokens = data_i[ikey][iaxis][\"token\"].split()\n                    ytokens = xtokens[:]\n            elif \"decoder\" in name:\n                if \"self\" in name:\n                    # self-attention\n                    att_w = att_w[:, :dec_len, :dec_len]\n                    if \"token\" in data_i[okey][oaxis].keys():\n                        ytokens = data_i[okey][oaxis][\"token\"].split() + [\"<eos>\"]\n                        xtokens = [\"<sos>\"] + data_i[okey][oaxis][\"token\"].split()\n                else:\n                    # cross-attention\n                    att_w = att_w[:, :dec_len, :enc_len]\n                    if \"token\" in data_i[okey][oaxis].keys():\n                        ytokens = data_i[okey][oaxis][\"token\"].split() + [\"<eos>\"]\n                    # for MT\n                    if is_mt:\n                        xtokens = data_i[ikey][iaxis][\"token\"].split()\n            else:\n                logging.warning(\"unknown name for shaping attention\")\n            fig = _plot_and_save_attention(att_w, filename, xtokens, ytokens)\n            savefn(fig, filename)\n\n\nclass PlotAttentionReport(asr_utils.PlotAttentionReport):\n    def plotfn(self, *args, **kwargs):\n        kwargs[\"ikey\"] = self.ikey\n        kwargs[\"iaxis\"] = self.iaxis\n        kwargs[\"okey\"] = self.okey\n        kwargs[\"oaxis\"] = self.oaxis\n        kwargs[\"subsampling_factor\"] = self.factor\n        plot_multi_head_attention(*args, **kwargs)\n\n    def __call__(self, trainer):\n        attn_dict, uttid_list = self.get_attention_weights()\n        suffix = \"ep.{.updater.epoch}.png\".format(trainer)\n        self.plotfn(self.data_dict, uttid_list, attn_dict, self.outdir, suffix, savefig)\n\n    def get_attention_weights(self):\n        return_batch, uttid_list = self.transform(self.data, return_uttid=True)\n        batch = self.converter([return_batch], self.device)\n        if isinstance(batch, tuple):\n            att_ws = self.att_vis_fn(*batch)\n        elif isinstance(batch, dict):\n            att_ws = self.att_vis_fn(**batch)\n        return att_ws, uttid_list\n\n    def log_attentions(self, logger, step):\n        def log_fig(plot, filename):\n            logger.add_figure(os.path.basename(filename), plot, step)\n            plt.clf()\n\n        attn_dict, uttid_list = self.get_attention_weights()\n        self.plotfn(self.data_dict, uttid_list, attn_dict, self.outdir, \"\", log_fig)\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/positionwise_feed_forward.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Positionwise feed forward layer definition.\"\"\"\n\nimport torch\n\n\nclass PositionwiseFeedForward(torch.nn.Module):\n    \"\"\"Positionwise feed forward layer.\n\n    Args:\n        idim (int): Input dimenstion.\n        hidden_units (int): The number of hidden units.\n        dropout_rate (float): Dropout rate.\n\n    \"\"\"\n\n    def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()):\n        \"\"\"Construct an PositionwiseFeedForward object.\"\"\"\n        super(PositionwiseFeedForward, self).__init__()\n        self.w_1 = torch.nn.Linear(idim, hidden_units)\n        self.w_2 = torch.nn.Linear(hidden_units, idim)\n        self.dropout = torch.nn.Dropout(dropout_rate)\n        self.activation = activation\n\n    def forward(self, x):\n        \"\"\"Forward funciton.\"\"\"\n        return self.w_2(self.dropout(self.activation(self.w_1(x))))\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/repeat.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Repeat the same layer definition.\"\"\"\n\nimport torch\n\n\nclass MultiSequential(torch.nn.Sequential):\n    \"\"\"Multi-input multi-output torch.nn.Sequential.\"\"\"\n\n    def forward(self, *args):\n        \"\"\"Repeat.\"\"\"\n        for m in self:\n            args = m(*args)\n        return args\n\n\ndef repeat(N, fn):\n    \"\"\"Repeat module N times.\n\n    Args:\n        N (int): Number of repeat time.\n        fn (Callable): Function to generate module.\n\n    Returns:\n        MultiSequential: Repeated model instance.\n\n    \"\"\"\n    return MultiSequential(*[fn(n) for n in range(N)])\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/sgd_optimizer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Optimizer module.\"\"\"\n\nimport torch\n\n\nclass NoamOpt(object):\n    \"\"\"Optim wrapper that implements rate.\"\"\"\n\n    def __init__(self, model_size, factor, warmup, optimizer):\n        \"\"\"Construct an NoamOpt object.\"\"\"\n        self.optimizer = optimizer\n        self._step = 0\n        self.warmup = warmup\n        self.factor = factor\n        self.model_size = model_size\n        self._rate = 0\n\n    @property\n    def param_groups(self):\n        \"\"\"Return param_groups.\"\"\"\n        return self.optimizer.param_groups\n\n    def step(self):\n        \"\"\"Update parameters and rate.\"\"\"\n        self._step += 1\n        rate = self.rate()\n        #for p in self.optimizer.param_groups:\n        #    p[\"lr\"] = rate\n        self._rate = rate\n        self.optimizer.step()\n\n    def rate(self, step=None):\n        \"\"\"Implement `lrate` above.\"\"\"\n        if step is None:\n            step = self._step\n        return (\n            self.factor\n            * self.model_size ** (-0.5)\n            * min(step ** (-0.5), step * self.warmup ** (-1.5))\n        )\n\n    def zero_grad(self):\n        \"\"\"Reset gradient.\"\"\"\n        self.optimizer.zero_grad()\n\n    def state_dict(self):\n        \"\"\"Return state_dict.\"\"\"\n        return {\n            \"_step\": self._step,\n            \"warmup\": self.warmup,\n            \"factor\": self.factor,\n            \"model_size\": self.model_size,\n            \"_rate\": self._rate,\n            \"optimizer\": self.optimizer.state_dict(),\n        }\n\n    def load_state_dict(self, state_dict):\n        \"\"\"Load state_dict.\"\"\"\n        for key, value in state_dict.items():\n            if key == \"optimizer\":\n                self.optimizer.load_state_dict(state_dict[\"optimizer\"])\n            else:\n                setattr(self, key, value)\n\n\ndef get_sgd_opt(model_params, d_model, warmup, factor):\n    \"\"\"Get standard SGD optimizer with NOAM scheduling.\"\"\"\n    \"\"\"Adam is then implemented by global optimizer\"\"\"\n    base = torch.optim.SGD(model_params, lr=1.0) # No momentum\n    return NoamOpt(d_model, factor, warmup, base)\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/subsampling.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2019 Shigeki Karita\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Subsampling layer definition.\"\"\"\n\nimport torch\n\nfrom espnet.nets.pytorch_backend.transformer.embedding import PositionalEncoding\n\n\nclass TooShortUttError(Exception):\n    \"\"\"Raised when the utt is too short for subsampling.\n\n    Args:\n        message (str): Message for error catch\n        actual_size (int): the short size that cannot pass the subsampling\n        limit (int): the limit size for subsampling\n\n    \"\"\"\n\n    def __init__(self, message, actual_size, limit):\n        \"\"\"Construct a TooShortUttError for error handler.\"\"\"\n        super().__init__(message)\n        self.actual_size = actual_size\n        self.limit = limit\n\n\ndef check_short_utt(ins, size):\n    \"\"\"Check if the utterance is too short for subsampling.\"\"\"\n    if isinstance(ins, Conv2dSubsampling) and size < 7:\n        return True, 7\n    if isinstance(ins, Conv2dSubsampling6) and size < 11:\n        return True, 11\n    if isinstance(ins, Conv2dSubsampling8) and size < 15:\n        return True, 15\n    return False, -1\n\n\nclass Conv2dSubsampling(torch.nn.Module):\n    \"\"\"Convolutional 2D subsampling (to 1/4 length).\n\n    Args:\n        idim (int): Input dimension.\n        odim (int): Output dimension.\n        dropout_rate (float): Dropout rate.\n        pos_enc (torch.nn.Module): Custom position encoding layer.\n\n    \"\"\"\n\n    def __init__(self, idim, odim, dropout_rate, pos_enc=None):\n        \"\"\"Construct an Conv2dSubsampling object.\"\"\"\n        super(Conv2dSubsampling, self).__init__()\n        self.conv = torch.nn.Sequential(\n            torch.nn.Conv2d(1, odim, 3, 2),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(odim, odim, 3, 2),\n            torch.nn.ReLU(),\n        )\n        self.out = torch.nn.Sequential(\n            torch.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),\n            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),\n        )\n\n    def forward(self, x, x_mask):\n        \"\"\"Subsample x.\n\n        Args:\n            x (torch.Tensor): Input tensor (#batch, time, idim).\n            x_mask (torch.Tensor): Input mask (#batch, 1, time).\n\n        Returns:\n            torch.Tensor: Subsampled tensor (#batch, time', odim),\n                where time' = time // 4.\n            torch.Tensor: Subsampled mask (#batch, 1, time'),\n                where time' = time // 4.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.conv(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        if x_mask is None:\n            return x, None\n        return x, x_mask[:, :, :-2:2][:, :, :-2:2]\n\n    def __getitem__(self, key):\n        \"\"\"Get item.\n\n        When reset_parameters() is called, if use_scaled_pos_enc is used,\n            return the positioning encoding.\n\n        \"\"\"\n        if key != -1:\n            raise NotImplementedError(\"Support only `-1` (for `reset_parameters`).\")\n        return self.out[key]\n\n\nclass Conv2dSubsampling6(torch.nn.Module):\n    \"\"\"Convolutional 2D subsampling (to 1/6 length).\n\n    Args:\n        idim (int): Input dimension.\n        odim (int): Output dimension.\n        dropout_rate (float): Dropout rate.\n        pos_enc (torch.nn.Module): Custom position encoding layer.\n\n    \"\"\"\n\n    def __init__(self, idim, odim, dropout_rate, pos_enc=None):\n        \"\"\"Construct an Conv2dSubsampling6 object.\"\"\"\n        super(Conv2dSubsampling6, self).__init__()\n        self.conv = torch.nn.Sequential(\n            torch.nn.Conv2d(1, odim, 3, 2),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(odim, odim, 5, 3),\n            torch.nn.ReLU(),\n        )\n        self.out = torch.nn.Sequential(\n            torch.nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim),\n            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),\n        )\n\n    def forward(self, x, x_mask):\n        \"\"\"Subsample x.\n\n        Args:\n            x (torch.Tensor): Input tensor (#batch, time, idim).\n            x_mask (torch.Tensor): Input mask (#batch, 1, time).\n\n        Returns:\n            torch.Tensor: Subsampled tensor (#batch, time', odim),\n                where time' = time // 6.\n            torch.Tensor: Subsampled mask (#batch, 1, time'),\n                where time' = time // 6.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.conv(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        if x_mask is None:\n            return x, None\n        return x, x_mask[:, :, :-2:2][:, :, :-4:3]\n\n\nclass Conv2dSubsampling8(torch.nn.Module):\n    \"\"\"Convolutional 2D subsampling (to 1/8 length).\n\n    Args:\n        idim (int): Input dimension.\n        odim (int): Output dimension.\n        dropout_rate (float): Dropout rate.\n        pos_enc (torch.nn.Module): Custom position encoding layer.\n\n    \"\"\"\n\n    def __init__(self, idim, odim, dropout_rate, pos_enc=None):\n        \"\"\"Construct an Conv2dSubsampling8 object.\"\"\"\n        super(Conv2dSubsampling8, self).__init__()\n        self.conv = torch.nn.Sequential(\n            torch.nn.Conv2d(1, odim, 3, 2),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(odim, odim, 3, 2),\n            torch.nn.ReLU(),\n            torch.nn.Conv2d(odim, odim, 3, 2),\n            torch.nn.ReLU(),\n        )\n        self.out = torch.nn.Sequential(\n            torch.nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2), odim),\n            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),\n        )\n\n    def forward(self, x, x_mask):\n        \"\"\"Subsample x.\n\n        Args:\n            x (torch.Tensor): Input tensor (#batch, time, idim).\n            x_mask (torch.Tensor): Input mask (#batch, 1, time).\n\n        Returns:\n            torch.Tensor: Subsampled tensor (#batch, time', odim),\n                where time' = time // 8.\n            torch.Tensor: Subsampled mask (#batch, 1, time'),\n                where time' = time // 8.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.conv(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        if x_mask is None:\n            return x, None\n        return x, x_mask[:, :, :-2:2][:, :, :-2:2][:, :, :-2:2]\n"
  },
  {
    "path": "nets/pytorch_backend/transformer/subsampling_without_posenc.py",
    "content": "# Copyright 2020 Emiru Tsunoo\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Subsampling layer definition.\"\"\"\n\nimport math\nimport torch\n\n\nclass Conv2dSubsamplingWOPosEnc(torch.nn.Module):\n    \"\"\"Convolutional 2D subsampling.\n\n    Args:\n        idim (int): Input dimension.\n        odim (int): Output dimension.\n        dropout_rate (float): Dropout rate.\n        kernels (list): kernel sizes\n        strides (list): stride sizes\n\n    \"\"\"\n\n    def __init__(self, idim, odim, dropout_rate, kernels, strides):\n        \"\"\"Construct an Conv2dSubsamplingWOPosEnc object.\"\"\"\n        assert len(kernels) == len(strides)\n        super().__init__()\n        conv = []\n        olen = idim\n        for i, (k, s) in enumerate(zip(kernels, strides)):\n            conv += [\n                torch.nn.Conv2d(1 if i == 0 else odim, odim, k, s),\n                torch.nn.ReLU(),\n            ]\n            olen = math.floor((olen - k) / s + 1)\n        self.conv = torch.nn.Sequential(*conv)\n        self.out = torch.nn.Linear(odim * olen, odim)\n        self.strides = strides\n        self.kernels = kernels\n\n    def forward(self, x, x_mask):\n        \"\"\"Subsample x.\n\n        Args:\n            x (torch.Tensor): Input tensor (#batch, time, idim).\n            x_mask (torch.Tensor): Input mask (#batch, 1, time).\n\n        Returns:\n            torch.Tensor: Subsampled tensor (#batch, time', odim),\n                where time' = time // 4.\n            torch.Tensor: Subsampled mask (#batch, 1, time'),\n                where time' = time // 4.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.conv(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        if x_mask is None:\n            return x, None\n        for k, s in zip(self.kernels, self.strides):\n            x_mask = x_mask[:, :, : -k + 1 : s]\n        return x, x_mask\n"
  },
  {
    "path": "nets/pytorch_backend/wavenet.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Copyright 2019 Tomoki Hayashi (Nagoya University)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"This code is based on https://github.com/kan-bayashi/PytorchWaveNetVocoder.\"\"\"\n\nimport logging\nimport sys\nimport time\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom torch import nn\n\n\ndef encode_mu_law(x, mu=256):\n    \"\"\"Perform mu-law encoding.\n\n    Args:\n        x (ndarray): Audio signal with the range from -1 to 1.\n        mu (int): Quantized level.\n\n    Returns:\n        ndarray: Quantized audio signal with the range from 0 to mu - 1.\n\n    \"\"\"\n    mu = mu - 1\n    fx = np.sign(x) * np.log(1 + mu * np.abs(x)) / np.log(1 + mu)\n    return np.floor((fx + 1) / 2 * mu + 0.5).astype(np.int64)\n\n\ndef decode_mu_law(y, mu=256):\n    \"\"\"Perform mu-law decoding.\n\n    Args:\n        x (ndarray): Quantized audio signal with the range from 0 to mu - 1.\n        mu (int): Quantized level.\n\n    Returns:\n        ndarray: Audio signal with the range from -1 to 1.\n\n    \"\"\"\n    mu = mu - 1\n    fx = (y - 0.5) / mu * 2 - 1\n    x = np.sign(fx) / mu * ((1 + mu) ** np.abs(fx) - 1)\n    return x\n\n\ndef initialize(m):\n    \"\"\"Initilize conv layers with xavier.\n\n    Args:\n        m (torch.nn.Module): Torch module.\n\n    \"\"\"\n    if isinstance(m, nn.Conv1d):\n        nn.init.xavier_uniform_(m.weight)\n        nn.init.constant_(m.bias, 0.0)\n\n    if isinstance(m, nn.ConvTranspose2d):\n        nn.init.constant_(m.weight, 1.0)\n        nn.init.constant_(m.bias, 0.0)\n\n\nclass OneHot(nn.Module):\n    \"\"\"Convert to one-hot vector.\n\n    Args:\n        depth (int): Dimension of one-hot vector.\n\n    \"\"\"\n\n    def __init__(self, depth):\n        super(OneHot, self).__init__()\n        self.depth = depth\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (LongTensor): long tensor variable with the shape  (B, T)\n\n        Returns:\n            Tensor: float tensor variable with the shape (B, depth, T)\n\n        \"\"\"\n        x = x % self.depth\n        x = torch.unsqueeze(x, 2)\n        x_onehot = x.new_zeros(x.size(0), x.size(1), self.depth).float()\n\n        return x_onehot.scatter_(2, x, 1)\n\n\nclass CausalConv1d(nn.Module):\n    \"\"\"1D dilated causal convolution.\"\"\"\n\n    def __init__(self, in_channels, out_channels, kernel_size, dilation=1, bias=True):\n        super(CausalConv1d, self).__init__()\n        self.in_channels = in_channels\n        self.out_channels = out_channels\n        self.kernel_size = kernel_size\n        self.dilation = dilation\n        self.padding = padding = (kernel_size - 1) * dilation\n        self.conv = nn.Conv1d(\n            in_channels,\n            out_channels,\n            kernel_size,\n            padding=padding,\n            dilation=dilation,\n            bias=bias,\n        )\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (Tensor): Input tensor with the shape (B, in_channels, T).\n\n        Returns:\n            Tensor: Tensor with the shape (B, out_channels, T)\n\n        \"\"\"\n        x = self.conv(x)\n        if self.padding != 0:\n            x = x[:, :, : -self.padding]\n        return x\n\n\nclass UpSampling(nn.Module):\n    \"\"\"Upsampling layer with deconvolution.\n\n    Args:\n        upsampling_factor (int): Upsampling factor.\n\n    \"\"\"\n\n    def __init__(self, upsampling_factor, bias=True):\n        super(UpSampling, self).__init__()\n        self.upsampling_factor = upsampling_factor\n        self.bias = bias\n        self.conv = nn.ConvTranspose2d(\n            1,\n            1,\n            kernel_size=(1, self.upsampling_factor),\n            stride=(1, self.upsampling_factor),\n            bias=self.bias,\n        )\n\n    def forward(self, x):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (Tensor): Input tensor with the shape  (B, C, T)\n\n        Returns:\n            Tensor: Tensor with the shape (B, C, T') where T' = T * upsampling_factor.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # B x 1 x C x T\n        x = self.conv(x)  # B x 1 x C x T'\n        return x.squeeze(1)\n\n\nclass WaveNet(nn.Module):\n    \"\"\"Conditional wavenet.\n\n    Args:\n        n_quantize (int): Number of quantization.\n        n_aux (int): Number of aux feature dimension.\n        n_resch (int): Number of filter channels for residual block.\n        n_skipch (int): Number of filter channels for skip connection.\n        dilation_depth (int): Number of dilation depth\n            (e.g. if set 10, max dilation = 2^(10-1)).\n        dilation_repeat (int): Number of dilation repeat.\n        kernel_size (int): Filter size of dilated causal convolution.\n        upsampling_factor (int): Upsampling factor.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        n_quantize=256,\n        n_aux=28,\n        n_resch=512,\n        n_skipch=256,\n        dilation_depth=10,\n        dilation_repeat=3,\n        kernel_size=2,\n        upsampling_factor=0,\n    ):\n        super(WaveNet, self).__init__()\n        self.n_aux = n_aux\n        self.n_quantize = n_quantize\n        self.n_resch = n_resch\n        self.n_skipch = n_skipch\n        self.kernel_size = kernel_size\n        self.dilation_depth = dilation_depth\n        self.dilation_repeat = dilation_repeat\n        self.upsampling_factor = upsampling_factor\n\n        self.dilations = [\n            2 ** i for i in range(self.dilation_depth)\n        ] * self.dilation_repeat\n        self.receptive_field = (self.kernel_size - 1) * sum(self.dilations) + 1\n\n        # for preprocessing\n        self.onehot = OneHot(self.n_quantize)\n        self.causal = CausalConv1d(self.n_quantize, self.n_resch, self.kernel_size)\n        if self.upsampling_factor > 0:\n            self.upsampling = UpSampling(self.upsampling_factor)\n\n        # for residual blocks\n        self.dil_sigmoid = nn.ModuleList()\n        self.dil_tanh = nn.ModuleList()\n        self.aux_1x1_sigmoid = nn.ModuleList()\n        self.aux_1x1_tanh = nn.ModuleList()\n        self.skip_1x1 = nn.ModuleList()\n        self.res_1x1 = nn.ModuleList()\n        for d in self.dilations:\n            self.dil_sigmoid += [\n                CausalConv1d(self.n_resch, self.n_resch, self.kernel_size, d)\n            ]\n            self.dil_tanh += [\n                CausalConv1d(self.n_resch, self.n_resch, self.kernel_size, d)\n            ]\n            self.aux_1x1_sigmoid += [nn.Conv1d(self.n_aux, self.n_resch, 1)]\n            self.aux_1x1_tanh += [nn.Conv1d(self.n_aux, self.n_resch, 1)]\n            self.skip_1x1 += [nn.Conv1d(self.n_resch, self.n_skipch, 1)]\n            self.res_1x1 += [nn.Conv1d(self.n_resch, self.n_resch, 1)]\n\n        # for postprocessing\n        self.conv_post_1 = nn.Conv1d(self.n_skipch, self.n_skipch, 1)\n        self.conv_post_2 = nn.Conv1d(self.n_skipch, self.n_quantize, 1)\n\n    def forward(self, x, h):\n        \"\"\"Calculate forward propagation.\n\n        Args:\n            x (LongTensor): Quantized input waveform tensor with the shape  (B, T).\n            h (Tensor): Auxiliary feature tensor with the shape  (B, n_aux, T).\n\n        Returns:\n            Tensor: Logits with the shape (B, T, n_quantize).\n\n        \"\"\"\n        # preprocess\n        output = self._preprocess(x)\n        if self.upsampling_factor > 0:\n            h = self.upsampling(h)\n\n        # residual block\n        skip_connections = []\n        for i in range(len(self.dilations)):\n            output, skip = self._residual_forward(\n                output,\n                h,\n                self.dil_sigmoid[i],\n                self.dil_tanh[i],\n                self.aux_1x1_sigmoid[i],\n                self.aux_1x1_tanh[i],\n                self.skip_1x1[i],\n                self.res_1x1[i],\n            )\n            skip_connections.append(skip)\n\n        # skip-connection part\n        output = sum(skip_connections)\n        output = self._postprocess(output)\n\n        return output\n\n    def generate(self, x, h, n_samples, interval=None, mode=\"sampling\"):\n        \"\"\"Generate a waveform with fast genration algorithm.\n\n        This generation based on `Fast WaveNet Generation Algorithm`_.\n\n        Args:\n            x (LongTensor): Initial waveform tensor with the shape  (T,).\n            h (Tensor): Auxiliary feature tensor with the shape  (n_samples + T, n_aux).\n            n_samples (int): Number of samples to be generated.\n            interval (int, optional): Log interval.\n            mode (str, optional): \"sampling\" or \"argmax\".\n\n        Return:\n            ndarray: Generated quantized waveform (n_samples).\n\n        .. _`Fast WaveNet Generation Algorithm`: https://arxiv.org/abs/1611.09482\n\n        \"\"\"\n        # reshape inputs\n        assert len(x.shape) == 1\n        assert len(h.shape) == 2 and h.shape[1] == self.n_aux\n        x = x.unsqueeze(0)\n        h = h.transpose(0, 1).unsqueeze(0)\n\n        # perform upsampling\n        if self.upsampling_factor > 0:\n            h = self.upsampling(h)\n\n        # padding for shortage\n        if n_samples > h.shape[2]:\n            h = F.pad(h, (0, n_samples - h.shape[2]), \"replicate\")\n\n        # padding if the length less than\n        n_pad = self.receptive_field - x.size(1)\n        if n_pad > 0:\n            x = F.pad(x, (n_pad, 0), \"constant\", self.n_quantize // 2)\n            h = F.pad(h, (n_pad, 0), \"replicate\")\n\n        # prepare buffer\n        output = self._preprocess(x)\n        h_ = h[:, :, : x.size(1)]\n        output_buffer = []\n        buffer_size = []\n        for i, d in enumerate(self.dilations):\n            output, _ = self._residual_forward(\n                output,\n                h_,\n                self.dil_sigmoid[i],\n                self.dil_tanh[i],\n                self.aux_1x1_sigmoid[i],\n                self.aux_1x1_tanh[i],\n                self.skip_1x1[i],\n                self.res_1x1[i],\n            )\n            if d == 2 ** (self.dilation_depth - 1):\n                buffer_size.append(self.kernel_size - 1)\n            else:\n                buffer_size.append(d * 2 * (self.kernel_size - 1))\n            output_buffer.append(output[:, :, -buffer_size[i] - 1 : -1])\n\n        # generate\n        samples = x[0]\n        start_time = time.time()\n        for i in range(n_samples):\n            output = samples[-self.kernel_size * 2 + 1 :].unsqueeze(0)\n            output = self._preprocess(output)\n            h_ = h[:, :, samples.size(0) - 1].contiguous().view(1, self.n_aux, 1)\n            output_buffer_next = []\n            skip_connections = []\n            for j, d in enumerate(self.dilations):\n                output, skip = self._generate_residual_forward(\n                    output,\n                    h_,\n                    self.dil_sigmoid[j],\n                    self.dil_tanh[j],\n                    self.aux_1x1_sigmoid[j],\n                    self.aux_1x1_tanh[j],\n                    self.skip_1x1[j],\n                    self.res_1x1[j],\n                )\n                output = torch.cat([output_buffer[j], output], dim=2)\n                output_buffer_next.append(output[:, :, -buffer_size[j] :])\n                skip_connections.append(skip)\n\n            # update buffer\n            output_buffer = output_buffer_next\n\n            # get predicted sample\n            output = sum(skip_connections)\n            output = self._postprocess(output)[0]\n            if mode == \"sampling\":\n                posterior = F.softmax(output[-1], dim=0)\n                dist = torch.distributions.Categorical(posterior)\n                sample = dist.sample().unsqueeze(0)\n            elif mode == \"argmax\":\n                sample = output.argmax(-1)\n            else:\n                logging.error(\"mode should be sampling or argmax\")\n                sys.exit(1)\n            samples = torch.cat([samples, sample], dim=0)\n\n            # show progress\n            if interval is not None and (i + 1) % interval == 0:\n                elapsed_time_per_sample = (time.time() - start_time) / interval\n                logging.info(\n                    \"%d/%d estimated time = %.3f sec (%.3f sec / sample)\"\n                    % (\n                        i + 1,\n                        n_samples,\n                        (n_samples - i - 1) * elapsed_time_per_sample,\n                        elapsed_time_per_sample,\n                    )\n                )\n                start_time = time.time()\n\n        return samples[-n_samples:].cpu().numpy()\n\n    def _preprocess(self, x):\n        x = self.onehot(x).transpose(1, 2)\n        output = self.causal(x)\n        return output\n\n    def _postprocess(self, x):\n        output = F.relu(x)\n        output = self.conv_post_1(output)\n        output = F.relu(output)  # B x C x T\n        output = self.conv_post_2(output).transpose(1, 2)  # B x T x C\n        return output\n\n    def _residual_forward(\n        self,\n        x,\n        h,\n        dil_sigmoid,\n        dil_tanh,\n        aux_1x1_sigmoid,\n        aux_1x1_tanh,\n        skip_1x1,\n        res_1x1,\n    ):\n        output_sigmoid = dil_sigmoid(x)\n        output_tanh = dil_tanh(x)\n        aux_output_sigmoid = aux_1x1_sigmoid(h)\n        aux_output_tanh = aux_1x1_tanh(h)\n        output = torch.sigmoid(output_sigmoid + aux_output_sigmoid) * torch.tanh(\n            output_tanh + aux_output_tanh\n        )\n        skip = skip_1x1(output)\n        output = res_1x1(output)\n        output = output + x\n        return output, skip\n\n    def _generate_residual_forward(\n        self,\n        x,\n        h,\n        dil_sigmoid,\n        dil_tanh,\n        aux_1x1_sigmoid,\n        aux_1x1_tanh,\n        skip_1x1,\n        res_1x1,\n    ):\n        output_sigmoid = dil_sigmoid(x)[:, :, -1:]\n        output_tanh = dil_tanh(x)[:, :, -1:]\n        aux_output_sigmoid = aux_1x1_sigmoid(h)\n        aux_output_tanh = aux_1x1_tanh(h)\n        output = torch.sigmoid(output_sigmoid + aux_output_sigmoid) * torch.tanh(\n            output_tanh + aux_output_tanh\n        )\n        skip = skip_1x1(output)\n        output = res_1x1(output)\n        output = output + x[:, :, -1:]  # B x C x 1\n        return output, skip\n"
  },
  {
    "path": "nets/scorer_interface.py",
    "content": "\"\"\"Scorer interface module.\"\"\"\n\nfrom typing import Any\nfrom typing import List\nfrom typing import Tuple\n\nimport torch\nimport warnings\n\n\nclass ScorerInterface:\n    \"\"\"Scorer interface for beam search.\n\n    The scorer performs scoring of the all tokens in vocabulary.\n\n    Examples:\n        * Search heuristics\n            * :class:`espnet.nets.scorers.length_bonus.LengthBonus`\n        * Decoder networks of the sequence-to-sequence models\n            * :class:`espnet.nets.pytorch_backend.nets.transformer.decoder.Decoder`\n            * :class:`espnet.nets.pytorch_backend.nets.rnn.decoders.Decoder`\n        * Neural language models\n            * :class:`espnet.nets.pytorch_backend.lm.transformer.TransformerLM`\n            * :class:`espnet.nets.pytorch_backend.lm.default.DefaultRNNLM`\n            * :class:`espnet.nets.pytorch_backend.lm.seq_rnn.SequentialRNNLM`\n\n    \"\"\"\n\n    def init_state(self, x: torch.Tensor) -> Any:\n        \"\"\"Get an initial state for decoding (optional).\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        Returns: initial state\n\n        \"\"\"\n        return None\n\n    def select_state(self, state: Any, i: int, new_id: int = None) -> Any:\n        \"\"\"Select state with relative ids in the main beam search.\n\n        Args:\n            state: Decoder state for prefix tokens\n            i (int): Index to select a state in the main beam search\n            new_id (int): New label index to select a state if necessary\n\n        Returns:\n            state: pruned state\n\n        \"\"\"\n        return None if state is None else state[i]\n\n    def score(\n        self, y: torch.Tensor, state: Any, x: torch.Tensor\n    ) -> Tuple[torch.Tensor, Any]:\n        \"\"\"Score new token (required).\n\n        Args:\n            y (torch.Tensor): 1D torch.int64 prefix tokens.\n            state: Scorer state for prefix tokens\n            x (torch.Tensor): The encoder feature that generates ys.\n\n        Returns:\n            tuple[torch.Tensor, Any]: Tuple of\n                scores for next token that has a shape of `(n_vocab)`\n                and next state for ys\n\n        \"\"\"\n        raise NotImplementedError\n\n    def final_score(self, state: Any) -> float:\n        \"\"\"Score eos (optional).\n\n        Args:\n            state: Scorer state for prefix tokens\n\n        Returns:\n            float: final score\n\n        \"\"\"\n        return 0.0\n\n\nclass BatchScorerInterface(ScorerInterface):\n    \"\"\"Batch scorer interface.\"\"\"\n\n    def batch_init_state(self, x: torch.Tensor) -> Any:\n        \"\"\"Get an initial state for decoding (optional).\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        Returns: initial state\n\n        \"\"\"\n        return self.init_state(x)\n\n    def batch_score(\n        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor\n    ) -> Tuple[torch.Tensor, List[Any]]:\n        \"\"\"Score new token batch (required).\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        warnings.warn(\n            \"{} batch score is implemented through for loop not parallelized\".format(\n                self.__class__.__name__\n            )\n        )\n        scores = list()\n        outstates = list()\n        for i, (y, state, x) in enumerate(zip(ys, states, xs)):\n            score, outstate = self.score(y, state, x)\n            outstates.append(outstate)\n            scores.append(score)\n        scores = torch.cat(scores, 0).view(ys.shape[0], -1)\n        return scores, outstates\n\n\nclass PartialScorerInterface(ScorerInterface):\n    \"\"\"Partial scorer interface for beam search.\n\n    The partial scorer performs scoring when non-partial scorer finished scoring,\n    and recieves pre-pruned next tokens to score because it is too heavy to score\n    all the tokens.\n\n    Examples:\n         * Prefix search for connectionist-temporal-classification models\n             * :class:`espnet.nets.scorers.ctc.CTCPrefixScorer`\n\n    \"\"\"\n\n    def score_partial(\n        self, y: torch.Tensor, next_tokens: torch.Tensor, state: Any, x: torch.Tensor\n    ) -> Tuple[torch.Tensor, Any]:\n        \"\"\"Score new token (required).\n\n        Args:\n            y (torch.Tensor): 1D prefix token\n            next_tokens (torch.Tensor): torch.int64 next token to score\n            state: decoder state for prefix tokens\n            x (torch.Tensor): The encoder feature that generates ys\n\n        Returns:\n            tuple[torch.Tensor, Any]:\n                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`\n                and next state for ys\n\n        \"\"\"\n        raise NotImplementedError\n\n\nclass BatchPartialScorerInterface(BatchScorerInterface, PartialScorerInterface):\n    \"\"\"Batch partial scorer interface for beam search.\"\"\"\n\n    def batch_score_partial(\n        self,\n        ys: torch.Tensor,\n        next_tokens: torch.Tensor,\n        states: List[Any],\n        xs: torch.Tensor,\n    ) -> Tuple[torch.Tensor, Any]:\n        \"\"\"Score new token (required).\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            next_tokens (torch.Tensor): torch.int64 tokens to score (n_batch, n_token).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, Any]:\n                Tuple of a score tensor for ys that has a shape `(n_batch, n_vocab)`\n                and next states for ys\n        \"\"\"\n        raise NotImplementedError\n"
  },
  {
    "path": "nets/scorers/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "nets/scorers/_mmi_utils.py",
    "content": "# Author: Jinchuan Tian ; Jan 2022\n# jinchuantian@stu.pku.edu.cn\n\n# We test our code on k2 version 1.2; other versions may encounter problems due to API change.\n# This file contains the MMI-related utility functions:\n# 1. The (equivalent implementation of) step composition between the training / decoding graph;\n# 2. The Lattice generation process with look-ahead mechanism.\n\nfrom typing import List\nfrom typing import Optional\nfrom typing import Tuple\n\nimport torch\nimport k2\nimport _k2\n\nfrom k2 import Fsa, DenseFsaVec \n\n\"\"\"\nIntersection function without autograd.\n\n(1) We write this function since the arc_map_a is not accessible in k2 API\n(2) Currently we are not using the pruned version to keep all paths.\n    We will try to find a balance between the speed and the precision later.\n\"\"\"\ndef intersect_dense_forward(a_fsas: Fsa,\n                           b_fsas: DenseFsaVec,\n                           search_beam: float,\n                           output_beam: float,\n                           prune: bool,\n                           min_active_states: int,\n                           max_active_states: int,\n                           seqframe_idx_name: Optional[str] = None,\n                           frame_idx_name: Optional[str] = None): \n\n    out_fsa = [0]\n\n    if prune:\n        ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense_pruned(\n            a_fsas=a_fsas.arcs,\n            b_fsas=b_fsas.dense_fsa_vec,\n            search_beam=search_beam,\n            output_beam=output_beam,\n            min_active_states=min_active_states,\n            max_active_states=max_active_states)\n    else:\n        ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(\n            a_fsas=a_fsas.arcs,\n            b_fsas=b_fsas.dense_fsa_vec,\n            a_to_b_map=None,\n            output_beam=output_beam)\n\n    out_fsa[0] = Fsa(ragged_arc)\n\n    seqframe_idx = None\n    if frame_idx_name is not None:\n        num_cols = b_fsas.dense_fsa_vec.scores_dim1()\n        seqframe_idx = arc_map_b // num_cols\n        shape = b_fsas.dense_fsa_vec.shape()\n        fsa_idx0 = _k2.index_select(shape.row_ids(1), seqframe_idx)\n        frame_idx = seqframe_idx - _k2.index_select(\n            shape.row_splits(1), fsa_idx0)\n        assert not hasattr(out_fsa[0], frame_idx_name)\n        setattr(out_fsa[0], frame_idx_name, frame_idx)\n\n    if seqframe_idx_name is not None:\n        if seqframe_idx is None:\n            num_cols = b_fsas.dense_fsa_vec.scores_dim1()\n            seqframe_idx = arc_map_b // num_cols\n\n        assert not hasattr(out_fsa[0], seqframe_idx_name)\n        setattr(out_fsa[0], seqframe_idx_name, seqframe_idx)\n\n    return out_fsa[0], arc_map_a, arc_map_b\n\n# Recover the frame-level probability so we avoid using a loop to \n# do the intersection for each frame\n# TODO: support batch trace \ndef step_trace(out_fsas, a_fsas, arc_map_a):\n    assert out_fsas.shape[0] == a_fsas.shape[0]\n    num_fsa = a_fsas.shape[0]\n\n    # K2 FsaVec Meta-info: num_state; 0; \n    # state_accumulated_counts (row_splits1); arc_accumulated_counts (row_splits12)\n    \n    # 1.1 Find all a_fsas arcs and meta-info\n    a_fsa_dict = a_fsas.as_dict()\n    a_fsa_meta = a_fsa_dict[\"arcs\"][: 2 * num_fsa + 4]\n    a_fsa_arcs = a_fsa_dict[\"arcs\"][2 * num_fsa + 4:].view(-1, 4) # exclude meta-info\n\n    # 1.2 Assign global state-ids\n    for i in range(num_fsa):\n        a_fsa_arcs[a_fsa_meta[i+num_fsa+3]: a_fsa_meta[i+num_fsa+4]][:, :2] += a_fsa_meta[i + 2]\n\n    # 1.3 Find all ending states and their scores\n    a_fsa_ending_mask = a_fsa_arcs[:, 2] == -1\n    a_ending_states = torch.masked_select(a_fsa_arcs[:, 0], a_fsa_ending_mask)\n    a_ending_scores = torch.masked_select(a_fsas.scores, a_fsa_ending_mask)\n\n    # 2.1 Find all out_fsas arcs and sort by entering states \n    out_fsa_dict = out_fsas.as_dict()\n    out_fsa_meta = out_fsa_dict[\"arcs\"][:2 * num_fsa + 4]\n    out_fsa_arcs = out_fsa_dict[\"arcs\"][2 * num_fsa + 4:].view(-1, 4)\n    out_incoming_ragged = out_fsas._get_incoming_arcs()\n\n    # 2.2 For each state, find an arc entering it\n    # We actually do not need arcs in out_fsas but need those in a_fsas. -> select arc_map\n    transform_index = out_incoming_ragged.values().long()\n    select_index = torch.unique_consecutive(out_incoming_ragged.row_splits(2))[:-1].long()\n   \n    arc_map_a_uniq = arc_map_a[transform_index][select_index]\n    frame_idx = out_fsas.frame_idx[transform_index][select_index]\n\n    # 2.3 Find all corresponding arcs in a_fsas and their entering states\n    a_fsa_arcs_uniq = a_fsa_arcs[arc_map_a_uniq.long()]\n    a_states_uniq = a_fsa_arcs_uniq[:, 1]\n\n    # 3.1 Find the forward scores and remove scores on starting states\n    raw_state_scores = out_fsas._get_forward_scores(True, True)\n    raw_state_scores_ = []\n    for i in range(num_fsa):\n        s, e = out_fsa_meta[2 + i], out_fsa_meta[3 + i] \n        raw_state_scores_.append(raw_state_scores[s + 1: e])\n    raw_state_scores = torch.cat(raw_state_scores_, dim=0)\n \n    # 3.2 Add ending state scores to the raw state_scores \n    # if the final state is reachable. Else set to -inf\n    state_scores = torch.ones_like(raw_state_scores) * - float('inf')\n    for state, score in zip(a_ending_states, a_ending_scores):\n        state_scores = torch.where(a_states_uniq==state, \n                                   raw_state_scores + score, \n                                   state_scores)\n    \n    # 3.3 Allocate scores on each frames and each Fsa\n    frame_ids, counts = torch.unique_consecutive(frame_idx, return_counts=True)\n    score_sequences, start = [], 0\n    score_sequence = []\n    for i, (fid, fc) in enumerate(zip(frame_ids.tolist(), counts.tolist())):\n        frame_score = torch.logsumexp(state_scores[start: start+fc], dim=0)        \n        score_sequence.append(frame_score)\n        start += fc\n\n        if i == len(counts) - 1 or fid > frame_ids[i+1]:\n            score_sequences.append(torch.stack(score_sequence, dim=0)[:-1])\n            score_sequence = []\n            \n    return score_sequences\n\n\"\"\"\nStep intersection implementation\n\nInput:\nfsa, FsaVec, training graph like CTC, MMI. Need duplication.\ndense_fsa_vec, DenseFsaVec, created from nnet_output and the corresponding length in t-axis.\nprune: bool, If true, use a pruned version of intersection.\nsearch_beam: float, parameter used in pruned intersection only.\noutput_beam: float, paramtere used in intersection.\nmin_active_states: int, parameter used in pruned intersection only.\nmax_active_states: int, parameter used in pruned intersection only.\n\nOutput: \nscore_sequences: List of 1-D tensors. The number of tensors is equal to the number fsas in of `fsa`\n                 Each tensor has length of T where T is the number of effective frames in nnet_ouptut.\n                 The t-th element represent the `tot_score` of interseted Fsa beteewn the input `fsa` \n                 and the first t frames.\n\nThis implementation is much faster than using a loop for T times. As the intersection is only used once\nfor each Fsa. The sequence is recovered from the generated Fsa and the arc_map_a.\n\"\"\"\ndef step_intersect(fsa, \n                   dense_fsa_vec, \n                   prune=False, \n                   search_beam=100, \n                   output_beam=100,\n                   min_active_states=30,\n                   max_active_states=50000):\n    \n    out_fsa, arc_map_a, arc_map_b = intersect_dense_forward(\n      a_fsas = fsa,\n      b_fsas = dense_fsa_vec,\n      search_beam = search_beam,\n      output_beam = output_beam,\n      prune = prune,\n      min_active_states = min_active_states,\n      max_active_states = max_active_states,\n      seqframe_idx_name = \"seqframe_idx\",\n      frame_idx_name = \"frame_idx\"\n    )\n\n    return step_trace(out_fsa, fsa, arc_map_a) \n\ndef step_intersect_test(): \n    from pathlib import Path\n    lang=Path(\"data/lang_phone\")\n    device = torch.device(\"cpu\")\n    \n    # import for test only\n    from espnet.nets.scorer_interface import PartialScorerInterface\n    from snowfall.training.mmi_graph import MmiTrainingGraphCompiler\n    from snowfall.lexicon import Lexicon\n    from snowfall.training.mmi_graph import create_bigram_phone_lm\n\n    lexicon = Lexicon(lang)\n    oov = open(lang / 'oov.txt').read().strip()\n    graph_compiler = MmiTrainingGraphCompiler(lexicon, device, oov)\n    phone_ids = lexicon.phone_symbols()\n\n    torch.manual_seed(888)\n    P = create_bigram_phone_lm(phone_ids)\n    P.scores = torch.randn_like(P.scores)\n\n    # texts = [\"你好\", \"再见\"]\n    texts = [\"中华人民共和国万岁\", \"世界人民大团结万岁\"]\n    num, den = graph_compiler.compile(texts, P, replicate_den=True)\n    graph = num \n \n    T = 100\n    beam_size = len(texts)\n    odim = len(phone_ids) + 1\n    nnet_output = torch.rand([beam_size, T, odim])\n\n    supervision = torch.stack([\n                          torch.arange(beam_size),\n                          torch.zeros(beam_size),\n                          torch.ones(beam_size) * T,\n                          ], dim=-1).cpu().int()   \n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision) \n\n    score_sequences = step_intersect(graph, \n                                    dense_fsa_vec,\n                                    prune=False,\n                                    search_beam=30,\n                                    output_beam=20,\n                                    min_active_states=30,\n                                    max_active_states=100000) \n\n    print(\"####  old method ###\")\n    buf = []\n    for t in range(1, T+1):\n        supervision = torch.stack([\n                          torch.arange(beam_size),\n                          torch.zeros(beam_size),\n                          torch.ones(beam_size) * t,\n                          ], dim=-1).cpu().int()\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        num_lats = k2.intersect_dense(graph, dense_fsa_vec, output_beam=30.0)\n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        buf.append(num_tot_scores)\n\n    buf = torch.stack(buf, dim=1)\n    score_sequences = torch.stack(score_sequences, dim=0)\n    print(buf - score_sequences)\n \nif __name__ == \"__main__\":\n    step_intersect_test() \n"
  },
  {
    "path": "nets/scorers/ctc.py",
    "content": "\"\"\"ScorerInterface implementation for CTC.\"\"\"\n\nimport numpy as np\nimport torch\n\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScore\nfrom espnet.nets.ctc_prefix_score import CTCPrefixScoreTH\nfrom espnet.nets.scorer_interface import BatchPartialScorerInterface\n\n\nclass CTCPrefixScorer(BatchPartialScorerInterface):\n    \"\"\"Decoder interface wrapper for CTCPrefixScore.\"\"\"\n\n    def __init__(self, ctc: torch.nn.Module, eos: int):\n        \"\"\"Initialize class.\n\n        Args:\n            ctc (torch.nn.Module): The CTC implementaiton.\n                For example, :class:`espnet.nets.pytorch_backend.ctc.CTC`\n            eos (int): The end-of-sequence id.\n\n        \"\"\"\n        self.ctc = ctc\n        self.eos = eos\n        self.impl = None\n\n    def init_state(self, x: torch.Tensor):\n        \"\"\"Get an initial state for decoding.\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        Returns: initial state\n\n        \"\"\"\n        logp = self.ctc.log_softmax(x.unsqueeze(0)).detach().squeeze(0).cpu().numpy()\n        # TODO(karita): use CTCPrefixScoreTH\n        self.impl = CTCPrefixScore(logp, 0, self.eos, np)\n        return 0, self.impl.initial_state()\n\n    def select_state(self, state, i, new_id=None):\n        \"\"\"Select state with relative ids in the main beam search.\n\n        Args:\n            state: Decoder state for prefix tokens\n            i (int): Index to select a state in the main beam search\n            new_id (int): New label id to select a state if necessary\n\n        Returns:\n            state: pruned state\n\n        \"\"\"\n        if type(state) == tuple:\n            if len(state) == 2:  # for CTCPrefixScore\n                sc, st = state\n                return sc[i], st[i]\n            else:  # for CTCPrefixScoreTH (need new_id > 0)\n                r, log_psi, f_min, f_max, scoring_idmap = state\n                s = log_psi[i, new_id].expand(log_psi.size(1))\n                if scoring_idmap is not None:\n                    return r[:, :, i, scoring_idmap[i, new_id]], s, f_min, f_max\n                else:\n                    return r[:, :, i, new_id], s, f_min, f_max\n        return None if state is None else state[i]\n\n    def score_partial(self, y, ids, state, x):\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D prefix token\n            next_tokens (torch.Tensor): torch.int64 next token to score\n            state: decoder state for prefix tokens\n            x (torch.Tensor): 2D encoder feature that generates ys\n\n        Returns:\n            tuple[torch.Tensor, Any]:\n                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`\n                and next state for ys\n\n        \"\"\"\n        prev_score, state = state\n        presub_score, new_st = self.impl(y.cpu(), ids.cpu(), state)\n        tscore = torch.as_tensor(\n            presub_score - prev_score, device=x.device, dtype=x.dtype\n        )\n        return tscore, (presub_score, new_st)\n\n    def batch_init_state(self, x: torch.Tensor):\n        \"\"\"Get an initial state for decoding.\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        Returns: initial state\n\n        \"\"\"\n        logp = self.ctc.log_softmax(x.unsqueeze(0))  # assuming batch_size = 1\n        xlen = torch.tensor([logp.size(1)])\n        self.impl = CTCPrefixScoreTH(logp, xlen, 0, self.eos)\n        return None\n\n    def batch_score_partial(self, y, ids, state, x):\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D prefix token\n            ids (torch.Tensor): torch.int64 next token to score\n            state: decoder state for prefix tokens\n            x (torch.Tensor): 2D encoder feature that generates ys\n\n        Returns:\n            tuple[torch.Tensor, Any]:\n                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`\n                and next state for ys\n\n        \"\"\"\n        batch_state = (\n            (\n                torch.stack([s[0] for s in state], dim=2),\n                torch.stack([s[1] for s in state]),\n                state[0][2],\n                state[0][3],\n            )\n            if state[0] is not None\n            else None\n        )\n        return self.impl(y, batch_state, ids)\n\n    def extend_prob(self, x: torch.Tensor):\n        \"\"\"Extend probs for decoding.\n\n        This extention is for streaming decoding\n        as in Eq (14) in https://arxiv.org/abs/2006.14941\n\n        Args:\n            x (torch.Tensor): The encoded feature tensor\n\n        \"\"\"\n        logp = self.ctc.log_softmax(x.unsqueeze(0))\n        self.impl.extend_prob(logp)\n\n    def extend_state(self, state):\n        \"\"\"Extend state for decoding.\n\n        This extention is for streaming decoding\n        as in Eq (14) in https://arxiv.org/abs/2006.14941\n\n        Args:\n            state: The states of hyps\n\n        Returns: exteded state\n\n        \"\"\"\n        new_state = []\n        for s in state:\n            new_state.append(self.impl.extend_state(s))\n\n        return new_state\n"
  },
  {
    "path": "nets/scorers/ctc_rnnt_scorer.py",
    "content": "import k2\nimport torch\nimport math\nfrom snowfall.training.ctc_graph import CtcTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\n\n\nclass CTCRNNTScorer():\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.device = device\n\n        # compiler\n        self.lang = lang\n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = CtcTrainingGraphCompiler(\n                              L_inv=self.lexicon.L_inv,\n                              phones=self.lexicon.phones,\n                              words=self.lexicon.words\n                              )\n\n        # linear\n        self.phone_ids = self.lexicon.phone_symbols()\n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.load_weight(rank)\n\n        self.char_list = char_list\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"ctc_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def den_scores(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n        nnet_output = torch.nn.functional.log_softmax(nnet_output, dim=-1)\n\n        # None is to be compatible with MMIRNNTScorer\n        return nnet_output, None \n\n    def batch_score(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        print(\"tu_sum: \", tu_sum)\n\n        batch = len(A)\n        if batch == 0:\n            return A\n\n        # (1) supervision\n        # +1 since frame start with 0; +1 since redundant blank\n        # the supervision must be descending order.\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long()\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   ts\n                                  ], dim=1).to(torch.int32) \n        indices = torch.argsort(supervision[:, 2], descending=True)\n        supervision = supervision[indices]\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                       supervision)\n\n        # (2) texts\n        texts = [h.yseq[1:] for h in A] # exclude starting <sos>\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts] # need modification for BPE\n        texts = [texts[idx] for idx in indices] # reorder\n        graphs = self.graph_compiler.compile(texts)\n\n        # (3) intersection.  \n        lats = k2.intersect_dense(graphs, dense_fsa_vec, output_beam=10.0)\n        tot_scores = lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        tot_scores = torch.where(tot_scores == -math.inf, 0.0, tot_scores)\n        \n        # (4) assign and post-process\n        # Question: How to deal with the hypothesis with empty yseq\n        idx_to_empty_str = [j for j, x in enumerate(texts) if x == \"\"]\n        for j in idx_to_empty_str:\n            tot_scores[j] = 0.0\n\n        for j in range(batch):\n            h = A[indices[j]]\n\n            # step_score = (tot_scores[j] - h.mmi_tot_score)*mmi_weight\n            h.score += (tot_scores[j].item() - h.mmi_tot_score) * mmi_weight\n            h.mmi_tot_score = tot_scores[j].item()\n            # print(f\"idx: {indices[j]} | Hypothesis: {texts[j]} | CTC Score: {h.mmi_tot_score} | Tot Score: {h.score} | CTC step Score: {step_score}\", flush=True)\n        \n        return A\n         \n"
  },
  {
    "path": "nets/scorers/length_bonus.py",
    "content": "\"\"\"Length bonus module.\"\"\"\nfrom typing import Any\nfrom typing import List\nfrom typing import Tuple\n\nimport torch\n\nfrom espnet.nets.scorer_interface import BatchScorerInterface\n\n\nclass LengthBonus(BatchScorerInterface):\n    \"\"\"Length bonus in beam search.\"\"\"\n\n    def __init__(self, n_vocab: int):\n        \"\"\"Initialize class.\n\n        Args:\n            n_vocab (int): The number of tokens in vocabulary for beam search\n\n        \"\"\"\n        self.n = n_vocab\n\n    def score(self, y, state, x):\n        \"\"\"Score new token.\n\n        Args:\n            y (torch.Tensor): 1D torch.int64 prefix tokens.\n            state: Scorer state for prefix tokens\n            x (torch.Tensor): 2D encoder feature that generates ys.\n\n        Returns:\n            tuple[torch.Tensor, Any]: Tuple of\n                torch.float32 scores for next token (n_vocab)\n                and None\n\n        \"\"\"\n        return torch.tensor([1.0], device=x.device, dtype=x.dtype).expand(self.n), None\n\n    def batch_score(\n        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor\n    ) -> Tuple[torch.Tensor, List[Any]]:\n        \"\"\"Score new token batch.\n\n        Args:\n            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).\n            states (List[Any]): Scorer states for prefix tokens.\n            xs (torch.Tensor):\n                The encoder feature that generates ys (n_batch, xlen, n_feat).\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        return (\n            torch.tensor([1.0], device=xs.device, dtype=xs.dtype).expand(\n                ys.shape[0], self.n\n            ),\n            None,\n        )\n"
  },
  {
    "path": "nets/scorers/lookahead.py",
    "content": "import k2\nimport torch\nimport numpy as np\n\ndef search_lexical_tree(self, node, next_tokens):\n    if node is None:\n        print(\"None node given!\")\n\n    intervals, next_nodes = [], []\n    # some tokens are invalid (e.g., invalid word combination). however, we should\n    # still compute the score for it to be compatible. We will force that score\n    # to logzero in postprocess stage, but need to use the index_to_kill make a record.\n    index_to_kill = []\n\n    for idx, i in enumerate(next_tokens):\n        # node is the previous one if _ is not proposed else root\n        subword = self.char_list[i]\n        # case (1): '_' or <eos> is proposed, which means end of the word\n        if subword == self.bpe_space or subword == \"<eos>\":\n            this_node = node # keep 'node' unchanged\n            # Invalid and kill. Previous node cannot be root\n            if this_node == self.lexroot:\n                interval = [self.word_unk_id-1, self.word_unk_id]\n                this_node = None\n                index_to_kill.append(idx)\n            # score is for a word, not a word prefix -> interval for only one word \n            else:\n                interval = [this_node[2][0], this_node[2][0] + 1]\n                # next_node is root so the next token is valid even though it is not\n                # start with '_' \n                this_node = self.lexroot\n\n        # case (2): impossible token. kill them\n        elif subword == \"<blank>\" or subword == \"<unk>\":\n            this_node = None\n            interval = [self.word_unk_id-1, self.word_unk_id]\n            index_to_kill.append(idx)\n\n        # case (3): ordinary tokens. All special token should never reach this branch\n        else:\n            # subword start with '_' means a prefix of new word -> search from root\n            this_node = self.lexroot if subword.startswith(self.bpe_space) else node\n\n            subword = subword.replace(self.bpe_space, \"\")\n            for c in subword:\n                cid = self.alphabet_dict[c]\n                # descent to successor\n                if cid in this_node[0]:\n                    this_node = this_node[0][cid]\n                # no valid successor found. kill this hypothesis\n                else:\n                    this_node = None\n                    break\n\n            if this_node is not None and this_node[2] is not None:\n                interval = this_node[2]\n            else:\n                interval = [self.word_unk_id-1, self.word_unk_id]\n                index_to_kill.append(idx)\n\n        # plus one to correct the interval. see building process of lexroot\n        interval = [interval[0] + 1, interval[1] + 1]\n        intervals.append(interval)\n        # this_node == None always means a kill\n        next_nodes.append(this_node)\n\n    return intervals, next_nodes, index_to_kill\n\n\ndef parse_lookahead(yseq, lexroot, char_list, alphabet, word_dict, bpe_space):\n\n    # (1) check if the final word finishes\n    final_token = char_list[yseq[-1]]\n    if final_token in [\"<blank>\", \"<eos>\", \"<unk>\", bpe_space]:\n        tail_complete = True\n    else:\n        tail_complete = False\n\n    # (2) recover the string\n    yseq = \"\".join([char_list[y] for y in yseq])\\\n           .replace(\"<blank>\", \"\")\\\n           .replace(\"<eos>\", \"\")\\\n           .replace(\"<unk>\", bpe_space + \"<unk>\")\\\n           .replace(bpe_space, \" \")\\\n           .strip().split()\n\n    # (3) parse prefix\n    unk_id = word_dict[\"<UNK>\"]\n    prefix = [word_dict[tok] if tok in word_dict else unk_id \n                for tok in yseq[:-1]]\n\n    # (4) parse interval of tail\n\n    tail = yseq[-1] if len(yseq) > 0 else \"<unk>\"\n    if tail == \"<unk>\":\n        interval = [unk_id-1, unk_id]\n    else:\n        node = lexroot\n        for c in tail:\n            cid = alphabet[c]\n            if cid in node[0]:\n                node = node[0][cid]\n                interval = [node[2][0], node[2][0] + 1]\\\n                               if tail_complete else node[2]\n            else:\n                interval = [unk_id-1, unk_id]\n                break\n\n    # shift by 1: see building process of lexroot\n    interval = [interval[0] + 1, interval[1] + 1]   \n\n    # yseq = \" \".join(yseq)\n    # print(f\"yseq: {yseq} prefix: {prefix} interval: {interval}\") \n    return prefix, interval\n\ndef build_word_fsa_mat(prefix, interval):\n    prefix_len = len(prefix)\n\n    # prefix part\n    start_state = np.arange(prefix_len)\n    end_state = np.arange(prefix_len) + 1\n    labels = np.array(prefix)\n    scores = np.zeros(prefix_len)\n    prefix_part = np.stack([start_state, end_state, labels, scores], axis=1)\n\n    # interval_part\n    interval_len = interval[1] - interval[0]\n    start_state = np.ones(interval_len) * prefix_len\n    end_state = np.ones(interval_len) * (prefix_len + 1)\n    labels = np.arange(interval[0], interval[1])\n    scores = np.zeros(interval_len)\n    interval_part = np.stack([start_state, end_state, labels, scores], axis=1)\n\n    # final arc\n    final_arc = np.array([[prefix_len + 1, prefix_len + 2, -1, 0]])\n\n    # combine \n    mat = np.concatenate([prefix_part, interval_part, final_arc], axis=0)\n    mat = torch.from_numpy(mat).int()\n    return mat\n"
  },
  {
    "path": "nets/scorers/mmi.py",
    "content": "import k2\nimport torch\nimport torch.nn.functional as F\nimport sys\nimport jieba\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom snowfall.warpper.mmi_utils import build_word_mapping, convert_transcription\n\n# All methods are overrided\nclass MMIPrefixScores(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \"\"\"\n        lang: Path object of lang dir\n        device: torch device\n \n        We do not refer K2MMI module in model to avoid device conflict\n        \"\"\"\n        self.lang = lang\n        self.device = device\n        self.sos_id = sos_id # sos and eos are the same\n        self.char_list = char_list\n        self.oovid = int(open(self.lang / 'oov.int').read().strip())\n\n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(\n                                 self.lexicon, device=device)    \n\n        self.phone_ids = self.lexicon.phone_symbols()\n        # ignore dropout here; +1 for blank; maybe need log_softmax\n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n\n        # if True, the numerator score would be calculated on segments instead of single characters\n        self.use_segment = use_segment \n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"] \n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        \"\"\"\n        x: 2-D tensor, utterance encoder output without batch\n        \n        return: nnet_output: [B, T, D], log_distribution without being normalized (exp-sum!=1)\n                supervision: supervision for single utterance\n                prev_score: initial score\n\n        The denominator score is independent to the prefix. It is a constant for any hypothesis.\n        So it would be ignored during decoding: We only consider the numerator, and the initial \n        score is set 0.\n        \"\"\"\n        \n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n        \n        T = x.size()[1]\n        supervision = torch.Tensor([[0, 0, T]]).to(torch.int32)\n\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, supervision, prev_score \n\n    def select_state(self, states, j):\n        nnet_output_single, supervision_single, prev_scores = states\n        prev_score = torch.Tensor([prev_scores[j]])\n        return nnet_output_single, supervision_single, prev_score\n\n    def score(**kargs):        \n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        \"\"\"\n        y: prefix hypothesis, start with sos\n        next_tokens: candidates for next token\n\tstate: decoding state\n\ths_pad: encoder output, ignore\n\n        return:\n        tok_scores: prefix g, new token c, hypothesis h. token_scores = score(h) - score(g)\n                    which is the score for c.\n        state: directly copy nnet_output_single and supervision_single. \n               save the scores for diffeent score(h), it would be score(g) in next round.\n        \"\"\"\n        nnet_output_single, supervision_single, prev_score = state\n        batch_size = next_tokens.size()[0]\n\n        # acoustic\n        supervision = supervision_single.repeat(batch_size, 1)\n        nnet_output = nnet_output_single.repeat(batch_size, 1, 1)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # texts\n        y = y.unsqueeze(0).repeat(batch_size, 1)\n        next_tokens = next_tokens.unsqueeze(1)\n        ys = torch.cat([y, next_tokens], dim=-1)\n        # texts = convert_transcription(ys, self.word_mapping, self.lexicon.words,\n        #                               self.oovid, [self.sos_id])\n        texts = [\"\".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        texts = [text.replace(\"<space>\", \" \") for text in texts]\n        print(texts)\n\n        if self.use_segment:\n            texts = self.segmentation(texts)\n        \n        num_graphs, _ = self.graph_compiler.compile(\n                            texts, self.P, replicate_den=True)\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n        \n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True,\n                                                 use_double_scores=True)\n        tok_scores = num_tot_scores - prev_score\n        \n        return tok_scores, (nnet_output_single, supervision_single, num_tot_scores)\n\n    def score(self, state):\n        raise NotImplementedError\n\n    def final_score(self, state):\n        # This will add a score for the <eos>\n        # We do not give a special score for the last <eos>\n        return 0\n\n    def segmentation(self, ys):\n        ys = [y.replace(\" \", \"\") for y in ys]\n        ys = [jieba.cut(y, cut_all=False) for y in ys]\n        ys = [\" \".join(list(y)) for y in ys]\n        return ys\n"
  },
  {
    "path": "nets/scorers/mmi_alignment_score.py",
    "content": "import k2\nimport torch\nimport math\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.nets.scorers.mmi_utils import step_intersect\n\nclass MMIRNNTScorer():\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list, mas_lookahead=1):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n     \n        self.mas_lookahead = mas_lookahead\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def den_scores(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = nnet_output.size(1)\n        supervision = torch.Tensor([[0, 0, T]]).int()\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        den_scores = step_intersect(den, dense_fsa_vec)[0]\n\n        return nnet_output, den_scores\n\n    def batch_rescore(self, A, nnet_output, den_scores):\n        batch, T = len(A), nnet_output.size(1)\n        if batch == 0:\n            return A\n\n        texts = [h.yseq[1:] for h in A]\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts]\n        num, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   torch.ones(batch) * T], dim=1).int()\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1), supervision)\n\n        num_lats = k2.intersect_dense_pruned(num,\n                                             dense_fsa_vec,\n                                             search_beam=20.0,\n                                             output_beam=10.0,\n                                             min_active_states=30,\n                                             max_active_states=100000)\n        num_scores = num_lats.get_tot_scores(True, True)\n        tot_scores = num_scores - den_scores[-1] # num_frame -> index\n\n        for h, s in zip(A, tot_scores):\n            h.mmi_prev_score = s.item()\n        \n        return A\n\n    def batch_score(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        if self.mas_lookahead > 15:\n            return self.Imp_A(A, nnet_output, den_scores, tu_sum, mmi_weight)\n        else:\n            return self.Imp_B(A, nnet_output, den_scores, tu_sum, mmi_weight)\n\n    def Imp_B(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        batch = len(A)\n        if batch == 0:\n            return A\n\n        # reorder: increasing order in u means decreasing order in t\n        #          this is required by k2 supervision\n        A.sort(key=lambda h: len(h.yseq))\n\n        # (1) get ts: the alignment length in t-axis \n        # +1 since frame start with 0; +1 since redundant blank\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long()\n\n        # (2) compile numerator graph\n        texts = [h.yseq[1:] for h in A] # exclude starting <sos>\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts] # need modification for BPE\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n\n        # (3) intersection\n        lookahead_range = (0, self.mas_lookahead)\n        tot_scores_collection = []\n        T = nnet_output.size()[1]\n        for s in range(lookahead_range[0], lookahead_range[1] + 1):\n            ts_shift = torch.clamp(ts + s, min=1, max=T)\n            supervision = torch.stack([torch.arange(batch),\n                                       torch.zeros(batch),\n                                       ts_shift\n                                       ], dim=1).to(torch.int32)\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                           supervision)\n            num_lats = k2.intersect_dense_pruned(num_graphs,\n                                             dense_fsa_vec,\n                                             search_beam=20.0,\n                                             output_beam=10.0,\n                                             min_active_states=30,\n                                             max_active_states=20000)\n            num_scores = num_lats.get_tot_scores(True, True)\n            # num_scores = torch.where(num_scores == -math.inf, 0.0, num_scores)\n            tot_scores = num_scores - den_scores[ts_shift-1] # num_frame -> idx_frame\n            tot_scores_collection.append(tot_scores)\n        tot_scores = torch.stack(tot_scores_collection, dim=1) # [beam, T]\n\n        # hint: we can only use top-1 score rather than logsumexp or top-k-sum\n        # since torch.clamp leads to repeatition of these scores at boundaries\n        tot_scores, _ = torch.topk(tot_scores, 1, dim=-1)\n\n        # (4) assign and post-process\n        idx_to_empty_str = [j for j, h in enumerate(A) if len(h.yseq) == 1]\n        for j in idx_to_empty_str:\n            tot_scores[j] = 0.0\n\n        for j in range(batch):\n            h = A[j]\n            h.score += (tot_scores[j].item() - h.mmi_prev_score) * mmi_weight\n            h.mmi_prev_score = tot_scores[j].item()\n\n        A = [h for h in A if h.score > -1e8]\n        return A\n    \n\n    \"Version using step_intersect. Very slow. Only used when lookahead is very large\"\n    def Imp_A(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        batch = len(A)\n        if batch == 0 or mmi_weight == 0:\n            return A\n        \n        # For hypotheses without mmi_tot_score, compute the score sequences and assign\n        texts = [h.yseq[1:] for h in A if h.mmi_tot_score is None]\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts]\n        indices = [i for i, h in enumerate(A) if h.mmi_tot_score is None]\n\n        T, num_texts = nnet_output.size(1), len(texts)\n        if num_texts > 0:\n            num, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n            supervision = torch.stack([torch.arange(num_texts),\n                                       torch.zeros(num_texts),\n                                       torch.ones(num_texts) * T], dim=1).int()\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(num_texts, 1, 1), supervision)\n            num_scores = step_intersect(num, dense_fsa_vec)\n            assert len(num_scores) == num_texts\n    \n            tot_scores = [x - den_scores for x in num_scores]\n            \n            for i, score in zip(indices, tot_scores):\n                A[i].mmi_tot_score = score.tolist()\n\n        # selecting the scores accordingly\n        assert all([h.mmi_tot_score is not None for h in A])\n        ts = [tu_sum - len(h.yseq[1:]) for h in A]\n        curr_scores = [max(h.mmi_tot_score[t: min(t + self.mas_lookahead, T)])\n                      for t, h in zip(ts, A)]\n        prev_scores = [h.mmi_prev_score for h in A]\n        diff_scores = [a - b for a, b in zip(curr_scores, prev_scores)]\n\n        for curr_s, diff_s, h in zip(curr_scores, diff_scores, A):\n            h.score += mmi_weight * diff_s\n            h.mmi_prev_score = curr_s\n       \n        # exclude all hypotheses whose end-states are not reachable \n        A = [h for h in A if h.score > -1e8]\n        \n        return A\n\n    \n"
  },
  {
    "path": "nets/scorers/mmi_frame_prefix_scorer.py",
    "content": "import k2\nimport torch\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\n\n\nclass MMIFramePrefixScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        # Although the main decoding process adopt CPU. MMI computation may\n        # Adopt GPU to accelerate\n        #self.device = torch.device(f\"cuda:{rank-1}\") if device == \"cuda\" \\\n        #              else torch.device(\"cpu\")\n        self.device = torch.device(\"cpu\")\n        print(\"MMI scorer device: \", self.device)\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n        # special tokens and lexical root\n        self.word2id = self.lexicon.words._sym2id \n        self.char2id = {c: cid for cid, c in enumerate(char_list)}\n        self.id2char = {cid: c for cid, c in enumerate(char_list)}\n        self.eos = sos_id\n        self.word_unk_id = self.word2id[\"<UNK>\"]\n        self.char_space_id = self.char2id[\"<space>\"]\n        self.lexroot = make_lexical_tree(self.word2id, self.char2id, \"<UNK>\")\n        print(f\"lexroot succ: {self.lexroot[0].keys()}\")\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = x.size()[1]\n        den_scores = []\n        # use a loop since denominator would consume much memory\n        # in descending order\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0).cpu() # [T] -> [B, T]\n        print(\"Den scores: \", den_scores)\n\n        # (3) Prev Score is zero\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, den_scores, prev_score, self.lexroot\n\n    def select_state(self, states, j):\n        nnet_output_single, den_scores, prev_scores, next_roots = states\n        return nnet_output_single, den_scores, prev_scores[j], next_roots[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def split_intervals(self, olds, max_interval=2000):\n        # it may cause error in k2 if the interval is too large. e.g., > 5k\n        news, indices = [], []\n        cnt = 0\n        for start, end in olds:\n            group_idx = []\n            while start <= end:\n                news.append([start, min(end, start + max_interval)])\n                start += max_interval\n                group_idx.append(cnt)\n                cnt += 1\n            indices.append(group_idx)\n        return news, indices\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # (1) unpack the state\n        nnet_output_single, den_scores, prev_score, root = state\n    \n        # (2) process the prefix and intervals; build numerator graphs       \n        prefix = \"\".join([self.id2char[c.item()] for c in y]).replace(\"<eos>\", \"\").replace(\"<space>\", \" \")\n        prefix = [self.word2id.get(tok, self.word_unk_id) for tok in prefix.strip().split()]\n        if y[-1] == self.char_space_id:\n            prefix = prefix # end with <space>\n        else:\n            prefix = prefix[:-1] # the last part is a sub-word\n        print(f\"prefix: {prefix}\")\n\n        intervals = []\n        force_zero = [] # invalid case. Set prob to logzero finally\n        next_roots = []\n        for idx, tok in enumerate(next_tokens):\n            tok = tok.item()\n            # case 1: <space>, <eos> indicate the end of a word\n            # This reduce the probability from a prefix prob\n            # to an exact probability of a word\n            if tok == self.char_space_id or tok == self.eos:\n                if root[1] > -1:\n                    intervals.append([root[1]-1, root[1]])\n                    next_roots.append(self.lexroot)\n                    print(f\"{idx}-th: {tok}, space / eos. This is a valid space\")\n                else:\n                    intervals.append((self.word_unk_id-1, self.word_unk_id))\n                    force_zero.append(idx)\n                    next_roots.append(self.lexroot)\n                    print(f\"{idx}-th: {tok}, space / eos. This is an invalid space\")\n            # case 2: OOV. kill it\n            elif not tok in root[0]:\n                intervals.append((self.word_unk_id-1, self.word_unk_id))\n                force_zero.append(idx)\n                next_roots.append(self.lexroot)\n                print(f\"{idx}-th: {tok}, oov\")\n            # case 3: A valid intra-word transition. \n            # shift to next lexicon node\n            else:\n                intervals.append(root[0][tok][2])\n                next_roots.append(root[0][tok])\n                print(f\"{idx}-th: {tok}, intra-trans\")\n       \n        # Being compatible with lexroot format \n        intervals = [(l+1, r+1) for l, r in intervals]\n        # Long intervals may cause error in k2. split them into\n        split_intervals, interval_indices = self.split_intervals(intervals)\n        print(f\"intervals: {intervals}, split intervals: {split_intervals}\", flush=True)\n        num_split = len(split_intervals)\n\n        num_graphs = self.graph_compiler.compile_nums_for_prefix_scoring(\n                                          prefix, split_intervals, self.P)\n        \n        # (3) frame-wise intersection\n        # calculate frame-by-frame to avoid OOM\n        T = nnet_output_single.size()[1]\n        nnet_output = nnet_output_single.repeat(num_split, 1, 1)\n        score_segment = []\n   \n        for t in range(1, T+1):\n            supervision = torch.stack([\n                          torch.arange(num_split),\n                          torch.zeros(num_split),\n                          torch.ones(num_split) * t\n                          ], dim=1).to(torch.int32)\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output[:, :t], supervision)\n            num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0) \n            num_tot_scores_split = num_lats.get_tot_scores(log_semiring=True,\n                                                           use_double_scores=True)\n            score_segment.append(num_tot_scores_split)\n        \n        score_segment = torch.stack(score_segment, dim=1) # [num_split, T]\n        num_scores = torch.zeros(len(interval_indices), T) # [batch_size, T]\n        for i in range(len(interval_indices)):\n            num_scores[i] = torch.logsumexp(score_segment[interval_indices[i]], dim=0)\n            if i in force_zero:\n                num_scores[i] = self.logzero\n\n        # (4) finalize\n        tot_scores = num_scores - den_scores.unsqueeze(0)\n        tot_scores = torch.logsumexp(tot_scores, dim=-1)[0]\n        tok_scores = tot_scores - prev_score\n        state = (nnet_output_single, den_scores, tot_scores, next_roots)\n        return tok_scores.numpy(), state\n\n    def final_score(self, state):\n        return 0     \n                \n"
  },
  {
    "path": "nets/scorers/mmi_frame_scorer.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\n# from espnet.nets.scorers.trace_frame import trace_frame\nfrom espnet.nets.scorers.mmi_utils import step_intersect\n\n\nclass MMIFrameScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.oov = self.oovid = open(self.lang / 'oov.txt').read().strip()\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device, self.oov)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n\n        for i in range(10):\n            try:\n                self.load_weight(rank)\n            except:\n                print(f\"{i}-th trail to load MMI matrix weight but fail\")\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        #torch.set_printoptions(sci_mode=False)\n        #fake_den_scores = trace_frame(nnet_output, den)\n\n        T = x.size()[1]\n        den_scores = []\n        # use a loop since denominator would consume much memory\n        # in descending order\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n        print(\"den_scores: \", den_scores)\n\n        ### DEBUG ###\n        supervision = torch.Tensor([[0, 0, T]]).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        den_scores_ = step_intersect(den, dense_fsa_vec)[0].unsqueeze(0)\n        den_scores_ = torch.flip(den_scores_, [1]) \n\n        max_diff = torch.max(torch.abs(den_scores - den_scores_)).item()\n        if abs(max_diff) > 0.02:\n            print(\"denominator error: \", den_scores, den_scores_)\n            raise ValueError\n        ### END DEBUG ###\n\n        # (3) Prev Score is zero\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, den_scores, prev_score \n\n    def select_state(self, states, j):\n        nnet_output_single, den_scores, prev_scores = states\n        return nnet_output_single, den_scores, prev_scores[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # Warning: All frame-level scores are adopted in reverse order in time-axis \n        # since k2 requires a descending input length\n\n        # (1) unpack state\n        nnet_output_single, den_scores, prev_score = state\n\n        # (2) acoustic\n        T = nnet_output_single.size()[1]\n        batch_size = len(next_tokens)\n        num_egs = T * batch_size\n        supervision = torch.stack([\n                      torch.arange(num_egs),\n                      torch.zeros(num_egs),\n                      torch.arange(T, 0, -1).unsqueeze(1).repeat(1, batch_size).view(-1)\n                      ], dim=1).to(torch.int32) \n        nnet_output = nnet_output_single.repeat(num_egs, 1, 1)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # (3) texts\n        y = y.unsqueeze(0).repeat(num_egs, 1)\n        next_tokens = next_tokens.unsqueeze(1).repeat(T, 1)\n        ys = torch.cat([y, next_tokens], dim=1)\n        \n        # This is for Chinese. Need more tuning on English\n        #if not \"<space>\" in self.char_list:\n        #    texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        #    texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        #else:\n        #    texts = [\"\".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        #    texts = [text.replace(\"<eos>\", \"\").replace(\"<space>\", \"<space> \").strip() for text in texts]\n        texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n\n        # (4) compute and accumulate\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)         \n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        num_tot_scores = num_tot_scores.view(T, batch_size).transpose(0, 1) # -> [B, T]\n        \n        ### DEBUG ###\n        supervision = torch.stack([\n                      torch.arange(batch_size),\n                      torch.zeros(batch_size),\n                      torch.ones(batch_size).int() * T\n                      ], dim=1).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        ys = ys[: batch_size]\n        texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n        num_tot_scores_ = torch.stack(step_intersect(num_graphs, dense_fsa_vec), dim=0)\n        num_tot_scores_ = torch.flip(num_tot_scores_, [1])\n\n        max_diff = torch.max(torch.abs(num_tot_scores - num_tot_scores_)).item()\n        if abs(max_diff) > 0.02:\n            print(\"numerator error: \", num_tot_scores, num_tot_scores_)\n            raise ValueError \n        ### END DEBUG ### \n\n        #     minus the denominator scores. \n        tot_scores_frame = num_tot_scores - den_scores\n        tot_scores = torch.logsumexp(tot_scores_frame, dim=-1)\n\n        # (5) treat <eos> and ctc <blk> specailly\n        next_tokens = next_tokens.squeeze(1)[:batch_size] # recover the initail next_tokens\n       \n        # <eos> means the exact probability rather than the prefix probability \n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            tot_scores[eos_pos] = tot_scores_frame[eos_pos.item(), 0]\n\n        # CTC blank is never allowed in hypothesis. kill it\n        blk_pos = torch.where(next_tokens == self.blank)[0]\n        if len(blk_pos) > 0:\n            tot_scores[blk_pos] = self.logzero\n        \n        # (6) finalize\n        tok_scores = tot_scores - prev_score\n        state = nnet_output_single, den_scores, tot_scores\n        return tok_scores, state\n\n    def final_score(self, state):\n        return 0     \n                \n"
  },
  {
    "path": "nets/scorers/mmi_frame_scorer_trace.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.nets.scorers.trace_frame import compute_frame_level_scores_batch\n\nclass MMIFrameScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        \"\"\"\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = nnet_output.size()[1]\n        print(T)\n        den_scores = []\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n        print(\"den score computed from previous version: \", den_scores)\n        \"\"\"\n\n        # trace the lattice: this is much faster\n        texts = [\"<UNK>\"] \n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n        assert len(texts) == nnet_output.size()[0]\n        den_scores = compute_frame_level_scores_batch(den, nnet_output)\n        print(\"den_scores: \", den_scores)\n\n        # (3) Prev Score is zero\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, den_scores, prev_score \n\n    def select_state(self, states, j):\n        nnet_output_single, den_scores, prev_scores = states\n        return nnet_output_single, den_scores, prev_scores[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # Warning: All frame-level scores are adopted in reverse order in time-axis \n        # since k2 requires a descending input length\n\n        # (1) unpack state\n        nnet_output_single, den_frame_scores, prev_score = state\n\n        \"\"\"\n        # (2) acoustic\n        T = nnet_output_single.size()[1]\n        batch_size = len(next_tokens)\n        num_egs = T * batch_size\n        supervision = torch.stack([\n                      torch.arange(num_egs),\n                      torch.zeros(num_egs),\n                      torch.arange(T, 0, -1).unsqueeze(1).repeat(1, batch_size).view(-1)\n                      ], dim=1).to(torch.int32) \n        nnet_output = nnet_output_single.repeat(num_egs, 1, 1)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # (3) texts\n        y = y.unsqueeze(0).repeat(num_egs, 1)\n        next_tokens = next_tokens.unsqueeze(1).repeat(T, 1)\n        ys = torch.cat([y, next_tokens], dim=1)\n        # This is for Chinese. Need more tuning on English\n        if not \"<space>\" in self.char_list:\n            texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        else:\n            texts = [\"\".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").replace(\"<space>\", \"<space> \").strip() for text in texts]\n\n        # (4) compute and accumulate\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)         \n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        num_tot_scores = num_tot_scores.view(T, batch_size).transpose(0, 1) # -> [B, T]\n        \"\"\"\n\n        # (2) acoustic\n        T = nnet_output_single.size()[1]\n        batch_size = len(next_tokens)\n        nnet_output = nnet_output_single.repeat(batch_size, 1, 1) \n        supervision = torch.stack([\n                      torch.arange(batch_size),\n                      torch.zeros(batch_size),\n                      torch.ones(batch_size) * T\n                      ], dim=1).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # (3) texts:\n        y = y.unsqueeze(0).repeat(batch_size, 1)\n        next_tokens = next_tokens.unsqueeze(1)\n        ys = torch.cat([y, next_tokens], dim=1)\n        if not \"<space>\" in self.char_list:\n            texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        else:\n            texts = [\"\".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").replace(\"<space>\", \"<space> \").strip() for text in texts]\n        num, _  = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n        print(texts)\n\n        # (4) compute and accumulate\n        num_frame_scores = compute_frame_level_scores_batch(num, nnet_output) \n\n        #     combine the denominator scores\n        tot_scores_frame = num_frame_scores - den_frame_scores\n        tot_scores = torch.logsumexp(tot_scores_frame, dim=-1)\n        print(\"numerator scores: \", num_frame_scores)\n        print(\"log_posterior: \", tot_scores_frame)\n\n        # (5) treat <eos> and ctc <blk> specailly\n        next_tokens = next_tokens.squeeze(1) # recover the initail next_tokens\n       \n        # <eos> means the exact probability rather than the prefix probability \n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            tot_scores[eos_pos] = tot_scores_frame[eos_pos.item(), 0]\n\n        # CTC blank is never allowed in hypothesis. kill it\n        blk_pos = torch.where(next_tokens == self.blank)[0]\n        if len(blk_pos) > 0:\n            tot_scores[blk_pos] = self.logzero\n        \n        # (6) finalize\n        tok_scores = tot_scores - prev_score\n        state = nnet_output_single, den_frame_scores, tot_scores\n        return tok_scores, state\n\n    def final_score(self, state):\n        return 0     \n                \n"
  },
  {
    "path": "nets/scorers/mmi_lookahead.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.nets.scorers.lookahead import parse_lookahead, build_word_fsa_mat\n\nclass MMILookaheadScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.oov = self.oovid = open(self.lang / 'oov.txt').read().strip()\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device, self.oov)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n       \n        # We use a char-level lexicon root but search in it by BPE\n        alphabet = [chr(i) for i in range(65, 91)] + [\"\\'\"] if \"A\" in char_list else\\\n                   [chr(i) for i in range(97, 123)] + [\"\\'\"]\n        self.alphabet_dict = {c: i+1 for i, c in enumerate(alphabet)} \n        self.word_dict = self.lexicon.words._sym2id\n        self.word_unk_id = int(open(self.lang / 'oov.int').read().strip()) \n        self.lexroot = make_lexical_tree(self.word_dict, self.alphabet_dict, self.word_unk_id) # 3 is unknown-id\n        print(\"end of lex building\", flush=True)\n\n        self.char_list = char_list # BPE char list\n        self.bpe_space = char_list[-2][0]\n\n        # special value\n        self.logzero = -10000.0\n        self.eos = sos_id\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) nnet_output\n        nnet_output = self.lo(x.unsqueeze(0))\n\n        # (2) den_graph\n        texts = [\"<UNK>\"]\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        # (3) den_scores. Use a loop to avoid memory spark\n        T = nnet_output.size()[1]\n        den_scores = []\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n\n        # (4) others\n        prev_score = torch.Tensor([0]).to(torch.float32)\n\n        return nnet_output, den_scores, prev_score\n\n    def select_state(self, states, j):\n        # only the prev_scores and next_nodes should be selected\n        nnet_output_single, den_scores, prev_scores = states\n        return nnet_output_single, den_scores, prev_scores[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # (1) unpack the state\n        nnet_output_single, den_frame_scores, prev_score = state\n        beam_size = len(next_tokens)        \n\n        # (2) build numerator graph\n        yseqs = torch.cat([\n                y.unsqueeze(0).repeat(beam_size, 1),\n                next_tokens.unsqueeze(1)], dim=1).cpu().tolist()\n        prefix_and_interval = [parse_lookahead(yseq, self.lexroot,\n                               self.char_list, self.alphabet_dict,\n                               self.word_dict, self.bpe_space)\n                               for yseq in yseqs]\n        word_fsa_mats = [build_word_fsa_mat(*x) for x in prefix_and_interval]\n        word_fsa = [k2.Fsa.from_dict({\"arcs\": mat}) for mat in word_fsa_mats]\n        word_fsa = k2.create_fsa_vec(word_fsa)\n        num_graphs = self.graph_compiler.compile_lookahead_numerators(word_fsa, self.P) \n\n        # (3) loop to compute frame-level prob. cannot do this in one go due to memory limitation\n        T = nnet_output_single.size()[1]\n        num_frame_scores = []\n        nnet_output = nnet_output_single.expand(beam_size, -1, -1)\n        for t in range(T, 0, -1):\n            supervision = torch.stack([\n                          torch.arange(beam_size),\n                          torch.zeros(beam_size),\n                          torch.ones(beam_size) * t,\n                          ], dim=-1).cpu().int()\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output[:, :t, :], supervision)\n            \n            # A pruned version. Or it would be much slow. parameters are tunable\n            lats = k2.intersect_dense_pruned(num_graphs,\n                                         dense_fsa_vec,\n                                         search_beam=10.0,\n                                         output_beam=5.0,\n                                         min_active_states=30,\n                                         max_active_states=10000)\n            num_frame_score = lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            num_frame_scores.append(num_frame_score)\n        num_frame_scores = torch.stack(num_frame_scores, dim=-1) # [beam, T]\n        # Important: exclude all -inf\n        num_frame_scores = torch.clamp(num_frame_scores, min=self.logzero)\n \n        frame_scores = num_frame_scores - den_frame_scores\n        scores = torch.logsumexp(frame_scores, dim=-1)\n\n        # (4) postprocess\n        # (4.1) <eos> should only aligned with last frame\n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            scores[eos_pos] = frame_scores[eos_pos.item(), 0]\n\n        # TODO: kill some valid token like <blank>\n        \n        # (5) return\n        token_scores = scores - prev_score\n        states = nnet_output_single, den_frame_scores, scores\n        return token_scores, states\n\n    \"\"\"  \n    def search_lexical_tree(self, node, next_tokens):\n        if node is None:\n            print(\"None node given!\")\n\n        intervals, next_nodes = [], []\n        # some tokens are invalid (e.g., invalid word combination). however, we should\n        # still compute the score for it to be compatible. We will force that score\n        # to logzero in postprocess stage, but need to use the index_to_kill make a record.\n        index_to_kill = [] \n\n        for idx, i in enumerate(next_tokens):\n            # node is the previous one if _ is not proposed else root\n            subword = self.char_list[i]\n            # case (1): '_' or <eos> is proposed, which means end of the word\n            if subword == self.bpe_space or subword == \"<eos>\":\n                this_node = node # keep 'node' unchanged\n                # Invalid and kill. Previous node cannot be root\n                if this_node == self.lexroot:\n                    interval = [self.word_unk_id-1, self.word_unk_id]\n                    this_node = None\n                    index_to_kill.append(idx)\n                # score is for a word, not a word prefix -> interval for only one word \n                else:\n                    interval = [this_node[2][0], this_node[2][0] + 1]\n                    # next_node is root so the next token is valid even though it is not\n                    # start with '_' \n                    this_node = self.lexroot\n\n            # case (2): impossible token. kill them\n            elif subword == \"<blank>\" or subword == \"<unk>\":\n                this_node = None\n                interval = [self.word_unk_id-1, self.word_unk_id]\n                index_to_kill.append(idx)\n\n            # case (3): ordinary tokens. All special token should never reach this branch\n            else:\n                # subword start with '_' means a prefix of new word -> search from root\n                this_node = self.lexroot if subword.startswith(self.bpe_space) else node\n\n                subword = subword.replace(self.bpe_space, \"\")\n                for c in subword:\n                    cid = self.alphabet_dict[c]\n                    # descent to successor\n                    if cid in this_node[0]:\n                        this_node = this_node[0][cid]\n                    # no valid successor found. kill this hypothesis\n                    else:\n                        this_node = None\n                        break\n            \n                if this_node is not None and this_node[2] is not None:\n                    interval = this_node[2]\n                else:\n                    interval = [self.word_unk_id-1, self.word_unk_id]\n                    index_to_kill.append(idx)\n            \n            # plus one to correct the interval. see building process of lexroot\n            interval = [interval[0] + 1, interval[1] + 1]  \n            intervals.append(interval)\n            # this_node == None always means a kill\n            next_nodes.append(this_node)\n\n        return intervals, next_nodes, index_to_kill\n    \"\"\"\n    def final_score(self, state):\n        return 0\n     \n"
  },
  {
    "path": "nets/scorers/mmi_lookahead_bak.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\n\n\nclass MMILookaheadScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n\n        self.wdict = self.lexicon.words._sym2id\n        self.pdict = self.lexicon.phones._sym2id\n        self.lexroot = make_lexical_tree(self.wdict, self.pdict, \"<UNK>\")\n        self.char_list = char_list\n        self.space_str = \"<space>\"\n        self.oovid = int(open(self.lang / 'oov.int').read().strip()) \n        self.unk_id = self.wdict[\"<UNK>\"]\n        self.sos_id = sos_id\n        \n        self.logzero = -10000.0\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # TODO: determine the format of state\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        T = x.size()[1]\n        supervision = torch.Tensor([[0, 0, T]]).to(torch.int32)\n\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, supervision, prev_score, self.lexroot \n\n    def select_state(self, states, j):\n        # TODO: select the state\n        pass\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # (1) unpack state\n        nnet_output_single, supervision_single, prev_score, node = state\n        batch_size = next_tokens.size()[0]\n\n        # (2) handle the prefix and candidate, find the prefix_ids and intervals\n        # prefix\n        prefix = [self.char_list[x] for x in y[1:]]\n        # prefix = [char_list[x] for x in y[1:]]\n        prefix = \"\".join(prefix).replace(self.space_str, \" \").strip().split(' ')\n        word_transition = False\n        if self.char_list[y[-1]] == self.space_str:\n            # end with <space>, word transition\n            prefix = prefix[:-1]\n            word_transition = True\n        prefix = [self.wdict.get(w, self.oovid)  for w in prefix]\n        \n        # candidate: any interval: start-1, end -> [start, end)\n        intervals, next_nodes = [], []\n        for tok in next_tokens:\n            tok = self.char_list[tok]\n            \n            if tok == self.space_str: \n                next_nodes.append(self.lexroot)\n            # attention symbol but not mmi symbol\n            if not tok in self.pdict:\n                intervals.append((self.unk_id - 1, self.unk_id))\n                next_nodes.append(None)\n            else:\n                pid = self.pdict[tok]\n                if pid in node[0]: # successors exists: valid word\n                    intervals.append(node[0][pid][2])\n                    next_nodes.append(node[0][pid])\n                else: # OOV: \n                    intervals.append((self.unk_id - 1, self.unk_id))\n                    next_nodes.append(None)\n        intervals = [(l+1, r+1) for l, r in intervals] # work around: to be compatible with lex-tree \n        split_intervals, indices = self.split_intervals(intervals)\n        num_split = len(split_intervals)\n\n        # (3) acoustic information\n        supervision = supervision_single.repeat(num_split, 1)\n        nnet_output = nnet_output_single.repeat(num_split, 1, 1)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        \n        # (4) accumulate probability\n        num_graphs = self.graph_compiler.compile_nums_for_prefix_scoring(prefix, split_intervals, self.P)\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=5.0)\n        num_tot_scores_split = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        num_tot_scores = torch.zeros(batch_size).to(y.device)\n        for i in range(batch_size):\n            num_tot_scores[i] = torch.logsumexp(num_tot_scores_split[indices[i]], dim=-1)\n\n        num_tok_scores = num_tot_scores - prev_score\n        return num_tok_scores, (nnet_output_single, supervision_single, num_tot_scores, next_nodes)    \n         \n\n    def split_intervals(self, olds, max_interval=2000):\n        # it may cause error in k2 if the interval is too large. e.g., > 5k\n        # so a large interval should be split into multiple small intervals.\n        news, indices = [], []\n        cnt = 0\n        for start, end in olds:\n            group_idx = []\n            while start <= end:\n                news.append([start, min(end, start + max_interval)])\n                start += max_interval\n                group_idx.append(cnt)\n                cnt += 1\n            indices.append(group_idx)\n        return news, indices\n"
  },
  {
    "path": "nets/scorers/mmi_lookahead_split.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\n\n\nclass MMILookaheadScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n       \n        # We use a char-level lexicon root but search in it by BPE\n        alphabet = [chr(i) for i in range(65, 91)] + [\"\\'\"] # upper class\n        self.alphabet_dict = {c: i+1 for i, c in enumerate(alphabet)} \n        self.word_dict = self.lexicon.words._sym2id\n        self.word_unk_id = int(open(self.lang / 'oov.int').read().strip()) \n        self.lexroot = make_lexical_tree(self.word_dict, self.alphabet_dict, self.word_unk_id) # 3 is unknown-id\n        print(\"end of lex building\", flush=True)\n\n        self.char_list = char_list # BPE char list\n        self.bpe_space = char_list[-2][0]\n\n        # special value\n        self.logzero = -10000.0\n        self.eos = sos_id\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) nnet_output\n        nnet_output = self.lo(x.unsqueeze(0))\n\n        # (2) den_graph\n        texts = [\"<UNK>\"]\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        # (3) den_scores. Use a loop to avoid memory spark\n        T = nnet_output.size()[1]\n        den_scores = []\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n\n        # (4) others\n        prev_score = torch.Tensor([0]).to(torch.float32)\n\n        return nnet_output, den_scores, prev_score, self.lexroot\n\n    def select_state(self, states, j):\n        # only the prev_scores and next_nodes should be selected\n        nnet_output_single, den_scores, prev_scores, next_nodes = states\n        return nnet_output_single, den_scores, prev_scores[j], next_nodes[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        print(\"start a new partial score\", flush=True)\n        torch.set_printoptions(sci_mode=False)\n        # (1) unpack state\n        nnet_output_single, den_frame_scores, prev_score, node = state\n        beam_size = len(next_tokens)\n        T = nnet_output_single.size()[1]   \n\n        # (2) build numerator graph\n        \n        prefix = \"\".join([self.char_list[x.item()] for x in y[1:]])\\\n                 .replace(\" \", \"\").replace(self.bpe_space, \" \").strip().split(\" \")\n\n        intervals, next_nodes, index_to_kill = self.search_lexical_tree(node, next_tokens)\n        # if proposed token does not start with '_', y[-1] should be removed during graph composition\n        # '_' means the proposal of new word. <eos> cannot be seen as a new word. other special tokens\n        # would be killed\n        drop_prefix_tail = [int(not self.char_list[x.item()].startswith(self.bpe_space)) for x in next_tokens]\n        # some intervals are too long. we need to split them before computation and then combine \n        split_intervals, interval_indexes, split_drop_prefix_tail = self.split_intervals(intervals, drop_prefix_tail)\n        print(\"intervals: \\n\", intervals, \"split intervals: \\n\", split_intervals, \"interval_index: \\n\", interval_indexes)\n        graphs = self.graph_compiler.compile_nums_for_prefix_scoring(\n                     prefix, split_intervals, self.P, split_drop_prefix_tail)\n\n        # (3) loop to compute frame-level prob. cannot do this in one go due to memory limitation\n        split_size = len(split_intervals)\n        split_frame_scores = []\n        nnet_output = nnet_output_single.expand(split_size, -1, -1)\n        for t in range(T, 0, -1):\n            supervision = torch.stack([\n                          torch.arange(split_size),\n                          torch.zeros(split_size),\n                          torch.ones(split_size) * t,\n                          ], dim=-1).cpu().int()\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output[:, :t, :], supervision)\n            # split_lats = k2.intersect_dense(graphs, dense_fsa_vec, output_beam=3.0)\n            # must use a pruned version! very slow here especially for large interval\n            split_lats = k2.intersect_dense_pruned(graphs,\n                                         dense_fsa_vec,\n                                         search_beam=10.0,\n                                         output_beam=5.0,\n                                         min_active_states=30,\n                                         max_active_states=10000)\n            split_frame_score = split_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            split_frame_scores.append(split_frame_score)\n            print(f\"intersection t = {t}\", flush=True)\n        split_frame_scores = torch.stack(split_frame_scores, dim=-1) # [split_size, T]\n        \n        # (4) combine split and then den_scores to compute scores\n        num_frame_scores = []\n        for interval_index in interval_indexes:\n            num_frame_score = torch.logsumexp(split_frame_scores[interval_index], dim=0)\n            num_frame_scores.append(num_frame_score)\n        num_frame_scores = torch.stack(num_frame_scores, dim=0) # [beam_size, T]\n\n        frame_scores = num_frame_scores - den_frame_scores\n        scores = torch.logsumexp(frame_scores, dim=-1)\n\n        # (5) postprocess\n        # A. <eos> should only aligned with last frame\n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            print(\"found eos position: \", eos_pos)\n            scores[eos_pos] = frame_scores[eos_pos.item(), 0]\n\n        # B. kill hypothesis in \"index_to_kill\"\n        print(\"index to kill: \", index_to_kill)\n        if len(index_to_kill) > 0:\n            for idx in index_to_kill:\n                scores[idx] = self.logzero\n        \n        # (6) return\n        token_scores = scores - prev_score\n        print(\"token score: \", token_scores, flush=True) \n        states = nnet_output_single, den_frame_scores, scores, next_nodes\n        return token_scores, states\n\n   \n    def search_lexical_tree(self, node, next_tokens):\n        if node is None:\n            print(\"None node given!\")\n\n        intervals, next_nodes = [], []\n        # some tokens are invalid (e.g., invalid word combination). however, we should\n        # still compute the score for it to be compatible. We will force that score\n        # to logzero in postprocess stage, but need to use the index_to_kill make a record.\n        index_to_kill = [] \n\n        for idx, i in enumerate(next_tokens):\n            # node is the previous one if _ is not proposed else root\n            subword = self.char_list[i]\n            print(\"subword: \", subword)\n            # case (1): '_' or <eos> is proposed, which means end of the word\n            if subword == self.bpe_space or subword == \"<eos>\":\n                this_node = node # keep 'node' unchanged\n                # Invalid and kill. Previous node cannot be root\n                if this_node == self.lexroot:\n                    interval = [self.word_unk_id-1, self.word_unk_id]\n                    this_node = None\n                    index_to_kill.append(idx)\n                # score is for a word, not a word prefix -> interval for only one word \n                else:\n                    interval = [this_node[2][0], this_node[2][0] + 1]\n                    # next_node is root so the next token is valid even though it is not\n                    # start with '_' \n                    this_node = self.lexroot\n\n            # case (2): impossible token. kill them\n            elif subword == \"<blank>\" or subword == \"<unk>\":\n                this_node = None\n                interval = [self.word_unk_id-1, self.word_unk_id]\n                index_to_kill.append(idx)\n\n            # case (3): ordinary tokens. All special token should never reach this branch\n            else:\n                # subword start with '_' means a prefix of new word -> search from root\n                this_node = self.lexroot if subword.startswith(self.bpe_space) else node\n\n                subword = subword.replace(self.bpe_space, \"\")\n                for c in subword:\n                    cid = self.alphabet_dict[c]\n                    # descent to successor\n                    if cid in this_node[0]:\n                        this_node = this_node[0][cid]\n                    # no valid successor found. kill this hypothesis\n                    else:\n                        this_node = None\n                        break\n            \n                if this_node is not None and this_node[2] is not None:\n                    interval = this_node[2]\n                else:\n                    interval = [self.word_unk_id-1, self.word_unk_id]\n                    index_to_kill.append(idx)\n            \n            # plus one to correct the interval. see building process of lexroot\n            interval = [interval[0] + 1, interval[1] + 1]  \n            intervals.append(interval)\n            # this_node == None always means a kill\n            next_nodes.append(this_node)\n\n        return intervals, next_nodes, index_to_kill\n\n    def split_intervals(self, intervals, is_new_word, max_len=20000):\n        \"\"\"\n        large interval (length > 2000) will cause errors in k2. \n        Split it and keep the index\n        \"\"\"\n        new_intervals = []\n        interval_indexes = []\n        interval_is_new_word = []\n        for idx, (left, right) in enumerate(intervals):\n            interval_index = []\n            new_word = is_new_word[idx]\n            while left < right:\n                new_intervals.append([left, min(left + max_len, right)])   \n                left += max_len       \n                interval_index.append(idx) \n                interval_is_new_word.append(new_word) # means this split is a new_word\n            interval_indexes.append(interval_index)\n\n        return new_intervals, interval_indexes, interval_is_new_word\n\n    def final_score(self, state):\n        return 0\n             \n"
  },
  {
    "path": "nets/scorers/mmi_prefix_score.py",
    "content": "import os\nimport k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.nets.scorers.mmi_utils import step_intersect\n\nclass MMIFrameScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list, weight_path):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.oov = self.oovid = open(self.lang / 'oov.txt').read().strip()\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device, self.oov)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n\n        for i in range(10):\n            try:\n                self.load_weight(rank, weight_path)\n            except:\n                print(f\"{i}-th trail to load MMI matrix weight but fail\")\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n    def load_weight(self, rank, path):\n        # load lo weight and lm_scores\n        ckpt_path = os.path.join(path, f\"mmi_param.{rank}.pth\")\n        ckpt_dict = torch.load(ckpt_path)\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        # (1) den_graphs\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        # (2) DenseFsaVec\n        nnet_output = self.lo(x.unsqueeze(0))\n        supervision = torch.Tensor([[0, 0, nnet_output.size(1)]]).int()\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # (3) den_scores: [B, T]\n        den_scores = step_intersect(den, dense_fsa_vec)[0].unsqueeze(0)\n\n        # (4) initialize prev_score by 0.0\n        prev_score = torch.Tensor([0.0]).to(torch.float32)\n        return nnet_output, den_scores, prev_score \n\n    def select_state(self, states, j):\n        nnet_output_single, den_scores, prev_scores = states\n        return nnet_output_single, den_scores, prev_scores[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        nnet_output_single, den_scores, prev_score = state\n        batch_size = len(next_tokens)\n\n        # (1) num_graphs\n        ys = torch.cat([y.unsqueeze(0).repeat(batch_size, 1), next_tokens.unsqueeze(1)], dim=1)\n        texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n        texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        num, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n\n        # (2) DenseFsaVec\n        supervision = torch.stack([\n                      torch.arange(batch_size),\n                      torch.zeros(batch_size),\n                      torch.ones(batch_size) * nnet_output_single.size(1)\n                      ], dim=1).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output_single.repeat(batch_size, 1, 1), supervision)\n\n        # (3) num_scores: [B, T]\n        num_scores = torch.stack(step_intersect(num, dense_fsa_vec), dim=0) \n\n        # (4) compute frame scores and accumulate along t-axis\n        tot_scores_frame = num_scores - den_scores\n        tot_scores = torch.logsumexp(tot_scores_frame, dim=-1)\n\n        # (5) post-process\n        # <eos> means the exact probability rather than the prefix probability \n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            tot_scores[eos_pos] = tot_scores_frame[eos_pos.item(), -1]\n\n        # CTC blank is never allowed in hypothesis. kill it\n        blk_pos = torch.where(next_tokens == self.blank)[0]\n        if len(blk_pos) > 0:\n            tot_scores[blk_pos] = self.logzero\n        \n        # (6) finalize\n        tok_scores = tot_scores - prev_score\n        state = nnet_output_single, den_scores, tot_scores\n        return tok_scores, state\n\n    def final_score(self, state):\n        return 0     \n                \n"
  },
  {
    "path": "nets/scorers/mmi_rescorer.py",
    "content": "import os\nimport k2\nimport torch\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.asr.asr_utils import parse_hypothesis\n\nclass MMIRescorer(object):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list, weight_path):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.oov = self.oovid = open(self.lang / 'oov.txt').read().strip()\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device, self.oov)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank, weight_path)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n\n        if not char_list[-2][0].isalnum():\n            self.bpe_space = char_list[-2][0]\n            print(\"use bpe. bpe space is: \", self.bpe_space)\n        else:\n            self.bpe_space = \"\"\n        \n    def load_weight(self, rank, path):\n        # load lo weight and lm_scores\n        ckpt_path = os.path.join(path, f\"mmi_param.{rank}.pth\")\n        ckpt_dict = torch.load(ckpt_path)\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def score(self, x, nbest_hyps, v2=False):\n        batch_size = len(nbest_hyps)        \n\n        # (1) acoustic\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n        T = x.size()[1]\n        supervision = torch.Tensor([[0, 0, T]]).to(torch.int32)\n        \n        nnet_output = nnet_output.repeat(batch_size, 1, 1)\n        supervision = supervision.repeat(batch_size, 1)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # (2) texts\n        \"\"\"\n        texts, scores = [], []\n        for idx, hyp in enumerate(nbest_hyps):\n            if v2:\n                tokenid = hyp.yseq.tolist()[1:]\n                text = \" \".join([char_list[x] for x in tokenid])\n            else:\n                text, token, tokenid, score = parse_hypothesis(hyp, char_list)\n            text = text.replace(\"<eos>\", \"\").strip()\n            texts.append(text)\n            text_split = text.strip().split()\n            for tok in text_split:\n                if not tok in self.lexicon.words:\n                    print(f\"{idx}: Found oov {tok}\")\n        \"\"\"\n        texts = []\n        for idx, hyp in enumerate(nbest_hyps):\n            if v2:\n                tokenid = hyp.yseq[1:]\n                tokens = [self.char_list[x] for x in tokenid]\n                if isinstance(tokenid, torch.Tensor):\n                    tokenid = tokenid.tolist()\n                if \"<space>\" in self.char_list: # English\n                    text = \"\".join(tokens).replace(\"<space>\", \" \")\n                else: # Mandarin, BPE\n                    text = \" \".join(tokens)\n                text = text.replace(\"<eos>\", \"\") \n            else:\n                text, _, tokenid, _ = parse_hypothesis(hyp, self.char_list)\n\n            # BPE space:\n            if self.bpe_space is not None:\n                text = text.replace(\" \", \"\").replace(self.bpe_space, \" \").strip() \n\n            texts.append(text)\n\n        # (3) computation\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        \n                \n        # (4) add MMI scores\n        new_hyps = []\n        for i, hyp in enumerate(nbest_hyps):\n            if v2:\n                if hasattr(hyp, \"scores\"):\n                    hyp.scores[\"mmi_tot_score\"] = num_tot_scores[i].item()\n                else:\n                    setattr(hyp, 'mmi_tot_score', num_tot_scores[i].item())\n            else:\n                hyp[\"mmi_tot_score\"] = num_tot_scores[i].item()\n            new_hyps.append(hyp)\n        return new_hyps \n"
  },
  {
    "path": "nets/scorers/mmi_rnnt_lookahead_scorer.py",
    "content": "import k2\nimport torch\nimport math\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.lm.lm_utils import make_lexical_tree\nfrom espnet.nets.scorers.lookahead import parse_lookahead, build_word_fsa_mat\n\n\nclass MMIRNNTLookaheadScorer():\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n       \n        # lexical root\n        alphabet = [chr(i) for i in range(65, 91)] + [\"\\'\"] # upper class\n        self.alphabet_dict = {c: i+1 for i, c in enumerate(alphabet)}\n        self.word_dict = self.lexicon.words._sym2id\n        self.word_unk_id = int(open(self.lang / 'oov.int').read().strip())\n        self.lexroot = make_lexical_tree(self.word_dict, self.alphabet_dict, self.word_unk_id) # 3 is unknown-id\n        print(\"end of lex building\", flush=True)\n \n        self.char_list = char_list\n        self.bpe_space = char_list[-2][0]\n        \n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.logzero = -10000\n\n        self.lookahead = True\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def den_scores(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = x.size()[1]\n        den_scores = []\n        # use a loop since denominator would consume much memory\n        # in acscending order\n        for t in range(1, T+1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n\n        return nnet_output, den_scores \n\n    # this is deprecated: just reorder A at the beginning\n    def batch_score_(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n\n        batch = len(A)\n        if batch == 0:\n            return A\n\n        # (1) supervision\n        # +1 since frame start with 0; +1 since redundant blank\n        # the supervision must be descending order.\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long()\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   ts\n                                  ], dim=1).to(torch.int32) \n        indices = torch.argsort(supervision[:, 2], descending=True)\n        supervision = supervision[indices]\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                       supervision)\n\n        # compile numerator graph and keep it in the order of indices\n        prefix_and_interval = [parse_lookahead(h.yseq, self.lexroot, \n                               self.char_list, self.alphabet_dict, \n                               self.word_dict, self.bpe_space)\n                               for h in A]\n        prefix_and_interval = [prefix_and_interval[j] for j in indices]\n        word_fsa_mats = [build_word_fsa_mat(*x) for x in prefix_and_interval]\n        word_fsa = [k2.Fsa.from_dict({\"arcs\": mat}) for mat in word_fsa_mats]\n        word_fsa = k2.create_fsa_vec(word_fsa)\n        num_graphs = self.graph_compiler.compile_lookahead_numerators(word_fsa, self.P) \n\n        # (3) intersection.  \n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n        num_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        num_scores = torch.where(num_scores == -math.inf, 0.0, num_scores)\n        # num_scores is in the order of indices\n        ts = torch.Tensor([ts[j] for j in indices]).long()\n        tot_scores = num_scores - den_scores[0][ts-1] # -1: num_frames -> idx_frames\n\n        # (4) assign and post-process\n        # Question: How to deal with the hypothesis with empty yseq\n        idx_to_empty_str = [j for j, h in enumerate(A) if len(h.yseq) == 1]\n        for j in idx_to_empty_str:\n            tot_scores[indicesj] = 0.0\n\n        for j in range(batch):\n            h = A[indices[j]]\n            # print(f\"idx: {indices[j]} | Hypothesis: {texts[j]} | Prev MMI Score: {h.mmi_tot_score} | Prev Score: {h.score} | MMI step Score: {(tot_scores[j] - h.mmi_tot_score)*mmi_weight}\")\n            h.score += (tot_scores[j].item() - h.mmi_tot_score) * mmi_weight\n            h.mmi_tot_score = tot_scores[j].item()\n        \n        return A\n\n    # version of no lookahead in time-axis\n    def __batch_score(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        print(f\"Result of tu_sum: {tu_sum}\")\n        batch = len(A)\n        if batch == 0:\n            return A\n\n        # reorder: increasing order in u means decreasing order in t\n        #          this is required by k2 supervision\n        A.sort(key=lambda h: len(h.yseq))\n\n        # (1) supervision\n        # +1 since frame start with 0; +1 since redundant blank\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long() \n\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   ts\n                                  ], dim=1).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                       supervision)\n\n        # (2) compile numerator graph\n        prefix_and_interval = [parse_lookahead(h.yseq, self.lexroot,\n                               self.char_list, self.alphabet_dict,\n                               self.word_dict, self.bpe_space)\n                               for h in A]\n        word_fsa_mats = [build_word_fsa_mat(*x) for x in prefix_and_interval]\n        word_fsa = [k2.Fsa.from_dict({\"arcs\": mat}) for mat in word_fsa_mats]\n        word_fsa = k2.create_fsa_vec(word_fsa)\n        num_graphs = self.graph_compiler.compile_lookahead_numerators(word_fsa, self.P)\n\n        # (3) intersection\n        # num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n        num_lats = k2.intersect_dense_pruned(num_graphs,\n                                         dense_fsa_vec,\n                                         search_beam=20.0,\n                                         output_beam=10.0,\n                                         min_active_states=30,\n                                         max_active_states=20000)\n        num_scores = num_lats.get_tot_scores(True, True)\n        num_scores = torch.where(num_scores == -math.inf, 0.0, num_scores)\n        tot_scores = num_scores - den_scores[0][ts-1] # num_frame -> idx_frame\n\n        # (4) assign and post-process\n        idx_to_empty_str = [j for j, h in enumerate(A) if len(h.yseq) == 1]\n        for j in idx_to_empty_str:\n            tot_scores[j] = 0.0\n\n        for j in range(batch):\n            h = A[j]\n            h.score += (tot_scores[j].item() - h.mmi_tot_score) * mmi_weight\n            h.mmi_tot_score = tot_scores[j].item()\n            \n            text = \"\".join([self.char_list[x] for x in h.yseq[1:]])\n            print(f\"text: {text} | score: {h.score} | mmi_score: {h.mmi_tot_score}\", flush=True)\n\n        return A\n\n    # version of lookahead in time-axis\n    def batch_score(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        print(f\"Result of tu_sum: {tu_sum}\")\n        batch = len(A)\n        if batch == 0:\n            return A\n\n        # reorder: increasing order in u means decreasing order in t\n        #          this is required by k2 supervision\n        A.sort(key=lambda h: len(h.yseq))\n\n        # (1) get ts: the alignment length in t-axis \n        # +1 since frame start with 0; +1 since redundant blank\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long()\n\n        # (2) compile numerator graph\n        prefix_and_interval = [parse_lookahead(h.yseq, self.lexroot,\n                               self.char_list, self.alphabet_dict,\n                               self.word_dict, self.bpe_space)\n                               for h in A]\n        word_fsa_mats = [build_word_fsa_mat(*x) for x in prefix_and_interval]\n        word_fsa = [k2.Fsa.from_dict({\"arcs\": mat}) for mat in word_fsa_mats]\n        word_fsa = k2.create_fsa_vec(word_fsa)\n        num_graphs = self.graph_compiler.compile_lookahead_numerators(word_fsa, self.P)\n\n        # (3) intersection\n        lookahead_range = (0, 25) # tunable paramter. avoid this hard code in the future\n        tot_scores_collection = []\n        T = nnet_output.size()[1]\n        for s in range(lookahead_range[0], lookahead_range[1] + 1): # be symmetric\n            ts_shift = torch.clamp(ts + s, min=1, max=T)\n            supervision = torch.stack([torch.arange(batch),\n                                       torch.zeros(batch),\n                                       ts_shift\n                                       ], dim=1).to(torch.int32)\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                           supervision)\n            num_lats = k2.intersect_dense_pruned(num_graphs,\n                                             dense_fsa_vec,\n                                             search_beam=20.0,\n                                             output_beam=10.0,\n                                             min_active_states=30,\n                                             max_active_states=20000)\n            num_scores = num_lats.get_tot_scores(True, True)\n            num_scores = torch.where(num_scores == -math.inf, 0.0, num_scores)\n            tot_scores = num_scores - den_scores[0][ts_shift-1] # num_frame -> idx_frame\n            tot_scores_collection.append(tot_scores)\n        tot_scores = torch.stack(tot_scores_collection, dim=1) # [beam, T]\n        \n        # hint: we can only use top-1 score rather than logsumexp or top-k-sum\n        # since torch.clamp leads to repeatition of these scores at boundaries\n        tot_scores, _ = torch.topk(tot_scores, 1, dim=-1)\n\n        # (4) assign and post-process\n        idx_to_empty_str = [j for j, h in enumerate(A) if len(h.yseq) == 1]\n        for j in idx_to_empty_str:\n            tot_scores[j] = 0.0\n\n        for j in range(batch):\n            h = A[j]\n            h.score += (tot_scores[j].item() - h.mmi_tot_score) * mmi_weight\n            h.mmi_tot_score = tot_scores[j].item()\n\n            #text = \"\".join([self.char_list[x] for x in h.yseq[1:]])\n            #print(f\"text: {text} | score: {h.score} | mmi_score: {h.mmi_tot_score} | score_rnnt: {h.score - h.mmi_tot_score * mmi_weight}\", flush=True)\n\n        return A\n"
  },
  {
    "path": "nets/scorers/mmi_rnnt_scorer.py",
    "content": "import os\nimport k2\nimport torch\nimport math\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\n\n\nclass MMIRNNTScorer():\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list, weight_path, lookahead=0):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.oov = self.oovid = open(self.lang / 'oov.txt').read().strip()\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device, self.oov)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank, weight_path)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n        self.lookahead = lookahead\n\n    def load_weight(self, rank, path):\n        # load lo weight and lm_scores\n        ckpt_path = os.path.join(path, f\"mmi_param.{rank}.pth\") \n        ckpt_dict = torch.load(ckpt_path)\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def den_scores(self, x):\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = x.size()[1]\n        den_scores = []\n        # use a loop since denominator would consume much memory\n        # in acscending order\n        for t in range(1, T+1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            den_lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            den_tot_scores = den_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            den_scores.append(den_tot_scores)\n        den_scores = torch.cat(den_scores).unsqueeze(0) # [T] -> [B, T]\n\n        return nnet_output, den_scores\n\n    def batch_rescore(self, A, h):\n        ans = []\n        start = 0\n        while start < len(A):\n            ans.append(self._batch_rescore(A[start: start + 50], h))\n            start += 50\n\n        return A\n\n    def _batch_rescore(self, A, h):\n        nnet_output = self.lo(h.unsqueeze(0))\n        batch, T = len(A), nnet_output.size(1)\n        if batch == 0:\n            return A\n\n        texts = [h.yseq[1:] for h in A]\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts]\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   torch.ones(batch) * T\n                                  ], dim=1).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                       supervision)\n\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=30.0)\n        num_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n\n        for h, s in zip(A, num_scores):\n            h.mmi_tot_score = s.item()\n        \n        return A\n\n    # batch score without time-axis lookahead. TODO: remove the indices as the order is not important\n    def batch_score(self, A, nnet_output, den_scores, tu_sum, mmi_weight):\n        \n        batch, T = len(A), nnet_output.size(1)\n        if batch == 0:\n            return A\n\n        # (1) supervision\n        # +1 since frame start with 0; +1 since redundant <sos>\n        # the supervision must be descending order.\n        ts = [tu_sum - len(h.yseq) + 2 for h in A]\n        ts = torch.Tensor(ts).long()\n        supervision = torch.stack([torch.arange(batch),\n                                   torch.zeros(batch),\n                                   ts\n                                  ], dim=1).to(torch.int32) \n        indices = torch.argsort(supervision[:, 2], descending=True)\n        supervision = supervision[indices]\n        # dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n        #                                supervision)\n\n        # (2) texts\n        texts = [h.yseq[1:] for h in A] # exclude starting <sos>\n        texts = [\" \".join([self.char_list[x] for x in text]) for text in texts] # need modification for BPE\n        texts = [texts[idx] for idx in indices] # reorder\n        num_graphs, _ = self.graph_compiler.compile(texts, self.P, replicate_den=False)\n\n        # (3) intersection. \n        num_score_collection = []\n        for i in range(self.lookahead + 1):\n            supervision = torch.stack([torch.arange(batch),\n                                       torch.zeros(batch),\n                                       torch.clamp(ts + i, min=1, max=T)\n                                       ], dim=1).to(torch.int32)[indices]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output.repeat(batch, 1, 1),\n                                           supervision)\n            num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=200.0)\n            num_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            num_scores = torch.where(num_scores == -math.inf, 0.0, num_scores)\n            num_score_collection.append(num_scores)\n\n        num_scores = torch.stack(num_score_collection, dim=1).max(1)[0]\n        ts = torch.Tensor([ts[j] for j in indices]).long()\n        tot_scores = num_scores - den_scores[0][ts-1] # -1: num_frames -> idx_frames\n\n        # (4) assign and post-process\n        idx_to_empty_str = [j for j, x in enumerate(texts) if x == \"\"]\n        for j in idx_to_empty_str:\n            tot_scores[j] = 0.0\n\n        for j in range(batch):\n            h = A[indices[j]]\n            h.score += (tot_scores[j].item() - h.mmi_tot_score) * mmi_weight\n            h.mmi_tot_score = tot_scores[j].item()\n        \n        return A\n"
  },
  {
    "path": "nets/scorers/mmi_utils.py",
    "content": "# Author: Jinchuan Tian ; Jan 2022\n# jinchuantian@stu.pku.edu.cn\n\n# We test our code on k2 version 1.2; other versions may encounter problems due to API change.\n# This file contains the MMI-related utility functions:\n# 1. The (equivalent implementation of) step composition between the training / decoding graph;\n# 2. The Lattice generation process with look-ahead mechanism.\n\nfrom typing import List\nfrom typing import Optional\nfrom typing import Tuple\n\nimport torch\nimport k2\nimport _k2\nimport numpy as np # debug\nfrom k2 import Fsa, DenseFsaVec \n\n\"\"\"\nIntersection function without autograd.\n\n(1) We write this function since the arc_map_a is not accessible in k2 API\n(2) Currently we are not using the pruned version to keep all paths.\n    We will try to find a balance between the speed and the precision later.\n\"\"\"\ndef intersect_dense_forward(a_fsas: Fsa,\n                           b_fsas: DenseFsaVec,\n                           search_beam: float,\n                           output_beam: float,\n                           prune: bool,\n                           min_active_states: int,\n                           max_active_states: int,\n                           seqframe_idx_name: Optional[str] = None,\n                           frame_idx_name: Optional[str] = None): \n\n    out_fsa = [0]\n\n    if prune:\n        ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense_pruned(\n            a_fsas=a_fsas.arcs,\n            b_fsas=b_fsas.dense_fsa_vec,\n            search_beam=search_beam,\n            output_beam=output_beam,\n            min_active_states=min_active_states,\n            max_active_states=max_active_states)\n    else:\n        ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(\n            a_fsas=a_fsas.arcs,\n            b_fsas=b_fsas.dense_fsa_vec,\n            a_to_b_map=None,\n            output_beam=output_beam)\n\n    out_fsa[0] = Fsa(ragged_arc)\n\n    seqframe_idx = None\n    if frame_idx_name is not None:\n        num_cols = b_fsas.dense_fsa_vec.scores_dim1()\n        seqframe_idx = arc_map_b // num_cols\n        shape = b_fsas.dense_fsa_vec.shape()\n        fsa_idx0 = _k2.index_select(shape.row_ids(1), seqframe_idx)\n        frame_idx = seqframe_idx - _k2.index_select(\n            shape.row_splits(1), fsa_idx0)\n        assert not hasattr(out_fsa[0], frame_idx_name)\n        setattr(out_fsa[0], frame_idx_name, frame_idx)\n\n    if seqframe_idx_name is not None:\n        if seqframe_idx is None:\n            num_cols = b_fsas.dense_fsa_vec.scores_dim1()\n            seqframe_idx = arc_map_b // num_cols\n\n        assert not hasattr(out_fsa[0], seqframe_idx_name)\n        setattr(out_fsa[0], seqframe_idx_name, seqframe_idx)\n\n    return out_fsa[0], arc_map_a, arc_map_b\n\n\n# For each state, Add the score on the ending arc if the end state is reachable \n# from this state. Then return the frame-level scores for each Fsa.\ndef step_trace(out_fsas, a_fsas, arc_map_a, durations):\n    assert out_fsas.shape[0] == a_fsas.shape[0]\n    num_fsa = a_fsas.shape[0]\n\n    # K2 FsaVec Meta-info: num_state; 0; \n    # state_accumulated_counts (row_splits1); \n    # arc_accumulated_counts (row_splits12);\n    \n    # 1.1 Find all a_fsas arcs and meta-info\n    a_fsa_dict = a_fsas.as_dict()\n    a_fsa_meta = a_fsa_dict[\"arcs\"][: 2 * num_fsa + 4].long()\n    a_fsa_arcs = a_fsa_dict[\"arcs\"][2 * num_fsa + 4:].view(-1, 4) \n\n    # 1.2 Assign global state-ids\n    for i in range(num_fsa):\n        a_fsa_arcs[a_fsa_meta[i+num_fsa+3]: a_fsa_meta[i+num_fsa+4]][:, :2] += a_fsa_meta[i + 2]\n\n    # 1.3 Find all ending states and their scores. -1 means arcs entering ending states.\n    a_fsa_ending_mask = a_fsa_arcs[:, 2] == -1\n    a_ending_states = torch.masked_select(a_fsa_arcs[:, 0], a_fsa_ending_mask)\n    a_ending_scores = torch.masked_select(a_fsas.scores, a_fsa_ending_mask)\n\n    # 2.1 Find all out_fsas arcs and sort by entering states \n    out_fsa_dict = out_fsas.as_dict()\n    out_fsa_meta = out_fsa_dict[\"arcs\"][:2 * num_fsa + 4].long()\n    out_fsa_arcs = out_fsa_dict[\"arcs\"][2 * num_fsa + 4:].view(-1, 4)\n    out_incoming_ragged = out_fsas._get_incoming_arcs()\n\n    # 2.2 For each state, find an arc entering it\n    #     No entering arcs for start states but a fake one is assigned to it.\n    #     We need the corresponding arcs in a_fsas so arc_map_a is selected\n    transform_index = out_incoming_ragged.values().long()\n    select_index = out_incoming_ragged.row_splits(2).long()[:-1]\n    arc_map_a_uniq = arc_map_a[transform_index][select_index]\n    frame_idx = out_fsas.frame_idx[transform_index][select_index]\n\n    # 2.3 Find all corresponding arcs in a_fsas and their entering states\n    #     Starting states of each Fsa is set to 0 to disable the fake arcs\n    #     They are not needed to be accurate as long as 0 is not in `a_ending_states`\n    a_fsa_arcs_uniq = a_fsa_arcs[arc_map_a_uniq.long()]\n    a_states_uniq = a_fsa_arcs_uniq[:, 1]\n    # We use this to avoid out-of-range error when last several FSAs are empty\n    start_indices = torch.where(out_fsa_meta[2: 2 + num_fsa] == out_fsa_meta[2 + num_fsa],\n                                0, out_fsa_meta[2: 2 + num_fsa])\n    a_states_uniq[start_indices] = 0\n\n    # 3.1 Find the forward scores\n    #     Add ending state scores to the raw state_scores \n    #     if the final state is reachable. Else set to -inf\n    raw_state_scores = out_fsas._get_forward_scores(True, True)\n    state_scores = torch.ones_like(raw_state_scores) * float('-1e10')\n    for state, score in zip(a_ending_states, a_ending_scores):\n        state_scores = torch.where(a_states_uniq==state, \n                                   raw_state_scores + score, \n                                   state_scores)\n    \n    # 3.2 Allocate scores on each frames and each Fsa\n    #     Score on starting state is also accumulated on frame 1\n    #     But it is always ok since this score is always -inf\n    #     TODO: maybe we need assume T in dense_fsa_vec supervison is identical \n    frame_ids, counts = torch.unique_consecutive(frame_idx, return_counts=True)\n\n    score_sequences, start = [], 0\n    score_sequence = []\n    for i, (fid, fc) in enumerate(zip(frame_ids.tolist(), counts.tolist())):\n        frame_score = torch.logsumexp(state_scores[start: start+fc], dim=0)        \n        score_sequence.append(frame_score)\n        start += fc\n\n        if i == len(counts) - 1 or fid > frame_ids[i+1]:\n            score_sequences.append(torch.stack(score_sequence, dim=0)[:-1])\n            score_sequence = []\n\n    # 3.3 For empty Fsas, assgin -inf score sequences.\n    ans, index = [], 0\n    is_empties = out_fsa_meta[2: 2 + num_fsa] == out_fsa_meta[3: 3 + num_fsa]\n    for i, is_empty in enumerate(is_empties):\n        if is_empty:\n            ans.append(torch.ones(durations[i]) * float('-1e10'))\n        else:\n            ans.append(score_sequences[index])\n            index += 1\n    assert len(ans) == num_fsa\n    return ans\n\n\"\"\"\nStep intersection implementation\n\nInput:\nfsa, FsaVec, training graph like CTC, MMI. Need duplication.\ndense_fsa_vec, DenseFsaVec, created from nnet_output and the corresponding length in t-axis.\nprune: bool, If true, use a pruned version of intersection.\nsearch_beam: float, parameter used in pruned intersection only.\noutput_beam: float, paramtere used in intersection.\nmin_active_states: int, parameter used in pruned intersection only.\nmax_active_states: int, parameter used in pruned intersection only.\n\nOutput: \nscore_sequences: List of 1-D tensors. The number of tensors is equal to the number fsas in of `fsa`\n                 Each tensor has length of T where T is the number of effective frames in nnet_ouptut.\n                 The t-th element represent the `tot_score` of interseted Fsa beteewn the input `fsa` \n                 and the first t frames.\n\nThis implementation is much faster than using a loop for T times. As the intersection is only used once\nfor each Fsa. The sequence is recovered from the output Fsa and the arc_map_a generated by T frames.\n\nThe intersection would fail if the hypothesis is too long to reach the end-state. If all intersection\nfails, the step_trace would result in an error so we check the out_fsa in advance. (Currently we only\nuse this in MMI prefix score computation so all hypotheses and dense_fsa_vec have the same lengths. \nThis could also be used in MMI alignment score computation with time-axis lookahead and this function\nmight be further revised later)\n\nCurrently we find the `search_beam` and the `output_beam` should be very large to achieve accurate\ndecoding performance.\n\"\"\"\ndef step_intersect(fsa, \n                   dense_fsa_vec, \n                   prune=False, \n                   search_beam=3000, \n                   output_beam=2000,\n                   min_active_states=30,\n                   max_active_states=5000000):\n\n    out_fsa, arc_map_a, arc_map_b = intersect_dense_forward(\n      a_fsas = fsa,\n      b_fsas = dense_fsa_vec,\n      search_beam = search_beam,\n      output_beam = output_beam,\n      prune = prune,\n      min_active_states = min_active_states,\n      max_active_states = max_active_states,\n      seqframe_idx_name = \"seqframe_idx\",\n      frame_idx_name = \"frame_idx\"\n    )\n\n     # If all intersections fail\n    if out_fsa.num_arcs == 0:\n        ans = []\n        for d in dense_fsa_vec.duration:\n            ans.append(torch.ones(d.item()) * float(\"-1e10\"))\n        return ans\n\n    return step_trace(out_fsa, fsa, arc_map_a, dense_fsa_vec.duration) \n\ndef step_intersect_test(): \n    from pathlib import Path\n    lang=Path(\"data/lang_phone\")\n    device = torch.device(\"cpu\")\n    \n    # import for test only\n    from espnet.nets.scorer_interface import PartialScorerInterface\n    from snowfall.training.mmi_graph import MmiTrainingGraphCompiler\n    from snowfall.lexicon import Lexicon\n    from snowfall.training.mmi_graph import create_bigram_phone_lm\n\n    lexicon = Lexicon(lang)\n    oov = open(lang / 'oov.txt').read().strip()\n    graph_compiler = MmiTrainingGraphCompiler(lexicon, device, oov)\n    phone_ids = lexicon.phone_symbols()\n\n    torch.manual_seed(888)\n    P = create_bigram_phone_lm(phone_ids)\n    P.scores = torch.randn_like(P.scores)\n\n    texts = [\"你\", \"好\"]\n    # texts = ['本 市 警 察 近', '本 市 警 察 今', '本 市 警 察 信', '本 市 警 察 昨', '本 市 警 察 二', '本 市 警 察 记', '本 市 警 察 数', '本 市 警 察 甚', '本 市 警 察 继', '本 市 警 察 至', '本 市 警 察 进', '本 市 警 察 几', '本 市 警 察 五', '本 市 警 察 仅', '本 市 警 察 上']\n    num, den = graph_compiler.compile(texts, P, replicate_den=True)\n    graph = num \n \n    T = 3\n    beam_size = len(texts)\n    odim = len(phone_ids) + 1\n    nnet_output = torch.rand([beam_size, T, odim])\n\n    supervision = torch.stack([\n                          torch.arange(beam_size),\n                          torch.zeros(beam_size),\n                          torch.ones(beam_size) * T,\n                          ], dim=-1).cpu().int()   \n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision) \n    score_sequences = step_intersect(graph, \n                                    dense_fsa_vec,\n                                    prune=False,\n                                    search_beam=30,\n                                    output_beam=20,\n                                    min_active_states=30,\n                                    max_active_states=100000) \n    print(score_sequences)\n    return\n    print(\"####  old method ###\")\n    buf = []\n    for t in range(1, T+1):\n        supervision = torch.stack([\n                          torch.arange(beam_size),\n                          torch.zeros(beam_size),\n                          torch.ones(beam_size) * t,\n                          ], dim=-1).cpu().int()\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        num_lats = k2.intersect_dense(graph, dense_fsa_vec, output_beam=30.0)\n        num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n        buf.append(num_tot_scores)\n\n    buf = torch.stack(buf, dim=1)\n    score_sequences = torch.stack(score_sequences, dim=0)\n    print(buf - score_sequences)\n \nif __name__ == \"__main__\":\n    step_intersect_test() \n"
  },
  {
    "path": "nets/scorers/new_mmi_frame_scorer.py",
    "content": "import k2\nimport torch\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.nets.scorers.trace_frame import trace_frame\n\nclass MMIFrameScorer(PartialScorerInterface):\n    def __init__(self, lang, device, idim, sos_id, rank, use_segment, char_list):\n        \n        self.lang = lang\n        self.device = device\n        \n        self.lexicon = Lexicon(lang)\n        self.graph_compiler = MmiTrainingGraphCompiler(self.lexicon, self.device)\n        self.phone_ids = self.lexicon.phone_symbols()\n      \n        self.lo = torch.nn.Linear(idim, len(self.phone_ids) + 1) \n        self.lm_scores = None\n        self.load_weight(rank)\n\n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.set_scores_stochastic_(self.lm_scores)\n        \n        self.char_list = char_list\n        self.eos = sos_id  # <sos> is identical to <eos>\n        self.blank = 0 # by default 0 means CTC blank \n        self.logzero = -10000\n\n    def load_weight(self, rank):\n        # load lo weight and lm_scores\n        ckpt_dict = torch.load(self.lang / f\"mmi_param.{rank}.pth\")\n        for v in ckpt_dict.values():\n            v.requires_grad=False\n        self.lm_scores = ckpt_dict[\"lm_scores\"]\n        lo_dict = {\"weight\": ckpt_dict[\"lo.1.weight\"],\n                   \"bias\": ckpt_dict[\"lo.1.bias\"]}\n        self.lo.load_state_dict(lo_dict)\n\n    def init_state(self, x):\n        torch.set_printoptions(sci_mode=False)\n\n        x = x[:50]\n        # (1) nnet_output\n        x = x.unsqueeze(0)\n        nnet_output = self.lo(x)\n\n        # (2) den_scores\n        texts = [\"<UNK>\"] # use a random text, just to get den graph\n        _, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n\n        T = x.size()[1]\n\n        den_scores = trace_frame(nnet_output, den, \"<UNK>\")\n        # [T] -> [B, T]\n        den_scores = den_scores.unsqueeze(0)\n        print(\"den_score: \", den_scores)\n\n        # (3) Prev Score is zero\n        prev_score = torch.Tensor([0]).to(torch.float32)\n        return nnet_output, den_scores, prev_score \n\n    def select_state(self, states, j):\n        nnet_output_single, den_scores, prev_scores = states\n        return nnet_output_single, den_scores, prev_scores[j]\n\n    def score(**kargs):\n        raise NotImplementedError\n\n    def score_partial(self, y, next_tokens, state, hs_pad):\n        # Warning: All frame-level scores are adopted in reverse order in time-axis \n        # since k2 requires a descending input length\n\n        # (1) unpack state\n        nnet_output_single, den_scores, prev_score = state\n        batch_size = len(next_tokens)\n\n        # (3) texts\n        y = y.unsqueeze(0).repeat(batch_size, 1)\n        next_tokens = next_tokens.unsqueeze(1)\n        ys = torch.cat([y, next_tokens], dim=1)\n        # This is for Chinese. Need more tuning on English\n        if not \"<space>\" in self.char_list:\n            texts = [\" \".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").strip() for text in texts]\n        else:\n            texts = [\"\".join([self.char_list[tid] for tid in text[1:]]) for text in ys]\n            texts = [text.replace(\"<eos>\", \"\").replace(\"<space>\", \"<space> \").strip() for text in texts]\n\n        num_scores = []\n        for text in texts:\n             print(text, flush=True)\n             num_graph, _ = self.graph_compiler.compile([text], self.P, replicate_den=False)\n             num_graph[0].draw(f\"{text}_graph.svg\")\n             num_score = trace_frame(nnet_output_single, num_graph, text)\n             num_scores.append(num_score)\n        num_scores = torch.stack(num_scores, dim=0)\n\n        #     minus the denominator scores\n        # we should keep the frame-level result for <eos>\n        tot_scores_frame = num_scores - den_scores\n        tot_scores = torch.logsumexp(tot_scores_frame, dim=-1)\n        print(tot_scores_frame)       \n \n        # (5) treat <eos> and ctc <blk> specailly\n        # <eos> means the exact probability rather than the prefix probability \n        eos_pos = torch.where(next_tokens == self.eos)[0]\n        if len(eos_pos) > 0:\n            tot_scores[eos_pos] = tot_scores_frame[eos_pos.item(), 0]\n\n        # CTC blank is never allowed in hypothesis. kill it\n        blk_pos = torch.where(next_tokens == self.blank)[0]\n        if len(blk_pos) > 0:\n            tot_scores[blk_pos] = self.logzero\n        \n        # (6) finalize\n        tok_scores = tot_scores - prev_score\n        state = nnet_output_single, den_scores, tot_scores\n        return tok_scores, state\n\n    def final_score(self, state):\n        return 0     \n                \n"
  },
  {
    "path": "nets/scorers/ngram.py",
    "content": "\"\"\"Ngram lm implement.\"\"\"\n\nfrom abc import ABC\n\nimport kenlm\nimport torch\n\nfrom espnet.nets.scorer_interface import BatchScorerInterface\nfrom espnet.nets.scorer_interface import PartialScorerInterface\n\n\nclass Ngrambase(ABC):\n    \"\"\"Ngram base implemented throught ScorerInterface.\"\"\"\n\n    def __init__(self, ngram_model, token_list):\n        \"\"\"Initialize Ngrambase.\n\n        Args:\n            ngram_model: ngram model path\n            token_list: token list from dict or model.json\n\n        \"\"\"\n        self.chardict = [x if x != \"<eos>\" else \"</s>\" for x in token_list]\n        self.charlen = len(self.chardict)\n        self.lm = kenlm.LanguageModel(ngram_model)\n        self.tmpkenlmstate = kenlm.State()\n\n    def init_state(self, x):\n        \"\"\"Initialize tmp state.\"\"\"\n        state = kenlm.State()\n        self.lm.NullContextWrite(state)\n        return state\n\n    def score_partial_(self, y, next_token, state, x):\n        \"\"\"Score interface for both full and partial scorer.\n\n        Args:\n            y: previous char\n            next_token: next token need to be score\n            state: previous state\n            x: encoded feature\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        out_state = kenlm.State()\n        ys = self.chardict[y[-1]] if y.shape[0] > 1 else \"<s>\"\n        self.lm.BaseScore(state, ys, out_state)\n        scores = torch.empty_like(next_token, dtype=x.dtype, device=y.device)\n        for i, j in enumerate(next_token):\n            scores[i] = self.lm.BaseScore(\n                out_state, self.chardict[j], self.tmpkenlmstate\n            )\n        return scores, out_state\n\n\nclass NgramFullScorer(Ngrambase, BatchScorerInterface):\n    \"\"\"Fullscorer for ngram.\"\"\"\n\n    def score(self, y, state, x):\n        \"\"\"Score interface for both full and partial scorer.\n\n        Args:\n            y: previous char\n            state: previous state\n            x: encoded feature\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        return self.score_partial_(y, torch.tensor(range(self.charlen)), state, x)\n\n\nclass NgramPartScorer(Ngrambase, PartialScorerInterface):\n    \"\"\"Partialscorer for ngram.\"\"\"\n\n    def score_partial(self, y, next_token, state, x):\n        \"\"\"Score interface for both full and partial scorer.\n\n        Args:\n            y: previous char\n            next_token: next token need to be score\n            state: previous state\n            x: encoded feature\n\n        Returns:\n            tuple[torch.Tensor, List[Any]]: Tuple of\n                batchfied scores for next token with shape of `(n_batch, n_vocab)`\n                and next state list for ys.\n\n        \"\"\"\n        return self.score_partial_(y, next_token, state, x)\n\n    def select_state(self, state, i):\n        \"\"\"Empty select state for scorer interface.\"\"\"\n        return state\n"
  },
  {
    "path": "nets/scorers/sorted_matcher.py",
    "content": "import math\nimport kaldi.fstext as fst\n\nclass SortedMatcher(object):\n    \"\"\"\n    class implements searching arc/scores on FST\n    \n    Args:\n        vector_fst (object): loaded fst\n        max_num_arcs (int): maximum number of arcs starting from one fst state\n        max_id (int): maximum i/o label id of LM fst\n        backoff_id (int): backoff id of LM fst\n        disambig_ids (List of int): disambig ids of LM fst\n    \"\"\"\n    def __init__(self, vector_fst, max_num_arcs,\n                 max_id, backoff_id, disambig_ids):\n        #make sure fst is i/o label sorted\n        self.fst = vector_fst\n        self.max_num_arcs = max_num_arcs\n        self.max_id = max_id\n        self.backoff_id = backoff_id\n        self.disambig_ids = disambig_ids\n\n    def search(self, state_id, ilabel):\n        \"\"\"\n        binary search on ArcIterator\n        \"\"\"\n        aiter = self.fst.arcs(state_id)\n        #binary search on ArcIterator\n        size = self.max_num_arcs\n        high = size - 1\n        while size > 1:\n            half = size // 2\n            mid = high - half\n            aiter.seek(mid)\n            if aiter.done():\n                cur_id = self.max_id\n            else:\n                cur_id = aiter.value().ilabel\n            if cur_id >= ilabel:\n                high = mid\n            size -= half\n        aiter.seek(high)\n        if aiter.done():\n            return False, None\n        if aiter.value().ilabel == ilabel:\n            return True, aiter\n        return False, None\n\n    \"\"\"\n    Tyriontian, Questions:\n    (1) This function tries to find every paths (with backoff) that \n        accepts ilabel. There are possibly many of them, and will not\n        lead to much difference (hori's book). \n        To achieve better speed, try to only find the path that \n        has samllest order of backoff.\n    \"\"\"\n    def get_scores_wodisambig(self, state_id, ilabel, init_score=0.0):\n        scores = []\n        states = []\n        bf_score = init_score\n        cur_state = state_id\n        while True:\n            has_arc, aiter = self.search(cur_state, ilabel)\n            if has_arc:\n                scores.append(bf_score + aiter.value().weight.value)\n                states.append(aiter.value().nextstate)\n            \n            has_backoff, aiter_bf = self.search(cur_state, self.backoff_id)\n            if has_backoff:\n                bf_score += aiter_bf.value().weight.value\n                cur_state = aiter_bf.value().nextstate\n            else:\n                return scores, states\n\n    \"\"\"\n    Given the state_id and ilabel, this function tries to find all \n    paths that have different disambig symbols and backoff order.\n    This could be much time consuming. \n    O( log(max_num_arcs) * num_disambig * (lm_rank - 1) )\n    \"\"\"\n    def get_scores(self, state_id, ilabel):\n        init_scores = [0.0]\n        init_states = [state_id]\n        #check disambig arcs,\n        for label in self.disambig_ids:\n            found, aiter = self.search(state_id, label)\n            if found:\n                init_scores.append(aiter.value().weight.value)\n                init_states.append(aiter.value().nextstate)\n        scores, states = [], []\n        for i, init_score in enumerate(init_scores):\n            cur_sc, cur_st = self.get_scores_wodisambig(init_states[i],\n                                                        ilabel, init_score)\n            scores.extend(cur_sc)\n            states.extend(cur_st)\n        return scores, states\n\n\n    \"\"\"\n    Similarly, find all path with different disambig symbols. Then \n    add the final score with all possible backoff\n    \"\"\"\n    def final_score(self, state_id):\n        final_scores = [0.0]\n        final_states = [state_id]\n        #check disambig arcs,\n        for label in self.disambig_ids:\n            found, aiter = self.search(state_id, label)\n            if found:\n                final_scores.append(aiter.value().weight.value)\n                final_states.append(aiter.value().nextstate)\n        def search_final(state_id, init_score=0.0):\n            score = init_score\n            cur_state = state_id\n            while True:\n                final_score = self.fst.final(cur_state).value\n                if math.isinf(final_score):\n                    found, aiter = self.search(cur_state, self.backoff_id)\n                    if found:\n                        score += aiter.value().weight.value\n                        cur_state = aiter.value().nextstate\n                    else:\n                        return float('inf'), None\n                else:\n                    score += final_score\n                    return score, cur_state\n        for i, final_score in enumerate(final_scores):\n            final_scores[i], final_states[i] = search_final(final_states[i],\n                                                            final_score)\n        return final_scores, final_states\n"
  },
  {
    "path": "nets/scorers/test.py",
    "content": "        \n        num, den = self.graph_compiler.compile(texts, self.P, replicate_den=True)\n        T = x.size()[1]\n        scores = []\n        for t in range(T, 0, -1):\n            supervision = torch.Tensor([[0, 0, t]]).to(torch.int32) # [idx, start, length]\n            dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n            lats = k2.intersect_dense(den, dense_fsa_vec, output_beam=10.0)\n            frame_score = lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n            scores.append(frame_score)\n        tot_scores = torch.cat(scores).unsqueeze(0)\n\n\n"
  },
  {
    "path": "nets/scorers/tlg_scorer.py",
    "content": "# Author: tyriontian\n# tyriontian@tencent.com\n\nimport os\nimport sys\nimport torch\nimport kaldi.fstext as fst\n\nfrom pathlib import Path\nfrom espnet.nets.scorer_interface import PartialScorerInterface\nfrom espnet.nets.scorers.sorted_matcher import SortedMatcher\n\nclass TlgPartialScorer(PartialScorerInterface):\n    \"\"\"\n    This is a wrapper for Espnet: the word-level N-gram LM on-the-fly decoding method.\n    (proposed by cweng, cweng@tencent.com)\n    \"\"\"\n\n    def __init__(self, lang, nonblk_reward=0.0):\n        self.lang = Path(lang)\n        \n        # build the SortedMatcher: core of this algorithm\n        # the `lang` directory should have these files \n        disambig_ids = open(self.lang / \"disambig_ids\").readline().replace(\"\\n\", \"\").split(\",\")\n        disambig_ids = [int(i) for i in disambig_ids]\n        backoff_id = int(open(self.lang / \"backoff_id\").readline().strip())\n        max_id = int(open(self.lang / \"max_id\").readline().strip())\n        max_num_arcs = int(open(self.lang / \"max_num_arcs\").readline().strip())\n        fst_lm = fst.StdVectorFst.read(str(self.lang / \"LG.fst\"))\n\n        self.scorer = SortedMatcher(fst_lm, max_num_arcs, max_id, backoff_id, disambig_ids) \n        \n        # reward whenever a new non-blank token generated\n        assert nonblk_reward >= 0.0\n        self.nonblk_reward = nonblk_reward\n\n        print(\"Build TLG scorer successfully!\", flush=True)\n\n    def init_state(self, x=None):\n        \"\"\"\n        0 is the starting state\n        \"\"\"\n        return {0: 0.0}\n\n    def score_partial(self, y, next_tokens, state, x):\n        \"\"\"\n        args:\n        y: interface required. Not used here\n        next_tokens: list of token-ids to search\n        state: dict, {state1: score1, state2: score2, ...}\n               state is shared for all token-ids\n        x: interface required, Not used here\n\n        return:\n        scores: list of scores for each token-id\n        next_states: list of dicts, each of which is in format like `state`\n\n        Hint: next_tokens contains no <blank> \n        \"\"\"\n        scores = []\n        next_states = []\n        for tok_id in next_tokens:\n            # <eps> is not in our vocab but in the compilation of LG.fst\n            score, next_state = self.score_one(tok_id + 1, state)\n            scores.append(score)\n            next_states.append(next_state)\n\n        return scores, next_states\n\n    def score_one(self, tok_id, state_dict):\n        # In case the searched results are all empty.\n        scores = [1e10]\n        next_states = [0]\n        for state, prev_score in state_dict.items():\n            searched = list(self.scorer.get_scores(state, tok_id))\n            searched[0] = [x + prev_score for x in searched[0]]\n            scores += searched[0]\n            next_states += searched[1]\n        \n        # the scores used for comparison have considered previous scores.  \n        next_dict = {}\n        for state, score in zip(next_states, scores):\n            if state in next_dict:\n                next_dict[state] = min(next_dict[state], score)\n            else:\n                next_dict[state] = score\n        \n        next_dict = {k: v + self.nonblk_reward for k, v in next_dict.items()}\n        # Minimum value in the state dict is exactly the accumulated socre of the \n        # whole history. The first-order difference is the token-level score.\n        score = min(next_dict.values()) - min(state_dict.values())\n        return - score, next_dict\n           \n    def final_score(self, states):\n        \"\"\"\n        args:\n        states: list of dict {state1: score1, state2: score2, ...}\n        \n        return: \n        scores: final scores for each hypothesis\n        state are not returned and considered any longer\n        \"\"\"\n        scores = []\n        for state in states:\n            score = self.final_score_one(state)\n            scores.append(score)\n        return scores\n\n    def final_score_one(self, state_dict):\n        scores = []\n        for state, _ in state_dict.items():\n            searched = self.scorer.final_score(state)\n            scores += searched[0]\n        score = min(scores) - min(state_dict.values())\n        return score\n        \nif __name__ == \"__main__\":\n   token_list = [s.split()[0] for s in open(\"data/char.txt\").readlines()]\n   token_list.insert(0, \"<blk>\")\n   scorer = TlgPartialScorer(\"data/tlg_ngram\", token_list=token_list) \n\n   texts = [\"天空很蓝\", \"天坑很蓝\", \"我爱你\", \"我艾你\", \"宇智波鼬\", \"宇子波鼬\", \"翁超\",\"余剑威\", \"田晋川\"]\n   for text in texts:\n       text_ids = [token_list.index(t) for t in text]\n       state = scorer.init_state(None)\n       for text_id in text_ids:\n           score, next_states = scorer.score_partial(None, [text_id], state, None)\n           state = next_states[0]\n           print(f\"token: {token_list[text_id]} | score: {score} | state: {state}\")\n       score = scorer.final_score([state])\n       print(f\"Final score: {score}\") \n   \n"
  },
  {
    "path": "nets/scorers/trace_frame.py",
    "content": "import torch\nimport k2\nimport numpy as np\nimport _k2\n\n\"\"\"\ndef _trace_frame(lats): \n    arcs = lats[0].as_dict()['arcs']\n    lats[0].draw(\"den_3frame.svg\")\n\n    frame2state = []\n    prev_buf, cur_buf = [0], []\n\n    for arc in arcs:\n        f, t, _, _ = arc\n        f, t = int(f), int(t)\n\n        if f in prev_buf:\n            if not t in cur_buf:\n                cur_buf.append(t)\n\n        else:\n            frame2state.append(prev_buf)\n            prev_buf = cur_buf\n            cur_buf = [t]\n    \n    frame2state.append(prev_buf) # last frame\n    frame2state.append([t]) # final state\n    return frame2state\n\"\"\"\n\ndef trace_lattice(lats):\n    arcs = lats.arcs.values()[:, :2]\n    T = max(lats.frame).item()\n    frame2state = [[] for _ in range(T+1)]\n\n    for idx, (_, dst) in enumerate(arcs.tolist()):\n        frame_idx = lats.frame[idx]\n        if dst not in frame2state[frame_idx]:\n            frame2state[frame_idx].append(dst)\n     \n    return frame2state\n\ndef compute_frame_level_scores(graph, nnet_output):\n    T = nnet_output.size()[1]\n\n    # dump lattice\n    supervision = torch.Tensor([[0, 0, T]]).to(torch.int32) \n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n    lats = k2.intersect_dense(graph, dense_fsa_vec, output_beam=10.0,\\\n           seqframe_idx_name='seqframe', frame_idx_name='frame')\n    \n    # compute frame-level scores\n    forward_scores = lats.get_forward_scores(True, True)\n    frame2states = trace_lattice(lats)\n    assert len(frame2states) == T + 1 # extra final state\n\n    tot_scores = []\n    for t in range(T, 0, -1):\n        # scores for the last frame\n        if t == T:\n            tot_scores.append(forward_scores[-1])\n        \n        # scores for other frames\n        else:\n            states = frame2states[t-1]\n            frame_score = torch.logsumexp(forward_scores[states], dim=-1)\n            tot_scores.append(frame_score)\n    tot_scores = torch.stack(tot_scores, dim=0)\n    \n    return tot_scores\n\ndef trace_lattice_batch(lats, batch):\n    T = max(lats.frame).item()\n    frame2state = [[[] for _ in range(T+1)] for __ in range(batch)] # 2-D list: [batch, T]\n    arcs = lats.arcs.values()[:, :2].tolist() \n\n    batch_idx, last_is_zero = -1, False\n    for idx, (src, dst) in enumerate(arcs):\n        \n        if src == 0 and last_is_zero == False:\n            batch_idx += 1\n            last_is_zero = True\n\n        if not src == 0:\n            last_is_zero = False\n\n        frame_idx = lats.frame[idx]\n        if dst not in frame2state[batch_idx][frame_idx]:\n            frame2state[batch_idx][frame_idx].append(dst) \n\n    return frame2state\n\ndef split_forward_scores(scores):\n    # splits the forward_scores according to the start state\n    scores_splits = []\n    prev_idx = 0 \n    for i in range(1, len(scores)):\n        if scores[i] == 0:\n            scores_splits.append(scores[prev_idx: i])\n            prev_idx = i\n    scores_splits.append(scores[prev_idx:])\n    return scores_splits\n\ndef compute_frame_level_scores_batch(graphs, nnet_output):\n    # We would assume that nnet_output in different batch\n    # is the same. This is only used for batch decoding\n    batch, T, _ = nnet_output.size()\n\n    supervision = torch.stack([\n                  torch.arange(batch),\n                  torch.zeros(batch),\n                  torch.ones(batch) * T\n                  ], dim=1).to(torch.int32)\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n    lats = k2.intersect_dense(graphs, dense_fsa_vec, output_beam=10.0,\n              seqframe_idx_name='seqframe', frame_idx_name='frame')\n\n    forward_scores = lats.get_forward_scores(True, True)\n    forward_scores = split_forward_scores(forward_scores)\n\n    frame2state = trace_lattice_batch(lats, batch)\n    \n    tot_scores = [[] for _ in range(batch)]\n    for b in range(batch):\n        for f in range(T, 0, -1): # descent order\n            state = frame2state[b][f-1]\n            frame_score = torch.logsumexp(forward_scores[b][state], dim=-1)\n            tot_scores[b].append(frame_score)\n    tot_scores = torch.Tensor(tot_scores)\n    return tot_scores \n\n# this only for debug\ndef compute_frame_level_scores_loop(graph, nnet_output):\n    T = nnet_output.size()[1]\n\n    tot_scores = []\n    for t in range(T, 0, -1):\n        # feed one more frame it it is not the last frame\n        # so the states in first t frames is identical to\n        # the those in whole lattice\n        t_ = t if t == T else t + 1\n        supervision = torch.Tensor([[0, 0, t_]]).to(torch.int32)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n        lats = k2.intersect_dense(graph, dense_fsa_vec, output_beam=10.0,\\\n               seqframe_idx_name='seqframe', frame_idx_name='frame')\n\n        forward_scores = lats.get_forward_scores(True, True)\n        frame2states = trace_lattice(lats)\n        \n        if t == T:\n            tot_scores.append(forward_scores[-1])\n        else:\n            assert len(frame2states) == t + 2\n            states = frame2states[t-1]\n            frame_score = torch.logsumexp(forward_scores[states], dim=-1) \n            tot_scores.append(frame_score)\n    tot_scores = torch.stack(tot_scores, dim=0)\n    return tot_scores\n\nif __name__ == '__main__':\n    batch_size = 3\n    nnet_output = torch.tensor(\n    [\n     [0.1, 0.22, 0.28, 0.4],\n     [0.1, 0.13, 0.07, 0.7],\n     [0.6, 0.2, 0.05, 0.15],\n    ], dtype=torch.float32\n    ).unsqueeze(0).repeat(batch_size, 1, 1)\n    nnet_output = torch.nn.functional.log_softmax(nnet_output, -1)\n\n    graph = k2.ctc_graph([[1], [1,2], [1,2,3]])\n    \n    scores = compute_frame_level_scores_batch(graph, nnet_output)\n    \n    #scores = compute_frame_level_scores(graph, nnet_output)\n    #print(\"Scores computed by new version: \", scores)\n\n    #scores = compute_frame_level_scores_loop(graph, nnet_output)\n    #print(\"Scores computed by original version: \", scores)\n"
  },
  {
    "path": "nets/scorers/word_ngram.py",
    "content": "# Author: tyriontian\n# tyriontian@tencent.com\n# also see: https://github.com/k2-fsa/k2/issues/874\n\nimport os\nimport sys\nimport re\nimport torch\nimport k2\nimport copy\nimport fcntl\nimport time\n\nfrom pathlib import Path\nfrom espnet.nets.scorer_interface import PartialScorerInterface\n\n\ndef is_disambig_symbol(symbol: str, pattern: re.Pattern = re.compile(r'^#\\d+$')) -> bool:\n    return pattern.match(symbol) is not None\n\ndef find_first_disambig_symbol(symbols: k2.SymbolTable) -> int:\n    return min(v for k, v in symbols._sym2id.items() if is_disambig_symbol(k))\n\n\"\"\"\nSeveral reminders:\n1. make sure '#0' is the last symbol in words.txt so '<s>' and '</s>' will not be \n   changed to epsilon in `G.labels[G.labels >= first_word_disambig_id] = 0`\n2. There might be a bug that `k2.create_fsa_vec` will make G.fst unsorted. make sure\n   you sort the G.fst as a FsaVec rather than a Fsa.\n3. To be compatible with the back-off paths in G.fst, self-loops are added to the \n   raw lattices.\n4. This module works on both CPU and GPU. it depends on the type of `device` you feed.\n   We find that CPU is even faster than GPU when the test scale is small\n5. Each time we load the G.fst from disk we need to call `k2.arc_sort` to make sure \n   the `properties` in it is correct. However, this would leads to a spike in memory\n   use and will slow the initialization. Directly change the G.properties is dangerous\n   and cannot solve this problem.\n6. Reading the G.pt exclusively, which is ensured by the system file lock.\n   Each time only one process would do the reading. This is to avoid memory spike. \n\"\"\"\nclass WordNgram():\n    def __init__(self, lang, device):\n        self.lang = Path(lang)\n        self.device = device\n        self.is_cuda = device.type == \"cuda\"\n\n        self.symbol_table = k2.SymbolTable.from_file(self.lang / 'words.txt')\n        self.oovid = int(open(self.lang / 'oov.int').read().strip())\n\n\n        self.load_G()\n        return\n        # rapid access on the disk may lead to error\n        # try many times\n        for i in range(10):\n            try:\n                self.load_G()\n                break\n            except:\n                print(f\"{i}-th trial to load G.fst but failed\")              \n\n    def load_G(self):\n        \n        if os.path.exists(self.lang / 'G.pt'):\n            f = open(self.lang / 'G.pt', 'r')\n            fcntl.flock(f, fcntl.LOCK_EX) # lock\n            G_dict = torch.load(self.lang / 'G.pt')\n            fcntl.flock(f, fcntl.LOCK_UN) # unlock\n            G = k2.Fsa.from_dict(G_dict).to(self.device)\n            G = k2.create_fsa_vec([G])\n            G = k2.arc_sort(G)\n            print(\"Successfully load the cached G.pt\", flush=True)\n            \n        else:\n            f = open(self.lang / 'G.fst.txt') \n            G = k2.Fsa.from_openfst(f.read(), acceptor=False)\n            del G.aux_labels\n\n            first_word_disambig_id = find_first_disambig_symbol(self.symbol_table)\n            G.labels[G.labels >= first_word_disambig_id] = 0 \n            G = k2.arc_sort(G)\n            \n            torch.save(G.as_dict(), self.lang / 'G.pt')\n            print(\"No cached G.pt found. Build a new G from G.fst.txt\")\n        \n        self.G = G.to(self.device)\n\n    def text2lat(self, text, gram_len=6):\n        \"\"\"\n        Enumreate all possible paths that split text into word sequence. \n        Output will be a epsilon-free, acyclic lattice on self.device\n        \n        text: list of token \n        ngram_len: the maximum length of each word gram, a.k.a, characters\n        \"\"\"\n        text = [tok.replace(\" \", \"\") for tok in text] # exclude blank\n        \n        if gram_len < 1:\n            raise ValueError(\"invalid ngram_len. it should be larger than 1\")\n\n        arcs = []\n        for s in range(len(text) + 1):\n            for r in range(1, gram_len + 1):\n                if len(text) - s - r < 0:\n                    continue\n                \n                w = \"\".join(text[s: s + r])\n                if w in self.symbol_table:\n                    wid = self.symbol_table[w] \n                elif r == 1:\n                    wid = self.oovid\n                else:\n                    continue\n                \n                arc = [s, s+r, wid, 0.0]\n                arcs.append(arc)\n        arcs.append([len(text), len(text) + 1, -1, 0.0])\n        arcs.append([len(text) + 1])\n      \n        arcs = sorted(arcs, key=lambda arc: arc[0])\n        arcs = [[str(i) for i in arc] for arc in arcs]\n        arcs = [' '.join(arc) for arc in arcs]\n        arcs = '\\n'.join(arcs)\n\n        lat = k2.Fsa.from_str(arcs, True)\n        lat = k2.arc_sort(lat)\n        return lat.to(self.device)\n\n    def score_lattice(self, lats, log_semiring=True):\n        \"\"\"\n        Apply the scores on G.fst to the lattice.\n        Both the input and the output are k2.FsaVec\n        \"\"\"\n        assert lats.device == self.device\n\n        lats = k2.add_epsilon_self_loops(lats)\n        if self.is_cuda:\n            b_to_a_map = torch.zeros(lats.shape[0],\n                         device=self.device, dtype=torch.int32)\n            scored_lattice = k2.intersect_device(\n                             self.G, lats, b_to_a_map,\n                             sorted_match_a=True \n                             )\n            scored_lattice = k2.top_sort(\n                             k2.connect(\n                             k2.remove_epsilon_self_loops(\n                             scored_lattice\n                             )))\n        else:\n            scored_lattice = k2.intersect(self.G, lats, \n                             treat_epsilons_specially=False)\n            scored_lattice = k2.top_sort(\n                             k2.connect(\n                             k2.remove_epsilon_self_loops(scored_lattice\n                             )))\n        return scored_lattice\n\n    def score_texts(self, texts, log_semiring=True):\n        lats = [self.text2lat(t) for t in texts]\n        lats = k2.create_fsa_vec(lats)\n     \n        scored_lattice = self.score_lattice(lats, log_semiring)\n        \n        scores = scored_lattice._get_tot_scores(log_semiring, True)\n        \n        return scores\n\n    def draw(self, fsavec, prefix=None):\n        for i in range(fsavec.shape[0]):\n            fsa = fsavec[i]\n            fsa.draw(f\"{prefix}_{i}.svg\")\n\n\n# Warpper to Espnet scorer interface\nclass WordNgramPartialScorer(PartialScorerInterface):\n    def __init__(self, lang, device, token_list, ignore_tokens=[\"<eos>\", \"<blank>\"], log_semiring=True, lower_char=True):\n        self.WordNgram = WordNgram(lang, device)\n        self.log_semiring = log_semiring\n        \n        self.token_list = copy.deepcopy(token_list)\n        for tok in ignore_tokens:\n            if tok in self.token_list:\n                idx = self.token_list.index(tok)\n                self.token_list[idx] = \"\"\n \n        # oov should be a single character\n        if \"<unk>\" in self.token_list:\n            idx = self.token_list.index(\"<unk>\")\n            self.token_list[idx] = \"兲\"\n\n        # ignore upper / lower case of english characters.\n        # this should be consistent with your words.txt\n        for tok in self.token_list:\n            if tok.isalpha():\n                idx = self.token_list.index(tok)\n                if lower_char:\n                    self.token_list[idx] = tok.lower()\n                else:\n                    self.token_list[idx] = tok.upper()\n\n    def init_state(self, x):\n        return 0.0\n\n    def score_partial(self, y, next_tokens, state, x):\n        prefix = [self.token_list[i] for i in y] \n        next_tokens = [self.token_list[i] for i in next_tokens]\n        texts = [prefix + [tok] for tok in next_tokens]\n\n        scores = self.WordNgram.score_texts(texts, \n                     log_semiring=self.log_semiring)\n        \n        return scores - state, scores\n    \nif __name__ == \"__main__\":\n    device = torch.device(\"cpu\") # cpu or cuda:0\n    lang = sys.argv[1]\n    word_ngram = WordNgram(lang, device)\n\n    texts = [[\"这\", \"件\", \"事\", u\"\\u2581\"+\"INTEREST\", \"ING\", u\"\\u2581\"+\"ALLOW\" , \"ED\"]]\n    print(texts)\n    for i in range(1):\n        scores = word_ngram.score_texts(texts, log_semiring=True)\n        print(scores)\n"
  },
  {
    "path": "nets/st_interface.py",
    "content": "\"\"\"ST Interface module.\"\"\"\n\nfrom espnet.nets.asr_interface import ASRInterface\nfrom espnet.utils.dynamic_import import dynamic_import\n\n\nclass STInterface(ASRInterface):\n    \"\"\"ST Interface for ESPnet model implementation.\n\n    NOTE: This class is inherited from ASRInterface to enable joint translation\n    and recognition when performing multi-task learning with the ASR task.\n\n    \"\"\"\n\n    def translate(self, x, trans_args, char_list=None, rnnlm=None, ensemble_models=[]):\n        \"\"\"Recognize x for evaluation.\n\n        :param ndarray x: input acouctic feature (B, T, D) or (T, D)\n        :param namespace trans_args: argment namespace contraining options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"translate method is not implemented\")\n\n    def translate_batch(self, x, trans_args, char_list=None, rnnlm=None):\n        \"\"\"Beam search implementation for batch.\n\n        :param torch.Tensor x: encoder hidden state sequences (B, Tmax, Henc)\n        :param namespace trans_args: argument namespace containing options\n        :param list char_list: list of characters\n        :param torch.nn.Module rnnlm: language model module\n        :return: N-best decoding results\n        :rtype: list\n        \"\"\"\n        raise NotImplementedError(\"Batch decoding is not supported yet.\")\n\n\npredefined_st = {\n    \"pytorch\": {\n        \"rnn\": \"espnet.nets.pytorch_backend.e2e_st:E2E\",\n        \"transformer\": \"espnet.nets.pytorch_backend.e2e_st_transformer:E2E\",\n    },\n    # \"chainer\": {\n    #     \"rnn\": \"espnet.nets.chainer_backend.e2e_st:E2E\",\n    #     \"transformer\": \"espnet.nets.chainer_backend.e2e_st_transformer:E2E\",\n    # }\n}\n\n\ndef dynamic_import_st(module, backend):\n    \"\"\"Import ST models dynamically.\n\n    Args:\n        module (str): module_name:class_name or alias in `predefined_st`\n        backend (str): NN backend. e.g., pytorch, chainer\n\n    Returns:\n        type: ST class\n\n    \"\"\"\n    model_class = dynamic_import(module, predefined_st.get(backend, dict()))\n    assert issubclass(\n        model_class, STInterface\n    ), f\"{module} does not implement STInterface\"\n    return model_class\n"
  },
  {
    "path": "nets/transducer_decoder_interface.py",
    "content": "\"\"\"Transducer decoder interface module.\"\"\"\n\nfrom dataclasses import dataclass\nfrom typing import Any\nfrom typing import Dict\nfrom typing import List\nfrom typing import Optional\nfrom typing import Tuple\nfrom typing import Union\n\nimport torch\n\n\n@dataclass\nclass Hypothesis:\n    \"\"\"Default hypothesis definition for beam search.\"\"\"\n\n    score: float\n    yseq: List[int]\n    dec_state: Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]], List[torch.Tensor], torch.Tensor\n    ]\n    lm_state: Union[Dict[str, Any], List[Any]] = None\n    mmi_tot_score: float = None\n    word_ngram_score: float = None\n    tlg_state: dict = None\n\n\n@dataclass\nclass NSCHypothesis(Hypothesis):\n    \"\"\"Extended hypothesis definition for NSC beam search.\"\"\"\n\n    y: List[torch.Tensor] = None\n    lm_scores: torch.Tensor = None\n\n\nclass TransducerDecoderInterface:\n    \"\"\"Decoder interface for transducer models.\"\"\"\n\n    def init_state(\n        self,\n        batch_size: int,\n        device: torch.device,\n    ) -> Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n    ]:\n        \"\"\"Initialize decoder states.\n\n        Args:\n            batch_size: Batch size for initial state\n            device: Device for initial state\n\n        Returns:\n            state: Initialized state\n\n        \"\"\"\n        raise NotImplementedError(\"init_state method is not implemented\")\n\n    def score(\n        self,\n        hyp: Union[Hypothesis, NSCHypothesis],\n        cache: Dict[str, Any],\n    ) -> Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]],\n        torch.Tensor,\n        List[Optional[torch.Tensor]],\n    ]:\n        \"\"\"Forward one hypothesis.\n\n        Args:\n            hyp: Hypothesis.\n            cache: Pairs of (y, state) for each token sequence (key)\n\n        Returns:\n            y: Decoder outputs\n            new_state: New decoder state\n            lm_tokens: Token id for LM\n\n        \"\"\"\n        raise NotImplementedError(\"score method is not implemented\")\n\n    def batch_score(\n        self,\n        hyps: Union[List[Hypothesis], List[NSCHypothesis]],\n        batch_states: Union[\n            Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n        ],\n        cache: Dict[str, Any],\n    ) -> Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]],\n        torch.Tensor,\n        List[Optional[torch.Tensor]],\n    ]:\n        \"\"\"Forward batch of hypotheses.\n\n        Args:\n            hyps: Batch of hypotheses\n            batch_states: Batch of decoder states\n            cache: pairs of (y, state) for each token sequence (key)\n\n        Returns:\n            batch_y: Decoder outputs\n            batch_states: Batch of decoder states\n            lm_tokens: Batch of token ids for LM\n\n        \"\"\"\n        raise NotImplementedError(\"batch_score method is not implemented\")\n\n    def select_state(\n        self,\n        batch_states: Union[\n            Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n        ],\n        idx: int,\n    ) -> Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n    ]:\n        \"\"\"Get decoder state from batch for given id.\n\n        Args:\n            batch_states: Batch of decoder states\n            idx: Index to extract state from batch\n\n        Returns:\n            state_idx: Decoder state for given id\n\n        \"\"\"\n        raise NotImplementedError(\"select_state method is not implemented\")\n\n    def create_batch_states(\n        self,\n        batch_states: Union[\n            Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n        ],\n        l_states: List[\n            Union[\n                Tuple[torch.Tensor, Optional[torch.Tensor]],\n                List[Optional[torch.Tensor]],\n            ]\n        ],\n        l_tokens: List[List[int]],\n    ) -> Union[\n        Tuple[torch.Tensor, Optional[torch.Tensor]], List[Optional[torch.Tensor]]\n    ]:\n        \"\"\"Create batch of decoder states.\n\n        Args:\n            batch_states: Batch of decoder states\n            l_states: List of decoder states\n            l_tokens: List of token sequences for input batch\n\n        Returns:\n            batch_states: Batch of decoder states\n\n        \"\"\"\n        raise NotImplementedError(\"create_batch_states method is not implemented\")\n"
  },
  {
    "path": "nets/tts_interface.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"TTS Interface realted modules.\"\"\"\n\nfrom espnet.asr.asr_utils import torch_load\n\n\ntry:\n    import chainer\nexcept ImportError:\n    Reporter = None\nelse:\n\n    class Reporter(chainer.Chain):\n        \"\"\"Reporter module.\"\"\"\n\n        def report(self, dicts):\n            \"\"\"Report values from a given dict.\"\"\"\n            for d in dicts:\n                chainer.reporter.report(d, self)\n\n\nclass TTSInterface(object):\n    \"\"\"TTS Interface for ESPnet model implementation.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser):\n        \"\"\"Add model specific argments to parser.\"\"\"\n        return parser\n\n    def __init__(self):\n        \"\"\"Initilize TTS module.\"\"\"\n        self.reporter = Reporter()\n\n    def forward(self, *args, **kwargs):\n        \"\"\"Calculate TTS forward propagation.\n\n        Returns:\n            Tensor: Loss value.\n\n        \"\"\"\n        raise NotImplementedError(\"forward method is not implemented\")\n\n    def inference(self, *args, **kwargs):\n        \"\"\"Generate the sequence of features given the sequences of characters.\n\n        Returns:\n            Tensor: The sequence of generated features (L, odim).\n            Tensor: The sequence of stop probabilities (L,).\n            Tensor: The sequence of attention weights (L, T).\n\n        \"\"\"\n        raise NotImplementedError(\"inference method is not implemented\")\n\n    def calculate_all_attentions(self, *args, **kwargs):\n        \"\"\"Calculate TTS attention weights.\n\n        Args:\n            Tensor: Batch of attention weights (B, Lmax, Tmax).\n\n        \"\"\"\n        raise NotImplementedError(\"calculate_all_attentions method is not implemented\")\n\n    def load_pretrained_model(self, model_path):\n        \"\"\"Load pretrained model parameters.\"\"\"\n        torch_load(model_path, self)\n\n    @property\n    def attention_plot_class(self):\n        \"\"\"Plot attention weights.\"\"\"\n        from espnet.asr.asr_utils import PlotAttentionReport\n\n        return PlotAttentionReport\n\n    @property\n    def base_plot_keys(self):\n        \"\"\"Return base key names to plot during training.\n\n        The keys should match what `chainer.reporter` reports.\n        if you add the key `loss`,\n        the reporter will report `main/loss` and `validation/main/loss` values.\n        also `loss.png` will be created as a figure visulizing `main/loss`\n        and `validation/main/loss` values.\n\n        Returns:\n            list[str]:  Base keys to plot during training.\n\n        \"\"\"\n        return [\"loss\"]\n"
  },
  {
    "path": "optimizer/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "optimizer/chainer.py",
    "content": "\"\"\"Chainer optimizer builders.\"\"\"\nimport argparse\n\nimport chainer\nfrom chainer.optimizer_hooks import WeightDecay\n\nfrom espnet.optimizer.factory import OptimizerFactoryInterface\nfrom espnet.optimizer.parser import adadelta\nfrom espnet.optimizer.parser import adam\nfrom espnet.optimizer.parser import sgd\n\n\nclass AdamFactory(OptimizerFactoryInterface):\n    \"\"\"Adam factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return adam(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        opt = chainer.optimizers.Adam(\n            alpha=args.lr,\n            beta1=args.beta1,\n            beta2=args.beta2,\n        )\n        opt.setup(target)\n        opt.add_hook(WeightDecay(args.weight_decay))\n        return opt\n\n\nclass SGDFactory(OptimizerFactoryInterface):\n    \"\"\"SGD factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return sgd(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        opt = chainer.optimizers.SGD(\n            lr=args.lr,\n        )\n        opt.setup(target)\n        opt.add_hook(WeightDecay(args.weight_decay))\n        return opt\n\n\nclass AdadeltaFactory(OptimizerFactoryInterface):\n    \"\"\"Adadelta factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return adadelta(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        opt = chainer.optimizers.AdaDelta(\n            rho=args.rho,\n            eps=args.eps,\n        )\n        opt.setup(target)\n        opt.add_hook(WeightDecay(args.weight_decay))\n        return opt\n\n\nOPTIMIZER_FACTORY_DICT = {\n    \"adam\": AdamFactory,\n    \"sgd\": SGDFactory,\n    \"adadelta\": AdadeltaFactory,\n}\n"
  },
  {
    "path": "optimizer/factory.py",
    "content": "\"\"\"Import optimizer class dynamically.\"\"\"\nimport argparse\n\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass OptimizerFactoryInterface:\n    \"\"\"Optimizer adaptor.\"\"\"\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        raise NotImplementedError()\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return parser\n\n    @classmethod\n    def build(cls, target, **kwargs):\n        \"\"\"Initialize optimizer with python-level args.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n\n        Returns:\n            new Optimizer\n\n        \"\"\"\n        args = argparse.Namespace(**kwargs)\n        args = fill_missing_args(args, cls.add_arguments)\n        return cls.from_args(target, args)\n\n\ndef dynamic_import_optimizer(name: str, backend: str) -> OptimizerFactoryInterface:\n    \"\"\"Import optimizer class dynamically.\n\n    Args:\n        name (str): alias name or dynamic import syntax `module:class`\n        backend (str): backend name e.g., chainer or pytorch\n\n    Returns:\n        OptimizerFactoryInterface or FunctionalOptimizerAdaptor\n\n    \"\"\"\n    if backend == \"pytorch\":\n        from espnet.optimizer.pytorch import OPTIMIZER_FACTORY_DICT\n\n        return OPTIMIZER_FACTORY_DICT[name]\n    elif backend == \"chainer\":\n        from espnet.optimizer.chainer import OPTIMIZER_FACTORY_DICT\n\n        return OPTIMIZER_FACTORY_DICT[name]\n    else:\n        raise NotImplementedError(f\"unsupported backend: {backend}\")\n\n    factory_class = dynamic_import(name)\n    assert issubclass(factory_class, OptimizerFactoryInterface)\n    return factory_class\n"
  },
  {
    "path": "optimizer/parser.py",
    "content": "\"\"\"Common optimizer default config for multiple backends.\"\"\"\n\n\ndef sgd(parser):\n    \"\"\"Add arguments.\"\"\"\n    parser.add_argument(\"--lr\", type=float, default=1.0, help=\"Learning rate\")\n    parser.add_argument(\"--weight-decay\", type=float, default=0.0, help=\"Weight decay\")\n    return parser\n\n\ndef adam(parser):\n    \"\"\"Add arguments.\"\"\"\n    parser.add_argument(\"--lr\", type=float, default=1e-3, help=\"Learning rate\")\n    parser.add_argument(\"--beta1\", type=float, default=0.9, help=\"Beta1\")\n    parser.add_argument(\"--beta2\", type=float, default=0.999, help=\"Beta2\")\n    parser.add_argument(\"--weight-decay\", type=float, default=0.0, help=\"Weight decay\")\n    return parser\n\n\ndef adadelta(parser):\n    \"\"\"Add arguments.\"\"\"\n    parser.add_argument(\"--rho\", type=float, default=0.95, help=\"Rho\")\n    parser.add_argument(\"--eps\", type=float, default=1e-8, help=\"Eps\")\n    parser.add_argument(\"--weight-decay\", type=float, default=0.0, help=\"Weight decay\")\n    return parser\n"
  },
  {
    "path": "optimizer/pytorch.py",
    "content": "\"\"\"PyTorch optimizer builders.\"\"\"\nimport argparse\n\nimport torch\n\nfrom espnet.optimizer.factory import OptimizerFactoryInterface\nfrom espnet.optimizer.parser import adadelta\nfrom espnet.optimizer.parser import adam\nfrom espnet.optimizer.parser import sgd\n\n\nclass AdamFactory(OptimizerFactoryInterface):\n    \"\"\"Adam factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return adam(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        return torch.optim.Adam(\n            target,\n            lr=args.lr,\n            weight_decay=args.weight_decay,\n            betas=(args.beta1, args.beta2),\n        )\n\n\nclass SGDFactory(OptimizerFactoryInterface):\n    \"\"\"SGD factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return sgd(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        return torch.optim.SGD(\n            target,\n            lr=args.lr,\n            weight_decay=args.weight_decay,\n        )\n\n\nclass AdadeltaFactory(OptimizerFactoryInterface):\n    \"\"\"Adadelta factory.\"\"\"\n\n    @staticmethod\n    def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:\n        \"\"\"Register args.\"\"\"\n        return adadelta(parser)\n\n    @staticmethod\n    def from_args(target, args: argparse.Namespace):\n        \"\"\"Initialize optimizer from argparse Namespace.\n\n        Args:\n            target: for pytorch `model.parameters()`,\n                for chainer `model`\n            args (argparse.Namespace): parsed command-line args\n\n        \"\"\"\n        return torch.optim.Adadelta(\n            target,\n            rho=args.rho,\n            eps=args.eps,\n            weight_decay=args.weight_decay,\n        )\n\n\nOPTIMIZER_FACTORY_DICT = {\n    \"adam\": AdamFactory,\n    \"sgd\": SGDFactory,\n    \"adadelta\": AdadeltaFactory,\n}\n"
  },
  {
    "path": "scheduler/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "scheduler/chainer.py",
    "content": "\"\"\"Chainer optimizer schdulers.\"\"\"\n\nfrom typing import List\n\nfrom chainer.optimizer import Optimizer\n\nfrom espnet.scheduler.scheduler import SchedulerInterface\n\n\nclass ChainerScheduler:\n    \"\"\"Chainer optimizer scheduler.\"\"\"\n\n    def __init__(self, schedulers: List[SchedulerInterface], optimizer: Optimizer):\n        \"\"\"Initialize class.\"\"\"\n        self.schedulers = schedulers\n        self.optimizer = optimizer\n        self.init_values = dict()\n        for s in self.schedulers:\n            self.init_values[s.key] = getattr(self.optimizer, s.key)\n\n    def step(self, n_iter: int):\n        \"\"\"Update optimizer by scheduling.\"\"\"\n        for s in self.schedulers:\n            new_val = self.init_values[s.key] * s.scale(n_iter)\n            setattr(self.optimizer, s.key, new_val)\n"
  },
  {
    "path": "scheduler/pytorch.py",
    "content": "\"\"\"PyTorch optimizer schdulers.\"\"\"\n\nfrom typing import List\n\nfrom torch.optim import Optimizer\n\nfrom espnet.scheduler.scheduler import SchedulerInterface\n\n\nclass PyTorchScheduler:\n    \"\"\"PyTorch optimizer scheduler.\"\"\"\n\n    def __init__(self, schedulers: List[SchedulerInterface], optimizer: Optimizer):\n        \"\"\"Initialize class.\"\"\"\n        self.schedulers = schedulers\n        self.optimizer = optimizer\n        for s in self.schedulers:\n            for group in optimizer.param_groups:\n                group.setdefault(\"initial_\" + s.key, group[s.key])\n\n    def step(self, n_iter: int):\n        \"\"\"Update optimizer by scheduling.\"\"\"\n        for s in self.schedulers:\n            for group in self.optimizer.param_groups:\n                group[s.key] = group[\"initial_\" + s.key] * s.scale(n_iter)\n"
  },
  {
    "path": "scheduler/scheduler.py",
    "content": "\"\"\"Schedulers.\"\"\"\n\nimport argparse\n\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.fill_missing_args import fill_missing_args\n\n\nclass _PrefixParser:\n    def __init__(self, parser, prefix):\n        self.parser = parser\n        self.prefix = prefix\n\n    def add_argument(self, name, **kwargs):\n        assert name.startswith(\"--\")\n        self.parser.add_argument(self.prefix + name[2:], **kwargs)\n\n\nclass SchedulerInterface:\n    \"\"\"Scheduler interface.\"\"\"\n\n    alias = \"\"\n\n    def __init__(self, key: str, args: argparse.Namespace):\n        \"\"\"Initialize class.\"\"\"\n        self.key = key\n        prefix = key + \"_\" + self.alias + \"_\"\n        for k, v in vars(args).items():\n            if k.startswith(prefix):\n                setattr(self, k[len(prefix) :], v)\n\n    def get_arg(self, name):\n        \"\"\"Get argument without prefix.\"\"\"\n        return getattr(self.args, f\"{self.key}_{self.alias}_{name}\")\n\n    @classmethod\n    def add_arguments(cls, key: str, parser: argparse.ArgumentParser):\n        \"\"\"Add arguments for CLI.\"\"\"\n        group = parser.add_argument_group(f\"{cls.alias} scheduler\")\n        cls._add_arguments(_PrefixParser(parser=group, prefix=f\"--{key}-{cls.alias}-\"))\n        return parser\n\n    @staticmethod\n    def _add_arguments(parser: _PrefixParser):\n        pass\n\n    @classmethod\n    def build(cls, key: str, **kwargs):\n        \"\"\"Initialize this class with python-level args.\n\n        Args:\n            key (str): key of hyper parameter\n\n        Returns:\n            LMinterface: A new instance of LMInterface.\n\n        \"\"\"\n\n        def add(parser):\n            return cls.add_arguments(key, parser)\n\n        kwargs = {f\"{key}_{cls.alias}_\" + k: v for k, v in kwargs.items()}\n        args = argparse.Namespace(**kwargs)\n        args = fill_missing_args(args, add)\n        return cls(key, args)\n\n    def scale(self, n_iter: int) -> float:\n        \"\"\"Scale at `n_iter`.\n\n        Args:\n            n_iter (int): number of current iterations.\n\n        Returns:\n            float: current scale of learning rate.\n\n        \"\"\"\n        raise NotImplementedError()\n\n\nSCHEDULER_DICT = {}\n\n\ndef register_scheduler(cls):\n    \"\"\"Register scheduler.\"\"\"\n    SCHEDULER_DICT[cls.alias] = cls.__module__ + \":\" + cls.__name__\n    return cls\n\n\ndef dynamic_import_scheduler(module):\n    \"\"\"Import Scheduler class dynamically.\n\n    Args:\n        module (str): module_name:class_name or alias in `SCHEDULER_DICT`\n\n    Returns:\n        type: Scheduler class\n\n    \"\"\"\n    model_class = dynamic_import(module, SCHEDULER_DICT)\n    assert issubclass(\n        model_class, SchedulerInterface\n    ), f\"{module} does not implement SchedulerInterface\"\n    return model_class\n\n\n@register_scheduler\nclass NoScheduler(SchedulerInterface):\n    \"\"\"Scheduler which does nothing.\"\"\"\n\n    alias = \"none\"\n\n    def scale(self, n_iter):\n        \"\"\"Scale of lr.\"\"\"\n        return 1.0\n\n\n@register_scheduler\nclass NoamScheduler(SchedulerInterface):\n    \"\"\"Warmup + InverseSqrt decay scheduler.\n\n    Args:\n        noam_warmup (int): number of warmup iterations.\n\n    \"\"\"\n\n    alias = \"noam\"\n\n    @staticmethod\n    def _add_arguments(parser: _PrefixParser):\n        \"\"\"Add scheduler args.\"\"\"\n        parser.add_argument(\n            \"--warmup\", type=int, default=1000, help=\"Number of warmup iterations.\"\n        )\n\n    def __init__(self, key, args):\n        \"\"\"Initialize class.\"\"\"\n        super().__init__(key, args)\n        self.normalize = 1 / (self.warmup * self.warmup ** -1.5)\n\n    def scale(self, step):\n        \"\"\"Scale of lr.\"\"\"\n        step += 1  # because step starts from 0\n        return self.normalize * min(step ** -0.5, step * self.warmup ** -1.5)\n\n\n@register_scheduler\nclass CyclicCosineScheduler(SchedulerInterface):\n    \"\"\"Cyclic cosine annealing.\n\n    Args:\n        cosine_warmup (int): number of warmup iterations.\n        cosine_total (int): number of total annealing iterations.\n\n    Notes:\n        Proposed in https://openreview.net/pdf?id=BJYwwY9ll\n        (and https://arxiv.org/pdf/1608.03983.pdf).\n        Used in the GPT2 config of Megatron-LM https://github.com/NVIDIA/Megatron-LM\n\n    \"\"\"\n\n    alias = \"cosine\"\n\n    @staticmethod\n    def _add_arguments(parser: _PrefixParser):\n        \"\"\"Add scheduler args.\"\"\"\n        parser.add_argument(\n            \"--warmup\", type=int, default=1000, help=\"Number of warmup iterations.\"\n        )\n        parser.add_argument(\n            \"--total\",\n            type=int,\n            default=100000,\n            help=\"Number of total annealing iterations.\",\n        )\n\n    def scale(self, n_iter):\n        \"\"\"Scale of lr.\"\"\"\n        import math\n\n        return 0.5 * (math.cos(math.pi * (n_iter - self.warmup) / self.total) + 1)\n"
  },
  {
    "path": "snowfall/__init__.py",
    "content": ""
  },
  {
    "path": "snowfall/common.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2019-2020 Mobvoi AI Lab, Beijing, China (author: Fangjun Kuang)\n# Apache 2.0\nimport argparse\nimport logging\nimport os\nimport re\nfrom collections import defaultdict\nfrom datetime import datetime\nfrom pathlib import Path\nfrom typing import Any, Dict, Iterable, List, Optional, TextIO, Tuple, Union\n\nimport k2\nimport kaldialign\nimport torch\nimport torch.distributed as dist\nfrom torch.cuda.amp import GradScaler\nfrom torch.nn.parallel import DistributedDataParallel\n\nfrom snowfall.models import AcousticModel\n\nPathlike = Union[str, Path]\n\n\ndef setup_logger(log_filename: Pathlike, log_level: str = 'info', use_console: bool = True) -> None:\n    now = datetime.now()\n    date_time = now.strftime('%Y-%m-%d-%H-%M-%S')\n    log_filename = '{}-{}'.format(log_filename, date_time)\n    os.makedirs(os.path.dirname(log_filename), exist_ok=True)\n\n    if dist.is_available() and dist.is_initialized():\n        world_size = dist.get_world_size()\n        rank = dist.get_rank()\n        formatter = f'%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] ({rank}/{world_size}) %(message)s'\n    else:\n        formatter = '%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s'\n\n    level = logging.ERROR\n    if log_level == 'debug':\n        level = logging.DEBUG\n    elif log_level == 'info':\n        level = logging.INFO\n    elif log_level == 'warning':\n        level = logging.WARNING\n    logging.basicConfig(filename=log_filename,\n                        format=formatter,\n                        level=level,\n                        filemode='w')\n    if use_console:\n        console = logging.StreamHandler()\n        console.setLevel(level)\n        console.setFormatter(logging.Formatter(formatter))\n        logging.getLogger('').addHandler(console)\n\n\ndef load_checkpoint(\n        filename: Pathlike,\n        model: AcousticModel,\n        optimizer: Optional[object] = None,\n        scheduler: Optional[object] = None,\n        scaler: Optional[GradScaler] = None,\n) -> Dict[str, Any]:\n    logging.info('load checkpoint from {}'.format(filename))\n\n    checkpoint = torch.load(filename, map_location='cpu')\n\n    keys = [\n        'state_dict', 'optimizer', 'scheduler', 'epoch', 'learning_rate', 'objf', 'valid_objf',\n        'num_features', 'num_classes', 'subsampling_factor',\n        'global_batch_idx_train'\n    ]\n    missing_keys = set(keys) - set(checkpoint.keys())\n    if missing_keys:\n        raise ValueError(f\"Missing keys in checkpoint: {missing_keys}\")\n\n    if isinstance(model, DistributedDataParallel):\n        model = model.module\n\n    if not list(model.state_dict().keys())[0].startswith('module.') \\\n            and list(checkpoint['state_dict'])[0].startswith('module.'):\n        # the checkpoint was saved by DDP\n        logging.info('load checkpoint from DDP')\n        dst_state_dict = model.state_dict()\n        src_state_dict = checkpoint['state_dict']\n        for key in dst_state_dict.keys():\n            src_key = '{}.{}'.format('module', key)\n            dst_state_dict[key] = src_state_dict.pop(src_key)\n        assert len(src_state_dict) == 0\n        model.load_state_dict(dst_state_dict)\n    else:\n        model.load_state_dict(checkpoint['state_dict'])\n\n    model.num_features = checkpoint['num_features']\n    model.num_classes = checkpoint['num_classes']\n    model.subsampling_factor = checkpoint['subsampling_factor']\n\n    if optimizer is not None:\n        optimizer.load_state_dict(checkpoint['optimizer'])\n\n    if scheduler is not None:\n        scheduler.load_state_dict(checkpoint['scheduler'])\n\n    if scaler is not None:\n        scaler.load_state_dict(checkpoint['grad_scaler'])\n\n    return checkpoint\n\n\ndef average_checkpoint(filenames: List[Pathlike], model: AcousticModel) -> Dict[str, Any]:\n    logging.info('average over checkpoints {}'.format(filenames))\n\n    avg_model = None\n\n    # sum\n    for filename in filenames:\n        checkpoint = torch.load(filename, map_location='cpu')\n        checkpoint_model = checkpoint['state_dict']\n        if avg_model is None:\n            avg_model = checkpoint_model\n        else:\n            for k in avg_model.keys():\n                avg_model[k] += checkpoint_model[k]\n    # average\n    for k in avg_model.keys():\n        if avg_model[k] is not None:\n            if avg_model[k].is_floating_point():\n                avg_model[k] /= len(filenames)\n            else:\n                avg_model[k] //= len(filenames)\n\n    checkpoint['state_dict'] = avg_model\n\n    keys = [\n        'state_dict', 'optimizer', 'scheduler', 'epoch', 'learning_rate', 'objf', 'valid_objf',\n        'num_features', 'num_classes', 'subsampling_factor',\n        'global_batch_idx_train'\n    ]\n    missing_keys = set(keys) - set(checkpoint.keys())\n    if missing_keys:\n        raise ValueError(f\"Missing keys in checkpoint: {missing_keys}\")\n\n    if not list(model.state_dict().keys())[0].startswith('module.') \\\n            and list(checkpoint['state_dict'])[0].startswith('module.'):\n        # the checkpoint was saved by DDP\n        logging.info('load checkpoint from DDP')\n        dst_state_dict = model.state_dict()\n        src_state_dict = checkpoint['state_dict']\n        for key in dst_state_dict.keys():\n            src_key = '{}.{}'.format('module', key)\n            dst_state_dict[key] = src_state_dict.pop(src_key)\n        assert len(src_state_dict) == 0\n        model.load_state_dict(dst_state_dict)\n    else:\n        model.load_state_dict(checkpoint['state_dict'])\n\n    model.num_features = checkpoint['num_features']\n    model.num_classes = checkpoint['num_classes']\n    model.subsampling_factor = checkpoint['subsampling_factor']\n\n    return checkpoint\n\n\ndef save_checkpoint(\n        filename: Pathlike,\n        model: Union[AcousticModel, DistributedDataParallel],\n        optimizer: object,\n        scheduler: object,\n        epoch: int,\n        learning_rate: float,\n        objf: float,\n        valid_objf: float,\n        global_batch_idx_train: int,\n        local_rank: int = 0,\n        scaler: Optional[GradScaler] = None\n) -> None:\n    if local_rank is not None and local_rank != 0:\n        return\n    if isinstance(model, DistributedDataParallel):\n        model = model.module\n    logging.info(f'Save checkpoint to {filename}: epoch={epoch}, '\n                 f'learning_rate={learning_rate}, objf={objf}, valid_objf={valid_objf}')\n    checkpoint = {\n        'state_dict': model.state_dict(),\n        'optimizer': optimizer.state_dict() if optimizer is not None else None,\n        'scheduler': scheduler.state_dict() if scheduler is not None else None,\n        'grad_scaler': scaler.state_dict() if scaler is not None else None,\n        'epoch': epoch,\n        'learning_rate': learning_rate,\n        'objf': objf,\n        'valid_objf': valid_objf,\n        'global_batch_idx_train': global_batch_idx_train,\n        'num_features': model.num_features,\n        'num_classes': model.num_classes,\n        'subsampling_factor': model.subsampling_factor,\n    }\n    torch.save(checkpoint, filename)\n\n\ndef save_training_info(\n        filename: Pathlike,\n        model_path: Pathlike,\n        current_epoch: int,\n        learning_rate: float,\n        objf: float,\n        best_objf: float,\n        valid_objf: float,\n        best_valid_objf: float,\n        best_epoch: int,\n        local_rank: int = 0\n):\n    if local_rank is not None and local_rank != 0:\n        return\n\n    with open(filename, 'w') as f:\n        f.write('model_path: {}\\n'.format(model_path))\n        f.write('epoch: {}\\n'.format(current_epoch))\n        f.write('learning rate: {}\\n'.format(learning_rate))\n        f.write('objf: {}\\n'.format(objf))\n        f.write('best objf: {}\\n'.format(best_objf))\n        f.write('valid objf: {}\\n'.format(valid_objf))\n        f.write('best valid objf: {}\\n'.format(best_valid_objf))\n        f.write('best epoch: {}\\n'.format(best_epoch))\n\n    logging.info('write training info to {}'.format(filename))\n\n\ndef get_phone_symbols(symbol_table: k2.SymbolTable,\n                      pattern: str = r'^#\\d+$') -> List[int]:\n    '''Return a list of phone IDs containing no disambiguation symbols.\n\n    Caution:\n      0 is not a phone ID so it is excluded from the return value.\n\n    Args:\n      symbol_table:\n        A symbol table in k2.\n      pattern:\n        Symbols containing this pattern are disambiguation symbols.\n    Returns:\n      Return a list of symbol IDs excluding those from disambiguation symbols.\n    '''\n    regex = re.compile(pattern)\n    symbols = symbol_table.symbols\n    ans = []\n    for s in symbols:\n        if not regex.match(s):\n            ans.append(symbol_table[s])\n    if 0 in ans:\n        ans.remove(0)\n    ans.sort()\n    return ans\n\n\ndef cut_id_dumper(dataloader, path: Path):\n    \"\"\"\n    Debugging utility. Writes processed cut IDs to a file.\n    Expects ``return_cuts=True`` to be passed to the Dataset class.\n\n    Example::\n\n        >>> for batch in cut_id_dumper(dataloader):\n        ...     pass\n    \"\"\"\n    if not dataloader.dataset.return_cuts:\n        return dataloader  # do nothing, \"return_cuts=True\" was not set\n    with path.open('w') as f:\n        for batch in dataloader:\n            for cut in batch['supervisions']['cut']:\n                print(cut.id, file=f)\n            yield batch\n\n\ndef str2bool(v):\n    if isinstance(v, bool):\n        return v\n    if v.lower() in ('yes', 'true', 't', 'y', '1'):\n        return True\n    elif v.lower() in ('no', 'false', 'f', 'n', '0'):\n        return False\n    else:\n        raise argparse.ArgumentTypeError('Boolean value expected.')\n\n\ndef describe(model: torch.nn.Module):\n    logging.info('=' * 80)\n    logging.info('Model parameters summary:')\n    logging.info('=' * 80)\n    total = 0\n    for name, param in model.named_parameters():\n        num_params = param.numel()\n        total += num_params\n        logging.info(f'* {name}: {num_params:>{80 - len(name) - 4}}')\n    logging.info('=' * 80)\n    logging.info(f'Total: {total}')\n    logging.info('=' * 80)\n\n\ndef get_texts(best_paths: k2.Fsa, indices: Optional[torch.Tensor] = None) -> List[List[int]]:\n    '''Extract the texts from the best-path FSAs, in the original order (before\n       the permutation given by `indices`).\n       Args:\n           best_paths:  a k2.Fsa with best_paths.arcs.num_axes() == 3, i.e.\n                    containing multiple FSAs, which is expected to be the result\n                    of k2.shortest_path (otherwise the returned values won't\n                    be meaningful).  Must have the 'aux_labels' attribute, as\n                  a ragged tensor.\n           indices: possibly a torch.Tensor giving the permutation that we used\n                    on the supervisions of this minibatch to put them in decreasing\n                    order of num-frames.  We'll apply the inverse permutation.\n                    Doesn't have to be on the same device as `best_paths`\n      Return:\n          Returns a list of lists of int, containing the label sequences we\n          decoded.\n    '''\n    # remove any 0's or -1's (there should be no 0's left but may be -1's.)\n    aux_labels = k2.ragged.remove_values_leq(best_paths.aux_labels, 0)\n    aux_shape = k2.ragged.compose_ragged_shapes(best_paths.arcs.shape(),\n                                                aux_labels.shape())\n    # remove the states and arcs axes.\n    aux_shape = k2.ragged.remove_axis(aux_shape, 1)\n    aux_shape = k2.ragged.remove_axis(aux_shape, 1)\n    aux_labels = k2.RaggedInt(aux_shape, aux_labels.values())\n    assert (aux_labels.num_axes() == 2)\n    aux_labels, _ = k2.ragged.index(aux_labels,\n                                    invert_permutation(indices).to(dtype=torch.int32,\n                                                                   device=best_paths.device))\n    return k2.ragged.to_list(aux_labels)\n\n\ndef invert_permutation(indices: torch.Tensor) -> torch.Tensor:\n    ans = torch.zeros(indices.shape, device=indices.device, dtype=torch.long)\n    ans[indices] = torch.arange(0, indices.shape[0], device=indices.device)\n    return ans\n\n\ndef find_first_disambig_symbol(symbols: k2.SymbolTable) -> int:\n    return min(v for k, v in symbols._sym2id.items() if k.startswith('#'))\n\n\ndef store_transcripts(path: Pathlike, texts: Iterable[Tuple[str, str]]):\n    with open(path, 'w') as f:\n        for ref, hyp in texts:\n            print(f'ref={ref}', file=f)\n            print(f'hyp={hyp}', file=f)\n\ndef write_error_stats(f: TextIO, test_set_name: str, results: List[Tuple[str,str]]) -> None:\n    subs: Dict[Tuple[str,str], int] = defaultdict(int)\n    ins: Dict[str, int] = defaultdict(int)\n    dels: Dict[str, int] = defaultdict(int)\n\n    # `words` stores counts per word, as follows:\n    #   corr, ref_sub, hyp_sub, ins, dels\n    words: Dict[str, List[int]] = defaultdict(lambda: [0,0,0,0,0])\n    num_corr = 0\n    ERR = '*'\n    for ref, hyp in results:\n        ali = kaldialign.align(ref, hyp, ERR)\n        for ref_word,hyp_word in ali:\n            if ref_word == ERR:\n                ins[hyp_word] += 1\n                words[hyp_word][3] += 1\n            elif hyp_word == ERR:\n                dels[ref_word] += 1\n                words[ref_word][4] += 1\n            elif hyp_word != ref_word:\n                subs[(ref_word,hyp_word)] += 1\n                words[ref_word][1] += 1\n                words[hyp_word][2] += 1\n            else:\n                words[ref_word][0] += 1\n                num_corr += 1\n    ref_len = sum([len(r) for r,_ in results])\n    sub_errs = sum(subs.values())\n    ins_errs = sum(ins.values())\n    del_errs = sum(dels.values())\n    tot_errs = sub_errs + ins_errs + del_errs\n    tot_err_rate = '%.2f' % (100.0 * tot_errs / ref_len)\n\n    logging.info(\n        f'[{test_set_name}] %WER {tot_errs / ref_len:.2%} '\n        f'[{tot_errs} / {ref_len}, {ins_errs} ins, {del_errs} del, {sub_errs} sub ]'\n    )\n\n    print(f\"%WER = {tot_err_rate}\", file=f)\n    print(f\"Errors: {ins_errs} insertions, {del_errs} deletions, {sub_errs} substitutions, over {ref_len} reference words ({num_corr} correct)\",\n          file=f)\n    print(\"Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:\",\n          file=f)\n\n    print(\"\", file=f)\n    print(\"PER-UTT DETAILS: corr or (ref->hyp)  \", file=f)\n    for ref, hyp in results:\n        ali = kaldialign.align(ref, hyp, ERR)\n        combine_successive_errors = True\n        if combine_successive_errors:\n            ali = [ [[x],[y]] for x,y in ali ]\n            for i in range(len(ali) - 1):\n                if ali[i][0] != ali[i][1] and ali[i+1][0] != ali[i+1][1]:\n                    ali[i+1][0] = ali[i][0] + ali[i+1][0]\n                    ali[i+1][1] = ali[i][1] + ali[i+1][1]\n                    ali[i] = [[],[]]\n            ali = [ [list(filter(lambda a: a != ERR, x)),\n                     list(filter(lambda a: a != ERR, y))]\n                     for x,y in ali ]\n            ali = list(filter(lambda x: x != [[],[]], ali))\n            ali = [ [ERR if x == [] else ' '.join(x),\n                     ERR if y == [] else ' '.join(y)]\n                    for x,y in ali ]\n\n        print(' '.join((ref_word if ref_word == hyp_word else f'({ref_word}->{hyp_word})'\n                        for ref_word,hyp_word in ali)), file=f)\n\n\n    print(\"\", file=f)\n    print(\"SUBSTITUTIONS: count ref -> hyp\", file=f)\n\n    for count,(ref,hyp) in sorted([(v,k) for k,v in subs.items()], reverse=True):\n        print(f\"{count}   {ref} -> {hyp}\", file=f)\n\n    print(\"\", file=f)\n    print(\"DELETIONS: count ref\", file=f)\n    for count,ref in sorted([(v,k) for k,v in dels.items()], reverse=True):\n        print(f\"{count}   {ref}\", file=f)\n\n    print(\"\", file=f)\n    print(\"INSERTIONS: count hyp\", file=f)\n    for count,hyp in sorted([(v,k) for k,v in ins.items()], reverse=True):\n        print(f\"{count}   {hyp}\", file=f)\n\n    print(\"\", file=f)\n    print(\"PER-WORD STATS: word  corr tot_errs count_in_ref count_in_hyp\", file=f)\n    for _,word,counts in sorted([(sum(v[1:]),k,v) for k,v in words.items()], reverse=True):\n        (corr, ref_sub, hyp_sub, ins, dels) = counts\n        tot_errs = ref_sub + hyp_sub + ins + dels\n        ref_count = corr + ref_sub + dels\n        hyp_count = corr + hyp_sub + ins\n\n        print(f\"{word}   {corr} {tot_errs} {ref_count} {hyp_count}\", file=f)\n"
  },
  {
    "path": "snowfall/data/__init__.py",
    "content": "from .aishell import AishellAsrDataModule\nfrom .asr_datamodule import AsrDataModule\nfrom .datamodule import DataModule\nfrom .librispeech import LibriSpeechAsrDataModule"
  },
  {
    "path": "snowfall/data/aishell.py",
    "content": "import logging\nfrom functools import lru_cache\n\nfrom lhotse import CutSet, load_manifest\nfrom snowfall.data.asr_datamodule import AsrDataModule\n\n\nclass AishellAsrDataModule(AsrDataModule):\n    \"\"\"\n    Aishell ASR data module.\n    \"\"\"\n    @lru_cache()\n    def train_cuts(self) -> CutSet:\n        logging.info(\"About to get train cuts\")\n        return load_manifest(self.args.feature_dir / 'cuts_train.json.gz')\n\n    @lru_cache()\n    def valid_cuts(self) -> CutSet:\n        logging.info(\"About to get valid cuts\")\n        return load_manifest(self.args.feature_dir / 'cuts_dev.json.gz')\n\n    @lru_cache()\n    def test_cuts(self) -> CutSet:\n        logging.info(\"About to get test cuts\")\n        return load_manifest(self.args.feature_dir / 'cuts_test.json.gz')\n\n\n"
  },
  {
    "path": "snowfall/data/asr_datamodule.py",
    "content": "import argparse\nimport logging\nfrom pathlib import Path\nfrom typing import List, Union\n\nfrom torch.utils.data import DataLoader\n\nfrom lhotse import Fbank, FbankConfig, load_manifest\nfrom lhotse.dataset import BucketingSampler, CutConcatenate, CutMix, K2SpeechRecognitionDataset, SingleCutSampler, \\\n    SpecAugment\nfrom lhotse.dataset.input_strategies import OnTheFlyFeatures\nfrom snowfall.common import str2bool\nfrom snowfall.data.datamodule import DataModule\n\n\nclass AsrDataModule(DataModule):\n    \"\"\"\n    DataModule for K2 ASR experiments.\n    It assumes there is always one train and valid dataloader,\n    but there can be multiple test dataloaders (e.g. LibriSpeech test-clean and test-other).\n\n    It contains all the common data pipeline modules used in ASR experiments, e.g.:\n    - dynamic batch size,\n    - bucketing samplers,\n    - cut concatenation,\n    - augmentation,\n    - on-the-fly feature extraction\n\n    This class should be derived for specific corpora used in ASR tasks.\n    \"\"\"\n\n    @classmethod\n    def add_arguments(cls, parser: argparse.ArgumentParser):\n        super().add_arguments(parser)\n        group = parser.add_argument_group(\n            title='ASR data related options',\n            description='These options are used for the preparation of PyTorch DataLoaders '\n                        'from Lhotse CutSet\\'s -- they control the effective batch sizes, '\n                        'sampling strategies, applied data augmentations, etc.'\n        )\n        group.add_argument(\n            '--feature-dir',\n            type=Path,\n            default=Path('exp/data'),\n            help='Path to directory with train/valid/test cuts.'\n        )\n        group.add_argument(\n            '--max-duration',\n            type=int,\n            default=500.0,\n            help=\"Maximum pooled recordings duration (seconds) in a single batch.\")\n        group.add_argument(\n            '--bucketing-sampler',\n            type=str2bool,\n            default=False,\n            help='When enabled, the batches will come from buckets of '\n                 'similar duration (saves padding frames).')\n        group.add_argument(\n            '--num-buckets',\n            type=int,\n            default=30,\n            help='The number of buckets for the BucketingSampler'\n                 '(you might want to increase it for larger datasets).')\n        group.add_argument(\n            '--concatenate-cuts',\n            type=str2bool,\n            default=True,\n            help='When enabled, utterances (cuts) will be concatenated '\n                 'to minimize the amount of padding.')\n        group.add_argument(\n            '--duration-factor',\n            type=float,\n            default=1.0,\n            help='Determines the maximum duration of a concatenated cut '\n                 'relative to the duration of the longest cut in a batch.')\n        group.add_argument(\n            '--gap',\n            type=float,\n            default=1.0,\n            help='The amount of padding (in seconds) inserted between concatenated cuts. '\n                 'This padding is filled with noise when noise augmentation is used.')\n        group.add_argument(\n            '--on-the-fly-feats',\n            type=str2bool,\n            default=False,\n            help='When enabled, use on-the-fly cut mixing and feature extraction. '\n                 'Will drop existing precomputed feature manifests if available.'\n        )\n\n    def train_dataloaders(self) -> DataLoader:\n        logging.info(\"About to get train cuts\")\n        cuts_train = self.train_cuts()\n\n        logging.info(\"About to get Musan cuts\")\n        cuts_musan = load_manifest(self.args.feature_dir / 'cuts_musan.json.gz')\n\n        logging.info(\"About to create train dataset\")\n        transforms = [CutMix(cuts=cuts_musan, prob=0.5, snr=(10, 20))]\n        if self.args.concatenate_cuts:\n            logging.info(f'Using cut concatenation with duration factor '\n                         f'{self.args.duration_factor} and gap {self.args.gap}.')\n            # Cut concatenation should be the first transform in the list,\n            # so that if we e.g. mix noise in, it will fill the gaps between different utterances.\n            transforms = [\n                             CutConcatenate(\n                                 duration_factor=self.args.duration_factor,\n                                 gap=self.args.gap\n                             )\n                         ] + transforms\n\n        input_transforms = [\n            SpecAugment(num_frame_masks=2, features_mask_size=27, num_feature_masks=2, frames_mask_size=100)\n        ]\n\n        train = K2SpeechRecognitionDataset(\n            cuts_train,\n            cut_transforms=transforms,\n            input_transforms=input_transforms\n        )\n\n        if self.args.on_the_fly_feats:\n            # NOTE: the PerturbSpeed transform should be added only if we remove it from data prep stage.\n            # # Add on-the-fly speed perturbation; since originally it would have increased epoch\n            # # size by 3, we will apply prob 2/3 and use 3x more epochs.\n            # # Speed perturbation probably should come first before concatenation,\n            # # but in principle the transforms order doesn't have to be strict (e.g. could be randomized)\n            # transforms = [PerturbSpeed(factors=[0.9, 1.1], p=2 / 3)] + transforms\n            # Drop feats to be on the safe side.\n            cuts_train = cuts_train.drop_features()\n            train = K2SpeechRecognitionDataset(\n                cuts=cuts_train,\n                cut_transforms=transforms,\n                input_strategy=OnTheFlyFeatures(Fbank(FbankConfig(num_mel_bins=80))),\n                input_transforms=input_transforms\n            )\n\n        if self.args.bucketing_sampler:\n            logging.info('Using BucketingSampler.')\n            train_sampler = BucketingSampler(\n                cuts_train,\n                max_duration=self.args.max_duration,\n                shuffle=True,\n                num_buckets=self.args.num_buckets\n            )\n        else:\n            logging.info('Using SingleCutSampler.')\n            train_sampler = SingleCutSampler(\n                cuts_train,\n                max_duration=self.args.max_duration,\n                shuffle=True,\n            )\n        logging.info(\"About to create train dataloader\")\n        train_dl = DataLoader(\n            train,\n            sampler=train_sampler,\n            batch_size=None,\n            num_workers=4,\n            persistent_workers=True,\n        )\n        return train_dl\n\n    def valid_dataloaders(self) -> DataLoader:\n        logging.info(\"About to get dev cuts\")\n        cuts_valid = self.valid_cuts()\n\n        transforms = [ ]\n        if self.args.concatenate_cuts:\n            transforms = [ CutConcatenate(\n                                 duration_factor=self.args.duration_factor,\n                                 gap=self.args.gap)\n                          ] + transforms\n\n\n        logging.info(\"About to create dev dataset\")\n        if self.args.on_the_fly_feats:\n            cuts_valid = cuts_valid.drop_features()\n            validate = K2SpeechRecognitionDataset(\n                cuts_valid.drop_features(),\n                cut_transforms=transforms,\n                input_strategy=OnTheFlyFeatures(Fbank(FbankConfig(num_mel_bins=80)))\n            )\n        else:\n            validate = K2SpeechRecognitionDataset(cuts_valid,\n                                                  cut_transforms=transforms)\n        valid_sampler = SingleCutSampler(\n            cuts_valid,\n            max_duration=self.args.max_duration,\n            shuffle=True,\n        )\n        logging.info(\"About to create dev dataloader\")\n        valid_dl = DataLoader(\n            validate,\n            sampler=valid_sampler,\n            batch_size=None,\n            num_workers=2,\n            persistent_workers=True,\n        )\n        return valid_dl\n\n    def test_dataloaders(self) -> Union[DataLoader, List[DataLoader]]:\n        cuts = self.test_cuts()\n        is_list = isinstance(cuts, list)\n        test_loaders = []\n        if not is_list:\n            cuts = [cuts]\n\n        for cuts_test in cuts:\n            logging.debug(\"About to create test dataset\")\n            test = K2SpeechRecognitionDataset(\n                cuts_test,\n                input_strategy=OnTheFlyFeatures(Fbank(FbankConfig(num_mel_bins=80)))\n            )\n            sampler = SingleCutSampler(cuts_test, max_duration=self.args.max_duration)\n            logging.debug(\"About to create test dataloader\")\n            test_dl = DataLoader(test, batch_size=None, sampler=sampler, num_workers=1)\n            test_loaders.append(test_dl)\n\n        if is_list:\n            return test_loaders\n        else:\n            return test_loaders[0]\n"
  },
  {
    "path": "snowfall/data/datamodule.py",
    "content": "from torch.utils.data import DataLoader\nfrom typing import List, Union\n\nimport argparse\n\nfrom lhotse import CutSet\n\n\nclass DataModule:\n    \"\"\"\n    Contains dataset-related code. It is intended to read/construct Lhotse cuts,\n    and create Dataset/Sampler/DataLoader out of them.\n\n    There is a separate method to create each of train/valid/test DataLoader.\n    In principle, there might be multiple DataLoaders for each of train/valid/test\n    (e.g. when a corpus has multiple test sets).\n    The API of this class allows to return lists of CutSets/DataLoaders.\n    \"\"\"\n    def __init__(self, args: argparse.Namespace):\n        self.args = args\n\n    @classmethod\n    def add_arguments(cls, parser: argparse.ArgumentParser):\n        pass\n\n    def train_cuts(self) -> Union[CutSet, List[CutSet]]:\n        raise NotImplementedError()\n\n    def valid_cuts(self) -> Union[CutSet, List[CutSet]]:\n        raise NotImplementedError()\n\n    def test_cuts(self) -> Union[CutSet, List[CutSet]]:\n        raise NotImplementedError()\n\n    def train_dataloaders(self) -> Union[DataLoader, List[DataLoader]]:\n        raise NotImplementedError()\n\n    def valid_dataloaders(self) -> Union[DataLoader, List[DataLoader]]:\n        raise NotImplementedError()\n\n    def test_dataloaders(self) -> Union[DataLoader, List[DataLoader]]:\n        raise NotImplementedError()\n"
  },
  {
    "path": "snowfall/data/librispeech.py",
    "content": "import argparse\n\nfrom functools import lru_cache\n\nimport logging\nfrom typing import List\n\nfrom lhotse import CutSet, load_manifest\nfrom snowfall.common import str2bool\nfrom snowfall.data.asr_datamodule import AsrDataModule\n\n\nclass LibriSpeechAsrDataModule(AsrDataModule):\n    \"\"\"\n    LibriSpeech ASR data module. Can be used for 100h subset (``--full-libri false``) or full 960h set.\n    The train and valid cuts for standard Libri splits are concatenated into a single CutSet/DataLoader.\n    \"\"\"\n\n    @classmethod\n    def add_arguments(cls, parser: argparse.ArgumentParser):\n        super().add_arguments(parser)\n        group = parser.add_argument_group(title='LibriSpeech specific options')\n        group.add_argument(\n            '--full-libri',\n            type=str2bool,\n            default=True,\n            help='When enabled, use 960h LibriSpeech.')\n\n    @lru_cache()\n    def train_cuts(self) -> CutSet:\n        logging.info(\"About to get train cuts\")\n        cuts_train = load_manifest(self.args.feature_dir / 'cuts_train-clean-100.json.gz')\n        if self.args.full_libri:\n            cuts_train = (\n                    cuts_train +\n                    load_manifest(self.args.feature_dir / 'cuts_train-clean-360.json.gz') +\n                    load_manifest(self.args.feature_dir / 'cuts_train-other-500.json.gz')\n            )\n        return cuts_train\n\n    @lru_cache()\n    def valid_cuts(self) -> CutSet:\n        logging.info(\"About to get dev cuts\")\n        cuts_valid = (\n                load_manifest(self.args.feature_dir / 'cuts_dev-clean.json.gz') +\n                load_manifest(self.args.feature_dir / 'cuts_dev-other.json.gz')\n        )\n        return cuts_valid\n\n    @lru_cache()\n    def test_cuts(self) -> List[CutSet]:\n        test_sets = ['test-clean', 'test-other']\n        cuts = []\n        for test_set in test_sets:\n            logging.debug(\"About to get test cuts\")\n            cuts.append(load_manifest(self.args.feature_dir / f'cuts_{test_set}.json.gz'))\n        return cuts\n"
  },
  {
    "path": "snowfall/decoding/__init__.py",
    "content": ""
  },
  {
    "path": "snowfall/decoding/graph.py",
    "content": "import logging\n\nimport k2\nimport torch\nfrom k2 import Fsa\n\n\ndef compile_HLG(\n        L: Fsa,\n        G: Fsa,\n        H: Fsa,\n        labels_disambig_id_start: int,\n        aux_labels_disambig_id_start: int\n) -> Fsa:\n    \"\"\"\n    Creates a decoding graph using a lexicon fst ``L`` and language model fsa ``G``.\n    Involves arc sorting, intersection, determinization, removal of disambiguation symbols\n    and adding epsilon self-loops.\n\n    Args:\n        L:\n            An ``Fsa`` that represents the lexicon (L), i.e. has phones as ``symbols``\n                and words as ``aux_symbols``.\n        G:\n            An ``Fsa`` that represents the language model (G), i.e. it's an acceptor\n            with words as ``symbols``.\n        H:  An ``Fsa`` that represents a specific topology used to convert the network\n            outputs to a sequence of phones.\n            Typically, it's a CTC topology fst, in which when 0 appears on the left\n            side, it represents the blank symbol; when it appears on the right side,\n            it indicates an epsilon.\n        labels_disambig_id_start:\n            An integer ID corresponding to the first disambiguation symbol in the\n            phonetic alphabet.\n        aux_labels_disambig_id_start:\n            An integer ID corresponding to the first disambiguation symbol in the\n            words vocabulary.\n    :return:\n    \"\"\"\n    L = k2.arc_sort(L)\n    G = k2.arc_sort(G)\n    logging.info(\"Intersecting L and G\")\n    LG = k2.compose(L, G)\n    logging.info(f'LG shape = {LG.shape}')\n    logging.info(\"Connecting L*G\")\n    LG = k2.connect(LG)\n    logging.info(f'LG shape = {LG.shape}')\n    logging.info(\"Determinizing L*G\")\n    LG = k2.determinize(LG)\n    logging.info(f'LG shape = {LG.shape}')\n    logging.info(\"Connecting det(L*G)\")\n    LG = k2.connect(LG)\n    logging.info(f'LG shape = {LG.shape}')\n    logging.info(\"Removing disambiguation symbols on L*G\")\n    LG.labels[LG.labels >= labels_disambig_id_start] = 0\n    if isinstance(LG.aux_labels, torch.Tensor):\n        LG.aux_labels[LG.aux_labels >= aux_labels_disambig_id_start] = 0\n    else:\n        LG.aux_labels.values()[LG.aux_labels.values() >= aux_labels_disambig_id_start] = 0\n    logging.info(\"Removing epsilons\")\n    LG = k2.remove_epsilon(LG)\n    logging.info(f'LG shape = {LG.shape}')\n    logging.info(\"Connecting rm-eps(det(L*G))\")\n    LG = k2.connect(LG)\n    logging.info(f'LG shape = {LG.shape}')\n    LG.aux_labels = k2.ragged.remove_values_eq(LG.aux_labels, 0)\n\n    logging.info(\"Arc sorting LG\")\n    LG = k2.arc_sort(LG)\n\n    logging.info(\"Composing ctc_topo LG\")\n    HLG = k2.compose(H, LG, inner_labels='phones')\n\n    logging.info(\"Connecting LG\")\n    HLG = k2.connect(HLG)\n\n    logging.info(\"Arc sorting LG\")\n    HLG = k2.arc_sort(HLG)\n    logging.info(\n        f'LG is arc sorted: {(HLG.properties & k2.fsa_properties.ARC_SORTED) != 0}'\n    )\n\n    # Attach a new attribute `lm_scores` so that we can recover\n    # the `am_scores` later.\n    # The scores on an arc consists of two parts:\n    #  scores = am_scores + lm_scores\n    # NOTE: we assume that both kinds of scores are in log-space.\n    HLG.lm_scores = HLG.scores.clone()\n    return HLG\n"
  },
  {
    "path": "snowfall/decoding/lm_rescore.py",
    "content": "# Copyright (c)  2021  Xiaomi Corporation (authors: Fangjun Kuang)\n\nfrom typing import Optional\n\nimport math\n\nimport k2\nimport torch\n\n\ndef _intersect_device(a_fsas: k2.Fsa, b_fsas: k2.Fsa, b_to_a_map: torch.Tensor,\n                      sorted_match_a: bool):\n    '''This is a wrapper of k2.intersect_device and its purpose is to split\n    b_fsas into several batches and process each batch separately to avoid\n    CUDA OOM error.\n\n    The arguments and return value of this function are the same as\n    k2.intersect_device.\n    '''\n    # NOTE: You can decrease batch_size in case of CUDA out of memory error.\n    batch_size = 500\n    num_fsas = b_fsas.shape[0]\n    if num_fsas <= batch_size:\n        return k2.intersect_device(a_fsas,\n                                   b_fsas,\n                                   b_to_a_map=b_to_a_map,\n                                   sorted_match_a=sorted_match_a)\n\n    num_batches = int(math.ceil(float(num_fsas) / batch_size))\n    splits = []\n    for i in range(num_batches):\n        start = i * batch_size\n        end = min(start + batch_size, num_fsas)\n        splits.append((start, end))\n\n    ans = []\n    for start, end in splits:\n        indexes = torch.arange(start, end).to(b_to_a_map)\n\n        fsas = k2.index(b_fsas, indexes)\n        b_to_a = k2.index(b_to_a_map, indexes)\n        path_lats = k2.intersect_device(a_fsas,\n                                        fsas,\n                                        b_to_a_map=b_to_a,\n                                        sorted_match_a=sorted_match_a)\n        ans.append(path_lats)\n\n    return k2.cat(ans)\n\n\ndef compute_am_scores(lats: k2.Fsa, word_fsas_with_epsilon_loops: k2.Fsa,\n                      path_to_seq_map: torch.Tensor) -> torch.Tensor:\n    '''Compute AM scores of n-best lists (represented as word_fsas).\n\n    Args:\n      lats:\n        An FsaVec, which is the output of `k2.intersect_dense_pruned`.\n        It must have the attribute `lm_scores`.\n      word_fsas_with_epsilon_loops:\n        An FsaVec representing a n-best list. Note that it has been processed\n        by `k2.add_epsilon_self_loops`.\n      path_to_seq_map:\n        A 1-D torch.Tensor with dtype torch.int32. path_to_seq_map[i] indicates\n        which sequence the i-th Fsa in word_fsas_with_epsilon_loops belongs to.\n        path_to_seq_map.numel() == word_fsas_with_epsilon_loops.arcs.dim0().\n    Returns:\n      Return a 1-D torch.Tensor containing the AM scores of each path.\n      `ans.numel() == word_fsas_with_epsilon_loops.shape[0]`\n    '''\n    device = lats.device\n    assert len(lats.shape) == 3\n    assert hasattr(lats, 'lm_scores')\n\n    # k2.compose() currently does not support b_to_a_map. To void\n    # replicating `lats`, we use k2.intersect_device here.\n    #\n    # lats has phone IDs as `labels` and word IDs as aux_labels, so we\n    # need to invert it here.\n    inverted_lats = k2.invert(lats)\n\n    # Now the `labels` of inverted_lats are word IDs (a 1-D torch.Tensor)\n    # and its `aux_labels` are phone IDs ( a k2.RaggedInt with 2 axes)\n\n    # Remove its `aux_labels` since it is not needed in the\n    # following computation\n    del inverted_lats.aux_labels\n    inverted_lats = k2.arc_sort(inverted_lats)\n\n    am_path_lats = _intersect_device(inverted_lats,\n                                     word_fsas_with_epsilon_loops,\n                                     b_to_a_map=path_to_seq_map,\n                                     sorted_match_a=True)\n\n    # NOTE: `k2.connect` and `k2.top_sort` support only CPU at present\n    am_path_lats = k2.top_sort(k2.connect(am_path_lats.to('cpu'))).to(device)\n\n    # The `scores` of every arc consists of `am_scores` and `lm_scores`\n    am_path_lats.scores = am_path_lats.scores - am_path_lats.lm_scores\n\n    am_scores = am_path_lats.get_tot_scores(True, True)\n\n    return am_scores\n\n\n@torch.no_grad()\ndef rescore_with_n_best_list(lats: k2.Fsa, G: k2.Fsa,\n                             num_paths: int) -> k2.Fsa:\n    '''Decode using n-best list with LM rescoring.\n\n    `lats` is a decoding lattice, which has 3 axes. This function first\n    extracts `num_paths` paths from `lats` for each sequence using\n    `k2.random_paths`. The `am_scores` of these paths are computed.\n    For each path, its `lm_scores` is computed using `G` (which is an LM).\n    The final `tot_scores` is the sum of `am_scores` and `lm_scores`.\n    The path with the greatest `tot_scores` within a sequence is used\n    as the decoding output.\n\n    Args:\n      lats:\n        An FsaVec. It can be the output of `k2.intersect_dense_pruned`.\n      G:\n        An FsaVec representing the language model (LM). Note that it\n        is an FsaVec, but it contains only one Fsa.\n      num_paths:\n        It is the size `n` in `n-best` list.\n    Returns:\n      An FsaVec representing the best decoding path for each sequence\n      in the lattice.\n    '''\n    device = lats.device\n\n    assert len(lats.shape) == 3\n    assert hasattr(lats, 'aux_labels')\n    assert hasattr(lats, 'lm_scores')\n\n    assert G.shape == (1, None, None)\n    assert G.device == device\n    assert hasattr(G, 'aux_labels') is False\n\n    # First, extract `num_paths` paths for each sequence.\n    # paths is a k2.RaggedInt with axes [seq][path][arc_pos]\n    paths = k2.random_paths(lats, num_paths=num_paths, use_double_scores=True)\n\n    # word_seqs is a k2.RaggedInt sharing the same shape as `paths`\n    # but it contains word IDs. Note that it also contains 0s and -1s.\n    # The last entry in each sublist is -1.\n    word_seqs = k2.index(lats.aux_labels, paths)\n\n    # Remove epsilons and -1 from word_seqs\n    word_seqs = k2.ragged.remove_values_leq(word_seqs, 0)\n\n    # Remove repeated sequences to avoid redundant computation later.\n    #\n    # unique_word_seqs is still a k2.RaggedInt with 3 axes [seq][path][word]\n    # except that there are no repeated paths with the same word_seq\n    # within a seq.\n    #\n    # num_repeats is also a k2.RaggedInt with 2 axes containing the\n    # multiplicities of each path.\n    # num_repeats.num_elements() == unique_word_seqs.num_elements()\n    #\n    # Since k2.ragged.unique_sequences will reorder paths within a seq,\n    # `new2old` is a 1-D torch.Tensor mapping from the output path index\n    # to the input path index.\n    # new2old.numel() == unique_word_seqs.num_elements()\n    unique_word_seqs, num_repeats, new2old = k2.ragged.unique_sequences(\n        word_seqs, need_num_repeats=True, need_new2old_indexes=True)\n\n    seq_to_path_shape = k2.ragged.get_layer(unique_word_seqs.shape(), 0)\n\n    # path_to_seq_map is a 1-D torch.Tensor.\n    # path_to_seq_map[i] is the seq to which the i-th path\n    # belongs.\n    path_to_seq_map = seq_to_path_shape.row_ids(1)\n\n    # Remove the seq axis.\n    # Now unique_word_seqs has only two axes [path][word]\n    unique_word_seqs = k2.ragged.remove_axis(unique_word_seqs, 0)\n\n    # word_fsas is an FsaVec with axes [path][state][arc]\n    word_fsas = k2.linear_fsa(unique_word_seqs)\n\n    word_fsas_with_epsilon_loops = k2.add_epsilon_self_loops(word_fsas)\n\n    am_scores = compute_am_scores(lats, word_fsas_with_epsilon_loops,\n                                  path_to_seq_map)\n\n    # Now compute lm_scores\n    b_to_a_map = torch.zeros_like(path_to_seq_map)\n    lm_path_lats = _intersect_device(G,\n                                     word_fsas_with_epsilon_loops,\n                                     b_to_a_map=b_to_a_map,\n                                     sorted_match_a=True)\n    lm_path_lats = k2.top_sort(k2.connect(lm_path_lats.to('cpu'))).to(device)\n    lm_scores = lm_path_lats.get_tot_scores(True, True)\n\n    tot_scores = am_scores + lm_scores\n\n    # Remember that we used `k2.ragged.unique_sequences` to remove repeated\n    # paths to avoid redundant computation in `k2.intersect_device`.\n    # Now we use `num_repeats` to correct the scores for each path.\n    #\n    # NOTE(fangjun): It is commented out as it leads to a worse WER\n    # tot_scores = tot_scores * num_repeats.values()\n\n    # TODO(fangjun): We may need to add `k2.RaggedDouble`\n    ragged_tot_scores = k2.RaggedFloat(seq_to_path_shape,\n                                       tot_scores.to(torch.float32))\n    argmax_indexes = k2.ragged.argmax_per_sublist(ragged_tot_scores)\n\n    # Use k2.index here since argmax_indexes' dtype is torch.int32\n    best_path_indexes = k2.index(new2old, argmax_indexes)\n\n    paths = k2.ragged.remove_axis(paths, 0)\n\n    # best_path is a k2.RaggedInt with 2 axes [path][arc_pos]\n    best_paths = k2.index(paths, best_path_indexes)\n\n    # labels is a k2.RaggedInt with 2 axes [path][phone_id]\n    # Note that it contains -1s.\n    labels = k2.index(lats.labels.contiguous(), best_paths)\n\n    labels = k2.ragged.remove_values_eq(labels, -1)\n\n    # lats.aux_labels is a k2.RaggedInt tensor with 2 axes, so\n    # aux_labels is also a k2.RaggedInt with 2 axes\n    aux_labels = k2.index(lats.aux_labels, best_paths.values())\n\n    best_path_fsas = k2.linear_fsa(labels)\n    best_path_fsas.aux_labels = aux_labels\n\n    return best_path_fsas\n\n\n@torch.no_grad()\ndef rescore_with_whole_lattice(lats: k2.Fsa,\n                               G_with_epsilon_loops: k2.Fsa) -> k2.Fsa:\n    '''Use whole lattice to rescore.\n\n    Args:\n      lats:\n        An FsaVec It can be the output of `k2.intersect_dense_pruned`.\n      G_with_epsilon_loops:\n        An FsaVec representing the language model (LM). Note that it\n        is an FsaVec, but it contains only one Fsa.\n    '''\n    assert len(lats.shape) == 3\n    assert hasattr(lats, 'lm_scores')\n    assert G_with_epsilon_loops.shape == (1, None, None)\n\n    device = lats.device\n    lats.scores = lats.scores - lats.lm_scores\n    # Now, lats.scores contains only am_scores\n\n    # inverted_lats has word IDs as labels.\n    # Its aux_labels are phone IDs, which is a ragged tensor k2.RaggedInt\n    inverted_lats = k2.invert(lats)\n    num_seqs = lats.shape[0]\n    inverted_lats_with_epsilon_loops = k2.add_epsilon_self_loops(inverted_lats)\n\n    b_to_a_map = torch.zeros(num_seqs, device=device, dtype=torch.int32)\n    try:\n        rescoring_lats = k2.intersect_device(G_with_epsilon_loops,\n                                             inverted_lats_with_epsilon_loops,\n                                             b_to_a_map,\n                                             sorted_match_a=True)\n    except RuntimeError as e:\n        print(f'Caught exception:\\n{e}\\n')\n        print(f'Number of FSAs: {inverted_lats.shape[0]}')\n        print('num_arcs before pruning: ',\n              inverted_lats_with_epsilon_loops.arcs.num_elements())\n\n        # NOTE(fangjun): The choice of the threshold 0.01 is arbitrary here\n        # to avoid OOM. We may need to fine tune it.\n        inverted_lats = k2.prune_on_arc_post(inverted_lats, 0.001, True)\n        inverted_lats_with_epsilon_loops = k2.add_epsilon_self_loops(\n            inverted_lats)\n        print('num_arcs after pruning: ',\n              inverted_lats_with_epsilon_loops.arcs.num_elements())\n\n        rescoring_lats = k2.intersect_device(G_with_epsilon_loops,\n                                             inverted_lats_with_epsilon_loops,\n                                             b_to_a_map,\n                                             sorted_match_a=True)\n\n    rescoring_lats = k2.top_sort(k2.connect(\n        rescoring_lats.to('cpu'))).to(device)\n    inverted_rescoring_lats = k2.invert(rescoring_lats)\n    # inverted rescoring_lats has phone IDs as labels\n    # and word IDs as aux_labels.\n\n    inverted_rescoring_lats = k2.remove_epsilon_self_loops(\n        inverted_rescoring_lats)\n    best_paths = k2.shortest_path(inverted_rescoring_lats,\n                                  use_double_scores=True)\n    return best_paths\n\n\n@torch.no_grad()\ndef decode_with_lm_rescoring(lats: k2.Fsa, G: k2.Fsa, num_paths: int,\n                             use_whole_lattice: bool) -> k2.Fsa:\n    '''Decode using n-best list with LM rescoring.\n\n    `lats` is a decoding lattice, which has 3 axes. This function first\n    extracts `num_paths` paths from `lats` for each sequence using\n    `k2.random_paths`. The `am_scores` of these paths are computed.\n    For each path, its `lm_scores` is computed using `G` (which is an LM).\n    The final `tot_scores` is the sum of `am_scores` and `lm_scores`.\n    The path with the greatest `tot_scores` within a sequence is used\n    as the decoding output.\n\n    Args:\n      lats:\n        An FsaVec It can be the output of `k2.intersect_dense_pruned`.\n      G:\n        An FsaVec representing the language model (LM). Note that it\n        is an FsaVec, but it contains only one Fsa.\n      num_paths:\n        It is the size `n` in `n-best` list.\n        Used only if use_whole_lattice is False.\n      use_whole_lattice:\n        True to use whole lattice for rescoring. False to use n-best list\n        for rescoring.\n    Returns:\n      An FsaVec representing the best decoding path for each sequence\n      in the lattice.\n    '''\n    if use_whole_lattice:\n        return rescore_with_whole_lattice(lats, G)\n    else:\n        return rescore_with_n_best_list(lats, G, num_paths)\n"
  },
  {
    "path": "snowfall/dist.py",
    "content": "import os\nimport torch\nfrom torch import distributed as dist\n\n\ndef setup_dist(rank, world_size, master_port = None):\n    os.environ['MASTER_ADDR'] = 'localhost'\n    os.environ['MASTER_PORT'] = ('12354' if master_port is None\n                                 else str(master_port))\n    dist.init_process_group(\"nccl\", rank=rank, world_size=world_size)\n    torch.cuda.set_device(rank)\n\n\ndef cleanup_dist():\n    dist.destroy_process_group()\n"
  },
  {
    "path": "snowfall/lexicon.py",
    "content": "import re\n\nfrom typing import List\n\nimport logging\n\nimport torch\nfrom pathlib import Path\n\nimport k2\n\n\nclass Lexicon:\n    def __init__(self, lang_dir: Path):\n        self.lang_dir = lang_dir\n        self.phones = k2.SymbolTable.from_file(self.lang_dir / 'phones.txt')\n        self.words = k2.SymbolTable.from_file(self.lang_dir / 'words.txt')\n\n        logging.info(\"Loading L.fst\")\n        if (self.lang_dir / 'Linv.pt').exists():\n            L_inv = k2.Fsa.from_dict(torch.load(self.lang_dir / 'Linv.pt'))\n        else:\n            with open(self.lang_dir / 'L.fst.txt') as f:\n                L = k2.Fsa.from_openfst(f.read(), acceptor=False)\n                L_inv = k2.arc_sort(L.invert_())\n                torch.save(L_inv.as_dict(), self.lang_dir / 'Linv.pt')\n        self.L_inv = L_inv\n\n    def phone_symbols(\n            self,\n            regex: str = re.compile(r'^#\\d+$')\n    ) -> List[int]:\n        '''Return a list of phone IDs containing no disambiguation symbols.\n\n        Caution:\n          0 is not a phone ID so it is excluded from the return value.\n\n        Args:\n          regex:\n            Symbols containing this pattern are disambiguation symbols.\n        Returns:\n          Return a list of symbol IDs excluding those from disambiguation symbols.\n        '''\n        symbols = self.phones.symbols\n        ans = []\n        for s in symbols:\n            if not regex.match(s):\n                ans.append(self.phones[s])\n        if 0 in ans:\n            ans.remove(0)\n        ans.sort()\n        return ans\n\n\n"
  },
  {
    "path": "snowfall/models/__init__.py",
    "content": "from .interface import AcousticModel\nfrom .tdnn import *\n"
  },
  {
    "path": "snowfall/models/conformer.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright (c)  2021  University of Chinese Academy of Sciences (author: Han Zhu)\n# Apache 2.0\n\nimport k2\nimport math\nimport torch\nfrom torch import Tensor, nn\nfrom typing import Dict, List, Optional, Tuple\nimport warnings\nimport copy\n\nfrom snowfall.common import get_texts\nfrom snowfall.models.transformer import Transformer, encoder_padding_mask\n\n\nclass Conformer(Transformer):\n    \"\"\"\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n        subsampling_factor (int): subsampling factor of encoder (the convolution layers before transformers)\n        d_model (int): attention dimension\n        nhead (int): number of head\n        dim_feedforward (int): feedforward dimention\n        num_encoder_layers (int): number of encoder layers\n        num_decoder_layers (int): number of decoder layers\n        dropout (float): dropout rate\n        cnn_module_kernel (int): Kernel size of convolution module\n        normalize_before (bool): whether to use layer_norm before the first block.\n        vgg_frontend (bool): whether to use vgg frontend.\n    \"\"\"\n\n    def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 4,\n                 d_model: int = 256, nhead: int = 4, dim_feedforward: int = 2048,\n                 num_encoder_layers: int = 12, num_decoder_layers: int = 6,\n                 dropout: float = 0.1, cnn_module_kernel: int = 31,\n                 normalize_before: bool = True, vgg_frontend: bool = False) -> None:\n        super(Conformer, self).__init__(num_features=num_features, num_classes=num_classes, subsampling_factor=subsampling_factor,\n                 d_model=d_model, nhead=nhead, dim_feedforward=dim_feedforward,\n                 num_encoder_layers=num_encoder_layers, num_decoder_layers=num_decoder_layers,\n                 dropout=dropout, normalize_before=normalize_before, vgg_frontend=vgg_frontend)\n\n        self.encoder_pos = RelPositionalEncoding(d_model, dropout)\n\n        encoder_layer = ConformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, cnn_module_kernel, normalize_before)\n        self.encoder = ConformerEncoder(encoder_layer, num_encoder_layers)\n\n    def encode(self, x: Tensor, supervisions: Optional[Dict] = None) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"\n        Args:\n            x: Tensor of dimension (batch_size, num_features, input_length).\n            supervisions : Supervison in lhotse format, i.e., batch['supervisions']\n\n        Returns:\n            Tensor: Predictor tensor of dimension (input_length, batch_size, d_model).\n            Tensor: Mask tensor of dimension (batch_size, input_length)\n        \"\"\"\n        x = x.permute(0, 2, 1)  # (B, F, T) -> (B, T, F)\n\n        x = self.encoder_embed(x)\n        x, pos_emb = self.encoder_pos(x)\n        x = x.permute(1, 0, 2)  # (B, T, F) -> (T, B, F)\n        mask = encoder_padding_mask(x.size(0), supervisions)\n        mask = mask.to(x.device) if mask != None else None\n        x = self.encoder(x, pos_emb, src_key_padding_mask=mask)  # (T, B, F)\n\n        return x, mask\n\n\nclass ConformerEncoderLayer(nn.Module):\n    \"\"\"\n    ConformerEncoderLayer is made up of self-attn, feedforward and convolution networks.\n    See: \"Conformer: Convolution-augmented Transformer for Speech Recognition\"\n\n    Args:\n        d_model: the number of expected features in the input (required).\n        nhead: the number of heads in the multiheadattention models (required).\n        dim_feedforward: the dimension of the feedforward network model (default=2048).\n        dropout: the dropout value (default=0.1).\n        cnn_module_kernel (int): Kernel size of convolution module.\n        normalize_before: whether to use layer_norm before the first block.\n\n    Examples::\n        >>> encoder_layer = ConformerEncoderLayer(d_model=512, nhead=8)\n        >>> src = torch.rand(10, 32, 512)\n        >>> pos_emb = torch.rand(32, 19, 512)\n        >>> out = encoder_layer(src, pos_emb)\n    \"\"\"\n\n    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,\n                 cnn_module_kernel: int = 31, normalize_before: bool = True) -> None:\n        super(ConformerEncoderLayer, self).__init__()\n        self.self_attn = RelPositionMultiheadAttention(d_model, nhead, dropout=0.0)\n\n        self.feed_forward = nn.Sequential(\n            nn.Linear(d_model, dim_feedforward),\n            Swish(),\n            nn.Dropout(dropout),\n            nn.Linear(dim_feedforward, d_model)\n        )\n\n        self.feed_forward_macaron = nn.Sequential(\n            nn.Linear(d_model, dim_feedforward),\n            Swish(),\n            nn.Dropout(dropout),\n            nn.Linear(dim_feedforward, d_model)\n        )\n\n        self.conv_module = ConvolutionModule(d_model, cnn_module_kernel)\n\n        self.norm_ff_macaron = nn.LayerNorm(d_model)  # for the macaron style FNN module\n        self.norm_ff = nn.LayerNorm(d_model)  # for the FNN module\n        self.norm_mha = nn.LayerNorm(d_model)  # for the MHA module\n\n        self.ff_scale = 0.5\n\n        self.norm_conv = nn.LayerNorm(d_model)  # for the CNN module\n        self.norm_final = nn.LayerNorm(d_model)  # for the final output of the block\n\n        self.dropout = nn.Dropout(dropout)\n\n        self.normalize_before = normalize_before\n\n    def forward(self, src: Tensor, pos_emb: Tensor, src_mask: Optional[Tensor] = None,\n                src_key_padding_mask: Optional[Tensor] = None) -> Tensor:\n        \"\"\"\n        Pass the input through the encoder layer.\n\n        Args:\n            src: the sequence to the encoder layer (required).\n            pos_emb: Positional embedding tensor (required).\n            src_mask: the mask for the src sequence (optional).\n            src_key_padding_mask: the mask for the src keys per batch (optional).\n\n        Shape:\n            src: (S, N, E).\n            pos_emb: (N, 2*S-1, E)\n            src_mask: (S, S).\n            src_key_padding_mask: (N, S).\n            S is the source sequence length, N is the batch size, E is the feature number\n        \"\"\"\n\n        # macaron style feed forward module\n        residual = src\n        if self.normalize_before:\n            src = self.norm_ff_macaron(src)\n        src = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(src))\n        if not self.normalize_before:\n            src = self.norm_ff_macaron(src)\n\n        # multi-headed self-attention module\n        residual = src\n        if self.normalize_before:\n            src = self.norm_mha(src)\n        src_att = self.self_attn(src, src, src, pos_emb=pos_emb, attn_mask=src_mask,\n                              key_padding_mask=src_key_padding_mask)[0]\n        src = residual + self.dropout(src_att)\n        if not self.normalize_before:\n            src = self.norm_mha(src)\n\n        # convolution module\n        residual = src\n        if self.normalize_before:\n            src = self.norm_conv(src)\n        src = residual + self.dropout(self.conv_module(src))\n        if not self.normalize_before:\n            src = self.norm_conv(src)\n\n        # feed forward module\n        residual = src\n        if self.normalize_before:\n            src = self.norm_ff(src)\n        src = residual + self.ff_scale * self.dropout(self.feed_forward(src))\n        if not self.normalize_before:\n            src = self.norm_ff(src)\n        \n        if self.normalize_before:\n            src = self.norm_final(src)\n\n        return src\n\n\nclass ConformerEncoder(nn.TransformerEncoder):\n    r\"\"\"ConformerEncoder is a stack of N encoder layers\n\n    Args:\n        encoder_layer: an instance of the ConformerEncoderLayer() class (required).\n        num_layers: the number of sub-encoder-layers in the encoder (required).\n        norm: the layer normalization component (optional).\n\n    Examples::\n        >>> encoder_layer = ConformerEncoderLayer(d_model=512, nhead=8)\n        >>> conformer_encoder = ConformerEncoder(encoder_layer, num_layers=6)\n        >>> src = torch.rand(10, 32, 512)\n        >>> pos_emb = torch.rand(32, 19, 512)\n        >>> out = conformer_encoder(src, pos_emb)\n    \"\"\"\n\n    def __init__(self, encoder_layer: nn.Module, num_layers: int, norm: nn.Module = None) -> None:\n        super(ConformerEncoder, self).__init__(encoder_layer=encoder_layer, num_layers=num_layers, norm=norm)\n\n    def forward(self, src: Tensor, pos_emb: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:\n        r\"\"\"Pass the input through the encoder layers in turn.\n\n        Args:\n            src: the sequence to the encoder (required).\n            pos_emb: Positional embedding tensor (required).\n            mask: the mask for the src sequence (optional).\n            src_key_padding_mask: the mask for the src keys per batch (optional).\n\n        Shape:\n            src: (S, N, E).\n            pos_emb: (N, 2*S-1, E)\n            mask: (S, S).\n            src_key_padding_mask: (N, S).\n            S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number\n\n        \"\"\"\n        output = src\n\n        for mod in self.layers:\n            output = mod(output, pos_emb, src_mask=mask, src_key_padding_mask=src_key_padding_mask)\n\n        if self.norm is not None:\n            output = self.norm(output)\n\n        return output\n\n\nclass RelPositionalEncoding(torch.nn.Module):\n    \"\"\"Relative positional encoding module.\n\n    See : Appendix B in \"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context\"\n    Modified from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transformer/embedding.py\n\n    Args:\n        d_model: Embedding dimension.\n        dropout_rate: Dropout rate.\n        max_len: Maximum input length.\n\n    \"\"\"\n\n    def __init__(self, d_model: int, dropout_rate: float, max_len: int = 5000) -> None:\n        \"\"\"Construct an PositionalEncoding object.\"\"\"\n        super(RelPositionalEncoding, self).__init__()\n        self.d_model = d_model\n        self.xscale = math.sqrt(self.d_model)\n        self.dropout = torch.nn.Dropout(p=dropout_rate)\n        self.pe = None\n        self.extend_pe(torch.tensor(0.0).expand(1, max_len))\n\n    def extend_pe(self, x: Tensor) -> None:\n        \"\"\"Reset the positional encodings.\"\"\"\n        if self.pe is not None:\n            # self.pe contains both positive and negative parts\n            # the length of self.pe is 2 * input_len - 1\n            if self.pe.size(1) >= x.size(1) * 2 - 1:\n                if self.pe.dtype != x.dtype or self.pe.device != x.device:\n                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)\n                return\n        # Suppose `i` means to the position of query vecotr and `j` means the\n        # position of key vector. We use position relative positions when keys\n        # are to the left (i>j) and negative relative positions otherwise (i<j).\n        pe_positive = torch.zeros(x.size(1), self.d_model)\n        pe_negative = torch.zeros(x.size(1), self.d_model)\n        position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)\n        div_term = torch.exp(\n            torch.arange(0, self.d_model, 2, dtype=torch.float32)\n            * -(math.log(10000.0) / self.d_model)\n        )\n        pe_positive[:, 0::2] = torch.sin(position * div_term)\n        pe_positive[:, 1::2] = torch.cos(position * div_term)\n        pe_negative[:, 0::2] = torch.sin(-1 * position * div_term)\n        pe_negative[:, 1::2] = torch.cos(-1 * position * div_term)\n\n        # Reserve the order of positive indices and concat both positive and\n        # negative indices. This is used to support the shifting trick\n        # as in \"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context\"\n        pe_positive = torch.flip(pe_positive, [0]).unsqueeze(0)\n        pe_negative = pe_negative[1:].unsqueeze(0)\n        pe = torch.cat([pe_positive, pe_negative], dim=1)\n        self.pe = pe.to(device=x.device, dtype=x.dtype)\n\n    def forward(self, x: torch.Tensor) -> Tuple[Tensor, Tensor]:\n        \"\"\"Add positional encoding.\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, `*`).\n\n        Returns:\n            torch.Tensor: Encoded tensor (batch, time, `*`).\n            torch.Tensor: Encoded tensor (batch, 2*time-1, `*`).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x * self.xscale\n        pos_emb = self.pe[\n            :,\n            self.pe.size(1) // 2 - x.size(1) + 1 : self.pe.size(1) // 2 + x.size(1),\n        ]\n        return self.dropout(x), self.dropout(pos_emb)\n\n\nclass RelPositionMultiheadAttention(nn.Module):\n    r\"\"\"Multi-Head Attention layer with relative position encoding\n\n    See reference: \"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context\"\n\n    Args:\n        embed_dim: total dimension of the model.\n        num_heads: parallel attention heads.\n        dropout: a Dropout layer on attn_output_weights. Default: 0.0.\n\n    Examples::\n\n        >>> rel_pos_multihead_attn = RelPositionMultiheadAttention(embed_dim, num_heads)\n        >>> attn_output, attn_output_weights = multihead_attn(query, key, value, pos_emb)\n    \"\"\"\n\n    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.) -> None:\n        super(RelPositionMultiheadAttention, self).__init__()\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n\n        self.in_proj = nn.Linear(embed_dim, 3 * embed_dim, bias=True)\n        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=True)\n\n        # linear transformation for positional encoding.\n        self.linear_pos = nn.Linear(embed_dim, embed_dim, bias=False)\n        # these two learnable bias are used in matrix c and matrix d\n        # as described in \"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context\" Section 3.3\n        self.pos_bias_u = nn.Parameter(torch.Tensor(num_heads, self.head_dim))\n        self.pos_bias_v = nn.Parameter(torch.Tensor(num_heads, self.head_dim))\n\n        self._reset_parameters()\n\n    def _reset_parameters(self) -> None:\n        nn.init.xavier_uniform_(self.in_proj.weight)\n        nn.init.constant_(self.in_proj.bias, 0.)\n        nn.init.constant_(self.out_proj.bias, 0.)\n\n        nn.init.xavier_uniform_(self.pos_bias_u)\n        nn.init.xavier_uniform_(self.pos_bias_v)\n\n    def forward(self, query: Tensor, key: Tensor, value: Tensor, pos_emb: Tensor, key_padding_mask: Optional[Tensor] = None,\n                need_weights: bool = True, attn_mask: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:\n        r\"\"\"\n        Args:\n            query, key, value: map a query and a set of key-value pairs to an output.\n            pos_emb: Positional embedding tensor\n            key_padding_mask: if provided, specified padding elements in the key will\n                be ignored by the attention. When given a binary mask and a value is True,\n                the corresponding value on the attention layer will be ignored. When given\n                a byte mask and a value is non-zero, the corresponding value on the attention\n                layer will be ignored\n            need_weights: output attn_output_weights.\n            attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all\n                the batches while a 3D mask allows to specify a different mask for the entries of each batch.\n\n        Shape:\n            - Inputs:\n            - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is\n            the embedding dimension.\n            - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is\n            the embedding dimension.\n            - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is\n            the embedding dimension.\n            - pos_emb: :math:`(N, 2*L-1, E)` where L is the target sequence length, N is the batch size, E is\n            the embedding dimension.\n            - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.\n            If a ByteTensor is provided, the non-zero positions will be ignored while the position\n            with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the\n            value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.\n            - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.\n            3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,\n            S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked\n            positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend\n            while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``\n            is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor\n            is provided, it will be added to the attention weight.\n\n            - Outputs:\n            - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,\n            E is the embedding dimension.\n            - attn_output_weights: :math:`(N, L, S)` where N is the batch size,\n            L is the target sequence length, S is the source sequence length.\n        \"\"\"\n        return self.multi_head_attention_forward(\n                query, key, value, pos_emb, self.embed_dim, self.num_heads,\n                self.in_proj.weight, self.in_proj.bias,\n                self.dropout, self.out_proj.weight, self.out_proj.bias,\n                training=self.training,\n                key_padding_mask=key_padding_mask, need_weights=need_weights,\n                attn_mask=attn_mask)\n\n    def rel_shift(self, x: Tensor) -> Tensor:\n        \"\"\"Compute relative positional encoding.\n\n        Args:\n            x: Input tensor (batch, head, time1, 2*time1-1).\n                time1 means the length of query vector.\n\n        Returns:\n            Tensor: tensor of shape (batch, head, time1, time2)\n          (note: time2 has the same value as time1, but it is for\n          the key, while time1 is for the query).\n        \"\"\"\n        (batch_size, num_heads, time1, n) = x.shape\n        assert n == 2*time1 - 1\n        (batch_stride, head_stride, time1_stride, n_stride) = x.stride()\n        return x.as_strided((batch_size, num_heads, time1, time1),\n                            (batch_stride, head_stride, time1_stride - n_stride, n_stride),\n                            storage_offset=n_stride*(time1 - 1))\n\n    def multi_head_attention_forward(self, query: Tensor,\n                                    key: Tensor,\n                                    value: Tensor,\n                                    pos_emb: Tensor,\n                                    embed_dim_to_check: int,\n                                    num_heads: int,\n                                    in_proj_weight: Tensor,\n                                    in_proj_bias: Tensor,\n                                    dropout_p: float,\n                                    out_proj_weight: Tensor,\n                                    out_proj_bias: Tensor,\n                                    training: bool = True,\n                                    key_padding_mask: Optional[Tensor] = None,\n                                    need_weights: bool = True,\n                                    attn_mask: Optional[Tensor] = None,\n                                    ) -> Tuple[Tensor, Optional[Tensor]]:\n        r\"\"\"\n        Args:\n            query, key, value: map a query and a set of key-value pairs to an output.\n            pos_emb: Positional embedding tensor\n            embed_dim_to_check: total dimension of the model.\n            num_heads: parallel attention heads.\n            in_proj_weight, in_proj_bias: input projection weight and bias.\n            dropout_p: probability of an element to be zeroed.\n            out_proj_weight, out_proj_bias: the output projection weight and bias.\n            training: apply dropout if is ``True``.\n            key_padding_mask: if provided, specified padding elements in the key will\n                be ignored by the attention. This is an binary mask. When the value is True,\n                the corresponding value on the attention layer will be filled with -inf.\n            need_weights: output attn_output_weights.\n            attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all\n                the batches while a 3D mask allows to specify a different mask for the entries of each batch.\n\n        Shape:\n            Inputs:\n            - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is\n            the embedding dimension.\n            - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is\n            the embedding dimension.\n            - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is\n            the embedding dimension.\n            - pos_emb: :math:`(N, 2*L-1, E)` or :math:`(1, 2*L-1, E)` where L is the target sequence\n            length, N is the batch size, E is the embedding dimension.\n            - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.\n            If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions\n            will be unchanged. If a BoolTensor is provided, the positions with the\n            value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.\n            - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.\n            3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,\n            S is the source sequence length. attn_mask ensures that position i is allowed to attend the unmasked\n            positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend\n            while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``\n            are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor\n            is provided, it will be added to the attention weight.\n\n            Outputs:\n            - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,\n            E is the embedding dimension.\n            - attn_output_weights: :math:`(N, L, S)` where N is the batch size,\n            L is the target sequence length, S is the source sequence length.\n        \"\"\"\n\n        tgt_len, bsz, embed_dim = query.size()\n        assert embed_dim == embed_dim_to_check\n        assert key.size(0) == value.size(0) and key.size(1) == value.size(1)\n\n        head_dim = embed_dim // num_heads\n        assert head_dim * num_heads == embed_dim, \"embed_dim must be divisible by num_heads\"\n        scaling = float(head_dim) ** -0.5\n\n        if torch.equal(query, key) and torch.equal(key, value):\n            # self-attention\n            q, k, v = nn.functional.linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)\n\n        elif torch.equal(key, value):\n            # encoder-decoder attention\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = 0\n            _end = embed_dim\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            q = nn.functional.linear(query, _w, _b)\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = embed_dim\n            _end = None\n            _w = in_proj_weight[_start:, :]\n            if _b is not None:\n                _b = _b[_start:]\n            k, v = nn.functional.linear(key, _w, _b).chunk(2, dim=-1)\n\n        else:\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = 0\n            _end = embed_dim\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            q = nn.functional.linear(query, _w, _b)\n\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = embed_dim\n            _end = embed_dim * 2\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            k = nn.functional.linear(key, _w, _b)\n\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = embed_dim * 2\n            _end = None\n            _w = in_proj_weight[_start:, :]\n            if _b is not None:\n                _b = _b[_start:]\n            v = nn.functional.linear(value, _w, _b)\n\n        q = q * scaling\n\n        if attn_mask is not None:\n            assert attn_mask.dtype == torch.float32 or attn_mask.dtype == torch.float64 or \\\n                attn_mask.dtype == torch.float16 or attn_mask.dtype == torch.uint8 or attn_mask.dtype == torch.bool, \\\n                'Only float, byte, and bool types are supported for attn_mask, not {}'.format(attn_mask.dtype)\n            if attn_mask.dtype == torch.uint8:\n                warnings.warn(\"Byte tensor for attn_mask is deprecated. Use bool tensor instead.\")\n                attn_mask = attn_mask.to(torch.bool)\n\n            if attn_mask.dim() == 2:\n                attn_mask = attn_mask.unsqueeze(0)\n                if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:\n                    raise RuntimeError('The size of the 2D attn_mask is not correct.')\n            elif attn_mask.dim() == 3:\n                if list(attn_mask.size()) != [bsz * num_heads, query.size(0), key.size(0)]:\n                    raise RuntimeError('The size of the 3D attn_mask is not correct.')\n            else:\n                raise RuntimeError(\"attn_mask's dimension {} is not supported\".format(attn_mask.dim()))\n            # attn_mask's dim is 3 now.\n\n        # convert ByteTensor key_padding_mask to bool\n        if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:\n            warnings.warn(\"Byte tensor for key_padding_mask is deprecated. Use bool tensor instead.\")\n            key_padding_mask = key_padding_mask.to(torch.bool)\n\n        q = q.contiguous().view(tgt_len, bsz, num_heads, head_dim)\n        k = k.contiguous().view(-1, bsz, num_heads, head_dim)\n        v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)\n\n        src_len = k.size(0)\n\n        if key_padding_mask is not None:\n            assert key_padding_mask.size(0) == bsz, \"{} == {}\".format(key_padding_mask.size(0), bsz)\n            assert key_padding_mask.size(1) == src_len, \"{} == {}\".format(key_padding_mask.size(1), src_len)\n\n\n        q = q.transpose(0, 1)  # (batch, time1, head, d_k)\n\n        pos_emb_bsz = pos_emb.size(0)\n        assert pos_emb_bsz in (1, bsz)  # actually it is 1\n        p = self.linear_pos(pos_emb).view(pos_emb_bsz, -1, num_heads, head_dim)\n        p = p.transpose(1, 2)  # (batch, head, 2*time1-1, d_k)\n\n        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2) # (batch, head, time1, d_k)\n\n        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2) # (batch, head, time1, d_k)\n\n        # compute attention score\n        # first compute matrix a and matrix c\n        # as described in \"Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context\" Section 3.3\n        k = k.permute(1, 2, 3, 0) # (batch, head, d_k, time2)\n        matrix_ac = torch.matmul(q_with_bias_u, k) # (batch, head, time1, time2)\n\n        # compute matrix b and matrix d\n        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1)) # (batch, head, time1, 2*time1-1)\n        matrix_bd = self.rel_shift(matrix_bd)\n\n        attn_output_weights = (matrix_ac + matrix_bd)  # (batch, head, time1, time2)\n\n        attn_output_weights = attn_output_weights.view(bsz * num_heads, tgt_len, -1)\n\n        assert list(attn_output_weights.size()) == [bsz * num_heads, tgt_len, src_len]\n\n        if attn_mask is not None:\n            if attn_mask.dtype == torch.bool:\n                attn_output_weights.masked_fill_(attn_mask, float('-inf'))\n            else:\n                attn_output_weights += attn_mask\n\n        if key_padding_mask is not None:\n            attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)\n            attn_output_weights = attn_output_weights.masked_fill(\n                key_padding_mask.unsqueeze(1).unsqueeze(2),\n                float('-inf'),\n            )\n            attn_output_weights = attn_output_weights.view(bsz * num_heads, tgt_len, src_len)\n\n        attn_output_weights = nn.functional.softmax(\n            attn_output_weights, dim=-1)\n        attn_output_weights = nn.functional.dropout(attn_output_weights, p=dropout_p, training=training)\n\n        attn_output = torch.bmm(attn_output_weights, v)\n        assert list(attn_output.size()) == [bsz * num_heads, tgt_len, head_dim]\n        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n        attn_output = nn.functional.linear(attn_output, out_proj_weight, out_proj_bias)\n\n        if need_weights:\n            # average attention weights over heads\n            attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)\n            return attn_output, attn_output_weights.sum(dim=1) / num_heads\n        else:\n            return attn_output, None\n\n\nclass ConvolutionModule(nn.Module):\n    \"\"\"ConvolutionModule in Conformer model.\n    Modified from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/conformer/convolution.py\n\n    Args:\n        channels (int): The number of channels of conv layers.\n        kernel_size (int): Kernerl size of conv layers.\n        bias (bool): Whether to use bias in conv layers (default=True).\n\n    \"\"\"\n\n    def __init__(self, channels: int, kernel_size: int, bias: bool = True) -> None:\n        \"\"\"Construct an ConvolutionModule object.\"\"\"\n        super(ConvolutionModule, self).__init__()\n        # kernerl_size should be a odd number for 'SAME' padding\n        assert (kernel_size - 1) % 2 == 0\n\n        self.pointwise_conv1 = nn.Conv1d(\n            channels,\n            2 * channels,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n            bias=bias,\n        )\n        self.depthwise_conv = nn.Conv1d(\n            channels,\n            channels,\n            kernel_size,\n            stride=1,\n            padding=(kernel_size - 1) // 2,\n            groups=channels,\n            bias=bias,\n        )\n        self.norm = nn.BatchNorm1d(channels)\n        self.pointwise_conv2 = nn.Conv1d(\n            channels,\n            channels,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n            bias=bias,\n        )\n        self.activation = Swish()\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"Compute convolution module.\n\n        Args:\n            x: Input tensor (#time, batch, channels).\n\n        Returns:\n            Tensor: Output tensor (#time, batch, channels).\n\n        \"\"\"\n        # exchange the temporal dimension and the feature dimension\n        x = x.permute(1, 2, 0) # (#batch, channels, time).\n\n        # GLU mechanism\n        x = self.pointwise_conv1(x)  # (batch, 2*channels, time)\n        x = nn.functional.glu(x, dim=1)  # (batch, channels, time)\n\n        # 1D Depthwise Conv\n        x = self.depthwise_conv(x)\n        x = self.activation(self.norm(x))\n\n        x = self.pointwise_conv2(x) # (batch, channel, time)\n\n        return x.permute(2, 0, 1)\n\n\nclass Swish(torch.nn.Module):\n    \"\"\"Construct an Swish object.\"\"\"\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"Return Swich activation function.\"\"\"\n        return x * torch.sigmoid(x)\n"
  },
  {
    "path": "snowfall/models/contextnet.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright (c)  2021  University of Chinese Academy of Sciences (author: Han Zhu)\n# Apache 2.0\n\n\nimport torch\nfrom torch import Tensor, nn\nfrom typing import List\nfrom snowfall.models import AcousticModel\nfrom snowfall.models.conformer import Swish\n\n\nclass ContextNet(AcousticModel):\n    \"\"\"ContextNet. Reference: https://arxiv.org/pdf/2005.03191.pdf\n\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n        kernel_size (int): Kernel size of convolution layers (default 3).\n        num_blocks (int): Number of context block (default 6).\n        num_layers (int): Number of depthwise convolution layers for each \n                context block (except first and last block) (default 5).\n        conv_out_channels (List[int]): Number of output channels produced by context blocks, \n                len(conv_out_channels) = num_blocks (default [*[256] * 2, *[512] * 3, 640]).\n        subsampling_layers (List[int]): Indexs of subsampling layers (default [1, 3]).\n        alpha (float): The factor to scale the output channel of the network (default 1.5).\n        dropout (float): Dropout (default 0.1).\n    \"\"\"\n\n    def __init__(\n        self,\n        num_features: int,\n        num_classes: int,\n        kernel_size: int = 3,\n        num_blocks: int = 6,\n        num_layers: int = 5,\n        conv_out_channels: List[int] = [*[256] * 2, *[512] * 3, 640],\n        subsampling_layers: List[int] = [1, 3],\n        alpha: float = 1.5,\n        dropout: int = 0.1,\n    ):\n        super().__init__()\n\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = 2 * len(subsampling_layers)\n\n        conv_channels = [num_features] +  \\\n                [int(channels * alpha) for channels in conv_out_channels]\n\n        strides = [1] * num_blocks\n        for layer in subsampling_layers:\n            strides[layer] = 2\n            strides[layer] = 2\n\n        residuals = [False, *[True] * (num_blocks - 2), False ] \n\n        blocks_num_layers = [1, *[num_layers] * (num_blocks - 2), 1 ] \n\n        self.block_list = [\n            ContextNetBlock(\n                conv_channels[i],\n                conv_channels[i+1],\n                kernel_size=kernel_size,\n                stride=strides[i],\n                num_layers=blocks_num_layers[i],\n                dropout=dropout,\n                residual=residuals[i]\n            ) for i in range(num_blocks)]\n\n        self.blocks = nn.Sequential(*self.block_list)\n\n        self.output_layer = nn.Linear(conv_channels[-1], num_classes)\n    \n    def forward(self, x, supervision = None):\n        \"\"\"\n        Args:\n            x (torch.Tensor): Input tensor (batch, channels, time).\n            supervision: Supervison in lhotse format, get from batch['supervisions'].\n                        It's not used here, just to keep consistent with transformer.\n\n        Returns:\n            torch.Tensor: Output tensor (batch, channels, time).\n        \"\"\"\n        x = x.transpose(1, -1)\n        x = self.blocks(x)\n        x = self.output_layer(x)\n        x = nn.functional.log_softmax(x, dim=-1).transpose(1, -1)\n        return x, None, None\n\n\nclass ContextNetBlock(torch.nn.Module):\n    \"\"\"A block in ContextNet.\n\n    Args:\n        in_channels (int): Number of output channels of this model.\n        out_channels (int): Number of input channels of this model.\n        kernel_size (int) : Kernel size of convolution layers (default 3).\n        stride (int): Stride of this context block (default 1).\n        num_layers (int): Number of depthwise convolution layers for this context block (default 5).\n        dropout (float): Dropout (default 0.1).\n        residual (bool): Whether to apply residual connection at this context block (default None).\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels: int,\n        out_channels: int,\n        kernel_size: int = 3,\n        stride: int = 1,\n        num_layers: int = 1,\n        dropout: float = 0.1,\n        residual: bool = True,\n    ):\n        super().__init__()\n\n        self.convs_list = [\n            ConvModule(\n                in_channels if i == 0 else out_channels,\n                out_channels,\n                kernel_size=kernel_size,\n                stride=stride if i == num_layers - 1 else 1,\n                padding=kernel_size // 2 - stride + 1 if i == num_layers - 1 else kernel_size // 2\n            ) for i in range(num_layers)]\n\n        self.convs = nn.Sequential(*self.convs_list)\n\n        self.SE = SEModule(channels=out_channels, kernel_size=kernel_size, padding=kernel_size // 2)\n\n        self.drop = nn.Dropout(dropout)\n\n        if residual:\n            self.residual = ConvModule(in_channels,\n                out_channels,\n                kernel_size=kernel_size,\n                padding=kernel_size // 2 - stride + 1,\n                stride=stride,\n                activation=None)\n        else:\n            self.residual = None\n        \n        self.activation = Swish()\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, channels).\n\n        Returns:\n            torch.Tensor: Output tensor (batch, time, channels).\n        \"\"\"\n        out = self.convs(x)\n        out = self.SE(out)\n        if self.residual:\n            out = out + self.residual(x)\n        out = self.activation(out)\n        out = self.drop(out)\n        return out\n\n\nclass SEModule(torch.nn.Module):\n    \"\"\"Squeeze-and-Excitation module.\n\n    Args:\n        channels (int): Input and output channels.\n        kernel_size (int) : Kernel size of convolution layers (default 3).\n        padding (int): Zero-padding added to both sides of the input.\n    \"\"\"\n\n    def __init__(\n        self,\n        channels: int,\n        kernel_size: int,\n        padding: int\n    ):\n        super().__init__()\n\n        self.conv = ConvModule(channels, channels, kernel_size=kernel_size, padding=padding, stride=1)\n\n        self.avg_pool = nn.AdaptiveAvgPool1d(1)\n\n        self.bottleneck = nn.Sequential(\n            torch.nn.Linear(channels, channels // 8),\n            Swish(),\n            torch.nn.Linear(channels // 8, channels),\n            Swish(),\n        )\n\n        self.final_act = torch.nn.Sigmoid()\n\n    def forward(self, x):\n        \"\"\"Squeeze and excitation\n\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, channels).\n\n        Returns:\n            torch.Tensor: Output tensor (batch, time, channels).\n        \"\"\"\n        B, T, C = x.shape\n\n        x = self.conv(x).transpose(1, -1) # (B, C, T)\n        avg = self.avg_pool(x).transpose(1,-1) # (B, 1, C)\n        avg = self.bottleneck(avg)\n        avg = self.final_act(avg)\n        context = avg.repeat(1, T, 1) # (B, T, C)\n        out = x.transpose(1, -1) * context\n        return out\n\n\nclass ConvModule(torch.nn.Module):\n    \"\"\"\n    Args:\n        in_channels (int): Number of input channels.\n        out_channels (int): Number of output channels.\n        kernel_size (int): Size of the convolving kernel.\n        stride (int): Stride of the convolution (default 1).\n        dilation (int): Spacing between kernel elements (default 1).\n        padding (int): Zero-padding added to both sides of the input.\n        padding_mode (str): 'zeros', 'reflect', 'replicate' or 'circular' (default 'zeros').\n        bias (bool): If True, adds a learnable bias to the output (default: True).\n        activation (object): activation function used in this convolution module. (default: Swish)\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels: int,\n        out_channels: int,\n        kernel_size: int,\n        stride: int = 1,\n        dilation: int = 1,\n        padding: int = 0,\n        padding_mode : str = 'zeros',\n        bias: bool = True,\n        activation = Swish\n    ):\n        super().__init__()\n\n        self.conv = SeparableConv1D(\n            in_channels,\n            out_channels,\n            kernel_size=kernel_size,\n            stride=stride,\n            dilation=dilation,\n            padding=padding,\n            padding_mode=padding_mode,\n            bias=bias,\n        )\n\n        self.norm = torch.nn.BatchNorm1d(out_channels)\n\n        if activation:\n            self.activation = activation()\n        else:\n            self.activation = None\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, channels).\n\n        Returns:\n            torch.Tensor: Output tensor (batch, new_time, channels).\n        \"\"\"\n        x = self.conv(x).transpose(1, -1) # (B, C, T)\n        x = self.norm(x)\n        if self.activation:   \n            x = self.activation(x)\n        x = x.transpose(1, -1) # (B, T, C)\n        return x\n\n\nclass SeparableConv1D(nn.Module):\n    \"\"\"Depthwise separable 1D convolution.\n\n    Args:\n        in_channels (int): Number of input channels.\n        out_channels (int): Number of output channels.\n        kernel_size (int): Size of the convolving kernel.\n        stride (int): Stride of the convolution (default 1).\n        dilation (int): Spacing between kernel elements (default 1).\n        padding (int): Zero-padding added to both sides of the input.\n        padding_mode (str): 'zeros', 'reflect', 'replicate' or 'circular' (default 'zeros').\n        bias (bool): If True, adds a learnable bias to the output (default: True).\n\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels: int,\n        out_channels: int,\n        kernel_size: int,\n        stride: int = 1,\n        dilation: int = 1,\n        padding: int = 0,\n        padding_mode : str = 'zeros',\n        bias: bool = True,\n    ):\n        super().__init__()\n\n        self.depthwise = nn.Conv1d(\n            in_channels,\n            in_channels,\n            kernel_size=kernel_size,\n            stride=stride,\n            dilation=dilation,\n            padding=padding,\n            padding_mode=padding_mode,\n            groups=in_channels,\n            bias=bias,\n        )\n\n        self.pointwise = nn.Conv1d(\n            in_channels,\n            out_channels,\n            kernel_size=1\n        )\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x (torch.Tensor): Input tensor (batch, time, channels).\n\n        Returns:\n            torch.Tensor: Output tensor (batch, time, channels).\n        \"\"\"\n        x = x.transpose(1, -1) # (B, C, T)\n        x = self.pointwise(self.depthwise(x)).transpose(1, -1) # (B, T, C)\n        return x"
  },
  {
    "path": "snowfall/models/interface.py",
    "content": "from typing import Optional\n\nfrom torch import nn\nfrom torch.utils.tensorboard import SummaryWriter\n\n\nclass AcousticModel(nn.Module):\n    \"\"\"\n    AcousticModel specifies the common attributes/methods that\n    will be exposed by all Snowfall acoustic model networks.\n    Think of it as of an interface class.\n    \"\"\"\n\n    # A.k.a. the input feature dimension.\n    num_features: int\n\n    # A.k.a. the output dimension (could be the number of phones or\n    # characters in the vocabulary).\n    num_classes: int\n\n    # When greater than one, the networks output sequence length will be\n    # this many times smaller than the input sequence length.\n    subsampling_factor: int\n\n    def write_tensorboard_diagnostics(\n            self,\n            tb_writer: SummaryWriter,\n            global_step: Optional[int] = None\n    ):\n        \"\"\"\n        Collect interesting diagnostic info about the model and write to to TensorBoard.\n        Unless overridden, logs nothing.\n\n        :param tb_writer: a TensorBoard ``SummaryWriter`` instance.\n        :param global_step: optional number of total training steps done so far.\n        \"\"\"\n        pass\n"
  },
  {
    "path": "snowfall/models/tdnn.py",
    "content": "from typing import Optional\n\nfrom torch import Tensor\nfrom torch import nn\n\n# Copyright (c)  2020  Xiaomi Corporation (authors: Daniel Povey, Haowen Qiu)\n# Apache 2.0\nfrom torch.utils.tensorboard import SummaryWriter\n\nfrom snowfall.models import AcousticModel\nfrom snowfall.training.diagnostics import measure_weight_norms\n\n\nclass Tdnn1a(AcousticModel):\n    \"\"\"\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n    \"\"\"\n\n    def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 3) -> None:\n        super(Tdnn1a, self).__init__()\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = subsampling_factor\n        self.tdnn = nn.Sequential(\n            nn.Conv1d(in_channels=num_features,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=self.subsampling_factor,  # <---- stride=3: subsampling_factor!\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=2000,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=2000, affine=False),\n            nn.Conv1d(in_channels=2000,\n                      out_channels=2000,\n                      kernel_size=1,\n                      stride=1,\n                      padding=0), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=2000, affine=False),\n            nn.Conv1d(in_channels=2000,\n                      out_channels=num_classes,\n                      kernel_size=1,\n                      stride=1,\n                      padding=0))\n\n    def forward(self, x: Tensor) -> Tensor:\n        r\"\"\"\n        Args:\n            x (torch.Tensor): Tensor of dimension (batch_size, num_features, input_length).\n\n        Returns:\n            Tensor: Predictor tensor of dimension (batch_size, number_of_classes, input_length).\n        \"\"\"\n\n        x = self.tdnn(x)\n        x = nn.functional.log_softmax(x, dim=1)\n        return x\n\n    def write_tensorboard_diagnostics(\n            self,\n            tb_writer: SummaryWriter,\n            global_step: Optional[int] = None\n    ):\n        tb_writer.add_scalars(\n            'train/weight_l2_norms',\n            measure_weight_norms(self, norm='l2'),\n            global_step=global_step\n        )\n        tb_writer.add_scalars(\n            'train/weight_max_norms',\n            measure_weight_norms(self, norm='linf'),\n            global_step=global_step\n        )\n"
  },
  {
    "path": "snowfall/models/tdnn_lstm.py",
    "content": "from typing import Optional\n\nfrom torch import Tensor\nfrom torch import nn\nfrom torch.utils.tensorboard import SummaryWriter\n\nfrom snowfall.models import AcousticModel\nfrom snowfall.training.diagnostics import measure_weight_norms\n\n\nclass TdnnLstm1a(AcousticModel):\n    \"\"\"\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n    \"\"\"\n\n    def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 3) -> None:\n        super().__init__()\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = subsampling_factor\n        self.tdnn = nn.Sequential(\n            nn.Conv1d(in_channels=num_features,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=self.subsampling_factor,  # <---- stride=3: subsampling_factor!\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n        )\n        self.lstm = nn.LSTM(500, 500)\n        self.dropout = nn.Dropout(0.5)\n        self.tdnn2 = nn.Sequential(\n            nn.Conv1d(in_channels=500,\n                      out_channels=2000,\n                      kernel_size=1,\n                      stride=1,\n                      padding=0), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=2000, affine=False),\n            nn.Conv1d(in_channels=2000,\n                      out_channels=num_classes,\n                      kernel_size=1,\n                      stride=1,\n                      padding=0)\n        )\n\n    def forward(self, x: Tensor) -> Tensor:\n        r\"\"\"\n        Args:\n            x (torch.Tensor): Tensor of dimension (batch_size, num_features, input_length).\n\n        Returns:\n            Tensor: Predictor tensor of dimension (batch_size, number_of_classes, input_length).\n        \"\"\"\n        x = self.tdnn(x)\n        x, _ = self.lstm(x.permute(2, 0, 1))  # (B, F, T) -> (T, B, F)\n        x = x.permute(1, 2, 0)  # (T, B, F) -> (B, F, T)\n        x = self.dropout(x)\n        x = self.tdnn2(x)\n        x = nn.functional.log_softmax(x, dim=1)\n        return x\n\n\nclass TdnnLstm1b(AcousticModel):\n    \"\"\"\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n    \"\"\"\n\n    def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 3) -> None:\n        super().__init__()\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = subsampling_factor\n        self.tdnn = nn.Sequential(\n            nn.Conv1d(in_channels=num_features,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=1,\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n            nn.Conv1d(in_channels=500,\n                      out_channels=500,\n                      kernel_size=3,\n                      stride=self.subsampling_factor,  # <---- stride: subsampling_factor!\n                      padding=1), nn.ReLU(inplace=True),\n            nn.BatchNorm1d(num_features=500, affine=False),\n        )\n        self.lstms = nn.ModuleList([\n            nn.LSTM(input_size=500, hidden_size=500, num_layers=1)\n            for _ in range(5)\n        ])\n        self.lstm_bnorms = nn.ModuleList([\n            nn.BatchNorm1d(num_features=500, affine=False)\n            for _ in range(5)\n        ])\n        self.dropout = nn.Dropout(0.2)\n        self.linear = nn.Linear(in_features=500, out_features=self.num_classes)\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"\n        Args:\n            x (torch.Tensor): Tensor of dimension (batch_size, num_features, input_length).\n\n        Returns:\n            Tensor: Predictor tensor of dimension (batch_size, number_of_classes, input_length).\n        \"\"\"\n        x = self.tdnn(x)\n        x = x.permute(2, 0, 1)  # (B, F, T) -> (T, B, F) -> how LSTM expects it\n        for lstm, bnorm in zip(self.lstms, self.lstm_bnorms):\n            x_new, _ = lstm(x)\n            x_new = bnorm(x_new.permute(1, 2, 0)).permute(2, 0, 1)  # (T, B, F) -> (B, F, T) -> (T, B, F)\n            x_new = self.dropout(x_new)\n            x = x_new + x  # skip connections\n        x = x.transpose(1, 0)  # (T, B, F) -> (B, T, F) -> linear expects \"features\" in the last dim\n        x = self.linear(x)\n        x = x.transpose(1, 2)  # (B, T, F) -> (B, F, T) -> shape expected by Snowfall\n        x = nn.functional.log_softmax(x, dim=1)\n        return x\n\n    def write_tensorboard_diagnostics(\n            self,\n            tb_writer: SummaryWriter,\n            global_step: Optional[int] = None\n    ):\n        tb_writer.add_scalars(\n            'train/weight_l2_norms',\n            measure_weight_norms(self, norm='l2'),\n            global_step=global_step\n        )\n        tb_writer.add_scalars(\n            'train/weight_max_norms',\n            measure_weight_norms(self, norm='linf'),\n            global_step=global_step\n        )\n"
  },
  {
    "path": "snowfall/models/tdnnf.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright 2021 John's Hopkins University (author: Piotr Żelasko)\n# Copyright 2020 Mobvoi AI Lab, Beijing, China (author: Fangjun Kuang)\n# Apache 2.0\nfrom typing import Optional\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.utils.tensorboard import SummaryWriter\n\nfrom snowfall.models import AcousticModel\nfrom snowfall.training.diagnostics import measure_semiorthogonality, measure_weight_norms\n\n\"\"\"\nCAUTION! This model is not fully ported from Kaldi. It will converge, but its training\nis still unstable and it seems to underperform its Kaldi counterpart.\nWe expect to improve this going forward.\n\"\"\"\n\n\ndef tdnnf_optimizer(\n        model: nn.Module,\n        learning_rate: float = 5e-5,\n        momentum: float = 0.9,\n        weight_decay: float = 1e-5) -> torch.optim.Optimizer:\n    \"\"\"\n    This is an example of an optimizer with parameter/layer-specific learning rates.\n    We don't use it by default but it can be helpful in tuning the training of a specific model.\n    \"\"\"\n    out_layer_keys = {'output_affine.weight', 'output_affine.bias', 'prefinal_l.weight', 'prefinal_l.bias'}\n    return torch.optim.SGD([\n        # Default optimization settings\n        {'params': [p for key, p in model.named_parameters() if key not in out_layer_keys]},\n        # Output layer may need smaller LR\n        {'params': [model.output_affine.weight], 'lr': learning_rate * 0.5},\n        {'params': [model.output_affine.bias], 'lr': learning_rate * 0.1},\n    ],\n        lr=learning_rate,\n        momentum=momentum,\n        weight_decay=weight_decay\n    )\n\n\nclass Tdnnf1a(AcousticModel):\n    \"\"\"\n    This is a PyTorch implementation of a standard Kaldi TDNN-F model architecture.\n    The default configuration is based on the Kaldi nnet3 xconfig below,\n    except it doesn't use an LDA transform.\n    Note that unlike Kaldi models it does not have a cross-entropy output layer,\n    as Snowfall does not support alignments in training at this time.\n\n    .. code-block:\n\n        input dim=43 name=input\n        fixed-affine-layer name=lda input=Append(-1,0,1) affine-transform-file=exp/chain_cleaned_1c/tdnn1c_sp/configs/lda.mat\n        relu-batchnorm-dropout-layer name=tdnn1 l2-regularize=0.008 dropout-proportion=0.0 dropout-per-dim-continuous=true dim=1024\n        tdnnf-layer name=tdnnf2 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=1\n        tdnnf-layer name=tdnnf3 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=1\n        tdnnf-layer name=tdnnf4 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=1\n        tdnnf-layer name=tdnnf5 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=0\n        tdnnf-layer name=tdnnf6 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf7 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf8 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf9 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf10 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf11 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf12 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        tdnnf-layer name=tdnnf13 l2-regularize=0.008 dropout-proportion=0.0 bypass-scale=0.66 dim=1024 bottleneck-dim=128 time-stride=3\n        linear-component name=prefinal-l dim=256 l2-regularize=0.008 orthonormal-constraint=-1.0\n        prefinal-layer name=prefinal-chain input=prefinal-l l2-regularize=0.008 big-dim=1024 small-dim=256\n        output-layer name=output include-log-softmax=false dim=3456 l2-regularize=0.002\n    \"\"\"\n\n    def __init__(self,\n                 num_features,\n                 num_classes,\n                 hidden_dim=1024,\n                 bottleneck_dim=128,\n                 prefinal_bottleneck_dim=256,\n                 kernel_size_list=[3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3],\n                 subsampling_factor_list=[1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1],\n                 subsampling_factor=3):\n        super().__init__()\n\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = subsampling_factor\n\n        # at present, we support only frame_subsampling_factor to be 3\n        assert self.subsampling_factor == 3\n\n        assert len(kernel_size_list) == len(subsampling_factor_list)\n        num_layers = len(kernel_size_list)\n\n        self.ortho_constrain_count = 0\n\n        self.input_batch_norm = nn.BatchNorm1d(num_features=self.num_features, affine=False)\n\n        self.tdnn1 = TDNN(input_dim=self.num_features, hidden_dim=hidden_dim)\n\n        tdnnfs = []\n        for i in range(num_layers):\n            kernel_size = kernel_size_list[i]\n            subsampling_factor = subsampling_factor_list[i]\n            layer = FactorizedTDNN(dim=hidden_dim,\n                                   bottleneck_dim=bottleneck_dim,\n                                   kernel_size=kernel_size,\n                                   subsampling_factor=subsampling_factor,\n                                   cnn_padding=int(subsampling_factor == 1))\n            tdnnfs.append(layer)\n\n        # tdnnfs requires [N, C, T]\n        self.tdnnfs = nn.ModuleList(tdnnfs)\n\n        # prefinal_l affine requires [N, C, T]\n        self.prefinal_l = OrthonormalLinear(\n            dim=hidden_dim,\n            bottleneck_dim=prefinal_bottleneck_dim,\n            kernel_size=1)\n\n        # prefinal_chain requires [N, C, T]\n        self.prefinal_chain = PrefinalLayer(big_dim=hidden_dim,\n                                            small_dim=prefinal_bottleneck_dim)\n\n        # output_affine requires [N, T, C]\n        self.output_affine = nn.Linear(in_features=prefinal_bottleneck_dim,\n                                       out_features=self.num_classes)\n\n        self.register_forward_pre_hook(constrain_orthonormal_hook)\n\n    def forward(self, x, dropout=0.):\n        # input x is of shape: [batch_size, feat_dim, seq_len] = [N, C, T]\n        assert x.ndim == 3\n\n        # at this point, x is [N, C, T]\n        x = self.input_batch_norm(x)\n\n        # at this point, x is [N, C, T]\n        x = self.tdnn1(x, dropout=dropout)\n\n        # tdnnf requires input of shape [N, C, T]\n        for layer in self.tdnnfs:\n            x = layer(x, dropout=dropout)\n\n        # at this point, x is [N, C, T]\n        x = self.prefinal_l(x)\n\n        # at this point, x is [N, C, T]\n        nnet_output = self.prefinal_chain(x)\n        # at this point, nnet_output is [N, C, T]\n        nnet_output = nnet_output.permute(0, 2, 1)\n        # at this point, nnet_output is [N, T, C]\n        nnet_output = self.output_affine(nnet_output)\n        nnet_output = F.log_softmax(nnet_output, dim=2)\n        # we return nnet_output [N, C, T]\n        nnet_output = nnet_output.permute(0, 2, 1)\n        return nnet_output\n\n    def write_tensorboard_diagnostics(self, tb_writer: SummaryWriter, global_step: Optional[int] = None):\n        tb_writer.add_scalars(\n            'train/semiorthogonality_score',\n            measure_semiorthogonality(self),\n            global_step=global_step\n        )\n        tb_writer.add_scalars(\n            'train/weight_l2_norms',\n            measure_weight_norms(self, norm='l2'),\n            global_step=global_step\n        )\n        tb_writer.add_scalars(\n            'train/weight_max_norms',\n            measure_weight_norms(self, norm='linf'),\n            global_step=global_step\n        )\n\n\ndef constrain_orthonormal_hook(model, unused_x):\n    if not model.training:\n        return\n\n    model.ortho_constrain_count = (model.ortho_constrain_count + 1) % 2\n    if model.ortho_constrain_count != 0:\n        return\n\n    with torch.no_grad():\n        for m in model.modules():\n            if hasattr(m, 'constrain_orthonormal'):\n                m.constrain_orthonormal()\n\n\ndef _constrain_orthonormal_internal(M):\n    '''\n    Refer to\n        void ConstrainOrthonormalInternal(BaseFloat scale, CuMatrixBase<BaseFloat> *M)\n    from\n        https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-utils.cc#L982\n    Note that we always use the **floating** case.\n    '''\n    assert M.ndim == 2\n\n    num_rows = M.size(0)\n    num_cols = M.size(1)\n\n    assert num_rows <= num_cols\n\n    # P = M * M^T\n    P = torch.mm(M, M.t())\n    P_PT = torch.mm(P, P.t())\n\n    trace_P = torch.trace(P)\n    trace_P_P = torch.trace(P_PT)\n\n    scale = torch.sqrt(trace_P_P / trace_P)\n\n    ratio = trace_P_P * num_rows / (trace_P * trace_P)\n    assert ratio > 0.99\n\n    update_speed = 0.125\n\n    if ratio > 1.02:\n        update_speed *= 0.5\n        if ratio > 1.1:\n            update_speed *= 0.5\n\n    identity = torch.eye(num_rows, dtype=P.dtype, device=P.device)\n    P = P - scale * scale * identity\n\n    alpha = update_speed / (scale * scale)\n    M = M - 4 * alpha * torch.mm(P, M)\n    return M\n\n\nclass SharedDimScaleDropout(nn.Module):\n    def __init__(self, dim=1):\n        '''\n        Continuous scaled dropout that is const over chosen dim (usually across time)\n        Multiplies inputs by random mask taken from Uniform([1 - 2\\alpha, 1 + 2\\alpha])\n        '''\n        super().__init__()\n        self.dim = dim\n        self.register_buffer('mask', torch.tensor(0.))\n\n    def forward(self, x, alpha=0.0):\n        if self.training and alpha > 0.:\n            # sample mask from uniform dist with dim of length 1 in self.dim and then repeat to match size\n            tied_mask_shape = list(x.shape)\n            tied_mask_shape[self.dim] = 1\n            repeats = [1 if i != self.dim else x.shape[self.dim]\n                       for i in range(len(x.shape))]\n            return x * self.mask.repeat(tied_mask_shape).uniform_(1 - 2 * alpha, 1 + 2 * alpha).repeat(repeats)\n            # expected value of dropout mask is 1 so no need to scale outputs like vanilla dropout\n        return x\n\n\nclass OrthonormalLinear(nn.Module):\n\n    def __init__(self, dim, bottleneck_dim, kernel_size):\n        super().__init__()\n        # WARNING(fangjun): kaldi uses [-1, 0] for the first linear layer\n        # and [0, 1] for the second affine layer;\n        # we use [-1, 0, 1] for the first linear layer if time_stride == 1\n\n        self.kernel_size = kernel_size\n\n        # conv requires [N, C, T]\n        self.conv = nn.Conv1d(in_channels=dim,\n                              out_channels=bottleneck_dim,\n                              kernel_size=kernel_size,\n                              bias=False)\n\n    def forward(self, x):\n        # input x is of shape: [batch_size, feat_dim, seq_len] = [N, C, T]\n        assert x.ndim == 3\n        x = self.conv(x)\n        return x\n\n    def constrain_orthonormal(self):\n        state_dict = self.conv.state_dict()\n        w = state_dict['weight']\n        # w is of shape [out_channels, in_channels, kernel_size]\n        out_channels = w.size(0)\n        in_channels = w.size(1)\n        kernel_size = w.size(2)\n\n        w = w.reshape(out_channels, -1)\n\n        num_rows = w.size(0)\n        num_cols = w.size(1)\n\n        need_transpose = False\n        if num_rows > num_cols:\n            w = w.t()\n            need_transpose = True\n\n        w = _constrain_orthonormal_internal(w)\n\n        if need_transpose:\n            w = w.t()\n\n        w = w.reshape(out_channels, in_channels, kernel_size)\n\n        state_dict['weight'] = w\n        self.conv.load_state_dict(state_dict)\n\n\nclass PrefinalLayer(nn.Module):\n\n    def __init__(self, big_dim, small_dim):\n        super().__init__()\n        self.affine = nn.Linear(in_features=small_dim, out_features=big_dim)\n        self.batchnorm1 = nn.BatchNorm1d(num_features=big_dim, affine=False)\n        self.linear = OrthonormalLinear(dim=big_dim,\n                                        bottleneck_dim=small_dim,\n                                        kernel_size=1)\n        self.batchnorm2 = nn.BatchNorm1d(num_features=small_dim, affine=False)\n\n    def forward(self, x):\n        # x is [N, C, T]\n        x = x.permute(0, 2, 1)\n\n        # at this point, x is [N, T, C]\n\n        x = self.affine(x)\n        x = F.relu(x)\n\n        # at this point, x is [N, T, C]\n\n        x = x.permute(0, 2, 1)\n\n        # at this point, x is [N, C, T]\n\n        x = self.batchnorm1(x)\n\n        x = self.linear(x)\n\n        x = self.batchnorm2(x)\n\n        return x\n\n\nclass TDNN(nn.Module):\n    '''\n    This class implements the following topology in kaldi:\n      relu-batchnorm-dropout-layer name=tdnn1 dropout-per-dim-continuous=true dim=1024\n    '''\n\n    def __init__(self, input_dim, hidden_dim):\n        super().__init__()\n        # affine conv1d requires [N, C, T]\n        self.affine = nn.Conv1d(in_channels=input_dim,\n                                out_channels=hidden_dim,\n                                kernel_size=1)\n\n        # tdnn1_batchnorm requires [N, C, T]\n        self.batchnorm = nn.BatchNorm1d(num_features=hidden_dim,\n                                        affine=False)\n\n        self.dropout = SharedDimScaleDropout(dim=2)\n\n    def forward(self, x, dropout=0.):\n        # input x is of shape: [batch_size, feat_dim, seq_len] = [N, C, T]\n        x = self.affine(x)\n        x = F.relu(x)\n        x = self.batchnorm(x)\n        x = self.dropout(x, alpha=dropout)\n        # return shape is [N, C, T]\n        return x\n\n\nclass FactorizedTDNN(nn.Module):\n    '''\n    This class implements the following topology in kaldi:\n      tdnnf-layer name=tdnnf2 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1\n    References:\n        - http://danielpovey.com/files/2018_interspeech_tdnnf.pdf\n        - ConstrainOrthonormalInternal() from\n          https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-utils.cc#L982\n    '''\n\n    def __init__(self,\n                 dim,\n                 bottleneck_dim,\n                 kernel_size,\n                 subsampling_factor,\n                 bypass_scale=0.66,\n                 cnn_padding=1):\n        super().__init__()\n\n        assert abs(bypass_scale) <= 1\n\n        self.bypass_scale = bypass_scale\n\n        self.s = subsampling_factor\n\n        # linear requires [N, C, T]\n        self.linear = OrthonormalLinear(dim=dim,\n                                        bottleneck_dim=bottleneck_dim,\n                                        kernel_size=kernel_size)\n\n        # affine requires [N, C, T]\n        # WARNING(fangjun): we do not use nn.Linear here\n        # since we want to use `stride`\n        self.affine = nn.Conv1d(in_channels=bottleneck_dim,\n                                out_channels=dim,\n                                kernel_size=1,\n                                stride=subsampling_factor,\n                                padding=cnn_padding)\n\n        # batchnorm requires [N, C, T]\n        self.batchnorm = nn.BatchNorm1d(num_features=dim, affine=False)\n\n        self.dropout = SharedDimScaleDropout(dim=2)\n\n    def forward(self, x, dropout=0.):\n        # input x is of shape: [batch_size, feat_dim, seq_len] = [N, C, T]\n        assert x.ndim == 3\n\n        # save it for skip connection\n        input_x = x\n\n        x = self.linear(x)\n        # at this point, x is [N, C, T]\n\n        x = self.affine(x)\n        # at this point, x is [N, C, T]\n\n        x = F.relu(x)\n\n        # at this point, x is [N, C, T]\n\n        x = self.batchnorm(x)\n\n        # at this point, x is [N, C, T]\n\n        x = self.dropout(x, alpha=dropout)\n\n        if self.linear.kernel_size > 1:\n            # padding takes care of keeping the shapes correct\n            x = self.bypass_scale * input_x + x\n        else:\n            x = self.bypass_scale * input_x[:, :, ::self.s] + x\n        return x\n\n\n"
  },
  {
    "path": "snowfall/models/transformer.py",
    "content": "#!/usr/bin/env python3\n\n# Copyright (c)  2021  University of Chinese Academy of Sciences (author: Han Zhu)\n# Apache 2.0\n\nimport k2\nimport math\nimport torch\nfrom torch import Tensor, nn\nfrom typing import Dict, List, Optional, Tuple\n\nfrom snowfall.common import get_texts\nfrom snowfall.models import AcousticModel\n\n\nclass Transformer(AcousticModel):\n    \"\"\"\n    Args:\n        num_features (int): Number of input features\n        num_classes (int): Number of output classes\n        subsampling_factor (int): subsampling factor of encoder (the convolution layers before transformers)\n        d_model (int): attention dimension\n        nhead (int): number of head\n        dim_feedforward (int): feedforward dimention\n        num_encoder_layers (int): number of encoder layers\n        num_decoder_layers (int): number of decoder layers\n        dropout (float): dropout rate\n        normalize_before (bool): whether to use layer_norm before the first block.\n        vgg_frontend (bool): whether to use vgg frontend.\n    \"\"\"\n\n    def __init__(self, num_features: int, num_classes: int, subsampling_factor: int = 4,\n                 d_model: int = 256, nhead: int = 4, dim_feedforward: int = 2048,\n                 num_encoder_layers: int = 12, num_decoder_layers: int = 6,\n                 dropout: float = 0.1, normalize_before: bool = True,\n                 vgg_frontend: bool = False) -> None:\n        super().__init__()\n        self.num_features = num_features\n        self.num_classes = num_classes\n        self.subsampling_factor = subsampling_factor\n        if subsampling_factor != 4:\n            raise NotImplementedError(\"Support only 'subsampling_factor=4'.\")\n\n        self.encoder_embed = (VggSubsampling(num_features, d_model) if vgg_frontend else\n                              Conv2dSubsampling(num_features, d_model))\n        self.encoder_pos = PositionalEncoding(d_model, dropout)\n\n        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, normalize_before=normalize_before)\n\n        if normalize_before:\n            encoder_norm = nn.LayerNorm(d_model)\n        else:\n            encoder_norm = None\n\n        self.encoder = nn.TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)\n\n        self.encoder_output_layer = nn.Sequential(\n            nn.Dropout(p=dropout),\n            nn.Linear(d_model, num_classes)\n        )\n\n        if num_decoder_layers > 0:\n            self.decoder_num_class = self.num_classes + 1  # +1 for the sos/eos symbol\n\n            self.decoder_embed = nn.Embedding(self.decoder_num_class, d_model)\n            self.decoder_pos = PositionalEncoding(d_model, dropout)\n\n            decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, normalize_before=normalize_before)\n\n            if normalize_before:\n                decoder_norm = nn.LayerNorm(d_model)\n            else:\n                decoder_norm = None\n\n            self.decoder = nn.TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)\n\n            self.decoder_output_layer = torch.nn.Linear(d_model, self.decoder_num_class)\n\n            self.decoder_criterion = LabelSmoothingLoss(self.decoder_num_class)\n        else:\n            self.decoder_criterion = None\n\n    def forward(self, x: Tensor, supervision: Optional[Dict] = None) -> Tuple[Tensor, Tensor, Optional[Tensor]]:\n        \"\"\"\n        Args:\n            x: Tensor of dimension (batch_size, num_features, input_length).\n            supervision: Supervison in lhotse format, get from batch['supervisions']\n\n        Returns:\n            Tensor: After log-softmax tensor of dimension (batch_size, number_of_classes, input_length).\n            Tensor: Before linear layer tensor of dimension (input_length, batch_size, d_model).\n            Optional[Tensor]: Mask tensor of dimension (batch_size, input_length) or None.\n\n        \"\"\"\n        encoder_memory, memory_mask = self.encode(x, supervision)\n        x = self.encoder_output(encoder_memory)\n        return x, encoder_memory, memory_mask\n\n    def encode(self, x: Tensor, supervisions: Optional[Dict] = None) -> Tuple[Tensor, Optional[Tensor]]:\n        \"\"\"\n        Args:\n            x: Tensor of dimension (batch_size, num_features, input_length).\n            supervisions : Supervison in lhotse format, i.e., batch['supervisions']\n\n        Returns:\n            Tensor: Predictor tensor of dimension (input_length, batch_size, d_model).\n            Optional[Tensor]: Mask tensor of dimension (batch_size, input_length) or None.\n        \"\"\"\n        x = x.permute(0, 2, 1)  # (B, F, T) -> (B, T, F)\n\n        x = self.encoder_embed(x)\n        x = self.encoder_pos(x)\n        x = x.permute(1, 0, 2)  # (B, T, F) -> (T, B, F)\n        mask = encoder_padding_mask(x.size(0), supervisions)\n        mask = mask.to(x.device) if mask != None else None\n        x = self.encoder(x, src_key_padding_mask=mask)  # (T, B, F)\n\n        return x, mask\n\n    def encoder_output(self, x: Tensor) -> Tensor:\n        \"\"\"\n        Args:\n            x: Tensor of dimension (input_length, batch_size, d_model).\n\n        Returns:\n            Tensor: After log-softmax tensor of dimension (batch_size, number_of_classes, input_length).\n        \"\"\"\n        x = self.encoder_output_layer(x).permute(1, 2, 0)  # (T, B, F) ->(B, F, T)\n        x = nn.functional.log_softmax(x, dim=1)  # (B, F, T)\n        return x\n\n    def decoder_forward(self, x: Tensor, encoder_mask: Tensor, supervision: Dict, graph_compiler: object) -> Tensor:\n        \"\"\"\n        Args:\n            x: Tensor of dimension (input_length, batch_size, d_model).\n            encoder_mask: Mask tensor of dimension (batch_size, input_length)\n            supervision: Supervison in lhotse format, get from batch['supervisions']\n            graph_compiler: use graph_compiler.L_inv (Its labels are words, while its aux_labels are phones)\n                            , graph_compiler.words and graph_compiler.oov\n\n        Returns:\n            Tensor: Decoder loss.\n        \"\"\"\n        batch_text = get_normal_transcripts(supervision, graph_compiler.lexicon.words, graph_compiler.oov)\n        ys_in_pad, ys_out_pad = add_sos_eos(batch_text, graph_compiler.L_inv, self.decoder_num_class - 1,\n                                            self.decoder_num_class - 1)\n        ys_in_pad = ys_in_pad.to(x.device)\n        ys_out_pad = ys_out_pad.to(x.device)\n\n        tgt_mask = generate_square_subsequent_mask(ys_in_pad.shape[-1]).to(x.device)\n\n        tgt_key_padding_mask = decoder_padding_mask(ys_in_pad)\n\n        tgt = self.decoder_embed(ys_in_pad)  # (B, T) -> (B, T, F)\n        tgt = self.decoder_pos(tgt)\n        tgt = tgt.permute(1, 0, 2)  # (B, T, F) -> (T, B, F)\n        pred_pad = self.decoder(tgt=tgt,\n                                memory=x,\n                                tgt_mask=tgt_mask,\n                                tgt_key_padding_mask=tgt_key_padding_mask,\n                                memory_key_padding_mask=encoder_mask)  # (T, B, F)\n        pred_pad = pred_pad.permute(1, 0, 2)  # (T, B, F) -> (B, T, F)\n        pred_pad = self.decoder_output_layer(pred_pad)  # (B, T, F)\n\n        decoder_loss = self.decoder_criterion(pred_pad, ys_out_pad)\n\n        return decoder_loss\n\n\nclass TransformerEncoderLayer(nn.Module):\n    \"\"\"\n    Modified from torch.nn.TransformerEncoderLayer. Add support of normalize_before,\n    i.e., use layer_norm before the first block.\n\n    Args:\n        d_model: the number of expected features in the input (required).\n        nhead: the number of heads in the multiheadattention models (required).\n        dim_feedforward: the dimension of the feedforward network model (default=2048).\n        dropout: the dropout value (default=0.1).\n        activation: the activation function of intermediate layer, relu or gelu (default=relu).\n        normalize_before: whether to use layer_norm before the first block.\n\n    Examples::\n        >>> encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)\n        >>> src = torch.rand(10, 32, 512)\n        >>> out = encoder_layer(src)\n    \"\"\"\n\n    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,\n                 activation: str = \"relu\", normalize_before: bool = True) -> None:\n        super(TransformerEncoderLayer, self).__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=0.0)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n\n        self.normalize_before = normalize_before\n\n    def __setstate__(self, state):\n        if 'activation' not in state:\n            state['activation'] = nn.functional.relu\n        super(TransformerEncoderLayer, self).__setstate__(state)\n\n    def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,\n                src_key_padding_mask: Optional[Tensor] = None) -> Tensor:\n        \"\"\"\n        Pass the input through the encoder layer.\n\n        Args:\n            src: the sequence to the encoder layer (required).\n            src_mask: the mask for the src sequence (optional).\n            src_key_padding_mask: the mask for the src keys per batch (optional).\n\n        Shape:\n            src: (S, N, E).\n            src_mask: (S, S).\n            src_key_padding_mask: (N, S).\n            S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number\n        \"\"\"\n        residual = src\n        if self.normalize_before:\n            src = self.norm1(src)\n        src2 = self.self_attn(src, src, src, attn_mask=src_mask,\n                              key_padding_mask=src_key_padding_mask)[0]\n        src = residual + self.dropout1(src2)\n        if not self.normalize_before:\n            src = self.norm1(src)\n\n        residual = src\n        if self.normalize_before:\n            src = self.norm2(src)\n        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))\n        src = residual + self.dropout2(src2)\n        if not self.normalize_before:\n            src = self.norm2(src)\n        return src\n\n\nclass TransformerDecoderLayer(nn.Module):\n    \"\"\"\n    Modified from torch.nn.TransformerDecoderLayer. Add support of normalize_before,\n    i.e., use layer_norm before the first block.\n\n    Args:\n        d_model: the number of expected features in the input (required).\n        nhead: the number of heads in the multiheadattention models (required).\n        dim_feedforward: the dimension of the feedforward network model (default=2048).\n        dropout: the dropout value (default=0.1).\n        activation: the activation function of intermediate layer, relu or gelu (default=relu).\n\n    Examples::\n        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)\n        >>> memory = torch.rand(10, 32, 512)\n        >>> tgt = torch.rand(20, 32, 512)\n        >>> out = decoder_layer(tgt, memory)\n    \"\"\"\n\n    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,\n                 activation: str = \"relu\", normalize_before: bool = True) -> None:\n        super(TransformerDecoderLayer, self).__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=0.0)\n        self.src_attn = nn.MultiheadAttention(d_model, nhead, dropout=0.0)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.norm3 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n        self.dropout3 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n\n        self.normalize_before = normalize_before\n\n    def __setstate__(self, state):\n        if 'activation' not in state:\n            state['activation'] = nn.functional.relu\n        super(TransformerDecoderLayer, self).__setstate__(state)\n\n    def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,\n                memory_mask: Optional[Tensor] = None,\n                tgt_key_padding_mask: Optional[Tensor] = None,\n                memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:\n        \"\"\"Pass the inputs (and mask) through the decoder layer.\n\n        Args:\n            tgt: the sequence to the decoder layer (required).\n            memory: the sequence from the last layer of the encoder (required).\n            tgt_mask: the mask for the tgt sequence (optional).\n            memory_mask: the mask for the memory sequence (optional).\n            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).\n            memory_key_padding_mask: the mask for the memory keys per batch (optional).\n\n        Shape:\n            tgt: (T, N, E).\n            memory: (S, N, E).\n            tgt_mask: (T, T).\n            memory_mask: (T, S).\n            tgt_key_padding_mask: (N, T).\n            memory_key_padding_mask: (N, S).\n            S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number\n        \"\"\"\n        residual = tgt\n        if self.normalize_before:\n            tgt = self.norm1(tgt)\n        tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = residual + self.dropout1(tgt2)\n        if not self.normalize_before:\n            tgt = self.norm1(tgt)\n\n        residual = tgt\n        if self.normalize_before:\n            tgt = self.norm2(tgt)\n        tgt2 = self.src_attn(tgt, memory, memory, attn_mask=memory_mask,\n                             key_padding_mask=memory_key_padding_mask)[0]\n        tgt = residual + self.dropout2(tgt2)\n        if not self.normalize_before:\n            tgt = self.norm2(tgt)\n\n        residual = tgt\n        if self.normalize_before:\n            tgt = self.norm3(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = residual + self.dropout3(tgt2)\n        if not self.normalize_before:\n            tgt = self.norm3(tgt)\n        return tgt\n\n\ndef _get_activation_fn(activation: str):\n    if activation == \"relu\":\n        return nn.functional.relu\n    elif activation == \"gelu\":\n        return nn.functional.gelu\n\n    raise RuntimeError(\"activation should be relu/gelu, not {}\".format(activation))\n\n\nclass Conv2dSubsampling(nn.Module):\n    \"\"\"Convolutional 2D subsampling (to 1/4 length).\n        Modified from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transformer/subsampling.py\n\n    Args:\n        idim: Input dimension.\n        odim: Output dimension.\n\n    \"\"\"\n\n    def __init__(self, idim: int, odim: int) -> None:\n        \"\"\"Construct a Conv2dSubsampling object.\"\"\"\n        super(Conv2dSubsampling, self).__init__()\n        self.conv = nn.Sequential(\n            nn.Conv2d(in_channels=1, out_channels=odim, kernel_size=3, stride=2),\n            nn.ReLU(),\n            nn.Conv2d(in_channels=odim, out_channels=odim, kernel_size=3, stride=2),\n            nn.ReLU(),\n        )\n        self.out = nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim)\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"Subsample x.\n\n        Args:\n            x: Input tensor of dimension (batch_size, input_length, num_features). (#batch, time, idim).\n\n        Returns:\n            torch.Tensor: Subsampled tensor of dimension (batch_size, input_length, d_model).\n                where time' = time // 4.\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.conv(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        return x\n\n\nclass VggSubsampling(nn.Module):\n    \"\"\"Trying to follow the setup described here https://arxiv.org/pdf/1910.09799.pdf\n       This paper is not 100% explicit so I am guessing to some extent,\n       and trying to compare with other VGG implementations.\n\n    Args:\n        idim: Input dimension.\n        odim: Output dimension.\n\n    \"\"\"\n\n    def __init__(self, idim: int, odim: int) -> None:\n        \"\"\"Construct a VggSubsampling object.   This uses 2 VGG blocks with 2\n           Conv2d layers each, subsampling its input by a factor of 4 in the\n           time dimensions.\n\n           Args:\n             idim:  Number of features at input, e.g. 40 or 80 for MFCC\n                    (will be treated as the image height).\n             odim:  Output dimension (number of features), e.g. 256\n        \"\"\"\n        super(VggSubsampling, self).__init__()\n\n        cur_channels = 1\n        layers = []\n        block_dims = [32,64]\n\n        # The decision to use padding=1 for the 1st convolution, then padding=0\n        # for the 2nd and for the max-pooling, and ceil_mode=True, was driven by\n        # a back-compatibility concern so that the number of frames at the\n        # output would be equal to:\n        #  (((T-1)//2)-1)//2.\n        # We can consider changing this by using padding=1 on the 2nd convolution,\n        # so the num-frames at the output would be T//4.\n        for block_dim in block_dims:\n            layers.append(torch.nn.Conv2d(in_channels=cur_channels, out_channels=block_dim,\n                                          kernel_size=3, padding=1, stride=1))\n            layers.append(torch.nn.ReLU())\n            layers.append(torch.nn.Conv2d(in_channels=block_dim, out_channels=block_dim,\n                                          kernel_size=3, padding=0, stride=1))\n            layers.append(torch.nn.MaxPool2d(kernel_size=2, stride=2,\n                                             padding=0, ceil_mode=True))\n            cur_channels = block_dim\n\n        self.layers = nn.Sequential(*layers)\n\n        self.out = nn.Linear(block_dims[-1] * (((idim - 1) // 2 - 1) // 2), odim)\n\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"Subsample x.\n\n        Args:\n            x: Input tensor of dimension (batch_size, input_length, num_features). (#batch, time, idim).\n\n        Returns:\n           torch.Tensor: Subsampled tensor of dimension (batch_size, input_length', d_model).\n              where input_length' == (((input_length - 1) // 2) - 1) // 2\n\n        \"\"\"\n        x = x.unsqueeze(1)  # (b, c, t, f)\n        x = self.layers(x)\n        b, c, t, f = x.size()\n        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))\n        return x\n\n\nclass PositionalEncoding(nn.Module):\n    \"\"\"\n    Positional encoding.\n\n    Args:\n        d_model: Embedding dimension.\n        dropout: Dropout rate.\n        max_len: Maximum input length.\n\n    \"\"\"\n\n    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000) -> None:\n        \"\"\"Construct an PositionalEncoding object.\"\"\"\n        super(PositionalEncoding, self).__init__()\n        self.d_model = d_model\n        self.xscale = math.sqrt(self.d_model)\n        self.dropout = nn.Dropout(p=dropout)\n        self.pe = None\n        self.extend_pe(torch.tensor(0.0).expand(1, max_len))\n\n    def extend_pe(self, x: Tensor) -> None:\n        \"\"\"Reset the positional encodings.\"\"\"\n        if self.pe is not None:\n            if self.pe.size(1) >= x.size(1):\n                if self.pe.dtype != x.dtype or self.pe.device != x.device:\n                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)\n                return\n        pe = torch.zeros(x.size(1), self.d_model)\n        position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)\n        div_term = torch.exp(\n            torch.arange(0, self.d_model, 2, dtype=torch.float32)\n            * -(math.log(10000.0) / self.d_model)\n        )\n        pe[:, 0::2] = torch.sin(position * div_term)\n        pe[:, 1::2] = torch.cos(position * div_term)\n        pe = pe.unsqueeze(0)\n        self.pe = pe.to(device=x.device, dtype=x.dtype)\n\n    def forward(self, x: Tensor) -> Tensor:\n        \"\"\"\n        Add positional encoding.\n\n        Args:\n            x: Input tensor of dimention (batch_size, input_length, d_model).\n\n        Returns:\n            torch.Tensor: Encoded tensor of dimention (batch_size, input_length, d_model).\n\n        \"\"\"\n        self.extend_pe(x)\n        x = x * self.xscale + self.pe[:, : x.size(1)]\n        return self.dropout(x)\n\n\nclass Noam(object):\n    \"\"\"\n    Implements Noam optimizer. Proposed in \"Attention Is All You Need\", https://arxiv.org/pdf/1706.03762.pdf\n    Modified from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transformer/optimizer.py\n\n    Args:\n        params (iterable): iterable of parameters to optimize or dicts defining parameter groups\n        model_size: attention dimension of the transformer model\n        factor: learning rate factor\n        warm_step: warmup steps\n    \"\"\"\n\n    def __init__(self, params, model_size: int = 256, factor: float = 10.0, warm_step: int = 25000, weight_decay=0) -> None:\n        \"\"\"Construct an Noam object.\"\"\"\n        self.optimizer = torch.optim.Adam(params, lr=0, betas=(0.9, 0.98), eps=1e-9, weight_decay=weight_decay)\n        self._step = 0\n        self.warmup = warm_step\n        self.factor = factor\n        self.model_size = model_size\n        self._rate = 0\n\n    @property\n    def param_groups(self):\n        \"\"\"Return param_groups.\"\"\"\n        return self.optimizer.param_groups\n\n    def step(self):\n        \"\"\"Update parameters and rate.\"\"\"\n        self._step += 1\n        rate = self.rate()\n        for p in self.optimizer.param_groups:\n            p[\"lr\"] = rate\n        self._rate = rate\n        self.optimizer.step()\n\n    def rate(self, step=None):\n        \"\"\"Implement `lrate` above.\"\"\"\n        if step is None:\n            step = self._step\n        return (\n                self.factor\n                * self.model_size ** (-0.5)\n                * min(step ** (-0.5), step * self.warmup ** (-1.5))\n        )\n\n    def zero_grad(self):\n        \"\"\"Reset gradient.\"\"\"\n        self.optimizer.zero_grad()\n\n    def state_dict(self):\n        \"\"\"Return state_dict.\"\"\"\n        return {\n            \"_step\": self._step,\n            \"warmup\": self.warmup,\n            \"factor\": self.factor,\n            \"model_size\": self.model_size,\n            \"_rate\": self._rate,\n            \"optimizer\": self.optimizer.state_dict(),\n        }\n\n    def load_state_dict(self, state_dict):\n        \"\"\"Load state_dict.\"\"\"\n        for key, value in state_dict.items():\n            if key == \"optimizer\":\n                self.optimizer.load_state_dict(state_dict[\"optimizer\"])\n            else:\n                setattr(self, key, value)\n\n\nclass LabelSmoothingLoss(nn.Module):\n    \"\"\"\n    Label-smoothing loss. KL-divergence between q_{smoothed ground truth prob.}(w)\n    and p_{prob. computed by model}(w) is minimized.\n    Modified from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transformer/label_smoothing_loss.py\n\n    Args:\n        size: the number of class\n        padding_idx: padding_idx: ignored class id\n        smoothing: smoothing rate (0.0 means the conventional CE)\n        normalize_length: normalize loss by sequence length if True\n        criterion: loss function to be smoothed\n    \"\"\"\n\n    def __init__(\n            self,\n            size: int,\n            padding_idx: int = -1,\n            smoothing: float = 0.1,\n            normalize_length: bool = False,\n            criterion: nn.Module = nn.KLDivLoss(reduction=\"none\"),\n    ) -> None:\n        \"\"\"Construct an LabelSmoothingLoss object.\"\"\"\n        super(LabelSmoothingLoss, self).__init__()\n        self.criterion = criterion\n        self.padding_idx = padding_idx\n        assert 0.0 < smoothing <= 1.0\n        self.confidence = 1.0 - smoothing\n        self.smoothing = smoothing\n        self.size = size\n        self.true_dist = None\n        self.normalize_length = normalize_length\n\n    def forward(self, x: Tensor, target: Tensor) -> Tensor:\n        \"\"\"\n        Compute loss between x and target.\n\n        Args:\n            x: prediction of dimention (batch_size, input_length, number_of_classes).\n            target: target masked with self.padding_id of dimention (batch_size, input_length).\n\n        Returns:\n            torch.Tensor: scalar float value\n        \"\"\"\n        assert x.size(2) == self.size\n        batch_size = x.size(0)\n        x = x.view(-1, self.size)\n        target = target.view(-1)\n        with torch.no_grad():\n            true_dist = x.clone()\n            true_dist.fill_(self.smoothing / (self.size - 1))\n            ignore = target == self.padding_idx  # (B,)\n            total = len(target) - ignore.sum().item()\n            target = target.masked_fill(ignore, 0)  # avoid -1 index\n            true_dist.scatter_(1, target.unsqueeze(1), self.confidence)\n        kl = self.criterion(torch.log_softmax(x, dim=1), true_dist)\n        denom = total if self.normalize_length else batch_size\n        return kl.masked_fill(ignore.unsqueeze(1), 0).sum() / denom\n\n\ndef encoder_padding_mask(max_len: int, supervisions: Optional[Dict] = None) -> Optional[Tensor]:\n    \"\"\"Make mask tensor containing indices of padded part.\n\n    Args:\n        max_len: maximum length of input features\n        supervisions : Supervison in lhotse format, i.e., batch['supervisions']\n\n    Returns:\n        Tensor: Mask tensor of dimension (batch_size, input_length), True denote the masked indices.\n    \"\"\"\n    if supervisions == None:\n        return None\n\n    supervision_segments = torch.stack(\n        (supervisions['sequence_idx'],\n         supervisions['start_frame'],\n         supervisions['num_frames']), 1).to(torch.int32)\n\n    lengths = [0 for _ in range(int(max(supervision_segments[:, 0])) + 1)]\n    for sequence_idx, start_frame, num_frames in supervision_segments:\n        lengths[sequence_idx] = start_frame + num_frames\n\n    lengths = [((i -1) // 2 - 1) // 2 for i in lengths]\n    bs = int(len(lengths))\n    seq_range = torch.arange(0, max_len, dtype=torch.int64)\n    seq_range_expand = seq_range.unsqueeze(0).expand(bs, max_len)\n    seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)\n    mask = seq_range_expand >= seq_length_expand\n\n    return mask\n\n\ndef decoder_padding_mask(ys_pad: Tensor, ignore_id: int = -1) -> Tensor:\n    \"\"\"Generate a length mask for input. The masked position are filled with bool(True),\n        Unmasked positions are filled with bool(False).\n\n    Args:\n        ys_pad: padded tensor of dimension (batch_size, input_length).\n        ignore_id: the ignored number (the padding number) in ys_pad\n\n    Returns:\n        Tensor: a mask tensor of dimension (batch_size, input_length).\n    \"\"\"\n    ys_mask = ys_pad == ignore_id\n    return ys_mask\n\n\ndef get_normal_transcripts(supervision: Dict, words: k2.SymbolTable, oov: str = '<UNK>') -> List[List[int]]:\n    \"\"\"Get normal transcripts (1 input recording has 1 transcript) from lhotse cut format.\n    Achieved by concatenate the transcripts corresponding to the same recording.\n\n    Args:\n        supervision : Supervison in lhotse format, i.e., batch['supervisions']\n        words: The word symbol table.\n        oov: Out of vocabulary word.\n\n    Returns:\n        List[List[int]]: List of concatenated transcripts, length is batch_size\n    \"\"\"\n\n    texts = [[token if token in words else oov\n              for token in text.split(' ')] for text in supervision['text']]\n    texts_ids = [[words[token] for token in text] for text in texts]\n\n    batch_text = [[] for _ in range(int(max(supervision['sequence_idx'])) + 1)]\n    for sequence_idx, text in zip(supervision['sequence_idx'], texts_ids):\n        batch_text[sequence_idx] = batch_text[sequence_idx] + text\n    return batch_text\n\n\ndef generate_square_subsequent_mask(sz: int) -> Tensor:\n    \"\"\"Generate a square mask for the sequence. The masked positions are filled with float('-inf').\n        Unmasked positions are filled with float(0.0).\n\n    Args:\n        sz: mask size\n\n    Returns:\n        Tensor: a square mask of dimension (sz, sz)\n    \"\"\"\n    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)\n    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))\n    return mask\n\n\ndef add_sos_eos(ys: List[List[int]], lexicon: k2.Fsa, sos: int, eos: int, ignore_id: int = -1) -> Tuple[Tensor, Tensor]:\n    \"\"\"Add <sos> and <eos> labels.\n\n    Args:\n        ys: batch of unpadded target sequences\n        lexicon: Its labels are words, while its aux_labels are phones.\n        sos: index of <sos>\n        eos: index of <eos>\n        ignore_id: index of padding\n\n    Returns:\n        Tensor: Input of transformer decoder. Padded tensor of dimention (batch_size, max_length).\n        Tensor: Output of transformer decoder. padded tensor of dimention (batch_size, max_length).\n    \"\"\"\n\n    _sos = torch.tensor([sos])\n    _eos = torch.tensor([eos])\n    ys = get_hierarchical_targets(ys, lexicon)\n    ys_in = [torch.cat([_sos, y], dim=0) for y in ys]\n    ys_out = [torch.cat([y, _eos], dim=0) for y in ys]\n    return pad_list(ys_in, eos), pad_list(ys_out, ignore_id)\n\n\ndef pad_list(ys: List[Tensor], pad_value: float) -> Tensor:\n    \"\"\"Perform padding for the list of tensors.\n\n    Args:\n        ys: List of tensors. len(ys) = batch_size.\n        pad_value: Value for padding.\n\n    Returns:\n        Tensor: Padded tensor (batch_size, max_length, `*`).\n\n    Examples:\n        >>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]\n        >>> x\n        [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]\n        >>> pad_list(x, 0)\n        tensor([[1., 1., 1., 1.],\n                [1., 1., 0., 0.],\n                [1., 0., 0., 0.]])\n\n    \"\"\"\n    n_batch = len(ys)\n    max_len = max(x.size(0) for x in ys)\n    pad = ys[0].new_full((n_batch, max_len, *ys[0].size()[1:]), pad_value)\n\n    for i in range(n_batch):\n        pad[i, : ys[i].size(0)] = ys[i]\n\n    return pad\n\n\ndef get_hierarchical_targets(ys: List[List[int]], lexicon: k2.Fsa) -> List[Tensor]:\n    \"\"\"Get hierarchical transcripts (i.e., phone level transcripts) from transcripts (i.e., word level transcripts).\n\n    Args:\n        ys: Word level transcripts.\n        lexicon: Its labels are words, while its aux_labels are phones.\n\n    Returns:\n        List[Tensor]: Phone level transcripts.\n\n    \"\"\"\n\n    if lexicon is None:\n        return ys\n    else:\n        L_inv = lexicon\n\n    n_batch = len(ys)\n    indices = torch.tensor(range(n_batch))\n    device = L_inv.device\n\n    transcripts = k2.create_fsa_vec([k2.linear_fsa(x, device=device) for x in ys])\n    transcripts_with_self_loops = k2.add_epsilon_self_loops(transcripts)\n\n    transcripts_lexicon = k2.intersect(\n        L_inv, transcripts_with_self_loops,\n        treat_epsilons_specially=False)\n    # Don't call invert_() above because we want to return phone IDs,\n    # which is the `aux_labels` of transcripts_lexicon\n    transcripts_lexicon = k2.remove_epsilon(transcripts_lexicon)\n    transcripts_lexicon = k2.top_sort(transcripts_lexicon)\n\n    transcripts_lexicon = k2.shortest_path(transcripts_lexicon, use_double_scores=True)\n\n    ys = get_texts(transcripts_lexicon, indices)\n    ys = [torch.tensor(y) for y in ys]\n\n    return ys\n\n\n\ndef test_transformer():\n    t = Transformer(40, 1281)\n    T = 200\n    f = torch.rand(31, 40, T)\n    g, _, _ = t(f)\n    assert g.shape == (31, 1281, (((T-1)//2)-1)//2)\n\ndef main():\n    test_transformer()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "snowfall/objectives/__init__.py",
    "content": "from .common import encode_supervisions\nfrom .ctc import CTCLoss\nfrom .mmi import LFMMILoss\n"
  },
  {
    "path": "snowfall/objectives/common.py",
    "content": "import math\nimport torch\nfrom torch import Tensor\nfrom typing import Dict, List, Tuple\n\n\ndef encode_supervisions(supervisions: Dict[str, Tensor]) -> Tuple[Tensor, List[str]]:\n    \"\"\"\n    Encodes Lhotse's ``batch[\"supervisions\"]`` dict into a pair of torch Tensor,\n    and a list of transcription strings.\n\n    The supervision tensor has shape ``(batch_size, 3)``.\n    Its second dimension contains information about sequence index [0],\n    start frames [1] and num frames [2].\n\n    The batch items might become re-ordered during this operation -- the returned tensor\n    and list of strings are guaranteed to be consistent with each other.\n\n    This mimics subsampling by a factor of 4 with Conv1D layer with no padding.\n    \"\"\"\n    supervision_segments = torch.stack(\n        (supervisions['sequence_idx'],\n         (((supervisions['start_frame'] - 1) // 2 - 1) // 2),\n         (((supervisions['num_frames'] - 1) // 2 - 1) // 2)),\n        1\n    ).to(torch.int32)\n    supervision_segments = torch.clamp(supervision_segments, min=0)\n    indices = torch.argsort(supervision_segments[:, 2], descending=True)\n    supervision_segments = supervision_segments[indices]\n    texts = supervisions['text']\n    texts = [texts[idx] for idx in indices]\n    return supervision_segments, texts\n\n\ndef get_tot_objf_and_num_frames(\n        tot_scores: Tensor,\n        frames_per_seq: Tensor\n    ) -> Tuple[torch.Tensor, int, int]:\n    \"\"\"Figures out the total score(log-prob) over all successful supervision segments\n    (i.e. those for which the total score wasn't -infinity), and the corresponding\n    number of frames of neural net output\n         Args:\n            tot_scores: a Torch tensor of shape (num_segments,) containing total scores\n                       from forward-backward\n            frames_per_seq: a Torch tensor of shape (num_segments,) containing the number of\n                           frames for each segment\n        Returns:\n             Returns a tuple of 3 scalar tensors:  (tot_score, ok_frames, all_frames)\n        where ok_frames is the frames for successful (finite) segments, and\n       all_frames is the frames for all segments (finite or not).\n    \"\"\"\n    mask = torch.ne(tot_scores, -math.inf)\n    # finite_indexes is a tensor containing successful segment indexes, e.g.\n    # [ 0 1 3 4 5 ]\n    finite_indexes = torch.nonzero(mask).squeeze(1)\n    ok_frames = frames_per_seq[finite_indexes].sum()\n    all_frames = frames_per_seq.sum()\n    return tot_scores[finite_indexes].sum(), ok_frames, all_frames\n"
  },
  {
    "path": "snowfall/objectives/ctc.py",
    "content": "from typing import List, Tuple\n\nimport torch\nfrom torch import nn\n\nimport k2\n\nfrom snowfall.objectives.common import get_tot_objf_and_num_frames\nfrom snowfall.training.ctc_graph import CtcTrainingGraphCompiler\n\n\nclass CTCLoss(nn.Module):\n    \"\"\"\n    Connectionist Temporal Classification (CTC) loss.\n\n    TODO: more detailed description\n    \"\"\"\n    def __init__(\n            self,\n            graph_compiler: CtcTrainingGraphCompiler,\n    ):\n        super().__init__()\n        self.graph_compiler = graph_compiler\n\n    def forward(\n            self,\n            nnet_output: torch.Tensor,\n            texts: List,\n            supervision_segments: torch.Tensor\n    ) -> Tuple[torch.Tensor, int, int]:\n        num_graphs = self.graph_compiler.compile(texts).to(nnet_output.device)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)\n\n        num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, 10.0)\n\n        num_tot_scores = num_lats.get_tot_scores(\n            log_semiring=True,\n            use_double_scores=True\n        )\n        tot_scores = num_tot_scores\n        tot_score, tot_frames, all_frames = get_tot_objf_and_num_frames(\n            tot_scores,\n            supervision_segments[:, 2]\n        )\n        return tot_score, tot_frames, all_frames\n"
  },
  {
    "path": "snowfall/objectives/mmi.py",
    "content": "from typing import List, Tuple\n\nimport torch\nfrom torch import nn\n\nimport k2\n\nfrom snowfall.objectives.common import get_tot_objf_and_num_frames\nfrom snowfall.training.mmi_graph import MmiTrainingGraphCompiler\n\n\ndef _compute_mmi_loss_exact_optimized(\n        nnet_output: torch.Tensor,\n        texts: List[str],\n        supervision_segments: torch.Tensor,\n        graph_compiler: MmiTrainingGraphCompiler,\n        P: k2.Fsa,\n        den_scale: float = 1.0\n) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n    '''\n    The function name contains `exact`, which means it uses a version of\n    intersection without pruning.\n\n    `optimized` in the function name means this function is optimized\n    in that it calls k2.intersect_dense only once\n\n    Note:\n      It is faster at the cost of using more memory.\n\n    Args:\n      nnet_output:\n        A 3-D tensor of shape [N, T, C]\n      texts:\n        The transcript. Each element consists of space(s) separated words.\n      supervision_segments:\n        A 2-D tensor that will be passed to :func:`k2.DenseFsaVec`.\n      graph_compiler:\n        Used to build num_graphs and den_graphs\n      P:\n        Represents a bigram Fsa.\n      den_scale:\n        The scale applied to the denominator tot_scores.\n    '''\n\n    num_graphs, den_graphs = graph_compiler.compile(texts,\n                                                    P,\n                                                    replicate_den=False)\n\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)\n\n    device = num_graphs.device\n\n    num_fsas = num_graphs.shape[0]\n    assert dense_fsa_vec.dim0() == num_fsas\n\n    assert den_graphs.shape[0] == 1\n\n    # the aux_labels of num_graphs is k2.RaggedInt\n    # but it is torch.Tensor for den_graphs.\n    #\n    # The following converts den_graphs.aux_labels\n    # from torch.Tensor to k2.RaggedInt so that\n    # we can use k2.append() later\n    den_graphs.convert_attr_to_ragged_(name='aux_labels')\n\n    # The motivation to concatenate num_graphs and den_graphs\n    # is to reduce the number of calls to k2.intersect_dense.\n    num_den_graphs = k2.cat([num_graphs, den_graphs])\n\n    # NOTE: The a_to_b_map in k2.intersect_dense must be sorted\n    # so the following reorders num_den_graphs.\n    #\n    # The following code computes a_to_b_map\n\n    # [0, 1, 2, ... ]\n    num_graphs_indexes = torch.arange(num_fsas, dtype=torch.int32)\n\n    # [num_fsas, num_fsas, num_fsas, ... ]\n    den_graphs_indexes = torch.tensor([num_fsas] * num_fsas, dtype=torch.int32)\n\n    # [0, num_fsas, 1, num_fsas, 2, num_fsas, ... ]\n    num_den_graphs_indexes = torch.stack(\n        [num_graphs_indexes, den_graphs_indexes]).t().reshape(-1).to(device)\n\n    num_den_reordered_graphs = k2.index(num_den_graphs, num_den_graphs_indexes)\n\n    # [[0, 1, 2, ...]]\n    a_to_b_map = torch.arange(num_fsas, dtype=torch.int32).reshape(1, -1)\n\n    # [[0, 1, 2, ...]] -> [0, 0, 1, 1, 2, 2, ... ]\n    a_to_b_map = a_to_b_map.repeat(2, 1).t().reshape(-1).to(device)\n\n    num_den_lats = k2.intersect_dense(num_den_reordered_graphs,\n                                      dense_fsa_vec,\n                                      output_beam=10.0,\n                                      a_to_b_map=a_to_b_map)\n\n    num_den_tot_scores = num_den_lats.get_tot_scores(log_semiring=True,\n                                                     use_double_scores=True)\n\n    num_tot_scores = num_den_tot_scores[::2]\n    den_tot_scores = num_den_tot_scores[1::2]\n\n    tot_scores = num_tot_scores - den_scale * den_tot_scores\n    tot_score, tot_frames, all_frames = get_tot_objf_and_num_frames(\n        tot_scores, supervision_segments[:, 2])\n    return tot_score, tot_frames, all_frames\n\n\ndef _compute_mmi_loss_exact_non_optimized(\n        nnet_output: torch.Tensor,\n        texts: List[str],\n        supervision_segments: torch.Tensor,\n        graph_compiler: MmiTrainingGraphCompiler,\n        P: k2.Fsa,\n        den_scale: float = 1.0\n) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n    '''\n    See :func:`_compute_mmi_loss_exact_optimized` for the meaning\n    of the arguments.\n\n    It's more readable, though it invokes k2.intersect_dense twice.\n\n    Note:\n      It uses less memory at the cost of speed. It is slower.\n    '''\n    num_graphs, den_graphs = graph_compiler.compile(texts,\n                                                    P,\n                                                    replicate_den=True)\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)\n\n    num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n    den_lats = k2.intersect_dense(den_graphs, dense_fsa_vec, output_beam=10.0)\n\n    num_tot_scores = num_lats.get_tot_scores(log_semiring=True,\n                                             use_double_scores=True)\n\n    den_tot_scores = den_lats.get_tot_scores(log_semiring=True,\n                                             use_double_scores=True)\n    tot_scores = num_tot_scores - den_scale * den_tot_scores\n    tot_score, tot_frames, all_frames = get_tot_objf_and_num_frames(\n        tot_scores, supervision_segments[:, 2])\n    return tot_score, tot_frames, all_frames\n\n\ndef _compute_mmi_loss_pruned(\n        nnet_output: torch.Tensor,\n        texts: List[str],\n        supervision_segments: torch.Tensor,\n        graph_compiler: MmiTrainingGraphCompiler,\n        P: k2.Fsa,\n        den_scale: float = 1.0\n) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n    '''\n    See :func:`_compute_mmi_loss_exact_optimized` for the meaning\n    of the arguments.\n\n    `pruned` means it uses k2.intersect_dense_pruned\n\n    Note:\n      It uses the least amount of memory, but the loss is not exact due\n      to pruning.\n    '''\n    num_graphs, den_graphs = graph_compiler.compile(texts,\n                                                    P,\n                                                    replicate_den=False)\n\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)\n\n    num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n\n    # the values for search_beam/output_beam/min_active_states/max_active_states\n    # are not tuned. You may want to tune them.\n    # 20 7 30 10000\n    # for wsj: 10 5 30 10000\n    # for aishell 20 7 30 10000\n    den_lats = k2.intersect_dense_pruned(den_graphs,\n                                         dense_fsa_vec,\n                                         search_beam=10.0,\n                                         output_beam=5.0,\n                                         min_active_states=30,\n                                         max_active_states=10000)\n\n    num_tot_scores = num_lats.get_tot_scores(log_semiring=True,\n                                             use_double_scores=True)\n\n    den_tot_scores = den_lats.get_tot_scores(log_semiring=True,\n                                             use_double_scores=True)\n\n    tot_scores = num_tot_scores - den_scale * den_tot_scores\n    tot_score, tot_frames, all_frames = get_tot_objf_and_num_frames(\n        tot_scores, supervision_segments[:, 2])\n    return tot_score, tot_frames, all_frames\n\n\nclass LFMMILoss(nn.Module):\n    \"\"\"\n    Computes Lattice-Free Maximum Mutual Information (LFMMI) loss.\n\n    TODO: more detailed description\n    \"\"\"\n\n    def __init__(\n            self,\n            graph_compiler: MmiTrainingGraphCompiler,\n            P: k2.Fsa,\n            use_pruned_intersect: bool = False,\n            den_scale: float = 1.0,\n    ):\n        super().__init__()\n        self.graph_compiler = graph_compiler\n        self.P = P\n        self.den_scale = den_scale\n        self.use_pruned_intersect = use_pruned_intersect\n\n    def forward(self, nnet_output: torch.Tensor, texts: List[str],\n                supervision_segments: torch.Tensor\n               ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:\n        if self.use_pruned_intersect:\n            func = _compute_mmi_loss_pruned\n        else:\n            func = _compute_mmi_loss_exact_non_optimized\n            # func = _compute_mmi_loss_exact_optimized\n\n        return func(nnet_output=nnet_output,\n                    texts=texts,\n                    supervision_segments=supervision_segments,\n                    graph_compiler=self.graph_compiler,\n                    P=self.P,\n                    den_scale=self.den_scale)\n"
  },
  {
    "path": "snowfall/training/__init__.py",
    "content": ""
  },
  {
    "path": "snowfall/training/ctc_graph.py",
    "content": "# Copyright (c)  2020  Xiaomi Corp.       (author: Fangjun Kuang)\n\nfrom functools import lru_cache\nfrom typing import Iterable\nfrom typing import List\n\nimport torch\nimport k2\n\nfrom snowfall.common import get_phone_symbols\n\n\ndef build_ctc_topo(tokens: List[int]) -> k2.Fsa:\n    '''Build CTC topology.\n    A token which appears once on the right side (i.e. olabels) may\n    appear multiple times on the left side (ilabels), possibly with\n    epsilons in between.\n    When 0 appears on the left side, it represents the blank symbol;\n    when it appears on the right side, it indicates an epsilon. That\n    is, 0 has two meanings here.\n    Args:\n      tokens:\n        A list of tokens, e.g., phones, characters, etc.\n    Returns:\n      Returns an FST that converts repeated tokens to a single token.\n    '''\n    assert 0 in tokens, 'We assume 0 is ID of the blank symbol'\n\n    num_states = len(tokens)\n    final_state = num_states\n    arcs = ''\n    for i in range(num_states):\n        for j in range(num_states):\n            if i == j:\n                arcs += f'{i} {i} {tokens[i]} 0 0.0\\n'\n            else:\n                arcs += f'{i} {j} {tokens[j]} {tokens[j]} 0.0\\n'\n        arcs += f'{i} {final_state} -1 -1 0.0\\n'\n    arcs += f'{final_state}'\n    ans = k2.Fsa.from_str(arcs, num_aux_labels=1)\n    ans = k2.arc_sort(ans)\n    return ans\n\n\nclass CtcTrainingGraphCompiler(object):\n\n    def __init__(self,\n                 L_inv: k2.Fsa,\n                 phones: k2.SymbolTable,\n                 words: k2.SymbolTable,\n                 oov: str = '<UNK>'):\n        '''\n        Args:\n          L_inv:\n            Its labels are words, while its aux_labels are phones.\n        phones:\n          The phone symbol table.\n        words:\n          The word symbol table.\n        oov:\n          Out of vocabulary word.\n        '''\n        if L_inv.properties & k2.fsa_properties.ARC_SORTED != 0:\n            L_inv = k2.arc_sort(L_inv)\n\n        assert oov in words\n\n        self.L_inv = L_inv\n        self.phones = phones\n        self.words = words\n        self.oov = oov\n        phone_ids = get_phone_symbols(phones)\n        phone_ids_with_blank = [0] + phone_ids\n        self.ctc_topo = k2.arc_sort(build_ctc_topo(phone_ids_with_blank))\n\n    def compile(self, texts: Iterable[str]) -> k2.Fsa:\n        decoding_graphs = k2.create_fsa_vec(\n            [self.compile_one_and_cache(text) for text in texts])\n\n        # make sure the gradient is not accumulated\n        decoding_graphs.requires_grad_(False)\n        return decoding_graphs\n\n    @lru_cache(maxsize=100000)\n    def compile_one_and_cache(self, text: str) -> k2.Fsa:\n        tokens = (token if token in self.words else self.oov\n                  for token in text.split(' '))\n        word_ids = [self.words[token] for token in tokens]\n        label_graph = k2.linear_fsa(word_ids)\n        decoding_graph = k2.connect(k2.intersect(label_graph,\n                                                 self.L_inv)).invert_()\n        decoding_graph = k2.arc_sort(decoding_graph)\n        decoding_graph = k2.compose(self.ctc_topo, decoding_graph)\n        decoding_graph = k2.connect(decoding_graph)\n        return decoding_graph\n"
  },
  {
    "path": "snowfall/training/diagnostics.py",
    "content": "from typing import Dict, Optional\n\nimport torch\nfrom torch import nn\nfrom torch.cuda.amp import GradScaler\n\n\ndef l1_norm(x):\n    return torch.sum(torch.abs(x))\n\n\ndef l2_norm(x):\n    return torch.sum(torch.pow(x, 2))\n\n\ndef linf_norm(x):\n    return torch.max(torch.abs(x))\n\n\ndef measure_weight_norms(model: nn.Module, norm: str = 'l2') -> Dict[str, float]:\n    \"\"\"\n    Compute the norms of the model's parameters.\n\n    :param model: a torch.nn.Module instance\n    :param norm: how to compute the norm. Available values: 'l1', 'l2', 'linf'\n    :return: a dict mapping from parameter's name to its norm.\n    \"\"\"\n    with torch.no_grad():\n        norms = {}\n        for name, param in model.named_parameters():\n            if norm == 'l1':\n                val = l1_norm(param)\n            elif norm == 'l2':\n                val = l2_norm(param)\n            elif norm == 'linf':\n                val = linf_norm(param)\n            else:\n                raise ValueError(f\"Unknown norm type: {norm}\")\n            norms[name] = val.item()\n        return norms\n\n\ndef measure_semiorthogonality(model: nn.Module) -> Dict[str, float]:\n    \"\"\"\n    Compute the semi-orthogonality objective function proposed by:\n\n        \"Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks\",\n        Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohamadi,\n        Sanjeev Khudanpur, Interspeech 2018\n    \"\"\"\n    with torch.no_grad():\n        scores = {}\n        for name, m in model.named_modules():\n            if hasattr(m, 'constrain_orthonormal'):\n                weight = m.state_dict()['conv.weight']\n                dim = weight.shape[0]\n                w = weight.reshape(dim, -1)\n                P = torch.mm(w, w.t())\n                scale = torch.trace(torch.mm(P, P.t()) / torch.trace(P))\n                I = torch.eye(dim, dtype=P.dtype, device=P.device)\n                Q = P - scale * I\n                score = torch.trace(torch.mm(Q, Q.t()))\n                scores[name] = score.item()\n        return scores\n\n\ndef measure_gradient_norms(model: nn.Module, norm: str = 'l1') -> Dict[str, float]:\n    \"\"\"\n    Compute the norms of the gradients for each of model's parameters.\n\n    :param model: a torch.nn.Module instance\n    :param norm: how to compute the norm. Available values: 'l1', 'l2', 'linf'\n    :return: a dict mapping from parameter's name to its gradient's norm.\n    \"\"\"\n    with torch.no_grad():\n        norms = {}\n        for name, param in model.named_parameters():\n            if norm == 'l1':\n                val = l1_norm(param)\n            elif norm == 'l2':\n                val = l2_norm(param)\n            elif norm == 'linf':\n                val = linf_norm(param)\n            else:\n                raise ValueError(f\"Unknown norm type: {norm}\")\n            norms[name] = val.item()\n        return norms\n\n\ndef optim_step_and_measure_param_change(\n        model: nn.Module,\n        optimizer: torch.optim.Optimizer,\n        scaler: Optional[GradScaler] = None\n) -> Dict[str, float]:\n    \"\"\"\n    Perform model weight update and measure the \"relative change in parameters per minibatch.\"\n    It is understood as a ratio between the L2 norm of the difference between original and updates parameters,\n    and the L2 norm of the original parameter. It is given by the formula:\n\n        .. math::\n            \\begin{aligned}\n                \\delta = \\frac{\\Vert\\theta - \\theta_{new}\\Vert^2}{\\Vert\\theta\\Vert^2}\n            \\end{aligned}\n    \"\"\"\n    param_copy = {n: p.detach().clone() for n, p in model.named_parameters()}\n    if scaler:\n        scaler.step(optimizer)\n    else:\n        optimizer.step()\n    relative_change = {}\n    with torch.no_grad():\n        for n, p_new in model.named_parameters():\n            p_orig = param_copy[n]\n            delta = l2_norm(p_orig - p_new) / l2_norm(p_orig)\n            relative_change[n] = delta.item()\n    return relative_change\n"
  },
  {
    "path": "snowfall/training/mmi_graph.py",
    "content": "# Copyright (c)  2020  Xiaomi Corp.       (author: Fangjun Kuang)\n\nfrom typing import Iterable\nfrom typing import List\nfrom typing import Tuple\nimport numpy as np\nimport k2\nimport torch\n\nfrom .ctc_graph import build_ctc_topo\nfrom snowfall.common import get_phone_symbols\nfrom ..lexicon import Lexicon\n\n\ndef create_bigram_phone_lm(phones: List[int]) -> k2.Fsa:\n    '''Create a bigram phone LM.\n    The resulting FSA (P) has a start-state and a state for\n    each phone 1, 2, ....; and each of the above-mentioned states\n    has a transition to the state for each phone and also to the final-state.\n\n    Caution:\n      blank is not a phone.\n\n    Args:\n      A list of phone IDs.\n\n    Returns:\n      An FSA representing the bigram phone LM.\n    '''\n    assert 0 not in phones\n    final_state = len(phones) + 1\n    rules = ''\n    for i in range(1, final_state):\n        rules += f'0 {i} {phones[i-1]} 0.0\\n'\n\n    for i in range(1, final_state):\n        for j in range(1, final_state):\n            rules += f'{i} {j} {phones[j-1]} 0.0\\n'\n        rules += f'{i} {final_state} -1 0.0\\n'\n    rules += f'{final_state}'\n    return k2.Fsa.from_str(rules)\n\n\nclass MmiTrainingGraphCompiler(object):\n\n    def __init__(\n            self,\n            lexicon: Lexicon,\n            device: torch.device,\n            oov: str = '<UNK>'\n    ):\n        '''\n        Args:\n          L_inv:\n            Its labels are words, while its aux_labels are phones.\n        phones:\n          The phone symbol table.\n        words:\n          The word symbol table.\n        oov:\n          Out of vocabulary word.\n        '''\n        self.lexicon = lexicon\n        L_inv = self.lexicon.L_inv.to(device)\n\n        if L_inv.properties & k2.fsa_properties.ARC_SORTED != 0:\n            L_inv = k2.arc_sort(L_inv)\n\n        assert L_inv.requires_grad is False\n\n        assert oov in self.lexicon.words\n\n        self.L_inv = L_inv\n        self.oov_id = self.lexicon.words[oov]\n        self.oov = oov\n        self.device = device\n\n        phone_symbols = get_phone_symbols(self.lexicon.phones)\n        phone_symbols_with_blank = [0] + phone_symbols\n\n        ctc_topo = build_ctc_topo(phone_symbols_with_blank).to(device)\n        assert ctc_topo.requires_grad is False\n\n        self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert_())\n\n    def compile(self,\n                texts: Iterable[str],\n                P: k2.Fsa,\n                replicate_den: bool = True) -> Tuple[k2.Fsa, k2.Fsa]:\n        '''Create numerator and denominator graphs from transcripts\n        and the bigram phone LM.\n\n        Args:\n          texts:\n            A list of transcripts. Within a transcript, words are\n            separated by spaces.\n          P:\n            The bigram phone LM created by :func:`create_bigram_phone_lm`.\n          replicate_den:\n            If True, the returned den_graph is replicated to match the number\n            of FSAs in the returned num_graph; if False, the returned den_graph\n            contains only a single FSA\n        Returns:\n          A tuple (num_graph, den_graph), where\n\n            - `num_graph` is the numerator graph. It is an FsaVec with\n              shape `(len(texts), None, None)`.\n\n            - `den_graph` is the denominator graph. It is an FsaVec with the same\n              shape of the `num_graph` if replicate_den is True; otherwise, it\n              is an FsaVec containing only a single FSA.\n        '''\n        assert P.device == self.device\n        P_with_self_loops = k2.add_epsilon_self_loops(P)\n\n        ctc_topo_P = k2.intersect(self.ctc_topo_inv,\n                                  P_with_self_loops,\n                                  treat_epsilons_specially=False).invert()\n\n        ctc_topo_P = k2.arc_sort(ctc_topo_P)\n\n        num_graphs = self.build_num_graphs(texts)\n        num_graphs_with_self_loops = k2.remove_epsilon_and_add_self_loops(\n            num_graphs)\n        # num_graphs_with_self_loops[0].draw(\"linear_lex_rmeps_addslp.svg\")\n\n        num_graphs_with_self_loops = k2.arc_sort(num_graphs_with_self_loops)\n\n        num = k2.compose(ctc_topo_P,\n                         num_graphs_with_self_loops,\n                         treat_epsilons_specially=False)\n        num = k2.arc_sort(num)\n        # num[0].draw(\"num.svg\")\n\n        ctc_topo_P_vec = k2.create_fsa_vec([ctc_topo_P.detach()])\n        if replicate_den:\n            indexes = torch.zeros(len(texts),\n                                  dtype=torch.int32,\n                                  device=self.device)\n            den = k2.index_fsa(ctc_topo_P_vec, indexes)\n        else:\n            den = ctc_topo_P_vec\n\n        return num, den\n\n    def build_num_graphs(self, texts: List[str]) -> k2.Fsa:\n        '''Convert transcript to an Fsa with the help of lexicon\n        and word symbol table.\n\n        Args:\n          texts:\n            Each element is a transcript containing words separated by spaces.\n            For instance, it may be 'HELLO SNOWFALL', which contains\n            two words.\n\n        Returns:\n          Return an FST (FsaVec) corresponding to the transcript. Its `labels` are\n          phone IDs and `aux_labels` are word IDs.\n        '''\n        word_ids_list = []\n        for text in texts:\n            word_ids = []\n            for word in text.split(' '):\n                if word in self.lexicon.words:\n                    word_ids.append(self.lexicon.words[word])\n                else:\n                    word_ids.append(self.oov_id)\n            word_ids_list.append(word_ids)\n\n        fsa = k2.linear_fsa(word_ids_list, self.device)\n        fsa = k2.add_epsilon_self_loops(fsa)\n        assert fsa.device == self.device\n        num_graphs = k2.intersect(self.L_inv,\n                                  fsa,\n                                  treat_epsilons_specially=False).invert_()\n        num_graphs = k2.arc_sort(num_graphs)\n        return num_graphs\n\n    def compile_lookahead_numerators(self, word_fsa_vec, P):\n        # Compile lexicon graph\n        fsa = k2.add_epsilon_self_loops(word_fsa_vec)\n        assert fsa.device == self.device\n        num_graphs = k2.intersect(self.L_inv,\n                                  fsa,\n                                  treat_epsilons_specially=False).invert_()\n        num_graphs = k2.arc_sort(num_graphs)\n\n        # Compile ctc_topo_P\n        assert P.device == self.device\n        P_with_self_loops = k2.add_epsilon_self_loops(P)\n\n        ctc_topo_P = k2.intersect(self.ctc_topo_inv,\n                                  P_with_self_loops,\n                                  treat_epsilons_specially=False).invert()\n\n        ctc_topo_P = k2.arc_sort(ctc_topo_P)\n\n        # Combine\n        num_graphs_with_self_loops = k2.remove_epsilon_and_add_self_loops(\n            num_graphs)\n\n        num_graphs_with_self_loops = k2.arc_sort(num_graphs_with_self_loops)\n\n        num = k2.compose(ctc_topo_P,\n                         num_graphs_with_self_loops,\n                         treat_epsilons_specially=False)\n        num = k2.arc_sort(num)\n        return num\n\n    \"\"\"\n    def build_word_fsa(self, prefix, candidate_intervals, drop_prefix_tail):\n        # convert prefix_ids in BPE domain to word sequence.\n        if '' in prefix:\n            prefix.remove('')\n\n        prefix_ids = [self.lexicon.words[word] if word in self.lexicon.words else self.oov_id \n                      for word in prefix]\n\n        # a special token that does not start with '_' could also be proposed in first iteration\n        # they requires 'drop_tail' but there is no tail to drop\n        # in this case, disable the 'drop_tail' operation\n        batch = len(candidate_intervals)  \n        drop_prefix_tail = [0] * batch if prefix == [] else drop_prefix_tail\n \n        # Prefix part \n        prefix_len = len(prefix_ids)\n        start_state = np.arange(prefix_len)\n        end_state = np.arange(prefix_len) + 1\n        labels = np.array(prefix_ids)\n        scores = np.zeros(prefix_len)\n        prefix_part = np.stack([start_state, end_state, labels, scores], axis=1)\n        \n        # candidate part\n        candidate_parts = []\n        ending_parts = []\n        for (start, end), drop_tail in zip(candidate_intervals, drop_prefix_tail): \n            num_candidate = end - start\n            start_state = np.ones(num_candidate) * (prefix_len - drop_tail)\n            end_state = np.ones(num_candidate) * (prefix_len + 1 - drop_tail)\n            labels = np.arange(start, end)\n            scores = np.zeros(num_candidate)\n            candidate_part = np.stack([start_state, end_state, labels, scores], axis=1)\n            candidate_parts.append(candidate_part)\n\n            # end arc\n            end_arc = np.array([[prefix_len + 1 - drop_tail, prefix_len + 2 - drop_tail, -1, 0]])\n            ending_parts.append(end_arc)\n         \n       \n        # assemble: do not need to arc_sort \n        num_vec = []\n        for i, (candidate_part, drop_tail) in enumerate(zip(candidate_parts, drop_prefix_tail)):\n            this_prefix_part = prefix_part[:-1] if drop_tail else prefix_part\n            end_arc = ending_parts[i]\n            num_mat = np.concatenate([this_prefix_part, candidate_part, end_arc], axis=0)\n            num_mat = torch.from_numpy(num_mat).to(torch.int32)\n            num_vec.append(num_mat)\n       \n        # convert to k2 FsaVec \n        num_vec = [k2.Fsa.from_dict({\"arcs\": num}) for num in num_vec]\n        num_vec = k2.create_fsa_vec(num_vec)\n        return num_vec    \n    \"\"\"\n"
  },
  {
    "path": "snowfall/training/mmi_mbr_graph.py",
    "content": "# Copyright (c)  2020  Xiaomi Corp.       (author: Fangjun Kuang)\n\nfrom functools import lru_cache\nfrom typing import Iterable\nfrom typing import List\nfrom typing import Tuple\nfrom pathlib import Path\n\nimport logging\n\nimport k2\nimport torch\n\nfrom .ctc_graph import build_ctc_topo\nfrom snowfall.common import get_phone_symbols\nfrom snowfall.decoding.graph import compile_HLG\n\n\ndef find_first_disambig_symbol(symbols: k2.SymbolTable) -> int:\n    return min(v for k, v in symbols._sym2id.items() if k.startswith('#'))\n\n\nclass MmiMbrTrainingGraphCompiler(object):\n\n    def __init__(self,\n                 L_inv: k2.Fsa,\n                 L_disambig: k2.Fsa,\n                 G: k2.Fsa,\n                 phones: k2.SymbolTable,\n                 words: k2.SymbolTable,\n                 device: torch.device,\n                 oov: str = '<UNK>'):\n        '''\n        Args:\n          L_inv:\n            Its labels are words, while its aux_labels are phones.\n          L_disambig:\n            L with disambig symbols. Its labels are phones and aux_labels\n            are words.\n          G:\n            The language model.\n          phones:\n            The phone symbol table.\n          words:\n            The word symbol table.\n          device:\n            The target device that all FSAs should be moved to.\n          oov:\n            Out of vocabulary word.\n        '''\n\n        L_inv = L_inv.to(device)\n        G = G.to(device)\n\n        if L_inv.properties & k2.fsa_properties.ARC_SORTED != 0:\n            L_inv = k2.arc_sort(L_inv)\n\n        if G.properties & k2.fsa_properties.ARC_SORTED != 0:\n            G = k2.arc_sort(G)\n\n        assert L_inv.requires_grad is False\n        assert G.requires_grad is False\n\n        assert oov in words\n\n        L = L_inv.invert()\n        L = k2.arc_sort(L)\n\n        self.L_inv = L_inv\n        self.L = L\n        self.phones = phones\n        self.words = words\n        self.device = device\n        self.oov_id = self.words[oov]\n\n        phone_symbols = get_phone_symbols(phones)\n        phone_symbols_with_blank = [0] + phone_symbols\n\n        ctc_topo = k2.arc_sort(\n            build_ctc_topo(phone_symbols_with_blank).to(device))\n        assert ctc_topo.requires_grad is False\n\n        self.ctc_topo = ctc_topo\n        self.ctc_topo_inv = k2.arc_sort(ctc_topo.invert())\n\n        lang_dir = Path('data/lang_nosp')\n        if not (lang_dir / 'HLG_uni.pt').exists():\n            logging.info(\"Composing (ctc_topo, L_disambig, G)\")\n            first_phone_disambig_id = find_first_disambig_symbol(phones)\n            first_word_disambig_id = find_first_disambig_symbol(words)\n            # decoding_graph is the result of composing (ctc_topo, L_disambig, G)\n            decoding_graph = compile_HLG(\n                L=L_disambig.to('cpu'),\n                G=G.to('cpu'),\n                H=ctc_topo.to('cpu'),\n                labels_disambig_id_start=first_phone_disambig_id,\n                aux_labels_disambig_id_start=first_word_disambig_id)\n            torch.save(decoding_graph.as_dict(),\n                       lang_dir / 'HLG_uni.pt')\n        else:\n            logging.info(\"Loading pre-compiled HLG\")\n            decoding_graph = k2.Fsa.from_dict(\n                torch.load(lang_dir / 'HLG_uni.pt'))\n\n        assert hasattr(decoding_graph, 'phones')\n\n        self.decoding_graph = decoding_graph.to(device)\n\n    def compile(self, texts: Iterable[str],\n                P: k2.Fsa) -> Tuple[k2.Fsa, k2.Fsa, k2.Fsa]:\n        '''Create numerator and denominator graphs from transcripts\n        and the bigram phone LM.\n\n        Args:\n          texts:\n            A list of transcripts. Within a transcript, words are\n            separated by spaces.\n          P:\n            The bigram phone LM created by :func:`create_bigram_phone_lm`.\n        Returns:\n          A tuple (num_graph, den_graph, decoding_graph), where\n\n            - `num_graph` is the numerator graph. It is an FsaVec with\n              shape `(len(texts), None, None)`.\n              It is the result of compose(ctc_topo, P, L, transcript)\n\n            - `den_graph` is the denominator graph. It is an FsaVec with the same\n              shape of the `num_graph`.\n              It is the result of compose(ctc_topo, P).\n\n            - decoding_graph: It is the result of compose(ctc_topo, L_disambig, G)\n              Note that it is a single Fsa, not an FsaVec.\n        '''\n        assert P.device == self.device\n        P_with_self_loops = k2.add_epsilon_self_loops(P)\n\n        ctc_topo_P = k2.intersect(self.ctc_topo_inv,\n                                  P_with_self_loops,\n                                  treat_epsilons_specially=False).invert()\n        ctc_topo_P = k2.arc_sort(ctc_topo_P)\n\n        num_graphs = self.build_num_graphs(texts)\n\n        num_graphs_with_self_loops = k2.remove_epsilon_and_add_self_loops(\n            num_graphs)\n\n        num_graphs_with_self_loops = k2.arc_sort(num_graphs_with_self_loops)\n\n        num = k2.compose(ctc_topo_P,\n                         num_graphs_with_self_loops,\n                         treat_epsilons_specially=False,\n                         inner_labels='phones')\n        num = k2.arc_sort(num)\n\n        ctc_topo_P_vec = k2.create_fsa_vec([ctc_topo_P.detach()])\n        indexes = torch.zeros(len(texts),\n                              dtype=torch.int32,\n                              device=self.device)\n        den = k2.index_fsa(ctc_topo_P_vec, indexes)\n\n        return num, den, self.decoding_graph\n\n    def build_num_graphs(self, texts: List[str]) -> k2.Fsa:\n        '''Convert transcript to an Fsa with the help of lexicon\n        and word symbol table.\n\n        Args:\n          texts:\n            Each element is a transcript containing words separated by spaces.\n            For instance, it may be 'HELLO SNOWFALL', which contains\n            two words.\n\n        Returns:\n          Return an FST (FsaVec) corresponding to the transcript. Its `labels` are\n          phone IDs and `aux_labels` are word IDs.\n        '''\n        word_ids_list = []\n        for text in texts:\n            word_ids = []\n            for word in text.split(' '):\n                if word in self.words:\n                    word_ids.append(self.words[word])\n                else:\n                    word_ids.append(self.oov_id)\n            word_ids_list.append(word_ids)\n\n        fsa = k2.linear_fsa(word_ids_list, self.device)\n        fsa = k2.add_epsilon_self_loops(fsa)\n        num_graphs = k2.intersect(self.L_inv,\n                                  fsa,\n                                  treat_epsilons_specially=False).invert_()\n        num_graphs = k2.arc_sort(num_graphs)\n        return num_graphs\n"
  },
  {
    "path": "snowfall/warpper/k2_decode.py",
    "content": "import torch\nimport k2\nimport sys\nimport numpy as np\nimport logging\n\nfrom espnet.nets.pytorch_backend.nets_utils import make_non_pad_mask\nfrom kaldialign import edit_distance\n\nMAX_LEN = 2000\n\ndef k2_decode(model, device, js, sampler, batch_size, use_segment=False):\n    model.ctc.decode_init()\n    model.to(device)\n    model.eval()\n\n    egs = []\n    tot_results = []\n    tot_loss = []\n    num_egs = len(list(js.keys()))\n    for idx, name in enumerate(js.keys()):\n        egs.append((name, js[name]))\n   \n        if len(egs) == batch_size or idx == num_egs - 1:\n            # for logging\n            names = [eg[0] for eg in egs]\n            if not use_segment: # chinese\n                texts = [eg[1][\"output\"][0][\"token\"] for eg in egs] \n            else: # english\n                texts = [eg[1][\"output\"][0][\"text\"] for eg in egs]\n            ilens_from_json = [eg[1][\"input\"][0][\"shape\"][0] for eg in egs]\n            batch_size = len(names) # for last several examples\n\n            feats = sampler(egs)\n            xs_pad, ilens = build_batch_data(feats[0])\n            xs_pad = torch.from_numpy(xs_pad).to(device)\n            ilens = torch.from_numpy(ilens).to(device)\n            egs = []\n\n            src_mask = make_non_pad_mask(ilens.tolist()).to(xs_pad.device).unsqueeze(-2) \n            hs_pad, hs_mask = model.encoder(xs_pad, src_mask)\n            hs_len = hs_mask.view(batch_size, -1).sum(1)\n                \n            results = model.ctc.decode(hs_pad, hs_len, texts, use_segment) \n            tot_results += results  \n            parse_results(tot_results)\n\ndef build_batch_data(feats):\n    # feats: list of 2d ndarray\n    batch_size = len(feats)\n    dim = feats[0].shape[-1]\n    max_len = 0\n    buf = np.zeros((batch_size, MAX_LEN, dim), dtype=np.float32)\n    ilen = np.zeros(batch_size, dtype=np.int32)\n\n    for i in range(batch_size):\n        feat = feats[i]\n        feat_len = feat.shape[0]\n        buf[i:, :feat_len, :] = feat\n        ilen[i] = feat_len\n        max_len = max(max_len, feat_len)\n\n    buf = buf[:, :max_len, :]\n    return buf, ilen\n\ndef parse_results(results):\n    dists = [edit_distance(r, h) for r, h in results]\n    errors = {\n        key: sum(dist[key] for dist in dists)\n        for key in ['sub', 'ins', 'del', 'total']\n    }\n    total_chars = sum(len(ref) for ref, _ in results)\n    logging.warning(\n        f'%WER {errors[\"total\"] / total_chars:.2%} '\n        f'[{errors[\"total\"]} / {total_chars}, {errors[\"ins\"]} ins, {errors[\"del\"]} del, {errors[\"sub\"]} sub ]'\n    )\n\n"
  },
  {
    "path": "snowfall/warpper/mmi_test.py",
    "content": "import torch\nimport k2\nfrom pathlib import Path\nfrom snowfall.lexicon import Lexicon\nfrom snowfall.training.mmi_graph import create_bigram_phone_lm, MmiTrainingGraphCompiler\n\n\ndef main():\n    lang = Path(\"data/lang_k2mmi\")\n    lexicon = Lexicon(lang)\n    device = torch.device(\"cpu\")\n    graph_compiler = MmiTrainingGraphCompiler(lexicon, device=device)\n\n    phone_ids = lexicon.phone_symbols()\n    P = create_bigram_phone_lm(phone_ids)\n    \n    dim = len(phone_ids) + 1\n    T = 100\n    nnet_output = torch.rand(1, T, dim)\n    supervision = torch.Tensor([[0, 0, T]]).to(torch.int32)\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n    texts = ['你 好']\n    num_graphs, _ = graph_compiler.compile(texts, P, replicate_den=False)\n\n    # num_lats = k2.intersection_dense(num_graphs, dense_fsa_vec, output_beam=10.0)\n    num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n\n    print(num_tot_scores)\n\nmain()\n    \n"
  },
  {
    "path": "snowfall/warpper/mmi_utils.py",
    "content": "import torch \n\ndef build_word_mapping(word_mapping):\n    ans = {}\n    for line in open(word_mapping):\n        f, t = line.split()\n        ans[int(f)] = int(t)\n    return ans \n\ndef convert_transcription(ys, mapping, words, oov_id, ignore_ids):\n    \"\"\"\n    ys: 2-D torch tensor. indexs of tokens\n    mapping: dict, from attention domain to MMI domain. No special tokens\n    words: dict, from MMI domain index to words\n    ignore_ids: list, ids to ignore\n    \n    We assume there should be NO KEY ERROR!\n    \"\"\"\n    ys = ys.cpu().numpy()\n    ys = [\n          [mapping.get(tok, oov_id) for tok in y if not tok in ignore_ids]\n          for y in ys\n         ]\n    ys = [\n          \" \".join([words[tok] for tok in y])\n          for y in ys\n         ]\n    return ys\n\ndef encode_supervision(hlens):\n    batch_size = hlens.size()[0]\n    supervision = torch.stack((torch.arange(batch_size),\n                              torch.zeros(batch_size),\n                              hlens.cpu()), 1).to(torch.int32)\n    supervision = torch.clamp(supervision, min=0)\n    indices = torch.argsort(supervision[:, 2], descending=True)\n    supervision = supervision[indices]\n    return supervision, indices\n\ndef parse_step(hyp, words, part_ids, weights, full_scores, part_scores, weighted_scores):\n    # previous hypothesis\n    word_hypo = \"\".join([words[x] for x in hyp.yseq])\n    print(f\"Previous Hypothesis:   {word_hypo}\")\n    print(f\"Previous Total scores: {hyp.score}\")\n    \n    # candidates:\n    part_toks = \"     \".join([words[tok] for tok in part_ids])\n    print(f\"Proposed Candidates:   {part_toks}\")\n\n    # slice full scores by part_ids. \n    # cannot modify the original data \n    weighted_scores_sliced = weighted_scores[part_ids]\n    full_scores_sliced = {}\n    for k in full_scores:\n        full_scores_sliced[k] = full_scores[k][part_ids]\n\n    # show scores from every source\n    score_dict = {**full_scores_sliced, **part_scores}\n    for k in score_dict:\n        info = \"{:<7}(weighted):   \".format(k)\n        for v in score_dict[k]:\n            info += \"{:>6.2f} \".format(v * weights[k])\n        print(info, flush=True)\n\n    score_dict = {**full_scores_sliced, **part_scores, \"total\": weighted_scores_sliced}\n    for k in score_dict:\n        info = \"{:<7}:             \".format(k)\n        for v in score_dict[k]:\n            info += \"{:>6.2f} \".format(v)\n        print(info, flush=True)\n"
  },
  {
    "path": "snowfall/warpper/prefix_scorer.py",
    "content": "import k2\nimport torch\nimport numpy as np\nfrom pathlib import Path\nfrom espnet.snowfall.lexicon import Lexicon\nfrom espnet.snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom espnet.snowfall.warpper.mmi_utils import encode_supervision\nfrom espnet.snowfall.training.mmi_graph import create_bigram_phone_lm\n\n\ndef build_word_fsa(prefix_ids, candidate_intervals):\n    batch = len(candidate_intervals)    \n\n    # Prefix part \n    prefix_len = len(prefix_ids)\n    start_state = np.arange(prefix_len)\n    end_state = np.arange(prefix_len) + 1\n    labels = np.array(prefix_ids)\n    scores = np.zeros(prefix_len)\n    \n    prefix_part = np.stack([start_state, end_state, labels, scores], axis=1)\n\n    # candidate part\n    candidate_parts = []\n    for start, end in candidate_intervals: \n        num_candidate = end - start\n        start_state = np.ones(num_candidate) * prefix_len \n        end_state = np.ones(num_candidate) * (prefix_len + 1)\n        labels = np.arange(start, end)\n        scores = np.zeros(num_candidate)\n        candidate_part = np.stack([start_state, end_state, labels, scores], axis=1)\n        candidate_parts.append(candidate_part)\n\n    # end arc\n    end_arc = np.array([[prefix_len + 1, prefix_len + 2, -1, 0]])\n   \n    # assemble: do not need to arc_sort \n    num_vec = []\n    for i, candidate_part in enumerate(candidate_parts):\n        num_mat = np.concatenate([prefix_part, candidate_part, end_arc], axis=0)\n        num_mat = torch.from_numpy(num_mat).to(torch.int32)\n        num_vec.append(num_mat)\n\n    num_vec = [k2.Fsa.from_dict({\"arcs\": num}) for num in num_vec]\n    num_vec = k2.create_fsa_vec(num_vec)\n    return num_vec\n    \nif __name__ == '__main__':\n    lang = Path(\"data/lang_char\")\n    device = torch.device(\"cpu\")\n    lexicon = Lexicon(lang)\n    compiler = MmiTrainingGraphCompiler(lexicon, device)\n    phones = lexicon.phone_symbols()\n    P = create_bigram_phone_lm(phones).to(device)\n\n    prefix_ids = [1] \n    candidate_intervals =  [[58968, 60968], [60968, 62968], [62968, 64968], [64968, 66298]]\n    num_graphs = compiler.compile_nums_for_prefix_scoring(prefix_ids, candidate_intervals, P)\n\n    batch = len(candidate_intervals)\n    nnet_output = torch.randn(batch, 500, len(phones) + 1).to(device)\n    nnet_output = torch.nn.functional.log_softmax(nnet_output, dim=-1)\n    hlens = torch.ones(batch) * 500\n    supervision, _ = encode_supervision(hlens)\n    dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n    num_lats = k2.intersect_dense(num_graphs, dense_fsa_vec, output_beam=5.0)\n    print(num_lats[0].as_dict())\n    num_tot_scores = num_lats.get_tot_scores(log_semiring=True, use_double_scores=True)\n    print(num_tot_scores)\n"
  },
  {
    "path": "snowfall/warpper/warpper_ctc.py",
    "content": "import torch\nimport k2\nimport torch.nn.functional as F\nimport os\nimport sys\nimport logging\nimport numpy as np\nfrom pathlib import Path\nfrom typing import List\nfrom typing import Union\nfrom k2 import Fsa, SymbolTable\nfrom espnet.snowfall.lexicon import Lexicon\nfrom snowfall.training.ctc_graph import CtcTrainingGraphCompiler\nfrom espnet.snowfall.objectives.ctc import CTCLoss\nfrom espnet.snowfall.common import get_phone_symbols, find_first_disambig_symbol, get_texts\nfrom espnet.snowfall.training.ctc_graph import build_ctc_topo\nfrom espnet.snowfall.decoding.graph import compile_HLG\nfrom lhotse.utils import nullcontext\nfrom espnet.snowfall.warpper.mmi_utils import build_word_mapping, convert_transcription, encode_supervision\n\n\"\"\"\nJune 29th\nself.phone_ids and self.phones are not identical\nself.phone_ids is built from lexicon and have no esp/blk\nself.phones is read from file and have esp/blk\nNot clear would this leads to a bug\n\"\"\"\n\nclass K2CTC(torch.nn.Module):\n\n    def __init__(self, \n                 idim, \n                 lang, \n                 char_list, \n                 device, \n                 dropout, \n                 den_scale, \n                 eos_id, \n                 pad_id=-1,\n                 use_segment=False):\n\n        \"\"\"\n        idim: input dim, usually the transformer output dim\n        lang: k2 lang directory\n        word_mapping: mapping from attention vocab to MMI vocab\n        device: torch.device object. device to build the loss module\n        dropout: dropout rate for linear out layer\n        den_scale: den_scale for MMI loss computation\n        eos_id: end of sentence id\n        pad_id: id of padding in ys_pad\n        use_segment: If true, the supervision of MMI training would use \"texts\"\n                     instead of ys_pad. Sensitive for Chinese\n        \"\"\"\n\n        super().__init__()\n        self.device = device\n        \n        # compiler\n        self.lang = Path(lang)\n        self.lexicon = Lexicon(self.lang)\n        self.graph_compiler = CtcTrainingGraphCompiler(\n                              L_inv=self.lexicon.L_inv,\n                              phones=self.lexicon.phones,\n                              words=self.lexicon.words\n                              )\n\n        # bigram LM\n        self.phone_ids = self.lexicon.phone_symbols() # blank excluded\n        self.words = self.lexicon.words\n        self.phones = self.lexicon.phones \n\n        # linear\n        self.idim = idim\n        self.odim = len(self.phone_ids) + 1\n        self.lo = torch.nn.Sequential(\n                    torch.nn.Dropout(p=dropout),\n                    torch.nn.Linear(self.idim, self.odim)\n                                     )\n\n        # others\n        self.eos_id = eos_id\n        self.pad_id = pad_id\n        self.use_segment=use_segment\n        self.char_list = char_list\n        self.oovid = int(open(self.lang / 'oov.int').read().strip())\n        self.probs = None # for visualization\n        self.HLG = None # Decoding graph. build by \"decode_init\"\n        print(\"INFO from CTC module:\")\n        print(f\"device: {device}\")\n        print(f\"use segment info: {use_segment}\")\n        print(f\"self.lo {self.lo}\")\n        print(f\"number of phones {len(self.phone_ids)}\")\n\n    # softmax, log_softmax and argmax for decoding and visualization\n    def log_softmax(self, hs_pad):\n        return self.softmax(hs_pad).log()\n\n    def softmax(self, hs_pad):\n        # self.probs is required by visualization\n        self.probs = F.softmax(self.lo(hs_pad), dim=2)\n        return self.probs\n\n    def argmax(self, hs_pad):\n        return torch.argmax(self.lo(hs_pad), dim=2)\n\n    def forward(self, hs_pad, hlens, ys_pad, texts):\n        \n        if self.use_segment:\n            ys = texts\n            if \"<space>\" in self.char_list:\n                ys = [y.replace(\" \", \"<space> \") for y in ys]\n        else:\n            # split by every character: BPE or chinese chars\n            ys = [[self.char_list[c] for c in y if c != self.pad_id] for y in ys_pad]\n            ys = [\" \".join(y).replace(\"<eos>\", \"\") for y in ys]\n \n        supervision, indices = encode_supervision(hlens)\n        ys = [ys[i] for i in indices]\n        \n        nnet_output = self.lo(hs_pad)\n        nnet_output = F.log_softmax(nnet_output, dim=-1)\n        \n        loss_fn = CTCLoss(self.graph_compiler)\n\n        grad_context = nullcontext if self.training else torch.no_grad\n\n        with grad_context():\n            ctc_loss, ctc_frames, all_frames = loss_fn(\n                nnet_output, ys, supervision)\n        batch_size = hlens.size()[0]\n        ctc_loss /= batch_size\n        return - ctc_loss\n\n    def decode(self, nnet_output, hlens, texts, is_english):\n       \n        # add linear function \n        nnet_output = self.lo(nnet_output)\n        supervision, indices = encode_supervision(hlens)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # Show MMI loss before decoding\n        self.P.set_scores_stochastic_(self.lm_scores)\n        if self.training:\n            assert self.P.is_cpu\n            assert self.P.requires_grad is False\n        \"\"\"\n        loss_fn = LFMMILoss(\n            graph_compiler = self.graph_compiler,\n            P = self.P,\n            den_scale = self.den_scale\n        )\n\n        grad_context = nullcontext if self.training else torch.no_grad\n       \n        if not is_english: \n            texts_reorder = [\n                      \" \".join(list(text.replace(\" \", \"\")))\n                    for text in texts]\n\n            texts_reorder = [texts_reorder[i] for i in indices]\n        else:\n            texts_reorder = [texts[i] for i in indices]\n        with grad_context():\n            mmi_loss, tot_frames, all_frames = loss_fn(\n                nnet_output, texts_reorder, supervision)\n        mmi_loss = - mmi_loss / len(texts)\n        print(\"MMI Loss: \", mmi_loss)\n        \"\"\"\n        assert nnet_output.device == self.HLG.device\n        # 7.0 output beam is tunable\n        lattices = k2.intersect_dense_pruned(self.HLG, dense_fsa_vec, 20.0, 7.0, 30,\n                                             10000)\n        best_paths = k2.shortest_path(lattices, use_double_scores=True)\n        \n        assert best_paths.shape[0] == len(texts)\n        hyps = get_texts(best_paths, indices)\n        assert len(hyps) == len(texts)\n\n        results = []\n        batch_size =len(texts)\n        for i in range(batch_size):\n            hyp_words = [self.words.get(x) for x in hyps[i]]\n            ref_words = texts[i].split(' ')\n            \n            if not is_english:\n                hyp = \"\".join(hyp_words).replace(\" \", \"\")\n                ref = \"\".join(ref_words).replace(\" \", \"\")\n            else:\n                hyp = \" \".join(hyp_words)\n                ref = \" \".join(ref_words)\n            print(\"#\"*20)\n            print(f\"Reference: {ref}\")\n            print(f\"Hypothesis: {hyp}\")\n            sys.stdout.flush()\n\n            if not is_english:\n                ref_char = list(ref)\n                hyp_char = list(hyp)\n            else:\n                ref_char = ref.split()\n                hyp_char = hyp.split()\n            results.append((ref_char, hyp_char))\n\n        return results\n\n    def decode_init(self):\n        # Build HLG.fst\n        phone_ids = get_phone_symbols(self.phones) # will remove 0\n        phone_ids_with_blank = [0] + phone_ids\n        ctc_topo = k2.arc_sort(build_ctc_topo(phone_ids_with_blank))\n        if not os.path.exists(self.lang / 'HLG.pt'):\n            logging.debug(\"Loading L_disambig.fst.txt\")\n            with open(self.lang / 'L_disambig.fst.txt') as f:\n                L = k2.Fsa.from_openfst(f.read(), acceptor=False)\n            logging.debug(\"Loading G.fst.txt\")\n            with open(self.lang / 'G.fst.txt') as f:\n                G = k2.Fsa.from_openfst(f.read(), acceptor=False)\n            first_phone_disambig_id = find_first_disambig_symbol(self.phones)\n            first_word_disambig_id = find_first_disambig_symbol(self.words)\n            print(\"first disambig symbol: \", first_phone_disambig_id, first_word_disambig_id, flush=True)\n            HLG = compile_HLG(L=L,\n                             G=G,\n                             H=ctc_topo,\n                             labels_disambig_id_start=first_phone_disambig_id,\n                             aux_labels_disambig_id_start=first_word_disambig_id)\n            torch.save(HLG.as_dict(), self.lang / 'HLG.pt')\n        else:\n            logging.debug(\"Loading pre-compiled HLG\")\n            d = torch.load(self.lang / 'HLG.pt')\n            HLG = k2.Fsa.from_dict(d)\n\n        HLG = HLG.to(self.device)\n        HLG.aux_labels = k2.ragged.remove_values_eq(HLG.aux_labels, 0)\n        HLG.requires_grad_(False)\n        if not hasattr(HLG, 'lm_scores'):\n            HLG.lm_scores = HLG.scores.clone()\n        self.HLG = HLG\n        print(\"Successful Initialize Decoding HLG\")\n        \n    def dump_weight(self, rank):\n        d = {}\n        for k, v in self.named_parameters():\n            print(f\"Found parameter {k} with shape {v.size()}\")\n            d[k] = v\n        save_path = self.lang / f\"ctc_param.{rank}.pth\"\n        torch.save(d, save_path)\n"
  },
  {
    "path": "snowfall/warpper/warpper_mmi.py",
    "content": "import torch\nimport k2\nimport torch.nn.functional as F\nimport os\nimport sys\nimport logging\nimport numpy as np\nfrom pathlib import Path\nfrom typing import List\nfrom typing import Union\nfrom k2 import Fsa, SymbolTable\nfrom espnet.snowfall.lexicon import Lexicon\nfrom espnet.snowfall.objectives.mmi import LFMMILoss\nfrom espnet.snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom espnet.snowfall.training.mmi_graph import create_bigram_phone_lm\nfrom espnet.snowfall.common import get_phone_symbols, find_first_disambig_symbol, get_texts\nfrom espnet.snowfall.training.ctc_graph import build_ctc_topo\nfrom espnet.snowfall.decoding.graph import compile_HLG\nfrom lhotse.utils import nullcontext\nfrom espnet.snowfall.warpper.mmi_utils import build_word_mapping, convert_transcription, encode_supervision\n\n\"\"\"\nJune 29th\nself.phone_ids and self.phones are not identical\nself.phone_ids is built from lexicon and have no esp/blk\nself.phones is read from file and have esp/blk\nNot clear would this leads to a bug\n\"\"\"\n\nclass K2MMI(torch.nn.Module):\n\n    def __init__(self, \n                 idim, \n                 lang, \n                 char_list, \n                 device, \n                 dropout, \n                 den_scale, \n                 eos_id, \n                 pad_id=-1,\n                 use_segment=False):\n\n        \"\"\"\n        idim: input dim, usually the transformer output dim\n        lang: k2 lang directory\n        word_mapping: mapping from attention vocab to MMI vocab\n        device: torch.device object. device to build the loss module\n        dropout: dropout rate for linear out layer\n        den_scale: den_scale for MMI loss computation\n        eos_id: end of sentence id\n        pad_id: id of padding in ys_pad\n        use_segment: If true, the supervision of MMI training would use \"texts\"\n                     instead of ys_pad. Sensitive for Chinese\n        \"\"\"\n\n        super().__init__()\n        self.device = device\n        \n        # compiler\n        self.lang = Path(lang)\n        self.lexicon = Lexicon(self.lang)\n        self.oovid = int(open(self.lang / 'oov.int').read().strip())\n        self.oov = self.lexicon.words[self.oovid]\n        self.graph_compiler = MmiTrainingGraphCompiler(\n                              lexicon=self.lexicon,\n                              device=self.device,\n                              oov=self.oov\n                              )\n\n        # bigram LM\n        self.phone_ids = self.lexicon.phone_symbols() # blank excluded\n        self.words = self.lexicon.words\n        self.phones = self.lexicon.phones \n        self.P = create_bigram_phone_lm(self.phone_ids)\n        self.P.scores = torch.zeros_like(self.P.scores)\n        self.P = self.P.to(self.device)\n        self.lm_scores = torch.nn.Parameter(self.P.scores.clone(), requires_grad=True)\n        self.use_pruned_intersect = len(self.phone_ids) > 500\n\n        # linear\n        self.idim = idim\n        self.odim = len(self.phone_ids) + 1\n        self.lo = torch.nn.Sequential(\n                    torch.nn.Dropout(p=dropout),\n                    torch.nn.Linear(self.idim, self.odim)\n                                     )\n\n        # others\n        self.eos_id = eos_id\n        self.pad_id = pad_id\n        self.den_scale = den_scale \n        self.use_segment=use_segment\n        self.char_list = char_list\n        self.probs = None # for visualization\n        self.HLG = None # Decoding graph. build by \"decode_init\"\n        print(\"INFO from MMI module:\")\n        print(f\"device: {device}\")\n        print(f\"use pruned_intersect: {self.use_pruned_intersect}\")\n        print(f\"use segment info: {use_segment}\")\n        print(f\"self.lo {self.lo}\")\n        print(f\"number of phones {len(self.phone_ids)}\")\n\n    # softmax, log_softmax and argmax for decoding and visualization\n    def log_softmax(self, hs_pad):\n        return self.softmax(hs_pad).log()\n\n    def softmax(self, hs_pad):\n        # self.probs is required by visualization\n        self.probs = F.softmax(self.lo(hs_pad), dim=2)\n        return self.probs\n\n    def argmax(self, hs_pad):\n        return torch.argmax(self.lo(hs_pad), dim=2)\n\n    def forward(self, hs_pad, hlens, ys_pad, texts):\n         \n        if self.use_segment:\n            ys = texts\n        else:\n            ys = [[self.char_list[c] for c in y if c != self.pad_id] for y in ys_pad]\n            ys = [\" \".join(y).replace(\"<eos>\", \"\") for y in ys]\n\n        supervision, indices = encode_supervision(hlens)\n        ys = [ys[i] for i in indices]\n        \n        nnet_output = self.lo(hs_pad)\n        \n        self.P.set_scores_stochastic_(self.lm_scores)\n        if self.training:\n            assert self.P.is_cpu\n            assert self.P.requires_grad is True\n        else:\n            # Never use segmentation in evaluation: to approximate the decoding stage\n            ys = [[self.char_list[c] for c in y if c != self.pad_id] for y in ys_pad]\n            ys = [\" \".join(y).replace(\"<eos>\", \"\") for y in ys]\n\n        loss_fn = LFMMILoss(\n            graph_compiler=self.graph_compiler,\n            P=self.P,\n            den_scale=self.den_scale,\n            use_pruned_intersect=self.use_pruned_intersect\n        )\n\n        grad_context = nullcontext if self.training else torch.no_grad\n\n        with grad_context():\n            mmi_loss, tot_frames, all_frames = loss_fn(\n                nnet_output, ys, supervision)\n        batch_size = hlens.size()[0]\n        mmi_loss /= batch_size\n        return - mmi_loss\n\n    def decode(self, nnet_output, hlens, texts, is_english):\n       \n        # add linear function \n        nnet_output = self.lo(nnet_output)\n        supervision, indices = encode_supervision(hlens)\n        dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision)\n\n        # Show MMI loss before decoding\n        self.P.set_scores_stochastic_(self.lm_scores)\n        if self.training:\n            assert self.P.is_cpu\n            assert self.P.requires_grad is False\n        \"\"\"\n        loss_fn = LFMMILoss(\n            graph_compiler = self.graph_compiler,\n            P = self.P,\n            den_scale = self.den_scale\n        )\n\n        grad_context = nullcontext if self.training else torch.no_grad\n       \n        if not is_english: \n            texts_reorder = [\n                      \" \".join(list(text.replace(\" \", \"\")))\n                    for text in texts]\n\n            texts_reorder = [texts_reorder[i] for i in indices]\n        else:\n            texts_reorder = [texts[i] for i in indices]\n        with grad_context():\n            mmi_loss, tot_frames, all_frames = loss_fn(\n                nnet_output, texts_reorder, supervision)\n        mmi_loss = - mmi_loss / len(texts)\n        print(\"MMI Loss: \", mmi_loss)\n        \"\"\"\n        assert nnet_output.device == self.HLG.device\n        # 7.0 output beam is tunable\n        lattices = k2.intersect_dense_pruned(self.HLG, dense_fsa_vec, 20.0, 7.0, 30,\n                                             10000)\n        best_paths = k2.shortest_path(lattices, use_double_scores=True)\n        \n        assert best_paths.shape[0] == len(texts)\n        hyps = get_texts(best_paths, indices)\n        assert len(hyps) == len(texts)\n\n        results = []\n        batch_size =len(texts)\n        for i in range(batch_size):\n            hyp_words = [self.words.get(x) for x in hyps[i]]\n            ref_words = texts[i].split(' ')\n            \n            if not is_english:\n                hyp = \"\".join(hyp_words).replace(\" \", \"\")\n                ref = \"\".join(ref_words).replace(\" \", \"\")\n            else:\n                hyp = \" \".join(hyp_words)\n                ref = \" \".join(ref_words)\n            print(\"#\"*20)\n            print(f\"Reference: {ref}\")\n            print(f\"Hypothesis: {hyp}\")\n            sys.stdout.flush()\n\n            if not is_english:\n                ref_char = list(ref)\n                hyp_char = list(hyp)\n            else:\n                ref_char = ref.split()\n                hyp_char = hyp.split()\n            results.append((ref_char, hyp_char))\n\n        return results\n\n    def decode_init(self):\n        # Build HLG.fst\n        phone_ids = get_phone_symbols(self.phones) # will remove 0\n        phone_ids_with_blank = [0] + phone_ids\n        ctc_topo = k2.arc_sort(build_ctc_topo(phone_ids_with_blank))\n        if not os.path.exists(self.lang / 'HLG.pt'):\n            logging.debug(\"Loading L_disambig.fst.txt\")\n            with open(self.lang / 'L_disambig.fst.txt') as f:\n                L = k2.Fsa.from_openfst(f.read(), acceptor=False)\n            logging.debug(\"Loading G.fst.txt\")\n            with open(self.lang / 'G.fst.txt') as f:\n                G = k2.Fsa.from_openfst(f.read(), acceptor=False)\n            first_phone_disambig_id = find_first_disambig_symbol(self.phones)\n            first_word_disambig_id = find_first_disambig_symbol(self.words)\n            print(\"first disambig symbol: \", first_phone_disambig_id, first_word_disambig_id, flush=True)\n            HLG = compile_HLG(L=L,\n                             G=G,\n                             H=ctc_topo,\n                             labels_disambig_id_start=first_phone_disambig_id,\n                             aux_labels_disambig_id_start=first_word_disambig_id)\n            torch.save(HLG.as_dict(), self.lang / 'HLG.pt')\n        else:\n            logging.debug(\"Loading pre-compiled HLG\")\n            d = torch.load(self.lang / 'HLG.pt')\n            HLG = k2.Fsa.from_dict(d)\n\n        HLG = HLG.to(self.device)\n        HLG.aux_labels = k2.ragged.remove_values_eq(HLG.aux_labels, 0)\n        HLG.requires_grad_(False)\n        if not hasattr(HLG, 'lm_scores'):\n            HLG.lm_scores = HLG.scores.clone()\n        self.HLG = HLG\n        print(\"Successful Initialize Decoding HLG\")\n    \n    def dump_weight(self, rank, path):\n        d = {}\n        for k, v in self.named_parameters():\n            print(f\"Found parameter {k} with shape {v.size()}\")\n            d[k] = v\n        save_path = os.path.join(path, f\"mmi_param.{rank}.pth\")\n        torch.save(d, save_path) \n"
  },
  {
    "path": "st/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "st/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "st/pytorch_backend/st.py",
    "content": "# Copyright 2019 Kyoto University (Hirofumi Inaguma)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"Training/decoding definition for the speech translation task.\"\"\"\n\nimport json\nimport logging\nimport os\nimport sys\n\nfrom chainer import training\nfrom chainer.training import extensions\nimport numpy as np\nfrom tensorboardX import SummaryWriter\nimport torch\n\nfrom espnet.asr.asr_utils import adadelta_eps_decay\nfrom espnet.asr.asr_utils import adam_lr_decay\nfrom espnet.asr.asr_utils import add_results_to_json\nfrom espnet.asr.asr_utils import CompareValueTrigger\nfrom espnet.asr.asr_utils import restore_snapshot\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_model\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_modules\n\nfrom espnet.nets.pytorch_backend.e2e_asr import pad_list\nfrom espnet.nets.st_interface import STInterface\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.iterators import ShufflingEnabler\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\nfrom espnet.asr.pytorch_backend.asr import CustomConverter as ASRCustomConverter\nfrom espnet.asr.pytorch_backend.asr import CustomEvaluator\nfrom espnet.asr.pytorch_backend.asr import CustomUpdater\n\nimport matplotlib\n\nmatplotlib.use(\"Agg\")\n\nif sys.version_info[0] == 2:\n    from itertools import izip_longest as zip_longest\nelse:\n    from itertools import zip_longest as zip_longest\n\n\nclass CustomConverter(ASRCustomConverter):\n    \"\"\"Custom batch converter for Pytorch.\n\n    Args:\n        subsampling_factor (int): The subsampling factor.\n        dtype (torch.dtype): Data type to convert.\n        use_source_text (bool): use source transcription.\n\n    \"\"\"\n\n    def __init__(\n        self, subsampling_factor=1, dtype=torch.float32, use_source_text=False\n    ):\n        \"\"\"Construct a CustomConverter object.\"\"\"\n        super().__init__(subsampling_factor=subsampling_factor, dtype=dtype)\n        self.use_source_text = use_source_text\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Transform a batch and send it to a device.\n\n        Args:\n            batch (list): The batch to transform.\n            device (torch.device): The device to send to.\n\n        Returns:\n            tuple(torch.Tensor, torch.Tensor, torch.Tensor)\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys, ys_src = batch[0]\n\n        # get batch of lengths of input sequences\n        ilens = np.array([x.shape[0] for x in xs])\n        ilens = torch.from_numpy(ilens).to(device)\n\n        xs_pad = pad_list([torch.from_numpy(x).float() for x in xs], 0).to(\n            device, dtype=self.dtype\n        )\n\n        ys_pad = pad_list(\n            [torch.from_numpy(np.array(y, dtype=np.int64)) for y in ys],\n            self.ignore_id,\n        ).to(device)\n\n        if self.use_source_text:\n            ys_pad_src = pad_list(\n                [torch.from_numpy(np.array(y, dtype=np.int64)) for y in ys_src],\n                self.ignore_id,\n            ).to(device)\n        else:\n            ys_pad_src = None\n\n        return xs_pad, ilens, ys_pad, ys_pad_src\n\n\ndef train(args):\n    \"\"\"Train with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n    idim = int(valid_json[utts[0]][\"input\"][0][\"shape\"][-1])\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][-1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # Initialize with pre-trained ASR encoder and MT decoder\n    if args.enc_init is not None or args.dec_init is not None:\n        model = load_trained_modules(idim, odim, args, interface=STInterface)\n    else:\n        model_class = dynamic_import(args.model_module)\n        model = model_class(idim, odim, args)\n    assert isinstance(model, STInterface)\n    total_subsampling_factor = model.get_total_subsampling_factor()\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to \" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    reporter = model.reporter\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    if args.train_dtype in (\"float16\", \"float32\", \"float64\"):\n        dtype = getattr(torch, args.train_dtype)\n    else:\n        dtype = torch.float32\n    model = model.to(device=device, dtype=dtype)\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Setup an optimizer\n    if args.opt == \"adadelta\":\n        optimizer = torch.optim.Adadelta(\n            model.parameters(), rho=0.95, eps=args.eps, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"adam\":\n        optimizer = torch.optim.Adam(\n            model.parameters(), lr=args.lr, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"noam\":\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n\n        optimizer = get_std_opt(\n            model.parameters(),\n            args.adim,\n            args.transformer_warmup_steps,\n            args.transformer_lr,\n        )\n    else:\n        raise NotImplementedError(\"unknown optimizer: \" + args.opt)\n\n    # setup apex.amp\n    if args.train_dtype in (\"O0\", \"O1\", \"O2\", \"O3\"):\n        try:\n            from apex import amp\n        except ImportError as e:\n            logging.error(\n                f\"You need to install apex for --train-dtype {args.train_dtype}. \"\n                \"See https://github.com/NVIDIA/apex#linux\"\n            )\n            raise e\n        if args.opt == \"noam\":\n            model, optimizer.optimizer = amp.initialize(\n                model, optimizer.optimizer, opt_level=args.train_dtype\n            )\n        else:\n            model, optimizer = amp.initialize(\n                model, optimizer, opt_level=args.train_dtype\n            )\n        use_apex = True\n    else:\n        use_apex = False\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # Setup a converter\n    converter = CustomConverter(\n        subsampling_factor=model.subsample[0],\n        dtype=dtype,\n        use_source_text=args.asr_weight > 0 or args.mt_weight > 0,\n    )\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    # make minibatch list (variable length)\n    train = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=0,\n    )\n    valid = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        iaxis=0,\n        oaxis=0,\n    )\n\n    load_tr = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n    )\n    load_cv = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=True,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    # default collate function converts numpy array to pytorch tensor\n    # we used an empty collate function instead which returns list\n    train_iter = ChainerDataLoader(\n        dataset=TransformDataset(train, lambda data: converter([load_tr(data)])),\n        batch_size=1,\n        num_workers=args.n_iter_processes,\n        shuffle=not use_sortagrad,\n        collate_fn=lambda x: x[0],\n    )\n    valid_iter = ChainerDataLoader(\n        dataset=TransformDataset(valid, lambda data: converter([load_cv(data)])),\n        batch_size=1,\n        shuffle=False,\n        collate_fn=lambda x: x[0],\n        num_workers=args.n_iter_processes,\n    )\n\n    # Set up a trainer\n    updater = CustomUpdater(\n        model,\n        args.grad_clip,\n        {\"main\": train_iter},\n        optimizer,\n        device,\n        args.ngpu,\n        args.grad_noise,\n        args.accum_grad,\n        use_apex=use_apex,\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    # Evaluate the model with the test dataset for each epoch\n    if args.save_interval_iters > 0:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu),\n            trigger=(args.save_interval_iters, \"iteration\"),\n        )\n    else:\n        trainer.extend(\n            CustomEvaluator(model, {\"main\": valid_iter}, reporter, device, args.ngpu)\n        )\n\n    # Save attention weight at each epoch\n    if args.num_save_attention > 0:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"input\"][0][\"shape\"][1]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            subsampling_factor=total_subsampling_factor,\n        )\n        trainer.extend(att_reporter, trigger=(1, \"epoch\"))\n    else:\n        att_reporter = None\n\n    # Save CTC prob at each epoch\n    if (args.asr_weight > 0 and args.mtlalpha > 0) and args.num_save_ctc > 0:\n        # NOTE: sort it by output lengths\n        data = sorted(\n            list(valid_json.items())[: args.num_save_ctc],\n            key=lambda x: int(x[1][\"output\"][0][\"shape\"][0]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            ctc_vis_fn = model.module.calculate_all_ctc_probs\n            plot_class = model.module.ctc_plot_class\n        else:\n            ctc_vis_fn = model.calculate_all_ctc_probs\n            plot_class = model.ctc_plot_class\n        ctc_reporter = plot_class(\n            ctc_vis_fn,\n            data,\n            args.outdir + \"/ctc_prob\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            subsampling_factor=total_subsampling_factor,\n        )\n        trainer.extend(ctc_reporter, trigger=(1, \"epoch\"))\n    else:\n        ctc_reporter = None\n\n    # Make a plot for training and validation values\n    trainer.extend(\n        extensions.PlotReport(\n            [\n                \"main/loss\",\n                \"validation/main/loss\",\n                \"main/loss_asr\",\n                \"validation/main/loss_asr\",\n                \"main/loss_mt\",\n                \"validation/main/loss_mt\",\n                \"main/loss_st\",\n                \"validation/main/loss_st\",\n            ],\n            \"epoch\",\n            file_name=\"loss.png\",\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\n                \"main/acc\",\n                \"validation/main/acc\",\n                \"main/acc_asr\",\n                \"validation/main/acc_asr\",\n                \"main/acc_mt\",\n                \"validation/main/acc_mt\",\n            ],\n            \"epoch\",\n            file_name=\"acc.png\",\n        )\n    )\n    trainer.extend(\n        extensions.PlotReport(\n            [\"main/bleu\", \"validation/main/bleu\"], \"epoch\", file_name=\"bleu.png\"\n        )\n    )\n\n    # Save best models\n    trainer.extend(\n        snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\"validation/main/loss\"),\n    )\n    trainer.extend(\n        snapshot_object(model, \"model.acc.best\"),\n        trigger=training.triggers.MaxValueTrigger(\"validation/main/acc\"),\n    )\n\n    # save snapshot which contains model and optimizer states\n    if args.save_interval_iters > 0:\n        trainer.extend(\n            torch_snapshot(filename=\"snapshot.iter.{.updater.iteration}\"),\n            trigger=(args.save_interval_iters, \"iteration\"),\n        )\n    else:\n        trainer.extend(torch_snapshot(), trigger=(1, \"epoch\"))\n\n    # epsilon decay in the optimizer\n    if args.opt == \"adadelta\":\n        if args.criterion == \"acc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adadelta_eps_decay(args.eps_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n    elif args.opt == \"adam\":\n        if args.criterion == \"acc\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.acc.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n            trainer.extend(\n                adam_lr_decay(args.lr_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/acc\",\n                    lambda best_value, current_value: best_value > current_value,\n                ),\n            )\n        elif args.criterion == \"loss\":\n            trainer.extend(\n                restore_snapshot(\n                    model, args.outdir + \"/model.loss.best\", load_fn=torch_load\n                ),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n            trainer.extend(\n                adam_lr_decay(args.lr_decay),\n                trigger=CompareValueTrigger(\n                    \"validation/main/loss\",\n                    lambda best_value, current_value: best_value < current_value,\n                ),\n            )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(\n        extensions.LogReport(trigger=(args.report_interval_iters, \"iteration\"))\n    )\n    report_keys = [\n        \"epoch\",\n        \"iteration\",\n        \"main/loss\",\n        \"main/loss_st\",\n        \"main/loss_asr\",\n        \"validation/main/loss\",\n        \"validation/main/loss_st\",\n        \"validation/main/loss_asr\",\n        \"main/acc\",\n        \"validation/main/acc\",\n    ]\n    if args.asr_weight > 0:\n        report_keys.append(\"main/acc_asr\")\n        report_keys.append(\"validation/main/acc_asr\")\n    report_keys += [\"elapsed_time\"]\n    if args.opt == \"adadelta\":\n        trainer.extend(\n            extensions.observe_value(\n                \"eps\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"eps\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"eps\")\n    elif args.opt in [\"adam\", \"noam\"]:\n        trainer.extend(\n            extensions.observe_value(\n                \"lr\",\n                lambda trainer: trainer.updater.get_optimizer(\"main\").param_groups[0][\n                    \"lr\"\n                ],\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n        report_keys.append(\"lr\")\n    if args.asr_weight > 0:\n        if args.mtlalpha > 0:\n            report_keys.append(\"main/cer_ctc\")\n            report_keys.append(\"validation/main/cer_ctc\")\n        if args.mtlalpha < 1:\n            if args.report_cer:\n                report_keys.append(\"validation/main/cer\")\n            if args.report_wer:\n                report_keys.append(\"validation/main/wer\")\n    if args.report_bleu:\n        report_keys.append(\"main/bleu\")\n        report_keys.append(\"validation/main/bleu\")\n    trainer.extend(\n        extensions.PrintReport(report_keys),\n        trigger=(args.report_interval_iters, \"iteration\"),\n    )\n\n    trainer.extend(extensions.ProgressBar(update_interval=args.report_interval_iters))\n    set_early_stop(trainer, args)\n\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        trainer.extend(\n            TensorboardLogger(\n                SummaryWriter(args.tensorboard_dir),\n                att_reporter=att_reporter,\n                ctc_reporter=ctc_reporter,\n            ),\n            trigger=(args.report_interval_iters, \"iteration\"),\n        )\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\ndef trans(args):\n    \"\"\"Decode with the given args.\n\n    Args:\n        args (namespace): The program arguments.\n\n    \"\"\"\n    set_deterministic_pytorch(args)\n    model, train_args = load_trained_model(args.model)\n    assert isinstance(model, STInterface)\n    model.trans_args = args\n\n    # gpu\n    if args.ngpu == 1:\n        gpu_id = list(range(args.ngpu))\n        logging.info(\"gpu id: \" + str(gpu_id))\n        model.cuda()\n\n    # read json data\n    with open(args.trans_json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n    new_js = {}\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"asr\",\n        load_output=False,\n        sort_in_input_length=False,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},\n    )\n\n    if args.batchsize == 0:\n        with torch.no_grad():\n            for idx, name in enumerate(js.keys(), 1):\n                logging.info(\"(%d/%d) decoding \" + name, idx, len(js.keys()))\n                batch = [(name, js[name])]\n                feat = load_inputs_and_targets(batch)[0][0]\n                nbest_hyps = model.translate(\n                    feat,\n                    args,\n                    train_args.char_list,\n                )\n                new_js[name] = add_results_to_json(\n                    js[name], nbest_hyps, train_args.char_list\n                )\n\n    else:\n\n        def grouper(n, iterable, fillvalue=None):\n            kargs = [iter(iterable)] * n\n            return zip_longest(*kargs, fillvalue=fillvalue)\n\n        # sort data if batchsize > 1\n        keys = list(js.keys())\n        if args.batchsize > 1:\n            feat_lens = [js[key][\"input\"][0][\"shape\"][0] for key in keys]\n            sorted_index = sorted(range(len(feat_lens)), key=lambda i: -feat_lens[i])\n            keys = [keys[i] for i in sorted_index]\n\n        with torch.no_grad():\n            for names in grouper(args.batchsize, keys, None):\n                names = [name for name in names if name]\n                batch = [(name, js[name]) for name in names]\n                feats = load_inputs_and_targets(batch)[0]\n                nbest_hyps = model.translate_batch(\n                    feats,\n                    args,\n                    train_args.char_list,\n                )\n\n                for i, nbest_hyp in enumerate(nbest_hyps):\n                    name = names[i]\n                    new_js[name] = add_results_to_json(\n                        js[name], nbest_hyp, train_args.char_list\n                    )\n\n    with open(args.result_label, \"wb\") as f:\n        f.write(\n            json.dumps(\n                {\"utts\": new_js}, indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n"
  },
  {
    "path": "transform/__init__.py",
    "content": "\"\"\"Initialize main package.\"\"\"\n"
  },
  {
    "path": "transform/add_deltas.py",
    "content": "import numpy as np\n\n\ndef delta(feat, window):\n    assert window > 0\n    delta_feat = np.zeros_like(feat)\n    for i in range(1, window + 1):\n        delta_feat[:-i] += i * feat[i:]\n        delta_feat[i:] += -i * feat[:-i]\n        delta_feat[-i:] += i * feat[-1]\n        delta_feat[:i] += -i * feat[0]\n    delta_feat /= 2 * sum(i ** 2 for i in range(1, window + 1))\n    return delta_feat\n\n\ndef add_deltas(x, window=2, order=2):\n    feats = [x]\n    for _ in range(order):\n        feats.append(delta(feats[-1], window))\n    return np.concatenate(feats, axis=1)\n\n\nclass AddDeltas(object):\n    def __init__(self, window=2, order=2):\n        self.window = window\n        self.order = order\n\n    def __repr__(self):\n        return \"{name}(window={window}, order={order}\".format(\n            name=self.__class__.__name__, window=self.window, order=self.order\n        )\n\n    def __call__(self, x):\n        return add_deltas(x, window=self.window, order=self.order)\n"
  },
  {
    "path": "transform/channel_selector.py",
    "content": "import numpy\n\n\nclass ChannelSelector(object):\n    \"\"\"Select 1ch from multi-channel signal\"\"\"\n\n    def __init__(self, train_channel=\"random\", eval_channel=0, axis=1):\n        self.train_channel = train_channel\n        self.eval_channel = eval_channel\n        self.axis = axis\n\n    def __repr__(self):\n        return (\n            \"{name}(train_channel={train_channel}, \"\n            \"eval_channel={eval_channel}, axis={axis})\".format(\n                name=self.__class__.__name__,\n                train_channel=self.train_channel,\n                eval_channel=self.eval_channel,\n                axis=self.axis,\n            )\n        )\n\n    def __call__(self, x, train=True):\n        # Assuming x: [Time, Channel] by default\n\n        if x.ndim <= self.axis:\n            # If the dimension is insufficient, then unsqueeze\n            # (e.g [Time] -> [Time, 1])\n            ind = tuple(\n                slice(None) if i < x.ndim else None for i in range(self.axis + 1)\n            )\n            x = x[ind]\n\n        if train:\n            channel = self.train_channel\n        else:\n            channel = self.eval_channel\n\n        if channel == \"random\":\n            ch = numpy.random.randint(0, x.shape[self.axis])\n        else:\n            ch = channel\n\n        ind = tuple(slice(None) if i != self.axis else ch for i in range(x.ndim))\n        return x[ind]\n"
  },
  {
    "path": "transform/cmvn.py",
    "content": "import io\n\nimport h5py\nimport kaldiio\nimport numpy as np\n\n\nclass CMVN(object):\n    def __init__(\n        self,\n        stats,\n        norm_means=True,\n        norm_vars=False,\n        filetype=\"mat\",\n        utt2spk=None,\n        spk2utt=None,\n        reverse=False,\n        std_floor=1.0e-20,\n    ):\n        self.stats_file = stats\n        self.norm_means = norm_means\n        self.norm_vars = norm_vars\n        self.reverse = reverse\n\n        if isinstance(stats, dict):\n            stats_dict = dict(stats)\n        else:\n            # Use for global CMVN\n            if filetype == \"mat\":\n                stats_dict = {None: kaldiio.load_mat(stats)}\n            # Use for global CMVN\n            elif filetype == \"npy\":\n                stats_dict = {None: np.load(stats)}\n            # Use for speaker CMVN\n            elif filetype == \"ark\":\n                self.accept_uttid = True\n                stats_dict = dict(kaldiio.load_ark(stats))\n            # Use for speaker CMVN\n            elif filetype == \"hdf5\":\n                self.accept_uttid = True\n                stats_dict = h5py.File(stats)\n            else:\n                raise ValueError(\"Not supporting filetype={}\".format(filetype))\n\n        if utt2spk is not None:\n            self.utt2spk = {}\n            with io.open(utt2spk, \"r\", encoding=\"utf-8\") as f:\n                for line in f:\n                    utt, spk = line.rstrip().split(None, 1)\n                    self.utt2spk[utt] = spk\n        elif spk2utt is not None:\n            self.utt2spk = {}\n            with io.open(spk2utt, \"r\", encoding=\"utf-8\") as f:\n                for line in f:\n                    spk, utts = line.rstrip().split(None, 1)\n                    for utt in utts.split():\n                        self.utt2spk[utt] = spk\n        else:\n            self.utt2spk = None\n\n        # Kaldi makes a matrix for CMVN which has a shape of (2, feat_dim + 1),\n        # and the first vector contains the sum of feats and the second is\n        # the sum of squares. The last value of the first, i.e. stats[0,-1],\n        # is the number of samples for this statistics.\n        self.bias = {}\n        self.scale = {}\n        for spk, stats in stats_dict.items():\n            assert len(stats) == 2, stats.shape\n\n            count = stats[0, -1]\n\n            # If the feature has two or more dimensions\n            if not (np.isscalar(count) or isinstance(count, (int, float))):\n                # The first is only used\n                count = count.flatten()[0]\n\n            mean = stats[0, :-1] / count\n            # V(x) = E(x^2) - (E(x))^2\n            var = stats[1, :-1] / count - mean * mean\n            std = np.maximum(np.sqrt(var), std_floor)\n            self.bias[spk] = -mean\n            self.scale[spk] = 1 / std\n\n    def __repr__(self):\n        return (\n            \"{name}(stats_file={stats_file}, \"\n            \"norm_means={norm_means}, norm_vars={norm_vars}, \"\n            \"reverse={reverse})\".format(\n                name=self.__class__.__name__,\n                stats_file=self.stats_file,\n                norm_means=self.norm_means,\n                norm_vars=self.norm_vars,\n                reverse=self.reverse,\n            )\n        )\n\n    def __call__(self, x, uttid=None):\n        if self.utt2spk is not None:\n            spk = self.utt2spk[uttid]\n        else:\n            spk = uttid\n\n        if not self.reverse:\n            if self.norm_means:\n                x = np.add(x, self.bias[spk])\n            if self.norm_vars:\n                x = np.multiply(x, self.scale[spk])\n\n        else:\n            if self.norm_vars:\n                x = np.divide(x, self.scale[spk])\n            if self.norm_means:\n                x = np.subtract(x, self.bias[spk])\n\n        return x\n\n\nclass UtteranceCMVN(object):\n    def __init__(self, norm_means=True, norm_vars=False, std_floor=1.0e-20):\n        self.norm_means = norm_means\n        self.norm_vars = norm_vars\n        self.std_floor = std_floor\n\n    def __repr__(self):\n        return \"{name}(norm_means={norm_means}, norm_vars={norm_vars})\".format(\n            name=self.__class__.__name__,\n            norm_means=self.norm_means,\n            norm_vars=self.norm_vars,\n        )\n\n    def __call__(self, x, uttid=None):\n        # x: [Time, Dim]\n        square_sums = (x ** 2).sum(axis=0)\n        mean = x.mean(axis=0)\n\n        if self.norm_means:\n            x = np.subtract(x, mean)\n\n        if self.norm_vars:\n            var = square_sums / x.shape[0] - mean ** 2\n            std = np.maximum(np.sqrt(var), self.std_floor)\n            x = np.divide(x, std)\n\n        return x\n"
  },
  {
    "path": "transform/functional.py",
    "content": "import inspect\n\nfrom espnet.transform.transform_interface import TransformInterface\nfrom espnet.utils.check_kwargs import check_kwargs\n\n\nclass FuncTrans(TransformInterface):\n    \"\"\"Functional Transformation\n\n    WARNING:\n        Builtin or C/C++ functions may not work properly\n        because this class heavily depends on the `inspect` module.\n\n    Usage:\n\n    >>> def foo_bar(x, a=1, b=2):\n    ...     '''Foo bar\n    ...     :param x: input\n    ...     :param int a: default 1\n    ...     :param int b: default 2\n    ...     '''\n    ...     return x + a - b\n\n\n    >>> class FooBar(FuncTrans):\n    ...     _func = foo_bar\n    ...     __doc__ = foo_bar.__doc__\n    \"\"\"\n\n    _func = None\n\n    def __init__(self, **kwargs):\n        self.kwargs = kwargs\n        check_kwargs(self.func, kwargs)\n\n    def __call__(self, x):\n        return self.func(x, **self.kwargs)\n\n    @classmethod\n    def add_arguments(cls, parser):\n        fname = cls._func.__name__.replace(\"_\", \"-\")\n        group = parser.add_argument_group(fname + \" transformation setting\")\n        for k, v in cls.default_params().items():\n            # TODO(karita): get help and choices from docstring?\n            attr = k.replace(\"_\", \"-\")\n            group.add_argument(f\"--{fname}-{attr}\", default=v, type=type(v))\n        return parser\n\n    @property\n    def func(self):\n        return type(self)._func\n\n    @classmethod\n    def default_params(cls):\n        try:\n            d = dict(inspect.signature(cls._func).parameters)\n        except ValueError:\n            d = dict()\n        return {\n            k: v.default for k, v in d.items() if v.default != inspect.Parameter.empty\n        }\n\n    def __repr__(self):\n        params = self.default_params()\n        params.update(**self.kwargs)\n        ret = self.__class__.__name__ + \"(\"\n        if len(params) == 0:\n            return ret + \")\"\n        for k, v in params.items():\n            ret += \"{}={}, \".format(k, v)\n        return ret[:-2] + \")\"\n"
  },
  {
    "path": "transform/perturb.py",
    "content": "import librosa\nimport numpy\nimport scipy\nimport soundfile\n\nfrom espnet.utils.io_utils import SoundHDF5File\n\n\nclass SpeedPerturbation(object):\n    \"\"\"SpeedPerturbation\n\n    The speed perturbation in kaldi uses sox-speed instead of sox-tempo,\n    and sox-speed just to resample the input,\n    i.e pitch and tempo are changed both.\n\n    \"Why use speed option instead of tempo -s in SoX for speed perturbation\"\n    https://groups.google.com/forum/#!topic/kaldi-help/8OOG7eE4sZ8\n\n    Warning:\n        This function is very slow because of resampling.\n        I recommmend to apply speed-perturb outside the training using sox.\n\n    \"\"\"\n\n    def __init__(\n        self,\n        lower=0.9,\n        upper=1.1,\n        utt2ratio=None,\n        keep_length=True,\n        res_type=\"kaiser_best\",\n        seed=None,\n    ):\n        self.res_type = res_type\n        self.keep_length = keep_length\n        self.state = numpy.random.RandomState(seed)\n\n        if utt2ratio is not None:\n            self.utt2ratio = {}\n            # Use the scheduled ratio for each utterances\n            self.utt2ratio_file = utt2ratio\n            self.lower = None\n            self.upper = None\n            self.accept_uttid = True\n\n            with open(utt2ratio, \"r\") as f:\n                for line in f:\n                    utt, ratio = line.rstrip().split(None, 1)\n                    ratio = float(ratio)\n                    self.utt2ratio[utt] = ratio\n        else:\n            self.utt2ratio = None\n            # The ratio is given on runtime randomly\n            self.lower = lower\n            self.upper = upper\n\n    def __repr__(self):\n        if self.utt2ratio is None:\n            return \"{}(lower={}, upper={}, \" \"keep_length={}, res_type={})\".format(\n                self.__class__.__name__,\n                self.lower,\n                self.upper,\n                self.keep_length,\n                self.res_type,\n            )\n        else:\n            return \"{}({}, res_type={})\".format(\n                self.__class__.__name__, self.utt2ratio_file, self.res_type\n            )\n\n    def __call__(self, x, uttid=None, train=True):\n        if not train:\n            return x\n\n        x = x.astype(numpy.float32)\n        if self.accept_uttid:\n            ratio = self.utt2ratio[uttid]\n        else:\n            ratio = self.state.uniform(self.lower, self.upper)\n\n        # Note1: resample requires the sampling-rate of input and output,\n        #        but actually only the ratio is used.\n        y = librosa.resample(x, ratio, 1, res_type=self.res_type)\n\n        if self.keep_length:\n            diff = abs(len(x) - len(y))\n            if len(y) > len(x):\n                # Truncate noise\n                y = y[diff // 2 : -((diff + 1) // 2)]\n            elif len(y) < len(x):\n                # Assume the time-axis is the first: (Time, Channel)\n                pad_width = [(diff // 2, (diff + 1) // 2)] + [\n                    (0, 0) for _ in range(y.ndim - 1)\n                ]\n                y = numpy.pad(\n                    y, pad_width=pad_width, constant_values=0, mode=\"constant\"\n                )\n        return y\n\n\nclass BandpassPerturbation(object):\n    \"\"\"BandpassPerturbation\n\n    Randomly dropout along the frequency axis.\n\n    The original idea comes from the following:\n        \"randomly-selected frequency band was cut off under the constraint of\n         leaving at least 1,000 Hz band within the range of less than 4,000Hz.\"\n        (The Hitachi/JHU CHiME-5 system: Advances in speech recognition for\n         everyday home environments using multiple microphone arrays;\n         http://spandh.dcs.shef.ac.uk/chime_workshop/papers/CHiME_2018_paper_kanda.pdf)\n\n    \"\"\"\n\n    def __init__(self, lower=0.0, upper=0.75, seed=None, axes=(-1,)):\n        self.lower = lower\n        self.upper = upper\n        self.state = numpy.random.RandomState(seed)\n        # x_stft: (Time, Channel, Freq)\n        self.axes = axes\n\n    def __repr__(self):\n        return \"{}(lower={}, upper={})\".format(\n            self.__class__.__name__, self.lower, self.upper\n        )\n\n    def __call__(self, x_stft, uttid=None, train=True):\n        if not train:\n            return x_stft\n\n        if x_stft.ndim == 1:\n            raise RuntimeError(\n                \"Input in time-freq domain: \" \"(Time, Channel, Freq) or (Time, Freq)\"\n            )\n\n        ratio = self.state.uniform(self.lower, self.upper)\n        axes = [i if i >= 0 else x_stft.ndim - i for i in self.axes]\n        shape = [s if i in axes else 1 for i, s in enumerate(x_stft.shape)]\n\n        mask = self.state.randn(*shape) > ratio\n        x_stft *= mask\n        return x_stft\n\n\nclass VolumePerturbation(object):\n    def __init__(self, lower=-1.6, upper=1.6, utt2ratio=None, dbunit=True, seed=None):\n        self.dbunit = dbunit\n        self.utt2ratio_file = utt2ratio\n        self.lower = lower\n        self.upper = upper\n        self.state = numpy.random.RandomState(seed)\n\n        if utt2ratio is not None:\n            # Use the scheduled ratio for each utterances\n            self.utt2ratio = {}\n            self.lower = None\n            self.upper = None\n            self.accept_uttid = True\n\n            with open(utt2ratio, \"r\") as f:\n                for line in f:\n                    utt, ratio = line.rstrip().split(None, 1)\n                    ratio = float(ratio)\n                    self.utt2ratio[utt] = ratio\n        else:\n            # The ratio is given on runtime randomly\n            self.utt2ratio = None\n\n    def __repr__(self):\n        if self.utt2ratio is None:\n            return \"{}(lower={}, upper={}, dbunit={})\".format(\n                self.__class__.__name__, self.lower, self.upper, self.dbunit\n            )\n        else:\n            return '{}(\"{}\", dbunit={})'.format(\n                self.__class__.__name__, self.utt2ratio_file, self.dbunit\n            )\n\n    def __call__(self, x, uttid=None, train=True):\n        if not train:\n            return x\n\n        x = x.astype(numpy.float32)\n\n        if self.accept_uttid:\n            ratio = self.utt2ratio[uttid]\n        else:\n            ratio = self.state.uniform(self.lower, self.upper)\n        if self.dbunit:\n            ratio = 10 ** (ratio / 20)\n        return x * ratio\n\n\nclass NoiseInjection(object):\n    \"\"\"Add isotropic noise\"\"\"\n\n    def __init__(\n        self,\n        utt2noise=None,\n        lower=-20,\n        upper=-5,\n        utt2ratio=None,\n        filetype=\"list\",\n        dbunit=True,\n        seed=None,\n    ):\n        self.utt2noise_file = utt2noise\n        self.utt2ratio_file = utt2ratio\n        self.filetype = filetype\n        self.dbunit = dbunit\n        self.lower = lower\n        self.upper = upper\n        self.state = numpy.random.RandomState(seed)\n\n        if utt2ratio is not None:\n            # Use the scheduled ratio for each utterances\n            self.utt2ratio = {}\n            with open(utt2noise, \"r\") as f:\n                for line in f:\n                    utt, snr = line.rstrip().split(None, 1)\n                    snr = float(snr)\n                    self.utt2ratio[utt] = snr\n        else:\n            # The ratio is given on runtime randomly\n            self.utt2ratio = None\n\n        if utt2noise is not None:\n            self.utt2noise = {}\n            if filetype == \"list\":\n                with open(utt2noise, \"r\") as f:\n                    for line in f:\n                        utt, filename = line.rstrip().split(None, 1)\n                        signal, rate = soundfile.read(filename, dtype=\"int16\")\n                        # Load all files in memory\n                        self.utt2noise[utt] = (signal, rate)\n\n            elif filetype == \"sound.hdf5\":\n                self.utt2noise = SoundHDF5File(utt2noise, \"r\")\n            else:\n                raise ValueError(filetype)\n        else:\n            self.utt2noise = None\n\n        if utt2noise is not None and utt2ratio is not None:\n            if set(self.utt2ratio) != set(self.utt2noise):\n                raise RuntimeError(\n                    \"The uttids mismatch between {} and {}\".format(utt2ratio, utt2noise)\n                )\n\n    def __repr__(self):\n        if self.utt2ratio is None:\n            return \"{}(lower={}, upper={}, dbunit={})\".format(\n                self.__class__.__name__, self.lower, self.upper, self.dbunit\n            )\n        else:\n            return '{}(\"{}\", dbunit={})'.format(\n                self.__class__.__name__, self.utt2ratio_file, self.dbunit\n            )\n\n    def __call__(self, x, uttid=None, train=True):\n        if not train:\n            return x\n        x = x.astype(numpy.float32)\n\n        # 1. Get ratio of noise to signal in sound pressure level\n        if uttid is not None and self.utt2ratio is not None:\n            ratio = self.utt2ratio[uttid]\n        else:\n            ratio = self.state.uniform(self.lower, self.upper)\n\n        if self.dbunit:\n            ratio = 10 ** (ratio / 20)\n        scale = ratio * numpy.sqrt((x ** 2).mean())\n\n        # 2. Get noise\n        if self.utt2noise is not None:\n            # Get noise from the external source\n            if uttid is not None:\n                noise, rate = self.utt2noise[uttid]\n            else:\n                # Randomly select the noise source\n                noise = self.state.choice(list(self.utt2noise.values()))\n            # Normalize the level\n            noise /= numpy.sqrt((noise ** 2).mean())\n\n            # Adjust the noise length\n            diff = abs(len(x) - len(noise))\n            offset = self.state.randint(0, diff)\n            if len(noise) > len(x):\n                # Truncate noise\n                noise = noise[offset : -(diff - offset)]\n            else:\n                noise = numpy.pad(noise, pad_width=[offset, diff - offset], mode=\"wrap\")\n\n        else:\n            # Generate white noise\n            noise = self.state.normal(0, 1, x.shape)\n\n        # 3. Add noise to signal\n        return x + noise * scale\n\n\nclass RIRConvolve(object):\n    def __init__(self, utt2rir, filetype=\"list\"):\n        self.utt2rir_file = utt2rir\n        self.filetype = filetype\n\n        self.utt2rir = {}\n        if filetype == \"list\":\n            with open(utt2rir, \"r\") as f:\n                for line in f:\n                    utt, filename = line.rstrip().split(None, 1)\n                    signal, rate = soundfile.read(filename, dtype=\"int16\")\n                    self.utt2rir[utt] = (signal, rate)\n\n        elif filetype == \"sound.hdf5\":\n            self.utt2rir = SoundHDF5File(utt2rir, \"r\")\n        else:\n            raise NotImplementedError(filetype)\n\n    def __repr__(self):\n        return '{}(\"{}\")'.format(self.__class__.__name__, self.utt2rir_file)\n\n    def __call__(self, x, uttid=None, train=True):\n        if not train:\n            return x\n\n        x = x.astype(numpy.float32)\n\n        if x.ndim != 1:\n            # Must be single channel\n            raise RuntimeError(\n                \"Input x must be one dimensional array, but got {}\".format(x.shape)\n            )\n\n        rir, rate = self.utt2rir[uttid]\n        if rir.ndim == 2:\n            # FIXME(kamo): Use chainer.convolution_1d?\n            # return [Time, Channel]\n            return numpy.stack(\n                [scipy.convolve(x, r, mode=\"same\") for r in rir], axis=-1\n            )\n        else:\n            return scipy.convolve(x, rir, mode=\"same\")\n"
  },
  {
    "path": "transform/spec_augment.py",
    "content": "\"\"\"Spec Augment module for preprocessing i.e., data augmentation\"\"\"\n\nimport random\n\nimport numpy\nfrom PIL import Image\nfrom PIL.Image import BICUBIC\n\nfrom espnet.transform.functional import FuncTrans\n\n\ndef time_warp(x, max_time_warp=80, inplace=False, mode=\"PIL\"):\n    \"\"\"time warp for spec augment\n\n    move random center frame by the random width ~ uniform(-window, window)\n    :param numpy.ndarray x: spectrogram (time, freq)\n    :param int max_time_warp: maximum time frames to warp\n    :param bool inplace: overwrite x with the result\n    :param str mode: \"PIL\" (default, fast, not differentiable) or \"sparse_image_warp\"\n        (slow, differentiable)\n    :returns numpy.ndarray: time warped spectrogram (time, freq)\n    \"\"\"\n    window = max_time_warp\n    if mode == \"PIL\":\n        t = x.shape[0]\n        if t - window <= window:\n            return x\n        # NOTE: randrange(a, b) emits a, a + 1, ..., b - 1\n        center = random.randrange(window, t - window)\n        warped = random.randrange(center - window, center + window) + 1  # 1 ... t - 1\n\n        left = Image.fromarray(x[:center]).resize((x.shape[1], warped), BICUBIC)\n        right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped), BICUBIC)\n        if inplace:\n            x[:warped] = left\n            x[warped:] = right\n            return x\n        return numpy.concatenate((left, right), 0)\n    elif mode == \"sparse_image_warp\":\n        import torch\n\n        from espnet.utils import spec_augment\n\n        # TODO(karita): make this differentiable again\n        return spec_augment.time_warp(torch.from_numpy(x), window).numpy()\n    else:\n        raise NotImplementedError(\n            \"unknown resize mode: \"\n            + mode\n            + \", choose one from (PIL, sparse_image_warp).\"\n        )\n\n\nclass TimeWarp(FuncTrans):\n    _func = time_warp\n    __doc__ = time_warp.__doc__\n\n    def __call__(self, x, train):\n        if not train:\n            return x\n        return super().__call__(x)\n\n\ndef freq_mask(x, F=30, n_mask=2, replace_with_zero=True, inplace=False):\n    \"\"\"freq mask for spec agument\n\n    :param numpy.ndarray x: (time, freq)\n    :param int n_mask: the number of masks\n    :param bool inplace: overwrite\n    :param bool replace_with_zero: pad zero on mask if true else use mean\n    \"\"\"\n    if inplace:\n        cloned = x\n    else:\n        cloned = x.copy()\n\n    num_mel_channels = cloned.shape[1]\n    fs = numpy.random.randint(0, F, size=(n_mask, 2))\n\n    for f, mask_end in fs:\n        f_zero = random.randrange(0, num_mel_channels - f)\n        mask_end += f_zero\n\n        # avoids randrange error if values are equal and range is empty\n        if f_zero == f_zero + f:\n            continue\n\n        if replace_with_zero:\n            cloned[:, f_zero:mask_end] = 0\n        else:\n            cloned[:, f_zero:mask_end] = cloned.mean()\n    return cloned\n\n\nclass FreqMask(FuncTrans):\n    _func = freq_mask\n    __doc__ = freq_mask.__doc__\n\n    def __call__(self, x, train):\n        if not train:\n            return x\n        return super().__call__(x)\n\n\ndef time_mask(spec, T=40, n_mask=2, replace_with_zero=True, inplace=False):\n    \"\"\"freq mask for spec agument\n\n    :param numpy.ndarray spec: (time, freq)\n    :param int n_mask: the number of masks\n    :param bool inplace: overwrite\n    :param bool replace_with_zero: pad zero on mask if true else use mean\n    \"\"\"\n    if inplace:\n        cloned = spec\n    else:\n        cloned = spec.copy()\n    len_spectro = cloned.shape[0]\n    ts = numpy.random.randint(0, T, size=(n_mask, 2))\n    for t, mask_end in ts:\n        # avoid randint range error\n        if len_spectro - t <= 0:\n            continue\n        t_zero = random.randrange(0, len_spectro - t)\n\n        # avoids randrange error if values are equal and range is empty\n        if t_zero == t_zero + t:\n            continue\n\n        mask_end += t_zero\n        if replace_with_zero:\n            cloned[t_zero:mask_end] = 0\n        else:\n            cloned[t_zero:mask_end] = cloned.mean()\n    return cloned\n\n\nclass TimeMask(FuncTrans):\n    _func = time_mask\n    __doc__ = time_mask.__doc__\n\n    def __call__(self, x, train):\n        if not train:\n            return x\n        return super().__call__(x)\n\n\ndef spec_augment(\n    x,\n    resize_mode=\"PIL\",\n    max_time_warp=80,\n    max_freq_width=27,\n    n_freq_mask=2,\n    max_time_width=100,\n    n_time_mask=2,\n    inplace=True,\n    replace_with_zero=True,\n):\n    \"\"\"spec agument\n\n    apply random time warping and time/freq masking\n    default setting is based on LD (Librispeech double) in Table 2\n        https://arxiv.org/pdf/1904.08779.pdf\n\n    :param numpy.ndarray x: (time, freq)\n    :param str resize_mode: \"PIL\" (fast, nondifferentiable) or \"sparse_image_warp\"\n        (slow, differentiable)\n    :param int max_time_warp: maximum frames to warp the center frame in spectrogram (W)\n    :param int freq_mask_width: maximum width of the random freq mask (F)\n    :param int n_freq_mask: the number of the random freq mask (m_F)\n    :param int time_mask_width: maximum width of the random time mask (T)\n    :param int n_time_mask: the number of the random time mask (m_T)\n    :param bool inplace: overwrite intermediate array\n    :param bool replace_with_zero: pad zero on mask if true else use mean\n    \"\"\"\n    assert isinstance(x, numpy.ndarray)\n    assert x.ndim == 2\n    x = time_warp(x, max_time_warp, inplace=inplace, mode=resize_mode)\n    x = freq_mask(\n        x,\n        max_freq_width,\n        n_freq_mask,\n        inplace=inplace,\n        replace_with_zero=replace_with_zero,\n    )\n    x = time_mask(\n        x,\n        max_time_width,\n        n_time_mask,\n        inplace=inplace,\n        replace_with_zero=replace_with_zero,\n    )\n    return x\n\n\nclass SpecAugment(FuncTrans):\n    _func = spec_augment\n    __doc__ = spec_augment.__doc__\n\n    def __call__(self, x, train):\n        if not train:\n            return x\n        return super().__call__(x)\n"
  },
  {
    "path": "transform/spectrogram.py",
    "content": "import librosa\nimport numpy as np\n\n\ndef stft(\n    x, n_fft, n_shift, win_length=None, window=\"hann\", center=True, pad_mode=\"reflect\"\n):\n    # x: [Time, Channel]\n    if x.ndim == 1:\n        single_channel = True\n        # x: [Time] -> [Time, Channel]\n        x = x[:, None]\n    else:\n        single_channel = False\n    x = x.astype(np.float32)\n\n    # FIXME(kamo): librosa.stft can't use multi-channel?\n    # x: [Time, Channel, Freq]\n    x = np.stack(\n        [\n            librosa.stft(\n                x[:, ch],\n                n_fft=n_fft,\n                hop_length=n_shift,\n                win_length=win_length,\n                window=window,\n                center=center,\n                pad_mode=pad_mode,\n            ).T\n            for ch in range(x.shape[1])\n        ],\n        axis=1,\n    )\n\n    if single_channel:\n        # x: [Time, Channel, Freq] -> [Time, Freq]\n        x = x[:, 0]\n    return x\n\n\ndef istft(x, n_shift, win_length=None, window=\"hann\", center=True):\n    # x: [Time, Channel, Freq]\n    if x.ndim == 2:\n        single_channel = True\n        # x: [Time, Freq] -> [Time, Channel, Freq]\n        x = x[:, None, :]\n    else:\n        single_channel = False\n\n    # x: [Time, Channel]\n    x = np.stack(\n        [\n            librosa.istft(\n                x[:, ch].T,  # [Time, Freq] -> [Freq, Time]\n                hop_length=n_shift,\n                win_length=win_length,\n                window=window,\n                center=center,\n            )\n            for ch in range(x.shape[1])\n        ],\n        axis=1,\n    )\n\n    if single_channel:\n        # x: [Time, Channel] -> [Time]\n        x = x[:, 0]\n    return x\n\n\ndef stft2logmelspectrogram(x_stft, fs, n_mels, n_fft, fmin=None, fmax=None, eps=1e-10):\n    # x_stft: (Time, Channel, Freq) or (Time, Freq)\n    fmin = 0 if fmin is None else fmin\n    fmax = fs / 2 if fmax is None else fmax\n\n    # spc: (Time, Channel, Freq) or (Time, Freq)\n    spc = np.abs(x_stft)\n    # mel_basis: (Mel_freq, Freq)\n    mel_basis = librosa.filters.mel(fs, n_fft, n_mels, fmin, fmax)\n    # lmspc: (Time, Channel, Mel_freq) or (Time, Mel_freq)\n    lmspc = np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))\n\n    return lmspc\n\n\ndef spectrogram(x, n_fft, n_shift, win_length=None, window=\"hann\"):\n    # x: (Time, Channel) -> spc: (Time, Channel, Freq)\n    spc = np.abs(stft(x, n_fft, n_shift, win_length, window=window))\n    return spc\n\n\ndef logmelspectrogram(\n    x,\n    fs,\n    n_mels,\n    n_fft,\n    n_shift,\n    win_length=None,\n    window=\"hann\",\n    fmin=None,\n    fmax=None,\n    eps=1e-10,\n    pad_mode=\"reflect\",\n):\n    # stft: (Time, Channel, Freq) or (Time, Freq)\n    x_stft = stft(\n        x,\n        n_fft=n_fft,\n        n_shift=n_shift,\n        win_length=win_length,\n        window=window,\n        pad_mode=pad_mode,\n    )\n\n    return stft2logmelspectrogram(\n        x_stft, fs=fs, n_mels=n_mels, n_fft=n_fft, fmin=fmin, fmax=fmax, eps=eps\n    )\n\n\nclass Spectrogram(object):\n    def __init__(self, n_fft, n_shift, win_length=None, window=\"hann\"):\n        self.n_fft = n_fft\n        self.n_shift = n_shift\n        self.win_length = win_length\n        self.window = window\n\n    def __repr__(self):\n        return (\n            \"{name}(n_fft={n_fft}, n_shift={n_shift}, \"\n            \"win_length={win_length}, window={window})\".format(\n                name=self.__class__.__name__,\n                n_fft=self.n_fft,\n                n_shift=self.n_shift,\n                win_length=self.win_length,\n                window=self.window,\n            )\n        )\n\n    def __call__(self, x):\n        return spectrogram(\n            x,\n            n_fft=self.n_fft,\n            n_shift=self.n_shift,\n            win_length=self.win_length,\n            window=self.window,\n        )\n\n\nclass LogMelSpectrogram(object):\n    def __init__(\n        self,\n        fs,\n        n_mels,\n        n_fft,\n        n_shift,\n        win_length=None,\n        window=\"hann\",\n        fmin=None,\n        fmax=None,\n        eps=1e-10,\n    ):\n        self.fs = fs\n        self.n_mels = n_mels\n        self.n_fft = n_fft\n        self.n_shift = n_shift\n        self.win_length = win_length\n        self.window = window\n        self.fmin = fmin\n        self.fmax = fmax\n        self.eps = eps\n\n    def __repr__(self):\n        return (\n            \"{name}(fs={fs}, n_mels={n_mels}, n_fft={n_fft}, \"\n            \"n_shift={n_shift}, win_length={win_length}, window={window}, \"\n            \"fmin={fmin}, fmax={fmax}, eps={eps}))\".format(\n                name=self.__class__.__name__,\n                fs=self.fs,\n                n_mels=self.n_mels,\n                n_fft=self.n_fft,\n                n_shift=self.n_shift,\n                win_length=self.win_length,\n                window=self.window,\n                fmin=self.fmin,\n                fmax=self.fmax,\n                eps=self.eps,\n            )\n        )\n\n    def __call__(self, x):\n        return logmelspectrogram(\n            x,\n            fs=self.fs,\n            n_mels=self.n_mels,\n            n_fft=self.n_fft,\n            n_shift=self.n_shift,\n            win_length=self.win_length,\n            window=self.window,\n        )\n\n\nclass Stft2LogMelSpectrogram(object):\n    def __init__(self, fs, n_mels, n_fft, fmin=None, fmax=None, eps=1e-10):\n        self.fs = fs\n        self.n_mels = n_mels\n        self.n_fft = n_fft\n        self.fmin = fmin\n        self.fmax = fmax\n        self.eps = eps\n\n    def __repr__(self):\n        return (\n            \"{name}(fs={fs}, n_mels={n_mels}, n_fft={n_fft}, \"\n            \"fmin={fmin}, fmax={fmax}, eps={eps}))\".format(\n                name=self.__class__.__name__,\n                fs=self.fs,\n                n_mels=self.n_mels,\n                n_fft=self.n_fft,\n                fmin=self.fmin,\n                fmax=self.fmax,\n                eps=self.eps,\n            )\n        )\n\n    def __call__(self, x):\n        return stft2logmelspectrogram(\n            x,\n            fs=self.fs,\n            n_mels=self.n_mels,\n            n_fft=self.n_fft,\n            fmin=self.fmin,\n            fmax=self.fmax,\n        )\n\n\nclass Stft(object):\n    def __init__(\n        self,\n        n_fft,\n        n_shift,\n        win_length=None,\n        window=\"hann\",\n        center=True,\n        pad_mode=\"reflect\",\n    ):\n        self.n_fft = n_fft\n        self.n_shift = n_shift\n        self.win_length = win_length\n        self.window = window\n        self.center = center\n        self.pad_mode = pad_mode\n\n    def __repr__(self):\n        return (\n            \"{name}(n_fft={n_fft}, n_shift={n_shift}, \"\n            \"win_length={win_length}, window={window},\"\n            \"center={center}, pad_mode={pad_mode})\".format(\n                name=self.__class__.__name__,\n                n_fft=self.n_fft,\n                n_shift=self.n_shift,\n                win_length=self.win_length,\n                window=self.window,\n                center=self.center,\n                pad_mode=self.pad_mode,\n            )\n        )\n\n    def __call__(self, x):\n        return stft(\n            x,\n            self.n_fft,\n            self.n_shift,\n            win_length=self.win_length,\n            window=self.window,\n            center=self.center,\n            pad_mode=self.pad_mode,\n        )\n\n\nclass IStft(object):\n    def __init__(self, n_shift, win_length=None, window=\"hann\", center=True):\n        self.n_shift = n_shift\n        self.win_length = win_length\n        self.window = window\n        self.center = center\n\n    def __repr__(self):\n        return (\n            \"{name}(n_shift={n_shift}, \"\n            \"win_length={win_length}, window={window},\"\n            \"center={center})\".format(\n                name=self.__class__.__name__,\n                n_shift=self.n_shift,\n                win_length=self.win_length,\n                window=self.window,\n                center=self.center,\n            )\n        )\n\n    def __call__(self, x):\n        return istft(\n            x,\n            self.n_shift,\n            win_length=self.win_length,\n            window=self.window,\n            center=self.center,\n        )\n"
  },
  {
    "path": "transform/transform_interface.py",
    "content": "# TODO(karita): add this to all the transform impl.\nclass TransformInterface:\n    \"\"\"Transform Interface\"\"\"\n\n    def __call__(self, x):\n        raise NotImplementedError(\"__call__ method is not implemented\")\n\n    @classmethod\n    def add_arguments(cls, parser):\n        return parser\n\n    def __repr__(self):\n        return self.__class__.__name__ + \"()\"\n\n\nclass Identity(TransformInterface):\n    \"\"\"Identity Function\"\"\"\n\n    def __call__(self, x):\n        return x\n"
  },
  {
    "path": "transform/transformation.py",
    "content": "from collections import OrderedDict\nimport copy\nimport io\nimport logging\nimport sys\n\nimport yaml\n\nfrom espnet.utils.dynamic_import import dynamic_import\n\n\nPY2 = sys.version_info[0] == 2\n\nif PY2:\n    from collections import Sequence\n    from funcsigs import signature\nelse:\n    # The ABCs from 'collections' will stop working in 3.8\n    from collections.abc import Sequence\n    from inspect import signature\n\n\n# TODO(karita): inherit TransformInterface\n# TODO(karita): register cmd arguments in asr_train.py\nimport_alias = dict(\n    identity=\"espnet.transform.transform_interface:Identity\",\n    time_warp=\"espnet.transform.spec_augment:TimeWarp\",\n    time_mask=\"espnet.transform.spec_augment:TimeMask\",\n    freq_mask=\"espnet.transform.spec_augment:FreqMask\",\n    spec_augment=\"espnet.transform.spec_augment:SpecAugment\",\n    speed_perturbation=\"espnet.transform.perturb:SpeedPerturbation\",\n    volume_perturbation=\"espnet.transform.perturb:VolumePerturbation\",\n    noise_injection=\"espnet.transform.perturb:NoiseInjection\",\n    bandpass_perturbation=\"espnet.transform.perturb:BandpassPerturbation\",\n    rir_convolve=\"espnet.transform.perturb:RIRConvolve\",\n    delta=\"espnet.transform.add_deltas:AddDeltas\",\n    cmvn=\"espnet.transform.cmvn:CMVN\",\n    utterance_cmvn=\"espnet.transform.cmvn:UtteranceCMVN\",\n    fbank=\"espnet.transform.spectrogram:LogMelSpectrogram\",\n    spectrogram=\"espnet.transform.spectrogram:Spectrogram\",\n    stft=\"espnet.transform.spectrogram:Stft\",\n    istft=\"espnet.transform.spectrogram:IStft\",\n    stft2fbank=\"espnet.transform.spectrogram:Stft2LogMelSpectrogram\",\n    wpe=\"espnet.transform.wpe:WPE\",\n    channel_selector=\"espnet.transform.channel_selector:ChannelSelector\",\n)\n\n\nclass Transformation(object):\n    \"\"\"Apply some functions to the mini-batch\n\n    Examples:\n        >>> kwargs = {\"process\": [{\"type\": \"fbank\",\n        ...                        \"n_mels\": 80,\n        ...                        \"fs\": 16000},\n        ...                       {\"type\": \"cmvn\",\n        ...                        \"stats\": \"data/train/cmvn.ark\",\n        ...                        \"norm_vars\": True},\n        ...                       {\"type\": \"delta\", \"window\": 2, \"order\": 2}]}\n        >>> transform = Transformation(kwargs)\n        >>> bs = 10\n        >>> xs = [np.random.randn(100, 80).astype(np.float32)\n        ...       for _ in range(bs)]\n        >>> xs = transform(xs)\n    \"\"\"\n\n    def __init__(self, conffile=None):\n        if conffile is not None:\n            if isinstance(conffile, dict):\n                self.conf = copy.deepcopy(conffile)\n            else:\n                with io.open(conffile, encoding=\"utf-8\") as f:\n                    self.conf = yaml.safe_load(f)\n                    assert isinstance(self.conf, dict), type(self.conf)\n        else:\n            self.conf = {\"mode\": \"sequential\", \"process\": []}\n\n        self.functions = OrderedDict()\n        if self.conf.get(\"mode\", \"sequential\") == \"sequential\":\n            for idx, process in enumerate(self.conf[\"process\"]):\n                assert isinstance(process, dict), type(process)\n                opts = dict(process)\n                process_type = opts.pop(\"type\")\n                class_obj = dynamic_import(process_type, import_alias)\n                # TODO(karita): assert issubclass(class_obj, TransformInterface)\n                try:\n                    self.functions[idx] = class_obj(**opts)\n                except TypeError:\n                    try:\n                        signa = signature(class_obj)\n                    except ValueError:\n                        # Some function, e.g. built-in function, are failed\n                        pass\n                    else:\n                        logging.error(\n                            \"Expected signature: {}({})\".format(\n                                class_obj.__name__, signa\n                            )\n                        )\n                    raise\n        else:\n            raise NotImplementedError(\n                \"Not supporting mode={}\".format(self.conf[\"mode\"])\n            )\n\n    def __repr__(self):\n        rep = \"\\n\" + \"\\n\".join(\n            \"    {}: {}\".format(k, v) for k, v in self.functions.items()\n        )\n        return \"{}({})\".format(self.__class__.__name__, rep)\n\n    def __call__(self, xs, uttid_list=None, **kwargs):\n        \"\"\"Return new mini-batch\n\n        :param Union[Sequence[np.ndarray], np.ndarray] xs:\n        :param Union[Sequence[str], str] uttid_list:\n        :return: batch:\n        :rtype: List[np.ndarray]\n        \"\"\"\n        if not isinstance(xs, Sequence):\n            is_batch = False\n            xs = [xs]\n        else:\n            is_batch = True\n\n        if isinstance(uttid_list, str):\n            uttid_list = [uttid_list for _ in range(len(xs))]\n\n        if self.conf.get(\"mode\", \"sequential\") == \"sequential\":\n            for idx in range(len(self.conf[\"process\"])):\n                func = self.functions[idx]\n                # TODO(karita): use TrainingTrans and UttTrans to check __call__ args\n                # Derive only the args which the func has\n                try:\n                    param = signature(func).parameters\n                except ValueError:\n                    # Some function, e.g. built-in function, are failed\n                    param = {}\n                _kwargs = {k: v for k, v in kwargs.items() if k in param}\n                try:\n                    if uttid_list is not None and \"uttid\" in param:\n                        xs = [func(x, u, **_kwargs) for x, u in zip(xs, uttid_list)]\n                    else:\n                        xs = [func(x, **_kwargs) for x in xs]\n                except Exception:\n                    logging.fatal(\n                        \"Catch a exception from {}th func: {}\".format(idx, func)\n                    )\n                    raise\n        else:\n            raise NotImplementedError(\n                \"Not supporting mode={}\".format(self.conf[\"mode\"])\n            )\n\n        if is_batch:\n            return xs\n        else:\n            return xs[0]\n"
  },
  {
    "path": "transform/wpe.py",
    "content": "from nara_wpe.wpe import wpe\n\n\nclass WPE(object):\n    def __init__(\n        self, taps=10, delay=3, iterations=3, psd_context=0, statistics_mode=\"full\"\n    ):\n        self.taps = taps\n        self.delay = delay\n        self.iterations = iterations\n        self.psd_context = psd_context\n        self.statistics_mode = statistics_mode\n\n    def __repr__(self):\n        return (\n            \"{name}(taps={taps}, delay={delay}\"\n            \"iterations={iterations}, psd_context={psd_context}, \"\n            \"statistics_mode={statistics_mode})\".format(\n                name=self.__class__.__name__,\n                taps=self.taps,\n                delay=self.delay,\n                iterations=self.iterations,\n                psd_context=self.psd_context,\n                statistics_mode=self.statistics_mode,\n            )\n        )\n\n    def __call__(self, xs):\n        \"\"\"Return enhanced\n\n        :param np.ndarray xs: (Time, Channel, Frequency)\n        :return: enhanced_xs\n        :rtype: np.ndarray\n\n        \"\"\"\n        # nara_wpe.wpe: (F, C, T)\n        xs = wpe(\n            xs.transpose((2, 1, 0)),\n            taps=self.taps,\n            delay=self.delay,\n            iterations=self.iterations,\n            psd_context=self.psd_context,\n            statistics_mode=self.statistics_mode,\n        )\n        return xs.transpose(2, 1, 0)\n"
  },
  {
    "path": "tts/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "tts/pytorch_backend/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "tts/pytorch_backend/tts.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"E2E-TTS training / decoding functions.\"\"\"\n\nimport copy\nimport json\nimport logging\nimport math\nimport os\nimport time\n\nimport chainer\nimport kaldiio\nimport numpy as np\nimport torch\n\nfrom chainer import training\nfrom chainer.training import extensions\n\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_modules\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.evaluator import BaseEvaluator\n\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\nfrom espnet.utils.training.iterators import ShufflingEnabler\n\nimport matplotlib\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom tensorboardX import SummaryWriter\n\nmatplotlib.use(\"Agg\")\n\n\nclass CustomEvaluator(BaseEvaluator):\n    \"\"\"Custom evaluator.\"\"\"\n\n    def __init__(self, model, iterator, target, device):\n        \"\"\"Initilize module.\n\n        Args:\n            model (torch.nn.Module): Pytorch model instance.\n            iterator (chainer.dataset.Iterator): Iterator for validation.\n            target (chainer.Chain): Dummy chain instance.\n            device (torch.device): The device to be used in evaluation.\n\n        \"\"\"\n        super(CustomEvaluator, self).__init__(iterator, target)\n        self.model = model\n        self.device = device\n\n    # The core part of the update routine can be customized by overriding.\n    def evaluate(self):\n        \"\"\"Evaluate over validation iterator.\"\"\"\n        iterator = self._iterators[\"main\"]\n\n        if self.eval_hook:\n            self.eval_hook(self)\n\n        if hasattr(iterator, \"reset\"):\n            iterator.reset()\n            it = iterator\n        else:\n            it = copy.copy(iterator)\n\n        summary = chainer.reporter.DictSummary()\n\n        self.model.eval()\n        with torch.no_grad():\n            for batch in it:\n                if isinstance(batch, tuple):\n                    x = tuple(arr.to(self.device) for arr in batch)\n                else:\n                    x = batch\n                    for key in x.keys():\n                        x[key] = x[key].to(self.device)\n                observation = {}\n                with chainer.reporter.report_scope(observation):\n                    # convert to torch tensor\n                    if isinstance(x, tuple):\n                        self.model(*x)\n                    else:\n                        self.model(**x)\n                summary.add(observation)\n        self.model.train()\n\n        return summary.compute_mean()\n\n\nclass CustomUpdater(training.StandardUpdater):\n    \"\"\"Custom updater.\"\"\"\n\n    def __init__(self, model, grad_clip, iterator, optimizer, device, accum_grad=1):\n        \"\"\"Initilize module.\n\n        Args:\n            model (torch.nn.Module) model: Pytorch model instance.\n            grad_clip (float) grad_clip : The gradient clipping value.\n            iterator (chainer.dataset.Iterator): Iterator for training.\n            optimizer (torch.optim.Optimizer) : Pytorch optimizer instance.\n            device (torch.device): The device to be used in training.\n\n        \"\"\"\n        super(CustomUpdater, self).__init__(iterator, optimizer)\n        self.model = model\n        self.grad_clip = grad_clip\n        self.device = device\n        self.clip_grad_norm = torch.nn.utils.clip_grad_norm_\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Update model one step.\"\"\"\n        # When we pass one iterator and optimizer to StandardUpdater.__init__,\n        # they are automatically named 'main'.\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n\n        # Get the next batch (a list of json files)\n        batch = train_iter.next()\n        if isinstance(batch, tuple):\n            x = tuple(arr.to(self.device) for arr in batch)\n        else:\n            x = batch\n            for key in x.keys():\n                x[key] = x[key].to(self.device)\n\n        # compute loss and gradient\n        if isinstance(x, tuple):\n            loss = self.model(*x).mean() / self.accum_grad\n        else:\n            loss = self.model(**x).mean() / self.accum_grad\n        loss.backward()\n\n        # update parameters\n        self.forward_count += 1\n        if self.forward_count != self.accum_grad:\n            return\n        self.forward_count = 0\n\n        # compute the gradient norm to check if it is normal or not\n        grad_norm = self.clip_grad_norm(self.model.parameters(), self.grad_clip)\n        logging.debug(\"grad norm={}\".format(grad_norm))\n        if math.isnan(grad_norm):\n            logging.warning(\"grad norm is nan. Do not update model.\")\n        else:\n            optimizer.step()\n        optimizer.zero_grad()\n\n    def update(self):\n        \"\"\"Run update function.\"\"\"\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomConverter(object):\n    \"\"\"Custom converter.\"\"\"\n\n    def __init__(self):\n        \"\"\"Initilize module.\"\"\"\n        # NOTE: keep as class for future development\n        pass\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Convert a given batch.\n\n        Args:\n            batch (list): List of ndarrays.\n            device (torch.device): The device to be send.\n\n        Returns:\n            dict: Dict of converted tensors.\n\n        Examples:\n            >>> batch = [([np.arange(5), np.arange(3)],\n                          [np.random.randn(8, 2), np.random.randn(4, 2)],\n                          None, None)]\n            >>> conveter = CustomConverter()\n            >>> conveter(batch, torch.device(\"cpu\"))\n            {'xs': tensor([[0, 1, 2, 3, 4],\n                           [0, 1, 2, 0, 0]]),\n             'ilens': tensor([5, 3]),\n             'ys': tensor([[[-0.4197, -1.1157],\n                            [-1.5837, -0.4299],\n                            [-2.0491,  0.9215],\n                            [-2.4326,  0.8891],\n                            [ 1.2323,  1.7388],\n                            [-0.3228,  0.6656],\n                            [-0.6025,  1.3693],\n                            [-1.0778,  1.3447]],\n                           [[ 0.1768, -0.3119],\n                            [ 0.4386,  2.5354],\n                            [-1.2181, -0.5918],\n                            [-0.6858, -0.8843],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000]]]),\n             'labels': tensor([[0., 0., 0., 0., 0., 0., 0., 1.],\n                               [0., 0., 0., 1., 1., 1., 1., 1.]]),\n             'olens': tensor([8, 4])}\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys, spembs, extras = batch[0]\n\n        # get list of lengths (must be tensor for DataParallel)\n        ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).long().to(device)\n        olens = torch.from_numpy(np.array([y.shape[0] for y in ys])).long().to(device)\n\n        # perform padding and conversion to tensor\n        xs = pad_list([torch.from_numpy(x).long() for x in xs], 0).to(device)\n        ys = pad_list([torch.from_numpy(y).float() for y in ys], 0).to(device)\n\n        # make labels for stop prediction\n        labels = ys.new_zeros(ys.size(0), ys.size(1))\n        for i, l in enumerate(olens):\n            labels[i, l - 1 :] = 1.0\n\n        # prepare dict\n        new_batch = {\n            \"xs\": xs,\n            \"ilens\": ilens,\n            \"ys\": ys,\n            \"labels\": labels,\n            \"olens\": olens,\n        }\n\n        # load speaker embedding\n        if spembs is not None:\n            spembs = torch.from_numpy(np.array(spembs)).float()\n            new_batch[\"spembs\"] = spembs.to(device)\n\n        # load second target\n        if extras is not None:\n            extras = pad_list([torch.from_numpy(extra).float() for extra in extras], 0)\n            new_batch[\"extras\"] = extras.to(device)\n\n        return new_batch\n\n\ndef train(args):\n    \"\"\"Train E2E-TTS model.\"\"\"\n    set_deterministic_pytorch(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n\n    # reverse input and output dimension\n    idim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][1])\n    odim = int(valid_json[utts[0]][\"input\"][0][\"shape\"][1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # get extra input and output dimenstion\n    if args.use_speaker_embedding:\n        args.spk_embed_dim = int(valid_json[utts[0]][\"input\"][1][\"shape\"][0])\n    else:\n        args.spk_embed_dim = None\n    if args.use_second_target:\n        args.spc_dim = int(valid_json[utts[0]][\"input\"][1][\"shape\"][1])\n    else:\n        args.spc_dim = None\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to\" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    # specify model architecture\n    if args.enc_init is not None or args.dec_init is not None:\n        model = load_trained_modules(idim, odim, args, TTSInterface)\n    else:\n        model_class = dynamic_import(args.model_module)\n        model = model_class(idim, odim, args)\n    assert isinstance(model, TTSInterface)\n    logging.info(model)\n    reporter = model.reporter\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        model = torch.nn.DataParallel(model, device_ids=list(range(args.ngpu)))\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    model = model.to(device)\n\n    # freeze modules, if specified\n    if args.freeze_mods:\n        if hasattr(model, \"module\"):\n            freeze_mods = [\"module.\" + x for x in args.freeze_mods]\n        else:\n            freeze_mods = args.freeze_mods\n\n        for mod, param in model.named_parameters():\n            if any(mod.startswith(key) for key in freeze_mods):\n                logging.info(f\"{mod} is frozen not to be updated.\")\n                param.requires_grad = False\n\n        model_params = filter(lambda x: x.requires_grad, model.parameters())\n    else:\n        model_params = model.parameters()\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Setup an optimizer\n    if args.opt == \"adam\":\n        optimizer = torch.optim.Adam(\n            model_params, args.lr, eps=args.eps, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"noam\":\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n\n        optimizer = get_std_opt(\n            model_params, args.adim, args.transformer_warmup_steps, args.transformer_lr\n        )\n    else:\n        raise NotImplementedError(\"unknown optimizer: \" + args.opt)\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    if use_sortagrad:\n        args.batch_sort_key = \"input\"\n    # make minibatch list (variable length)\n    train_batchset = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        batch_sort_key=args.batch_sort_key,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        swap_io=True,\n        iaxis=0,\n        oaxis=0,\n    )\n    valid_batchset = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        batch_sort_key=args.batch_sort_key,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        swap_io=True,\n        iaxis=0,\n        oaxis=0,\n    )\n\n    load_tr = LoadInputsAndTargets(\n        mode=\"tts\",\n        use_speaker_embedding=args.use_speaker_embedding,\n        use_second_target=args.use_second_target,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n        keep_all_data_on_mem=args.keep_all_data_on_mem,\n    )\n\n    load_cv = LoadInputsAndTargets(\n        mode=\"tts\",\n        use_speaker_embedding=args.use_speaker_embedding,\n        use_second_target=args.use_second_target,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n        keep_all_data_on_mem=args.keep_all_data_on_mem,\n    )\n\n    converter = CustomConverter()\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    train_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(\n                train_batchset, lambda data: converter([load_tr(data)])\n            ),\n            batch_size=1,\n            num_workers=args.num_iter_processes,\n            shuffle=not use_sortagrad,\n            collate_fn=lambda x: x[0],\n        )\n    }\n    valid_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(\n                valid_batchset, lambda data: converter([load_cv(data)])\n            ),\n            batch_size=1,\n            shuffle=False,\n            collate_fn=lambda x: x[0],\n            num_workers=args.num_iter_processes,\n        )\n    }\n\n    # Set up a trainer\n    updater = CustomUpdater(\n        model, args.grad_clip, train_iter, optimizer, device, args.accum_grad\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    # set intervals\n    eval_interval = (args.eval_interval_epochs, \"epoch\")\n    save_interval = (args.save_interval_epochs, \"epoch\")\n    report_interval = (args.report_interval_iters, \"iteration\")\n\n    # Evaluate the model with the test dataset for each epoch\n    trainer.extend(\n        CustomEvaluator(model, valid_iter, reporter, device), trigger=eval_interval\n    )\n\n    # Save snapshot for each epoch\n    trainer.extend(torch_snapshot(), trigger=save_interval)\n\n    # Save best models\n    trainer.extend(\n        snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\n            \"validation/main/loss\", trigger=eval_interval\n        ),\n    )\n\n    # Save attention figure for each epoch\n    if args.num_save_attention > 0:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"output\"][0][\"shape\"][0]),\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n            reduction_factor = model.module.reduction_factor\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n            reduction_factor = model.reduction_factor\n        if reduction_factor > 1:\n            # fix the length to crop attention weight plot correctly\n            data = copy.deepcopy(data)\n            for idx in range(len(data)):\n                ilen = data[idx][1][\"input\"][0][\"shape\"][0]\n                data[idx][1][\"input\"][0][\"shape\"][0] = ilen // reduction_factor\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            reverse=True,\n        )\n        trainer.extend(att_reporter, trigger=eval_interval)\n    else:\n        att_reporter = None\n\n    # Make a plot for training and validation values\n    if hasattr(model, \"module\"):\n        base_plot_keys = model.module.base_plot_keys\n    else:\n        base_plot_keys = model.base_plot_keys\n    plot_keys = []\n    for key in base_plot_keys:\n        plot_key = [\"main/\" + key, \"validation/main/\" + key]\n        trainer.extend(\n            extensions.PlotReport(plot_key, \"epoch\", file_name=key + \".png\"),\n            trigger=eval_interval,\n        )\n        plot_keys += plot_key\n    trainer.extend(\n        extensions.PlotReport(plot_keys, \"epoch\", file_name=\"all_loss.png\"),\n        trigger=eval_interval,\n    )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(extensions.LogReport(trigger=report_interval))\n    report_keys = [\"epoch\", \"iteration\", \"elapsed_time\"] + plot_keys\n    trainer.extend(extensions.PrintReport(report_keys), trigger=report_interval)\n    trainer.extend(extensions.ProgressBar(), trigger=report_interval)\n\n    set_early_stop(trainer, args)\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        writer = SummaryWriter(args.tensorboard_dir)\n        trainer.extend(TensorboardLogger(writer, att_reporter), trigger=report_interval)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\n@torch.no_grad()\ndef decode(args):\n    \"\"\"Decode with E2E-TTS model.\"\"\"\n    set_deterministic_pytorch(args)\n    # read training config\n    idim, odim, train_args = get_model_conf(args.model, args.model_conf)\n\n    # show arguments\n    for key in sorted(vars(args).keys()):\n        logging.info(\"args: \" + key + \": \" + str(vars(args)[key]))\n\n    # define model\n    model_class = dynamic_import(train_args.model_module)\n    model = model_class(idim, odim, train_args)\n    assert isinstance(model, TTSInterface)\n    logging.info(model)\n\n    # load trained model parameters\n    logging.info(\"reading model parameters from \" + args.model)\n    torch_load(args.model, model)\n    model.eval()\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    model = model.to(device)\n\n    # read json data\n    with open(args.json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n\n    # check directory\n    outdir = os.path.dirname(args.out)\n    if len(outdir) != 0 and not os.path.exists(outdir):\n        os.makedirs(outdir)\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"tts\",\n        load_input=False,\n        sort_in_input_length=False,\n        use_speaker_embedding=train_args.use_speaker_embedding,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n\n    # define function for plot prob and att_ws\n    def _plot_and_save(array, figname, figsize=(6, 4), dpi=150):\n        import matplotlib.pyplot as plt\n\n        shape = array.shape\n        if len(shape) == 1:\n            # for eos probability\n            plt.figure(figsize=figsize, dpi=dpi)\n            plt.plot(array)\n            plt.xlabel(\"Frame\")\n            plt.ylabel(\"Probability\")\n            plt.ylim([0, 1])\n        elif len(shape) == 2:\n            # for tacotron 2 attention weights, whose shape is (out_length, in_length)\n            plt.figure(figsize=figsize, dpi=dpi)\n            plt.imshow(array, aspect=\"auto\")\n            plt.xlabel(\"Input\")\n            plt.ylabel(\"Output\")\n        elif len(shape) == 4:\n            # for transformer attention weights,\n            # whose shape is (#leyers, #heads, out_length, in_length)\n            plt.figure(figsize=(figsize[0] * shape[0], figsize[1] * shape[1]), dpi=dpi)\n            for idx1, xs in enumerate(array):\n                for idx2, x in enumerate(xs, 1):\n                    plt.subplot(shape[0], shape[1], idx1 * shape[1] + idx2)\n                    plt.imshow(x, aspect=\"auto\")\n                    plt.xlabel(\"Input\")\n                    plt.ylabel(\"Output\")\n        else:\n            raise NotImplementedError(\"Support only from 1D to 4D array.\")\n        plt.tight_layout()\n        if not os.path.exists(os.path.dirname(figname)):\n            # NOTE: exist_ok = True is needed for parallel process decoding\n            os.makedirs(os.path.dirname(figname), exist_ok=True)\n        plt.savefig(figname)\n        plt.close()\n\n    # define function to calculate focus rate\n    # (see section 3.3 in https://arxiv.org/abs/1905.09263)\n    def _calculate_focus_rete(att_ws):\n        if att_ws is None:\n            # fastspeech case -> None\n            return 1.0\n        elif len(att_ws.shape) == 2:\n            # tacotron 2 case -> (L, T)\n            return float(att_ws.max(dim=-1)[0].mean())\n        elif len(att_ws.shape) == 4:\n            # transformer case -> (#layers, #heads, L, T)\n            return float(att_ws.max(dim=-1)[0].mean(dim=-1).max())\n        else:\n            raise ValueError(\"att_ws should be 2 or 4 dimensional tensor.\")\n\n    # define function to convert attention to duration\n    def _convert_att_to_duration(att_ws):\n        if len(att_ws.shape) == 2:\n            # tacotron 2 case -> (L, T)\n            pass\n        elif len(att_ws.shape) == 4:\n            # transformer case -> (#layers, #heads, L, T)\n            # get the most diagonal head according to focus rate\n            att_ws = torch.cat(\n                [att_w for att_w in att_ws], dim=0\n            )  # (#heads * #layers, L, T)\n            diagonal_scores = att_ws.max(dim=-1)[0].mean(dim=-1)  # (#heads * #layers,)\n            diagonal_head_idx = diagonal_scores.argmax()\n            att_ws = att_ws[diagonal_head_idx]  # (L, T)\n        else:\n            raise ValueError(\"att_ws should be 2 or 4 dimensional tensor.\")\n        # calculate duration from 2d attention weight\n        durations = torch.stack(\n            [att_ws.argmax(-1).eq(i).sum() for i in range(att_ws.shape[1])]\n        )\n        return durations.view(-1, 1).float()\n\n    # define writer instances\n    feat_writer = kaldiio.WriteHelper(\"ark,scp:{o}.ark,{o}.scp\".format(o=args.out))\n    if args.save_durations:\n        dur_writer = kaldiio.WriteHelper(\n            \"ark,scp:{o}.ark,{o}.scp\".format(o=args.out.replace(\"feats\", \"durations\"))\n        )\n    if args.save_focus_rates:\n        fr_writer = kaldiio.WriteHelper(\n            \"ark,scp:{o}.ark,{o}.scp\".format(o=args.out.replace(\"feats\", \"focus_rates\"))\n        )\n\n    # start decoding\n    for idx, utt_id in enumerate(js.keys()):\n        # setup inputs\n        batch = [(utt_id, js[utt_id])]\n        data = load_inputs_and_targets(batch)\n        x = torch.LongTensor(data[0][0]).to(device)\n        spemb = None\n        if train_args.use_speaker_embedding:\n            spemb = torch.FloatTensor(data[1][0]).to(device)\n\n        # decode and write\n        start_time = time.time()\n        outs, probs, att_ws = model.inference(x, args, spemb=spemb)\n        logging.info(\n            \"inference speed = %.1f frames / sec.\"\n            % (int(outs.size(0)) / (time.time() - start_time))\n        )\n        if outs.size(0) == x.size(0) * args.maxlenratio:\n            logging.warning(\"output length reaches maximum length (%s).\" % utt_id)\n        focus_rate = _calculate_focus_rete(att_ws)\n        logging.info(\n            \"(%d/%d) %s (size: %d->%d, focus rate: %.3f)\"\n            % (idx + 1, len(js.keys()), utt_id, x.size(0), outs.size(0), focus_rate)\n        )\n        feat_writer[utt_id] = outs.cpu().numpy()\n        if args.save_durations:\n            ds = _convert_att_to_duration(att_ws)\n            dur_writer[utt_id] = ds.cpu().numpy()\n        if args.save_focus_rates:\n            fr_writer[utt_id] = np.array(focus_rate).reshape(1, 1)\n\n        # plot and save prob and att_ws\n        if probs is not None:\n            _plot_and_save(\n                probs.cpu().numpy(),\n                os.path.dirname(args.out) + \"/probs/%s_prob.png\" % utt_id,\n            )\n        if att_ws is not None:\n            _plot_and_save(\n                att_ws.cpu().numpy(),\n                os.path.dirname(args.out) + \"/att_ws/%s_att_ws.png\" % utt_id,\n            )\n\n    # close file object\n    feat_writer.close()\n    if args.save_durations:\n        dur_writer.close()\n    if args.save_focus_rates:\n        fr_writer.close()\n"
  },
  {
    "path": "utils/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "utils/bmuf.py",
    "content": "\"\"\"\nBMUF (block model update filtering) module\nimplementation of block model update filtering\n\"\"\"\n\nimport torch\nimport torch.distributed as dist\n#import torch.distributed.ReduceOp as ReduceOp\nimport torch.nn as nn\n\nSUCCESS = 1\nSTOP = 0\n\ndef _copy_vec_to_param(vec, parameters, is_grad=False):\n    \"\"\"Copy vector to the parameters\n\n    Args:\n        vec (Tensor): a single vector represents the parameters of a model.\n        parameters (Iterable[Tensor]): an iterator of Tensors that are the\n            parameters of a model.\n    \"\"\"\n    # Ensure vec of type Tensor\n    if not isinstance(vec, torch.Tensor):\n        raise TypeError('expected torch.Tensor, but got: {}'\n                        .format(torch.typename(vec)))\n    # Pointer for slicing the vector for each parameter\n    pointer = 0\n    for param in parameters:\n        # The length of the parameter\n        num_param = param.numel()\n        # Slice the vector, reshape it, and replace the old data of the parameter\n        if is_grad: \n            param.grad = param.grad.copy_(vec[pointer:pointer + num_param]\n                                      .view_as(param).data)\n        else:\n            param.data = param.data.copy_(vec[pointer:pointer + num_param]\n                                      .view_as(param).data)\n        # Increment the pointer\n        pointer += num_param\n\n\nclass BmufTrainer():\n    \"\"\"\n    Basic BMUF Trainer Class,\n    implements Nesterov Block Momentum\n\n    Args:\n        master_node (int): master node index, zero in most cases\n        rank (int): local rank, eg, 0-7 if 8GPUs are used\n        world_size (int): total number of workers\n        model (nn.module): model\n        block_momentum (float): block momentum value\n        block_lr (float): block learning rate\n    \"\"\"\n    def __init__(self, master_node, rank, world_size, model,\n                 block_momentum, block_lr):\n        self.master_node = master_node\n        self.rank = rank\n        self.world_size = world_size\n        self.model = model\n        self.block_momentum = block_momentum\n        self.block_lr = block_lr\n        dist.init_process_group(backend=\"nccl\", init_method=\"env://\")\n        #clone() make sure self.param\n        #NOT tied to model parameters\n        #data() enforces no grad\n        param_vec = nn.utils.parameters_to_vector(model.parameters())\n        self.param = param_vec.data.clone()\n        #broadcast initial param to other nodes\n        dist.broadcast(tensor=self.param, src=master_node, async_op=False)\n        num_param = self.param.numel()\n        if self.rank == master_node:\n            self.delta_prev = torch.FloatTensor([0]*num_param).cuda(self.rank)\n        else:\n            self.delta_prev = None\n            #nn.utils.vector_to_parameters(self.param.clone(),\n            #                              self.model.parameters())\n            _copy_vec_to_param(self.param, self.model.parameters())\n\n    def update_and_sync(self):\n        \"\"\"\n        Performs a single block sync and update\n        return SUCCESS if numericals are healthy\n        return STOP otherwise\n\n        \"\"\"\n        delta = self.param - \\\n                nn.utils.parameters_to_vector(self.model.parameters()).data\n        #gather block gradients into delta\n        #default: op=ReduceOp.SUM,\n        dist.reduce(tensor=delta, dst=self.master_node)\n        #check if model params are still healthy\n        if torch.isnan(delta).sum().item():\n            return STOP\n        if self.rank == self.master_node:\n            #for master node\n            delta = delta / float(self.world_size)\n            self.delta_prev = self.block_momentum * self.delta_prev + \\\n                              (self.block_lr *(1 - self.block_momentum)* delta)\n            #self.delta_prev = self.block_momentum * self.delta_prev + \\\n            #                   (self.block_lr * delta)\n            \n            self.param -= (1+self.block_momentum) * self.delta_prev\n        dist.broadcast(tensor=self.param, src=self.master_node, async_op=False)\n        _copy_vec_to_param(self.param, self.model.parameters())\n\n        return SUCCESS\n\n    def broadcast(self, tensor):\n        \"\"\"broadcast interface for trainer\"\"\"\n        dist.broadcast(tensor=tensor, src=self.master_node, async_op=False)\n\n    def sum_reduce(self, tensor):\n        \"\"\"sumreduce interface for trainer\"\"\"\n        #op=ReduceOp.SUM,\n        dist.reduce(tensor=tensor, dst=self.master_node)\n\n\nclass BlockAdamTrainer():\n    \"\"\"\n    By tyrion: Does this trainer requires the local optimizer being\n    SGD? which means the delta is still the gradients (scaled)\n    The learning rate is scheduled by the local optimizer so \n    the block_lr should be set 1.0.\n\n    This is essentially sync adam optimizer but\n    allows each worker to have individual loader\n    to improve the training efficiency, to replace\n    replace DataParallel()\n\n    Args:\n        master_node (int): master node index, zero in most cases\n        rank (int): local rank, eg, 0-7 if 8 GPUs are used\n        world_size (int): total number of workers\n        model (nn.module): torch model\n        block_lr (float): block learning rate\n\n    \"\"\"\n    def __init__(self, args, master_node, rank, world_size, model):\n        # Communication related\n        self.master_node = master_node\n        self.rank = rank\n        self.world_size = world_size\n        dist.init_process_group(backend=\"nccl\", init_method=\"env://\")\n     \n        # Model and optimizer \n        self.model = model\n        param_vec = nn.utils.parameters_to_vector(model.parameters()).data.clone() \n        dist.broadcast(tensor=param_vec, src=master_node, async_op=False)\n        _copy_vec_to_param(param_vec, self.model.parameters())\n\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n        if hasattr(args, \"enc_block_arch\") or hasattr(args, \"dec_block_arch\"):\n            adim = model.most_dom_dim\n        else:\n            adim = args.adim\n\n        # consider when some modules are freezed\n        params = [p for p in model.parameters() if p.requires_grad]\n        self.optimizer = get_std_opt(\n            params, adim, args.transformer_warmup_steps, args.transformer_lr\n        )\n\n    def update_and_sync(self):\n        # Before calling this function we assume the forward-backword has finished\n        # so the grad for params are non-zero.\n\n        params = [p for p in self.optimizer.param_groups[0][\"params\"] if hasattr(p.grad, \"data\")]\n        \n        # average gradients\n        grad_vec = nn.utils.parameters_to_vector([p.grad.data for p in params])\n        dist.all_reduce(tensor=grad_vec.data)\n        \n        # Update with the global gradients\n        _copy_vec_to_param(grad_vec, params, is_grad=True)\n        self.optimizer.step()\n        self.optimizer.zero_grad()\n\n        return SUCCESS\n\n#class BlockAdamTrainer():\n#    \"\"\"\n#    By tyrion: Does this trainer requires the local optimizer being\n#    SGD? which means the delta is still the gradients (scaled)\n#    The learning rate is scheduled by the local optimizer so \n#    the block_lr should be set 1.0.\n#    This is essentially sync adam optimizer but\n#    allows each worker to have individual loader\n#    to improve the training efficiency, to replace\n#    replace DataParallel()\n#    Args:\n#        master_node (int): master node index, zero in most cases\n#        rank (int): local rank, eg, 0-7 if 8 GPUs are used\n#        world_size (int): total number of workers\n#        model (nn.module): torch model\n#        block_lr (float): block learning rate\n#    \"\"\"\n#    def __init__(self, args, master_node, rank, world_size, model):\n#        self.master_node = master_node\n#        self.rank = rank\n#        self.world_size = world_size\n#        self.model = model\n#        dist.init_process_group(backend=\"nccl\", init_method=\"env://\")\n#        #clone() make sure self.param\n#        #NOT tied to model parameters\n#        #data() enforces no grad\n#        param_vec = nn.utils.parameters_to_vector(model.parameters())\n#        self.param = nn.parameter.Parameter(param_vec.data.clone())\n#        #broadcast initial param to other nodes\n#        dist.broadcast(tensor=self.param.data, src=master_node, async_op=False)\n#        if self.rank == master_node:\n#            from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n#            if hasattr(args, \"enc_block_arch\") or hasattr(args, \"dec_block_arch\"):\n#                adim = model.most_dom_dim\n#            else:\n#                adim = args.adim\n#    \n#            # consider when some modules are freezed\n#            self.optimizer = get_std_opt(\n#                [self.param], adim, args.transformer_warmup_steps, args.transformer_lr\n#            )\n#        else:\n#            _copy_vec_to_param(self.param.data, self.model.parameters())\n#\n#    def update_and_sync(self):\n#        \"\"\"Perform a single block sync and update\n#           when the block size equals to batch size\n#           we are doing sync adam\n#        \"\"\"\n#        delta = self.param.data - \\\n#                nn.utils.parameters_to_vector(self.model.parameters()).data\n#        #gather block gradients into delta\n#        #op=ReduceOp.SUM,\n#        dist.reduce(tensor=delta, dst=self.master_node)\n#        #check if model params are still healthy\n#        if torch.isnan(delta).sum().item():\n#            return STOP\n#        if self.rank == self.master_node:\n#            #local rank is master node\n#            #delta = delta / float(self.world_size)\n#            #use delta.data to detach from computation graph\n#            self.param.grad = delta.data\n#            self.optimizer.step()\n#        dist.broadcast(tensor=self.param.data, src=self.master_node, async_op=False)\n#        _copy_vec_to_param(self.param.data, self.model.parameters())\n#\n#        return SUCCESS\n#\n#    def reset_model(self, model):\n#        del self.param\n#        param_vec = nn.utils.parameters_to_vector(model.parameters())\n#        self.param = nn.parameter.Parameter(param_vec.data.clone())\n\n\nclass BmufAdamTrainer():\n    \"\"\"The implementation of BMUF-adam, check more detils in,\n       Chen. et al, 2020, \"Parallelizing Adam Optimizer with\n       Blockwise Model-Update Filtering.\"\n\n    Args:\n        master_node (int): master node index, zero in most cases\n        rank (int): local rank, eg, 0-7 if 8 GPUs are used\n        world_size (int): total number of workers\n        model (nn.module): torch model\n        block_momentum (float): block momentum value\n        block_lr (float): block learning rate\n        sync_period (int): sync period in number of batches\n        optim (torch.optim.Optimizer): adam optimizer\n    \"\"\"\n    def __init__(self, master_node, rank, world_size, model,\n                 block_momentum, block_lr, sync_period, optim):\n        self.master_node = master_node\n        self.rank = rank\n        self.world_size = world_size\n        self.model = model\n        self.block_momentum = block_momentum\n        self.block_lr = block_lr\n        self.sync_period = sync_period\n        self.optim = optim\n        dist.init_process_group(backend=\"nccl\", init_method=\"env://\")\n        self.rho = 0.0\n        #default setup\n        self.betas = (0.9, 0.999)\n\t#clone() make sure self.param\n        #NOT tied to model parameters\n        #data() enforces no grad\n        param_vec = nn.utils.parameters_to_vector(model.parameters())\n        self.param = param_vec.data.clone()\n        #broadcast initial param to other nodes\n        dist.broadcast(tensor=self.param, src=master_node, async_op=False)\n        self.num_param = self.param.numel()\n        if self.rank == master_node:\n            self.delta_prev = torch.FloatTensor([0]*self.num_param)\\\n                                   .cuda(master_node)\n        else:\n            self.delta_prev = None\n            _copy_vec_to_param(self.param, self.model.parameters())\n\n        #initialize first and second moment buffer\n        dim = 0\n        for group in optim.param_groups:\n            self.betas = group['betas']\n            for p in group['params']:\n                dim += p.numel()\n        if self.rank == master_node:\n            self.exp_avg = torch.FloatTensor([0]*dim).cuda(self.rank)\n            self.exp_avg_sq = torch.FloatTensor([0]*dim).cuda(self.rank)\n        else:\n            self.exp_avg = None\n            self.exp_avg_sq = None\n        #extend param to accommodate first and second moments\n        vec_ext = torch.FloatTensor([0]*dim*2).cuda(self.rank)\n        self.param = torch.cat([self.param, vec_ext])\n\n    def update_and_sync(self):\n        \"\"\"perform single block sync and update\"\"\"\n        #gather block gradients into delta\n        delta = self.param[:self.num_param] - \\\n                nn.utils.parameters_to_vector(self.model.parameters()).data\n        #gather local first and second moment\n        exp_avg, exp_avg_sq = [], []\n        for group in self.optim.param_groups:\n            for p in group['params']:\n                if p.grad is None:\n                    continue\n                state = self.optim.state[p]\n                exp_avg.append(state['exp_avg'].view(-1))\n                exp_avg_sq.append(state['exp_avg_sq'].view(-1))\n        exp_avg = torch.cat(exp_avg)\n        exp_avg_sq = torch.cat(exp_avg_sq)\n        vec = torch.cat([delta, exp_avg, exp_avg_sq])\n        #op=ReduceOp.SUM,\n        dist.reduce(tensor=vec, dst=self.master_node)\n        #check if model params are still healthy\n        if torch.isnan(vec).sum().item():\n            return STOP\n        self.rho = self.block_momentum * self.rho + self.sync_period\n        if self.rank == self.master_node:\n            #local rank is master node\n            vec = vec / float(self.world_size)\n            self.delta_prev = self.block_momentum * self.delta_prev + \\\n                              (self.block_lr *(1 - self.block_momentum)*\\\n                               vec[:self.num_param])\n            self.param[:self.num_param] -= (1+self.block_momentum) \\\n                                           * self.delta_prev\n            #calculate first and second moment for next block\n            dim = (vec.numel() - self.num_param) // 2\n            beta1_tau = self.betas[0]**self.sync_period\n            beta2_tau = self.betas[1]**self.sync_period\n            beta1_rho = self.betas[0]**(self.rho*self.block_momentum)\n            beta2_rho = self.betas[1]**(self.rho*self.block_momentum)\n            self.exp_avg = beta1_tau * (beta1_rho - 1) * self.exp_avg\n            self.exp_avg += (1 - beta1_tau * beta1_rho) *\\\n                            vec[self.num_param:self.num_param+dim]\n            self.exp_avg = self.exp_avg / (1 - beta1_tau)\n            self.exp_avg_sq = beta2_tau * (beta2_rho - 1) * self.exp_avg_sq\n            self.exp_avg_sq += (1 - beta2_tau * beta2_rho) *\\\n                               vec[self.num_param+dim:]\n            self.exp_avg_sq = self.exp_avg_sq / (1 - beta2_tau)\n            self.param[self.num_param:self.num_param+dim] = self.exp_avg\n            self.param[self.num_param+dim:] = self.exp_avg_sq\n\n        dist.broadcast(tensor=self.param, src=self.master_node,\n                       async_op=False)\n        _copy_vec_to_param(self.param[:self.num_param],\n                           self.model.parameters())\n        #assign flattened moments to optimizer\n        ptr1 = self.num_param\n        ptr2 = self.num_param+(self.param.numel()-self.num_param)//2\n        for group in self.optim.param_groups:\n            for p in group['params']:\n                if p.grad is None:\n                    continue\n                state = self.optim.state[p]\n                state['step'] += self.rho * self.block_momentum\n                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']\n                numel = exp_avg.numel()\n                exp_avg.data = exp_avg.data\\\n                               .copy_(self.param[ptr1:ptr1+numel]\n                                      .view_as(exp_avg).data)\n                exp_avg_sq.data = exp_avg_sq.data\\\n                                  .copy_(self.param[ptr2:ptr2+numel]\n                                         .view_as(exp_avg_sq).data)\n                ptr1 += numel\n                ptr2 += numel\n\n\n        return SUCCESS\n\n    def broadcast(self, tensor):\n        \"\"\"broadcast interface for trainer\"\"\"\n        dist.broadcast(tensor=tensor, src=self.master_node, async_op=False)\n\n    def sum_reduce(self, tensor):\n        \"\"\"sum reduce interface for trainer\"\"\"\n        #op=ReduceOp.SUM,\n        dist.reduce(tensor=tensor, dst=self.master_node)\n"
  },
  {
    "path": "utils/check_kwargs.py",
    "content": "import inspect\n\n\ndef check_kwargs(func, kwargs, name=None):\n    \"\"\"check kwargs are valid for func\n\n    If kwargs are invalid, raise TypeError as same as python default\n    :param function func: function to be validated\n    :param dict kwargs: keyword arguments for func\n    :param str name: name used in TypeError (default is func name)\n    \"\"\"\n    try:\n        params = inspect.signature(func).parameters\n    except ValueError:\n        return\n    if name is None:\n        name = func.__name__\n    for k in kwargs.keys():\n        if k not in params:\n            raise TypeError(f\"{name}() got an unexpected keyword argument '{k}'\")\n"
  },
  {
    "path": "utils/cli_readers.py",
    "content": "import io\nimport logging\nimport sys\n\nimport h5py\nimport kaldiio\nimport soundfile\n\nfrom espnet.utils.io_utils import SoundHDF5File\n\n\ndef file_reader_helper(\n    rspecifier: str,\n    filetype: str = \"mat\",\n    return_shape: bool = False,\n    segments: str = None,\n):\n    \"\"\"Read uttid and array in kaldi style\n\n    This function might be a bit confusing as \"ark\" is used\n    for HDF5 to imitate \"kaldi-rspecifier\".\n\n    Args:\n        rspecifier: Give as \"ark:feats.ark\" or \"scp:feats.scp\"\n        filetype: \"mat\" is kaldi-martix, \"hdf5\": HDF5\n        return_shape: Return the shape of the matrix,\n            instead of the matrix. This can reduce IO cost for HDF5.\n    Returns:\n        Generator[Tuple[str, np.ndarray], None, None]:\n\n    Examples:\n        Read from kaldi-matrix ark file:\n\n        >>> for u, array in file_reader_helper('ark:feats.ark', 'mat'):\n        ...     array\n\n        Read from HDF5 file:\n\n        >>> for u, array in file_reader_helper('ark:feats.h5', 'hdf5'):\n        ...     array\n\n    \"\"\"\n    if filetype == \"mat\":\n        return KaldiReader(rspecifier, return_shape=return_shape, segments=segments)\n    elif filetype == \"hdf5\":\n        return HDF5Reader(rspecifier, return_shape=return_shape)\n    elif filetype == \"sound.hdf5\":\n        return SoundHDF5Reader(rspecifier, return_shape=return_shape)\n    elif filetype == \"sound\":\n        return SoundReader(rspecifier, return_shape=return_shape)\n    else:\n        raise NotImplementedError(f\"filetype={filetype}\")\n\n\nclass KaldiReader:\n    def __init__(self, rspecifier, return_shape=False, segments=None):\n        self.rspecifier = rspecifier\n        self.return_shape = return_shape\n        self.segments = segments\n\n    def __iter__(self):\n        with kaldiio.ReadHelper(self.rspecifier, segments=self.segments) as reader:\n            for key, array in reader:\n                if self.return_shape:\n                    array = array.shape\n                yield key, array\n\n\nclass HDF5Reader:\n    def __init__(self, rspecifier, return_shape=False):\n        if \":\" not in rspecifier:\n            raise ValueError(\n                'Give \"rspecifier\" such as \"ark:some.ark: {}\"'.format(self.rspecifier)\n            )\n        self.rspecifier = rspecifier\n        self.ark_or_scp, self.filepath = self.rspecifier.split(\":\", 1)\n        if self.ark_or_scp not in [\"ark\", \"scp\"]:\n            raise ValueError(f\"Must be scp or ark: {self.ark_or_scp}\")\n\n        self.return_shape = return_shape\n\n    def __iter__(self):\n        if self.ark_or_scp == \"scp\":\n            hdf5_dict = {}\n            with open(self.filepath, \"r\", encoding=\"utf-8\") as f:\n                for line in f:\n                    key, value = line.rstrip().split(None, 1)\n\n                    if \":\" not in value:\n                        raise RuntimeError(\n                            \"scp file for hdf5 should be like: \"\n                            '\"uttid filepath.h5:key\": {}({})'.format(\n                                line, self.filepath\n                            )\n                        )\n                    path, h5_key = value.split(\":\", 1)\n\n                    hdf5_file = hdf5_dict.get(path)\n                    if hdf5_file is None:\n                        try:\n                            hdf5_file = h5py.File(path, \"r\")\n                        except Exception:\n                            logging.error(\"Error when loading {}\".format(path))\n                            raise\n                        hdf5_dict[path] = hdf5_file\n\n                    try:\n                        data = hdf5_file[h5_key]\n                    except Exception:\n                        logging.error(\n                            \"Error when loading {} with key={}\".format(path, h5_key)\n                        )\n                        raise\n\n                    if self.return_shape:\n                        yield key, data.shape\n                    else:\n                        yield key, data[()]\n\n            # Closing all files\n            for k in hdf5_dict:\n                try:\n                    hdf5_dict[k].close()\n                except Exception:\n                    pass\n\n        else:\n            if self.filepath == \"-\":\n                # Required h5py>=2.9\n                filepath = io.BytesIO(sys.stdin.buffer.read())\n            else:\n                filepath = self.filepath\n            with h5py.File(filepath, \"r\") as f:\n                for key in f:\n                    if self.return_shape:\n                        yield key, f[key].shape\n                    else:\n                        yield key, f[key][()]\n\n\nclass SoundHDF5Reader:\n    def __init__(self, rspecifier, return_shape=False):\n        if \":\" not in rspecifier:\n            raise ValueError(\n                'Give \"rspecifier\" such as \"ark:some.ark: {}\"'.format(rspecifier)\n            )\n        self.ark_or_scp, self.filepath = rspecifier.split(\":\", 1)\n        if self.ark_or_scp not in [\"ark\", \"scp\"]:\n            raise ValueError(f\"Must be scp or ark: {self.ark_or_scp}\")\n        self.return_shape = return_shape\n\n    def __iter__(self):\n        if self.ark_or_scp == \"scp\":\n            hdf5_dict = {}\n            with open(self.filepath, \"r\", encoding=\"utf-8\") as f:\n                for line in f:\n                    key, value = line.rstrip().split(None, 1)\n\n                    if \":\" not in value:\n                        raise RuntimeError(\n                            \"scp file for hdf5 should be like: \"\n                            '\"uttid filepath.h5:key\": {}({})'.format(\n                                line, self.filepath\n                            )\n                        )\n                    path, h5_key = value.split(\":\", 1)\n\n                    hdf5_file = hdf5_dict.get(path)\n                    if hdf5_file is None:\n                        try:\n                            hdf5_file = SoundHDF5File(path, \"r\")\n                        except Exception:\n                            logging.error(\"Error when loading {}\".format(path))\n                            raise\n                        hdf5_dict[path] = hdf5_file\n\n                    try:\n                        data = hdf5_file[h5_key]\n                    except Exception:\n                        logging.error(\n                            \"Error when loading {} with key={}\".format(path, h5_key)\n                        )\n                        raise\n\n                    # Change Tuple[ndarray, int] -> Tuple[int, ndarray]\n                    # (soundfile style -> scipy style)\n                    array, rate = data\n                    if self.return_shape:\n                        array = array.shape\n                    yield key, (rate, array)\n\n            # Closing all files\n            for k in hdf5_dict:\n                try:\n                    hdf5_dict[k].close()\n                except Exception:\n                    pass\n\n        else:\n            if self.filepath == \"-\":\n                # Required h5py>=2.9\n                filepath = io.BytesIO(sys.stdin.buffer.read())\n            else:\n                filepath = self.filepath\n            for key, (a, r) in SoundHDF5File(filepath, \"r\").items():\n                if self.return_shape:\n                    a = a.shape\n                yield key, (r, a)\n\n\nclass SoundReader:\n    def __init__(self, rspecifier, return_shape=False):\n        if \":\" not in rspecifier:\n            raise ValueError(\n                'Give \"rspecifier\" such as \"scp:some.scp: {}\"'.format(rspecifier)\n            )\n        self.ark_or_scp, self.filepath = rspecifier.split(\":\", 1)\n        if self.ark_or_scp != \"scp\":\n            raise ValueError(\n                'Only supporting \"scp\" for sound file: {}'.format(self.ark_or_scp)\n            )\n        self.return_shape = return_shape\n\n    def __iter__(self):\n        with open(self.filepath, \"r\", encoding=\"utf-8\") as f:\n            for line in f:\n                key, sound_file_path = line.rstrip().split(None, 1)\n                # Assume PCM16\n                array, rate = soundfile.read(sound_file_path, dtype=\"int16\")\n                # Change Tuple[ndarray, int] -> Tuple[int, ndarray]\n                # (soundfile style -> scipy style)\n                if self.return_shape:\n                    array = array.shape\n                yield key, (rate, array)\n"
  },
  {
    "path": "utils/cli_utils.py",
    "content": "from collections.abc import Sequence\nfrom distutils.util import strtobool as dist_strtobool\nimport sys\n\nimport numpy\n\n\ndef strtobool(x):\n    # distutils.util.strtobool returns integer, but it's confusing,\n    return bool(dist_strtobool(x))\n\n\ndef get_commandline_args():\n    extra_chars = [\n        \" \",\n        \";\",\n        \"&\",\n        \"(\",\n        \")\",\n        \"|\",\n        \"^\",\n        \"<\",\n        \">\",\n        \"?\",\n        \"*\",\n        \"[\",\n        \"]\",\n        \"$\",\n        \"`\",\n        '\"',\n        \"\\\\\",\n        \"!\",\n        \"{\",\n        \"}\",\n    ]\n\n    # Escape the extra characters for shell\n    argv = [\n        arg.replace(\"'\", \"'\\\\''\")\n        if all(char not in arg for char in extra_chars)\n        else \"'\" + arg.replace(\"'\", \"'\\\\''\") + \"'\"\n        for arg in sys.argv\n    ]\n\n    return sys.executable + \" \" + \" \".join(argv)\n\n\ndef is_scipy_wav_style(value):\n    # If Tuple[int, numpy.ndarray] or not\n    return (\n        isinstance(value, Sequence)\n        and len(value) == 2\n        and isinstance(value[0], int)\n        and isinstance(value[1], numpy.ndarray)\n    )\n\n\ndef assert_scipy_wav_style(value):\n    assert is_scipy_wav_style(\n        value\n    ), \"Must be Tuple[int, numpy.ndarray], but got {}\".format(\n        type(value)\n        if not isinstance(value, Sequence)\n        else \"{}[{}]\".format(type(value), \", \".join(str(type(v)) for v in value))\n    )\n"
  },
  {
    "path": "utils/cli_writers.py",
    "content": "from pathlib import Path\nfrom typing import Dict\n\nimport h5py\nimport kaldiio\nimport numpy\nimport soundfile\n\nfrom espnet.utils.cli_utils import assert_scipy_wav_style\nfrom espnet.utils.io_utils import SoundHDF5File\n\n\ndef file_writer_helper(\n    wspecifier: str,\n    filetype: str = \"mat\",\n    write_num_frames: str = None,\n    compress: bool = False,\n    compression_method: int = 2,\n    pcm_format: str = \"wav\",\n):\n    \"\"\"Write matrices in kaldi style\n\n    Args:\n        wspecifier: e.g. ark,scp:out.ark,out.scp\n        filetype: \"mat\" is kaldi-martix, \"hdf5\": HDF5\n        write_num_frames: e.g. 'ark,t:num_frames.txt'\n        compress: Compress or not\n        compression_method: Specify compression level\n\n    Write in kaldi-matrix-ark with \"kaldi-scp\" file:\n\n    >>> with file_writer_helper('ark,scp:out.ark,out.scp') as f:\n    >>>     f['uttid'] = array\n\n    This \"scp\" has the following format:\n\n        uttidA out.ark:1234\n        uttidB out.ark:2222\n\n    where, 1234 and 2222 points the strating byte address of the matrix.\n    (For detail, see official documentation of Kaldi)\n\n    Write in HDF5 with \"scp\" file:\n\n    >>> with file_writer_helper('ark,scp:out.h5,out.scp', 'hdf5') as f:\n    >>>     f['uttid'] = array\n\n    This \"scp\" file is created as:\n\n        uttidA out.h5:uttidA\n        uttidB out.h5:uttidB\n\n    HDF5 can be, unlike \"kaldi-ark\", accessed to any keys,\n    so originally \"scp\" is not required for random-reading.\n    Nevertheless we create \"scp\" for HDF5 because it is useful\n    for some use-case. e.g. Concatenation, Splitting.\n\n    \"\"\"\n    if filetype == \"mat\":\n        return KaldiWriter(\n            wspecifier,\n            write_num_frames=write_num_frames,\n            compress=compress,\n            compression_method=compression_method,\n        )\n    elif filetype == \"hdf5\":\n        return HDF5Writer(\n            wspecifier, write_num_frames=write_num_frames, compress=compress\n        )\n    elif filetype == \"sound.hdf5\":\n        return SoundHDF5Writer(\n            wspecifier, write_num_frames=write_num_frames, pcm_format=pcm_format\n        )\n    elif filetype == \"sound\":\n        return SoundWriter(\n            wspecifier, write_num_frames=write_num_frames, pcm_format=pcm_format\n        )\n    else:\n        raise NotImplementedError(f\"filetype={filetype}\")\n\n\nclass BaseWriter:\n    def __setitem__(self, key, value):\n        raise NotImplementedError\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        self.close()\n\n    def close(self):\n        try:\n            self.writer.close()\n        except Exception:\n            pass\n\n        if self.writer_scp is not None:\n            try:\n                self.writer_scp.close()\n            except Exception:\n                pass\n\n        if self.writer_nframe is not None:\n            try:\n                self.writer_nframe.close()\n            except Exception:\n                pass\n\n\ndef get_num_frames_writer(write_num_frames: str):\n    \"\"\"get_num_frames_writer\n\n    Examples:\n        >>> get_num_frames_writer('ark,t:num_frames.txt')\n    \"\"\"\n    if write_num_frames is not None:\n        if \":\" not in write_num_frames:\n            raise ValueError(\n                'Must include \":\", write_num_frames={}'.format(write_num_frames)\n            )\n\n        nframes_type, nframes_file = write_num_frames.split(\":\", 1)\n        if nframes_type != \"ark,t\":\n            raise ValueError(\n                \"Only supporting text mode. \"\n                \"e.g. --write-num-frames=ark,t:foo.txt :\"\n                \"{}\".format(nframes_type)\n            )\n\n    return open(nframes_file, \"w\", encoding=\"utf-8\")\n\n\nclass KaldiWriter(BaseWriter):\n    def __init__(\n        self, wspecifier, write_num_frames=None, compress=False, compression_method=2\n    ):\n        if compress:\n            self.writer = kaldiio.WriteHelper(\n                wspecifier, compression_method=compression_method\n            )\n        else:\n            self.writer = kaldiio.WriteHelper(wspecifier)\n        self.writer_scp = None\n        if write_num_frames is not None:\n            self.writer_nframe = get_num_frames_writer(write_num_frames)\n        else:\n            self.writer_nframe = None\n\n    def __setitem__(self, key, value):\n        self.writer[key] = value\n        if self.writer_nframe is not None:\n            self.writer_nframe.write(f\"{key} {len(value)}\\n\")\n\n\ndef parse_wspecifier(wspecifier: str) -> Dict[str, str]:\n    \"\"\"Parse wspecifier to dict\n\n    Examples:\n        >>> parse_wspecifier('ark,scp:out.ark,out.scp')\n        {'ark': 'out.ark', 'scp': 'out.scp'}\n\n    \"\"\"\n    ark_scp, filepath = wspecifier.split(\":\", 1)\n    if ark_scp not in [\"ark\", \"scp,ark\", \"ark,scp\"]:\n        raise ValueError(\"{} is not allowed: {}\".format(ark_scp, wspecifier))\n    ark_scps = ark_scp.split(\",\")\n    filepaths = filepath.split(\",\")\n    if len(ark_scps) != len(filepaths):\n        raise ValueError(\"Mismatch: {} and {}\".format(ark_scp, filepath))\n    spec_dict = dict(zip(ark_scps, filepaths))\n    return spec_dict\n\n\nclass HDF5Writer(BaseWriter):\n    \"\"\"HDF5Writer\n\n    Examples:\n        >>> with HDF5Writer('ark:out.h5', compress=True) as f:\n        ...     f['key'] = array\n    \"\"\"\n\n    def __init__(self, wspecifier, write_num_frames=None, compress=False):\n        spec_dict = parse_wspecifier(wspecifier)\n        self.filename = spec_dict[\"ark\"]\n\n        if compress:\n            self.kwargs = {\"compression\": \"gzip\"}\n        else:\n            self.kwargs = {}\n        self.writer = h5py.File(spec_dict[\"ark\"], \"w\")\n        if \"scp\" in spec_dict:\n            self.writer_scp = open(spec_dict[\"scp\"], \"w\", encoding=\"utf-8\")\n        else:\n            self.writer_scp = None\n        if write_num_frames is not None:\n            self.writer_nframe = get_num_frames_writer(write_num_frames)\n        else:\n            self.writer_nframe = None\n\n    def __setitem__(self, key, value):\n        self.writer.create_dataset(key, data=value, **self.kwargs)\n\n        if self.writer_scp is not None:\n            self.writer_scp.write(f\"{key} {self.filename}:{key}\\n\")\n        if self.writer_nframe is not None:\n            self.writer_nframe.write(f\"{key} {len(value)}\\n\")\n\n\nclass SoundHDF5Writer(BaseWriter):\n    \"\"\"SoundHDF5Writer\n\n    Examples:\n        >>> fs = 16000\n        >>> with SoundHDF5Writer('ark:out.h5') as f:\n        ...     f['key'] = fs, array\n    \"\"\"\n\n    def __init__(self, wspecifier, write_num_frames=None, pcm_format=\"wav\"):\n        self.pcm_format = pcm_format\n        spec_dict = parse_wspecifier(wspecifier)\n        self.filename = spec_dict[\"ark\"]\n        self.writer = SoundHDF5File(spec_dict[\"ark\"], \"w\", format=self.pcm_format)\n        if \"scp\" in spec_dict:\n            self.writer_scp = open(spec_dict[\"scp\"], \"w\", encoding=\"utf-8\")\n        else:\n            self.writer_scp = None\n        if write_num_frames is not None:\n            self.writer_nframe = get_num_frames_writer(write_num_frames)\n        else:\n            self.writer_nframe = None\n\n    def __setitem__(self, key, value):\n        assert_scipy_wav_style(value)\n        # Change Tuple[int, ndarray] -> Tuple[ndarray, int]\n        # (scipy style -> soundfile style)\n        value = (value[1], value[0])\n        self.writer.create_dataset(key, data=value)\n\n        if self.writer_scp is not None:\n            self.writer_scp.write(f\"{key} {self.filename}:{key}\\n\")\n        if self.writer_nframe is not None:\n            self.writer_nframe.write(f\"{key} {len(value[0])}\\n\")\n\n\nclass SoundWriter(BaseWriter):\n    \"\"\"SoundWriter\n\n    Examples:\n        >>> fs = 16000\n        >>> with SoundWriter('ark,scp:outdir,out.scp') as f:\n        ...     f['key'] = fs, array\n    \"\"\"\n\n    def __init__(self, wspecifier, write_num_frames=None, pcm_format=\"wav\"):\n        self.pcm_format = pcm_format\n        spec_dict = parse_wspecifier(wspecifier)\n        # e.g. ark,scp:dirname,wav.scp\n        # -> The wave files are found in dirname/*.wav\n        self.dirname = spec_dict[\"ark\"]\n        Path(self.dirname).mkdir(parents=True, exist_ok=True)\n        self.writer = None\n\n        if \"scp\" in spec_dict:\n            self.writer_scp = open(spec_dict[\"scp\"], \"w\", encoding=\"utf-8\")\n        else:\n            self.writer_scp = None\n        if write_num_frames is not None:\n            self.writer_nframe = get_num_frames_writer(write_num_frames)\n        else:\n            self.writer_nframe = None\n\n    def __setitem__(self, key, value):\n        assert_scipy_wav_style(value)\n        rate, signal = value\n        wavfile = Path(self.dirname) / (key + \".\" + self.pcm_format)\n        soundfile.write(wavfile, signal.astype(numpy.int16), rate)\n\n        if self.writer_scp is not None:\n            self.writer_scp.write(f\"{key} {wavfile}\\n\")\n        if self.writer_nframe is not None:\n            self.writer_nframe.write(f\"{key} {len(signal)}\\n\")\n"
  },
  {
    "path": "utils/dataset.py",
    "content": "#!/usr/bin/env python\n\n# Copyright 2017 Johns Hopkins University (Shinji Watanabe)\n# Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"pytorch dataset and dataloader implementation for chainer training.\"\"\"\n\nimport torch\nimport torch.utils.data\nimport time\n\ndef get_time():\n    return time.asctime( time.localtime(time.time()))\n\nclass TransformDataset(torch.utils.data.Dataset):\n    \"\"\"Transform Dataset for pytorch backend.\n\n    Args:\n        data: list object from make_batchset\n        transfrom: transform function\n\n    \"\"\"\n\n    def __init__(self, data, transform):\n        \"\"\"Init function.\"\"\"\n        super(TransformDataset).__init__()\n        self.data = data\n        self.transform = transform\n\n    def __len__(self):\n        \"\"\"Len function.\"\"\"\n        return len(self.data)\n\n    def __getitem__(self, idx):\n        \"\"\"[] operator.\"\"\"\n        # print(f\"{get_time()}: data laoder call getitem\")\n        return self.transform(self.data[idx])\n\n\nclass ChainerDataLoader(object):\n    \"\"\"Pytorch dataloader in chainer style.\n\n    Args:\n        all args for torch.utils.data.dataloader.Dataloader\n\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        \"\"\"Init function.\"\"\"\n        self.loader = torch.utils.data.dataloader.DataLoader(**kwargs)\n        self.len = len(kwargs[\"dataset\"])\n        self.current_position = 0\n        self.epoch = 0\n        self.iter = None\n        self.kwargs = kwargs\n\n    def next(self):\n        \"\"\"Implement next function.\"\"\"\n        if self.iter is None:\n            self.iter = iter(self.loader)\n        try:\n            ret = next(self.iter)\n        except StopIteration:\n            self.iter = None\n            return self.next()\n        self.current_position += 1\n        if self.current_position == self.len:\n            self.epoch = self.epoch + 1\n            self.current_position = 0\n        return ret\n\n    def __iter__(self):\n        \"\"\"Implement iter function.\"\"\"\n        for batch in self.loader:\n            yield batch\n\n    @property\n    def epoch_detail(self):\n        \"\"\"Epoch_detail required by chainer.\"\"\"\n        return self.epoch + self.current_position / self.len\n\n    def serialize(self, serializer):\n        \"\"\"Serialize and deserialize function.\"\"\"\n        epoch = serializer(\"epoch\", self.epoch)\n        current_position = serializer(\"current_position\", self.current_position)\n        self.epoch = epoch\n        self.current_position = current_position\n\n    def start_shuffle(self):\n        \"\"\"Shuffle function for sortagrad.\"\"\"\n        self.kwargs[\"shuffle\"] = True if \"sampler\" in list(self.kwargs.keys()) else None\n        self.loader = torch.utils.data.dataloader.DataLoader(**self.kwargs)\n\n    def finalize(self):\n        \"\"\"Implement finalize function.\"\"\"\n        del self.loader\n"
  },
  {
    "path": "utils/deterministic_utils.py",
    "content": "import logging\nimport os\n\nimport chainer\nimport torch\n\n\ndef set_deterministic_pytorch(args):\n    \"\"\"Ensures pytorch produces deterministic results depending on the program arguments\n\n    :param Namespace args: The program arguments\n    \"\"\"\n    # seed setting\n    torch.manual_seed(args.seed)\n\n    # debug mode setting\n    # 0 would be fastest, but 1 seems to be reasonable\n    # considering reproducibility\n    # remove type check\n    torch.backends.cudnn.deterministic = True\n    torch.backends.cudnn.benchmark = (\n        False  # https://github.com/pytorch/pytorch/issues/6351\n    )\n    if args.debugmode < 2:\n        chainer.config.type_check = False\n        logging.info(\"torch type check is disabled\")\n    # use deterministic computation or not\n    if args.debugmode < 1:\n        torch.backends.cudnn.deterministic = False\n        torch.backends.cudnn.benchmark = True\n        logging.info(\"torch cudnn deterministic is disabled\")\n\n\ndef set_deterministic_chainer(args):\n    \"\"\"Ensures chainer produces deterministic results depending on the program arguments\n\n    :param Namespace args: The program arguments\n    \"\"\"\n    # seed setting (chainer seed may not need it)\n    os.environ[\"CHAINER_SEED\"] = str(args.seed)\n    logging.info(\"chainer seed = \" + os.environ[\"CHAINER_SEED\"])\n\n    # debug mode setting\n    # 0 would be fastest, but 1 seems to be reasonable\n    # considering reproducibility\n    # remove type check\n    if args.debugmode < 2:\n        chainer.config.type_check = False\n        logging.info(\"chainer type check is disabled\")\n    # use deterministic computation or not\n    if args.debugmode < 1:\n        chainer.config.cudnn_deterministic = False\n        logging.info(\"chainer cudnn deterministic is disabled\")\n    else:\n        chainer.config.cudnn_deterministic = True\n"
  },
  {
    "path": "utils/draw_num_fst.py",
    "content": "#!/usr/bin/env python3\n# encoding: utf-8\nimport sys\nimport torch\nimport k2\nfrom pathlib import Path\nfrom espnet.snowfall.training.mmi_graph import MmiTrainingGraphCompiler\nfrom espnet.snowfall.lexicon import Lexicon\nfrom espnet.snowfall.training.mmi_graph import create_bigram_phone_lm\n\ndef main():\n\n    # compiler\n    lang = Path(\"data/lang_k2mmi\")\n    lexicon = Lexicon(lang)\n    device = torch.device(\"cuda:0\")\n    graph_compiler = MmiTrainingGraphCompiler(lexicon=lexicon, device=device)\n    \n    # P\n    phone_ids = lexicon.phone_symbols()\n    P = create_bigram_phone_lm(phone_ids)\n    P = P.to(device)\n\n    # compile num graph\n    ys = [\"S O U R C E <space> C O L O N\"]\n    num_graphs, _ = graph_compiler.compile(ys, P, replicate_den=True)\n    num = num_graphs[0]\n\n    # draw\n    num.draw(\"num.svg\") \n\n\nmain() \n\n"
  },
  {
    "path": "utils/dynamic_import.py",
    "content": "import importlib\n\n\ndef dynamic_import(import_path, alias=dict()):\n    \"\"\"dynamic import module and class\n\n    :param str import_path: syntax 'module_name:class_name'\n        e.g., 'espnet.transform.add_deltas:AddDeltas'\n    :param dict alias: shortcut for registered class\n    :return: imported class\n    \"\"\"\n    if import_path not in alias and \":\" not in import_path:\n        raise ValueError(\n            \"import_path should be one of {} or \"\n            'include \":\", e.g. \"espnet.transform.add_deltas:AddDeltas\" : '\n            \"{}\".format(set(alias), import_path)\n        )\n    if \":\" not in import_path:\n        import_path = alias[import_path]\n\n    module_name, objname = import_path.split(\":\")\n    m = importlib.import_module(module_name)\n    return getattr(m, objname)\n"
  },
  {
    "path": "utils/fill_missing_args.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Copyright 2018 Nagoya University (Tomoki Hayashi)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\nimport argparse\nimport logging\n\n\ndef fill_missing_args(args, add_arguments):\n    \"\"\"Fill missing arguments in args.\n\n    Args:\n        args (Namespace or None): Namesapce containing hyperparameters.\n        add_arguments (function): Function to add arguments.\n\n    Returns:\n        Namespace: Arguments whose missing ones are filled with default value.\n\n    Examples:\n        >>> from argparse import Namespace\n        >>> from espnet.nets.pytorch_backend.e2e_tts_tacotron2 import Tacotron2\n        >>> args = Namespace()\n        >>> fill_missing_args(args, Tacotron2.add_arguments_fn)\n        Namespace(aconv_chans=32, aconv_filts=15, adim=512, atype='location', ...)\n\n    \"\"\"\n    # check argument type\n    assert isinstance(args, argparse.Namespace) or args is None\n    assert callable(add_arguments)\n\n    # get default arguments\n    default_args, _ = add_arguments(argparse.ArgumentParser()).parse_known_args()\n\n    # convert to dict\n    args = {} if args is None else vars(args)\n    default_args = vars(default_args)\n\n    for key, value in default_args.items():\n        if key not in args:\n            logging.info(\n                'attribute \"%s\" does not exist. use default %s.' % (key, str(value))\n            )\n            args[key] = value\n\n    return argparse.Namespace(**args)\n"
  },
  {
    "path": "utils/io_utils.py",
    "content": "from collections import OrderedDict\nimport io\nimport logging\nimport os\nimport copy\nimport h5py\nimport kaldiio\nimport numpy as np\nimport soundfile\nimport time\nimport psutil\nimport torch\nfrom espnet.transform.transformation import Transformation\n\ndef get_time():\n    return time.asctime( time.localtime(time.time()))\n\nclass LoadInputsAndTargets(object):\n    \"\"\"Create a mini-batch from a list of dicts\n\n    >>> batch = [('utt1',\n    ...           dict(input=[dict(feat='some.ark:123',\n    ...                            filetype='mat',\n    ...                            name='input1',\n    ...                            shape=[100, 80])],\n    ...                output=[dict(tokenid='1 2 3 4',\n    ...                             name='target1',\n    ...                             shape=[4, 31])]]))\n    >>> l = LoadInputsAndTargets()\n    >>> feat, target = l(batch)\n\n    :param: str mode: Specify the task mode, \"asr\" or \"tts\"\n    :param: str preprocess_conf: The path of a json file for pre-processing\n    :param: bool load_input: If False, not to load the input data\n    :param: bool load_output: If False, not to load the output data\n    :param: bool sort_in_input_length: Sort the mini-batch in descending order\n        of the input length\n    :param: bool use_speaker_embedding: Used for tts mode only\n    :param: bool use_second_target: Used for tts mode only\n    :param: dict preprocess_args: Set some optional arguments for preprocessing\n    :param: Optional[dict] preprocess_args: Used for tts mode only\n    \"\"\"\n\n    def __init__(\n        self,\n        mode=\"asr\",\n        preprocess_conf=None,\n        load_input=True,\n        load_output=True,\n        sort_in_input_length=True,\n        use_speaker_embedding=False,\n        use_second_target=False,\n        preprocess_args=None,\n        keep_all_data_on_mem=False,\n        block_load=False,\n    ):\n        self._loaders = {}\n        if mode not in [\"asr\", \"tts\", \"mt\", \"vc\"]:\n            raise ValueError(\"Only asr or tts are allowed: mode={}\".format(mode))\n        if preprocess_conf is not None:\n            self.preprocessing = Transformation(preprocess_conf)\n            logging.warning(\n                \"[Experimental feature] Some preprocessing will be done \"\n                \"for the mini-batch creation using {}\".format(self.preprocessing)\n            )\n        else:\n            # If conf doesn't exist, this function don't touch anything.\n            self.preprocessing = None\n\n        if use_second_target and use_speaker_embedding and mode == \"tts\":\n            raise ValueError(\n                'Choose one of \"use_second_target\" and ' '\"use_speaker_embedding \"'\n            )\n        if (\n            (use_second_target or use_speaker_embedding)\n            and mode != \"tts\"\n            and mode != \"vc\"\n        ):\n            logging.warning(\n                '\"use_second_target\" and \"use_speaker_embedding\" is '\n                \"used only for tts or vc mode\"\n            )\n\n        self.mode = mode\n        self.load_output = load_output\n        self.load_input = load_input\n        self.sort_in_input_length = sort_in_input_length\n        self.use_speaker_embedding = use_speaker_embedding\n        self.use_second_target = use_second_target\n        if preprocess_args is None:\n            self.preprocess_args = {}\n        else:\n            assert isinstance(preprocess_args, dict), type(preprocess_args)\n            self.preprocess_args = dict(preprocess_args)\n\n        self.keep_all_data_on_mem = keep_all_data_on_mem\n        self.block_load = block_load\n\n    def __call__(self, batch, return_uttid=False):\n        \"\"\"Function to load inputs and targets from list of dicts\n\n        :param List[Tuple[str, dict]] batch: list of dict which is subset of\n            loaded data.json\n        :param bool return_uttid: return utterance ID information for visualization\n        :return: list of input token id sequences [(L_1), (L_2), ..., (L_B)]\n        :return: list of input feature sequences\n            [(T_1, D), (T_2, D), ..., (T_B, D)]\n        :rtype: list of float ndarray\n        :return: list of target token id sequences [(L_1), (L_2), ..., (L_B)]\n        :rtype: list of int ndarray\n\n        \"\"\"\n        x_feats_dict = OrderedDict()  # OrderedDict[str, List[np.ndarray]]\n        y_feats_dict = OrderedDict()  # OrderedDict[str, List[np.ndarray]]\n        uttid_list = []  # List[str]\n        text_list = [] # List[str]\n\n        if self.block_load:\n            _, info = batch[0]\n            ark_names = [parse_arkpath(inp[\"feat\"]) for inp in info[\"input\"]]\n\n        for uttid, info in batch:\n            uttid_list.append(uttid)\n\n            if self.load_input:\n                # Note(kamo): This for-loop is for multiple inputs\n                for idx, inp in enumerate(info[\"input\"]):\n                    # {\"input\":\n                    #  [{\"feat\": \"some/path.h5:F01_050C0101_PED_REAL\",\n                    #    \"filetype\": \"hdf5\",\n                    #    \"name\": \"input1\", ...}], ...}\n                    if self.block_load:\n                        assert parse_arkpath(inp[\"feat\"]) == ark_names[idx],\\\n                               f\"The batch should from the same ark if use block_load, key error: {inp['feat']}\"\n                    x = self._get_from_loader(\n                        filepath=inp[\"feat\"], filetype=inp.get(\"filetype\", \"mat\"),\n                        uttid=uttid\n                    )\n                    x_feats_dict.setdefault(inp[\"name\"], []).append(x)\n            # FIXME(kamo): Dirty way to load only speaker_embedding\n            elif self.mode == \"tts\" and self.use_speaker_embedding:\n                for idx, inp in enumerate(info[\"input\"]):\n                    if idx != 1 and len(info[\"input\"]) > 1:\n                        x = None\n                    else:\n                        x = self._get_from_loader(\n                            filepath=inp[\"feat\"], filetype=inp.get(\"filetype\", \"mat\")\n                        )\n                    x_feats_dict.setdefault(inp[\"name\"], []).append(x)\n\n            if self.load_output:\n                if self.mode == \"mt\":\n                    x = np.fromiter(\n                        map(int, info[\"output\"][1][\"tokenid\"].split()), dtype=np.int64\n                    )\n                    x_feats_dict.setdefault(info[\"output\"][1][\"name\"], []).append(x)\n\n                for idx, inp in enumerate(info[\"output\"]):\n                    if \"tokenid\" in inp:\n                        # ======= Legacy format for output =======\n                        # {\"output\": [{\"tokenid\": \"1 2 3 4\"}])\n                        x = np.fromiter(\n                            map(int, inp[\"tokenid\"].split()), dtype=np.int64\n                        )\n                    else:\n                        # ======= New format =======\n                        # {\"input\":\n                        #  [{\"feat\": \"some/path.h5:F01_050C0101_PED_REAL\",\n                        #    \"filetype\": \"hdf5\",\n                        #    \"name\": \"target1\", ...}], ...}\n                        x = self._get_from_loader(\n                            filepath=inp[\"feat\"], filetype=inp.get(\"filetype\", \"mat\")\n                        )\n\n                    y_feats_dict.setdefault(inp[\"name\"], []).append(x)\n           \n            if \"text_org\" in info: \n                text_list.append(info[\"text_org\"])\n            else:\n                text_list.append(info[\"output\"][0][\"text\"])\n        \n        if self.mode == \"asr\":\n            return_batch, uttid_list = self._create_batch_asr(\n                x_feats_dict, y_feats_dict, uttid_list\n            )\n        elif self.mode == \"tts\":\n            _, info = batch[0]\n            eos = int(info[\"output\"][0][\"shape\"][1]) - 1\n            return_batch, uttid_list = self._create_batch_tts(\n                x_feats_dict, y_feats_dict, uttid_list, eos\n            )\n        elif self.mode == \"mt\":\n            return_batch, uttid_list = self._create_batch_mt(\n                x_feats_dict, y_feats_dict, uttid_list\n            )\n        elif self.mode == \"vc\":\n            return_batch, uttid_list = self._create_batch_vc(\n                x_feats_dict, y_feats_dict, uttid_list\n            )\n        else:\n            raise NotImplementedError(self.mode)\n        \n        \"\"\"\n        Additional information by tyriontian\n        xs_orig is the identical spectrum with xs but ignore preprocess (like specaug)\n        we need xs_orig for on-the-fly decoding in MBR training\n        text_org is the original text label sequence. we need this for MMI training\n        \"\"\"\n\n        return_batch[\"text_org\"] = text_list\n        return_batch[\"xs_orig\"] = copy.deepcopy(return_batch[\"input1\"])\n\n        if self.preprocessing is not None:\n            # Apply pre-processing all input features\n            for x_name in return_batch.keys():\n                if x_name.startswith(\"input\"):\n                    return_batch[x_name] = self.preprocessing(\n                        return_batch[x_name], uttid_list, **self.preprocess_args\n                    )\n      \n        if return_uttid:\n            return tuple(return_batch.values()), uttid_list\n\n        # Doesn't return the names now.\n        return tuple(return_batch.values())\n\n    def _create_batch_asr(self, x_feats_dict, y_feats_dict, uttid_list):\n        \"\"\"Create a OrderedDict for the mini-batch\n\n        :param OrderedDict x_feats_dict:\n            e.g. {\"input1\": [ndarray, ndarray, ...],\n                  \"input2\": [ndarray, ndarray, ...]}\n        :param OrderedDict y_feats_dict:\n            e.g. {\"target1\": [ndarray, ndarray, ...],\n                  \"target2\": [ndarray, ndarray, ...]}\n        :param: List[str] uttid_list:\n            Give uttid_list to sort in the same order as the mini-batch\n        :return: batch, uttid_list\n        :rtype: Tuple[OrderedDict, List[str]]\n        \"\"\"\n        # handle single-input and multi-input (paralell) asr mode\n        xs = list(x_feats_dict.values())\n\n        if self.load_output:\n            ys = list(y_feats_dict.values())\n            assert len(xs[0]) == len(ys[0]), (len(xs[0]), len(ys[0]))\n\n            # get index of non-zero length samples\n            nonzero_idx = list(filter(lambda i: len(ys[0][i]) > 0, range(len(ys[0]))))\n            for n in range(1, len(y_feats_dict)):\n                nonzero_idx = filter(lambda i: len(ys[n][i]) > 0, nonzero_idx)\n        else:\n            # Note(kamo): Be careful not to make nonzero_idx to a generator\n            nonzero_idx = list(range(len(xs[0])))\n\n        if self.sort_in_input_length:\n            # sort in input lengths based on the first input\n            nonzero_sorted_idx = sorted(nonzero_idx, key=lambda i: -len(xs[0][i]))\n        else:\n            nonzero_sorted_idx = nonzero_idx\n\n        if len(nonzero_sorted_idx) != len(xs[0]):\n            logging.warning(\n                \"Target sequences include empty tokenid (batch {} -> {}).\".format(\n                    len(xs[0]), len(nonzero_sorted_idx)\n                )\n            )\n\n        # remove zero-length samples\n        xs = [[x[i] for i in nonzero_sorted_idx] for x in xs]\n        uttid_list = [uttid_list[i] for i in nonzero_sorted_idx]\n\n        x_names = list(x_feats_dict.keys())\n        if self.load_output:\n            ys = [[y[i] for i in nonzero_sorted_idx] for y in ys]\n            y_names = list(y_feats_dict.keys())\n\n            # Keeping x_name and y_name, e.g. input1, for future extension\n            return_batch = OrderedDict(\n                [\n                    *[(x_name, x) for x_name, x in zip(x_names, xs)],\n                    *[(y_name, y) for y_name, y in zip(y_names, ys)],\n                ]\n            )\n        else:\n            return_batch = OrderedDict([(x_name, x) for x_name, x in zip(x_names, xs)])\n        return return_batch, uttid_list\n\n    def _create_batch_mt(self, x_feats_dict, y_feats_dict, uttid_list):\n        \"\"\"Create a OrderedDict for the mini-batch\n\n        :param OrderedDict x_feats_dict:\n        :param OrderedDict y_feats_dict:\n        :return: batch, uttid_list\n        :rtype: Tuple[OrderedDict, List[str]]\n        \"\"\"\n        # Create a list from the first item\n        xs = list(x_feats_dict.values())[0]\n\n        if self.load_output:\n            ys = list(y_feats_dict.values())[0]\n            assert len(xs) == len(ys), (len(xs), len(ys))\n\n            # get index of non-zero length samples\n            nonzero_idx = filter(lambda i: len(ys[i]) > 0, range(len(ys)))\n        else:\n            nonzero_idx = range(len(xs))\n\n        if self.sort_in_input_length:\n            # sort in input lengths\n            nonzero_sorted_idx = sorted(nonzero_idx, key=lambda i: -len(xs[i]))\n        else:\n            nonzero_sorted_idx = nonzero_idx\n\n        if len(nonzero_sorted_idx) != len(xs):\n            logging.warning(\n                \"Target sequences include empty tokenid (batch {} -> {}).\".format(\n                    len(xs), len(nonzero_sorted_idx)\n                )\n            )\n\n        # remove zero-length samples\n        xs = [xs[i] for i in nonzero_sorted_idx]\n        uttid_list = [uttid_list[i] for i in nonzero_sorted_idx]\n\n        x_name = list(x_feats_dict.keys())[0]\n        if self.load_output:\n            ys = [ys[i] for i in nonzero_sorted_idx]\n            y_name = list(y_feats_dict.keys())[0]\n\n            return_batch = OrderedDict([(x_name, xs), (y_name, ys)])\n        else:\n            return_batch = OrderedDict([(x_name, xs)])\n        return return_batch, uttid_list\n\n    def _create_batch_tts(self, x_feats_dict, y_feats_dict, uttid_list, eos):\n        \"\"\"Create a OrderedDict for the mini-batch\n\n        :param OrderedDict x_feats_dict:\n            e.g. {\"input1\": [ndarray, ndarray, ...],\n                  \"input2\": [ndarray, ndarray, ...]}\n        :param OrderedDict y_feats_dict:\n            e.g. {\"target1\": [ndarray, ndarray, ...],\n                  \"target2\": [ndarray, ndarray, ...]}\n        :param: List[str] uttid_list:\n        :param int eos:\n        :return: batch, uttid_list\n        :rtype: Tuple[OrderedDict, List[str]]\n        \"\"\"\n        # Use the output values as the input feats for tts mode\n        xs = list(y_feats_dict.values())[0]\n        # get index of non-zero length samples\n        nonzero_idx = list(filter(lambda i: len(xs[i]) > 0, range(len(xs))))\n        # sort in input lengths\n        if self.sort_in_input_length:\n            # sort in input lengths\n            nonzero_sorted_idx = sorted(nonzero_idx, key=lambda i: -len(xs[i]))\n        else:\n            nonzero_sorted_idx = nonzero_idx\n        # remove zero-length samples\n        xs = [xs[i] for i in nonzero_sorted_idx]\n        uttid_list = [uttid_list[i] for i in nonzero_sorted_idx]\n        # Added eos into input sequence\n        xs = [np.append(x, eos) for x in xs]\n\n        if self.load_input:\n            ys = list(x_feats_dict.values())[0]\n            assert len(xs) == len(ys), (len(xs), len(ys))\n            ys = [ys[i] for i in nonzero_sorted_idx]\n\n            spembs = None\n            spcs = None\n            spembs_name = \"spembs_none\"\n            spcs_name = \"spcs_none\"\n\n            if self.use_second_target:\n                spcs = list(x_feats_dict.values())[1]\n                spcs = [spcs[i] for i in nonzero_sorted_idx]\n                spcs_name = list(x_feats_dict.keys())[1]\n\n            if self.use_speaker_embedding:\n                spembs = list(x_feats_dict.values())[1]\n                spembs = [spembs[i] for i in nonzero_sorted_idx]\n                spembs_name = list(x_feats_dict.keys())[1]\n\n            x_name = list(y_feats_dict.keys())[0]\n            y_name = list(x_feats_dict.keys())[0]\n\n            return_batch = OrderedDict(\n                [(x_name, xs), (y_name, ys), (spembs_name, spembs), (spcs_name, spcs)]\n            )\n        elif self.use_speaker_embedding:\n            if len(x_feats_dict) == 0:\n                raise IndexError(\"No speaker embedding is provided\")\n            elif len(x_feats_dict) == 1:\n                spembs_idx = 0\n            else:\n                spembs_idx = 1\n\n            spembs = list(x_feats_dict.values())[spembs_idx]\n            spembs = [spembs[i] for i in nonzero_sorted_idx]\n\n            x_name = list(y_feats_dict.keys())[0]\n            spembs_name = list(x_feats_dict.keys())[spembs_idx]\n\n            return_batch = OrderedDict([(x_name, xs), (spembs_name, spembs)])\n        else:\n            x_name = list(y_feats_dict.keys())[0]\n\n            return_batch = OrderedDict([(x_name, xs)])\n        return return_batch, uttid_list\n\n    def _create_batch_vc(self, x_feats_dict, y_feats_dict, uttid_list):\n        \"\"\"Create a OrderedDict for the mini-batch\n\n        :param OrderedDict x_feats_dict:\n            e.g. {\"input1\": [ndarray, ndarray, ...],\n                  \"input2\": [ndarray, ndarray, ...]}\n        :param OrderedDict y_feats_dict:\n            e.g. {\"target1\": [ndarray, ndarray, ...],\n                  \"target2\": [ndarray, ndarray, ...]}\n        :param: List[str] uttid_list:\n        :return: batch, uttid_list\n        :rtype: Tuple[OrderedDict, List[str]]\n        \"\"\"\n        # Create a list from the first item\n        xs = list(x_feats_dict.values())[0]\n\n        # get index of non-zero length samples\n        nonzero_idx = list(filter(lambda i: len(xs[i]) > 0, range(len(xs))))\n\n        # sort in input lengths\n        if self.sort_in_input_length:\n            # sort in input lengths\n            nonzero_sorted_idx = sorted(nonzero_idx, key=lambda i: -len(xs[i]))\n        else:\n            nonzero_sorted_idx = nonzero_idx\n\n        # remove zero-length samples\n        xs = [xs[i] for i in nonzero_sorted_idx]\n        uttid_list = [uttid_list[i] for i in nonzero_sorted_idx]\n\n        if self.load_output:\n            ys = list(y_feats_dict.values())[0]\n            assert len(xs) == len(ys), (len(xs), len(ys))\n            ys = [ys[i] for i in nonzero_sorted_idx]\n\n            spembs = None\n            spcs = None\n            spembs_name = \"spembs_none\"\n            spcs_name = \"spcs_none\"\n\n            if self.use_second_target:\n                raise ValueError(\"Currently second target not supported.\")\n                spcs = list(x_feats_dict.values())[1]\n                spcs = [spcs[i] for i in nonzero_sorted_idx]\n                spcs_name = list(x_feats_dict.keys())[1]\n\n            if self.use_speaker_embedding:\n                spembs = list(x_feats_dict.values())[1]\n                spembs = [spembs[i] for i in nonzero_sorted_idx]\n                spembs_name = list(x_feats_dict.keys())[1]\n\n            x_name = list(x_feats_dict.keys())[0]\n            y_name = list(y_feats_dict.keys())[0]\n\n            return_batch = OrderedDict(\n                [(x_name, xs), (y_name, ys), (spembs_name, spembs), (spcs_name, spcs)]\n            )\n        elif self.use_speaker_embedding:\n            if len(x_feats_dict) == 0:\n                raise IndexError(\"No speaker embedding is provided\")\n            elif len(x_feats_dict) == 1:\n                spembs_idx = 0\n            else:\n                spembs_idx = 1\n\n            spembs = list(x_feats_dict.values())[spembs_idx]\n            spembs = [spembs[i] for i in nonzero_sorted_idx]\n\n            x_name = list(x_feats_dict.keys())[0]\n            spembs_name = list(x_feats_dict.keys())[spembs_idx]\n\n            return_batch = OrderedDict([(x_name, xs), (spembs_name, spembs)])\n        else:\n            x_name = list(x_feats_dict.keys())[0]\n\n            return_batch = OrderedDict([(x_name, xs)])\n        return return_batch, uttid_list\n\n    def _get_from_loader(self, filepath, filetype, uttid=None):\n        \"\"\"Return ndarray\n\n        In order to make the fds to be opened only at the first referring,\n        the loader are stored in self._loaders\n\n        >>> ndarray = loader.get_from_loader(\n        ...     'some/path.h5:F01_050C0101_PED_REAL', filetype='hdf5')\n\n        :param: str filepath:\n        :param: str filetype:\n        :return:\n        :rtype: np.ndarray\n        \"\"\"\n        if filetype == \"hdf5\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.h5:F01_050C0101_PED_REAL\",\n            #                \"filetype\": \"hdf5\",\n            # -> filepath = \"some/path.h5\", key = \"F01_050C0101_PED_REAL\"\n            filepath, key = filepath.split(\":\", 1)\n\n            loader = self._loaders.get(filepath)\n            if loader is None:\n                # To avoid disk access, create loader only for the first time\n                loader = h5py.File(filepath, \"r\")\n                self._loaders[filepath] = loader\n            return loader[key][()]\n        elif filetype == \"sound.hdf5\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.h5:F01_050C0101_PED_REAL\",\n            #                \"filetype\": \"sound.hdf5\",\n            # -> filepath = \"some/path.h5\", key = \"F01_050C0101_PED_REAL\"\n            filepath, key = filepath.split(\":\", 1)\n\n            loader = self._loaders.get(filepath)\n            if loader is None:\n                # To avoid disk access, create loader only for the first time\n                loader = SoundHDF5File(filepath, \"r\", dtype=\"int16\")\n                self._loaders[filepath] = loader\n            array, rate = loader[key]\n            return array\n        elif filetype == \"sound\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.wav\",\n            #                \"filetype\": \"sound\"},\n            # Assume PCM16\n            if not self.keep_all_data_on_mem:\n                array, _ = soundfile.read(filepath, dtype=\"int16\")\n                return array\n            if filepath not in self._loaders:\n                array, _ = soundfile.read(filepath, dtype=\"int16\")\n                self._loaders[filepath] = array\n            return self._loaders[filepath]\n        elif filetype == \"npz\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.npz:F01_050C0101_PED_REAL\",\n            #                \"filetype\": \"npz\",\n            filepath, key = filepath.split(\":\", 1)\n\n            loader = self._loaders.get(filepath)\n            if loader is None:\n                # To avoid disk access, create loader only for the first time\n                loader = np.load(filepath)\n                self._loaders[filepath] = loader\n            return loader[key]\n        elif filetype == \"npy\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.npy\",\n            #                \"filetype\": \"npy\"},\n            if not self.keep_all_data_on_mem:\n                return np.load(filepath)\n            if filepath not in self._loaders:\n                self._loaders[filepath] = np.load(filepath)\n            return self._loaders[filepath]\n        elif filetype in [\"mat\", \"vec\"]:\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.ark:123\",\n            #                \"filetype\": \"mat\"}]},\n            # In this case, \"123\" indicates the starting points of the matrix\n            # load_mat can load both matrix and vector\n            if self.block_load:\n                ark_name = parse_arkpath(filepath) \n\n                # remove empty ark\n                if ark_name in self._loaders:\n                    if self._loaders[ark_name] == {}:\n                        del self._loaders[ark_name]\n\n                # load the ark when requested\n                elif not ark_name in self._loaders:\n                    ark_dict = load_ark_full(ark_name)\n                    self._loaders[ark_name] = ark_dict\n                \n                # use deep copy as the memory will be released sooner\n                try:\n                    data_copy = copy.deepcopy(self._loaders[ark_name][uttid])\n                    del self._loaders[ark_name][uttid]\n                \n                except:\n                    # in batchfy process the last minibatch contains repeatitive\n                    # uttrances. In this case the features have been deleted\n                    # and then leads to an error. \n                    print(f\"Warning: {filepath} is loaded from disk directly\",flush=True)\n                    data = kaldiio.load_mat(filepath)\n                    data_copy = copy.deepcopy(data)\n                return data_copy\n\n            if not self.keep_all_data_on_mem:\n                return kaldiio.load_mat(filepath)\n            if filepath not in self._loaders:\n                self._loaders[filepath] = kaldiio.load_mat(filepath)\n            return self._loaders[filepath]\n        elif filetype == \"scp\":\n            # e.g.\n            #    {\"input\": [{\"feat\": \"some/path.scp:F01_050C0101_PED_REAL\",\n            #                \"filetype\": \"scp\",\n            filepath, key = filepath.split(\":\", 1)\n            loader = self._loaders.get(filepath)\n            if loader is None:\n                # To avoid disk access, create loader only for the first time\n                loader = kaldiio.load_scp(filepath)\n                self._loaders[filepath] = loader\n            return loader[key]\n        else:\n            raise NotImplementedError(\"Not supported: loader_type={}\".format(filetype))\n\n\nclass SoundHDF5File(object):\n    \"\"\"Collecting sound files to a HDF5 file\n\n    >>> f = SoundHDF5File('a.flac.h5', mode='a')\n    >>> array = np.random.randint(0, 100, 100, dtype=np.int16)\n    >>> f['id'] = (array, 16000)\n    >>> array, rate = f['id']\n\n\n    :param: str filepath:\n    :param: str mode:\n    :param: str format: The type used when saving wav. flac, nist, htk, etc.\n    :param: str dtype:\n\n    \"\"\"\n\n    def __init__(self, filepath, mode=\"r+\", format=None, dtype=\"int16\", **kwargs):\n        self.filepath = filepath\n        self.mode = mode\n        self.dtype = dtype\n\n        self.file = h5py.File(filepath, mode, **kwargs)\n        if format is None:\n            # filepath = a.flac.h5 -> format = flac\n            second_ext = os.path.splitext(os.path.splitext(filepath)[0])[1]\n            format = second_ext[1:]\n            if format.upper() not in soundfile.available_formats():\n                # If not found, flac is selected\n                format = \"flac\"\n\n        # This format affects only saving\n        self.format = format\n\n    def __repr__(self):\n        return '<SoundHDF5 file \"{}\" (mode {}, format {}, type {})>'.format(\n            self.filepath, self.mode, self.format, self.dtype\n        )\n\n    def create_dataset(self, name, shape=None, data=None, **kwds):\n        f = io.BytesIO()\n        array, rate = data\n        soundfile.write(f, array, rate, format=self.format)\n        self.file.create_dataset(name, shape=shape, data=np.void(f.getvalue()), **kwds)\n\n    def __setitem__(self, name, data):\n        self.create_dataset(name, data=data)\n\n    def __getitem__(self, key):\n        data = self.file[key][()]\n        f = io.BytesIO(data.tobytes())\n        array, rate = soundfile.read(f, dtype=self.dtype)\n        return array, rate\n\n    def keys(self):\n        return self.file.keys()\n\n    def values(self):\n        for k in self.file:\n            yield self[k]\n\n    def items(self):\n        for k in self.file:\n            yield k, self[k]\n\n    def __iter__(self):\n        return iter(self.file)\n\n    def __contains__(self, item):\n        return item in self.file\n\n    def __len__(self, item):\n        return len(self.file)\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        self.file.close()\n\n    def close(self):\n        self.file.close()\n\n# return the path filename\ndef parse_arkpath(path):\n    return path.strip().split(\":\")[0]\n\n# read the whole ark without lazy loading\ndef load_ark_full(path):\n    ret = {}\n    for k, v in kaldiio.load_ark(path):\n        ret[k] = v\n    return ret\n\n# worker id suffix\ndef wid_suffix():\n    return \"_\" + str(torch.utils.data.get_worker_info().id)\n\n# monitor the system memory usage\ndef memory_ratio():\n    mem = psutil.virtual_memory()\n    return float(mem.used) / float(mem.total)\n"
  },
  {
    "path": "utils/parse_decoding_process.py",
    "content": "import os\nimport numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef plot_decoding_logs(graph_dir, char_list, recog_args, uttid, nbest_hyps):\n    graph_subdir = os.path.join(graph_dir, uttid)\n    os.makedirs(graph_subdir, exist_ok=True)\n\n    for i, hyp in enumerate(nbest_hyps):\n        hyp_chr = \"\".join([char_list[int(x)] for x in hyp[\"yseq\"][1:]]).replace(\"<space>\", \" \")\n        print(f\"{i}-th hypothesis of {uttid}: {hyp_chr}\")\n\n        logs = hyp[\"logs\"]\n        step_logs = process_logs(logs, recog_args, accum=False)\n        tot_logs = process_logs(logs, recog_args, accum=True)\n\n        filename = f\"{uttid}-{i}-step.png\"\n        filename = os.path.join(graph_subdir, filename)\n        plot_dict(filename, step_logs, hyp_chr)\n\n        filename = f\"{uttid}-{i}-tot.png\"\n        filename = os.path.join(graph_subdir, filename)\n        plot_dict(filename, tot_logs, hyp_chr) \n           \ndef plot_dict(filename, d, title):\n    \n    plt.clf()\n    plt.cla() \n    lines = []\n    for k, v in d.items():\n        x = np.arange(len(v)) \n        line = plt.plot(x, v, label=k)\n        lines.append(line)\n    plt.legend()\n    plt.title(title)\n    plt.savefig(filename)\n\ndef process_logs(logs, args, accum=False):\n    ans = {}\n    \n    for k, v in logs.items():\n        \n        if accum:\n            v = [sum(v[:l+1]) for l in range(len(v))]\n        \n        v = np.array(v)\n        if k == \"att\":\n            v = v * (1 - args.ctc_weight)\n        elif k == \"ctc\":\n            v = v * args.ctc_weight\n        elif k == \"mmi\":\n            v = v * args.mmi_weight\n        elif k == \"lm\":\n            v = v * args.lm_weight\n\n        ans[k] = v\n    \n    tot = np.zeros_like(ans[\"att\"])\n    for k, v in ans.items():\n        tot += v\n    ans[\"sum\"] = tot\n\n    return ans\n        \n"
  },
  {
    "path": "utils/parse_npy.py",
    "content": "import numpy as np\nimport sys\n\ndef main():\n    npy_file = sys.argv[1]\n    symbol_table = sys.argv[2]\n    \n    print(f\"Parsing file {npy_file}\")\n\n    log_probs = np.load(npy_file)\n    probs = np.exp(log_probs) \n\n    syms = {}\n    for line in open(symbol_table):\n        ph, pid = line.split()\n        syms[int(pid)] = ph\n\n    max_probs = np.argmax(probs, axis=-1)\n    max_syms = [syms[x] for x in max_probs]\n    max_syms = \" \".join(max_syms)\n    print(max_syms)\n\nmain()\n"
  },
  {
    "path": "utils/print.py",
    "content": "import time\nimport torch.distributed as dist\n\n\ndef step_print(ctx, flush=False):\n    tmark = time.asctime(time.localtime(time.time()))\n    rank = dist.get_rank()\n    print(f\"{tmark} | rank: {rank} | {ctx}\", flush=flush)\n"
  },
  {
    "path": "utils/rtf_calculator.py",
    "content": "import time\n\nclass RTF_calculator():\n    def __init__(self, js, fps=100):\n        self.js = js\n        self.fps = fps\n        self.time_stamp = None\n    \n    def tik(self):\n        self.time_stamp = time.time()\n\n    def tok(self):\n        time_elapsed = time.time() - self.time_stamp\n        time_utts = sum(\n                    v[\"input\"][0][\"shape\"][0] for v in self.js.values()\n                    )\n        time_utts /= self.fps \n\n        rtf =  time_elapsed / time_utts\n        print(\"RTF calculator: RTF is {:.2f} | time_utts: {:2f} | time_elapsed: {:.2f}\".format(rtf, time_utts, time_elapsed))\n"
  },
  {
    "path": "utils/sampler.py",
    "content": "import torch\nimport torch.utils.data as data\nimport random\n\n# We cannot make the data loading totally random due to the slow ceph \n# So we use this sampler to ensure that the data reading will be contrained\n# in limited number of arks\nclass BufferSampler(object):\n    def __init__(self, length, utts_per_ark, batch_size, buf_size, seed=0, prefetch_ratio=0.3):\n        \"\"\"\n        length: number of minibatches\n        utts_per_ark: the number of utterances in each ark except the last one\n        batch_size: the batch size used in training\n        buf_size: the number of arks that you want to put in the buffer\n        prefetch_ratio: when the remained number of minibatches is below this ratio, \n                        we start to featch the arks in next group\n                        0.5 means we begin to read the next group of arks when half of\n                        this group is consumed\n        \"\"\"\n    \n        self.batch_per_ark = int(utts_per_ark / batch_size)\n        self.buf_size = buf_size\n        self.prefetch_ratio = prefetch_ratio\n        self.num_batches = length\n        self.seed = seed\n        \n        # seed2 is a bias on seed. It never work independently\n        # it is different on different GPU rank\n        try:\n            import torch.distributed as dist\n            self.seed2 = dist.get_rank()\n        except:\n            print(\"Sampler: you are not using DDP training paradigm.\")\n            print(\"Sampler: So the rank bias of random seed is set to 0\", flush=True)\n            self.seed2 = 0\n\n    def __iter__(self):\n        self.reset()\n        print(\"A new iterator in sampler is built\")\n        # make 0, ..., length - 1 in indices\n        assert sum(self.indices) == self.num_batches * (self.num_batches - 1) / 2\n        return iter(self.indices)\n \n    def __len__(self):\n        return self.num_batches\n    \n    \"\"\"\n    This is the core function of this sampler\n    The output indices have features below:\n    (1) All arks are divided into several groups. Each group consists\n        of at most `buf_size` arks\n    (2) The indices are from the same group until all data in this group\n        is consumed. This is to avoid buffering too many arks.\n    (3) For DDP training, the grouping results are identical. This is to \n        ensure that the length distribution in this group is similar \n        across the different ranks. This is controlled by self.seed.\n    (4) Within the group, the order of indices cannot be identical \n        across the ranks, or the global mini-batch will be identical\n        in each epochs. In this case, we ensure that for any valid\n        t, the t-th minibatch in this group across the different \n        ranks are from the same ark-id but not necessarily the same.\n        This provides more variation in training data. This is controlled\n        by `self.seed2`\n    \"\"\"\n    def _get_indices(self):\n        num_arks = int(self.num_batches // self.batch_per_ark) + \\\n                     int(self.num_batches % self.batch_per_ark != 0)\n\n        # group arks\n        ark_ids = list(range(num_arks))\n        random.shuffle(ark_ids)\n        start = 0\n        groups = []\n        while start < num_arks:\n            end = min(start + self.buf_size, num_arks)\n            group = ark_ids[start: end]\n            groups.append(group)\n            start += self.buf_size\n \n        def process_group(group, seed_bias):\n            eg_indices = [] # global idx of the mini-batches\n            ark_indices = [] # ark idx of the mini-batches\n            for i, arkid in enumerate(group):\n                start = arkid * self.batch_per_ark\n                end = min((arkid+1) * self.batch_per_ark, self.num_batches)\n \n                eg_indice = list(range(start, end))\n                eg_indices.append(eg_indice)\n \n                ark_indice = [i] * (end - start)\n                ark_indices.append(ark_indice)\n\n            ark_indices = self._splice_list(ark_indices)\n\n            # the ark_indices is with self.seed\n            # as we need it identical on different GPU ranks\n            random.shuffle(ark_indices)\n\n            # eg_indices is with self.seed + self.seed2\n            # we need it different on different GPU ranks \n            random.seed(self.seed + self.seed2)\n            for e in eg_indices:\n                random.shuffle(e)\n           \n            # we need recover the seed so the next time\n            # we shuffle ark_indices will still have\n            # the same results across the GPUs.\n            # we do not use `self.seed` only as it \n            # always return to the same start point\n            random.seed(self.seed + seed_bias + 888) \n            \n            # combine finally\n            group_indice = []\n            for i in ark_indices:\n                batch_idx = eg_indices[i].pop()\n                group_indice.append(batch_idx)\n            return group_indice\n\n        group_indices = [process_group(g, b) for b, g in enumerate(groups)]\n        return self._splice_list(group_indices)\n   \n    # Using these indices leads to identical global batches in \n    # each epoch \n    def _get_indices_deprecated(self):\n        num_arks = int(self.num_batches // self.batch_per_ark) + \\\n                     int(self.num_batches % self.batch_per_ark != 0)\n\n        ark_ids = list(range(num_arks))\n        random.shuffle(ark_ids)\n        ark_indices = [(idx * self.batch_per_ark, \n                        min((idx+1) * self.batch_per_ark, self.num_batches))\n                        for idx in ark_ids]\n        ark_indices = [list(range(*idx)) for idx in ark_indices]\n\n        # grouping ark indices and shuffle within the group\n        start = 0 \n        group_indices = []\n        while start < num_arks:\n            end = min(start + self.buf_size, num_arks)\n            \n            group_indice = ark_indices[start: end]\n            group_indice = self._splice_list(group_indice)\n            random.shuffle(group_indice)\n            group_indices.append(group_indice)\n            start += self.buf_size\n\n        group_indices = self._splice_list(group_indices)\n        \n        return group_indices\n\n\n    def reset(self, seed=None):\n        # change the seed and reset the indices\n        # It is important to use the seed in DDP training\n        # as the result of sampler is identical on each GPU.\n        # Since the index of minibatch is proportional to the\n        # length of utternace, this will help us to balance \n        # the load of each GPU \n        seed = seed if seed is not None else self.seed + 1\n        self.seed = seed\n        random.seed(seed)\n        self.indices = self._get_indices()\n\n    def _splice_list(self, lsts):\n        out = []\n        for l in lsts:\n            out += l\n        return out\n\n    # this provides the prefetch factor of dataloader\n    # no matter how much mini-batches to preload, all\n    # arks in the next group will be loaded. So a small\n    # ratio is enough and will save memory\n    # just make sure 0.3 group of data will not run out\n    # before the next group is loaded\n    def get_prefetch_factor(self):\n        return int(self.buf_size * self.batch_per_ark * self.prefetch_ratio)\n\nclass testdataset:\n    def __init__(self, length):\n        self.l = length\n\n    def __len__(self):\n        return self.l\n\nif __name__ == '__main__':\n    # 26 batches (52 utts), 4 batches in each ark, max 3 arks in buf, batch_size = 2\n    num_minibatches = 26\n    sampler = BufferSampler(num_minibatches, utts_per_ark=4, batch_size=2, buf_size=3)\n    out = \"\"\n    for i in iter(sampler):\n        out += f\"{i}\\t\"\n\n"
  },
  {
    "path": "utils/spec_augment.py",
    "content": "# -*- coding: utf-8 -*-\n\n\"\"\"\nThis implementation is modified from https://github.com/zcaceres/spec_augment\n\nMIT License\n\nCopyright (c) 2019 Zach Caceres\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETjjHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n\"\"\"\n\nimport random\n\nimport torch\n\n\ndef specaug(\n    spec, W=5, F=30, T=40, num_freq_masks=2, num_time_masks=2, replace_with_zero=False\n):\n    \"\"\"SpecAugment\n\n    Reference:\n        SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition\n        (https://arxiv.org/pdf/1904.08779.pdf)\n\n    This implementation modified from https://github.com/zcaceres/spec_augment\n\n    :param torch.Tensor spec: input tensor with the shape (T, dim)\n    :param int W: time warp parameter\n    :param int F: maximum width of each freq mask\n    :param int T: maximum width of each time mask\n    :param int num_freq_masks: number of frequency masks\n    :param int num_time_masks: number of time masks\n    :param bool replace_with_zero: if True, masked parts will be filled with 0,\n        if False, filled with mean\n    \"\"\"\n    return time_mask(\n        freq_mask(\n            time_warp(spec, W=W),\n            F=F,\n            num_masks=num_freq_masks,\n            replace_with_zero=replace_with_zero,\n        ),\n        T=T,\n        num_masks=num_time_masks,\n        replace_with_zero=replace_with_zero,\n    )\n\n\ndef time_warp(spec, W=5):\n    \"\"\"Time warping\n\n    :param torch.Tensor spec: input tensor with shape (T, dim)\n    :param int W: time warp parameter\n    \"\"\"\n    spec = spec.unsqueeze(0)\n    spec_len = spec.shape[1]\n    num_rows = spec.shape[2]\n    device = spec.device\n\n    y = num_rows // 2\n    horizontal_line_at_ctr = spec[0, :, y]\n    assert len(horizontal_line_at_ctr) == spec_len\n\n    point_to_warp = horizontal_line_at_ctr[random.randrange(W, spec_len - W)]\n    assert isinstance(point_to_warp, torch.Tensor)\n\n    # Uniform distribution from (0,W) with chance to be up to W negative\n    dist_to_warp = random.randrange(-W, W)\n    src_pts, dest_pts = (\n        torch.tensor([[[point_to_warp, y]]], device=device),\n        torch.tensor([[[point_to_warp + dist_to_warp, y]]], device=device),\n    )\n    warped_spectro, dense_flows = sparse_image_warp(spec, src_pts, dest_pts)\n    return warped_spectro.squeeze(3).squeeze(0)\n\n\ndef freq_mask(spec, F=30, num_masks=1, replace_with_zero=False):\n    \"\"\"Frequency masking\n\n    :param torch.Tensor spec: input tensor with shape (T, dim)\n    :param int F: maximum width of each mask\n    :param int num_masks: number of masks\n    :param bool replace_with_zero: if True, masked parts will be filled with 0,\n        if False, filled with mean\n    \"\"\"\n    cloned = spec.unsqueeze(0).clone()\n    num_mel_channels = cloned.shape[2]\n\n    for i in range(0, num_masks):\n        f = random.randrange(0, F)\n        f_zero = random.randrange(0, num_mel_channels - f)\n\n        # avoids randrange error if values are equal and range is empty\n        if f_zero == f_zero + f:\n            return cloned.squeeze(0)\n\n        mask_end = random.randrange(f_zero, f_zero + f)\n        if replace_with_zero:\n            cloned[0][:, f_zero:mask_end] = 0\n        else:\n            cloned[0][:, f_zero:mask_end] = cloned.mean()\n    return cloned.squeeze(0)\n\n\ndef time_mask(spec, T=40, num_masks=1, replace_with_zero=False):\n    \"\"\"Time masking\n\n    :param torch.Tensor spec: input tensor with shape (T, dim)\n    :param int T: maximum width of each mask\n    :param int num_masks: number of masks\n    :param bool replace_with_zero: if True, masked parts will be filled with 0,\n        if False, filled with mean\n    \"\"\"\n    cloned = spec.unsqueeze(0).clone()\n    len_spectro = cloned.shape[1]\n\n    for i in range(0, num_masks):\n        t = random.randrange(0, T)\n        t_zero = random.randrange(0, len_spectro - t)\n\n        # avoids randrange error if values are equal and range is empty\n        if t_zero == t_zero + t:\n            return cloned.squeeze(0)\n\n        mask_end = random.randrange(t_zero, t_zero + t)\n        if replace_with_zero:\n            cloned[0][t_zero:mask_end, :] = 0\n        else:\n            cloned[0][t_zero:mask_end, :] = cloned.mean()\n    return cloned.squeeze(0)\n\n\ndef sparse_image_warp(\n    img_tensor,\n    source_control_point_locations,\n    dest_control_point_locations,\n    interpolation_order=2,\n    regularization_weight=0.0,\n    num_boundaries_points=0,\n):\n    device = img_tensor.device\n    control_point_flows = dest_control_point_locations - source_control_point_locations\n\n    batch_size, image_height, image_width = img_tensor.shape\n    flattened_grid_locations = get_flat_grid_locations(\n        image_height, image_width, device\n    )\n\n    flattened_flows = interpolate_spline(\n        dest_control_point_locations,\n        control_point_flows,\n        flattened_grid_locations,\n        interpolation_order,\n        regularization_weight,\n    )\n\n    dense_flows = create_dense_flows(\n        flattened_flows, batch_size, image_height, image_width\n    )\n\n    warped_image = dense_image_warp(img_tensor, dense_flows)\n\n    return warped_image, dense_flows\n\n\ndef get_grid_locations(image_height, image_width, device):\n    y_range = torch.linspace(0, image_height - 1, image_height, device=device)\n    x_range = torch.linspace(0, image_width - 1, image_width, device=device)\n    y_grid, x_grid = torch.meshgrid(y_range, x_range)\n    return torch.stack((y_grid, x_grid), -1)\n\n\ndef flatten_grid_locations(grid_locations, image_height, image_width):\n    return torch.reshape(grid_locations, [image_height * image_width, 2])\n\n\ndef get_flat_grid_locations(image_height, image_width, device):\n    y_range = torch.linspace(0, image_height - 1, image_height, device=device)\n    x_range = torch.linspace(0, image_width - 1, image_width, device=device)\n    y_grid, x_grid = torch.meshgrid(y_range, x_range)\n    return torch.stack((y_grid, x_grid), -1).reshape([image_height * image_width, 2])\n\n\ndef create_dense_flows(flattened_flows, batch_size, image_height, image_width):\n    # possibly .view\n    return torch.reshape(flattened_flows, [batch_size, image_height, image_width, 2])\n\n\ndef interpolate_spline(\n    train_points,\n    train_values,\n    query_points,\n    order,\n    regularization_weight=0.0,\n):\n    # First, fit the spline to the observed data.\n    w, v = solve_interpolation(train_points, train_values, order, regularization_weight)\n    # Then, evaluate the spline at the query locations.\n    query_values = apply_interpolation(query_points, train_points, w, v, order)\n\n    return query_values\n\n\ndef solve_interpolation(train_points, train_values, order, regularization_weight):\n    device = train_points.device\n    b, n, d = train_points.shape\n    k = train_values.shape[-1]\n\n    c = train_points\n    f = train_values.float()\n\n    matrix_a = phi(cross_squared_distance_matrix(c, c), order).unsqueeze(0)  # [b, n, n]\n\n    # Append ones to the feature values for the bias term in the linear model.\n    ones = torch.ones(1, dtype=train_points.dtype, device=device).view([-1, 1, 1])\n    matrix_b = torch.cat((c, ones), 2).float()  # [b, n, d + 1]\n\n    # [b, n + d + 1, n]\n    left_block = torch.cat((matrix_a, torch.transpose(matrix_b, 2, 1)), 1)\n\n    num_b_cols = matrix_b.shape[2]  # d + 1\n\n    # In Tensorflow, zeros are used here. Pytorch solve fails with zeros\n    # for some reason we don't understand.\n    # So instead we use very tiny randn values (variance of one, zero mean)\n    # on one side of our multiplication.\n    lhs_zeros = torch.randn((b, num_b_cols, num_b_cols), device=device) / 1e10\n    right_block = torch.cat((matrix_b, lhs_zeros), 1)  # [b, n + d + 1, d + 1]\n    lhs = torch.cat((left_block, right_block), 2)  # [b, n + d + 1, n + d + 1]\n\n    rhs_zeros = torch.zeros(\n        (b, d + 1, k), dtype=train_points.dtype, device=device\n    ).float()\n    rhs = torch.cat((f, rhs_zeros), 1)  # [b, n + d + 1, k]\n\n    # Then, solve the linear system and unpack the results.\n    X, LU = torch.gesv(rhs, lhs)\n    w = X[:, :n, :]\n    v = X[:, n:, :]\n\n    return w, v\n\n\ndef cross_squared_distance_matrix(x, y):\n    \"\"\"Pairwise squared distance between two (batch) matrices' rows (2nd dim).\n\n    Computes the pairwise distances between rows of x and rows of y\n    Args:\n    x: [batch_size, n, d] float `Tensor`\n    y: [batch_size, m, d] float `Tensor`\n    Returns:\n    squared_dists: [batch_size, n, m] float `Tensor`, where\n    squared_dists[b,i,j] = ||x[b,i,:] - y[b,j,:]||^2\n    \"\"\"\n    x_norm_squared = torch.sum(torch.mul(x, x))\n    y_norm_squared = torch.sum(torch.mul(y, y))\n\n    x_y_transpose = torch.matmul(x.squeeze(0), y.squeeze(0).transpose(0, 1))\n\n    # squared_dists[b,i,j] = ||x_bi - y_bj||^2 = x_bi'x_bi- 2x_bi'x_bj + x_bj'x_bj\n    squared_dists = x_norm_squared - 2 * x_y_transpose + y_norm_squared\n\n    return squared_dists.float()\n\n\ndef phi(r, order):\n    \"\"\"Coordinate-wise nonlinearity used to define the order of the interpolation.\n\n    See https://en.wikipedia.org/wiki/Polyharmonic_spline for the definition.\n    Args:\n    r: input op\n    order: interpolation order\n    Returns:\n    phi_k evaluated coordinate-wise on r, for k = r\n    \"\"\"\n    EPSILON = torch.tensor(1e-10, device=r.device)\n    # using EPSILON prevents log(0), sqrt0), etc.\n    # sqrt(0) is well-defined, but its gradient is not\n    if order == 1:\n        r = torch.max(r, EPSILON)\n        r = torch.sqrt(r)\n        return r\n    elif order == 2:\n        return 0.5 * r * torch.log(torch.max(r, EPSILON))\n    elif order == 4:\n        return 0.5 * torch.square(r) * torch.log(torch.max(r, EPSILON))\n    elif order % 2 == 0:\n        r = torch.max(r, EPSILON)\n        return 0.5 * torch.pow(r, 0.5 * order) * torch.log(r)\n    else:\n        r = torch.max(r, EPSILON)\n        return torch.pow(r, 0.5 * order)\n\n\ndef apply_interpolation(query_points, train_points, w, v, order):\n    \"\"\"Apply polyharmonic interpolation model to data.\n\n    Notes:\n        Given coefficients w and v for the interpolation model, we evaluate\n        interpolated function values at query_points.\n\n    Args:\n        query_points: `[b, m, d]` x values to evaluate the interpolation at\n        train_points: `[b, n, d]` x values that act as the interpolation centers\n            ( the c variables in the wikipedia article)\n            w: `[b, n, k]` weights on each interpolation center\n            v: `[b, d, k]` weights on each input dimension\n        order: order of the interpolation\n\n    Returns:\n        Polyharmonic interpolation evaluated at points defined in query_points.\n    \"\"\"\n    query_points = query_points.unsqueeze(0)\n    # First, compute the contribution from the rbf term.\n    pairwise_dists = cross_squared_distance_matrix(\n        query_points.float(), train_points.float()\n    )\n    phi_pairwise_dists = phi(pairwise_dists, order)\n\n    rbf_term = torch.matmul(phi_pairwise_dists, w)\n\n    # Then, compute the contribution from the linear term.\n    # Pad query_points with ones, for the bias term in the linear model.\n    ones = torch.ones_like(query_points[..., :1])\n    query_points_pad = torch.cat((query_points, ones), 2).float()\n    linear_term = torch.matmul(query_points_pad, v)\n\n    return rbf_term + linear_term\n\n\ndef dense_image_warp(image, flow):\n    \"\"\"Image warping using per-pixel flow vectors.\n\n    Apply a non-linear warp to the image, where the warp is specified by a dense\n    flow field of offset vectors that define the correspondences of pixel values\n    in the output image back to locations in the  source image. Specifically, the\n    pixel value at output[b, j, i, c] is\n    images[b, j - flow[b, j, i, 0], i - flow[b, j, i, 1], c].\n    The locations specified by this formula do not necessarily map to an int\n    index. Therefore, the pixel value is obtained by bilinear\n    interpolation of the 4 nearest pixels around\n    (b, j - flow[b, j, i, 0], i - flow[b, j, i, 1]). For locations outside\n    of the image, we use the nearest pixel values at the image boundary.\n    Args:\n    image: 4-D float `Tensor` with shape `[batch, height, width, channels]`.\n    flow: A 4-D float `Tensor` with shape `[batch, height, width, 2]`.\n    name: A name for the operation (optional).\n    Note that image and flow can be of type tf.half, tf.float32, or tf.float64,\n    and do not necessarily have to be the same type.\n    Returns:\n    A 4-D float `Tensor` with shape`[batch, height, width, channels]`\n    and same type as input image.\n    Raises:\n    ValueError: if height < 2 or width < 2 or the inputs have the wrong number\n    of dimensions.\n    \"\"\"\n    image = image.unsqueeze(3)  # add a single channel dimension to image tensor\n    batch_size, height, width, channels = image.shape\n    device = image.device\n\n    # The flow is defined on the image grid. Turn the flow into a list of query\n    # points in the grid space.\n    grid_x, grid_y = torch.meshgrid(\n        torch.arange(width, device=device), torch.arange(height, device=device)\n    )\n\n    stacked_grid = torch.stack((grid_y, grid_x), dim=2).float()\n\n    batched_grid = stacked_grid.unsqueeze(-1).permute(3, 1, 0, 2)\n\n    query_points_on_grid = batched_grid - flow\n    query_points_flattened = torch.reshape(\n        query_points_on_grid, [batch_size, height * width, 2]\n    )\n    # Compute values at the query points, then reshape the result back to the\n    # image grid.\n    interpolated = interpolate_bilinear(image, query_points_flattened)\n    interpolated = torch.reshape(interpolated, [batch_size, height, width, channels])\n    return interpolated\n\n\ndef interpolate_bilinear(\n    grid, query_points, name=\"interpolate_bilinear\", indexing=\"ij\"\n):\n    \"\"\"Similar to Matlab's interp2 function.\n\n    Notes:\n        Finds values for query points on a grid using bilinear interpolation.\n\n    Args:\n        grid: a 4-D float `Tensor` of shape `[batch, height, width, channels]`.\n        query_points: a 3-D float `Tensor` of N points with shape `[batch, N, 2]`.\n        name: a name for the operation (optional).\n        indexing: whether the query points are specified as row and column (ij),\n            or Cartesian coordinates (xy).\n\n    Returns:\n        values: a 3-D `Tensor` with shape `[batch, N, channels]`\n\n    Raises:\n        ValueError: if the indexing mode is invalid, or if the shape of the inputs\n        invalid.\n    \"\"\"\n    if indexing != \"ij\" and indexing != \"xy\":\n        raise ValueError(\"Indexing mode must be 'ij' or 'xy'\")\n\n    shape = grid.shape\n    if len(shape) != 4:\n        msg = \"Grid must be 4 dimensional. Received size: \"\n        raise ValueError(msg + str(grid.shape))\n\n    batch_size, height, width, channels = grid.shape\n\n    shape = [batch_size, height, width, channels]\n    query_type = query_points.dtype\n    grid_type = grid.dtype\n    grid_device = grid.device\n\n    num_queries = query_points.shape[1]\n\n    alphas = []\n    floors = []\n    ceils = []\n    index_order = [0, 1] if indexing == \"ij\" else [1, 0]\n    unstacked_query_points = query_points.unbind(2)\n\n    for dim in index_order:\n        queries = unstacked_query_points[dim]\n\n        size_in_indexing_dimension = shape[dim + 1]\n\n        # max_floor is size_in_indexing_dimension - 2 so that max_floor + 1\n        # is still a valid index into the grid.\n        max_floor = torch.tensor(\n            size_in_indexing_dimension - 2, dtype=query_type, device=grid_device\n        )\n        min_floor = torch.tensor(0.0, dtype=query_type, device=grid_device)\n        maxx = torch.max(min_floor, torch.floor(queries))\n        floor = torch.min(maxx, max_floor)\n        int_floor = floor.long()\n        floors.append(int_floor)\n        ceil = int_floor + 1\n        ceils.append(ceil)\n\n        # alpha has the same type as the grid, as we will directly use alpha\n        # when taking linear combinations of pixel values from the image.\n\n        alpha = torch.tensor((queries - floor), dtype=grid_type, device=grid_device)\n        min_alpha = torch.tensor(0.0, dtype=grid_type, device=grid_device)\n        max_alpha = torch.tensor(1.0, dtype=grid_type, device=grid_device)\n        alpha = torch.min(torch.max(min_alpha, alpha), max_alpha)\n\n        # Expand alpha to [b, n, 1] so we can use broadcasting\n        # (since the alpha values don't depend on the channel).\n        alpha = torch.unsqueeze(alpha, 2)\n        alphas.append(alpha)\n\n    flattened_grid = torch.reshape(grid, [batch_size * height * width, channels])\n    batch_offsets = torch.reshape(\n        torch.arange(batch_size, device=grid_device) * height * width, [batch_size, 1]\n    )\n\n    # This wraps array_ops.gather. We reshape the image data such that the\n    # batch, y, and x coordinates are pulled into the first dimension.\n    # Then we gather. Finally, we reshape the output back. It's possible this\n    # code would be made simpler by using array_ops.gather_nd.\n    def gather(y_coords, x_coords, name):\n        linear_coordinates = batch_offsets + y_coords * width + x_coords\n        gathered_values = torch.gather(flattened_grid.t(), 1, linear_coordinates)\n        return torch.reshape(gathered_values, [batch_size, num_queries, channels])\n\n    # grab the pixel values in the 4 corners around each query point\n    top_left = gather(floors[0], floors[1], \"top_left\")\n    top_right = gather(floors[0], ceils[1], \"top_right\")\n    bottom_left = gather(ceils[0], floors[1], \"bottom_left\")\n    bottom_right = gather(ceils[0], ceils[1], \"bottom_right\")\n\n    interp_top = alphas[1] * (top_right - top_left) + top_left\n    interp_bottom = alphas[1] * (bottom_right - bottom_left) + bottom_left\n    interp = alphas[0] * (interp_bottom - interp_top) + interp_top\n\n    return interp\n"
  },
  {
    "path": "utils/training/__init__.py",
    "content": "\"\"\"Initialize sub package.\"\"\"\n"
  },
  {
    "path": "utils/training/batchfy.py",
    "content": "import itertools\nimport logging\nimport numpy as np\nimport random\n\ndef batchfy_by_seq(\n    sorted_data,\n    batch_size,\n    max_length_in,\n    max_length_out,\n    min_batch_size=1,\n    shortest_first=False,\n    ikey=\"input\",\n    iaxis=0,\n    okey=\"output\",\n    oaxis=0,\n):\n    \"\"\"Make batch set from json dictionary\n\n    :param Dict[str, Dict[str, Any]] sorted_data: dictionary loaded from data.json\n    :param int batch_size: batch size\n    :param int max_length_in: maximum length of input to decide adaptive batch size\n    :param int max_length_out: maximum length of output to decide adaptive batch size\n    :param int min_batch_size: mininum batch size (for multi-gpu)\n    :param bool shortest_first: Sort from batch with shortest samples\n        to longest if true, otherwise reverse\n    :param str ikey: key to access input\n        (for ASR ikey=\"input\", for TTS, MT ikey=\"output\".)\n    :param int iaxis: dimension to access input\n        (for ASR, TTS iaxis=0, for MT iaxis=\"1\".)\n    :param str okey: key to access output\n        (for ASR, MT okey=\"output\". for TTS okey=\"input\".)\n    :param int oaxis: dimension to access output\n        (for ASR, TTS, MT oaxis=0, reserved for future research, -1 means all axis.)\n    :return: List[List[Tuple[str, dict]]] list of batches\n    \"\"\"\n    if batch_size <= 0:\n        raise ValueError(f\"Invalid batch_size={batch_size}\")\n\n    # check #utts is more than min_batch_size\n    if len(sorted_data) < min_batch_size:\n        raise ValueError(\n            f\"#utts({len(sorted_data)}) is less than min_batch_size({min_batch_size}).\"\n        )\n\n    # make list of minibatches\n    minibatches = []\n    start = 0\n    while True:\n        _, info = sorted_data[start]\n        ilen = int(info[ikey][iaxis][\"shape\"][0])\n        olen = (\n            int(info[okey][oaxis][\"shape\"][0])\n            if oaxis >= 0\n            else max(map(lambda x: int(x[\"shape\"][0]), info[okey]))\n        )\n        factor = max(int(ilen / max_length_in), int(olen / max_length_out))\n        # change batchsize depending on the input and output length\n        # if ilen = 1000 and max_length_in = 800\n        # then b = batchsize / 2\n        # and max(min_batches, .) avoids batchsize = 0\n        bs = max(min_batch_size, int(batch_size / (1 + factor)))\n        end = min(len(sorted_data), start + bs)\n        minibatch = sorted_data[start:end]\n        if shortest_first:\n            minibatch.reverse()\n\n        # check each batch is more than minimum batchsize\n        # we repeat the data in this mini-batch,\n        # so they are from the same ark\n        if len(minibatch) < min_batch_size:\n            #mod = min_batch_size - len(minibatch) % min_batch_size\n            #additional_minibatch = [\n            #    sorted_data[i] for i in np.random.randint(0, start, mod)\n            #]\n            #if shortest_first:\n            #    additional_minibatch.reverse()\n            repeat_data = minibatch[0] if shortest_first else minibatch[-1]\n            repeat_data = [repeat_data for _ in range(min_batch_size - len(minibatch))]\n            minibatch = repeat_data + minibatch if shortest_first else\\\n                        minibatch + repeat_data\n            # minibatch.extend(additional_minibatch)\n        minibatches.append(minibatch)\n\n        if end == len(sorted_data):\n            break\n        start = end\n\n    # batch: List[List[Tuple[str, dict]]]\n    return minibatches\n\n\ndef batchfy_by_bin(\n    sorted_data,\n    batch_bins,\n    num_batches=0,\n    min_batch_size=1,\n    shortest_first=False,\n    ikey=\"input\",\n    okey=\"output\",\n):\n    \"\"\"Make variably sized batch set, which maximizes\n\n    the number of bins up to `batch_bins`.\n\n    :param Dict[str, Dict[str, Any]] sorted_data: dictionary loaded from data.json\n    :param int batch_bins: Maximum frames of a batch\n    :param int num_batches: # number of batches to use (for debug)\n    :param int min_batch_size: minimum batch size (for multi-gpu)\n    :param int test: Return only every `test` batches\n    :param bool shortest_first: Sort from batch with shortest samples\n        to longest if true, otherwise reverse\n\n    :param str ikey: key to access input (for ASR ikey=\"input\", for TTS ikey=\"output\".)\n    :param str okey: key to access output (for ASR okey=\"output\". for TTS okey=\"input\".)\n\n    :return: List[Tuple[str, Dict[str, List[Dict[str, Any]]]] list of batches\n    \"\"\"\n    if batch_bins <= 0:\n        raise ValueError(f\"invalid batch_bins={batch_bins}\")\n    length = len(sorted_data)\n    idim = int(sorted_data[0][1][ikey][0][\"shape\"][1])\n    odim = int(sorted_data[0][1][okey][0][\"shape\"][1])\n    logging.info(\"# utts: \" + str(len(sorted_data)))\n    minibatches = []\n    start = 0\n    n = 0\n    while True:\n        # Dynamic batch size depending on size of samples\n        b = 0\n        next_size = 0\n        max_olen = 0\n        while next_size < batch_bins and (start + b) < length:\n            ilen = int(sorted_data[start + b][1][ikey][0][\"shape\"][0]) * idim\n            olen = int(sorted_data[start + b][1][okey][0][\"shape\"][0]) * odim\n            if olen > max_olen:\n                max_olen = olen\n            next_size = (max_olen + ilen) * (b + 1)\n            if next_size <= batch_bins:\n                b += 1\n            elif next_size == 0:\n                raise ValueError(\n                    f\"Can't fit one sample in batch_bins ({batch_bins}): \"\n                    f\"Please increase the value\"\n                )\n        end = min(length, start + max(min_batch_size, b))\n        batch = sorted_data[start:end]\n        if shortest_first:\n            batch.reverse()\n        minibatches.append(batch)\n        # Check for min_batch_size and fixes the batches if needed\n        i = -1\n        while len(minibatches[i]) < min_batch_size:\n            missing = min_batch_size - len(minibatches[i])\n            if -i == len(minibatches):\n                minibatches[i + 1].extend(minibatches[i])\n                minibatches = minibatches[1:]\n                break\n            else:\n                minibatches[i].extend(minibatches[i - 1][:missing])\n                minibatches[i - 1] = minibatches[i - 1][missing:]\n                i -= 1\n        if end == length:\n            break\n        start = end\n        n += 1\n    if num_batches > 0:\n        minibatches = minibatches[:num_batches]\n    lengths = [len(x) for x in minibatches]\n    logging.info(\n        str(len(minibatches))\n        + \" batches containing from \"\n        + str(min(lengths))\n        + \" to \"\n        + str(max(lengths))\n        + \" samples \"\n        + \"(avg \"\n        + str(int(np.mean(lengths)))\n        + \" samples).\"\n    )\n    return minibatches\n\n\ndef batchfy_by_frame(\n    sorted_data,\n    max_frames_in,\n    max_frames_out,\n    max_frames_inout,\n    num_batches=0,\n    min_batch_size=1,\n    shortest_first=False,\n    ikey=\"input\",\n    okey=\"output\",\n):\n    \"\"\"Make variable batch set, which maximizes the number of frames to max_batch_frame.\n\n    :param Dict[str, Dict[str, Any]] sorteddata: dictionary loaded from data.json\n    :param int max_frames_in: Maximum input frames of a batch\n    :param int max_frames_out: Maximum output frames of a batch\n    :param int max_frames_inout: Maximum input+output frames of a batch\n    :param int num_batches: # number of batches to use (for debug)\n    :param int min_batch_size: minimum batch size (for multi-gpu)\n    :param int test: Return only every `test` batches\n    :param bool shortest_first: Sort from batch with shortest samples\n        to longest if true, otherwise reverse\n\n    :param str ikey: key to access input (for ASR ikey=\"input\", for TTS ikey=\"output\".)\n    :param str okey: key to access output (for ASR okey=\"output\". for TTS okey=\"input\".)\n\n    :return: List[Tuple[str, Dict[str, List[Dict[str, Any]]]] list of batches\n    \"\"\"\n    if max_frames_in <= 0 and max_frames_out <= 0 and max_frames_inout <= 0:\n        raise ValueError(\n            \"At least, one of `--batch-frames-in`, `--batch-frames-out` or \"\n            \"`--batch-frames-inout` should be > 0\"\n        )\n    length = len(sorted_data)\n    minibatches = []\n    start = 0\n    end = 0\n    while end != length:\n        # Dynamic batch size depending on size of samples\n        b = 0\n        max_olen = 0\n        max_ilen = 0\n        while (start + b) < length:\n            ilen = int(sorted_data[start + b][1][ikey][0][\"shape\"][0])\n            if ilen > max_frames_in and max_frames_in != 0:\n                raise ValueError(\n                    f\"Can't fit one sample in --batch-frames-in ({max_frames_in}): \"\n                    f\"Please increase the value\"\n                )\n            olen = int(sorted_data[start + b][1][okey][0][\"shape\"][0])\n            if olen > max_frames_out and max_frames_out != 0:\n                raise ValueError(\n                    f\"Can't fit one sample in --batch-frames-out ({max_frames_out}): \"\n                    f\"Please increase the value\"\n                )\n            if ilen + olen > max_frames_inout and max_frames_inout != 0:\n                raise ValueError(\n                    f\"Can't fit one sample in --batch-frames-out ({max_frames_inout}): \"\n                    f\"Please increase the value\"\n                )\n            max_olen = max(max_olen, olen)\n            max_ilen = max(max_ilen, ilen)\n            in_ok = max_ilen * (b + 1) <= max_frames_in or max_frames_in == 0\n            out_ok = max_olen * (b + 1) <= max_frames_out or max_frames_out == 0\n            inout_ok = (max_ilen + max_olen) * (\n                b + 1\n            ) <= max_frames_inout or max_frames_inout == 0\n            if in_ok and out_ok and inout_ok:\n                # add more seq in the minibatch\n                b += 1\n            else:\n                # no more seq in the minibatch\n                break\n        end = min(length, start + b)\n        batch = sorted_data[start:end]\n        if shortest_first:\n            batch.reverse()\n        minibatches.append(batch)\n        # Check for min_batch_size and fixes the batches if needed\n        i = -1\n        while len(minibatches[i]) < min_batch_size:\n            missing = min_batch_size - len(minibatches[i])\n            if -i == len(minibatches):\n                minibatches[i + 1].extend(minibatches[i])\n                minibatches = minibatches[1:]\n                break\n            else:\n                minibatches[i].extend(minibatches[i - 1][:missing])\n                minibatches[i - 1] = minibatches[i - 1][missing:]\n                i -= 1\n        start = end\n    if num_batches > 0:\n        minibatches = minibatches[:num_batches]\n    lengths = [len(x) for x in minibatches]\n    logging.info(\n        str(len(minibatches))\n        + \" batches containing from \"\n        + str(min(lengths))\n        + \" to \"\n        + str(max(lengths))\n        + \" samples\"\n        + \"(avg \"\n        + str(int(np.mean(lengths)))\n        + \" samples).\"\n    )\n\n    return minibatches\n\n\ndef batchfy_shuffle(data, batch_size, min_batch_size, num_batches, shortest_first):\n    import random\n\n    logging.info(\"use shuffled batch.\")\n    sorted_data = random.sample(data.items(), len(data.items()))\n    logging.info(\"# utts: \" + str(len(sorted_data)))\n    # make list of minibatches\n    minibatches = []\n    start = 0\n    while True:\n        end = min(len(sorted_data), start + batch_size)\n        # check each batch is more than minimum batchsize\n        minibatch = sorted_data[start:end]\n        if shortest_first:\n            minibatch.reverse()\n        if len(minibatch) < min_batch_size:\n            mod = min_batch_size - len(minibatch) % min_batch_size\n            additional_minibatch = [\n                sorted_data[i] for i in np.random.randint(0, start, mod)\n            ]\n            if shortest_first:\n                additional_minibatch.reverse()\n            minibatch.extend(additional_minibatch)\n        minibatches.append(minibatch)\n        if end == len(sorted_data):\n            break\n        start = end\n\n    # for debugging\n    if num_batches > 0:\n        minibatches = minibatches[:num_batches]\n        logging.info(\"# minibatches: \" + str(len(minibatches)))\n    return minibatches\n\n\nBATCH_COUNT_CHOICES = [\"auto\", \"seq\", \"bin\", \"frame\"]\nBATCH_SORT_KEY_CHOICES = [\"input\", \"output\", \"shuffle\"]\n\n\ndef make_batchset(\n    data,\n    batch_size=0,\n    max_length_in=float(\"inf\"),\n    max_length_out=float(\"inf\"),\n    num_batches=0,\n    min_batch_size=1,\n    shortest_first=False,\n    batch_sort_key=\"input\",\n    swap_io=False,\n    mt=False,\n    no_sort=False,\n    count=\"auto\",\n    batch_bins=0,\n    batch_frames_in=0,\n    batch_frames_out=0,\n    batch_frames_inout=0,\n    iaxis=0,\n    oaxis=0,\n):\n    \"\"\"Make batch set from json dictionary\n\n    if utts have \"category\" value,\n\n        >>> data = {'utt1': {'category': 'A', 'input': ...},\n        ...         'utt2': {'category': 'B', 'input': ...},\n        ...         'utt3': {'category': 'B', 'input': ...},\n        ...         'utt4': {'category': 'A', 'input': ...}}\n        >>> make_batchset(data, batchsize=2, ...)\n        [[('utt1', ...), ('utt4', ...)], [('utt2', ...), ('utt3': ...)]]\n\n    Note that if any utts doesn't have \"category\",\n    perform as same as batchfy_by_{count}\n\n    :param Dict[str, Dict[str, Any]] data: dictionary loaded from data.json\n    :param int batch_size: maximum number of sequences in a minibatch.\n    :param int batch_bins: maximum number of bins (frames x dim) in a minibatch.\n    :param int batch_frames_in:  maximum number of input frames in a minibatch.\n    :param int batch_frames_out: maximum number of output frames in a minibatch.\n    :param int batch_frames_out: maximum number of input+output frames in a minibatch.\n    :param str count: strategy to count maximum size of batch.\n        For choices, see espnet.asr.batchfy.BATCH_COUNT_CHOICES\n\n    :param int max_length_in: maximum length of input to decide adaptive batch size\n    :param int max_length_out: maximum length of output to decide adaptive batch size\n    :param int num_batches: # number of batches to use (for debug)\n    :param int min_batch_size: minimum batch size (for multi-gpu)\n    :param bool shortest_first: Sort from batch with shortest samples\n        to longest if true, otherwise reverse\n    :param str batch_sort_key: how to sort data before creating minibatches\n        [\"input\", \"output\", \"shuffle\"]\n    :param bool swap_io: if True, use \"input\" as output and \"output\"\n        as input in `data` dict\n    :param bool mt: if True, use 0-axis of \"output\" as output and 1-axis of \"output\"\n        as input in `data` dict\n    :param int iaxis: dimension to access input\n        (for ASR, TTS iaxis=0, for MT iaxis=\"1\".)\n    :param int oaxis: dimension to access output (for ASR, TTS, MT oaxis=0,\n        reserved for future research, -1 means all axis.)\n    :return: List[List[Tuple[str, dict]]] list of batches\n    \"\"\"\n\n    # check args\n    if count not in BATCH_COUNT_CHOICES:\n        raise ValueError(\n            f\"arg 'count' ({count}) should be one of {BATCH_COUNT_CHOICES}\"\n        )\n    if batch_sort_key not in BATCH_SORT_KEY_CHOICES:\n        raise ValueError(\n            f\"arg 'batch_sort_key' ({batch_sort_key}) should be \"\n            f\"one of {BATCH_SORT_KEY_CHOICES}\"\n        )\n\n    # TODO(karita): remove this by creating converter from ASR to TTS json format\n    batch_sort_axis = 0\n    if swap_io:\n        # for TTS\n        ikey = \"output\"\n        okey = \"input\"\n        if batch_sort_key == \"input\":\n            batch_sort_key = \"output\"\n        elif batch_sort_key == \"output\":\n            batch_sort_key = \"input\"\n    elif mt:\n        # for MT\n        ikey = \"output\"\n        okey = \"output\"\n        batch_sort_key = \"output\"\n        batch_sort_axis = 1\n        assert iaxis == 1\n        assert oaxis == 0\n        # NOTE: input is json['output'][1] and output is json['output'][0]\n    else:\n        ikey = \"input\"\n        okey = \"output\"\n\n    if count == \"auto\":\n        if batch_size != 0:\n            count = \"seq\"\n        elif batch_bins != 0:\n            count = \"bin\"\n        elif batch_frames_in != 0 or batch_frames_out != 0 or batch_frames_inout != 0:\n            count = \"frame\"\n        else:\n            raise ValueError(\n                f\"cannot detect `count` manually set one of {BATCH_COUNT_CHOICES}\"\n            )\n        logging.info(f\"count is auto detected as {count}\")\n\n    if count != \"seq\" and batch_sort_key == \"shuffle\":\n        raise ValueError(\"batch_sort_key=shuffle is only available if batch_count=seq\")\n\n    category2data = {}  # Dict[str, dict]\n    for k, v in data.items():\n        category2data.setdefault(v.get(\"category\"), {})[k] = v\n\n    batches_list = []  # List[List[List[Tuple[str, dict]]]]\n    for d in category2data.values():\n        if batch_sort_key == \"shuffle\":\n            batches = batchfy_shuffle(\n                d, batch_size, min_batch_size, num_batches, shortest_first\n            )\n            batches_list.append(batches)\n            continue\n\n        # sort it by input lengths (long to short)\n        # add a random float in (0, 1) to shuffle multilingual data with the same length\n        if not no_sort:\n            sorted_data = sorted(\n                d.items(),\n                key=lambda data: int(data[1][batch_sort_key][batch_sort_axis][\"shape\"][0]) + random.random(),\n                reverse=not shortest_first,\n            )\n        else:\n            sorted_data = list(d.items())\n\n        logging.info(\"# utts: \" + str(len(sorted_data)))\n        if count == \"seq\":\n            batches = batchfy_by_seq(\n                sorted_data,\n                batch_size=batch_size,\n                max_length_in=max_length_in,\n                max_length_out=max_length_out,\n                min_batch_size=min_batch_size,\n                shortest_first=shortest_first,\n                ikey=ikey,\n                iaxis=iaxis,\n                okey=okey,\n                oaxis=oaxis,\n            )\n        if count == \"bin\":\n            batches = batchfy_by_bin(\n                sorted_data,\n                batch_bins=batch_bins,\n                min_batch_size=min_batch_size,\n                shortest_first=shortest_first,\n                ikey=ikey,\n                okey=okey,\n            )\n        if count == \"frame\":\n            batches = batchfy_by_frame(\n                sorted_data,\n                max_frames_in=batch_frames_in,\n                max_frames_out=batch_frames_out,\n                max_frames_inout=batch_frames_inout,\n                min_batch_size=min_batch_size,\n                shortest_first=shortest_first,\n                ikey=ikey,\n                okey=okey,\n            )\n        batches_list.append(batches)\n\n    if len(batches_list) == 1:\n        batches = batches_list[0]\n    else:\n        # Concat list. This way is faster than \"sum(batch_list, [])\"\n        batches = list(itertools.chain(*batches_list))\n\n    # for debugging\n    if num_batches > 0:\n        batches = batches[:num_batches]\n        print(f\"only keep {len(batches)} minibatches\")\n    logging.info(\"# minibatches: \" + str(len(batches)))\n    print(\"# minibatches: \" + str(len(batches)))\n    # batch: List[List[Tuple[str, dict]]]\n    return batches\n"
  },
  {
    "path": "utils/training/evaluator.py",
    "content": "from chainer.training.extensions import Evaluator\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\n\n\nclass BaseEvaluator(Evaluator):\n    \"\"\"Base Evaluator in ESPnet\"\"\"\n\n    def __call__(self, trainer=None):\n        ret = super().__call__(trainer)\n        try:\n            if trainer is not None:\n                # force tensorboard to report evaluation log\n                tb_logger = trainer.get_extension(TensorboardLogger.default_name)\n                tb_logger(trainer)\n        except ValueError:\n            pass\n        return ret\n"
  },
  {
    "path": "utils/training/iterators.py",
    "content": "import chainer\nfrom chainer.iterators import MultiprocessIterator\nfrom chainer.iterators import SerialIterator\nfrom chainer.iterators import ShuffleOrderSampler\nfrom chainer.training.extension import Extension\n\nimport numpy as np\n\n\nclass ShufflingEnabler(Extension):\n    \"\"\"An extension enabling shuffling on an Iterator\"\"\"\n\n    def __init__(self, iterators):\n        \"\"\"Inits the ShufflingEnabler\n\n        :param list[Iterator] iterators: The iterators to enable shuffling on\n        \"\"\"\n        self.set = False\n        self.iterators = iterators\n\n    def __call__(self, trainer):\n        \"\"\"Calls the enabler on the given iterator\n\n        :param trainer: The iterator\n        \"\"\"\n        if not self.set:\n            for iterator in self.iterators:\n                iterator.start_shuffle()\n            self.set = True\n\n\nclass ToggleableShufflingSerialIterator(SerialIterator):\n    \"\"\"A SerialIterator having its shuffling property activated during training\"\"\"\n\n    def __init__(self, dataset, batch_size, repeat=True, shuffle=True):\n        \"\"\"Init the Iterator\n\n        :param torch.nn.Tensor dataset: The dataset to take batches from\n        :param int batch_size: The batch size\n        :param bool repeat: Whether to repeat data (allow multiple epochs)\n        :param bool shuffle: Whether to shuffle the batches\n        \"\"\"\n        super(ToggleableShufflingSerialIterator, self).__init__(\n            dataset, batch_size, repeat, shuffle\n        )\n\n    def start_shuffle(self):\n        \"\"\"Starts shuffling (or reshuffles) the batches\"\"\"\n        self._shuffle = True\n        if int(chainer._version.__version__[0]) <= 4:\n            self._order = np.random.permutation(len(self.dataset))\n        else:\n            self.order_sampler = ShuffleOrderSampler()\n            self._order = self.order_sampler(np.arange(len(self.dataset)), 0)\n\n\nclass ToggleableShufflingMultiprocessIterator(MultiprocessIterator):\n    \"\"\"A MultiprocessIterator having its shuffling property activated during training\"\"\"\n\n    def __init__(\n        self,\n        dataset,\n        batch_size,\n        repeat=True,\n        shuffle=True,\n        n_processes=None,\n        n_prefetch=1,\n        shared_mem=None,\n        maxtasksperchild=20,\n    ):\n        \"\"\"Init the iterator\n\n        :param torch.nn.Tensor dataset: The dataset to take batches from\n        :param int batch_size: The batch size\n        :param bool repeat: Whether to repeat batches or not (enables multiple epochs)\n        :param bool shuffle: Whether to shuffle the order of the batches\n        :param int n_processes: How many processes to use\n        :param int n_prefetch: The number of prefetch to use\n        :param int shared_mem: How many memory to share between processes\n        :param int maxtasksperchild: Maximum number of tasks per child\n        \"\"\"\n        super(ToggleableShufflingMultiprocessIterator, self).__init__(\n            dataset=dataset,\n            batch_size=batch_size,\n            repeat=repeat,\n            shuffle=shuffle,\n            n_processes=n_processes,\n            n_prefetch=n_prefetch,\n            shared_mem=shared_mem,\n            maxtasksperchild=maxtasksperchild,\n        )\n\n    def start_shuffle(self):\n        \"\"\"Starts shuffling (or reshuffles) the batches\"\"\"\n        self.shuffle = True\n        if int(chainer._version.__version__[0]) <= 4:\n            self._order = np.random.permutation(len(self.dataset))\n        else:\n            self.order_sampler = ShuffleOrderSampler()\n            self._order = self.order_sampler(np.arange(len(self.dataset)), 0)\n        self._set_prefetch_state()\n"
  },
  {
    "path": "utils/training/tensorboard_logger.py",
    "content": "from chainer.training.extension import Extension\n\n\nclass TensorboardLogger(Extension):\n    \"\"\"A tensorboard logger extension\"\"\"\n\n    default_name = \"espnet_tensorboard_logger\"\n\n    def __init__(\n        self, logger, att_reporter=None, ctc_reporter=None, entries=None, epoch=0\n    ):\n        \"\"\"Init the extension\n\n        :param SummaryWriter logger: The logger to use\n        :param PlotAttentionReporter att_reporter: The (optional) PlotAttentionReporter\n        :param entries: The entries to watch\n        :param int epoch: The starting epoch\n        \"\"\"\n        self._entries = entries\n        self._att_reporter = att_reporter\n        self._ctc_reporter = ctc_reporter\n        self._logger = logger\n        self._epoch = epoch\n\n    def __call__(self, trainer):\n        \"\"\"Updates the events file with the new values\n\n        :param trainer: The trainer\n        \"\"\"\n        observation = trainer.observation\n        for k, v in observation.items():\n            if (self._entries is not None) and (k not in self._entries):\n                continue\n            if k is not None and v is not None:\n                if \"cupy\" in str(type(v)):\n                    v = v.get()\n                if \"cupy\" in str(type(k)):\n                    k = k.get()\n                self._logger.add_scalar(k, v, trainer.updater.iteration)\n        if (\n            self._att_reporter is not None\n            and trainer.updater.get_iterator(\"main\").epoch > self._epoch\n        ):\n            self._epoch = trainer.updater.get_iterator(\"main\").epoch\n            self._att_reporter.log_attentions(self._logger, trainer.updater.iteration)\n        if (\n            self._ctc_reporter is not None\n            and trainer.updater.get_iterator(\"main\").epoch > self._epoch\n        ):\n            self._epoch = trainer.updater.get_iterator(\"main\").epoch\n            self._ctc_reporter.log_ctc_probs(self._logger, trainer.updater.iteration)\n"
  },
  {
    "path": "utils/training/train_utils.py",
    "content": "import chainer\nimport logging\n\n\ndef check_early_stop(trainer, epochs):\n    \"\"\"Checks an early stopping trigger and warns the user if it's the case\n\n    :param trainer: The trainer used for training\n    :param epochs: The maximum number of epochs\n    \"\"\"\n    end_epoch = trainer.updater.get_iterator(\"main\").epoch\n    if end_epoch < (epochs - 1):\n        logging.warning(\n            \"Hit early stop at epoch \"\n            + str(end_epoch)\n            + \"\\nYou can change the patience or set it to 0 to run all epochs\"\n        )\n\n\ndef set_early_stop(trainer, args, is_lm=False):\n    \"\"\"Sets the early stop trigger given the program arguments\n\n    :param trainer: The trainer used for training\n    :param args: The program arguments\n    :param is_lm: If the trainer is for a LM (epoch instead of epochs)\n    \"\"\"\n    patience = args.patience\n    criterion = args.early_stop_criterion\n    epochs = args.epoch if is_lm else args.epochs\n    mode = \"max\" if \"acc\" in criterion else \"min\"\n    if patience > 0:\n        trainer.stop_trigger = chainer.training.triggers.EarlyStoppingTrigger(\n            monitor=criterion,\n            mode=mode,\n            patients=patience,\n            max_trigger=(epochs, \"epoch\"),\n        )\n"
  },
  {
    "path": "vc/pytorch_backend/vc.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\n# Copyright 2020 Nagoya University (Wen-Chin Huang)\n#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)\n\n\"\"\"E2E VC training / decoding functions.\"\"\"\n\nimport copy\nimport json\nimport logging\nimport math\nimport os\nimport time\n\nimport chainer\nimport kaldiio\nimport numpy as np\nimport torch\n\nfrom chainer import training\nfrom chainer.training import extensions\n\nfrom espnet.asr.asr_utils import get_model_conf\nfrom espnet.asr.asr_utils import snapshot_object\nfrom espnet.asr.asr_utils import torch_load\nfrom espnet.asr.asr_utils import torch_resume\nfrom espnet.asr.asr_utils import torch_snapshot\nfrom espnet.asr.pytorch_backend.asr_init import load_trained_modules\nfrom espnet.nets.pytorch_backend.nets_utils import pad_list\nfrom espnet.nets.tts_interface import TTSInterface\nfrom espnet.utils.dataset import ChainerDataLoader\nfrom espnet.utils.dataset import TransformDataset\nfrom espnet.utils.dynamic_import import dynamic_import\nfrom espnet.utils.io_utils import LoadInputsAndTargets\nfrom espnet.utils.training.batchfy import make_batchset\nfrom espnet.utils.training.evaluator import BaseEvaluator\n\nfrom espnet.utils.deterministic_utils import set_deterministic_pytorch\nfrom espnet.utils.training.train_utils import check_early_stop\nfrom espnet.utils.training.train_utils import set_early_stop\n\nfrom espnet.utils.training.iterators import ShufflingEnabler\n\nimport matplotlib\n\nfrom espnet.utils.training.tensorboard_logger import TensorboardLogger\nfrom tensorboardX import SummaryWriter\n\nmatplotlib.use(\"Agg\")\n\n\nclass CustomEvaluator(BaseEvaluator):\n    \"\"\"Custom evaluator.\"\"\"\n\n    def __init__(self, model, iterator, target, device):\n        \"\"\"Initilize module.\n\n        Args:\n            model (torch.nn.Module): Pytorch model instance.\n            iterator (chainer.dataset.Iterator): Iterator for validation.\n            target (chainer.Chain): Dummy chain instance.\n            device (torch.device): The device to be used in evaluation.\n\n        \"\"\"\n        super(CustomEvaluator, self).__init__(iterator, target)\n        self.model = model\n        self.device = device\n\n    # The core part of the update routine can be customized by overriding.\n    def evaluate(self):\n        \"\"\"Evaluate over validation iterator.\"\"\"\n        iterator = self._iterators[\"main\"]\n\n        if self.eval_hook:\n            self.eval_hook(self)\n\n        if hasattr(iterator, \"reset\"):\n            iterator.reset()\n            it = iterator\n        else:\n            it = copy.copy(iterator)\n\n        summary = chainer.reporter.DictSummary()\n\n        self.model.eval()\n        with torch.no_grad():\n            for batch in it:\n                if isinstance(batch, tuple):\n                    x = tuple(arr.to(self.device) for arr in batch)\n                else:\n                    x = batch\n                    for key in x.keys():\n                        x[key] = x[key].to(self.device)\n                observation = {}\n                with chainer.reporter.report_scope(observation):\n                    # convert to torch tensor\n                    if isinstance(x, tuple):\n                        self.model(*x)\n                    else:\n                        self.model(**x)\n                summary.add(observation)\n        self.model.train()\n\n        return summary.compute_mean()\n\n\nclass CustomUpdater(training.StandardUpdater):\n    \"\"\"Custom updater.\"\"\"\n\n    def __init__(self, model, grad_clip, iterator, optimizer, device, accum_grad=1):\n        \"\"\"Initilize module.\n\n        Args:\n            model (torch.nn.Module) model: Pytorch model instance.\n            grad_clip (float) grad_clip : The gradient clipping value.\n            iterator (chainer.dataset.Iterator): Iterator for training.\n            optimizer (torch.optim.Optimizer) : Pytorch optimizer instance.\n            device (torch.device): The device to be used in training.\n\n        \"\"\"\n        super(CustomUpdater, self).__init__(iterator, optimizer)\n        self.model = model\n        self.grad_clip = grad_clip\n        self.device = device\n        self.clip_grad_norm = torch.nn.utils.clip_grad_norm_\n        self.accum_grad = accum_grad\n        self.forward_count = 0\n\n    # The core part of the update routine can be customized by overriding.\n    def update_core(self):\n        \"\"\"Update model one step.\"\"\"\n        # When we pass one iterator and optimizer to StandardUpdater.__init__,\n        # they are automatically named 'main'.\n        train_iter = self.get_iterator(\"main\")\n        optimizer = self.get_optimizer(\"main\")\n\n        # Get the next batch (a list of json files)\n        batch = train_iter.next()\n        if isinstance(batch, tuple):\n            x = tuple(arr.to(self.device) for arr in batch)\n        else:\n            x = batch\n            for key in x.keys():\n                x[key] = x[key].to(self.device)\n\n        # compute loss and gradient\n        if isinstance(x, tuple):\n            loss = self.model(*x).mean() / self.accum_grad\n        else:\n            loss = self.model(**x).mean() / self.accum_grad\n        loss.backward()\n\n        # update parameters\n        self.forward_count += 1\n        if self.forward_count != self.accum_grad:\n            return\n        self.forward_count = 0\n\n        # compute the gradient norm to check if it is normal or not\n        grad_norm = self.clip_grad_norm(self.model.parameters(), self.grad_clip)\n        logging.debug(\"grad norm={}\".format(grad_norm))\n        if math.isnan(grad_norm):\n            logging.warning(\"grad norm is nan. Do not update model.\")\n        else:\n            optimizer.step()\n        optimizer.zero_grad()\n\n    def update(self):\n        \"\"\"Run update function.\"\"\"\n        self.update_core()\n        if self.forward_count == 0:\n            self.iteration += 1\n\n\nclass CustomConverter(object):\n    \"\"\"Custom converter.\"\"\"\n\n    def __init__(self):\n        \"\"\"Initilize module.\"\"\"\n        # NOTE: keep as class for future development\n        pass\n\n    def __call__(self, batch, device=torch.device(\"cpu\")):\n        \"\"\"Convert a given batch.\n\n        Args:\n            batch (list): List of ndarrays.\n            device (torch.device): The device to be send.\n\n        Returns:\n            dict: Dict of converted tensors.\n\n        Examples:\n            >>> batch = [([np.arange(5), np.arange(3)],\n                          [np.random.randn(8, 2), np.random.randn(4, 2)],\n                          None, None)]\n            >>> conveter = CustomConverter()\n            >>> conveter(batch, torch.device(\"cpu\"))\n            {'xs': tensor([[0, 1, 2, 3, 4],\n                           [0, 1, 2, 0, 0]]),\n             'ilens': tensor([5, 3]),\n             'ys': tensor([[[-0.4197, -1.1157],\n                            [-1.5837, -0.4299],\n                            [-2.0491,  0.9215],\n                            [-2.4326,  0.8891],\n                            [ 1.2323,  1.7388],\n                            [-0.3228,  0.6656],\n                            [-0.6025,  1.3693],\n                            [-1.0778,  1.3447]],\n                           [[ 0.1768, -0.3119],\n                            [ 0.4386,  2.5354],\n                            [-1.2181, -0.5918],\n                            [-0.6858, -0.8843],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000],\n                            [ 0.0000,  0.0000]]]),\n             'labels': tensor([[0., 0., 0., 0., 0., 0., 0., 1.],\n                               [0., 0., 0., 1., 1., 1., 1., 1.]]),\n             'olens': tensor([8, 4])}\n\n        \"\"\"\n        # batch should be located in list\n        assert len(batch) == 1\n        xs, ys, spembs, extras = batch[0]\n\n        # get list of lengths (must be tensor for DataParallel)\n        ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).long().to(device)\n        olens = torch.from_numpy(np.array([y.shape[0] for y in ys])).long().to(device)\n\n        # perform padding and conversion to tensor\n        xs = pad_list([torch.from_numpy(x).float() for x in xs], 0).to(device)\n        ys = pad_list([torch.from_numpy(y).float() for y in ys], 0).to(device)\n\n        # make labels for stop prediction\n        labels = ys.new_zeros(ys.size(0), ys.size(1))\n        for i, l in enumerate(olens):\n            labels[i, l - 1 :] = 1.0\n\n        # prepare dict\n        new_batch = {\n            \"xs\": xs,\n            \"ilens\": ilens,\n            \"ys\": ys,\n            \"labels\": labels,\n            \"olens\": olens,\n        }\n\n        # load speaker embedding\n        if spembs is not None:\n            spembs = torch.from_numpy(np.array(spembs)).float()\n            new_batch[\"spembs\"] = spembs.to(device)\n\n        # load second target\n        if extras is not None:\n            extras = pad_list([torch.from_numpy(extra).float() for extra in extras], 0)\n            new_batch[\"extras\"] = extras.to(device)\n\n        return new_batch\n\n\ndef train(args):\n    \"\"\"Train E2E VC model.\"\"\"\n    set_deterministic_pytorch(args)\n\n    # check cuda availability\n    if not torch.cuda.is_available():\n        logging.warning(\"cuda is not available\")\n\n    # get input and output dimension info\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n    utts = list(valid_json.keys())\n\n    # In TTS, this is reversed, but not in VC. See `espnet.utils.training.batchfy`\n    idim = int(valid_json[utts[0]][\"input\"][0][\"shape\"][1])\n    odim = int(valid_json[utts[0]][\"output\"][0][\"shape\"][1])\n    logging.info(\"#input dims : \" + str(idim))\n    logging.info(\"#output dims: \" + str(odim))\n\n    # get extra input and output dimenstion\n    if args.use_speaker_embedding:\n        args.spk_embed_dim = int(valid_json[utts[0]][\"input\"][1][\"shape\"][0])\n    else:\n        args.spk_embed_dim = None\n    if args.use_second_target:\n        args.spc_dim = int(valid_json[utts[0]][\"input\"][1][\"shape\"][1])\n    else:\n        args.spc_dim = None\n\n    # write model config\n    if not os.path.exists(args.outdir):\n        os.makedirs(args.outdir)\n    model_conf = args.outdir + \"/model.json\"\n    with open(model_conf, \"wb\") as f:\n        logging.info(\"writing a model config file to\" + model_conf)\n        f.write(\n            json.dumps(\n                (idim, odim, vars(args)), indent=4, ensure_ascii=False, sort_keys=True\n            ).encode(\"utf_8\")\n        )\n    for key in sorted(vars(args).keys()):\n        logging.info(\"ARGS: \" + key + \": \" + str(vars(args)[key]))\n\n    # specify model architecture\n    if args.enc_init is not None or args.dec_init is not None:\n        model = load_trained_modules(idim, odim, args, TTSInterface)\n    else:\n        model_class = dynamic_import(args.model_module)\n        model = model_class(idim, odim, args)\n    assert isinstance(model, TTSInterface)\n    logging.info(model)\n    reporter = model.reporter\n\n    # freeze modules, if specified\n    if args.freeze_mods:\n        for mod, param in model.named_parameters():\n            if any(mod.startswith(key) for key in args.freeze_mods):\n                logging.info(\"freezing %s\" % mod)\n                param.requires_grad = False\n\n    for mod, param in model.named_parameters():\n        if not param.requires_grad:\n            logging.info(\"Frozen module %s\" % mod)\n\n    # check the use of multi-gpu\n    if args.ngpu > 1:\n        model = torch.nn.DataParallel(model, device_ids=list(range(args.ngpu)))\n        if args.batch_size != 0:\n            logging.warning(\n                \"batch size is automatically increased (%d -> %d)\"\n                % (args.batch_size, args.batch_size * args.ngpu)\n            )\n            args.batch_size *= args.ngpu\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    model = model.to(device)\n\n    logging.warning(\n        \"num. model params: {:,} (num. trained: {:,} ({:.1f}%))\".format(\n            sum(p.numel() for p in model.parameters()),\n            sum(p.numel() for p in model.parameters() if p.requires_grad),\n            sum(p.numel() for p in model.parameters() if p.requires_grad)\n            * 100.0\n            / sum(p.numel() for p in model.parameters()),\n        )\n    )\n\n    # Setup an optimizer\n    if args.opt == \"adam\":\n        optimizer = torch.optim.Adam(\n            model.parameters(), args.lr, eps=args.eps, weight_decay=args.weight_decay\n        )\n    elif args.opt == \"noam\":\n        from espnet.nets.pytorch_backend.transformer.optimizer import get_std_opt\n\n        optimizer = get_std_opt(\n            model, args.adim, args.transformer_warmup_steps, args.transformer_lr\n        )\n    elif args.opt == \"lamb\":\n        from pytorch_lamb import Lamb\n\n        optimizer = Lamb(\n            model.parameters(), lr=args.lr, weight_decay=0.01, betas=(0.9, 0.999)\n        )\n    else:\n        raise NotImplementedError(\"unknown optimizer: \" + args.opt)\n\n    # FIXME: TOO DIRTY HACK\n    setattr(optimizer, \"target\", reporter)\n    setattr(optimizer, \"serialize\", lambda s: reporter.serialize(s))\n\n    # read json data\n    with open(args.train_json, \"rb\") as f:\n        train_json = json.load(f)[\"utts\"]\n    with open(args.valid_json, \"rb\") as f:\n        valid_json = json.load(f)[\"utts\"]\n\n    use_sortagrad = args.sortagrad == -1 or args.sortagrad > 0\n    if use_sortagrad:\n        args.batch_sort_key = \"input\"\n    # make minibatch list (variable length)\n    train_batchset = make_batchset(\n        train_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        batch_sort_key=args.batch_sort_key,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        shortest_first=use_sortagrad,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        swap_io=False,\n        iaxis=0,\n        oaxis=0,\n    )\n    valid_batchset = make_batchset(\n        valid_json,\n        args.batch_size,\n        args.maxlen_in,\n        args.maxlen_out,\n        args.minibatches,\n        batch_sort_key=args.batch_sort_key,\n        min_batch_size=args.ngpu if args.ngpu > 1 else 1,\n        count=args.batch_count,\n        batch_bins=args.batch_bins,\n        batch_frames_in=args.batch_frames_in,\n        batch_frames_out=args.batch_frames_out,\n        batch_frames_inout=args.batch_frames_inout,\n        swap_io=False,\n        iaxis=0,\n        oaxis=0,\n    )\n\n    load_tr = LoadInputsAndTargets(\n        mode=\"vc\",\n        use_speaker_embedding=args.use_speaker_embedding,\n        use_second_target=args.use_second_target,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": True},  # Switch the mode of preprocessing\n        keep_all_data_on_mem=args.keep_all_data_on_mem,\n    )\n\n    load_cv = LoadInputsAndTargets(\n        mode=\"vc\",\n        use_speaker_embedding=args.use_speaker_embedding,\n        use_second_target=args.use_second_target,\n        preprocess_conf=args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n        keep_all_data_on_mem=args.keep_all_data_on_mem,\n    )\n\n    converter = CustomConverter()\n    # hack to make batchsize argument as 1\n    # actual bathsize is included in a list\n    train_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(\n                train_batchset, lambda data: converter([load_tr(data)])\n            ),\n            batch_size=1,\n            num_workers=args.num_iter_processes,\n            shuffle=not use_sortagrad,\n            collate_fn=lambda x: x[0],\n        )\n    }\n    valid_iter = {\n        \"main\": ChainerDataLoader(\n            dataset=TransformDataset(\n                valid_batchset, lambda data: converter([load_cv(data)])\n            ),\n            batch_size=1,\n            shuffle=False,\n            collate_fn=lambda x: x[0],\n            num_workers=args.num_iter_processes,\n        )\n    }\n\n    # Set up a trainer\n    updater = CustomUpdater(\n        model, args.grad_clip, train_iter, optimizer, device, args.accum_grad\n    )\n    trainer = training.Trainer(updater, (args.epochs, \"epoch\"), out=args.outdir)\n\n    # Resume from a snapshot\n    if args.resume:\n        logging.info(\"resumed from %s\" % args.resume)\n        torch_resume(args.resume, trainer)\n\n    # set intervals\n    eval_interval = (args.eval_interval_epochs, \"epoch\")\n    save_interval = (args.save_interval_epochs, \"epoch\")\n    report_interval = (args.report_interval_iters, \"iteration\")\n\n    # Evaluate the model with the test dataset for each epoch\n    trainer.extend(\n        CustomEvaluator(model, valid_iter, reporter, device), trigger=eval_interval\n    )\n\n    # Save snapshot for each epoch\n    trainer.extend(torch_snapshot(), trigger=save_interval)\n\n    # Save best models\n    trainer.extend(\n        snapshot_object(model, \"model.loss.best\"),\n        trigger=training.triggers.MinValueTrigger(\n            \"validation/main/loss\", trigger=eval_interval\n        ),\n    )\n\n    # Save attention figure for each epoch\n    if args.num_save_attention > 0:\n        data = sorted(\n            list(valid_json.items())[: args.num_save_attention],\n            key=lambda x: int(x[1][\"input\"][0][\"shape\"][1]),\n            reverse=True,\n        )\n        if hasattr(model, \"module\"):\n            att_vis_fn = model.module.calculate_all_attentions\n            plot_class = model.module.attention_plot_class\n        else:\n            att_vis_fn = model.calculate_all_attentions\n            plot_class = model.attention_plot_class\n        att_reporter = plot_class(\n            att_vis_fn,\n            data,\n            args.outdir + \"/att_ws\",\n            converter=converter,\n            transform=load_cv,\n            device=device,\n            reverse=True,\n        )\n        trainer.extend(att_reporter, trigger=eval_interval)\n    else:\n        att_reporter = None\n\n    # Make a plot for training and validation values\n    if hasattr(model, \"module\"):\n        base_plot_keys = model.module.base_plot_keys\n    else:\n        base_plot_keys = model.base_plot_keys\n    plot_keys = []\n    for key in base_plot_keys:\n        plot_key = [\"main/\" + key, \"validation/main/\" + key]\n        trainer.extend(\n            extensions.PlotReport(plot_key, \"epoch\", file_name=key + \".png\"),\n            trigger=eval_interval,\n        )\n        plot_keys += plot_key\n    trainer.extend(\n        extensions.PlotReport(plot_keys, \"epoch\", file_name=\"all_loss.png\"),\n        trigger=eval_interval,\n    )\n\n    # Write a log of evaluation statistics for each epoch\n    trainer.extend(extensions.LogReport(trigger=report_interval))\n    report_keys = [\"epoch\", \"iteration\", \"elapsed_time\"] + plot_keys\n    trainer.extend(extensions.PrintReport(report_keys), trigger=report_interval)\n    trainer.extend(extensions.ProgressBar(), trigger=report_interval)\n\n    set_early_stop(trainer, args)\n    if args.tensorboard_dir is not None and args.tensorboard_dir != \"\":\n        writer = SummaryWriter(args.tensorboard_dir)\n        trainer.extend(TensorboardLogger(writer, att_reporter), trigger=report_interval)\n\n    if use_sortagrad:\n        trainer.extend(\n            ShufflingEnabler([train_iter]),\n            trigger=(args.sortagrad if args.sortagrad != -1 else args.epochs, \"epoch\"),\n        )\n\n    # Run the training\n    trainer.run()\n    check_early_stop(trainer, args.epochs)\n\n\n@torch.no_grad()\ndef decode(args):\n    \"\"\"Decode with E2E VC model.\"\"\"\n    set_deterministic_pytorch(args)\n    # read training config\n    idim, odim, train_args = get_model_conf(args.model, args.model_conf)\n\n    # show arguments\n    for key in sorted(vars(args).keys()):\n        logging.info(\"args: \" + key + \": \" + str(vars(args)[key]))\n\n    # define model\n    model_class = dynamic_import(train_args.model_module)\n    model = model_class(idim, odim, train_args)\n    assert isinstance(model, TTSInterface)\n    logging.info(model)\n\n    # load trained model parameters\n    logging.info(\"reading model parameters from \" + args.model)\n    torch_load(args.model, model)\n    model.eval()\n\n    # set torch device\n    device = torch.device(\"cuda\" if args.ngpu > 0 else \"cpu\")\n    model = model.to(device)\n\n    # read json data\n    with open(args.json, \"rb\") as f:\n        js = json.load(f)[\"utts\"]\n\n    # check directory\n    outdir = os.path.dirname(args.out)\n    if len(outdir) != 0 and not os.path.exists(outdir):\n        os.makedirs(outdir)\n\n    load_inputs_and_targets = LoadInputsAndTargets(\n        mode=\"vc\",\n        load_output=False,\n        sort_in_input_length=False,\n        use_speaker_embedding=train_args.use_speaker_embedding,\n        preprocess_conf=train_args.preprocess_conf\n        if args.preprocess_conf is None\n        else args.preprocess_conf,\n        preprocess_args={\"train\": False},  # Switch the mode of preprocessing\n    )\n\n    # define function for plot prob and att_ws\n    def _plot_and_save(array, figname, figsize=(6, 4), dpi=150):\n        import matplotlib.pyplot as plt\n\n        shape = array.shape\n        if len(shape) == 1:\n            # for eos probability\n            plt.figure(figsize=figsize, dpi=dpi)\n            plt.plot(array)\n            plt.xlabel(\"Frame\")\n            plt.ylabel(\"Probability\")\n            plt.ylim([0, 1])\n        elif len(shape) == 2:\n            # for tacotron 2 attention weights, whose shape is (out_length, in_length)\n            plt.figure(figsize=figsize, dpi=dpi)\n            plt.imshow(array, aspect=\"auto\")\n            plt.xlabel(\"Input\")\n            plt.ylabel(\"Output\")\n        elif len(shape) == 4:\n            # for transformer attention weights,\n            # whose shape is (#leyers, #heads, out_length, in_length)\n            plt.figure(figsize=(figsize[0] * shape[0], figsize[1] * shape[1]), dpi=dpi)\n            for idx1, xs in enumerate(array):\n                for idx2, x in enumerate(xs, 1):\n                    plt.subplot(shape[0], shape[1], idx1 * shape[1] + idx2)\n                    plt.imshow(x, aspect=\"auto\")\n                    plt.xlabel(\"Input\")\n                    plt.ylabel(\"Output\")\n        else:\n            raise NotImplementedError(\"Support only from 1D to 4D array.\")\n        plt.tight_layout()\n        if not os.path.exists(os.path.dirname(figname)):\n            # NOTE: exist_ok = True is needed for parallel process decoding\n            os.makedirs(os.path.dirname(figname), exist_ok=True)\n        plt.savefig(figname)\n        plt.close()\n\n    # define function to calculate focus rate\n    # (see section 3.3 in https://arxiv.org/abs/1905.09263)\n    def _calculate_focus_rete(att_ws):\n        if att_ws is None:\n            # fastspeech case -> None\n            return 1.0\n        elif len(att_ws.shape) == 2:\n            # tacotron 2 case -> (L, T)\n            return float(att_ws.max(dim=-1)[0].mean())\n        elif len(att_ws.shape) == 4:\n            # transformer case -> (#layers, #heads, L, T)\n            return float(att_ws.max(dim=-1)[0].mean(dim=-1).max())\n        else:\n            raise ValueError(\"att_ws should be 2 or 4 dimensional tensor.\")\n\n    # define function to convert attention to duration\n    def _convert_att_to_duration(att_ws):\n        if len(att_ws.shape) == 2:\n            # tacotron 2 case -> (L, T)\n            pass\n        elif len(att_ws.shape) == 4:\n            # transformer case -> (#layers, #heads, L, T)\n            # get the most diagonal head according to focus rate\n            att_ws = torch.cat(\n                [att_w for att_w in att_ws], dim=0\n            )  # (#heads * #layers, L, T)\n            diagonal_scores = att_ws.max(dim=-1)[0].mean(dim=-1)  # (#heads * #layers,)\n            diagonal_head_idx = diagonal_scores.argmax()\n            att_ws = att_ws[diagonal_head_idx]  # (L, T)\n        else:\n            raise ValueError(\"att_ws should be 2 or 4 dimensional tensor.\")\n        # calculate duration from 2d attention weight\n        durations = torch.stack(\n            [att_ws.argmax(-1).eq(i).sum() for i in range(att_ws.shape[1])]\n        )\n        return durations.view(-1, 1).float()\n\n    # define writer instances\n    feat_writer = kaldiio.WriteHelper(\"ark,scp:{o}.ark,{o}.scp\".format(o=args.out))\n    if args.save_durations:\n        dur_writer = kaldiio.WriteHelper(\n            \"ark,scp:{o}.ark,{o}.scp\".format(o=args.out.replace(\"feats\", \"durations\"))\n        )\n    if args.save_focus_rates:\n        fr_writer = kaldiio.WriteHelper(\n            \"ark,scp:{o}.ark,{o}.scp\".format(o=args.out.replace(\"feats\", \"focus_rates\"))\n        )\n\n    # start decoding\n    for idx, utt_id in enumerate(js.keys()):\n        # setup inputs\n        batch = [(utt_id, js[utt_id])]\n        data = load_inputs_and_targets(batch)\n        x = torch.FloatTensor(data[0][0]).to(device)\n        spemb = None\n        if train_args.use_speaker_embedding:\n            spemb = torch.FloatTensor(data[1][0]).to(device)\n\n        # decode and write\n        start_time = time.time()\n        outs, probs, att_ws = model.inference(x, args, spemb=spemb)\n        logging.info(\n            \"inference speed = %.1f frames / sec.\"\n            % (int(outs.size(0)) / (time.time() - start_time))\n        )\n        if outs.size(0) == x.size(0) * args.maxlenratio:\n            logging.warning(\"output length reaches maximum length (%s).\" % utt_id)\n        focus_rate = _calculate_focus_rete(att_ws)\n        logging.info(\n            \"(%d/%d) %s (size: %d->%d, focus rate: %.3f)\"\n            % (idx + 1, len(js.keys()), utt_id, x.size(0), outs.size(0), focus_rate)\n        )\n        feat_writer[utt_id] = outs.cpu().numpy()\n        if args.save_durations:\n            ds = _convert_att_to_duration(att_ws)\n            dur_writer[utt_id] = ds.cpu().numpy()\n        if args.save_focus_rates:\n            fr_writer[utt_id] = np.array(focus_rate).reshape(1, 1)\n\n        # plot and save prob and att_ws\n        if probs is not None:\n            _plot_and_save(\n                probs.cpu().numpy(),\n                os.path.dirname(args.out) + \"/probs/%s_prob.png\" % utt_id,\n            )\n        if att_ws is not None:\n            _plot_and_save(\n                att_ws.cpu().numpy(),\n                os.path.dirname(args.out) + \"/att_ws/%s_att_ws.png\" % utt_id,\n            )\n\n    # close file object\n    feat_writer.close()\n    if args.save_durations:\n        dur_writer.close()\n    if args.save_focus_rates:\n        fr_writer.close()\n"
  },
  {
    "path": "version.txt",
    "content": "\n0.9.9\n"
  }
]